T-SQL RegExp to find sequential repeated characters - regex

I'm looking for a RegExp to find duplicated characters in a entire word in SQL Server and Regular Expressions (RegExp). For Example:
"AAUGUST" match (AA)
"ANDREA" don't match (are 2 vowels "A", buit are separated)
"ELEEPHANT" match (EE)
I was trying with:
SELECT field1
FROM exampleTable
WHERE field1 like '%([A-Z]){2}%'
But it doesn't work.
I apreciated for your help.
Thanks!

You can't do what you're asking with T-SQL's LIKE.
Your best bet is to look at using the Common Language Runtime (CLR), but it can also be achieved (albeit painfully slowly) using, for example, a scalar value function as follows:
create function dbo.ContainsRepeatingAlphaChars(#str nvarchar(max)) returns bit
as begin
declare #p int, -- the position we're looking at
#c char(1) -- the previous char
if #str is null or len(#str) < 2 return 0;
select #c = substring(#str, 1, 1), #p = 1;
while (1=1) begin
set #p = #p + 1; -- move position pointer ahead
if #p > len(#str) return 0; -- if we're at the end of the string and haven't already exited, we haven't found a match
if #c like '[A-Z]' and #c = substring(#str, #p, 1) return 1; -- if last char is A-Z and matches the current char then return "found!"
set #c = substring(#str, #p, 1); -- Get next char
end
return 0; -- this will never be hit but stops SQL Server complaining that not all paths return a value
end
GO
-- Example usage:
SELECT field1
FROM exampleTable
WHERE dbo.ContainsRepeatingAlphaChars(field1) = 1
Did I mention that it would be slow? Don't use this on a large table. Go CLR.

Related

Removing everything between nested parentheses

For removing everything between parentheses, currently i use:
SELECT
REGEXP_REPLACE('(aaa) bbb (ccc (ddd) / eee)', "\\([^()]*\\)", "");
Which is incorrect, because it gives bbb (ccc / eee), as that removes inner parentheses only.
How to remove everynting between nested parentheses? so expected result from this example is bbb
In case of Google BigQuery, this is only possible if you know your maximum number of nestings. Because it uses re2 library that doesn't support regex recursions.
let r = /\((?:(?:\((?:[^()])*\))|(?:[^()]))*\)/g
let s = "(aaa) bbb (ccc (ddd) / eee)"
console.log(s.replace(r, ""))
If you can iterate on the regular expression operation until you reach a fixed point you can do it like this:
repeat {
old_string = string
string := remove_non_nested_parens_using_regex(string)
} until (string == old_string)
For instance if we have
((a(b)) (c))x)
on the first iteration we remove (b) and (c): sequences which begin with (, end with ) and do not contain parentheses, matched by \([^()]*\). We end up with:
((a) )x)
Then on the next iteration, (a) is gone:
( )x)
and after one more iteration, ( ) is gone:
x)
when we try removing more parentheses, there is no more change, and so the algorithm terminates with x).

Regex for IFC with array attributed

IFC is a variation of STEP files used for construction projects. The IFC contains information about the building being constructed. The file is text based and it easy to read. I am trying to parse this information into a python dictionary.
The general format of each line will be similar to the following
2334=IFCMATERIALLAYERSETUSAGE(#2333,.AXIS2.,.POSITIVE.,-180.);
ideally this should be parsed int #2334, IFCMATERIALLAYERSETUSAGE, #2333,.AXIS2.,.POSITIVE.,-180.
I found a solution Regex includes two matches in first match
https://regex101.com/r/RHIu0r/10 for part of the problem.
However, there are some cases the data contains arrays instead of values as the example below
2335=IFCRELASSOCIATESMATERIAL('2ON6$yXXD1GAAH8whbdZmc',#5,$,$,(#40,#221,#268,#281),#2334);
This case need to be parsed as #2335, IFCRELASSOCIATESMATERIAL, '2ON6$yXXD1GAAH8whbdZmc', #5,$,$, [#40,#221,#268,#281],#2334
Where [#40,#221,#268,#281] is a stored in a single variable as an array
The array can be in the middle or the last variable.
Would you be able to assist in creating a regular expression to obtain desired results
I have created https://regex101.com/r/mqrGka/1 with cases to test
Here's a solution that continues from the point you reached with the regular expression in the test cases:
file = """\
#1=IFCOWNERHISTORY(#89024,#44585,$,.NOCHANGE.,$,$,$,1190720890);
#2=IFCSPACE(';;);',#1,$);some text);
#2=IFCSPACE(';;);',#1,$);
#2885=IFCRELAGGREGATES('1gtpBVmrDD_xsEb7NuFKc8',#5,$,$,#2813,(#2840,#2846,#2852,#2858,#2879));
#2334=IFCMATERIALLAYERSETUSAGE(#2333,.AXIS2.,.POSITIVE.,-180.);
#2335=IFCRELASSOCIATESMATERIAL('2ON6$yXXD1GAAH8whbdZmc',#5,$,$,(#40,#221,#268,#281),#2334);
""".splitlines()
import re
d = dict()
for line in file:
m = re.match(r"^#(\d+)\s*=\s*([a-zA-Z0-9]+)\s*\(((?:'[^']*'|[^;'])+)\);", line, re.I|re.M)
attr = m.group(3) # attribute list string
values = [m.group(2)] # first value is the entity type name
while attr:
start = 1
if attr[0] == "'": start += attr.find("'", 1) # don't split at comma within string
if attr[0] == "(": start += attr.find(")", 1) # don't split item within parentheses
end = attr.find(",", start) # search for a comma / end of item
if end < 0: end = len(attr)
value = attr[1:end-1].split(",") if attr[0] == "(" else attr[:end]
if value[0] == "'": value = value[1:-1] # remove quotes
values.append(value)
attr = attr[end+1:] # remove current attribute item
d[m.group(1)] = values # store into dictionary

Regex: How to remove English words from sentences using Regex?

I've number of rows in SQLite, each row has one column that contains data like this:
prosperکامیاب شدن ، موفق شدن ، رونق یافتن
As you can see, the sentence starts with English words, Now I want to remove English words at first of each sentence. Is there any way to do that via T-SQL query(using Regex)?
you may try this :) I have made it as a function to call upon
create function dbo.RemoveEngChars (#Unicode_string nvarchar(max))
returns nvarchar(max) as
begin
declare #i int = 1; -- must start from 1, as SubString is 1-based
declare #OriginalString nvarchar(100) = #Unicode_string collate SQL_Latin1_General_Cp1256_CS_AS
declare #ModifiedString nvarchar(100) = N'';
while #i <= Len(#OriginalString)
begin
if SubString(#OriginalString, #i, 1) not like '[a-Z]'
begin
set #ModifiedString = #ModifiedString + SubString(#OriginalString, #i, 1);
end
set #i = #i + 1;
end
return #ModifiedString
end
--To call the function , you can run the following script and pass the Unicode in N' prefix
select dbo.RemoveEngChars(N'prosperکامیاب شدن ، موفق شدن ، رونق یافتن')

How to use regular expressions to extract 3-tuple values from a string

I am trying to extract n 3-tuples (Si, Pi, Vi) from a string.
The string contains at least one such 3-tuple.
Pi and Vi are not mandatory.
SomeTextxyz#S1((property(P1)val(V1))#S2((property(P2)val(V2))#S3
|----------1-------------|----------2-------------|-- n
The desired output would be:
Si,Pi,Vi.
So for n occurrences in the string the output should look like this:
[S1,P1,V1] [S2,P2,V2] ... [Sn-1,Pn-1,Vn-1] (without the brackets)
Example
The input string could be something like this:
MyCarGarage#Mustang((property(PS)val(500))#Porsche((property(PS)val(425‌​)).
Once processed the output should be:
Mustang,PS,500 Porsche,PS,425
Is there an efficient way to extract those 3-tuples using a regular expression
(e.g. using C++ and std::regex) and what would it look like?
#(.*?)\(\(property\((.*?)\)val\((.*?)\)\) should do the trick.
example at http://regex101.com/r/bD1rY2
# # Matches the # symbol
(.*?) # Captures everything until it encounters the next part (ungreedy wildcard)
\(\(property\( # Matches the string "((property(" the backslashes escape the parenthesis
(.*?) # Same as the one above
\)val\( # Matches the string ")val("
(.*?) # Same as the one above
\)\) # Matches the string "))"
How you should implement this in C++ i don't know but that is the easy part :)
http://ideone.com/S7UQpA
I used C's <regex.h> instead of std::regex because std::regex isn't implemented in g++ (which is what IDEONE uses). The regular expression I used:
" In C(++)? regexes are strings.
# Literal match
([^(#]+) As many non-#, non-( characters as possible. This is group 1
( Start another group (group 2)
\\(\\(property\\( Yet more literal matching
([^)]+) As many non-) characters as possible. Group 3.
\\)val\\( Literal again
([^)]+) As many non-) characters as possible. Group 4.
\\)\\) Literal parentheses
) Close group 2
? Group 2 optional
" Close Regex
And some c++:
int getMatches(char* haystack, item** items){
first, calculate the length of the string (we'll use that later) and the number of # found in the string (the maximum number of matches)
int l = -1, ats = 0;
while (haystack[++l])
if (haystack[l] == '#')
ats++;
malloc a large enough array.
*items = (item*) malloc(ats * sizeof(item));
item* arr = *items;
Make a regex needle to find. REGEX is #defined elsewhere.
regex_t needle;
regcomp(&needle, REGEX, REG_ICASE|REG_EXTENDED);
regmatch_t match[5];
ret will hold the return value (0 for "found a match", but there are other errors you may want to be catching here). x will be used to count the found matches.
int ret;
int x = -1;
Loop over matches (ret will be zero if a match is found).
while (!(ret = regexec(&needle, haystack, 5, match,0))){
++x;
Get the name from match1
int bufsize = match[1].rm_eo-match[1].rm_so + 1;
arr[x].name = (char *) malloc(bufsize);
strncpy(arr[x].name, &(haystack[match[1].rm_so]), bufsize - 1);
arr[x].name[bufsize-1]=0x0;
Check to make sure the property (match[3]) and the value (match[4]) were found.
if (!(match[3].rm_so > l || match[3].rm_so<0 || match[3].rm_eo > l || match[3].rm_so< 0
|| match[4].rm_so > l || match[4].rm_so<0 || match[4].rm_eo > l || match[4].rm_so< 0)){
Get the property from match[3].
bufsize = match[3].rm_eo-match[3].rm_so + 1;
arr[x].property = (char *) malloc(bufsize);
strncpy(arr[x].property, &(haystack[match[3].rm_so]), bufsize - 1);
arr[x].property[bufsize-1]=0x0;
Get the value from match[4].
bufsize = match[4].rm_eo-match[4].rm_so + 1;
arr[x].value = (char *) malloc(bufsize);\
strncpy(arr[x].value, &(haystack[match[4].rm_so]), bufsize - 1);
arr[x].value[bufsize-1]=0x0;
} else {
Otherwise, set both property and value to NULL.
arr[x].property = NULL;
arr[x].value = NULL;
}
Move the haystack to past the match and decrement the known length.
haystack = &(haystack[match[0].rm_eo]);
l -= match[0].rm_eo;
}
Return the number of matches.
return x+1;
}
Hope this helps. Though it occurs to me now that you never answered kind of a vital question: What have you tried?

R code to check if word matches pattern

I need to validate a string against a character vector pattern. My current code is:
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# valid pattern is lowercase alphabet, '.', '!', and '?' AND
# the string length should be >= than 2
my.pattern = c(letters, '!', '.', '?')
check.pattern = function(word, min.size = 2)
{
word = trim(word)
chars = strsplit(word, NULL)[[1]]
all(chars %in% my.pattern) && (length(chars) >= min.size)
}
Example:
w.valid = 'special!'
w.invalid = 'test-me'
check.pattern(w.valid) #TRUE
check.pattern(w.invalid) #FALSE
This is VERY SLOW i guess...is there a faster way to do this? Regex maybe?
Thanks!
PS: Thanks everyone for the great answers. My objective was to build a 29 x 29 matrix,
where the row names and column names are the allowed characters. Then i iterate over each word of a huge text file and build a 'letter precedence' matrix. For example, consider the word 'special', starting from the first char:
row s, col p -> increment 1
row p, col e -> increment 1
row e, col c -> increment 1
... and so on.
The bottleneck of my code was the vector allocation, i was 'appending' instead of pre-allocate the final vector, so the code was taking 30 minutes to execute, instead of 20 seconds!
There are some built-in functions that can clean up your code. And I think you're not leveraging the full power of regular expressions.
The blaring issue here is strsplit. Comparing the equality of things character-by-character is inefficient when you have regular expressions. The pattern here uses the square bracket notation to filter for the characters you want. * is for any number of repeats (including zero), while the ^ and $ symbols represent the beginning and end of the line so that there is nothing else there. nchar(word) is the same as length(chars). Changing && to & makes the function vectorized so you can input a vector of strings and get a logical vector as output.
check.pattern.2 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]*$"),word) & nchar(word) >= min.size
}
check.pattern.2(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Next, using curly braces for number of repetitions and some paste0, the pattern can use your min.size:
check.pattern.3 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]{",min.size,",}$"),word)
}
check.pattern.3(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Finally, you can internalize the regex from trim:
check.pattern.4 = function(word, min.size = 2)
{
grepl(paste0("^\\s*[a-z!.?]{",min.size,",}\\s*$"),word)
}
check.pattern.4(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
If I understand the pattern you are desiring correctly, you would want a regex of a similar format to:
^\\s*[a-z!\\.\\?]{MIN,MAX}\\s*$
Where MIN is replaced with the minimum length of the string, and MAX is replaced with the maximum length of the string. If there is no maximum length, then MAX and the comma can be omitted. Likewise, if there is neither maximum nor minimum everything within the {} including the braces themselves can be replaced with a * which signifies the preceding item will be matched zero or more times; this is equivalent to {0}.
This ensures that the regex only matches strings where every character after any leading and trailing whitespace is from the set of
* a lower case letter
* a bang (exclamation point)
* a question mark
Note that this has been written in Perl style regex as it is what I am more familiar with; most of my research was at this wiki for R text processing.
The reason for the slowness of your function is the extra overhead of splitting the string into a number of smaller strings. This is a lot of overhead in comparison to a regex (or even a manual iteration over the string, comparing each character until the end is reached or an invalid character is found). Also remember that this algorithm ENSURES a O(n) performance rate, as the split causes n strings to be generated. This means that even FAILING strings must do at least n actions to reject the string.
Hopefully this clarifies why you were having performance issues.