Removing everything between nested parentheses - regex

For removing everything between parentheses, currently i use:
SELECT
REGEXP_REPLACE('(aaa) bbb (ccc (ddd) / eee)', "\\([^()]*\\)", "");
Which is incorrect, because it gives bbb (ccc / eee), as that removes inner parentheses only.
How to remove everynting between nested parentheses? so expected result from this example is bbb

In case of Google BigQuery, this is only possible if you know your maximum number of nestings. Because it uses re2 library that doesn't support regex recursions.
let r = /\((?:(?:\((?:[^()])*\))|(?:[^()]))*\)/g
let s = "(aaa) bbb (ccc (ddd) / eee)"
console.log(s.replace(r, ""))

If you can iterate on the regular expression operation until you reach a fixed point you can do it like this:
repeat {
old_string = string
string := remove_non_nested_parens_using_regex(string)
} until (string == old_string)
For instance if we have
((a(b)) (c))x)
on the first iteration we remove (b) and (c): sequences which begin with (, end with ) and do not contain parentheses, matched by \([^()]*\). We end up with:
((a) )x)
Then on the next iteration, (a) is gone:
( )x)
and after one more iteration, ( ) is gone:
x)
when we try removing more parentheses, there is no more change, and so the algorithm terminates with x).

Related

Regular expressions string splits

I'm trying to create a pattern that enables me split a string on comas but ignoring expressions within curly brackets.
my existing code works great if only one group of curly bracket expressions exist in the string.
Dim expression As New Regex(",(?=(?:[^\{]*\{[^\{]*\})*(?![^\}]*\}))")
Try
parts = expression.Split(sortString)
For Each Item In parts
If Not Item Is Nothing Then
result.Add(Item)
End If
Next
Return result
If I pass the string
{IIF(Hemo.Site = "LV",1,IIF(Hemo.Site = "SVC",2,IIF(Hemo_Pressures.Site = "AO",3,4)))},Site DESC,Pressure1 ASC
It works, the curly bracket grouping is ignored and each string after is broken out with the coma split.
Problem is. If I need to accommodate multiple groupings of curly bracket expressions in my string and it begins to fail.
This fails:
{IIF(Hemo.Site = "LV",1,IIF(Hemo.Site = "SVC",2,IIF(Hemo_Pressures.Site = "AO",3,4)))},Site DESC,{IIF(Hemo.Site = "LV",1,IIF(Hemo.Site = "SVC",2,IIF(Hemo.Site = "AO",3,4)))}, Pressure1 ASC
one of the grouping is ignored as it should be, but the other grouping of curly brackets is not. Resulting in a dirty collection.
I would appreciate a second pair of eyes on this.
The pattern ,(?![^{]*\}) seems to be working for me, as long as there are no lose { or } floating around in the text.
Broken down, it evaluates as
,
zero-width negative lookahead
Any character not in "{"
* (zero or more times)
}
End Capture
You can use *,(?![^{]*\}) * to trim the spaces immediately before and after commas if required.
On your test string, it produces the follow splits:
{IIF(Hemo.Site = "LV",1,IIF(Hemo.Site =
"SVC",2,IIF(Hemo_Pressures.Site = "AO",3,4)))}
Site DESC
{IIF(Hemo.Site = "LV",1,IIF(Hemo.Site = "SVC",2,IIF(Hemo.Site =
"AO",3,4)))}
Pressure1 ASC
However, it will fail to properly split strings like
Apple, Ban}ana, Carrot {1,2,3}, Frog, Cat, 1,2,3
due to the lose } in "banana"

regex with all components optionals, how to avoid empty matches

I have to process a comma separated string which contains triplets of values and translate them to runtime types,the input looks like:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each substring should be transformed this way:
"1x2y3z" should become Vector3 with x = 1, y = 2, z = 3
"80r160g255b" should become Color with r = 80, g = 160, b = 255
"48h30m50s" should become Time with h = 48, m = 30, s = 50
The problem I'm facing is that all the components are optional (but they preserve order) so the following strings are also valid Vector3, Color and Time values:
"1x3z" Vector3 x = 1, y = 0, z = 3
"255b" Color r = 0, g = 0, b = 255
"1h" Time h = 1, m = 0, s = 0
What I have tried so far?
All components optional
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The A, B and C are replaced with the correct letter for each case, the expression works almost well but it gives twice the expected results (one match for the string and another match for an empty string just after the first match), for example:
"1h1m1s" two matches [1]: "1h1m1s" [2]: ""
"11x50z" two matches [1]: "11x50z" [2]: ""
"11111h" two matches [1]: "11111h" [2]: ""
This isn't unexpected... after all an empty string matches the expression when ALL of the components are empty; so in order to fix this issue I've tried the following:
1 to 3 quantifier
((?:\d+[ABC]){1,3})
But now, the expression matches strings with wrong ordering or even repeated components!:
"1s1m1h" one match, should not match at all! (wrong order)
"11z50z" one match, should not match at all! (repeated components)
"1r1r1b" one match, should not match at all! (repeated components)
As for my last attempt, I've tried this variant of my first expression:
Match from begin ^ to the end $
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
And it works better than the first version but it still matches the empty string plus I should first tokenize the input and then pass each token to the expression in order to assure that the test string could match the begin (^) and end ($) operators.
EDIT: Lookahead attempt (thanks to Casimir et Hippolyte)
After reading and (try to) understanding the regex lookahead concept and with the help of Casimir et Hippolyte answer I've tried the suggested expression:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following test string:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
And the results were amazing! it is able to detect complete valid matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" which weren't intended to be matched at all). Unfortunately it captures emtpy matches everytime a unvalid match is found so a "" is detected just before "1s1m1h", "1h1h1h", "adfank" and "12322134445688", so I modified the Lookahead condition to get the expression below:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
It gets rid of the empty matches in any string which doesn't match (?:\d+[ABC]){1,3}) so the empty matches just before "adfank" and "12322134445688" are gone but the ones just before "1s1m1h", "1h1h1h" are stil detected.
So the question is: Is there any regular expression which matches three triplet values in a given order where all component is optional but should be composed of at least one component and doesn't match empty strings?
The regex tool I'm using is the C++11 one.
Yes, you can add a lookahead at the begining to ensure there is at least one character:
^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
If you need to find this kind of substring in a larger string (so without to tokenize before), you can remove the anchors and use a more explicit subpattern in a lookahead:
(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)
In this case, to avoid false positive (since you are looking for very small strings that can be a part of something else), you can add word-boundaries to the pattern:
\b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Note: in a comma delimited string: (?=\d+[ABC]) can be replaced by (?=[^,])
I think this might do the trick.
I am keying on either the beginning of the string to match ^ or the comma separator , for fix the start of each match: (?:^|,).
Example:
#include <regex>
#include <iostream>
const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~");
int main()
{
std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b";
std::sregex_iterator iter(test.begin(), test.end(), r);
std::sregex_iterator end_iter;
for(; iter != end_iter; ++iter)
std::cout << iter->str(1) << '\n';
}
Output:
1x2y3z
80r160g255b
48h30m50s
1x3z
255b
Is that what you are after?
EDIT:
If you really want to go to town and make empty expressions unmatched then as far as I can tell you have to put in every permutation like this:
const std::string A = "(?:\\d+[xrh])";
const std::string B = "(?:\\d+[ygm])";
const std::string C = "(?:\\d+[zbs])";
const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")");

Delphi RegEx Parenthesis Parser

I am looking for an regular expression as generally solution.
This regular expression is used to obtain parenthetical functions and parameters.
Input:
...alotOfText...
DBINFO("Parameter1"|'FirstFunction(Parameter)'|Parameter3|SecondFunction("Parameter1"|Parameter2)")
...alotOfOtherText...
Current regex:
cRegex =
'DBINFO\('// Looking for DBINFO(
+ '(?:' // Recursion for following Pattern(s)
+ '[^\)]' // no "("
+ '|(?R))' // or Repeat the Recursion (am i right?) I don't really understand this line
+ '*\)' // Quantifier for recursion (?) with unlimited Chars and one ")" at the end.
;
For inputs with only one set of () this works, but as soon as I need to parse the input mentioned above, the matches are only until the first occurrence of a ).
So I researched that multiple levels of parenthesis need to use sub routines. But even on my primary information source I can't find an example that brings me back on track. http://www.regular-expressions.info/subroutine.html
Remarks:
Each parameter could be blank, with " or with ' (mixed)
Source:
hRegEx := TRegEx.Create(cRegex), [roIgnoreCase, roMultiLine]);
hMatchCollection := hRegEx.Matches(aLayoutString);
for hMatch in hMatchCollection do
// Regarding the Regular Expression there should only be one Match in the Collection.
//Thats Subject to Change
begin
if hMatch.Success then
begin
Result := ParseParameter(hMatch.Value);
end;
end;
If you give an example: Please comment on it as mine. I want to believe .. ah learn. :)
Found!
cRegex =
'DBINFO' // some Searchinfo outside the parenthesis Expression
+ '(' // Outer Match Start for (?1)
+ '\(' // Search one "("
+ '(' // "SubGroup" Start
+ '(?>[^()]+)' // SubPattern: everything that is non-parentheses
+ '|(?1)' // or recursive match of the Subpattern 1
+ ')' // "SubGroup" End
+ '*\)' // any Numer of "SubGroup" and one ")"
+ ')' // Outer Match End
;
I was wrong with my first Expression. The Paranthesis Expression itself was perfectly fine. So this seems to work fine.
Found at:
http://mushclient.com/pcre/pcrepattern.html#SEC19
If someone with more knowledge could correct my Comments about the Expression. First i am using the wrong Names. Second i am not sure if (?1) reffers to the Inner () or the Outer () Match. And i dont know how to format Expressions.

use regular expression to find and replace but only every 3 characters for DNA sequence

Is it possible to do a find/replace using regular expressions on a string of dna such that it only considers every 3 characters (a codon of dna) at a time.
for example I would like the regular expression to see this:
dna="AAACCCTTTGGG"
as this:
AAA CCC TTT GGG
If I use the regular expressions right now and the expression was
Regex.Replace(dna,"ACC","AAA") it would find a match, but in this case of looking at 3 characters at a time there would be no match.
Is this possible?
Why use a regex? Try this instead, which is probably more efficient to boot:
public string DnaReplaceCodon(string input, string match, string replace) {
if (match.Length != 3 || replace.Length != 3)
throw new ArgumentOutOfRangeException();
var output = new StringBuilder(input.Length);
int i = 0;
while (i + 2 < input.Length) {
if (input[i] == match[0] && input[i+1] == match[1] && input[i+2] == match[2]) {
output.Append(replace);
} else {
output.Append(input[i]);
output.Append(input[i]+1);
output.Append(input[i]+2);
}
i += 3;
}
// pick up trailing letters.
while (i < input.Length) output.Append(input[i]);
return output.ToString();
}
Solution
It is possible to do this with regex. Assuming the input is valid (contains only A, T, G, C):
Regex.Replace(input, #"\G((?:.{3})*?)" + codon, "$1" + replacement);
DEMO
If the input is not guaranteed to be valid, you can just do a check with the regex ^[ATCG]*$ (allow non-multiple of 3) or ^([ATCG]{3})*$ (sequence must be multiple of 3). It doesn't make sense to operate on invalid input anyway.
Explanation
The construction above works for any codon. For the sake of explanation, let the codon be AAA. The regex will be \G((?:.{3})*?)AAA.
The whole regex actually matches the shortest substring that ends with the codon to be replaced.
\G # Must be at beginning of the string, or where last match left off
((?:.{3})*?) # Match any number of codon, lazily. The text is also captured.
AAA # The codon we want to replace
We make sure the matches only starts from positions whose index is multiple of 3 with:
\G which asserts that the match starts from where the previous match left off (or the beginning of the string)
And the fact that the pattern ((?:.{3})*?)AAA can only match a sequence whose length is multiple of 3.
Due to the lazy quantifier, we can be sure that in each match, the part before the codon to be replaced (matched by ((?:.{3})*?) part) does not contain the codon.
In the replacement, we put back the part before the codon (which is captured in capturing group 1 and can be referred to with $1), follows by the replacement codon.
NOTE
As explained in the comment, the following is not a good solution! I leave it in so that others will not fall for the same mistake
You can usually find out where a match starts and ends via m.start() and m.end(). If m.start() % 3 == 0 you found a relevant match.

R code to check if word matches pattern

I need to validate a string against a character vector pattern. My current code is:
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# valid pattern is lowercase alphabet, '.', '!', and '?' AND
# the string length should be >= than 2
my.pattern = c(letters, '!', '.', '?')
check.pattern = function(word, min.size = 2)
{
word = trim(word)
chars = strsplit(word, NULL)[[1]]
all(chars %in% my.pattern) && (length(chars) >= min.size)
}
Example:
w.valid = 'special!'
w.invalid = 'test-me'
check.pattern(w.valid) #TRUE
check.pattern(w.invalid) #FALSE
This is VERY SLOW i guess...is there a faster way to do this? Regex maybe?
Thanks!
PS: Thanks everyone for the great answers. My objective was to build a 29 x 29 matrix,
where the row names and column names are the allowed characters. Then i iterate over each word of a huge text file and build a 'letter precedence' matrix. For example, consider the word 'special', starting from the first char:
row s, col p -> increment 1
row p, col e -> increment 1
row e, col c -> increment 1
... and so on.
The bottleneck of my code was the vector allocation, i was 'appending' instead of pre-allocate the final vector, so the code was taking 30 minutes to execute, instead of 20 seconds!
There are some built-in functions that can clean up your code. And I think you're not leveraging the full power of regular expressions.
The blaring issue here is strsplit. Comparing the equality of things character-by-character is inefficient when you have regular expressions. The pattern here uses the square bracket notation to filter for the characters you want. * is for any number of repeats (including zero), while the ^ and $ symbols represent the beginning and end of the line so that there is nothing else there. nchar(word) is the same as length(chars). Changing && to & makes the function vectorized so you can input a vector of strings and get a logical vector as output.
check.pattern.2 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]*$"),word) & nchar(word) >= min.size
}
check.pattern.2(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Next, using curly braces for number of repetitions and some paste0, the pattern can use your min.size:
check.pattern.3 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]{",min.size,",}$"),word)
}
check.pattern.3(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Finally, you can internalize the regex from trim:
check.pattern.4 = function(word, min.size = 2)
{
grepl(paste0("^\\s*[a-z!.?]{",min.size,",}\\s*$"),word)
}
check.pattern.4(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
If I understand the pattern you are desiring correctly, you would want a regex of a similar format to:
^\\s*[a-z!\\.\\?]{MIN,MAX}\\s*$
Where MIN is replaced with the minimum length of the string, and MAX is replaced with the maximum length of the string. If there is no maximum length, then MAX and the comma can be omitted. Likewise, if there is neither maximum nor minimum everything within the {} including the braces themselves can be replaced with a * which signifies the preceding item will be matched zero or more times; this is equivalent to {0}.
This ensures that the regex only matches strings where every character after any leading and trailing whitespace is from the set of
* a lower case letter
* a bang (exclamation point)
* a question mark
Note that this has been written in Perl style regex as it is what I am more familiar with; most of my research was at this wiki for R text processing.
The reason for the slowness of your function is the extra overhead of splitting the string into a number of smaller strings. This is a lot of overhead in comparison to a regex (or even a manual iteration over the string, comparing each character until the end is reached or an invalid character is found). Also remember that this algorithm ENSURES a O(n) performance rate, as the split causes n strings to be generated. This means that even FAILING strings must do at least n actions to reject the string.
Hopefully this clarifies why you were having performance issues.