Delphi RegEx Parenthesis Parser - regex

I am looking for an regular expression as generally solution.
This regular expression is used to obtain parenthetical functions and parameters.
Input:
...alotOfText...
DBINFO("Parameter1"|'FirstFunction(Parameter)'|Parameter3|SecondFunction("Parameter1"|Parameter2)")
...alotOfOtherText...
Current regex:
cRegex =
'DBINFO\('// Looking for DBINFO(
+ '(?:' // Recursion for following Pattern(s)
+ '[^\)]' // no "("
+ '|(?R))' // or Repeat the Recursion (am i right?) I don't really understand this line
+ '*\)' // Quantifier for recursion (?) with unlimited Chars and one ")" at the end.
;
For inputs with only one set of () this works, but as soon as I need to parse the input mentioned above, the matches are only until the first occurrence of a ).
So I researched that multiple levels of parenthesis need to use sub routines. But even on my primary information source I can't find an example that brings me back on track. http://www.regular-expressions.info/subroutine.html
Remarks:
Each parameter could be blank, with " or with ' (mixed)
Source:
hRegEx := TRegEx.Create(cRegex), [roIgnoreCase, roMultiLine]);
hMatchCollection := hRegEx.Matches(aLayoutString);
for hMatch in hMatchCollection do
// Regarding the Regular Expression there should only be one Match in the Collection.
//Thats Subject to Change
begin
if hMatch.Success then
begin
Result := ParseParameter(hMatch.Value);
end;
end;
If you give an example: Please comment on it as mine. I want to believe .. ah learn. :)

Found!
cRegex =
'DBINFO' // some Searchinfo outside the parenthesis Expression
+ '(' // Outer Match Start for (?1)
+ '\(' // Search one "("
+ '(' // "SubGroup" Start
+ '(?>[^()]+)' // SubPattern: everything that is non-parentheses
+ '|(?1)' // or recursive match of the Subpattern 1
+ ')' // "SubGroup" End
+ '*\)' // any Numer of "SubGroup" and one ")"
+ ')' // Outer Match End
;
I was wrong with my first Expression. The Paranthesis Expression itself was perfectly fine. So this seems to work fine.
Found at:
http://mushclient.com/pcre/pcrepattern.html#SEC19
If someone with more knowledge could correct my Comments about the Expression. First i am using the wrong Names. Second i am not sure if (?1) reffers to the Inner () or the Outer () Match. And i dont know how to format Expressions.

Related

Removing everything between nested parentheses

For removing everything between parentheses, currently i use:
SELECT
REGEXP_REPLACE('(aaa) bbb (ccc (ddd) / eee)', "\\([^()]*\\)", "");
Which is incorrect, because it gives bbb (ccc / eee), as that removes inner parentheses only.
How to remove everynting between nested parentheses? so expected result from this example is bbb
In case of Google BigQuery, this is only possible if you know your maximum number of nestings. Because it uses re2 library that doesn't support regex recursions.
let r = /\((?:(?:\((?:[^()])*\))|(?:[^()]))*\)/g
let s = "(aaa) bbb (ccc (ddd) / eee)"
console.log(s.replace(r, ""))
If you can iterate on the regular expression operation until you reach a fixed point you can do it like this:
repeat {
old_string = string
string := remove_non_nested_parens_using_regex(string)
} until (string == old_string)
For instance if we have
((a(b)) (c))x)
on the first iteration we remove (b) and (c): sequences which begin with (, end with ) and do not contain parentheses, matched by \([^()]*\). We end up with:
((a) )x)
Then on the next iteration, (a) is gone:
( )x)
and after one more iteration, ( ) is gone:
x)
when we try removing more parentheses, there is no more change, and so the algorithm terminates with x).

RegEx for matching the first {N} chars and last {M} chars

I'm having an issue filtering tags in Grafana with an InfluxDB backend. I'm trying to filter out the first 8 characters and last 2 of the tag but I'm running into a really weird issue.
Here are some of the names...
GYPSKSVLMP2L1HBS135WH
GYPSKSVLMP2L2HBS135WH
RSHLKSVLMP1L1HBS045RD
RSHLKSVLMP35L1HBS135WH
RSHLKSVLMP35L2HBS135WH
only want to return something like this:
MP8L1HBS225
MP24L2HBS045
I first started off using this expression:
[MP].*
But it only returns the following out of 148:
PAYNKSVLMP27L1HBS045RD
PAYNKSVLMP27L1HBS135WH
PAYNKSVLMP27L1HBS225BL
PAYNKSVLMP27L1HBS315BR
The pattern [MP].* Matches either a M or P and then matches any char until the end of the string not taking any char, digit or quantifing number afterwards into account.
If you want to match MP and the value does not end on a digit but the last in the match should be a digit, you could use:
MP[A-Z0-9]+[0-9]
Regex demo
If lookaheads are supported you might also use:
MP[A-Z0-9]+(?=[A-Z0-9]{2}$)
Regex demo
You may not even want to touch MP. You can simply define a left and right boundary, just like your question asks, and swipe everything in between which might be faster, maybe an expression similar to:
(\w{8})(.*)(\w{2})
which you can simply call it using $2. That is the second capturing group, just to be easy to replace.
Graph
This graph shows how the expression would work:
Performance
This JavaScript snippet shows the performance of this expression using a simple 1-million times for loop.
repeat = 1000000;
start = Date.now();
for (var i = repeat; i >= 0; i--) {
var string = "RSHLKSVLMP35L2HBS135WH";
var regex = /^(\w{8})(.*)(\w{2})$/g;
var match = string.replace(regex, "$2");
}
end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");
Try Regex: (?<=\w{8})\w+(?=\w{2})
Demo

Regular Expressions matching difficulty

My current regex:
([\d]*)([^\d]*[\d][a-z]*-[\d]*)([\d][a-z?])(.?)
Right so I am attempting to make regex match a string based on: a count that can be any amount of number from 0-1million then followed by a number then sometimes a letter then - then any number for numbers followed by the same number and sometimes a letter then sometimes a letter. example of strings it should match:
1921-1220104081741b
192123212a-1220234104081742ab
an example of what it should return based on above (this is 2 examples it shouldn't read both lines.)
(192) (1-122010408174) (1) (b)
(19212321) (2a-122023410408174) (2a) (b)
My current regex works with the second one but it returns (1b) in the first when I would like it to return (1) (b) but also return (2a) in the case of the second one or the case of:
1926h-1220104081746h Should Return: (192) (6h-122010408174) (6h)
Not 100% sure if its possible, sense I'm fairly new to regex. For reference I'm doing this in excel-vba if there is any other way to do this easier.
You could capture the character(s) before the dash character, and then back reference that match.
In the expression below, \3 would match what was matched by the 3rd capturing group:
(\d*)((\d[a-z]*)-\d*)(\3)([a-z])?
Example Here
Output after merging the capture groups:
1921-1220104081741b
(192) (1-122010408174) (1) (b)
192123212a-1220234104081742ab
(19212321) (2a-122023410408174) (2a) (b)
1926h-1220104081746h
(192) (6h-122010408174) (6h)
Example:
Disregard the JS. Here is the output after merging the capture groups:
var strings = ['1921-1220104081741b', '192123212a-1220234104081742ab', '1926h-1220104081746h'], exp = /(\d*)((\d[a-z]*)-\d*)(\3)([a-z])?/;
strings.forEach(function(str) {
var m = str.match(exp);
snippet.log(str);
snippet.log('(' + m[1] + ') ('+ m[2] + ') (' + m[4] + ') (' + (m[5]||'') + ')');
snippet.log('---');
});
<script src="http://tjcrowder.github.io/simple-snippets-console/snippet.js"></script>
I think what you are saying with "followed by the same number" is that the piece right before the dash is repeated as your third capture group. I would suggest implementing this by splitting up your second capture group and then using a backreference:
([\d]*)([\d][a-z]*)-([\d]*)(\2)(.?)
For your three examples:
1921-1220104081741b
192123212a-1220234104081742ab
1926h-1220104081746h
This results in:
(192) (1) - (122010408174) (1) (b)
(19212321) (2a) - (122023410408174) (2a) (b)
(192) (6h) - (122010408174) (6h) ()
...and you can join the two middle groups back together to get the hyphenated term you wanted.

regex with all components optionals, how to avoid empty matches

I have to process a comma separated string which contains triplets of values and translate them to runtime types,the input looks like:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each substring should be transformed this way:
"1x2y3z" should become Vector3 with x = 1, y = 2, z = 3
"80r160g255b" should become Color with r = 80, g = 160, b = 255
"48h30m50s" should become Time with h = 48, m = 30, s = 50
The problem I'm facing is that all the components are optional (but they preserve order) so the following strings are also valid Vector3, Color and Time values:
"1x3z" Vector3 x = 1, y = 0, z = 3
"255b" Color r = 0, g = 0, b = 255
"1h" Time h = 1, m = 0, s = 0
What I have tried so far?
All components optional
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The A, B and C are replaced with the correct letter for each case, the expression works almost well but it gives twice the expected results (one match for the string and another match for an empty string just after the first match), for example:
"1h1m1s" two matches [1]: "1h1m1s" [2]: ""
"11x50z" two matches [1]: "11x50z" [2]: ""
"11111h" two matches [1]: "11111h" [2]: ""
This isn't unexpected... after all an empty string matches the expression when ALL of the components are empty; so in order to fix this issue I've tried the following:
1 to 3 quantifier
((?:\d+[ABC]){1,3})
But now, the expression matches strings with wrong ordering or even repeated components!:
"1s1m1h" one match, should not match at all! (wrong order)
"11z50z" one match, should not match at all! (repeated components)
"1r1r1b" one match, should not match at all! (repeated components)
As for my last attempt, I've tried this variant of my first expression:
Match from begin ^ to the end $
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
And it works better than the first version but it still matches the empty string plus I should first tokenize the input and then pass each token to the expression in order to assure that the test string could match the begin (^) and end ($) operators.
EDIT: Lookahead attempt (thanks to Casimir et Hippolyte)
After reading and (try to) understanding the regex lookahead concept and with the help of Casimir et Hippolyte answer I've tried the suggested expression:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following test string:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
And the results were amazing! it is able to detect complete valid matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" which weren't intended to be matched at all). Unfortunately it captures emtpy matches everytime a unvalid match is found so a "" is detected just before "1s1m1h", "1h1h1h", "adfank" and "12322134445688", so I modified the Lookahead condition to get the expression below:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
It gets rid of the empty matches in any string which doesn't match (?:\d+[ABC]){1,3}) so the empty matches just before "adfank" and "12322134445688" are gone but the ones just before "1s1m1h", "1h1h1h" are stil detected.
So the question is: Is there any regular expression which matches three triplet values in a given order where all component is optional but should be composed of at least one component and doesn't match empty strings?
The regex tool I'm using is the C++11 one.
Yes, you can add a lookahead at the begining to ensure there is at least one character:
^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
If you need to find this kind of substring in a larger string (so without to tokenize before), you can remove the anchors and use a more explicit subpattern in a lookahead:
(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)
In this case, to avoid false positive (since you are looking for very small strings that can be a part of something else), you can add word-boundaries to the pattern:
\b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Note: in a comma delimited string: (?=\d+[ABC]) can be replaced by (?=[^,])
I think this might do the trick.
I am keying on either the beginning of the string to match ^ or the comma separator , for fix the start of each match: (?:^|,).
Example:
#include <regex>
#include <iostream>
const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~");
int main()
{
std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b";
std::sregex_iterator iter(test.begin(), test.end(), r);
std::sregex_iterator end_iter;
for(; iter != end_iter; ++iter)
std::cout << iter->str(1) << '\n';
}
Output:
1x2y3z
80r160g255b
48h30m50s
1x3z
255b
Is that what you are after?
EDIT:
If you really want to go to town and make empty expressions unmatched then as far as I can tell you have to put in every permutation like this:
const std::string A = "(?:\\d+[xrh])";
const std::string B = "(?:\\d+[ygm])";
const std::string C = "(?:\\d+[zbs])";
const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")");

use regular expression to find and replace but only every 3 characters for DNA sequence

Is it possible to do a find/replace using regular expressions on a string of dna such that it only considers every 3 characters (a codon of dna) at a time.
for example I would like the regular expression to see this:
dna="AAACCCTTTGGG"
as this:
AAA CCC TTT GGG
If I use the regular expressions right now and the expression was
Regex.Replace(dna,"ACC","AAA") it would find a match, but in this case of looking at 3 characters at a time there would be no match.
Is this possible?
Why use a regex? Try this instead, which is probably more efficient to boot:
public string DnaReplaceCodon(string input, string match, string replace) {
if (match.Length != 3 || replace.Length != 3)
throw new ArgumentOutOfRangeException();
var output = new StringBuilder(input.Length);
int i = 0;
while (i + 2 < input.Length) {
if (input[i] == match[0] && input[i+1] == match[1] && input[i+2] == match[2]) {
output.Append(replace);
} else {
output.Append(input[i]);
output.Append(input[i]+1);
output.Append(input[i]+2);
}
i += 3;
}
// pick up trailing letters.
while (i < input.Length) output.Append(input[i]);
return output.ToString();
}
Solution
It is possible to do this with regex. Assuming the input is valid (contains only A, T, G, C):
Regex.Replace(input, #"\G((?:.{3})*?)" + codon, "$1" + replacement);
DEMO
If the input is not guaranteed to be valid, you can just do a check with the regex ^[ATCG]*$ (allow non-multiple of 3) or ^([ATCG]{3})*$ (sequence must be multiple of 3). It doesn't make sense to operate on invalid input anyway.
Explanation
The construction above works for any codon. For the sake of explanation, let the codon be AAA. The regex will be \G((?:.{3})*?)AAA.
The whole regex actually matches the shortest substring that ends with the codon to be replaced.
\G # Must be at beginning of the string, or where last match left off
((?:.{3})*?) # Match any number of codon, lazily. The text is also captured.
AAA # The codon we want to replace
We make sure the matches only starts from positions whose index is multiple of 3 with:
\G which asserts that the match starts from where the previous match left off (or the beginning of the string)
And the fact that the pattern ((?:.{3})*?)AAA can only match a sequence whose length is multiple of 3.
Due to the lazy quantifier, we can be sure that in each match, the part before the codon to be replaced (matched by ((?:.{3})*?) part) does not contain the codon.
In the replacement, we put back the part before the codon (which is captured in capturing group 1 and can be referred to with $1), follows by the replacement codon.
NOTE
As explained in the comment, the following is not a good solution! I leave it in so that others will not fall for the same mistake
You can usually find out where a match starts and ends via m.start() and m.end(). If m.start() % 3 == 0 you found a relevant match.