How to extract the operands on both sides of "==" using regex? - regex

Language and package
python3.8, regex
Description
The inputs and wanted outputs are listed as following:
if (programWorkflowState.getTerminal(1, 2) == Boolean.TRUE) {
Want: programWorkflowState.getTerminal(1, 2) and Boolean.TRUE
boolean ignore = !_isInStatic.isEmpty() && (_isInStatic.peek() == 3) && isAnonymous;
Want: _isInStatic.peek() and 3
boolean b = (num1 * ( 2 + num2)) == value;
Want: (num1 * ( 2 + num2)) and value
My current regex
((?:\((?:[^\(\)]|(?R))*\)|[\w\.])+)\s*==\s*((?:\((?:[^\(\)]|(?R))*\)|[\w\.])+)
This pattern want to match \((?:[^\(\)]|(?R))*\) or [\w\.] on both side of "=="
Result on regex101.com
Problem: It failed to match the recursive part (num1 * ( 2 + num2)).
The explanation of the recursive pattern \((?:m|(?R))*\) is here
But if I only use the recursive pattern, it succeeded to match (num1 * ( 2 + num2)) as the image shows.
What's the right regex to achieve my purpose?

The \((?:m|(?R))*\) pattern contains a (?R) construct (equal to (?0) subroutine) that recurses the entire pattern.
You need to wrap the pattern you need to recurse with a group and use a subroutine instead of (?R) recursion construct, e.g. (?P<aux>\((?:m|(?&aux))*\)) to recurse a pattern inside a longer one.
You can use
((?:(?P<aux1>\((?:[^()]++|(?&aux1))*\))|[\w.])++)\s*[!=]=\s*((?:(?&aux1)|[\w.])+)
See this regex demo (it takes just 6875 steps to match the string provided, yours takes 13680)
Details
((?:(?P<aux1>\((?:[^()]++|(?&aux1))*\))|[\w.])++) - Group 1, matches one or more occurrences (possessively, due to ++, not allowing backtracking into the pattern so that the regex engine could not re-try matching a string in another way if the subsequent patterns fail to match)
(?P<aux1>\((?:[^()]++|(?&aux1))*\)) - an auxiliary group "aux1" that matches (, then zero or more occurrences of either 1+ chars other than ( and ) or the whole Group "aux1" pattern, and then a )
| - or
[\w.] - a letter, digit, underscore or .
\s*[!=]=\s* - != or == with zero or more whitespace on both ends
((?:(?&aux1)|[\w.])+) - Group 2: one or more occurences of Group "aux" pattern or a letter, digit, underscore or ..

Related

Find the Longest Common starting substring of S2 in S1

I was solving a problem. i solved the Longest Common starting substring of S2 in S1 part but the time complexity was very high.
In the below Code I have to find the Longest Common starting substring of str3 in s[i].
In the below code instead of find function i have also use KMP algorithm but i faced high time complexity again.
string str3=abstring1(c,1,2,3);
while(1)
{
size_t found = s[i].find(str3);
if(str3.length()==0)
break;
if (found != string::npos)
{
str1=str1+str3;
break;
}
else
{
str3.pop_back();
}
}
Example :
S1=balling S2=baller
ans=ball
S1=balling S2=uolling
ans=
We have to find common starting substring of S2 in S1
Can you help in c++
I find Similar Post but i was not able to do my self in c++.
Here is a solution that emits the faint aroma of a hack.
Suppose
s1 = 'snowballing'
s2 = 'baller'
Then form the string
s = s2 + '|' + s1
#=> 'baller|snowballing'
where the pipe ('|') can be any character that is not in either string. (If in doubt, one could use, say, "\x00".)
We may then match s against the regular expression
^(.*)(?=.*\|.*\1)
This will match the longest starting string in s2 that is present in s1, which in this example is 'ball'.
Demo
The regular expression can be broken down as follows.
^ # match beginning of string
( # begin capture group 1
.* # match zero or more characters, as many as possible
) # end capture group 1
(?= # begin a positive lookahead
.* # match zero or more characters, as many as possible
\| # match '|'
.* # match zero or more characters, as many as possible
\1 # match the contents of capture group 1
) # end positive lookahead

regex match longest substring with equal first and last char

/(\w)(\w*)\1/
For this string:"mgntdygtxrvxjnwksqhxuxtrv" I match "txrvxjnwksqhxuxt" (using Ruby), but not the even longer valid substring "tdygtxrvxjnwksqhxuxt".
For a given string, here are two ways to find the longest substring that begins and ends with the same character.
Suppose
str = "mgntdygtxrvxjnwksqhxuxtrv"
Use a regular expression
r = /(.)(?=(.*\1))/
str.gsub(r).map { $1 + $2 }.max_by(&:length)
#=> "tdygtxrvxjnwksqhxuxt".
When, as here, the regular expression contains capture groups, it may be more convenient to use String#gsub without a second argument or block (in which case it returns an enumerator, which can be chained) than String#scan (" If the pattern contains groups, each individual result is itself an array containing one entry per group.") Here gsub performs no substitutions; it merely generates matches of the regular expression.
The regular expression can be made self-documenting by writing it in free-spacing mode.
r = /
(.) # match any char and save to capture group 1
(?= # begin a positive lookahead
(.*\1) # match >= 0 characters followed by the contents of capture group 1
) # end the postive lookahead
/x # free-spacing regex definition mode
The following intermediate calculation is performed:
str.gsub(r).map { $1 + $2 }
#=> ["gntdyg", "ntdygtxrvxjn", "tdygtxrvxjnwksqhxuxt", "txrvxjnwksqhxuxt",
# "xrvxjnwksqhxux", "rvxjnwksqhxuxtr", "vxjnwksqhxuxtrv", "xjnwksqhxux",
# "xux"]
Notice that this does not enumerate all substrings beginning and ending with the same character (because .* is greedy). It does not generate, for example, the substring "xrvx".
Do not use a regular expression
v = str.each_char.with_index.with_object({}) do |(c,i),h|
if h.key?(c)
h[c][:size] = i - h[c][:start] + 1
else
h[c] = { start: i, size: 1 }
end
end.max_by { |_,h| h[:size] }.last
str[v[:start], v[:size]]
#=> "tdygtxrvxjnwksqhxuxt"

How do you find 3 UNIQUE digits in a string of digits?

I am trying to write a regex that is very specific. I want to find 3 digits in a list. The issue comes because I do not care about repeating digits (5, 555, and 55555555555555 are seen as 5). Also, within the 3 digits, they need to be 3 different digits (123 = good, 311 = bad).
Here is what I have so far to find 3 digits, ignoring repeats but it does not specify 3 unique digits.
^(?:([0]{1,}|[1]{1,}|[2]{1,}|[3]{1,}|[4]{1,}|[5]{1,}|[6]{1,}|[7]{1,}|[8]{1,}|[9]{1,}|[0]{1,})(?!.*\\1)){3}$<p>
Here is an example of the types of data I see.
Matching:
458
3333335555111
2222555111
222255558888
111147
9533333333
And not matching:
999999999
222252
888887
Right now my regex will find all of these. How can I ignore any that do not have 3 unique digits?
If your regex-tool of choice supports look-behinds, back-references and possesive matching you could use
^(\d)\1*+(?!.*\1)(\d)\2*+(\d)\3*+$
^ and $ are anchors to ensure, that we check the whole string
(\d) matches a digit into a first capturing group, with \1*+ we possesively match any following occurences of this digit and use the lookbehind (?!.*\1) to ensure, that it doesn't end with that number.
(\d)\2*+ then matches the next different digit, again matching any following occurences possesively (check 122 without the possesive matching to see, why I use it here)
(\d)\3*+ matches the last digit with any following occurences.
Without possesive matching you could make more use of look-behinds, like ^(\d)\1*(?!.*\1)(\d)\2*(?!.*\2)(\d)\3*+$
See https://regex101.com/r/pV2tB2/2 for a demo.
Site Note: Regex might not be the best for this, but as you specifically asked for it - here you are.
This can be done with regex, but it's not the best tool for your work.
Instead of a regex-only approach, you can easily achieve this using Python.
Example:
strings = ['458', '3333335555111', '2222555111', '222255558888', '111147', '9533333333', '955555555', '12222211']
for s in strings:
if len(set(list(s))) == 3:
print "Ok :", s
else:
print "Error :", s
Output:
>> Ok : 458
>> Ok : 3333335555111
>> Ok : 2222555111
>> Ok : 222255558888
>> Ok : 111147
>> Ok : 9533333333
>> Error : 955555555
>> Error : 12222211
I've used the following commands while iterating over the strings inside that list:
list()
set()
len()
Using negative lookahead, this should match any string of digits that contains at least 3 unique digits /^(\d)\1*(?!\1)(\d)(?:\2|\1)*(?!\2|\1)(\d)+$/
(\d) - Match a digit
\1* - Allow that digit to repeat
(?!\1) - Make sure that's followed by a digit that does not match the first match
(\d) - Match the new digit
(?:\2|\1)* - Allow repeats of either the first or second digit
(?!\2|\1) - Make sure that's followed by a digit that does not match the first or second match
(\d)+ - Capture the third unique digit, then allow any number of digits of any kind to follow
I'm not sure if an awk script will do it for you, but here it goes:
awk '
function match_func(num) {
if (match_array[num] == 0)
match_array[num] = 1;
}
{
for (i = 0; i < length($1); i++)
match_func(substr($1, i, 1));
for (i = 0; i < 10; i++)
if (match_array[i] == 1) match_sum++;
if (match_sum == 3)
print $1;
}'

Extracting inner groups with regex

I have the following string
([Valor][Corr][Fat]: 6M UC x Viz. Lógicos IN('3','6')) AND (((SUM_RevisionAnomalia_UltRevision_1M = 1) AND (CANT_ConsumoFact_UltRevision_1M > 1)) OR ((SUM_RevisionNoAnomalia_UltRevision_1M + 1) AND (CANT_ConsumoFact_UltRevision_1M BETWEEN 1 - 2))) OR (SUM_RevisionNoAnomalia_UltRevision_1M <= 1)
and I am trying to extract all inner groups, so my answer should contain
([Valor][Corr][Fat]: 6M UC x Viz. Lógicos IN('3','6'))
(SUM_RevisionAnomalia_UltRevision_1M = 1)
(CANT_ConsumoFact_UltRevision_1M > 1)
(SUM_RevisionNoAnomalia_UltRevision_1M + 1)
(CANT_ConsumoFact_UltRevision_1M BETWEEN 1 - 2)
(SUM_RevisionNoAnomalia_UltRevision_1M <= 1)
It is quite easy to extract this when there is only 1 set of those strings inside parentheses, but when given the example above my regex captures the whole string.
The regex i am using is
/(\([a-zA-Z0-9\[\]:_+=-\s\.\(\),'óáéíúüçãôàäê><]+\))/g
It seems you just want to match what is in-between ( and ) that is not ( and ) unless these are (...) that are preceded with a word character.
You can use
\((?:[^()]|\b\([^()]*\))*\)
See the regex demo
The regex breakdown:
\( - matching a literal (
(?:[^()]|\b\([^()]*\))* - zero or more sequences of:
[^()] - any character other than ( and )
| - or...
\b\([^()]*\) - a word boundary (i.e. before that position, there must be a word character) followed with ( followed with zero or more characters other than ( and )
\) - a closing )
An alternative pattern can be an unrolled one (more efficient with longer inputs):
\([^()]*(?:\b\([^()]*\)[^()]*)*\)
See another demo

use regular expression to find and replace but only every 3 characters for DNA sequence

Is it possible to do a find/replace using regular expressions on a string of dna such that it only considers every 3 characters (a codon of dna) at a time.
for example I would like the regular expression to see this:
dna="AAACCCTTTGGG"
as this:
AAA CCC TTT GGG
If I use the regular expressions right now and the expression was
Regex.Replace(dna,"ACC","AAA") it would find a match, but in this case of looking at 3 characters at a time there would be no match.
Is this possible?
Why use a regex? Try this instead, which is probably more efficient to boot:
public string DnaReplaceCodon(string input, string match, string replace) {
if (match.Length != 3 || replace.Length != 3)
throw new ArgumentOutOfRangeException();
var output = new StringBuilder(input.Length);
int i = 0;
while (i + 2 < input.Length) {
if (input[i] == match[0] && input[i+1] == match[1] && input[i+2] == match[2]) {
output.Append(replace);
} else {
output.Append(input[i]);
output.Append(input[i]+1);
output.Append(input[i]+2);
}
i += 3;
}
// pick up trailing letters.
while (i < input.Length) output.Append(input[i]);
return output.ToString();
}
Solution
It is possible to do this with regex. Assuming the input is valid (contains only A, T, G, C):
Regex.Replace(input, #"\G((?:.{3})*?)" + codon, "$1" + replacement);
DEMO
If the input is not guaranteed to be valid, you can just do a check with the regex ^[ATCG]*$ (allow non-multiple of 3) or ^([ATCG]{3})*$ (sequence must be multiple of 3). It doesn't make sense to operate on invalid input anyway.
Explanation
The construction above works for any codon. For the sake of explanation, let the codon be AAA. The regex will be \G((?:.{3})*?)AAA.
The whole regex actually matches the shortest substring that ends with the codon to be replaced.
\G # Must be at beginning of the string, or where last match left off
((?:.{3})*?) # Match any number of codon, lazily. The text is also captured.
AAA # The codon we want to replace
We make sure the matches only starts from positions whose index is multiple of 3 with:
\G which asserts that the match starts from where the previous match left off (or the beginning of the string)
And the fact that the pattern ((?:.{3})*?)AAA can only match a sequence whose length is multiple of 3.
Due to the lazy quantifier, we can be sure that in each match, the part before the codon to be replaced (matched by ((?:.{3})*?) part) does not contain the codon.
In the replacement, we put back the part before the codon (which is captured in capturing group 1 and can be referred to with $1), follows by the replacement codon.
NOTE
As explained in the comment, the following is not a good solution! I leave it in so that others will not fall for the same mistake
You can usually find out where a match starts and ends via m.start() and m.end(). If m.start() % 3 == 0 you found a relevant match.