Nongreedy regex with alternation and repetition [duplicate] - regex

This question already has answers here:
Non-greedy regular expression match for multicharacter delimiters in awk
(3 answers)
Closed 8 years ago.
I am trying to match the contents between AB and BA using extended regex, for instance using awk.
Consider the two example strings AB12BABA and AB123BABA, I tried the following regex
AB([^B]|([^B][^A]|B[^A]|[^B]A))*BA
But it matches the whole string (greedy) for both examples.
Can anyone explain how the regex engine works for this case, and how I should change my regex so that it would work.

The BRE and ERE engines will match with the Leftmost Longest Rule, which is different from how Perl and other NFA-based regex engines matches the regex.
The documentation from Boost library is more detailed in regards to the technical aspect, so I quote it here:
The Leftmost Longest Rule
Often there is more than one way of matching a regular expression at a particular location, for POSIX basic and extended regular expressions, the "best" match is determined as follows:
Find the leftmost match, if there is only one match possible at this location then return it.
Find the longest of the possible matches, along with any ties. If there is only one such possible match then return it.
If there are no marked sub-expressions, then all the remaining alternatives are indistinguishable; return the first of these found.
Find the match which has matched the first sub-expression in the leftmost position, along with any ties. If there is only on such match possible then return it.
Find the match which has the longest match for the first sub-expression, along with any ties. If there is only one such match then return it.
Repeat steps 4 and 5 for each additional marked sub-expression.
If there is still more than one possible match remaining, then they are indistinguishable; return the first one found.
Marked sub-expression as mentioned in the text refers to () capturing groups. Note that they only does capturing and back-reference is not supported.
Therefore, in order to do a lazy matching, you need to construct a regular expression, such that it matches the repeated part, while avoid matching the tail part until the very end. Since ERE and BRE are equivalent to theoretical regular expression, as long as you can construct a DFA, there exists an equivalent regex that does the trick (just that constructing it is not trivial task in some cases).
For your requirement, this regex shall work:
AB([^B]|B+[^AB])*B*BA
The part ([^B]|B+[^AB])*B* matches any string that does not contain the string "BA".
Derivation
This is the DFA for matching a string that does not contain the string "BA".
The notation here is not standard, so I will explain a bit:
State q1/B means that the state is named q1 (just like how you name a variable), B is the current progress towards matching BA.
* means any character in the alphabet. [^B] means any character in the alphabet except for B.
In the DFA, q0 and q1 are final states, q0 is the initial state. Note that q2 is a trap state, since it is a non-final state, and there is no transition out of this state.
Use the steps here, or just use JFLAP to derive the regular expression. (In JFLAP, you should use some character, such as C to represent [^AB]).
Since q2 is a trap state, we can exclude it from the formula:
R0 = [^B]R0 + BR1 + λ
R1 = [^AB]R0 + BR1 + λ
Apply Arden's theorem to R1:
R1 = B*([^AB]R0 + λ)
Substitute R1 to R0:
R0 = [^B]R0 + BB*([^AB]R0 + λ) + λ
Distribute BB* over ([^AB]R0 + λ):
R0 = [^B]R0 + BB*[^AB]R0 + BB*λ + λ
Group together:
R0 = ([^B] + BB*[^AB])R0 + (BB* + λ)
Apply Arden's theorem to R0:
R0 = ([^B] + BB*[^AB])*(BB* + λ)
(BB* OR λ (empty string)) is equivalent to B*:
R0 = ([^B] + BB*[^AB])*B*
Let use rewrite it into awk's syntax: ([^B]|B+[^AB])*B*, which is what shown above.

Use look arounds and a non greedy quantifier:
(?<=AB).*?(?=BA)
If you want to match the delimiters too, simply:
AB.*?BA

Related

Match if there is a X in the first and X the second Y can be 0

I am currently working on a program. I need a regex that takes Y and X and that pairs of X is separated by Y. It does not have to be equal numbers, but it cannot contain multiple X'es on side of each other.
Examples:
# Don't match:
XXXYYYYY
#Match:
XYXYYYY
X
My try so far:
{Y*[X|^X]Y*[X|^X]Y*}*
The problem is that if there is a X in the first and X in the second the Y still can be 0. Can i directly test for double X's?
Since the answers above uses look-ahead, this answer present a solution in vanilla regular expression:
^(X?Y)*X?$
The solution above assumes empty string is allowed. Otherwise:
^((X?Y)+X?|X)$
(Feel free to make the groups non-capturing)
Thanks to Unihedron for the simplification XY|Y to X?Y.
If anyone still have doubt about the validity of this answer, solve the below equations:
R1 = XR2 + YR3 + λ
R2 = YR3 + λ
R3 = XR2 + YR3 + λ
The DFA can be drawn from the equations above.
Remove the + λ in R1 if empty string is disallowed.
What's so unusual about it?
^(?:X(?!X)|Y)+$
DEMO
Explanation: it's just a series of X and Y where an X cannot be followed by another X (negative lookahead).
^(?!.*?XX)X[YX]*$
Try this.This should fulfill your requirement.See demo.
http://regex101.com/r/sU3fA2/8
You can use this pattern:
^Y*(XY+)*X?$
if you want to ensure that there is at least one character, you can check the length separately or add a lookahead at the begining:
^(?=.)Y*(?:XY+)*X?$
about catastrophic backtracking:
If you use a DFA regex engine there is no problem since there is no backtracking.
if you use a NFA regex engine, you can prevent the catastrophic backtracking with several ways, examples:
^Y*(XY|Y)*X?$ # possible but not really efficient
^Y*(?>XY+)*X?$ # using an atomic group (if available)
^Y*(?:XY+)*+X?$ # using a possessive quantifier (if available)
^Y*(?=((XY+)*))\1X?$ # emulate `(?>(?:XY+)*)`
^Y*(?:(?=(XY+))\1)*X?$ # emulate `(?>XY+)*`

(V)C++ (2010) regular expressions, "recursive captures"

I want match and capture operators and operands of an expression like:
1
x
1 + x
x + y + 3 + 10
etc...
So on regexpal,
(\w+)(\s*([+])\s*(\w+))*
Appears to do it, but how do I obtain the matched captures? Notice [+] and (\w+) is already in 1 capture.
Unfortunately this is not possible (at least in any regex flavor that I know of). If one capturing group is used multiple times, the capture will always be filled with the last thing it captured. Simpley example: ([a-z])* applied to abc will give you only c.
I recommend that you use the regex just to check for a valid format. Then you can split the string at the matches of \s*\b\s*. This should then result in an array containing x, +, y, +, 3, +, 10 for your last example.
Here is some example code that shows how to use regexes to split strings, using boost::regex.
Maybe this would be a better job for System.CodeDom.Compiler than for Regexes.
If boost is an option for you, then you can use boost::regex with boost::match_extra flag, then match_results::captures and sub_match::captures contain list of all captured items

How to match variables in formula with regular epxressions

If I have a formula similar to this: a + b - c * (exp(a*b) ) / 3
I want to match only variables(a, b, c). For me, [a-zA-Z]+ does the job. However I do not want to match exp function. How can I achieve this with regular expressions? I use javascript.
([a-zA-Z]+)\b(?!\s*\()
more common notion of acceptable variable names would be
\b([a-zA-Z_]\w*)\b(?!\s*\()
with dots in function names it becomes
(?:[^.]|^)\b([a-zA-Z]+)\b(?!(\.|\s*\())
(the variable will be in the first capturing match)
Changing it to \b[A-z]\b will match single a-z letters not standing next to other characters.

Difference between * and + regex

Can anybody tell me the difference between the * and + operators in the example below:
[<>]+ [<>]*
Each of them are quantifiers, the star quantifier(*) means that the preceding expression can match zero or more times it is like {0,} while the plus quantifier(+) indicate that the preceding expression MUST match at least one time or multiple times and it is the same as {1,} .
So to recap :
a* ---> a{0,} ---> Match a or aa or aaaaa or an empty string
a+ ---> a{1,} ---> Match a or aa or aaaa but not a string empty
* means zero-or-more, and + means one-or-more. So the difference is that the empty string would match the second expression but not the first.
+ means one or more of the previous atom. ({1,})
* means zero or more. This can match nothing, in addition to the characters specified in your square-bracket expression. ({0,})
Note that + is available in Extended and Perl-Compatible Regular Expressions, and is not available in Basic RE. * is available in all three RE dialects. That dialect you're using depends most likely on the language you're in.
Pretty much, the only things in modern operating systems that still default to BRE are grep and sed (both of which have ERE capability as an option) and non-vim vi.
* means zero or more of the previous expression.
In other words, the expression is optional.
You might define an integer like this:
-*[0-9]+
In other words, an optional negative sign followed by one or more digits.
They are quantifiers.
+ means 1 or many (at least one occurrence for the match to succeed)
* means 0 or many (the match succeeds regardless of the presence of the search string)
[<>]+ is same as [<>][<>]*
I'll bring some example to extend answers above. Let we have a text:
100test10
test10
test
if we write \d+test\d+, this expression matches 100test10 and test10 but \d*test\d* matches three of them

What's a prefix regular expression?

I'm reading something that mentions prefix regular expressions, and sites as an example /^joey/
What's a prefix regular expression? Does that mean it starts with a caret?
in REGEX ^ at the start of a regex means, "Starts with"
/^joey/
Would therefore match any string that starts with "joey" such as "joeyjoey" or "joey and jane"
A prefixed regular expression (PRE) is defined recursively
Empty set ø end empty string ""- are PREs
For each symbol a in alphabet, "a" is a PRE
If p and q are PREs denoting the regular sets P and Q, respectively, r is a regular expression denoting the regular set R such that e belongs to R, and x belongs to S, then the following expressions are also PREs:
p + q (union )
xp (concatenation with symbol x on the left) .
pr (concatenation with an e-regular on the right)
p* (star) .
This definition was taken from "Fast Text Searching for Regular Expressions or
Automaton Searching on Tries" work by RICARDO A. BAEZA-YATES and GASTON H. GONNET
In other words PRE means Regular Expression that language L has only strings with some fixed prefix.
abc.* - is PRE
(A|B)cd - is not PRE
The caret means that you match the start of a string for example /^joey/ will match "joey is there" since the string starts with "joey" but not "Is joey around?" since joey is in the middle of the sentence.
It's not a standard term. Whoever wrote that obviously means a regex that matches only at the beginning of the target text, as the other responders have said. The caret is usually used for that purpose, but it can also mean the beginning of a logical line, if the match is being performed in multiline mode. Many regex flavors support an additional construct that matches the very beginning of the text regardless of the matching mode, \A being its usual form.
For more details, read this.