Is there any difference between (?>EXPR|) and (?:EXPR)?+ - regex

In the following I will consider the regular expressions (?>EXPR|) and (?:EXPR)?+.
Let's say we want to match the string ABC.
Using (?>A|AB|)C it will first try to match A, then it will fail (because the A character is not followed by C) and it will try to match AB without possibility of backtracking, so it will fail again (because the A character has already been consumed) and finally it will match the empty string, failing a third time. Two characters later, it will find the substring C, that clearly matches the pattern.
Using (?:A|AB)?+C it will first try to match A, then it will fail (because the A character is not followed by C) and it hasn't got the possibility to go further because of the possessive quantifier +. Two characters later, it will find the substring C, that clearly matches the pattern.
The question is: even if (?>EXPR|) and (?:EXPR)?+ work in different ways, are they semantically equivalent?

See the atomic group referefence:
An atomic group is a group that, when the regex engine exits from it, automatically throws away all backtracking positions remembered by any tokens inside the group. Atomic groups are non-capturing. The syntax is (?>group). Lookaround groups are also atomic. Atomic grouping is supported by most modern regular expression flavors, including the JGsoft flavor, Java, PCRE, .NET, Perl, Boost, and Ruby. Most of these also support possessive quantifiers, which are essentially a notational convenience for atomic grouping.
Note that possessive quantifiers are a notational convenience for atomic grouping, they are functioning in the same way: they make their patterns match once without allowing any backtracking into these patterns.
If you wrap a set of patterns with a non-capturing group and set a possessive quantifier to this group it behaves as an atomic group.
Since (?>A|AB)? is an optional atomic group that matches A or AB (and atomic groups are non-capturing), it is the same as (?>A|AB|) that matches either A, AB or an empty string (so, it is also optional in a way).
(?>A|AB)?C = (?>A|AB|)C = (?:A|AB)?+C

Related

How to re-use a capturing group to match a different alternation choice?

I have a group of words and another group with a conjunction. I’m looking for a regular expression that matches any single one of those words, demanding the conjunction in between:
If the words are (A|B|C)
and the conjunction is (&)
then do match A & C, C & B and even A & A
but don’t match A + C, A C or A & D
Practical example: Consider this platform-agnostic regex: /(Huey|Dewey|Louie) and \1/.
I want it to match “Huey and Louie” or “Dewey and Huey”, but it only matches “Huey and Huey”, because backreferences merely match previously matched texts.
I could repeat myself by using /(Huey|Dewey|Louie) and (Huey|Dewey|Louie)/ but I think there’s a smarter way of re-using capturing groups at a later time. Is that feasible somehow?
You can do this if you're using Perl (or a language with sufficiently compatible regexes):
/(Huey|Dewey|Louie) and (?1)/
The (?N) part is a "recursive subpattern", matching the same thing as the subregex in capturing group N. (The difference between this and backreferences like \N is that \N matches the same string that was matched by the capturing group. (?N) reuses the regex itself.)

How does the ? make a quantifier lazy in regex

I've been looking into regex lately and figured that the ? operator makes the *,+, or ? lazy. My question is how does it do that? Is it that *? for example is a special operator, or does the ? have an effect on the * ? In other words, does regex recognize *? as one operator in itself, or does regex recognize *? as the two separate operators * and ? ? If it is the case that *? is being recognized as two separate operators, how does the ? affect the * to make it lazy. If ? means that the * is optional, shouldn't this mean that the * doesn't have to exists at all. If so, then in a statement .*? wouldn't regex just match separate letters and the whole string instead of the shorter string? Please explain, I'm desperate to understand.Many thanks.
? can mean a lot of different things in different contexts.
Following a normal regex token (a character, a shorthand, a character class, a group...), it means "Match the previous item 0-1 times".
Following a quantifier like ?, *, +, {n,m}, it takes on a different meaning: "Make the previous quantifier lazy instead of greedy (if that's the default; that can be changed, though - for example in PHP, the /U modifier makes all quantifiers lazy by default, so the additional ? makes them greedy).
Right after an opening parenthesis, it marks the start of a special construct like for example
a) (?s): mode modifiers ("turn on dotall mode")
b) (?:...): make the group non-capturing
c) (?=...) or (?!...): lookahead assertion
d) (?<=...) or (?<!...): lookbehind assertion
e) (?>...): atomic group
f) (?<foo>...): named capturing group
g) (?#comment): inline comments, ignored by the regex engine
h) (?(?=if)then|else): conditionals
and others. Not all constructs are available in all regex flavors.
Within a character class ([?]), it simply matches a verbatim ?.
I think a little history will make it easier to understand. When the Larry Wall wanted to grow regex syntax to support new features, his options were severely limited. He couldn't just decree (for example) that % is now a metacharacter that supports new feature "XYZ". That would break the millions of existing regexes that happened to use % to match a literal percent sign.
What he could do is take an already-defined metacharacter and use it in such a way that its original function wouldn't make sense. For example, any regex that contained two quantifiers in a row would be invalid, so it was safe to say a ? after another quantifier now turns it into a reluctant quantifier (a much better name than "lazy" IMO; non-greedy good too). So the answer to your question is that ? doesn't modify the *, *? is a single entity: a reluctant quantifier. The same is true of the + in possessive quantifiers (*+, {0,2}+ etc.).
A similar process occurred with group syntax. It would never make sense to have a quantifier after an unescaped opening parenthesis, so it was safe to say (? now marks the beginning of a special group construct. But the question mark alone would only support one new feature, so the ? itself to be followed has to be followed by at least one more character to indicate which kind of group it is ((?:...), (?<!...), etc.). Again, the (?: is a single entity: the opening delimiter of a non-capturing group.
I don't know offhand why he used the question mark both times. I do know Perl 6 Rules (a bottom-up rewrite of Perl 5 regexes) has done away with all that crap and uses an infinitely more sensible syntax.
Imagine you have the following text:
BAAAAAAAAD
The following regexs will return:
/B(A+)/ => 'BAAAAAAAA'
/B(A+?)/ => 'BA'
/B(A*)/ => 'BAAAAAAAA'
/B(A*?)/ => 'B'
The addition of the "?" to the + and * operators make them "lazy" - i.e. they will match the absolute minimum required for the expression to be true. Whereas by default the * and + operators are "greedy" and try and match AS MUCH AS POSSIBLE for the expression to be true.
Remember + means "one or more" so the minimum will be "one if possible, more if absolutely necessary" whereas the maximum will be "all if possible, one if absolutely necessary".
And * means "zero or more" so the minimum will be "nothing if possible, more if absolutely necessary" whereas the maximum will be "all if possible, zero if absolutely necessary".
This very much depends on the implementation, I guess. But since every quantifier I am aware of can be modified with ? it might be reasonable to implement it that way.

Are regex atomic groups distributive?

Are regex atomic groups distributive?
I.e. is (?>A?B?) always equivalent to (?>A?)(?>B?)?
If not please provide a counter example.
Atomic groups in general
The atomic group (?>regex1|regex2|regex3) takes only the first successful match within it. In other words, it doesn't allow backtracking.
Regexes are evaluated left-to-right, so you express the order you intend things to match. The engine starts at the first position, trying to make a successful match, backtracking if necessary. If any path through the expression would lead to a successful match, then it will match at that position.
Atomic groups are not distributive. Consider these patterns evaluated over ABC:
(?>(AB?))(?>(BC)) (no match) and (?>(AB?)(BC)) (matches ABC).
Atomic Groups with all optional components
But, your scenario where both parts are optional may be different.
Considering an atomic group with 2 greedy optional parts A and B ((A)? and (B)?). At any position, if A matches, it can move on to evaluate the optional B. Otherwise, if A doesn't match, that's fine, too because it's optional. Therefore, (A)? matches at any position. The same logic applies for the optional B. The question remaining is whether there can be any difference in backtracking.
In the case of all optional parts ((?>A?B?)), since each part always matches, there's no reason to backtrack within the atomic group, so it will always match. Then, since it is in an atomic group, it is prohibited from backtracking.
In the case of separate atomic groups ((?>A?)(?>B?)), each part always matches, and the engine is prohibited from backtracking in either case. This means the results will be the same.
To reiterate, the engine can only use the first possible match in (?>A?)(?>B?), which will always be the same match as the first possible match in (?>A?B?). Thus, if my reasoning is correct,for this special case, the matches will be the same for multiple optional atomic groups as a single atomic group with both optional components.
Since you didn't specify, I'll assume you're referring to Perl regexes, since I haven't seen the (?>) grouping operator in any other language.
Consider the following:
ra = 'A?'
rb = 'B?'
/(?>${ra} ${rb})/x is the same as/(?>${ra})(?>${rb})/x.
In this case, yes, it works either way; however, because (?>) disables backtracking, this is not the case with some other values of ra and rb.
For example, given:
ra = 'A*'
rb = 'AB*'
/(?>${ra} ${rb})/x != /(?>${ra})(?>${rb})/x.
In the latter, rb could never match, since ra would consume an entire sequence of A's, and would not allow backtracking. Note that this would work if we used (?:) as the grouping operator. Note also, that if we used capture groups (), then the match would be the same, but the side effects (assignment to \1, \2, ...) would be different.

What is the difference between atomic and non-capturing groups?

What is an atomic group, ((?>expr)) and what is it used for?
In https://www.regular-expressions.info/atomic.html, the only example is when expr is alternation, such as the regex a(?>bc|b)c matches abcc but not abc. Are there examples with expr not being alternation?
Are atomic and non-capturing groups, ((?:expr)) the same thing?
When Atomic groups are used, the regex engine won't backtrack for further permutations if the complete regular expression has not been matched for a given string.
Whenever you use an alternation, the regex will immediately try to match the rest of the expression if it is successful. Still, it will keep track of the position where other alternations are possible. If the rest of the expression is not matched, the regex will go back to the previously noted position and try the other combinations. If Atomic grouping had been used, the regex engine would not have kept track of the previous position and would just have given up matching.
The above example doesn't explain the purpose of using atomic groups. It just demonstrates the elimination of backtracking. Atomic groups would be used in specific scenarios where greedy quantifiers are used, and further combinations are possible even though there is no alternation.
Atomic and non-capturing groups are different. Non-capturing groups don't save the matches' value, while atomic groups disable backtracking if further combinations are needed.
For example, the regular expression a(?:bc|b)c matches both abcc and abc (without capturing the match), whilst a(?>bc|c)c only matches abcc. If the regex was a(?>b|bc)c, it would only match abc, whilst a(?:b|bc)c would still match both.
Atomic groups (and the possessive modifier) are useful to avoid catastrophic backtracking - which can be exploited by malicious users to trigger denial of service attacks by gobbling up a server's memory.
Non-capturing groups are just that -- non-capturing. The regex engine can backtrack into a non-capturing group; not into an atomic group.
Are there examples with expr not being alternation?
Consider the following pattern:
(abc)?a
This finds a match in both abc and abca. But what happens when the optional part becomes atomic?
(?>(abc)?)a
It no longer finds a match in abc. It will never give up abc, so the final a fails.
As others have said, there are other situations where you might want to avoid backtracking, even if it has no effect on the final match, to optimise your regex.

What's the difference between () and [] in a regex?

Let's say:
/(a|b)/ vs /[ab]/
There's not much difference in your above example (in most languages). The major difference is that the () version creates a group that can be backreferenced by \1 in the match (or, sometimes, $1). The [] version doesn't do this.
Also,
/(ab|cd)/ # matches 'ab' or 'cd'
/[abcd]/ # matches 'a', 'b', 'c' or 'd'
() in regular expression is used for grouping regular expressions, allowing you to apply operators to an entire expression rather than a single character. For instance, if I have the regular expression ab, then ab* refers to an a followed by any number of bs (for instance, a, ab, abb, etc), while (ab)* refers to any number of repetitions of the sequence ab (for instance, the empty string, ab, abab, etc). In many regular expression engines, () are also used for creating references that can be referred to after matching. For instance, in Ruby, after you execute "foo" =~ /f(o*)/, $1 will contain oo.
| in a regular expression indicates alternation; it means the expression before the bar, or the expression after it. You could match any digit with the expression 0|1|2|3|4|5|6|7|8|9. You will frequently see alternation wrapped in a set of parentheses for the purposes of grouping or capturing a sub-expression, but it is not required. You can use alternation on longer expressions as well, like foo|bar, to indicate either foo or bar.
You can express every regular expression (in the formal, theoretical sense, not the extended sense that many languages use), with just alternation |, kleene closure *, concatenation (just writing two expressions next to each other with nothing in between), and parentheses for grouping. But that would be rather inconvenient for complicated expressions, so several shorthands are commonly available. For instance, x? is just a shorthand for |x (that is, the empty string or x), while y+ is a shorthand for yy*.
[] are basically a shorthand for the alternation | of all of the characters, or ranges of characters, within it. As I said, I could write 0|1|3|4|5|6|7|8|9, but it's much more convenient to write [0-9]. I can also write [a-zA-Z] to represent any letter. Note that while [] do provide grouping, they do not generally introduce a new reference that can be referred to later on; you would have to wrap them in parentheses for that, like ([a-zA-Z])
So, your two example regular expressions are equivalent in what they match, but the (a|b) will set the first sub-match to the matching character, while [ab] will not create any references to sub-matches.
First, when speaking about regexes, it's often important to specify what sort of regexes you're talking about. There are several variations (such as the traditional POSIX regexes, Perl and Perl-compatible regexes (PCRE), etc.).
Assuming PCRE or something very similar, which is often the most common these days, there are three key differences:
Using parenthetical groups, you can check options consisting of more than one character. So /(a|b)/ might instead be /(abc|defg)/.
Parenthetical groups perform a capture operation so that you can extract the result (so that if it matched on "b", you can get "b" back and see that). /[ab]/ does not. The capture operation can be overridden by adding ?: like so: /(?:a|b)/
Even if you override the capture behavior of parentheses, the underlying implementation may still be faster for [] when you're checking single characters (although nothing says non-capturing (?:a|b) can't be optimized as a special case into [ab], but regex compilation may take ever so slightly longer).