What is the difference between atomic and non-capturing groups? - regex

What is an atomic group, ((?>expr)) and what is it used for?
In https://www.regular-expressions.info/atomic.html, the only example is when expr is alternation, such as the regex a(?>bc|b)c matches abcc but not abc. Are there examples with expr not being alternation?
Are atomic and non-capturing groups, ((?:expr)) the same thing?

When Atomic groups are used, the regex engine won't backtrack for further permutations if the complete regular expression has not been matched for a given string.
Whenever you use an alternation, the regex will immediately try to match the rest of the expression if it is successful. Still, it will keep track of the position where other alternations are possible. If the rest of the expression is not matched, the regex will go back to the previously noted position and try the other combinations. If Atomic grouping had been used, the regex engine would not have kept track of the previous position and would just have given up matching.
The above example doesn't explain the purpose of using atomic groups. It just demonstrates the elimination of backtracking. Atomic groups would be used in specific scenarios where greedy quantifiers are used, and further combinations are possible even though there is no alternation.
Atomic and non-capturing groups are different. Non-capturing groups don't save the matches' value, while atomic groups disable backtracking if further combinations are needed.
For example, the regular expression a(?:bc|b)c matches both abcc and abc (without capturing the match), whilst a(?>bc|c)c only matches abcc. If the regex was a(?>b|bc)c, it would only match abc, whilst a(?:b|bc)c would still match both.

Atomic groups (and the possessive modifier) are useful to avoid catastrophic backtracking - which can be exploited by malicious users to trigger denial of service attacks by gobbling up a server's memory.
Non-capturing groups are just that -- non-capturing. The regex engine can backtrack into a non-capturing group; not into an atomic group.

Are there examples with expr not being alternation?
Consider the following pattern:
(abc)?a
This finds a match in both abc and abca. But what happens when the optional part becomes atomic?
(?>(abc)?)a
It no longer finds a match in abc. It will never give up abc, so the final a fails.
As others have said, there are other situations where you might want to avoid backtracking, even if it has no effect on the final match, to optimise your regex.

Related

Is there any difference between (?>EXPR|) and (?:EXPR)?+

In the following I will consider the regular expressions (?>EXPR|) and (?:EXPR)?+.
Let's say we want to match the string ABC.
Using (?>A|AB|)C it will first try to match A, then it will fail (because the A character is not followed by C) and it will try to match AB without possibility of backtracking, so it will fail again (because the A character has already been consumed) and finally it will match the empty string, failing a third time. Two characters later, it will find the substring C, that clearly matches the pattern.
Using (?:A|AB)?+C it will first try to match A, then it will fail (because the A character is not followed by C) and it hasn't got the possibility to go further because of the possessive quantifier +. Two characters later, it will find the substring C, that clearly matches the pattern.
The question is: even if (?>EXPR|) and (?:EXPR)?+ work in different ways, are they semantically equivalent?
See the atomic group referefence:
An atomic group is a group that, when the regex engine exits from it, automatically throws away all backtracking positions remembered by any tokens inside the group. Atomic groups are non-capturing. The syntax is (?>group). Lookaround groups are also atomic. Atomic grouping is supported by most modern regular expression flavors, including the JGsoft flavor, Java, PCRE, .NET, Perl, Boost, and Ruby. Most of these also support possessive quantifiers, which are essentially a notational convenience for atomic grouping.
Note that possessive quantifiers are a notational convenience for atomic grouping, they are functioning in the same way: they make their patterns match once without allowing any backtracking into these patterns.
If you wrap a set of patterns with a non-capturing group and set a possessive quantifier to this group it behaves as an atomic group.
Since (?>A|AB)? is an optional atomic group that matches A or AB (and atomic groups are non-capturing), it is the same as (?>A|AB|) that matches either A, AB or an empty string (so, it is also optional in a way).
(?>A|AB)?C = (?>A|AB|)C = (?:A|AB)?+C

Regex query efficient?

I came up with the below regex expression to look for terms like Password,Passphrase,Pass001 etc and the word following it. Is it efficient or can it be made better? Thanks for the help
"([Pp][aA][sS][Ss]([wW][oO][rR][dD][sS]?|[Pp][hH][rR][aA][sS][eE])?|[Pp]([aA][sS]([sS])?)?[wW][Dd])[0-9]?[0-9]?[0-9]?[\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+\S*"
I will be using it to scan files upto 300K for these terms. When I try now to scan with these expression a whole C: drive it takes 5 hours or worse case I have encountered, 5 days
You may use the following enhancement:
(?i)p(?:ass(?:words?|phrase)?|(?:ass?)?wd)[0-9]{0,3}[-\s:=_\/#&'\]\[()+*\r\n]\S*
See the regex demo
Instead of [sS], you may make the regex case insensitive by adding (?i) case insensitive modifier. Use corresponding option in your software if it does not work like this.
Make sure your alternations do not match at the same location in the string. It is not quite easy here, but p at the start of each alternative in the first group decreases the regex efficiency. So, move it outside (e.g. (?:pass|port) => p(ass|ort)).
Use non-capturing groups rather than capturing ones if you are not going to access submatches, that also has a slight impact on performance.
Use limiting quantifiers instead of repeating ? quantified patterns. Instead of a?a?a?, use a{0,3}.
Do not overescape chars inside the character class. I only left \/, \] and \[ as I am not sure what regex flavor you are using, it might appear you can avoid escaping at all.
Note that a performance penalty is big if you have consecutive non-fixed width patterns that may match the same type of chars. You have [\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+\S*: [\s\:\-\=\_\/\#\&\'\[\(\+\*\r\n\)\]]+ matches 1 or more special chars and \S* matches 0 or more chars other than whitespace that also matches some chars matched by the preceding pattern. Remove the + from the preceding subpattern.

Are atomic groups always used with alternation | inside?

Are atomic groups always used with alternation | inside? I get the impression from "all backtracking positions remembered by any tokens inside the group" from:
An atomic group is a group that, when the regex engine exits from it,
automatically throws away all backtracking positions remembered by any
tokens inside the group. Atomic groups are non-capturing. The syntax
is (?>group).
An example will make the behavior of atomic groups clear. The regular
expression a(bc|b)c (capturing group) matches abcc and abc. The regex
a(?>bc|b)c (atomic group) matches abcc but not abc.
Can you given an example, where atomic groups are used without alternation | inside it? Thanks.
Alternations have nothing to do with atomic groups. The point of atomic groups is to avoid backtracking. There are two main reasons for this:
Avoid unneeded backtracking when a regex is going to fail to match anyway.
Avoid backtracking into a part of an expression where you don't want to find a match
You asked for an example of atomic grouping without alternations.
Let's look at both uses.
A. Avoid Backtracking on Failure
For example, consider these two strings:
name=Joe species=hamster food=carrot says:{I love carrots}
name=Joe species=hamster food=carrot says:{I love peas}
Let's say we want to find a string that is well-formed (it has the key=value tokens) and has carrots after the tokens, perhaps in the says part. One way to attempt this could be:
Non-Atomic Version
^(?:\w+=\w+\s+)*.*carrots
This will match the first string and not the second. We're happy. Or... are we really? There are two reasons to be unhappy. We'll look at the second reason in part B (the second main reason for atomic groups). So what's the first reason?
Well, when you debug the failure case in RegexBuddy, you see that it takes the engine 401 steps before the engine decides it cannot match the second string. It is that long because after matching the tokens and failing to match carrots in the says:{I love peas}, the engine backtracks into the (\w+=\w+\s+)* in the hope of finding carrots there. Now let's look at an atomic version.
An Atomic Version
^(?>(?:\w+=\w+\s+)*).*carrots
Here, the atomic group prevents the engine from backtracking into the (?:\w+=\w+\s+)*. The result is that on the second string, the engine fails in 64 steps. Quite a lot faster than 401!
B. Avoid Backtracking into part of String where Match is Not Desired
Keeping the same regexes, let's modify the strings slightly:
name=Joe species=hamster food=carrots says:{I love carrots}
name=Joe species=hamster food=carrots says:{I love peas}
Our atomic regex still works (it matches the first string but not the second).
However, the non-atomic regex now matches both strings! That is because after failing to find carrots in says:{I love peas}, the engine backtracks into the tokens, and finds carrots in food=carrots
Therefore, in this instance an atomic group is a handy tool to skip the portion of the string where we don't want to find carrots, while still making sure that it is well-formed.

Can DFA regex engines handle atomic groups?

According to this page (and some others), DFA regex engines can deal with capturing groups rather well. I'm curious about atomic groups (or possessive quantifiers), as I recently used them a lot and can't imagine how this could be done.
I disagree with the fist part of the answer:
A DFA does not need to deal with constructs like atomic grouping.... Atomic Grouping is a way to help the engine finish a match, that would otherwise cause endless backtracking
Atomic groups are important not only for speed of NFA engines, but they also allow to write simpler and less error-prone regexes. Let's say I needed to find all C-style multiline comments in a program. The exact regex would be something like:
start with the literal /*
eat anything of the following
any char except *
a * followed by anything but /
repeat this as much as possible
end with the literal */
This sounds a bit complicated, the regex
/\* ( [^*] | \*[^/] )+ \*/
is complicated and wrong (it doesn't handle /* foo **/ correctly). Using a reluctant (lazy) quantifier is better
/\* .*? \*/
but also wrong as it can eat the whole line
/* foo */ ##$!!**##$ /* bar */
when backtracking due to a later sub-expression failing on the garbage occurs. Putting the above in an atomic group solves the problem nicely:
(?> /\* .*? \*/ )
This works always (I hope) and is as fast as possible (for NFA). So I wonder if a DFA engine could somehow handle it.
A DFA does not need to deal with constructs like atomic grouping. A DFA is "text directed", unlike the NFA, which is "regex directed", in other words:
Atomic Grouping is a way to help the engine finish a match, that would otherwise cause endless backtracking, as the (NFA) engine tries every permutation possible to find a match at a position, no match is even possible.
Atomic grouping, simply said, throws away backtracking positions. Since a DFA does not backtrack (the text to be matched is checked against the regex, not the regex against the text like a NFA - the DFA opens a branch for each decision), throwing away something that is not there is pointless.
I suggest J.F.Friedl's Mastering Regular Expressions (Google Books), he explains the general idea of a DFA:
DFA Engine: Text-Directed
Contrast the regex-directed NFA engine with an engine that, while
scanning the string, keeps track of all matches “currently in the
works.” In the tonight example, the moment the engine hits t, it adds
a potential match to its list of those currently in progress:
[...]
Each subsequent character scanned updates the list of possible
matches. After a few more characters are matched, the situation
becomes
[...]
with two possible matches in the works (and one alternative, knight,
ruled out). With the g that follows, only the third alternative
remains viable. Once the h and t are scanned as well, the engine
realizes it has a complete match and can return success.
I call this “text-directed” matching because each character scanned
from the text controls the engine. As in the example, a partial match
might be the start of any number of different, yet possible, matches.
Matches that are no longer viable are pruned as subsequent characters
are scanned. There are even situations where a “partial match in
progress” is also a full match. If the regex were ⌈to(…)?⌋, for
example, the parenthesized expression becomes optional, but it’s still
greedy, so it’s always attempted. All the time that a partial match is
in progress inside those parentheses, a full match (of 'to') is
already confirmed and in reserve in case the longer matches don’t pan
out.
(Source: http://my.safaribooksonline.com/book/programming/regular-expressions/0596528124/regex-directed-versus-text-directed/i87)
Concerning capturing groups and DFAs: as far as I was able to understand from your link, these approaches are not pure DFA engines but hybrids of DFA and NFA.

Are regex atomic groups distributive?

Are regex atomic groups distributive?
I.e. is (?>A?B?) always equivalent to (?>A?)(?>B?)?
If not please provide a counter example.
Atomic groups in general
The atomic group (?>regex1|regex2|regex3) takes only the first successful match within it. In other words, it doesn't allow backtracking.
Regexes are evaluated left-to-right, so you express the order you intend things to match. The engine starts at the first position, trying to make a successful match, backtracking if necessary. If any path through the expression would lead to a successful match, then it will match at that position.
Atomic groups are not distributive. Consider these patterns evaluated over ABC:
(?>(AB?))(?>(BC)) (no match) and (?>(AB?)(BC)) (matches ABC).
Atomic Groups with all optional components
But, your scenario where both parts are optional may be different.
Considering an atomic group with 2 greedy optional parts A and B ((A)? and (B)?). At any position, if A matches, it can move on to evaluate the optional B. Otherwise, if A doesn't match, that's fine, too because it's optional. Therefore, (A)? matches at any position. The same logic applies for the optional B. The question remaining is whether there can be any difference in backtracking.
In the case of all optional parts ((?>A?B?)), since each part always matches, there's no reason to backtrack within the atomic group, so it will always match. Then, since it is in an atomic group, it is prohibited from backtracking.
In the case of separate atomic groups ((?>A?)(?>B?)), each part always matches, and the engine is prohibited from backtracking in either case. This means the results will be the same.
To reiterate, the engine can only use the first possible match in (?>A?)(?>B?), which will always be the same match as the first possible match in (?>A?B?). Thus, if my reasoning is correct,for this special case, the matches will be the same for multiple optional atomic groups as a single atomic group with both optional components.
Since you didn't specify, I'll assume you're referring to Perl regexes, since I haven't seen the (?>) grouping operator in any other language.
Consider the following:
ra = 'A?'
rb = 'B?'
/(?>${ra} ${rb})/x is the same as/(?>${ra})(?>${rb})/x.
In this case, yes, it works either way; however, because (?>) disables backtracking, this is not the case with some other values of ra and rb.
For example, given:
ra = 'A*'
rb = 'AB*'
/(?>${ra} ${rb})/x != /(?>${ra})(?>${rb})/x.
In the latter, rb could never match, since ra would consume an entire sequence of A's, and would not allow backtracking. Note that this would work if we used (?:) as the grouping operator. Note also, that if we used capture groups (), then the match would be the same, but the side effects (assignment to \1, \2, ...) would be different.