Are regex atomic groups distributive? - regex

Are regex atomic groups distributive?
I.e. is (?>A?B?) always equivalent to (?>A?)(?>B?)?
If not please provide a counter example.

Atomic groups in general
The atomic group (?>regex1|regex2|regex3) takes only the first successful match within it. In other words, it doesn't allow backtracking.
Regexes are evaluated left-to-right, so you express the order you intend things to match. The engine starts at the first position, trying to make a successful match, backtracking if necessary. If any path through the expression would lead to a successful match, then it will match at that position.
Atomic groups are not distributive. Consider these patterns evaluated over ABC:
(?>(AB?))(?>(BC)) (no match) and (?>(AB?)(BC)) (matches ABC).
Atomic Groups with all optional components
But, your scenario where both parts are optional may be different.
Considering an atomic group with 2 greedy optional parts A and B ((A)? and (B)?). At any position, if A matches, it can move on to evaluate the optional B. Otherwise, if A doesn't match, that's fine, too because it's optional. Therefore, (A)? matches at any position. The same logic applies for the optional B. The question remaining is whether there can be any difference in backtracking.
In the case of all optional parts ((?>A?B?)), since each part always matches, there's no reason to backtrack within the atomic group, so it will always match. Then, since it is in an atomic group, it is prohibited from backtracking.
In the case of separate atomic groups ((?>A?)(?>B?)), each part always matches, and the engine is prohibited from backtracking in either case. This means the results will be the same.
To reiterate, the engine can only use the first possible match in (?>A?)(?>B?), which will always be the same match as the first possible match in (?>A?B?). Thus, if my reasoning is correct,for this special case, the matches will be the same for multiple optional atomic groups as a single atomic group with both optional components.

Since you didn't specify, I'll assume you're referring to Perl regexes, since I haven't seen the (?>) grouping operator in any other language.
Consider the following:
ra = 'A?'
rb = 'B?'
/(?>${ra} ${rb})/x is the same as/(?>${ra})(?>${rb})/x.
In this case, yes, it works either way; however, because (?>) disables backtracking, this is not the case with some other values of ra and rb.
For example, given:
ra = 'A*'
rb = 'AB*'
/(?>${ra} ${rb})/x != /(?>${ra})(?>${rb})/x.
In the latter, rb could never match, since ra would consume an entire sequence of A's, and would not allow backtracking. Note that this would work if we used (?:) as the grouping operator. Note also, that if we used capture groups (), then the match would be the same, but the side effects (assignment to \1, \2, ...) would be different.

Related

Is there any difference between (?>EXPR|) and (?:EXPR)?+

In the following I will consider the regular expressions (?>EXPR|) and (?:EXPR)?+.
Let's say we want to match the string ABC.
Using (?>A|AB|)C it will first try to match A, then it will fail (because the A character is not followed by C) and it will try to match AB without possibility of backtracking, so it will fail again (because the A character has already been consumed) and finally it will match the empty string, failing a third time. Two characters later, it will find the substring C, that clearly matches the pattern.
Using (?:A|AB)?+C it will first try to match A, then it will fail (because the A character is not followed by C) and it hasn't got the possibility to go further because of the possessive quantifier +. Two characters later, it will find the substring C, that clearly matches the pattern.
The question is: even if (?>EXPR|) and (?:EXPR)?+ work in different ways, are they semantically equivalent?
See the atomic group referefence:
An atomic group is a group that, when the regex engine exits from it, automatically throws away all backtracking positions remembered by any tokens inside the group. Atomic groups are non-capturing. The syntax is (?>group). Lookaround groups are also atomic. Atomic grouping is supported by most modern regular expression flavors, including the JGsoft flavor, Java, PCRE, .NET, Perl, Boost, and Ruby. Most of these also support possessive quantifiers, which are essentially a notational convenience for atomic grouping.
Note that possessive quantifiers are a notational convenience for atomic grouping, they are functioning in the same way: they make their patterns match once without allowing any backtracking into these patterns.
If you wrap a set of patterns with a non-capturing group and set a possessive quantifier to this group it behaves as an atomic group.
Since (?>A|AB)? is an optional atomic group that matches A or AB (and atomic groups are non-capturing), it is the same as (?>A|AB|) that matches either A, AB or an empty string (so, it is also optional in a way).
(?>A|AB)?C = (?>A|AB|)C = (?:A|AB)?+C

Is it possible to erase a capture group that has already matched, making it non-participating?

In PCRE2 or any other regex engine supporting forward backreferences, is it possible to change a capture group that matched in a previous iteration of a loop into a non-participating capture group (also known as an unset capture group or non-captured group), causing conditionals that test that group to match with their "false" clause rather than their "true" clause?
For example, take the following PCRE regex:
^(?:(z)?(?(1)aa|a)){2}
When fed the string zaazaa, it matches the whole string, as desired. But when fed zaaaa, I would like it to match zaaa; instead, it matches zaaaa, the whole string. (This is just for illustration. Of course this example could be handled by ^(?:zaa|a){2} but that is beside the point. Practical usage of capture group erasure would tend to be in loops that most often do far more than 2 iterations.)
An alternative way of doing this, which also doesn't work as desired:
^(?:(?:z()|())(?:\1aa|\2a)){2}
Note that both of these work as desired when the loop is "unrolled", because they no longer have to erase a capture that has already been made:
^(?:(z)?(?(1)aa|a))(?:(z)?(?(2)aa|a))
^(?:(?:z()|())(?:\1aa|\2a))(?:(?:z()|())(?:\3aa|\4a))
So instead of being able to use the simplest form of conditional, a more complicated one must be used, which only works in this example because the "true" match of z is non-empty:
^(?:(z?)(?(?!.*$\1)aa|a)){2}
Or just using an emulated conditional:
^(?:(z?)(?:(?!.*$\1)aa|(?=.*$\1)a)){2}
I have scoured all the documentation I can find, and there seems not to even be any mention or explicit description of this behavior (that captures made within a loop persist through iterations of that loop even when they fail to be re-captured).
It's different than what I intuitively expected. The way I would implement it is that evaluating a capture group with 0 repetitions would erase/unset it (so this could happen to any capture group with a *, ?, or {0,N} quantifier), but skipping it due to being in a parallel alternative within the same group in which it gained a capture during a previous iteration would not erase it. Thus, this regex would still match words iff they contain at least one of every vowel:
\b(?:a()|e()|i()|o()|u()|\w)++\1\2\3\4\5\b
But skipping a capture group due to it being inside an unevaluated alternative of a group that is evaluated with nonzero repetitions which is nested within the group in which the capture group took on a value during a previous iteration would erase/unset it, so this regex would be able to either capture or erase group \1 on every iteration of the loop:
^(?:(?=a|(b)).(?(1)_))*$
and would match strings such as aaab_ab_b_aaaab_ab_aab_b_b_aaa. However, the way forward references are actually implemented in existing engines, it matches aaaaab_a_b_a_a_b_b_a_b_b_b_.
I would like to know the answer to this question not merely because it would be useful in constructing regexes, but because I have written my own regex engine, currently ECMAScript-compatible with some optional extensions (including molecular lookahead (?*), i.e. non-atomic lookahead, which as far as I know, no other engine has), and I would like to continue adding features from other engines, including forward/nested backreferences. Not only do I want my implementation of forward backreferences to be compatible with existing implementations, but if there isn't a way of erasing capture groups in other engines, I will probably create a way of doing it in my engine that doesn't conflict with other existing regex features.
To be clear: An answer stating that this is not possible in any mainstream engines will be acceptable, as long as it is backed up by adequate research and/or citing of sources. An answer stating that it is possible would be much easier to state, since it would require only one example.
Some information on what a non-participating capture group is:
http://blog.stevenlevithan.com/archives/npcg-javascript - this is the article that originally introduced me to the idea.
https://www.regular-expressions.info/backref2.html - the first section on this page gives a brief explanation.
In ECMAScript/Javascript regexes, backreferences to NPCGs always match (making a zero-length match). In pretty much every other regex flavor, they fail to match anything.
I found this documented in PCRE's man page, under "DIFFERENCES BETWEEN PCRE2 AND PERL":
12. There are some differences that are concerned with the settings of
captured strings when part of a pattern is repeated. For example,
matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
unset, but in PCRE2 it is set to "b".
I'm struggling to think of a practical problem that cannot be better solved with an alternative solution, but in the interests of keeping it simple, here goes:
Suppose you have a simple task well-suited to being solved by using forward references; for example, check the input string is a palindrome. This cannot be solved generally with recursion (due to the atomic nature of subroutine calls), and so we bang out the following:
/^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$/
Easy enough. Now suppose we are asked to verify that every line in the input is a palindrome. Let's try to solve this by placing the expression in a repeated group:
\A(?:^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$(?:\n|\z))+\z
Clearly that doesn't work, since the value of \2 persists from the first line to the next. This is similar to the problem you're facing, and so here are a number of ways to overcome it:
1. Enclose the entire subexpression in (?!(?! )):
\A(?:(?!(?!^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$)).+(?:\n|\z))+\z
Very easy, just shove 'em in there and you're essentially good to go. Not a great solution if you want any particular captured values to persist.
2. Branch reset group to reset the value of capture groups:
\A(?|^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$|\n()()|\z)+\z
With this technique, you can reset the value of capture groups from the first (\1 in this case) up to a certain one (\2 here). If you need to keep \1's value but wipe \2, this technique will not work.
3. Introduce a group that captures the remainder of the string from a certain position to help you later identify where you are:
\A(?:^(?:(.)(?=.*(\1(?(2)(?=\2\3\z)\2))([\s\S]*)))*+.?\2$(?:\n|\z))+\z
The whole rest of the collection of lines is saved in \3, allowing you to reliably check whether you have progressed to the next line (when (?=\2\3\z) is no longer true).
This is one of my favourite techniques because it can be used to solve tasks that seem impossible, such as the ol' matching nested brackets using forward references. With it, you can maintain any other capture information you need. The only downside is that it's horribly inefficient, especially for long subjects.
4. This doesn't really answer the question, but it solves the problem:
\A(?![\s\S]*^(?!(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$))
This is the alternative solution I was talking about. Basically, "re-write the pattern" :) Sometimes it's possible, sometimes it isn't.
With PCRE (and all as I'm aware) it's not possible to unset a capturing group but using subroutine calls since their nature doesn't remember values from the previous recursion, you are able to accomplish the same task:
(?(DEFINE)((z)?(?(2)aa|a)))^(?1){2}
See live demo here
If you are going to implement a behavior into your own regex flavor to unset a capturing group, I'd strongly suggest do not let it happen automatically. Just provide some flags.
This is partially possible in .NET's flavour of regex.
The first thing to note is that .NET records all of the captures for a given capture group, not just the latest. For instance, ^(?=(.)*) records each character in the first line as a separate capture in the group.
To actually delete captures, .NET regex has a construction known as balancing groups. The full format of this construction is (?<name1-name2>subexpression).
First, name2 must have previously been captured.
The subexpression must then match.
If name1 is present, the substring between the end of the capture of name2 and the start of the subexpression match is captured into name1.
The latest capture of name2 is then deleted. (This means that the old value could be backreferenced in the subexpression.)
The match is advanced to the end of the subexpression.
If you know you have name2 captured exactly once then it can readily be deleted using (?<-name2>); if you don't know whether you have name2 captured then you could use (?>(?<-name2>)?) or a conditional. The problem arises if you might have name2 captured more than once since then it depends on whether you can organise enough repetitions of the deletion of name2. ((?<-name2>)* doesn't work because * is equivalent to ? for zero-length matches.)
There is also another way to "erase" capture groups in .NET. Unlike the (?<-name>) method, this empties the group instead of deleting it – so instead of not matching, it will then match an empty string.
In .NET, groups with the same name can be captured multiple times, even if that name is a number. This allows PCRE expressions using balanced groups to be ported to .NET. Consider this PCRE pattern:
(?|(pattern)|())
Assuming both groups are \1 above, then using this technique, in .NET it would become:
(?:(pattern)|(?<1>))
I used this technique today to make a 38 byte .NET regex that matches strings whose length is a fourth power:
^((?=(?>^((?<3>\3|x))|\3(\3\2))*$)){2}
the above is a port of the following 35 byte PCRE regex, which uses balanced groups:
^((?=(?|^((\2|x))|\2(\2\3))*+$)){2}
(In this example, the capture group isn't actually being emptied. But this technique can be used to do anything a balanced group can do, including emptying a group.)

Are atomic groups always used with alternation | inside?

Are atomic groups always used with alternation | inside? I get the impression from "all backtracking positions remembered by any tokens inside the group" from:
An atomic group is a group that, when the regex engine exits from it,
automatically throws away all backtracking positions remembered by any
tokens inside the group. Atomic groups are non-capturing. The syntax
is (?>group).
An example will make the behavior of atomic groups clear. The regular
expression a(bc|b)c (capturing group) matches abcc and abc. The regex
a(?>bc|b)c (atomic group) matches abcc but not abc.
Can you given an example, where atomic groups are used without alternation | inside it? Thanks.
Alternations have nothing to do with atomic groups. The point of atomic groups is to avoid backtracking. There are two main reasons for this:
Avoid unneeded backtracking when a regex is going to fail to match anyway.
Avoid backtracking into a part of an expression where you don't want to find a match
You asked for an example of atomic grouping without alternations.
Let's look at both uses.
A. Avoid Backtracking on Failure
For example, consider these two strings:
name=Joe species=hamster food=carrot says:{I love carrots}
name=Joe species=hamster food=carrot says:{I love peas}
Let's say we want to find a string that is well-formed (it has the key=value tokens) and has carrots after the tokens, perhaps in the says part. One way to attempt this could be:
Non-Atomic Version
^(?:\w+=\w+\s+)*.*carrots
This will match the first string and not the second. We're happy. Or... are we really? There are two reasons to be unhappy. We'll look at the second reason in part B (the second main reason for atomic groups). So what's the first reason?
Well, when you debug the failure case in RegexBuddy, you see that it takes the engine 401 steps before the engine decides it cannot match the second string. It is that long because after matching the tokens and failing to match carrots in the says:{I love peas}, the engine backtracks into the (\w+=\w+\s+)* in the hope of finding carrots there. Now let's look at an atomic version.
An Atomic Version
^(?>(?:\w+=\w+\s+)*).*carrots
Here, the atomic group prevents the engine from backtracking into the (?:\w+=\w+\s+)*. The result is that on the second string, the engine fails in 64 steps. Quite a lot faster than 401!
B. Avoid Backtracking into part of String where Match is Not Desired
Keeping the same regexes, let's modify the strings slightly:
name=Joe species=hamster food=carrots says:{I love carrots}
name=Joe species=hamster food=carrots says:{I love peas}
Our atomic regex still works (it matches the first string but not the second).
However, the non-atomic regex now matches both strings! That is because after failing to find carrots in says:{I love peas}, the engine backtracks into the tokens, and finds carrots in food=carrots
Therefore, in this instance an atomic group is a handy tool to skip the portion of the string where we don't want to find carrots, while still making sure that it is well-formed.

getting at least 1 of 2 zero or more sets with a regular expression

How would I write a regular expression that allows for zero or more of one group, and zero or more of another group, but at least one of the two groups has to exist?
Specifically, I want to get a spreadsheet like reference, so it should get A1:B5 (for a whole region), A:A (for a whole column), or 5:5 (for a whole row).
I first tried
[A-Za-z]*[\d]*:[A-Za-z]*[\d]*
but this wouldn't be sufficient because then simply typing : or B6: would also satisfy that criteria.
Any help would be appreciated.
You can do this with grouping...
/((how)|(now))+/
If you want to match a range but not a cell reference, you could just enumerate the ways to do that:
([A-Z]:[A-Z])|(\d+:\d+)|([A-Z]\d+:[A-Z]\d+)
One way would be an explicit alternation:
(?:[a-zA-Z]+|\d+|[a-zA-Z]+\d+):(?:[a-zA-Z]+|\d+|[a-zA-Z]+\d+)
If your engine supports lookbehind, however, you could use that:
(?>[a-zA-Z]*\d*(?<=.)):(?>[a-zA-Z]*\d*)(?<=.))
This says "zero or more letters, followed by zero or more numbers, which must end in at least one character (.). That guarantees it won't be empty. The atomic grouping (?>...) means that the lookbehind (?<=.) can't match whatever came before that point.

What is the difference between atomic and non-capturing groups?

What is an atomic group, ((?>expr)) and what is it used for?
In https://www.regular-expressions.info/atomic.html, the only example is when expr is alternation, such as the regex a(?>bc|b)c matches abcc but not abc. Are there examples with expr not being alternation?
Are atomic and non-capturing groups, ((?:expr)) the same thing?
When Atomic groups are used, the regex engine won't backtrack for further permutations if the complete regular expression has not been matched for a given string.
Whenever you use an alternation, the regex will immediately try to match the rest of the expression if it is successful. Still, it will keep track of the position where other alternations are possible. If the rest of the expression is not matched, the regex will go back to the previously noted position and try the other combinations. If Atomic grouping had been used, the regex engine would not have kept track of the previous position and would just have given up matching.
The above example doesn't explain the purpose of using atomic groups. It just demonstrates the elimination of backtracking. Atomic groups would be used in specific scenarios where greedy quantifiers are used, and further combinations are possible even though there is no alternation.
Atomic and non-capturing groups are different. Non-capturing groups don't save the matches' value, while atomic groups disable backtracking if further combinations are needed.
For example, the regular expression a(?:bc|b)c matches both abcc and abc (without capturing the match), whilst a(?>bc|c)c only matches abcc. If the regex was a(?>b|bc)c, it would only match abc, whilst a(?:b|bc)c would still match both.
Atomic groups (and the possessive modifier) are useful to avoid catastrophic backtracking - which can be exploited by malicious users to trigger denial of service attacks by gobbling up a server's memory.
Non-capturing groups are just that -- non-capturing. The regex engine can backtrack into a non-capturing group; not into an atomic group.
Are there examples with expr not being alternation?
Consider the following pattern:
(abc)?a
This finds a match in both abc and abca. But what happens when the optional part becomes atomic?
(?>(abc)?)a
It no longer finds a match in abc. It will never give up abc, so the final a fails.
As others have said, there are other situations where you might want to avoid backtracking, even if it has no effect on the final match, to optimise your regex.