Are atomic groups always used with alternation | inside? - regex

Are atomic groups always used with alternation | inside? I get the impression from "all backtracking positions remembered by any tokens inside the group" from:
An atomic group is a group that, when the regex engine exits from it,
automatically throws away all backtracking positions remembered by any
tokens inside the group. Atomic groups are non-capturing. The syntax
is (?>group).
An example will make the behavior of atomic groups clear. The regular
expression a(bc|b)c (capturing group) matches abcc and abc. The regex
a(?>bc|b)c (atomic group) matches abcc but not abc.
Can you given an example, where atomic groups are used without alternation | inside it? Thanks.

Alternations have nothing to do with atomic groups. The point of atomic groups is to avoid backtracking. There are two main reasons for this:
Avoid unneeded backtracking when a regex is going to fail to match anyway.
Avoid backtracking into a part of an expression where you don't want to find a match
You asked for an example of atomic grouping without alternations.
Let's look at both uses.
A. Avoid Backtracking on Failure
For example, consider these two strings:
name=Joe species=hamster food=carrot says:{I love carrots}
name=Joe species=hamster food=carrot says:{I love peas}
Let's say we want to find a string that is well-formed (it has the key=value tokens) and has carrots after the tokens, perhaps in the says part. One way to attempt this could be:
Non-Atomic Version
^(?:\w+=\w+\s+)*.*carrots
This will match the first string and not the second. We're happy. Or... are we really? There are two reasons to be unhappy. We'll look at the second reason in part B (the second main reason for atomic groups). So what's the first reason?
Well, when you debug the failure case in RegexBuddy, you see that it takes the engine 401 steps before the engine decides it cannot match the second string. It is that long because after matching the tokens and failing to match carrots in the says:{I love peas}, the engine backtracks into the (\w+=\w+\s+)* in the hope of finding carrots there. Now let's look at an atomic version.
An Atomic Version
^(?>(?:\w+=\w+\s+)*).*carrots
Here, the atomic group prevents the engine from backtracking into the (?:\w+=\w+\s+)*. The result is that on the second string, the engine fails in 64 steps. Quite a lot faster than 401!
B. Avoid Backtracking into part of String where Match is Not Desired
Keeping the same regexes, let's modify the strings slightly:
name=Joe species=hamster food=carrots says:{I love carrots}
name=Joe species=hamster food=carrots says:{I love peas}
Our atomic regex still works (it matches the first string but not the second).
However, the non-atomic regex now matches both strings! That is because after failing to find carrots in says:{I love peas}, the engine backtracks into the tokens, and finds carrots in food=carrots
Therefore, in this instance an atomic group is a handy tool to skip the portion of the string where we don't want to find carrots, while still making sure that it is well-formed.

Related

Is it possible to erase a capture group that has already matched, making it non-participating?

In PCRE2 or any other regex engine supporting forward backreferences, is it possible to change a capture group that matched in a previous iteration of a loop into a non-participating capture group (also known as an unset capture group or non-captured group), causing conditionals that test that group to match with their "false" clause rather than their "true" clause?
For example, take the following PCRE regex:
^(?:(z)?(?(1)aa|a)){2}
When fed the string zaazaa, it matches the whole string, as desired. But when fed zaaaa, I would like it to match zaaa; instead, it matches zaaaa, the whole string. (This is just for illustration. Of course this example could be handled by ^(?:zaa|a){2} but that is beside the point. Practical usage of capture group erasure would tend to be in loops that most often do far more than 2 iterations.)
An alternative way of doing this, which also doesn't work as desired:
^(?:(?:z()|())(?:\1aa|\2a)){2}
Note that both of these work as desired when the loop is "unrolled", because they no longer have to erase a capture that has already been made:
^(?:(z)?(?(1)aa|a))(?:(z)?(?(2)aa|a))
^(?:(?:z()|())(?:\1aa|\2a))(?:(?:z()|())(?:\3aa|\4a))
So instead of being able to use the simplest form of conditional, a more complicated one must be used, which only works in this example because the "true" match of z is non-empty:
^(?:(z?)(?(?!.*$\1)aa|a)){2}
Or just using an emulated conditional:
^(?:(z?)(?:(?!.*$\1)aa|(?=.*$\1)a)){2}
I have scoured all the documentation I can find, and there seems not to even be any mention or explicit description of this behavior (that captures made within a loop persist through iterations of that loop even when they fail to be re-captured).
It's different than what I intuitively expected. The way I would implement it is that evaluating a capture group with 0 repetitions would erase/unset it (so this could happen to any capture group with a *, ?, or {0,N} quantifier), but skipping it due to being in a parallel alternative within the same group in which it gained a capture during a previous iteration would not erase it. Thus, this regex would still match words iff they contain at least one of every vowel:
\b(?:a()|e()|i()|o()|u()|\w)++\1\2\3\4\5\b
But skipping a capture group due to it being inside an unevaluated alternative of a group that is evaluated with nonzero repetitions which is nested within the group in which the capture group took on a value during a previous iteration would erase/unset it, so this regex would be able to either capture or erase group \1 on every iteration of the loop:
^(?:(?=a|(b)).(?(1)_))*$
and would match strings such as aaab_ab_b_aaaab_ab_aab_b_b_aaa. However, the way forward references are actually implemented in existing engines, it matches aaaaab_a_b_a_a_b_b_a_b_b_b_.
I would like to know the answer to this question not merely because it would be useful in constructing regexes, but because I have written my own regex engine, currently ECMAScript-compatible with some optional extensions (including molecular lookahead (?*), i.e. non-atomic lookahead, which as far as I know, no other engine has), and I would like to continue adding features from other engines, including forward/nested backreferences. Not only do I want my implementation of forward backreferences to be compatible with existing implementations, but if there isn't a way of erasing capture groups in other engines, I will probably create a way of doing it in my engine that doesn't conflict with other existing regex features.
To be clear: An answer stating that this is not possible in any mainstream engines will be acceptable, as long as it is backed up by adequate research and/or citing of sources. An answer stating that it is possible would be much easier to state, since it would require only one example.
Some information on what a non-participating capture group is:
http://blog.stevenlevithan.com/archives/npcg-javascript - this is the article that originally introduced me to the idea.
https://www.regular-expressions.info/backref2.html - the first section on this page gives a brief explanation.
In ECMAScript/Javascript regexes, backreferences to NPCGs always match (making a zero-length match). In pretty much every other regex flavor, they fail to match anything.
I found this documented in PCRE's man page, under "DIFFERENCES BETWEEN PCRE2 AND PERL":
12. There are some differences that are concerned with the settings of
captured strings when part of a pattern is repeated. For example,
matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
unset, but in PCRE2 it is set to "b".
I'm struggling to think of a practical problem that cannot be better solved with an alternative solution, but in the interests of keeping it simple, here goes:
Suppose you have a simple task well-suited to being solved by using forward references; for example, check the input string is a palindrome. This cannot be solved generally with recursion (due to the atomic nature of subroutine calls), and so we bang out the following:
/^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$/
Easy enough. Now suppose we are asked to verify that every line in the input is a palindrome. Let's try to solve this by placing the expression in a repeated group:
\A(?:^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$(?:\n|\z))+\z
Clearly that doesn't work, since the value of \2 persists from the first line to the next. This is similar to the problem you're facing, and so here are a number of ways to overcome it:
1. Enclose the entire subexpression in (?!(?! )):
\A(?:(?!(?!^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$)).+(?:\n|\z))+\z
Very easy, just shove 'em in there and you're essentially good to go. Not a great solution if you want any particular captured values to persist.
2. Branch reset group to reset the value of capture groups:
\A(?|^(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$|\n()()|\z)+\z
With this technique, you can reset the value of capture groups from the first (\1 in this case) up to a certain one (\2 here). If you need to keep \1's value but wipe \2, this technique will not work.
3. Introduce a group that captures the remainder of the string from a certain position to help you later identify where you are:
\A(?:^(?:(.)(?=.*(\1(?(2)(?=\2\3\z)\2))([\s\S]*)))*+.?\2$(?:\n|\z))+\z
The whole rest of the collection of lines is saved in \3, allowing you to reliably check whether you have progressed to the next line (when (?=\2\3\z) is no longer true).
This is one of my favourite techniques because it can be used to solve tasks that seem impossible, such as the ol' matching nested brackets using forward references. With it, you can maintain any other capture information you need. The only downside is that it's horribly inefficient, especially for long subjects.
4. This doesn't really answer the question, but it solves the problem:
\A(?![\s\S]*^(?!(?:(.)(?=.*(\1(?(2)\2))))*+.?\2$))
This is the alternative solution I was talking about. Basically, "re-write the pattern" :) Sometimes it's possible, sometimes it isn't.
With PCRE (and all as I'm aware) it's not possible to unset a capturing group but using subroutine calls since their nature doesn't remember values from the previous recursion, you are able to accomplish the same task:
(?(DEFINE)((z)?(?(2)aa|a)))^(?1){2}
See live demo here
If you are going to implement a behavior into your own regex flavor to unset a capturing group, I'd strongly suggest do not let it happen automatically. Just provide some flags.
This is partially possible in .NET's flavour of regex.
The first thing to note is that .NET records all of the captures for a given capture group, not just the latest. For instance, ^(?=(.)*) records each character in the first line as a separate capture in the group.
To actually delete captures, .NET regex has a construction known as balancing groups. The full format of this construction is (?<name1-name2>subexpression).
First, name2 must have previously been captured.
The subexpression must then match.
If name1 is present, the substring between the end of the capture of name2 and the start of the subexpression match is captured into name1.
The latest capture of name2 is then deleted. (This means that the old value could be backreferenced in the subexpression.)
The match is advanced to the end of the subexpression.
If you know you have name2 captured exactly once then it can readily be deleted using (?<-name2>); if you don't know whether you have name2 captured then you could use (?>(?<-name2>)?) or a conditional. The problem arises if you might have name2 captured more than once since then it depends on whether you can organise enough repetitions of the deletion of name2. ((?<-name2>)* doesn't work because * is equivalent to ? for zero-length matches.)
There is also another way to "erase" capture groups in .NET. Unlike the (?<-name>) method, this empties the group instead of deleting it – so instead of not matching, it will then match an empty string.
In .NET, groups with the same name can be captured multiple times, even if that name is a number. This allows PCRE expressions using balanced groups to be ported to .NET. Consider this PCRE pattern:
(?|(pattern)|())
Assuming both groups are \1 above, then using this technique, in .NET it would become:
(?:(pattern)|(?<1>))
I used this technique today to make a 38 byte .NET regex that matches strings whose length is a fourth power:
^((?=(?>^((?<3>\3|x))|\3(\3\2))*$)){2}
the above is a port of the following 35 byte PCRE regex, which uses balanced groups:
^((?=(?|^((\2|x))|\2(\2\3))*+$)){2}
(In this example, the capture group isn't actually being emptied. But this technique can be used to do anything a balanced group can do, including emptying a group.)

How to invert an arbitrary Regex expression

This question sounds like a duplicate, but I've looked at a LOT of similar questions, and none fit the bill either because they restrict their question to a very specific example, or to a specific usercase (e.g: single chars only) or because you need substitution for a successful approach, or because you'd need to use a programming language (e.g: C#'s split, or Match().Value).
I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
For example, let's say I want to find the reverse of the Regex "over" in "The cow jumps over the moon", it would match The cow jumps and also match the moon.
That's only a simple example of course. The Regex could be something more messy such as "o.*?m", in which case the matches would be: The c, ps, and oon.
Here is one possible solution I found after ages of hunting. Unfortunately, it requires the use of substitution in the replace field which I was hoping to keep clear. Also, everything else is matched, but only a character by character basis instead of big chunks.
Just to stress again, the answer should be general-purpose for any arbitrary Regex, and not specific to any particular example.
From post: I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
The answer -
A match is Not Discontinuous, it is continuous !!
Each match is a continuous, unbroken substring. So, within each match there
is no skipping anything within that substring. Whatever matched the
regular expression is included in a particular match result.
So within a single Match, there is no inverting (i.e. match not this only) that can extend past
a negative thing.
This is a Tennant of Regular Expressions.
Further, in this case, since you only want all things NOT something, you have
to consume that something in the process.
This is easily done by just capturing what you want.
So, even with multiple matches, its not good enough to say (?:(?!\bover\b).)+
because even though it will match up to (but not) over, on the next match
it will match ver ....
There are ways to avoid this that are tedious, requiring variable length lookbehinds.
But, the easiest way is to match up to over, then over, then the rest.
Several constructs can help. One is \K.
Unfortunately, there is no magical recipe to negate a pattern.
As you mentioned it in your question when you have an efficient pattern you use with a match method, to obtain the complementary, the more easy (and efficient) way is to use a split method with the same pattern.
To do it with the pattern itself, workarounds are:
1. consuming the characters that match the pattern
"other content" is the content until the next pattern or the end of the string.
alternation + capture group:
(pattern)|other content
Then you must check if the capture group exists to know which part of the alternation succeeds.
"other content" can be for example described in this way: .*?(?=pattern|$)
With PCRE and Perl, you can use backtracking control verbs to avoid the capture group, but the idea is the same:
pattern(*SKIP)(*FAIL)|other content
With this variant, you don't need to check anything after, since the first branch is forced to fail.
or without alternation:
((?:pattern)*)(other content)
variant in PCRE, Perl, or Ruby with the \K feature:
(?:pattern)*\Kother content
Where \K removes all on the left from the match result.
2. checking characters of the string one by one
(?:(?!pattern).)*
if this way is very simple to write (if the lookahead is available), it has the inconvenient to be slow since each positions of the string are tested with the lookahead.
The amount of lookahead tests can be reduced if you can use the first character of the pattern (lets say "a"):
[^a]*(?:(?!pattern)a[^a]*)*
3. list all that is not the pattern.
using character classes
Lets say your pattern is /hello/:
([^h]|h(([^eh]|$)|e(([^lh]|$)|l(([^lh]|$)|l([^oh]|$))))*
This way becomes quickly fastidious when the number of characters is important, but it can be useful for regex flavors that haven't many features like POSIX regex.

Can DFA regex engines handle atomic groups?

According to this page (and some others), DFA regex engines can deal with capturing groups rather well. I'm curious about atomic groups (or possessive quantifiers), as I recently used them a lot and can't imagine how this could be done.
I disagree with the fist part of the answer:
A DFA does not need to deal with constructs like atomic grouping.... Atomic Grouping is a way to help the engine finish a match, that would otherwise cause endless backtracking
Atomic groups are important not only for speed of NFA engines, but they also allow to write simpler and less error-prone regexes. Let's say I needed to find all C-style multiline comments in a program. The exact regex would be something like:
start with the literal /*
eat anything of the following
any char except *
a * followed by anything but /
repeat this as much as possible
end with the literal */
This sounds a bit complicated, the regex
/\* ( [^*] | \*[^/] )+ \*/
is complicated and wrong (it doesn't handle /* foo **/ correctly). Using a reluctant (lazy) quantifier is better
/\* .*? \*/
but also wrong as it can eat the whole line
/* foo */ ##$!!**##$ /* bar */
when backtracking due to a later sub-expression failing on the garbage occurs. Putting the above in an atomic group solves the problem nicely:
(?> /\* .*? \*/ )
This works always (I hope) and is as fast as possible (for NFA). So I wonder if a DFA engine could somehow handle it.
A DFA does not need to deal with constructs like atomic grouping. A DFA is "text directed", unlike the NFA, which is "regex directed", in other words:
Atomic Grouping is a way to help the engine finish a match, that would otherwise cause endless backtracking, as the (NFA) engine tries every permutation possible to find a match at a position, no match is even possible.
Atomic grouping, simply said, throws away backtracking positions. Since a DFA does not backtrack (the text to be matched is checked against the regex, not the regex against the text like a NFA - the DFA opens a branch for each decision), throwing away something that is not there is pointless.
I suggest J.F.Friedl's Mastering Regular Expressions (Google Books), he explains the general idea of a DFA:
DFA Engine: Text-Directed
Contrast the regex-directed NFA engine with an engine that, while
scanning the string, keeps track of all matches “currently in the
works.” In the tonight example, the moment the engine hits t, it adds
a potential match to its list of those currently in progress:
[...]
Each subsequent character scanned updates the list of possible
matches. After a few more characters are matched, the situation
becomes
[...]
with two possible matches in the works (and one alternative, knight,
ruled out). With the g that follows, only the third alternative
remains viable. Once the h and t are scanned as well, the engine
realizes it has a complete match and can return success.
I call this “text-directed” matching because each character scanned
from the text controls the engine. As in the example, a partial match
might be the start of any number of different, yet possible, matches.
Matches that are no longer viable are pruned as subsequent characters
are scanned. There are even situations where a “partial match in
progress” is also a full match. If the regex were ⌈to(…)?⌋, for
example, the parenthesized expression becomes optional, but it’s still
greedy, so it’s always attempted. All the time that a partial match is
in progress inside those parentheses, a full match (of 'to') is
already confirmed and in reserve in case the longer matches don’t pan
out.
(Source: http://my.safaribooksonline.com/book/programming/regular-expressions/0596528124/regex-directed-versus-text-directed/i87)
Concerning capturing groups and DFAs: as far as I was able to understand from your link, these approaches are not pure DFA engines but hybrids of DFA and NFA.

Positive lookahead that (also) matches the empty string

I'm doing an internship with some Groovy code and I came across the following pattern:
(?=(^\w)*)(\w)+(?=(^\w)*)
It basically just finds words (contiguous collections of word characters) to sift out punctuation and such. Is there a reason to not simply use this pattern?
\w+
Since it's not my code I imagine that there might have been a reason for using something so ridiculously complicated, but at the same time it seems like it would be very inefficient. Is there any difference between the two? They seem to give the same results on http://regexpal.com/.
The answer to why not use just \w+ is capturing groups, this doesn't explain any possible subtlety or logic in the regex though.
The (optional) prefix and suffix strings are partially captured for possible later use, and as noted by m.buettner ^\w is quite likely a meant to be [^\w], meaning the second final group never matches (though there might be cases with multi-line input, see Pattern Matching Flags, I can't see one myself, since \w+ won't match and consume and end of line).
The use of both (?=) and * indicates that perhaps the author was not quite familiar with regexs, typically you use look arounds to constrain (which * effectively undoes here), or to optimise matching.
A polite approach might be assume that the regex was being "tweaked" during development, and has been left with some unneeded subpatterns...

What is the difference between atomic and non-capturing groups?

What is an atomic group, ((?>expr)) and what is it used for?
In https://www.regular-expressions.info/atomic.html, the only example is when expr is alternation, such as the regex a(?>bc|b)c matches abcc but not abc. Are there examples with expr not being alternation?
Are atomic and non-capturing groups, ((?:expr)) the same thing?
When Atomic groups are used, the regex engine won't backtrack for further permutations if the complete regular expression has not been matched for a given string.
Whenever you use an alternation, the regex will immediately try to match the rest of the expression if it is successful. Still, it will keep track of the position where other alternations are possible. If the rest of the expression is not matched, the regex will go back to the previously noted position and try the other combinations. If Atomic grouping had been used, the regex engine would not have kept track of the previous position and would just have given up matching.
The above example doesn't explain the purpose of using atomic groups. It just demonstrates the elimination of backtracking. Atomic groups would be used in specific scenarios where greedy quantifiers are used, and further combinations are possible even though there is no alternation.
Atomic and non-capturing groups are different. Non-capturing groups don't save the matches' value, while atomic groups disable backtracking if further combinations are needed.
For example, the regular expression a(?:bc|b)c matches both abcc and abc (without capturing the match), whilst a(?>bc|c)c only matches abcc. If the regex was a(?>b|bc)c, it would only match abc, whilst a(?:b|bc)c would still match both.
Atomic groups (and the possessive modifier) are useful to avoid catastrophic backtracking - which can be exploited by malicious users to trigger denial of service attacks by gobbling up a server's memory.
Non-capturing groups are just that -- non-capturing. The regex engine can backtrack into a non-capturing group; not into an atomic group.
Are there examples with expr not being alternation?
Consider the following pattern:
(abc)?a
This finds a match in both abc and abca. But what happens when the optional part becomes atomic?
(?>(abc)?)a
It no longer finds a match in abc. It will never give up abc, so the final a fails.
As others have said, there are other situations where you might want to avoid backtracking, even if it has no effect on the final match, to optimise your regex.