Can DFA regex engines handle atomic groups? - regex

According to this page (and some others), DFA regex engines can deal with capturing groups rather well. I'm curious about atomic groups (or possessive quantifiers), as I recently used them a lot and can't imagine how this could be done.
I disagree with the fist part of the answer:
A DFA does not need to deal with constructs like atomic grouping.... Atomic Grouping is a way to help the engine finish a match, that would otherwise cause endless backtracking
Atomic groups are important not only for speed of NFA engines, but they also allow to write simpler and less error-prone regexes. Let's say I needed to find all C-style multiline comments in a program. The exact regex would be something like:
start with the literal /*
eat anything of the following
any char except *
a * followed by anything but /
repeat this as much as possible
end with the literal */
This sounds a bit complicated, the regex
/\* ( [^*] | \*[^/] )+ \*/
is complicated and wrong (it doesn't handle /* foo **/ correctly). Using a reluctant (lazy) quantifier is better
/\* .*? \*/
but also wrong as it can eat the whole line
/* foo */ ##$!!**##$ /* bar */
when backtracking due to a later sub-expression failing on the garbage occurs. Putting the above in an atomic group solves the problem nicely:
(?> /\* .*? \*/ )
This works always (I hope) and is as fast as possible (for NFA). So I wonder if a DFA engine could somehow handle it.

A DFA does not need to deal with constructs like atomic grouping. A DFA is "text directed", unlike the NFA, which is "regex directed", in other words:
Atomic Grouping is a way to help the engine finish a match, that would otherwise cause endless backtracking, as the (NFA) engine tries every permutation possible to find a match at a position, no match is even possible.
Atomic grouping, simply said, throws away backtracking positions. Since a DFA does not backtrack (the text to be matched is checked against the regex, not the regex against the text like a NFA - the DFA opens a branch for each decision), throwing away something that is not there is pointless.
I suggest J.F.Friedl's Mastering Regular Expressions (Google Books), he explains the general idea of a DFA:
DFA Engine: Text-Directed
Contrast the regex-directed NFA engine with an engine that, while
scanning the string, keeps track of all matches “currently in the
works.” In the tonight example, the moment the engine hits t, it adds
a potential match to its list of those currently in progress:
[...]
Each subsequent character scanned updates the list of possible
matches. After a few more characters are matched, the situation
becomes
[...]
with two possible matches in the works (and one alternative, knight,
ruled out). With the g that follows, only the third alternative
remains viable. Once the h and t are scanned as well, the engine
realizes it has a complete match and can return success.
I call this “text-directed” matching because each character scanned
from the text controls the engine. As in the example, a partial match
might be the start of any number of different, yet possible, matches.
Matches that are no longer viable are pruned as subsequent characters
are scanned. There are even situations where a “partial match in
progress” is also a full match. If the regex were ⌈to(…)?⌋, for
example, the parenthesized expression becomes optional, but it’s still
greedy, so it’s always attempted. All the time that a partial match is
in progress inside those parentheses, a full match (of 'to') is
already confirmed and in reserve in case the longer matches don’t pan
out.
(Source: http://my.safaribooksonline.com/book/programming/regular-expressions/0596528124/regex-directed-versus-text-directed/i87)
Concerning capturing groups and DFAs: as far as I was able to understand from your link, these approaches are not pure DFA engines but hybrids of DFA and NFA.

Related

Regex Unrolled Loop Generalization [duplicate]

I'm trying to understand unroll loops in regex. What is the big difference between:
MINISTÉRIO[\s\S]*?PÁG
and
MINISTÉRIO(?:[^P]*(?:P(?!ÁG\s:\s\d+\/\d+)[^P]*)(?:[\s\S]*?))PÁG
In this context:
http://regexr.com/3dmlr
Why should i use the second, if the first do the SAME thing?
Thanks.
What is Unroll-the-loop
See this Unroll the loop technique source:
This optimisation thechnique is used to optimize repeated alternation of the form (expr1|expr2|...)*. These expression are not uncommon, and the use of another repetition inside an alternation may also leads to super-linear match. Super-linear match arise from the underterministic expression (a*)*.
The unrolling the loop technique is based on the hypothesis that in most case, you kown in a repeteated alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:
normal* ( special normal* )*
So, this is an optimization technique where alternations are turned into linearly matching atoms.
This makes these unrolled patterns very efficient since they involve less backtracking.
Current Scenario
Your MINISTÉRIO[\s\S]*?PÁG is a non-unrolled pattern while MINISTÉRIO[^P]*(?:P(?!ÁG)[^P]*)*PÁG is. See the demos (both saved with PCRE option to show the number of steps in the box above. Regex performance is different across regex engines, but this will tell you exactly the performance difference). Add more text after text: the first regex will start requiring more steps to finish, the second one will only show more steps after adding P. So, in texts where the character you used in the known part is not common, unrolled patterns are very efficient.
See the Difference between .*?, .* and [^"]*+ quantifiers section in my answer to understand how lazy matching works (your [\s\S]*? is the same as .*? with a DOTALL modifier in languages that allow a . to match a newline, too).
Performance Question
Is the lazy matching pattern always slow and inefficient? It is not always so. With very short strings, lazy dot matching is usually better (1-10 symbols). When we talk about long inputs, where there can be the leading delimiter, and no trailing one, this may lead to excessive backtracking leading to time out issues.
Use unrolled patterns when you have arbitrary inputs of potentially long length and where there may be no match.
Use lazy matching when your input is controlled, you know there will always be a match, some known set log formats, or the like.
Bonus: Commonly Unrolled patterns
Tempered greedy tokens
Regular string literals ("String\u0020:\"text\""): "[^"\\]*(?:\\.[^"\\]*)*"
Multiline comment regex (/* Comments */): /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
#<...># comment regex: #<[^>]*(?:>[^#]*)*#

is this regex vulnerable to REDOS attacks

Regex :
^\d+(\.\d+)*$
I tried to break it with :
1234567890.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1x]
that is 200x".1"
I have read about ReDos attacks from :
Preventing Regular Expression Denial of Service (ReDoS)
Runaway Regular Expressions: Catastrophic Backtracking
However, I am not too confident in my skills to prepare a ReDos attack on an expression. I tried to trigger catastrophic backtracking due to "Nested Quantifiers".
Is that expression breakable? What input should be used for that and, if yes, how did you come up with it?
"Nested quantifiers" isn't inherently a problem. It's just a simple way to refer to a problem which is actually quite a bit more complicated. The problem is "quantifying over a sub-expression which can, itself, match in many ways at the same position". It just turns out that you almost always need a quantifier in the inner sub-expression to provide a rich enough supply of matches, and so quantifiers inside quantifiers serve as a red flag that indicates the possibility of trouble.
(.*)* is problematic because .* has maximum symmetry — it can match anything between zero and all of the remaining characters at any point of the input. Repeating this leads to a combinatorial explosion.
([0-9a-f]+\d+)* is problematic because at any point in a string of digits, there will be many possible ways to allocate those digits between an initial substring of [0-9a-f]+ and a final substring of \d+, so it has the same exact issue as (.*)*.
(\.\d+)* is not problematic because \. and \d match completely different things. A digit isn't a dot and a dot isn't a digit. At any given point in the input there is only one possible way to match \., and only one possible way to match \d+ that leaves open the possibility of another repetition (consume all of the digits, because if we stop before a digit, the next character is certainly not a dot). Therefore (\.\d+)* is no worse, backtracking-wise, than a \d* would be in the same context, even though it contains nested quantifiers.
Your regex is safe, but only because of "\."
Testing on regex101.com shows that there are no combinations of inputs that create runaway checks - but your regex is VERY close to being vulnerable, so be careful when modifying it.
As you've read, catastrophic backtracking happens when two quantifiers are right next to each other. In your case, the regex expands to \d+\.\d+\.\d+\.\d+\. ... and so on. Because you make the dot required for every single match between \d+, your regex grows by only three steps for each period-number you add. (This translates to 4 steps per period-number if you put an invalid character at the end.) That's a linear growth rate, so your regex is fine. Demo
However, if you make the \. optional, accidentally forget the escape character to make it plain ol' ., or remove it altogether, then you're in trouble. Such a regex would allow catastrophic backtracking; an invalid character at the end approximately doubles the runtime with every additional number you add before it. That's an exponential growth rate, and it's enough to crash time out the regex101 engine's default settings with just 18 digits and 1 invalid character. Demo
As written, your regex is fine, and will remain so as long as you ensure sure there's something "solid" between the first \d+ and the second \d+, as well as something "solid" between the second \d+ and the * outside its capture group.

Why regular expression .* is slower at one place and faster at other

Lately I am using a lot of regular expressions in java/groovy. For testing I routinely use regex101.com. Obviously I am looking at the regular expressions performance too.
One thing I noticed that using .* properly can significantly improve the overall performance. Primarily, using .* in between, or better to say not at the end of the regular expression is performance kill.
For example, in this regular expression the required number of steps is 27:
If I change first .* to \s*, it will reduce the steps required significantly to 16:
However, if I change second .* to \s*, it does not reduce the steps any further:
I have few questions:
Why the above? I dont want to compare \s and .*. I know the difference. I want to know why \s and .* costs different based on their position in the complete regex. And then the characteristics of the regex which may cost different based on their position in the overall regex (or based on any other aspect other than position, if there is any).
Does the steps counter given in this site really gives any indication about regex performance?
what other simple or similar (position related) regex performance observations you have?
The following is output from the debugger.
The big reason for the difference in performance is that .* will consume everything until the end of the string (except the newline). The pattern will then continue, forcing the regex to backtrack (as seen in the first image).
The reason that \s and .* perform equally well at the end of the pattern is that the greedy pattern vs. consuming whitespace makes no difference if there's nothing else to match (besides WS).
If your test string didn't end in whitespace, there would be a difference in performance, much like you saw in the first pattern - the regex would be forced to backtrack.
EDIT
You can see the performance difference if you end with something besides whitespace:
Bad:
^myname.*mahesh.*hiworld
Better:
^myname.*mahesh\s*hiworld
Even better:
^myname\s*mahesh\s*hiworld
The way regex engines work with the * quantifier, aka greedy quantifier, is to consume everything in the input that matches, then:
try the next term in the regex. If it matches, proceed on
"unconsume" one character (move the pointer back one), aka backtrack and goto step 1.
Since . matches anything (almost), the first state after encountering .* is to move the pointer to the end of input, then start moving back through the input one char at a time trying the next term until there's a match.
With \s*, only whitespace is consumed, so the pointer is initially moved exactly where you want it to be - no backtracking required to match the next term.
Something you should try is using the reluctant quantifier .*?, which will consume one char at a time until the next term matches, which should have the same time complexity as \s*, but be slightly more efficient because no check of the current char is required.
\s* and .* at the end of the expression will perform similarly, because both will consume everything at the end f input that matches, which leaves the pointer is the same position for both expressions.

How to invert an arbitrary Regex expression

This question sounds like a duplicate, but I've looked at a LOT of similar questions, and none fit the bill either because they restrict their question to a very specific example, or to a specific usercase (e.g: single chars only) or because you need substitution for a successful approach, or because you'd need to use a programming language (e.g: C#'s split, or Match().Value).
I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
For example, let's say I want to find the reverse of the Regex "over" in "The cow jumps over the moon", it would match The cow jumps and also match the moon.
That's only a simple example of course. The Regex could be something more messy such as "o.*?m", in which case the matches would be: The c, ps, and oon.
Here is one possible solution I found after ages of hunting. Unfortunately, it requires the use of substitution in the replace field which I was hoping to keep clear. Also, everything else is matched, but only a character by character basis instead of big chunks.
Just to stress again, the answer should be general-purpose for any arbitrary Regex, and not specific to any particular example.
From post: I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.
The answer -
A match is Not Discontinuous, it is continuous !!
Each match is a continuous, unbroken substring. So, within each match there
is no skipping anything within that substring. Whatever matched the
regular expression is included in a particular match result.
So within a single Match, there is no inverting (i.e. match not this only) that can extend past
a negative thing.
This is a Tennant of Regular Expressions.
Further, in this case, since you only want all things NOT something, you have
to consume that something in the process.
This is easily done by just capturing what you want.
So, even with multiple matches, its not good enough to say (?:(?!\bover\b).)+
because even though it will match up to (but not) over, on the next match
it will match ver ....
There are ways to avoid this that are tedious, requiring variable length lookbehinds.
But, the easiest way is to match up to over, then over, then the rest.
Several constructs can help. One is \K.
Unfortunately, there is no magical recipe to negate a pattern.
As you mentioned it in your question when you have an efficient pattern you use with a match method, to obtain the complementary, the more easy (and efficient) way is to use a split method with the same pattern.
To do it with the pattern itself, workarounds are:
1. consuming the characters that match the pattern
"other content" is the content until the next pattern or the end of the string.
alternation + capture group:
(pattern)|other content
Then you must check if the capture group exists to know which part of the alternation succeeds.
"other content" can be for example described in this way: .*?(?=pattern|$)
With PCRE and Perl, you can use backtracking control verbs to avoid the capture group, but the idea is the same:
pattern(*SKIP)(*FAIL)|other content
With this variant, you don't need to check anything after, since the first branch is forced to fail.
or without alternation:
((?:pattern)*)(other content)
variant in PCRE, Perl, or Ruby with the \K feature:
(?:pattern)*\Kother content
Where \K removes all on the left from the match result.
2. checking characters of the string one by one
(?:(?!pattern).)*
if this way is very simple to write (if the lookahead is available), it has the inconvenient to be slow since each positions of the string are tested with the lookahead.
The amount of lookahead tests can be reduced if you can use the first character of the pattern (lets say "a"):
[^a]*(?:(?!pattern)a[^a]*)*
3. list all that is not the pattern.
using character classes
Lets say your pattern is /hello/:
([^h]|h(([^eh]|$)|e(([^lh]|$)|l(([^lh]|$)|l([^oh]|$))))*
This way becomes quickly fastidious when the number of characters is important, but it can be useful for regex flavors that haven't many features like POSIX regex.

Are atomic groups always used with alternation | inside?

Are atomic groups always used with alternation | inside? I get the impression from "all backtracking positions remembered by any tokens inside the group" from:
An atomic group is a group that, when the regex engine exits from it,
automatically throws away all backtracking positions remembered by any
tokens inside the group. Atomic groups are non-capturing. The syntax
is (?>group).
An example will make the behavior of atomic groups clear. The regular
expression a(bc|b)c (capturing group) matches abcc and abc. The regex
a(?>bc|b)c (atomic group) matches abcc but not abc.
Can you given an example, where atomic groups are used without alternation | inside it? Thanks.
Alternations have nothing to do with atomic groups. The point of atomic groups is to avoid backtracking. There are two main reasons for this:
Avoid unneeded backtracking when a regex is going to fail to match anyway.
Avoid backtracking into a part of an expression where you don't want to find a match
You asked for an example of atomic grouping without alternations.
Let's look at both uses.
A. Avoid Backtracking on Failure
For example, consider these two strings:
name=Joe species=hamster food=carrot says:{I love carrots}
name=Joe species=hamster food=carrot says:{I love peas}
Let's say we want to find a string that is well-formed (it has the key=value tokens) and has carrots after the tokens, perhaps in the says part. One way to attempt this could be:
Non-Atomic Version
^(?:\w+=\w+\s+)*.*carrots
This will match the first string and not the second. We're happy. Or... are we really? There are two reasons to be unhappy. We'll look at the second reason in part B (the second main reason for atomic groups). So what's the first reason?
Well, when you debug the failure case in RegexBuddy, you see that it takes the engine 401 steps before the engine decides it cannot match the second string. It is that long because after matching the tokens and failing to match carrots in the says:{I love peas}, the engine backtracks into the (\w+=\w+\s+)* in the hope of finding carrots there. Now let's look at an atomic version.
An Atomic Version
^(?>(?:\w+=\w+\s+)*).*carrots
Here, the atomic group prevents the engine from backtracking into the (?:\w+=\w+\s+)*. The result is that on the second string, the engine fails in 64 steps. Quite a lot faster than 401!
B. Avoid Backtracking into part of String where Match is Not Desired
Keeping the same regexes, let's modify the strings slightly:
name=Joe species=hamster food=carrots says:{I love carrots}
name=Joe species=hamster food=carrots says:{I love peas}
Our atomic regex still works (it matches the first string but not the second).
However, the non-atomic regex now matches both strings! That is because after failing to find carrots in says:{I love peas}, the engine backtracks into the tokens, and finds carrots in food=carrots
Therefore, in this instance an atomic group is a handy tool to skip the portion of the string where we don't want to find carrots, while still making sure that it is well-formed.