Why regular expression .* is slower at one place and faster at other - regex

Lately I am using a lot of regular expressions in java/groovy. For testing I routinely use regex101.com. Obviously I am looking at the regular expressions performance too.
One thing I noticed that using .* properly can significantly improve the overall performance. Primarily, using .* in between, or better to say not at the end of the regular expression is performance kill.
For example, in this regular expression the required number of steps is 27:
If I change first .* to \s*, it will reduce the steps required significantly to 16:
However, if I change second .* to \s*, it does not reduce the steps any further:
I have few questions:
Why the above? I dont want to compare \s and .*. I know the difference. I want to know why \s and .* costs different based on their position in the complete regex. And then the characteristics of the regex which may cost different based on their position in the overall regex (or based on any other aspect other than position, if there is any).
Does the steps counter given in this site really gives any indication about regex performance?
what other simple or similar (position related) regex performance observations you have?

The following is output from the debugger.
The big reason for the difference in performance is that .* will consume everything until the end of the string (except the newline). The pattern will then continue, forcing the regex to backtrack (as seen in the first image).
The reason that \s and .* perform equally well at the end of the pattern is that the greedy pattern vs. consuming whitespace makes no difference if there's nothing else to match (besides WS).
If your test string didn't end in whitespace, there would be a difference in performance, much like you saw in the first pattern - the regex would be forced to backtrack.
EDIT
You can see the performance difference if you end with something besides whitespace:
Bad:
^myname.*mahesh.*hiworld
Better:
^myname.*mahesh\s*hiworld
Even better:
^myname\s*mahesh\s*hiworld

The way regex engines work with the * quantifier, aka greedy quantifier, is to consume everything in the input that matches, then:
try the next term in the regex. If it matches, proceed on
"unconsume" one character (move the pointer back one), aka backtrack and goto step 1.
Since . matches anything (almost), the first state after encountering .* is to move the pointer to the end of input, then start moving back through the input one char at a time trying the next term until there's a match.
With \s*, only whitespace is consumed, so the pointer is initially moved exactly where you want it to be - no backtracking required to match the next term.
Something you should try is using the reluctant quantifier .*?, which will consume one char at a time until the next term matches, which should have the same time complexity as \s*, but be slightly more efficient because no check of the current char is required.
\s* and .* at the end of the expression will perform similarly, because both will consume everything at the end f input that matches, which leaves the pointer is the same position for both expressions.

Related

Regex Unrolled Loop Generalization [duplicate]

I'm trying to understand unroll loops in regex. What is the big difference between:
MINISTÉRIO[\s\S]*?PÁG
and
MINISTÉRIO(?:[^P]*(?:P(?!ÁG\s:\s\d+\/\d+)[^P]*)(?:[\s\S]*?))PÁG
In this context:
http://regexr.com/3dmlr
Why should i use the second, if the first do the SAME thing?
Thanks.
What is Unroll-the-loop
See this Unroll the loop technique source:
This optimisation thechnique is used to optimize repeated alternation of the form (expr1|expr2|...)*. These expression are not uncommon, and the use of another repetition inside an alternation may also leads to super-linear match. Super-linear match arise from the underterministic expression (a*)*.
The unrolling the loop technique is based on the hypothesis that in most case, you kown in a repeteated alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:
normal* ( special normal* )*
So, this is an optimization technique where alternations are turned into linearly matching atoms.
This makes these unrolled patterns very efficient since they involve less backtracking.
Current Scenario
Your MINISTÉRIO[\s\S]*?PÁG is a non-unrolled pattern while MINISTÉRIO[^P]*(?:P(?!ÁG)[^P]*)*PÁG is. See the demos (both saved with PCRE option to show the number of steps in the box above. Regex performance is different across regex engines, but this will tell you exactly the performance difference). Add more text after text: the first regex will start requiring more steps to finish, the second one will only show more steps after adding P. So, in texts where the character you used in the known part is not common, unrolled patterns are very efficient.
See the Difference between .*?, .* and [^"]*+ quantifiers section in my answer to understand how lazy matching works (your [\s\S]*? is the same as .*? with a DOTALL modifier in languages that allow a . to match a newline, too).
Performance Question
Is the lazy matching pattern always slow and inefficient? It is not always so. With very short strings, lazy dot matching is usually better (1-10 symbols). When we talk about long inputs, where there can be the leading delimiter, and no trailing one, this may lead to excessive backtracking leading to time out issues.
Use unrolled patterns when you have arbitrary inputs of potentially long length and where there may be no match.
Use lazy matching when your input is controlled, you know there will always be a match, some known set log formats, or the like.
Bonus: Commonly Unrolled patterns
Tempered greedy tokens
Regular string literals ("String\u0020:\"text\""): "[^"\\]*(?:\\.[^"\\]*)*"
Multiline comment regex (/* Comments */): /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
#<...># comment regex: #<[^>]*(?:>[^#]*)*#

is this regex vulnerable to REDOS attacks

Regex :
^\d+(\.\d+)*$
I tried to break it with :
1234567890.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1x]
that is 200x".1"
I have read about ReDos attacks from :
Preventing Regular Expression Denial of Service (ReDoS)
Runaway Regular Expressions: Catastrophic Backtracking
However, I am not too confident in my skills to prepare a ReDos attack on an expression. I tried to trigger catastrophic backtracking due to "Nested Quantifiers".
Is that expression breakable? What input should be used for that and, if yes, how did you come up with it?
"Nested quantifiers" isn't inherently a problem. It's just a simple way to refer to a problem which is actually quite a bit more complicated. The problem is "quantifying over a sub-expression which can, itself, match in many ways at the same position". It just turns out that you almost always need a quantifier in the inner sub-expression to provide a rich enough supply of matches, and so quantifiers inside quantifiers serve as a red flag that indicates the possibility of trouble.
(.*)* is problematic because .* has maximum symmetry — it can match anything between zero and all of the remaining characters at any point of the input. Repeating this leads to a combinatorial explosion.
([0-9a-f]+\d+)* is problematic because at any point in a string of digits, there will be many possible ways to allocate those digits between an initial substring of [0-9a-f]+ and a final substring of \d+, so it has the same exact issue as (.*)*.
(\.\d+)* is not problematic because \. and \d match completely different things. A digit isn't a dot and a dot isn't a digit. At any given point in the input there is only one possible way to match \., and only one possible way to match \d+ that leaves open the possibility of another repetition (consume all of the digits, because if we stop before a digit, the next character is certainly not a dot). Therefore (\.\d+)* is no worse, backtracking-wise, than a \d* would be in the same context, even though it contains nested quantifiers.
Your regex is safe, but only because of "\."
Testing on regex101.com shows that there are no combinations of inputs that create runaway checks - but your regex is VERY close to being vulnerable, so be careful when modifying it.
As you've read, catastrophic backtracking happens when two quantifiers are right next to each other. In your case, the regex expands to \d+\.\d+\.\d+\.\d+\. ... and so on. Because you make the dot required for every single match between \d+, your regex grows by only three steps for each period-number you add. (This translates to 4 steps per period-number if you put an invalid character at the end.) That's a linear growth rate, so your regex is fine. Demo
However, if you make the \. optional, accidentally forget the escape character to make it plain ol' ., or remove it altogether, then you're in trouble. Such a regex would allow catastrophic backtracking; an invalid character at the end approximately doubles the runtime with every additional number you add before it. That's an exponential growth rate, and it's enough to crash time out the regex101 engine's default settings with just 18 digits and 1 invalid character. Demo
As written, your regex is fine, and will remain so as long as you ensure sure there's something "solid" between the first \d+ and the second \d+, as well as something "solid" between the second \d+ and the * outside its capture group.

Why is a character class faster than alternation?

It seems that using a character class is faster than the alternation in an example like:
[abc] vs (a|b|c)
I have heard about it being recommended and with a simple test using Time::HiRes I verified it (~10 times slower).
Also using (?:a|b|c) in case the capturing parenthesis makes a difference does not change the result.
But I can not understand why. I think it is because of backtracking but the way I see it at each position there are 3 character comparison so I am not sure how backtracking hits in affecting the alternation. Is it a result of the implementation's nature of alternation?
This is because the "OR" construct | backtracks between the alternation: If the first alternation is not matched, the engine has to return before the pointer location moved during the match of the alternation, to continue matching the next alternation; Whereas the character class can advance sequentially. See this match on a regex engine with optimizations disabled:
Pattern: (r|f)at
Match string: carat
Pattern: [rf]at
Match string: carat
But to be short, the fact that pcre engine optimizes this (single literal characters -> character class) away is already a decent hint that alternations are inefficient.
Because a character class like [abc] is irreducable and can be optimised, whereas an alternation like (?:a|b|c) may also be (?:aa(?!xx)|[^xba]*?|t(?=.[^t])t).
The authors have chosen not to optimise the regex compiler to check that all elements of an alternation are a single character.
There is a big difference between "check that the next character is in this character class" and "check that the rest of the string matches any one of these regular expressions".

Can DFA regex engines handle atomic groups?

According to this page (and some others), DFA regex engines can deal with capturing groups rather well. I'm curious about atomic groups (or possessive quantifiers), as I recently used them a lot and can't imagine how this could be done.
I disagree with the fist part of the answer:
A DFA does not need to deal with constructs like atomic grouping.... Atomic Grouping is a way to help the engine finish a match, that would otherwise cause endless backtracking
Atomic groups are important not only for speed of NFA engines, but they also allow to write simpler and less error-prone regexes. Let's say I needed to find all C-style multiline comments in a program. The exact regex would be something like:
start with the literal /*
eat anything of the following
any char except *
a * followed by anything but /
repeat this as much as possible
end with the literal */
This sounds a bit complicated, the regex
/\* ( [^*] | \*[^/] )+ \*/
is complicated and wrong (it doesn't handle /* foo **/ correctly). Using a reluctant (lazy) quantifier is better
/\* .*? \*/
but also wrong as it can eat the whole line
/* foo */ ##$!!**##$ /* bar */
when backtracking due to a later sub-expression failing on the garbage occurs. Putting the above in an atomic group solves the problem nicely:
(?> /\* .*? \*/ )
This works always (I hope) and is as fast as possible (for NFA). So I wonder if a DFA engine could somehow handle it.
A DFA does not need to deal with constructs like atomic grouping. A DFA is "text directed", unlike the NFA, which is "regex directed", in other words:
Atomic Grouping is a way to help the engine finish a match, that would otherwise cause endless backtracking, as the (NFA) engine tries every permutation possible to find a match at a position, no match is even possible.
Atomic grouping, simply said, throws away backtracking positions. Since a DFA does not backtrack (the text to be matched is checked against the regex, not the regex against the text like a NFA - the DFA opens a branch for each decision), throwing away something that is not there is pointless.
I suggest J.F.Friedl's Mastering Regular Expressions (Google Books), he explains the general idea of a DFA:
DFA Engine: Text-Directed
Contrast the regex-directed NFA engine with an engine that, while
scanning the string, keeps track of all matches “currently in the
works.” In the tonight example, the moment the engine hits t, it adds
a potential match to its list of those currently in progress:
[...]
Each subsequent character scanned updates the list of possible
matches. After a few more characters are matched, the situation
becomes
[...]
with two possible matches in the works (and one alternative, knight,
ruled out). With the g that follows, only the third alternative
remains viable. Once the h and t are scanned as well, the engine
realizes it has a complete match and can return success.
I call this “text-directed” matching because each character scanned
from the text controls the engine. As in the example, a partial match
might be the start of any number of different, yet possible, matches.
Matches that are no longer viable are pruned as subsequent characters
are scanned. There are even situations where a “partial match in
progress” is also a full match. If the regex were ⌈to(…)?⌋, for
example, the parenthesized expression becomes optional, but it’s still
greedy, so it’s always attempted. All the time that a partial match is
in progress inside those parentheses, a full match (of 'to') is
already confirmed and in reserve in case the longer matches don’t pan
out.
(Source: http://my.safaribooksonline.com/book/programming/regular-expressions/0596528124/regex-directed-versus-text-directed/i87)
Concerning capturing groups and DFAs: as far as I was able to understand from your link, these approaches are not pure DFA engines but hybrids of DFA and NFA.

Regex to *not* match any characters

I know it is quite some weird goal here but for a quick and dirty fix for one of our system we do need to not filter any input and let the corruption go into the system.
My current regex for this is "\^.*"
The problem with that is that it does not match characters as planned ... but for one match it does work. The string that make it not work is ^#jj (basically anything that has ^ ... ).
What would be the best way to not match any characters now ? I was thinking of removing the \  but only doing this will transform the "not" into a "start with" ...
The ^ character doesn't mean "not" except inside a character class ([]). If you want to not match anything, you could use a negative lookahead that matches anything: (?!.*).
A simple and cheap regex that will never match anything is to match against something that is simply unmatchable, for example: \b\B.
It's simply impossible for this regex to match, since it's a contradiction.
References
regular-expressions.info\Word Boundaries
\B is the negated version of \b. \B matches at every position where \b does not.
Another very well supported and fast pattern that would fail to match anything that is guaranteed to be constant time:
$unmatchable pattern $anything goes here etc.
$ of course indicates the end-of-line. No characters could possibly go after $ so no further state transitions could possibly be made. The additional advantage are that your pattern is intuitive, self-descriptive and readable as well!
tldr; The most portable and efficient regex to never match anything is $- (end of line followed by a char)
Impossible regex
The most reliable solution is to create an impossible regex. There are many impossible regexes but not all are as good.
First you want to avoid "lookahead" solutions because some regex engines don't support it.
Then you want to make sure your "impossible regex" is efficient and won't take too much computation steps to match... nothing.
I found that $- has a constant computation time ( O(1) ) and only takes two steps to compute regardless of the size of your text (https://regex101.com/r/yjcs1Z/3).
For comparison:
$^ and $. both take 36 steps to compute -> O(1)
\b\B takes 1507 steps on my sample and increase with the number of character in your string -> O(n)
Empty regex (alternative solution)
If your regex engine accepts it, the best and simplest regex to never match anything might be: an empty regex .
Instead of trying to not match any characters, why not just match all characters? ^.*$ should do the trick. If you have to not match any characters then try ^\j$ (Assuming of course, that your regular expression engine will not throw an error when you provide it an invalid character class. If it does, try ^()$. A quick test with RegexBuddy suggests that this might work.
^ is only not when it's in class (such as [^a-z] meaning anything but a-z). You've turned it into a literal ^ with the backslash.
What you're trying to do is [^]*, but that's not legal. You could try something like
" {10000}"
which would match exactly 10,000 spaces, if that's longer than your maximum input, it should never be matched.
((?iLmsux))
Try this, it matches only if the string is empty.
Interesting ... the most obvious and simple variant:
~^
.
https://regex101.com/r/KhTM1i/1
requiring usually only one computation step (failing directly at the start and being computational expensive only if the matched string begins with a long series of ~) is not mentioned among all the other answers ... for 12 years.
You want to match nothing at all? Neg lookarounds seems obvious, but can be slow, perhaps ^$ (matches empty string only) as an alternative?