In an answer to a recent question, I contrived a couple of clever little regexes (at the asker's request) to match a substring at either the beginning or the end of a string. When run on Regex101, however, I noted that the different patterns have different step counts (indicating that the regex engine has to do more work for one vs. the other). To my mind, however, there is no intuitive reason that this should be so.
The three patterns are as follows:
Fun with conditionals: /(^)?!next(?(1)|$)/ (demo - 86 steps)
Classic alternation: ^!next|!next$ (demo - 58 steps)
Nasty lookarounds: !next(?:(?<=^.{5})|(?=$)) (demo - 35 steps)
Why is the first pattern so much less efficient than the second, and, most confusingly, why is the third so efficient?
TL;DR
Why is the first pattern so much less efficient than the second, and,
most confusingly, why is the third so efficient?
Because first two are anchored, third is not.
Real story, how steps are taken
Considering this regex /^x/gm, how many steps do you think engine will take to return a "no match" if subject string is abc? You are right, two.
Assert beginning of string
Match x
Then overall match fails since no x immediately comes after beginning of string assertion.
Well I lied. It’s not that I’m nasty, it just makes it easier to understand things that are going to happen. According to regex101.com it takes no steps at all:
Shall you believe it this time? Yes. No. Let's see.
PCRE start-up optimizations
PCRE, being kind to its users, provides some features to speed up things that is called start-up optimization. It does some dependent optimizations in according to Regular Expressions being used.
One important feature of these optimizations is a pre-scan of subject string in order to ensure that:
subject string contains at least one character that corresponds to the first character of match or
if a known starting point exists.
If one is not found matching function never runs.
Saying that, if our regex is /x/ and our subject string is abc then with start-up optimization enabled, a pre-scan is intended to be done to look for x, if is not found whole match fails or more better it doesn't even bother to go through matching process.
So how do these information help?
Let's flashback to our first example and change our regex a little bit. From:
/^x/gm
to
/^x/g
The difference is m flag that is getting unset. For those who don't know what m flag does if is set:
It changes the meaning of ^ and $ symbols in the sense that they no more mean start and end of string but start and end of line.
Now what if we run this regex /^x/g over our subject string abc? Should we expect a difference in steps engine takes or not? Absolutely, yes. Let's look at regex101.com returned info:
I really encourage you to believe it this time. It's actual.
What's happening?
Well, it seems a little confusing but we are going to enlighten things up. When there is no m modifier set, pre-scan looks to assert start of string (a known starting point). If assertion passes then actual matching function runs otherwise "no match" will return.
But wait... every subject string definitely has one and only start of string position and it's always at the very beginning of it. So wouldn't be a pre-scan obviously unnecessary? Yes, engine doesn't do a pre-scan here. With /^x/g it immediately asserts start of string and then fails like so (since it matches at ^ it goes through actual matching process). That's why we see regex101.com shows number of steps are 2.
But... with setting m modifier things differ. Now meaning of both ^ and $ anchors are changed. With ^ matching start of line, assertion of the same position in subject string abc happens but next immediate character is not x, being within actual matching process and since g flag is on, next match starts at position before b and fails and this trial and error continues up to end of subject string.
Debugger shows 6 steps but main page says 0 steps, why?
I'm not sure about latter but for the sake of debugging, regex101 debugger runs with (*NO_START_OPT) so 6 steps are true only if this verb is set. And I said I'm not sure about latter because all anchored patterns prevent a further pre-scan optimization and we should know what can be called an anchored pattern:
A pattern is automatically anchored by PCRE if all of its
top-level alternatives begin with one of the following:
^ unless PCRE_MULTILINE is set
\A always
\G always
.* if PCRE_DOTALL is set and there are no back references to the
subpattern in which .* appears
Now you got entirely what I was talking about when I was saying no pre-scan happens while m flag is not set in /^x/g: It's considered an anchored pattern which disables pre-scan optimization. So when m flag is on, this is no more an anchored pattern: /^x/gm hence pre-scan optimization could take place.
Engine knows start of string anchor \A (or ^ while multiline mode is disable) occurs only once when is matched so it doesn't continue at the next position.
Back to your own RegExes
First two are anchored (^ in conjunction with m flag), third is not. That is, third regex benefits from a pre-scan optimization. You can believe in 35 steps since an optimization caused it. But if you disable start-up optimization:
(*NO_START_OPT)!next(?:(?<=^.{5})|(?=$))
You will see 57 steps which is mostly the same as number of debugger steps.
Related
In an answer to a recent question, I contrived a couple of clever little regexes (at the asker's request) to match a substring at either the beginning or the end of a string. When run on Regex101, however, I noted that the different patterns have different step counts (indicating that the regex engine has to do more work for one vs. the other). To my mind, however, there is no intuitive reason that this should be so.
The three patterns are as follows:
Fun with conditionals: /(^)?!next(?(1)|$)/ (demo - 86 steps)
Classic alternation: ^!next|!next$ (demo - 58 steps)
Nasty lookarounds: !next(?:(?<=^.{5})|(?=$)) (demo - 35 steps)
Why is the first pattern so much less efficient than the second, and, most confusingly, why is the third so efficient?
TL;DR
Why is the first pattern so much less efficient than the second, and,
most confusingly, why is the third so efficient?
Because first two are anchored, third is not.
Real story, how steps are taken
Considering this regex /^x/gm, how many steps do you think engine will take to return a "no match" if subject string is abc? You are right, two.
Assert beginning of string
Match x
Then overall match fails since no x immediately comes after beginning of string assertion.
Well I lied. It’s not that I’m nasty, it just makes it easier to understand things that are going to happen. According to regex101.com it takes no steps at all:
Shall you believe it this time? Yes. No. Let's see.
PCRE start-up optimizations
PCRE, being kind to its users, provides some features to speed up things that is called start-up optimization. It does some dependent optimizations in according to Regular Expressions being used.
One important feature of these optimizations is a pre-scan of subject string in order to ensure that:
subject string contains at least one character that corresponds to the first character of match or
if a known starting point exists.
If one is not found matching function never runs.
Saying that, if our regex is /x/ and our subject string is abc then with start-up optimization enabled, a pre-scan is intended to be done to look for x, if is not found whole match fails or more better it doesn't even bother to go through matching process.
So how do these information help?
Let's flashback to our first example and change our regex a little bit. From:
/^x/gm
to
/^x/g
The difference is m flag that is getting unset. For those who don't know what m flag does if is set:
It changes the meaning of ^ and $ symbols in the sense that they no more mean start and end of string but start and end of line.
Now what if we run this regex /^x/g over our subject string abc? Should we expect a difference in steps engine takes or not? Absolutely, yes. Let's look at regex101.com returned info:
I really encourage you to believe it this time. It's actual.
What's happening?
Well, it seems a little confusing but we are going to enlighten things up. When there is no m modifier set, pre-scan looks to assert start of string (a known starting point). If assertion passes then actual matching function runs otherwise "no match" will return.
But wait... every subject string definitely has one and only start of string position and it's always at the very beginning of it. So wouldn't be a pre-scan obviously unnecessary? Yes, engine doesn't do a pre-scan here. With /^x/g it immediately asserts start of string and then fails like so (since it matches at ^ it goes through actual matching process). That's why we see regex101.com shows number of steps are 2.
But... with setting m modifier things differ. Now meaning of both ^ and $ anchors are changed. With ^ matching start of line, assertion of the same position in subject string abc happens but next immediate character is not x, being within actual matching process and since g flag is on, next match starts at position before b and fails and this trial and error continues up to end of subject string.
Debugger shows 6 steps but main page says 0 steps, why?
I'm not sure about latter but for the sake of debugging, regex101 debugger runs with (*NO_START_OPT) so 6 steps are true only if this verb is set. And I said I'm not sure about latter because all anchored patterns prevent a further pre-scan optimization and we should know what can be called an anchored pattern:
A pattern is automatically anchored by PCRE if all of its
top-level alternatives begin with one of the following:
^ unless PCRE_MULTILINE is set
\A always
\G always
.* if PCRE_DOTALL is set and there are no back references to the
subpattern in which .* appears
Now you got entirely what I was talking about when I was saying no pre-scan happens while m flag is not set in /^x/g: It's considered an anchored pattern which disables pre-scan optimization. So when m flag is on, this is no more an anchored pattern: /^x/gm hence pre-scan optimization could take place.
Engine knows start of string anchor \A (or ^ while multiline mode is disable) occurs only once when is matched so it doesn't continue at the next position.
Back to your own RegExes
First two are anchored (^ in conjunction with m flag), third is not. That is, third regex benefits from a pre-scan optimization. You can believe in 35 steps since an optimization caused it. But if you disable start-up optimization:
(*NO_START_OPT)!next(?:(?<=^.{5})|(?=$))
You will see 57 steps which is mostly the same as number of debugger steps.
In SAS, I am setting up PXPARSE functions to extract meaningful information from free text answers from a survey. For the most part, I have done this without issue. However, I've started needing lookarounds and now I am getting an incorrect match despite my best efforts.
Here is the expression that is being evaluated:
hlhx=PRXPARSE('/yes|(?<!no).*homeless.*(for|in|year|age)|at\sage|couch|was\shomeless|multiple|
lived.*streets|(?<!\bnot).*at\srisk|has\sbeen|high\srisk|currently\shomeless|
liv(es|ing|ed).*car|many|(?<!\bno).*(hx|history|h.?o)|(?<!\bno)(?<!low).+risk/ox');
A couple of responses should not match this expression, but do:
no hx of homelessness and low risk of homelessness
owns home, no h/o homelessness; low risk for homelessness
no and little risk
Obviously I have not properly specified my lookbehinds. Any help would be greatly appreciated.
EDIT: To put a finer point on it, what part of the expression is causing a match with entries like those in the list?
Best,
Lauren
Here's how your regex matches no and little risk:
One of the branches in your regex is ...|(?<!\bno)(?<!low).+risk.
The regex engine starts by attempting a match at every position within the target string, starting at the beginning:
no and little risk
^
The first constraint is that the current position cannot be preceded by a word boundary followed by "no" (due to (?<!\bno)). This condition is satisfied: The beginning of the target string is not preceded by anything.
The second constraint is that the current position cannot be preceded by "low" (due to (?<!low)). This condition is also satisfied (see above).
Then we match one or more non-newline characters, but as many as possible of them (this is the .+ part). Here we initially consume the whole string:
no and little risk
------------------^
But then the regex requires a match of risk, which fails (there are no more characters left in the target string). This causes .+ to backtrack and consume fewer and fewer characters, until this happens:
no and little risk
--------------^
At this point, risk successfully matches and the regex finishes.
The basic problem is that want you want to do is (?<!\bno.+)(?<!low.+)risk, but what you wrote is (?<!\bno)(?<!low).+risk. These are two very different things!
The former means "match 'risk', but only if it's not preceded by 'no' or 'low' anywhere in the string (up to 1 character before 'risk')". The latter means "match any non-empty substring followed by 'risk', as long as it's not preceded by either 'no' or 'low'". This gives the regex engine the freedom to look for any matching position in the string, as long as it's not immediately preceded by "no" or "low" and is followed by ".+risk" somewhere.
Unfortunately (?<!\bno.+) is not a valid regex because look-behind assertions must have a fixed length.
One possible workaround is to do the following:
^(?!.*(?:\bno|low).+risk).*risk
This says: Starting from the beginning of the string, first make sure there is no "no" or "low" followed by "risk" anywhere, then match "risk" anywhere within the string.
This is not quite equivalent to the (hypothetical) variable-width look-behind version, because that one would have matched
risk no risk
^^^^
due to the presence of "risk" without "no" preceding it, whereas this workaround first finds
risk no risk
^^^^^^^
and immediately rejects the whole string.
I recently became aware of Regular expression Denial of Service attacks, and decided to root out so-called 'evil' regex patterns wherever I could find them in my codebase - or at least those that are used on user input. The examples given at the OWASP link above and wikipedia are helpful, but they don't do a great job of explaining the problem in simple terms.
A description of evil regexes, from wikipedia:
the regular expression applies repetition ("+", "*") to a complex subexpression;
for the repeated subexpression, there exists a match which is also a suffix of another valid match.
With examples, again from wikipedia:
(a+)+
([a-zA-Z]+)*
(a|aa)+
(a|a?)+
(.*a){x} for x > 10
Is this a problem that just doesn't have a simpler explanation? I'm looking for something that would make it easier to avoid this problem while writing regexes, or to find them within an existing codebase.
Why Are Evil Regexes A Problem?
Because computers do exactly what you tell them to do, even if it's not what you meant or is totally unreasonable. If you ask a regex engine to prove that, for some given input, there either is or is not a match for a given pattern, then the engine will attempt to do that no matter how many different combinations must be tested.
Here is a simple pattern inspired by the first example in the OP's post:
^((ab)*)+$
Given the input:
abababababababababababab
The regex engine tries something like (abababababababababababab) and a match is found on the first try.
But then we throw the monkey wrench in:
abababababababababababab a
The engine will first try (abababababababababababab) but that fails because of that extra a. This causes catastrophic backtracking, because our pattern (ab)*, in a show of good faith, will release one of its captures (it will "backtrack") and let the outer pattern try again. For our regex engine, that looks something like this:
(abababababababababababab) - Nope
(ababababababababababab)(ab) - Nope
(abababababababababab)(abab) - Nope
(abababababababababab)(ab)(ab) - Nope
(ababababababababab)(ababab) - Nope
(ababababababababab)(abab)(ab) - Nope
(ababababababababab)(ab)(abab) - Nope
(ababababababababab)(ab)(ab)(ab) - Nope
(abababababababab)(abababab) - Nope
(abababababababab)(ababab)(ab) - Nope
(abababababababab)(abab)(abab) - Nope
(abababababababab)(abab)(ab)(ab) - Nope
(abababababababab)(ab)(ababab) - Nope
(abababababababab)(ab)(abab)(ab) - Nope
(abababababababab)(ab)(ab)(abab) - Nope
(abababababababab)(ab)(ab)(ab)(ab) - Nope
(ababababababab)(ababababab) - Nope
(ababababababab)(abababab)(ab) - Nope
(ababababababab)(ababab)(abab) - Nope
(ababababababab)(ababab)(ab)(ab) - Nope
(ababababababab)(abab)(abab)(ab) - Nope
(ababababababab)(abab)(ab)(abab) - Nope
(ababababababab)(abab)(ab)(ab)(ab) - Nope
(ababababababab)(ab)(abababab) - Nope
(ababababababab)(ab)(ababab)(ab) - Nope
(ababababababab)(ab)(abab)(abab) - Nope
(ababababababab)(ab)(abab)(ab)(ab) - Nope
(ababababababab)(ab)(ab)(ababab) - Nope
(ababababababab)(ab)(ab)(abab)(ab) - Nope
(ababababababab)(ab)(ab)(ab)(abab) - Nope
(ababababababab)(ab)(ab)(ab)(ab)(ab) - Nope
...
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abababab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(abab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab) - Nope
The number of possible combinations scales exponentially with the length of the input and, before you know it, the regex engine is eating up all your system resources trying to solve this thing until, having exhausted every possible combination of terms, it finally gives up and reports "There is no match." Meanwhile your server has turned into a burning pile of molten metal.
How to Spot Evil Regexes
It's actually very tricky. Catastrophic backtracking in modern regex engines is similar in nature to the halting problem which Alan Turing proved was impossible to solve. I have written problematic regexes myself, even though I know what they are and generally how to avoid them. Wrapping everything you can in an atomic group can help to prevent the backtracking issue. It basically tells the regex engine not to revisit a given expression - "lock whatever you matched on the first try". Note, however, that atomic expressions don't prevent backtracking within the expression, so ^(?>((ab)*)+)$ is still dangerous, but ^(?>(ab)*)+$ is safe (it'll match (abababababababababababab) and then refuse to give up any of it's matched characters, thus preventing catastrophic backtracking).
Unfortunately, once it's written, it's actually very hard to immediately or quickly find a problem regex. In the end, recognizing a bad regex is like recognizing any other bad code - it takes a lot of time and experience and/or a single catastrophic event.
Interestingly, since this answer was first written, a team at the University of Texas at Austin published a paper describing the development of a tool capable of performing static analysis of regular expressions with the express purpose of finding these "evil" patterns. The tool was developed to analyse Java programs, but I suspect that in the coming years we'll see more tools developed around analysing and detecting problematic patterns in JavaScript and other languages, especially as the rate of ReDoS attacks continues to climb.
Static Detection of DoS Vulnerabilities in
Programs that use Regular Expressions
Valentin Wüstholz, Oswaldo Olivo, Marijn J. H. Heule, and Isil Dillig
The University of Texas at Austin
Detecting evil regexes
Try Nicolaas Weideman's RegexStaticAnalysis project.
Try my ensemble-style vuln-regex-detector which has a CLI for Weideman's tool and others.
Rules of thumb
Evil regexes are always due to ambiguity in the corresponding NFA, which you can visualize with tools like regexper.
Here are some forms of ambiguity. Don't use these in your regexes.
Nesting quantifiers like (a+)+ (aka "star height > 1"). This can cause exponential blow-up. See substack's safe-regex tool.
Quantified Overlapping Disjunctions like (a|a)+. This can cause exponential blow-up.
Avoid Quantified Overlapping Adjacencies like \d+\d+. This can cause polynomial blow-up.
Additional resources
I wrote this paper on super-linear regexes. It includes loads of references to other regex-related research.
What you call an "evil" regex is a regex that exhibits catastrophic backtracking. The linked page (which I wrote) explains the concept in detail. Basically, catastrophic backtracking happens when a regex fails to match and different permutations of the same regex can find a partial match. The regex engine then tries all those permutations. If you want to go over your code and inspect your regexes these are the 3 key issues to look at:
Alternatives must be mutually exclusive. If multiple alternatives can match the same text then the engine will try both if the remainder of the regex fails. If the alternatives are in a group that is repeated, you have catastrophic backtracking. A classic example is (.|\s)* to match any amount of any text when the regex flavor does not have a "dot matches line breaks" mode. If this is part of a longer regex then a subject string with a sufficiently long run of spaces (matched by both . and \s) will break the regex. The fix is to use (.|\n)* to make the alternatives mutually exclusive or even better to be more specific about which characters are really allowed, such as [\r\n\t\x20-\x7E] for ASCII printables, tabs, and line breaks.
Quantified tokens that are in sequence must either be mutually exclusive with each other or be mutually exclusive what comes between them. Otherwise both can match the same text and all combinations of the two quantifiers will be tried when the remainder of the regex fails to match. A classic example is a.*?b.*?c to match 3 things with "anything" between them. When c can't be matched the first .*? will expand character by character until the end of the line or file. For each expansion the second .*? will expand character by character to match the remainder of the line or file. The fix is to realize that you can't have "anything" between them. The first run needs to stop at b and the second run needs to stop at c. With single characters a[^b]*+b[^c]*+c is an easy solution. Since we now stop at the delimiter, we can use possessive quantifiers to further increase performance.
A group that contains a token with a quantifier must not have a quantifier of its own unless the quantified token inside the group can only be matched with something else that is mutually exclusive with it. That ensures that there is no way that fewer iterations of the outer quantifier with more iterations of the inner quantifier can match the same text as more iterations of the outer quantifier with fewer iterations of the inner quantifier. This is the problem illustrated in JDB's answer.
While I was writing my answer I decided that this merited a full article on my website. This is now online too.
I would sum it up as "A repetition of a repetition". The first example you listed is a good one, as it states "the letter a, one or more times in a row. This can again happen one or more times in a row".
What to look for in this case is combination of the quantifiers, such as * and +.
A somewhat more subtle thing to look out for is the third and fourth one. Those examples contain an OR operation, in which both sides can be true. This combined with a quantifier of the expression can result in a LOT of potential matches depending on the input string.
To sum it up, TLDR-style:
Be careful how quantifiers are used in combination with other operators.
I have surprisingly come across ReDOS quite a few times performing source code reviews. One thing I would recommend is to use a timeout with whatever Regular Expression engine that you are using.
For example, in C# I can create the regular expression with a TimeSpan attribute.
string pattern = #"^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$";
Regex regexTags = new Regex(pattern, RegexOptions.None, TimeSpan.FromSeconds(1.0));
try
{
string noTags = regexTags.Replace(description, "");
System.Console.WriteLine(noTags);
}
catch (RegexMatchTimeoutException ex)
{
System.Console.WriteLine("RegEx match timeout");
}
This regex is vulnerable to denial of service and without the timeout will spin and eat resources. With the timeout, it will throw a RegexMatchTimeoutException after the given timeout and will not cause the resource usage leading to a Denial of Service condition.
You will want to experiment with the timeout value to make sure it works for your usage.
I would say this is related to the regex engine in use. You may not always be able to avoid these types of regexes, but if your regex engine is built right, then it is less of a problem. See this blog series for a great deal of information on the topic of regex engines.
Note the caveat at the bottom of the article, in that backtracking is an NP-Complete problem. There currently is no way to efficiently process them, and you might want to disallow them in your input.
I don't think you can recognize such regexes, at least not all of them or not without restrictively limiting their expressiveness. If you'd really care about ReDoSs, I'd try to sandbox them and kill their processing with a timeout. It also might be possible that there are RegEx implementations that let you limit their max backtracking amount.
There are some ways I can think of that you could implement some simplification rules by running them on small test inputs or analyzing the regex's structure.
(a+)+ can be reduced using some sort of rule for replacing redundant operators to just (a+)
([a-zA-Z]+)* could also be simplified with our new redundancy combining rule to ([a-zA-Z]*)
The computer could run tests by running the small subexpressions of the regex against randomly-generated sequences of the relevant characters or sequences of characters, and seeing what groups they all end up in. For the first one, the computer is like, hey the regex wants a's, so lets try it with 6aaaxaaq. It then sees that all the a's, and only the first groupm end up in one group, and concludes that no matter how many a's is puts, it won't matter, since + gets all in the group. The second one, is like, hey, the regex wants a bunch of letters, so lets try it with -fg0uj=, and then it sees that again each bunch is all in one group, so it gets rid of the + at the end.
Now we need a new rule to handle the next ones: The eliminate-irrelevant-options rule.
With (a|aa)+, the computer takes a look at it and is like, we like that big second one, but we can use that first one to fill in more gaps, lets get ans many aa's as we can, and see if we can get anything else after we're done. It could run it against another test string, like `eaaa#a~aa.' to determine that.
You can protect yourself from (a|a?)+ by having the computer realize that the strings matched by a? are not the droids we are looking for, because since it can always match anywhere, we decide that we don't like things like (a?)+, and throw it out.
We protect from (.*a){x} by getting it to realize that the characters matched by a would have already been grabbed by .*. We then throw out that part and use another rule to replace the redundant quantifiers in (.*){x}.
While implementing a system like this would be very complicated, this is a complicated problem, and a complicated solution may be necessary. You should also use techniques other people have brought up, like only allowing the regex some limited amount of execution resources before killing it if it doesn't finish.
NOTE :- The question is bit long as it includes a section from book.
I was reading about atomic groups from Mastering Regular Expression.
It is given that atomic groups leads to faster failure. Quoting that particular section from the book
Faster failures with atomic grouping. Consider ^\w+: applied to
Subject. We can see, just by looking at it, that it will fail
because the text doesn’t have a colon in it, but the regex engine
won’t reach that conclusion until it actually goes through the
motions of checking.
So, by the time : is first checked, the \w+
will have marched to the end of the string. This results in a lot of
states — one skip me state for each match of \w by the plus
(except the first, since plus requires one match). When then checked
at the end of the string, : fails, so the regex engine backtracks to
the most recently saved state:
at
which point the : fails again, this time trying to match t. This
backtrack-test fail cycle happens all the way back to the oldest state:
After the attempt from the final state
fails, overall failure can finally be announced.
All that backtracking is a lot of work that after just a glance we
know to be unnecessary. If the colon can’t match after the last
letter, it certainly can’t match one of the letters the + is forced
to give up!
So, knowing that none of the states left by \w+, once
it’s finished, could possibly lead to a match, we can save the regex
engine the trouble of checking them: ^(?>\w+):. By adding the atomic
grouping, we use our global knowledge of the regex to enhance the
local working of \w+ by having its saved states (which we know to be
useless) thrown away. If there is a match, the atomic grouping won’t
have mattered, but if there’s not to be a match, having thrown away
the useless states lets the regex come to that conclusion more
quickly.
I tried these regex here. It took 4 steps for ^\w+: and 6 steps for ^(?>\w+): (with internal engine optimization disabled)
My Questions
In the second paragraph from above section, it is mentioned that
So, by the time : is first checked, the \w+ will have marched to the end of the string. This results in a lot of states — one skip me state for each match of \w by the plus (except the first, since plus requires one match).When then checked
at the end of the string, : fails, so the regex engine backtracks to
the most recently saved state:
at
which point the : fails again, this time trying to match t. This
backtrack-test fail cycle happens all the way back to the oldest state:
but on this site, I see no backtracking. Why?
Is there some optimization going on inside(even after it is disabled)?
Can the number of steps taken by a regex decide whether one regex is having good performance over other regex?
The debugger on that site seems to gloss over the details of backtracking. RegexBuddy does a better job. Here's what it shows for ^\w+:
After \w+ consumes all the letters, it tries to match : and fails. Then it gives back one character, tries the : again, and fails again. And so on, until there's nothing left to give back. Fifteen steps total. Now look at the atomic version (^(?>\w+):):
After failing to match the : the first time, it gives back all the letters at once, as if they were one character. A total of five steps, and two of those are entering and leaving the group. And using a possessive quantifier (^\w++:) eliminates even those:
As for your second question, yes, the number-of-steps metric from regex debuggers is useful, especially if you're just learning regexes. Every regex flavor has at least a few optimizations that allow even badly written regexes to perform adequately, but a debugger (especially a flavor-neutral one like RegexBuddy's) makes it obvious when you're doing something wrong.
Lately I am using a lot of regular expressions in java/groovy. For testing I routinely use regex101.com. Obviously I am looking at the regular expressions performance too.
One thing I noticed that using .* properly can significantly improve the overall performance. Primarily, using .* in between, or better to say not at the end of the regular expression is performance kill.
For example, in this regular expression the required number of steps is 27:
If I change first .* to \s*, it will reduce the steps required significantly to 16:
However, if I change second .* to \s*, it does not reduce the steps any further:
I have few questions:
Why the above? I dont want to compare \s and .*. I know the difference. I want to know why \s and .* costs different based on their position in the complete regex. And then the characteristics of the regex which may cost different based on their position in the overall regex (or based on any other aspect other than position, if there is any).
Does the steps counter given in this site really gives any indication about regex performance?
what other simple or similar (position related) regex performance observations you have?
The following is output from the debugger.
The big reason for the difference in performance is that .* will consume everything until the end of the string (except the newline). The pattern will then continue, forcing the regex to backtrack (as seen in the first image).
The reason that \s and .* perform equally well at the end of the pattern is that the greedy pattern vs. consuming whitespace makes no difference if there's nothing else to match (besides WS).
If your test string didn't end in whitespace, there would be a difference in performance, much like you saw in the first pattern - the regex would be forced to backtrack.
EDIT
You can see the performance difference if you end with something besides whitespace:
Bad:
^myname.*mahesh.*hiworld
Better:
^myname.*mahesh\s*hiworld
Even better:
^myname\s*mahesh\s*hiworld
The way regex engines work with the * quantifier, aka greedy quantifier, is to consume everything in the input that matches, then:
try the next term in the regex. If it matches, proceed on
"unconsume" one character (move the pointer back one), aka backtrack and goto step 1.
Since . matches anything (almost), the first state after encountering .* is to move the pointer to the end of input, then start moving back through the input one char at a time trying the next term until there's a match.
With \s*, only whitespace is consumed, so the pointer is initially moved exactly where you want it to be - no backtracking required to match the next term.
Something you should try is using the reluctant quantifier .*?, which will consume one char at a time until the next term matches, which should have the same time complexity as \s*, but be slightly more efficient because no check of the current char is required.
\s* and .* at the end of the expression will perform similarly, because both will consume everything at the end f input that matches, which leaves the pointer is the same position for both expressions.