What does it mean that there is faster failure with atomic grouping - regex

NOTE :- The question is bit long as it includes a section from book.
I was reading about atomic groups from Mastering Regular Expression.
It is given that atomic groups leads to faster failure. Quoting that particular section from the book
Faster failures with atomic grouping. Consider ^\w+: applied to
Subject. We can see, just by looking at it, that it will fail
because the text doesn’t have a colon in it, but the regex engine
won’t reach that conclusion until it actually goes through the
motions of checking.
So, by the time : is first checked, the \w+
will have marched to the end of the string. This results in a lot of
states — one skip me state for each match of \w by the plus
(except the first, since plus requires one match). When then checked
at the end of the string, : fails, so the regex engine backtracks to
the most recently saved state:
at
which point the : fails again, this time trying to match t. This
backtrack-test fail cycle happens all the way back to the oldest state:
After the attempt from the final state
fails, overall failure can finally be announced.
All that backtracking is a lot of work that after just a glance we
know to be unnecessary. If the colon can’t match after the last
letter, it certainly can’t match one of the letters the + is forced
to give up!
So, knowing that none of the states left by \w+, once
it’s finished, could possibly lead to a match, we can save the regex
engine the trouble of checking them: ^(?>\w+):. By adding the atomic
grouping, we use our global knowledge of the regex to enhance the
local working of \w+ by having its saved states (which we know to be
useless) thrown away. If there is a match, the atomic grouping won’t
have mattered, but if there’s not to be a match, having thrown away
the useless states lets the regex come to that conclusion more
quickly.
I tried these regex here. It took 4 steps for ^\w+: and 6 steps for ^(?>\w+): (with internal engine optimization disabled)
My Questions
In the second paragraph from above section, it is mentioned that
So, by the time : is first checked, the \w+ will have marched to the end of the string. This results in a lot of states — one skip me state for each match of \w by the plus (except the first, since plus requires one match).When then checked
at the end of the string, : fails, so the regex engine backtracks to
the most recently saved state:
at
which point the : fails again, this time trying to match t. This
backtrack-test fail cycle happens all the way back to the oldest state:
but on this site, I see no backtracking. Why?
Is there some optimization going on inside(even after it is disabled)?
Can the number of steps taken by a regex decide whether one regex is having good performance over other regex?

The debugger on that site seems to gloss over the details of backtracking. RegexBuddy does a better job. Here's what it shows for ^\w+:
After \w+ consumes all the letters, it tries to match : and fails. Then it gives back one character, tries the : again, and fails again. And so on, until there's nothing left to give back. Fifteen steps total. Now look at the atomic version (^(?>\w+):):
After failing to match the : the first time, it gives back all the letters at once, as if they were one character. A total of five steps, and two of those are entering and leaving the group. And using a possessive quantifier (^\w++:) eliminates even those:
As for your second question, yes, the number-of-steps metric from regex debuggers is useful, especially if you're just learning regexes. Every regex flavor has at least a few optimizations that allow even badly written regexes to perform adequately, but a debugger (especially a flavor-neutral one like RegexBuddy's) makes it obvious when you're doing something wrong.

Related

Why do non-capturing and atomic groups seem to add many steps [duplicate]

In an answer to a recent question, I contrived a couple of clever little regexes (at the asker's request) to match a substring at either the beginning or the end of a string. When run on Regex101, however, I noted that the different patterns have different step counts (indicating that the regex engine has to do more work for one vs. the other). To my mind, however, there is no intuitive reason that this should be so.
The three patterns are as follows:
Fun with conditionals: /(^)?!next(?(1)|$)/ (demo - 86 steps)
Classic alternation: ^!next|!next$ (demo - 58 steps)
Nasty lookarounds: !next(?:(?<=^.{5})|(?=$)) (demo - 35 steps)
Why is the first pattern so much less efficient than the second, and, most confusingly, why is the third so efficient?
TL;DR
Why is the first pattern so much less efficient than the second, and,
most confusingly, why is the third so efficient?
Because first two are anchored, third is not.
Real story, how steps are taken
Considering this regex /^x/gm, how many steps do you think engine will take to return a "no match" if subject string is abc? You are right, two.
Assert beginning of string
Match x
Then overall match fails since no x immediately comes after beginning of string assertion.
Well I lied. It’s not that I’m nasty, it just makes it easier to understand things that are going to happen. According to regex101.com it takes no steps at all:
Shall you believe it this time? Yes. No. Let's see.
PCRE start-up optimizations
PCRE, being kind to its users, provides some features to speed up things that is called start-up optimization. It does some dependent optimizations in according to Regular Expressions being used.
One important feature of these optimizations is a pre-scan of subject string in order to ensure that:
subject string contains at least one character that corresponds to the first character of match or
if a known starting point exists.
If one is not found matching function never runs.
Saying that, if our regex is /x/ and our subject string is abc then with start-up optimization enabled, a pre-scan is intended to be done to look for x, if is not found whole match fails or more better it doesn't even bother to go through matching process.
So how do these information help?
Let's flashback to our first example and change our regex a little bit. From:
/^x/gm
to
/^x/g
The difference is m flag that is getting unset. For those who don't know what m flag does if is set:
It changes the meaning of ^ and $ symbols in the sense that they no more mean start and end of string but start and end of line.
Now what if we run this regex /^x/g over our subject string abc? Should we expect a difference in steps engine takes or not? Absolutely, yes. Let's look at regex101.com returned info:
I really encourage you to believe it this time. It's actual.
What's happening?
Well, it seems a little confusing but we are going to enlighten things up. When there is no m modifier set, pre-scan looks to assert start of string (a known starting point). If assertion passes then actual matching function runs otherwise "no match" will return.
But wait... every subject string definitely has one and only start of string position and it's always at the very beginning of it. So wouldn't be a pre-scan obviously unnecessary? Yes, engine doesn't do a pre-scan here. With /^x/g it immediately asserts start of string and then fails like so (since it matches at ^ it goes through actual matching process). That's why we see regex101.com shows number of steps are 2.
But... with setting m modifier things differ. Now meaning of both ^ and $ anchors are changed. With ^ matching start of line, assertion of the same position in subject string abc happens but next immediate character is not x, being within actual matching process and since g flag is on, next match starts at position before b and fails and this trial and error continues up to end of subject string.
Debugger shows 6 steps but main page says 0 steps, why?
I'm not sure about latter but for the sake of debugging, regex101 debugger runs with (*NO_START_OPT) so 6 steps are true only if this verb is set. And I said I'm not sure about latter because all anchored patterns prevent a further pre-scan optimization and we should know what can be called an anchored pattern:
A pattern is automatically anchored by PCRE if all of its
top-level alternatives begin with one of the following:
^ unless PCRE_MULTILINE is set
\A always
\G always
.* if PCRE_DOTALL is set and there are no back references to the
subpattern in which .* appears
Now you got entirely what I was talking about when I was saying no pre-scan happens while m flag is not set in /^x/g: It's considered an anchored pattern which disables pre-scan optimization. So when m flag is on, this is no more an anchored pattern: /^x/gm hence pre-scan optimization could take place.
Engine knows start of string anchor \A (or ^ while multiline mode is disable) occurs only once when is matched so it doesn't continue at the next position.
Back to your own RegExes
First two are anchored (^ in conjunction with m flag), third is not. That is, third regex benefits from a pre-scan optimization. You can believe in 35 steps since an optimization caused it. But if you disable start-up optimization:
(*NO_START_OPT)!next(?:(?<=^.{5})|(?=$))
You will see 57 steps which is mostly the same as number of debugger steps.

How to evaluate the performance of a certain regex with certain engine? [duplicate]

I recently became aware of Regular expression Denial of Service attacks, and decided to root out so-called 'evil' regex patterns wherever I could find them in my codebase - or at least those that are used on user input. The examples given at the OWASP link above and wikipedia are helpful, but they don't do a great job of explaining the problem in simple terms.
A description of evil regexes, from wikipedia:
the regular expression applies repetition ("+", "*") to a complex subexpression;
for the repeated subexpression, there exists a match which is also a suffix of another valid match.
With examples, again from wikipedia:
(a+)+
([a-zA-Z]+)*
(a|aa)+
(a|a?)+
(.*a){x} for x > 10
Is this a problem that just doesn't have a simpler explanation? I'm looking for something that would make it easier to avoid this problem while writing regexes, or to find them within an existing codebase.
Why Are Evil Regexes A Problem?
Because computers do exactly what you tell them to do, even if it's not what you meant or is totally unreasonable. If you ask a regex engine to prove that, for some given input, there either is or is not a match for a given pattern, then the engine will attempt to do that no matter how many different combinations must be tested.
Here is a simple pattern inspired by the first example in the OP's post:
^((ab)*)+$
Given the input:
abababababababababababab
The regex engine tries something like (abababababababababababab) and a match is found on the first try.
But then we throw the monkey wrench in:
abababababababababababab a
The engine will first try (abababababababababababab) but that fails because of that extra a. This causes catastrophic backtracking, because our pattern (ab)*, in a show of good faith, will release one of its captures (it will "backtrack") and let the outer pattern try again. For our regex engine, that looks something like this:
(abababababababababababab) - Nope
(ababababababababababab)(ab) - Nope
(abababababababababab)(abab) - Nope
(abababababababababab)(ab)(ab) - Nope
(ababababababababab)(ababab) - Nope
(ababababababababab)(abab)(ab) - Nope
(ababababababababab)(ab)(abab) - Nope
(ababababababababab)(ab)(ab)(ab) - Nope
(abababababababab)(abababab) - Nope
(abababababababab)(ababab)(ab) - Nope
(abababababababab)(abab)(abab) - Nope
(abababababababab)(abab)(ab)(ab) - Nope
(abababababababab)(ab)(ababab) - Nope
(abababababababab)(ab)(abab)(ab) - Nope
(abababababababab)(ab)(ab)(abab) - Nope
(abababababababab)(ab)(ab)(ab)(ab) - Nope
(ababababababab)(ababababab) - Nope
(ababababababab)(abababab)(ab) - Nope
(ababababababab)(ababab)(abab) - Nope
(ababababababab)(ababab)(ab)(ab) - Nope
(ababababababab)(abab)(abab)(ab) - Nope
(ababababababab)(abab)(ab)(abab) - Nope
(ababababababab)(abab)(ab)(ab)(ab) - Nope
(ababababababab)(ab)(abababab) - Nope
(ababababababab)(ab)(ababab)(ab) - Nope
(ababababababab)(ab)(abab)(abab) - Nope
(ababababababab)(ab)(abab)(ab)(ab) - Nope
(ababababababab)(ab)(ab)(ababab) - Nope
(ababababababab)(ab)(ab)(abab)(ab) - Nope
(ababababababab)(ab)(ab)(ab)(abab) - Nope
(ababababababab)(ab)(ab)(ab)(ab)(ab) - Nope
                              ...
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abababab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(abab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ababab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab)(ab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(abab) - Nope
(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab)(ab) - Nope
The number of possible combinations scales exponentially with the length of the input and, before you know it, the regex engine is eating up all your system resources trying to solve this thing until, having exhausted every possible combination of terms, it finally gives up and reports "There is no match." Meanwhile your server has turned into a burning pile of molten metal.
How to Spot Evil Regexes
It's actually very tricky. Catastrophic backtracking in modern regex engines is similar in nature to the halting problem which Alan Turing proved was impossible to solve. I have written problematic regexes myself, even though I know what they are and generally how to avoid them. Wrapping everything you can in an atomic group can help to prevent the backtracking issue. It basically tells the regex engine not to revisit a given expression - "lock whatever you matched on the first try". Note, however, that atomic expressions don't prevent backtracking within the expression, so ^(?>((ab)*)+)$ is still dangerous, but ^(?>(ab)*)+$ is safe (it'll match (abababababababababababab) and then refuse to give up any of it's matched characters, thus preventing catastrophic backtracking).
Unfortunately, once it's written, it's actually very hard to immediately or quickly find a problem regex. In the end, recognizing a bad regex is like recognizing any other bad code - it takes a lot of time and experience and/or a single catastrophic event.
Interestingly, since this answer was first written, a team at the University of Texas at Austin published a paper describing the development of a tool capable of performing static analysis of regular expressions with the express purpose of finding these "evil" patterns. The tool was developed to analyse Java programs, but I suspect that in the coming years we'll see more tools developed around analysing and detecting problematic patterns in JavaScript and other languages, especially as the rate of ReDoS attacks continues to climb.
Static Detection of DoS Vulnerabilities in
Programs that use Regular Expressions
Valentin Wüstholz, Oswaldo Olivo, Marijn J. H. Heule, and Isil Dillig
The University of Texas at Austin
Detecting evil regexes
Try Nicolaas Weideman's RegexStaticAnalysis project.
Try my ensemble-style vuln-regex-detector which has a CLI for Weideman's tool and others.
Rules of thumb
Evil regexes are always due to ambiguity in the corresponding NFA, which you can visualize with tools like regexper.
Here are some forms of ambiguity. Don't use these in your regexes.
Nesting quantifiers like (a+)+ (aka "star height > 1"). This can cause exponential blow-up. See substack's safe-regex tool.
Quantified Overlapping Disjunctions like (a|a)+. This can cause exponential blow-up.
Avoid Quantified Overlapping Adjacencies like \d+\d+. This can cause polynomial blow-up.
Additional resources
I wrote this paper on super-linear regexes. It includes loads of references to other regex-related research.
What you call an "evil" regex is a regex that exhibits catastrophic backtracking. The linked page (which I wrote) explains the concept in detail. Basically, catastrophic backtracking happens when a regex fails to match and different permutations of the same regex can find a partial match. The regex engine then tries all those permutations. If you want to go over your code and inspect your regexes these are the 3 key issues to look at:
Alternatives must be mutually exclusive. If multiple alternatives can match the same text then the engine will try both if the remainder of the regex fails. If the alternatives are in a group that is repeated, you have catastrophic backtracking. A classic example is (.|\s)* to match any amount of any text when the regex flavor does not have a "dot matches line breaks" mode. If this is part of a longer regex then a subject string with a sufficiently long run of spaces (matched by both . and \s) will break the regex. The fix is to use (.|\n)* to make the alternatives mutually exclusive or even better to be more specific about which characters are really allowed, such as [\r\n\t\x20-\x7E] for ASCII printables, tabs, and line breaks.
Quantified tokens that are in sequence must either be mutually exclusive with each other or be mutually exclusive what comes between them. Otherwise both can match the same text and all combinations of the two quantifiers will be tried when the remainder of the regex fails to match. A classic example is a.*?b.*?c to match 3 things with "anything" between them. When c can't be matched the first .*? will expand character by character until the end of the line or file. For each expansion the second .*? will expand character by character to match the remainder of the line or file. The fix is to realize that you can't have "anything" between them. The first run needs to stop at b and the second run needs to stop at c. With single characters a[^b]*+b[^c]*+c is an easy solution. Since we now stop at the delimiter, we can use possessive quantifiers to further increase performance.
A group that contains a token with a quantifier must not have a quantifier of its own unless the quantified token inside the group can only be matched with something else that is mutually exclusive with it. That ensures that there is no way that fewer iterations of the outer quantifier with more iterations of the inner quantifier can match the same text as more iterations of the outer quantifier with fewer iterations of the inner quantifier. This is the problem illustrated in JDB's answer.
While I was writing my answer I decided that this merited a full article on my website. This is now online too.
I would sum it up as "A repetition of a repetition". The first example you listed is a good one, as it states "the letter a, one or more times in a row. This can again happen one or more times in a row".
What to look for in this case is combination of the quantifiers, such as * and +.
A somewhat more subtle thing to look out for is the third and fourth one. Those examples contain an OR operation, in which both sides can be true. This combined with a quantifier of the expression can result in a LOT of potential matches depending on the input string.
To sum it up, TLDR-style:
Be careful how quantifiers are used in combination with other operators.
I have surprisingly come across ReDOS quite a few times performing source code reviews. One thing I would recommend is to use a timeout with whatever Regular Expression engine that you are using.
For example, in C# I can create the regular expression with a TimeSpan attribute.
string pattern = #"^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$";
Regex regexTags = new Regex(pattern, RegexOptions.None, TimeSpan.FromSeconds(1.0));
try
{
string noTags = regexTags.Replace(description, "");
System.Console.WriteLine(noTags);
}
catch (RegexMatchTimeoutException ex)
{
System.Console.WriteLine("RegEx match timeout");
}
This regex is vulnerable to denial of service and without the timeout will spin and eat resources. With the timeout, it will throw a RegexMatchTimeoutException after the given timeout and will not cause the resource usage leading to a Denial of Service condition.
You will want to experiment with the timeout value to make sure it works for your usage.
I would say this is related to the regex engine in use. You may not always be able to avoid these types of regexes, but if your regex engine is built right, then it is less of a problem. See this blog series for a great deal of information on the topic of regex engines.
Note the caveat at the bottom of the article, in that backtracking is an NP-Complete problem. There currently is no way to efficiently process them, and you might want to disallow them in your input.
I don't think you can recognize such regexes, at least not all of them or not without restrictively limiting their expressiveness. If you'd really care about ReDoSs, I'd try to sandbox them and kill their processing with a timeout. It also might be possible that there are RegEx implementations that let you limit their max backtracking amount.
There are some ways I can think of that you could implement some simplification rules by running them on small test inputs or analyzing the regex's structure.
(a+)+ can be reduced using some sort of rule for replacing redundant operators to just (a+)
([a-zA-Z]+)* could also be simplified with our new redundancy combining rule to ([a-zA-Z]*)
The computer could run tests by running the small subexpressions of the regex against randomly-generated sequences of the relevant characters or sequences of characters, and seeing what groups they all end up in. For the first one, the computer is like, hey the regex wants a's, so lets try it with 6aaaxaaq. It then sees that all the a's, and only the first groupm end up in one group, and concludes that no matter how many a's is puts, it won't matter, since + gets all in the group. The second one, is like, hey, the regex wants a bunch of letters, so lets try it with -fg0uj=, and then it sees that again each bunch is all in one group, so it gets rid of the + at the end.
Now we need a new rule to handle the next ones: The eliminate-irrelevant-options rule.
With (a|aa)+, the computer takes a look at it and is like, we like that big second one, but we can use that first one to fill in more gaps, lets get ans many aa's as we can, and see if we can get anything else after we're done. It could run it against another test string, like `eaaa#a~aa.' to determine that.
You can protect yourself from (a|a?)+ by having the computer realize that the strings matched by a? are not the droids we are looking for, because since it can always match anywhere, we decide that we don't like things like (a?)+, and throw it out.
We protect from (.*a){x} by getting it to realize that the characters matched by a would have already been grabbed by .*. We then throw out that part and use another rule to replace the redundant quantifiers in (.*){x}.
While implementing a system like this would be very complicated, this is a complicated problem, and a complicated solution may be necessary. You should also use techniques other people have brought up, like only allowing the regex some limited amount of execution resources before killing it if it doesn't finish.

REGEX: PCRE atomic group doesn't work

In my PCRE regular expression I used an atomic group to reduce backtracks.
<\/?\s*\b(?>a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|em(?:bed)?|f(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|h(?:[1-6r]|ead(?:er)?|tml)|i(?:frame|mg|nput|ns)?|kbd|l(?:abel|egend|i(?:nk)?)|m(?:a(?:in|p|rk)|et(?:a|er))|n(?:av|o(?:frames|script))|o(?:bject|l|pt(?:group|ion)|utput)|p(?:aram|icture|re|rogress)?|q|r[pt]|ruby|s|s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|ul?|v(?:ar|ideo)|wbr)\b
REGEX101
But in the example debug, I see that after f checking ends, it goes further for other options. I'm trying to stop it after f check fails so it doesn't check the rest of expression. What's wrong?
I will assume you know what you're doing by using regex here, since there's probably an argument to be made that PCRE is not the best approach to implementing this sort of matching in a "tree"-like fashion. But I'm not fussed about that.
The idea of using conditionals isn't bad, but it adds extra steps in the form of the conditions themselves. Also, you can only branch off in two directions per conditional.
PCRE has a feature called "backtracking control verbs" which allow you to do precisely what you want. They have varying levels of control, and the one I would suggest in this case is the strongest:
<\/?\s*\b(?>a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|em(?:bed)?|f(*COMMIT)(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|h(?:[1-6r]|ead(?:er)?|tml)|i(?:frame|mg|nput|ns)?|kbd|l(?:abel|egend|i(?:nk)?)|m(?:a(?:in|p|rk)|et(?:a|er))|n(?:av|o(?:frames|script))|o(?:bject|l|pt(?:group|ion)|utput)|p(?:aram|icture|re|rogress)?|q|r[pt]|ruby|s|s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|ul?|v(?:ar|ideo)|wbr)\b
https://regex101.com/r/p572K8/2
Just by adding a single (*COMMIT) verb after the 'f' branch, it's cut the number of steps required to find a failure in this case by half.
(*COMMIT) tells the engine to commit to the match at that point. It won't even re-attempt the match starting from </ again if no match is found.
To fully optimize the expression, you'll have to add (*COMMIT) at every point after branching has occurred.
Another thing you can do is try to re-order your alternatives in such a way as to prioritize those that are found most commonly. That might be something else to consider in your optimization process.
Because that's how atomic group works. The idea is:
at the current position, find the first sequence that matches the pattern inside atomic grouping and hold on to it.
(Source: Confusion with Atomic Grouping - how it differs from the Grouping in regular expression of Ruby?)
So if there is no match inside an atomic group, it will iterate through all options.
You can use conditionals instead:
</?\s*\b(?(?=a)a(?:bbr|cronym|ddress|pplet|r(?:ea|ticle)|side|udio)?|(?(?=b)b(?:ase|asefont|d[io]|ig|lockquote|ody|r|utton)?|(?(?=c)c(?:anvas|aption|enter|ite|ode|ol(?:group)?)|(?(?=d)d(?:ata(?:list)?|[dlt]|el|etails|fn|ialog|i[rv])|(?(?=e)em(?:bed)?|(?(?=f)f(?:i(?:eldset|g(?:caption|ure))|o(?:nt|oter|rm)|rame(?:set)?)|(?(?=h)h(?:[1-6r]|ead(?:er)?|tml)|(?(?=i)i(?:frame|mg|nput|ns)?|(?(?=k)kbd|(?(?=l)l(?:abel|egend|i(?:nk)?)|(?(?=m)m(?:a(?:in|p|rk)|et(?:a|er))|(?(?=n)n(?:av|o(?:frames|script))|(?(?=o)o(?:bject|l|pt(?:group|ion)|utput)|(?(?=p)p(?:aram|icture|re|rogress)?|(?(?=q)q|(?(?=r)r[pt]|(?(?=r)ruby|(?(?=s)s|(?(?=s)s(?:amp|ection|elect|mall|ource|pan|trike|trong|tyle|ub|ummary|up|vg)|(?(?=t)t(?:able|body|[dhrt]|emplate|extarea|foot|head|ime|itle|rack)|(?(?=u)ul?|(?(?=v)v(?:ar|ideo)|wbr))))))))))))))))))))))\b
Regex101

Why do these three regexes have different step counts?

In an answer to a recent question, I contrived a couple of clever little regexes (at the asker's request) to match a substring at either the beginning or the end of a string. When run on Regex101, however, I noted that the different patterns have different step counts (indicating that the regex engine has to do more work for one vs. the other). To my mind, however, there is no intuitive reason that this should be so.
The three patterns are as follows:
Fun with conditionals: /(^)?!next(?(1)|$)/ (demo - 86 steps)
Classic alternation: ^!next|!next$ (demo - 58 steps)
Nasty lookarounds: !next(?:(?<=^.{5})|(?=$)) (demo - 35 steps)
Why is the first pattern so much less efficient than the second, and, most confusingly, why is the third so efficient?
TL;DR
Why is the first pattern so much less efficient than the second, and,
most confusingly, why is the third so efficient?
Because first two are anchored, third is not.
Real story, how steps are taken
Considering this regex /^x/gm, how many steps do you think engine will take to return a "no match" if subject string is abc? You are right, two.
Assert beginning of string
Match x
Then overall match fails since no x immediately comes after beginning of string assertion.
Well I lied. It’s not that I’m nasty, it just makes it easier to understand things that are going to happen. According to regex101.com it takes no steps at all:
Shall you believe it this time? Yes. No. Let's see.
PCRE start-up optimizations
PCRE, being kind to its users, provides some features to speed up things that is called start-up optimization. It does some dependent optimizations in according to Regular Expressions being used.
One important feature of these optimizations is a pre-scan of subject string in order to ensure that:
subject string contains at least one character that corresponds to the first character of match or
if a known starting point exists.
If one is not found matching function never runs.
Saying that, if our regex is /x/ and our subject string is abc then with start-up optimization enabled, a pre-scan is intended to be done to look for x, if is not found whole match fails or more better it doesn't even bother to go through matching process.
So how do these information help?
Let's flashback to our first example and change our regex a little bit. From:
/^x/gm
to
/^x/g
The difference is m flag that is getting unset. For those who don't know what m flag does if is set:
It changes the meaning of ^ and $ symbols in the sense that they no more mean start and end of string but start and end of line.
Now what if we run this regex /^x/g over our subject string abc? Should we expect a difference in steps engine takes or not? Absolutely, yes. Let's look at regex101.com returned info:
I really encourage you to believe it this time. It's actual.
What's happening?
Well, it seems a little confusing but we are going to enlighten things up. When there is no m modifier set, pre-scan looks to assert start of string (a known starting point). If assertion passes then actual matching function runs otherwise "no match" will return.
But wait... every subject string definitely has one and only start of string position and it's always at the very beginning of it. So wouldn't be a pre-scan obviously unnecessary? Yes, engine doesn't do a pre-scan here. With /^x/g it immediately asserts start of string and then fails like so (since it matches at ^ it goes through actual matching process). That's why we see regex101.com shows number of steps are 2.
But... with setting m modifier things differ. Now meaning of both ^ and $ anchors are changed. With ^ matching start of line, assertion of the same position in subject string abc happens but next immediate character is not x, being within actual matching process and since g flag is on, next match starts at position before b and fails and this trial and error continues up to end of subject string.
Debugger shows 6 steps but main page says 0 steps, why?
I'm not sure about latter but for the sake of debugging, regex101 debugger runs with (*NO_START_OPT) so 6 steps are true only if this verb is set. And I said I'm not sure about latter because all anchored patterns prevent a further pre-scan optimization and we should know what can be called an anchored pattern:
A pattern is automatically anchored by PCRE if all of its
top-level alternatives begin with one of the following:
^ unless PCRE_MULTILINE is set
\A always
\G always
.* if PCRE_DOTALL is set and there are no back references to the
subpattern in which .* appears
Now you got entirely what I was talking about when I was saying no pre-scan happens while m flag is not set in /^x/g: It's considered an anchored pattern which disables pre-scan optimization. So when m flag is on, this is no more an anchored pattern: /^x/gm hence pre-scan optimization could take place.
Engine knows start of string anchor \A (or ^ while multiline mode is disable) occurs only once when is matched so it doesn't continue at the next position.
Back to your own RegExes
First two are anchored (^ in conjunction with m flag), third is not. That is, third regex benefits from a pre-scan optimization. You can believe in 35 steps since an optimization caused it. But if you disable start-up optimization:
(*NO_START_OPT)!next(?:(?<=^.{5})|(?=$))
You will see 57 steps which is mostly the same as number of debugger steps.

Rules of regex engines. Greediness, eagerness and laziness of regexes

As we all know, regex engine use two rules when it goes about its work:
Rule 1: The Match That Begins Earliest Wins or regular expressions
are eager.
Rule 2: Regular expressions are greedy.
These lines appear in tutorial:
The two of these rules go hand in hand.
It's eager to give you a result, so what it does is it tries to just
keep letting that first one do all the work.
While we're already in the middle of it, let's keep going, get to the
end of the string and then when it doesn't work out, then it will
backtrack and try another one.
It doesn't backtrack back to the beginning; it doesn't try all sorts
of other combinations.
It's still eager to get you a result, so it says, what if I just gave
back one?
Would that allow me to give a result back?
If it does, great, it's done. It's able to just finish there.
It doesn't have to keep backtracking further in the string, looking
for some kind of a better match or match that's further along.
I don't quite understand these lines (especially 2nd ("While we're...") and last ("It doesn't have to keep backtracking") sentences).
And these lines about lazy mode.
It still defers to the overall match just like the greedy one does
clearly.
I don't understand the following analogy:
It's not necessarily any faster or slower to choose a lazy strategy or
a greedy strategy, but it will probably match different things.
Now as far as is faster or slower, it's a little bit like saying, if
you've lost your car keys and your sunglasses inside your house, is it
better to start looking in the kitchen or to start looking in the
living room?
You don't know which one's going to yield the best result, and you
don't know which one's going to find the sunglasses first or the keys
first; it's just about different strategies of starting the search.
So you will likely get different results depending on where you start,
but it's not necessarily faster to start in one place or the other.
What 'faster or slower' means?
I'm going to draw scheme how it work (in both case). So I will contemplate this questions until I find out what's going on around here!)
I need understand it exactly and unambiguously.
Thanks.
Let's try by the exemple
for an input of this is input for test input on regex and a regex like /this.*input/
The match will be this is input for test input
What will be done is
starting to examine the string and it will get a match with this is input
But now its at the middle of the string, it will continue to see if it could match more on it (this is the While we're already in the middle of it, let's keep going )
It will match till this is input for test input and continue till the end of the string
at the end, there's things wich are not part of the match, so the interpreter "backtrack" to the last time it matches.
For the last part its more about the ored regexes
Consider input string as cdacdgabcdef and the regex (ab|a).*
A common mistake is thinking it will return the more precise one (in this case 'abcdef') but it will return 'acdgabcdef' because the a match is the first one to match.
what happens here is: There's something matching this part, let's continue to the next part of the pattern and forget about the other options in this part.
For the lazy and greedy questions, the link of #AvinashRaj is clear enough, I won't repeat it here.