Do regular expression engine parse characters one by one? - regex

It is kind of absurd to ask. Anyway, let me proceed.
While exploring the regular expressions, I come across a scenario, where the expression is
[A-Z0-9]+(\d\d\.\d+)
The input string is 123.456 and the pattern being matched is as follows,
The pattern [A-Z0-9]+ could have matched upto 135, but it is not followed by the 2 more digits (\d\d) and a literal dot character. So, engine went with having characters 23.456 in the first subgroup.
Whether the regular expression engine check for the match by parsing one character at a time ? I was in that assumption.
By looking at this, it seems not. The engine should be parsing characters as well moving the window of matching back and forth, so that it can help us matching the result.
Correct me if I am wrong.

A regex engine parse the string according to the pattern it is given.
Your pattern is [A-Z0-9]+(\d\d\.\d+). Given the 123.456 string, the [A-Z0-9]+ is first tried from the beginning of the string. 123 is grabbed first (since + is a greedy quantifier). Then the regex engine tries to match the rest of the string with (\d\d\.\d+) - and fails. Backtracking occurs because the regex engine knows that [A-Z0-9]+ can match a different (smaller) portion of the string, and thus, the 3 is dropped from the currently consumed chars, and (\d\d\.\d+) is retried to match 3.456, but there must be 2 digits before a dot. Backtracking happens again.
Thus, only 1 remains outside the capturing group 1 value.
Also, have a look at the steps generated at regex101.com (backtracking is marked with ):

Related

Regex finding shortest string with starting word and ending word

I am wanting to find a way to write a regular expression to search for occurrences of a string which begins with a specified beginning substring and ends with another specified ending string but whose total lengths are minimal. For example, if my beginning string was bar and my ending string was foo when searching through the string barbazbarbazfoobazfoo then I would want to have it return barbazfoo.
I am aware of how to do this if it were just a single character at one end or the other, for example in replacing the words above with characters I could search using a[^a].*?b in order to find the the string axb within the string axaxbxb, but since I am looking for words rather than characters I can't simply say that I don't want any of a particular letter since the letter is allowed to appear inbetween.
For context, I am attempting to read through logs from a server and would like to find for example which users encountered a specific error, but there is additional information between where the username appears and where the information about the exceptions occur. As such, I am not looking for a solution which uses the fact that foo in the above example has the only occurrences of the letters f and o.
Additional example: From the first paragraph on this regex tutorial about lookahead and lookbehind
The text reads:
Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called "assertions". They do not consume characters in the string, but only assert whether a match is possible or not. Lookaround allows you to create regular expressions that are impossible to create without them, or that would get very longwinded without them.
If my start word was lookaround and my end word was match then I expect to have found the substring lookaround actually match, noting that there are potentially multiple occurrences of the target words and an unknown number of words and characters inbetween possibly sharing characters with the target words. In the above example for instance lookaround[^lookaround]*?match comes back as not having found a match as the syntax appears to be looking to avoid each of the letters l,o,k,... individually. I am looking to see how I can have it look to avoid substrings rather than individual letters.
You have to use Tempered Greedy Token:
First (with word boundaries)
\blookaround\b(?:(?!\b(?:match|lookaround)\b).)*\bmatch\b
matches lookaround actually matches characters, but then gives up the match
Second (without)
lookaround(?:(?!(?:match|lookaround)).)*match
matches lookaround actually match

Email-similar regex catastrophic backtracing

I'd like to match something which may be called the beginning of the e-mail, ie.
1 character (whichever letter from alphabet and digits)
0 or 1 dot
1 or more character
The repetition of {2nd and 3rd point} zero or more times
# character
The regex I've been trying to apply on Regex101 is \w(\.?\w+)*#.
I am getting the error Catastrophic backtracking. What am I doing wrong? Is the regex correct?
It is usual for catastrophic backtracking to appear in cases of nested quantifiers when the group inside contains at least one optional subpattern, making the quantified subpattern match the same pattern as the subpattern before the outer group and the outer group is not at the end of the pattern.
Your regex causes the issue right because the (\.?\w+)* is not at the end, there is an optional \.? and the expression is reduced to \w(\w+)*#.
For example aaa.aaaaaa.a.aa.aa but now aaa..aaaa.a
What you need is
^\w+(?:\.\w+)*#
See the regex demo
^ - start of string (to avoid partial matches)
\w+ - 1 or more word chars
(?:\.\w+)* - zero or more sequences of:
\. - a literal dot
\w+ - 1 or more word chars
# - a literal # char.
The problem
"Catastrophic backtracing" occurs when a part of the string could match a part of the regex in many different ways, so it needs to repeatedly retry to determine whether or not the string actually matches. A simple case: The regex a+a+b to match two or more a followed by one b. If you were to run that on aaaaaaaaaaa, the problem arises: First, the first a+ matches everything, and it fails at the second a+. Then, it tries with the first a+ matching all but one a, and the second a+ matches one a (this is "backtracing"), and then it fails on the b. But regexes aren't "smart" enough to know that it could stop there - so it has to keep going in that pattern until it's tried every split of giving some as to the first and some to the second. Some regex engines will realize they're getting stuck like this, and quit after enough steps, with the error you saw.
For your specific pattern: what you have there matches any nonzero quantity of letters or digits, mixed with any quantity of . where the . cannot be first, followed by an #. The only additional limit is that there can't be two adjacent dots. Effectively, this is the same case as my example: The * applied to a section containing a + acts like multiple duplicates of that +-ed section.
Atomic grouping
You could try something with atomic grouping. That basically says "once you've found any match for this, don't backtrace into it". After all, if you've found some amount of /w, it's not going to contain a /. and there's no need to keep rechecking that - dots are not letters or digits, and neither of those is an #.
In this case, the result would be \w(?>\.?\w+)*#. Note that not all regex engines support atomic grouping, though the one you linked does. If the string is only a match, nothing will change - if it's not a match, or contains non-matches, the process will take fewer steps. Using #eddiem's example from the comments, it finds two matches in 166311 steps with your original, but only takes 623 steps with atomic grouping added.
Possessive quantifiers
Another option would be a possessive quantifier - \w(\.?\w+)*+# means roughly the same thing. *+, specifically, is "whatever the star matches, don't backtrace inside it". In the above case, it matches in 558 steps - but it's slightly different meaning, in that it treats all the repeats together as one atomic value, instead of as several distinct atomic values. I don't think there's a difference in this case, but there might be in some cases. Again, not supported by all regex engines.

Regex to match html open and close tags(need some explanation)

I have difficulties in understanding some nuances in regular expressions. I am following the tutorial http://www.regular-expressions.info/backref.html and stuck on the example of matching open and close tag using backreferences.
We have string:
Testing <B><I>bold italic</I></B> text
and expression:
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
I can understand the whole logic, but can not get why engine backtracks to dot:
The engine has now arrived at the second < in the regex, and the
second < in the string. These match. The next token is /. This does
not match I, and the engine is forced to backtrack to the dot. The dot
matches the second < in the string. The star is still lazy, so the
engine again takes note of the available backtracking position and
advances to < and I. These do not match, so the engine again
backtracks.
Why it backtracks to dot? Is this because we have successfully matched the previous part of regex and it always backtracks to the position of previous successful match + 1?
And the second part I can not get completely. If we have a string:
Testing <BOO><I>bold italic</I></B> text
and expression without word boundary:
<([A-Z][A-Z0-9]*)[^>]*>.*?</\1>
...and look inside the regex engine at the point where \1 fails the
first time. First, .*? continues to expand until it has reached the
end of the string, and </\1> has failed to match each time .*? matched
one more character.
Then the regex engine backtracks into the capturing group. [A-Z0-9]* has matched oo, but would just as happily match o or nothing at all. When backtracking, [A-Z0-9]* is forced to give up one character.
Why it backtracks into the capturing group and not to dot as in previous example? And I can not get why [A-Z0-9]* is forced to give up one character? Is there some general rule where engine will backtrack?
NOTE: It is not about HTML parsing, it is a drill-down into how backtracking works using an HTML string example from http://regular-expression.info/backref.html.
The problem is that I just can not understand why backtracking rolls back to particular position is general.
The point is that a regular expression engine tries to find a match by all means. If there are options, different paths it may follow based on the current pattern, it will try them once it finds unmatching symbols on its way. See this backtracking introduction at rexegg.com:
Backtracking is a wonderful feature of modern regex engines: if a token fails to match, the engine backtracks to any position where it could have taken a different path. A greedy quantifier may then give up one character, a lazy quantifier may expand to match one more, or the rightmost side of an alternation may be tried. If a pattern continues to fail, the engine systematically explores all available paths.
So, backtracking may roll back to every construct or grouping that has a quantifier/alternation set to make sure all possible combinations are tried before a match failure is asserted. Your assumption that it always backtracks to the last matched symbol is not correct.
The only places where backtracking does not have access to are atomic groups, or groups that have possessive quantifiers. Also, the fact that a lookaround is zero-length automatically makes it atomic (see lookarounds).
In the first regex, \b marks a word boundary, and thus there can be no backtracking into the capturing group as there is no other word boundary other than already matched. When you remove it, backtracking can test all the preceding locations inside the capturing group.
To understand the importance of backtracking and \b, compare these regexes against the Testing <Boo><I>bold italic</I></Bo> text input:
<([A-Z][A-Z0-9]*)[^>]*>.*?<\/\1o> - the match is found as no word boundary is set and the engine backtracks into the capturing group freely, and the capturing group may contain B, Bo and Boo.
<([A-Z][A-Z0-9]*)\b[^>]*>.*?<\/\1o> - no match is found as Group 1 can only contain Boo.

Regex how to match two similar numbers in separate match groups?

I got the following string:
[13:49:38 INFO]: Overall : Mean tick time: 4.126 ms. Mean TPS:
20.000
the bold numbers should be matched, each into its own capture group.
My current expression is (\d+.\d{3}) which matches 4.126 how can I match my 20.000 now into a second capture group? Adding the same capture group again makes it find nothing. So what I basically need is, "search for first number, then ignore everything until you find next digit."
You could use something like so: (\d+\.\d{3}).+?(\d+\.\d{3})$ (example here) which essentially is your regex (plus a minor fix) twice, with the difference that it will also look for the same pattern again at the end of the string.
Another minor note, your regex contains, a potential issue in which you are matching the decimal point with the period character. In regular expression language, the period character means any character, thus your expression would also match 4s222. Adding an extra \ in front makes the regex engine treat is as an actual character, and not a special one.

How to optimise this regex to match string (1234-12345-1)

I've got this RegEx example: http://regexr.com?34hihsvn
I'm wondering if there's a more elegant way of writing it, or perhaps a more optimised way?
Here are the rules:
Digits and dashes only.
Must not contain more than 10 digits.
Must have two hyphens.
Must have at least one digit between each hyphen.
Last number must only be one digit.
I'm new to this so would appreciate any hints or tips.
In case the link expires, the text to search is
----------
22-22-1
22-22-22
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
666666-7777777-1
88888888-88888888-1
1-1-1
88888888-88888888-22
22-333-
333-22
----------
My regex is: \b((\d{1,4}-\d{1,5})|(\d{1,5}-\d{1,4}))-\d{1}\b
I'm using this site for testing: http://gskinner.com/RegExr/
Thanks for any help,
Nick
Here is a regex I came up with:
(?=\b[\d-]{3,10}-\d\b)\b\d+-\d+-\d\b
This uses a look-ahead to validate the information before attempting the match. So it looks for between 3-10 characters in the class of [\d-] followed by a dash and a digit. And then after that you have the actual match to confirm that the format of your string is actually digit(dash)digit(dash)digit.
From your sample strings this regex matches:
22-22-1
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
1-1-1
It also matches the following strings:
22-7777777-1
1-88888888-1
Your regexp only allows a first and second group of digits with a maximum length of 5. Therefore, valid strings like 1-12345678-1 or 123456-1-1 won't be matched.
This regexp works for the given requirements:
\b(?:\d\-\d{1,8}|\d{2}\-\d{1,7}|\d{3}\-\d{1,6}|\d{4}\-\d{1,5}|\d{5}\-\d{1,4}|\d{6}\-\d{1,3}|\d{7}\-\d{1,2}|\d{8}\-\d)\-\d\b
(RegExr)
You can use this with the m modifier (switch the multiline mode on):
^\d(?!.{12})\d*-\d+-\d$
or this one without the m modifier:
\b\d(?!.{12})\d*-\d+-\d\b
By design these two patterns match at least three digits separated by hyphens (so no need to put a {5,n} quantifier somewhere, it's useless).
Patterns are also build to fail faster:
I have chosen to start them with a digit \d, this way each beginning of a line or word-boundary not followed by a digit is immediately discarded. Other thing, using only one digit, I know the remaining string length.
Then I test the upper limit of the string length with a negative lookahead that test if there is one more character than the maximum length (if there are 12 characters at this position, there are 13 characters at least in the string). No need to use more descriptive that the dot meta-character here, the goal is to quickly test the length.
finally, I describe the end of string without doing something particular. That is probably the slower part of the pattern, but it doesn't matter since the overwhelming majority of unnecessary positions have already been discarded.