Email-similar regex catastrophic backtracing - regex

I'd like to match something which may be called the beginning of the e-mail, ie.
1 character (whichever letter from alphabet and digits)
0 or 1 dot
1 or more character
The repetition of {2nd and 3rd point} zero or more times
# character
The regex I've been trying to apply on Regex101 is \w(\.?\w+)*#.
I am getting the error Catastrophic backtracking. What am I doing wrong? Is the regex correct?

It is usual for catastrophic backtracking to appear in cases of nested quantifiers when the group inside contains at least one optional subpattern, making the quantified subpattern match the same pattern as the subpattern before the outer group and the outer group is not at the end of the pattern.
Your regex causes the issue right because the (\.?\w+)* is not at the end, there is an optional \.? and the expression is reduced to \w(\w+)*#.
For example aaa.aaaaaa.a.aa.aa but now aaa..aaaa.a
What you need is
^\w+(?:\.\w+)*#
See the regex demo
^ - start of string (to avoid partial matches)
\w+ - 1 or more word chars
(?:\.\w+)* - zero or more sequences of:
\. - a literal dot
\w+ - 1 or more word chars
# - a literal # char.

The problem
"Catastrophic backtracing" occurs when a part of the string could match a part of the regex in many different ways, so it needs to repeatedly retry to determine whether or not the string actually matches. A simple case: The regex a+a+b to match two or more a followed by one b. If you were to run that on aaaaaaaaaaa, the problem arises: First, the first a+ matches everything, and it fails at the second a+. Then, it tries with the first a+ matching all but one a, and the second a+ matches one a (this is "backtracing"), and then it fails on the b. But regexes aren't "smart" enough to know that it could stop there - so it has to keep going in that pattern until it's tried every split of giving some as to the first and some to the second. Some regex engines will realize they're getting stuck like this, and quit after enough steps, with the error you saw.
For your specific pattern: what you have there matches any nonzero quantity of letters or digits, mixed with any quantity of . where the . cannot be first, followed by an #. The only additional limit is that there can't be two adjacent dots. Effectively, this is the same case as my example: The * applied to a section containing a + acts like multiple duplicates of that +-ed section.
Atomic grouping
You could try something with atomic grouping. That basically says "once you've found any match for this, don't backtrace into it". After all, if you've found some amount of /w, it's not going to contain a /. and there's no need to keep rechecking that - dots are not letters or digits, and neither of those is an #.
In this case, the result would be \w(?>\.?\w+)*#. Note that not all regex engines support atomic grouping, though the one you linked does. If the string is only a match, nothing will change - if it's not a match, or contains non-matches, the process will take fewer steps. Using #eddiem's example from the comments, it finds two matches in 166311 steps with your original, but only takes 623 steps with atomic grouping added.
Possessive quantifiers
Another option would be a possessive quantifier - \w(\.?\w+)*+# means roughly the same thing. *+, specifically, is "whatever the star matches, don't backtrace inside it". In the above case, it matches in 558 steps - but it's slightly different meaning, in that it treats all the repeats together as one atomic value, instead of as several distinct atomic values. I don't think there's a difference in this case, but there might be in some cases. Again, not supported by all regex engines.

Related

Why are sequential regular expressions more efficient than a combined experession?

In answering a Splunk question on SO, the following sample text was given:
msg: abc.asia - [2021-08-23T00:27:08.152+0000] "GET /facts?factType=COMMERCIAL&sourceSystem=ADMIN&sourceOwner=ABC&filters=%257B%2522stringMatchFilters%2522:%255B%257B%2522key%2522:%2522BFEESCE((json_data-%253E%253E'isNotSearchable')::boolean,%2520false)%2522,%2522value%2522:%2522false%2522,%2522operator%2522:%2522EQ%2522%257D%255D,%2522multiStringMatchFilters%2522:%255B%257B%2522key%2522:%2522json_data-%253E%253E'id'%2522,%2522values%2522:%255B%25224970111%2522%255D%257D%255D,%2522containmentFilters%2522:%255B%255D,%2522nestedMultiStringMatchFilter%2522:%255B%255D,%2522nestedStringMatchFilters%2522:%255B%255D%257D&sorts=%257B%2522sortOrders%2522:%255B%257B%2522key%2522:%2522id%2522,%2522order%2522:%2522DESC%2522%257D%255D%257D&pagination=null
The person wanted to extract everything in the "filters" portion of the URL if "factType" was "COMMERCIAL"
The following all-in-one regex pulls it out neatly (presuming the URL is always in the right order (ie factType coming before filters):
factType=(?<facttype>\w+).+filters=(?<filters>[^\&]+)
According to regex101, it finds its expected matches with 670 steps
But if I break it up to
factType=(?<facttype>\w+)
followed by
filters=(?<filters>[^\&]+)
regex101 reports the matches being found with 26 and 16 steps, respectively
What about breaking up the regex into two makes it so much more (~15x) efficient to match?
The main problem with the regexp is the presence of the .+ where . eat (nearly) anything and * is generally greedy. Indeed, regexp engines are split in two categories: greedy engines and lazy ones. Greedy engines basically consume all the characters and backtrack as long as nothing is found while lazy ones consume characters only when the following pattern is not found. More engines are greedy. AFAIK, Java is the rare language to use a lazy engine by default. Fortunately, you can specify that you want a lazy quantifier with .+?. This means the engine will try to search the shortest possible match for .* instead of the longest one by default. This is what people usually do when writing a manual search. The result is 65 steps instead of 670 steps (10x better).
Note, that quantifiers do not always help in such a case. It is often better to make the regexp more precise (ie. deterministic) so to improve performance (by reducing the number of possible backtracks due to wrongly path tacking in the non-deterministic automaton).
Still, note that regexp engines are generally not very optimized compared to manual searches (as long as you use efficient search algorithms). They are great to make the code short, flexible and maintainable. For high-performance, a basic loop in native languages is often better. This is especially true if it is vectorized using SIMD instructions (which is generally not easy).
Here is a regex that would be inherently more efficient than .+ or .+? irrespective of the positions of those matches in input text.
factType=(?<facttype>\w+)(?:&(?!filters=)[^&\s]*)*&filters=(?<filters>[^\&]+)
RegEx Demo 1
RegEx Demo 2
This regex may look bit longer but it will be more efficient because we are using a negative lookahead (?!filters=) after matching & to stop the match just before filters query parameter.
Q. What is backtracking?
A. In simple words: If a match isn't complete, the engine will backtrack the string to try to find a whole match until it succeeds or fails. In the above example if you use .+ it matches longest possible match till the end of input then starts backtracking one position backward at a time to find the whole match of second pattern. When you use .+? it just does lazy match and moves forward one position at a time to get the full match.
This suggested approach is far more efficient than .* or .+ or .+? approaches because it avoids expensive backtracking while trying to find match of the second pattern.
RegEx Details:
factType=: Match factType=
(?<facttype>\w+): Match 1+ word characters and capture in named group facttype
(?:: Start non-capture group
&: Match a &
(?!filters=): Stop matching when we have filters= at next position
[^&\s]*: Match 0 or more of non-space non-& chars
)*: End non-capture group. Repeat this group 0 or more times
&: Match a &
filters=: Match filters=
(?<filters>[^\&]+): Match 1 or more of non-space non-& chars and capture in named group filters
Related article on catastrophic backtracking

Recurse subpattern doesn't seem to work with alternation

I want to match strings with numbers separated by commas. In a nutshell I want to match at most 8 numbers in range 1-16. So that string 1,2,3,4,5,6,7,8 is OK and 1,2,3,4,5,6,7,8,9 is not since it has 9 numbers. Also 16 is OK but 17 is not since 17 is not in range.
I tried to use this regex ^(?:(?:[1-9]|1[0-6]),){0,7}(?:[1-9]|1[0-6])$
and it worked fine. I use alternation to match numbers from 1-16, then I use 0..7 repetition with comma at the end and then the same without comma at the end. But I do not like the repetition of the subpattern, so I tried (?1) to recurse the first capturing group. My regex looks like ^(?:([1-9]|1[0-6]),){0,7}(?1)$. However this do not produce match when the last number has two letters (10-16). It does match 1,1, but not 1,10. I do not understand why.
I created an example of the problem.
https://regex101.com/r/VkuPqP/1
In the debugger I see that the engine do not try the second alternation from the group, when the pattern recurse. I expect it to work. Where is the problem?
That happens because the regex subroutines in PCRE are atomic.
The regex you have can be re-written as ^(?:([1-9]|1[0-6]),){0,7}(?>[1-9]|1[0-6])$, see its demo. (?>...|...) will not allow backtracking into this group, so if the first branch "wins" (as in your example), the subsequent ones will not be tried upon the next subpattern failure (here, $ fails to match the end of string after matching 1 - it is followed with 0).
You may swap the alternatives in this case, the longer should come first:
^(?:(1[0-6]|[1-9]),){0,7}(?1)$
See the regex demo.
In general, the best practice is that each alternative in a group must match in different location inside a string. They should not match at the same locations.
If you can't rewrite an alternation group so that each alternative matches at unique locations in the string, you should repeat the group without using a regex subroutine.

Regex to find last occurrence of pattern in a string

My string being of the form:
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
I only want to match against the last segment of whitespace before the last period(.)
So far I am able to capture whitespace but not the very last occurrence using:
\s+(?=\.\w)
How can I make it less greedy?
In a general case, you can match the last occurrence of any pattern using the following scheme:
pattern(?![\s\S]*pattern)
(?s)pattern(?!.*pattern)
pattern(?!(?s:.*)pattern)
where [\s\S]* matches any zero or more chars as many as possible. (?s) and (?s:.) can be used with regex engines that support these constructs so as to use . to match any chars.
In this case, rather than \s+(?![\s\S]*\s), you may use
\s+(?!\S*\s)
See the regex demo. Note the \s and \S are inverse classes, thus, it makes no sense using [\s\S]* here, \S* is enough.
Details:
\s+ - one or more whitespace chars
(?!\S*\s) - that are not immediately followed with any 0 or more non-whitespace chars and then a whitespace.
You can try like so:
(\s+)(?=\.[^.]+$)
(?=\.[^.]+$) Positive look ahead for a dot and characters except dot at the end of line.
Demo:
https://regex101.com/r/k9VwC6/3
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=((?<=\S)\s+)).*
replaced by `>\1<`
> <
As a more generalized example
This example defines several needles and finds the last occurrence of either one of them. In this example the needles are:
defined word findMyLastOccurrence
whitespaces (?<=\S)\s+
dots (?<=[^\.])\.+
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)).*
replaced by `>\1<`
>..<
Explanation:
Part 1 .*
is greedy and finds everything as long as the needles are found. Thus, it also captures all needle occurrences until the very last needle.
edit to add:
in case we are interested in the first hit, we can prevent the greediness by writing .*?
Part 2 (?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+|(?<=**Not**NeedlePart)NeedlePart+))
defines the 'break' condition for the greedy 'find-all'. It consists of several parts:
(?=(needles))
positive lookahead: ensure that previously found everything is followed by the needles
findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)|(?<=**Not**NeedlePart)NeedlePart+
several needles for which we are looking. Needles are patterns themselves.
In case we look for a collection of whitespaces, dots or other needleparts, the pattern we are looking for is actually: anything which is not a needlepart, followed by one or more needleparts (thus needlepart is +). See the example for whitespaces \s negated with \S, actual dot . negated with [^.]
Part 3 .*
as we aren't interested in the remainder, we capture it and dont use it any further. We could capture it with parenthesis and use it as another group, but that's out of scope here
SIMPLE SOLUTION for a COMMON PROBLEM
All of the answers that I have read through are way off topic, overly complicated, or just simply incorrect. This question is a common problem that regex offers a simple solution for.
Breaking Down the General Problem
THE STRING
The generalized problem is such that there is a string that contains several characters.
THE SUB-STRING
Within the string is a sub-string made up of a few characters. Often times this is a file extension (i.e .c, .ts, or .json), or a top level domain (i.e. .com, .org, or .io), but it could be something as arbitrary as MC Donald's Mulan Szechuan Sauce. The point it is, it may not always be something simple.
THE BEFORE VARIANCE (Most important part)
The before variance is an arbitrary character, or characters, that always comes just before the sub-string. In this question, the before variance is an unknown amount of white-space. Its a variance because the amount of white-space that needs to be match against varies (or has a dynamic quantity).
Describing the Solution in Reference to the Problem
(Solution Part 1)
Often times when working with regular expressions its necessary to work in reverse.
We will start at the end of the problem described above, and work backwards, henceforth; we are going to start at the The Before Variance (or #3)
So, as mentioned above, The Before Variance is an unknown amount of white-space. We know that it includes white-space, but we don't know how much, so we will use the meta sequence for Any Whitespce with the one or more quantifier.
The Meta Sequence for "Any Whitespace" is \s.
The "One or More" quantifier is +
so we will start with...
NOTE: In ECMAS Regex the / characters are like quotes around a string.
const regex = /\s+/g
I also included the g to tell the engine to set the global flag to true. I won't explain flags, for the sake of brevity, but if you don't know what the global flag does, you should DuckDuckGo it.
(Solution Part 2)
Remember, we are working in reverse, so the next part to focus on is the Sub-string. In this question it is .com, but the author may want it to match against a value with variance, rather than just the static string of characters .com, therefore I will talk about that more below, but to stay focused, we will work with .com for now.
It's necessary that we use a concept here that's called ZERO LENGTH ASSERTION. We need a "zero-length assertion" because we have a sub-string that is significant, but is not what we want to match against. "Zero-length assertions" allow us to move the point in the string where the regular expression engine is looking at, without having to match any characters to get there.
The Zero-Length Assertion that we are going to use is called LOOK AHEAD, and its syntax is as follows.
Look-ahead Syntax: (?=Your-SubStr-Here)
We are going to use the look ahead to match against a variance that comes before the pattern assigned to the look-ahead, which will be our sub-string. The result looks like this:
const regex = /\s+(?=\.com)/gi
I added the insensitive flag to tell the engine to not be concerned with the case of the letter, in other words; the regular expression /\s+(?=\.cOM)/gi
is the same as /\s+(?=\.Com)/gi, and both are the same as: /\s+(?=\.com)/gi &/or /\s+(?=.COM)/gi. Everyone of the "Just Listed" regular expressions are equivalent so long as the i flag is set.
That's it! The link HERE (REGEX101) will take you to an example where you can play with the regular expression if you like.
I mentioned above working with a sub-string that has more variance than .com.
You could use (\s*)(?=\.\w{3,}) for instance.
The problem with this regex, is even though it matches .txt, .org, .json, and .unclepetespurplebeet, the regex isn't safe. When using the question's string of...
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
as an example, you can see at the LINK HERE (Regex101) there are 3 lines in the string. Those lines represent areas where the sub-string's lookahead's assertion returned true. Each time the assertion was true, a possibility for an incorrect final match was created. Though, only one match was returned in the end, and it was the correct match, when implemented in a program, or website, that's running in production, you can pretty much guarantee that the regex is not only going to fail, but its going to fail horribly and you will come to hate it.
You can try this. It will capture the last white space segment - in the first capture group.
(\s+)\.[^\.]*$

Regular Expression pattern explanation

Dono what this regular expression is doing
(?>[^\,]*\,){3}([^\,]*)[\']?
(?>[^\,]*\,){4}([^\,]*)[\']?
could any one explain me more in deatil
There is an awesome site http://regex101.com for these needs! It describes regulars and allows you to test and debug them.
Your ones does match things like 4 (5 for the second one) values separated by commas and returns the last one as a signle matching group:
(?>...) are atomic groups. After they have matched once they won't leave it forever.
[^\,] matches any character except comma
[^\,]*\, means any number (even zero) of non-comma charaters, and then a sigle comma
(?>[^\,]*\,){3} means do that happend above 3 times
([^\,]*)[\']? means one more word without commas as a group and possibly one more comma.
For example, in 1,,333,4,5 the first one will match 1,,333,4, and return 4 as matched group. The second one will find 1,,333,4,5 and 5 as group.
Edit: Even more description.
Regular expression have groups. These are parts or regular expressions that can have number quantifiers -- how many times to repeat them ({3}) and some options. Also, after regular has matched, we can find out what every group has matched.
Atomic ones, less talk, take as much forward as they can and never go back. Also, they can't be watched as described before. They are used here only due to perfomance reasons.
So, we need to take as a group the 4th word from comma-separated values. We will do it like this:
We will take 3 times ({3}) an atomic group ((?>...)):
Which takes a word -- any number of characters (*) of any non-comma character ([^\n])
[^...] means any symbol except described ones.
And a comma (\,) that separates that word from the next one
Now, our wanted word starts. We open a group ((...))
That will take a word as described above: [^\,]*
The is possibly one more comma, take it if there is one (\,? or [\,]?)
? means 0 or 1 group before, here it's single comma.
So, it starts on first word in first atomic group, takes it all, then takes a comma. After that, it is repeated 2 times more. That takes 3 first words with their commas.
After that, one non-atomic group takes the 4th word.

Why does this regex using *+ (possessive) not match

I need to get the last match of [0-9.]* in a string like
one 1.234 three
some text 1.2321 xyz 1 5 1.234 and more text
some other text
but also need the text around it - even when there is no number like in the 3rd line
I wanted to use ^(.*)([0-9\.]*+)(.*)$ but it just matches the first (.*).
On the other hand ^(.*?)([0-9\.]*+)(.*?)$ just matches the last (.*?).
Why is that? I thought that it will try to satisfy all rules?
I know that I can exclude 0-9. from the last .* to get what I want, but I want to understand why the above isn't working although I used *+
A possessive quantifier doesn't guarantee the longest possible match, it just prevents backtracking. Neither of your regexes ever tries to backtrack, so the possessive quantifier has no effect.
With the first regex, the first (.*) consumes the whole string, then ([0-9.]*+) and the second (.*) each consume nothing because there's nothing left to match.
With the second regex, the first (.*?) initially consumes nothing because it's reluctant. Then ([0-9.]*+) successfully matches some more nothing because it's still at the beginning of the string, which doesn't happen to start with a digit or a period. Finally, the last (.*?) is forced to consume what's left (the whole string) despite being reluctant, because of the anchor ($) following it.
To solve your problem, we need to know more about the kind of input you can expect. For example, if you know there will never be any digits or periods after the number you're looking for, you could use this:
^(.*?)(?:([0-9.]+)([^0-9.]*))?$
The key here is that the second capturing group, ([0-9.]+), uses a + instead of a *. If there are no digits or periods in the string, the enclosing group, (?:([0-9.]+)([^0-9.]*))?, will match nothing, and the initial (.*?) will be forced to consume the whole string. (The second and third groups will be empty.)
If there's more than one sequence of digits or periods in the string, the second group is guaranteed to match the last of them, because the third group, ([^0-9.]*), allows anything but those characters in the remainder of the string.
This is pretty weak, but it's the best I can do with the information you've supplied. The point is, possessive quantifiers are brilliant when you can use them, but that doesn't happen nearly as often as you might expect.