Why are sequential regular expressions more efficient than a combined experession? - regex

In answering a Splunk question on SO, the following sample text was given:
msg: abc.asia - [2021-08-23T00:27:08.152+0000] "GET /facts?factType=COMMERCIAL&sourceSystem=ADMIN&sourceOwner=ABC&filters=%257B%2522stringMatchFilters%2522:%255B%257B%2522key%2522:%2522BFEESCE((json_data-%253E%253E'isNotSearchable')::boolean,%2520false)%2522,%2522value%2522:%2522false%2522,%2522operator%2522:%2522EQ%2522%257D%255D,%2522multiStringMatchFilters%2522:%255B%257B%2522key%2522:%2522json_data-%253E%253E'id'%2522,%2522values%2522:%255B%25224970111%2522%255D%257D%255D,%2522containmentFilters%2522:%255B%255D,%2522nestedMultiStringMatchFilter%2522:%255B%255D,%2522nestedStringMatchFilters%2522:%255B%255D%257D&sorts=%257B%2522sortOrders%2522:%255B%257B%2522key%2522:%2522id%2522,%2522order%2522:%2522DESC%2522%257D%255D%257D&pagination=null
The person wanted to extract everything in the "filters" portion of the URL if "factType" was "COMMERCIAL"
The following all-in-one regex pulls it out neatly (presuming the URL is always in the right order (ie factType coming before filters):
factType=(?<facttype>\w+).+filters=(?<filters>[^\&]+)
According to regex101, it finds its expected matches with 670 steps
But if I break it up to
factType=(?<facttype>\w+)
followed by
filters=(?<filters>[^\&]+)
regex101 reports the matches being found with 26 and 16 steps, respectively
What about breaking up the regex into two makes it so much more (~15x) efficient to match?

The main problem with the regexp is the presence of the .+ where . eat (nearly) anything and * is generally greedy. Indeed, regexp engines are split in two categories: greedy engines and lazy ones. Greedy engines basically consume all the characters and backtrack as long as nothing is found while lazy ones consume characters only when the following pattern is not found. More engines are greedy. AFAIK, Java is the rare language to use a lazy engine by default. Fortunately, you can specify that you want a lazy quantifier with .+?. This means the engine will try to search the shortest possible match for .* instead of the longest one by default. This is what people usually do when writing a manual search. The result is 65 steps instead of 670 steps (10x better).
Note, that quantifiers do not always help in such a case. It is often better to make the regexp more precise (ie. deterministic) so to improve performance (by reducing the number of possible backtracks due to wrongly path tacking in the non-deterministic automaton).
Still, note that regexp engines are generally not very optimized compared to manual searches (as long as you use efficient search algorithms). They are great to make the code short, flexible and maintainable. For high-performance, a basic loop in native languages is often better. This is especially true if it is vectorized using SIMD instructions (which is generally not easy).

Here is a regex that would be inherently more efficient than .+ or .+? irrespective of the positions of those matches in input text.
factType=(?<facttype>\w+)(?:&(?!filters=)[^&\s]*)*&filters=(?<filters>[^\&]+)
RegEx Demo 1
RegEx Demo 2
This regex may look bit longer but it will be more efficient because we are using a negative lookahead (?!filters=) after matching & to stop the match just before filters query parameter.
Q. What is backtracking?
A. In simple words: If a match isn't complete, the engine will backtrack the string to try to find a whole match until it succeeds or fails. In the above example if you use .+ it matches longest possible match till the end of input then starts backtracking one position backward at a time to find the whole match of second pattern. When you use .+? it just does lazy match and moves forward one position at a time to get the full match.
This suggested approach is far more efficient than .* or .+ or .+? approaches because it avoids expensive backtracking while trying to find match of the second pattern.
RegEx Details:
factType=: Match factType=
(?<facttype>\w+): Match 1+ word characters and capture in named group facttype
(?:: Start non-capture group
&: Match a &
(?!filters=): Stop matching when we have filters= at next position
[^&\s]*: Match 0 or more of non-space non-& chars
)*: End non-capture group. Repeat this group 0 or more times
&: Match a &
filters=: Match filters=
(?<filters>[^\&]+): Match 1 or more of non-space non-& chars and capture in named group filters
Related article on catastrophic backtracking

Related

Can this be parsed by regular expression [duplicate]

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.
So let's say the text is:
start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end
This example has 8 items inside, but say it could have between 3 and 10 items.
I'd ideally like something like this:
start:(?:(\w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here
I usually use something like this in simple situations:
start:(\w+)-(\w+)-(\w+)-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?:end
3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo
And the best I could do so far:
start:(\w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end
shorter especially if the matches are complex but still long. demo
Anyone managed to make it work as a 1 regex-only solution without programming?
I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.
Update:
The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation
Update 2 (bounty):
With #nhahtdh's help I got to the RegExp below by using \G:
(?:start:(?=(?:[\w]+(?:-|(?=:end))){3,10}:end)|(?!^)\G-)([\w]+)
demo even shorter, but can be described without repeating code
I'm also interested in the ECMA flavor and as it doesn't support \G wondering if there's another way, especially without using /g modifier.
Read this first!
This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.
For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.
This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.
Summary
.NET supports capturing repeating pattern with CaptureCollection class.
For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).
Solution
This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.
However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.
For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)
DEMO. The construction is \w+- repeated, then \w+:end.
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)
DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.
\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.
Explanation
Let us break down the regex:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?<=-)\G
)
(\w+)
(?:-|:end)
The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.
The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.
I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.
As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.
For the other construction:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?!^)\G-
)
(\w+)
Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).
It is quite tricky to make this regex works for matching in middle of the string:
We need to check the number of words between start: and :end (as part of the requirement of the original regex).
\G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.
For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).
For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.
The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.
Validation Version
If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)
(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).
Construction
For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.
When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.
We build such regex by going through these steps:
Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
Match and capture the first instance. (e.g. (\w+))
(At this point, the first instance and delimiter should have been matched)
Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
Add the delimiter (if any). (e.g. -)
(After this step, the rest of the tokens should have also been matched, except the last maybe)
Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
Now the hard part. You need to check that:
There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.
Although it might theoretically be possible to write a single expression, it's a lot more practical to match the outer boundaries first and then perform a split on the inner part.
In ECMAScript I would write it like this:
'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end'
.match(/^start:([\w-]+):end$/)[1] // match the inner part
.split('-') // split inner part (this could be a split regex as well)
In PHP:
$txt = 'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end';
if (preg_match('/^start:([\w-]+):end$/', $txt, $matches)) {
print_r(explode('-', $matches[1]));
}
Of course you can use the regex in this quoted string.
"(?<a>\\w+)-(?<b>\\w+)-(?:(?<c>\\w+)" \
"(?:-(?<d>\\w+)(?:-(?<e>\\w+)(?:-(?<f>\\w+)" \
"(?:-(?<g>\\w+)(?:-(?<h>\\w+)(?:-(?<i>\\w+)" \
"(?:-(?<j>\\w+))?" \
")?)?)?" \
")?)?)?" \
")"
Is it a good idea? No, I don't think so.
Not sure you can do it in that way, but you can use the global flag to find all of the words between the colons, see:
http://regex101.com/r/gK0lX1
You'd have to validate the number of groups yourself though. Without the global flag you're only getting a single match, not all matches - change {3,10} to {1,5} and you get the result 'sir' instead.
import re
s = "start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end"
print re.findall(r"(\b\w+?\b)(?:-|:end)", s)
produces
['test', 'test', 'lorem', 'ipsum', 'sir', 'doloret', 'etc', 'etc', 'something']
When you combine:
Your observation: any kind of repitition of a single capture group will result in an overwrite of the last capture, thus returning only the last capture of the capture group.
The knowledge: Any kind of capturing based on the parts, instead of the whole, makes it impossible to set a limit on the amount of times the regex engine will repeat. The limit would have to be metadata (not regex).
With a requirement that the answer cannot involve programming (looping), nor an answer that involves simply copy-pasting capturegroups as you've done in your question.
It can be deduced that it cannot be done.
Update: There are some regex engines for which p. 1 is not necessarily true. In that case the regex you have indicated start:(?:(\w+)-?){3,10}:end will do the job (source).

Regex to find last occurrence of pattern in a string

My string being of the form:
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
I only want to match against the last segment of whitespace before the last period(.)
So far I am able to capture whitespace but not the very last occurrence using:
\s+(?=\.\w)
How can I make it less greedy?
In a general case, you can match the last occurrence of any pattern using the following scheme:
pattern(?![\s\S]*pattern)
(?s)pattern(?!.*pattern)
pattern(?!(?s:.*)pattern)
where [\s\S]* matches any zero or more chars as many as possible. (?s) and (?s:.) can be used with regex engines that support these constructs so as to use . to match any chars.
In this case, rather than \s+(?![\s\S]*\s), you may use
\s+(?!\S*\s)
See the regex demo. Note the \s and \S are inverse classes, thus, it makes no sense using [\s\S]* here, \S* is enough.
Details:
\s+ - one or more whitespace chars
(?!\S*\s) - that are not immediately followed with any 0 or more non-whitespace chars and then a whitespace.
You can try like so:
(\s+)(?=\.[^.]+$)
(?=\.[^.]+$) Positive look ahead for a dot and characters except dot at the end of line.
Demo:
https://regex101.com/r/k9VwC6/3
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=((?<=\S)\s+)).*
replaced by `>\1<`
> <
As a more generalized example
This example defines several needles and finds the last occurrence of either one of them. In this example the needles are:
defined word findMyLastOccurrence
whitespaces (?<=\S)\s+
dots (?<=[^\.])\.+
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)).*
replaced by `>\1<`
>..<
Explanation:
Part 1 .*
is greedy and finds everything as long as the needles are found. Thus, it also captures all needle occurrences until the very last needle.
edit to add:
in case we are interested in the first hit, we can prevent the greediness by writing .*?
Part 2 (?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+|(?<=**Not**NeedlePart)NeedlePart+))
defines the 'break' condition for the greedy 'find-all'. It consists of several parts:
(?=(needles))
positive lookahead: ensure that previously found everything is followed by the needles
findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)|(?<=**Not**NeedlePart)NeedlePart+
several needles for which we are looking. Needles are patterns themselves.
In case we look for a collection of whitespaces, dots or other needleparts, the pattern we are looking for is actually: anything which is not a needlepart, followed by one or more needleparts (thus needlepart is +). See the example for whitespaces \s negated with \S, actual dot . negated with [^.]
Part 3 .*
as we aren't interested in the remainder, we capture it and dont use it any further. We could capture it with parenthesis and use it as another group, but that's out of scope here
SIMPLE SOLUTION for a COMMON PROBLEM
All of the answers that I have read through are way off topic, overly complicated, or just simply incorrect. This question is a common problem that regex offers a simple solution for.
Breaking Down the General Problem
THE STRING
The generalized problem is such that there is a string that contains several characters.
THE SUB-STRING
Within the string is a sub-string made up of a few characters. Often times this is a file extension (i.e .c, .ts, or .json), or a top level domain (i.e. .com, .org, or .io), but it could be something as arbitrary as MC Donald's Mulan Szechuan Sauce. The point it is, it may not always be something simple.
THE BEFORE VARIANCE (Most important part)
The before variance is an arbitrary character, or characters, that always comes just before the sub-string. In this question, the before variance is an unknown amount of white-space. Its a variance because the amount of white-space that needs to be match against varies (or has a dynamic quantity).
Describing the Solution in Reference to the Problem
(Solution Part 1)
Often times when working with regular expressions its necessary to work in reverse.
We will start at the end of the problem described above, and work backwards, henceforth; we are going to start at the The Before Variance (or #3)
So, as mentioned above, The Before Variance is an unknown amount of white-space. We know that it includes white-space, but we don't know how much, so we will use the meta sequence for Any Whitespce with the one or more quantifier.
The Meta Sequence for "Any Whitespace" is \s.
The "One or More" quantifier is +
so we will start with...
NOTE: In ECMAS Regex the / characters are like quotes around a string.
const regex = /\s+/g
I also included the g to tell the engine to set the global flag to true. I won't explain flags, for the sake of brevity, but if you don't know what the global flag does, you should DuckDuckGo it.
(Solution Part 2)
Remember, we are working in reverse, so the next part to focus on is the Sub-string. In this question it is .com, but the author may want it to match against a value with variance, rather than just the static string of characters .com, therefore I will talk about that more below, but to stay focused, we will work with .com for now.
It's necessary that we use a concept here that's called ZERO LENGTH ASSERTION. We need a "zero-length assertion" because we have a sub-string that is significant, but is not what we want to match against. "Zero-length assertions" allow us to move the point in the string where the regular expression engine is looking at, without having to match any characters to get there.
The Zero-Length Assertion that we are going to use is called LOOK AHEAD, and its syntax is as follows.
Look-ahead Syntax: (?=Your-SubStr-Here)
We are going to use the look ahead to match against a variance that comes before the pattern assigned to the look-ahead, which will be our sub-string. The result looks like this:
const regex = /\s+(?=\.com)/gi
I added the insensitive flag to tell the engine to not be concerned with the case of the letter, in other words; the regular expression /\s+(?=\.cOM)/gi
is the same as /\s+(?=\.Com)/gi, and both are the same as: /\s+(?=\.com)/gi &/or /\s+(?=.COM)/gi. Everyone of the "Just Listed" regular expressions are equivalent so long as the i flag is set.
That's it! The link HERE (REGEX101) will take you to an example where you can play with the regular expression if you like.
I mentioned above working with a sub-string that has more variance than .com.
You could use (\s*)(?=\.\w{3,}) for instance.
The problem with this regex, is even though it matches .txt, .org, .json, and .unclepetespurplebeet, the regex isn't safe. When using the question's string of...
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
as an example, you can see at the LINK HERE (Regex101) there are 3 lines in the string. Those lines represent areas where the sub-string's lookahead's assertion returned true. Each time the assertion was true, a possibility for an incorrect final match was created. Though, only one match was returned in the end, and it was the correct match, when implemented in a program, or website, that's running in production, you can pretty much guarantee that the regex is not only going to fail, but its going to fail horribly and you will come to hate it.
You can try this. It will capture the last white space segment - in the first capture group.
(\s+)\.[^\.]*$

Email-similar regex catastrophic backtracing

I'd like to match something which may be called the beginning of the e-mail, ie.
1 character (whichever letter from alphabet and digits)
0 or 1 dot
1 or more character
The repetition of {2nd and 3rd point} zero or more times
# character
The regex I've been trying to apply on Regex101 is \w(\.?\w+)*#.
I am getting the error Catastrophic backtracking. What am I doing wrong? Is the regex correct?
It is usual for catastrophic backtracking to appear in cases of nested quantifiers when the group inside contains at least one optional subpattern, making the quantified subpattern match the same pattern as the subpattern before the outer group and the outer group is not at the end of the pattern.
Your regex causes the issue right because the (\.?\w+)* is not at the end, there is an optional \.? and the expression is reduced to \w(\w+)*#.
For example aaa.aaaaaa.a.aa.aa but now aaa..aaaa.a
What you need is
^\w+(?:\.\w+)*#
See the regex demo
^ - start of string (to avoid partial matches)
\w+ - 1 or more word chars
(?:\.\w+)* - zero or more sequences of:
\. - a literal dot
\w+ - 1 or more word chars
# - a literal # char.
The problem
"Catastrophic backtracing" occurs when a part of the string could match a part of the regex in many different ways, so it needs to repeatedly retry to determine whether or not the string actually matches. A simple case: The regex a+a+b to match two or more a followed by one b. If you were to run that on aaaaaaaaaaa, the problem arises: First, the first a+ matches everything, and it fails at the second a+. Then, it tries with the first a+ matching all but one a, and the second a+ matches one a (this is "backtracing"), and then it fails on the b. But regexes aren't "smart" enough to know that it could stop there - so it has to keep going in that pattern until it's tried every split of giving some as to the first and some to the second. Some regex engines will realize they're getting stuck like this, and quit after enough steps, with the error you saw.
For your specific pattern: what you have there matches any nonzero quantity of letters or digits, mixed with any quantity of . where the . cannot be first, followed by an #. The only additional limit is that there can't be two adjacent dots. Effectively, this is the same case as my example: The * applied to a section containing a + acts like multiple duplicates of that +-ed section.
Atomic grouping
You could try something with atomic grouping. That basically says "once you've found any match for this, don't backtrace into it". After all, if you've found some amount of /w, it's not going to contain a /. and there's no need to keep rechecking that - dots are not letters or digits, and neither of those is an #.
In this case, the result would be \w(?>\.?\w+)*#. Note that not all regex engines support atomic grouping, though the one you linked does. If the string is only a match, nothing will change - if it's not a match, or contains non-matches, the process will take fewer steps. Using #eddiem's example from the comments, it finds two matches in 166311 steps with your original, but only takes 623 steps with atomic grouping added.
Possessive quantifiers
Another option would be a possessive quantifier - \w(\.?\w+)*+# means roughly the same thing. *+, specifically, is "whatever the star matches, don't backtrace inside it". In the above case, it matches in 558 steps - but it's slightly different meaning, in that it treats all the repeats together as one atomic value, instead of as several distinct atomic values. I don't think there's a difference in this case, but there might be in some cases. Again, not supported by all regex engines.

Regex to match html open and close tags(need some explanation)

I have difficulties in understanding some nuances in regular expressions. I am following the tutorial http://www.regular-expressions.info/backref.html and stuck on the example of matching open and close tag using backreferences.
We have string:
Testing <B><I>bold italic</I></B> text
and expression:
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
I can understand the whole logic, but can not get why engine backtracks to dot:
The engine has now arrived at the second < in the regex, and the
second < in the string. These match. The next token is /. This does
not match I, and the engine is forced to backtrack to the dot. The dot
matches the second < in the string. The star is still lazy, so the
engine again takes note of the available backtracking position and
advances to < and I. These do not match, so the engine again
backtracks.
Why it backtracks to dot? Is this because we have successfully matched the previous part of regex and it always backtracks to the position of previous successful match + 1?
And the second part I can not get completely. If we have a string:
Testing <BOO><I>bold italic</I></B> text
and expression without word boundary:
<([A-Z][A-Z0-9]*)[^>]*>.*?</\1>
...and look inside the regex engine at the point where \1 fails the
first time. First, .*? continues to expand until it has reached the
end of the string, and </\1> has failed to match each time .*? matched
one more character.
Then the regex engine backtracks into the capturing group. [A-Z0-9]* has matched oo, but would just as happily match o or nothing at all. When backtracking, [A-Z0-9]* is forced to give up one character.
Why it backtracks into the capturing group and not to dot as in previous example? And I can not get why [A-Z0-9]* is forced to give up one character? Is there some general rule where engine will backtrack?
NOTE: It is not about HTML parsing, it is a drill-down into how backtracking works using an HTML string example from http://regular-expression.info/backref.html.
The problem is that I just can not understand why backtracking rolls back to particular position is general.
The point is that a regular expression engine tries to find a match by all means. If there are options, different paths it may follow based on the current pattern, it will try them once it finds unmatching symbols on its way. See this backtracking introduction at rexegg.com:
Backtracking is a wonderful feature of modern regex engines: if a token fails to match, the engine backtracks to any position where it could have taken a different path. A greedy quantifier may then give up one character, a lazy quantifier may expand to match one more, or the rightmost side of an alternation may be tried. If a pattern continues to fail, the engine systematically explores all available paths.
So, backtracking may roll back to every construct or grouping that has a quantifier/alternation set to make sure all possible combinations are tried before a match failure is asserted. Your assumption that it always backtracks to the last matched symbol is not correct.
The only places where backtracking does not have access to are atomic groups, or groups that have possessive quantifiers. Also, the fact that a lookaround is zero-length automatically makes it atomic (see lookarounds).
In the first regex, \b marks a word boundary, and thus there can be no backtracking into the capturing group as there is no other word boundary other than already matched. When you remove it, backtracking can test all the preceding locations inside the capturing group.
To understand the importance of backtracking and \b, compare these regexes against the Testing <Boo><I>bold italic</I></Bo> text input:
<([A-Z][A-Z0-9]*)[^>]*>.*?<\/\1o> - the match is found as no word boundary is set and the engine backtracks into the capturing group freely, and the capturing group may contain B, Bo and Boo.
<([A-Z][A-Z0-9]*)\b[^>]*>.*?<\/\1o> - no match is found as Group 1 can only contain Boo.

Collapse and Capture a Repeating Pattern in a Single Regex Expression

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.
So let's say the text is:
start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end
This example has 8 items inside, but say it could have between 3 and 10 items.
I'd ideally like something like this:
start:(?:(\w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here
I usually use something like this in simple situations:
start:(\w+)-(\w+)-(\w+)-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?:end
3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo
And the best I could do so far:
start:(\w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end
shorter especially if the matches are complex but still long. demo
Anyone managed to make it work as a 1 regex-only solution without programming?
I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.
Update:
The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation
Update 2 (bounty):
With #nhahtdh's help I got to the RegExp below by using \G:
(?:start:(?=(?:[\w]+(?:-|(?=:end))){3,10}:end)|(?!^)\G-)([\w]+)
demo even shorter, but can be described without repeating code
I'm also interested in the ECMA flavor and as it doesn't support \G wondering if there's another way, especially without using /g modifier.
Read this first!
This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.
For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.
This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.
Summary
.NET supports capturing repeating pattern with CaptureCollection class.
For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).
Solution
This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.
However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.
For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)
DEMO. The construction is \w+- repeated, then \w+:end.
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)
DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.
\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.
Explanation
Let us break down the regex:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?<=-)\G
)
(\w+)
(?:-|:end)
The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.
The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.
I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.
As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.
For the other construction:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?!^)\G-
)
(\w+)
Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).
It is quite tricky to make this regex works for matching in middle of the string:
We need to check the number of words between start: and :end (as part of the requirement of the original regex).
\G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.
For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).
For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.
The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.
Validation Version
If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)
(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).
Construction
For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.
When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.
We build such regex by going through these steps:
Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
Match and capture the first instance. (e.g. (\w+))
(At this point, the first instance and delimiter should have been matched)
Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
Add the delimiter (if any). (e.g. -)
(After this step, the rest of the tokens should have also been matched, except the last maybe)
Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
Now the hard part. You need to check that:
There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.
Although it might theoretically be possible to write a single expression, it's a lot more practical to match the outer boundaries first and then perform a split on the inner part.
In ECMAScript I would write it like this:
'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end'
.match(/^start:([\w-]+):end$/)[1] // match the inner part
.split('-') // split inner part (this could be a split regex as well)
In PHP:
$txt = 'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end';
if (preg_match('/^start:([\w-]+):end$/', $txt, $matches)) {
print_r(explode('-', $matches[1]));
}
Of course you can use the regex in this quoted string.
"(?<a>\\w+)-(?<b>\\w+)-(?:(?<c>\\w+)" \
"(?:-(?<d>\\w+)(?:-(?<e>\\w+)(?:-(?<f>\\w+)" \
"(?:-(?<g>\\w+)(?:-(?<h>\\w+)(?:-(?<i>\\w+)" \
"(?:-(?<j>\\w+))?" \
")?)?)?" \
")?)?)?" \
")"
Is it a good idea? No, I don't think so.
Not sure you can do it in that way, but you can use the global flag to find all of the words between the colons, see:
http://regex101.com/r/gK0lX1
You'd have to validate the number of groups yourself though. Without the global flag you're only getting a single match, not all matches - change {3,10} to {1,5} and you get the result 'sir' instead.
import re
s = "start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end"
print re.findall(r"(\b\w+?\b)(?:-|:end)", s)
produces
['test', 'test', 'lorem', 'ipsum', 'sir', 'doloret', 'etc', 'etc', 'something']
When you combine:
Your observation: any kind of repitition of a single capture group will result in an overwrite of the last capture, thus returning only the last capture of the capture group.
The knowledge: Any kind of capturing based on the parts, instead of the whole, makes it impossible to set a limit on the amount of times the regex engine will repeat. The limit would have to be metadata (not regex).
With a requirement that the answer cannot involve programming (looping), nor an answer that involves simply copy-pasting capturegroups as you've done in your question.
It can be deduced that it cannot be done.
Update: There are some regex engines for which p. 1 is not necessarily true. In that case the regex you have indicated start:(?:(\w+)-?){3,10}:end will do the job (source).