Python Regex non greedy match - regex

This question comes from "Automate the boring stuff with python" book.
atRegex1 = re.compile(r'\w{1,2}at')
atRegex2 = re.compile(r'\w{1,2}?at')
atRegex1.findall('The cat in the hat sat on the flat mat.')
atRegex2.findall('The cat in the hat sat on the flat mat.')
I thought the question market ? should conduct a non-greedy match, so \w{1,2}? should only return 1 character. But for both of these functions, I get the same output:
['cat', 'hat', 'sat', 'flat', 'mat']
In the book,
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
mo2.group()
'HaHaHa'
Any one can help me understand why there is a difference? Thanks!

The issue you're experiencing is due to the nature of backtracking in regex. The regex engine is parsing the string at each given position therein, and as such, will attempt every option of the pattern until it either matches or fails at that position. If it matches, it will consume those characters and if it fails it will continue to the next position until the end of the string is met.
The keyword here is backtracks. I think Microsoft's documentation does a great job of defining this term (I've bolded the important section):
Backtracking occurs when a regular expression pattern contains
optional quantifiers or alternation constructs, and the regular
expression engine returns to a previous saved state to continue its
search for a match. Backtracking is central to the power of regular
expressions; it makes it possible for expressions to be powerful and
flexible, and to match very complex patterns. At the same time, this
power comes at a cost. Backtracking is often the single most important
factor that affects the performance of the regular expression engine.
Fortunately, the developer has control over the behavior of the
regular expression engine and how it uses backtracking. This topic
explains how backtracking works and how it can be controlled.
The regex engine backtracks to a previous saved state. It cannot forward track to a future saved state, although that would be pretty neat! Since you've specified that your match should end with at (the lazy quantifier precedes it), it will exhaust every regex option until \w{1,2} ending in at proves true.
How can you get around this? Well, the easiest way is probably to use a capture group:
See regex in use here
\w*(\w{1,2}?at)
\w*(\w{1,2}at) # yields same results as above (but in more steps)
\w*(\wat) # yields same results as above (faster method)
\wat # yields same results as above (fastest method)
\b\w{1,2}at\b # perhaps this is what OP is after?
\w* Matches any word character any number of times. This is a fix to allow us to simulate forward tracking (this is not a proper term, just used in the context of the rest of my answer above). It will match as many characters as possible and work its way backwards until a match occurs.
The rest of the pattern the OP already had. In fact, \w{2} will never be met since \w will always only be met once (since the \w* token is greedy), therefore \wat can be used instead \w*(\wat). Perhaps the OP intended to use anchors such as \b in the regex: \b\w{1,2}at\b? This doesn't differ from the original nature of the OP's regex either since making the quantifier lazy would have theoretically yielded the same results in the context of forward tracking (one match of \w would have satisfied \w{1,2}?, thus \w{2} would never be reached).

Second regex has a known pattern to match: Ha for minimum 3 times and maximum 5 but as few as possible. So in this case it never goes beyond 3, the same as (Ha){3}. Engine's satisfied as soon as possible.
(Ha){3,5}? matches the same as below (consider groups as one):
(Ha){3}|(Ha){4}|(Ha){5}
and (Ha){3,5} matches the same as:
(Ha){5}|(Ha){4}|(Ha){3}
So if first side of alternation, in both regexes, is found there is no more try for a new match from engine.
What about \w{1,2}?at? Let's translate it:
(?:\w{1}|\w{2})at
First side of alternation has a priority - when found matching process is done. That's true about \w{1,2}at too:
(?:\w{2}|\w{1})at
Note: if first side doesn't match, engine goes with other sides in order.

Related

Why are sequential regular expressions more efficient than a combined experession?

In answering a Splunk question on SO, the following sample text was given:
msg: abc.asia - [2021-08-23T00:27:08.152+0000] "GET /facts?factType=COMMERCIAL&sourceSystem=ADMIN&sourceOwner=ABC&filters=%257B%2522stringMatchFilters%2522:%255B%257B%2522key%2522:%2522BFEESCE((json_data-%253E%253E'isNotSearchable')::boolean,%2520false)%2522,%2522value%2522:%2522false%2522,%2522operator%2522:%2522EQ%2522%257D%255D,%2522multiStringMatchFilters%2522:%255B%257B%2522key%2522:%2522json_data-%253E%253E'id'%2522,%2522values%2522:%255B%25224970111%2522%255D%257D%255D,%2522containmentFilters%2522:%255B%255D,%2522nestedMultiStringMatchFilter%2522:%255B%255D,%2522nestedStringMatchFilters%2522:%255B%255D%257D&sorts=%257B%2522sortOrders%2522:%255B%257B%2522key%2522:%2522id%2522,%2522order%2522:%2522DESC%2522%257D%255D%257D&pagination=null
The person wanted to extract everything in the "filters" portion of the URL if "factType" was "COMMERCIAL"
The following all-in-one regex pulls it out neatly (presuming the URL is always in the right order (ie factType coming before filters):
factType=(?<facttype>\w+).+filters=(?<filters>[^\&]+)
According to regex101, it finds its expected matches with 670 steps
But if I break it up to
factType=(?<facttype>\w+)
followed by
filters=(?<filters>[^\&]+)
regex101 reports the matches being found with 26 and 16 steps, respectively
What about breaking up the regex into two makes it so much more (~15x) efficient to match?
The main problem with the regexp is the presence of the .+ where . eat (nearly) anything and * is generally greedy. Indeed, regexp engines are split in two categories: greedy engines and lazy ones. Greedy engines basically consume all the characters and backtrack as long as nothing is found while lazy ones consume characters only when the following pattern is not found. More engines are greedy. AFAIK, Java is the rare language to use a lazy engine by default. Fortunately, you can specify that you want a lazy quantifier with .+?. This means the engine will try to search the shortest possible match for .* instead of the longest one by default. This is what people usually do when writing a manual search. The result is 65 steps instead of 670 steps (10x better).
Note, that quantifiers do not always help in such a case. It is often better to make the regexp more precise (ie. deterministic) so to improve performance (by reducing the number of possible backtracks due to wrongly path tacking in the non-deterministic automaton).
Still, note that regexp engines are generally not very optimized compared to manual searches (as long as you use efficient search algorithms). They are great to make the code short, flexible and maintainable. For high-performance, a basic loop in native languages is often better. This is especially true if it is vectorized using SIMD instructions (which is generally not easy).
Here is a regex that would be inherently more efficient than .+ or .+? irrespective of the positions of those matches in input text.
factType=(?<facttype>\w+)(?:&(?!filters=)[^&\s]*)*&filters=(?<filters>[^\&]+)
RegEx Demo 1
RegEx Demo 2
This regex may look bit longer but it will be more efficient because we are using a negative lookahead (?!filters=) after matching & to stop the match just before filters query parameter.
Q. What is backtracking?
A. In simple words: If a match isn't complete, the engine will backtrack the string to try to find a whole match until it succeeds or fails. In the above example if you use .+ it matches longest possible match till the end of input then starts backtracking one position backward at a time to find the whole match of second pattern. When you use .+? it just does lazy match and moves forward one position at a time to get the full match.
This suggested approach is far more efficient than .* or .+ or .+? approaches because it avoids expensive backtracking while trying to find match of the second pattern.
RegEx Details:
factType=: Match factType=
(?<facttype>\w+): Match 1+ word characters and capture in named group facttype
(?:: Start non-capture group
&: Match a &
(?!filters=): Stop matching when we have filters= at next position
[^&\s]*: Match 0 or more of non-space non-& chars
)*: End non-capture group. Repeat this group 0 or more times
&: Match a &
filters=: Match filters=
(?<filters>[^\&]+): Match 1 or more of non-space non-& chars and capture in named group filters
Related article on catastrophic backtracking

Can this be parsed by regular expression [duplicate]

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.
So let's say the text is:
start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end
This example has 8 items inside, but say it could have between 3 and 10 items.
I'd ideally like something like this:
start:(?:(\w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here
I usually use something like this in simple situations:
start:(\w+)-(\w+)-(\w+)-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?:end
3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo
And the best I could do so far:
start:(\w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end
shorter especially if the matches are complex but still long. demo
Anyone managed to make it work as a 1 regex-only solution without programming?
I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.
Update:
The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation
Update 2 (bounty):
With #nhahtdh's help I got to the RegExp below by using \G:
(?:start:(?=(?:[\w]+(?:-|(?=:end))){3,10}:end)|(?!^)\G-)([\w]+)
demo even shorter, but can be described without repeating code
I'm also interested in the ECMA flavor and as it doesn't support \G wondering if there's another way, especially without using /g modifier.
Read this first!
This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.
For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.
This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.
Summary
.NET supports capturing repeating pattern with CaptureCollection class.
For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).
Solution
This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.
However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.
For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)
DEMO. The construction is \w+- repeated, then \w+:end.
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)
DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.
\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.
Explanation
Let us break down the regex:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?<=-)\G
)
(\w+)
(?:-|:end)
The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.
The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.
I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.
As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.
For the other construction:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?!^)\G-
)
(\w+)
Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).
It is quite tricky to make this regex works for matching in middle of the string:
We need to check the number of words between start: and :end (as part of the requirement of the original regex).
\G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.
For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).
For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.
The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.
Validation Version
If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)
(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).
Construction
For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.
When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.
We build such regex by going through these steps:
Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
Match and capture the first instance. (e.g. (\w+))
(At this point, the first instance and delimiter should have been matched)
Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
Add the delimiter (if any). (e.g. -)
(After this step, the rest of the tokens should have also been matched, except the last maybe)
Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
Now the hard part. You need to check that:
There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.
Although it might theoretically be possible to write a single expression, it's a lot more practical to match the outer boundaries first and then perform a split on the inner part.
In ECMAScript I would write it like this:
'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end'
.match(/^start:([\w-]+):end$/)[1] // match the inner part
.split('-') // split inner part (this could be a split regex as well)
In PHP:
$txt = 'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end';
if (preg_match('/^start:([\w-]+):end$/', $txt, $matches)) {
print_r(explode('-', $matches[1]));
}
Of course you can use the regex in this quoted string.
"(?<a>\\w+)-(?<b>\\w+)-(?:(?<c>\\w+)" \
"(?:-(?<d>\\w+)(?:-(?<e>\\w+)(?:-(?<f>\\w+)" \
"(?:-(?<g>\\w+)(?:-(?<h>\\w+)(?:-(?<i>\\w+)" \
"(?:-(?<j>\\w+))?" \
")?)?)?" \
")?)?)?" \
")"
Is it a good idea? No, I don't think so.
Not sure you can do it in that way, but you can use the global flag to find all of the words between the colons, see:
http://regex101.com/r/gK0lX1
You'd have to validate the number of groups yourself though. Without the global flag you're only getting a single match, not all matches - change {3,10} to {1,5} and you get the result 'sir' instead.
import re
s = "start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end"
print re.findall(r"(\b\w+?\b)(?:-|:end)", s)
produces
['test', 'test', 'lorem', 'ipsum', 'sir', 'doloret', 'etc', 'etc', 'something']
When you combine:
Your observation: any kind of repitition of a single capture group will result in an overwrite of the last capture, thus returning only the last capture of the capture group.
The knowledge: Any kind of capturing based on the parts, instead of the whole, makes it impossible to set a limit on the amount of times the regex engine will repeat. The limit would have to be metadata (not regex).
With a requirement that the answer cannot involve programming (looping), nor an answer that involves simply copy-pasting capturegroups as you've done in your question.
It can be deduced that it cannot be done.
Update: There are some regex engines for which p. 1 is not necessarily true. In that case the regex you have indicated start:(?:(\w+)-?){3,10}:end will do the job (source).

Regex for nested matches

Consider the string
cos(t(2))+t(51)
Using a regular expression, I'd like to match cos(t(2)), t(2) and t(51). The general pattern this fits is intended to be something like
variable or function name + opening_parenthesis + contents + closing_parenthesis,
where contents can be any expression that has an equal number of opening and closing parentheses.
I'm using [a-zA-Z]+\([\W\w]*\) which returns cos(t(2)))+t(51), which of course is not the desired result.
Any ideas on how to achieve this using regex? I'm particularly stuck at this "equal number of opening and closing parentheses".
Niels, this is an interesting and tricky question because you are looking for overlapping matches. Even with recursion, the task is not trivial.
You asked about any idea how to achieve this with regex, so it sounds like even if this is not available in matlab, you would be interested in seeing an answer that shows you how to do it in regex.
This makes sense to me because tools often change the regex libraries they use. For instance Notepad++, which used to have crippled regex, switched to PCRE in version 6. (As it happens, PCRE would work with this solution.)
In Perl and PCRE, you can use this short regex:
(?=(\b\w+\((?:\d+|(?1))\)))
This will match:
cos(t(2))
t(2)
t(51)
For instance, in php, you could use this code (see the results at the bottom of the online demo).
$regex = "~(?=(\b\w+\((?:\d+|(?1))\)))~";
$string = "cos(t(2))+t(51)";
$count = preg_match_all($regex,$string,$matches);
print_r($matches[1]);
How does it work?
To allow overlapping matches, we use a lookahead. That way, after matching cos(t(2)), the engine will position itself NOT after cos(t(2)), but before the o in cos
In fact the engine does not actually match cos(t(2)) but merely captures it to Group 1. What it matches is the assertion that at this position in the string, looking ahead, we can see x. After matching this assertion, it tries to match it again starting from the next position in the string.
The expression in the lookahead (which describes what we're looking for) is almost very simple: in (\b\w+\((?:\d+|(?1))\)), after the \d+, the alternation | allows us to repeat subroutine number one with (?1), which is to say, the whole expression we are currently within. So we don't recurse the entire regex (which includes a lookahead), but a subexpression thereof.

Collapse and Capture a Repeating Pattern in a Single Regex Expression

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.
So let's say the text is:
start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end
This example has 8 items inside, but say it could have between 3 and 10 items.
I'd ideally like something like this:
start:(?:(\w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here
I usually use something like this in simple situations:
start:(\w+)-(\w+)-(\w+)-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?:end
3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo
And the best I could do so far:
start:(\w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end
shorter especially if the matches are complex but still long. demo
Anyone managed to make it work as a 1 regex-only solution without programming?
I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.
Update:
The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation
Update 2 (bounty):
With #nhahtdh's help I got to the RegExp below by using \G:
(?:start:(?=(?:[\w]+(?:-|(?=:end))){3,10}:end)|(?!^)\G-)([\w]+)
demo even shorter, but can be described without repeating code
I'm also interested in the ECMA flavor and as it doesn't support \G wondering if there's another way, especially without using /g modifier.
Read this first!
This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.
For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.
This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.
Summary
.NET supports capturing repeating pattern with CaptureCollection class.
For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).
Solution
This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.
However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.
For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)
DEMO. The construction is \w+- repeated, then \w+:end.
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)
DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.
\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.
Explanation
Let us break down the regex:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?<=-)\G
)
(\w+)
(?:-|:end)
The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.
The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.
I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.
As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.
For the other construction:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?!^)\G-
)
(\w+)
Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).
It is quite tricky to make this regex works for matching in middle of the string:
We need to check the number of words between start: and :end (as part of the requirement of the original regex).
\G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.
For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).
For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.
The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.
Validation Version
If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)
(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).
Construction
For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.
When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.
We build such regex by going through these steps:
Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
Match and capture the first instance. (e.g. (\w+))
(At this point, the first instance and delimiter should have been matched)
Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
Add the delimiter (if any). (e.g. -)
(After this step, the rest of the tokens should have also been matched, except the last maybe)
Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
Now the hard part. You need to check that:
There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.
Although it might theoretically be possible to write a single expression, it's a lot more practical to match the outer boundaries first and then perform a split on the inner part.
In ECMAScript I would write it like this:
'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end'
.match(/^start:([\w-]+):end$/)[1] // match the inner part
.split('-') // split inner part (this could be a split regex as well)
In PHP:
$txt = 'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end';
if (preg_match('/^start:([\w-]+):end$/', $txt, $matches)) {
print_r(explode('-', $matches[1]));
}
Of course you can use the regex in this quoted string.
"(?<a>\\w+)-(?<b>\\w+)-(?:(?<c>\\w+)" \
"(?:-(?<d>\\w+)(?:-(?<e>\\w+)(?:-(?<f>\\w+)" \
"(?:-(?<g>\\w+)(?:-(?<h>\\w+)(?:-(?<i>\\w+)" \
"(?:-(?<j>\\w+))?" \
")?)?)?" \
")?)?)?" \
")"
Is it a good idea? No, I don't think so.
Not sure you can do it in that way, but you can use the global flag to find all of the words between the colons, see:
http://regex101.com/r/gK0lX1
You'd have to validate the number of groups yourself though. Without the global flag you're only getting a single match, not all matches - change {3,10} to {1,5} and you get the result 'sir' instead.
import re
s = "start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end"
print re.findall(r"(\b\w+?\b)(?:-|:end)", s)
produces
['test', 'test', 'lorem', 'ipsum', 'sir', 'doloret', 'etc', 'etc', 'something']
When you combine:
Your observation: any kind of repitition of a single capture group will result in an overwrite of the last capture, thus returning only the last capture of the capture group.
The knowledge: Any kind of capturing based on the parts, instead of the whole, makes it impossible to set a limit on the amount of times the regex engine will repeat. The limit would have to be metadata (not regex).
With a requirement that the answer cannot involve programming (looping), nor an answer that involves simply copy-pasting capturegroups as you've done in your question.
It can be deduced that it cannot be done.
Update: There are some regex engines for which p. 1 is not necessarily true. In that case the regex you have indicated start:(?:(\w+)-?){3,10}:end will do the job (source).

Regular expression greedy match not working as expected

I have a very basic regular expression that I just can't figure out why it's not working so the question is two parts. Why does my current version not work and what is the correct expression.
Rules are pretty simple:
Must have minimum 3 characters.
If a % character is the first character must be a minimum of 4 characters.
So the following cases should work out as follows:
AB - fail
ABC - pass
ABCDEFG - pass
% - fail
%AB - fail
%ABC - pass
%ABCDEFG - pass
%%AB - pass
The expression I am using is:
^%?\S{3}
Which to me means:
^ - Start of string
%? - Greedy check for 0 or 1 % character
\S{3} - 3 other characters that are not white space
The problem is, the %? for some reason is not doing a greedy check. It's not eating the % character if it exists so the '%AB' case is passing which I think should be failing. Why is the %? not eating the % character?
Someone please show me the light :)
Edit: The answer I used was Dav below: ^(%\S{3}|[^%\s]\S{2})
Although it was a 2 part answer and Alan's really made me understand why. I didn't use his version of ^(?>%?)\S{3} because it worked but not in the javascript implementation. Both great answers and a lot of help.
The word for the behavior you described isn't greedy, it's possessive. Normal, greedy quantifiers match as much as they can originally, but back off if necessary to allow the whole regex to match (I like to think of them as greedy but accommodating). That's what's happening to you: the %? originally matches the leading percent sign, but if there aren't enough characters left for an overall match, it gives up the percent sign and lets \S{3} match it instead.
Some regex flavors (including Java and PHP) support possessive quantifiers, which never back off, even if that causes the overall match to fail. .NET doesn't have those, but it has the next best thing: atomic groups. Whatever you put inside an atomic group acts like a separate regex--it either matches at the position where it's applied or it doesn't, but it never goes back and tries to match more or less than it originally did just because the rest of the regex is failing (that is, the regex engine never backtracks into the atomic group). Here's how you would use it for your problem:
^(?>%?)\S{3}
If the string starts with a percent sign, the (?>%?) matches it, and if there aren't enough characters left for \S{3} to match, the regex fails.
Note that atomic groups (or possessive quantifiers) are not necessary to solve this problem, as #Dav demonstrated. But they're very powerful tools which can easily make the difference between impossible and possible, or too damn slow and slick as can be.
Regex will always try to match the whole pattern if it can - "greedy" doesn't mean "will always grab the character if it exists", but instead means "will always grab the character if it exists and a match can be made with it grabbed".
Instead, what you probably want is something like this:
^(%\S{3}|[^%\s]\S{2})
Which will match either a % followed by 3 characters, or a non-%, non-whitespace followed by 2 more.
I always love to look at RE questions to see how much time people spend on them to "Save time"
str.len() >= str[0]=='&' ? 4 : 3
Although in real life I'd be more explicit, I just wrote it that way because for some reason some people consider code brevity an advantage (I'd call it an anti-advantage, but that's not a popular opinion right now)
Try the regex modified a little based on Dav's original one:
^(%\S{3,}|[^%\s]\S{2,})
with the regex option "^ and $ match at line breaks" on.