making a small regular expression a bit more readable - regex

I've got a working regular expression, but I'd like to make it a tad more readable, and I'm far from a regex guru, so I was humbly hoping for some tips.
This is designed to scrape the output of several different compilers, linkers, and other build tools, and is used to build a nice little summery report. It does it's job great, but I'm left feeling like I wrote it in a clunky fashion, and I'd sooner learn than keep it the wrong way.
(.*?)\s?:?\s?(informational|warning|error|fatal error)?\s([A-Z]+[0-9][0-9][0-9][0-9]):\s(.*)$
Which, broken down simply, is as follows:
(.*?) # non-greedily match up until...
\s?:?\s? # we come across a possible " : "
(informational|warning|error|fatal error)? # possibly followed by one of these
\s([A-Z]+[0-9][0-9][0-9][0-9]):\s # but 100% followed by this alphanum
(.*)$ # and then capture the rest
I'm mostly interested in making the 2nd and 4th entry above more... beautiful. For some reason, the regex tester I was using (The Regulator) didn't match plain spaces, so I had to use the \s... but it is not meant to match any other whitespace.
Any schooling will be greatly appreciated.

The easiest way to make a long regex more readable is to use the "free-spacing" (or \x) modifier, which would let you write your regex just like you did in the second block of code -- it makes whitespace ignored. This isn't supported by all engines, however (according to the page linked above, .NET, Java, Perl, PCRE, Python, Ruby and XPath support it).
Note also that in free-spacing mode, you can use [ ] instead of \s if you want to only match a space character (unless you're using Java, in which case you have to use \ , which is an escaped space).
There's not really anything you can do for the second line, if you want each element to be optional independently of the other elements, but the fourth can be shortened:
\s([A-Z]+\d{4}):\s
\d is a shorthand class equivalent to [0-9], and {4} specifies that it should appear exactly four times.
The third line can be slightly shortened as well ((?:…) specifies a non-capturing group):
(informational|warning|(?:fatal )? error)?
From an efficiency standpoint, unless you actually need to capture subpatterns each time you use brackets, you can remove all of them, except for on the third line, where the group is needed for the alternation) -- but that one can be made non-capturing. Putting this all together you'd get:
.*?
\s?:?\s?
(?:informational|warning|(?:fatal )?error)?
\s[A-Z]+\d{4}:\s
.*$

Line 2
I think your regular expression doesn't match with the comment. You probably want this instead:
(\s:\s)?
To make it non-capturing:
(?:\s:\s)?
You should be able to use a literal space instead of \s. This must be a restriction in the tool you are using.
Line 4
[0-9][0-9][0-9][0-9] can be replaced with [0-9]{4}.
In some languages [0-9] is equivalent to \d.

Perhaps you can build the RE from sub-expressions, so that your end RE would look something like this:
/$preamble$possible_colon$keyword$alphanum$trailer/

Related

Can this be parsed by regular expression [duplicate]

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.
So let's say the text is:
start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end
This example has 8 items inside, but say it could have between 3 and 10 items.
I'd ideally like something like this:
start:(?:(\w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here
I usually use something like this in simple situations:
start:(\w+)-(\w+)-(\w+)-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?:end
3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo
And the best I could do so far:
start:(\w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end
shorter especially if the matches are complex but still long. demo
Anyone managed to make it work as a 1 regex-only solution without programming?
I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.
Update:
The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation
Update 2 (bounty):
With #nhahtdh's help I got to the RegExp below by using \G:
(?:start:(?=(?:[\w]+(?:-|(?=:end))){3,10}:end)|(?!^)\G-)([\w]+)
demo even shorter, but can be described without repeating code
I'm also interested in the ECMA flavor and as it doesn't support \G wondering if there's another way, especially without using /g modifier.
Read this first!
This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.
For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.
This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.
Summary
.NET supports capturing repeating pattern with CaptureCollection class.
For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).
Solution
This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.
However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.
For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)
DEMO. The construction is \w+- repeated, then \w+:end.
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)
DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.
\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.
Explanation
Let us break down the regex:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?<=-)\G
)
(\w+)
(?:-|:end)
The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.
The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.
I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.
As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.
For the other construction:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?!^)\G-
)
(\w+)
Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).
It is quite tricky to make this regex works for matching in middle of the string:
We need to check the number of words between start: and :end (as part of the requirement of the original regex).
\G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.
For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).
For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.
The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.
Validation Version
If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)
(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).
Construction
For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.
When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.
We build such regex by going through these steps:
Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
Match and capture the first instance. (e.g. (\w+))
(At this point, the first instance and delimiter should have been matched)
Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
Add the delimiter (if any). (e.g. -)
(After this step, the rest of the tokens should have also been matched, except the last maybe)
Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
Now the hard part. You need to check that:
There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.
Although it might theoretically be possible to write a single expression, it's a lot more practical to match the outer boundaries first and then perform a split on the inner part.
In ECMAScript I would write it like this:
'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end'
.match(/^start:([\w-]+):end$/)[1] // match the inner part
.split('-') // split inner part (this could be a split regex as well)
In PHP:
$txt = 'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end';
if (preg_match('/^start:([\w-]+):end$/', $txt, $matches)) {
print_r(explode('-', $matches[1]));
}
Of course you can use the regex in this quoted string.
"(?<a>\\w+)-(?<b>\\w+)-(?:(?<c>\\w+)" \
"(?:-(?<d>\\w+)(?:-(?<e>\\w+)(?:-(?<f>\\w+)" \
"(?:-(?<g>\\w+)(?:-(?<h>\\w+)(?:-(?<i>\\w+)" \
"(?:-(?<j>\\w+))?" \
")?)?)?" \
")?)?)?" \
")"
Is it a good idea? No, I don't think so.
Not sure you can do it in that way, but you can use the global flag to find all of the words between the colons, see:
http://regex101.com/r/gK0lX1
You'd have to validate the number of groups yourself though. Without the global flag you're only getting a single match, not all matches - change {3,10} to {1,5} and you get the result 'sir' instead.
import re
s = "start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end"
print re.findall(r"(\b\w+?\b)(?:-|:end)", s)
produces
['test', 'test', 'lorem', 'ipsum', 'sir', 'doloret', 'etc', 'etc', 'something']
When you combine:
Your observation: any kind of repitition of a single capture group will result in an overwrite of the last capture, thus returning only the last capture of the capture group.
The knowledge: Any kind of capturing based on the parts, instead of the whole, makes it impossible to set a limit on the amount of times the regex engine will repeat. The limit would have to be metadata (not regex).
With a requirement that the answer cannot involve programming (looping), nor an answer that involves simply copy-pasting capturegroups as you've done in your question.
It can be deduced that it cannot be done.
Update: There are some regex engines for which p. 1 is not necessarily true. In that case the regex you have indicated start:(?:(\w+)-?){3,10}:end will do the job (source).

Multiple spaces, multiple commas and multiple hypens in alphanumeric regex

I am very new to regex and regular expressions, and I am stuck in a situation where I want to apply a regex on an JSF input field.
Where
alphanumeric
multiple spaces
multiple dot(.)
multiple hyphen (‐)
are allowed, and Minimum limit is 1 and Maximum limit is 5.
And for multiple values - they must be separated by comma (,)
So a Single value can be:
3kd-R
or
k3
or
-4
And multiple values (must be comma separated):
kdk30,3.K-4,ER--U,2,.I3,
By the help of stackoverflow, so far I am able to achieve only this:
(^[a-zA-Z0-9 ]{5}(,[a-zA-Z0-9 ]{5})*$)
Something like
^[-.a-zA-Z0-9 ]{1,5}(,[-.a-zA-Z0-9 ]{1,5})*$
Changes made
[-.a-zA-Z0-9 ] Added - and . to the character class so that those are matched as well.
{1,5} Quantifier, ensures that it is matched minimum 1 and maximum 5 characters
Regex demo
You've done pretty good. You need to add hyphen and dot to that first character class. Note: With the hyphen, since it delegates ranges within a character class, you need to position it where contextually it cannot be specifying a range--not to say put it where it seems like it would be an invalid range, e.g., 7-., but positionally cannot be a range, i.e., first or last. So your first character class would look something like this:
[a-zA-Z 0-9.-]{1,5} or [-a-zA-Z0-9 .]{1,5}
So, we've just defined what one segment looks like. That pattern can reoccur zero or more times. Of course, there are many ways to do that, but I would favor a regex subroutine because this allows code reuse. Now if the specs change or you're testing and realize you have to tweak that segment pattern, you only need to change it in one place.
Subroutines are not supported in BRE or ERE, but most widely-used modern regex engines support them (Perl, PCRE, Ruby, Delphi, R, PHP). They are very simple to use and understand. Basically, you just need to be able to refer to it (sound familiar? refer-back? back-reference?), so this means we need to capture the regex we wish to repeat. Then it's as simple as referring back to it, but instead of \1 which refers to the captured value (data), we want to refer to it as (?1), the capturing expression. In doing so, we've logically defined a subroutine:
([a-zA-Z 0-9.-]{1,5})(,(?1))*
So, the first group basically defines our subroutine and the second group consists of a comma followed by the same segment-definition expression we used for the first group, and that is optional ('*' is the zero-or-more quantifier).
If you operate on large quantities of data where efficiency is a consideration, don't capture when you don't have to. If your sole purpose for using parenthesis is to alternate (e.g., \b[bB](asset|eagle)\b hound) or to quantify, as in our second group, use the (?: ... ) notation, which signifies to the regex engine that this is a non-capturing group. Without going into great detail, there is a lot of overhead in maintaining the match locations--not that it's complex, per se, just potentially highly repetitive. Regex engines will match, store the information, then when the match fails, they "give up" the match and try again starting with the next matching substring. Each time they match your capture group, they're storing that information again. Okay, I'm off the soapbox now. :-)
So, we're almost there. I say "almost" because I don't have all the information. But if this should be the sole occupant of the "subject" (line, field, etc.--the data sample you're evaluating), you should anchor it to "assert" that requirement. The caret '^' is beginning of subject, and the dollar '$' is end of subject, so by encapsulating our expression in ^ ... $ we are asserting that the subject matches in it's entirety, front-to-back. These assertions have zero-length; they consume no data, only assert a relative position. You can operate on them, e.g., s/^/ / would indent your entire document two spaces. You haven't really substituted the beginning of line with two spaces, but you're able to operate on that imaginary, zero-length location. (Do some research on zero-length assertions [aka zero-width assertions, or look-arounds] to uncover a powerful feature of modern regex. For example, in the previous regex if I wanted to make sure I did not insert two spaces on blank lines: s/^(?!$)/ /)
Also, you didn't say if you need to capture the results to do something with it. My impression was it's validation only, so that's not necessary. However, if it is needed, you can wrap the entire expression in capturing parenthesis: ^( ... )$.
I'm going to provide a final solution that does not assume you need to capture but does assume the entire subject should consist of this value:
^([a-zA-Z 0-9. -]{1,5})(?:,(?1))*$
I know I went on a bit, but you said you were new to regex, so wanted to provide some detail. I hope it wasn't too much detail.
By the way, an excellent resource with tutorials is regular-expressions dot info, and a wonderful regex development and testing tool is regex101 dot com. And I can never say enough about stack overflow!

Collapse and Capture a Repeating Pattern in a Single Regex Expression

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.
So let's say the text is:
start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end
This example has 8 items inside, but say it could have between 3 and 10 items.
I'd ideally like something like this:
start:(?:(\w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here
I usually use something like this in simple situations:
start:(\w+)-(\w+)-(\w+)-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?:end
3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo
And the best I could do so far:
start:(\w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end
shorter especially if the matches are complex but still long. demo
Anyone managed to make it work as a 1 regex-only solution without programming?
I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.
Update:
The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation
Update 2 (bounty):
With #nhahtdh's help I got to the RegExp below by using \G:
(?:start:(?=(?:[\w]+(?:-|(?=:end))){3,10}:end)|(?!^)\G-)([\w]+)
demo even shorter, but can be described without repeating code
I'm also interested in the ECMA flavor and as it doesn't support \G wondering if there's another way, especially without using /g modifier.
Read this first!
This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.
For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.
This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.
Summary
.NET supports capturing repeating pattern with CaptureCollection class.
For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).
Solution
This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.
However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.
For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)
DEMO. The construction is \w+- repeated, then \w+:end.
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)
DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.
\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.
Explanation
Let us break down the regex:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?<=-)\G
)
(\w+)
(?:-|:end)
The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.
The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.
I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.
As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.
For the other construction:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?!^)\G-
)
(\w+)
Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).
It is quite tricky to make this regex works for matching in middle of the string:
We need to check the number of words between start: and :end (as part of the requirement of the original regex).
\G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.
For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).
For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.
The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.
Validation Version
If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)
(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).
Construction
For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.
When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.
We build such regex by going through these steps:
Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
Match and capture the first instance. (e.g. (\w+))
(At this point, the first instance and delimiter should have been matched)
Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
Add the delimiter (if any). (e.g. -)
(After this step, the rest of the tokens should have also been matched, except the last maybe)
Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
Now the hard part. You need to check that:
There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.
Although it might theoretically be possible to write a single expression, it's a lot more practical to match the outer boundaries first and then perform a split on the inner part.
In ECMAScript I would write it like this:
'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end'
.match(/^start:([\w-]+):end$/)[1] // match the inner part
.split('-') // split inner part (this could be a split regex as well)
In PHP:
$txt = 'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end';
if (preg_match('/^start:([\w-]+):end$/', $txt, $matches)) {
print_r(explode('-', $matches[1]));
}
Of course you can use the regex in this quoted string.
"(?<a>\\w+)-(?<b>\\w+)-(?:(?<c>\\w+)" \
"(?:-(?<d>\\w+)(?:-(?<e>\\w+)(?:-(?<f>\\w+)" \
"(?:-(?<g>\\w+)(?:-(?<h>\\w+)(?:-(?<i>\\w+)" \
"(?:-(?<j>\\w+))?" \
")?)?)?" \
")?)?)?" \
")"
Is it a good idea? No, I don't think so.
Not sure you can do it in that way, but you can use the global flag to find all of the words between the colons, see:
http://regex101.com/r/gK0lX1
You'd have to validate the number of groups yourself though. Without the global flag you're only getting a single match, not all matches - change {3,10} to {1,5} and you get the result 'sir' instead.
import re
s = "start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end"
print re.findall(r"(\b\w+?\b)(?:-|:end)", s)
produces
['test', 'test', 'lorem', 'ipsum', 'sir', 'doloret', 'etc', 'etc', 'something']
When you combine:
Your observation: any kind of repitition of a single capture group will result in an overwrite of the last capture, thus returning only the last capture of the capture group.
The knowledge: Any kind of capturing based on the parts, instead of the whole, makes it impossible to set a limit on the amount of times the regex engine will repeat. The limit would have to be metadata (not regex).
With a requirement that the answer cannot involve programming (looping), nor an answer that involves simply copy-pasting capturegroups as you've done in your question.
It can be deduced that it cannot be done.
Update: There are some regex engines for which p. 1 is not necessarily true. In that case the regex you have indicated start:(?:(\w+)-?){3,10}:end will do the job (source).

lookahead in kate for patterns

I'm working on compiling a table of cases for a legal book. I've converted it to HTML so I can use the tags for search and replace operations, and I'm currently working in Kate. The text refers to the names of cases and the citations for the cases are in the footnotes, e.g.
<i>Smith v Jones</i>127 ......... [other stuff including newline characters].......</br>127 (1937) 173 ER 406;
I've been able to get lookahead working in Kate, using:
<i>.*</i>([0-9]{1,4}) .+<br/>\1 .*<br/>
...but I've run into greediness problems.
The text is a mess, so I really need to find matches step by step rather than relying on a batch process.
Is there a Linux (or Windows) text editor that supports both lookahead AND non-greedy operators, or am I going to have to try grep or sed?
I'm not familiar with Kate, but it seems to use QRegExp, which is incompatible with other Perl-like regex flavors in many important ways. For example, most flavors allow you make individual quantifiers non-greedy by appending a question mark (e.g. .* => .+?), but in QRegExp you can only make them all greedy or all non-greedy. What's worse, it looks like Kate doesn't even let you do that--via a Non-Greedy checkbox, for example.
But it's best not to rely on non-greedy quantifiers all time anyway. For one thing, they don't guarantee the shortest possible match, as many people say. You should get in the habit of being more specific about what should and should not be matched, when that's not too difficult. For example, if the section you want to match doesn't contain any tags other than the ones in your sample string, you can do this:
<i>[^<]*</i>(\d+)\b[^<]+<br/>\1\b[^<]*<br/>
The advantage of using [^<]* instead of .* is that it will never try to match anything after the next <. .* will always grab the rest of the document at first, only to backtrack almost all the way to the starting point. The non-greedy version, .*?, will initially match only to the next <, but if the match attempt fails later on it will go ahead and consume the < and beyond, eventually to consume the whole document.
If there can be other tags, you can use [^<]*(<(?!br/>)[^<]*)* instead. It will consume any characters that are not <, or < if it's not the beginning of a <br/> tag.
<i>[^<]*</i>(\d+)\b[^<]*(<(?!br/>)[^<]*)*<br/>\1\b[^<]*(<(?!br/>)[^<]*)*<br/>
By the way, what you're calling a lookahead (I'm assuming you mean \1) is really a backreference. The (?!br/>) in my regex is an example of lookaheads--in this case a negative lookahead. The Kate/QRegExp docs claim that lookaheads are supported but non-capturing groups-- e.g. (?:...)--aren't, which is why used all capturing groups in that last regex.
If you have the option of switching to a different editor, I strongly recommend that you do so. My favorite is EditPad Pro; it has the best regex support I've ever seen in an editor.

Regex word-break with unicode diacritics

I am working on an application that searches text using regular expressions based on input from a user. One option the user has is to include a "Match 0 or more characters" wildcard using the asterisk. I need this to only match between word boundaries. My first attempt was to convert all asterisks to (?:(?=\B).)*, which works fine for most cases. Where it fails is that apparently .Net considers the position between a unicode character with a diacritic and another character a word-break. I consider this a bug, and have submitted it to the Microsoft feedback site.
In the meantime, however, I need to get the functionality implemented and product shipped. I am considering using [\p{L}\p{M}\p{N}\p{Pc}]* as the replacement text, but, frankly, am in "I don't really understand what this is going to do" land. I mean, I can read the specifications, but am not confident that I could sufficiently test this to make sure it is doing what I expect. I simply wouldn't know all the boundary conditions to test. The application is used by cross-cultural workers, many of whom are in tribal locations, so any and all writing systems need to be supported, including some that use zero-width word breaks.
Does anyone have a more elegant solution, or could confirm/correct the code above, or offer some pointers?
Thanks for your help.
The equivalent of /(?:(?=\B).)*/ in a unicode context would be:
/
(?:
(?: (?<=[\p{L}\p{M}\p{N}\p{Pc}]) (?=[\p{L}\p{M}\p{N}\p{Pc}])
| (?<![\p{L}\p{M}\p{N}\p{Pc}]) (?![\p{L}\p{M}\p{N}\p{Pc}])
)
.
)*
/
...or somewhat simplified:
/(?:[\p{L}\p{M}\p{N}\p{Pc}]+|[^\p{L}\p{M}\p{N}\p{Pc}]+)?/
This would match either a word or a non-word (spacing, punctuation etc.) sequence, possibly an empty one.
A normal or negated word-boundary (\b or \B) is basically a double look-around. One looking behind, making sure of the type of character that precedes the current position. Similarly one looking ahead.
In the second regex, I removed the look-arounds and used simple character classes instead.