Regex for nested matches - regex

Consider the string
cos(t(2))+t(51)
Using a regular expression, I'd like to match cos(t(2)), t(2) and t(51). The general pattern this fits is intended to be something like
variable or function name + opening_parenthesis + contents + closing_parenthesis,
where contents can be any expression that has an equal number of opening and closing parentheses.
I'm using [a-zA-Z]+\([\W\w]*\) which returns cos(t(2)))+t(51), which of course is not the desired result.
Any ideas on how to achieve this using regex? I'm particularly stuck at this "equal number of opening and closing parentheses".

Niels, this is an interesting and tricky question because you are looking for overlapping matches. Even with recursion, the task is not trivial.
You asked about any idea how to achieve this with regex, so it sounds like even if this is not available in matlab, you would be interested in seeing an answer that shows you how to do it in regex.
This makes sense to me because tools often change the regex libraries they use. For instance Notepad++, which used to have crippled regex, switched to PCRE in version 6. (As it happens, PCRE would work with this solution.)
In Perl and PCRE, you can use this short regex:
(?=(\b\w+\((?:\d+|(?1))\)))
This will match:
cos(t(2))
t(2)
t(51)
For instance, in php, you could use this code (see the results at the bottom of the online demo).
$regex = "~(?=(\b\w+\((?:\d+|(?1))\)))~";
$string = "cos(t(2))+t(51)";
$count = preg_match_all($regex,$string,$matches);
print_r($matches[1]);
How does it work?
To allow overlapping matches, we use a lookahead. That way, after matching cos(t(2)), the engine will position itself NOT after cos(t(2)), but before the o in cos
In fact the engine does not actually match cos(t(2)) but merely captures it to Group 1. What it matches is the assertion that at this position in the string, looking ahead, we can see x. After matching this assertion, it tries to match it again starting from the next position in the string.
The expression in the lookahead (which describes what we're looking for) is almost very simple: in (\b\w+\((?:\d+|(?1))\)), after the \d+, the alternation | allows us to repeat subroutine number one with (?1), which is to say, the whole expression we are currently within. So we don't recurse the entire regex (which includes a lookahead), but a subexpression thereof.

Related

Why "ab(cd|c)*d" matches "abcdcdd" completely but "ab(c|cd)*d" does not match that? Whereas they're like each other

I tried this regex:
ab(cd|c)*d
in the regex101 and RegExr websites. It matched this text completely:
abcdcdd
Now let's swap "cd" and "c" in the regex:
ab(c|cd)*d
When I try this regex in the websites, I see this regex does not completely match the same text.
Why doesn't the regex engine recognize that ab(cd|c)*d and ab(c|cd)*d are the same, and how can I persuade ab(c|cd)*d to match the longest string?
REGEX: ab(cd|c)*d
Complete text matched in 13 steps: abcdcdd
REGEX: ab(c|cd)*d
Partial text matched in 9 steps: abcdcdd
#MurrayW's answer is excellent, but I would like to add some background information.
Regex as Finite State Automata
When I first learned regular expressions in university, we learned to convert them to finite state automata, essentially compiling them into graphs that were then processed to match the string. When you do that, (cd|c) and (c|cd) get compiled into the same graph, in which case both of your regular expressions would match the whole string. This is what grep actually does:
Both
echo abcdcdd | grep --color -E 'ab(c|cd)*d'
and
echo abcdcdd | grep --color -E 'ab(cd|c)*d'
color the whole string in red.
Patterns we call "regular expressions"
True finite state automata have many limitations that programmers don't like, such as the inability to capture matching groups, of to reuse those groups later in the pattern, and other limitations I forget, so the regular expression libraries that we use in most programming languages implement more complex formalisms. I don't remember that they are exactly, maybe push-down automata, but we have memory, we have backtracking, and all sorts of good stuff we use without thinking about it.
At the risk of seeming pedantic, the patterns we use are not "regular" at all. I know, the difference is usually not relevant, we just want our code to work, but once in a while it matters.
So, while the regular expressions (cd|c) and (c|cd) would be compiled into the same finite state automaton, those two (non-regular) patterns are instead turned into logic that says try the variants from left to right, and backtrack only if the rest of the pattern fails to match later, hence the results you observed.
Speed
While the patterns our "regular expression" libraries support offer us lots of goodies we like, those come at a performance cost. True regular expressions are blazingly fast, while our patterns, though usually fast, can sometimes be very expensive. Search for "catastrophic backtracking" on this site for many examples of patterns that take exponential time to fail. The same patterns, used with grep, would be compiled into a graph that is applied in linear time to the string to match no matter what.
Because the | character performs an or operation by testing the left-most condition first. If that matches, nothing further is tested in the or. If that fails, then the next or element is tested, and so on.
Using regex pattern ab(cd|c)*d, you can see that the cd part of (cd|c)* matches in your string, and is also repeated: abcdcdd.
However, in pattern ab(c|cd)*d, the c matches from the or operation in abcdcdd and so cd isn't tested at all. Then, the d at the end of the pattern matches the d after the first c and then the pattern stops, having only matched abcdcdd
As previously answered in the comments, they are not the same patterns. The alternation in the first one tries to match cd first, the second one c first.
First pattern
abcdcdd
^^^^
||
||
ab(cd|c)*d
Second pattern
abcdcdd
^^____
| |
| |
ab(c|cd)*d
If the d is optional, you can omit the pipe for the alternation and make the d optional.
ab(cd?)*d.
Regex demo
Note that this way you repeat the capturing group which will hold the value of the last iteration.
If you are not interrested in the value of the group and non capturing groups are supported you could use ab(?:cd?)*d.
Regex is always a left to right proposition.
The only way a regex engine will ignore a previous alternation construct
is if it has to satisfy a term on the right side of the alternation group
that cannot be satisfied otherwise.
The regex rule is that the pattern is traversed from left to right,
but is controlled by the target string being traversed from left to right.
The symbiosis ..
Given the target string was matched like so "abcdcdd"
its easy to assume that the regex subset of the full regex
ab
( c | cd )* # (1)
d
is clearly
ab
c*
d
where the cd term of the alternation to the right was never needed
for a successful match.
This proves regex engines are a Left to Right bias machine.

Can this be parsed by regular expression [duplicate]

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.
So let's say the text is:
start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end
This example has 8 items inside, but say it could have between 3 and 10 items.
I'd ideally like something like this:
start:(?:(\w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here
I usually use something like this in simple situations:
start:(\w+)-(\w+)-(\w+)-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?:end
3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo
And the best I could do so far:
start:(\w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end
shorter especially if the matches are complex but still long. demo
Anyone managed to make it work as a 1 regex-only solution without programming?
I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.
Update:
The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation
Update 2 (bounty):
With #nhahtdh's help I got to the RegExp below by using \G:
(?:start:(?=(?:[\w]+(?:-|(?=:end))){3,10}:end)|(?!^)\G-)([\w]+)
demo even shorter, but can be described without repeating code
I'm also interested in the ECMA flavor and as it doesn't support \G wondering if there's another way, especially without using /g modifier.
Read this first!
This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.
For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.
This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.
Summary
.NET supports capturing repeating pattern with CaptureCollection class.
For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).
Solution
This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.
However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.
For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)
DEMO. The construction is \w+- repeated, then \w+:end.
(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)
DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.
\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.
Explanation
Let us break down the regex:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?<=-)\G
)
(\w+)
(?:-|:end)
The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.
The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.
I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.
As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.
For the other construction:
(?:
start:(?=\w+(?:-\w+){2,9}:end)
|
(?!^)\G-
)
(\w+)
Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).
It is quite tricky to make this regex works for matching in middle of the string:
We need to check the number of words between start: and :end (as part of the requirement of the original regex).
\G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.
For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).
For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.
The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.
Validation Version
If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)
(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).
Construction
For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.
When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.
We build such regex by going through these steps:
Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
Match and capture the first instance. (e.g. (\w+))
(At this point, the first instance and delimiter should have been matched)
Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
Add the delimiter (if any). (e.g. -)
(After this step, the rest of the tokens should have also been matched, except the last maybe)
Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
Now the hard part. You need to check that:
There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.
Although it might theoretically be possible to write a single expression, it's a lot more practical to match the outer boundaries first and then perform a split on the inner part.
In ECMAScript I would write it like this:
'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end'
.match(/^start:([\w-]+):end$/)[1] // match the inner part
.split('-') // split inner part (this could be a split regex as well)
In PHP:
$txt = 'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end';
if (preg_match('/^start:([\w-]+):end$/', $txt, $matches)) {
print_r(explode('-', $matches[1]));
}
Of course you can use the regex in this quoted string.
"(?<a>\\w+)-(?<b>\\w+)-(?:(?<c>\\w+)" \
"(?:-(?<d>\\w+)(?:-(?<e>\\w+)(?:-(?<f>\\w+)" \
"(?:-(?<g>\\w+)(?:-(?<h>\\w+)(?:-(?<i>\\w+)" \
"(?:-(?<j>\\w+))?" \
")?)?)?" \
")?)?)?" \
")"
Is it a good idea? No, I don't think so.
Not sure you can do it in that way, but you can use the global flag to find all of the words between the colons, see:
http://regex101.com/r/gK0lX1
You'd have to validate the number of groups yourself though. Without the global flag you're only getting a single match, not all matches - change {3,10} to {1,5} and you get the result 'sir' instead.
import re
s = "start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end"
print re.findall(r"(\b\w+?\b)(?:-|:end)", s)
produces
['test', 'test', 'lorem', 'ipsum', 'sir', 'doloret', 'etc', 'etc', 'something']
When you combine:
Your observation: any kind of repitition of a single capture group will result in an overwrite of the last capture, thus returning only the last capture of the capture group.
The knowledge: Any kind of capturing based on the parts, instead of the whole, makes it impossible to set a limit on the amount of times the regex engine will repeat. The limit would have to be metadata (not regex).
With a requirement that the answer cannot involve programming (looping), nor an answer that involves simply copy-pasting capturegroups as you've done in your question.
It can be deduced that it cannot be done.
Update: There are some regex engines for which p. 1 is not necessarily true. In that case the regex you have indicated start:(?:(\w+)-?){3,10}:end will do the job (source).

Python Regex non greedy match

This question comes from "Automate the boring stuff with python" book.
atRegex1 = re.compile(r'\w{1,2}at')
atRegex2 = re.compile(r'\w{1,2}?at')
atRegex1.findall('The cat in the hat sat on the flat mat.')
atRegex2.findall('The cat in the hat sat on the flat mat.')
I thought the question market ? should conduct a non-greedy match, so \w{1,2}? should only return 1 character. But for both of these functions, I get the same output:
['cat', 'hat', 'sat', 'flat', 'mat']
In the book,
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedyHaRegex.search('HaHaHaHaHa')
mo2.group()
'HaHaHa'
Any one can help me understand why there is a difference? Thanks!
The issue you're experiencing is due to the nature of backtracking in regex. The regex engine is parsing the string at each given position therein, and as such, will attempt every option of the pattern until it either matches or fails at that position. If it matches, it will consume those characters and if it fails it will continue to the next position until the end of the string is met.
The keyword here is backtracks. I think Microsoft's documentation does a great job of defining this term (I've bolded the important section):
Backtracking occurs when a regular expression pattern contains
optional quantifiers or alternation constructs, and the regular
expression engine returns to a previous saved state to continue its
search for a match. Backtracking is central to the power of regular
expressions; it makes it possible for expressions to be powerful and
flexible, and to match very complex patterns. At the same time, this
power comes at a cost. Backtracking is often the single most important
factor that affects the performance of the regular expression engine.
Fortunately, the developer has control over the behavior of the
regular expression engine and how it uses backtracking. This topic
explains how backtracking works and how it can be controlled.
The regex engine backtracks to a previous saved state. It cannot forward track to a future saved state, although that would be pretty neat! Since you've specified that your match should end with at (the lazy quantifier precedes it), it will exhaust every regex option until \w{1,2} ending in at proves true.
How can you get around this? Well, the easiest way is probably to use a capture group:
See regex in use here
\w*(\w{1,2}?at)
\w*(\w{1,2}at) # yields same results as above (but in more steps)
\w*(\wat) # yields same results as above (faster method)
\wat # yields same results as above (fastest method)
\b\w{1,2}at\b # perhaps this is what OP is after?
\w* Matches any word character any number of times. This is a fix to allow us to simulate forward tracking (this is not a proper term, just used in the context of the rest of my answer above). It will match as many characters as possible and work its way backwards until a match occurs.
The rest of the pattern the OP already had. In fact, \w{2} will never be met since \w will always only be met once (since the \w* token is greedy), therefore \wat can be used instead \w*(\wat). Perhaps the OP intended to use anchors such as \b in the regex: \b\w{1,2}at\b? This doesn't differ from the original nature of the OP's regex either since making the quantifier lazy would have theoretically yielded the same results in the context of forward tracking (one match of \w would have satisfied \w{1,2}?, thus \w{2} would never be reached).
Second regex has a known pattern to match: Ha for minimum 3 times and maximum 5 but as few as possible. So in this case it never goes beyond 3, the same as (Ha){3}. Engine's satisfied as soon as possible.
(Ha){3,5}? matches the same as below (consider groups as one):
(Ha){3}|(Ha){4}|(Ha){5}
and (Ha){3,5} matches the same as:
(Ha){5}|(Ha){4}|(Ha){3}
So if first side of alternation, in both regexes, is found there is no more try for a new match from engine.
What about \w{1,2}?at? Let's translate it:
(?:\w{1}|\w{2})at
First side of alternation has a priority - when found matching process is done. That's true about \w{1,2}at too:
(?:\w{2}|\w{1})at
Note: if first side doesn't match, engine goes with other sides in order.

Look behinds: all the rage in regex?

Many regex questions lately have some kind of look-around element in the query that appears to me is not necessary to the success of the match. Is there some teaching resource that is promoting them? I am trying to figure out what kinds of cases you would be better off using a positive look ahead/behind. The main application I can see is when trying to not match an element. But, for example, this query from a recent question has a simple solution to capturing the .*, but why would you use a look behind?
(?<=<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span
And this one from another question:
$url = "www.example.com/id/1234";
preg_match("/\d+(?<=id\/[\d])/",$url,$matches);
When is it truly better to use a positive look-around? Can you give some examples?
I realize this is bordering on an opinion-based question, but I think the answers would be really instructive. Regex is confusing enough without making things more complicated... I have read this page and am more interested in some simple guidelines for when to use them rather than how they work.
Thanks for all the replies. In addition to those below, I recommend checking out m.buettner's great answer here.
You can capture overlapping matches, and you can find matches which could lie in the lookarounds of other matches.
You can express complex logical assertions about your match (because many engines let you use multiple lookbehind/lookahead assertions which all must match in order for the match to succeed).
Lookaround is a natural way to express the common constraint "matches X, if it is followed by/preceded by Y". It is (arguably) less natural to add extra "matching" parts that have to be thrown out by postprocessing.
Negative lookaround assertions, of course, are even more useful. Combined with #2, they can allow you do some pretty wizard tricks, which may even be hard to express in usual program logic.
Examples, by popular request:
Overlapping matches: suppose you want to find all candidate genes in a given genetic sequence. Genes generally start with ATG, and end with TAG, TAA or TGA. But, candidates could overlap: false starts may exist. So, you can use a regex like this:
ATG(?=((?:...)*(?:TAG|TAA|TGA)))
This simple regex looks for the ATG start-codon, followed by some number of codons, followed by a stop codon. It pulls out everything that looks like a gene (sans start codon), and properly outputs genes even if they overlap.
Zero-width matching: suppose you want to find every tr with a specific class in a computer-generated HTML page. You might do something like this:
<tr class="TableRow">.*?</tr>(?=<tr class="TableRow">|</table>)
This deals with the case in which a bare </tr> appears inside the row. (Of course, in general, an HTML parser is a better choice, but sometimes you just need something quick and dirty).
Multiple constraints: suppose you have a file with data like id:tag1,tag2,tag3,tag4, with tags in any order, and you want to find all rows with tags "green" and "egg". This can be done easily with two lookaheads:
(.*):(?=.*\bgreen\b)(?=.*\begg\b)
There are two great things about lookaround expressions:
They are zero-width assertions. They require to be matched, but they consume nothing of the input string. This allows to describe parts of the string which will not be contained in a match result. By using capturing groups in lookaround expressions, they are the only way to capture parts of the input multiple times.
They simplify a lot of things. While they do not extend regular languages, they easily allow to combine (intersect) multiple expressions to match the same part of a string.
Well one simple case where they are handy is when you are anchoring the pattern to the start or finish of a line, and just want to make sure that something is either right ahead or behind the pattern you are matching.
I try to address your points:
some kind of look-around element in the query that appears to me is not necessary to the success of the match
Of course they are necessary for the match. As soon as a lookaround assertions fails, there is no match. They can be used to ensure conditions around the pattern, that have additionally to be true. The whole regex does only match, if:
The pattern does fit and
The lookaround assertions are true.
==> But the returned match is only the pattern.
When is it truly better to use a positive look-around?
Simple answer: when you want stuff to be there, but you don't want to match it!
As Bergi mentioned in his answer, they are zero width assertions, this means they don't match a character sequence, they just ensure it is there. So the characters inside a lookaround expression are not "consumed", the regex engine continues after the last "consumed" character.
Regarding your first example:
(?<=<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span
I think there is a misunderstanding on your side, when you write "has a simple solution to capturing the .*". The .* is not "captured", it is the only thing that the expression does match. But only those characters are matched that have a "<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">" before and a "<\/a><span" after (those two are not part of the match!).
"Captured" is only something that has been matched by a capturing group.
The second example
\d+(?<=id\/[\d])
Is interesting. It is matching a sequence of digits (\d+) and after the sequence, the lookbehind assertion checks if there is one digit with "id/" before it. Means it will fail if there is more than one digit or if the text "id/" before the digit is missing. Means this regex is matching only one digit, when there is fitting text before.
teaching resources
www.regular-expressions.info
perlretut on Looking ahead and looking behind
I'm assuming you understand the good uses of lookarounds, and ask why they are used with no apparent reason.
I think there are four main categories of how people use regular expressions:
Validation
Validation is usually done on the whole text. Lookarounds like you describe are not possible.
Match
Extracting a part of the text. Lookarounds are used mainly due to developer laziness: avoiding captures.
For example, if we have in a settings file with the line Index=5, we can match /^Index=(\d+)/ and take the first group, or match /(?<=^Index=)\d+/ and take everything.
As other answers said, sometimes you need overlapping between matches, but these are relatively rare.
Replace
This is similar to match with one difference: the whole match is removed and is being replaced with a new string (and some captured groups).
Example: we want to highlight the name in "Hi, my name is Bob!".
We can replace /(name is )(\w+)/ with $1<b>$2</b>,
but it is neater to replace /(?<=name is )\w+/ with <b>$&</b> - and no captures at all.
Split
split takes the text and breaks it to an array of tokens, with your pattern being the delimiter. This is done by:
Find a match. Everything before this match is token.
The content of the match is discarded, but:
In most flavors, each captured group in the match is also a token (notably not in Java).
When there are no more matches, the rest of the text is the last token.
Here, lookarounds are crucial. Matching a character means removing it from the result, or at least separating it from its token.
Example: We have a comma separated list of quoted string: "Hello","Hi, I'm Jim."
Splitting by comma /,/ is wrong: {"Hello", "Hi, I'm Jim."}
We can't add the quote mark, /",/: {"Hello, "Hi, I'm Jim."}
The only good option is lookbehind, /(?<="),/: {"Hello", "Hi, I'm Jim."}
Personally, I prefer to match the tokens rather than split by the delimiter, whenever that is possible.
Conclusion
To answer the main question - these lookarounds are used because:
Sometimes you can't match text that need.
Developers are shiftless.
Lookaround assertions can also be used to reduce backtracking which can be the main cause for a bad performance in regexes.
For example: The regex ^[0-9A-Z]([-.\w]*[0-9A-Z])*#(1) can also be written ^[0-9A-Z][-.\w]*(?<=[0-9A-Z])#(2) using a positive look behind (simple validation of the user name in an e-mail address).
Regex (1) can cause a lot of backtracking essentially because [0-9A-Z] is a subset of [-.\w] and the nested quantifiers. Regex (2) reduces the excessive backtracking, more information here Backtracking, section Controlling Backtracking > Lookbehind Assertions.
For more information about backtracking
Best Practices for Regular Expressions in the .NET Framework
Optimizing Regular Expression Performance, Part II: Taking Charge of Backtracking
Runaway Regular Expressions: Catastrophic Backtracking
I typed this a while back but got busy (still am, so I might take a while to reply back) and didn't get around to post it. If you're still open to answers...
Is there some teaching resource that is promoting them?
I don't think so, it's just a coincidence I believe.
But, for example, this query from a recent question has a simple solution to capturing the .*, but why would you use a look behind?
(?<=<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span
This is most probably a C# regex, since variable width lookbehinds are not supported my many regex engines. Well, the lookarounds could be certainly avoided here, because for this, I believe it's really simpler to have capture groups (and make the .* lazy as we're at it):
(<td><a href="\/xxx\.html\?n=[0-9]{0,5}">).*?(<\/a><span)
If it's for a replace, or
<td><a href="\/xxx\.html\?n=[0-9]{0,5}">(.*?)<\/a><span
for a match. Though an html parser would definitely be more advisable here.
Lookarounds in this case I believe are slower. See regex101 demo where the match is 64 steps for capture groups but 94+19 = 1-3 steps for the lookarounds.
When is it truly better to use a positive look-around? Can you give some examples?
Well, lookarounds have the property of being zero-width assertions, which mean they don't really comtribute to matches while they contribute onto deciding what to match and also allows overlapping matches.
Thinking a bit about it, I think, too, that negative lookarounds get used much more often, but that doesn't make positive lookarounds less useful!
Some 'exploits' I can find browsing some old answers of mine (links below will be demos from regex101) follow. When/If you see something you're not familiar about, I probably won't be explaining it here, since the question's focused on positive lookarounds, but you can always look at the demo links I provided where there's a description of the regex, and if you still want some explanation, let me know and I'll try to explain as much as I can.
To get matches between certain characters:
In some matches, positive lookahead make things easier, where a lookahead could do as well, or when it's not so practical to use no lookarounds:
Dog sighed. "I'm no super dog, nor special dog," said Dog, "I'm an ordinary dog, now leave me alone!" Dog pushed him away and made his way to the other dog.
We want to get all the dog (regardless of case) outside quotes. With a positive lookahead, we can do this:
\bdog\b(?=(?:[^"]*"[^"]*")*[^"]*$)
to ensure that there are even number of quotes ahead. With a negative lookahead, it would look like this:
\bdog\b(?!(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
to ensure that there are no odd number of quotes ahead. Or use something like this if you don't want a lookahead, but you'll have to extract the group 1 matches:
(?:"[^"]+"[^"]+?)?(\bdog\b)
Okay, now say we want the opposite; find 'dog' inside the quotes. The regex with the lookarounds just need to have the sign inversed, first and second:
\bdog\b(?!(?:[^"]*"[^"]*")*[^"]*$)
\bdog\b(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
But without the lookaheads, it's not possible. the closest you can get is maybe this:
"[^"]*(\bdog\b)[^"]*"
But this doesn't get all the matches, or you can maybe use this:
"[^"]*?(\bdog\b)[^"]*?(?:(\bdog\b)[^"]*?)?"
But it's just not practical for more occurrences of dog and you get the results in variables with increasing numbers... And this is indeed easier with lookarounds, because they are zero width assertions, you don't have to worry about the expression inside the lookaround to match dog or not, or the regex wouldn't have obtained all the occurrences of dog in the quotes.
Of course now, this logic can be extended to groups of characters, such as getting specific patterns between words such as start and end.
Overlapping matches
If you have a string like:
abcdefghijkl
And want to extract all the consecutive 3 characters possible inside, you can use this:
(?=(...))
If you have something like:
1A Line1 Detail1 Detail2 Detail3 2A Line2 Detail 3A Line3 Detail Detail
And want to extract these, knowing that each line starts with #A Line# (where # is a number):
1A Line1 Detail1 Detail2 Detail3
2A Line2 Detail
3A Line3 Detail Detail
You might try this, which fails because of greediness...
[0-9]+A Line[0-9]+(?: \w+)+
Or this, which when made lazy no more works...
[0-9]+A Line[0-9]+(?: \w+)+?
But with a positive lookahead, you get this:
[0-9]+A Line[0-9]+(?: \w+)+?(?= [0-9]+A Line[0-9]+|$)
And appropriately extracts what's needed.
Another possible situation is one where you have something like this:
#ff00fffirstword#445533secondword##008877thi#rdword#
Which you want to convert to three pairs of variables (first of the pair being a # and some hex values (6) and whatever characters after them):
#ff00ff and firstword
#445533 and secondword#
#008877 and thi#rdword#
If there were no hashes inside the 'words', it would have been enough to use (#[0-9a-f]{6})([^#]+), but unfortunately, that's not the case and you have to resort to .*? instead of [^#]+, which doesn't quite yet solve the issue of stray hashes. Positive lookaheads however make this possible:
(#[0-9a-f]{6})(.+?)(?=#[0-9a-f]{6}|$)
Validation & Formatting
Not recommended, but you can use positive lookaheads for quick validations. The following regex for instance allow the entry of a string containing at least 1 digit and 1 lowercase letter.
^(?=[^0-9]*[0-9])(?=[^a-z]*[a-z])
This can be useful when you're checking for character length but have patterns of varying length in the a string, for example, a 4 character long string with valid formats where # indicates a digit and the hyphen/dash/minus - must be in the middle:
##-#
#-##
A regex like this does the trick:
^(?=.{4}$)\d+-\d+
Where otherwise, you'd do ^(?:[0-9]{2}-[0-9]|[0-9]-[0-9]{2})$ and imagine now that the max length was 15; the number of alterations you'd need.
If you want a quick and dirty way to rearrange some dates in the 'messed up' format mmm-yyyy and yyyy-mm to a more uniform format mmm-yyyy, you can use this:
(?=.*(\b\w{3}\b))(?=.*(\b\d{4}\b)).*
Input:
Oct-2013
2013-Oct
Output:
Oct-2013
Oct-2013
An alternative might be to use a regex (normal match) and process separately all the non-conforming formats separately.
Something else I came across on SO was the indian currency format, which was ##,##,###.### (3 digits to the left of the decimal and all other digits groupped in pair). If you have an input of 122123123456.764244, you expect 1,22,12,31,23,456.764244 and if you want to use a regex, this one does this:
\G\d{1,2}\K\B(?=(?:\d{2})*\d{3}(?!\d))
(The (?:\G|^) in the link is only used because \G matches only at the start of the string and after a match) and I don't think this could work without the positive lookahead, since it looks forward without moving the point of replacement.)
Trimming
Suppose you have:
this is a sentence
And want to trim all the spaces with a single regex. You might be tempted to do a general replace on spaces:
\s+
But this yields thisisasentence. Well, maybe replace with a single space? It now yields " this is a sentence " (double quotes used because backticks eats spaces). Something you can however do is this:
^\s*|\s$|\s+(?=\s)
Which makes sure to leave one space behind so that you can replace with nothing and get "this is a sentence".
Splitting
Well, somewhere else where positive lookarounds might be useful is where, say you have a string ABC12DE3456FGHI789 and want to get the letters+digits apart, that is you want to get ABC12, DE3456 and FGHI789. You can easily do use the regex:
(?<=[0-9])(?=[A-Z])
While if you use ([A-Z]+[0-9]+) (i.e. the captured groups are put back in the resulting list/array/etc, you will be getting empty elements as well.
Note that this could be done with a match as well, with [A-Z]+[0-9]+
If I had to mention negative lookarounds, this post would have been even longer :)
Keep in mind that a positive/negative lookaround is the same for a regex engine. The goal of lookarounds is to perform a check somewhere in your "regular expression".
One of the main interest is to capture something without using capturing parenthesis (capturing the whole pattern), example:
string: aaabbbccc
regex: (?<=aaa)bbb(?=ccc)
(you obtain the result with the whole pattern)
instead of: aaa(bbb)ccc
(you obtain the result with the capturing group.)

Regex exception

I'd like to have regex that would match every [[ except these starting with some word, ex.:
Match [[DEF, but not match [[ABC:DEF.
Thanks for help and sorry for my English.
EDIT:
My regex (Python) is (\[\[)|(\{\{([Tt]emplate:|)[Cc]ategory).
It match every [[ and {{category}} or {{Template:Category}} or {{template:category}}, but I don't want to match [[ if it starting by ex. ABC. More examples:
Match [[SOMETHING, but not match [[ABC: SOMETHING,
Match [[EXAMPLE, but not match [[ABC: EXAMPLE.
EDIT2: "define ex. ABC"
I want match every [[ not followed by some string, for example ABC.
This depends heavily on the regex engine you are using. If I can assume it can handle look-arounds, the regex would probably be \[\[(?!ABC) for matching two opening brackets not followed by the three characters ABC.
match every [[ but don't match [[ if it starting by ex. ABC
Maybe you mean:
\[\[(?!ABC)
...or maybe something more like:
\[\[(?!\w+:)
Finally, after 8 years, here's an easy copy-paste code that should cover every possible case.
Watch out for:
Be careful when using this for "any-word-except", make sure to put \b in the theREGEX_BEFORE part, as you should be doing anyways for finding words.
If your regex is really complex, and you need to use this code in two different places in one regex expression, make sure to use exceptions_group_1 for the first time, exceptions_group_2 for the second time, etc. Read the explanation below to understand this better.
Copy/Paste Code:
In the following regex, ONLY replace the all-caps sections with your regex.
Python regex
pattern = r"REGEX_BEFORE(?>(?P<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER"
Ruby regex
pattern = /REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(<exceptions_group_1>)always(?<=fail)|)REGEX_AFTER/
PCRE regex
REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER
JavaScript
Impossible as of 6/17/2020, and probably won't be possible in the near future.
Full Examples
REGEX_BEFORE = [[
YOUR_NORMAL_PATTERN = \w+\d*
REGEX_AFTER = ]]
EXCEPTION_PATTERN = MyKeyword\d+
Python regex
pattern = r"\[\[(?>(?P<exceptions_group_1>MyKeyword\d+)|\w+\d*)(?(exceptions_group_1)always(?<=fail)|)\]\]"
Ruby regex
pattern = /\[\[(?>(?<exceptions_group_1>MyKeyword\d+)|\w+\d*)(?(<exceptions_group_1>)always(?<=fail)|)\]\]/
PCRE regex
\[\[(?>(?<exceptions_group_1>MyKeyword\d+)|\w+\d*)(?(exceptions_group_1)always(?<=fail)|)\]\]
How does it work?
This uses decently complicated regex, namely Atomic Groups, Conditionals, Lookbehinds, and Named Groups.
The (?> is the start of an atomic group, which means its not allowed to backtrack: which means, If that group matches once, but then later gets invalidated because a lookbehind failed, then the whole group will fail to match. (We want this behavior in this case).
The (?<exceptions_group_1> creates a named capture group. Its just easier than using numbers. Note that the pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
Note that the atomic pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
The real magic is in the (?(exceptions_group_1). This is a conditional asking whether or not exceptions_group_1 was successfully matched. If it was, then it tries to find always(?<=fail). That pattern (as it says) will always fail, because its looking for the word "always" and then it checks 'does "ways"=="fail"', which it never will.
Because the conditional fails, this means the atomic group fails, and because it's atomic that means its not allowed to backtrack (to try to look for the normal pattern) because it already matched the exception.
This is definitely not how these tools were intended to be used, but it should work reliably and efficiently.
Exact answer to the original question:
pattern = r"(\[\[(?>(?P<exceptions_group_1>ABC: )|(SOMETHING|EXAMPLE))(?(exceptions_group_1)always(?<=fail)|))"