Why does this regex using *+ (possessive) not match - regex

I need to get the last match of [0-9.]* in a string like
one 1.234 three
some text 1.2321 xyz 1 5 1.234 and more text
some other text
but also need the text around it - even when there is no number like in the 3rd line
I wanted to use ^(.*)([0-9\.]*+)(.*)$ but it just matches the first (.*).
On the other hand ^(.*?)([0-9\.]*+)(.*?)$ just matches the last (.*?).
Why is that? I thought that it will try to satisfy all rules?
I know that I can exclude 0-9. from the last .* to get what I want, but I want to understand why the above isn't working although I used *+

A possessive quantifier doesn't guarantee the longest possible match, it just prevents backtracking. Neither of your regexes ever tries to backtrack, so the possessive quantifier has no effect.
With the first regex, the first (.*) consumes the whole string, then ([0-9.]*+) and the second (.*) each consume nothing because there's nothing left to match.
With the second regex, the first (.*?) initially consumes nothing because it's reluctant. Then ([0-9.]*+) successfully matches some more nothing because it's still at the beginning of the string, which doesn't happen to start with a digit or a period. Finally, the last (.*?) is forced to consume what's left (the whole string) despite being reluctant, because of the anchor ($) following it.
To solve your problem, we need to know more about the kind of input you can expect. For example, if you know there will never be any digits or periods after the number you're looking for, you could use this:
^(.*?)(?:([0-9.]+)([^0-9.]*))?$
The key here is that the second capturing group, ([0-9.]+), uses a + instead of a *. If there are no digits or periods in the string, the enclosing group, (?:([0-9.]+)([^0-9.]*))?, will match nothing, and the initial (.*?) will be forced to consume the whole string. (The second and third groups will be empty.)
If there's more than one sequence of digits or periods in the string, the second group is guaranteed to match the last of them, because the third group, ([^0-9.]*), allows anything but those characters in the remainder of the string.
This is pretty weak, but it's the best I can do with the information you've supplied. The point is, possessive quantifiers are brilliant when you can use them, but that doesn't happen nearly as often as you might expect.

Related

Why put a lookahead at the beginning, or a lookbehind at the end? [duplicate]

I'm pretty decent with regular expressions, and now I'm trying once again to understand lookahead and lookbehind assertions. They mostly make sense, but I'm not quite sure how the order affects the result. I've been looking at this site which places lookbehinds before the expression, and lookaheads after the expression. My question is, does this change anything? A recent answer here on SO placed the lookahead before the expression which is leading to my confusion.
When tutorials introduce lookarounds, they tend to choose the simplest use case for each one. So they'll use examples like (?<!a)b ('b' not preceded by 'a') or q(?=u) ('q' followed by 'u'). It's just to avoid cluttering the explanation with distracting details, but it tends to create (or reinforce) the impression that lookbehinds and lookaheads are supposed to appear in a certain order. It took me quite a while to get over that idea, and I've seen several others afflicted with it, too.
Try looking at some more realistic examples. One question that comes up a lot involves validating passwords; for example, making sure a new password is at least six characters long and contains at least one letter and one digit. One way to do that would be:
^(?=.*[A-Za-z])(?=.*\d)[A-Za-z0-9]{6,}$
The character class [A-Za-z0-9]{6,} could match all letters or all digits, so you use the lookaheads to ensure that there's at least one of each. In this case, you have to do the lookaheads first, because the later parts of the regex have to be able to examine the whole string.
For another example, suppose you need to find all occurrences of the word "there" unless it's preceded by a quotation mark. The obvious regex for that is (?<!")[Tt]here\b, but if you're searching a large corpus, that could create a performance problem. As written, that regex will do the negative lookbehind at each and every position in the text, and only when that succeeds will it check the rest of the regex.
Every regex engine has its own strengths and weaknesses, but one thing that's true of all of them is that they're quicker to find fixed sequences of literal characters than anything else--the longer the sequence, the better. That means it can be dramatically faster to do the lookbehind last, even though it means matching the word twice:
[Tt]here\b(?<!"[Tt]here)
So the rule governing the placement of lookarounds is that there is no rule; you put them wherever they make the most sense in each case.
It's easier to show in an example than explain, I think. Let's take this regex:
(?<=\d)(?=(.)\1)(?!p)\w(?<!q)
What this means is:
(?<=\d) - make sure what comes before the match position is a digit.
(?=(.)\1) - make sure whatever character we match at this (same) position is followed by a copy of itself (through the backreference).
(?!p) - make sure what follows is not a p.
\w - match a letter, digit or underscore. Note that this is the first time we actually match and consume the character.
(?<!q) - make sure what we matched so far doesn't end with a q.
All this will match strings like abc5ddx or 9xx but not 5d or 6qq or asd6pp or add. Note that each assertion works independently. It just stops, looks around, and if all is well, allows the matching to continue.
Note also that in most (probably all) implementations, lookbehinds have the limitation of being fixed-length. You can't use repetition/optionality operators like ?, *, and + in them. This is because to match a pattern we need a starting point - otherwise we'd have to try matching each lookbehind from every point in the string.
A sample run of this regex on the string a3b5ddx is as follows:
Text cursor position: 0.
Try to match the first lookbehind at position -1 (since \d always matches 1 character). We can't match at negative indices, so fail and advance the cursor.
Text cursor position: 1.
Try to match the first lookbehind at position 0. a does not match \d so fail and advance the cursor again.
Text cursor position: 2.
Try to match the first lookbehind at position 1. 3 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 2. b matches (.) and is captured. 5 does not match \1 (which is the captured b). Therefore, fail and advance the cursor.
Text cursor position: 3.
Try to match the first lookbehind at position 2. b does not match \d so fail and advance the cursor again.
Text cursor position: 4.
Try to match the first lookbehind at position 3. 5 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 4. d matches (.) and is captured. The second d does match \1 (which is the first captured d). Allow the matching to continue from where we left off.
Try to match the second lookahead. b at position 4 does not match p, and since this is a negative lookahead, that's what we want; allow the matching to continue.
Try to match \w at position 4. b matches. Advance cursor since we have consumed a character and continue. Also mark this as the start of the match.
Text cursor position: 5.
Try to match the second lookbehind at position 4 (since q always matches 1 character). d does not match q which is what we want from a negative lookbehind.
Realize that we're at the end of the regex and report success by returning the substring from the start of the match to the current position (4 to 5), which is d.
1(?=ABC) means - look for 1, and match (but don't capture) ABC after it.
(?<=ABC)1 means - match (but don't capture) ABC before the current location, and continue to match 1.
So, normally, you'll place the lookahead after the expression and the lookbehind before it.
When we place a lookbehind after the expression, we're rechecking the string we've already matched. This is common when you have complex conditions (you can think about it as the AND of regexs). For example, take a look on this recent answer by Daniel Brückner:
.&.(?<! & )
First, you capture an ampersand between two characters. Next, you check they were both not spaces (\S&\S would not work here, the OP wanted to capture 1&_).

Email-similar regex catastrophic backtracing

I'd like to match something which may be called the beginning of the e-mail, ie.
1 character (whichever letter from alphabet and digits)
0 or 1 dot
1 or more character
The repetition of {2nd and 3rd point} zero or more times
# character
The regex I've been trying to apply on Regex101 is \w(\.?\w+)*#.
I am getting the error Catastrophic backtracking. What am I doing wrong? Is the regex correct?
It is usual for catastrophic backtracking to appear in cases of nested quantifiers when the group inside contains at least one optional subpattern, making the quantified subpattern match the same pattern as the subpattern before the outer group and the outer group is not at the end of the pattern.
Your regex causes the issue right because the (\.?\w+)* is not at the end, there is an optional \.? and the expression is reduced to \w(\w+)*#.
For example aaa.aaaaaa.a.aa.aa but now aaa..aaaa.a
What you need is
^\w+(?:\.\w+)*#
See the regex demo
^ - start of string (to avoid partial matches)
\w+ - 1 or more word chars
(?:\.\w+)* - zero or more sequences of:
\. - a literal dot
\w+ - 1 or more word chars
# - a literal # char.
The problem
"Catastrophic backtracing" occurs when a part of the string could match a part of the regex in many different ways, so it needs to repeatedly retry to determine whether or not the string actually matches. A simple case: The regex a+a+b to match two or more a followed by one b. If you were to run that on aaaaaaaaaaa, the problem arises: First, the first a+ matches everything, and it fails at the second a+. Then, it tries with the first a+ matching all but one a, and the second a+ matches one a (this is "backtracing"), and then it fails on the b. But regexes aren't "smart" enough to know that it could stop there - so it has to keep going in that pattern until it's tried every split of giving some as to the first and some to the second. Some regex engines will realize they're getting stuck like this, and quit after enough steps, with the error you saw.
For your specific pattern: what you have there matches any nonzero quantity of letters or digits, mixed with any quantity of . where the . cannot be first, followed by an #. The only additional limit is that there can't be two adjacent dots. Effectively, this is the same case as my example: The * applied to a section containing a + acts like multiple duplicates of that +-ed section.
Atomic grouping
You could try something with atomic grouping. That basically says "once you've found any match for this, don't backtrace into it". After all, if you've found some amount of /w, it's not going to contain a /. and there's no need to keep rechecking that - dots are not letters or digits, and neither of those is an #.
In this case, the result would be \w(?>\.?\w+)*#. Note that not all regex engines support atomic grouping, though the one you linked does. If the string is only a match, nothing will change - if it's not a match, or contains non-matches, the process will take fewer steps. Using #eddiem's example from the comments, it finds two matches in 166311 steps with your original, but only takes 623 steps with atomic grouping added.
Possessive quantifiers
Another option would be a possessive quantifier - \w(\.?\w+)*+# means roughly the same thing. *+, specifically, is "whatever the star matches, don't backtrace inside it". In the above case, it matches in 558 steps - but it's slightly different meaning, in that it treats all the repeats together as one atomic value, instead of as several distinct atomic values. I don't think there's a difference in this case, but there might be in some cases. Again, not supported by all regex engines.

How does \w+ select whole word?

In a simple regular expression, I understand that
\w
gives a single word character; however, I do not get how adding a plus (+) like so:
\w+
selects the whole word. In my mind, the plus just means one or more of the word character, so I do not understand how it would expand out to whole words.
Similar to how [0-9]+ means one or more digits, where each digit may be different, similarly \w+ means one or more word characters, again where each character may be different. In this normal "greedy" mode it keeps on going until it can't find any more. (You can also make it non-greedy, finding as few as possible while still allowing the regex to match) via \w+? in some regex flavors.)
If you wanted what you expect, to require the same character repeatedly, you'd need to use back-references:
(\w)\1* - Find any word character, capture it, and then find zero or more of that same character.
One character at a time example
With the regex \w+ and the input string Hello World the regex will start at the beginning and say to itself, "Is the next character matched by \w? Yes, so we add it to the result and then move forward one character." Because of the + modifier, after doing this once it keeps on doing it, one step at a time, until it cannot find any more. At this point it moves on to the next part of the regular expression (if there is one) or it stops. With just \w+ this captures all of Hello but not the space or World.
A note on Backtracking
The default + modifier enables "backtracking". This is a (sometimes-expensive) feature of regex that allows you to express your desire simply while giving the best chance of succeeding. For example, if your regex was \w+l and your input string was Hello World, the regex engine would capture all of Hello, and then say "Oh dear, now I need to find an l. There isn't one after the o...maybe I went too far?" It will back up until it has captured Hell and see if there is an l next (there isn't), and then back up again to just Hel and see if there is an l next (there is). The end result will be capturing just the string Hell.
Even more interesting is the case of the regex \w+r and the input string "Hello World". In this case the engine will capture all of Hello and see if there is an r following it. Since it does not find one, it backtracks one character at a time, until it finds out that H isn't followed by an r at this point it says "Maybe starting with the H wasn't a good idea" and goes forward in the string. Eventually it finds World, then backtracks to capture just Wo and finds that there is, finally, the r it needs. At this point it returns Wor as the match.
When adding the + it matches 1 or more of the preceeding tokens.
Its called a greedy match, and will match as many characters as possible before satisfying the next token.
http://regexr.com is a great tool for using regex and it also explains what the operators do.
The + is a greedy quantifier. It means that it will match as many characters as possible, even if there are "lesser" matches that would be valid.
In the string Hello world, \w+ matches Hello and world.
Appending a ? to it makes it non-greedy and it will be satisfied with the minimal valid match.
In the string Hello world, \w+? matches every letter separately.

Does the greediness of a regex matter after all need matches have been used?

The title pretty much says it all.
I have a regex that I need to match the names of virtual machines from an array.
The regex looks like this:
/^(?<id> \d+)\s+(?<name> .+?)\s+\[.+\]/mx
After the last capture group is matched I have no need for the left overs other than using them to stop the match at the correct place so that all characters in the capture group are correctly matched. Does it matter how greedy the left overs are if they are not being used?
Here is an example of the string I am matching, this is before the match.
432 TEST Box åäö!"''*# [Store] TEST Box +w6XDpMO2IQ-_''_+Iw/TEST Box +w6XDpMO2IQ-_''_+Iw.vmx slesGuest vmx-04
Here is an example if the string I am matching, this is after the match.
432 TEST Box åäö!"''*#
Like I ask above, if I only need the first 2 capture groups does it matter how greedy the uncaptured part at the end is?
There would be no difference between \s+ and \s+? as long as the preceding quantifier .+? remains lazy; it will always match at least one space and expand as needed until the following [.
I first said that there might be a difference between \[.+\] and \[.+?\] if more than one pair of data items can occur on the same line. The former would match too much in that case. But I just noticed that you've anchored your regex to the start of the line. So no, in that case, it doesn't matter either.

Regex lookahead ordering

I'm pretty decent with regular expressions, and now I'm trying once again to understand lookahead and lookbehind assertions. They mostly make sense, but I'm not quite sure how the order affects the result. I've been looking at this site which places lookbehinds before the expression, and lookaheads after the expression. My question is, does this change anything? A recent answer here on SO placed the lookahead before the expression which is leading to my confusion.
When tutorials introduce lookarounds, they tend to choose the simplest use case for each one. So they'll use examples like (?<!a)b ('b' not preceded by 'a') or q(?=u) ('q' followed by 'u'). It's just to avoid cluttering the explanation with distracting details, but it tends to create (or reinforce) the impression that lookbehinds and lookaheads are supposed to appear in a certain order. It took me quite a while to get over that idea, and I've seen several others afflicted with it, too.
Try looking at some more realistic examples. One question that comes up a lot involves validating passwords; for example, making sure a new password is at least six characters long and contains at least one letter and one digit. One way to do that would be:
^(?=.*[A-Za-z])(?=.*\d)[A-Za-z0-9]{6,}$
The character class [A-Za-z0-9]{6,} could match all letters or all digits, so you use the lookaheads to ensure that there's at least one of each. In this case, you have to do the lookaheads first, because the later parts of the regex have to be able to examine the whole string.
For another example, suppose you need to find all occurrences of the word "there" unless it's preceded by a quotation mark. The obvious regex for that is (?<!")[Tt]here\b, but if you're searching a large corpus, that could create a performance problem. As written, that regex will do the negative lookbehind at each and every position in the text, and only when that succeeds will it check the rest of the regex.
Every regex engine has its own strengths and weaknesses, but one thing that's true of all of them is that they're quicker to find fixed sequences of literal characters than anything else--the longer the sequence, the better. That means it can be dramatically faster to do the lookbehind last, even though it means matching the word twice:
[Tt]here\b(?<!"[Tt]here)
So the rule governing the placement of lookarounds is that there is no rule; you put them wherever they make the most sense in each case.
It's easier to show in an example than explain, I think. Let's take this regex:
(?<=\d)(?=(.)\1)(?!p)\w(?<!q)
What this means is:
(?<=\d) - make sure what comes before the match position is a digit.
(?=(.)\1) - make sure whatever character we match at this (same) position is followed by a copy of itself (through the backreference).
(?!p) - make sure what follows is not a p.
\w - match a letter, digit or underscore. Note that this is the first time we actually match and consume the character.
(?<!q) - make sure what we matched so far doesn't end with a q.
All this will match strings like abc5ddx or 9xx but not 5d or 6qq or asd6pp or add. Note that each assertion works independently. It just stops, looks around, and if all is well, allows the matching to continue.
Note also that in most (probably all) implementations, lookbehinds have the limitation of being fixed-length. You can't use repetition/optionality operators like ?, *, and + in them. This is because to match a pattern we need a starting point - otherwise we'd have to try matching each lookbehind from every point in the string.
A sample run of this regex on the string a3b5ddx is as follows:
Text cursor position: 0.
Try to match the first lookbehind at position -1 (since \d always matches 1 character). We can't match at negative indices, so fail and advance the cursor.
Text cursor position: 1.
Try to match the first lookbehind at position 0. a does not match \d so fail and advance the cursor again.
Text cursor position: 2.
Try to match the first lookbehind at position 1. 3 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 2. b matches (.) and is captured. 5 does not match \1 (which is the captured b). Therefore, fail and advance the cursor.
Text cursor position: 3.
Try to match the first lookbehind at position 2. b does not match \d so fail and advance the cursor again.
Text cursor position: 4.
Try to match the first lookbehind at position 3. 5 does match \d so keep the cursor intact and continue matching.
Try to match the first lookahead at position 4. d matches (.) and is captured. The second d does match \1 (which is the first captured d). Allow the matching to continue from where we left off.
Try to match the second lookahead. b at position 4 does not match p, and since this is a negative lookahead, that's what we want; allow the matching to continue.
Try to match \w at position 4. b matches. Advance cursor since we have consumed a character and continue. Also mark this as the start of the match.
Text cursor position: 5.
Try to match the second lookbehind at position 4 (since q always matches 1 character). d does not match q which is what we want from a negative lookbehind.
Realize that we're at the end of the regex and report success by returning the substring from the start of the match to the current position (4 to 5), which is d.
1(?=ABC) means - look for 1, and match (but don't capture) ABC after it.
(?<=ABC)1 means - match (but don't capture) ABC before the current location, and continue to match 1.
So, normally, you'll place the lookahead after the expression and the lookbehind before it.
When we place a lookbehind after the expression, we're rechecking the string we've already matched. This is common when you have complex conditions (you can think about it as the AND of regexs). For example, take a look on this recent answer by Daniel Brückner:
.&.(?<! & )
First, you capture an ampersand between two characters. Next, you check they were both not spaces (\S&\S would not work here, the OP wanted to capture 1&_).