I was testing out some random regex and came across some weird results. Say we have the regular expression (ab|(ba)*|a)* It does not match aba but if I remove the inner star, (ab|(ba)|a)* or if I switch the ordering of the terms, (a|ab|(ba)*)* these two cases now match aba. So why is this the case? Is it something to do with ambiguity or the nested *? I know its a weird test case and the inner * is redundant but I just want to understand these results. I was using regex101.com to test.
The alternation operator (|) is short-circuiting and will always try to match the left-most possible subpattern until that one fails, at which time it will attempt to match the next one. Only non-overlapping patterns can be matched. An empty-string match causes the current greedy pattern to end, because empty strings can be matched infinitely, and it doesn't make sense to keep doing so, greedy or not. Greedy does not necessarily mean stupid. :)
So in the case of the pattern (ab|(ba)*|a)*, and the string 'aba', it will match 'ab' from the beginning of the string. Since you're using a greedy quantifier on the outermost capture group, *, the regex will continue trying to make a longer match with the outermost capture group. The match iterator will be at the 3rd character, and it will try to match 'ab', but it will fail. Then, upon realizing that it can potentially match (ba)* an infinite amount of times with the empty string, it will end the match (without capturing anything with (ba)* and without attempting to match the last alternative pattern, a) and return the last iteration of the outermost repeated capturing group.
Now if you switch the ordering of the subpatterns linked with the alternation operator like (ab|a|(ba)*)*, that will match the whole string, since the matcher is able to advance the match iterator with a, and then completes the match with a final empty-string match of the 3rd alternative subpattern.
(ab|(ba)|a)* also works because the second alternative can't be matched with the empty string, so as soon as it fails to match ba, it successfully moves on to attempt to match a.
Another similar way to fix it would be to use (ab|(ba)+|a)*. This will correctly cause the second alternative to fail properly instead of matching it.
A final way to fix it is to use the anchor to the end of the string, commonly represented by $. The pattern (ab|(ba)*|a)*$ is able to correctly fail on matching the second alternative, by realizing that it will never reach the end of the string by doing so. It will still match the second alternative eventually, but only after the match iterator has traversed to the end of the string.
That's why you see only one capture with the string 'aba' from your outermost capture group. The pattern (ba)* will always match from index 2-2 (or any empty substring for that matter), which then ends the current match and prevents the next a from matching, but will not capture anything unless you have an explicit 'ba' in your string that doesn't overlap with any earlier alternatives.
Your assumption is false: it matches aba, see here.
The point is that there is a difference in "what the regex" prefers to match. If you however force the regex to match from start-to-end, it will match aba completely.
Some more detail: if you use the disjunction pattern (for instance r|s with r and s other regexes): the regex "likes" to select the left regex r over the right regex s. For instance if the regex says (a|aa)* and the input is aa, one can match the first item twice, or the use the second one. In that case, the regex likes to select the first item twice.
The same holds for repetitions, a regex wants to repeat the item within the Kleene star r* as much as possible.
Related
Consider the following test data:
x.foo,x.bar
y.foo,y.bar
yy.foo,yy.bar
x.foo,y.bar
y.foo,x.bar
yy.foo,x.bar
x.foo,yy.bar
yy.foo,y.bar
y.foo,yy.bar
I'm attempting to write a regular expression where the string before .foo and the string before .bar are different from each other. The first three items should not match. The other six should.
This mostly works:
^(.+?)\.foo,(?!\1)(.+?)\.bar$
However, it misses on the last one, because y is in match group 1, and thus yy is not matched in match group 2.
Interactive: https://regex101.com/r/Pv5062/1
How can I modify the negative lookahead pattern such that the last item matches as well?
Inline backreferences do not store the context information, they only keep the text captured. You need to specify the context yourself.
You may add a dot after \1:
^(.+?)\.foo,(?!\1\.)(.+?)\.bar$
^^
Or, even repeat the part after the second (.+?):
^(.+?)\.foo,(?!\1\.bar$)(.+?)\.bar$
Or, if the bar part cannot contain ., you may make it more "generic":
^(.+?)\.foo,(?!\1\.[^.]+$)(.+?)\.bar$
See the regex demo and another regex demo.
The point is: your (?!\1) is not "anchored" and will fail the match in case the text stored in Group 1 appears immediately to the right of the current location regardless of the context. To solve this, you need to provide this context. As the value that can be matched with .+? can contain virtually anything all you can rely on is the "hardcoded" bits after the lookahead.
I'm working with pattern matching in Postgresql 9.4. I run this query:
select regexp_matches('aaabbb', 'a+b+?')
and I expect it to return 'aaab' but instead it returns 'aaabbb'. Shouldn't the b+? atom match only one 'b' since it is not greedy? Is the greediness of the first quantifier setting the greediness for the whole regular expression?
Here is what I've found in postgresql 9.4's documentation:
Once the length of the entire match is determined, the part of it that matches any particular subexpression is determined on the basis of the greediness attribute of that subexpression, with subexpressions starting earlier in the RE taking priority over ones starting later.
and
If the RE could match more than one substring starting at that point, either the longest possible match or the shortest possible match will be taken, depending on whether the RE is greedy or non-greedy.
An example of what this means:
SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})');
Result: 123
SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
Result: 1
In the first case, the RE as a whole is greedy because Y* is greedy. It can match beginning at the Y, and it matches the longest possible string starting there, i.e., Y123. The output is the parenthesized part of that, or 123. In the second case, the RE as a whole is non-greedy because Y*? is non-greedy. It can match beginning at the Y, and it matches the shortest possible string starting there, i.e., Y1. The sub-expression [0-9]{1,3} is greedy but it cannot change the decision as to the overall match length; so it is forced to match just 1.
Meaning that the greediness of an operator is determined by the the ones defined prior to it.
I guess you have to use a+?b+? for achieving what you want.
I have a string like aaa**b***c****ddd, and I want to get a sequence of matched text of pattern [^*]\*+[^*], which should I thank be [a**b, b***c, c***d]. However, when I test this in text editor like vim or emacs, the second (b***c) is not matched.
aaa**b***c***ddd
|--| |---|
first third
|---|
second, which I think should be matched but not
How should I modify the regular expression to match the second?
Yes you can, the trick consists to put all in a capturing group inside a lookahead to allow overlapping results:
(?=([^*]\*+[^*]))
But you can't use this do to replacements since this pattern matches nothing. (or perhaps if you can get the capture group length and the current offset)
EDIT:
it seems to be possible to obtain the capture group length with vim with strlen(submatch(1))
#CommuSoft is correct. One way to approach this problem would be to match the whole string against this regex and then the second time around, you match this regex against the substring that starts at (index_of_first_previous_match + 1) until the end of the string. Hope that is clear.
So if the index of your first match above (a**b) was 2. Then the new substring that you match against the regex the second time should start from index 3 till the end of the string. This will give you the two results.
However, Casimir's answer is much simpler.
For example, I want to exclude 'fitting', 'hollow', 'trillion'
but not 'hello' or 'pattern'
I already got the following to work
(.)(.)\2\1
which matches 'hollow' or 'fitting', but I have trouble negating this.
the closest thing I get is
^.(?!(.)(.)\2\1)
which excludes 'fitting' and 'hollow' but not 'trillion'
It's a little different from what you have. Your current regex will check for the pallindromicity (?) as of the second character. Since you want to check the whole string, you need to change it a little to:
^(?!.*(.)(.)\2\1)
The first anchor will ensure that the check is made only at the beginning (otherwise, the regex can claim a match at the end of the string).
Then the .* within the negative lookahead will enable the check to be done anywhere within the string. If there's any match, fail the entire match.
It doesn't match with trillion because you added ^. means it must have a character before the match from beginning. For your first two cases it has h and f character. So if you change this into ^..(?!(.)(.)\2\1) then it will work for trillion.
So in general the regex will be:
(?!.*(.)(.)\2\1)
^^ any number of characters(other than \n)
I'm searching the pattern (.*)\\1 on the text blabl with regexec(). I get successful but empty matches in regmatch_t structures. What exactly has been matched?
The regex .* can match successfully a string of zero characters, or the nothing that occurs between adjacent characters.
So your pattern is matching zero characters in the parens, and then matching zero characters immediately following that.
So if your regex was /f(.*)\1/ it would match the string "foo" between the 'f' and the first 'o'.
You might try using .+ instead of .*, as that matches one or more instead of zero or more. (Using .+ you should match the 'oo' in 'foo')
\1 is the backreference typically used for replacement later or when trying to further refine your regex by getting a match within a match. You should just use (.*), this will give you the results you want and will automatically be given the backreference number 1. I'm no regex expert but these are my thoughts based on my limited knowledge.
As an aside, I always revert back to RegexBuddy when trying to see what's really happening.
\1 is the "re-match" instruction. The question is, do you want to re-match immediately (e.g., BLABLA)
/(.+)\1/
or later (e.g., BLAahemBLA)
/(.+).*\1/