Does greediness of first quantifier override greediness of all next quantifiers? - regex

I'm working with pattern matching in Postgresql 9.4. I run this query:
select regexp_matches('aaabbb', 'a+b+?')
and I expect it to return 'aaab' but instead it returns 'aaabbb'. Shouldn't the b+? atom match only one 'b' since it is not greedy? Is the greediness of the first quantifier setting the greediness for the whole regular expression?

Here is what I've found in postgresql 9.4's documentation:
Once the length of the entire match is determined, the part of it that matches any particular subexpression is determined on the basis of the greediness attribute of that subexpression, with subexpressions starting earlier in the RE taking priority over ones starting later.
and
If the RE could match more than one substring starting at that point, either the longest possible match or the shortest possible match will be taken, depending on whether the RE is greedy or non-greedy.
An example of what this means:
SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})');
Result: 123
SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
Result: 1
In the first case, the RE as a whole is greedy because Y* is greedy. It can match beginning at the Y, and it matches the longest possible string starting there, i.e., Y123. The output is the parenthesized part of that, or 123. In the second case, the RE as a whole is non-greedy because Y*? is non-greedy. It can match beginning at the Y, and it matches the shortest possible string starting there, i.e., Y1. The sub-expression [0-9]{1,3} is greedy but it cannot change the decision as to the overall match length; so it is forced to match just 1.
Meaning that the greediness of an operator is determined by the the ones defined prior to it.
I guess you have to use a+?b+? for achieving what you want.

Related

Capture a dot with postgres regexp

I have these strings :
3 FD160497. 2016 abcd
3 FD160497 2016 abcd
I want to capture "FD", the digits, then the dot if it is present.
I tried this :
SELECT
sqn[1] AS letters,
sqn[2] AS digits,
sqn[3] AS dot
FROM (
SELECT
regexp_matches(string, '.*?(FD)([0-9]{6})(\.)?.*') as sqn
FROM
mytable
) t;
(PostgreSQL 9.5.3)
"dot" column is NULL in both cases, and I really don't know why.
It works well on regex101.
The first lazy pattern made all quantifiers in the current branch lazy, so your pattern became equivalent to
.*?(FD)([0-9]{6})(\.)??.*?
^^ ^
See its demo at regex101.com
See the 9.7.3.1. Regular Expression Details excerpt:
...matching is done in such a way that the branch, or whole RE, matches the longest or shortest possible substring as a whole. Once the length of the entire match is determined, the part of it that matches any particular subexpression is determined on the basis of the greediness attribute of that subexpression, with subexpressions starting earlier in the RE taking priority over ones starting later.
You need to use quantifiers consistently within one branch:
regexp_matches(string, '.*(FD)([0-9]{6})(\.)?.*') as sqn
or
regexp_matches(string, '.*[[:blank:]](FD)([0-9]{6})(\.)?.*') as sqn
See the regex demo

Tcl greedy subexpression difference between + and *

I am trying to understand Tcl subexpression matches and "greediness" and am completely stumped as to what's going on. Referencing the example found at http://wiki.tcl.tk/396:
%regexp -inline (.*?)(n+)(.*) ennui
en e n {}
%regexp -inline ^(.*?)(n+)(.*)$ ennui
ennui e nn ui
Notwithstanding the fact that I don't completely understand "nested expressions" (that is what the parenthesis indicate, right?) matching, I decided to start small and just try the difference between * and + as greedy operators:
% regexp -inline (.*)(u*)(.*) ennui
ennui ennui {} {}
% regexp -inline (.*)(u+)(.*) ennui
ennui enn u i
If * matches zero or more, and + matches one or more, I don't understand the difference in the output between the two commands. Why do u* and u+ produce two different results on the same string?
I feel like this is an extremely important nuance - that if I can grasp what's going on in this simple pattern match/regex, my life will be made whole. Help!
Thanks in advance.
Regarding the non-greediness. Tcl regular expressions have a quirk: the first quantifier in the expression sets the greediness for the whole expression. (See the "Matching" section of the re_syntax manual page, paying close attention to the word "preference"):
A branch has the same preference as the first quantified atom in it which has a preference.
%regexp -inline (.*?)(n+)(.*) ennui
en e n {}
(.*?) grabs zero or more characters, preferring the shortest match
(n+) grabs one or more n, inheriting the shortest preference
(.*) grabs zero or more characters, inheriting the shortest preference
The first subexpression matches from the first character up to but not including the first n. The 2nd part matches one n. The 3rd part matches zero characters between the first and the second n.
I'm a bit surprised that the first subexpression captured an e instead of capturing zero characters before the first n, but that can be explained by the higher priority of "leftmost" matching to the regex engine:
In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string.
The achored expression's results surprises me too: I would have expected e n nui instead of e nn ui. Adding the $ anchor seems to have discarded the expression's preference for shortest matching.
The reason for the (.*)(u*)(.*) and (.*)(u+)(.*) difference is that the second regex requires at least 1 u.
The ARE regex in Tcl uses backtracking (as most NFAs). With (.*), the engine grabs the whole string from the beginning to end, and starts backtracking to find if it can accommodate for the next subpattern.
In the first expression, u is optional (can be 0 due to *), thus, the greedy .* decides it won't yield any characters. Then, the last .* can also match 0 characters, again, no need to give any characters to that group.
In the second expression, the u is obligatory, must occur at least once. Thus, the engine grabs all the string with the first .*, then backtracks, and founds u. So, it puts the starting sequence to group 1, and matches and captures u with (u+). Since u is only 1, the last (.*) matches and captures the rest of the string.
#stribizhev answer pretty much explains everything. As for your non-greedy version — the question mark at end tells the engine that it shouldn't consume the whole string, but grab the least possible match and continue from there.
(.*?) for "ennui" matches 0 characters and it's ok, since we're not greedy
(n+) for "ennui" match fails, so the engine returns to matching (.*?) again
(.*?) for "ennui" now matches one character e
(n+) for "nnui" matches nn since its greedy
(.*) for "ui" matches whats left, ui

Can someone explain these regex results

I was testing out some random regex and came across some weird results. Say we have the regular expression (ab|(ba)*|a)* It does not match aba but if I remove the inner star, (ab|(ba)|a)* or if I switch the ordering of the terms, (a|ab|(ba)*)* these two cases now match aba. So why is this the case? Is it something to do with ambiguity or the nested *? I know its a weird test case and the inner * is redundant but I just want to understand these results. I was using regex101.com to test.
The alternation operator (|) is short-circuiting and will always try to match the left-most possible subpattern until that one fails, at which time it will attempt to match the next one. Only non-overlapping patterns can be matched. An empty-string match causes the current greedy pattern to end, because empty strings can be matched infinitely, and it doesn't make sense to keep doing so, greedy or not. Greedy does not necessarily mean stupid. :)
So in the case of the pattern (ab|(ba)*|a)*, and the string 'aba', it will match 'ab' from the beginning of the string. Since you're using a greedy quantifier on the outermost capture group, *, the regex will continue trying to make a longer match with the outermost capture group. The match iterator will be at the 3rd character, and it will try to match 'ab', but it will fail. Then, upon realizing that it can potentially match (ba)* an infinite amount of times with the empty string, it will end the match (without capturing anything with (ba)* and without attempting to match the last alternative pattern, a) and return the last iteration of the outermost repeated capturing group.
Now if you switch the ordering of the subpatterns linked with the alternation operator like (ab|a|(ba)*)*, that will match the whole string, since the matcher is able to advance the match iterator with a, and then completes the match with a final empty-string match of the 3rd alternative subpattern.
(ab|(ba)|a)* also works because the second alternative can't be matched with the empty string, so as soon as it fails to match ba, it successfully moves on to attempt to match a.
Another similar way to fix it would be to use (ab|(ba)+|a)*. This will correctly cause the second alternative to fail properly instead of matching it.
A final way to fix it is to use the anchor to the end of the string, commonly represented by $. The pattern (ab|(ba)*|a)*$ is able to correctly fail on matching the second alternative, by realizing that it will never reach the end of the string by doing so. It will still match the second alternative eventually, but only after the match iterator has traversed to the end of the string.
That's why you see only one capture with the string 'aba' from your outermost capture group. The pattern (ba)* will always match from index 2-2 (or any empty substring for that matter), which then ends the current match and prevents the next a from matching, but will not capture anything unless you have an explicit 'ba' in your string that doesn't overlap with any earlier alternatives.
Your assumption is false: it matches aba, see here.
The point is that there is a difference in "what the regex" prefers to match. If you however force the regex to match from start-to-end, it will match aba completely.
Some more detail: if you use the disjunction pattern (for instance r|s with r and s other regexes): the regex "likes" to select the left regex r over the right regex s. For instance if the regex says (a|aa)* and the input is aa, one can match the first item twice, or the use the second one. In that case, the regex likes to select the first item twice.
The same holds for repetitions, a regex wants to repeat the item within the Kleene star r* as much as possible.

Difference (if any) between .+? and .*

I am looking through some old code bases and have come across two regular expression parts that I think are semantically identical. Wondering it the Stackoverflow community can confirm my understanding.
RegEx 1: (.+?) - One or more characters, but optional
RegEx 2: (.*) - Zero or more characters
I keep thinking of different scenarios but can't think of any input where both expressions wouldn't be the same.
(.+?) means match one or more character, but instead of the default greedy matching (match as much as possible), the ? after the quantifier makes the matching lazy (match as few as possible).
Conceptually, greedy matching will first try the longest possible sequence that can be formed by the pattern inside, then gradually reduce the length of the sequence as the engine backtracks. Lazy matching will first try the shortest possible sequence that can be formed by the pattern inside, then gradually increase the length of the sequence as the engine backtracks.
Therefore, (.+?) and (.*) are completely different. Given a string "abc", the pattern (.+?) will match "a" for the first match, while (.*) will match "abc" for the first match.
When you correct the pattern to the intended meaning: ((?:.+)?), it is exactly the same as (.*) in behavior. Since quantifier is greedy by default, ((?:.+)?) will first try the case of .+ before attempting the case of empty string. And .+ will try the longest sequence before the 1 character sequence. Therefore, the effect of ((?:.+)?) is exactly the same as (.*): it will find the longest sequence and backtrack gradually to the case of empty string.
First,
. is any character
Next
* is zero or more
+ is one or more
? is one or zero
You're thinking that .+? is one or more of any character and 0 or 1 of them I'm guessing?
You're missing this:
Lazy modifier
*? is zero or more getting as few as possible
+? is one or more getting as few as possible
See here for further discussion
Greedy vs. Reluctant vs. Possessive Quantifiers

What is the difference between .*? and .* regular expressions?

I'm trying to split up a string into two parts using regex. The string is formatted as follows:
text to extract<number>
I've been using (.*?)< and <(.*?)> which work fine but after reading into regex a little, I've just started to wonder why I need the ? in the expressions. I've only done it like that after finding them through this site so I'm not exactly sure what the difference is.
On greedy vs non-greedy
Repetition in regex by default is greedy: they try to match as many reps as possible, and when this doesn't work and they have to backtrack, they try to match one fewer rep at a time, until a match of the whole pattern is found. As a result, when a match finally happens, a greedy repetition would match as many reps as possible.
The ? as a repetition quantifier changes this behavior into non-greedy, also called reluctant (in e.g. Java) (and sometimes "lazy"). In contrast, this repetition will first try to match as few reps as possible, and when this doesn't work and they have to backtrack, they start matching one more rept a time. As a result, when a match finally happens, a reluctant repetition would match as few reps as possible.
References
regular-expressions.info/Repetition - Laziness instead of Greediness
Example 1: From A to Z
Let's compare these two patterns: A.*Z and A.*?Z.
Given the following input:
eeeAiiZuuuuAoooZeeee
The patterns yield the following matches:
A.*Z yields 1 match: AiiZuuuuAoooZ (see on rubular.com)
A.*?Z yields 2 matches: AiiZ and AoooZ (see on rubular.com)
Let's first focus on what A.*Z does. When it matched the first A, the .*, being greedy, first tries to match as many . as possible.
eeeAiiZuuuuAoooZeeee
\_______________/
A.* matched, Z can't match
Since the Z doesn't match, the engine backtracks, and .* must then match one fewer .:
eeeAiiZuuuuAoooZeeee
\______________/
A.* matched, Z still can't match
This happens a few more times, until finally we come to this:
eeeAiiZuuuuAoooZeeee
\__________/
A.* matched, Z can now match
Now Z can match, so the overall pattern matches:
eeeAiiZuuuuAoooZeeee
\___________/
A.*Z matched
By contrast, the reluctant repetition in A.*?Z first matches as few . as possible, and then taking more . as necessary. This explains why it finds two matches in the input.
Here's a visual representation of what the two patterns matched:
eeeAiiZuuuuAoooZeeee
\__/r \___/r r = reluctant
\____g____/ g = greedy
Example: An alternative
In many applications, the two matches in the above input is what is desired, thus a reluctant .*? is used instead of the greedy .* to prevent overmatching. For this particular pattern, however, there is a better alternative, using negated character class.
The pattern A[^Z]*Z also finds the same two matches as the A.*?Z pattern for the above input (as seen on ideone.com). [^Z] is what is called a negated character class: it matches anything but Z.
The main difference between the two patterns is in performance: being more strict, the negated character class can only match one way for a given input. It doesn't matter if you use greedy or reluctant modifier for this pattern. In fact, in some flavors, you can do even better and use what is called possessive quantifier, which doesn't backtrack at all.
References
regular-expressions.info/Repetition - An Alternative to Laziness, Negated Character Classes and Possessive Quantifiers
Example 2: From A to ZZ
This example should be illustrative: it shows how the greedy, reluctant, and negated character class patterns match differently given the same input.
eeAiiZooAuuZZeeeZZfff
These are the matches for the above input:
A[^Z]*ZZ yields 1 match: AuuZZ (as seen on ideone.com)
A.*?ZZ yields 1 match: AiiZooAuuZZ (as seen on ideone.com)
A.*ZZ yields 1 match: AiiZooAuuZZeeeZZ (as seen on ideone.com)
Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
Related topics
These are links to questions and answers on stackoverflow that cover some topics that may be of interest.
One greedy repetition can outgreed another
Regex not being greedy enough
Regular expression: who's greedier
It is the difference between greedy and non-greedy quantifiers.
Consider the input 101000000000100.
Using 1.*1, * is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001.
.*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.
All quantifiers have a non-greedy mode: .*?, .+?, .{2,6}?, and even .??.
In your case, a similar pattern could be <([^>]*)> - matching anything but a greater-than sign (strictly speaking, it matches zero or more characters other than > in-between < and >).
See Quantifier Cheat Sheet.
Let's say you have:
<a></a>
<(.*)> would match a></a where as <(.*?)> would match a.
The latter stops after the first match of >. It checks for one
or 0 matches of .* followed by the next expression.
The first expression <(.*)> doesn't stop when matching the first >. It will continue until the last match of >.