What is the difference between .*? and .* regular expressions? - regex

I'm trying to split up a string into two parts using regex. The string is formatted as follows:
text to extract<number>
I've been using (.*?)< and <(.*?)> which work fine but after reading into regex a little, I've just started to wonder why I need the ? in the expressions. I've only done it like that after finding them through this site so I'm not exactly sure what the difference is.

On greedy vs non-greedy
Repetition in regex by default is greedy: they try to match as many reps as possible, and when this doesn't work and they have to backtrack, they try to match one fewer rep at a time, until a match of the whole pattern is found. As a result, when a match finally happens, a greedy repetition would match as many reps as possible.
The ? as a repetition quantifier changes this behavior into non-greedy, also called reluctant (in e.g. Java) (and sometimes "lazy"). In contrast, this repetition will first try to match as few reps as possible, and when this doesn't work and they have to backtrack, they start matching one more rept a time. As a result, when a match finally happens, a reluctant repetition would match as few reps as possible.
References
regular-expressions.info/Repetition - Laziness instead of Greediness
Example 1: From A to Z
Let's compare these two patterns: A.*Z and A.*?Z.
Given the following input:
eeeAiiZuuuuAoooZeeee
The patterns yield the following matches:
A.*Z yields 1 match: AiiZuuuuAoooZ (see on rubular.com)
A.*?Z yields 2 matches: AiiZ and AoooZ (see on rubular.com)
Let's first focus on what A.*Z does. When it matched the first A, the .*, being greedy, first tries to match as many . as possible.
eeeAiiZuuuuAoooZeeee
\_______________/
A.* matched, Z can't match
Since the Z doesn't match, the engine backtracks, and .* must then match one fewer .:
eeeAiiZuuuuAoooZeeee
\______________/
A.* matched, Z still can't match
This happens a few more times, until finally we come to this:
eeeAiiZuuuuAoooZeeee
\__________/
A.* matched, Z can now match
Now Z can match, so the overall pattern matches:
eeeAiiZuuuuAoooZeeee
\___________/
A.*Z matched
By contrast, the reluctant repetition in A.*?Z first matches as few . as possible, and then taking more . as necessary. This explains why it finds two matches in the input.
Here's a visual representation of what the two patterns matched:
eeeAiiZuuuuAoooZeeee
\__/r \___/r r = reluctant
\____g____/ g = greedy
Example: An alternative
In many applications, the two matches in the above input is what is desired, thus a reluctant .*? is used instead of the greedy .* to prevent overmatching. For this particular pattern, however, there is a better alternative, using negated character class.
The pattern A[^Z]*Z also finds the same two matches as the A.*?Z pattern for the above input (as seen on ideone.com). [^Z] is what is called a negated character class: it matches anything but Z.
The main difference between the two patterns is in performance: being more strict, the negated character class can only match one way for a given input. It doesn't matter if you use greedy or reluctant modifier for this pattern. In fact, in some flavors, you can do even better and use what is called possessive quantifier, which doesn't backtrack at all.
References
regular-expressions.info/Repetition - An Alternative to Laziness, Negated Character Classes and Possessive Quantifiers
Example 2: From A to ZZ
This example should be illustrative: it shows how the greedy, reluctant, and negated character class patterns match differently given the same input.
eeAiiZooAuuZZeeeZZfff
These are the matches for the above input:
A[^Z]*ZZ yields 1 match: AuuZZ (as seen on ideone.com)
A.*?ZZ yields 1 match: AiiZooAuuZZ (as seen on ideone.com)
A.*ZZ yields 1 match: AiiZooAuuZZeeeZZ (as seen on ideone.com)
Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
Related topics
These are links to questions and answers on stackoverflow that cover some topics that may be of interest.
One greedy repetition can outgreed another
Regex not being greedy enough
Regular expression: who's greedier

It is the difference between greedy and non-greedy quantifiers.
Consider the input 101000000000100.
Using 1.*1, * is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001.
.*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.
All quantifiers have a non-greedy mode: .*?, .+?, .{2,6}?, and even .??.
In your case, a similar pattern could be <([^>]*)> - matching anything but a greater-than sign (strictly speaking, it matches zero or more characters other than > in-between < and >).
See Quantifier Cheat Sheet.

Let's say you have:
<a></a>
<(.*)> would match a></a where as <(.*?)> would match a.
The latter stops after the first match of >. It checks for one
or 0 matches of .* followed by the next expression.
The first expression <(.*)> doesn't stop when matching the first >. It will continue until the last match of >.

Related

Regex match everything between word and comma [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Regular expression to extract string between a word and dot followed by strings [duplicate]

My regex pattern looks something like
<xxxx location="file path/level1/level2" xxxx some="xxx">
I am only interested in the part in quotes assigned to location. Shouldn't it be as easy as below without the greedy switch?
/.*location="(.*)".*/
Does not seem to work.
You need to make your regular expression lazy/non-greedy, because by default, "(.*)" will match all of "file path/level1/level2" xxx some="xxx".
Instead you can make your dot-star non-greedy, which will make it match as few characters as possible:
/location="(.*?)"/
Adding a ? on a quantifier (?, * or +) makes it non-greedy.
Note: this is only available in regex engines which implement the Perl 5 extensions (Java, Ruby, Python, etc) but not in "traditional" regex engines (including Awk, sed, grep without -P, etc.).
location="(.*)" will match from the " after location= until the " after some="xxx unless you make it non-greedy.
So you either need .*? (i.e. make it non-greedy by adding ?) or better replace .* with [^"]*.
[^"] Matches any character except for a " <quotation-mark>
More generic: [^abc] - Matches any character except for an a, b or c
How about
.*location="([^"]*)".*
This avoids the unlimited search with .* and will match exactly to the first quote.
Use non-greedy matching, if your engine supports it. Add the ? inside the capture.
/location="(.*?)"/
Use of Lazy quantifiers ? with no global flag is the answer.
Eg,
If you had global flag /g then, it would have matched all the lowest length matches as below.
Here's another way.
Here's the one you want. This is lazy [\s\S]*?
The first item:
[\s\S]*?(?:location="[^"]*")[\s\S]* Replace with: $1
Explaination: https://regex101.com/r/ZcqcUm/2
For completeness, this gets the last one. This is greedy [\s\S]*
The last item:[\s\S]*(?:location="([^"]*)")[\s\S]*
Replace with: $1
Explaination: https://regex101.com/r/LXSPDp/3
There's only 1 difference between these two regular expressions and that is the ?
The other answers here fail to spell out a full solution for regex versions which don't support non-greedy matching. The greedy quantifiers (.*?, .+? etc) are a Perl 5 extension which isn't supported in traditional regular expressions.
If your stopping condition is a single character, the solution is easy; instead of
a(.*?)b
you can match
a[^ab]*b
i.e specify a character class which excludes the starting and ending delimiiters.
In the more general case, you can painstakingly construct an expression like
start(|[^e]|e(|[^n]|n(|[^d])))end
to capture a match between start and the first occurrence of end. Notice how the subexpression with nested parentheses spells out a number of alternatives which between them allow e only if it isn't followed by nd and so forth, and also take care to cover the empty string as one alternative which doesn't match whatever is disallowed at that particular point.
Of course, the correct approach in most cases is to use a proper parser for the format you are trying to parse, but sometimes, maybe one isn't available, or maybe the specialized tool you are using is insisting on a regular expression and nothing else.
Because you are using quantified subpattern and as descried in Perl Doc,
By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?" . Note that the meanings don't change, just the
"greediness":
*? //Match 0 or more times, not greedily (minimum matches)
+? //Match 1 or more times, not greedily
Thus, to allow your quantified pattern to make minimum match, follow it by ? :
/location="(.*?)"/
import regex
text = 'ask her to call Mary back when she comes back'
p = r'(?i)(?s)call(.*?)back'
for match in regex.finditer(p, str(text)):
print (match.group(1))
Output:
Mary

Lazy ? in regex

Can someone please illustrate it what is the difference between the following regular expressions?
/b\w+?/g
and
/b\w/g
based on the documentation of regexr for Lazy ?:Makes the preceding quantifier lazy, causing it to match as few characters as possible. By default, quantifiers are greedy, and will match as many characters as possible.
Both regexes will match the same thing (one "alphanumeric" character). The difference between them becomes noticeable only when the regex becomes more complex.
For example,
\b\w\.
will match "A." but not "Mr." because \w matches only one character.
\b\w+?\.
will match both. This hasn't yet to do with the laziness but with the lack of a quantifier in the first regex. But if you compare
\b\w+\d
and
\b\w+?\d
you'll notice an important difference: Both regexes will match the string "R2D2" differently - the first one matches "R2D2", the second one only matches "R2" because the ? tells the \w+ token to match as few characters as possible, so it stops the match after R (even though it could match more. It will only match more if the regex wouldn't match otherwise).

Difference (if any) between .+? and .*

I am looking through some old code bases and have come across two regular expression parts that I think are semantically identical. Wondering it the Stackoverflow community can confirm my understanding.
RegEx 1: (.+?) - One or more characters, but optional
RegEx 2: (.*) - Zero or more characters
I keep thinking of different scenarios but can't think of any input where both expressions wouldn't be the same.
(.+?) means match one or more character, but instead of the default greedy matching (match as much as possible), the ? after the quantifier makes the matching lazy (match as few as possible).
Conceptually, greedy matching will first try the longest possible sequence that can be formed by the pattern inside, then gradually reduce the length of the sequence as the engine backtracks. Lazy matching will first try the shortest possible sequence that can be formed by the pattern inside, then gradually increase the length of the sequence as the engine backtracks.
Therefore, (.+?) and (.*) are completely different. Given a string "abc", the pattern (.+?) will match "a" for the first match, while (.*) will match "abc" for the first match.
When you correct the pattern to the intended meaning: ((?:.+)?), it is exactly the same as (.*) in behavior. Since quantifier is greedy by default, ((?:.+)?) will first try the case of .+ before attempting the case of empty string. And .+ will try the longest sequence before the 1 character sequence. Therefore, the effect of ((?:.+)?) is exactly the same as (.*): it will find the longest sequence and backtrack gradually to the case of empty string.
First,
. is any character
Next
* is zero or more
+ is one or more
? is one or zero
You're thinking that .+? is one or more of any character and 0 or 1 of them I'm guessing?
You're missing this:
Lazy modifier
*? is zero or more getting as few as possible
+? is one or more getting as few as possible
See here for further discussion
Greedy vs. Reluctant vs. Possessive Quantifiers

Hows does the Regular Expression: /.+?/ work?

How would the '.+?' regular expression work? Is the .+ part matching anything written, and the ? part saying it can either be there or not? So, for example, this regular expression would match:
'cat'
'' (ie, nothing written, just the empty string)
The "+?" is not a "+" quantifier followed by a "?" quantifier. Instead the "?" modifies the "+" to perform a "lazy" or "non greedy" match, meaning that the least number of characters that match is already sufficient.
So a "a+?" regex would match just a single "a" in "caaat".
Besides what Hans Kesting already said, a lazy multiplier will do the exact oposite of the normal greedy multipliers: The possible match is kept as small as possible and the rest of the regular expression is tested.
So if you’re having the string aaba and test the regular expression a.*b on it, the internal processing steps would be as follows:
a in a.*b matches aaba
.* in a.*b matches aaba, and since .* is greedy
.* then matches aaba
.* then matches aaba
b in a.*b fails as there is no letter left
backtracking goes one step back and .* will now only match bb in aaba
b in a.*b still fails on aaba
backtracking goes one step back and .* now matches only b in aaba
b in a.*b now matches b in aaba and we’re done.
So the full match is aaba.
If we do the same with a lazy multiplier (a.*?b), the processing will do the oposite, try to match the least possible characters as possible:
a in a.*?b matches aaba
.* in a.*?b matches nothing (* = zero or more repetitions), and since .* is declared as lazy (.*?), the rest of the regular expression is tested
b in a.*?b fails on aaba
backtracking will try to increase the match of .*
.* matches now aaba
b in a.*?b matches aaba and we’re done.
So the full match if aaba.
+? (lazy plus)
Repeats the previous item once or
more. Lazy, so the engine first
matches the previous item only once,
before trying permutations with ever
increasing matches of the preceding
item.
/".+?"/ matches "def" (and "ghi") in abc "def" "ghi" jkl, while /".+"/ matches "def" "ghi".
You can find more info here
There is documentation on how Perl handles these quantifiers perldoc perlre.
By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness":
*? Match 0 or more times, not greedily
+? Match 1 or more times, not greedily
?? Match 0 or 1 time, not greedily
{n}? Match exactly n times, not greedily
{n,}? Match at least n times, not greedily
{n,m}? Match at least n but not more than m times, not greedily
By default, when a quantified subpattern does not allow the rest of the overall pattern to match, Perl will backtrack. However, this behaviour is sometimes undesirable. Thus Perl provides the "possessive" quantifier form as well.
*+ Match 0 or more times and give nothing back
++ Match 1 or more times and give nothing back
?+ Match 0 or 1 time and give nothing back
{n}+ Match exactly n times and give nothing back (redundant)
{n,}+ Match at least n times and give nothing back
{n,m}+ Match at least n but not more than m times and give nothing back
For instance,
'aaaa' =~ /a++a/
will never match, as the a++ will gobble up all the a 's in the string and won't leave any for the remaining part of the pattern. This feature can be extremely useful to give perl hints about where it shouldn't backtrack. For instance, the typical "match a double-quoted string" problem can be most efficiently performed when written as:
/"(?:[^"\\]++|\\.)*+"/
as we know that if the final quote does not match, backtracking will not help. See the independent subexpression (?>...) for more details; possessive quantifiers are just syntactic sugar for that construct. For instance the above example could also be written as follows:
/"(?>(?:(?>[^"\\]+)|\\.)*)"/
link
inevitably, the regex is going to look for at least one character. I've come across case where an empty string wouldn't pass that test already, it would be better to use .*? or (.*)? instead, sometimes you have to specify the part of the string which may be null in braces before the question mark, it helps. E.g. \d{6}? will yield a wrong result, whereas if I had said (\d{6})? in a string say for example:
preg_match("/shu\.(\d{6})?/", "shu.321456")
this will yield true and so will the string "shu." without any int after the period