I have some text with following pattern
~, and optionally a - then some digits
So I can have (can be part of some larger text)
~7
~-6
~-11534
~-0
for example my text can be:
New Zealand~1 expenditure~-900
Right now I am using this pattern:
[~-]*[0-9]*[0-9]
It seems to does the work but I know [0-9]*[0-9] is greedy matching (0 to unlimited times)
I am wondering if there is a better pattern?
EDIT: Actually, to match your requirements better, I suggest
~-?[0-9]+
This way we specify that ~ required, and it can be followed by - and digits.
The question mark after a quantifier says to take as little as possible, making it non-greedy, but in current example it is unnecessary.
EDIT 2: I noticed rather lately that the digits are not entirely optional, and changed 'zero or more' quantifier * to 'one or more' quantifier +.
EDIT 3: To talk about 'greedy' and 'non-greedy'. A non-greedy algorithm would return as little as possible, and in case of multiple digits at the end of string it would include only the first one into result, which is not what you're looking for.
A bit more about greedy and non-greedy algorithms, thanks to Wiktor Stribiżew for excellently formulated explanation:
The lazy quantifier at the end of the pattern will match 0 (if *? is used) or 1 (if +? is used) chars. That happens because the lazily quantified patterns only grab what they must match first (the minimum number of occurrences) and are skipped so that the subsequent patterns could be checked. Only if there is no match the engine goes back to the lazily quantified subpattern to expand it one more char and retry.
Related
What are these two terms in an understandable way?
Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.
'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.
Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow
Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.
As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me
Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.
From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.
Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples
Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).
Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.
Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.
To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.
try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""
What does this combination of quantifiers *? mean?
Use this as the following example:
([0-9][AB]*?)
It's a non-greedy match. In [AB]*?, the regex looks for as few occurrences of [AB] as needed to make the overall regex match the searched string, whereas the greedy version [AB]* looks for as many occurrences as possible. It is a feature of Perl's regexes, and hence available in PCRE (Perl Compatible Regular Expressions) (see repetition) and other systems that look to Perl for their definition.
The PCRE page gives an example:
The classic example of where [greediness] gives problems is in trying to match comments in C programs. These appear between /* and */ and within the comment, individual * and / characters may appear. An attempt to match C comments by applying the pattern:
/\*.*\*/
to the string
/* first comment */ not comment /* second comment */
fails, because it matches the entire string owing to the greediness of the .* item.
If a quantifier is followed by a question mark, it ceases to be greedy, and instead matches the minimum number of times possible, so the pattern
/\*.*?\*/
does the right thing with the C comments.
Jonathan already explained the difference, but here's an example that might help you understand what's happening here.
Given the string "9AB":
([0-9][AB]*?) matches only "9A" because it stop as soon as "A" matched (lazy)
([0-9][AB]*) matches the whole string ("9AB") because it consumes "A" and successes to match the following "B" (greedy)
Note that the second one will match a digit, followed by zero or more (unlimited) number of "A" or "B"s.
Can someone please illustrate it what is the difference between the following regular expressions?
/b\w+?/g
and
/b\w/g
based on the documentation of regexr for Lazy ?:Makes the preceding quantifier lazy, causing it to match as few characters as possible. By default, quantifiers are greedy, and will match as many characters as possible.
Both regexes will match the same thing (one "alphanumeric" character). The difference between them becomes noticeable only when the regex becomes more complex.
For example,
\b\w\.
will match "A." but not "Mr." because \w matches only one character.
\b\w+?\.
will match both. This hasn't yet to do with the laziness but with the lack of a quantifier in the first regex. But if you compare
\b\w+\d
and
\b\w+?\d
you'll notice an important difference: Both regexes will match the string "R2D2" differently - the first one matches "R2D2", the second one only matches "R2" because the ? tells the \w+ token to match as few characters as possible, so it stops the match after R (even though it could match more. It will only match more if the regex wouldn't match otherwise).
I am looking through some old code bases and have come across two regular expression parts that I think are semantically identical. Wondering it the Stackoverflow community can confirm my understanding.
RegEx 1: (.+?) - One or more characters, but optional
RegEx 2: (.*) - Zero or more characters
I keep thinking of different scenarios but can't think of any input where both expressions wouldn't be the same.
(.+?) means match one or more character, but instead of the default greedy matching (match as much as possible), the ? after the quantifier makes the matching lazy (match as few as possible).
Conceptually, greedy matching will first try the longest possible sequence that can be formed by the pattern inside, then gradually reduce the length of the sequence as the engine backtracks. Lazy matching will first try the shortest possible sequence that can be formed by the pattern inside, then gradually increase the length of the sequence as the engine backtracks.
Therefore, (.+?) and (.*) are completely different. Given a string "abc", the pattern (.+?) will match "a" for the first match, while (.*) will match "abc" for the first match.
When you correct the pattern to the intended meaning: ((?:.+)?), it is exactly the same as (.*) in behavior. Since quantifier is greedy by default, ((?:.+)?) will first try the case of .+ before attempting the case of empty string. And .+ will try the longest sequence before the 1 character sequence. Therefore, the effect of ((?:.+)?) is exactly the same as (.*): it will find the longest sequence and backtrack gradually to the case of empty string.
First,
. is any character
Next
* is zero or more
+ is one or more
? is one or zero
You're thinking that .+? is one or more of any character and 0 or 1 of them I'm guessing?
You're missing this:
Lazy modifier
*? is zero or more getting as few as possible
+? is one or more getting as few as possible
See here for further discussion
Greedy vs. Reluctant vs. Possessive Quantifiers
I'm trying to split up a string into two parts using regex. The string is formatted as follows:
text to extract<number>
I've been using (.*?)< and <(.*?)> which work fine but after reading into regex a little, I've just started to wonder why I need the ? in the expressions. I've only done it like that after finding them through this site so I'm not exactly sure what the difference is.
On greedy vs non-greedy
Repetition in regex by default is greedy: they try to match as many reps as possible, and when this doesn't work and they have to backtrack, they try to match one fewer rep at a time, until a match of the whole pattern is found. As a result, when a match finally happens, a greedy repetition would match as many reps as possible.
The ? as a repetition quantifier changes this behavior into non-greedy, also called reluctant (in e.g. Java) (and sometimes "lazy"). In contrast, this repetition will first try to match as few reps as possible, and when this doesn't work and they have to backtrack, they start matching one more rept a time. As a result, when a match finally happens, a reluctant repetition would match as few reps as possible.
References
regular-expressions.info/Repetition - Laziness instead of Greediness
Example 1: From A to Z
Let's compare these two patterns: A.*Z and A.*?Z.
Given the following input:
eeeAiiZuuuuAoooZeeee
The patterns yield the following matches:
A.*Z yields 1 match: AiiZuuuuAoooZ (see on rubular.com)
A.*?Z yields 2 matches: AiiZ and AoooZ (see on rubular.com)
Let's first focus on what A.*Z does. When it matched the first A, the .*, being greedy, first tries to match as many . as possible.
eeeAiiZuuuuAoooZeeee
\_______________/
A.* matched, Z can't match
Since the Z doesn't match, the engine backtracks, and .* must then match one fewer .:
eeeAiiZuuuuAoooZeeee
\______________/
A.* matched, Z still can't match
This happens a few more times, until finally we come to this:
eeeAiiZuuuuAoooZeeee
\__________/
A.* matched, Z can now match
Now Z can match, so the overall pattern matches:
eeeAiiZuuuuAoooZeeee
\___________/
A.*Z matched
By contrast, the reluctant repetition in A.*?Z first matches as few . as possible, and then taking more . as necessary. This explains why it finds two matches in the input.
Here's a visual representation of what the two patterns matched:
eeeAiiZuuuuAoooZeeee
\__/r \___/r r = reluctant
\____g____/ g = greedy
Example: An alternative
In many applications, the two matches in the above input is what is desired, thus a reluctant .*? is used instead of the greedy .* to prevent overmatching. For this particular pattern, however, there is a better alternative, using negated character class.
The pattern A[^Z]*Z also finds the same two matches as the A.*?Z pattern for the above input (as seen on ideone.com). [^Z] is what is called a negated character class: it matches anything but Z.
The main difference between the two patterns is in performance: being more strict, the negated character class can only match one way for a given input. It doesn't matter if you use greedy or reluctant modifier for this pattern. In fact, in some flavors, you can do even better and use what is called possessive quantifier, which doesn't backtrack at all.
References
regular-expressions.info/Repetition - An Alternative to Laziness, Negated Character Classes and Possessive Quantifiers
Example 2: From A to ZZ
This example should be illustrative: it shows how the greedy, reluctant, and negated character class patterns match differently given the same input.
eeAiiZooAuuZZeeeZZfff
These are the matches for the above input:
A[^Z]*ZZ yields 1 match: AuuZZ (as seen on ideone.com)
A.*?ZZ yields 1 match: AiiZooAuuZZ (as seen on ideone.com)
A.*ZZ yields 1 match: AiiZooAuuZZeeeZZ (as seen on ideone.com)
Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
Related topics
These are links to questions and answers on stackoverflow that cover some topics that may be of interest.
One greedy repetition can outgreed another
Regex not being greedy enough
Regular expression: who's greedier
It is the difference between greedy and non-greedy quantifiers.
Consider the input 101000000000100.
Using 1.*1, * is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001.
.*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.
All quantifiers have a non-greedy mode: .*?, .+?, .{2,6}?, and even .??.
In your case, a similar pattern could be <([^>]*)> - matching anything but a greater-than sign (strictly speaking, it matches zero or more characters other than > in-between < and >).
See Quantifier Cheat Sheet.
Let's say you have:
<a></a>
<(.*)> would match a></a where as <(.*?)> would match a.
The latter stops after the first match of >. It checks for one
or 0 matches of .* followed by the next expression.
The first expression <(.*)> doesn't stop when matching the first >. It will continue until the last match of >.