One of the HTML input fields in an app I'm working on is being validated with the following regex pattern:
.{5,}+
What is this checking for?
Other fields are being checked with this pattern which I also don't understand:
.+
We can break your pattern down into three parts:
The dot is a wildcard, it matches any character (except for newlines, by default, unless the /s modifier is set).
{5,} is specifies repetition on the dot. It says that the dot must match at least 5 times. If there was a number after the comma, the dot would have to match between 5 and that number of times, but since there's no number, it can match infinite times.
In your first pattern, the + is a possessive quantifier (see below for how + can mean different things in different situations). It tells the regular expression engine that once it's satisfied the previous condition (ie. .{5,}), it should not try to backtrack.
Your second pattern is simpler. The dot still means the same thing as above (works as a wildcard). However, here the + has a different meaning, and is a repetition operator, meaning that the dot must match 1 or more times (that could also be expressed as .{1,}, as we saw above).
As you can see, + has a different meaning depending on context. When used on its own, it is a repetition operator. However when it follows a different repetition operator (either *, ?, + or {...}) it becomes a possessive quantifier.
The + means after another quantifier ({5,}) means a possessive match, i.e. once a match is found, *do not backtrack**.
For instance, the pattern .{5,}x will match abcdex:
.{5,} matches abcdex.
x matches nothing.
So backtrack .{5,} and let it match abcde.
Now x matches that last x.
But .{5,}+x will not match abcdex:
.{5,}+ matches abcdex.
x matches nothing.
Cannot backtrack the .{5,}+. We have to stop here.
*: Even the pattern cannot be backtracked, the matched strings can still be deleted as a whole. For instance, a?.{5,}x will match {a? → a, .{5,}+ → bcdex, x → no match}, and then delete the whole .{5,}+ and a and restart with {a? → , .{5,}+ → abcdex, x → no match}. Therefore, we can also say that the + makes the quantifier "atomic".
On the other hand, + alone just mean {1,}, i.e. match one or more times.
Any character, 5 or more times.
"." means any character except a line break.
{m, n} defines a bounded interval. "m" is the min. "n" is the max. If n is not defined, as is here, it is unlimited.
"+" means possessive.
.{5,}+ means
Match any single character that is not a line break character
Between 5 and unlimited times; as many times as possible, without giving back (possessive)
.+ is the same thing but it matches between 1 and unlimited times, giving back as needed (greedy).
As I've mentioned many times before, I'm a huge fan of RegexBuddy. It's "Create" mode is excellent for deconstruction regular expressions.
Related
What does the plus symbol in regex mean?
+ can actually have two meanings, depending on context.
Like the other answers mentioned, + usually is a repetition operator, and causes the preceding token to repeat one or more times. a+ would be expressed as aa* in formal language theory, and could also be expressed as a{1,} (match a minimum of 1 times and a maximum of infinite times).
However, + can also make other quantifiers possessive if it follows a repetition operator (ie ?+, *+, ++ or {m,n}+). A possessive quantifier is an advanced feature of some regex flavours (PCRE, Java and the JGsoft engine) which tells the engine not to backtrack once a match has been made.
To understand how this works, we need to understand two concepts of regex engines: greediness and backtracking. Greediness means that in general regexes will try to consume as many characters as they can. Let's say our pattern is .* (the dot is a special construct in regexes which means any character1; the star means match zero or more times), and your target is aaaaaaaab. The entire string will be consumed, because the entire string is the longest match that satisfies the pattern.
However, let's say we change the pattern to .*b. Now, when the regex engine tries to match against aaaaaaaab, the .* will again consume the entire string. However, since the engine will have reached the end of the string and the pattern is not yet satisfied (the .* consumed everything but the pattern still has to match b afterwards), it will backtrack, one character at a time, and try to match b. The first backtrack will make the .* consume aaaaaaaa, and then b can consume b, and the pattern succeeds.
Possessive quantifiers are also greedy, but as mentioned, once they return a match, the engine can no longer backtrack past that point. So if we change our pattern to .*+b (match any character zero or more times, possessively, followed by a b), and try to match aaaaaaaab, again the .* will consume the whole string, but then since it is possessive, backtracking information is discarded, and the b cannot be matched so the pattern fails.
1 In most engines, the dot will not match a newline character, unless the /s ("singleline" or "dotall") modifier is specified.
In most implementations + means "one or more".
In some theoretical writings + is used to mean "or" (most implementations use the | symbol for that).
1 or more of previous expression.
[0-9]+
Would match:
1234567890
In:
I have 1234567890 dollars
One or more occurences of the preceding symbols.
E.g. a+ means the letter a one or more times. Thus, a matches a, aa, aaaaaa but not an empty string.
If you know what the asterisk (*) means, then you can express (exp)+ as (exp)(exp)*, where (exp) is any regular expression.
A lot depends on where + symbol appears and what the regex flavor is.
In posix-bre and vim (in a non-very magic mode) flavor, + matches a literal + char. E.g. sed 's/+//g' file > newfile removes all + chars in file. If you want to use + as a quantifier here, use \+ (supported in GNU tools), or replace with \{1,\} or double the quantified pattern and remove the quantifier from the first part and add * (zero or more occurrences quantifier) after the other (e.g. sed 's/c++*//' removes c followed with one or more + chars).
In posix-ere and other regex flavors, outside a character class ([...]), + acts as a quantifier meaning "one or more, but as many as possible, occurrences of the quantified pattern*. E.g. in javascript, s.replace(/\++/g, '-') will replace a string like ++++ with a single -. Note that in NFA regex flavors + has a lazy counterpart, +?, that matches "one or more, but as few as possible, occurrences of the quantified pattern".
Inside a character class, the + char is treated as a literal char, in every regex flavor. [+] always matches a single + literal char. E.g. in c#, Regex.Replace("1+2=3", #"[+]", "-") will result in 1-2=3. Note it is not a good idea to use a single char inside a character class, only use a character class for two or more chars, or for charsets. E.g. [+0-9] matches a + or any ASCII digit chars. In php, preg_replace('~[\s+]+~', '-', '1 2+++3') will result in 1-2-3 since the regex matches one or more (due to last + that is a quantifier) whitespaces (\s) or plus chars (+ insdide the character class).
The + symbol can also be a part of the possessive quantifier in some PCRE-like regex flavors (php, ruby, java, boost, icu, etc (but no in python re, .net, javascript). E.g. C\+++(?!\d) in php PCRE would match C and then one or more + symbols (\+ - a literal + and ++ one more occurrences with allowing to backtrack into this quantified pattern) not followed with a digit. If there is a digit after plus chars the whole match fails. Other examples: a?+ (one or zero a chars), a{1,3}+ (one to three a chars as many as possible), a{3}+ (=a{3}, three as), a*+ matches zero or more a chars.
What does the plus symbol in regex mean?
+ can actually have two meanings, depending on context.
Like the other answers mentioned, + usually is a repetition operator, and causes the preceding token to repeat one or more times. a+ would be expressed as aa* in formal language theory, and could also be expressed as a{1,} (match a minimum of 1 times and a maximum of infinite times).
However, + can also make other quantifiers possessive if it follows a repetition operator (ie ?+, *+, ++ or {m,n}+). A possessive quantifier is an advanced feature of some regex flavours (PCRE, Java and the JGsoft engine) which tells the engine not to backtrack once a match has been made.
To understand how this works, we need to understand two concepts of regex engines: greediness and backtracking. Greediness means that in general regexes will try to consume as many characters as they can. Let's say our pattern is .* (the dot is a special construct in regexes which means any character1; the star means match zero or more times), and your target is aaaaaaaab. The entire string will be consumed, because the entire string is the longest match that satisfies the pattern.
However, let's say we change the pattern to .*b. Now, when the regex engine tries to match against aaaaaaaab, the .* will again consume the entire string. However, since the engine will have reached the end of the string and the pattern is not yet satisfied (the .* consumed everything but the pattern still has to match b afterwards), it will backtrack, one character at a time, and try to match b. The first backtrack will make the .* consume aaaaaaaa, and then b can consume b, and the pattern succeeds.
Possessive quantifiers are also greedy, but as mentioned, once they return a match, the engine can no longer backtrack past that point. So if we change our pattern to .*+b (match any character zero or more times, possessively, followed by a b), and try to match aaaaaaaab, again the .* will consume the whole string, but then since it is possessive, backtracking information is discarded, and the b cannot be matched so the pattern fails.
1 In most engines, the dot will not match a newline character, unless the /s ("singleline" or "dotall") modifier is specified.
In most implementations + means "one or more".
In some theoretical writings + is used to mean "or" (most implementations use the | symbol for that).
1 or more of previous expression.
[0-9]+
Would match:
1234567890
In:
I have 1234567890 dollars
One or more occurences of the preceding symbols.
E.g. a+ means the letter a one or more times. Thus, a matches a, aa, aaaaaa but not an empty string.
If you know what the asterisk (*) means, then you can express (exp)+ as (exp)(exp)*, where (exp) is any regular expression.
A lot depends on where + symbol appears and what the regex flavor is.
In posix-bre and vim (in a non-very magic mode) flavor, + matches a literal + char. E.g. sed 's/+//g' file > newfile removes all + chars in file. If you want to use + as a quantifier here, use \+ (supported in GNU tools), or replace with \{1,\} or double the quantified pattern and remove the quantifier from the first part and add * (zero or more occurrences quantifier) after the other (e.g. sed 's/c++*//' removes c followed with one or more + chars).
In posix-ere and other regex flavors, outside a character class ([...]), + acts as a quantifier meaning "one or more, but as many as possible, occurrences of the quantified pattern*. E.g. in javascript, s.replace(/\++/g, '-') will replace a string like ++++ with a single -. Note that in NFA regex flavors + has a lazy counterpart, +?, that matches "one or more, but as few as possible, occurrences of the quantified pattern".
Inside a character class, the + char is treated as a literal char, in every regex flavor. [+] always matches a single + literal char. E.g. in c#, Regex.Replace("1+2=3", #"[+]", "-") will result in 1-2=3. Note it is not a good idea to use a single char inside a character class, only use a character class for two or more chars, or for charsets. E.g. [+0-9] matches a + or any ASCII digit chars. In php, preg_replace('~[\s+]+~', '-', '1 2+++3') will result in 1-2-3 since the regex matches one or more (due to last + that is a quantifier) whitespaces (\s) or plus chars (+ insdide the character class).
The + symbol can also be a part of the possessive quantifier in some PCRE-like regex flavors (php, ruby, java, boost, icu, etc (but no in python re, .net, javascript). E.g. C\+++(?!\d) in php PCRE would match C and then one or more + symbols (\+ - a literal + and ++ one more occurrences with allowing to backtrack into this quantified pattern) not followed with a digit. If there is a digit after plus chars the whole match fails. Other examples: a?+ (one or zero a chars), a{1,3}+ (one to three a chars as many as possible), a{3}+ (=a{3}, three as), a*+ matches zero or more a chars.
What does the plus symbol in regex mean?
+ can actually have two meanings, depending on context.
Like the other answers mentioned, + usually is a repetition operator, and causes the preceding token to repeat one or more times. a+ would be expressed as aa* in formal language theory, and could also be expressed as a{1,} (match a minimum of 1 times and a maximum of infinite times).
However, + can also make other quantifiers possessive if it follows a repetition operator (ie ?+, *+, ++ or {m,n}+). A possessive quantifier is an advanced feature of some regex flavours (PCRE, Java and the JGsoft engine) which tells the engine not to backtrack once a match has been made.
To understand how this works, we need to understand two concepts of regex engines: greediness and backtracking. Greediness means that in general regexes will try to consume as many characters as they can. Let's say our pattern is .* (the dot is a special construct in regexes which means any character1; the star means match zero or more times), and your target is aaaaaaaab. The entire string will be consumed, because the entire string is the longest match that satisfies the pattern.
However, let's say we change the pattern to .*b. Now, when the regex engine tries to match against aaaaaaaab, the .* will again consume the entire string. However, since the engine will have reached the end of the string and the pattern is not yet satisfied (the .* consumed everything but the pattern still has to match b afterwards), it will backtrack, one character at a time, and try to match b. The first backtrack will make the .* consume aaaaaaaa, and then b can consume b, and the pattern succeeds.
Possessive quantifiers are also greedy, but as mentioned, once they return a match, the engine can no longer backtrack past that point. So if we change our pattern to .*+b (match any character zero or more times, possessively, followed by a b), and try to match aaaaaaaab, again the .* will consume the whole string, but then since it is possessive, backtracking information is discarded, and the b cannot be matched so the pattern fails.
1 In most engines, the dot will not match a newline character, unless the /s ("singleline" or "dotall") modifier is specified.
In most implementations + means "one or more".
In some theoretical writings + is used to mean "or" (most implementations use the | symbol for that).
1 or more of previous expression.
[0-9]+
Would match:
1234567890
In:
I have 1234567890 dollars
One or more occurences of the preceding symbols.
E.g. a+ means the letter a one or more times. Thus, a matches a, aa, aaaaaa but not an empty string.
If you know what the asterisk (*) means, then you can express (exp)+ as (exp)(exp)*, where (exp) is any regular expression.
A lot depends on where + symbol appears and what the regex flavor is.
In posix-bre and vim (in a non-very magic mode) flavor, + matches a literal + char. E.g. sed 's/+//g' file > newfile removes all + chars in file. If you want to use + as a quantifier here, use \+ (supported in GNU tools), or replace with \{1,\} or double the quantified pattern and remove the quantifier from the first part and add * (zero or more occurrences quantifier) after the other (e.g. sed 's/c++*//' removes c followed with one or more + chars).
In posix-ere and other regex flavors, outside a character class ([...]), + acts as a quantifier meaning "one or more, but as many as possible, occurrences of the quantified pattern*. E.g. in javascript, s.replace(/\++/g, '-') will replace a string like ++++ with a single -. Note that in NFA regex flavors + has a lazy counterpart, +?, that matches "one or more, but as few as possible, occurrences of the quantified pattern".
Inside a character class, the + char is treated as a literal char, in every regex flavor. [+] always matches a single + literal char. E.g. in c#, Regex.Replace("1+2=3", #"[+]", "-") will result in 1-2=3. Note it is not a good idea to use a single char inside a character class, only use a character class for two or more chars, or for charsets. E.g. [+0-9] matches a + or any ASCII digit chars. In php, preg_replace('~[\s+]+~', '-', '1 2+++3') will result in 1-2-3 since the regex matches one or more (due to last + that is a quantifier) whitespaces (\s) or plus chars (+ insdide the character class).
The + symbol can also be a part of the possessive quantifier in some PCRE-like regex flavors (php, ruby, java, boost, icu, etc (but no in python re, .net, javascript). E.g. C\+++(?!\d) in php PCRE would match C and then one or more + symbols (\+ - a literal + and ++ one more occurrences with allowing to backtrack into this quantified pattern) not followed with a digit. If there is a digit after plus chars the whole match fails. Other examples: a?+ (one or zero a chars), a{1,3}+ (one to three a chars as many as possible), a{3}+ (=a{3}, three as), a*+ matches zero or more a chars.
I'm trying to split up a string into two parts using regex. The string is formatted as follows:
text to extract<number>
I've been using (.*?)< and <(.*?)> which work fine but after reading into regex a little, I've just started to wonder why I need the ? in the expressions. I've only done it like that after finding them through this site so I'm not exactly sure what the difference is.
On greedy vs non-greedy
Repetition in regex by default is greedy: they try to match as many reps as possible, and when this doesn't work and they have to backtrack, they try to match one fewer rep at a time, until a match of the whole pattern is found. As a result, when a match finally happens, a greedy repetition would match as many reps as possible.
The ? as a repetition quantifier changes this behavior into non-greedy, also called reluctant (in e.g. Java) (and sometimes "lazy"). In contrast, this repetition will first try to match as few reps as possible, and when this doesn't work and they have to backtrack, they start matching one more rept a time. As a result, when a match finally happens, a reluctant repetition would match as few reps as possible.
References
regular-expressions.info/Repetition - Laziness instead of Greediness
Example 1: From A to Z
Let's compare these two patterns: A.*Z and A.*?Z.
Given the following input:
eeeAiiZuuuuAoooZeeee
The patterns yield the following matches:
A.*Z yields 1 match: AiiZuuuuAoooZ (see on rubular.com)
A.*?Z yields 2 matches: AiiZ and AoooZ (see on rubular.com)
Let's first focus on what A.*Z does. When it matched the first A, the .*, being greedy, first tries to match as many . as possible.
eeeAiiZuuuuAoooZeeee
\_______________/
A.* matched, Z can't match
Since the Z doesn't match, the engine backtracks, and .* must then match one fewer .:
eeeAiiZuuuuAoooZeeee
\______________/
A.* matched, Z still can't match
This happens a few more times, until finally we come to this:
eeeAiiZuuuuAoooZeeee
\__________/
A.* matched, Z can now match
Now Z can match, so the overall pattern matches:
eeeAiiZuuuuAoooZeeee
\___________/
A.*Z matched
By contrast, the reluctant repetition in A.*?Z first matches as few . as possible, and then taking more . as necessary. This explains why it finds two matches in the input.
Here's a visual representation of what the two patterns matched:
eeeAiiZuuuuAoooZeeee
\__/r \___/r r = reluctant
\____g____/ g = greedy
Example: An alternative
In many applications, the two matches in the above input is what is desired, thus a reluctant .*? is used instead of the greedy .* to prevent overmatching. For this particular pattern, however, there is a better alternative, using negated character class.
The pattern A[^Z]*Z also finds the same two matches as the A.*?Z pattern for the above input (as seen on ideone.com). [^Z] is what is called a negated character class: it matches anything but Z.
The main difference between the two patterns is in performance: being more strict, the negated character class can only match one way for a given input. It doesn't matter if you use greedy or reluctant modifier for this pattern. In fact, in some flavors, you can do even better and use what is called possessive quantifier, which doesn't backtrack at all.
References
regular-expressions.info/Repetition - An Alternative to Laziness, Negated Character Classes and Possessive Quantifiers
Example 2: From A to ZZ
This example should be illustrative: it shows how the greedy, reluctant, and negated character class patterns match differently given the same input.
eeAiiZooAuuZZeeeZZfff
These are the matches for the above input:
A[^Z]*ZZ yields 1 match: AuuZZ (as seen on ideone.com)
A.*?ZZ yields 1 match: AiiZooAuuZZ (as seen on ideone.com)
A.*ZZ yields 1 match: AiiZooAuuZZeeeZZ (as seen on ideone.com)
Here's a visual representation of what they matched:
___n
/ \ n = negated character class
eeAiiZooAuuZZeeeZZfff r = reluctant
\_________/r / g = greedy
\____________/g
Related topics
These are links to questions and answers on stackoverflow that cover some topics that may be of interest.
One greedy repetition can outgreed another
Regex not being greedy enough
Regular expression: who's greedier
It is the difference between greedy and non-greedy quantifiers.
Consider the input 101000000000100.
Using 1.*1, * is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001.
.*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.
All quantifiers have a non-greedy mode: .*?, .+?, .{2,6}?, and even .??.
In your case, a similar pattern could be <([^>]*)> - matching anything but a greater-than sign (strictly speaking, it matches zero or more characters other than > in-between < and >).
See Quantifier Cheat Sheet.
Let's say you have:
<a></a>
<(.*)> would match a></a where as <(.*?)> would match a.
The latter stops after the first match of >. It checks for one
or 0 matches of .* followed by the next expression.
The first expression <(.*)> doesn't stop when matching the first >. It will continue until the last match of >.
How would the '.+?' regular expression work? Is the .+ part matching anything written, and the ? part saying it can either be there or not? So, for example, this regular expression would match:
'cat'
'' (ie, nothing written, just the empty string)
The "+?" is not a "+" quantifier followed by a "?" quantifier. Instead the "?" modifies the "+" to perform a "lazy" or "non greedy" match, meaning that the least number of characters that match is already sufficient.
So a "a+?" regex would match just a single "a" in "caaat".
Besides what Hans Kesting already said, a lazy multiplier will do the exact oposite of the normal greedy multipliers: The possible match is kept as small as possible and the rest of the regular expression is tested.
So if you’re having the string aaba and test the regular expression a.*b on it, the internal processing steps would be as follows:
a in a.*b matches aaba
.* in a.*b matches aaba, and since .* is greedy
.* then matches aaba
.* then matches aaba
b in a.*b fails as there is no letter left
backtracking goes one step back and .* will now only match bb in aaba
b in a.*b still fails on aaba
backtracking goes one step back and .* now matches only b in aaba
b in a.*b now matches b in aaba and we’re done.
So the full match is aaba.
If we do the same with a lazy multiplier (a.*?b), the processing will do the oposite, try to match the least possible characters as possible:
a in a.*?b matches aaba
.* in a.*?b matches nothing (* = zero or more repetitions), and since .* is declared as lazy (.*?), the rest of the regular expression is tested
b in a.*?b fails on aaba
backtracking will try to increase the match of .*
.* matches now aaba
b in a.*?b matches aaba and we’re done.
So the full match if aaba.
+? (lazy plus)
Repeats the previous item once or
more. Lazy, so the engine first
matches the previous item only once,
before trying permutations with ever
increasing matches of the preceding
item.
/".+?"/ matches "def" (and "ghi") in abc "def" "ghi" jkl, while /".+"/ matches "def" "ghi".
You can find more info here
There is documentation on how Perl handles these quantifiers perldoc perlre.
By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?". Note that the meanings don't change, just the "greediness":
*? Match 0 or more times, not greedily
+? Match 1 or more times, not greedily
?? Match 0 or 1 time, not greedily
{n}? Match exactly n times, not greedily
{n,}? Match at least n times, not greedily
{n,m}? Match at least n but not more than m times, not greedily
By default, when a quantified subpattern does not allow the rest of the overall pattern to match, Perl will backtrack. However, this behaviour is sometimes undesirable. Thus Perl provides the "possessive" quantifier form as well.
*+ Match 0 or more times and give nothing back
++ Match 1 or more times and give nothing back
?+ Match 0 or 1 time and give nothing back
{n}+ Match exactly n times and give nothing back (redundant)
{n,}+ Match at least n times and give nothing back
{n,m}+ Match at least n but not more than m times and give nothing back
For instance,
'aaaa' =~ /a++a/
will never match, as the a++ will gobble up all the a 's in the string and won't leave any for the remaining part of the pattern. This feature can be extremely useful to give perl hints about where it shouldn't backtrack. For instance, the typical "match a double-quoted string" problem can be most efficiently performed when written as:
/"(?:[^"\\]++|\\.)*+"/
as we know that if the final quote does not match, backtracking will not help. See the independent subexpression (?>...) for more details; possessive quantifiers are just syntactic sugar for that construct. For instance the above example could also be written as follows:
/"(?>(?:(?>[^"\\]+)|\\.)*)"/
link
inevitably, the regex is going to look for at least one character. I've come across case where an empty string wouldn't pass that test already, it would be better to use .*? or (.*)? instead, sometimes you have to specify the part of the string which may be null in braces before the question mark, it helps. E.g. \d{6}? will yield a wrong result, whereas if I had said (\d{6})? in a string say for example:
preg_match("/shu\.(\d{6})?/", "shu.321456")
this will yield true and so will the string "shu." without any int after the period