What is the meaning of + in a regex? - regex

What does the plus symbol in regex mean?

+ can actually have two meanings, depending on context.
Like the other answers mentioned, + usually is a repetition operator, and causes the preceding token to repeat one or more times. a+ would be expressed as aa* in formal language theory, and could also be expressed as a{1,} (match a minimum of 1 times and a maximum of infinite times).
However, + can also make other quantifiers possessive if it follows a repetition operator (ie ?+, *+, ++ or {m,n}+). A possessive quantifier is an advanced feature of some regex flavours (PCRE, Java and the JGsoft engine) which tells the engine not to backtrack once a match has been made.
To understand how this works, we need to understand two concepts of regex engines: greediness and backtracking. Greediness means that in general regexes will try to consume as many characters as they can. Let's say our pattern is .* (the dot is a special construct in regexes which means any character1; the star means match zero or more times), and your target is aaaaaaaab. The entire string will be consumed, because the entire string is the longest match that satisfies the pattern.
However, let's say we change the pattern to .*b. Now, when the regex engine tries to match against aaaaaaaab, the .* will again consume the entire string. However, since the engine will have reached the end of the string and the pattern is not yet satisfied (the .* consumed everything but the pattern still has to match b afterwards), it will backtrack, one character at a time, and try to match b. The first backtrack will make the .* consume aaaaaaaa, and then b can consume b, and the pattern succeeds.
Possessive quantifiers are also greedy, but as mentioned, once they return a match, the engine can no longer backtrack past that point. So if we change our pattern to .*+b (match any character zero or more times, possessively, followed by a b), and try to match aaaaaaaab, again the .* will consume the whole string, but then since it is possessive, backtracking information is discarded, and the b cannot be matched so the pattern fails.
1 In most engines, the dot will not match a newline character, unless the /s ("singleline" or "dotall") modifier is specified.

In most implementations + means "one or more".
In some theoretical writings + is used to mean "or" (most implementations use the | symbol for that).

1 or more of previous expression.
[0-9]+
Would match:
1234567890
In:
I have 1234567890 dollars

One or more occurences of the preceding symbols.
E.g. a+ means the letter a one or more times. Thus, a matches a, aa, aaaaaa but not an empty string.
If you know what the asterisk (*) means, then you can express (exp)+ as (exp)(exp)*, where (exp) is any regular expression.

A lot depends on where + symbol appears and what the regex flavor is.
In posix-bre and vim (in a non-very magic mode) flavor, + matches a literal + char. E.g. sed 's/+//g' file > newfile removes all + chars in file. If you want to use + as a quantifier here, use \+ (supported in GNU tools), or replace with \{1,\} or double the quantified pattern and remove the quantifier from the first part and add * (zero or more occurrences quantifier) after the other (e.g. sed 's/c++*//' removes c followed with one or more + chars).
In posix-ere and other regex flavors, outside a character class ([...]), + acts as a quantifier meaning "one or more, but as many as possible, occurrences of the quantified pattern*. E.g. in javascript, s.replace(/\++/g, '-') will replace a string like ++++ with a single -. Note that in NFA regex flavors + has a lazy counterpart, +?, that matches "one or more, but as few as possible, occurrences of the quantified pattern".
Inside a character class, the + char is treated as a literal char, in every regex flavor. [+] always matches a single + literal char. E.g. in c#, Regex.Replace("1+2=3", #"[+]", "-") will result in 1-2=3. Note it is not a good idea to use a single char inside a character class, only use a character class for two or more chars, or for charsets. E.g. [+0-9] matches a + or any ASCII digit chars. In php, preg_replace('~[\s+]+~', '-', '1 2+++3') will result in 1-2-3 since the regex matches one or more (due to last + that is a quantifier) whitespaces (\s) or plus chars (+ insdide the character class).
The + symbol can also be a part of the possessive quantifier in some PCRE-like regex flavors (php, ruby, java, boost, icu, etc (but no in python re, .net, javascript). E.g. C\+++(?!\d) in php PCRE would match C and then one or more + symbols (\+ - a literal + and ++ one more occurrences with allowing to backtrack into this quantified pattern) not followed with a digit. If there is a digit after plus chars the whole match fails. Other examples: a?+ (one or zero a chars), a{1,3}+ (one to three a chars as many as possible), a{3}+ (=a{3}, three as), a*+ matches zero or more a chars.

Related

Regex expression to match everything between a ? and # OR ? to the end of string [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

Regex match between if present [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

regex not matching last word if there is no white space after it [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

HTML5regex - particular characters + atleast one alphabetic digit [duplicate]

What does the plus symbol in regex mean?
+ can actually have two meanings, depending on context.
Like the other answers mentioned, + usually is a repetition operator, and causes the preceding token to repeat one or more times. a+ would be expressed as aa* in formal language theory, and could also be expressed as a{1,} (match a minimum of 1 times and a maximum of infinite times).
However, + can also make other quantifiers possessive if it follows a repetition operator (ie ?+, *+, ++ or {m,n}+). A possessive quantifier is an advanced feature of some regex flavours (PCRE, Java and the JGsoft engine) which tells the engine not to backtrack once a match has been made.
To understand how this works, we need to understand two concepts of regex engines: greediness and backtracking. Greediness means that in general regexes will try to consume as many characters as they can. Let's say our pattern is .* (the dot is a special construct in regexes which means any character1; the star means match zero or more times), and your target is aaaaaaaab. The entire string will be consumed, because the entire string is the longest match that satisfies the pattern.
However, let's say we change the pattern to .*b. Now, when the regex engine tries to match against aaaaaaaab, the .* will again consume the entire string. However, since the engine will have reached the end of the string and the pattern is not yet satisfied (the .* consumed everything but the pattern still has to match b afterwards), it will backtrack, one character at a time, and try to match b. The first backtrack will make the .* consume aaaaaaaa, and then b can consume b, and the pattern succeeds.
Possessive quantifiers are also greedy, but as mentioned, once they return a match, the engine can no longer backtrack past that point. So if we change our pattern to .*+b (match any character zero or more times, possessively, followed by a b), and try to match aaaaaaaab, again the .* will consume the whole string, but then since it is possessive, backtracking information is discarded, and the b cannot be matched so the pattern fails.
1 In most engines, the dot will not match a newline character, unless the /s ("singleline" or "dotall") modifier is specified.
In most implementations + means "one or more".
In some theoretical writings + is used to mean "or" (most implementations use the | symbol for that).
1 or more of previous expression.
[0-9]+
Would match:
1234567890
In:
I have 1234567890 dollars
One or more occurences of the preceding symbols.
E.g. a+ means the letter a one or more times. Thus, a matches a, aa, aaaaaa but not an empty string.
If you know what the asterisk (*) means, then you can express (exp)+ as (exp)(exp)*, where (exp) is any regular expression.
A lot depends on where + symbol appears and what the regex flavor is.
In posix-bre and vim (in a non-very magic mode) flavor, + matches a literal + char. E.g. sed 's/+//g' file > newfile removes all + chars in file. If you want to use + as a quantifier here, use \+ (supported in GNU tools), or replace with \{1,\} or double the quantified pattern and remove the quantifier from the first part and add * (zero or more occurrences quantifier) after the other (e.g. sed 's/c++*//' removes c followed with one or more + chars).
In posix-ere and other regex flavors, outside a character class ([...]), + acts as a quantifier meaning "one or more, but as many as possible, occurrences of the quantified pattern*. E.g. in javascript, s.replace(/\++/g, '-') will replace a string like ++++ with a single -. Note that in NFA regex flavors + has a lazy counterpart, +?, that matches "one or more, but as few as possible, occurrences of the quantified pattern".
Inside a character class, the + char is treated as a literal char, in every regex flavor. [+] always matches a single + literal char. E.g. in c#, Regex.Replace("1+2=3", #"[+]", "-") will result in 1-2=3. Note it is not a good idea to use a single char inside a character class, only use a character class for two or more chars, or for charsets. E.g. [+0-9] matches a + or any ASCII digit chars. In php, preg_replace('~[\s+]+~', '-', '1 2+++3') will result in 1-2-3 since the regex matches one or more (due to last + that is a quantifier) whitespaces (\s) or plus chars (+ insdide the character class).
The + symbol can also be a part of the possessive quantifier in some PCRE-like regex flavors (php, ruby, java, boost, icu, etc (but no in python re, .net, javascript). E.g. C\+++(?!\d) in php PCRE would match C and then one or more + symbols (\+ - a literal + and ++ one more occurrences with allowing to backtrack into this quantified pattern) not followed with a digit. If there is a digit after plus chars the whole match fails. Other examples: a?+ (one or zero a chars), a{1,3}+ (one to three a chars as many as possible), a{3}+ (=a{3}, three as), a*+ matches zero or more a chars.

Regex - allow alpha/num (ąę etc too) and only one space [duplicate]

What does the plus symbol in regex mean?
+ can actually have two meanings, depending on context.
Like the other answers mentioned, + usually is a repetition operator, and causes the preceding token to repeat one or more times. a+ would be expressed as aa* in formal language theory, and could also be expressed as a{1,} (match a minimum of 1 times and a maximum of infinite times).
However, + can also make other quantifiers possessive if it follows a repetition operator (ie ?+, *+, ++ or {m,n}+). A possessive quantifier is an advanced feature of some regex flavours (PCRE, Java and the JGsoft engine) which tells the engine not to backtrack once a match has been made.
To understand how this works, we need to understand two concepts of regex engines: greediness and backtracking. Greediness means that in general regexes will try to consume as many characters as they can. Let's say our pattern is .* (the dot is a special construct in regexes which means any character1; the star means match zero or more times), and your target is aaaaaaaab. The entire string will be consumed, because the entire string is the longest match that satisfies the pattern.
However, let's say we change the pattern to .*b. Now, when the regex engine tries to match against aaaaaaaab, the .* will again consume the entire string. However, since the engine will have reached the end of the string and the pattern is not yet satisfied (the .* consumed everything but the pattern still has to match b afterwards), it will backtrack, one character at a time, and try to match b. The first backtrack will make the .* consume aaaaaaaa, and then b can consume b, and the pattern succeeds.
Possessive quantifiers are also greedy, but as mentioned, once they return a match, the engine can no longer backtrack past that point. So if we change our pattern to .*+b (match any character zero or more times, possessively, followed by a b), and try to match aaaaaaaab, again the .* will consume the whole string, but then since it is possessive, backtracking information is discarded, and the b cannot be matched so the pattern fails.
1 In most engines, the dot will not match a newline character, unless the /s ("singleline" or "dotall") modifier is specified.
In most implementations + means "one or more".
In some theoretical writings + is used to mean "or" (most implementations use the | symbol for that).
1 or more of previous expression.
[0-9]+
Would match:
1234567890
In:
I have 1234567890 dollars
One or more occurences of the preceding symbols.
E.g. a+ means the letter a one or more times. Thus, a matches a, aa, aaaaaa but not an empty string.
If you know what the asterisk (*) means, then you can express (exp)+ as (exp)(exp)*, where (exp) is any regular expression.
A lot depends on where + symbol appears and what the regex flavor is.
In posix-bre and vim (in a non-very magic mode) flavor, + matches a literal + char. E.g. sed 's/+//g' file > newfile removes all + chars in file. If you want to use + as a quantifier here, use \+ (supported in GNU tools), or replace with \{1,\} or double the quantified pattern and remove the quantifier from the first part and add * (zero or more occurrences quantifier) after the other (e.g. sed 's/c++*//' removes c followed with one or more + chars).
In posix-ere and other regex flavors, outside a character class ([...]), + acts as a quantifier meaning "one or more, but as many as possible, occurrences of the quantified pattern*. E.g. in javascript, s.replace(/\++/g, '-') will replace a string like ++++ with a single -. Note that in NFA regex flavors + has a lazy counterpart, +?, that matches "one or more, but as few as possible, occurrences of the quantified pattern".
Inside a character class, the + char is treated as a literal char, in every regex flavor. [+] always matches a single + literal char. E.g. in c#, Regex.Replace("1+2=3", #"[+]", "-") will result in 1-2=3. Note it is not a good idea to use a single char inside a character class, only use a character class for two or more chars, or for charsets. E.g. [+0-9] matches a + or any ASCII digit chars. In php, preg_replace('~[\s+]+~', '-', '1 2+++3') will result in 1-2-3 since the regex matches one or more (due to last + that is a quantifier) whitespaces (\s) or plus chars (+ insdide the character class).
The + symbol can also be a part of the possessive quantifier in some PCRE-like regex flavors (php, ruby, java, boost, icu, etc (but no in python re, .net, javascript). E.g. C\+++(?!\d) in php PCRE would match C and then one or more + symbols (\+ - a literal + and ++ one more occurrences with allowing to backtrack into this quantified pattern) not followed with a digit. If there is a digit after plus chars the whole match fails. Other examples: a?+ (one or zero a chars), a{1,3}+ (one to three a chars as many as possible), a{3}+ (=a{3}, three as), a*+ matches zero or more a chars.