* and + behaving differently in regex python [duplicate] - regex

This question already has answers here:
Python regular expression pattern * is not working as expected
(2 answers)
Closed 3 years ago.
I am using python re module. I am not able to get why following two are behaving differently. I am expecting that the one with * will also give same result.
re.search(r'([0-9]+)',':329392.899')
Output: re.Match object; span=(1, 7), match='329392'
re.search('([0-9]*)',':329392.899')
Output: re.Match object; span=(0, 0), match=''

re.search will first attempt to find a match starting at the beginning of the string, and only advance the starting position when a match cannot be found. The [0-9]* pattern does match the at the beginning of the string, it just matches zero characters (* matches zero or more).

* matches zero or more of the pattern. There are zero digits at the very beginning of the input string, before the :, it's matching that.
+ matches one or more of the pattern, so it doesn't find a match until it gets to the 3, then it matches all the digits.

* means match zero or more time, so when you use ([0-9]*) it will match ( capture ) empty string also which is why you get Output: re.Match object; span=(0, 0), match=''
Whereas on the other hand + means one or more so it won't capture the empty string
Have a look at demo and see the highlighted matches and matched values, also you're missing r in second snippet
Regex Demo

Related

CMake regex simple digit match [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
What is the difference between:
(.+?)
and
(.*?)
when I use it in my php preg_match regex?
They are called quantifiers.
* 0 or more of the preceding expression
+ 1 or more of the preceding expression
Per default a quantifier is greedy, that means it matches as many characters as possible.
The ? after a quantifier changes the behaviour to make this quantifier "ungreedy", means it will match as little as possible.
Example greedy/ungreedy
For example on the string "abab"
a.*b will match "abab" (preg_match_all will return one match, the "abab")
while a.*?b will match only the starting "ab" (preg_match_all will return two matches, "ab")
You can test your regexes online e.g. on Regexr, see the greedy example here
The first (+) is one or more characters. The second (*) is zero or more characters. Both are non-greedy (?) and match anything (.).
In RegEx, {i,f} means "between i to f matches". Let's take a look at the following examples:
{3,7} means between 3 to 7 matches
{,10} means up to 10 matches with no lower limit (i.e. the low limit is 0)
{3,} means at least 3 matches with no upper limit (i.e. the high limit is infinity)
{,} means no upper limit or lower limit for the number of matches (i.e. the lower limit is 0 and the upper limit is infinity)
{5} means exactly 4
Most good languages contain abbreviations, so does RegEx:
+ is the shorthand for {1,}
* is the shorthand for {,}
? is the shorthand for {,1}
This means + requires at least 1 match while * accepts any number of matches or no matches at all and ? accepts no more than 1 match or zero matches.
Credit: Codecademy.com
+ matches at least one character
* matches any number (including 0) of characters
The ? indicates a lazy expression, so it will match as few characters as possible.
A + matches one or more instances of the preceding pattern. A * matches zero or more instances of the preceding pattern.
So basically, if you use a + there must be at least one instance of the pattern, if you use * it will still match if there are no instances of it.
Consider below is the string to match.
ab
The pattern (ab.*) will return a match for capture group with result of ab
While the pattern (ab.+) will not match and not returning anything.
But if you change the string to following, it will return aba for pattern (ab.+)
aba
+ is minimal one, * can be zero as well.
A star is very similar to a plus, the only difference is that while the plus matches 1 or more of the preceding character/group, the star matches 0 or more.
I think the previous answers fail to highlight a simple example:
for example we have an array:
numbers = [5, 15]
The following regex expression ^[0-9]+ matches: 15 only.
However, ^[0-9]* matches both 5 and 15. The difference is that the + operator requires at least one duplicate of the preceding regex expression

regex replace in powershell command duplicates characters: a bug in powershell? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
What is the difference between:
(.+?)
and
(.*?)
when I use it in my php preg_match regex?
They are called quantifiers.
* 0 or more of the preceding expression
+ 1 or more of the preceding expression
Per default a quantifier is greedy, that means it matches as many characters as possible.
The ? after a quantifier changes the behaviour to make this quantifier "ungreedy", means it will match as little as possible.
Example greedy/ungreedy
For example on the string "abab"
a.*b will match "abab" (preg_match_all will return one match, the "abab")
while a.*?b will match only the starting "ab" (preg_match_all will return two matches, "ab")
You can test your regexes online e.g. on Regexr, see the greedy example here
The first (+) is one or more characters. The second (*) is zero or more characters. Both are non-greedy (?) and match anything (.).
In RegEx, {i,f} means "between i to f matches". Let's take a look at the following examples:
{3,7} means between 3 to 7 matches
{,10} means up to 10 matches with no lower limit (i.e. the low limit is 0)
{3,} means at least 3 matches with no upper limit (i.e. the high limit is infinity)
{,} means no upper limit or lower limit for the number of matches (i.e. the lower limit is 0 and the upper limit is infinity)
{5} means exactly 4
Most good languages contain abbreviations, so does RegEx:
+ is the shorthand for {1,}
* is the shorthand for {,}
? is the shorthand for {,1}
This means + requires at least 1 match while * accepts any number of matches or no matches at all and ? accepts no more than 1 match or zero matches.
Credit: Codecademy.com
+ matches at least one character
* matches any number (including 0) of characters
The ? indicates a lazy expression, so it will match as few characters as possible.
A + matches one or more instances of the preceding pattern. A * matches zero or more instances of the preceding pattern.
So basically, if you use a + there must be at least one instance of the pattern, if you use * it will still match if there are no instances of it.
Consider below is the string to match.
ab
The pattern (ab.*) will return a match for capture group with result of ab
While the pattern (ab.+) will not match and not returning anything.
But if you change the string to following, it will return aba for pattern (ab.+)
aba
+ is minimal one, * can be zero as well.
A star is very similar to a plus, the only difference is that while the plus matches 1 or more of the preceding character/group, the star matches 0 or more.
I think the previous answers fail to highlight a simple example:
for example we have an array:
numbers = [5, 15]
The following regex expression ^[0-9]+ matches: 15 only.
However, ^[0-9]* matches both 5 and 15. The difference is that the + operator requires at least one duplicate of the preceding regex expression

Regex string matched failure

I do not understand how regex string matching works
r2 = r'a[bcd]*b'
m1 = re.findall(r2,"abcbd")
abcb
This falls in line with what was explained in regex
Step 3 The engine tries to match b, but the current position is at the end of the string, so it fails.
How?I do not understand this?
The following regex a[bcd]*b matches the longest substring (because * is greedy):
a starting with a
[bcd]* followed by any number (0: can match empty string) of character in set (b,c,d)
b ending by b
EDIT: following comment, backtracking occurs in following example
>>> re.findall(r2,"abcxb")
['ab']
abc matches a[bcd]*, but x is not expected
a also matches a[bcd]* (because empty string matches [bcd]*)
finally returns ab
Concerning greediness, the metacharacter * after a single character, a character set or a group, means any number of times (the most possible match) some regexp engines accept the sequence of metacharacters *? which modifies the behavior to the least possible, for example:
>>> r2 = r'a[bcd]*?b'
>>> re.findall(r2,"abcbde")
['ab']
Your regular expression requires the match to end in b, therefore everything is matched up to the trailing d. If b were optional, as in a[bcd]*b?, then entire string would be matched.

How regex engine works for "[*=]+$" [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 7 years ago.
I'm trying to build regex that removes * and = or any combination of them from the end of the string, so I tried "[*=]$", but it was lazy, for example, if I have the string this is a dog =*, then it will remove * and keep =, then I tried the regex [*=]+$, and it did the job, But I can't understand how the regex engine would work with the last regex, or in another word, how this regex become greedy.
Note that + repeats the previous token one or more times. So [*=]+ matches one or more * or = symbols exists at the last.
What happens in the background is, at first [*=] matches all the * or = symbols (matching continuous characters). Once after regex engine saw the + which exists next to the char class, then it starts to match the following * or = symbols. And finally once it saw the end of the line anchor $, all the matches other than the one exists at the last will get discarded by the regex engine. Now, you left with the last match (match exists at the end of a line).

What does the regular expression ^(\d{1,2})$ mean? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
I am trying to understand what the regular expression ^(\d{1,2})$ stands for in google sheets. A quick look around the regex sites and in tools left me confused. Can anybody please help?
^ Asserts position at start of the string
( Denotes the start of a capturing group
\d Numerical digit, 0, 1, 2, ... 9. Etc.
{1,2} one to two times.
) You guessed it - Closes the group.
$ Assert position at end of the string
Regular expression visualization:
^ - start of a line.
(\d{1,2}) - captures upto two digits(ie; one or two digits).
$ - End of the line.
It means at least one at most two digits \d{1,2}, no other characters at the beginning ^ or the end $. Parenthesis essentially picks the string in it i.e. what ever the digits are
^ matches the start of the line
The parens can be ignored for now..
\d{1, 2} means one or two digits
$ is the end of the line.
The parens, if you need them, can be used to retrieve the digit(s) that were found in the regex.