Perl. Regex not matching desired when using {x,y} metacharacter - regex

I'm trying to work with {x,y} meta character, so please help to understand why
1. 'Hello' =~ /\w{2,}/; # Returns true. while..
2. 'Hello' =~ /\w{,6}/; # ..returns false ??!
\w{2,} stands for *'match [0-9A-Za-z_] character at least 2 times'*
\w{,6} stands for *'match [0-9A-Za-z_] character at most 6 times'*
If I'm reading this correct? So why the second doesn't match?

According to perlre documentation -- Quantifiers, only *, +, ?, {n}, {n,}, {n,m} are recognized:
The following standard quantifiers are recognized:
* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
-> /{,6}/ matches '{,6}' literally.
Use /\w{0,6}/ or /\w{1,6}/ instead according to your need.

The first argument to the {n,m} expression is required. See the perlre man page, for example:
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match at least n but not more than m times
A pattern like {,m} is not recognized. If you explicitly give the first argument as 1 it works:
print 'Hello' =~ /\w{1,6}/;
generates "1".

Actually:
\w{n,m} means match alphanumeric least n times, but at most m times.
\w{n,} means match alphanumeric n or more times.
\w{n} means match alphanumeric exactly n times.
However:
\w{,m} means match alphanumeric followed by the literal {,m}. This is because the n is required; you must specify the first argument to the {n,m} expression.

Related

CMake regex simple digit match [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
What is the difference between:
(.+?)
and
(.*?)
when I use it in my php preg_match regex?
They are called quantifiers.
* 0 or more of the preceding expression
+ 1 or more of the preceding expression
Per default a quantifier is greedy, that means it matches as many characters as possible.
The ? after a quantifier changes the behaviour to make this quantifier "ungreedy", means it will match as little as possible.
Example greedy/ungreedy
For example on the string "abab"
a.*b will match "abab" (preg_match_all will return one match, the "abab")
while a.*?b will match only the starting "ab" (preg_match_all will return two matches, "ab")
You can test your regexes online e.g. on Regexr, see the greedy example here
The first (+) is one or more characters. The second (*) is zero or more characters. Both are non-greedy (?) and match anything (.).
In RegEx, {i,f} means "between i to f matches". Let's take a look at the following examples:
{3,7} means between 3 to 7 matches
{,10} means up to 10 matches with no lower limit (i.e. the low limit is 0)
{3,} means at least 3 matches with no upper limit (i.e. the high limit is infinity)
{,} means no upper limit or lower limit for the number of matches (i.e. the lower limit is 0 and the upper limit is infinity)
{5} means exactly 4
Most good languages contain abbreviations, so does RegEx:
+ is the shorthand for {1,}
* is the shorthand for {,}
? is the shorthand for {,1}
This means + requires at least 1 match while * accepts any number of matches or no matches at all and ? accepts no more than 1 match or zero matches.
Credit: Codecademy.com
+ matches at least one character
* matches any number (including 0) of characters
The ? indicates a lazy expression, so it will match as few characters as possible.
A + matches one or more instances of the preceding pattern. A * matches zero or more instances of the preceding pattern.
So basically, if you use a + there must be at least one instance of the pattern, if you use * it will still match if there are no instances of it.
Consider below is the string to match.
ab
The pattern (ab.*) will return a match for capture group with result of ab
While the pattern (ab.+) will not match and not returning anything.
But if you change the string to following, it will return aba for pattern (ab.+)
aba
+ is minimal one, * can be zero as well.
A star is very similar to a plus, the only difference is that while the plus matches 1 or more of the preceding character/group, the star matches 0 or more.
I think the previous answers fail to highlight a simple example:
for example we have an array:
numbers = [5, 15]
The following regex expression ^[0-9]+ matches: 15 only.
However, ^[0-9]* matches both 5 and 15. The difference is that the + operator requires at least one duplicate of the preceding regex expression

regex replace in powershell command duplicates characters: a bug in powershell? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
What is the difference between:
(.+?)
and
(.*?)
when I use it in my php preg_match regex?
They are called quantifiers.
* 0 or more of the preceding expression
+ 1 or more of the preceding expression
Per default a quantifier is greedy, that means it matches as many characters as possible.
The ? after a quantifier changes the behaviour to make this quantifier "ungreedy", means it will match as little as possible.
Example greedy/ungreedy
For example on the string "abab"
a.*b will match "abab" (preg_match_all will return one match, the "abab")
while a.*?b will match only the starting "ab" (preg_match_all will return two matches, "ab")
You can test your regexes online e.g. on Regexr, see the greedy example here
The first (+) is one or more characters. The second (*) is zero or more characters. Both are non-greedy (?) and match anything (.).
In RegEx, {i,f} means "between i to f matches". Let's take a look at the following examples:
{3,7} means between 3 to 7 matches
{,10} means up to 10 matches with no lower limit (i.e. the low limit is 0)
{3,} means at least 3 matches with no upper limit (i.e. the high limit is infinity)
{,} means no upper limit or lower limit for the number of matches (i.e. the lower limit is 0 and the upper limit is infinity)
{5} means exactly 4
Most good languages contain abbreviations, so does RegEx:
+ is the shorthand for {1,}
* is the shorthand for {,}
? is the shorthand for {,1}
This means + requires at least 1 match while * accepts any number of matches or no matches at all and ? accepts no more than 1 match or zero matches.
Credit: Codecademy.com
+ matches at least one character
* matches any number (including 0) of characters
The ? indicates a lazy expression, so it will match as few characters as possible.
A + matches one or more instances of the preceding pattern. A * matches zero or more instances of the preceding pattern.
So basically, if you use a + there must be at least one instance of the pattern, if you use * it will still match if there are no instances of it.
Consider below is the string to match.
ab
The pattern (ab.*) will return a match for capture group with result of ab
While the pattern (ab.+) will not match and not returning anything.
But if you change the string to following, it will return aba for pattern (ab.+)
aba
+ is minimal one, * can be zero as well.
A star is very similar to a plus, the only difference is that while the plus matches 1 or more of the preceding character/group, the star matches 0 or more.
I think the previous answers fail to highlight a simple example:
for example we have an array:
numbers = [5, 15]
The following regex expression ^[0-9]+ matches: 15 only.
However, ^[0-9]* matches both 5 and 15. The difference is that the + operator requires at least one duplicate of the preceding regex expression

R grep to match dot

So I have two strings like mylist<-c('claim', 'cl.bi'), when I do
grep('^cl\\.*', mylist)
it returns both 1 and 2. But if I do
grep('^cl\\.', mylist)
it will only return 2. So why would the first match 'claim'? What happened to the period matching?
"^cl\\.*" matches "claim" because the * quantifier is defined thusly (here quoting from ?regex):
'*' The preceding item will be matched zero or more times.
"claim" contains a beginning of line, followed by a c, followed by an l, followed by zero (in this case) or more dots, so fulfilling all the requirements for a successful match.
If you want to only match strings beginning cl., use the one or more times quantifier, +, like this:
grep('^cl\\.+', mylist, value=TRUE)
# [1] "cl.bi"
The * operator tells the engine to match it's preceding token "zero or more" times. In the first case, the engine trys matching a literal dot "zero or more" times — which might be none at all.
Essentially, if you use the * operator, it will still match if there are no instances of (.)
A better visualization:
* --→ equivalent to {0,} --→ match preceding token (0 or more times)
\\.* --→ equivalent to \\.{0,} --→ match ., .., ..., etc or an empty match
↑↑↑↑↑
The quantifier * means zero or more times. Pay attention to the zero. It applies to the preceding token, which is \. in your case.
In short, the cl part matches, and the dot after it isn't required.
Here are the matched substrings for both cases:
claim
--
cl.bi
---
To simplify what the others have said: '^cl\\.*' is just equivalent to '^cl', since the * matches 0+ occurrences of the \\.
Whereas '^cl\\.' forces it to match an actual dot. It is equivalent to '^cl\\.{1}'

Why is bracket mandatory here?

1 . ^([0-9A-Za-z]{5})+$
vs
2 . ^[a-zA-Z0-9]{5}+$
My intention is to match any string of length n such that n is a multiple of 5.
Check here : https://regex101.com/r/sS6rW8/1.
Please elaborate why case 1 matches the string whereas case 2 doesnot.
Because {n}+ doesn't mean what you think it does. In PCRE syntax, this turns {n} into a possessive quantifier. In other words, a{5}+ is the same as (?>a{5}). It's like the second + in the expression a++, which is the same as using an atomic group (?>a+).
This has no use with a fixed-length {n} but is more meaningful when used with {min,max}. So, a{2,5}+ is equivalent to (?>a{2,5}).
As a simple example, consider these patterns:
^(a{1,2})(ab) will match aab -> $1 is "a", $2 is "ab"
^(a{1,2}+)(ab) won't match aab -> $1 consumes "aa" possessively and $2 can't match
In ^([0-9A-Za-z]{5})+$ you're saying any number or letter 5 characters long 1 or more times. The + is on the entire group (whatever's inside the parentheses) and the {5} is on the [0-9A-Za-z]
Your second example has a no backtrack clause {5}+, which is different than (stuff{5})+

How to match to a number of any length, not just a digit

How do I match to a number of any length, if its a one digit number I can use \d but if its any length do I have to use the non-word \W ? Or can I use [0-max] ?
You need to use a Quantifier.
{n,m} generic quantifier, matches at least n times at most m times. When m is omitted it matches any amount
? is short for {0,1}, i.e. makes the preceding object optional, means match 0 or 1 times
+ is short for {1,}, i.e. repeats the preceding object 1 or more times
* is short for {0,}, i.e. repeats the preceding object 0 or more times (matches also on an empty string!)
But be careful, when you are searching for \d{1,2}, it will normally (depends on the language and method you use) also match on "123456". Then you need to have a look at anchors and word boundaries.
Use:
\d+
That means match at least one (1..n) digit character.