R grep to match dot - regex

So I have two strings like mylist<-c('claim', 'cl.bi'), when I do
grep('^cl\\.*', mylist)
it returns both 1 and 2. But if I do
grep('^cl\\.', mylist)
it will only return 2. So why would the first match 'claim'? What happened to the period matching?

"^cl\\.*" matches "claim" because the * quantifier is defined thusly (here quoting from ?regex):
'*' The preceding item will be matched zero or more times.
"claim" contains a beginning of line, followed by a c, followed by an l, followed by zero (in this case) or more dots, so fulfilling all the requirements for a successful match.
If you want to only match strings beginning cl., use the one or more times quantifier, +, like this:
grep('^cl\\.+', mylist, value=TRUE)
# [1] "cl.bi"

The * operator tells the engine to match it's preceding token "zero or more" times. In the first case, the engine trys matching a literal dot "zero or more" times — which might be none at all.
Essentially, if you use the * operator, it will still match if there are no instances of (.)
A better visualization:
* --→ equivalent to {0,} --→ match preceding token (0 or more times)
\\.* --→ equivalent to \\.{0,} --→ match ., .., ..., etc or an empty match
↑↑↑↑↑

The quantifier * means zero or more times. Pay attention to the zero. It applies to the preceding token, which is \. in your case.
In short, the cl part matches, and the dot after it isn't required.
Here are the matched substrings for both cases:
claim
--
cl.bi
---

To simplify what the others have said: '^cl\\.*' is just equivalent to '^cl', since the * matches 0+ occurrences of the \\.
Whereas '^cl\\.' forces it to match an actual dot. It is equivalent to '^cl\\.{1}'

Related

Regular expression - Allow period('.') in the middle of the string but not at the end

I am using a regular expression to allow and reject strings based on the criteria--
The expression used-
^([\w\.,'()#&-]|\s)*$
Allows-
exmple_
example(ggg)
exam.pl56e
exam.pl56e.hbhbh.
exampleghh. vgvj
example (bb)ste kklk ae
_example_
Currently, it allows adding period in the middle of the string as well as at the end.
I just want to reject string if the period is added at the end of the string but allow it to be added in the middle using the above regular expression
For example, reject-
Test.test1.
Example.
Test Test.
test#example.
exam.pl56e.hbhbh.
You may use a single character class in the pattern (merge \s with the previous character class) to simplify the pattern, and use either
^([\w.,'()#&\s-]*[\w,'()#&\s-])?$
See the regex demo.
Details
^ - start of string
([\w.,'()#&\s-]*[\w,'()#&\s-])? - an optional sequence (if you want to match at least 1 char, remove ( and )?) of:
[\w.,'()#&\s-]* - 0+ word, ., ,, ', (, ), #, &, whitespace or hyphen chars
[\w,'()#&\s-] - a word, ,, ', (, ), #, &, whitespace or hyphen chars (but no .!)
$ - end of string
Or, a lookbehind version:
^[\w.,'()#&\s-]*$(?<!\.)
It matches a string that only consists of the chars inside the character class, and after the end of string is matched, the lookbehind checks if the last char is a dot. If it is, the match is failed.
Or, a lookahead
^(?!.*\.$)[\w.,'()#&\s-]*$
Here, (?!.*\.$) checks if the string ends with . after any 0+ chars, and if it does, no match is returned. Else, the string is matched against the [\w.,'()#&\s-]* pattern.
Just specify that the last character cannot be a period.
^([\w\.,'()#&-]|\s)*[^.]$
A nice trick I've learned is to blacklist certain otherwise-allowed expressions by placing them on their own in an unmatched alternation in front of the matched one.
# sentences containing `foo` or `bar` but not the word `foobar`
^.*foobar.*$|(^.*foo.*$)|(^.*bar.*$)
This is admittedly a bit...verbose here:
^(?:[\w\.,'()#&-]|\s)*\.$|^([\w\.,'()#&-]|\s)*$
So it might be better to use a negative lookbehind
^([\w\.,'()#&-]|\s)*$(?<!\.)
You could use a negative lookahead (?!) to assert that what follows are not the characters in the character class repeated zero or more times ending with a dot at the end of the string.
^(?![\w\.,'()#&\s-]*\.$)[\w\.,'()#&\s-]*$
Note that using the asterix * it matches zero or more times.

Match 3 and 4 delimiters and between them; not less not more

I have a command-line program that its first argument ( = argv[ 1 ] ) is a regex pattern.
./program 's/one-or-more/anything/gi/digit-digit'
So I need a regex to check if the entered input from user is correct or not. This regex can be solve easily but since I use c++ library and std::regex_match and this function by default puts begin and end assertion (^ and $) at the given string, so the nan-greedy quantifier is ignored.
Let me clarify the subject. If I want to match /anything/ then I can use /.*?/ but std::regex_match considers this pattern as ^/.*?/$ and therefore if the user enters: /anything/anything/anyhting/ the std::regex_match still returns true whereas the input-pattern is not correct. The std::regex_match only returns true or false and the expected pattern form the user can only be a text according to the pattern. Since the pattern is various, here, I can not provide you all possibilities, but I give you some example.
Should be match
/.//
s/.//
/.//g
/.//i
/././gi
/one-or-more/anything/
/one-or-more/anything/g/3
/one-or-more/anything/i
/one-or-more/anything/gi/99
s/one-or-more/anything/g/4
s/one-or-more/anything/i
s/one-or-more/anything/gi/54
and anything look like this pattern
Rules:
delimiters are /|##
s letter at the beginning and g, i and 2 digits at the end are optional
std::regex_match function returns true if the entire target character sequence can be match, otherwise return false
between first and second delimiter can be one-or-more +
between second and third delimiter can be zero-or-more *
between third and fourth can be g or i
At least 3 delimiter should be match /.// not less so /./ should not be match
ECMAScript 262 is allowed for the pattern
NOTE
May you would need to see may question about std::regex_match:
std::regex_match and lazy quantifier with strange
behavior
I no need any C++ code, I just need a pattern.
Do not try d?([/|##]).+?\1.*?\1[gi]?[gi]?\1?d?\d?\d?. It fails.
My attempt so far: ^(?!s?([/|##]).+?\1.*?\1.*?\1)s?([/|##]).+?\2.*?\2[gi]?[gi]?\d?\d?$
If you are willing to try, you should put ^ and $ around your pattern
If you need more details please comment me, and I will update the question.
Thanks.
You could use this regular expression:
^s?([/|##])((?!\1).)+\1((?!\1).)*\1((gi?|ig)(\1\d\d?)?|i)?$
See regex101.com
Note how this also rejects these cases:
///anything/
/./anything/gg
/./anything/ii
/./anything/i/12
How it works:
Some explanation of the parts that are different:
((?!\1).): this will match any character that is not the delimiter. This way you are sure you can keep track of the exact number of delimiters used. You can this way also prevent that the first character after the first delimiter, is again that delimiter, which should not be allowed.
(gi?|ig): matches any of the valid modifier combinations, except a sole i, which is treated separately. So this also excludes gg and ii as valid character sequences.
(\1\d\d?)?: optionally allows for an extra delimiter (after a g modifier -- see previous) to be added with one or two digits following it.
( |i)?: for the case there is no g modifier present, but just the i or none: then no digits are allowed to follow.
This is a tricky one, but I took the challenge - here is what I have ended up with:
^s?([\/|##])(?:(?!\1).)+\1(?:(?!\1).)*\1(?:i|(?:gi?|ig)(\1\d{1,2})?)?$
Pattern breakdown:
^ matches start of string
s? matches an optional 's' character
([\/|##]) matches the delimeter characters and captures as group 1
(?:(?!\1).)+ matches anything other than the delimiter character one or more times (uses negative lookahead to make sure that the character isn't the delimiter matched in group 1)
\1 matches the delimiter character captured in group 1
(?:(?!\1).)* matches anything other than the delimiter character zero or more times
\1 matches the delimiter character captured in group 1
(?: starts a new group
i matches the i character
| or
(?:gi?|ig) matches either g, gi, or ig
(\1\d{1,2})? followed by an optional extra delimiter and 0-9 once or twice
)? closes group and makes it optional
$ matches end of string
I have used non capturing groups throughout - these are groups that start ?:

Regex with more than one OR/AND operator

I'm trying to match text that is:
a combination of numbers and letters, and might contain [:,.]
OR
a * character plus at least one number OR letter (not necessarily in this order)
Meaning my regex should match all these
Bf1305020008401 6798ubbii230693
Nettbank til: Troij iudh Betalt: 03.05.13
7509*30.04
*87589
but not these:
0205
252,25
Yes, regex alternation with | does not have the meaning in a character group (e.g. [a-z|0-9]) that it does elsewhere in a pattern. (Think of it as implied between characters & character ranges within a character group, making it redundant.)
Pattern
This pattern should do what you need:
^((?=^.{0,}[0-9])(?=^.{0,}[a-zA-Z])[0-9a-zA-Z :,.]{2,}|(?!^\*$)(?=^[0-9.a-zA-Z]{0,}\*[0-9.a-zA-Z]{0,})(?!^[0-9.a-zA-Z]{0,}\*[0-9.a-zA-Z]{0,}\*)[*0-9.a-zA-Z]{2,})$
It matches...
Bf1305020008401 6798ubbii230693
Nettbank til: Troij iudh Betalt: 03.05.13
7509*30.04
*87589
...and does not match...
0205
252,25
...as you require.
You can try the pattern with the inputs you specified in a regex fiddle.
Explanation
Some explanation for the 1st subpattern (on the left side of the |) matching your 1st set of match criteria:
(?=^.{0,}[0-9]) - Assert that a number appears in the string.
(?=^.{0,}[a-zA-Z]) - Assert that a letter also (i.e. AND) appears in the string.
[0-9a-zA-Z :,.]{2,} - "a combination of numbers and letters, and might contain [ :,.]" (assuming the aforementioned assertions)
Similarly, some explanation for the 2nd subpattern (on the right side of the |) matching your 2nd set of match criteria:
(?!^\*$) - Assert that the string is not just *.
(?=^[0-9.a-zA-Z]{0,}\*[0-9.a-zA-Z]{0,}) - Assert that the string contains *.
(?!^[0-9.a-zA-Z]{0,}\*[0-9.a-zA-Z]{0,}\*) - Assert that the string does not contain more than one *.
[*0-9.a-zA-Z]{2,} - "a * character + atleast one number OR letter (not necessarily in this order)" (assuming the aforementioned assertions)
There is probably room to sand & polish the pattern - especially the lookahead assertions for * in the second subpattern I suspect; but it works and conveys the strategy I employed of multiple lookahead assertions to constrain each of the two subpatterns to fit your requirements.
As you comment below, I think you dose want a full line match, and by saying number and letter, I think it means digits and letters both occurred in the right match.
And by saying "a * character + atleast one number OR letter" I suppose "*" occurs only once in match.
Maybe you could try this one:
(^(?=.*[a-zA-Z]+)(?=.*[0-9]+)[0-9a-zA-Z :,.]+$)|(^[a-zA-Z0-9.]*\*[a-zA-Z0-9.]+$)|(^[a-zA-Z0-9.]+\*[a-zA-Z0-9.]*$)
It matches:
Bf1305020008401 6798ubbii230693
Nettbank til: Troij iudh Betalt: 03.05.13
7509*30.04
*87589
123456*
.*.
test123
123test
But won't match any of:
0205
252,25
*
123*345*789
rebound
test
123
Original:
This should work
(^[A-Za-z0-9 ]*(([A-Za-z]+[ ]*[0-9]+)|([0-9]+[ ]*[A-Za-z]+))[A-Za-z0-9 ]*$)|(^\*[A-Za-z0-9]+$)

Decyphering a simple regex

The regular expression in question is
(\d{3,4}[.-]?)+
sample text
707-7019-789
My progress so far
( )+ a capturing group, capturing one or more
\d{3,4} digit, in quantities 3 or 4
[.-]? dot (or something) or hyphen, in quantities zero or one <-- this is the part I'm interested in
From my understanding this should match 3 or 4 digit number, followed by a dot (or anything, since dot matches anything) or a hyphen, bundled in a group, one or more times. Why doesn't this matches a
707+123-4567
then?
. in a character group [] is just a literal ., it does not have the special meaning "anything". [.-]? means "a dot or a hyphen or nothing", because the entire group is made optional with the ?.
[.-]?
What this means literally:
character class [.-]
Match only one out of the following characters: . and - literally.
lazy quantifier ?
Repeat the last token between 0 and 1 times, as few times as possible.
The brackets remove the functionality of the dot.
Brackets mean "Range"/"Character class".
Thus you are saying Choose from the list/range/character class .-
You aren't saying choose from the list "anything"- (anything is the regular meaning of .)

Regular Expression to match set of arbitrary codes

I am looking for some help on creating a regular expression that would work with a unique input in our system. We already have some logic in our keypress event that will only allow digits, and will allow the letter A and the letter M. Now I need to come up with a RegEx that can match the input during the onblur event to ensure the format is correct.
I have some examples below of what would be valid. The letter A represents an age, so it is always followed by up to 3 digits. The letter M can only occur at the end of the string.
Valid Input
1-M
10-M
100-M
5-7
5-20
5-100
10-20
10-100
A5-7
A10-7
A100-7
A10-20
A5-A7
A10-A20
A10-A100
A100-A102
Invalid Input
a-a
a45
4
This matches all of the samples.
/A?\d{1,3}-A?\d{0,3}M?/
Not sure if 10-A10M should or shouldn't be legal or even if M can appear with numbers. If it M is only there without numbers:
/A?\d{1,3}-(A?\d{1,3}|M)/
Use the brute force method if you have a small amount of well defined patterns so you don't get bad corner-case matches:
^(\d+-M|\d+-\d+|A\d+-\d+|A\d+-A\d+)$
Here are the individual regexes broken out:
\d+-M <- matches anything like '1-M'
\d+-\d+ <- 5-7
A\d+-\d+ <- A5-7
A\d+-A\d+ <- A10-A20
/^[A]?[0-9]{1,3}-[A]?[0-9]{1,3}[M]?$/
Matches anything of the form:
A(optional)[1-3 numbers]-A(optional)[1-3 numbers]M(optional)
^A?\d+-(?:A?\d+|M)$
An optional A followed by one or more digits, a dash, and either another optional A and some digits or an M. The '(?: ... )' notation is a Perl 'non-capturing' set of parentheses around the alternatives; it means there will be no '$1' after the regex matches. Clearly, if you wanted to capture the various bits and pieces, you could - and would - do so, and the non-capturing clause might not be relevant any more.
(You could replace the '+' with '{1,3}' as JasonV did to limit the numbers to 3 digits.)
^A?\d{1,3}-(M|A?\d{1,3})$
^ -- the match must be done from the beginning
A? -- "A" is optional
\d{1,3} -- between one and 3 digits; [0-9]{1,3} also work
- -- A "-" character
(...|...) -- Either one of the two expressions
(M|...) -- Either "M" or...
(...|A?\d{1,3}) -- "A" followed by at least one and at most three digits
$ -- the match should be done to the end
Some consequences of changing the format. If you do not put "^" at the beginning, the match may ignore an invalid beginning. For example, "MAAMA0-M" would be matched at "A0-M".
If, likewise, you leave $ out, the match may ignore an invalid trail. For example, "A0-MMMMAAMAM" would match "A0-M".
Using \d is usually preferred, as is \w for alphanumerics, \s for spaces, \D for non-digit, \W for non-alphanumeric or \S for non-space. But you must be careful that \d is not being treated as an escape sequence. You might need to write it \\d instead.
{x,y} means the last match must occur between x and y times.
? means the last match must occur once or not at all.
When using (), it is treated as one match. (ABC)? will match ABC or nothing at all.
I’d use this regular expression:
^(?:[1-9]\d{0,2}-(?:M|[1-9]\d{0,2})|A[1-9]\d{0,2}-A?[1-9]\d{0,2})$
This matches either:
<number>-M or <number>-<number>
A<number>-<number> or A<number>-A<number>
Additionally <number> must not begin with a 0.