RE2 (Rust) regular expression doesn't work as expected

RE2 (Rust) regular expression doesn't work as expected - regex

I have a regex that seemingly is straightforward but does not act as required. The input to be parsed is described as follows (nb: {} are not part of the regex, only what's inside):
A sequence of 0 or more spaces {\s*}
A dash {-}
A sequence of 0 or more spaces {\s*}
A full person's name (first name, middle names, surname; all captured into f1). The name must not start with a number
must appear at the end of the line {[A-Za-z][\w\s]*)}
The whole construct SPACE-SPACEf1 is optional
Just to explain what is captured into f1:
For the first char, I'm using the set of chars represented by [A-Za-z]. Followed by \w or space 0 or more times. This is captured into f1.
(?:\s*-\s*(?P<f1>[A-Za-z][\w\s]*))?$
I expect the following sequences to match and capture a value into f1:
" - Bruce" (f1=Bruce)
" - Bruce Dickinson" (f1=Bruce Dickinson)
I expect the following to not match:
"Bruce" (there is no leading dash)
" - Bruce!" (there is a non word (\w) character after the name and before end of line
I expect the following match but not capture a value into f1 (I would prefer it to not match though):
" - 1Bruce" (leading character is numeric)
These are the actual results:
" - Bruce" (f1=Bruce) Tick; this works
" - Bruce Dickinson" (f1=Bruce Dickinson) Tick; this works
"Bruce" (f1= not captured, but expression is a match. This is wrong, because Bruce doesn't match the optional part, and $ comes next which doesn't match Bruce)
" - Bruce!" (f1= not cpatured, but expression is a match; this is wrong, because of the !, which means that match does not appear at the end of line.
I expect that:
(?:\s*-\s*(?P<f1>[A-Za-z][\w\s]*))?
would consume { - Bruce}, which should leave !, which should fail because of the next regex token being $; however, the computer says no, so I'm wrong but I don't know why :(
" - 1Bruce" (f1= not captured, but expression is match. This is understandable because the whole {space dash space f1} sequence is optional and because it doesn't match, that construct is skipped and then there is nothing else to process on the input; we hit the end of line)
If I can get this to work, I can get the rest of my expression to work the way I want it to. I need somebody else to jolt me into thinking about this differently. I've spent 2 days on this with no positive output, so very frustrating.
PS: I am using regex101.com to test regexes. The regexes will be used as part of a Rust application whose regex engine is based on google's RE2.
Eventually, I need to be able to recognise a sequence of names delimited by &, and the whole expression is optional by the use of ? and must appear at the end of line $.
So
{ - Bruce & Nicko & Dave Murray } would be valid
and
{ - Bruce & Nicko & Dave Murray & } should not be valid and NOT match
But 1 step at a time!

The point here is that you cannot match and not match something at the same time. If you make the whole pattern optional, and the end of string obligatory, even if there is nothing of interest the end of string will be matched - always.
The way out is to think of a subpattern you are interested in. You are interested in the names, so, make the first letter obligatory. The hyphen seems to be obligatory in all test cases you supplied, too. Everything else can be optional:
\s*-\s*(?P<f1>([^\W\d_])\w*(?:\s+\w+)*)(?:\s*&\s*(?P<f2>([^\W\d_])\w*(?:\s+\w+‌)*))*$
See the regex demo (the \s is replaced with \h and \n added to the negated character classes just for demo purposes as it is a multiline demo).
Note that I replaced [a-zA-Z] with [^\W\d_] to make the pattern more flexible ([^\W\d_] just matches any letter).

Related

Regex that gets a sentence containing a name

I am creating regexes that get the whole sentence if a piece of specific information exists. Right now I am working on my name regex, so if there is any composed name (example: "Jorge Martel", "Jorge Martel del Arnold Albuquerque") the regex should get the whole sentence that has the name.
If I have these two sentences:
(1) - "A hardworking guy is working at the supermarket. They call him Jorge Horizon, but that's not his real name."
(2) - "He has an identity document that contains the name, Jorge Martel Arnold."
The regex should return these two results from the sentences above:
(1) - "They call him Jorge Horizon, but that's not his real name."
(2) - "He has an identity document that contains the name, Jorge Martel Arnold."
This is my regex:
(?:(?(?<=[\.!?]\s([A-Z]))(.+?[^.])|))?((?:(?:[A-Z][A-zÀ-ÿ']+\s(?:(?:(?:[A-zÀ-ÿ']{1,3}\s)?(?:[A-ZÀ-Ÿ][A-zÀ-ÿ']*\s?))+))\b)(.+?[\.!?](?:\s|\n|\Z)))
Basically, it verifies if there is a dot, exclamation, or interrogation symbol with a blank space and an upper case character and tells the regex that everything must be select, else it should get all the sentence.
My else case (|) right now is empty, because using (.+?) avoids my first condition...
Regex without the else case:
Validates until the dot, but doesn't get the second sentence.
Regex with the else case:
Validates the second sentence, but overrides the first condition that appears in the first sentence.
I expect my regex to return correctly the sentences:
"They call him Jorge Horizon, but that's not his real name."
"He has an identity document that contains the name, Jorge Martel Arnold."
I have also created a text to validate the regex operations as I will be using it a lot in texts. I added a lot of conditions in this text, which will probably appear in my daily work.
Check my regex, sentence, and text here:
Does anyone know what should I change in my regex? I have tried many variations and still cannot find the solution.
P.S.: I intend to use it in my python code, but I need to fix it with the regex and not with the python code.

you can try this.
[\w\ \,\']+\.\ ?([\w\ \,\']+\.)|^([\w\ \,\']+\.)$
prints $1$2. I.e if group one is empty it prints blank since there is no match, then will print group 2. Visa versa, it prints group 1 when group 2 is not there.
[\w\ ,']+.\ ?([\w\ ,']+.) - as matching anything with XXX. XXX.
then
^([\w\ ,']+.)$ - must start end with only 1 sentence.
Though honestly this can easily be done with a Tokenizer of (.) that check length of 1 or 2. It' really like using a sledgehammer to hammer a nail.

Matching names can be a very hard job using a regex, but if you want to match at least 2 consecutive uppercase words using the specified ranges.
Assuming the names start with an uppercase char A-Z (else you can extend that character class as well with the allowed chars or if supported use \p{Lu} to match an uppercase char that has a lowercase variant):
(?<!\S)[A-Z][A-Za-zÀ-ÿ]*(?:\s+[a-zÀ-ÿ,]+)*\s+[A-Z][a-zÀ-ÿ]*\s+[A-Z][a-zÀ-ÿ,]*.*?[.!?](?!\S)
(?<!\S) Assert a whitespace boundary to the left
[A-Z][A-Za-zÀ-ÿ]* Match an uppercase char A-Z optionally followed by matching the defined ranges
(?:\s+[a-zÀ-ÿ,]*)* Optionally repeat matching 1+ whitespace chars and 1 or more of the ranges
\s+[A-Z][a-zÀ-ÿ]*\s+[A-Z][a-zÀ-ÿ,]* Match 2 times whitespace chars followed by an uppercase A-Z and optional chars defined in the character class
.*?[.!?] Match as least as possible chars followed by one of . ! or ?
(?!\S) Assert a whitspace boundary to the right
Regex demo

Try this:
((?:^|(?:[^\.!?]*))[^\.!?\n]*(?:(?:[A-ZÀ-Ÿ][A-zÀ-ÿ']+\s?){2,}[^\.!?]*[\.!?]))
It will capture sentences where name has at least two words, e.g. His name is John Smith.
It won't capture sentences like: John went to a concert.

Match 3 and 4 delimiters and between them; not less not more

I have a command-line program that its first argument ( = argv[ 1 ] ) is a regex pattern.
./program 's/one-or-more/anything/gi/digit-digit'
So I need a regex to check if the entered input from user is correct or not. This regex can be solve easily but since I use c++ library and std::regex_match and this function by default puts begin and end assertion (^ and $) at the given string, so the nan-greedy quantifier is ignored.
Let me clarify the subject. If I want to match /anything/ then I can use /.*?/ but std::regex_match considers this pattern as ^/.*?/$ and therefore if the user enters: /anything/anything/anyhting/ the std::regex_match still returns true whereas the input-pattern is not correct. The std::regex_match only returns true or false and the expected pattern form the user can only be a text according to the pattern. Since the pattern is various, here, I can not provide you all possibilities, but I give you some example.
Should be match
/.//
s/.//
/.//g
/.//i
/././gi
/one-or-more/anything/
/one-or-more/anything/g/3
/one-or-more/anything/i
/one-or-more/anything/gi/99
s/one-or-more/anything/g/4
s/one-or-more/anything/i
s/one-or-more/anything/gi/54
and anything look like this pattern
Rules:
delimiters are /|##
s letter at the beginning and g, i and 2 digits at the end are optional
std::regex_match function returns true if the entire target character sequence can be match, otherwise return false
between first and second delimiter can be one-or-more +
between second and third delimiter can be zero-or-more *
between third and fourth can be g or i
At least 3 delimiter should be match /.// not less so /./ should not be match
ECMAScript 262 is allowed for the pattern
NOTE
May you would need to see may question about std::regex_match:
std::regex_match and lazy quantifier with strange
behavior
I no need any C++ code, I just need a pattern.
Do not try d?([/|##]).+?\1.*?\1[gi]?[gi]?\1?d?\d?\d?. It fails.
My attempt so far: ^(?!s?([/|##]).+?\1.*?\1.*?\1)s?([/|##]).+?\2.*?\2[gi]?[gi]?\d?\d?$
If you are willing to try, you should put ^ and $ around your pattern
If you need more details please comment me, and I will update the question.
Thanks.

You could use this regular expression:
^s?([/|##])((?!\1).)+\1((?!\1).)*\1((gi?|ig)(\1\d\d?)?|i)?$
See regex101.com
Note how this also rejects these cases:
///anything/
/./anything/gg
/./anything/ii
/./anything/i/12
How it works:
Some explanation of the parts that are different:
((?!\1).): this will match any character that is not the delimiter. This way you are sure you can keep track of the exact number of delimiters used. You can this way also prevent that the first character after the first delimiter, is again that delimiter, which should not be allowed.
(gi?|ig): matches any of the valid modifier combinations, except a sole i, which is treated separately. So this also excludes gg and ii as valid character sequences.
(\1\d\d?)?: optionally allows for an extra delimiter (after a g modifier -- see previous) to be added with one or two digits following it.
( |i)?: for the case there is no g modifier present, but just the i or none: then no digits are allowed to follow.

This is a tricky one, but I took the challenge - here is what I have ended up with:
^s?([\/|##])(?:(?!\1).)+\1(?:(?!\1).)*\1(?:i|(?:gi?|ig)(\1\d{1,2})?)?$
Pattern breakdown:
^ matches start of string
s? matches an optional 's' character
([\/|##]) matches the delimeter characters and captures as group 1
(?:(?!\1).)+ matches anything other than the delimiter character one or more times (uses negative lookahead to make sure that the character isn't the delimiter matched in group 1)
\1 matches the delimiter character captured in group 1
(?:(?!\1).)* matches anything other than the delimiter character zero or more times
\1 matches the delimiter character captured in group 1
(?: starts a new group
i matches the i character
| or
(?:gi?|ig) matches either g, gi, or ig
(\1\d{1,2})? followed by an optional extra delimiter and 0-9 once or twice
)? closes group and makes it optional
$ matches end of string
I have used non capturing groups throughout - these are groups that start ?:

Regex with more than one OR/AND operator

I'm trying to match text that is:
a combination of numbers and letters, and might contain [:,.]
OR
a * character plus at least one number OR letter (not necessarily in this order)
Meaning my regex should match all these
Bf1305020008401 6798ubbii230693
Nettbank til: Troij iudh Betalt: 03.05.13
7509*30.04
*87589
but not these:
0205
252,25

Yes, regex alternation with | does not have the meaning in a character group (e.g. [a-z|0-9]) that it does elsewhere in a pattern. (Think of it as implied between characters & character ranges within a character group, making it redundant.)
Pattern
This pattern should do what you need:
^((?=^.{0,}[0-9])(?=^.{0,}[a-zA-Z])[0-9a-zA-Z :,.]{2,}|(?!^\*$)(?=^[0-9.a-zA-Z]{0,}\*[0-9.a-zA-Z]{0,})(?!^[0-9.a-zA-Z]{0,}\*[0-9.a-zA-Z]{0,}\*)[*0-9.a-zA-Z]{2,})$
It matches...
Bf1305020008401 6798ubbii230693
Nettbank til: Troij iudh Betalt: 03.05.13
7509*30.04
*87589
...and does not match...
0205
252,25
...as you require.
You can try the pattern with the inputs you specified in a regex fiddle.
Explanation
Some explanation for the 1st subpattern (on the left side of the |) matching your 1st set of match criteria:
(?=^.{0,}[0-9]) - Assert that a number appears in the string.
(?=^.{0,}[a-zA-Z]) - Assert that a letter also (i.e. AND) appears in the string.
[0-9a-zA-Z :,.]{2,} - "a combination of numbers and letters, and might contain [ :,.]" (assuming the aforementioned assertions)
Similarly, some explanation for the 2nd subpattern (on the right side of the |) matching your 2nd set of match criteria:
(?!^\*$) - Assert that the string is not just *.
(?=^[0-9.a-zA-Z]{0,}\*[0-9.a-zA-Z]{0,}) - Assert that the string contains *.
(?!^[0-9.a-zA-Z]{0,}\*[0-9.a-zA-Z]{0,}\*) - Assert that the string does not contain more than one *.
[*0-9.a-zA-Z]{2,} - "a * character + atleast one number OR letter (not necessarily in this order)" (assuming the aforementioned assertions)
There is probably room to sand & polish the pattern - especially the lookahead assertions for * in the second subpattern I suspect; but it works and conveys the strategy I employed of multiple lookahead assertions to constrain each of the two subpatterns to fit your requirements.

As you comment below, I think you dose want a full line match, and by saying number and letter, I think it means digits and letters both occurred in the right match.
And by saying "a * character + atleast one number OR letter" I suppose "*" occurs only once in match.
Maybe you could try this one:
(^(?=.*[a-zA-Z]+)(?=.*[0-9]+)[0-9a-zA-Z :,.]+$)|(^[a-zA-Z0-9.]*\*[a-zA-Z0-9.]+$)|(^[a-zA-Z0-9.]+\*[a-zA-Z0-9.]*$)
It matches:
Bf1305020008401 6798ubbii230693
Nettbank til: Troij iudh Betalt: 03.05.13
7509*30.04
*87589
123456*
.*.
test123
123test
But won't match any of:
0205
252,25
*
123*345*789
rebound
test
123
Original:
This should work
(^[A-Za-z0-9 ]*(([A-Za-z]+[ ]*[0-9]+)|([0-9]+[ ]*[A-Za-z]+))[A-Za-z0-9 ]*$)|(^\*[A-Za-z0-9]+$)

Need a regular expression for alphanumeric with 1 hypen present and space inbetween words

Can you please provide me with a regular expression that would
Allow only alphanumeric
Have definitely only one hyphen in the entire string
Hyphen or spaces not allowed at the front and back of the string
no consecutive space or hyphens allowed.
hypen and one space can be present near each other
Valid - "123-Abc test1","test- m e","abc slkh-hsds"
Invalid - " abc ", " -hsdj sdsd hjds- "
Thanks for helping me out on the same. Your help is much appreciated

/^([a-zA-Z0-9] ?)+-( ?[a-zA-Z0-9])+$/
See demo here.
EDIT:
If there can't be a space on both sides of the hyphen, then there needs to be a little more:
/^([a-zA-Z0-9] ?)+-(((?<! -) )?[a-zA-Z0-9])+$/
^^^^^^^^ ^
Alternatively, if negative lookbehind assertions aren't supported (e.g. in JavaScript), then an equivalent regex:
/^([a-zA-Z0-9]( (?!- ))?)+-( ?[a-zA-Z0-9])+$/
^ ^^^^^^^ ^

Only alphanumeric (hyphen and space included, otherwise it'd make no sense):
^[\da-zA-Z -]+$
This is the main part that will match the string and makes sure that every character is in the given set. I.e. digits and ASCII letters as well as space and hyphen (the use of which will be restricted in the following parts).
Only one hyphen and none at the start or end of the string:
(?=^[^-]+-[^-]+$)
This is a lookahead assertion making sure that the string starts and ends with at least one non-hyphen character. A single hyphen is required in the middle.
No space at the start or end or the string:
(?=^[^ ].*[^ ]$)
Again a lookahead, similar to the one above. They could be combined into one, but it looks much messier and is harder to explain.
No consecutive spaces (consecutive hyphens are ruled out already by 2. above):
(?!.* )
Putting it all together:
(?!.* )(?=^[^ ].*[^ ]$)(?=^[^-]+-[^-]+$)^[\da-zA-Z -]+$
Quick PowerShell test:
PS> $re='(?!.* )(?=^[^ ].*[^ ]$)(?=^[^-]+-[^-]+$)^[\da-zA-Z -]+$'
PS> "123-Abc test1","test- m e","abc slkh-hsds"," abc ", " -hsdj sdsd hjds- " -match $re
123-Abc test1
test- m e
abc slkh-hsds

Use this regex:
^(.+-.+)[\da-zA-Z]+[\da-zA-Z ]*[\da-zA-Z]+$

Regular Expression to match set of arbitrary codes

I am looking for some help on creating a regular expression that would work with a unique input in our system. We already have some logic in our keypress event that will only allow digits, and will allow the letter A and the letter M. Now I need to come up with a RegEx that can match the input during the onblur event to ensure the format is correct.
I have some examples below of what would be valid. The letter A represents an age, so it is always followed by up to 3 digits. The letter M can only occur at the end of the string.
Valid Input
1-M
10-M
100-M
5-7
5-20
5-100
10-20
10-100
A5-7
A10-7
A100-7
A10-20
A5-A7
A10-A20
A10-A100
A100-A102
Invalid Input
a-a
a45
4

This matches all of the samples.
/A?\d{1,3}-A?\d{0,3}M?/
Not sure if 10-A10M should or shouldn't be legal or even if M can appear with numbers. If it M is only there without numbers:
/A?\d{1,3}-(A?\d{1,3}|M)/

Use the brute force method if you have a small amount of well defined patterns so you don't get bad corner-case matches:
^(\d+-M|\d+-\d+|A\d+-\d+|A\d+-A\d+)$
Here are the individual regexes broken out:
\d+-M <- matches anything like '1-M'
\d+-\d+ <- 5-7
A\d+-\d+ <- A5-7
A\d+-A\d+ <- A10-A20

/^[A]?[0-9]{1,3}-[A]?[0-9]{1,3}[M]?$/
Matches anything of the form:
A(optional)[1-3 numbers]-A(optional)[1-3 numbers]M(optional)

^A?\d+-(?:A?\d+|M)$
An optional A followed by one or more digits, a dash, and either another optional A and some digits or an M. The '(?: ... )' notation is a Perl 'non-capturing' set of parentheses around the alternatives; it means there will be no '$1' after the regex matches. Clearly, if you wanted to capture the various bits and pieces, you could - and would - do so, and the non-capturing clause might not be relevant any more.
(You could replace the '+' with '{1,3}' as JasonV did to limit the numbers to 3 digits.)

^A?\d{1,3}-(M|A?\d{1,3})$
^ -- the match must be done from the beginning
A? -- "A" is optional
\d{1,3} -- between one and 3 digits; [0-9]{1,3} also work
- -- A "-" character
(...|...) -- Either one of the two expressions
(M|...) -- Either "M" or...
(...|A?\d{1,3}) -- "A" followed by at least one and at most three digits
$ -- the match should be done to the end
Some consequences of changing the format. If you do not put "^" at the beginning, the match may ignore an invalid beginning. For example, "MAAMA0-M" would be matched at "A0-M".
If, likewise, you leave $ out, the match may ignore an invalid trail. For example, "A0-MMMMAAMAM" would match "A0-M".
Using \d is usually preferred, as is \w for alphanumerics, \s for spaces, \D for non-digit, \W for non-alphanumeric or \S for non-space. But you must be careful that \d is not being treated as an escape sequence. You might need to write it \\d instead.
{x,y} means the last match must occur between x and y times.
? means the last match must occur once or not at all.
When using (), it is treated as one match. (ABC)? will match ABC or nothing at all.

I’d use this regular expression:
^(?:[1-9]\d{0,2}-(?:M|[1-9]\d{0,2})|A[1-9]\d{0,2}-A?[1-9]\d{0,2})$
This matches either:
<number>-M or <number>-<number>
A<number>-<number> or A<number>-A<number>
Additionally <number> must not begin with a 0.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

RE2 (Rust) regular expression doesn't work as expected - regex

Related

Regex that gets a sentence containing a name

Match 3 and 4 delimiters and between them; not less not more

Regex with more than one OR/AND operator

Need a regular expression for alphanumeric with 1 hypen present and space inbetween words

Regular Expression to match set of arbitrary codes

Categories

Resources