Issues with Regular Expressions - regex

I understand the concept of repetition 0 or more times (*) and grouping '()' on there own, but I'm having trouble understanding them given practice examples.
For example, (yes)* contains both the empty set and the word 'yes', but not y or ss. I assume that it doesn't contain those words because of grouping, but would that mean the word 'yesyes' is also valid as the group has been repeated?
In contrast, I assume with the Regular Expression 'yes*', any character can be repeated. For example 'y', 'ye' 'es' 'yes', 'yy'. However the solutions we have been provided with state that the word 'y' isn't contained. I'm confused.

Your understanding of (yes)* is correct ...
(yes)* matches the string "yes" (exactly - no shorter, no longer) 0 or more times - ie the empty string or yes,yesyes, yesyesyesyesyesyes etc
But your understanding of yes* is NOT correct ...
yes* matches the string "ye" followed by 0 or more "s" characters - ie ye,yes,yess,yessssssss

The "zero or more" * modifier applies only to the character or group immediately preceding it.
In the first example, we have the group (yes)* - this will match '', 'yes', 'yesyes', etc.
In the second example, yes*, the modifier applies only to the letter s. It will match 'ye', 'yes', 'yess', etc.
If this is not clear then perhaps you can elaborate a little on the source of your confusion.

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Is there a more efficient RegEx for solving Wordle?

So I have a list of all 5 letter words in the English language that I can interrogate when I'm really stuck at Wordle. I found this an excellent exercise for brushing up on my Regular Expressions in BBEDIT, which is what I tell myself I'm doing.
The way wordle works, I can have three conditions.
A letter that is somewhere in the word (and must be present)
A letter that is not present in the word
A letter that is correct in presence and position
Condition 3 is easy. If my start word "crone" has the n in the right place, my pattern is
...n.
And I can add condition 2 fairly easily with
^(?!.*[croe])...n.
If my next guess is "burns" I'll know there's an "s"
^(?!.*[croebur])^(?=.*s)...n.
And that it's not in the last position:
^(?!.*[croebur])^(?=.*s)...n[^s]
If my next (very poor) guess is 'stone' I'll know there's a 't'.
^(?!.*[croebur])^(?=.*s)^(?=.*t)sa.n.
So that's a workable formula.
But if my next guess were "wimpy" I'd know there was an 'i' in the answer, but I have to add an additional ^(?=.*i) which just feels inefficient. I tried grouping the letters that must be in the word by using a bracket set, ^(?=.*[ist]) but of course that will match targets that contain any one of those characters rather than all.
Is there a more efficient way to express the phrase "the word must contain all of the following letters to match" than a series of "start at the beginning, scan for occurence of this single character until the end" phrases?
If you enter a word into Wordle, it displays all the matched characters in your word. It also shows the characters which exist in the word but not in the correct order.
Considering these requirements, I think you should create different rules for each letter's place. This way, your regex pattern keeps simple, and you get the search results quickly. Let me give an example:
Input word: crone
Matched Characters: ...n.
Characters in the wrong place: -
Next regex search pattern: ^[^crone][^crone][^crone]n[^crone]$
Input word: burns
Matched Characters: ...n.
Characters in the wrong place: s
Next regex search pattern: ^(?=\S*[s]\S*)[^bucrone][^bucrone][^bucrone]n[^bucrones]$ (Be careful, there is an "s" character in the last parenthesis because we know its place isn't there.)
Input word: stone
Matched Characters: s..n.
Characters in the wrong place: t
Next regex search pattern: ^(?=\S*[t]\S*)s[^tsbucrone][^sbucrone]n[^sbucrones]$ (Be careful, there is a "t" character in the first parenthesis because we know its place isn't there.)
^ => Start of the line
[^abc] => Any character except "a" and "b" and "c"
(?=\S*[t]\S*)=> There must be a "t" character in the given string
(?=\S*[t]\S*)(?=\S*[u]\S*)=> There must be "t" and "u" characters in the given string
$ => End of the line
When we look at performance tests of the regex patterns with a seven-word sample, my regex pattern found the result in 130 steps, whereas your pattern in 175 steps. The performance difference will increase as the word-list increase. You can review it from the following links:
Suggested pattern: https://regex101.com/r/mvHL3J/1
Your pattern: https://regex101.com/r/Nn8EwL/1
Note: You need to click the "Regex Debugger" link in the left sidebar to see the steps.
Note 2: I updated my response to fix the bug in the following comment.

how do the regular expressions * and ? metacharacter work?

Hi I'm going through regular expressions but I'm confused about metacharacters, particularly '*' and '?'.
'*' is supposed to match the preceding character 0 or more times.
For example, 'ta*k' supposedly matches 'tak' and 'tk'.
But I wouldn't have thought this to be true at all - here's my reasoning:
for tak:
regexp: I need a 't'
string: I have 't'
regexp: okay, your next character needs to be an 'a'
string: yes it is
regexp: okay, keep giving me characters until your character isn't an 'a'
string: okay. I've just given you 'k'
regexp: okay, your next character needs to be a 'k'
string: I don't have any more characters left!
regexp: fail
for tk:
regexp: I need a 't'
string: I have 't'
regexp: okay, your next character needs to be an 'a'
string: no, it's a 'k'
regexp: fail
Can someone clarify for me why 'tak' and 'tk' matches 'ta*k'?
* does not mean to match a character zero or more times, but an atom zero or more times. A single character is an atom, but so is any grouping.
And * means zero or more. When the regex cursor has "swallowed" the t, the positions are:
in the regex: t|a*k
in the string: t|ak
The regex engine then tries and eats as as much as possible. Here there is one. After it has swallowed it, the positions are:
in the regex: ta*|k
in the string: ta|k
Then the k is swallowed:
in the regex: ta*k|
in the string: tak|
End of regex, match. Note that the string may have other characters behind, the regex engine doesn't care: it has a match.
In the case where the string is tk, before a* the positions are:
in the regex: t|a*k
in the string: t|k
But * can match an empty set of as, therefore a* is satisfied! Which means the positions then become:
in the regex: ta*|k
in the string: t|k
Rinse, repeat. Now, let's take taak as an input and ta?k as a regex: this will fail, but let's see how...
# before first character
regex: |ta?k
input: |taak
# t
regex: t|a?k
input: t|aak
# a?
regex: ta?|k
input: ta|ak
# k? Oops! No...
regex: |ta?k
input: t|aak
# t? Oops! No...
regex: |ta?k
input: ta|ak
# t? Oops! No...
regex: |ta?k
input: taa|k
# t? Oops! No...
regex: |ta?k
input: taak|
# t? Oops! No... And nothing to read anymore
# FAIL
Which is why it is VERY important to make regexes fail FAST.
Because a* means "zero or more instances of a".
When "it" asks for all characters that aren't "a", once it has one, it (roughly) pushes it back into the input stream. (Or it peeks ahead, or it just keeps it, etc.)
First sequence: here's your first non-"a", I'll hold on to that. You need a "k" next, that's what I have.
Second sequence: the next character doesn't need to be an "a"--it may be one or more "a". In this case it's none. I'll hold on to that non-"a". You need a "k"? I got your "k" right here still.
You are one character ahead:
regexp: okay, keep giving me characters until your character isn't an
'a'
string: next character is not an 'a'
regexp: okay, your next character needs to be a 'k'
string: next char is a 'k'
So it works. Note that 'a*' means "0 or more occourrences of 'a'", and not "1 or more occources of 'a'". For the latter one there's the '+' sign, like in 'a+'.
ta*k means, one 't', followed by 0 or more 'a's, followed by one 'k'. So 0 'a' characters, would make 'tk` a possible match.
If you want "1 or more" instead of "0 or more", use the + instead of *. That is, ta+k will match 'tak' but not 'tk'.
Let me know if there's anything I didn't explain.
By the way, RegEx doesn't always go left to right. The engine often backtracks, peeks ahead and studies the input. It's really complicated, which is why it's so powerful. If you looks at sites such as this one, they sometimes explain what the engine is doing. I recommend their tutorials because that's where I learned about RegEx!
The fundamental thing to remember is that a regular expression is a convenient shorthand for typing out a set of strings. a{1,5} is simply shorthand for the set of strings (a, aa, aaa, aaaa, aaaaa). a* is shorthand for ([empty], a, aa, aaa, ...).
Thus, in effect, when you feed a regular expression to a search algorithm, you are telling it the list of strings to search for.
Consequently, when you feed ta*k to your search algorithm, you are actually feeding it the set of strings (tk, tak, taak, taaak, taaaak, ...).
So, yes, it is useful to understand how the search algorithm will work, so that you can offer the most efficient regular expression, but don't let the tail wag the dog.

Regular expression matching any subset of a given set?

Is it possible to write a regular expression which will match any subset of a given set of characters a1 ... an ?
I.e. it should match any string where any of these characters appears at most once, there are no other characters and the relative order of the characters doesn't matter.
Some approaches that arise at once:
1. [a1,...,an]* or (a1|a2|...|an)*- this allows multiple presence of characters
2. (a1?a2?...an?) - no multiple presence, but relative order is important - this matches any subsequence but not subset.
3. ($|a1|...|an|a1a2|a2a1|...|a1...an|...|an...a1), i.e. write all possible subsequences (just hardcode all matching strings :)) of course, not acceptable.
I also have a guess that it may be theoretically impossible, because during parsing the string we will need to remember which character we have already met before, and as far as I know regular expressions can check out only right-linear languages.
Any help will be appreciated. Thanks in advance.
This doesn't really qualify for the language-agnostic tag, but...
^(?:(?!\1)a1()|(?!\2)a2()|...|(?!\n)an())*$
see a demo on ideone.com
The first time an element is matched, it gets "checked off" by the capturing group following it. Because the group has now participated in the match, a negative lookahead for its corresponding backreference (e.g., (?!\1)) will never match again, even though the group only captured an empty string. This is an undocumented feature that is nevertheless supported in many flavors, including Java, .NET, Perl, Python, and Ruby.
This solution also requires support for forward references (i.e., a reference to a given capturing group (\1) appearing in the regex before the group itself). This seems to be a little less widely supported than the empty-groups gimmick.
Can't think how to do it with a single regex, but this is one way to do it with n regexes: (I will usr 1 2 ... m n etc for your as)
^[23..n]*1?[23..n]*$
^[13..n]*2?[13..n]*$
...
^[12..m]*n?[12..m]*$
If all the above match, your string is a strict subset of 12..mn.
How this works: each line requires the string to consist exactly of:
any number of charactersm drawn fromthe set, except a particular one
perhaps a particular one
any number of charactersm drawn fromthe set, except a particular one
If this passes when every element in turn is considered as a particular one, we know:
there is nothing else in the string except the allowed elements
there is at most one of each of the allowed elements
as required.
for completeness I should say that I would only do this if I was under orders to "use regex"; if not, I'd track which allowed elements have been seen, and iterate over the characters of the string doing the obvious thing.
Not sure you can get an extended regex to do that, but it's pretty easy to do with a simple traversal of your string.
You use a hash (or an array, or whatever) to store if any of your allowed characters has already been seen or not in the string. Then you simply iterate over the elements of your string. If you encounter an element not in your allowed set, you bail out. If it's allowed, but you've already seen it, you bail out too.
In pseudo-code:
foreach char a in {a1, ..., an}
hit[a1] = false
foreach char c in string
if c not in {a1, ..., an} => fail
if hit[c] => fail
hit[c] = true
Similar to Alan Moore's, using only \1, and doesn't refer to a capturing group before it has been seen:
#!/usr/bin/perl
my $re = qr/^(?:([abc])(?!.*\1))*$/;
foreach (qw(ba pabc abac a cc cba abcd abbbbc), '') {
print "'$_' ", ($_ =~ $re) ? "matches" : "does not match", " \$re \n";
}
We match any number of blocks (the outer (?:)), where each block must consist of "precisely one character from our preferred set, which is not followed by a string containing that character".
If the string might contain newlines or other funny stuff, it might be necessary to play with some flags to make ^, $ and . behave as intended, but this all depends on the particular RE flavor.
Just for sillyness, one can use a positive look-ahead assertion to effectively AND two regexps, so we can test for any permutation of abc by asserting that the above matches, followed by an ordinary check for 'is N characters long and consists of these characters':
my $re2 = qr/^(?=$re)[abc]{3}$/;
foreach (qw(ba pabc abac a cc abcd abbbbc abc acb bac bca cab cba), '') {
print "'$_' ", ($_ =~ $re2) ? "matches" : "does not match", " \$re2 \n";
}

How does Sencha Touch matcher work?

I am trying to create a simple matcher that matches any string consisting of alphanumeric characters. I tried the following:
Ext.regModel('RegistrationData', {
fields: [
{name: 'nickname',type: 'string'},
],
validations: [
{type: 'format', name: 'nickname', matcher: /[a-zA-Z0-9]*/}
]
});
However this does not work as expected. I did not find any documentation on how should a regular expression in a matcher look like.
Thank you for help.
I found a blog on sencha.com, where they explain the validation.
I have no idea what sencha-touch is, but maybe it helps, when you tell us what you are giving to your regex, what you expect it to do, and what it actually does (does not work as expected is a bit vague). According to the blog it accepts "regular expression format", so for your simple check, it should be pretty standard.
EDIT:
As a wild guess, maybe you want to use anchors to ensure that the name has really only letters and numbers:
/^[a-zA-Z0-9]*$/
^ is matching the start of the string
$ is matching the end of the string
Your current regex /[a-zA-Z0-9]*/ would match a string containing zero or more occurrences of lower or upper case characters (A-Z) or numbers anywhere in the string. That's why Joe#2, J/o/e, *89Joe as well as Joe, Joe24andjOe28` match - they all contain zero or more subsequent occurrences of the respective characters.
If you want your string to contain only the respective characters you have to change the regex according to stema's answer:
/^[a-zA-Z0-9]*$/
But this has still one problem. Due to the * which meas zero or more occurrences it also matches an empty string, so the correct string should be:
/^[a-zA-Z0-9]+$/
with + meaning one or more occurrences. This will allow nicknames containing only one lowercase or uppercase character or number, such as a, F or 6.