I'm unsure of what this RegEx matches:
(a+b)^n(c+d)^m
I know that the + metacharacter means "one or more times the preceding pattern". So, a+ would match one or more as whereas a* also includes the empty string.
But I think that in this case, the RegEx means a or b to the nth time concatenated with c or d to the mth time, so it'd match strings like these:
aaaacc (n=4, m=2)
bbbbbdddd (n=5, m=4)
aaaddddd (n=3, m=5)
bc (n=1, m=1)
aaaaaaaaaaaaccccc (n=12, m=5)
...
Is this correct? If it's not, can anyone provide examples of what this RegEx does match?
It doesn't look like a valid regular expression given the incorrect use of ^
^ should either be inside []'s like this [^a], or at the very start of the regular expression.
+ just means 1 or more occurrence of a character.
If ^n means can be repeated n times then these would be matches:
aaaaaabccccccccd,
aaaaaabaaaaaabaaaaaabccccccccdccccccccd
Apparently (a+b)^n(c+d)^m means "n slots for unordered a's and b's followed by m slots for unordered c's and d's"
e.g. an example of (a+b)^10(c+d)^5 would be: aaaababbbadcccd
If you're using Perl regular expressions with the 'm' option, e.g. /(a+b)^n(c+d)^m/m, the
'^' will match an internal beginning of line. So...
/
(a+b) # Match one or more as followed by b
^n # Match the beginning of a line followed by a literal n.
(c+d) # Match one or more cs followed by d
^m # Match the beginning of a line followed by a literal m.
/mx
(a+b) and (c+d) would be available in $1 and $2.
Related
Can someone please tell me how to generate a regular expression for strings in which number of a's is a multiple of 3? The alphabet set is {a,b}.
I tried to construct a DFA for it first and then derive a RE from that. What I got was ((ba*)(ba*)(ba*))*.
I would write the pattern as:
^b*(?:(?:ab*){3})*$
This matches:
^ from the start of the string
b* optional leading b zero or more times
(?:
(?:ab*){3} match a followed by zero or more b 3 times
)* match this group zero or more times
$ end of the string
Demo
I do not understand how regex string matching works
r2 = r'a[bcd]*b'
m1 = re.findall(r2,"abcbd")
abcb
This falls in line with what was explained in regex
Step 3 The engine tries to match b, but the current position is at the end of the string, so it fails.
How?I do not understand this?
The following regex a[bcd]*b matches the longest substring (because * is greedy):
a starting with a
[bcd]* followed by any number (0: can match empty string) of character in set (b,c,d)
b ending by b
EDIT: following comment, backtracking occurs in following example
>>> re.findall(r2,"abcxb")
['ab']
abc matches a[bcd]*, but x is not expected
a also matches a[bcd]* (because empty string matches [bcd]*)
finally returns ab
Concerning greediness, the metacharacter * after a single character, a character set or a group, means any number of times (the most possible match) some regexp engines accept the sequence of metacharacters *? which modifies the behavior to the least possible, for example:
>>> r2 = r'a[bcd]*?b'
>>> re.findall(r2,"abcbde")
['ab']
Your regular expression requires the match to end in b, therefore everything is matched up to the trailing d. If b were optional, as in a[bcd]*b?, then entire string would be matched.
Let Sigma = {a,b}. The regular expression RE = (ab)(ab)*(aa|bb)*b over Sigma.
Give a string of length 5 in the set denoted by RE.
Correct answer: abaab
My answer: (ab)aab
I placed the parentheses there because they are in the RE. I understand why I don't need to, but is my answer incorrect? I tested it using RegEx, and the expression (ab)aab matched the text abaab, but it did not match when I reversed this.
() is syntax of regex and has its semantic meaning, you may have a look here and here
Similar to ^ or & and other reserved character in regex, you have to special handle to match them using regex, for example: Regex to Match Symbols: !$%^&*()_+|~-=`{}[]:";'<>?,./
Also, specifically in your question context, () should not appear as part of the string as it is not in the charater set (alphabet) {a,b}. And the string you provide has a lengh of 7 instead of 5, so it is correct to say it is wrong.
Your answer is wrong because the parentheses do not belong to your set of symbols. The string (ab)aab cannot be generated using only symbols present in the {a,b} set.
Even more, you were asked to provide a string of 5 symbols but (ab)aab has length 7.
Parentheses have special meaning in regex. They create sub-regexps and capturing groups. For example, (ab)* means ab can be matched any number of times, including zero. Without parentheses, ab* means the regex matches one a followed by any number of bs. That's a different expression.
For example:
the regular expression (ab)* matches the empty string (ab zero times), ab, abab, ababab, abababab and so on;
the regular expression ab* matches a (followed by zero bs), ab, abb, abbb, abbbb and so on.
The first set of parentheses in your example is useless if you are looking only for sub-regexps. Both (ab) and ab expressions match only the ab string. But they can be used to capture the matched part of the string and re-use it either with back references or for replacement.
When parentheses are used for sub-expressions in regular expressions, they are meta-characters, do not match anything in the string. In order to match an open parenthesis character ( (found in the string) you have to escape it in the regex: \(.
Several strings that match the regular expression (ab)(ab)*(aa|bb)*b over Sigma = { 'a', 'b' }: abb, ababb, abababababb, ababababaabbaaaabbb.
The last string (ababababaabbaaaabbb) matches the regex pieces as follows:
ab - (ab)
ababab - (ab)* - ('ab' 3 times)
aabbaaaabb - (aa|bb)* - ('aa' or 'bb', 5 times in total)
b - b
A regex that matches the string (ab)aab is \(ab\)(ab)*(aa|bb)*b but in this case
Sigma = { 'a', 'b', '(', ')' }
Both seem to mean, match 0 or more characters? I don't understand the difference between them, or when to use ? and when to use *. Some examples would help.
In the Formal definition the symbols of regular expressions operators are
. : which is concatenation like a.b.c would match a text having abc . Some times to indicate concatenation simply two symbols are used back to back.
* : match the last symbol 0 or more times, (abc)* would match a null string, abc, abcabc, abcabcabc, but not abcaabc. Known as the Kleen's star.
+ : would match either the left-hand side or the right hand side . (abc + def) would match abc or def. Also the union operator or the | operator is used.
These are applied on a set of symbols sigma, which includes the symbols in your language within other special symbols are the epsilon which denotes the empty string, and the null means no symbols at all. For details see [3]
These are the formal definitions.
When you use applications accepting the POSIX regular expression syntax the meaning of the different operators are like this:
These are the POSIX Basic regular expression operations
. : The dot '.' matches any character like a.c could match abc, axc, amc, aoc anything.
^ : Indicates the start of line. ^abc would match the string which is starting at the line. abc appearing in between the line would not be matched
$ : Indicates the end of line. abc$ would only match the string abc at the end of the line. This would not match any 'abc' in between the lines.
* : Matches the last symbol preceding the '*' 0 or more times. So ab*c would match ac, abc, abbc, abbbc, abbbbbc, abbbbbbbbc etc.
{m, n} : Matches the the preceding symbol atleat 'm' times and at most 'n' times.
ab{2,4}c would not match 'abc', but would match 'abbc', 'abbbc', 'abbbbc', but will not match 'abbbbbc' . So if the number of 'b' is >= 2 and <= 4 it would match.
{m,} : means match the preceding symbol minimum 'm' times, and no limit in the maximum. (note the comma)
{n} : means match the preceding symbol exactly 'n' times. so ab{3}c would only match 'abbbc'.
[symbols] : will match any one of the symbols inside the box braces. like a[xyz]c would match 'axc' , 'ayc', and 'azc' and no other strings
[^symbols] : will match any symbol once which are not inside the box brackets. like a[xyz]c would match any strings 'a.b' with the '.' being any symbol except x, y, z.
These are the POSIX Extended regular expression operators (needs grep -E)
? : Will match the preceding symbol 0 or at most 1 time. so ab?c would match 'ac' and 'abc' only.
+ : Will match the preceding symbol at least 1 time and at most any number times (no upper bounds). Like ab+c would match abc, abbbc, abbbbbc, abbbbbbbbc, etc, but would not match 'ac'
| : Would match either the expression on the left side of the '|' or the right side expression on the right side of the '|'. like (ab+c)|(xy*z) .
Also have a look at the POSIX meta character classes like [:alpha:] represents all the alphabets. [:punct:] denotes all the punctuations etc.
Wild Characters/ Globs
If you are using * and ? as wild cards then the interpretations are as below
* : Match any number of any characters at this position. Like *.c means all strings ending with the string '.c' (here . has no special interpretations) . Test with ls *.c or ls *.doc
? : Match any character only one time at this position. Like file??.txt would match strings 'fileab.c', 'file00.c' etc, and match any exactly two characters. Test with ls *.??? which will list all the files having a three character extension.
I hope this answers your question. Or you might want to through some text about formal definitions and the POSIX and maybe the Perl style regular expressions for a clear idea.
References:
Wikipedia Page
grep manual Regular expression section
Theory of Computation by Michael Sipser
Note: This answer was reconstructed
? means zero or one of. * means any number of. So this:
^ab?$
would match a and ab, but not abb. This:
^ab*$
would match not only a and ab, but also abb, abbb, and a with any number of bs following it.
For sake of regex completeness something like *? is also used. In this case it is a lazy match or non-greedy match and will match as few characters as possible before matching the next token.
For example:
a.*a
would match whole of abaaba
while
a.*?a
will match aba, aba
I am looking for some help on creating a regular expression that would work with a unique input in our system. We already have some logic in our keypress event that will only allow digits, and will allow the letter A and the letter M. Now I need to come up with a RegEx that can match the input during the onblur event to ensure the format is correct.
I have some examples below of what would be valid. The letter A represents an age, so it is always followed by up to 3 digits. The letter M can only occur at the end of the string.
Valid Input
1-M
10-M
100-M
5-7
5-20
5-100
10-20
10-100
A5-7
A10-7
A100-7
A10-20
A5-A7
A10-A20
A10-A100
A100-A102
Invalid Input
a-a
a45
4
This matches all of the samples.
/A?\d{1,3}-A?\d{0,3}M?/
Not sure if 10-A10M should or shouldn't be legal or even if M can appear with numbers. If it M is only there without numbers:
/A?\d{1,3}-(A?\d{1,3}|M)/
Use the brute force method if you have a small amount of well defined patterns so you don't get bad corner-case matches:
^(\d+-M|\d+-\d+|A\d+-\d+|A\d+-A\d+)$
Here are the individual regexes broken out:
\d+-M <- matches anything like '1-M'
\d+-\d+ <- 5-7
A\d+-\d+ <- A5-7
A\d+-A\d+ <- A10-A20
/^[A]?[0-9]{1,3}-[A]?[0-9]{1,3}[M]?$/
Matches anything of the form:
A(optional)[1-3 numbers]-A(optional)[1-3 numbers]M(optional)
^A?\d+-(?:A?\d+|M)$
An optional A followed by one or more digits, a dash, and either another optional A and some digits or an M. The '(?: ... )' notation is a Perl 'non-capturing' set of parentheses around the alternatives; it means there will be no '$1' after the regex matches. Clearly, if you wanted to capture the various bits and pieces, you could - and would - do so, and the non-capturing clause might not be relevant any more.
(You could replace the '+' with '{1,3}' as JasonV did to limit the numbers to 3 digits.)
^A?\d{1,3}-(M|A?\d{1,3})$
^ -- the match must be done from the beginning
A? -- "A" is optional
\d{1,3} -- between one and 3 digits; [0-9]{1,3} also work
- -- A "-" character
(...|...) -- Either one of the two expressions
(M|...) -- Either "M" or...
(...|A?\d{1,3}) -- "A" followed by at least one and at most three digits
$ -- the match should be done to the end
Some consequences of changing the format. If you do not put "^" at the beginning, the match may ignore an invalid beginning. For example, "MAAMA0-M" would be matched at "A0-M".
If, likewise, you leave $ out, the match may ignore an invalid trail. For example, "A0-MMMMAAMAM" would match "A0-M".
Using \d is usually preferred, as is \w for alphanumerics, \s for spaces, \D for non-digit, \W for non-alphanumeric or \S for non-space. But you must be careful that \d is not being treated as an escape sequence. You might need to write it \\d instead.
{x,y} means the last match must occur between x and y times.
? means the last match must occur once or not at all.
When using (), it is treated as one match. (ABC)? will match ABC or nothing at all.
I’d use this regular expression:
^(?:[1-9]\d{0,2}-(?:M|[1-9]\d{0,2})|A[1-9]\d{0,2}-A?[1-9]\d{0,2})$
This matches either:
<number>-M or <number>-<number>
A<number>-<number> or A<number>-A<number>
Additionally <number> must not begin with a 0.