Do parentheses change the length of a regular expression? - regex

Let Sigma = {a,b}. The regular expression RE = (ab)(ab)*(aa|bb)*b over Sigma.
Give a string of length 5 in the set denoted by RE.
Correct answer: abaab
My answer: (ab)aab
I placed the parentheses there because they are in the RE. I understand why I don't need to, but is my answer incorrect? I tested it using RegEx, and the expression (ab)aab matched the text abaab, but it did not match when I reversed this.

() is syntax of regex and has its semantic meaning, you may have a look here and here
Similar to ^ or & and other reserved character in regex, you have to special handle to match them using regex, for example: Regex to Match Symbols: !$%^&*()_+|~-=`{}[]:";'<>?,./
Also, specifically in your question context, () should not appear as part of the string as it is not in the charater set (alphabet) {a,b}. And the string you provide has a lengh of 7 instead of 5, so it is correct to say it is wrong.

Your answer is wrong because the parentheses do not belong to your set of symbols. The string (ab)aab cannot be generated using only symbols present in the {a,b} set.
Even more, you were asked to provide a string of 5 symbols but (ab)aab has length 7.
Parentheses have special meaning in regex. They create sub-regexps and capturing groups. For example, (ab)* means ab can be matched any number of times, including zero. Without parentheses, ab* means the regex matches one a followed by any number of bs. That's a different expression.
For example:
the regular expression (ab)* matches the empty string (ab zero times), ab, abab, ababab, abababab and so on;
the regular expression ab* matches a (followed by zero bs), ab, abb, abbb, abbbb and so on.
The first set of parentheses in your example is useless if you are looking only for sub-regexps. Both (ab) and ab expressions match only the ab string. But they can be used to capture the matched part of the string and re-use it either with back references or for replacement.
When parentheses are used for sub-expressions in regular expressions, they are meta-characters, do not match anything in the string. In order to match an open parenthesis character ( (found in the string) you have to escape it in the regex: \(.
Several strings that match the regular expression (ab)(ab)*(aa|bb)*b over Sigma = { 'a', 'b' }: abb, ababb, abababababb, ababababaabbaaaabbb.
The last string (ababababaabbaaaabbb) matches the regex pieces as follows:
ab - (ab)
ababab - (ab)* - ('ab' 3 times)
aabbaaaabb - (aa|bb)* - ('aa' or 'bb', 5 times in total)
b - b
A regex that matches the string (ab)aab is \(ab\)(ab)*(aa|bb)*b but in this case
Sigma = { 'a', 'b', '(', ')' }

Related

Matching letter grades within a body of text

I am trying to write a regex expression for matching letter grades embedded within a string however, I am having some difficulty with certain characters. These characters are commas, backslashes, forward slashes, or apostrophes at word boundaries.
These strings may consist of either just a letter grade, or a mixture of a letter grade and notes left by an instructor. The valid range for these grades is anything from an A+ to D-, with an F reserved for failures. For a particular letter such as C the valid grades are: C+, C, or C-. Grades will never appear embedded within another word. Examples of some of these strings are as follows:
string1: "A+"
string2: "B. Submitted with deferral"
string3: "F. Could not read M/C answer sheet."
string4: "C+"
string5: "Received a B- with late submission penalty."
The expression that I have tried thus far is as follows:
(\b[A-D]\b[+-]?)|\bF\b)
For string1 and string2, this will produce the following matches
"A+"
"B. Submitted with deferral"
For string3 this expression should match
F. Could not read M/C answer sheet.
But instead matches
F. Could not read M/C answer sheet.
Any assistance would be much appreciated.
Edit:
For clarity a substring is a letter grade if and only if:
It is if the form A+, A, A-, B+, B, B-, ..., D+, D, D-, with F (without a sign) reserved for a failing grade
It is not embedded in a word, for example FOA+O would not match A+. Likewise substrings such as AC or FB should produce no matches
Letters separated by characters such as \ / ?' should not be matched, for example A/C, B+'C, F\D should not produce matches, whereas A, C or A,C should match both letters.
Letter separated by periods such as B.A. should not result in matches. Whereas an letter occurring at the end of a sentence such as A. may be considered a match.
Consider the following example strings
string1: "A-- A-C, A\D, F/A, D'C, A,C, B+D, C-C, AB, XA, B.A. C C,
Cat, F, C+, B-."
string2: " A "
string3: "B+."
string4: "X"
string5: "F"
in these strings the only valid matches should be
string1: "A-- A-C, A\D, F/A, D'C, A,C, B+D, C-C, AB, XA, B.A. C
C, Cat, F, C+, B-."
string2: " A "
string3: "B+."
string5: "F"
I'm not sure which regex engine you're using but the following regex works for all of the test cases you presented:
See regex in use here
(?<=^|[\s,])(?:[A-D][-+]?|F)(?=[-+.]\B|[\s,]|$)
(?<=^|[\s,]) Lookbehind ensuring what precedes is either of the following options:
^ Asserts position at the start of the line.
[\s,] Match any whitespace character or the comma character.
(?:[A-D][-+]?|F) Match either of the following options:
[A-D][-+]? Match the following:
[A-D] Match any character in the range from A to D in the ASCII table (ABCD).
[-+]? Optionally match any character in the set (- or +)
F Match this literally.
(?=[-+.]\B|[\s,]|$) Lookahead ensuring what proceeds is either of the following options:
[-+.]\B Matches any character in the set (-+.), followed by an assertion for anything that doesn't match a word boundary (ensures what follows is not a letter).
[\s,] Matches any whitespace character or the comma character.
$ Asserts position at the end of the line.
Alternatives
Fixed-width lookbehind - see in use here
(?:^|(?<=[\s,]))(?:[A-D][-+]?|F)(?=[-+.]\B|[\s,]|$)
Without lookbehind (uses capture group instead) - see in use here
(?:^|[\s,])([A-D][-+]?|F)(?=[-+.]\B|[\s,]|$)
The "C" in "M/C" is matched because \b considers the "/" a valid word boundary.
(?<=^|\s)[A-F][+-]{0,1}(?=\W)
This regular expression will match letter grades that are either at the beginning of the line (^), or are preceded with whitespace (\s). The positive lookbehind (?<=) ensures that the leading whitespace is not considered part of the match.
After the letter grade, we have (?=\W), which will require one non-word character, using positive lookahead to exclude the boundary character from the match.
Your original expression is just fine, yet this expression has a start anchor, that might be helping us here:
(?<=^|\s)\b[A-DF]\b[+-]?
Demo 1
Or with capturing group:
(?<=^|\s)(\b[A-DF]\b[+-]?)
Demo 2
Or without lookarounds, these might work:
(?:^|\s)(\b[A-DF]\b[+-]?)
(^|\s)(\b[A-DF]\b[+-]?)
^(\b[A-DF]\b[+-]?)|\s(\b[A-DF]\b[+-]?)

Regex string matched failure

I do not understand how regex string matching works
r2 = r'a[bcd]*b'
m1 = re.findall(r2,"abcbd")
abcb
This falls in line with what was explained in regex
Step 3 The engine tries to match b, but the current position is at the end of the string, so it fails.
How?I do not understand this?
The following regex a[bcd]*b matches the longest substring (because * is greedy):
a starting with a
[bcd]* followed by any number (0: can match empty string) of character in set (b,c,d)
b ending by b
EDIT: following comment, backtracking occurs in following example
>>> re.findall(r2,"abcxb")
['ab']
abc matches a[bcd]*, but x is not expected
a also matches a[bcd]* (because empty string matches [bcd]*)
finally returns ab
Concerning greediness, the metacharacter * after a single character, a character set or a group, means any number of times (the most possible match) some regexp engines accept the sequence of metacharacters *? which modifies the behavior to the least possible, for example:
>>> r2 = r'a[bcd]*?b'
>>> re.findall(r2,"abcbde")
['ab']
Your regular expression requires the match to end in b, therefore everything is matched up to the trailing d. If b were optional, as in a[bcd]*b?, then entire string would be matched.

How can i write a regular expression for to match string staring with alphabets and ending with digits

i want to match the strings which is listed below other than than that whatever the string is it should not match
rahul2803
albert1212
ra456
r1
only the above mentioned strings should match in the following group of data
rahul
2546rahul
456
rahul2803
albert1212
ra456
r1
rahulrenjan
r4ghyk
i tried with ([a-z]*[0-9]) but it's not working.
In regular expressions * means zero or more so your regex matches zero letters. If you want one or more use + (\d means digit).
^[a-zA-Z]+\d+$
Regular expressions are fun to solve once you get the hang of the syntax.
This one should be pretty straight:
Start with a letter. ^[a-z] (I am not taking the case of capital
letters here, if they are then ^[a-zA-Z] )
Have multiple letters/digits in between .*
End the string with a digit [0-9]$
Combine all 3 and you get:
^[a-z].*[0-9]$

What does the RegEx (a+b)^n(c+d)^m match?

I'm unsure of what this RegEx matches:
(a+b)^n(c+d)^m
I know that the + metacharacter means "one or more times the preceding pattern". So, a+ would match one or more as whereas a* also includes the empty string.
But I think that in this case, the RegEx means a or b to the nth time concatenated with c or d to the mth time, so it'd match strings like these:
aaaacc (n=4, m=2)
bbbbbdddd (n=5, m=4)
aaaddddd (n=3, m=5)
bc (n=1, m=1)
aaaaaaaaaaaaccccc (n=12, m=5)
...
Is this correct? If it's not, can anyone provide examples of what this RegEx does match?
It doesn't look like a valid regular expression given the incorrect use of ^
^ should either be inside []'s like this [^a], or at the very start of the regular expression.
+ just means 1 or more occurrence of a character.
If ^n means can be repeated n times then these would be matches:
aaaaaabccccccccd,
aaaaaabaaaaaabaaaaaabccccccccdccccccccd
Apparently (a+b)^n(c+d)^m means "n slots for unordered a's and b's followed by m slots for unordered c's and d's"
e.g. an example of (a+b)^10(c+d)^5 would be: aaaababbbadcccd
If you're using Perl regular expressions with the 'm' option, e.g. /(a+b)^n(c+d)^m/m, the
'^' will match an internal beginning of line. So...
/
(a+b) # Match one or more as followed by b
^n # Match the beginning of a line followed by a literal n.
(c+d) # Match one or more cs followed by d
^m # Match the beginning of a line followed by a literal m.
/mx
(a+b) and (c+d) would be available in $1 and $2.

What's the difference between * and ? in regular expressions?

Both seem to mean, match 0 or more characters? I don't understand the difference between them, or when to use ? and when to use *. Some examples would help.
In the Formal definition the symbols of regular expressions operators are
. : which is concatenation like a.b.c would match a text having abc . Some times to indicate concatenation simply two symbols are used back to back.
* : match the last symbol 0 or more times, (abc)* would match a null string, abc, abcabc, abcabcabc, but not abcaabc. Known as the Kleen's star.
+ : would match either the left-hand side or the right hand side . (abc + def) would match abc or def. Also the union operator or the | operator is used.
These are applied on a set of symbols sigma, which includes the symbols in your language within other special symbols are the epsilon which denotes the empty string, and the null means no symbols at all. For details see [3]
These are the formal definitions.
When you use applications accepting the POSIX regular expression syntax the meaning of the different operators are like this:
These are the POSIX Basic regular expression operations
. : The dot '.' matches any character like a.c could match abc, axc, amc, aoc anything.
^ : Indicates the start of line. ^abc would match the string which is starting at the line. abc appearing in between the line would not be matched
$ : Indicates the end of line. abc$ would only match the string abc at the end of the line. This would not match any 'abc' in between the lines.
* : Matches the last symbol preceding the '*' 0 or more times. So ab*c would match ac, abc, abbc, abbbc, abbbbbc, abbbbbbbbc etc.
{m, n} : Matches the the preceding symbol atleat 'm' times and at most 'n' times.
ab{2,4}c would not match 'abc', but would match 'abbc', 'abbbc', 'abbbbc', but will not match 'abbbbbc' . So if the number of 'b' is >= 2 and <= 4 it would match.
{m,} : means match the preceding symbol minimum 'm' times, and no limit in the maximum. (note the comma)
{n} : means match the preceding symbol exactly 'n' times. so ab{3}c would only match 'abbbc'.
[symbols] : will match any one of the symbols inside the box braces. like a[xyz]c would match 'axc' , 'ayc', and 'azc' and no other strings
[^symbols] : will match any symbol once which are not inside the box brackets. like a[xyz]c would match any strings 'a.b' with the '.' being any symbol except x, y, z.
These are the POSIX Extended regular expression operators (needs grep -E)
? : Will match the preceding symbol 0 or at most 1 time. so ab?c would match 'ac' and 'abc' only.
+ : Will match the preceding symbol at least 1 time and at most any number times (no upper bounds). Like ab+c would match abc, abbbc, abbbbbc, abbbbbbbbc, etc, but would not match 'ac'
| : Would match either the expression on the left side of the '|' or the right side expression on the right side of the '|'. like (ab+c)|(xy*z) .
Also have a look at the POSIX meta character classes like [:alpha:] represents all the alphabets. [:punct:] denotes all the punctuations etc.
Wild Characters/ Globs
If you are using * and ? as wild cards then the interpretations are as below
* : Match any number of any characters at this position. Like *.c means all strings ending with the string '.c' (here . has no special interpretations) . Test with ls *.c or ls *.doc
? : Match any character only one time at this position. Like file??.txt would match strings 'fileab.c', 'file00.c' etc, and match any exactly two characters. Test with ls *.??? which will list all the files having a three character extension.
I hope this answers your question. Or you might want to through some text about formal definitions and the POSIX and maybe the Perl style regular expressions for a clear idea.
References:
Wikipedia Page
grep manual Regular expression section
Theory of Computation by Michael Sipser
Note: This answer was reconstructed
? means zero or one of. * means any number of. So this:
^ab?$
would match a and ab, but not abb. This:
^ab*$
would match not only a and ab, but also abb, abbb, and a with any number of bs following it.
For sake of regex completeness something like *? is also used. In this case it is a lazy match or non-greedy match and will match as few characters as possible before matching the next token.
For example:
a.*a
would match whole of abaaba
while
a.*?a
will match aba, aba