I do not understand how regex string matching works
r2 = r'a[bcd]*b'
m1 = re.findall(r2,"abcbd")
abcb
This falls in line with what was explained in regex
Step 3 The engine tries to match b, but the current position is at the end of the string, so it fails.
How?I do not understand this?
The following regex a[bcd]*b matches the longest substring (because * is greedy):
a starting with a
[bcd]* followed by any number (0: can match empty string) of character in set (b,c,d)
b ending by b
EDIT: following comment, backtracking occurs in following example
>>> re.findall(r2,"abcxb")
['ab']
abc matches a[bcd]*, but x is not expected
a also matches a[bcd]* (because empty string matches [bcd]*)
finally returns ab
Concerning greediness, the metacharacter * after a single character, a character set or a group, means any number of times (the most possible match) some regexp engines accept the sequence of metacharacters *? which modifies the behavior to the least possible, for example:
>>> r2 = r'a[bcd]*?b'
>>> re.findall(r2,"abcbde")
['ab']
Your regular expression requires the match to end in b, therefore everything is matched up to the trailing d. If b were optional, as in a[bcd]*b?, then entire string would be matched.
Related
I have the following regular expression:
/^[a-f0-9]{8}$/ --- This expression extracts an 8 character string as a md5 hash, for example: if I have the following string "hello world .305eef9f x1xxx 304ccf9f test1232" it will return "304ccf9f"
I also have the following regular expression:
/.[^.]*$/ --- This expression extracts a string after the last period (included), for example, if I have "hello world.this.is.atest.case9.23919sd3xxxs" it will return ".23919sd3xxxs"
Thing is, I've readen a bit about regex but I can't join both expressions in order to find the md5 string after the last period (included), for example:
topLeftLogo.93f02a9d.controller.99f06a7s ----> must return ".99f06a7s"
Thanks in advance for your time and help!
/^[a-f0-9]{8}$/ --- This expression extracts an 8 character string as a md5 hash
Yes but it doesn't return "304ccf9f" from "hello world .305eef9f x1xxx 304ccf9f test1232" because ^ in regex means start of string. How is it possible for it to match in middle of a string?
/.[^.]*$/ --- This expression extracts a string after the last period
No. It will do if you escape first dot only \.
To combine these two you have to replace ^ with \.:
\.[a-f0-9]{8}$
To match your characters 8 times after the last dot in this range [a-f0-9] you might use (if supported) a positive lookahead (?!.*\.) to match your values and assert that what follows does not contain a dot:
\.[a-f0-9]{8}(?!.*\.)
Regex demo
If you want to match characters from a-z instead of a-f like 99f06a7s you could use [a-z0-9]
About the first example
This regex ^[a-f0-9]{8}$ will match one of the ranges in the character class 8 times from the start until the end of the string due to the anchors ^ and $. It would not find a match in hello world .305eef9f x1xxx 304ccf9f test1232 on the same line.
About the second example
.[^.]*$ will match any character zero or more times followed by matching not a dot. That would for example also match a single a and is not bound to first matching a dot because you have to escape the dot to match it literally.
I'm adding this just in case people needs to solve a similar casuistic:
Case 1: for example, we want to get the hexadecimal ([a-f0-9]) 8 char string from our filename string
between the last period and the file extension, in order, for example, to remove that "hashed" part:
Example:
file.name2222.controller.2567d667.js ------> returns .2567d667
We will need to use the following regex:
\.[a-f0-9]{8}(?=\.\w+$)
Case 2: for example, we want the same as above but ignoring the first period:
Example:
file.name2222.controller.2567d667.js ------> returns 2567d667
We will need to use the following regex
[a-f0-9]{8}(?=\.\w+$)
A period p of a string w is any positive integer p such that w[i]=w[i+p]
whenever both sides of this equation are defined. Let per(w) denote
the size of the smallest period of w . We say that a string w is
periodic iff per(w) <= |w|/2.
So informally a periodic string is just a string that is made up from a prefix repeated at least twice. The only complication is that at the end of the string we don't require a full copy of the prefix.
For, example consider the string x = abcab. per(abcab) = 3 as x[1] = x[1+3] = a, x[2]=x[2+3] = b and there is no smaller period. The string abcab is therefore not periodic. However, the string ababa is periodic as per(ababa) = 2.
As more examples, abcabca, ababababa and abcabcabc are also periodic.
Is there a regex to determine if a string is periodic or not?
I don't really mind which flavor of regex but if it makes a difference, anything that Python re supports.
What you need is backreference
\b(\w*)(\w+\1)\2+\b
This matches even abcabca and ababababa.
\1 and \2 are used to match the first and second capturing groups, respectively.
You could use Regex back references.
For example (.+)\1+. This pattern will match a group () formed of at least one character .+. This group \1 (back reference) must repeat at least one time for a match.
The string ababa matches and it finds ab as the 1st group.
The string abcab is not a match.
Later edit
If you want a prefix that is repeated at least twice, you can change the pattern to: ^(.+)\1+. The problem is that I don't think you can match the end of the string to a substring of the prefix. So any string that starts with a repeating pattern will match but it will ignore the ending of the string.
Even later edit
Inspired from #tobias_k answer, here is how I would do it ^((.+)(?:.*))\1+\2?$. It looks for a string that has a prefix (it looks for the longest prefix it can find) that repeats at least twice and the ending must be the starting part of the prefix.
The first capturing group from the match will be the prefix that is repeating.
https://regex101.com/r/jQ3yY1/2
If you want the shortest prefix that repeats, you can use this pattern ^((.+?)(?:.*?))\1+\2?$.
You can use a regex like ^(.+)(.*)(\1\2)+\1?$.
^...$ from start to end of string
(.+) part of period that is always repeated (e.g. a in ababa)
(.*) optional part of period that is repeated except at the end (e.g. b in ababa)
(\1\2)+ one or more repetitions of the entire period
\1? optional final repetition of first part of the period
In Python:
>>> p = r"^(.+)(.*)(\1\2)+\1?$"
>>> re.match(p, "abcab")
None
>>> re.match(p, "abcabca")
<_sre.SRE_Match at 0x7f5fde6e51f8>
Note that this does not match the empty string "" though, which could also be considered periodic. If the empty string should be matched, you will have to treat it separately, e.g. by simply appending |^$ at the end of the regex.
Let Sigma = {a,b}. The regular expression RE = (ab)(ab)*(aa|bb)*b over Sigma.
Give a string of length 5 in the set denoted by RE.
Correct answer: abaab
My answer: (ab)aab
I placed the parentheses there because they are in the RE. I understand why I don't need to, but is my answer incorrect? I tested it using RegEx, and the expression (ab)aab matched the text abaab, but it did not match when I reversed this.
() is syntax of regex and has its semantic meaning, you may have a look here and here
Similar to ^ or & and other reserved character in regex, you have to special handle to match them using regex, for example: Regex to Match Symbols: !$%^&*()_+|~-=`{}[]:";'<>?,./
Also, specifically in your question context, () should not appear as part of the string as it is not in the charater set (alphabet) {a,b}. And the string you provide has a lengh of 7 instead of 5, so it is correct to say it is wrong.
Your answer is wrong because the parentheses do not belong to your set of symbols. The string (ab)aab cannot be generated using only symbols present in the {a,b} set.
Even more, you were asked to provide a string of 5 symbols but (ab)aab has length 7.
Parentheses have special meaning in regex. They create sub-regexps and capturing groups. For example, (ab)* means ab can be matched any number of times, including zero. Without parentheses, ab* means the regex matches one a followed by any number of bs. That's a different expression.
For example:
the regular expression (ab)* matches the empty string (ab zero times), ab, abab, ababab, abababab and so on;
the regular expression ab* matches a (followed by zero bs), ab, abb, abbb, abbbb and so on.
The first set of parentheses in your example is useless if you are looking only for sub-regexps. Both (ab) and ab expressions match only the ab string. But they can be used to capture the matched part of the string and re-use it either with back references or for replacement.
When parentheses are used for sub-expressions in regular expressions, they are meta-characters, do not match anything in the string. In order to match an open parenthesis character ( (found in the string) you have to escape it in the regex: \(.
Several strings that match the regular expression (ab)(ab)*(aa|bb)*b over Sigma = { 'a', 'b' }: abb, ababb, abababababb, ababababaabbaaaabbb.
The last string (ababababaabbaaaabbb) matches the regex pieces as follows:
ab - (ab)
ababab - (ab)* - ('ab' 3 times)
aabbaaaabb - (aa|bb)* - ('aa' or 'bb', 5 times in total)
b - b
A regex that matches the string (ab)aab is \(ab\)(ab)*(aa|bb)*b but in this case
Sigma = { 'a', 'b', '(', ')' }
regex with quantifier and grouping in python
p = re.compile('[29]{1}')
p.match('29')
why does 29 match p? i thought i explicitly said it's [29] (2 or 9) with {1} quantifier.
Shouldn't it be JUST 2 OR 9? Or does it match the first group and not care about the rest
thanks!
It is matching because it matches the sub-string '2'. The way regex works is that it returns true is there exists any substring inside the string that matches. The regex you are using would match '46657467562374746', because it contains a '2'. If you need the whole thing to match from beginning to end, you need to use anchors:
p = re.compile('^[29]{1}$')
p.match('29')
The hat (^) represents the beginning of the string and the dollar ($) represents the end of the string. So now this will only match if the whole sting is a single 2 or a single 9, instead of just containing a 2 or 9.
I'm unsure of what this RegEx matches:
(a+b)^n(c+d)^m
I know that the + metacharacter means "one or more times the preceding pattern". So, a+ would match one or more as whereas a* also includes the empty string.
But I think that in this case, the RegEx means a or b to the nth time concatenated with c or d to the mth time, so it'd match strings like these:
aaaacc (n=4, m=2)
bbbbbdddd (n=5, m=4)
aaaddddd (n=3, m=5)
bc (n=1, m=1)
aaaaaaaaaaaaccccc (n=12, m=5)
...
Is this correct? If it's not, can anyone provide examples of what this RegEx does match?
It doesn't look like a valid regular expression given the incorrect use of ^
^ should either be inside []'s like this [^a], or at the very start of the regular expression.
+ just means 1 or more occurrence of a character.
If ^n means can be repeated n times then these would be matches:
aaaaaabccccccccd,
aaaaaabaaaaaabaaaaaabccccccccdccccccccd
Apparently (a+b)^n(c+d)^m means "n slots for unordered a's and b's followed by m slots for unordered c's and d's"
e.g. an example of (a+b)^10(c+d)^5 would be: aaaababbbadcccd
If you're using Perl regular expressions with the 'm' option, e.g. /(a+b)^n(c+d)^m/m, the
'^' will match an internal beginning of line. So...
/
(a+b) # Match one or more as followed by b
^n # Match the beginning of a line followed by a literal n.
(c+d) # Match one or more cs followed by d
^m # Match the beginning of a line followed by a literal m.
/mx
(a+b) and (c+d) would be available in $1 and $2.