Complicated Regular Expression - regex

Hello I am trying to figure out a regular expression for the following where the alphabet consists of 'a','b','c','d':
Any combination of 'a','b','c','d' is acceptable as long as 'd' is never followed by ('d'|'a') and 'b' is never followed by ('b'|'c'). Any help would be great! Thanks.
EDIT: The one that got me closest is (a|b?|c|d?)* but this does not account for the fact that a 'd' can not be followed by an 'a' and a 'b' can not be followed by a 'c'.

Break the problem down into its component parts:
Any combination of 'a','b','c','d' is acceptable
This would be the simple expression:
[abcd]
However, given the extra restrictions on the characters d and b this becomes:
[ac]
d not followed by d or a
This can be achieved with a simple negative lookahead:
d(?![da])
b not followed by b or c
This is only slightly different than the previous character match:
b(?![bc])
Adding it all together
The complete one-character regex therefore becomes:
[ac]|d(?![da])|b(?![bc])
Or as a full expression:
/^([ac]|d(?![da])|b(?![bc]))+$/

Related

Regex | Containing "bbb"

I'm trying to create a Regex with chars 'a' and 'b'.
The only rule is that the regex must contain the word 'bbb' somewhere.
These are possible: aabbbaaaaaababa, abbba, bbb, aabbbaa, abbabbba, ...
These are not possible: abba, a, abb, bba, abbaaaabbaaaabba, ...
I have no idea how can I can express that.
Any ideas? Thanks in Advance!
Based on the tag "automata", I am guessing you are after the formal regular expression for this formal language. In that case, a regular expression is (a+b)bbb(a+b). The anatomy of this regular expression is the following:
(a+b) gives either "a" or "b"
(a+b)* gives any string of "a"s and "b"s whatever
bbb gives the string bbb only
the whole regular expression describes any string that begins with anything, then has bbb, then ends with anything
To prove this regular expression is correct, note that:
This regular expression only generates strings that contain the substring bbb. This is due to the middle part.
This regular expression generates all strings that contain the substring bbb. Suppose there were some string containing the substring bbb that this regular expression didn't generate. The string either starts with bbb or it doesn't. If it does, then the string is generated by our regular expression by repeating the first (a+b) zero times and the second (a+b) n - 3 times, where n is the length of the string. Otherwise, if it doesn't start with bbb, consider the suffix of length n - 1 as a recursive case. Continue thusly until the subcase does begin with bbb (it eventually must). Because this suffix is describable by our regular expression, the original case must be too since we can just repeat the first (a+b) an additional number of times equal to the depth of recursion.
The patter is kind simple
/b{3}/g
if you need it to match 3 and only 3 'b's, you can use
/b{3}[^b]?/g
Good evening! you can use this expression it might work
(a+b)* (bbb)(a+b)*
using this results in generating triple (bbb) minimum string
and by taking closure of (a+b) you can generate any type of strings containing triple b in them

ReGex - matching a logical implication

Given the alphabet {a, b, c}, how can I create a simple regular expression which matches exactly those words which meet the following criteria:
If the string "aa" occurs, then consequently "cc" must also occur (Note the logical implication).
The order of occurence doesn't matter ("cc" as well as "aa" can occur first).
Due to the former logical implication (if-then relationship), the string "cc" can occur even without "aa", but not vice versa.
I am looking for a way to implement this by using these syntax elements (., *, +, ?, |, ) as well as brackets.
Example what should be matched:
cc
abba
bccb
bccaa
ε (epsilon - empty string)
What shouldn't be matched:
aa
aacbcb
abaaba
baaa
caac
I have tried the following: a?b?(ba)*(ccaa)*(aacc)*c*b?
Not sure if I should answer this, since it's homework, but here goes...
First of, you identify the clauses P:"string 'aa' occurs" and Q:"string 'cc' occurs". Then you transform P -> Q into !P v Q. Then you translate this into the regular expression formed of the two parts:
^(((b|c)*(a(b|c)+)*a?)|(.*cc.*))$
The first part denies any 'aa' groups, and goes like this: allow any number of bs and cs at the beginning (including none), then if you find an a, force at least one different character after it. as followed by non as may happen any number of times. At the end we also have the a? to allow for the string to end in an a and we are sure it has no a before it because none of the 2 preceding groups end in a.
The second part is trivial.
Test it here: http://rubular.com/r/mAPHO8bulo
See it here: http://jex.im/regulex/...

Regular expression for formal languages

I'm trying to write a regular expression for a language consisting of:
Strings which contain any number of a’s followed by a single b and
Strings which contain any number of a’s followed by a single b followed by an even number of a's.
I thought (b | ((a^+)b)^* ) U (a | ( (b^+) a)* ) but it was wrong.
Is there anyone who knows where am I wrong?
Assumption
I'll assume it should be "strings that consist of", not "strings which contains". The difference is that bbbbbaaabaabbbb would be a valid string if it's "contains" (since it contains aaabaa).
To make it "strings that contains", the only difference would be adding .*? to the start and .* to the end (or [ab]*? and [ab]* if you want to limit it to a and b).
Problem analysis
I believe you can simplify the problem to just "strings that consist of any number of a's followed by a single b followed by an even number of a's", since 0 is an even number.
I have no idea what ^ or U is doing in your regular expression. Is this language specific syntax (usually ^ indicates the start of the line / string)?
Solution
It should be as simple as:
a*b(aa)*
a* - any number of a's
b - a single b
(aa)* an even number of a's
EDIT:
According to comments, it appears that you may want strings that consist of something like:
any number of a's
followed by any number of the following:
a single b
followed by an even number of a's (number != 0)
optionally followed by a b
The regex would be:
a*(b(aa)+)*b?

How to create a regular expression to match non-consecutive characters?

How to create a regular expression for strings of a,b and c such that aa and bb will be rejected?
For example, abcabccababcccccab will be accepted and aaabc or aaabbcccc or abcccababaa will be rejected.
If this is not a purely academical question you can simply search for aa and bb and negate your logic, for example:
s='abcccabaa'
# continue if string does not match.
if re.search('(?:aa|bb)', s) is None:
...
or simply scan the string for the two patterns, avoiding expensive regular expressions:
if 'aa' not in s and 'bb' not in s:
...
For such an easy task RE is probably total overkill.
P.S.: The examples are in Python but the principle applies to other languages of course.
^(?!.*(?:aa|bb))[abc]+$
See it here on Regexr
This regex would do two things
verify that your string consist only of a,b and c
fail on aa and bb
^ matches the start of the string
(?!.*(?:aa|bb)) negative lookahead assertion, will fail if there is aa or bb in the string
[abc]+ character class, allows only a,b,c at least one (+)
$ matches the end of the string
Using the & operator (intersection) and ~ (complement):
(a|b|c)*&~(.*(aa|cc).*)
Rewriting this without the these operators is tricky. The usual approach is to break it into cases.
In this case it is not all that difficult.
Suppose that the letter c is taken out of the picture. The only sequences then which don't have aa and bb are:
e (empty string)
a
b
b?(ab)*a?
Next what we can do is insert some optional 'c' runs into all possible interior places:
e (empty string)
a
b
(bc*)?(ac*bc*)*a?
Next, we have to acknowledge that illegal sequences like aabb become accepted if non-optional 'c' runs are put in the middle, as in for example acacbcbc'. We allow a finalaandb. This pattern can take care of our loneaandb` cases as well as matching the empty string:
(ac+|bc+)*(a|b)?
Then combine them together:
((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?)
We are almost there: we also need to recognize that this pattern can occur an arbitrary number of times, as long as there are dividing 'c'-s between the occurences, and with arbitrary leading or trailing runs of c-s around the whole thing
c*((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?)(c+((ac+|bc+)*(a|b)?|(bc*)?(ac*bc*)*a?|(ac+|bc+)(a|b)?))*c*
Mr. Regex Philbin, I'm not coming up with any cases that this doesn't handle, so I'm leaving it as my final answer.

how do the regular expressions * and ? metacharacter work?

Hi I'm going through regular expressions but I'm confused about metacharacters, particularly '*' and '?'.
'*' is supposed to match the preceding character 0 or more times.
For example, 'ta*k' supposedly matches 'tak' and 'tk'.
But I wouldn't have thought this to be true at all - here's my reasoning:
for tak:
regexp: I need a 't'
string: I have 't'
regexp: okay, your next character needs to be an 'a'
string: yes it is
regexp: okay, keep giving me characters until your character isn't an 'a'
string: okay. I've just given you 'k'
regexp: okay, your next character needs to be a 'k'
string: I don't have any more characters left!
regexp: fail
for tk:
regexp: I need a 't'
string: I have 't'
regexp: okay, your next character needs to be an 'a'
string: no, it's a 'k'
regexp: fail
Can someone clarify for me why 'tak' and 'tk' matches 'ta*k'?
* does not mean to match a character zero or more times, but an atom zero or more times. A single character is an atom, but so is any grouping.
And * means zero or more. When the regex cursor has "swallowed" the t, the positions are:
in the regex: t|a*k
in the string: t|ak
The regex engine then tries and eats as as much as possible. Here there is one. After it has swallowed it, the positions are:
in the regex: ta*|k
in the string: ta|k
Then the k is swallowed:
in the regex: ta*k|
in the string: tak|
End of regex, match. Note that the string may have other characters behind, the regex engine doesn't care: it has a match.
In the case where the string is tk, before a* the positions are:
in the regex: t|a*k
in the string: t|k
But * can match an empty set of as, therefore a* is satisfied! Which means the positions then become:
in the regex: ta*|k
in the string: t|k
Rinse, repeat. Now, let's take taak as an input and ta?k as a regex: this will fail, but let's see how...
# before first character
regex: |ta?k
input: |taak
# t
regex: t|a?k
input: t|aak
# a?
regex: ta?|k
input: ta|ak
# k? Oops! No...
regex: |ta?k
input: t|aak
# t? Oops! No...
regex: |ta?k
input: ta|ak
# t? Oops! No...
regex: |ta?k
input: taa|k
# t? Oops! No...
regex: |ta?k
input: taak|
# t? Oops! No... And nothing to read anymore
# FAIL
Which is why it is VERY important to make regexes fail FAST.
Because a* means "zero or more instances of a".
When "it" asks for all characters that aren't "a", once it has one, it (roughly) pushes it back into the input stream. (Or it peeks ahead, or it just keeps it, etc.)
First sequence: here's your first non-"a", I'll hold on to that. You need a "k" next, that's what I have.
Second sequence: the next character doesn't need to be an "a"--it may be one or more "a". In this case it's none. I'll hold on to that non-"a". You need a "k"? I got your "k" right here still.
You are one character ahead:
regexp: okay, keep giving me characters until your character isn't an
'a'
string: next character is not an 'a'
regexp: okay, your next character needs to be a 'k'
string: next char is a 'k'
So it works. Note that 'a*' means "0 or more occourrences of 'a'", and not "1 or more occources of 'a'". For the latter one there's the '+' sign, like in 'a+'.
ta*k means, one 't', followed by 0 or more 'a's, followed by one 'k'. So 0 'a' characters, would make 'tk` a possible match.
If you want "1 or more" instead of "0 or more", use the + instead of *. That is, ta+k will match 'tak' but not 'tk'.
Let me know if there's anything I didn't explain.
By the way, RegEx doesn't always go left to right. The engine often backtracks, peeks ahead and studies the input. It's really complicated, which is why it's so powerful. If you looks at sites such as this one, they sometimes explain what the engine is doing. I recommend their tutorials because that's where I learned about RegEx!
The fundamental thing to remember is that a regular expression is a convenient shorthand for typing out a set of strings. a{1,5} is simply shorthand for the set of strings (a, aa, aaa, aaaa, aaaaa). a* is shorthand for ([empty], a, aa, aaa, ...).
Thus, in effect, when you feed a regular expression to a search algorithm, you are telling it the list of strings to search for.
Consequently, when you feed ta*k to your search algorithm, you are actually feeding it the set of strings (tk, tak, taak, taaak, taaaak, ...).
So, yes, it is useful to understand how the search algorithm will work, so that you can offer the most efficient regular expression, but don't let the tail wag the dog.