Proving that a language is regular by giving a regular expression - regex

I am stumped by this practice problem (not for marks):
{w is an element of {a,b}* : the number of a's is even and the number of b's is even }
I can't seem to figure this one out.
In this case 0 is considered even.
A few acceptable strings: {}, {aa}, {bb}, {aabb}, {abab}, {bbaa}, {babaabba}, and so on
I've done similar examples where the a's must be a prefix, where the answer would be:
(aa)(bb)
but in this case they can be in any order.
Kleene stars (*), unions (U), intersects (&), and concatenation may be used.
Edit: Also have trouble with this one
{w is an element of {0,1}* : w = 1^r 0 1^s 0 for some r,s >= 1}

This is kind of ugly, but it should work:
ε U ( (aa) U (bb) U ((ab) U (ba) (ab) U (ba)) )*
For the second one:
11*011*0
Generally I would use a+ instead of aa* here.

Edit: Undeleted re: the comments in NullUserException's answer.
1) I personally think this one is easier to conceptualize if you first construct a DFA that can accept the strings. I haven't written it down, but off the top of my head I think you can do this with 4 states and one accept state. From there you can create an equivalent regex by removing states one at a time using an algorithm such as this one. This is possible because DFAs and regexes are provably equivalent.
2) Consider the fact that the Kleene star only applies to the nearest regular expression. Hence, if you have two individual ungrouped atoms (an atom itself is a regex!), it only applies to the second one (as in, ab* would match a single a and then any number - including 0 - b's). You can use this to your advantage in a case where you want something to exist, but you're not sure of how many there are.

Related

Haskell check if the regular expression r made up of the single symbol alphabet Σ = {a} defines language L(r) = a*

I have got to write an algorithm programatically using haskell. The program takes a regular expression r made up of the unary alphabet Σ = {a} and check if the regular expression r defines the language L(r) = a^* (Kleene star). I am looking for any kind of tip. I know that I can translate any regular expression to the corresponding NFA then to the DFA and at the very end minimize DFA then compare, but is there any other way to achieve my goal? I am asking because it is clearly said that this is the unary alphabet, so I suppose that I have to use this information somehow to make this exercise much easier.
This is how my regular expression data type looks like
data Reg = Epsilon | -- epsilon regex
Literal Char | -- a
Or Reg Reg | -- (a|a)
Then Reg Reg | -- (aa)
Star Reg -- (a)*
deriving Eq
Yes, there is another way. Every DFA for regular languages on the single-letter alphabet is a "lollipop"1: an initial string of nodes that each point to each other (some of which are marked as final and some not) followed by a loop of nodes (again, some of which are marked as final and some not). So instead of doing a full compilation pass, you can go directly to a DFA, where you simply store two [Bool] saying which nodes in the lead-in and in the loop are marked final (or perhaps two [Integer] giving the indices and two Integer giving the lengths may be easier, depending on your implementation plans). You don't need to ensure the compiled version is minimal; it's easy enough to check that all the Bools are True. The base cases for Epsilon and Literal are pretty straightforward, and with a bit of work and thought you should be able to work out how to implement the combining functions for "or", "then", and "star" (hint: think about gcd's and stuff).
1 You should try to prove this before you begin implementing, so you can be sure you believe me.
Edit 1: Hm, while on my afternoon walk today, I realized the idea I had in mind for "then" (and therefore "star") doesn't work. I'm not giving up on this idea (and deleting this answer) yet, but those operations may be trickier than I gave them credit for at first. This approach definitely isn't for the faint of heart!
Edit 2: Okay, I believe now that I have access to pencil and paper I've worked out how to do concatenation and iteration. Iteration is actually easier than concatenation. I'll give a hint for each -- though I have no idea whether the hint is a good one or not!
Suppose your two lollipops have a length m lead-in and a length n loop for the first one, and m'/n' for the second one. Then:
For iteration of the first lollipop, there's a fairly mechanical/simple way to produce a lollipop with a 2*m + 2*n-long lead-in and n-long loop.
For concatenation, you can produce a lollipop with m + n + m' + lcm(n, n')-long lead-in and n-long loop (yes, that short!).

What is the difference between (a+b)* and (a*b*)*?

Assuming that Σ = {a, b}, I want to find out the regular expression (RE) Σ* (that being the set of all possible strings over the alphabet Σ).
I came up with below two possibilities:
(a+b)*
(a*b*)*
However, I can't decide by myself which RE is correct, or if both are bad. So, please tell me the correct answer.
The + operator is typically used to indicate union (|, "or") in academic regular expressions, not "one or more" as it typically means in non-academic settings (such as most regex implementations).
So, a+b means [ab] or a|b, thus (a+b)* means any string of length 0 or more, containing any number of as and bs in any order.
Likewise, (a*b*)* also means any string of length 0 or more, containing any number of as and bs in any order.
The two expressions are different ways of expressing the same language.
In normal regular expression grammar, (a+b)* means zero or more of any sequence that start with a, then have zero or more a, then a b. This discounts things like baa (it doesn't start with a), abba, and a (there must be one exactly b after each a group), so is not correct.
(a*b*)* means zero or more of any sequence that contain zero or more a followed by zero or more b. This is more correct since it allows for either starting character, any order and quantity of characters, and so on. It also allows the empty string which I'm pretty certain should be allowed by Σ* (but I'll leave that up to you).
However, it may be better to opt for the much simpler [ab]* (or [ab]+ in the unlikely event you consider an empty string invalid). This is basically zero (one for the + variant) or more of any character drawn from the class [ab].
However, it's possible, since you're using Σ, that you may be discussing formal language theory (where Σ is common) rather than regex grammar (where it tends not to be).
If that is the case then you should understand that there are variants of the formal language where the a | b expression (effectively [ab] in regex grammar) can instead be rendered as one of a ∪ b, a ∨ b or a + b, with each of those operator symbols representing "logical or".
That would mean that (a+b)* is actually correct (as it is equivalent to the regex grammar I gave above) for what you need since it basically means any character from the set {a, b}, repeated zero or more times.
Additionally, that's also covered by your (a*b*)* option but it's almost always better to choose the simplest one that does the job :-)
And just something else to keep in mind for the formal language case. In English (for example), "a" is a word but you'd struggle to find anyone supporting the possibility that "" is also a word. Try looking it up in a dictionary :-)
In other words, any regular expression that allows an empty sequence of the language characters (such as (a+b)*) may not be suitable. You may find that (a+b)(a+b)* is a better option. This depends on whether Σ* allows for the empty sequence.
Acording to the algebraic properties of regular expressions,
(a*b*)* = (a+b)*
Therefore (a+b)* = (a*b*)*
Extra information:
(a+b)* = L(a+b)*
= (L(a+b))*
= (L(a) U L(b))*
= ({a} U {b})*
= {a,b}*
= {ε, a, b, aa, bb, ab, abab, aba, bbba,...}

Does order not matter in regular expressions?

I was looking at the question posed in this stackoverflow link (Regular expression for odd number of a's) for which it is asked to find the regular expression for strings that have odd number of a over Σ = {a,b}.
The answer given by the top comment which works is b*(ab*ab*)*ab*.
I am quite confused - a was placed just before the last b*, does this ordering actually matter? Why can't it be b*a(ab*ab*)*b* instead (where a is placed after the first b*), or any other permutation of it?
Another thing I am confused about is why it is (ab*ab*)* and not (b*ab*ab*)*. Isn't b*ab*ab* the more accurate definition of 'having exactly 2 a'?
Why can't it be b*a(ab*ab*)*b* instead?
b*a(ab*ab*)*b* does not work because it would require the string to have two consecutive as before the first non-leading b, wouldn't it? For example, abaa would not be matched by your proposed regex when it should. Use the regex debugger on a site like Regex101 to see this for yourself.
On the other hand, moving the whole ab* part to the start (b*ab*(ab*ab*)*) works as well.
why it is (ab*ab*)* and not (b*ab*ab*)*?
(b*ab*ab*)* does work, but the first b* is quite redundant because whatever b there is left, will be matched by the last b* in the group. There is also a b* before the group, which causes the b* to not be able to match anything, hence it is redundant.
There are infinitely many equivalent regular expressions which generate a given (infinite) regular language. A particular expression might be preferable in some cases and by certain authors: one might prefer a minimal expression, or one which shows structure or symmetry, or even one that simplifies the reasoning in a proof by induction.
Your particular suggestion to move the a is insufficient since, as noted above, that ensures the substring aa will appear in any string with more than one a. However, abab could be changed to baba to make that placement work. Choosing babab* would work with either placement. You could even go for an expression like bab + bababab + (babab*)a(babab*) which might be nice to work with depending on your application. Something like b*(abab)ab* has the advantage of being minimal (if it's not strictly minimal, it must be pretty close).

How do I convert language set notation to regular expressions?

I have this following questing in regular expression and I just can't get my head around these kind of problems.
L1 = { 0n1m | n≥3 ∧ m is odd }
How would I write a regular expression for this sort of problem when the alphabet is {0,1}.
What's the answer?
The regular expression for your example is:
000+1(11)*1
So what does this do?
The first two characters, 00, are literal zeros. This is going to be important for the next point
The second two characters, 0+, mean "at least one zero, no upper bound". These first four characters satisfy the first condition, which is that we have at least three zeros.
The next character, 1, is a literal one. Since we need to have an odd number of ones, this is the smallest number we're allowed to have
The last-but-one characters, (11), represent a logical grouping of two literal ones, and the ending * says to match this grouping zero or more times. Since we always have at least one 1, we'll always match an odd number. So we're done.
How'd I get that?
The key is knowing regular expression syntax. I happen to have quite a bit of experience in it, but this website helped me to verify.
Once you know the basic building blocks of regex, you need to break down your problem into what you can represent.
For example, regex allows us to specify a lower AND upper bound for matching (the {x,y} syntax), but doesn't allow to specify just a lower bound ({x} will match exactly x times). So I knew I would have to use either + or * to specify the zeros, as those are the only specifiers that permit an infinite number of matches. I also knew that it didn't make sense to apply those modifiers to a group; the restriction that we must have at least 3 zeroes doesn't imply that we must have a multiple of three, for example, so (000)+ was out. I had to apply the modifier to only one character, which meant I had to match a few literals first. 000 guarantees matching exactly three 0s, and 0* (Final expression 0000*) does exactly what I want, and then I condensed that to the equivalent 000+.
For the second condition, I had to think about what an odd number is. By definition, an odd number can be expressed by 2*k + 1, where k is an integer. So I had to match one 1 (Hence the literal 1), and some number of the substring 11. That led me to the group, and then the *. On a slightly different problem, you could write 1(11)+ to match any odd number of ones, and at least 3.
1 A colleague of mine pointed out to me that the + operator isn't technically part of the formal definition of regular expressions. If this is an academic question rather than a programming one, you might find the 0000* version more helpful. In that case, the final string would be 0000*1(11)*

Regular expressions to match word with three b, correct form?

I find this very ambiguous and vague and I would love to understand
I have these strings
abbb
bbb
aaaabaaabaaabaaabaaabaaab
babba
bbbaaaa
aaaaabbaba
And they are all valid because contains multiple of b, then I use:
(a*ba*ba*ba*)* and this matches them all
(a*ba*ba*b)*a* this match them all as well
a*(ba*ba*ba*)* same as above
Are these really all the same? Or there are edge cases that I am not seeing?
all of your regexes match the empty string, which doesn't have 3 b's.
This one,
(a*ba*ba*ba*)*
does not match aa. But the following match aa, and they are also equivalent:
(a*ba*ba*b)*a*
a*(ba*ba*ba*)*
If you want to force at least 3 b's, you have to take the b's out of the Kleene star:
(a|b)*b(a|b)*b(a|b)*b(a|b)*
* is zero or more. So,
even if you match using a regex like the ones below
(d*ef*gg*hi*)*
(s*o*m*e*t*h*i*n*g*)
etc.
they will match
(a*ba*ba*ba*)*
( match a word which may have an a or not or many a's then a b and then 0 or more a's and then a b and then 0 or more a's and one b and then 0 or more a's ) zero or more of this kind of match.. Its okay if we dont find a match thats what you want to say.
Similarly for your second case:
(a*ba*ba*b)*a*
(0 or more a and then a b and then 0 or more a and then a b then 0 or more a and then a b) 0 or more of this and zero or more of a after that.
So your regex basically matches so many 0 presence conditions, thats why you are not able to find the clear difference. better use + instead of *. A + quatifier will make the match only of the character is present at least 1 or more times.
you can play around with regex on this site here : http://regex101.com/r/rM5zQ1
for basic learnings regexone will be really helpful for you.
Hope that helps !
You should use + after the group instead of *, or else an empty string would be accepted:
(a*ba*ba*ba*)+
Although this would only allow multiples of 3. If you want at least 3 and any number of extras, it would be:
a*ba*ba*b(a|b)*
This works for those requirements. But it isn't a good approach. In your example you are searching for "a" and "b", which are single character patterns, and it's already an unreasonably long expression for the simple rule "has 3 b's" in my opinion. But what if the patterns were more complex? You would need to repeat them at least 3 times, making it even more unwieldy.
And what if the rules change slightly? If you wanted to match a maximum instead of a minimum number of b's, it would become even more complex / repetitive, because your only choice would be to combine the patterns for each possible number (1, 2, 3):
(a*ba*|a*ba*ba*|a*ba*ba*ba*)
Or if you decide the word must be a certain length, it actually becomes impossible, short of listing every permutation (for a 7 letter word, ba{3}bab, a{2}babab, b{3}a{4} etc.).
So, I think a better way to solve this is to match the basic generic pattern, then examine the results of the match to check the counts. For example, just match a "word":
(a|b)+
Then on the matching text, match b:
b
and test the number of matches and/or length of text as needed. Each pattern is only repeated a maximum of twice, and your code can easily be adapted to different requirements.