What is the difference between (a+b)* and (a*b*)*? - regex

Assuming that Σ = {a, b}, I want to find out the regular expression (RE) Σ* (that being the set of all possible strings over the alphabet Σ).
I came up with below two possibilities:
(a+b)*
(a*b*)*
However, I can't decide by myself which RE is correct, or if both are bad. So, please tell me the correct answer.

The + operator is typically used to indicate union (|, "or") in academic regular expressions, not "one or more" as it typically means in non-academic settings (such as most regex implementations).
So, a+b means [ab] or a|b, thus (a+b)* means any string of length 0 or more, containing any number of as and bs in any order.
Likewise, (a*b*)* also means any string of length 0 or more, containing any number of as and bs in any order.
The two expressions are different ways of expressing the same language.

In normal regular expression grammar, (a+b)* means zero or more of any sequence that start with a, then have zero or more a, then a b. This discounts things like baa (it doesn't start with a), abba, and a (there must be one exactly b after each a group), so is not correct.
(a*b*)* means zero or more of any sequence that contain zero or more a followed by zero or more b. This is more correct since it allows for either starting character, any order and quantity of characters, and so on. It also allows the empty string which I'm pretty certain should be allowed by Σ* (but I'll leave that up to you).
However, it may be better to opt for the much simpler [ab]* (or [ab]+ in the unlikely event you consider an empty string invalid). This is basically zero (one for the + variant) or more of any character drawn from the class [ab].
However, it's possible, since you're using Σ, that you may be discussing formal language theory (where Σ is common) rather than regex grammar (where it tends not to be).
If that is the case then you should understand that there are variants of the formal language where the a | b expression (effectively [ab] in regex grammar) can instead be rendered as one of a ∪ b, a ∨ b or a + b, with each of those operator symbols representing "logical or".
That would mean that (a+b)* is actually correct (as it is equivalent to the regex grammar I gave above) for what you need since it basically means any character from the set {a, b}, repeated zero or more times.
Additionally, that's also covered by your (a*b*)* option but it's almost always better to choose the simplest one that does the job :-)
And just something else to keep in mind for the formal language case. In English (for example), "a" is a word but you'd struggle to find anyone supporting the possibility that "" is also a word. Try looking it up in a dictionary :-)
In other words, any regular expression that allows an empty sequence of the language characters (such as (a+b)*) may not be suitable. You may find that (a+b)(a+b)* is a better option. This depends on whether Σ* allows for the empty sequence.

Acording to the algebraic properties of regular expressions,
(a*b*)* = (a+b)*
Therefore (a+b)* = (a*b*)*
Extra information:
(a+b)* = L(a+b)*
= (L(a+b))*
= (L(a) U L(b))*
= ({a} U {b})*
= {a,b}*
= {ε, a, b, aa, bb, ab, abab, aba, bbba,...}

Related

Automata - Regular Expression (Union Case)

Automata 1) Recognizes strings with at least 2 a
Regular Expression = b*ab*a(a+b)*
Automata 2) Recognizes strings with at least 2 b
Regular Expression = a*ba*b(a+b)*
The regular expression obtained from A3 = A1 U A2 is equivalent to R3 = R1 + R2? Or it's not?
R3 = b*ab*a(a+b)* + a*ba*b(a+b)*
There is neither "one" automaton nor "one" regular expression for any language; generally there many reasonable ones and many more (maybe infinitely many) unreasonable ones. In this sense, your question is not entirely well-posed: the regular expression corresponding to the union of two DFAs may or may not look like regular expressions for the original DFAs, +'ed together.
So, if you mean, can they look the same, the answer is likely yes. If you mean, must they look the same, answer is likely no. If you instead want to fix the algorithms for constructing the union machine and getting the regular expression, maybe we could show that a fixed method of doing it always gives the same answer.
In your specific case, applying the Cartesian Product Machine construction to get a DFA for the union of the original DFAs and then applying the construction from the proof of equivalence between DFAs and REs, we can see that the structure of the RE obtained by +'ing the original REs can't be achieved starting from a DFA; you'd have needed an NFA to get a + between the LHS and RHS, but DFAs can only + among individual symbols, not subexpressions. Of course, it might be possible the RE can be algebraically manipulated to derive the target RE, but that isn't exactly the same.
All of the above hold for the question of equality of REs. However, you asked about equivalence. Almost always, we say two REs are equivalent if they generate the same language. If this is what you meant, then yes, +ing the two REs will give an RE equivalent to the one obtained by constructing a union machine and deriving an RE from that. The REs will not look the same but will generate the same language, just as (ab + e)(abab)* and (ab)* generate the same language despite looking a bit different.
Regular expressions are not like finite state parsers and it's usually a mistake to try to incorporate them into complex parsing scenarios.
But also, they are marvelous tools for specific problems. After reading your descriptive requirements, there is a simple regular expression that accomplishes it, but in a way you might not expect. Your requirements:
strings with at least 2 a
strings with at least 2 b
The Union of the two, or strings withat least two a's or two b's
([ab]).*?\1
This expression opens a capture group to capture either a or b. Then it allows zero or more 'any characters' followed by whatever was captured in the capture group (\1).

How does one define a "language" using a regular expression?

I would like to define "some language" using a regular expression. The requirements are:
The language must contain an infinite number of strings.
The underlying alphabet must have at least three different characters.
I also need to draw a deterministic finite state automaton that accepts the strings of that language.
Give two character strings that are accepted by that finite state automaton and two that are not.
Given this set of requirements, I have thus far (based on my 20 years old memory of set theory and the math associated with it), come up with the following and would appreciate some input from a set-theory, regular expression and formal language definition expert (I know there are many of you who have a deeply vested interest in this subject).
Does the following come even close to fulfilling (1) and (2) at least? What does (4) actually imply? For instance, if the set can hold infinite strings (in theory), as per requirement (1), then how can we fulfill requirement (4) which says "Given 2 strings that are accepted by the (FSA) and TWO THAT ARE NOT"???
My current (rather fallable) solution is:
Alphabet:
∑ = {s,a,e,t,n}
Language:
L* = { Ø , ∈ , taste, set, ate, sane, ….}
OR (using regular expression)
L* = [saetn]*
Any takers?
Thanks.
First of all, the regular expression [saetn]* would accept all strings over the alphabet you chose, so you would be unable to find two which are not in the language (The language would be L = Σ*) and can't satisfy requirement (4).
L = { Ø , ε , taste, set, ate, sane, ...}
is not a valid language, because a language cannot contain Ø. The empty set is not a string (and a language is a set of strings, not a set of sets). Let's remove the Ø.
L = { ε , taste, set, ate, sane, ...}
Does the following come even close to fulfilling (1) and (2) at least?
It doesn't fulfil (1), as there is no reasonable pattern for the ... to have any meaning. The language looks finite.
L = { ε , taste, set, ate, sane }
Would be a valid finite language where ε denotes the empty string. All finite languages are regular, since you can create an expression that is an OR of all the strings in the language (|taste|set|ate|sane).
It does fulfil (2), as you picked the alphabet ∑ = {s,a,e,t,n}, which has 5 elements.
What does (4) actually imply?
It means that the language can't contain all strings over the alphabet. There must be at least two strings in Σ* that are not in the language, and you must show what they are. That doesn't prevent the language from being infinite.
An example of an infinite language would be:
L = { ε, s, a, t, ss, aa, tt, sss, aaa, ttt, ssss, ... }
That language (over the alphabet {s, a, t}) contains all strings which have no more than one distinct character. One regular expression that would accept that language would be s*|a*|t*. The language is clearly infinite, and any strings which contain two different symbols, like at or sat are not in the language. That language satisfies all the requirements. There are many other languages that satisfy all the requirements.
I will leave the drawing of the DFA to you. If you have any questions about it, feel free to comment on my answer.

Expressing regular expression in words

I am trying to express the following regular expression in words. Please not this is not so much a programming regex, as opposed to some CS work I am doing. The regular expression is:
(ab + b)* + (ba + b)*
The spaces are meaningless and the '+' functions as an 'or'. My answer right now is:
"This regular expression represents every string that does not contain the substring 'aa', and whose last letter is 'b' if the first letter is 'a'"
Is this correct? If so, that last condition I put makes me a bit weary. Is there a way to perhaps simplify the summation?
Thanks guys.
Hm, not sure I agree with #ChristianTernus's reduction.
Assuming these are implicitly anchored, the original, (ab|b)*|(ba|b)*, in English, is:
a string entirely composed of ab and b, or
a string entirely composed of ba and b.
So, for example, abb would match as the first kind but not the second, and bba would match the second kind but not the first.
Meanwhile, note how neither abb nor bba would match the reduction, (ab)*|(ba)*|(b)*, which actually means,
a string entirely composed of ab, or
a string entirely composed of ba, or
a string entirely composed of b.
Actually, the way you Englishified it, I think was already the best! Though, I'd style it like this:
This regular expression represents a string composed entirely of 'a's and 'b's, with no consecutive 'a's, and whose last character is 'b' if the first character is 'a'.
Nearly identical to what you already wrote.
As #ChristianTernus (and #slebetman) point out, the above fails to take into account that the original expression accepts a null string (or even a string without 'a's, which isn't clear from my Englishification), so in fact I believe OP's Englishification was indeed the strongest.
(ab + b)* + (ba + b)*
Translated into common (PCRE) regex, that's
(ab|b)*|(ba|b)*
In other words: a string composed of either zero or more instances of either 'ab' or 'b', or zero or more instances of either 'ba' or 'b'.
#acheong87's answer is also correct. I like this because it matches more closely the original structure of the regular expression -- it wouldn't be hard to turn this back into the regex from whence it came.

Does this regular expression generate a regular language?

I was told that the language generated by the regular expression:
(a*b*)*
is regular.
However, my thinking goes against this, as follows. Can anyone please provide an explanation whether I'm thinking right or wrong?
My Thoughts
(a*b*) refers to a single sequence of any amount of a, followed by any amount of b (can be empty). And this single sequence (which can't be changed) can be repeated 0 or more time. For example:
a* = a
b* = bbbb
-> (a*b*) = abbbb
-> (a*b*)* = abbbbabbbbabbbb, ...
On the other hand, since aba is not an exact repetition of the sequence ab, it is not included in the language.
aaabaaabaaab => is included in the language
aba => is not included in the language
Thus, the language consists of sequences that are an arbitrary-time repetition of a subsequence that is any amount of a followed by any amount of b. Therefore, the language is not regular since it requires a stack.
It's a zero or more times, followed by b zero or more times, repeated zero or more times.
""
"a"
"b"
"ab"
"ba"
"aab"
"bbabb"
"aba"
all pass.
* is not +.
aba is in that language; it's just an overly-complicated way to say "the set of all strings consisting of as and bs".
EDIT: The repeating group doesn't mean that the contents of the group must be repeated exactly; that would require a backreference. ((a*b*)?\1*)
Rather, it means that the group itself should be repeated, matching any string that it can match.
Technically /(a*b*)*/ will match everything and nothing.
Because all the operators are *'s it means zero or more. So since zero is an option, it will pretty much match anything.
It's wrong, you don't need a stack. Your DFA just thinks "can I add just another a (or not)?" or "can I add just another b (or not)?" in an endless loop until the word is consumed.
It is a regular expression, yes.
The * say something like "can repeat 0 or more times". The + is basically similar, different only that it need one repeatition on minimal (or be 1 or more times).
This regular expressions says, somethink like:
Repeat "below group" zero or more times;
Repeat a zero or more times;
Repeat b zero or more times;
Can works fine with all of your examples.
Edit/Note: the aba is validated too.
I hope to help :p
Basically, it'll match any string thats empty or made by a bunch of a and b. It reads:
(('a' zero or + times)('b' zero or + times) zero of plus times
That's why it matches aba:
(('a' one time)('b' one time)) one time ((a one time)(b zero time)) one time
You're wrong. :)
0 is also an amount, so aba is in this language. It wouldn't be if the regex was (a+b+)+, because + would mean '1 or more' where * means '0 or more'.

Proving that a language is regular by giving a regular expression

I am stumped by this practice problem (not for marks):
{w is an element of {a,b}* : the number of a's is even and the number of b's is even }
I can't seem to figure this one out.
In this case 0 is considered even.
A few acceptable strings: {}, {aa}, {bb}, {aabb}, {abab}, {bbaa}, {babaabba}, and so on
I've done similar examples where the a's must be a prefix, where the answer would be:
(aa)(bb)
but in this case they can be in any order.
Kleene stars (*), unions (U), intersects (&), and concatenation may be used.
Edit: Also have trouble with this one
{w is an element of {0,1}* : w = 1^r 0 1^s 0 for some r,s >= 1}
This is kind of ugly, but it should work:
ε U ( (aa) U (bb) U ((ab) U (ba) (ab) U (ba)) )*
For the second one:
11*011*0
Generally I would use a+ instead of aa* here.
Edit: Undeleted re: the comments in NullUserException's answer.
1) I personally think this one is easier to conceptualize if you first construct a DFA that can accept the strings. I haven't written it down, but off the top of my head I think you can do this with 4 states and one accept state. From there you can create an equivalent regex by removing states one at a time using an algorithm such as this one. This is possible because DFAs and regexes are provably equivalent.
2) Consider the fact that the Kleene star only applies to the nearest regular expression. Hence, if you have two individual ungrouped atoms (an atom itself is a regex!), it only applies to the second one (as in, ab* would match a single a and then any number - including 0 - b's). You can use this to your advantage in a case where you want something to exist, but you're not sure of how many there are.