Regular expression equivalences - regex

Is the following regular expression equivalence true? Why or why not?
(ab)* u (aba)* = (ab u aba)*
*=Kleene star
u=Union (Set Theory)

No, they aren't equivalent. The language on the RHS contains "abaab", but the language on the LHS doesn't. Is there any relationship among these? Yes; but I won't just give you the answer. Hint: are there any strings in the RHS that aren't in the LHS?
EDIT:
Just to expound a little for the interested reader. Languages are sets of strings. Therefore, relationships among sets are also relationships among languages. The most common set relationships are equality, subset, and superset. Given two sets A and B over universe sets U1 and U2, set A is a subset of B under universe set U1 u U2 (u stands for union) iff every element of A is also an element of B. Similarly, given two sets A and B over universe sets U1 and U2, set A is a superset of B under universe set U1 u U2 (u stands for union) iff every element of B is also an element of A (equivalently, iff B is a subset of A). Sets A and B are equal iff A is a subset of B and B is a subset of A (equivalently, iff A is a superset of B and B is a superset of A). Note that two sets A and B need not be in any of these three relationships; this happens when A contains an element not in B and, at the same time, B contains an element not in A.
To find which of the four possible relationships exist between sets A and B - equality, subset, superset, none - you usually check whether A is a subset of B and whether B is a subset of A. Call the first check B1 and the second check B2, where B1 and B2 are Boolean variables and are true iff the checks pass (i.e., A is a subset of B in the case of B1, and B is a subset of A in the case of B2). Then (B1 && B2) corresponds to equality, (B1 && !B2) corresponds to subset, (!B1 && B2) corresponds to superset and (!B1 && !B2) corresponds to no relationship.
In the example above, I demonstrate that the two languages are not equal by demonstrating that the RHS contains an element not in the LHS. Note that this also rules out the "A is a superset of B" relationship. Two remain: B is a superset of A, or there is no relationship between them. To decide this, you must determine whether A is a subset of B; whether the elements in the LHS language are all in the RHS language. If so, the LHS is a subset; otherwise, there is no relationship.
To show that there is an element in one set and not another, the easiest and most convincing approach is a proof by counterexample: name a string in one but not the other. This is the approach I adopted. You can also make an argument that the language must contain such an element without explicitly naming it; this kind of proof can be significantly harder to get right.
To show that every element of set A is also in set B, you will need a more generic proof technique. Proof by induction and proof by contradiction are common examples. To do an inductive proof, you assign - explicitly or implicitly - a natural number to each string in the language, demonstrate that your claim (in this case, that the element is also an element of the other set) is true with a simple argument. Then, you assume it is true for the first n elements in your set (according to the numbering you give) and show that this implies all the elements that come afterwards must also satisfy your claim. To do a proof by contradiction, you assume the opposite of what you want to prove, and derive a contradiction. If you succeed and your only assumption was that your claim is false, then your claim must have been true all along.

No, they are not equivalent.
Similarities:
both accept empty string
both accept "ababa" (minimun expression of regex2)
Differences:
ab and aba might appear one time or not in regex1, differing from regex2 that they might appear or not but in conjuntion.
Since we got a difference, we can say that they are not equivalent.
BUT
Since a regular expression is a representation (not a description) of a regular lenguage you can not tell that regex1 is equivalent to regex2 just by looking at the expression, to prove it (mathematical proof) you can convert those regular expression into a NFA (nondeterministic finite automata) or DFA (deterministic finite automata) and compare the diferences.

Related

Automata - Regular Expression (Union Case)

Automata 1) Recognizes strings with at least 2 a
Regular Expression = b*ab*a(a+b)*
Automata 2) Recognizes strings with at least 2 b
Regular Expression = a*ba*b(a+b)*
The regular expression obtained from A3 = A1 U A2 is equivalent to R3 = R1 + R2? Or it's not?
R3 = b*ab*a(a+b)* + a*ba*b(a+b)*
There is neither "one" automaton nor "one" regular expression for any language; generally there many reasonable ones and many more (maybe infinitely many) unreasonable ones. In this sense, your question is not entirely well-posed: the regular expression corresponding to the union of two DFAs may or may not look like regular expressions for the original DFAs, +'ed together.
So, if you mean, can they look the same, the answer is likely yes. If you mean, must they look the same, answer is likely no. If you instead want to fix the algorithms for constructing the union machine and getting the regular expression, maybe we could show that a fixed method of doing it always gives the same answer.
In your specific case, applying the Cartesian Product Machine construction to get a DFA for the union of the original DFAs and then applying the construction from the proof of equivalence between DFAs and REs, we can see that the structure of the RE obtained by +'ing the original REs can't be achieved starting from a DFA; you'd have needed an NFA to get a + between the LHS and RHS, but DFAs can only + among individual symbols, not subexpressions. Of course, it might be possible the RE can be algebraically manipulated to derive the target RE, but that isn't exactly the same.
All of the above hold for the question of equality of REs. However, you asked about equivalence. Almost always, we say two REs are equivalent if they generate the same language. If this is what you meant, then yes, +ing the two REs will give an RE equivalent to the one obtained by constructing a union machine and deriving an RE from that. The REs will not look the same but will generate the same language, just as (ab + e)(abab)* and (ab)* generate the same language despite looking a bit different.
Regular expressions are not like finite state parsers and it's usually a mistake to try to incorporate them into complex parsing scenarios.
But also, they are marvelous tools for specific problems. After reading your descriptive requirements, there is a simple regular expression that accomplishes it, but in a way you might not expect. Your requirements:
strings with at least 2 a
strings with at least 2 b
The Union of the two, or strings withat least two a's or two b's
([ab]).*?\1
This expression opens a capture group to capture either a or b. Then it allows zero or more 'any characters' followed by whatever was captured in the capture group (\1).

What is the difference between (a+b)* and (a*b*)*?

Assuming that Σ = {a, b}, I want to find out the regular expression (RE) Σ* (that being the set of all possible strings over the alphabet Σ).
I came up with below two possibilities:
(a+b)*
(a*b*)*
However, I can't decide by myself which RE is correct, or if both are bad. So, please tell me the correct answer.
The + operator is typically used to indicate union (|, "or") in academic regular expressions, not "one or more" as it typically means in non-academic settings (such as most regex implementations).
So, a+b means [ab] or a|b, thus (a+b)* means any string of length 0 or more, containing any number of as and bs in any order.
Likewise, (a*b*)* also means any string of length 0 or more, containing any number of as and bs in any order.
The two expressions are different ways of expressing the same language.
In normal regular expression grammar, (a+b)* means zero or more of any sequence that start with a, then have zero or more a, then a b. This discounts things like baa (it doesn't start with a), abba, and a (there must be one exactly b after each a group), so is not correct.
(a*b*)* means zero or more of any sequence that contain zero or more a followed by zero or more b. This is more correct since it allows for either starting character, any order and quantity of characters, and so on. It also allows the empty string which I'm pretty certain should be allowed by Σ* (but I'll leave that up to you).
However, it may be better to opt for the much simpler [ab]* (or [ab]+ in the unlikely event you consider an empty string invalid). This is basically zero (one for the + variant) or more of any character drawn from the class [ab].
However, it's possible, since you're using Σ, that you may be discussing formal language theory (where Σ is common) rather than regex grammar (where it tends not to be).
If that is the case then you should understand that there are variants of the formal language where the a | b expression (effectively [ab] in regex grammar) can instead be rendered as one of a ∪ b, a ∨ b or a + b, with each of those operator symbols representing "logical or".
That would mean that (a+b)* is actually correct (as it is equivalent to the regex grammar I gave above) for what you need since it basically means any character from the set {a, b}, repeated zero or more times.
Additionally, that's also covered by your (a*b*)* option but it's almost always better to choose the simplest one that does the job :-)
And just something else to keep in mind for the formal language case. In English (for example), "a" is a word but you'd struggle to find anyone supporting the possibility that "" is also a word. Try looking it up in a dictionary :-)
In other words, any regular expression that allows an empty sequence of the language characters (such as (a+b)*) may not be suitable. You may find that (a+b)(a+b)* is a better option. This depends on whether Σ* allows for the empty sequence.
Acording to the algebraic properties of regular expressions,
(a*b*)* = (a+b)*
Therefore (a+b)* = (a*b*)*
Extra information:
(a+b)* = L(a+b)*
= (L(a+b))*
= (L(a) U L(b))*
= ({a} U {b})*
= {a,b}*
= {ε, a, b, aa, bb, ab, abab, aba, bbba,...}

Given Two Regex, Determine if One is a Complement of Other

I'd like to know how you can tell if some regular expression is the complement of another regular expression. Let's say I have 2 regular expressions r_1 and r_2. I can certainly create a DFA out of each of them and then check to make sure that L(r_1) != L(r_2). But that doesn't necessarily mean that r_1 is the complement of r_2 and vice versa. Also, it seems to be that many different regular expressions that could be the same complement of a single regular expression.
So I'm wondering how, given two regular expressions, I can determine if one is the complement of another. This is also new to me, so perhaps I'm missing something that should be apparent.
Edit: I should point out that I am not simply trying to find the complement of a regular expression. I am given two regular expressions, and I am to determine if they are the complement of each other.
Here is one approach that is conceptually simple, if not terribly efficient (not that there is necessarily a more efficient solution...):
Construct NFAs M and N for regular expressions r and s, respectively. You can do this using the construction introduced in the proof that finite automata describe the same languages.
Determinize M and N to get M' and N'. We might as well go ahead and minimize them at this point... giving M'' and N''.
Construct a machine C using the Cartesian product machine construction on machines M'' and N''. Acceptance will be determined by the symmetric difference, or XOR, criterion: accepting states in the product machine correspond to pairs of states (m, n) where exactly one of the two states is accepting in its automaton.
Minimize C and call the result C'
If L(r) = L(s)', then the initial state of C' will be accepting and C' will have all transitions originating in the initial state also terminating in the initial state. If this is the case,
Why should this work? The symmetric difference of two sets is the set of everything in exactly one (not both, not neither). If L(s) and L(r) are complementary, then it is not difficult to see that the symmetric difference includes all strings (by definition, the complement of a set contains everything not in the set). Suppose now there were non-complementary sets whose symmetric difference were the universe of all strings. The sets are not complementary, so either (1) their union is non-empty or (2) their union is not the universe of all strings. In case (1), the symmetric difference will not include the shared element; in case (2), the symmetric difference will not include the missing strings. So, only complementary sets have the symmetric difference equal to the universe of all strings; and a minimal DFA for the set of all strings will always have an accepting initial state with self-loops.
For complement: L(r_1) == !L(r_2)

Monad: Why does Identity matter, what's going to happen if there's no such special member in a set?

I'm trying to learn the concept of monad, I'm watching this excellent video Brian Beckend trying to explain what is monad.
When he talks about monoid, it's a collection of types, it has a rule of composition, and this composition has to obey 2 rules:
associative: x # (y # z ) = (x # y) # z
a special member in the collection: x # id = x and id # x = x
I'm using # symbol representing composition. id means the special member.
The second point is what I'm trying to understand. why does this matter ? what if there's no such special member ?
When I learn new concept, I always try to relate these abstract concept to some other concrete things, so that I can fully understand and learn them by heart.
So what I'm trying to relate monad and monoid to is lego. So all the building blocks in a lego set forms a collection. and the composition rule is composite them into new shape of building blocks. and it's obvious the composition obey the first rule: associative. But there's no special building block which can composite with other building block and get the same back. So it fails to obey the second rule.
But lego is still highly composable. What has been missing or lack when lego fails to obey the second rule ? What is the consequence ?
Or put it this way, comparing to other monoid which obey all those rules. What feature does other monoid has but lego doesn't ?
A monoid without an identity element is called a semigroup and its still a fine and useful construct. It just gives us something different. Consider, for example, a fold on a list. We can do this by mapping every element of a list to a monoid and then composing them all. But if you only have a semigroup, you can't fold on a possibly empty list.
Consider another example -- the integers greater than zero, versus the integers greater than or equal to zero. In the latter case we have a monoid, since zero is literally our zero element. So I can solve for example, the equation "5 + x = 5". In the former case, with a semigroup, I can't solve that equation. Or I can say "you have no apples, I then give you five apples, how many do you have?" In a world without zero, we have to assume everyone starts with some apples to begin with! So, for the same reasons having a zero lying around is important with numbers, it is handy to have a "generalized zero" hanging around with more abstract algebraic structures.
(Note this doesn't mean one or the other is "better" -- just that they are different, and the extra structure, when available, can come in handy. Also note that there is a universal way to turn a semigroup into a monoid by adding a zero element, so since all semigroup results lift into the 'completed' results on monoids, it tends to be more convenient, typically, to just treat things in terms of the latter.)
The empty Lego could be considered as id but then you will have to accept that empty space is Lego. But yes if you don't want id like #sclv wrote, it would be a semigroup.

Proving that a language is regular by giving a regular expression

I am stumped by this practice problem (not for marks):
{w is an element of {a,b}* : the number of a's is even and the number of b's is even }
I can't seem to figure this one out.
In this case 0 is considered even.
A few acceptable strings: {}, {aa}, {bb}, {aabb}, {abab}, {bbaa}, {babaabba}, and so on
I've done similar examples where the a's must be a prefix, where the answer would be:
(aa)(bb)
but in this case they can be in any order.
Kleene stars (*), unions (U), intersects (&), and concatenation may be used.
Edit: Also have trouble with this one
{w is an element of {0,1}* : w = 1^r 0 1^s 0 for some r,s >= 1}
This is kind of ugly, but it should work:
ε U ( (aa) U (bb) U ((ab) U (ba) (ab) U (ba)) )*
For the second one:
11*011*0
Generally I would use a+ instead of aa* here.
Edit: Undeleted re: the comments in NullUserException's answer.
1) I personally think this one is easier to conceptualize if you first construct a DFA that can accept the strings. I haven't written it down, but off the top of my head I think you can do this with 4 states and one accept state. From there you can create an equivalent regex by removing states one at a time using an algorithm such as this one. This is possible because DFAs and regexes are provably equivalent.
2) Consider the fact that the Kleene star only applies to the nearest regular expression. Hence, if you have two individual ungrouped atoms (an atom itself is a regex!), it only applies to the second one (as in, ab* would match a single a and then any number - including 0 - b's). You can use this to your advantage in a case where you want something to exist, but you're not sure of how many there are.