Infinite state automata - state

I was asked to prove that for an arbitrary language L, L⊆ {a,b}∗, there exists an infinite state automata M that L = L(M). I have proved that the isa could accept arbitrary string w ⊆ {a,b}∗. I am confused about how to define an arbitrary language or how to define the final state in isa. Here's what I have done.

Related

why a^m b^n where m,n > 0 is a regular language but a^n b^n where n > 0 is non regular language

a^m b^n where m,n >= 0 is a regular language but why a^n b^n where n >= 0 is a non-regular language?
In both languages, we are taking an infinite number of a's and b's but why we could build a Finite Automata for the first case and not for the second case?
Let us consider the above languages as A,B:
X = a^m b^n where m,n>0
Y = a^n b^n where n>0
Language X is a regular language but Language Y is not a regular language because we cannot construct a Finite automata for language Y.
A language is not a regular language if the language does not satisfy the pumping lemma , but if the language satisfies the pumping lemma then the language need not be regular.
In case of language X the number of a's and b's are different so we need not remember number of occurrences of 'a' after accepting all a's we can accept b's and move to final state ,but in case of language Y the number of a's and b's are same. So all the strings in the language Y should contain equal number of a's and equal number of b's. So we need to remember the number of a's so that b's can be evaluated, which is not possible using finite automata.
So to evaluate tese kind of languages push down automata is required.
PushDown automata = Finite automata + some amount of memory(stack).
The languages of the kind Y are called context free language and they can be evaluated using Pushdown Automata(PDA).Finite automata do not have any memory to store the count of a's and b's. where as pushdown automata has some amount of memory so we can evaluate languages of type Y.
For this language Y we need to push all the a's on to the stack and whenever we encounter b's then we need to pop a's(remembering the condition that all the a's must be before b's).If all the a's are not before b's then we need to move the system to dead state.
In the second language the number of a's and b's have to be the same.
The finite automata would need the ability to count the number of a's to check if the number of b's are the same. This would need an infinite amount of states.
Maybe check out the pumping lemma on wikipedia: https://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages
By using pumping lemma we have to proove the above language is not a regular language
L=a^nb^n
n->is the no.of states
We have to solve this problem by using two methods
1.one way is to simply considering an example
2.another way is to by simply substituting the i values in uv^iw
First we have to take Z=a^nb^n
And then Z=uv^iw
v is a substring
For example n=1
•Z=ab
n=2
•Z=aabb
n=3
•Z=aaabbb
The selection of v comes under 3 cases
For example
Case1:
v is a substring only a`s
For example assume v=a^x
x is the no.of repetitions
•Z is written in uv^iw
If i=0
Z=a^n-x.1.b^n does not belongs to L
Case2:
Consider only b`s
Assume v=b^x
For i=0
Z=a^nb^n-x does not belongs to L
Case3:
Consider only a`sb`s
Assume v=a^xb^x
For i=0
Z=a^n-xb^n-x€L
For i=1
Z=a^n-xa^xb^xb^n-x
=a^nb^n€L
For i=2
Z=a^n-x(a^xb^x)^2b^n-x
=a^n-xa^xb^xa^xb^xb^n-x doesn't belongs to L
The string start with always a`s followed by b`s
We are getting one more additional ab (or) ba but our string of Length n.
So above the three conditions are not satisfiying so...The given Language is not a regular language
In the first case..a^mb^n m and n are not equal m may be Greater than n or less than n or may be equal to n ...so..it is a regular language...

How to tell whether a language is a regular language, context free language, Push down Automata, etc?

I know that the pumping lemma can be used to determine whether a language is a Regular Language, Context Free Language, Pushdown Automata, etc. However, I would like to know if there are any tricks in telling what type of language a given language is, or perhaps general tendencies for certain languages?
For example, is there anyway in telling what the languages are in the following examples below just by looking at the language description.
L = {(0^n)2(1^m) | n >= m }
L = {(0^n)2(1^m) | n >= 1, m >= 1, n + m <= 100 }
L = {(0^n)(1^m)2 | n >= 1, m >= 1, n + m <= 100 }
L = {ww^R} | w element of {0, 1}*, where w^R is the reverse of W}
L = {w2w | w element of {0, 1}*}
L = {w2w^R | w element of {0, 1}*, where w^R is the reverse of W}
The answers are:
Not Finite Automata, Not DPDA by empty stack, but DPDA by final state
Finite Automata, but Not DPDA by empty stack.
Finite Automata, also DPDA by empty stack
Is a PDA, but not DPDA
Not any DPDA
DPDA by empty stack and DPDA by final state, not FSA
Thanks!
There are some simple points which you can checkout by looking at a language which can help you deciding which language it is (P.S: These are not stated rules but derived from the definition of these languages).
Check if the language is finite. Finite languages means it is accepted by finite automata. For e.g L= {a^n b^m | n+m<100} or L={a^n b^n | n<50}. These examples may seem context-free languages but actually they are finite hence accepted by finite automata.
Check whether the condition given in language involves single comparison or not. If it involves more than one comparisons, then it is neither Regular nor Context-free. Then it is context-sensitive language. For e.g. L= {a^n b^n c^n | n>1} and L={a^n b^n c^m | m>n} both are the cases where more than one comparison are present. In first case, it is present in the body of language and in second example, one comparison in body and other comparison in condition of language.
Distinguish between PDA and DPDA is easy if you've knowledge of designing of PDA. If the language contains a clear point of changing a state then it is DPDA otherwise it is PDA.
If the context-free language involves a condition of equality like L={w.wR | w element of {0,1}* } or L ={ a^nb^n| n>1}, then the PDA is accepted by empty stack and final state but if the condition is of inequality, then you need to check whether the stack will be empty or not.
Try to visualize a stack and using that stack try to visualize whether given comparison can be made or not. For a language to be Context-free, only one stack should be used for comparison.
In case, the language is complex enough to guess whether it is Regular or context-free, then it must be context-sensitive. There ain't any way to tell straight away whether a language is R.E or context-sensitive so it is assumed that every language which can be written in set-builder form and which is not Regular or context-free is context-sensitive.
As I told earlier, these aren't stated rules or facts but just some points derived from the definitions of these languages. In order to guess quickly, just try to practise some languages based on these rules.

Given Two Regex, Determine if One is a Complement of Other

I'd like to know how you can tell if some regular expression is the complement of another regular expression. Let's say I have 2 regular expressions r_1 and r_2. I can certainly create a DFA out of each of them and then check to make sure that L(r_1) != L(r_2). But that doesn't necessarily mean that r_1 is the complement of r_2 and vice versa. Also, it seems to be that many different regular expressions that could be the same complement of a single regular expression.
So I'm wondering how, given two regular expressions, I can determine if one is the complement of another. This is also new to me, so perhaps I'm missing something that should be apparent.
Edit: I should point out that I am not simply trying to find the complement of a regular expression. I am given two regular expressions, and I am to determine if they are the complement of each other.
Here is one approach that is conceptually simple, if not terribly efficient (not that there is necessarily a more efficient solution...):
Construct NFAs M and N for regular expressions r and s, respectively. You can do this using the construction introduced in the proof that finite automata describe the same languages.
Determinize M and N to get M' and N'. We might as well go ahead and minimize them at this point... giving M'' and N''.
Construct a machine C using the Cartesian product machine construction on machines M'' and N''. Acceptance will be determined by the symmetric difference, or XOR, criterion: accepting states in the product machine correspond to pairs of states (m, n) where exactly one of the two states is accepting in its automaton.
Minimize C and call the result C'
If L(r) = L(s)', then the initial state of C' will be accepting and C' will have all transitions originating in the initial state also terminating in the initial state. If this is the case,
Why should this work? The symmetric difference of two sets is the set of everything in exactly one (not both, not neither). If L(s) and L(r) are complementary, then it is not difficult to see that the symmetric difference includes all strings (by definition, the complement of a set contains everything not in the set). Suppose now there were non-complementary sets whose symmetric difference were the universe of all strings. The sets are not complementary, so either (1) their union is non-empty or (2) their union is not the universe of all strings. In case (1), the symmetric difference will not include the shared element; in case (2), the symmetric difference will not include the missing strings. So, only complementary sets have the symmetric difference equal to the universe of all strings; and a minimal DFA for the set of all strings will always have an accepting initial state with self-loops.
For complement: L(r_1) == !L(r_2)

How does one define a "language" using a regular expression?

I would like to define "some language" using a regular expression. The requirements are:
The language must contain an infinite number of strings.
The underlying alphabet must have at least three different characters.
I also need to draw a deterministic finite state automaton that accepts the strings of that language.
Give two character strings that are accepted by that finite state automaton and two that are not.
Given this set of requirements, I have thus far (based on my 20 years old memory of set theory and the math associated with it), come up with the following and would appreciate some input from a set-theory, regular expression and formal language definition expert (I know there are many of you who have a deeply vested interest in this subject).
Does the following come even close to fulfilling (1) and (2) at least? What does (4) actually imply? For instance, if the set can hold infinite strings (in theory), as per requirement (1), then how can we fulfill requirement (4) which says "Given 2 strings that are accepted by the (FSA) and TWO THAT ARE NOT"???
My current (rather fallable) solution is:
Alphabet:
∑ = {s,a,e,t,n}
Language:
L* = { Ø , ∈ , taste, set, ate, sane, ….}
OR (using regular expression)
L* = [saetn]*
Any takers?
Thanks.
First of all, the regular expression [saetn]* would accept all strings over the alphabet you chose, so you would be unable to find two which are not in the language (The language would be L = Σ*) and can't satisfy requirement (4).
L = { Ø , ε , taste, set, ate, sane, ...}
is not a valid language, because a language cannot contain Ø. The empty set is not a string (and a language is a set of strings, not a set of sets). Let's remove the Ø.
L = { ε , taste, set, ate, sane, ...}
Does the following come even close to fulfilling (1) and (2) at least?
It doesn't fulfil (1), as there is no reasonable pattern for the ... to have any meaning. The language looks finite.
L = { ε , taste, set, ate, sane }
Would be a valid finite language where ε denotes the empty string. All finite languages are regular, since you can create an expression that is an OR of all the strings in the language (|taste|set|ate|sane).
It does fulfil (2), as you picked the alphabet ∑ = {s,a,e,t,n}, which has 5 elements.
What does (4) actually imply?
It means that the language can't contain all strings over the alphabet. There must be at least two strings in Σ* that are not in the language, and you must show what they are. That doesn't prevent the language from being infinite.
An example of an infinite language would be:
L = { ε, s, a, t, ss, aa, tt, sss, aaa, ttt, ssss, ... }
That language (over the alphabet {s, a, t}) contains all strings which have no more than one distinct character. One regular expression that would accept that language would be s*|a*|t*. The language is clearly infinite, and any strings which contain two different symbols, like at or sat are not in the language. That language satisfies all the requirements. There are many other languages that satisfy all the requirements.
I will leave the drawing of the DFA to you. If you have any questions about it, feel free to comment on my answer.

Determining whether a regex is a subset of another

I have a large collection of regular expression that when matched call a particular http handler. Some of the older regex's are unreachable (e.g. a.c* ⊃ abc*) and I'd like to prune them.
Is there a library that given two regex's will tell me if the second is subset of the first?
I wasn't sure this was decidable at first (it smelled like the halting problem by a different name). But it turns out it's decidable.
Trying to find the complexity of this problem lead me to this paper.
The formal definition of the problem can be found within: this is generally called the inclusion problem
The inclusion problem for R, is to test for two given expressions r, r′ ∈ R,
whether r ⊆ r′.
That paper has some great information (summary: all but the simplest expressions are fairly complex), however searching for information on the inclusion problem leads one directly back to StackOverflow. That answer already had a link to a paper describing a passable polynomial time algorithm which should cover a lot of common cases.
I found a python regex library that provides set operations.
http://github.com/ferno/greenery
The proof says Sub ⊆ Sup ⇔ Sub ∩ ¬Sup is {}. I can implement this with the python library:
import sys
from greenery.lego import parse
subregex = parse(sys.argv[1])
supregex = parse(sys.argv[2])
s = subregex&(supregex.everythingbut())
if s.empty():
print("%s is a subset of %s"%(subregex,supregex))
else:
print("%s is not a subset of %s, it also matches %s"%(subregex,supregex,s)
examples:
subset.py abcd.* ab.*
abcd.* is a subset of ab.*
subset.py a[bcd]f* a[cde]f*
a[bcd]f* is not a subset of a[cde]f*, it also matches abf*
The library may not be robust because as mentioned in the other answers you need to use the minimal DFA in order for this to work. I'm not sure ferno's library makes (or can make) that guarantee.
As an aside: playing with the library to calculate inverse or simplify regexes is lots of fun.
a(b|.).* simplifies to a.+. Which is pretty minimal.
The inverse of abf* is ([^a]|a([^b]|bf*[^f])).*|a?. Try to come up with that on your own!
If the regular expressions use "advanced features" of typical procedural matchers (like those in Perl, Java, Python, Ruby, etc.) that allow accepting languages that aren't regular, then you are out of luck. The problem is in general undecidable. E.g. the problem of whether one pushdown automaton recognizes the same context free (CF) language as another is undecidable. Extended regular expressions can describe CF languages.
On the other hand, if the regular expressions are "true" in the theoretical sense, consisting only of concatenation, alternation, and Kleene star over strings with a finite alphabet, plus the usual syntactic sugar on these (character classes, +, ?, etc), then there is a simple polynomial time algorithm.
I can't give you libraries, but this:
For each pair of regexes r and s for languages L(r) and L(s)
Find the corresponding Deterministic Finite Automata M(r) and M(s)
Compute the cross-product machine M(r x s) and assign accepting states
so that it computes L(r) - L(s)
Use a DFS or BFS of the the M(r x s) transition table to see if any
accepting state can be reached from the start state
If no, you can eliminate s because L(s) is a subset of L(r).
Reassign accepting states so that M(r x s) computes L(s) - L(r)
Repeat the steps above to see if it's possible to eliminate r
Converting a regex to a DFA generally uses Thompson's construction to get a non-deterministic automaton. This is converted to a DFA using the Subset Construction. The cross-product machine is another standard algorithm.
This was all worked out in the 1960's and is now part of any good undergrad computer science theory course. The gold standard for the topic is Hopcroft and Ullman, Automata Theory.
There is an answer in the mathematics section: https://math.stackexchange.com/questions/283838/is-one-regular-language-subset-of-another.
Basic idea:
Compute the minimal DFA for both languages.
Calculate the cross product of both automates M1 and M2, which means that each state consists of a pair [m1, m2] where m1 is from M1 and m2 from M2 for all possible combinations.
The new transition F12 is: F12([m1, m2], x) => [F1(m1, x), F2(m2, x)]. This means if there was a transition in M1 from state m1 to m1' while reading x and in M2 from state m2 to m2' while reading x then there is one transition in M12 from [m1, m2] to [m1', m2'] while reading x.
At the end you look into the reachable states:
If there is a pair [accepting, rejecting] then the M2 is not a subset of M1
If there is a pair [rejecting, accapting] then M1 is not a subset of M2
It would be benificial if you would just compute the new transition and the resulting states, omitting all non reachable states from the beginning.