Non regular context-free language and infinite regular sublanguages - regex

I had a work for the university which basically said:
"Demonstrates that the non-regular language L={0^n 1^n : n natural} had no infinite regular sublanguages."
I demonstrated this by contradiction. I basically said that there is a language S which is a sublanguage of L and it is a regular language. Since the possible Regular expressions for S are 0*, 1*, (1+0)* and (0o1)*. I check each grammar and demonstrate that none of them are part of the language L.
However, how I could prove that ANY non regular context free language could not contain any regular infinite sublanguages?
I don't want the prove per se, I just want to be pointed in the right direction.

L = {0^n 1^n : n natural} is non-regular context free.
M = 2*3* is infinite regular.
N = L∪M is non-regular context free. N contains M.

For the 0^n 1^n language, it might be valuable to look into the pumping lemma. I think when I learned the pumping lemma it was used on the a^n b^n language (same thing.) Possibly the pumping lemma might help in your proof.
Also you can consider that regular languages are closed under complement, union, intersection, and the kleene star.
That is if L1 and L2 are regular then:
L1 L2 (concatenation) is also regular.
L1 n L2 is regular
L1 U L2 is regular
¬L1 is regular
L1* is regular
It's possible that you could prove that any language that contains an regular infinite sublanguage is regular by using some of these rules.

Your instincts are good. Two things here.
First, almost always when the question takes the form "show that L is not regular/not CF" the answer is going to involve using the pumping lemmas. Similarly, when you get a question like "show there are no X that ..." the easy route is (almost always) going to be a proof by contradiction.

EDIT: false statement, only applies to context free language

Since you just want hints (and thankfully so, since I forgot how to do proofs since college), look at the definition of a regular language and what properties it has. Just from looking there I had enough info to prove the statement.

Related

Regex and grammar free context conversion

Can the following CFG be converted to a Regex ?
Someone said that this could be it regex: (ab* a + b)*
is this true and why? I cant seem to understand it
It's not a regular language.
Consider the subset of the language with exactly one b. (In other words, the intersection of the language with a*ba*.) If the language were regular, that subset would also be regular, since it would be the intersection of two regular languages.
But it's not regular, since it consists of strings in which the number of as following the b is at least as large as the number of as preceding the b, and that is not a regular language ("regular languages can't count").

check if regular language given CF grammar

I have this grammar
S->aSbA
S->e
A->aB
B->bA
How can i determine if the language is regular? My problem is that A and B doesnt have a terminal character so i dont know what languages it will produce.
There is no general method for this. Regularity is an undecidable problem for context-free languages.
In your specific case, as J Earls has pointed out, the only word you can derive is the empty word. All derivations that use a rule different from S -> e never terminate. Thus the language is finite and therefore regular.

Context Free Grammar for which a RegEx is impossible

I'm trying to find out if its possible to have an example of a CFG for which it is impossible to give a Regular Expression which can accept the same language.
Any language which requires counting/remembering can't be expressed as a regular expression.
For example, a language which checks balanced parenthesis:
S -> { S } S
S -> ε
Since a regular machine/expression has only a limited (pre-defined) number of states, it cannot "remember" (infinitely) earlier parts of the input.
As such recognizing the following expression is impossible for a state-machine: anbn (n∈ℕ)
You could make such a machine for n ≤ x, where x∈ℕ, but no state-machine can do it for every possible value from ℕ.

Automatic regex builder

I have N strings.
Also, there are K regular expressions, unknown to me. Each string is either matching one of the regular expressions, or it is garbage. There are total of L garbage strings in the set. Both K and L are unknown.
I'd like to deduce the regular expressions. Obviously, this problem has infinite number of solutions. I need to find a "reasonably good solution", which
1) minimizes K
2) minimizes L
3) maximizes "specifics" of the regular expressions. I don't know what't the right term for this quality. For example, the string "ab123" can be described as /ab\d+/ or /\w+.+/, but the first regex is more "specific".
All 3 requirements need to be taken as one compound criteria, with certain reasonable weights.
A solution for one particular case: If L = 0 and K = 1 (just one regex, and no garbage), then we can just find LCS (longest common subsequence) for the strings and come up with a corresponding regex from there. However, when we have "noise" (L > 0), this approach doesn't work.
Any ideas (or pointers to existing work) are greatly appreciated.
What you are trying to do is language learning or language inference with a twist: instead of generalising over a set of given examples (and possibly counter-examples), you wish to infer a language with a small yet specific grammar.
I'm not sure how much research is being done on that. However, if you are also interested in finding the minimal (= general) regular expression that accepts all n strings, search for papers on MDL (Minimum Description Length) and FSMs (Finite State Machines).
Two interesting queries at Google Scholar:
"minimum description length" automata
"language inference" automata
The key words in academia are "grammatical inference". Unfortunately, there aren't any efficient, general algorithms to do the sort of thing you're proposing. What's your real problem?
Edit: it sounds like you might be interested in Data Description Languages. PADS (http://www.padsproj.org/) is a typical example.
Nothing clever here, perhaps I don't fully understand the problem?
Why not just always reduce L to 0? Check each string against each regex; if a string doesn't match any of the regex's, it's garbage. if it does match, remember the regex/string(s) that did match and do LCS on each L = 0, K = 1 to deduce each regex's definition.

Complexity of Regex substitution

I didn't get the answer to this anywhere. What is the runtime complexity of a Regex match and substitution?
Edit: I work in python. But would like to know in general about most popular languages/tools (java, perl, sed).
From a purely theoretical stance:
The implementation I am familiar with would be to build a Deterministic Finite Automaton to recognize the regex. This is done in O(2^m), m being the size of the regex, using a standard algorithm. Once this is built, running a string through it is linear in the length of the string - O(n), n being string length. A replacement on a match found in the string should be constant time.
So overall, I suppose O(2^m + n).
Other theoretical info of possible interest.
For clarity, assume the standard definition for a regular expression
http://en.wikipedia.org/wiki/Regular_language
from the formal language theory. Practically, this means that the only building
material are alphabet symbols, operators of concatenation, alternation and
Kleene closure, along with the unit and zero constants (which appear for
group-theoretic reasons). Generally it's a good idea not to overload this term
despite the everyday practice in scripting languages which leads to
ambiguities.
There is an NFA construction that solves the matching problem for a regular
expression r and an input text t in O(|r| |t|) time and O(|r|) space, where
|-| is the length function. This algorithm was further improved by Myers
http://doi.acm.org/10.1145/128749.128755
to the time and space complexity O(|r| |t| / log |t|) by using automaton node listings and the Four Russians paradigm. This paradigm seems to be named after four Russian guys who wrote a groundbreaking paper which is not
online. However, the paradigm is illustrated in these computational biology
lecture notes
http://lyle.smu.edu/~saad/courses/cse8354/lectures/lecture5.pdf
I find it hilarious to name a paradigm by the number and
the nationality of authors instead of their last names.
The matching problem for regular expressions with added backreferences is
NP-complete, which was proven by Aho
http://portal.acm.org/citation.cfm?id=114877
by a reduction from the vertex-cover problem which is a classical NP-complete problem.
To match regular expressions with backreferences deterministically we could
employ backtracking (not unlike the Perl regex engine) to keep track of the
possible subwords of the input text t that can be assigned to the variables in
r. There are only O(|t|^2) subwords that can be assigned to any one variable
in r. If there are n variables in r, then there are O(|t|^2n) possible
assignments. Once an assignment of substrings to variables is fixed, the
problem reduces to the plain regular expression matching. Therefore the
worst-case complexity for matching regular expressions with backreferences is
O(|t|^2n).
Note however, regular expressions with backreferences are not yet
full-featured regexen.
Take, for example, the "don't care" symbol apart from any other
operators. There are several polynomial algorithms deciding whether a set of
patterns matches an input text. For example, Kucherov and Rusinowitch
http://dx.doi.org/10.1007/3-540-60044-2_46
define a pattern as a word w_1#w_2#...#w_n where each w_i is a word (not a regular expression) and "#" is a variable length "don't care" symbol not contained in either of w_i. They derive an O((|t| + |P|) log |P|) algorithm for matching a set of patterns P against an input text t, where |t| is the length of the text, and |P| is the length of all the words in P.
It would be interesting to know how these complexity measures combine and what
is the complexity measure of the matching problem for regular expressions with
backreferences, "don't care" and other interesting features of practical
regular expressions.
Alas, I haven't said a word about Python... :)
Depends on what you define by regex. If you allow operators of concatenation, alternative and Kleene-star, the time can actually be O(m*n+m), where m is size of a regex and n is length of the string. You do it by constructing a NFA (that is linear in m), and then simulating it by maintaining the set of states you're in and updating that (in O(m)) for every letter of input.
Things that make regex parsing difficult:
parentheses and backreferences: capturing is still OK with the aforementioned algorithm, although it would get the complexity higher, so it might be infeasable. Backreferences raise the recognition power of the regex, and its difficulty is well
positive look-ahead: is just another name for intersection, which raises the complexity of the aforementioned algorithm to O(m^2+n)
negative look-ahead: a disaster for constructing the automaton (O(2^m), possibly PSPACE-complete). But should still be possible to tackle with the dynamic algorithm in something like O(n^2*m)
Note that with a concrete implementation, things might get better or worse. As a rule of thumb, simple features should be fast enough, and unambiguous (eg. not like a*a*) regexes are better.
To delve into theprise's answer, for the construction of the automaton, O(2^m) is the worst case, though it really depends on the form of the regular expression (for a very simple one that matches a word, it's in O(m), using for example the Knuth-Morris-Pratt algorithm).
Depends on the implementation. What language/library/class? There may be a best case, but it would be very specific to the number of features in the implementation.
You can trade space for speed by building a nondeterministic finite automaton instead of a DFA. This can be traversed in linear time. Of course, in the worst case this could need O(2^m) space. I'd expect the tradeoff to be worth it.
If you're after matching and substitution, that implies grouping and backreferences.
Here is a perl example where grouping and backreferences can be used to solve an NP complete problem:
http://perl.plover.com/NPC/NPC-3SAT.html
This (coupled with a few other theoretical tidbits) means that using regular expressions for matching and substitution is NP-complete.
Note that this is different from the formal definition of a regular expression - which don't have the notion of grouping - and match in polynomial time as described by the other answers.
In python's re library, even if a regex is compiled, the complexity can still be exponential (in string length) in some cases, as it is not built on DFA. Some references here, here or here.