Proving a Certain Language is Regular - regex

In my computational theory class we have been asked to prove that a certain language is regular:
Ln = { a^(2k+1) | k is a multiple of n } c { a } *
I'm unsure where to start, usually you would use one of an NFA, DFA, regular expression, or regular grammar. If anyone could help push me in the right direction it would be greatly appreciated.

Here are some hints to get you started:
Notice that Ln = { a2nr + 1 | r ∈ N }, which you can rewrite as { (a2n)ra | r ∈ N }. That might expose a bit more structure that you previously noticed.
If you want to go down the DFA route, think about what information you need to keep track of at each point in time. Each state in the DFA should tell you something about the string you've seen so far. It might help to notice that it doesn't really matter how many total characters are in the string, just what the remainder is modulo n.
If you want to go down the regex route, how would you express the idea of "any number of copies of a2n followed by another a?"

Related

Haskell check if the regular expression r made up of the single symbol alphabet Σ = {a} defines language L(r) = a*

I have got to write an algorithm programatically using haskell. The program takes a regular expression r made up of the unary alphabet Σ = {a} and check if the regular expression r defines the language L(r) = a^* (Kleene star). I am looking for any kind of tip. I know that I can translate any regular expression to the corresponding NFA then to the DFA and at the very end minimize DFA then compare, but is there any other way to achieve my goal? I am asking because it is clearly said that this is the unary alphabet, so I suppose that I have to use this information somehow to make this exercise much easier.
This is how my regular expression data type looks like
data Reg = Epsilon | -- epsilon regex
Literal Char | -- a
Or Reg Reg | -- (a|a)
Then Reg Reg | -- (aa)
Star Reg -- (a)*
deriving Eq
Yes, there is another way. Every DFA for regular languages on the single-letter alphabet is a "lollipop"1: an initial string of nodes that each point to each other (some of which are marked as final and some not) followed by a loop of nodes (again, some of which are marked as final and some not). So instead of doing a full compilation pass, you can go directly to a DFA, where you simply store two [Bool] saying which nodes in the lead-in and in the loop are marked final (or perhaps two [Integer] giving the indices and two Integer giving the lengths may be easier, depending on your implementation plans). You don't need to ensure the compiled version is minimal; it's easy enough to check that all the Bools are True. The base cases for Epsilon and Literal are pretty straightforward, and with a bit of work and thought you should be able to work out how to implement the combining functions for "or", "then", and "star" (hint: think about gcd's and stuff).
1 You should try to prove this before you begin implementing, so you can be sure you believe me.
Edit 1: Hm, while on my afternoon walk today, I realized the idea I had in mind for "then" (and therefore "star") doesn't work. I'm not giving up on this idea (and deleting this answer) yet, but those operations may be trickier than I gave them credit for at first. This approach definitely isn't for the faint of heart!
Edit 2: Okay, I believe now that I have access to pencil and paper I've worked out how to do concatenation and iteration. Iteration is actually easier than concatenation. I'll give a hint for each -- though I have no idea whether the hint is a good one or not!
Suppose your two lollipops have a length m lead-in and a length n loop for the first one, and m'/n' for the second one. Then:
For iteration of the first lollipop, there's a fairly mechanical/simple way to produce a lollipop with a 2*m + 2*n-long lead-in and n-long loop.
For concatenation, you can produce a lollipop with m + n + m' + lcm(n, n')-long lead-in and n-long loop (yes, that short!).

How Can I demonstrate this grammar is not ambiguous?

I know I need to show that there is no string that I can get using only left most operation that will lead to two different parsing trees. But how can I do it? I know there is not a simple way of doing it, but since this exercise is on the compilers Dragon book, then I am pretty sure there is a way of showing (no need to be a formal proof, just justfy why) it.
The Gramar is:
S-> SS* | SS+ | a
What this grammar represents is another way of simple arithmetic(I do not remember the name if this technique of anyone knows, please tell me ) : normal sum arithmetic has the form a+a, and this just represents another way of summing and multiplying. So aa+ also mean a+a, aaa*+ is a*a+a and so on
The easiest way to prove that a CFG is unambiguous is to construct an unambiguous parser. If the grammar is LR(k) or LL(k) and you know the value of k, then that is straightforward.
This particular grammar is LR(0), so the parser construction is almost trivial; you should be able to do it on a single sheet of paper (which is worth doing before you try to look up the answer.
The intuition is simple: every production ends with a different terminal symbol, and those terminal symbols appear nowhere else in the grammar. So when you read a symbol, you know precisely which production to use to reduce; there is only one which can apply, and there is no left-hand side you can shift into.
If you invert the grammar to produce Polish (or Łukasiewicz) notation, then you get a trivial LL grammar. Again the parsing algorithm is obvious, since every right hand side starts with a unique terminal, so there is only one prediction which can be made:
S → * S S | + S S | a
So that's also unambiguous. But the infix grammar is ambiguous:
S → S * S | S + S | a
The easiest way to provide ambiguity is to find a sentence which has two parses; one such sentence in this case is:
a + a + a
I think the example string aa actually shows what you need. Can it not be parsed as:
S => SS* => aa OR S => SS+ => aa

find Reg. Expr. over {0,1,2} so last symbol of string is the sum of the symbols so far on the string mod 3.

I'm learning by myself formal languages (Aho's,Hopcroft) but I'm having a hard time with regular expressions.
I've been able to tackle simple tasks but this one has posed a challenge, at least for me. How to solve this if you can't count so far, I'm not used to this type of computation.
There must be some property or something that let me generalize the answer that much that i can put it as a regular expresion.
So far I've devised that is possible that there may be at least 2 o 3 cases:
sums mod3=0 if sum=3k
sums mod3=1 if sum=3k+1
sums mod3=2 if sum=3k+2.
But I've come to realize that there may be many combinations for a sum to happen so can't find the pattern the regular expression must follow.
The string for ex. {122211}0 (braces are for easy read sake) has the zero at the end as it holds that {sum=3k}0, if the sum is "10" from a string for ex. {1222111}1 the case may be {sum=3k+1} so the one has to be at the end, and so on.
This may or not be the right track to tackle the problem but I'm open to any suggestions please, any help is very appreciated.
Here's a hint: think of what distinct final states you can possibly be in. You certainly have at least 3 states, since the number of values can be three different things mod three. Also, you need to have a distinct start state, since the empty string cannot be accepted. Do you need more states?
Hint2: I think you can easily do this with a DFA using a start state and nine other states, of which exactly three will be accepting.
EDIT: Once you have a DFA, you can use Kleene's Theorem to construct an equivalent regular expression. If you'd rather go straight for a regular expression, here's another hint: if you're looking at any string of length 3k, you can append: 0; any string of length 1, followed by 1; any string of length 2, followed by 2. So if you can write regular expressions for strings of lengths 3k, 1, and 2, you're practically done.

Automatic regex builder

I have N strings.
Also, there are K regular expressions, unknown to me. Each string is either matching one of the regular expressions, or it is garbage. There are total of L garbage strings in the set. Both K and L are unknown.
I'd like to deduce the regular expressions. Obviously, this problem has infinite number of solutions. I need to find a "reasonably good solution", which
1) minimizes K
2) minimizes L
3) maximizes "specifics" of the regular expressions. I don't know what't the right term for this quality. For example, the string "ab123" can be described as /ab\d+/ or /\w+.+/, but the first regex is more "specific".
All 3 requirements need to be taken as one compound criteria, with certain reasonable weights.
A solution for one particular case: If L = 0 and K = 1 (just one regex, and no garbage), then we can just find LCS (longest common subsequence) for the strings and come up with a corresponding regex from there. However, when we have "noise" (L > 0), this approach doesn't work.
Any ideas (or pointers to existing work) are greatly appreciated.
What you are trying to do is language learning or language inference with a twist: instead of generalising over a set of given examples (and possibly counter-examples), you wish to infer a language with a small yet specific grammar.
I'm not sure how much research is being done on that. However, if you are also interested in finding the minimal (= general) regular expression that accepts all n strings, search for papers on MDL (Minimum Description Length) and FSMs (Finite State Machines).
Two interesting queries at Google Scholar:
"minimum description length" automata
"language inference" automata
The key words in academia are "grammatical inference". Unfortunately, there aren't any efficient, general algorithms to do the sort of thing you're proposing. What's your real problem?
Edit: it sounds like you might be interested in Data Description Languages. PADS (http://www.padsproj.org/) is a typical example.
Nothing clever here, perhaps I don't fully understand the problem?
Why not just always reduce L to 0? Check each string against each regex; if a string doesn't match any of the regex's, it's garbage. if it does match, remember the regex/string(s) that did match and do LCS on each L = 0, K = 1 to deduce each regex's definition.

How to determine if a regex is orthogonal to another regex?

I guess my question is best explained with an (simplified) example.
Regex 1:
^\d+_[a-z]+$
Regex 2:
^\d*$
Regex 1 will never match a string where regex 2 matches.
So let's say that regex 1 is orthogonal to regex 2.
As many people asked what I meant by orthogonal I'll try to clarify it:
Let S1 be the (infinite) set of strings where regex 1 matches.
S2 is the set of strings where regex 2 matches.
Regex 2 is orthogonal to regex 1 iff the intersection of S1 and S2 is empty.
The regex ^\d_a$ would be not orthogonal as the string '2_a' is in the set S1 and S2.
How can it be programmatically determined, if two regexes are orthogonal to each other?
Best case would be some library that implements a method like:
/**
* #return True if the regex is orthogonal (i.e. "intersection is empty"), False otherwise or Null if it can't be determined
*/
public Boolean isRegexOrthogonal(Pattern regex1, Pattern regex2);
By "Orthogonal" you mean "the intersection is the empty set" I take it?
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
Then again, I'm a theorist...
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
That seems like shooting sparrows with a cannon. Why not just construct the product automaton and check if an accept state is reachable from the initial state? That'll also give you a string in the intersection straight away without having to construct a regular expression first.
I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem.
I only know of a way to do it which involves creating a DFA from a regexp, which is exponential time (in the degenerate case). It's reducible to the halting problem, because everything is, but the halting problem is not reducible to it.
If the last, then you can use the fact that any RE can be translated into a finite state machine. Two finite state machines are equal if they have the same set of nodes, with the same arcs connecting those nodes.
So, given what I think you're using as a definition for orthogonal, if you translate your REs into FSMs and those FSMs are not equal, the REs are orthogonal.
That's not correct. You can have two DFAs (FSMs) that are non-isomorphic in the edge-labeled multigraph sense, but accept the same languages. Also, were that not the case, your test would check whether two regexps accepted non-identical, whereas OP wants non-overlapping languages (empty intersection).
Also, be aware that the \1, \2, ..., \9 construction is not regular: it can't be expressed in terms of concatenation, union and * (Kleene star). If you want to include back substitution, I don't know what the answer is. Also of interest is the fact that the corresponding problem for context-free languages is undecidable: there is no algorithm which takes two context-free grammars G1 and G2 and returns true iff L(G1) ∩ L(g2) ≠ Ø.
It's been two years since this question was posted, but I'm happy to say this can be determined now simply by calling the "genex" program here: https://github.com/audreyt/regex-genex
$ ./binaries/osx/genex '^\d+_[a-z]+$' '^\d*$'
$
The empty output means there is no strings that matches both regex. If they have any overlap, it will output the entire list of overlaps:
$ runghc Main.hs '\d' '[123abc]'
1.00000000 "2"
1.00000000 "3"
1.00000000 "1"
Hope this helps!
The fsmtools can do all kinds of operations on finite state machines, your only problem would be to convert the string representation of the regular expression into the format the fsmtools can work with. This is definitely possible for simple cases, but will be tricky in the presence of advanced features like look{ahead,behind}.
You might also have a look at OpenFst, although I've never used it. It supports intersection, though.
Excellent point on the \1, \2 bit... that's context free, and so not solvable. Minor point: Not EVERYTHING is reducible to Halt... Program Equivalence for example.. – Brian Postow
[I'm replying to a comment]
IIRC, a^n b^m a^n b^m is not context free, and so (a\*)(b\*)\1\2 isn't either since it's the same. ISTR { ww | w ∈ L } not being "nice" even if L is "nice", for nice being one of regular, context-free.
I modify my statement: everything in RE is reducible to the halting problem ;-)
I finally found exactly the library that I was looking for:
dk.brics.automaton
Usage:
/**
* #return true if the two regexes will never both match a given string
*/
public boolean isRegexOrthogonal( String regex1, String regex2 ) {
Automaton automaton1 = new RegExp(regex1).toAutomaton();
Automaton automaton2 = new RegExp(regex2).toAutomaton();
return automaton1.intersection(automaton2).isEmpty();
}
It should be noted that the implementation doesn't and can't support complex RegEx features like back references. See the blog post "A Faster Java Regex Package" which introduces dk.brics.automaton.
You can maybe use something like Regexp::Genex to generate test strings to match a specified regex and then use the test string on the 2nd regex to determine whether the 2 regexes are orthogonal.
Proving that one regular expression is orthogonal to another can be trivial in some cases, such as mutually exclusive character groups in the same locations. For any but the simplest regular expressions this is a nontrivial problem. For serious expressions, with groups and backreferences, I would go so far as to say that this may be impossible.
I believe kdgregory is correct you're using Orthogonal to mean Complement.
Is this correct?
Let me start by saying that I have no idea how to construct such an algorithm, nor am I aware of any library that implements it. However, I would not be at all surprised to learn that nonesuch exists for general regular expressions of arbitrary complexity.
Every regular expression defines a regular language of all the strings that can be generated by the expression, or if you prefer, of all the strings that are "matched by" the regular expression. Think of the language as a set of strings. In most cases, the set will be infinitely large. Your question asks whether the intersections of the two sets given by the regular expressions is empty or not.
At least to a first approximation, I can't imagine a way to answer that question without computing the sets, which for infinite sets will take longer than you have. I think there might be a way to compute a limited set and determine when a pattern is being elaborated beyond what is required by the other regex, but it would not be straightforward.
For example, just consider the simple expressions (ab)* and (aba)*b. What is the algorithm that will decide to generate abab from the first expression and then stop, without checking ababab, abababab, etc. because they will never work? You can't just generate strings and check until a match is found because that would never complete when the languages are disjoint. I can't imagine anything that would work in the general case, but then there are folks much better than me at this kind of thing.
All in all, this is a hard problem. I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem. Although, given that regular expressions are not Turing complete, it seems at least possible that a solution exists.
I would do the following:
convert each regex to a FSA, using something like the following structure:
struct FSANode
{
bool accept;
Map<char, FSANode> links;
}
List<FSANode> nodes;
FSANode start;
Note that this isn't trivial, but for simple regex shouldn't be that difficult.
Make a new Combined Node like:
class CombinedNode
{
CombinedNode(FSANode left, FSANode right)
{
this.left = left;
this.right = right;
}
Map<char, CombinedNode> links;
bool valid { get { return !left.accept || !right.accept; } }
public FSANode left;
public FSANode right;
}
Build up links based on following the same char on the left and right sides, and you get two FSANodes which make a new CombinedNode.
Then start at CombinedNode(leftStart, rightStart), and find the spanning set, and if there are any non-valid CombinedNodes, the set isn't "orthogonal."
Convert each regular expression into a DFA. From the accept state of one DFA create an epsilon transition to the start state of the second DFA. You will in effect have created an NFA by adding the epsilon transition. Then convert the NFA into a DFA. If the start state is not the accept state, and the accept state is reachable, then the two regular expressions are not "orthogonal." (Since their intersection is non-empty.)
There are know procedures for converting a regular expression to a DFA, and converting an NFA to a DFA. You could look at a book like "Introduction to the Theory of Computation" by Sipser for the procedures, or just search around the web. No doubt many undergrads and grads had to do this for one "theory" class or another.
I spoke too soon. What I said in my original post would not work out, but there is a procedure for what you are trying to do if you can convert your regular expressions into DFA form.
You can find the procedure in the book I mentioned in my first post: "Introduction to the Theory of Computation" 2nd edition by Sipser. It's on page 46, with details in the footnote.
The procedure would give you a new DFA that is the intersection of the two DFAs. If the new DFA had a reachable accept state then the intersection is non-empty.