How to deal with the EmptySet regex constructor in coq proofs (and other general coq questions)? - regex

I'm trying to figure out how to approach the app_ne problem in SF. My thinking is to induct over the first regular expression, as it will allow us to satisfy the first disjunct, whereas all the other regular expression forms will allow one to prove the existential right disjunct.
(i) Is this a correct approach to the problem?
(ii) If so, how does one deal with the empty set case? This got me right away.
(iii) Is there any way admit a single part of a proof and then come back to it later (since this easy case is throwing me off and I would like to work through some of the other cases..)
Lemma app_ne : forall (a : ascii) s re0 re1,
a :: s =~ (App re0 re1) <->
([ ] =~ re0 /\ a :: s =~ re1) \/
exists s0 s1, s = s0 ++ s1 /\ a :: s0 =~ re0 /\ s1 =~ re1.
Proof.
intros.
split.
- intros. induction re0.
* right. inversion H.
(* + apply re_not_empty_correct. *)
(* + apply MEmpty. *)
Abort.

My thinking is to induct over the first regular expression, as it will allow us to satisfy the first disjunct,
I don't understand that reasoning (maybe it's not what you actually meant). If you could just prove the first disjunct, then there would be no point in having a disjunction in the first place.
whereas all the other regular expression forms will allow one to prove the existential right disjunct.
"other" than what?
(iii) Is there any way admit a single part of a proof and then come back to it later (since this easy case is throwing me off and I would like to work through some of the other cases..)
There is the admit. tactic to skip the current goal and the Admitted. command to skip the whole proof.
Hint for this problem: what does the first assumption mean: a :: s =~ (App re0 re1) (i.e., look at what the definition of =~ says about App)?

Related

Haskell check if the regular expression r made up of the single symbol alphabet Σ = {a} defines language L(r) = a*

I have got to write an algorithm programatically using haskell. The program takes a regular expression r made up of the unary alphabet Σ = {a} and check if the regular expression r defines the language L(r) = a^* (Kleene star). I am looking for any kind of tip. I know that I can translate any regular expression to the corresponding NFA then to the DFA and at the very end minimize DFA then compare, but is there any other way to achieve my goal? I am asking because it is clearly said that this is the unary alphabet, so I suppose that I have to use this information somehow to make this exercise much easier.
This is how my regular expression data type looks like
data Reg = Epsilon | -- epsilon regex
Literal Char | -- a
Or Reg Reg | -- (a|a)
Then Reg Reg | -- (aa)
Star Reg -- (a)*
deriving Eq
Yes, there is another way. Every DFA for regular languages on the single-letter alphabet is a "lollipop"1: an initial string of nodes that each point to each other (some of which are marked as final and some not) followed by a loop of nodes (again, some of which are marked as final and some not). So instead of doing a full compilation pass, you can go directly to a DFA, where you simply store two [Bool] saying which nodes in the lead-in and in the loop are marked final (or perhaps two [Integer] giving the indices and two Integer giving the lengths may be easier, depending on your implementation plans). You don't need to ensure the compiled version is minimal; it's easy enough to check that all the Bools are True. The base cases for Epsilon and Literal are pretty straightforward, and with a bit of work and thought you should be able to work out how to implement the combining functions for "or", "then", and "star" (hint: think about gcd's and stuff).
1 You should try to prove this before you begin implementing, so you can be sure you believe me.
Edit 1: Hm, while on my afternoon walk today, I realized the idea I had in mind for "then" (and therefore "star") doesn't work. I'm not giving up on this idea (and deleting this answer) yet, but those operations may be trickier than I gave them credit for at first. This approach definitely isn't for the faint of heart!
Edit 2: Okay, I believe now that I have access to pencil and paper I've worked out how to do concatenation and iteration. Iteration is actually easier than concatenation. I'll give a hint for each -- though I have no idea whether the hint is a good one or not!
Suppose your two lollipops have a length m lead-in and a length n loop for the first one, and m'/n' for the second one. Then:
For iteration of the first lollipop, there's a fairly mechanical/simple way to produce a lollipop with a 2*m + 2*n-long lead-in and n-long loop.
For concatenation, you can produce a lollipop with m + n + m' + lcm(n, n')-long lead-in and n-long loop (yes, that short!).

a fully backtracking star operator in parsec

I am trying to build a real, fully backtracking + combinator on parsec.
That is, one that receives a parser, and tries to find one or more instances of the given combinator.
That would mean that parse_one_or_more foolish_a would be able to match nine chars a in a row, for example. (see code below for context)
As far as I understand it, the reason why it does not currently do so is that, after foolish_a finds a match (the first 2 as) the many1 (try p1) never gives up on that match.
Is this possible in parsec? Pretty sure it would be very slow (this simple example is already exponential!) but I wonder if it can be done. It is for a programming challenge that runs without time limit -- I would not want to use it in the wild
import Text.Parsec
import Text.Parsec.String (Parser)
parse_one_or_more :: Parser String -> Parser String
parse_one_or_more p1 = (many1 (try p1)) >> eof >> return "bababa"
foolish_a = parse_one_or_more (try (string "aa") <|> string "aaa")
good_a = parse_one_or_more (string "aaa")
-- |
-- >>> parse foolish_a "unused triplet" "aaaaaaaaa"
-- Left...
-- ...
-- >>> parse good_a "unused" "aaaaaaaaa"
-- Right...
You are correct - Parsec-like libraries can't do this in a way that works for any input. Parsec's implementation of (<|>) is left-biased and commits to the left parser if it matches, regardless of anything that may happen later in the grammar. When the two arguments of (<|>) overlap, such as in (try (string "aa") <|> string "aaa"), there is no way to cause parsec to backtrack into there and try the right side match if the left side succeeded.
If you want to do this, you will need a different library, one that doesn't have a (<|>) operator that's left-biased and commits.
Yes, since Parsec produces a recursive-descent parser, you would rather want to make an unambiguous guess first to minimize the need for backtracking. So if your first guess is "aa" and that happens to overlap with a later guess "aaa", backtracking is necessary. Sometimes a grammar is LL(k) for some k > 1 and you want to use backtracking out of pure necessity.
The only time I use try is when I know that the backtracking is quite limited (k is low). For example, I might have an operator ? that overlaps with another operator ?//; I want to parse ? first because of precedence rules, but I want the parser to fail in case it's followed by // so that it can eventually reach the correct parse. Here k = 2, so the impact is quite low, but also I don't need an operator here that lets me backtrack arbitrarily.
If you want a parser combinator library that lets you fully backtrack all the time, this may come at a severe cost to performance. You could look into Text.ParserCombinators.ReadP's +++ symmetric choice operator that picks both. This is an example of what Carl suggested, a <|> that is not left-biased and commits.

Minimize specific regular expression

I have the following regex
(1*)+(1*0)(ε+11*0)*(11*)
If minimized, it should be
(1+01)*
But I cannot understand the minimization, could somebody explain it?
First off, for other people watching this, this is traditional formal Computer science regular expressions, not the regex languages used in most programming languages. In programming language regex terms, the two expressions would be 1*|1*0(|11*0)*11* and (1|01)*.
Now, to the problem:
The initial expression has 1* at the front and back of the expression in both of the top-level alternatives. So we can rewrite it first as:
(1*)(ε+0(ε+11*0)*1)(1*)
Now, in general, (ε+x)* for any regular expression x is just x*. So that's:
(1*)(ε+0(11*0)*1)(1*)
Now, also x* is the same as ε+xx*, so we can expand that inner bit out:
(1*)(ε+0(ε+(11*0)(11*0)*)1)(1*)
And now apply a(x+y)b => axb+ayb:
(1*)(ε+01+0(11*0)(11*0)*1)(1*)
Now, apply (xy)*x => x(yx)*:
(1*)(ε+01+0(11*0)1(1*01)*)(1*)
And rearrange the parens:
(1*)(ε+01+(01)(1*)(01)(1*(01))*)(1*)
And factor out a common prefix:
(1*)(ε+(01)(ε+(1*(01))(1*(01))*))(1*)
Using an expansion we had before, but in reverse:
(1*)(ε+(01)(1*(01))*)(1*)
Now bring that left 1* in:
((1*)+(1*)(01)(1*(01))*)(1*)
Since 1* is the same as ε+1*, we can write this as:
((ε+1*)+(1*)(01)(1*(01))*)(1*)
Rearranging alternatives:
(1*+(ε+(1*)(01)(1*(01))*))(1*)
Applying that ε+xx* <=> x* equivalence again:
(1*+(1*(01))*)(1*)
Now, x*+(x*y)* can be shown equivalent to (x+y)* - applying that here gives:
(1+01)*(1*)
And now we just apply (x+y)*x* => (x+y)*, and we're done.
(1+01)*
Okay, trying to work out a simpler derivation. First off, I need you to accept these identities, where x, y, a, and b are arbitrary regular expressions:
(ab)*a <=> a(ba)*
xa+ya <=> (x+y)a
ε+xx* <=> x*
a*(ba*)* <=> (a+b)*
As an aside, the last identity is often useful in constructing regexes that efficiently match grammars like strings with backslash escapes, where a naive approach might be ([^\\"]|\\.)*, but it's much more efficient in most regex matching libraries to use [^\\"]*(\\.[^\\"]*)*. Anyway, to the problem:
(1*)+(1*0)(ε+11*0)*(11*)
Well, (ε+x)* is still the same as x*, so let's do that first:
(1*)+(1*0)(11*0)*(11*)
Now apply identity 2 and pull the 1* out to the right:
(ε+(1*0)(11*0)*1)(1*)
Now, identity 1:
(ε+(1*0)1(1*01)*)(1*)
That's now ready for identity 3:
(1*01)*(1*)
Identity 1 again gives us:
1*((01)1*)*
And now identity 4 gives us the desired result:
(1+01)*

will a regular expression applied in reverse produce the same match?

Suppose we have some text and a regular expression that matches it. Question: if I apply the same expression to text backwards (starting from the last letter to the first one), will it still match?
regex -----> text
xereg --?--> txet
In practice that seems to work, the question is rather about what the theory says about the general case.
Not if you use the Kleene star - if you reverse the regex, you will end up with an invalid regex or one that matches a different pattern:
ab* -> *ba (invalid syntax)
a*b -> b*a (the first one matches aaab but not abbb, while the second one matches bbba but not baaa)
On the other hand, I'm quite sure that it would be possible to design an algorithm that, given a regex, produces a regex that matches the reverse strings. The following recursive algorithm should work (if r is a regex, rev(r) means the regex that matches the reversed strings):
If r is a single symbol x, then rev(r) = x.
If r is a union A|B, then rev(r) = rev(A)|rev(B).
If r is a concatenation AB, then rev(r) = rev(B)rev(A).
If r is a Kleene star A*, then rev(r) = rev(A)*.
The general cause is that it will not
for example, the regex
ab
will match
ab
but not
ba
How come you think that the general case is that it should?
There are regexes that matches the reverse string as well like
[a|b]*
Will match
ab
and
ba
The cases where regex and xeger would both produce the same match on a text are:
regex is a simple (atomic) pattern that is a palindrome. e.g., abcba
regex is composed of several atomic patterns using commutative functions (e.g., or) and you do not reverse those individual atomic patterns. If you do, then they should be a palindrome too. e.g., adef|bd881|cdavr if you do not reverse the atomic components or [aba|defed] if you do reverse the atomic components.
In general I would definitely say "no", but it really just depends on the complexity of the expressions.
Because not only would one need to reverse any simple (sub-)expressions, but if applicable one would also need to take into account more complex stuff which is not so easily "reversed" in just any regex: what about repetition operators, laziness vs. greediness, or back-references and look-arounds, quantifiers and modifiers… – items explained in e.g. this tutorial?
Perhaps if you have more specific examples or issues regarding such a "reversal", a more appropriate answer can be thought of.

Proving that a language is regular by giving a regular expression

I am stumped by this practice problem (not for marks):
{w is an element of {a,b}* : the number of a's is even and the number of b's is even }
I can't seem to figure this one out.
In this case 0 is considered even.
A few acceptable strings: {}, {aa}, {bb}, {aabb}, {abab}, {bbaa}, {babaabba}, and so on
I've done similar examples where the a's must be a prefix, where the answer would be:
(aa)(bb)
but in this case they can be in any order.
Kleene stars (*), unions (U), intersects (&), and concatenation may be used.
Edit: Also have trouble with this one
{w is an element of {0,1}* : w = 1^r 0 1^s 0 for some r,s >= 1}
This is kind of ugly, but it should work:
ε U ( (aa) U (bb) U ((ab) U (ba) (ab) U (ba)) )*
For the second one:
11*011*0
Generally I would use a+ instead of aa* here.
Edit: Undeleted re: the comments in NullUserException's answer.
1) I personally think this one is easier to conceptualize if you first construct a DFA that can accept the strings. I haven't written it down, but off the top of my head I think you can do this with 4 states and one accept state. From there you can create an equivalent regex by removing states one at a time using an algorithm such as this one. This is possible because DFAs and regexes are provably equivalent.
2) Consider the fact that the Kleene star only applies to the nearest regular expression. Hence, if you have two individual ungrouped atoms (an atom itself is a regex!), it only applies to the second one (as in, ab* would match a single a and then any number - including 0 - b's). You can use this to your advantage in a case where you want something to exist, but you're not sure of how many there are.