I'd like to know how you can tell if some regular expression is the complement of another regular expression. Let's say I have 2 regular expressions r_1 and r_2. I can certainly create a DFA out of each of them and then check to make sure that L(r_1) != L(r_2). But that doesn't necessarily mean that r_1 is the complement of r_2 and vice versa. Also, it seems to be that many different regular expressions that could be the same complement of a single regular expression.
So I'm wondering how, given two regular expressions, I can determine if one is the complement of another. This is also new to me, so perhaps I'm missing something that should be apparent.
Edit: I should point out that I am not simply trying to find the complement of a regular expression. I am given two regular expressions, and I am to determine if they are the complement of each other.
Here is one approach that is conceptually simple, if not terribly efficient (not that there is necessarily a more efficient solution...):
Construct NFAs M and N for regular expressions r and s, respectively. You can do this using the construction introduced in the proof that finite automata describe the same languages.
Determinize M and N to get M' and N'. We might as well go ahead and minimize them at this point... giving M'' and N''.
Construct a machine C using the Cartesian product machine construction on machines M'' and N''. Acceptance will be determined by the symmetric difference, or XOR, criterion: accepting states in the product machine correspond to pairs of states (m, n) where exactly one of the two states is accepting in its automaton.
Minimize C and call the result C'
If L(r) = L(s)', then the initial state of C' will be accepting and C' will have all transitions originating in the initial state also terminating in the initial state. If this is the case,
Why should this work? The symmetric difference of two sets is the set of everything in exactly one (not both, not neither). If L(s) and L(r) are complementary, then it is not difficult to see that the symmetric difference includes all strings (by definition, the complement of a set contains everything not in the set). Suppose now there were non-complementary sets whose symmetric difference were the universe of all strings. The sets are not complementary, so either (1) their union is non-empty or (2) their union is not the universe of all strings. In case (1), the symmetric difference will not include the shared element; in case (2), the symmetric difference will not include the missing strings. So, only complementary sets have the symmetric difference equal to the universe of all strings; and a minimal DFA for the set of all strings will always have an accepting initial state with self-loops.
For complement: L(r_1) == !L(r_2)
Related
Automata 1) Recognizes strings with at least 2 a
Regular Expression = b*ab*a(a+b)*
Automata 2) Recognizes strings with at least 2 b
Regular Expression = a*ba*b(a+b)*
The regular expression obtained from A3 = A1 U A2 is equivalent to R3 = R1 + R2? Or it's not?
R3 = b*ab*a(a+b)* + a*ba*b(a+b)*
There is neither "one" automaton nor "one" regular expression for any language; generally there many reasonable ones and many more (maybe infinitely many) unreasonable ones. In this sense, your question is not entirely well-posed: the regular expression corresponding to the union of two DFAs may or may not look like regular expressions for the original DFAs, +'ed together.
So, if you mean, can they look the same, the answer is likely yes. If you mean, must they look the same, answer is likely no. If you instead want to fix the algorithms for constructing the union machine and getting the regular expression, maybe we could show that a fixed method of doing it always gives the same answer.
In your specific case, applying the Cartesian Product Machine construction to get a DFA for the union of the original DFAs and then applying the construction from the proof of equivalence between DFAs and REs, we can see that the structure of the RE obtained by +'ing the original REs can't be achieved starting from a DFA; you'd have needed an NFA to get a + between the LHS and RHS, but DFAs can only + among individual symbols, not subexpressions. Of course, it might be possible the RE can be algebraically manipulated to derive the target RE, but that isn't exactly the same.
All of the above hold for the question of equality of REs. However, you asked about equivalence. Almost always, we say two REs are equivalent if they generate the same language. If this is what you meant, then yes, +ing the two REs will give an RE equivalent to the one obtained by constructing a union machine and deriving an RE from that. The REs will not look the same but will generate the same language, just as (ab + e)(abab)* and (ab)* generate the same language despite looking a bit different.
Regular expressions are not like finite state parsers and it's usually a mistake to try to incorporate them into complex parsing scenarios.
But also, they are marvelous tools for specific problems. After reading your descriptive requirements, there is a simple regular expression that accomplishes it, but in a way you might not expect. Your requirements:
strings with at least 2 a
strings with at least 2 b
The Union of the two, or strings withat least two a's or two b's
([ab]).*?\1
This expression opens a capture group to capture either a or b. Then it allows zero or more 'any characters' followed by whatever was captured in the capture group (\1).
I have a large collection of regular expression that when matched call a particular http handler. Some of the older regex's are unreachable (e.g. a.c* ⊃ abc*) and I'd like to prune them.
Is there a library that given two regex's will tell me if the second is subset of the first?
I wasn't sure this was decidable at first (it smelled like the halting problem by a different name). But it turns out it's decidable.
Trying to find the complexity of this problem lead me to this paper.
The formal definition of the problem can be found within: this is generally called the inclusion problem
The inclusion problem for R, is to test for two given expressions r, r′ ∈ R,
whether r ⊆ r′.
That paper has some great information (summary: all but the simplest expressions are fairly complex), however searching for information on the inclusion problem leads one directly back to StackOverflow. That answer already had a link to a paper describing a passable polynomial time algorithm which should cover a lot of common cases.
I found a python regex library that provides set operations.
http://github.com/ferno/greenery
The proof says Sub ⊆ Sup ⇔ Sub ∩ ¬Sup is {}. I can implement this with the python library:
import sys
from greenery.lego import parse
subregex = parse(sys.argv[1])
supregex = parse(sys.argv[2])
s = subregex&(supregex.everythingbut())
if s.empty():
print("%s is a subset of %s"%(subregex,supregex))
else:
print("%s is not a subset of %s, it also matches %s"%(subregex,supregex,s)
examples:
subset.py abcd.* ab.*
abcd.* is a subset of ab.*
subset.py a[bcd]f* a[cde]f*
a[bcd]f* is not a subset of a[cde]f*, it also matches abf*
The library may not be robust because as mentioned in the other answers you need to use the minimal DFA in order for this to work. I'm not sure ferno's library makes (or can make) that guarantee.
As an aside: playing with the library to calculate inverse or simplify regexes is lots of fun.
a(b|.).* simplifies to a.+. Which is pretty minimal.
The inverse of abf* is ([^a]|a([^b]|bf*[^f])).*|a?. Try to come up with that on your own!
If the regular expressions use "advanced features" of typical procedural matchers (like those in Perl, Java, Python, Ruby, etc.) that allow accepting languages that aren't regular, then you are out of luck. The problem is in general undecidable. E.g. the problem of whether one pushdown automaton recognizes the same context free (CF) language as another is undecidable. Extended regular expressions can describe CF languages.
On the other hand, if the regular expressions are "true" in the theoretical sense, consisting only of concatenation, alternation, and Kleene star over strings with a finite alphabet, plus the usual syntactic sugar on these (character classes, +, ?, etc), then there is a simple polynomial time algorithm.
I can't give you libraries, but this:
For each pair of regexes r and s for languages L(r) and L(s)
Find the corresponding Deterministic Finite Automata M(r) and M(s)
Compute the cross-product machine M(r x s) and assign accepting states
so that it computes L(r) - L(s)
Use a DFS or BFS of the the M(r x s) transition table to see if any
accepting state can be reached from the start state
If no, you can eliminate s because L(s) is a subset of L(r).
Reassign accepting states so that M(r x s) computes L(s) - L(r)
Repeat the steps above to see if it's possible to eliminate r
Converting a regex to a DFA generally uses Thompson's construction to get a non-deterministic automaton. This is converted to a DFA using the Subset Construction. The cross-product machine is another standard algorithm.
This was all worked out in the 1960's and is now part of any good undergrad computer science theory course. The gold standard for the topic is Hopcroft and Ullman, Automata Theory.
There is an answer in the mathematics section: https://math.stackexchange.com/questions/283838/is-one-regular-language-subset-of-another.
Basic idea:
Compute the minimal DFA for both languages.
Calculate the cross product of both automates M1 and M2, which means that each state consists of a pair [m1, m2] where m1 is from M1 and m2 from M2 for all possible combinations.
The new transition F12 is: F12([m1, m2], x) => [F1(m1, x), F2(m2, x)]. This means if there was a transition in M1 from state m1 to m1' while reading x and in M2 from state m2 to m2' while reading x then there is one transition in M12 from [m1, m2] to [m1', m2'] while reading x.
At the end you look into the reachable states:
If there is a pair [accepting, rejecting] then the M2 is not a subset of M1
If there is a pair [rejecting, accapting] then M1 is not a subset of M2
It would be benificial if you would just compute the new transition and the resulting states, omitting all non reachable states from the beginning.
How would I change the following regular expression to finite automata?
(abUb)(bUaaa)b*b((a*b)*Ub)*
note: U means union in this case
There are five top-level concatenated components of this regex. According to the algorithm recoverable from a part of Kleene's theorem, you can make NFA-Lambdas for these, then form the concatenation by connecting final states of one to initial states of the next.
When you see a union, that means you make two machines and combine them by making a new start state with two lambda transitions.
Kleene closure is a little more involved, but basically make the machine for the thing being repeated, then transform it by adding a new accepting start state and a loop to it from the old final states.
The base case is the machine for a single letter, which is two states, initial and final, with the appropriately labelled transition.
Work recursively from the simplest machines (innermost subexpressions) up to the whole thing, combining as necessary. Simplify the result as much as you like, possibly converting to a minimal DFA.
There is an application called Automatic Java Code Generator for Regular Expressions and Finite Automata, that automatically generates the NFA, DFA (including the transition table), and Java Code for a given regular expression or finite automata. It can be downloaded from this link: www.s-solutions.info You can always check if your solution is correct or not.
I have a container of regular expressions. I'd like to analyze them to determine if it's possible to generate a string that matches more than 1 of them. Short of writing my own regex engine with this use case in mind, is there an easy way in C++ or Python to solve this problem?
There's no easy way.
As long as your regular expressions use only standard features (Perl lets you embed arbitrary code in matching, I think), you can produce from each one a nondeterministic finite-state automaton (NFA) that compactly encodes all the strings that the RE matches.
Given any pair of NFA, it's decidable whether their intersection is empty. If the intersection isn't empty, then some string matches both REs in the pair (and conversely).
The standard decidability proof is to determinize them into DFAs first, and then construct a new DFA whose states are pairs of the two DFAs' states, and whose final states are exactly those in which both states in the pair are final in their original DFA. Alternatively, if you've already shown how to compute the complement of a NFA, then you can (DeMorgan's law style) get the intersection by complement(union(complement(A),complement(B))).
Unfortunately, NFA->DFA involves a potentially exponential size explosion (because states in the DFA are subsets of states in the NFA). From Wikipedia:
Some classes of regular languages can
only be described by deterministic
finite automata whose size grows
exponentially in the size of the
shortest equivalent regular
expressions. The standard example are
here the languages L_k consisting of
all strings over the alphabet {a,b}
whose kth-last letter equals a.
By the way, you should definitely use OpenFST. You can create automata as text files and play around with operations like minimization, intersection, etc. in order to see how efficient they are for your problem. There already exist open source regexp->nfa->dfa compilers (I remember a Perl module); modify one to output OpenFST automata files and play around.
Fortunately, it's possible to avoid the subset-of-states explosion, and intersect two NFA directly using the same construction as for DFA:
if A ->a B (in one NFA, you can go from state A to B outputting the letter 'a')
and X ->a Y (in the other NFA)
then (A,X) ->a (B,Y) in the intersection
(C,Z) is final iff C is final in the one NFA and Z is final in the other.
To start the process off, you start in the pair of start states for the two NFAs e.g. (A,X) - this is the start state of the intersection-NFA. Each time you first visit a state, generate an arc by the above rule for every pair of arcs leaving the two states, and then visit all the (new) states those arcs reach. You'd store the fact that you expanded a state's arcs (e.g. in a hash table) and end up exploring all the states reachable from the start.
If you allow epsilon transitions (that don't output a letter), that's fine:
if A ->epsilon B in the first NFA, then for every state (A,Y) you reach, add the arc (A,Y) ->epsilon (B,Y) and similarly for epsilons in the second-position NFA.
Epsilon transitions are useful (but not necessary) in taking the union of two NFAs when translating a regexp to an NFA; whenever you have alternation regexp1|regexp2|regexp3, you take the union: an NFA whose start state has an epsilon transition to each of the NFAs representing the regexps in the alternation.
Deciding emptiness for an NFA is easy: if you ever reach a final state in doing a depth-first-search from the start state, it's not empty.
This NFA-intersection is similar to finite state transducer composition (a transducer is an NFA that outputs pairs of symbols, that are concatenated pairwise to match both an input and output string, or to transform a given input to an output).
This regex inverter (written using pyparsing) works with a limited subset of re syntax (no * or + allowed, for instance) - you could invert two re's into two sets, and then look for a set intersection.
In theory, the problem you describe is impossible.
In practice, if you have a manageable number of regular expressions that use a limited subset or of regexp syntax, and/or a limited selection of strings that can be used to match against the container of regular expressions, you might be able to solve it.
Assuming you're not trying to solve the abstract general case, there might be something you can do to solve a practical application. Perhaps if you provided a representative sample of the regexps, and described the strings you'd be matching with, a heuristic could be created to solve the problem.
Is there a way to find out if two arbitrary regular expressions are equivalent? Looks like complex problem to me, but there might be some DFA simplification mechanism or something?
To test equivalence you can compute the minimal DFAs for the expressions and compare them.
Testability of equality is one of the classical properties of regular expressions. (N.B. This doesn't hold if you're really talking about Perl regular expressions or some other technically nonregular superlanguage.)
Turn your REs to generalised finite automata A and B, then construct a new automaton A-B such that the accepting states of A have null transitions to the start states of B, and that the accepting states of B are inverted. This gives you an automaton that accepts all those strings accepted by A, except for all those accepted by B.
Do the same for B-A, and reduce both to pure FAs. If an FA has no accepting states accessible from a start state then it accepts the empty language. If you can show that both A-B and B-A are empty, you've shown that A = B.
Edit Heh, I can't believe no one noticed the gigantic error there -- an intentional one, of course :-p
The automata A-B as described will accept those strings whose first half is accepted by A and whose second half is not accepted by B. Building the desired A-B is a slightly trickier process. I can't think of it off the top of my head, but I do know it's well-defined (and likely involves creating states to the represent the products of accepting states in A and non-accepting states in B).
This really depends on what you mean by regular expressions. As the other posters pointed out, reducing both expressions to their minimal DFA should work, but it only works for the pure regular expressions.
Some of the constructs used in the real world regex libs (backreferences in particular) give them power to express languages that aren't regular, so the DFA algorithm won't work for them. For example the regex : ([a-z]*) \1 matches a double occurence of the same word separated by a space (a a and b b but not b a nor a b). This cannot be recognized by a finite automaton at all.
These two Perlmonks threads discuss this question (specifically, read blokhead's responses):
Comparative satisfiability of regexps
Testing regex equivalence