Regular Expression Comparsion - regex

Is there any solution that can compare two regular expression for Subsumption, Partially overlapping, disjoint i.e. i want to know how to compare two regular expression. Secondly can i combine two regular expression if regex 1 is subsumpted by regex 2.

Say you have two expressions A and B and want to see if A matches a subset of what B does.
You need to compute the minimized DFA of B and then combine the two expressions to make a union of A and B and then compute the minimized DFA of that new expression. If those two DFAs are equal then A matches a subset of B.
In essence, you can't properly check this without going through the process of constructing a minimized automata. It will however, give a verifiable true answer to the question.
Combining the two expressions can be done by making a new expression like (A)|(B), perhaps substituting the paranthesis for non-capturing varieties if your engine supports that.
If you decide to go the whole way to do the algorithms, I've written a series of articles on the process:
http://binarysculpting.com/2012/02/11/regular-expressions-how-do-they-really-work-automata-theory-for-programmers-part-1/
http://binarysculpting.com/2012/02/15/converting-dfa-to-nfa-by-subset-construction-regular-expressions-part-2/
http://binarysculpting.com/2012/03/21/dfa-state-minimization/
To compare two automatas you could just check that the states and transitions are the same. They should be exactly equal.

Related

Automata - Regular Expression (Union Case)

Automata 1) Recognizes strings with at least 2 a
Regular Expression = b*ab*a(a+b)*
Automata 2) Recognizes strings with at least 2 b
Regular Expression = a*ba*b(a+b)*
The regular expression obtained from A3 = A1 U A2 is equivalent to R3 = R1 + R2? Or it's not?
R3 = b*ab*a(a+b)* + a*ba*b(a+b)*
There is neither "one" automaton nor "one" regular expression for any language; generally there many reasonable ones and many more (maybe infinitely many) unreasonable ones. In this sense, your question is not entirely well-posed: the regular expression corresponding to the union of two DFAs may or may not look like regular expressions for the original DFAs, +'ed together.
So, if you mean, can they look the same, the answer is likely yes. If you mean, must they look the same, answer is likely no. If you instead want to fix the algorithms for constructing the union machine and getting the regular expression, maybe we could show that a fixed method of doing it always gives the same answer.
In your specific case, applying the Cartesian Product Machine construction to get a DFA for the union of the original DFAs and then applying the construction from the proof of equivalence between DFAs and REs, we can see that the structure of the RE obtained by +'ing the original REs can't be achieved starting from a DFA; you'd have needed an NFA to get a + between the LHS and RHS, but DFAs can only + among individual symbols, not subexpressions. Of course, it might be possible the RE can be algebraically manipulated to derive the target RE, but that isn't exactly the same.
All of the above hold for the question of equality of REs. However, you asked about equivalence. Almost always, we say two REs are equivalent if they generate the same language. If this is what you meant, then yes, +ing the two REs will give an RE equivalent to the one obtained by constructing a union machine and deriving an RE from that. The REs will not look the same but will generate the same language, just as (ab + e)(abab)* and (ab)* generate the same language despite looking a bit different.
Regular expressions are not like finite state parsers and it's usually a mistake to try to incorporate them into complex parsing scenarios.
But also, they are marvelous tools for specific problems. After reading your descriptive requirements, there is a simple regular expression that accomplishes it, but in a way you might not expect. Your requirements:
strings with at least 2 a
strings with at least 2 b
The Union of the two, or strings withat least two a's or two b's
([ab]).*?\1
This expression opens a capture group to capture either a or b. Then it allows zero or more 'any characters' followed by whatever was captured in the capture group (\1).

Can regex match intersection between two regular expressions?

Given several regular expressions, can we write a regular expressions which is equal to their intersection?
For example, given two regular expressions c[a-z][a-z] and [a-z][aeiou]t, their intersection contains cat and cut and possibly more. How can we write a regular expression for their intersection?
Thanks.
A logical AND in regex is represented by
(?=...)(?=...)
So,
(?=[a-z][aeiou]t)(?=c[a-z][a-z])
The lookahead examples are easy to use, but technically are no longer regular languages. However it is possible to take the intersection of two regular languages, and that complement is regular.
First note that Regular Expressions can be converted to and from NFAs; they both are ways of expressing regular languages.
Second, by DeMorgan's law,
Thus these are the steps to compute the intersection of two RegExs:
Convert both RegExs to NFAs.
Compute the complement of both NFAs.
Compute the union of the two complements.
Compute the complement of that union.
Convert the resulting NFA to a RegEx.
Some sources:
Union and RegEx to NFA: http://courses.engr.illinois.edu/cs373/sp2009/lectures/lect_06.pdf
NFA to RegEx: http://courses.engr.illinois.edu/cs373/sp2009/lectures/lect_08.pdf
Complement of NFA: https://cs.stackexchange.com/questions/13282/complement-of-non-deterministic-finite-automata
Mathematically speaking, an intersection of two regular languages is regular, so there has to be a regular expression that accepts it.
Building it via corresponding NFAs is probably the easiest. Consider the two NFAs that correspond to the two regexes. The new states Q are pairs (Q1,Q2) from the two NFAs. If there is a transition (P1,x,Q1) in the first NFA and (P2,x,Q2) in the second NFA, then and only then there is a transition ((P1,P2),x,(Q1,Q2)) in the new NFA. A new state (Q1,Q2) is initial/final iff both Q1 and Q2 are initial/final.
If you use NFAs with ε-moves, then also for each transition (P1,ε,Q1) there will be a transition ((P1,P2),ε,(Q1,P2)) for all states P2. Likewise for ε-moves in the second NFA.
Now convert the new NFA to a regular expression with any known algorithm, and that's it.
As for PCRE, they are not, strictly speaking, regular expressions. There is no way to do it in the general case. Sometimes you can use lookaheads, like ^(?=regex1$)(?=regex2$) but this is only good for matching the entire string and is no good for either searching or embedding in other regexps. Without anchoring, the two lookaheads may end up matching strings of different lengths. This is not intersection.
First, let's agree on terms. My syntactical assumption will be that
The intersection of several regexes is one regex that matches strings
that each of the component regexes also match.
The General Option
To check for the intersection of two patterns, the general method is (pseudo-code):
if match(regex1) && match(regex2) { champagne for everyone! }
The Regex Option
In some cases, you can do the same with lookaheads, but for a complex regex there is little benefit of doing so, apart from making your regex more obscure to your enemies. Why little benefit? Because the engine will have to parse the whole string multiple times anyway.
Boolean AND
The general pattern for an AND checking that a string exactly meets regex1 and regex2 would be:
^(?=regex1$)(?=regex2$)
The $ in each lookahead ensures that each string matches the pattern and nothing more.
Matching when AND
Of course, if you don't want to just check the boolean value of the AND but also do some actual matching, after the lookaheads, you can add a dot-star to consume the string:
^(?=regex1$)(?=regex2$).*
Or... After checking the first condition, just match the second:
^(?=regex1$)regex2$
This is a technique used for instance in password validation. For more details on this, see Mastering Lookahead and Lookbehind.
Bonus section: Union of Regexes
Instead of working on an intersection, let's say you are interested in the union of the following regexes, i.e., a regex that matches either of those regexes:
catch
cat1
cat2
cat3
cat5
This is accomplished with the alternation | operator:
catch|cat1|cat2|cat3|cat5
Furthermore, such a regex can often be compressed, as in:
cat(?:ch|[1-35])
For And operation, we have something like this in RegEx
(REGEX)(REGEX)
Taking your example
'Cat'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
["Cat", "C", "a", "t"]
'Ca'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
//null
'Cat123'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
//null
where
([A-Za-z]+) //Match All characters
and
([aeiouAEIOU]+) //Match all vowels
Combine them both will match
([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)
eg:
'Hmmmmmm'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
//null
'Stckvrflw'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
null
'StackOverflow'.match(/^([A-Za-z]+)([aeiouAEIOU]+)([A-Za-z]+)$/)
["StackOverflow", "StackOverfl", "o", "w"]

Does lookaround affect which languages can be matched by regular expressions?

There are some features in modern regex engines which allow you to match languages that couldn't be matched without that feature. For example the following regex using back references matches the language of all strings that consist of a word that repeats itself: (.+)\1. This language is not regular and can't be matched by a regex that does not use back references.
Does lookaround also affect which languages can be matched by a regular expression? I.e. are there any languages that can be matched using lookaround that couldn't be matched otherwise? If so, is this true for all flavors of lookaround (negative or positive lookahead or lookbehind) or just for some of them?
The answer to the question you ask, which is whether a larger class of languages than the regular languages can be recognised with regular expressions augmented by lookaround, is no.
A proof is relatively straightforward, but an algorithm to translate a regular expression containing lookarounds into one without is messy.
First: note that you can always negate a regular expression (over a finite alphabet). Given a finite state automaton that recognises the language generated by the expression, you can simply exchange all the accepting states for non-accepting states to get an FSA that recognises exactly the negation of that language, for which there are a family of equivalent regular expressions.
Second: because regular languages (and hence regular expressions) are closed under negation they are also closed under intersection since A intersect B = neg ( neg(A) union neg(B)) by de Morgan's laws. In other words given two regular expressions, you can find another regular expression that matches both.
This allows you to simulate lookaround expressions. For example u(?=v)w matches only expressions that will match uv and uw.
For negative lookahead you need the regular expression equivalent of the set theoretic A\B, which is just A intersect (neg B) or equivalently neg (neg(A) union B). Thus for any regular expressions r and s you can find a regular expression r-s which matches those expressions that match r which do not match s. In negative lookahead terms: u(?!v)w matches only those expressions which match uw - uv.
There are two reasons why lookaround is useful.
First, because the negation of a regular expression can result in something much less tidy. For example q(?!u)=q($|[^u]).
Second, regular expressions do more than match expressions, they also consume characters from a string - or at least that's how we like to think about them. For example in python I care about the .start() and .end(), thus of course:
>>> re.search('q($|[^u])', 'Iraq!').end()
5
>>> re.search('q(?!u)', 'Iraq!').end()
4
Third, and I think this is a pretty important reason, negation of regular expressions does not lift nicely over concatenation. neg(a)neg(b) is not the same thing as neg(ab), which means that you cannot translate a lookaround out of the context in which you find it - you have to process the whole string. I guess that makes it unpleasant for people to work with and breaks people's intuitions about regular expressions.
I hope I have answered your theoretical question (its late at night, so forgive me if I am unclear). I agree with a commentator who said that this does have practical applications. I met very much the same problem when trying to scrape some very complicated web pages.
EDIT
My apologies for not being clearer: I do not believe you can give a proof of regularity of regular expressions + lookarounds by structural induction, my u(?!v)w example was meant to be just that, an example, and an easy one at that. The reason a structural induction won't work is because lookarounds behave in a non-compositional way - the point I was trying to make about negations above. I suspect any direct formal proof is going to have lots of messy details. I have tried to think of an easy way to show it but cannot come up with one off the top of my head.
To illustrate using Josh's first example of ^([^a]|(?=..b))*$ this is equivalent to a 7 state DFSA with all states accepting:
A - (a) -> B - (a) -> C --- (a) --------> D
Λ | \ |
| (not a) \ (b)
| | \ |
| v \ v
(b) E - (a) -> F \-(not(a)--> G
| <- (b) - / |
| | |
| (not a) |
| | |
| v |
\--------- H <-------------------(b)-----/
The regular expression for state A alone looks like:
^(a([^a](ab)*[^a]|a(ab|[^a])*b)b)*$
In other words any regular expression you are going to get by eliminating lookarounds will in general be much longer and much messier.
To respond to Josh's comment - yes I do think the most direct way to prove the equivalence is via the FSA. What makes this messier is that the usual way to construct an FSA is via a non-deterministic machine - its much easier to express u|v as simply the machine constructed from machines for u and v with an epsilon transition to the two of them. Of course this is equivalent to a deterministic machine, but at the risk of exponential blow-up of states. Whereas negation is much easier to do via a deterministic machine.
The general proof will involve taking the cartesian product of two machines and selecting those states you wish to retain at each point you want to insert a lookaround. The example above illustrates what I mean to some extent.
My apologies for not supplying a construction.
FURTHER EDIT:
I have found a blog post which describes an algorithm for generating a DFA out of a regular expression augmented with lookarounds. Its neat because the author extends the idea of an NFA-e with "tagged epsilon transitions" in the obvious way, and then explains how to convert such an automaton into a DFA.
I thought something like that would be a way to do it, but I'm pleased that someone has written it up. It was beyond me to come up with something so neat.
As the other answers claim, lookarounds don't add any extra power to regular expressions.
I think we can show this using the following:
One Pebble 2-NFA (see the Introduction section of the paper which refers to it).
The 1-pebble 2NFA does not deal with nested lookaheads, but, we can use a variant of multi-pebble 2NFAs (see section below).
Introduction
A 2-NFA is a non deterministic finite automaton which has the ability to move either left or right on it's input.
A one pebble machine is where the machine can place a pebble on the input tape (i.e. mark a specific input symbol with a pebble) and do possibly different transitions based on whether there is a pebble at the current input position or not.
It is known the One Pebble 2-NFA has the same power as a regular DFA.
Non-nested Lookaheads
The basic idea is as follows:
The 2NFA allows us to backtrack (or 'front track') by moving forward or backward in the input tape. So for a lookahead we can do the match for the lookahead regular expression and then backtrack what we have consumed, in matching the lookahead expression. In order to know exactly when to stop backtracking, we use the pebble! We drop the pebble before we enter the dfa for the lookahead to mark the spot where the backtracking needs to stop.
Thus at the end of running our string through the pebble 2NFA, we know whether we matched the lookahead expression or not and the input left (i.e. what is left to be consumed) is exactly what is required to match the remaining.
So for a lookahead of the form u(?=v)w
We have the DFAs for u, v and w.
From the accepting state (yes, we can assume there is only one) of DFA for u, we make an e-transition to the start state of v, marking the input with a pebble.
From an accepting state for v, we e-transtion to a state which keeps moving the input left, till it finds a pebble, and then transitions to start state of w.
From a rejecting state of v, we e-transition to a state which keeps moving left until it finds the pebble, and transtions to the accepting state of u (i.e where we left off).
The proof used for regular NFAs to show r1 | r2, or r* etc, carry over for these one pebble 2nfas. See http://www.coli.uni-saarland.de/projects/milca/courses/coal/html/node41.html#regularlanguages.sec.regexptofsa for more info on how the component machines are put together to give the bigger machine for the r* expression etc.
The reason why the above proofs for r* etc work is that the backtracking ensures that the input pointer is always at the right spot, when we enter the component nfas for repetition. Also, if a pebble is in use, then it is being processed by one of the lookahead component machines. Since there are no transitions from lookahead machine to lookahead machine without completely backtracking and getting back the pebble, a one pebble machine is all that is needed.
For eg consider ([^a] | a(?=...b))*
and the string abbb.
We have abbb which goes through the peb2nfa for a(?=...b), at the end of which we are at the state: (bbb, matched) (i.e in input bbb is remaining, and it has matched 'a' followed by '..b'). Now because of the *, we go back to the beginning (see the construction in the link above), and enter the dfa for [^a]. Match b, go back to beginning, enter [^a] again two times, and then accept.
Dealing with Nested Lookaheads
To handle nested lookaheads we can use a restricted version of k-pebble 2NFA as defined here: Complexity Results for Two-Way and Multi-Pebble Automata and their Logics (see Definition 4.1 and Theorem 4.2).
In general, 2 pebble automata can accept non-regular sets, but with the following restrictions, k-pebble automata can be shown to be regular (Theorem 4.2 in above paper).
If the pebbles are P_1, P_2, ..., P_K
P_{i+1} may not be placed unless P_i is already on the tape and P_{i} may not be picked up unless P_{i+1} is not on the tape. Basically the pebbles need to be used in a LIFO fashion.
Between the time P_{i+1} is placed and the time that either P_{i} is picked up or P_{i+2} is placed, the automaton can traverse only the subword located between the current location of P_{i} and the end of the input word that lies in the direction of P_{i+1}. Moreover, in this sub-word, the automaton can act only as a 1-pebble automaton with Pebble P_{i+1}. In particular it is not allowed to lift up, place or even sense the presence of another pebble.
So if v is a nested lookahead expression of depth k, then (?=v) is a nested lookahead expression of depth k+1. When we enter a lookahead machine within, we know exactly how many pebbles have to have been placed so far and so can exactly determine which pebble to place and when we exit that machine, we know which pebble to lift. All machines at depth t are entered by placing pebble t and exited (i.e. we return to processing of a depth t-1 machine) by removing pebble t. Any run of the complete machine looks like a recursive dfs call of a tree and the above two restrictions of the multi-pebble machine can be catered to.
Now when you combine expressions, for rr1, since you concat, the pebble numbers of r1 must be incremented by the depth of r. For r* and r|r1 the pebble numbering remains the same.
Thus any expression with lookaheads can be converted to an equivalent multi-pebble machine with the above restrictions in pebble placement and so is regular.
Conclusion
This basically addresses the drawback in Francis's original proof: being able to prevent the lookahead expressions from consuming anything which are required for future matches.
Since Lookbehinds are just finite string (not really regexs) we can deal with them first, and then deal with the lookaheads.
Sorry for the incomplete writeup, but a complete proof would involve drawing a lot of figures.
It looks right to me, but I will be glad to know of any mistakes (which I seem to be fond of :-)).
I agree with the other posts that lookaround is regular (meaning that it does not add any fundamental capability to regular expressions), but I have an argument for it that is simpler IMO than the other ones I have seen.
I will show that lookaround is regular by providing a DFA construction. A language is regular if and only if it has a DFA that recognizes it. Note that Perl doesn't actually use DFAs internally (see this paper for details: http://swtch.com/~rsc/regexp/regexp1.html) but we construct a DFA for purposes of the proof.
The traditional way of constructing a DFA for a regular expression is to first build an NFA using Thompson's Algorithm. Given two regular expressions fragments r1 and r2, Thompson's Algorithm provides constructions for concatenation (r1r2), alternation (r1|r2), and repetition (r1*) of regular expressions. This allows you to build a NFA bit by bit that recognizes the original regular expression. See the paper above for more details.
To show that positive and negative lookahead are regular, I will provide a construction for concatenation of a regular expression u with positive or negative lookahead: (?=v) or (?!v). Only concatenation requires special treatment; the usual alternation and repetition constructions work fine.
The construction is for both u(?=v) and u(?!v) is:
In other words, connect every final state of the existing NFA for u to both an accept state and to an NFA for v, but modified as follows. The function f(v) is defined as:
Let aa(v) be a function on an NFA v that changes every accept state into an "anti-accept state". An anti-accept state is defined to be a state that causes the match to fail if any path through the NFA ends in this state for a given string s, even if a different path through v for s ends in an accept state.
Let loop(v) be a function on an NFA v that adds a self-transition on any accept state. In other words, once a path leads to an accept state, that path can stay in the accept state forever no matter what input follows.
For negative lookahead, f(v) = aa(loop(v)).
For positive lookahead, f(v) = aa(neg(v)).
To provide an intuitive example for why this works, I will use the regex (b|a(?:.b))+, which is a slightly simplified version of the regex I proposed in the comments of Francis's proof. If we use my construction along with the traditional Thompson constructions, we end up with:
The es are epsilon transitions (transitions that can be taken without consuming any input) and the anti-accept states are labeled with an X. In the left half of the graph you see the representation of (a|b)+: any a or b puts the graph in an accept state, but also allows a transition back to the begin state so we can do it again. But note that every time we match an a we also enter the right half of the graph, where we are in anti-accept states until we match "any" followed by a b.
This is not a traditional NFA because traditional NFAs don't have anti-accept states. However we can use the traditional NFA->DFA algorithm to convert this into a traditional DFA. The algorithm works like usual, where we simulate multiple runs of the NFA by making our DFA states correspond to subsets of the NFA states we could possibly be in. The one twist is that we slightly augment the rule for deciding if a DFA state is an accept (final) state or not. In the traditional algorithm a DFA state is an accept state if any of the NFA states was an accept state. We modify this to say that a DFA state is an accept state if and only if:
= 1 NFA states is an accept state, and
0 NFA states are anti-accept states.
This algorithm will give us a DFA that recognizes the regular expression with lookahead. Ergo, lookahead is regular. Note that lookbehind requires a separate proof.
I have a feeling that there are two distinct questions being asked here:
Are Regex engines that encorporate "lookaround" more
powerful than Regex engines that don't?
Does "lookaround"
empower a Regex engine with the ability to parse languages that are
more complex than those generated from a Chomsky Type 3 - Regular grammar?
The answer to the first question in a practical sense is yes. Lookaround will give a Regex engine that
uses this feature fundamentally more power than one that doesn't. This is because
it provides a richer set of "anchors" for the matching process.
Lookaround lets you define an entire Regex as a possible anchor point (zero width assertion). You can
get a pretty good overview of the power of this feature here.
Lookaround, although powerful, does not lift the Regex engine beyond the theoretical
limits placed on it by a Type 3 Grammar. For example, you will never be able to reliably
parse a language based on a Context Free - Type 2 Grammar using a Regex engine
equipped with lookaround. Regex engines are limited to the power of a Finite State Automation
and this fundamentally restricts the expressiveness of any language they can parse to the level of a Type 3 Grammar. No matter
how many "tricks" are added to your Regex engine, languages generated via a Context Free Grammar
will always remain beyond its capabilities. Parsing Context Free - Type 2 grammar requires pushdown automation to "remember" where it is in
a recursive language construct. Anything that requires a recursive evaluation of the grammar rules cannot be parsed using
Regex engines.
To summarize: Lookaround provides some practical benefits to Regex engines but does not "alter the game" on a
theoretical level.
EDIT
Is there some grammar with a complexity somewhere between Type 3 (Regular) and Type 2 (Context Free)?
I believe the answer is no. The reason is because there is no theoretical limit
placed on the size of the NFA/DFA needed to describe a Regular language. It may become arbitrarily large
and therefore impractical to use (or specify). This is where dodges such as "lookaround" are useful. They
provide a short-hand mechanism to specify what would otherwise lead to very large/complex NFA/DFA
specifications. They do not increase the expressiveness of
Regular languages, they just make specifying them more practical. Once you get this point, it becomes
clear that there are a lot of "features" that could be added to Regex engines to make them more
useful in a practical sense - but nothing will make them capable of going beyond the
limits of a Regular language.
The basic difference between a Regular and a Context Free language is that a Regular language
does not contain recursive elements. In order to evaluate a recursive language you need a
Push Down Automation
to "remember" where you are in the recursion. An NFA/DFA does not stack state information so cannot
handle the recursion. So given a non-recursive language definition there will be some NFA/DFA (but
not necessarily a practical Regex expression) to describe it.

How can you detect if two regular expressions overlap in the strings they can match?

I have a container of regular expressions. I'd like to analyze them to determine if it's possible to generate a string that matches more than 1 of them. Short of writing my own regex engine with this use case in mind, is there an easy way in C++ or Python to solve this problem?
There's no easy way.
As long as your regular expressions use only standard features (Perl lets you embed arbitrary code in matching, I think), you can produce from each one a nondeterministic finite-state automaton (NFA) that compactly encodes all the strings that the RE matches.
Given any pair of NFA, it's decidable whether their intersection is empty. If the intersection isn't empty, then some string matches both REs in the pair (and conversely).
The standard decidability proof is to determinize them into DFAs first, and then construct a new DFA whose states are pairs of the two DFAs' states, and whose final states are exactly those in which both states in the pair are final in their original DFA. Alternatively, if you've already shown how to compute the complement of a NFA, then you can (DeMorgan's law style) get the intersection by complement(union(complement(A),complement(B))).
Unfortunately, NFA->DFA involves a potentially exponential size explosion (because states in the DFA are subsets of states in the NFA). From Wikipedia:
Some classes of regular languages can
only be described by deterministic
finite automata whose size grows
exponentially in the size of the
shortest equivalent regular
expressions. The standard example are
here the languages L_k consisting of
all strings over the alphabet {a,b}
whose kth-last letter equals a.
By the way, you should definitely use OpenFST. You can create automata as text files and play around with operations like minimization, intersection, etc. in order to see how efficient they are for your problem. There already exist open source regexp->nfa->dfa compilers (I remember a Perl module); modify one to output OpenFST automata files and play around.
Fortunately, it's possible to avoid the subset-of-states explosion, and intersect two NFA directly using the same construction as for DFA:
if A ->a B (in one NFA, you can go from state A to B outputting the letter 'a')
and X ->a Y (in the other NFA)
then (A,X) ->a (B,Y) in the intersection
(C,Z) is final iff C is final in the one NFA and Z is final in the other.
To start the process off, you start in the pair of start states for the two NFAs e.g. (A,X) - this is the start state of the intersection-NFA. Each time you first visit a state, generate an arc by the above rule for every pair of arcs leaving the two states, and then visit all the (new) states those arcs reach. You'd store the fact that you expanded a state's arcs (e.g. in a hash table) and end up exploring all the states reachable from the start.
If you allow epsilon transitions (that don't output a letter), that's fine:
if A ->epsilon B in the first NFA, then for every state (A,Y) you reach, add the arc (A,Y) ->epsilon (B,Y) and similarly for epsilons in the second-position NFA.
Epsilon transitions are useful (but not necessary) in taking the union of two NFAs when translating a regexp to an NFA; whenever you have alternation regexp1|regexp2|regexp3, you take the union: an NFA whose start state has an epsilon transition to each of the NFAs representing the regexps in the alternation.
Deciding emptiness for an NFA is easy: if you ever reach a final state in doing a depth-first-search from the start state, it's not empty.
This NFA-intersection is similar to finite state transducer composition (a transducer is an NFA that outputs pairs of symbols, that are concatenated pairwise to match both an input and output string, or to transform a given input to an output).
This regex inverter (written using pyparsing) works with a limited subset of re syntax (no * or + allowed, for instance) - you could invert two re's into two sets, and then look for a set intersection.
In theory, the problem you describe is impossible.
In practice, if you have a manageable number of regular expressions that use a limited subset or of regexp syntax, and/or a limited selection of strings that can be used to match against the container of regular expressions, you might be able to solve it.
Assuming you're not trying to solve the abstract general case, there might be something you can do to solve a practical application. Perhaps if you provided a representative sample of the regexps, and described the strings you'd be matching with, a heuristic could be created to solve the problem.

How to determine if a regex is orthogonal to another regex?

I guess my question is best explained with an (simplified) example.
Regex 1:
^\d+_[a-z]+$
Regex 2:
^\d*$
Regex 1 will never match a string where regex 2 matches.
So let's say that regex 1 is orthogonal to regex 2.
As many people asked what I meant by orthogonal I'll try to clarify it:
Let S1 be the (infinite) set of strings where regex 1 matches.
S2 is the set of strings where regex 2 matches.
Regex 2 is orthogonal to regex 1 iff the intersection of S1 and S2 is empty.
The regex ^\d_a$ would be not orthogonal as the string '2_a' is in the set S1 and S2.
How can it be programmatically determined, if two regexes are orthogonal to each other?
Best case would be some library that implements a method like:
/**
* #return True if the regex is orthogonal (i.e. "intersection is empty"), False otherwise or Null if it can't be determined
*/
public Boolean isRegexOrthogonal(Pattern regex1, Pattern regex2);
By "Orthogonal" you mean "the intersection is the empty set" I take it?
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
Then again, I'm a theorist...
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
That seems like shooting sparrows with a cannon. Why not just construct the product automaton and check if an accept state is reachable from the initial state? That'll also give you a string in the intersection straight away without having to construct a regular expression first.
I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem.
I only know of a way to do it which involves creating a DFA from a regexp, which is exponential time (in the degenerate case). It's reducible to the halting problem, because everything is, but the halting problem is not reducible to it.
If the last, then you can use the fact that any RE can be translated into a finite state machine. Two finite state machines are equal if they have the same set of nodes, with the same arcs connecting those nodes.
So, given what I think you're using as a definition for orthogonal, if you translate your REs into FSMs and those FSMs are not equal, the REs are orthogonal.
That's not correct. You can have two DFAs (FSMs) that are non-isomorphic in the edge-labeled multigraph sense, but accept the same languages. Also, were that not the case, your test would check whether two regexps accepted non-identical, whereas OP wants non-overlapping languages (empty intersection).
Also, be aware that the \1, \2, ..., \9 construction is not regular: it can't be expressed in terms of concatenation, union and * (Kleene star). If you want to include back substitution, I don't know what the answer is. Also of interest is the fact that the corresponding problem for context-free languages is undecidable: there is no algorithm which takes two context-free grammars G1 and G2 and returns true iff L(G1) ∩ L(g2) ≠ Ø.
It's been two years since this question was posted, but I'm happy to say this can be determined now simply by calling the "genex" program here: https://github.com/audreyt/regex-genex
$ ./binaries/osx/genex '^\d+_[a-z]+$' '^\d*$'
$
The empty output means there is no strings that matches both regex. If they have any overlap, it will output the entire list of overlaps:
$ runghc Main.hs '\d' '[123abc]'
1.00000000 "2"
1.00000000 "3"
1.00000000 "1"
Hope this helps!
The fsmtools can do all kinds of operations on finite state machines, your only problem would be to convert the string representation of the regular expression into the format the fsmtools can work with. This is definitely possible for simple cases, but will be tricky in the presence of advanced features like look{ahead,behind}.
You might also have a look at OpenFst, although I've never used it. It supports intersection, though.
Excellent point on the \1, \2 bit... that's context free, and so not solvable. Minor point: Not EVERYTHING is reducible to Halt... Program Equivalence for example.. – Brian Postow
[I'm replying to a comment]
IIRC, a^n b^m a^n b^m is not context free, and so (a\*)(b\*)\1\2 isn't either since it's the same. ISTR { ww | w ∈ L } not being "nice" even if L is "nice", for nice being one of regular, context-free.
I modify my statement: everything in RE is reducible to the halting problem ;-)
I finally found exactly the library that I was looking for:
dk.brics.automaton
Usage:
/**
* #return true if the two regexes will never both match a given string
*/
public boolean isRegexOrthogonal( String regex1, String regex2 ) {
Automaton automaton1 = new RegExp(regex1).toAutomaton();
Automaton automaton2 = new RegExp(regex2).toAutomaton();
return automaton1.intersection(automaton2).isEmpty();
}
It should be noted that the implementation doesn't and can't support complex RegEx features like back references. See the blog post "A Faster Java Regex Package" which introduces dk.brics.automaton.
You can maybe use something like Regexp::Genex to generate test strings to match a specified regex and then use the test string on the 2nd regex to determine whether the 2 regexes are orthogonal.
Proving that one regular expression is orthogonal to another can be trivial in some cases, such as mutually exclusive character groups in the same locations. For any but the simplest regular expressions this is a nontrivial problem. For serious expressions, with groups and backreferences, I would go so far as to say that this may be impossible.
I believe kdgregory is correct you're using Orthogonal to mean Complement.
Is this correct?
Let me start by saying that I have no idea how to construct such an algorithm, nor am I aware of any library that implements it. However, I would not be at all surprised to learn that nonesuch exists for general regular expressions of arbitrary complexity.
Every regular expression defines a regular language of all the strings that can be generated by the expression, or if you prefer, of all the strings that are "matched by" the regular expression. Think of the language as a set of strings. In most cases, the set will be infinitely large. Your question asks whether the intersections of the two sets given by the regular expressions is empty or not.
At least to a first approximation, I can't imagine a way to answer that question without computing the sets, which for infinite sets will take longer than you have. I think there might be a way to compute a limited set and determine when a pattern is being elaborated beyond what is required by the other regex, but it would not be straightforward.
For example, just consider the simple expressions (ab)* and (aba)*b. What is the algorithm that will decide to generate abab from the first expression and then stop, without checking ababab, abababab, etc. because they will never work? You can't just generate strings and check until a match is found because that would never complete when the languages are disjoint. I can't imagine anything that would work in the general case, but then there are folks much better than me at this kind of thing.
All in all, this is a hard problem. I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem. Although, given that regular expressions are not Turing complete, it seems at least possible that a solution exists.
I would do the following:
convert each regex to a FSA, using something like the following structure:
struct FSANode
{
bool accept;
Map<char, FSANode> links;
}
List<FSANode> nodes;
FSANode start;
Note that this isn't trivial, but for simple regex shouldn't be that difficult.
Make a new Combined Node like:
class CombinedNode
{
CombinedNode(FSANode left, FSANode right)
{
this.left = left;
this.right = right;
}
Map<char, CombinedNode> links;
bool valid { get { return !left.accept || !right.accept; } }
public FSANode left;
public FSANode right;
}
Build up links based on following the same char on the left and right sides, and you get two FSANodes which make a new CombinedNode.
Then start at CombinedNode(leftStart, rightStart), and find the spanning set, and if there are any non-valid CombinedNodes, the set isn't "orthogonal."
Convert each regular expression into a DFA. From the accept state of one DFA create an epsilon transition to the start state of the second DFA. You will in effect have created an NFA by adding the epsilon transition. Then convert the NFA into a DFA. If the start state is not the accept state, and the accept state is reachable, then the two regular expressions are not "orthogonal." (Since their intersection is non-empty.)
There are know procedures for converting a regular expression to a DFA, and converting an NFA to a DFA. You could look at a book like "Introduction to the Theory of Computation" by Sipser for the procedures, or just search around the web. No doubt many undergrads and grads had to do this for one "theory" class or another.
I spoke too soon. What I said in my original post would not work out, but there is a procedure for what you are trying to do if you can convert your regular expressions into DFA form.
You can find the procedure in the book I mentioned in my first post: "Introduction to the Theory of Computation" 2nd edition by Sipser. It's on page 46, with details in the footnote.
The procedure would give you a new DFA that is the intersection of the two DFAs. If the new DFA had a reachable accept state then the intersection is non-empty.