What does an NFA with epsilon transitions let us do? - regex

I have the following automaton. I am supposed to understand the use of empty transitions through it.
I think that the regular expression of this automaton is the following: 0* 1* 2*
I just want to know what this automaton lets us do? in other words what is the use of empty transitions in that case?

You're right, your automaton describes 0*1*2*. q0 is where it handles a (possibly empty) series of 0s. q1 is where it handles a (possibly empty) series of 1s. By having an empty transition from q0 to q1, no character is needed between the 0s and the 1s.
The empty transition does not let you describe any language that could not be described without it. However, just try reworking your automaton to not use empty transitions. It's possible, but it requires more transitions, it requires making every state a final state, and when you're done, you hopefully see that it makes it more difficult to tell what language is being described.
So the use of empty transitions is that they make your automaton easier to understand.

i think u want to know about difference between this..
and this
difference is first first automaton accept string of 0,1 and 2 of anything...
while second automaton accepts string of 0,1 and 2 of...0 followed by 1,1 followed by 2....
ex. it accepts 012,001122,0012,0112 etc.
it won't accept 102,201,0121,0120,0011221.
so empty transitions make it nice and easy...

Related

How to derive the RegEx from the state diagram?

I found a state diagram of a DFA (deterministic finite automaton) with its RegEx in a script, but this diagram is just a sample without any explanations. So I tried to derive the RegEx from the DFA state diagram by myself and got the expression: ab+a+b(a*b)*. I dont understand how I get the original RegEx (ab+a*)+ab+ mentioned in the script. Here my derivation:
I am grateful for any help, links, references and hints!
You've derived the regex correctly here. The expression you have ab+a+b(a*b)* is equivalent to (ab+a*)+ab+ - once you've finished the DFA state elimination (you have a single transition from the start state to an accepting state), there aren't any more derivations to do. You may however get different final regexes depending on the order you eliminate the states in, and they should all be valid assuming you did the eliminations correctly. The state elimination method is also not guaranteed to be able to produce all equivalent regular expressions for a particular DFA, so it's okay that you didn't arrive at exactly the original regex. You can also check the equivalence of two regular expressions here.
For your particular example though to show that this DFA is equivalent to this original regex (ab+a*)+ab+, take a look at the DFA at this state of the elimination (somewhere in between the second and third steps you've shown above):
Let's expand our expression (ab+a*)+ab+ to (ab+a*)(ab+a*)*ab+. So in the DFA, the first (ab+a*) gets us from state 0 to partway between states 2 and 3 (the a* in a*a).
Then the next part (ab+a*)* means we're allowed to have 0 or more copies of (ab+a*). If there are 0 copies, we'll just finish with an ab+, reading an a from the second half of the a*a transition from 2 to 3 and a b from the 3 to 4 transition, landing us in state 4 which is accepting and where we can take the self loop and read as many b's as we want.
Otherwise we have 1 or more copies of (ab+a*), again reading an a from the second half of the a*a transition from 2 to 3 and a b from the 3 to 4 transition. The a* comes from the first half of the a*ab self loop on state 4 and the second half ab is either the final ab+ of the regex or the start of another copy of (ab+a*). I'm not sure if there's a state elimination that arrives at exactly the expression (ab+a*)+ab+ but for what it's worth, I think the regex you derived more clearly captures the structure of this DFA.

Do not understand empty string in finite automata from regular expression

I am reading my textbook ("Introduction to the theory of computation" by Michael Sipser) and there is a part that is trying to convert the regular expression (ab U b)* into a non-deterministic finite automata. The book is telling me to break the expression apart and write a NFA for each piece. When it reaches ab it shows:
An a to go from the first state to the second
An empty string to go from the second state to the third
A b to go from the third state to the final accepting state.
I am really confused trying to understand the second step. What's the purpose of the empty string. Would it be incorrect if it wasn't there?
You're probably making reference to example 1.56, right? First of all, remember he is describing a process - think of it as an algorithm, that will be executed always in the same matter. What he's saying is that, if you have a regexp concatenation as P.Q, no matter what P and Q are, you can join their AFNs by connecting the ending states of P with an empty string to the starting state of Q.
So, in his example, he's creating an AFN to represent a.b. AFNs to represent a and b isolated are trivial, as:
So, in order to join both, you just follow the algorithm, putting an empty string connection from the final state of the first AFN to the starting state of the second algorithm, as:
In this particular case, since it's very easy to create a a.b AFD, we look at it and perceive this empty string is not necessary. But the process works with every single concatenation - and, in fact, being self evident that this is equivalent of a trivially correct AFD just reinforces his statement: the process works.
Hope that helps.

DFA based regular expression matching - how to get all matches?

I have a given DFA that represent a regular expression.
I want to match the DFA against an input stream and get all possible matches back, not only the leastmost-longest match.
For example:
regex: a*ba|baa
input: aaaaabaaababbabbbaa
result:
aaaaaba
aaba
ba
baa
Assumptions
Based on your question and later comments you want a general method for splitting a sentence into non-overlapping, matching substrings, with non-matching parts of the sentence discarded. You also seem to want optimal run-time performance. Also I assume you have an existing algorithm to transform a regular expression into DFA form already. I further assume that you are doing this by the usual method of first constructing an NFA and converting it by subset construction to DFA, since I'm not aware of any other way of accomplishing this.
Before you go chasing after shadows, make sure your trying to apply the right tool for the job. Any discussion of regular expressions is almost always muddied by the fact that folks use regular expressions for a lot more things than they are really optimal for. If you want to receive the benefits of regular expressions, be sure you're using a regular expression, and not something broader. If what you want to do can't be somewhat coded into a regular expression itself, then you can't benefit from the advantages of regular expression algorithms (fully)
An obvious example is that no amount of cleverness will allow a FSM, or any algorithm, to predict the future. For instance, an expression like (a*b)|(a), when matched against the string aaa... where the ellipsis is the portion of the expression not yet scanned because the user has not typed them yet, cannot give you every possible right subgroup.
For a more detailed discussion of Regular expression implementations, and specifically Thompson NFA's please check this link, which describes a simple C implementation with some clever optimizations.
Limitations of Regular Languages
The O(n) and Space(O(1)) guarantees of regular expression algorithms is a fairly narrow claim. Specifically, a regular language is the set of all languages that can be recognized in constant space. This distinction is important. Any kind of enhancement to the algorithm that does something more sophisticated than accepting or rejecting a sentence is likely to operate on a larger set of languages than regular. On top of that, if you can show that some enhancement requires greater than constant space to implement, then you are also outside of the performance guarantee. That being said, we can still do an awful lot if we are very careful to keep our algorithm within these narrow constraints.
Obviously that eliminates anything we might want to do with recursive backtracking. A stack does not have constant space. Even maintaining pointers into the sentence would be verboten, since we don't know how long the sentence might be. A long enough sentence would overflow any integer pointer. We can't create new states for the automaton as we go to get around this. All possible states (and a few impossible ones) must be predictable before exposing the recognizer to any input, and that quantity must be bounded by some constant, which may vary for the specific language we want to match, but by no other variable.
This still allows some room for adding additonal behavior. The usual way of getting more mileage is to add some extra annotations for where certain events in processing occur, such as when a subexpression started or stopped matching. Since we are only allowed to have constant space processing, that limits the number of subexpression matches we can process. This usually means the latest instance of that subexpression. This is why, when you ask for the subgrouped matched by (a|)*, you always get an empty string, because any sequence of a's is implicitly followed by infinitely many empty strings.
The other common enhancement is to do some clever thing between states. For example, in perl regex, \b matches the empty string, but only if the previous character is a word character and the next is not, or visa versa. Many simple assertions fit this, including the common line anchor operators, ^ and $. Lookahead and lookbehind assertions are also possible, but much more difficult.
When discussing the differences between various regular language recognizers, it's worth clarifying if we're talking about match recognition or search recognition, the former being an accept only if the entire sentence is in the language, and the latter accepts if any substring in the sentence is in the language. These are equivalent in the sense that if some expression E is accepted by the search method, then .*(E).* is accepted in the match method.
This is important because we might want to make it clear whether an expression like a*b|a accepts aa or not. In the search method, it does. Either token will match the right side of the disjunction. It doesn't match, though, because you could never get that sentence by stepping through the expression and generating tokens from the transitions, at least in a single pass. For this reason, i'm only going to talk about match semantics. Obviously if you want search semantics, you can modify the expression with .*'s
Note: A language defined by expression E|.* is not really a very manageable language, regardless of the sublanguage of E because it matches all possible sentences. This represents a real challenge for regular expression recognizers because they are really only suited to recognizing a language or else confirming that a sentence is not in that same language, rather than doing any more specific work.
Implementation of Regular Language Recognizers
There are generally three ways to process a regular expression. All three start the same, by transforming the expression into an NFA. This process produces one or two states for each production rule in the original expression. The rules are extemely simple. Here's some crude ascii art: note that a is any single literal character in the language's alphabet, and E1 and E2 are any regular expression. Epsilon(ε) is a state with inputs and outputs, but ignores the stream of characters, and doesn't consume any input either.
a ::= > a ->
E1 E2 ::= >- E1 ->- E2 ->
/---->
E1* ::= > --ε <-\
\ /
E1
/-E1 ->
E1|E2 ::= > ε
\-E2 ->
And that's it! Common uses such as E+, E?, [abc] are equivalent to EE*, (E|), (a|b|c) respectively. Also note that we add for each production rule a very small number of new states. In fact each rule adds zero or one state (in this presentation). characters, quantifiers and dysjunction all add just one state, and the concatenation doesn't add any. Everything else is done by updating the fragments' end pointers to start pointers of other states or fragments.
The epsilon transition states are important, because they are ambiguous. When encountered, is the machine supposed to change state to once following state or another? should it change state at all or stay put? That's the reason why these automatons are called nondeterministic. The solution is to have the automaton transition to the right state, whichever allows it to match the best. Thus the tricky part is to figure out how to do that.
There are fundamentally two ways of doing this. The first way is to try each one. Follow the first choice, and if that doesn't work, try the next. This is recursive backtracking, appears in a few (and notable) implementations. For well crafted regular expressions, this implementation does very little extra work. If the expression is a bit more convoluted, recursive backtracking is very, very bad, O(2^n).
The other way of doing this is to instead try both options in parallel. At each epsilon transition, add to the set of current states both of the states the epsilon transition suggests. Since you are using a set, you can have the same state come up more than once, but you only need to track it once, either you are in that state or not. If you get to the point that there's no option for a particular state to follow, just ignore it, that path didn't match. If there are no more states, then the entire expression didn't match. as soon as any state reaches the final state, you are done.
Just from that explanation, the amount of work we have to do has gone up a little bit. We've gone from having to keep track of a single state to several. At each iteration, we may have to update on the order of m state pointers, including things like checking for duplicates. Also the amount of storage we needed has gone up, since now it's no longer a single pointer to one possible state in the NFA, but a whole set of them.
However, this isn't anywhere close to as bad as it sounds. First off, the number of states is bounded by the number of productions in the original regular expression. From now on we'll call this value m to distinguish it from the number of symbols in the input, which will be n. If two state pointers end up transitioning to the same new state, you can discard one of them, because no matter what else happens, they will both follow the same path from there on out. This means the number of state pointers you need is bounded by the number of states, so that to is m.
This is a bigger win in the worst case scenario when compared to backtracking. After each character is consumed from the input, you will create, rename, or destroy at most m state pointers. There is no way to craft a regular expression which will cause you to execute more than that many instructions (times some constant factor depending on your exact implementation), or will cause you to allocate more space on the stack or heap.
This NFA, simultaneously in some subset of its m states, may be considered some other state machine who's state represents the set of states the NFA it models could be in. each state of that FSM represents one element from the power set of the states of the NFA. This is exactly the DFA implementation used for matching regular expressions.
Using this alternate representation has an advantage that instead of updating m state pointers, you only have to update one. It also has a downside, since it models the powerset of m states, it actually has up to 2m states. That is an upper limit, because you don't model states that cannot happen, for instance the expression a|b has two possible states after reading the first character, either the one for having seen an a, or the one for having seen a b. No matter what input you give it, it cannot be in both of those states at the same time, so that state-set does not appear in the DFA. In fact, because you are eliminating the redundancy of epsilon transitions, many simple DFA's actually get SMALLER than the NFA they represent, but there is simply no way to guarantee that.
To keep the explosion of states from growing too large, a solution used in a few versions of that algorithm is to only generate the DFA states you actually need, and if you get too many, discard ones you haven't used recently. You can always generate them again.
From Theory to Practice
Many practical uses of regular expressions involve tracking the position of the input. This is technically cheating, since the input could be arbitrarily long. Even if you used a 64 bit pointer, the input could possibly be 264+1 symbols long, and you would fail. Your position pointers have to grow with the length of the input, and now your algorithm now requires more than constant space to execute. In practice this isn't relevant, because if your regular expression did end up working its way through that much input, you probably won't notice that it would fail because you'd terminate it long before then.
Of course, we want to do more than just accept or reject inputs as a whole. The most useful variation on this is to extract submatches, to discover which portion of an input was matched by a certain section of the original expression. The simple way to achieve this is to add an epsilon transition for each of the opening and closing braces in the expression. When the FSM simulator encounters one of these states, it annotates the state pointer with information about where in the input it was at the time it encountered that particular transition. If the same pointer returns to that transition a second time, the old annotation is discarded and replaced with a new annotation for the new input position. If two states pointers with disagreeing annotations collapse to the same state, the annotation of a later input position wins again.
If you are sticking to Thompson NFA or DFA implementations, then there's not really any notion of greedy or non-greedy matching. A backtracking algorithm needs to be given a hint about whether it should start by trying to match as much as it can and recursively trying less, or trying as little as it can and recursively trying more, when it fails it first attempt. The Thompson NFA method tries all possible quantities simultaneously. On the other hand, you might still wish to use some greedy/nongreedy hinting. This information would be used to determine if newer or older submatch annotations should be preferred, in order to capture just the right portion of the input.
Another kind of practical enhancement is assertions, productions which do not consume input, but match or reject based on some aspect of the input position. For instance in perl regex, a \b indicates that the input must contain a word boundary at that position, such that the symbol just matched must be a word character, but the next character must not be, or visa versa. Again, we manage this by adding an epsilon transition with special instructions to the simulator. If the assertion passes, then the state pointer continues, otherwise it is discarded.
Lookahead and lookbehind assertions can be achieved with a bit more work. A typical lookbehind assertion r0(?<=r1)r2 is transformed into two separate expressions, .*r1 and r0εr2. Both expressions are applied to the input. Note that we added a .* to the assertion expression, because we don't actually care where it starts. When the simulator encounters the epsilon in the second generated fragment, it checks up on the state of the first fragment. If that fragment is in a state where it could accept right there, the assertion passes with the state pointer flowing into r2, but otherwise, it fails, and both fragments continue, with the second discarding the state pointer at the epsilon transition.
Lookahead also works by using an extra regex fragment for the assertion, but is a little more complex, because when we reach the point in the input where the assertion must succeed, none of the corresponding characters have been encountered (in the lookbehind case, they have all been encountered). Instead, when the simulator reaches the assertion, it starts a pointer in the start state of the assertion subexpression and annotates the state pointer in the main part of the simulation so that it knows it is dependent on the subexpression pointer. At each step, the simulation must check to see that the state pointer it depends upon is still matching. If it doesn't find one, then it fails wherever it happens to be. You don't have to keep any more copies of the assertion subexpressions state pointers than you do for the main part, if two state pointers in the assertion land on the same state, then the state pointers each of them depend upon will share the same fate, and can be reannotated to point to the single pointer you keep.
While were adding special instructions to epsilon transitions, it's not a terrible idea to suggest an instruction to make the simulator pause once in a while to let the user see what's going on. Whenever the simulator encounters such a transition, it will wrap up its current state in some kind of package that can be returned to the caller, inspected or altered, and then resumed where it left off. This could be used to match input interactively, so if the user types only a partial match, the simulator can ask for more input, but if the user types something invalid, the simulator is empty, and can complain to the user. Another possibility is to yield every time a subexpression is matched, allowing you to peek at every sub match in the input. This couldn't be used to exclude some submatches, though. For instance, if you tried to match ((a)*b) against aaa, you could see three submatches for the a's, even though the whole expression ultimately fails because there is no b, and no submatch for the corresponding b's
Finally, there might be a way to modify this to work with backreferences. Even if it's elegent, it's sure to be inefficient, specifically, regular expressions plus backreferences are in NP-Complete, so I won't even try to think of a way to do this, because we are only interested (here) in (asymptotically) efficient possibilities.

How can you detect if two regular expressions overlap in the strings they can match?

I have a container of regular expressions. I'd like to analyze them to determine if it's possible to generate a string that matches more than 1 of them. Short of writing my own regex engine with this use case in mind, is there an easy way in C++ or Python to solve this problem?
There's no easy way.
As long as your regular expressions use only standard features (Perl lets you embed arbitrary code in matching, I think), you can produce from each one a nondeterministic finite-state automaton (NFA) that compactly encodes all the strings that the RE matches.
Given any pair of NFA, it's decidable whether their intersection is empty. If the intersection isn't empty, then some string matches both REs in the pair (and conversely).
The standard decidability proof is to determinize them into DFAs first, and then construct a new DFA whose states are pairs of the two DFAs' states, and whose final states are exactly those in which both states in the pair are final in their original DFA. Alternatively, if you've already shown how to compute the complement of a NFA, then you can (DeMorgan's law style) get the intersection by complement(union(complement(A),complement(B))).
Unfortunately, NFA->DFA involves a potentially exponential size explosion (because states in the DFA are subsets of states in the NFA). From Wikipedia:
Some classes of regular languages can
only be described by deterministic
finite automata whose size grows
exponentially in the size of the
shortest equivalent regular
expressions. The standard example are
here the languages L_k consisting of
all strings over the alphabet {a,b}
whose kth-last letter equals a.
By the way, you should definitely use OpenFST. You can create automata as text files and play around with operations like minimization, intersection, etc. in order to see how efficient they are for your problem. There already exist open source regexp->nfa->dfa compilers (I remember a Perl module); modify one to output OpenFST automata files and play around.
Fortunately, it's possible to avoid the subset-of-states explosion, and intersect two NFA directly using the same construction as for DFA:
if A ->a B (in one NFA, you can go from state A to B outputting the letter 'a')
and X ->a Y (in the other NFA)
then (A,X) ->a (B,Y) in the intersection
(C,Z) is final iff C is final in the one NFA and Z is final in the other.
To start the process off, you start in the pair of start states for the two NFAs e.g. (A,X) - this is the start state of the intersection-NFA. Each time you first visit a state, generate an arc by the above rule for every pair of arcs leaving the two states, and then visit all the (new) states those arcs reach. You'd store the fact that you expanded a state's arcs (e.g. in a hash table) and end up exploring all the states reachable from the start.
If you allow epsilon transitions (that don't output a letter), that's fine:
if A ->epsilon B in the first NFA, then for every state (A,Y) you reach, add the arc (A,Y) ->epsilon (B,Y) and similarly for epsilons in the second-position NFA.
Epsilon transitions are useful (but not necessary) in taking the union of two NFAs when translating a regexp to an NFA; whenever you have alternation regexp1|regexp2|regexp3, you take the union: an NFA whose start state has an epsilon transition to each of the NFAs representing the regexps in the alternation.
Deciding emptiness for an NFA is easy: if you ever reach a final state in doing a depth-first-search from the start state, it's not empty.
This NFA-intersection is similar to finite state transducer composition (a transducer is an NFA that outputs pairs of symbols, that are concatenated pairwise to match both an input and output string, or to transform a given input to an output).
This regex inverter (written using pyparsing) works with a limited subset of re syntax (no * or + allowed, for instance) - you could invert two re's into two sets, and then look for a set intersection.
In theory, the problem you describe is impossible.
In practice, if you have a manageable number of regular expressions that use a limited subset or of regexp syntax, and/or a limited selection of strings that can be used to match against the container of regular expressions, you might be able to solve it.
Assuming you're not trying to solve the abstract general case, there might be something you can do to solve a practical application. Perhaps if you provided a representative sample of the regexps, and described the strings you'd be matching with, a heuristic could be created to solve the problem.

How to determine if a regex is orthogonal to another regex?

I guess my question is best explained with an (simplified) example.
Regex 1:
^\d+_[a-z]+$
Regex 2:
^\d*$
Regex 1 will never match a string where regex 2 matches.
So let's say that regex 1 is orthogonal to regex 2.
As many people asked what I meant by orthogonal I'll try to clarify it:
Let S1 be the (infinite) set of strings where regex 1 matches.
S2 is the set of strings where regex 2 matches.
Regex 2 is orthogonal to regex 1 iff the intersection of S1 and S2 is empty.
The regex ^\d_a$ would be not orthogonal as the string '2_a' is in the set S1 and S2.
How can it be programmatically determined, if two regexes are orthogonal to each other?
Best case would be some library that implements a method like:
/**
* #return True if the regex is orthogonal (i.e. "intersection is empty"), False otherwise or Null if it can't be determined
*/
public Boolean isRegexOrthogonal(Pattern regex1, Pattern regex2);
By "Orthogonal" you mean "the intersection is the empty set" I take it?
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
Then again, I'm a theorist...
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
That seems like shooting sparrows with a cannon. Why not just construct the product automaton and check if an accept state is reachable from the initial state? That'll also give you a string in the intersection straight away without having to construct a regular expression first.
I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem.
I only know of a way to do it which involves creating a DFA from a regexp, which is exponential time (in the degenerate case). It's reducible to the halting problem, because everything is, but the halting problem is not reducible to it.
If the last, then you can use the fact that any RE can be translated into a finite state machine. Two finite state machines are equal if they have the same set of nodes, with the same arcs connecting those nodes.
So, given what I think you're using as a definition for orthogonal, if you translate your REs into FSMs and those FSMs are not equal, the REs are orthogonal.
That's not correct. You can have two DFAs (FSMs) that are non-isomorphic in the edge-labeled multigraph sense, but accept the same languages. Also, were that not the case, your test would check whether two regexps accepted non-identical, whereas OP wants non-overlapping languages (empty intersection).
Also, be aware that the \1, \2, ..., \9 construction is not regular: it can't be expressed in terms of concatenation, union and * (Kleene star). If you want to include back substitution, I don't know what the answer is. Also of interest is the fact that the corresponding problem for context-free languages is undecidable: there is no algorithm which takes two context-free grammars G1 and G2 and returns true iff L(G1) ∩ L(g2) ≠ Ø.
It's been two years since this question was posted, but I'm happy to say this can be determined now simply by calling the "genex" program here: https://github.com/audreyt/regex-genex
$ ./binaries/osx/genex '^\d+_[a-z]+$' '^\d*$'
$
The empty output means there is no strings that matches both regex. If they have any overlap, it will output the entire list of overlaps:
$ runghc Main.hs '\d' '[123abc]'
1.00000000 "2"
1.00000000 "3"
1.00000000 "1"
Hope this helps!
The fsmtools can do all kinds of operations on finite state machines, your only problem would be to convert the string representation of the regular expression into the format the fsmtools can work with. This is definitely possible for simple cases, but will be tricky in the presence of advanced features like look{ahead,behind}.
You might also have a look at OpenFst, although I've never used it. It supports intersection, though.
Excellent point on the \1, \2 bit... that's context free, and so not solvable. Minor point: Not EVERYTHING is reducible to Halt... Program Equivalence for example.. – Brian Postow
[I'm replying to a comment]
IIRC, a^n b^m a^n b^m is not context free, and so (a\*)(b\*)\1\2 isn't either since it's the same. ISTR { ww | w ∈ L } not being "nice" even if L is "nice", for nice being one of regular, context-free.
I modify my statement: everything in RE is reducible to the halting problem ;-)
I finally found exactly the library that I was looking for:
dk.brics.automaton
Usage:
/**
* #return true if the two regexes will never both match a given string
*/
public boolean isRegexOrthogonal( String regex1, String regex2 ) {
Automaton automaton1 = new RegExp(regex1).toAutomaton();
Automaton automaton2 = new RegExp(regex2).toAutomaton();
return automaton1.intersection(automaton2).isEmpty();
}
It should be noted that the implementation doesn't and can't support complex RegEx features like back references. See the blog post "A Faster Java Regex Package" which introduces dk.brics.automaton.
You can maybe use something like Regexp::Genex to generate test strings to match a specified regex and then use the test string on the 2nd regex to determine whether the 2 regexes are orthogonal.
Proving that one regular expression is orthogonal to another can be trivial in some cases, such as mutually exclusive character groups in the same locations. For any but the simplest regular expressions this is a nontrivial problem. For serious expressions, with groups and backreferences, I would go so far as to say that this may be impossible.
I believe kdgregory is correct you're using Orthogonal to mean Complement.
Is this correct?
Let me start by saying that I have no idea how to construct such an algorithm, nor am I aware of any library that implements it. However, I would not be at all surprised to learn that nonesuch exists for general regular expressions of arbitrary complexity.
Every regular expression defines a regular language of all the strings that can be generated by the expression, or if you prefer, of all the strings that are "matched by" the regular expression. Think of the language as a set of strings. In most cases, the set will be infinitely large. Your question asks whether the intersections of the two sets given by the regular expressions is empty or not.
At least to a first approximation, I can't imagine a way to answer that question without computing the sets, which for infinite sets will take longer than you have. I think there might be a way to compute a limited set and determine when a pattern is being elaborated beyond what is required by the other regex, but it would not be straightforward.
For example, just consider the simple expressions (ab)* and (aba)*b. What is the algorithm that will decide to generate abab from the first expression and then stop, without checking ababab, abababab, etc. because they will never work? You can't just generate strings and check until a match is found because that would never complete when the languages are disjoint. I can't imagine anything that would work in the general case, but then there are folks much better than me at this kind of thing.
All in all, this is a hard problem. I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem. Although, given that regular expressions are not Turing complete, it seems at least possible that a solution exists.
I would do the following:
convert each regex to a FSA, using something like the following structure:
struct FSANode
{
bool accept;
Map<char, FSANode> links;
}
List<FSANode> nodes;
FSANode start;
Note that this isn't trivial, but for simple regex shouldn't be that difficult.
Make a new Combined Node like:
class CombinedNode
{
CombinedNode(FSANode left, FSANode right)
{
this.left = left;
this.right = right;
}
Map<char, CombinedNode> links;
bool valid { get { return !left.accept || !right.accept; } }
public FSANode left;
public FSANode right;
}
Build up links based on following the same char on the left and right sides, and you get two FSANodes which make a new CombinedNode.
Then start at CombinedNode(leftStart, rightStart), and find the spanning set, and if there are any non-valid CombinedNodes, the set isn't "orthogonal."
Convert each regular expression into a DFA. From the accept state of one DFA create an epsilon transition to the start state of the second DFA. You will in effect have created an NFA by adding the epsilon transition. Then convert the NFA into a DFA. If the start state is not the accept state, and the accept state is reachable, then the two regular expressions are not "orthogonal." (Since their intersection is non-empty.)
There are know procedures for converting a regular expression to a DFA, and converting an NFA to a DFA. You could look at a book like "Introduction to the Theory of Computation" by Sipser for the procedures, or just search around the web. No doubt many undergrads and grads had to do this for one "theory" class or another.
I spoke too soon. What I said in my original post would not work out, but there is a procedure for what you are trying to do if you can convert your regular expressions into DFA form.
You can find the procedure in the book I mentioned in my first post: "Introduction to the Theory of Computation" 2nd edition by Sipser. It's on page 46, with details in the footnote.
The procedure would give you a new DFA that is the intersection of the two DFAs. If the new DFA had a reachable accept state then the intersection is non-empty.