Efficiently querying one string against multiple regexes - regex

Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches.
The trivial way to do it would be to just query the string one by one against all regexes. Is there a faster,more efficient way to do it?
EDIT:
I have tried substituting it with DFA's (lex)
The problem here is that it would only give you one single pattern. If I have a string "hello" and patterns "[H|h]ello" and ".{0,20}ello", DFA will only match one of them, but I want both of them to hit.

This is the way lexers work.
The regular expressions are converted into a single non deterministic automata (NFA) and possibily transformed in a deterministic automata (DFA).
The resulting automaton will try to match all the regular expressions at once and will succeed on one of them.
There are many tools that can help you here, they are called "lexer generator" and there are solutions that work with most of the languages.
You don't say which language are you using. For C programmers I would suggest to have a look at the re2c tool. Of course the traditional (f)lex is always an option.

I've come across a similar problem in the past. I used a solution similar to the one suggested by akdom.
I was lucky in that my regular expressions usually had some substring that must appear in every string it matches. I was able to extract these substrings using a simple parser and index them in an FSA using the Aho-Corasick algorithms. The index was then used to quickly eliminate all the regular expressions that trivially don't match a given string, leaving only a few regular expressions to check.
I released the code under the LGPL as a Python/C module. See esmre on Google code hosting.

We had to do this on a product I worked on once. The answer was to compile all your regexes together into a Deterministic Finite State Machine (also known as a deterministic finite automaton or DFA). The DFA could then be walked character by character over your string and would fire a "match" event whenever one of the expressions matched.
Advantages are it runs fast (each character is compared only once) and does not get any slower if you add more expressions.
Disadvantages are that it requires a huge data table for the automaton, and there are many types of regular expressions that are not supported (for instance, back-references).
The one we used was hand-coded by a C++ template nut in our company at the time, so unfortunately I don't have any FOSS solutions to point you toward. But if you google regex or regular expression with "DFA" you'll find stuff that will point you in the right direction.

Martin Sulzmann Has done quite a bit of work in this field.
He has a HackageDB project explained breifly here which use partial derivatives seems to be tailor made for this.
The language used is Haskell and thus will be very hard to translate to a non functional language if that is the desire (I would think translation to many other FP languages would still be quite hard).
The code is not based on converting to a series of automata and then combining them, instead it is based on symbolic manipulation of the regexes themselves.
Also the code is very much experimental and Martin is no longer a professor but is in 'gainful employment'(1) so may be uninterested/unable to supply any help or input.
this is a joke - I like professors, the less the smart ones try to work the more chance I have of getting paid!

10,000 regexen eh? Eric Wendelin's suggestion of a hierarchy seems to be a good idea. Have you thought of reducing the enormity of these regexen to something like a tree structure?
As a simple example: All regexen requiring a number could branch off of one regex checking for such, all regexen not requiring one down another branch. In this fashion you could reduce the number of actual comparisons down to a path along the tree instead of doing every single comparison in 10,000.
This would require decomposing the regexen provided into genres, each genre having a shared test which would rule them out if it fails. In this way you could theoretically reduce the number of actual comparisons dramatically.
If you had to do this at run time you could parse through your given regular expressions and "file" them into either predefined genres (easiest to do) or comparative genres generated at that moment (not as easy to do).
Your example of comparing "hello" to "[H|h]ello" and ".{0,20}ello" won't really be helped by this solution. A simple case where this could be useful would be: if you had 1000 tests that would only return true if "ello" exists somewhere in the string and your test string is "goodbye;" you would only have to do the one test on "ello" and know that the 1000 tests requiring it won't work, and because of this, you won't have to do them.

If you're thinking in terms of "10,000 regexes" you need to shift your though processes. If nothing else, think in terms of "10,000 target strings to match". Then look for non-regex methods built to deal with "boatloads of target strings" situations, like Aho-Corasick machines. Frankly, though, it seems like somethings gone off the rails much earlier in the process than which machine to use, since 10,000 target strings sounds a lot more like a database lookup than a string match.

Aho-Corasick was the answer for me.
I had 2000 categories of things that each had lists of patterns to match against. String length averaged about 100,000 characters.
Main Caveat: The patters to match were all language patters not regex patterns e.g. 'cat' vs r'\w+'.
I was using python and so used https://pypi.python.org/pypi/pyahocorasick/.
import ahocorasick
A = ahocorasick.Automaton()
patterns = [
[['cat','dog'],'mammals'],
[['bass','tuna','trout'],'fish'],
[['toad','crocodile'],'amphibians'],
]
for row in patterns:
vals = row[0]
for val in vals:
A.add_word(val, (row[1], val))
A.make_automaton()
_string = 'tom loves lions tigers cats and bass'
def test():
vals = []
for item in A.iter(_string):
vals.append(item)
return vals
Running %timeit test() on my 2000 categories with about 2-3 traces per category and a _string length of about 100,000 got me 2.09 ms vs 631 ms doing sequential re.search() 315x faster!.

You'd need to have some way of determining if a given regex was "additive" compared to another one. Creating a regex "hierarchy" of sorts allowing you to determine that all regexs of a certain branch did not match

You could combine them in groups of maybe 20.
(?=(regex1)?)(?=(regex2)?)(?=(regex3)?)...(?=(regex20)?)
As long as each regex has zero (or at least the same number of) capture groups, you can look at what what captured to see which pattern(s) matched.
If regex1 matched, capture group 1 would have it's matched text. If not, it would be undefined/None/null/...

If you're using real regular expressions (the ones that correspond to regular languages from formal language theory, and not some Perl-like non-regular thing), then you're in luck, because regular languages are closed under union. In most regex languages, pipe (|) is union. So you should be able to construct a string (representing the regular expression you want) as follows:
(r1)|(r2)|(r3)|...|(r10000)
where parentheses are for grouping, not matching. Anything that matches this regular expression matches at least one of your original regular expressions.

I would recommend using Intel's Hyperscan if all you need is to know which regular expressions match. It is built for this purpose. If the actions you need to take are more sophisticated, you can also use ragel. Although it produces a single DFA and can result in many states, and consequently a very large executable program. Hyperscan takes a hybrid NFA/DFA/custom approach to matching that handles large numbers of expressions well.

I'd say that it's a job for a real parser. A midpoint might be a Parsing Expression Grammar (PEG). It's a higher-level abstraction of pattern matching, one feature is that you can define a whole grammar instead of a single pattern. There are some high-performance implementations that work by compiling your grammar into a bytecode and running it in a specialized VM.
disclaimer: the only one i know is LPEG, a library for Lua, and it wasn't easy (for me) to grasp the base concepts.

I'd almost suggest writing an "inside-out" regex engine - one where the 'target' was the regex, and the 'term' was the string.
However, it seems that your solution of trying each one iteratively is going to be far easier.

You could compile the regex into a hybrid DFA/Bucchi automata where each time the BA enters an accept state you flag which regex rule "hit".
Bucchi is a bit of overkill for this, but modifying the way your DFA works could do the trick.

I use Ragel with a leaving action:
action hello {...}
action ello {...}
action ello2 {...}
main := /[Hh]ello/ % hello |
/.+ello/ % ello |
any{0,20} "ello" % ello2 ;
The string "hello" would call the code in the action hello block, then in the action ello block and lastly in the action ello2 block.
Their regular expressions are quite limited and the machine language is preferred instead, the braces from your example only work with the more general language.

Try combining them into one big regex?

I think that the short answer is that yes, there is a way to do this, and that it is well known to computer science, and that I can't remember what it is.
The short answer is that you might find that your regex interpreter already deals with all of these efficiently when |'d together, or you might find one that does. If not, it's time for you to google string-matching and searching algorithms.

The fastest way to do it seems to be something like this (code is C#):
public static List<Regex> FindAllMatches(string s, List<Regex> regexes)
{
List<Regex> matches = new List<Regex>();
foreach (Regex r in regexes)
{
if (r.IsMatch(string))
{
matches.Add(r);
}
}
return matches;
}
Oh, you meant the fastest code? i don't know then....

Related

Regular expression with a given length with an exact number of a character

I'm looking for a regular expression for the language with the exact number of k a's in it.
I'm pretty much stuck at this. For a various length the solution would be easy with .
Does anybody have any advice on how I can achieve such an regex?
I'd use this one :
(b*ab*){k}
It just makes k blocks containing exactly one a. Therefore words have k a.
One of the b* can be factored out on the left or on the right.
There is no simple solution to this.
While that language is regular, it's ugly to describe. You can get it by intersecting the (trivial) DFAs for both languages ((a|b)^n and b*(ab*)^k) with each other, but you'll get a DFA with (n-k)*k states back. And transforming that it into a regular expression won't make it better.
However, if you're looking for an actual implementation it gets much easier. You can simply test the input against both regexes, or you can use lookahead to compose them into one regex:
/^(?=[ab]{n}$)b*(ab*){k}$/
You can use a look ahead to enforce the overall length:
^(?=.{5}$)([^a]*a){2}[^a]*$
See this demonstrated on rubular

Create regex from samples algorithm

AFAIK no one have implemented an algorithm that takes a set of strings and substrings and gives back one or more regular expressions that would match the given substrings inside the strings. So, for instance, if I'd give my algorithm this two samples:
string1 = "fwef 1234 asdfd"
substring1 = "1234"
string2 = "asdf456fsdf"
substring2 = "456"
The algorithm would give me the regular expression "[0-9]*" back. I know it could give more than one regex or even no possible regex back and you might find 1000 reasons why such algorithm would be close to impossible to implement to perfection. But what's the closest thing?
I don't really care about regex itself also. Basically what I want is an algorithm that takes samples as the ones above and then finds a pattern in them that can be used to easily find the "kind" of text I want to find in a string without having to write any regex or code manually.
I don't have proof but I suspect no such discrete algorithm with a finite output could exist since you are asking for the creation of a regular language which could be "large" in respect to the input size.
With that, I suggest you peek at txt2re which can break down sample texts one-by-one and help you build regexes.
FlashFill a new feature of MS Excel 2013 would do exactly the task you want, but it does not give you the regular expression. It's a NP-complete problem and an open question for practical purposes. If you're interested in how to synthesise string manipulation from multiple examples, Go Flash Fill official website and read a few papers. They have pseudo-code and demo. movies as well.
There are many such algorithm in fact. This is a research area called "Grammatical inference".
I know RPNI, for example. (you could also look on the probabilistic branch, alergia, MDI, DEES). These algorithms generate DSA (Deterministic State Automata). In fact you absolutely don't need to enter the strings in your example. Only substrings.
There are also some algorithms to generate directly Non deterministic automata.
Of course, get the regular expression from an Non Deterministic Automata is easy.
The main ideas are simple:
Generate a PTSA (Prefix Tree State Automata) from your sample.
Then, you have to try to "merge" some states. From these merge, will emerge loops (i.e. * in the regular expression). All the difficulty being to choose the right rule to merge.
Here you go:
re = '|'.join(substrings)
If you want anything more general, your algorithm is going to have to make educated guesses about what type of strings are acceptable as matches, and it's trivial to demonstrate that no procedure can account for all possible sets of possible inputs without simply enumerating them all. For instance, consider some of these scenarios:
Match all prime numbers
Match hexadecimal strings, but no strings containing 'f' are in the sample set
Match the same string repeated twice
Match any even-length string
The root problem is that your question is incompletely specified. If you have a more specific requirement, that might be solvable, depending on what it is.

Most efficient method to parse small, specific arguments

I have a command line application that needs to support arguments of the following brand:
all: return everything
search: return the first match to search
all*search: return everything matching search
X*search: return the first X matches to search
search#Y: return the Yth match to search
Where search can be either a single keyword or a space separated list of keywords, delimited by single quotes. Keywords are a sequence of one or more letters and digits - nothing else.
A few examples might be:
2*foo
bar#8
all*'foo bar'
This sounds just complex enough that flex/bison come to mind - but the application can expect to have to parse strings like this very frequently, and I feel like (because there's no counting involved) a fully-fledged parser would incur entirely too much overhead.
What would you recommend? A long series of string ops? A few beefy subpattern-capturing regular expressions? Is there actually a plausible argument for a "real" parser?
It might be useful to note that the syntax for this pseudo-grammar is not subject to change, so if the code turns out less-than-wonderfully-maintainable, I won't cry. This is all in C++, if that makes a difference.
Thanks!
I wouldn't reccomend a full lex/yacc parser just for this. What you described can fit a simple regular expression:
((all|[0-9]+)\*)?('[A-Za-z0-9\t ]*'|[A-Za-z0-9]+)(#[0-9]+)?
If you have a regex engine that support captures, it's easy to extract the single pieces of information you need. (Most probably in captures 1,3 and 4).
If I understood what you mean, you will probably want to check that capture 1 and capture 4 are not non-empty at the same time.
If you need to further split the search terms, you could do it in a subsequent step, parsing capture 3.
Even without regex, I would hand write a function. It would be simpler than dealing with lex/yacc and I guess you could put together something that is even more efficient than a regular expression.
The answer mostly depends on a balance between how much coding you want to do and how much libraries you want to depend on - if your application can depend on other libraries, you can use any of the many regular expression libraries - e.g. POSIX regex which comes with all Linux/Unix flavors.
OR
If you just want those specific syntaxes, I would use the string tokenizer (strtok) - split on '*' and split on '#' - then handle each case.
In this case the strtok approach would be much better since the number of commands to be parsed are few.

In "aa67bc54c9", is there any way to print "aa" 67 times, "bc" 54 times and so on, using regular expressions?

I was asked this question in an interview for an internship, and the first solution I suggested was to try and use a regular expression (I usually am a little stumped in interviews). Something like this
(?P<str>[a-zA-Z]+)(?P<n>[0-9]+)
I thought it would match the strings and store them in the variable "str" and the numbers in the variable "n". How, I was not sure of.
So it matches strings of type "a1b2c3", but a problem here is that it also matches strings of type "a1b". Could anyone suggest a solution to deal with this problem?
Also, is there any other regular expression that could solve this problem?
Do you know why "regular expressions" are called "regular"? :-)
That would be too long to explain, I'll just outline the way. To match a pattern (i.e. decide whether a given string is "valid" or "invalid"), a theoretical informatician would use a finite state automaton. That's an abstract machine that has a finite number of states; each tick it reads a char from the input and jumps to another state. The pattern of where to jump from particular state when a particular character is read is fixed. Some states are marked as "OK", some--as "FAIL", so that by examining state of a machine you can check whether your text is "valid" (i.e. a valid e-mail).
For example, this machine only accepts "nice" as its "valid" word (a pic from Wikipedia):
A set of "valid" words such a machine theoretically can distinguish from invalid is called "regular language". Not every set is a regular language: for example, finite state automata are incapable of checking whether parentheses in string are balanced.
But constructing state machines was a complex task, compared to the complexity of defining what "valid" is. So the mathematicians (mainly S. Kleene) noted that every regular language could be described with a "regular expression". They had *s and |s and were the prototypes of what we know as regexps now.
What does it have to do with the problem? The problem in subject is essentially non-regular. It can't be expressed with anything that works like a finite automaton.
The essence is that it should contain a memory cell that is capable to hold an arbitrary number (repetition count in your case). Finite automata and classical regular expressions can not do this.
However, modern regexps are more expressive and are said to be able to check balanced parentheses! But this may serve as a good example that you shouldn't use regexps for tasks they don't suit. Let alone that it contains code snippets; this makes the expression far from being "regular".
Answering the initial question, you can't solve your problem with using anything "regular" only. However, regexps could be aid you in solving this problem, as in tster's answer
Perhaps, I should look closer to tster's answer (do a "+1" there, please!) and show why it's not the "regular expression" solution. One may think that it is, it just contains print statement (not essential) and a loop--and loop concept is compatible with finite state automaton expressive power. But there is one more elusive thing:
while ($line =~ s/^([a-z]+)(\d+)//i)
{
print $1
x # <--- this one
$2;
}
The task of reading a string and a number and printing repeatedly that string given number of times, where the number is an arbitrary integer, is undoable on a finite state machine without additional memory. You use a memory cell to keep that number and decrease it, and check for it to be greater than zero. But this number may be arbitrarily big, and it contradicts with a finite memory available to the finite state machine.
However, there's nothing wrong with classical pattern /([abc]*){5}/ that matches something "regular" repeated fixed number of times. We essentially have states that correspond to "matched pattern once", "matched pattern twice" ... "matched pattern 5 times". There's finite number of them, and that's the gist of the difference.
how about:
while ($line =~ s/^([a-z]+)(\d+)//i)
{
print $1 x $2;
}
Answering your question directly:
No, regular expressions match text and don't print anything, so there is no way to do it solely using regular expressions.
The regular expression you gave will match one string/number pair; you can then print that repeatedly using an appropriate mechanism. The Perl solution from #tster is about as compact as it gets. (It doesn't use the names that you applied in your regex; I'm pretty sure that doesn't matter.)
The remaining details depend on your implementation language.
Nope, this is your basic 'trick question' - no matter how you answer it that answer is wrong unless you have exactly the answer the interviewer was trained to parrot. See the workup of the issue given by Pavel Shved - note that all invocations have 'not' as a common condition, the tool just keeps sliding: Even when it changes state there is no counter in that state
I have a rather advanced book by Kenneth C Louden who is a college prof on the matter, in which it is stated that the issue at hand is codified as "Regex's can't count." The obvious answer to the question seems to me at the moment to be using the lookahead feature of Regex's ...
Probably depends on what build of what brand of regex the interviewer is using, which probably depends of flight-dynamics of Golf Balls.
Nice answers so far. Regular expressions alone are generally thought of as a way to match patterns, not generate output in the manner you mentioned.
Having said that, there is a way to use regex as part of the solution. #Jonathan Leffler made a good point in his comment to tster's reply: "... maybe you need a better regex library in your language."
Depending on your language of choice and the library available, it is possible to pull this off. Using C# and .NET, for example, this could be achieved via the Regex.Replace method. However, the solution is not 100% regex since it still relies on other classes and methods (StringBuilder, String.Join, and Enumerable.Repeat) as shown below:
string input = "aa67bc54c9";
string pattern = #"([a-z]+)(\d+)";
string result = Regex.Replace(input, pattern, m =>
// can be achieved using StringBuilder or String.Join/Enumerable.Repeat
// don't use both
//new StringBuilder().Insert(0, m.Groups[1].Value, Int32.Parse(m.Groups[2].Value)).ToString()
String.Join("", Enumerable.Repeat(m.Groups[1].Value, Int32.Parse(m.Groups[2].Value)).ToArray())
+ Environment.NewLine // comment out to prevent line breaks
);
Console.WriteLine(result);
A clearer solution would be to identify the matches, loop over them and insert them using the StringBuilder rather than rely on Regex.Replace. Other languages may have compact idioms to handle the string multiplication that doesn't rely on other library classes.
To answer the interview question, I would reply with, "it's possible, however the solution would not be a stand-alone 100% regex approach and would rely on other language features and/or libraries to handle the generation aspect of the question since the regex alone is helpful in matching patterns, not generating them."
And based on the other responses here you could beef up that answer further if needed.

How to determine if a regex is orthogonal to another regex?

I guess my question is best explained with an (simplified) example.
Regex 1:
^\d+_[a-z]+$
Regex 2:
^\d*$
Regex 1 will never match a string where regex 2 matches.
So let's say that regex 1 is orthogonal to regex 2.
As many people asked what I meant by orthogonal I'll try to clarify it:
Let S1 be the (infinite) set of strings where regex 1 matches.
S2 is the set of strings where regex 2 matches.
Regex 2 is orthogonal to regex 1 iff the intersection of S1 and S2 is empty.
The regex ^\d_a$ would be not orthogonal as the string '2_a' is in the set S1 and S2.
How can it be programmatically determined, if two regexes are orthogonal to each other?
Best case would be some library that implements a method like:
/**
* #return True if the regex is orthogonal (i.e. "intersection is empty"), False otherwise or Null if it can't be determined
*/
public Boolean isRegexOrthogonal(Pattern regex1, Pattern regex2);
By "Orthogonal" you mean "the intersection is the empty set" I take it?
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
Then again, I'm a theorist...
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
That seems like shooting sparrows with a cannon. Why not just construct the product automaton and check if an accept state is reachable from the initial state? That'll also give you a string in the intersection straight away without having to construct a regular expression first.
I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem.
I only know of a way to do it which involves creating a DFA from a regexp, which is exponential time (in the degenerate case). It's reducible to the halting problem, because everything is, but the halting problem is not reducible to it.
If the last, then you can use the fact that any RE can be translated into a finite state machine. Two finite state machines are equal if they have the same set of nodes, with the same arcs connecting those nodes.
So, given what I think you're using as a definition for orthogonal, if you translate your REs into FSMs and those FSMs are not equal, the REs are orthogonal.
That's not correct. You can have two DFAs (FSMs) that are non-isomorphic in the edge-labeled multigraph sense, but accept the same languages. Also, were that not the case, your test would check whether two regexps accepted non-identical, whereas OP wants non-overlapping languages (empty intersection).
Also, be aware that the \1, \2, ..., \9 construction is not regular: it can't be expressed in terms of concatenation, union and * (Kleene star). If you want to include back substitution, I don't know what the answer is. Also of interest is the fact that the corresponding problem for context-free languages is undecidable: there is no algorithm which takes two context-free grammars G1 and G2 and returns true iff L(G1) ∩ L(g2) ≠ Ø.
It's been two years since this question was posted, but I'm happy to say this can be determined now simply by calling the "genex" program here: https://github.com/audreyt/regex-genex
$ ./binaries/osx/genex '^\d+_[a-z]+$' '^\d*$'
$
The empty output means there is no strings that matches both regex. If they have any overlap, it will output the entire list of overlaps:
$ runghc Main.hs '\d' '[123abc]'
1.00000000 "2"
1.00000000 "3"
1.00000000 "1"
Hope this helps!
The fsmtools can do all kinds of operations on finite state machines, your only problem would be to convert the string representation of the regular expression into the format the fsmtools can work with. This is definitely possible for simple cases, but will be tricky in the presence of advanced features like look{ahead,behind}.
You might also have a look at OpenFst, although I've never used it. It supports intersection, though.
Excellent point on the \1, \2 bit... that's context free, and so not solvable. Minor point: Not EVERYTHING is reducible to Halt... Program Equivalence for example.. – Brian Postow
[I'm replying to a comment]
IIRC, a^n b^m a^n b^m is not context free, and so (a\*)(b\*)\1\2 isn't either since it's the same. ISTR { ww | w ∈ L } not being "nice" even if L is "nice", for nice being one of regular, context-free.
I modify my statement: everything in RE is reducible to the halting problem ;-)
I finally found exactly the library that I was looking for:
dk.brics.automaton
Usage:
/**
* #return true if the two regexes will never both match a given string
*/
public boolean isRegexOrthogonal( String regex1, String regex2 ) {
Automaton automaton1 = new RegExp(regex1).toAutomaton();
Automaton automaton2 = new RegExp(regex2).toAutomaton();
return automaton1.intersection(automaton2).isEmpty();
}
It should be noted that the implementation doesn't and can't support complex RegEx features like back references. See the blog post "A Faster Java Regex Package" which introduces dk.brics.automaton.
You can maybe use something like Regexp::Genex to generate test strings to match a specified regex and then use the test string on the 2nd regex to determine whether the 2 regexes are orthogonal.
Proving that one regular expression is orthogonal to another can be trivial in some cases, such as mutually exclusive character groups in the same locations. For any but the simplest regular expressions this is a nontrivial problem. For serious expressions, with groups and backreferences, I would go so far as to say that this may be impossible.
I believe kdgregory is correct you're using Orthogonal to mean Complement.
Is this correct?
Let me start by saying that I have no idea how to construct such an algorithm, nor am I aware of any library that implements it. However, I would not be at all surprised to learn that nonesuch exists for general regular expressions of arbitrary complexity.
Every regular expression defines a regular language of all the strings that can be generated by the expression, or if you prefer, of all the strings that are "matched by" the regular expression. Think of the language as a set of strings. In most cases, the set will be infinitely large. Your question asks whether the intersections of the two sets given by the regular expressions is empty or not.
At least to a first approximation, I can't imagine a way to answer that question without computing the sets, which for infinite sets will take longer than you have. I think there might be a way to compute a limited set and determine when a pattern is being elaborated beyond what is required by the other regex, but it would not be straightforward.
For example, just consider the simple expressions (ab)* and (aba)*b. What is the algorithm that will decide to generate abab from the first expression and then stop, without checking ababab, abababab, etc. because they will never work? You can't just generate strings and check until a match is found because that would never complete when the languages are disjoint. I can't imagine anything that would work in the general case, but then there are folks much better than me at this kind of thing.
All in all, this is a hard problem. I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem. Although, given that regular expressions are not Turing complete, it seems at least possible that a solution exists.
I would do the following:
convert each regex to a FSA, using something like the following structure:
struct FSANode
{
bool accept;
Map<char, FSANode> links;
}
List<FSANode> nodes;
FSANode start;
Note that this isn't trivial, but for simple regex shouldn't be that difficult.
Make a new Combined Node like:
class CombinedNode
{
CombinedNode(FSANode left, FSANode right)
{
this.left = left;
this.right = right;
}
Map<char, CombinedNode> links;
bool valid { get { return !left.accept || !right.accept; } }
public FSANode left;
public FSANode right;
}
Build up links based on following the same char on the left and right sides, and you get two FSANodes which make a new CombinedNode.
Then start at CombinedNode(leftStart, rightStart), and find the spanning set, and if there are any non-valid CombinedNodes, the set isn't "orthogonal."
Convert each regular expression into a DFA. From the accept state of one DFA create an epsilon transition to the start state of the second DFA. You will in effect have created an NFA by adding the epsilon transition. Then convert the NFA into a DFA. If the start state is not the accept state, and the accept state is reachable, then the two regular expressions are not "orthogonal." (Since their intersection is non-empty.)
There are know procedures for converting a regular expression to a DFA, and converting an NFA to a DFA. You could look at a book like "Introduction to the Theory of Computation" by Sipser for the procedures, or just search around the web. No doubt many undergrads and grads had to do this for one "theory" class or another.
I spoke too soon. What I said in my original post would not work out, but there is a procedure for what you are trying to do if you can convert your regular expressions into DFA form.
You can find the procedure in the book I mentioned in my first post: "Introduction to the Theory of Computation" 2nd edition by Sipser. It's on page 46, with details in the footnote.
The procedure would give you a new DFA that is the intersection of the two DFAs. If the new DFA had a reachable accept state then the intersection is non-empty.