Find simplest regular expression matching all given strings - regex

Is there an algorithm that can produce a regular expression (maybe limited to a simplified grammar) from a set of strings such that the evaluation of all possible strings that match the regular expression reproduces the initial set of strings?
It is probably unrealistic to find such a algorithm for grammars of regular expressions with very "complicated" syntax (including arbitrary repetitions, assertions etc.), so let's start with a simplified one which only allows for an OR of substrings:
foo(a|b|cd)bar should match fooabar, foobbar and foocdbar.
Examples
Given the set of strings h_q1_a, h_q1_b, h_q1_c, h_p2_a, h_p2_b, h_p2_c, the desired output of the algorithm would be h_(q1|p2)_(a|b|c).
Given the set of strings h_q1_a, h_q1_b, h_p2_a, the desired output of the algorithm would be h_(q1_(a|b)|p2_a). Note that h_(q1|p2)_(a|b) would not be correct because that expand to 4 strings, including h_p2_b, which was not in the original set of strings.
Use case
I have a long list of labels which were all produced by putting together substrings. Instead of printing the vast list of strings, I would like to have a compact output indicating what labels are in the list. As the full list has been produced programmatically (using a finite set of pre- and suffixes) I expect the compact notation to be (potentially) much shorter than the initial list.
(The (simplified) regex should be as short as possible, although I am more interested in a practical solution than the best. The trivial answer is of course to just concatenate all strings like A|B|C|D|... which is, however, not helpful.)

There is a straight-forward solution to this problem, if what you want to find is the minimal finite state machine (FSM) for a set of strings. Since the resulting FSM cannot have loops (otherwise it would match an infinite number of strings), it should be easy to convert into a regular expression using only concatenation and disjunction (|) operators. Although this might not be the shortest possible regular expression, it will result in the smallest compiled regex if the regex library you use compiles to a minimized DFA. (Alternatively, you could use the DFA directly with a library like Ragel.)
The procedure is simple, if you have access to standard FSM algorithms:
Make a non-deterministic finite-state automaton (NFA) by just adding every string as a sequence of states, with each sequence starting from the start state. Clearly O(N) in the total size of the strings, since there will be precisely one NFA state for every character in the original strings.
Construct a deterministic finite-state automaton (DFA) from the NFA. The NFA is a tree, not even a DAG, which should avoid the exponential worst-case for the standard algorithm. Effectively, you're just constructing a prefix tree here, and you could have skipped step 1 and constructed the prefix tree directly, converting it directly into a DFA. The prefix tree cannot have more nodes than the original number of characters (and can have the same number of nodes if all the strings start with different characters), so its output is O(N) in size, but I don't have a proof off the top of my head that it is also O(N) in time.
Minimize the DFA.
DFA minimization is a well-studied problem. The Hopcroft algorithm is worst-case O(NS log N) algorithm, where N is the number of states in the DFA and S is the size of the alphabet. Normally, S would be considered a constant; in any event, the expected time of the Hopcroft algorithm is much better.
For acyclic DFAs, there are linear-time algorithms; the most-frequently cited one is due to Dominique Revuz, and I found a rough description of it here in English; the original paper seems to be pay-walled, but Revuz's thesis (in French) is available.

You can try to use Aho-Corasick algorithm to create a finite state machine from the input strings, after which it should be somewhat easy to generate the simplified regex. Your input strings as example:
h_q1_a
h_q1_b
h_q1_c
h_p2_a
h_p2_b
h_p2_c
will generate a finite machine that most probably look like this:
[h_] <-level 0
/ \
[q1] [p2] <-level 1
\ /
[_] <-level 2
/\ \
/ \ \
a b c <-level 3
Now for every level/depth of the trie all the stings (if multiple) will go under OR brackets, so
h_(q1|p2)_(a|b|c)
L0 L1 L2 L3

Related

Is there a program that or way to automatically changes ab|ac|an|db|df to (a(b|c|n)|d(b|f))

Rather than testing all the or statements, it will see if the first character matches. The title explains it all.
When a regex is processed, it first converted to a non-deterministic finite-state automaton (NFA). Then, the NFA is converted to a deterministic finite-state automaton (DFA) which typically results in a larger number of states. The third step is to convert the DFA into a "minimal" DFA with the fewest number of states.
Two regexes are equivalent if and only if they result in identical minimal DFAs. Hence, the two regexes ab|ac|an|db|df and (a(b|c|n)|d(b|f)) which are clearly equivalent result in the same minimal DFA and will perform pattern matching at the same speed. There is no particular reason to prefer one over the other.

Compiler matching string to regex, using NFA

I was reading about compilers, the chapter about lexical analyzers(scanners) and I'm puzzled by the following statement:
For an input string X and a regular expression R, the complexity for finding a match using non-deterministic finite automata is:
O(len R * len X)
How can the complexity be polynomial in len R?
I'm under the impression that it depends exponentially on len R, because whenever we have a character which may appear a variable number of times(ie followed by the * symbol), we must test all possible number of occurences. If we have multiple characters which appear a variable number of times, we must check all possibilities(by backtracking).
Where am I wrong?
we must check all possibilities(by backtracking).
Not necessarily by backtracking. There are many ways to implement an NFA. By moving through the input linearly, and transitioning to multiple states at the same time (storing the set of active states in an O(1)-lookup structure), you will get exactly the mentioned runtime - number of states in NFA is linear to length of regex.
See also the popular articel Regular Expression Matching Can Be Simple And Fast.

Is there a way to negate a regular expression?

Given a regular expression R that describes a regular language (no fancy backreferences). Is there an algorithmic way to construct a regular expression R* that describes the language of all words except those described by R? It should be possible as Wikipedia says:
The regular languages are closed under the various operations, that is, if the languages K and L are regular, so is the result of the following operations: […] the complement ¬L
For example, given the alphabet {a,b,c}, the inverse of the language (abc*)+ is (a|(ac|b|c).*)?
As DPenner has already pointed out in the comments, the inverse of a regular expresion can be exponentially larger than the original expression. This makes inversing regular expressions unsuitable to implement negative partial expression syntax for searching purposes. Is there an algorithm that preserves the O(n*m) runtime characteristic (where n is the size of the regex and m is the length of the input) of regular expression matching and allows for negated subexpressions?
Unfortunately, the answer given by nhahdtdh in the comments is as good as we can do (so far). Whether a given regular expression generates all strings is PSPACE-complete. Since all problems in NP are in PSPACE-complete, an efficient solution to the universality problem would imply that P=NP.
If there were an efficient solution to your problem, would you be able to resolve the universality problem? Sure you would.
Use your efficient algorithm to generate a regular expression for the negation;
Determine whether the resulting regular expression generates the empty set.
Note that the problem "given a regular expression, does it generate the empty set" is fairly straightforward:
The regular expression {} generates the empty set.
(r + s) generates the empty set iff both r and s generate the empty set.
(rs) generates the empty set iff either r or s generates the empty set.
Nothing else generates the empty set.
Basically, it's pretty easy to tell whether a regular expression generates the empty set: just start evaluating the regular expression.
(Note that while the above procedure is efficient in terms of the output length, it might not be efficient in terms of the input length, if the output length is more than polynomially faster than the input length. However, if that were the case, we'd have the same result anyway, i.e., that your algorithm isn't really efficient, since it would take exponentially many steps to generate an exponentially longer output from a given input).
Wikipedia says: ... if there exists at least one regex that matches a particular set then there exist an infinite number of such expressions. We can deduct from this statement that there is an infinite number of expressions that describe the language of all words except those described by R.
Again, (as also #nhahtdh tried to explain) the simplest algorithm to address this question is to extend the scope of evaluation outside the context of the regular expression language itself. That is: match the strings you want to exclude (which represent a finite subset to work with) by using the original regular expression and then treat any failure to match as an actual match (out of an infinite set of other possibilities). So, if the result of the match is negative, your candidate strings are a subset of the valid solutions.

Create regex from samples algorithm

AFAIK no one have implemented an algorithm that takes a set of strings and substrings and gives back one or more regular expressions that would match the given substrings inside the strings. So, for instance, if I'd give my algorithm this two samples:
string1 = "fwef 1234 asdfd"
substring1 = "1234"
string2 = "asdf456fsdf"
substring2 = "456"
The algorithm would give me the regular expression "[0-9]*" back. I know it could give more than one regex or even no possible regex back and you might find 1000 reasons why such algorithm would be close to impossible to implement to perfection. But what's the closest thing?
I don't really care about regex itself also. Basically what I want is an algorithm that takes samples as the ones above and then finds a pattern in them that can be used to easily find the "kind" of text I want to find in a string without having to write any regex or code manually.
I don't have proof but I suspect no such discrete algorithm with a finite output could exist since you are asking for the creation of a regular language which could be "large" in respect to the input size.
With that, I suggest you peek at txt2re which can break down sample texts one-by-one and help you build regexes.
FlashFill a new feature of MS Excel 2013 would do exactly the task you want, but it does not give you the regular expression. It's a NP-complete problem and an open question for practical purposes. If you're interested in how to synthesise string manipulation from multiple examples, Go Flash Fill official website and read a few papers. They have pseudo-code and demo. movies as well.
There are many such algorithm in fact. This is a research area called "Grammatical inference".
I know RPNI, for example. (you could also look on the probabilistic branch, alergia, MDI, DEES). These algorithms generate DSA (Deterministic State Automata). In fact you absolutely don't need to enter the strings in your example. Only substrings.
There are also some algorithms to generate directly Non deterministic automata.
Of course, get the regular expression from an Non Deterministic Automata is easy.
The main ideas are simple:
Generate a PTSA (Prefix Tree State Automata) from your sample.
Then, you have to try to "merge" some states. From these merge, will emerge loops (i.e. * in the regular expression). All the difficulty being to choose the right rule to merge.
Here you go:
re = '|'.join(substrings)
If you want anything more general, your algorithm is going to have to make educated guesses about what type of strings are acceptable as matches, and it's trivial to demonstrate that no procedure can account for all possible sets of possible inputs without simply enumerating them all. For instance, consider some of these scenarios:
Match all prime numbers
Match hexadecimal strings, but no strings containing 'f' are in the sample set
Match the same string repeated twice
Match any even-length string
The root problem is that your question is incompletely specified. If you have a more specific requirement, that might be solvable, depending on what it is.

How can you detect if two regular expressions overlap in the strings they can match?

I have a container of regular expressions. I'd like to analyze them to determine if it's possible to generate a string that matches more than 1 of them. Short of writing my own regex engine with this use case in mind, is there an easy way in C++ or Python to solve this problem?
There's no easy way.
As long as your regular expressions use only standard features (Perl lets you embed arbitrary code in matching, I think), you can produce from each one a nondeterministic finite-state automaton (NFA) that compactly encodes all the strings that the RE matches.
Given any pair of NFA, it's decidable whether their intersection is empty. If the intersection isn't empty, then some string matches both REs in the pair (and conversely).
The standard decidability proof is to determinize them into DFAs first, and then construct a new DFA whose states are pairs of the two DFAs' states, and whose final states are exactly those in which both states in the pair are final in their original DFA. Alternatively, if you've already shown how to compute the complement of a NFA, then you can (DeMorgan's law style) get the intersection by complement(union(complement(A),complement(B))).
Unfortunately, NFA->DFA involves a potentially exponential size explosion (because states in the DFA are subsets of states in the NFA). From Wikipedia:
Some classes of regular languages can
only be described by deterministic
finite automata whose size grows
exponentially in the size of the
shortest equivalent regular
expressions. The standard example are
here the languages L_k consisting of
all strings over the alphabet {a,b}
whose kth-last letter equals a.
By the way, you should definitely use OpenFST. You can create automata as text files and play around with operations like minimization, intersection, etc. in order to see how efficient they are for your problem. There already exist open source regexp->nfa->dfa compilers (I remember a Perl module); modify one to output OpenFST automata files and play around.
Fortunately, it's possible to avoid the subset-of-states explosion, and intersect two NFA directly using the same construction as for DFA:
if A ->a B (in one NFA, you can go from state A to B outputting the letter 'a')
and X ->a Y (in the other NFA)
then (A,X) ->a (B,Y) in the intersection
(C,Z) is final iff C is final in the one NFA and Z is final in the other.
To start the process off, you start in the pair of start states for the two NFAs e.g. (A,X) - this is the start state of the intersection-NFA. Each time you first visit a state, generate an arc by the above rule for every pair of arcs leaving the two states, and then visit all the (new) states those arcs reach. You'd store the fact that you expanded a state's arcs (e.g. in a hash table) and end up exploring all the states reachable from the start.
If you allow epsilon transitions (that don't output a letter), that's fine:
if A ->epsilon B in the first NFA, then for every state (A,Y) you reach, add the arc (A,Y) ->epsilon (B,Y) and similarly for epsilons in the second-position NFA.
Epsilon transitions are useful (but not necessary) in taking the union of two NFAs when translating a regexp to an NFA; whenever you have alternation regexp1|regexp2|regexp3, you take the union: an NFA whose start state has an epsilon transition to each of the NFAs representing the regexps in the alternation.
Deciding emptiness for an NFA is easy: if you ever reach a final state in doing a depth-first-search from the start state, it's not empty.
This NFA-intersection is similar to finite state transducer composition (a transducer is an NFA that outputs pairs of symbols, that are concatenated pairwise to match both an input and output string, or to transform a given input to an output).
This regex inverter (written using pyparsing) works with a limited subset of re syntax (no * or + allowed, for instance) - you could invert two re's into two sets, and then look for a set intersection.
In theory, the problem you describe is impossible.
In practice, if you have a manageable number of regular expressions that use a limited subset or of regexp syntax, and/or a limited selection of strings that can be used to match against the container of regular expressions, you might be able to solve it.
Assuming you're not trying to solve the abstract general case, there might be something you can do to solve a practical application. Perhaps if you provided a representative sample of the regexps, and described the strings you'd be matching with, a heuristic could be created to solve the problem.