Check if 2 minimum DFA are equivalent

Check if 2 minimum DFA are equivalent - regex

I have 2 minimized DFA and i need to check if they are equivalent.
If they are equivalent, the problem is to find a efficient comparison of state regardless of different labels. In my case DFA are table, then i need to find the permutation that match the rows of first DFA with rows of second DFA.
I thought also about to have a Breadth-first search of DFA and create the minimum access string to a state and then compare the first list with the second list (this should be regardless of the particular input, for example: 001 and 110 could be interchangeable).
I'm interesting either to direct and inefficient algorithm and to more sophisticated algorithm.

The right approach is to construct another DFA with:
L3=(L1-L2) U (L2-L1)
And test whether L3 is empty or not. If L3 is empty then L1=L2, otherwise L1<>L2

I found these algorithms:
- Symmetric difference
- Table-filling algorithm
- Faster Table-Filling algorithm O(n^2)
- Hopcroft algorithm
- Nearly Linear algorithm by Hopcroft and Karp
A complete reference is:
Algorithms for testing equivalence of finite automata, with a grading tool for Jflap - Norton, 2009
I accepted this my answere because the one of #abbaasi is too incomplete.
I will accept any other answer with a significant contribution.

I remeber a minimum DFA is unique. So if you have 2 minimized DFA, I think you only need to check whether they're the same.

Related

Levenshtein Automata

i implemented a levenshtein trie to find similar words to a given word.
my goal was to have a fast way to do spell correction.
However i found out that there is an even faster way to do that:
Levenshtein Automata
I just have a problem... I understand no word of that whats written
here.
Can someone explain me the idea and the basic functionality of a
levenshtein automata in easy words?

Can someone explain me the idea and the basic functionality of a levenshtein automata in easy words?
A deterministic finite automaton (DFA) is
an alphabet (set of possible input characters)
a set of states (just abstract objects with no special properties)
a transition function (given any state and an input character, it returns a unique state)
a distinguished start state
a set of accepting states.
You can draw a DFA as a diagram like those in the paper. Conventionally, circular nodes are states. Directed edges each labeled with one character are transitions. Accepting states are marked as double-line circles. The start state has an inward pointing arrow with nothing at its tail.
A DFA accepts word W if and only if you can move a marker from the start state along transitions whose concatenated labels equal W to an accepting state. That is, if T is the transition function, and W = "cat", then T(T(T(Start, 'c'), 'a'), 't') must be an accepting state. Cycles in the transition function allow DFAs to accept strings of arbitrary length even though the DFA is finite.
In software a DFA is a simple loop and a table T(state, char) that implements the transition function.
current_state = START
while not end-of-input
c = get character from input
current_state = T(current_state, c)
end
if current_state is an accepting state return ACCEPT, else REJECT
The Wikipedia page on DFAs is not bad.
DFAs have nice properties. Accepting/rejecting an input of length N requires O(N) time (as long as the transition function runs in constant time). There is a unique minimum version of every DFA (among all those accepting the same set of words) and an efficient algorithm to find that minimum DFA. It's easy to compare DFAs for equality in time linear in the DFAs' size.
A Levenshtein Automaton L(W, d) for word W and Levenshtein distance d is just a DFA that accepts all words having Levenshtein distance at most d from W. That is, the automaton accepts W plus a bunch of other words that are W with no more than d "mistakes" in the usual sense of Levenshtein distance.
The contribution of the paper is a fast algorithm for computing Levenshtein DFAs and then a more advanced algorithm that accomplishes the same thing without computing the DFA explicitly.

Gene's has provided a fantastic high-level description of Levenshtein Automata!
With that said, if you're looking for some code to further your understanding, you may find the Java LevenshteinAutomaton library useful. It implements both algorithms described in the paper you stumbled upon (among others), and is well-structured, easy to follow, and extensively commented. It is also maintained by yours truly :) .

Find simplest regular expression matching all given strings

Is there an algorithm that can produce a regular expression (maybe limited to a simplified grammar) from a set of strings such that the evaluation of all possible strings that match the regular expression reproduces the initial set of strings?
It is probably unrealistic to find such a algorithm for grammars of regular expressions with very "complicated" syntax (including arbitrary repetitions, assertions etc.), so let's start with a simplified one which only allows for an OR of substrings:
foo(a|b|cd)bar should match fooabar, foobbar and foocdbar.
Examples
Given the set of strings h_q1_a, h_q1_b, h_q1_c, h_p2_a, h_p2_b, h_p2_c, the desired output of the algorithm would be h_(q1|p2)_(a|b|c).
Given the set of strings h_q1_a, h_q1_b, h_p2_a, the desired output of the algorithm would be h_(q1_(a|b)|p2_a). Note that h_(q1|p2)_(a|b) would not be correct because that expand to 4 strings, including h_p2_b, which was not in the original set of strings.
Use case
I have a long list of labels which were all produced by putting together substrings. Instead of printing the vast list of strings, I would like to have a compact output indicating what labels are in the list. As the full list has been produced programmatically (using a finite set of pre- and suffixes) I expect the compact notation to be (potentially) much shorter than the initial list.
(The (simplified) regex should be as short as possible, although I am more interested in a practical solution than the best. The trivial answer is of course to just concatenate all strings like A|B|C|D|... which is, however, not helpful.)

There is a straight-forward solution to this problem, if what you want to find is the minimal finite state machine (FSM) for a set of strings. Since the resulting FSM cannot have loops (otherwise it would match an infinite number of strings), it should be easy to convert into a regular expression using only concatenation and disjunction (|) operators. Although this might not be the shortest possible regular expression, it will result in the smallest compiled regex if the regex library you use compiles to a minimized DFA. (Alternatively, you could use the DFA directly with a library like Ragel.)
The procedure is simple, if you have access to standard FSM algorithms:
Make a non-deterministic finite-state automaton (NFA) by just adding every string as a sequence of states, with each sequence starting from the start state. Clearly O(N) in the total size of the strings, since there will be precisely one NFA state for every character in the original strings.
Construct a deterministic finite-state automaton (DFA) from the NFA. The NFA is a tree, not even a DAG, which should avoid the exponential worst-case for the standard algorithm. Effectively, you're just constructing a prefix tree here, and you could have skipped step 1 and constructed the prefix tree directly, converting it directly into a DFA. The prefix tree cannot have more nodes than the original number of characters (and can have the same number of nodes if all the strings start with different characters), so its output is O(N) in size, but I don't have a proof off the top of my head that it is also O(N) in time.
Minimize the DFA.
DFA minimization is a well-studied problem. The Hopcroft algorithm is worst-case O(NS log N) algorithm, where N is the number of states in the DFA and S is the size of the alphabet. Normally, S would be considered a constant; in any event, the expected time of the Hopcroft algorithm is much better.
For acyclic DFAs, there are linear-time algorithms; the most-frequently cited one is due to Dominique Revuz, and I found a rough description of it here in English; the original paper seems to be pay-walled, but Revuz's thesis (in French) is available.

You can try to use Aho-Corasick algorithm to create a finite state machine from the input strings, after which it should be somewhat easy to generate the simplified regex. Your input strings as example:
h_q1_a
h_q1_b
h_q1_c
h_p2_a
h_p2_b
h_p2_c
will generate a finite machine that most probably look like this:
[h_] <-level 0
/ \
[q1] [p2] <-level 1
\ /
[_] <-level 2
/\ \
/ \ \
a b c <-level 3
Now for every level/depth of the trie all the stings (if multiple) will go under OR brackets, so
h_(q1|p2)_(a|b|c)
L0 L1 L2 L3

DFA to regular expression time complexity

I am looking at the time complexity analysis of converting DFAs to regular expressions in the
"Introduction to the Automata Theory, Languages and Computation", 2nd edition, page 151, by Ullman et al. This method is sometimes referred to as the transitive closure method. I don't understand how they came up with the 4^n expression in the O((n^3)*(4^n)) time complexity.
I understand that the 4^n expression holds regarding space complexity, but, regarding time complexity, it seems that we are performing only four constant time operations for each pair of states at each iteration, using the results of the previous iterations. What am I exactly missing?

It's a crude bound on the complexity of an algorithm that isn't using the right data structures. I don't think that there's much to explain other than that the authors clearly did not care to optimize here, probably because their main point was that regular expressions are at least as expressive as DFAs and because they feel that it's pointless to optimize this exponential-time algorithm.
There are three nested loops of n iterations each; the regular expressions constructed during iteration k of the outer loop inductively have size O(4^k), since they are constructed from at most four regular expressions constructed during the previous iteration. If the algorithm copies these subexpressions and we overestimate the regular-expression size bound at O(4^n) for all iterations, then we get O(n^3 4^n).
Obviously we can do better. Without eliminating the copying, we can get O(sum_{k=1}^n n^2 4^k) = O(n^2 (n + 4^n)) by bounding the geometric sum properly. Moreover, as you point out, we don't need to copy at all, except at the end if we agree with templatetypedef that the output must be completely written out, giving a running time of O(n^3) to prepare the regular expression and O(4^n) to write it out. The space complexity for this version equals the time complexity.

I suppose your doubt is about the n3 Time Complexity.
Let us assume Rijk represents the set of all strings that transition the automata from state qi to qj without passing through any state higher than qk.
Then the iterative formula for Rijk is shown below,
Rijk = Rikk-1 (Rkkk-1)* Rkjk-1 + Rijk-1.
This technique is similar to the all-pairs shortest path problem. The only difference is that we are taking the union and concatenation of regular expressions instead of summing up distances. The Time Complexity of all-pairs shortest path problem is n3. So we can expect the same complexity for DFA to Regular Expression Conversion also. The same method can also be used to convert NFA and ε-NFA to corresponding Regular Expressions.
The main problem of transitive closure approach is that it creates very large regular expressions. This large length is due to the repeated union of concatenated terms.

Is it possible to calucate the edit distance between a regexp and a string?

If so, please explain how.
Re: what is distance -- "The distance between two strings is defined as the minimal number of edits required to convert one into the other."
For example, xyz to XYZ would take 3 edits, so the string xYZ is closer to XYZ and xyz.
If the pattern is [0-9]{3} or for instance 123, then a23 would be closer to the pattern than ab3.
How can you find the shortest distance between a regexp and a non-matching string?
The above is the Damerau–Levenshtein distance algorithm.

You can use Finite State Machines to do this efficiently (that is, linear in time). If you use a transducer, you can even write the specification of the transformation fairly compactly and do far more nuanced transformations than simply inserts or deletes - see wikipedia for Finite State Transducer as a starting point, and software such as the FSA toolkit or FSA6 (which has a not entirely stable web-demo) too. There are lots of libraries for FSA manipulation; I don't want to suggest the previous two are your only or best options, just two I've heard of.
If, however, you merely want the efficient, approximate searching, a less flexibly but already-implemented-for-you option exists: TRE, which has an approximate matching function that returns the cost of the match - i.e., the distance to the match, from your perspective.

If you mean the string with the smallest levenshtein distance between the closest matched string and a sample, then I'm pretty sure it can be done, but you'd have to convert the Regex to a DFA yourself, then try to match and whenever something fails, non-deterministically continue as if it had passed and keep track of the number differences. you could use A* search or something similar for this, it would be quite inefficient though (O(2^n) worst case)

How can you detect if two regular expressions overlap in the strings they can match?

I have a container of regular expressions. I'd like to analyze them to determine if it's possible to generate a string that matches more than 1 of them. Short of writing my own regex engine with this use case in mind, is there an easy way in C++ or Python to solve this problem?

There's no easy way.
As long as your regular expressions use only standard features (Perl lets you embed arbitrary code in matching, I think), you can produce from each one a nondeterministic finite-state automaton (NFA) that compactly encodes all the strings that the RE matches.
Given any pair of NFA, it's decidable whether their intersection is empty. If the intersection isn't empty, then some string matches both REs in the pair (and conversely).
The standard decidability proof is to determinize them into DFAs first, and then construct a new DFA whose states are pairs of the two DFAs' states, and whose final states are exactly those in which both states in the pair are final in their original DFA. Alternatively, if you've already shown how to compute the complement of a NFA, then you can (DeMorgan's law style) get the intersection by complement(union(complement(A),complement(B))).
Unfortunately, NFA->DFA involves a potentially exponential size explosion (because states in the DFA are subsets of states in the NFA). From Wikipedia:
Some classes of regular languages can
only be described by deterministic
finite automata whose size grows
exponentially in the size of the
shortest equivalent regular
expressions. The standard example are
here the languages L_k consisting of
all strings over the alphabet {a,b}
whose kth-last letter equals a.
By the way, you should definitely use OpenFST. You can create automata as text files and play around with operations like minimization, intersection, etc. in order to see how efficient they are for your problem. There already exist open source regexp->nfa->dfa compilers (I remember a Perl module); modify one to output OpenFST automata files and play around.
Fortunately, it's possible to avoid the subset-of-states explosion, and intersect two NFA directly using the same construction as for DFA:
if A ->a B (in one NFA, you can go from state A to B outputting the letter 'a')
and X ->a Y (in the other NFA)
then (A,X) ->a (B,Y) in the intersection
(C,Z) is final iff C is final in the one NFA and Z is final in the other.
To start the process off, you start in the pair of start states for the two NFAs e.g. (A,X) - this is the start state of the intersection-NFA. Each time you first visit a state, generate an arc by the above rule for every pair of arcs leaving the two states, and then visit all the (new) states those arcs reach. You'd store the fact that you expanded a state's arcs (e.g. in a hash table) and end up exploring all the states reachable from the start.
If you allow epsilon transitions (that don't output a letter), that's fine:
if A ->epsilon B in the first NFA, then for every state (A,Y) you reach, add the arc (A,Y) ->epsilon (B,Y) and similarly for epsilons in the second-position NFA.
Epsilon transitions are useful (but not necessary) in taking the union of two NFAs when translating a regexp to an NFA; whenever you have alternation regexp1|regexp2|regexp3, you take the union: an NFA whose start state has an epsilon transition to each of the NFAs representing the regexps in the alternation.
Deciding emptiness for an NFA is easy: if you ever reach a final state in doing a depth-first-search from the start state, it's not empty.
This NFA-intersection is similar to finite state transducer composition (a transducer is an NFA that outputs pairs of symbols, that are concatenated pairwise to match both an input and output string, or to transform a given input to an output).

This regex inverter (written using pyparsing) works with a limited subset of re syntax (no * or + allowed, for instance) - you could invert two re's into two sets, and then look for a set intersection.

In theory, the problem you describe is impossible.
In practice, if you have a manageable number of regular expressions that use a limited subset or of regexp syntax, and/or a limited selection of strings that can be used to match against the container of regular expressions, you might be able to solve it.
Assuming you're not trying to solve the abstract general case, there might be something you can do to solve a practical application. Perhaps if you provided a representative sample of the regexps, and described the strings you'd be matching with, a heuristic could be created to solve the problem.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js