I am looking at the time complexity analysis of converting DFAs to regular expressions in the
"Introduction to the Automata Theory, Languages and Computation", 2nd edition, page 151, by Ullman et al. This method is sometimes referred to as the transitive closure method. I don't understand how they came up with the 4^n expression in the O((n^3)*(4^n)) time complexity.
I understand that the 4^n expression holds regarding space complexity, but, regarding time complexity, it seems that we are performing only four constant time operations for each pair of states at each iteration, using the results of the previous iterations. What am I exactly missing?
It's a crude bound on the complexity of an algorithm that isn't using the right data structures. I don't think that there's much to explain other than that the authors clearly did not care to optimize here, probably because their main point was that regular expressions are at least as expressive as DFAs and because they feel that it's pointless to optimize this exponential-time algorithm.
There are three nested loops of n iterations each; the regular expressions constructed during iteration k of the outer loop inductively have size O(4^k), since they are constructed from at most four regular expressions constructed during the previous iteration. If the algorithm copies these subexpressions and we overestimate the regular-expression size bound at O(4^n) for all iterations, then we get O(n^3 4^n).
Obviously we can do better. Without eliminating the copying, we can get O(sum_{k=1}^n n^2 4^k) = O(n^2 (n + 4^n)) by bounding the geometric sum properly. Moreover, as you point out, we don't need to copy at all, except at the end if we agree with templatetypedef that the output must be completely written out, giving a running time of O(n^3) to prepare the regular expression and O(4^n) to write it out. The space complexity for this version equals the time complexity.
I suppose your doubt is about the n3 Time Complexity.
Let us assume Rijk represents the set of all strings that transition the automata from state qi to qj without passing through any state higher than qk.
Then the iterative formula for Rijk is shown below,
Rijk = Rikk-1 (Rkkk-1)* Rkjk-1 + Rijk-1.
This technique is similar to the all-pairs shortest path problem. The only difference is that we are taking the union and concatenation of regular expressions instead of summing up distances. The Time Complexity of all-pairs shortest path problem is n3. So we can expect the same complexity for DFA to Regular Expression Conversion also. The same method can also be used to convert NFA and ε-NFA to corresponding Regular Expressions.
The main problem of transitive closure approach is that it creates very large regular expressions. This large length is due to the repeated union of concatenated terms.
Related
So I have the following task, where I have to explain whether or not it's possible to construct a regular expression following these constraints:
Numbers N < 1.000.000 which contain digit 1 exactly as often as digit 2
My task is not to actually write an regular expression, but just to reasoning about the fact if it's possible or not.
My conclusion is that it's perfectly possible, but I would have to write down every possible string in my regex, which would be extremely long.
So far I have the following regular expression:
(([3-9]*[1]{n}[3-9]*[2]{n}[3-9]*){0,1000})
Where n is iterating towards it's maximum length of 1.000.000
I know that iterators is not a "thing" in the regular language. But is there an easier way to "iterate" n, instead of having to write it all by hand?
My conclusion is that it's perfectly possible, but I would have to write down every possible string in my regex, which would be extremely long.
For questions about theory, the fact that it would be extremely long is irrelevant. You're correct. Any finite language is regular.
I know that iterators is not a "thing" in the regular language. But is there an easier way to "iterate" n, instead of having to write it all by hand?
No, there isn't. Perhaps it would help to think about how to construct a state machine for this.
One state, the initial state, is the one where the number of 1s seen and the number of 2s seen are equal.
A second state is one where one more 1 has been seen than 2s.
A third state is one where one more 2 has been seen than 1s.
A fourth state is one where two more 1s have been seen than 2s. Etc...
It should be fairly straightforward to convince yourself that you cannot combine any of those states, as there is no overlap in the valid remaining substrings for each state. Therefore, you would need >1000000 states, and as the translation of a regular expression to a state machine is fairly direct, the complexity of a regular expression cannot be less than the complexity of the state machine.
Of course, when programming, you can implement this far more efficiently. You can for instance store the count of 1s and 2s in additional variables. But variables are a feature state machines do not have.
I have 2 minimized DFA and i need to check if they are equivalent.
If they are equivalent, the problem is to find a efficient comparison of state regardless of different labels. In my case DFA are table, then i need to find the permutation that match the rows of first DFA with rows of second DFA.
I thought also about to have a Breadth-first search of DFA and create the minimum access string to a state and then compare the first list with the second list (this should be regardless of the particular input, for example: 001 and 110 could be interchangeable).
I'm interesting either to direct and inefficient algorithm and to more sophisticated algorithm.
The right approach is to construct another DFA with:
L3=(L1-L2) U (L2-L1)
And test whether L3 is empty or not. If L3 is empty then L1=L2, otherwise L1<>L2
I found these algorithms:
- Symmetric difference
- Table-filling algorithm
- Faster Table-Filling algorithm O(n^2)
- Hopcroft algorithm
- Nearly Linear algorithm by Hopcroft and Karp
A complete reference is:
Algorithms for testing equivalence of finite automata, with a grading tool for Jflap - Norton, 2009
I accepted this my answere because the one of #abbaasi is too incomplete.
I will accept any other answer with a significant contribution.
I remeber a minimum DFA is unique. So if you have 2 minimized DFA, I think you only need to check whether they're the same.
Is there an algorithm that can produce a regular expression (maybe limited to a simplified grammar) from a set of strings such that the evaluation of all possible strings that match the regular expression reproduces the initial set of strings?
It is probably unrealistic to find such a algorithm for grammars of regular expressions with very "complicated" syntax (including arbitrary repetitions, assertions etc.), so let's start with a simplified one which only allows for an OR of substrings:
foo(a|b|cd)bar should match fooabar, foobbar and foocdbar.
Examples
Given the set of strings h_q1_a, h_q1_b, h_q1_c, h_p2_a, h_p2_b, h_p2_c, the desired output of the algorithm would be h_(q1|p2)_(a|b|c).
Given the set of strings h_q1_a, h_q1_b, h_p2_a, the desired output of the algorithm would be h_(q1_(a|b)|p2_a). Note that h_(q1|p2)_(a|b) would not be correct because that expand to 4 strings, including h_p2_b, which was not in the original set of strings.
Use case
I have a long list of labels which were all produced by putting together substrings. Instead of printing the vast list of strings, I would like to have a compact output indicating what labels are in the list. As the full list has been produced programmatically (using a finite set of pre- and suffixes) I expect the compact notation to be (potentially) much shorter than the initial list.
(The (simplified) regex should be as short as possible, although I am more interested in a practical solution than the best. The trivial answer is of course to just concatenate all strings like A|B|C|D|... which is, however, not helpful.)
There is a straight-forward solution to this problem, if what you want to find is the minimal finite state machine (FSM) for a set of strings. Since the resulting FSM cannot have loops (otherwise it would match an infinite number of strings), it should be easy to convert into a regular expression using only concatenation and disjunction (|) operators. Although this might not be the shortest possible regular expression, it will result in the smallest compiled regex if the regex library you use compiles to a minimized DFA. (Alternatively, you could use the DFA directly with a library like Ragel.)
The procedure is simple, if you have access to standard FSM algorithms:
Make a non-deterministic finite-state automaton (NFA) by just adding every string as a sequence of states, with each sequence starting from the start state. Clearly O(N) in the total size of the strings, since there will be precisely one NFA state for every character in the original strings.
Construct a deterministic finite-state automaton (DFA) from the NFA. The NFA is a tree, not even a DAG, which should avoid the exponential worst-case for the standard algorithm. Effectively, you're just constructing a prefix tree here, and you could have skipped step 1 and constructed the prefix tree directly, converting it directly into a DFA. The prefix tree cannot have more nodes than the original number of characters (and can have the same number of nodes if all the strings start with different characters), so its output is O(N) in size, but I don't have a proof off the top of my head that it is also O(N) in time.
Minimize the DFA.
DFA minimization is a well-studied problem. The Hopcroft algorithm is worst-case O(NS log N) algorithm, where N is the number of states in the DFA and S is the size of the alphabet. Normally, S would be considered a constant; in any event, the expected time of the Hopcroft algorithm is much better.
For acyclic DFAs, there are linear-time algorithms; the most-frequently cited one is due to Dominique Revuz, and I found a rough description of it here in English; the original paper seems to be pay-walled, but Revuz's thesis (in French) is available.
You can try to use Aho-Corasick algorithm to create a finite state machine from the input strings, after which it should be somewhat easy to generate the simplified regex. Your input strings as example:
h_q1_a
h_q1_b
h_q1_c
h_p2_a
h_p2_b
h_p2_c
will generate a finite machine that most probably look like this:
[h_] <-level 0
/ \
[q1] [p2] <-level 1
\ /
[_] <-level 2
/\ \
/ \ \
a b c <-level 3
Now for every level/depth of the trie all the stings (if multiple) will go under OR brackets, so
h_(q1|p2)_(a|b|c)
L0 L1 L2 L3
What is the complexity with respect to the string length that takes to perform a regular expression comparison on a string?
The answer depends on what exactly you mean by "regular expressions." Classic regexes can be compiled into Deterministic Finite Automata that can match a string of length N in O(N) time. Certain extensions to the regex language change that for the worse.
You may find the following document of interest: Regular Expression Matching Can Be Simple And Fast.
unbounded - you can create a regular expression that never terminates, on an empty input string.
If you use normal (TCS:no backreference, concatenation,alternation,Kleene star) regexp and regexp is already compiled then it's O(n).
If you're looking for tight asymptotic bounds on RegEx (without respect to the expression itself), then there isn't one. As Alex points out, you can create a regex that is O(1) or a regex that is Omega(infinity). As a purely mathematical algorithm, a regular expression engine would be far too complicated to perform any sort of formal asymptotic analysis (aside from the fact that such analysis would be basically worthless).
The growth rate of a particular expression (since that, really, constitutes an algorithm, anyway) would be far more meaningful, though not necessarily any easier to analyze.
I didn't get the answer to this anywhere. What is the runtime complexity of a Regex match and substitution?
Edit: I work in python. But would like to know in general about most popular languages/tools (java, perl, sed).
From a purely theoretical stance:
The implementation I am familiar with would be to build a Deterministic Finite Automaton to recognize the regex. This is done in O(2^m), m being the size of the regex, using a standard algorithm. Once this is built, running a string through it is linear in the length of the string - O(n), n being string length. A replacement on a match found in the string should be constant time.
So overall, I suppose O(2^m + n).
Other theoretical info of possible interest.
For clarity, assume the standard definition for a regular expression
http://en.wikipedia.org/wiki/Regular_language
from the formal language theory. Practically, this means that the only building
material are alphabet symbols, operators of concatenation, alternation and
Kleene closure, along with the unit and zero constants (which appear for
group-theoretic reasons). Generally it's a good idea not to overload this term
despite the everyday practice in scripting languages which leads to
ambiguities.
There is an NFA construction that solves the matching problem for a regular
expression r and an input text t in O(|r| |t|) time and O(|r|) space, where
|-| is the length function. This algorithm was further improved by Myers
http://doi.acm.org/10.1145/128749.128755
to the time and space complexity O(|r| |t| / log |t|) by using automaton node listings and the Four Russians paradigm. This paradigm seems to be named after four Russian guys who wrote a groundbreaking paper which is not
online. However, the paradigm is illustrated in these computational biology
lecture notes
http://lyle.smu.edu/~saad/courses/cse8354/lectures/lecture5.pdf
I find it hilarious to name a paradigm by the number and
the nationality of authors instead of their last names.
The matching problem for regular expressions with added backreferences is
NP-complete, which was proven by Aho
http://portal.acm.org/citation.cfm?id=114877
by a reduction from the vertex-cover problem which is a classical NP-complete problem.
To match regular expressions with backreferences deterministically we could
employ backtracking (not unlike the Perl regex engine) to keep track of the
possible subwords of the input text t that can be assigned to the variables in
r. There are only O(|t|^2) subwords that can be assigned to any one variable
in r. If there are n variables in r, then there are O(|t|^2n) possible
assignments. Once an assignment of substrings to variables is fixed, the
problem reduces to the plain regular expression matching. Therefore the
worst-case complexity for matching regular expressions with backreferences is
O(|t|^2n).
Note however, regular expressions with backreferences are not yet
full-featured regexen.
Take, for example, the "don't care" symbol apart from any other
operators. There are several polynomial algorithms deciding whether a set of
patterns matches an input text. For example, Kucherov and Rusinowitch
http://dx.doi.org/10.1007/3-540-60044-2_46
define a pattern as a word w_1#w_2#...#w_n where each w_i is a word (not a regular expression) and "#" is a variable length "don't care" symbol not contained in either of w_i. They derive an O((|t| + |P|) log |P|) algorithm for matching a set of patterns P against an input text t, where |t| is the length of the text, and |P| is the length of all the words in P.
It would be interesting to know how these complexity measures combine and what
is the complexity measure of the matching problem for regular expressions with
backreferences, "don't care" and other interesting features of practical
regular expressions.
Alas, I haven't said a word about Python... :)
Depends on what you define by regex. If you allow operators of concatenation, alternative and Kleene-star, the time can actually be O(m*n+m), where m is size of a regex and n is length of the string. You do it by constructing a NFA (that is linear in m), and then simulating it by maintaining the set of states you're in and updating that (in O(m)) for every letter of input.
Things that make regex parsing difficult:
parentheses and backreferences: capturing is still OK with the aforementioned algorithm, although it would get the complexity higher, so it might be infeasable. Backreferences raise the recognition power of the regex, and its difficulty is well
positive look-ahead: is just another name for intersection, which raises the complexity of the aforementioned algorithm to O(m^2+n)
negative look-ahead: a disaster for constructing the automaton (O(2^m), possibly PSPACE-complete). But should still be possible to tackle with the dynamic algorithm in something like O(n^2*m)
Note that with a concrete implementation, things might get better or worse. As a rule of thumb, simple features should be fast enough, and unambiguous (eg. not like a*a*) regexes are better.
To delve into theprise's answer, for the construction of the automaton, O(2^m) is the worst case, though it really depends on the form of the regular expression (for a very simple one that matches a word, it's in O(m), using for example the Knuth-Morris-Pratt algorithm).
Depends on the implementation. What language/library/class? There may be a best case, but it would be very specific to the number of features in the implementation.
You can trade space for speed by building a nondeterministic finite automaton instead of a DFA. This can be traversed in linear time. Of course, in the worst case this could need O(2^m) space. I'd expect the tradeoff to be worth it.
If you're after matching and substitution, that implies grouping and backreferences.
Here is a perl example where grouping and backreferences can be used to solve an NP complete problem:
http://perl.plover.com/NPC/NPC-3SAT.html
This (coupled with a few other theoretical tidbits) means that using regular expressions for matching and substitution is NP-complete.
Note that this is different from the formal definition of a regular expression - which don't have the notion of grouping - and match in polynomial time as described by the other answers.
In python's re library, even if a regex is compiled, the complexity can still be exponential (in string length) in some cases, as it is not built on DFA. Some references here, here or here.