Detect if a regexp is exponential - regex

This article show that there is some regexp that is O(2^n) when backtracking.
The example is (x+x+)+y.
When attempt to match a string like xxxx...p it going to backtrack for a while before figure it out that it couldn't match.
Is there a way to detect such regexp?
thanks

If your regexp engine exposes runtime exponential behavior for (x+x+)+y ,then it is broken because a DFA or NFA can recognize this pattern in linear time:
echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | egrep "(x+x+)+y"
echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxy" | egrep "(x+x+)+y"
both answer immediately.
In fact, there are only a few cases (like backreferences) where backtracking is really needed (mainly, because a regexp with a backreference is not a regular expression in the language theoretic sense anymore). A capable implementation should switch to backtracking only when these corner cases are given.
In fairness, DFA's have a dark side too, because some regexp's have exponential size requirements, but a size contraints is easier to enforce than a time constraint and the huge DFA runs linear on the input, so it's a better bargain than a small backtracker choking on a couple of X's.
You should really read Russ Cox excellent article series about the implementation of regexp (and the pathological behavior of backtracking): http://swtch.com/~rsc/regexp/
To answer your question about decidability: You can't. Because there is not the one backtracking for regexpr. Every implementation has its own strategies to deal with exponential growth in their algorithm for certain cases and does not cover others. One rule might be fit for here and catastrophic for there.
UPDATE:
For example, one implementation could contain an optimizer which could use algebraic transformations to simplify regexps before executing them: (x+x+)+y is the same a xxx*y, which shouldn't be a problem for any backtracker. But the same optimizer wouldn't recognize the next expression and the problem is there again. Here someone described how to craft a regexpr which fools Perl's optimizer:
http://perlgeek.de/blog-en/perl-tips/in-search-of-an-exponetial-regexp.html

No I don't think so, but you can use these guidelines:
If it contains two quantifiers that are open-ended at the high end and they are nested then it might be O(2^n).
If it does not contain two such quantifiers then I think it cannot be O(2^n).
Quantifiers that can cause this are: *, + and {k,}.
Also note that the worst case complexity of evaluating a regular expression might be very different from the complexity on typical strings and that the complexity depends on the specific regular expression engine.

Any regex without backreferences can be matched in linear time, though many regex engines out there in the real world don't do it that way (at least many regex engines that are plugged into programming language runtime environments support backreferences, and don't switch to a more efficient execution model when no backreferences are present).
There's no easy way to find out how much time a regex with backreferences is going to consume.

You could detect and reject nested repetitions using a regex parser, which corresponds to a star height of 1. I've just written a module to compute and reject start heights of >1 using a regex parser from npm.
$ node safe.js '(x+x+)+y'
false
$ node safe.js '(beep|boop)*'
true
$ node safe.js '(a+){10}'
false
$ node safe.js '\blocation\s*:[^:\n]+\b(Oakland|San Francisco)\b'
true

Related

How to efficiently implement longest match in a lexer generator?

I'm interested in learning how to write a lexer generator like flex. I've been reading "Compilers: Principles, Techniques, and Tools" (the "dragon book"), and I have a basic idea of how flex works.
My initial approach is this: the user will supply a hash map of regexes mapping a regex to a token enum. The program will just loop through the regexes one by one in the order given and see if they match the start of the string (I could add a ^ to the beginning of each regex to implement this). If they do, I can add the token for that regex to a list of tokens for the program.
My first question is, is this the most efficient way to do it? Currently I have to loop through each regex, but in theory I could construct a DFA from all of the regexes combined and step through that more efficiently. However, there will be some overhead from creating this DFA.
My second question is, how would I implement the longest matching string tie breaker, like flex does? i.e, I want to match ifa as an identifier, and not the keyword if followed by the letter a. I don't see any efficient way to do this with regex. I think I'll have to loop through all of the regexes, try to match them, and if I have more than one match, take the longest result. However, if I converted the regexes to a DFA (that is, my own DFA data structure), then I could continue stepping through the characters until there are no more possible transition edges on the DFA. At that point, I can take the last time I passed through an acceptance state as the actual match of a token, since that should be the longest match.
Both of my questions point to writing my own translator from regex to a DFA. Is this required, or can I still do this efficiently with plain regex (implemented by a standard library) and still get the longest match?
EDIT: I kept the regex engine I'm using out of this because I wanted a general answer, but I'm using Rust's regex library: http://static.rust-lang.org/doc/master/regex/index.html
Timewise, it's much more efficient to compile all the regexes down into a single automaton that matches all patterns in parallel. It might blow up the space usage significantly, though (DFAs can have exponentially many states relative to the pattern sizes), so it's worth investigating whether this will hurt.
Typically, the way you'd implement maximal-munch (matching the longest string you can) is to run the matching automaton as normal. Keep track of the index of the last match that you find. When the automaton enters a dead state and stops, you can then output the substring from the beginning of the token up through the match point, then jump back in the input sequence to the point right after the match finished. This can be done pretty efficiently and without much slowdown at all.
In case it helps, here are some lecture slides from a compilers course I taught that explores scanning techniques.
Hope this helps!

Is there any way to optimize a generic regular expression?

I code in Eclipse, and when I do a CTRL-F to find some string, I see that apart from the standardized options of whole word, case sensitive, there is an option for regular expression search also (it is there in Notepad++ too).
I have tried it once or twice, and generally the results are almost instantaneous. But after all, the code files are not humongous, the biggest ones are not more than 500 lines long, with most lines filled less than half. Is there any way to optimize such that any user supplied regex will run much faster on a large piece of text, say 10-15 MB of size?
I can't think of any method because no standardized search algorithm like Rabin-Karp, or suffix tree would apply here!
I have no idea on how regular expression is implemented in Eclipse and why it is so slow. Here is just some thoughts:
First of all, there are a few concepts you should know: Nondeterministic finite automaton (NFA) and Deterministic finite automaton (DFA). In theory, Regular Expression, NFA, and DFA are equivalent, which means they have exactly the same ability to describe languages (sequences of characters). This implies that any one of them can be converted to another (see this site).
Regular Expression can be implemented by converting it to DFA, and using DFA to match text only takes linear time (many of the string matching algorithms, e.g. KMP, are actually special DFAs). However, the trouble is, most of modern Regular Expression implementations introduced features like backreferences making it impossible to use DFA.
So, if discarding those complex features is possible, implementing a fast Regular Expression would be feasible (do the matching in linear time). You may find more in this article.
What makes you think suffix tree isn't a suitable algorithm for this problem? From http://en.wikipedia.org/wiki/Suffix_tree:
Once [the suffix tree is] constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc.
I think a modified Boyer–Moore string search algorithm also would be possible.

What's the Time Complexity of Average Regex algorithms?

I'm not new to using regular expressions, and I understand the basic theory they're based on--finite state machines.
I'm not so good at algorithmic analysis though and don't understand how a regex compares to say, a basic linear search. I'm asking because on the surface it seems like a linear array search. (If the regex is simple.)
Where could I go to learn more about implementing a regex engine?
This is one of the most popular outlines: Regular Expression Matching Can Be Simple And Fast
. Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).
Are you familiar with the term Deterministic/Non-Deterministic Finite Automata?
Real regular expressions (when I say real I'm refering to those regex that recognize Regular Languages, and not the regex that almost every programming language include with backreferences, etc) can be converted into a DFA/NFA and both can be implemented in a mechanical way in a programming language (a NFA can be converted into a DFA)
What you have to do is:
Find a way to convert a regex into an automaton
Implement the recognition of the automaton in the programming language of your preference
That way, given a regex you can convert it to a DFA and run it to see if it matches or not a specified text.
This can be implemented in O(n), because DFA don't go backward (like a Turing Machine), so it matches the string or not. That is supposing you won't take in count overlapped matches, otherwise you will have to go back and start matching again...
The classic regular expression can be implemented in a way which is fast in practice but has really bad worst case behaviour (the standard DFA) or in a way which has guaranteed reasonable worst case behaviour (keeping it as an NFA). The standard DFA can be extended to support lots of extra matching characters and flags, which make use of the fact that it is basically back-tracking search.
Examples of the standard approach are everywhere (e.g. built into Perl). There is an example that claims good worst case behaviour at http://code.google.com/p/re2/ - in fact it is even better than I expected in the worst case, so they may have found an extra trick or two.
If you are at all interested in this, or care about writing programs that can be made to lock up solid given pathological inputs, read http://swtch.com/~rsc/regexp/regexp1.html.

substring match faster with regular expression?

After having read up on RE/NFA and DFA, it seems that finding a substring within a string might actually be asymptotically faster using an RE rather than a brute force O(mn) find. My reasoning is that a DFA would actually maintain state and avoid processing each character in the "haystack" more than once. Hence, searches in long strings may actually be much faster if done with regular expressions.
Of course, this is valid only for RE matchers that convert from NFA to DFA.
Has anyone experienced better string match performance in real life when using RE rather than a brute force matcher?
First of all, I would recommend you read the article about internals of regular expressions in several languages: Regular Expression Matching Can Be Simple And Fast.
Because regexps in many languages are not just for matching, but also provide possibility of group-capturing and back-referencing, almost all implementations use so called "backtracking" when execute NFA built from the given regexp. And this implementation has exponential time complexity (in worst case).
There could be RE implementation through the DFA (with group capturing), but it has an overhead (see Laurikari's paper NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions).
For simple substring searching you could use Knuth-Morris-Pratt algorithm, which build DFA to search substring, and it has optimal O(len(s)) complexity. But it hase overhead also, and if you test naive approach (O(nm)) against this optimal algorithm on real-world words and phrases (which are not so repetitive), you could find that naive approach is better in average.
For exact substring searching you could also try Boyer–Moore algo, which has O(mn) worst-case complexity, but work better than KMP in average on real-world data.
Most regular expressions used in practice are PCRE (Perl-Compatible Regular Expressions), which are wider than regular language and thus cannot be expressed with a regular grammar. PCRE has things like positive/negative lookahead/lookbehind assertions and even recursion, so parsing may require processing some characters more than once. Surely, it all comes down to particular RE implementation: whether it is optimized if the expressions stays within bounds of regular grammar or not.
Personally, I haven't done any sort of performance comparisons between the two. However, in my experience I never ever had performance issues with brute force find-and-replace, while I had to deal with RE performance bottlenecks on more than one occasion.
If you look at documentation for most languages it will mention that if you dont need to power of regex you should use the non-regex version for performance reasons... Example: http://www.php.net/manual/en/function.preg-split.php states: "If you don't need the power of regular expressions, you can choose faster (albeit simpler) alternatives like explode() or str_split()."
This is a trade off that exists everywhere. That is the more flexible and feature rich a solution is the poorer its performance.

Formal language expressiveness of Perl patterns

Classical regular expressions are equivalent to finite automata. Most current implementations of "regular expressions" are not strictly speaking regular expressions but are more powerful. Some people have started using the term "pattern" rather than "regular expression" to be more accurate.
What is the formal language classification of what can be described with a modern "regular expression" such as the patterns supported in Perl 5?
Update: By "Perl 5" I mean that pattern matching functionality implemented in Perl 5 and adopted by numerous other languages (C#, JavaScript, etc) and not anything specific to Perl. I don't want to consider, for example, tricks for embedding Perl code in a pattern.
Perl regexps, as those of any pattern language, where "backreferences" are allowed, are not actually "regular".
Backreferences is the mechanism of matching the same string that was matched by a sub-pattern before. For example, /^(a*)\1$/ matches only strings with even number of as, because after some as there should follow the same number of those.
It's easy to prove, that, for instance, pattern /^((a|b)*)\1$/ matches words from a non-regular language(*), so it's more expressive that ant finite automaton. Regular expressions can't "remember" a string of arbitrary length and then match it again (the length may be very long, while finite-state machine only can simulate finite amount of "memory").
A formal proof would use the pumping lemma. (By the way, this language can't be described by context-free grammar as well.)
Let alone the tricks that allow to use perl code in perl regexps (non-regular language of balanced parentheses there).
(*) "Regular languages" are sets of words that are matched by finite automata. I already wrote an answer about that.
There was a recent discussion on this topic a Perlmonks: Turing completeness and regular expressions
I've always heard perl's regex implementation described as an NFA with backtracking. Wikipedia seems to have a little section on this:
This is possibly slightly too fuzzy but it's informative non the less:
From Wikipedia:
There are at least three different
algorithms that decide if and how a
given regular expression matches a
string.
The oldest and fastest two rely on a
result in formal language theory that
allows every nondeterministic finite
state machine (NFA) to be transformed
into a deterministic finite state
machine (DFA). The DFA can be
constructed explicitly and then run on
the resulting input string one symbol
at a time. Constructing the DFA for a
regular expression of size m has the
time and memory cost of O(2m), but it
can be run on a string of size n in
time O(n). An alternative approach is
to simulate the NFA directly,
essentially building each DFA state on
demand and then discarding it at the
next step, possibly with caching. This
keeps the DFA implicit and avoids the
exponential construction cost, but
running cost rises to O(nm). The
explicit approach is called the DFA
algorithm and the implicit approach
the NFA algorithm. As both can be seen
as different ways of executing the
same DFA, they are also often called
the DFA algorithm without making a
distinction. These algorithms are
fast, but using them for recalling
grouped subexpressions, lazy
quantification, and similar features
is tricky.[12][13]
The third algorithm is to match the
pattern against the input string by
backtracking. This algorithm is
commonly called NFA, but this
terminology can be confusing. Its
running time can be exponential, which
simple implementations exhibit when
matching against expressions like
(a|aa)*b that contain both alternation
and unbounded quantification and force
the algorithm to consider an
exponentially increasing number of
sub-cases. More complex
implementations will often identify
and speed up or abort common cases
where they would otherwise run slowly.
Although backtracking implementations
only give an exponential guarantee in
the worst case, they provide much
greater flexibility and expressive
power. For example, any implementation
which allows the use of
backreferences, or implements the
various extensions introduced by Perl,
must use a backtracking
implementation.
Some implementations try to provide
the best of both algorithms by first
running a fast DFA match to see if the
string matches the regular expression
at all, and only in that case perform
a potentially slower backtracking
match.