Classical regular expressions are equivalent to finite automata. Most current implementations of "regular expressions" are not strictly speaking regular expressions but are more powerful. Some people have started using the term "pattern" rather than "regular expression" to be more accurate.
What is the formal language classification of what can be described with a modern "regular expression" such as the patterns supported in Perl 5?
Update: By "Perl 5" I mean that pattern matching functionality implemented in Perl 5 and adopted by numerous other languages (C#, JavaScript, etc) and not anything specific to Perl. I don't want to consider, for example, tricks for embedding Perl code in a pattern.
Perl regexps, as those of any pattern language, where "backreferences" are allowed, are not actually "regular".
Backreferences is the mechanism of matching the same string that was matched by a sub-pattern before. For example, /^(a*)\1$/ matches only strings with even number of as, because after some as there should follow the same number of those.
It's easy to prove, that, for instance, pattern /^((a|b)*)\1$/ matches words from a non-regular language(*), so it's more expressive that ant finite automaton. Regular expressions can't "remember" a string of arbitrary length and then match it again (the length may be very long, while finite-state machine only can simulate finite amount of "memory").
A formal proof would use the pumping lemma. (By the way, this language can't be described by context-free grammar as well.)
Let alone the tricks that allow to use perl code in perl regexps (non-regular language of balanced parentheses there).
(*) "Regular languages" are sets of words that are matched by finite automata. I already wrote an answer about that.
There was a recent discussion on this topic a Perlmonks: Turing completeness and regular expressions
I've always heard perl's regex implementation described as an NFA with backtracking. Wikipedia seems to have a little section on this:
This is possibly slightly too fuzzy but it's informative non the less:
From Wikipedia:
There are at least three different
algorithms that decide if and how a
given regular expression matches a
string.
The oldest and fastest two rely on a
result in formal language theory that
allows every nondeterministic finite
state machine (NFA) to be transformed
into a deterministic finite state
machine (DFA). The DFA can be
constructed explicitly and then run on
the resulting input string one symbol
at a time. Constructing the DFA for a
regular expression of size m has the
time and memory cost of O(2m), but it
can be run on a string of size n in
time O(n). An alternative approach is
to simulate the NFA directly,
essentially building each DFA state on
demand and then discarding it at the
next step, possibly with caching. This
keeps the DFA implicit and avoids the
exponential construction cost, but
running cost rises to O(nm). The
explicit approach is called the DFA
algorithm and the implicit approach
the NFA algorithm. As both can be seen
as different ways of executing the
same DFA, they are also often called
the DFA algorithm without making a
distinction. These algorithms are
fast, but using them for recalling
grouped subexpressions, lazy
quantification, and similar features
is tricky.[12][13]
The third algorithm is to match the
pattern against the input string by
backtracking. This algorithm is
commonly called NFA, but this
terminology can be confusing. Its
running time can be exponential, which
simple implementations exhibit when
matching against expressions like
(a|aa)*b that contain both alternation
and unbounded quantification and force
the algorithm to consider an
exponentially increasing number of
sub-cases. More complex
implementations will often identify
and speed up or abort common cases
where they would otherwise run slowly.
Although backtracking implementations
only give an exponential guarantee in
the worst case, they provide much
greater flexibility and expressive
power. For example, any implementation
which allows the use of
backreferences, or implements the
various extensions introduced by Perl,
must use a backtracking
implementation.
Some implementations try to provide
the best of both algorithms by first
running a fast DFA match to see if the
string matches the regular expression
at all, and only in that case perform
a potentially slower backtracking
match.
Related
From Mastering Regular Expressions 3e:
As a result, broadly speaking, there are three types of regex engines:
DFA (POSIX or not—similar either way)
Traditional NFA (most common: Perl, .NET, PHP, Java, Python, . . . )
POSIX NFA
From the Theory of Computation: Formal Languages, Automata, and Complexity:
For each NFA, there is a DFA that accepts exactly the same language.
Can I argue that NFA and DFA are the same thing? Or even though their ability to recognize patterns are equivalent, but they are still different in some way?
There are two things you're missing:
The "traditional NFA" implementations actually include abilities beyond the strict computer science definition of an NFA.
Performance characteristics are a thing to care about, even given two implementations that cover the same answer.
The net effect is that the backtracking implementations (I like that name better than "traditional NFA") are slightly more expressive than DFA implementations because they can match regexes like (\w{3,})\1, which matches three or more word characters repeated twice (something that can't be matched by a DFA). At the same time, DFA implementations are guaranteed O(n) in input length, but it is very easy to write regexes that have O(n^2) or worse behavior when presented with a string they don't match. (See https://swtch.com/~rsc/regexp/regexp1.html )
I code in Eclipse, and when I do a CTRL-F to find some string, I see that apart from the standardized options of whole word, case sensitive, there is an option for regular expression search also (it is there in Notepad++ too).
I have tried it once or twice, and generally the results are almost instantaneous. But after all, the code files are not humongous, the biggest ones are not more than 500 lines long, with most lines filled less than half. Is there any way to optimize such that any user supplied regex will run much faster on a large piece of text, say 10-15 MB of size?
I can't think of any method because no standardized search algorithm like Rabin-Karp, or suffix tree would apply here!
I have no idea on how regular expression is implemented in Eclipse and why it is so slow. Here is just some thoughts:
First of all, there are a few concepts you should know: Nondeterministic finite automaton (NFA) and Deterministic finite automaton (DFA). In theory, Regular Expression, NFA, and DFA are equivalent, which means they have exactly the same ability to describe languages (sequences of characters). This implies that any one of them can be converted to another (see this site).
Regular Expression can be implemented by converting it to DFA, and using DFA to match text only takes linear time (many of the string matching algorithms, e.g. KMP, are actually special DFAs). However, the trouble is, most of modern Regular Expression implementations introduced features like backreferences making it impossible to use DFA.
So, if discarding those complex features is possible, implementing a fast Regular Expression would be feasible (do the matching in linear time). You may find more in this article.
What makes you think suffix tree isn't a suitable algorithm for this problem? From http://en.wikipedia.org/wiki/Suffix_tree:
Once [the suffix tree is] constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc.
I think a modified Boyer–Moore string search algorithm also would be possible.
I'm not new to using regular expressions, and I understand the basic theory they're based on--finite state machines.
I'm not so good at algorithmic analysis though and don't understand how a regex compares to say, a basic linear search. I'm asking because on the surface it seems like a linear array search. (If the regex is simple.)
Where could I go to learn more about implementing a regex engine?
This is one of the most popular outlines: Regular Expression Matching Can Be Simple And Fast
. Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).
Are you familiar with the term Deterministic/Non-Deterministic Finite Automata?
Real regular expressions (when I say real I'm refering to those regex that recognize Regular Languages, and not the regex that almost every programming language include with backreferences, etc) can be converted into a DFA/NFA and both can be implemented in a mechanical way in a programming language (a NFA can be converted into a DFA)
What you have to do is:
Find a way to convert a regex into an automaton
Implement the recognition of the automaton in the programming language of your preference
That way, given a regex you can convert it to a DFA and run it to see if it matches or not a specified text.
This can be implemented in O(n), because DFA don't go backward (like a Turing Machine), so it matches the string or not. That is supposing you won't take in count overlapped matches, otherwise you will have to go back and start matching again...
The classic regular expression can be implemented in a way which is fast in practice but has really bad worst case behaviour (the standard DFA) or in a way which has guaranteed reasonable worst case behaviour (keeping it as an NFA). The standard DFA can be extended to support lots of extra matching characters and flags, which make use of the fact that it is basically back-tracking search.
Examples of the standard approach are everywhere (e.g. built into Perl). There is an example that claims good worst case behaviour at http://code.google.com/p/re2/ - in fact it is even better than I expected in the worst case, so they may have found an extra trick or two.
If you are at all interested in this, or care about writing programs that can be made to lock up solid given pathological inputs, read http://swtch.com/~rsc/regexp/regexp1.html.
After having read up on RE/NFA and DFA, it seems that finding a substring within a string might actually be asymptotically faster using an RE rather than a brute force O(mn) find. My reasoning is that a DFA would actually maintain state and avoid processing each character in the "haystack" more than once. Hence, searches in long strings may actually be much faster if done with regular expressions.
Of course, this is valid only for RE matchers that convert from NFA to DFA.
Has anyone experienced better string match performance in real life when using RE rather than a brute force matcher?
First of all, I would recommend you read the article about internals of regular expressions in several languages: Regular Expression Matching Can Be Simple And Fast.
Because regexps in many languages are not just for matching, but also provide possibility of group-capturing and back-referencing, almost all implementations use so called "backtracking" when execute NFA built from the given regexp. And this implementation has exponential time complexity (in worst case).
There could be RE implementation through the DFA (with group capturing), but it has an overhead (see Laurikari's paper NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions).
For simple substring searching you could use Knuth-Morris-Pratt algorithm, which build DFA to search substring, and it has optimal O(len(s)) complexity. But it hase overhead also, and if you test naive approach (O(nm)) against this optimal algorithm on real-world words and phrases (which are not so repetitive), you could find that naive approach is better in average.
For exact substring searching you could also try Boyer–Moore algo, which has O(mn) worst-case complexity, but work better than KMP in average on real-world data.
Most regular expressions used in practice are PCRE (Perl-Compatible Regular Expressions), which are wider than regular language and thus cannot be expressed with a regular grammar. PCRE has things like positive/negative lookahead/lookbehind assertions and even recursion, so parsing may require processing some characters more than once. Surely, it all comes down to particular RE implementation: whether it is optimized if the expressions stays within bounds of regular grammar or not.
Personally, I haven't done any sort of performance comparisons between the two. However, in my experience I never ever had performance issues with brute force find-and-replace, while I had to deal with RE performance bottlenecks on more than one occasion.
If you look at documentation for most languages it will mention that if you dont need to power of regex you should use the non-regex version for performance reasons... Example: http://www.php.net/manual/en/function.preg-split.php states: "If you don't need the power of regular expressions, you can choose faster (albeit simpler) alternatives like explode() or str_split()."
This is a trade off that exists everywhere. That is the more flexible and feature rich a solution is the poorer its performance.
If I have a list of regular expressions, is there an easy way to determine that no two of them will both return a match for the same string?
That is, the list is valid if and only if for all strings a maximum of one item in the list will match the entire string.
It seems like this will be very hard (maybe impossible?) to prove definitively, but I can't seem to find any work on the subject.
The reason I ask is that I am working on a tokenizer that accepts regexes, and I would like to ensure only one token at a time can match the head of the input.
If you're working with pure regular expressions (no backreferences or other features that cause them to recognize context-free or more complicated languages), what you ask is possible.
What you can do is convert each regex to a DFA, then (since regular languages are closed under intersection) combine them into a DFA that recognizes
the intersection of the two languages. If that DFA has a path from the start state to an accepting state, that string is accepted by both input regexen.
The problem with this is that the first step of the usual regex->DFA algorithm is to
convert the regex to a NFA, then convert the NFA to a DFA. But that last step can
result in an exponential blowup in the number of DFA states, so this will only be
feasible for very simple regular expressions.
If you are working with extended regex syntax, all bets are off: context-free languages
are not closed under intersection, so this method won't work.
The Wkipedia article on regular expressions does state
It is possible to write an algorithm which for two given regular expressions decides whether the described languages are essentially equal, reduces each expression to a minimal deterministic finite state machine, and determines whether they are isomorphic (equivalent).
but gives no further hints.
Of course the easy way you are after is to run a lot of tests -- but we all know the shortcomings of testing as a method of proof.
You can't do that by only looking at the regular expression.
Consider the case where you have [0-9] and [0-9]+. They are obviously different expressions, but when applied to the string "1", they both produce the same result. When applied to string "11" they produce different results.
The point is that a regular expression isn't enough information. The result depends both on the regex and the target string.