If I have a list of regular expressions, is there an easy way to determine that no two of them will both return a match for the same string?
That is, the list is valid if and only if for all strings a maximum of one item in the list will match the entire string.
It seems like this will be very hard (maybe impossible?) to prove definitively, but I can't seem to find any work on the subject.
The reason I ask is that I am working on a tokenizer that accepts regexes, and I would like to ensure only one token at a time can match the head of the input.
If you're working with pure regular expressions (no backreferences or other features that cause them to recognize context-free or more complicated languages), what you ask is possible.
What you can do is convert each regex to a DFA, then (since regular languages are closed under intersection) combine them into a DFA that recognizes
the intersection of the two languages. If that DFA has a path from the start state to an accepting state, that string is accepted by both input regexen.
The problem with this is that the first step of the usual regex->DFA algorithm is to
convert the regex to a NFA, then convert the NFA to a DFA. But that last step can
result in an exponential blowup in the number of DFA states, so this will only be
feasible for very simple regular expressions.
If you are working with extended regex syntax, all bets are off: context-free languages
are not closed under intersection, so this method won't work.
The Wkipedia article on regular expressions does state
It is possible to write an algorithm which for two given regular expressions decides whether the described languages are essentially equal, reduces each expression to a minimal deterministic finite state machine, and determines whether they are isomorphic (equivalent).
but gives no further hints.
Of course the easy way you are after is to run a lot of tests -- but we all know the shortcomings of testing as a method of proof.
You can't do that by only looking at the regular expression.
Consider the case where you have [0-9] and [0-9]+. They are obviously different expressions, but when applied to the string "1", they both produce the same result. When applied to string "11" they produce different results.
The point is that a regular expression isn't enough information. The result depends both on the regex and the target string.
Related
there are many ways to write a regular expression (regex) pattern that would answer the same question and yield the same result. for instance, ^aa$|^aaa$ and ^a{2,3}$ regex's patterns are equivalent in their results for the question - match all inputs that start and end with the character a and a appears either 2 or 3 times.
is there an algorithm for shortening a regular expression (regex) pattern? e.g. the algorithm accepts ^aa$|^aaa$ and prints ^a{2,3}$?
from theoretical perspective, when given 2 regex patterns that yield exactly the same result, would their minimized deterministic finite automaton be the same?
from practical/computational perspective, when given 2 regex that yield exactly the same result, should one regex be favored over the other?
There exist algorithms to minimize a regex (depending from the metric for what you intend to minimize, e.g. regex length), but from what I've seen here, you're looking at a PSPACE-hard problem with no general solution: this and this site give different results for example.
Yes, it is exactly the purpose of minimizing an automata.
Yes, since they are not linked to the associated minimum automata from an implementation point of view. Here's an interesting set of examples and explainations
I code in Eclipse, and when I do a CTRL-F to find some string, I see that apart from the standardized options of whole word, case sensitive, there is an option for regular expression search also (it is there in Notepad++ too).
I have tried it once or twice, and generally the results are almost instantaneous. But after all, the code files are not humongous, the biggest ones are not more than 500 lines long, with most lines filled less than half. Is there any way to optimize such that any user supplied regex will run much faster on a large piece of text, say 10-15 MB of size?
I can't think of any method because no standardized search algorithm like Rabin-Karp, or suffix tree would apply here!
I have no idea on how regular expression is implemented in Eclipse and why it is so slow. Here is just some thoughts:
First of all, there are a few concepts you should know: Nondeterministic finite automaton (NFA) and Deterministic finite automaton (DFA). In theory, Regular Expression, NFA, and DFA are equivalent, which means they have exactly the same ability to describe languages (sequences of characters). This implies that any one of them can be converted to another (see this site).
Regular Expression can be implemented by converting it to DFA, and using DFA to match text only takes linear time (many of the string matching algorithms, e.g. KMP, are actually special DFAs). However, the trouble is, most of modern Regular Expression implementations introduced features like backreferences making it impossible to use DFA.
So, if discarding those complex features is possible, implementing a fast Regular Expression would be feasible (do the matching in linear time). You may find more in this article.
What makes you think suffix tree isn't a suitable algorithm for this problem? From http://en.wikipedia.org/wiki/Suffix_tree:
Once [the suffix tree is] constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc.
I think a modified Boyer–Moore string search algorithm also would be possible.
I'm not new to using regular expressions, and I understand the basic theory they're based on--finite state machines.
I'm not so good at algorithmic analysis though and don't understand how a regex compares to say, a basic linear search. I'm asking because on the surface it seems like a linear array search. (If the regex is simple.)
Where could I go to learn more about implementing a regex engine?
This is one of the most popular outlines: Regular Expression Matching Can Be Simple And Fast
. Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).
Are you familiar with the term Deterministic/Non-Deterministic Finite Automata?
Real regular expressions (when I say real I'm refering to those regex that recognize Regular Languages, and not the regex that almost every programming language include with backreferences, etc) can be converted into a DFA/NFA and both can be implemented in a mechanical way in a programming language (a NFA can be converted into a DFA)
What you have to do is:
Find a way to convert a regex into an automaton
Implement the recognition of the automaton in the programming language of your preference
That way, given a regex you can convert it to a DFA and run it to see if it matches or not a specified text.
This can be implemented in O(n), because DFA don't go backward (like a Turing Machine), so it matches the string or not. That is supposing you won't take in count overlapped matches, otherwise you will have to go back and start matching again...
The classic regular expression can be implemented in a way which is fast in practice but has really bad worst case behaviour (the standard DFA) or in a way which has guaranteed reasonable worst case behaviour (keeping it as an NFA). The standard DFA can be extended to support lots of extra matching characters and flags, which make use of the fact that it is basically back-tracking search.
Examples of the standard approach are everywhere (e.g. built into Perl). There is an example that claims good worst case behaviour at http://code.google.com/p/re2/ - in fact it is even better than I expected in the worst case, so they may have found an extra trick or two.
If you are at all interested in this, or care about writing programs that can be made to lock up solid given pathological inputs, read http://swtch.com/~rsc/regexp/regexp1.html.
Classical regular expressions are equivalent to finite automata. Most current implementations of "regular expressions" are not strictly speaking regular expressions but are more powerful. Some people have started using the term "pattern" rather than "regular expression" to be more accurate.
What is the formal language classification of what can be described with a modern "regular expression" such as the patterns supported in Perl 5?
Update: By "Perl 5" I mean that pattern matching functionality implemented in Perl 5 and adopted by numerous other languages (C#, JavaScript, etc) and not anything specific to Perl. I don't want to consider, for example, tricks for embedding Perl code in a pattern.
Perl regexps, as those of any pattern language, where "backreferences" are allowed, are not actually "regular".
Backreferences is the mechanism of matching the same string that was matched by a sub-pattern before. For example, /^(a*)\1$/ matches only strings with even number of as, because after some as there should follow the same number of those.
It's easy to prove, that, for instance, pattern /^((a|b)*)\1$/ matches words from a non-regular language(*), so it's more expressive that ant finite automaton. Regular expressions can't "remember" a string of arbitrary length and then match it again (the length may be very long, while finite-state machine only can simulate finite amount of "memory").
A formal proof would use the pumping lemma. (By the way, this language can't be described by context-free grammar as well.)
Let alone the tricks that allow to use perl code in perl regexps (non-regular language of balanced parentheses there).
(*) "Regular languages" are sets of words that are matched by finite automata. I already wrote an answer about that.
There was a recent discussion on this topic a Perlmonks: Turing completeness and regular expressions
I've always heard perl's regex implementation described as an NFA with backtracking. Wikipedia seems to have a little section on this:
This is possibly slightly too fuzzy but it's informative non the less:
From Wikipedia:
There are at least three different
algorithms that decide if and how a
given regular expression matches a
string.
The oldest and fastest two rely on a
result in formal language theory that
allows every nondeterministic finite
state machine (NFA) to be transformed
into a deterministic finite state
machine (DFA). The DFA can be
constructed explicitly and then run on
the resulting input string one symbol
at a time. Constructing the DFA for a
regular expression of size m has the
time and memory cost of O(2m), but it
can be run on a string of size n in
time O(n). An alternative approach is
to simulate the NFA directly,
essentially building each DFA state on
demand and then discarding it at the
next step, possibly with caching. This
keeps the DFA implicit and avoids the
exponential construction cost, but
running cost rises to O(nm). The
explicit approach is called the DFA
algorithm and the implicit approach
the NFA algorithm. As both can be seen
as different ways of executing the
same DFA, they are also often called
the DFA algorithm without making a
distinction. These algorithms are
fast, but using them for recalling
grouped subexpressions, lazy
quantification, and similar features
is tricky.[12][13]
The third algorithm is to match the
pattern against the input string by
backtracking. This algorithm is
commonly called NFA, but this
terminology can be confusing. Its
running time can be exponential, which
simple implementations exhibit when
matching against expressions like
(a|aa)*b that contain both alternation
and unbounded quantification and force
the algorithm to consider an
exponentially increasing number of
sub-cases. More complex
implementations will often identify
and speed up or abort common cases
where they would otherwise run slowly.
Although backtracking implementations
only give an exponential guarantee in
the worst case, they provide much
greater flexibility and expressive
power. For example, any implementation
which allows the use of
backreferences, or implements the
various extensions introduced by Perl,
must use a backtracking
implementation.
Some implementations try to provide
the best of both algorithms by first
running a fast DFA match to see if the
string matches the regular expression
at all, and only in that case perform
a potentially slower backtracking
match.
I think that the title accurately summarizes my question, but just to elaborate a bit.
Instead of using a regular expression to verify properties of existing strings, I'd like to use the regular expression as a way to generate strings that have certain properties.
Note: The function doesn't need to generate every string that satisfies the regular expression (cause that would be an infinite number of string for a lot of regexes). Just a sampling of the many valid strings is sufficient.
How feasible is something like this? If the solution is too complicated/large, I'm happy with a general discussion/outline. Additionally, I'm interested in any existing programs or libraries (.NET) that do this.
Well a regex is convertible to a DFA which can be thought of as a graph. To generate a string given this DFA-graph you'd just find a path from a start state to an end state. You'd just have to think about how you want to handle cycles (Maybe traverse every cycle at least once to get a sampling? n times?), but I don't see why it wouldn't work.
This utility on UtilityMill will invert some simple regexen. It is based on this example from the pyparsing wiki. The test cases for this program are:
[A-EA]
[A-D]*
[A-D]{3}
X[A-C]{3}Y
X[A-C]{3}\(
X\d
foobar\d\d
foobar{2}
foobar{2,9}
fooba[rz]{2}
(foobar){2}
([01]\d)|(2[0-5])
([01]\d\d)|(2[0-4]\d)|(25[0-5])
[A-C]{1,2}
[A-C]{0,3}
[A-C]\s[A-C]\s[A-C]
[A-C]\s?[A-C][A-C]
[A-C]\s([A-C][A-C])
[A-C]\s([A-C][A-C])?
[A-C]{2}\d{2}
#|TH[12]
#(#|TH[12])?
#(#|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9]))?
#(#|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9])|OH(1[0-9]?|2[0-9]?|30?|[4-9]))?
(([ECMP]|HA|AK)[SD]|HS)T
[A-CV]{2}
A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|D[bsy]|E[rsu]|F[emr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airu]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|Uu[bhopqst]|U|V|W|Xe|Yb?|Z[nr]
(a|b)|(x|y)
(a|b) (x|y)
This can be done by traversing the DFA (includes pseudocode) or else by walking the regex's abstract-syntax tree directly or converting to NFA first, as explained by Doug McIlroy: paper and Haskell code. (He finds the NFA approach to go faster, but he didn't compare it to the DFA.)
These all work on regular expressions without back-references -- that is, 'real' regular expressions rather than Perl regular expressions. To handle the extra Perl features it'd be easiest to add on a post-filter.
Added: code for this in Python, by Peter Norvig and me.
Since it is trivially possible to write a regular expression that matches no possible strings, and I believe it is also possible to write a regular expression for which calculating a matching string requires an exhaustive search of possible strings of all lengths, you'll probably need an upper bound on requesting an answer.
The easiest way to implement but definitely most CPU time intensive approach would be to simply brute force it.
Set up a character table with the characters that your string should contain and then just sequentially generate strings and do a Regex.IsMatch on them.
I, personally, believe that this is the holy grail of reg-ex. If you could implement this -- even only 3/4 working -- I have no doubt that you'd be rich in about 5 minutes.
All joking aside, I'm not sure that what you are truly going after is feasible. Reg-Ex is a very open, flexible language and giving the computer enough sample input to truly and accurately find what you need, is probably not feasible.
If I'm proven wrong, I wish kudos to that developer.
To look at this from a different perspective, this is almost (not quite) like giving a computer it's output, and having it -- based on that -- write a program for you. This is a little overboard, but it kind of illustrates my point.