How to convert regex "x{m, n}" to NFA? - regex

The regex x{m, n} matches from m to n repetitions of the preceding x, attempting to match as many as possible.
I have a naive solution, but the number of nodes and edges depends on m and n, which is unacceptable when n is big.
So, is there any effective way to convert the regex to NFA?

Unfortunately, NFAs don't "count" very well. You're essentially going to have to manually expand your regex to something Thompsons' construction can handle. e.g.
m{2,5} -> mm(m(m(m)?)?)?
Search for the function SimplifyRepeat here to see Google's implementation. See this page for more information on regex implementation in practice.

Related

Levenshtein distance in regular expression

Is it possible to include Levenshtein distance in a regular expression query?
(Except by making union between permutations, like this to search for "hello" with Levenshtein distance 1:
.ello | h.llo | he.lo | hel.o | hell.
since this is stupid and unusable for larger Levenshtein distances.)
There are a couple of regex dialects out there with an approximate matching feature - namely the TRE library and the regex PyPI module for Python.
The TRE approximate matching syntax is described in the "Approximate matching settings" section at https://laurikari.net/tre/documentation/regex-syntax/. A TRE regex to match stuff within Levenshtein distance 1 of hello would be:
(hello){~1}
The regex module's approximate matching syntax is described at https://pypi.org/project/regex/ in the bullet point that begins with the text Approximate “fuzzy” matching. A regex regex to match stuff within Levenshtein distance 1 of hello would be:
(hello){e<=1}
Perhaps one or the other of these syntaxes will in time be adopted by other regex implementations, but at present I only know of these two.
You can generate the regex programmatically. I will leave that as an exercise for the reader, but for the output of this hypothetical function (given an input of "word") you want something like this string:
"^(?>word|wodr|wrod|owrd|word.|wor.d|wo.rd|w.ord|.word|wor.?|wo.?d|w.?rd|.?ord)$"
In English, first you try to match on the word itself, then on every possible single transposition, then on every possible single insertion, then on every possible single omission or substitution (can be done simultaneously).
The length of that string, given a word of length n, is linear (and notably not exponential) with n.
Which is reasonable, I think.
You pass this to your regex generator (like in Ruby it would be Regexp.new(str)) and bam, you got a matcher for ANY word with a Damerau-Levenshtein distance of 1 from a given word.
(Damerau-Levenshtein distances of 2 are far more complicated.)
Note use of the (?> non-backtracing construct which means the order of the individual |'d expressions in that output matter.
I could not think of a way to "compact" that expression.
EDIT: I got it to work, at least in Elixir! https://github.com/pmarreck/elixir-snippets/blob/master/damerau_levenshtein_distance_1.exs
I wouldn't necessarily recommend this though (except for educational purposes) since it will only get you to distances of 1; a legit D-L library will let you compute distances > 1. Although since this is regex, it would probably work pretty fast once constructed (note that you should save the "compiled" regex somewhere since this code currently reconstructs it on EVERY comparison!)
is there possiblity how to include levenshtein distance in regular expression query?
No, not in a sane way. Implementing - or using an existing - Levenshtein distance algorithm is the way to go.

Finding a string *and* its substrings in a haystack

Suppose you have a string (e.g. needle). Its 19 continuous substrings are:
needle
needl eedle
need eedl edle
nee eed edl dle
ne ee ed dl le
n e d l
If I were to build a regex to match, in a haystack, any of the substrings I could simply do:
/(needle|needl|eedle|need|eedl|edle|nee|eed|edl|dle|ne|ee|ed|dl|le|n|e|d|l)/
but it doesn't look really elegant. Is there a better way to create a regex that will greedly match any one of the substrings of a given string?
Additionally, what if I posed another constraint, wanted to match only substrings longer than a threshold, e.g. for substrings of at least 3 characters:
/(needle|needl|eedle|need|eedl|edle|nee|eed|edl|dle)/
note: I deliberately did not mention any particular regex dialect. Please state which one you're using in your answer.
As Qtax suggested, the expression
n(e(e(d(l(e)?)?)?)?)?|e(e(d(l(e)?)?)?)?|e(d(l(e)?)?)?|d(l(e)?)?|l(e)?|e
would be the way to go if you wanted to write an explicit regular expression (egrep syntax, optionally replace (...) by (?:...)). The reason why this is better than the initial solution is that the condensed version requires only O(n^2) space compared to O(n^3) space in the original version, where n is the length of the input. Try this with extraordinarily as input to see the difference. I guess the condensed version is also faster with many regexp engines out there.
The expression
nee(d(l(e)?)?)?|eed(l(e)?)?|edl(e)?|dle
will look for substrings of length 3 or longer.
As pointed out by vhallac, the generated regular expressions are a bit redundant and can be optimized. Apart from the proposed Emacs tool, there is a Perl package Regexp::Optimizer that I hoped would help here, but a quick check failed for the first regular expression.
Note that many regexp engines perform non-overlapping search by default. Check this with the requirements of your problem.
I have found elegant almostsolution, depending how badly you need only one regexp. For example here is the regexp, which finds common substring (perl) of length 7:
"$needle\0$heystack" =~ /(.{7}).*?\0.*\1/s
Matching string is in \1. Strings should not contain null character which is used as separator.
You should make a cycle which starters with length of the needle and goes downto treshold and tries to match the regexp.
Is there a better way to create a regex that will match any one of the
substrings of a given string?
No. But you can generate such expression easily.
Perhaps you're just looking for
.*(.{1,6}).*

Algorithm to match list of regular expressions

I have two algorithmic questions for a project I am working on. I have thought about these, and have some suspicions, but I would love to hear the community's input as well.
Suppose I have a string, and a list of N regular expressions (actually they are wildcard patterns representing a subset of full regex functionality). I want to know whether the string matches at least one of the regular expressions in the list. Is there a data structure that can allow me to match the string against the list of regular expressions in sublinear (presumably logarithmic) time?
This is an extension of the previous problem. Suppose I have the same situation: a string and a list of N regular expressions, only now each of the regular expressions is paired with an offset within the string at which the match must begin (or, if you prefer, each of the regular expressions must match a substring of the given string beginning at the given offset).
To give an example, suppose I had the string:
This is a test string
and the regex patterns and offsets:
(a) his.* at offset 0
(b) his.* at offset 1
The algorithm should return true. Although regex (a) does not match the string beginning at offset 0, regex (b) does match the substring beginning at offset 1 ("his is a test string").
Is there a data structure that can allow me to solve this problem in sublinear time?
One possibly useful piece of information is that often, many of the offsets in the list of regular expressions are the same (i.e. often we are matching the substring at offset X many times). This may be useful to leverage the solution to problem #1 above.
Thank you very much in advance for any suggestions you may have!
I will assume that you really have the power of regular expressions.
To determine whether a string is matched by one of expressions e_1, e_2, ..., e_n, just match against the expression e_1 + e_2 + ... + e_n (sometimes the + operator is written as |).
Given expression-offset pairs (e_1, o_1), ..., (e_n, o_n) and a string, you can check whether there is i such that the string is matched by expression e_i at offset o_i by matching against the expression .{o_1}e_1 + ... + .{o_n}e_n.
Depending on the form of the individual expressions, you can get sublinear performance (not in general though).
If your expressions are sufficiently simple (wildcard patterns are), AND your set of expressions are predetermined, i.e. only the input to be matched changes, THEN you may construct a finite state machine that matches the union of your expressions, i.e., the expression "(r1)|(r2)|...".
Constructing that machine takes time and space at least O(N) (but I guess it is not exponential, which is the worst case for regular expressions in general). Matching is then O(length(input)), independent of N.
OTOH, if your set of expressions is to be part of the program's input, then there is no sublinear algorithm, simply because each expression must be considered.
(1) Combine all the regular expressions as a big union: (r1)|(r2)|(r3)|...
(2) For each regex with offset n add n dots to the beginning plus an anchor. So his.* at offset 6 becomes ^......his.*. Or if your regex syntax supports it, ^.{6}his.*.

Automatic regex builder

I have N strings.
Also, there are K regular expressions, unknown to me. Each string is either matching one of the regular expressions, or it is garbage. There are total of L garbage strings in the set. Both K and L are unknown.
I'd like to deduce the regular expressions. Obviously, this problem has infinite number of solutions. I need to find a "reasonably good solution", which
1) minimizes K
2) minimizes L
3) maximizes "specifics" of the regular expressions. I don't know what't the right term for this quality. For example, the string "ab123" can be described as /ab\d+/ or /\w+.+/, but the first regex is more "specific".
All 3 requirements need to be taken as one compound criteria, with certain reasonable weights.
A solution for one particular case: If L = 0 and K = 1 (just one regex, and no garbage), then we can just find LCS (longest common subsequence) for the strings and come up with a corresponding regex from there. However, when we have "noise" (L > 0), this approach doesn't work.
Any ideas (or pointers to existing work) are greatly appreciated.
What you are trying to do is language learning or language inference with a twist: instead of generalising over a set of given examples (and possibly counter-examples), you wish to infer a language with a small yet specific grammar.
I'm not sure how much research is being done on that. However, if you are also interested in finding the minimal (= general) regular expression that accepts all n strings, search for papers on MDL (Minimum Description Length) and FSMs (Finite State Machines).
Two interesting queries at Google Scholar:
"minimum description length" automata
"language inference" automata
The key words in academia are "grammatical inference". Unfortunately, there aren't any efficient, general algorithms to do the sort of thing you're proposing. What's your real problem?
Edit: it sounds like you might be interested in Data Description Languages. PADS (http://www.padsproj.org/) is a typical example.
Nothing clever here, perhaps I don't fully understand the problem?
Why not just always reduce L to 0? Check each string against each regex; if a string doesn't match any of the regex's, it's garbage. if it does match, remember the regex/string(s) that did match and do LCS on each L = 0, K = 1 to deduce each regex's definition.

Complexity of Regex substitution

I didn't get the answer to this anywhere. What is the runtime complexity of a Regex match and substitution?
Edit: I work in python. But would like to know in general about most popular languages/tools (java, perl, sed).
From a purely theoretical stance:
The implementation I am familiar with would be to build a Deterministic Finite Automaton to recognize the regex. This is done in O(2^m), m being the size of the regex, using a standard algorithm. Once this is built, running a string through it is linear in the length of the string - O(n), n being string length. A replacement on a match found in the string should be constant time.
So overall, I suppose O(2^m + n).
Other theoretical info of possible interest.
For clarity, assume the standard definition for a regular expression
http://en.wikipedia.org/wiki/Regular_language
from the formal language theory. Practically, this means that the only building
material are alphabet symbols, operators of concatenation, alternation and
Kleene closure, along with the unit and zero constants (which appear for
group-theoretic reasons). Generally it's a good idea not to overload this term
despite the everyday practice in scripting languages which leads to
ambiguities.
There is an NFA construction that solves the matching problem for a regular
expression r and an input text t in O(|r| |t|) time and O(|r|) space, where
|-| is the length function. This algorithm was further improved by Myers
http://doi.acm.org/10.1145/128749.128755
to the time and space complexity O(|r| |t| / log |t|) by using automaton node listings and the Four Russians paradigm. This paradigm seems to be named after four Russian guys who wrote a groundbreaking paper which is not
online. However, the paradigm is illustrated in these computational biology
lecture notes
http://lyle.smu.edu/~saad/courses/cse8354/lectures/lecture5.pdf
I find it hilarious to name a paradigm by the number and
the nationality of authors instead of their last names.
The matching problem for regular expressions with added backreferences is
NP-complete, which was proven by Aho
http://portal.acm.org/citation.cfm?id=114877
by a reduction from the vertex-cover problem which is a classical NP-complete problem.
To match regular expressions with backreferences deterministically we could
employ backtracking (not unlike the Perl regex engine) to keep track of the
possible subwords of the input text t that can be assigned to the variables in
r. There are only O(|t|^2) subwords that can be assigned to any one variable
in r. If there are n variables in r, then there are O(|t|^2n) possible
assignments. Once an assignment of substrings to variables is fixed, the
problem reduces to the plain regular expression matching. Therefore the
worst-case complexity for matching regular expressions with backreferences is
O(|t|^2n).
Note however, regular expressions with backreferences are not yet
full-featured regexen.
Take, for example, the "don't care" symbol apart from any other
operators. There are several polynomial algorithms deciding whether a set of
patterns matches an input text. For example, Kucherov and Rusinowitch
http://dx.doi.org/10.1007/3-540-60044-2_46
define a pattern as a word w_1#w_2#...#w_n where each w_i is a word (not a regular expression) and "#" is a variable length "don't care" symbol not contained in either of w_i. They derive an O((|t| + |P|) log |P|) algorithm for matching a set of patterns P against an input text t, where |t| is the length of the text, and |P| is the length of all the words in P.
It would be interesting to know how these complexity measures combine and what
is the complexity measure of the matching problem for regular expressions with
backreferences, "don't care" and other interesting features of practical
regular expressions.
Alas, I haven't said a word about Python... :)
Depends on what you define by regex. If you allow operators of concatenation, alternative and Kleene-star, the time can actually be O(m*n+m), where m is size of a regex and n is length of the string. You do it by constructing a NFA (that is linear in m), and then simulating it by maintaining the set of states you're in and updating that (in O(m)) for every letter of input.
Things that make regex parsing difficult:
parentheses and backreferences: capturing is still OK with the aforementioned algorithm, although it would get the complexity higher, so it might be infeasable. Backreferences raise the recognition power of the regex, and its difficulty is well
positive look-ahead: is just another name for intersection, which raises the complexity of the aforementioned algorithm to O(m^2+n)
negative look-ahead: a disaster for constructing the automaton (O(2^m), possibly PSPACE-complete). But should still be possible to tackle with the dynamic algorithm in something like O(n^2*m)
Note that with a concrete implementation, things might get better or worse. As a rule of thumb, simple features should be fast enough, and unambiguous (eg. not like a*a*) regexes are better.
To delve into theprise's answer, for the construction of the automaton, O(2^m) is the worst case, though it really depends on the form of the regular expression (for a very simple one that matches a word, it's in O(m), using for example the Knuth-Morris-Pratt algorithm).
Depends on the implementation. What language/library/class? There may be a best case, but it would be very specific to the number of features in the implementation.
You can trade space for speed by building a nondeterministic finite automaton instead of a DFA. This can be traversed in linear time. Of course, in the worst case this could need O(2^m) space. I'd expect the tradeoff to be worth it.
If you're after matching and substitution, that implies grouping and backreferences.
Here is a perl example where grouping and backreferences can be used to solve an NP complete problem:
http://perl.plover.com/NPC/NPC-3SAT.html
This (coupled with a few other theoretical tidbits) means that using regular expressions for matching and substitution is NP-complete.
Note that this is different from the formal definition of a regular expression - which don't have the notion of grouping - and match in polynomial time as described by the other answers.
In python's re library, even if a regex is compiled, the complexity can still be exponential (in string length) in some cases, as it is not built on DFA. Some references here, here or here.