Levenshtein distance in regular expression - regex

Is it possible to include Levenshtein distance in a regular expression query?
(Except by making union between permutations, like this to search for "hello" with Levenshtein distance 1:
.ello | h.llo | he.lo | hel.o | hell.
since this is stupid and unusable for larger Levenshtein distances.)

There are a couple of regex dialects out there with an approximate matching feature - namely the TRE library and the regex PyPI module for Python.
The TRE approximate matching syntax is described in the "Approximate matching settings" section at https://laurikari.net/tre/documentation/regex-syntax/. A TRE regex to match stuff within Levenshtein distance 1 of hello would be:
(hello){~1}
The regex module's approximate matching syntax is described at https://pypi.org/project/regex/ in the bullet point that begins with the text Approximate “fuzzy” matching. A regex regex to match stuff within Levenshtein distance 1 of hello would be:
(hello){e<=1}
Perhaps one or the other of these syntaxes will in time be adopted by other regex implementations, but at present I only know of these two.

You can generate the regex programmatically. I will leave that as an exercise for the reader, but for the output of this hypothetical function (given an input of "word") you want something like this string:
"^(?>word|wodr|wrod|owrd|word.|wor.d|wo.rd|w.ord|.word|wor.?|wo.?d|w.?rd|.?ord)$"
In English, first you try to match on the word itself, then on every possible single transposition, then on every possible single insertion, then on every possible single omission or substitution (can be done simultaneously).
The length of that string, given a word of length n, is linear (and notably not exponential) with n.
Which is reasonable, I think.
You pass this to your regex generator (like in Ruby it would be Regexp.new(str)) and bam, you got a matcher for ANY word with a Damerau-Levenshtein distance of 1 from a given word.
(Damerau-Levenshtein distances of 2 are far more complicated.)
Note use of the (?> non-backtracing construct which means the order of the individual |'d expressions in that output matter.
I could not think of a way to "compact" that expression.
EDIT: I got it to work, at least in Elixir! https://github.com/pmarreck/elixir-snippets/blob/master/damerau_levenshtein_distance_1.exs
I wouldn't necessarily recommend this though (except for educational purposes) since it will only get you to distances of 1; a legit D-L library will let you compute distances > 1. Although since this is regex, it would probably work pretty fast once constructed (note that you should save the "compiled" regex somewhere since this code currently reconstructs it on EVERY comparison!)

is there possiblity how to include levenshtein distance in regular expression query?
No, not in a sane way. Implementing - or using an existing - Levenshtein distance algorithm is the way to go.

Related

Check if a regex is ambiguous

I wonder if there is a way to check the ambiguity of a regular expression automatically. A regex is considered ambiguous if there is an string which can be matched by more that one ways from the regex. For example, given a regex R = (ab)*(a|b)*, we can detect that R is an ambiguous regex since there are two ways to match string ab from R.
UPDATE
The question is about how to check if the regex is ambiguous by definition. I know in practical implementation of regex mechanism, there is always one way to match a regex, but please read and think about this question in academic way.
A regular expression is one-ambiguous if and only if the corresponding Glushkov automaton is not deterministic. This can be done in linear time now. Here's a link. BTW, deterministic regular expressions have been investigated also under the name of one-unambiguity.
You are forgetting greed. Usually one section gets first dibs because it is a greedy match, and so there is no ambiguity.
If instead you are talking about a mythical pattern matching engine without the practical details like greed; then the answer is yes you can.
Take every element of the pattern. And try every possible subset against every possible string. If more than one subset matches the same pattern then there's an ambiguity. Optimizing this to take less than infinite time is left as an exercise for the reader.
I read a paper published around 1980 which showed that whether a regular expression is ambiguous can be determined in O(n^4) time. I wish I could give you a reference but
I no longer know the reference or even the journal.
A more expensive way to determine if a regular expression is ambiguous is to construct
a finite state machine (exponential in time and space in worst case) from the regular expression using subset construction. Now consider any state X of the FSM constructed from nfa states N. If,
for any two nfa states n1, n2 of X, follow(n1) intersect follow(n2) is not empty then
the regular expression is ambiguous. If this is not true for any state of the FSM then
the regular expression is not ambiguous.
A possible solution:
Construct an NFA for the regexp. Then analyse the NFA where you start with a set of states consisting solely of the initial state. Then do a depth, or width first traversal where you keep track of if you can be in multiple states. You also need to track the path taken in order to eliminate cycles.
For example your (ab)*(a|b)* can be modeled with three states.
| a | b
p| {q,r} | {r}
q| {} | {p}
r| {r} | {r}
Where p is the starting state and p and r accepts.
You then need to consider both letters and proceed with the sets {q,r} and {r}. The set {r} only leads to {r} giving a cycle and we can close that path. The set {q,r}, from {q,r} a takes us to {r} which is an accepting state, but since this path can not accept if we start with going to q we only have a single path here, we can then close this when we identify the cycle. Getting a b from {q,r} takes us to {p,r}. Since both of these accepts we have identified an ambigous position and we can conclude that the regexp is ambigous.

Create regex from samples algorithm

AFAIK no one have implemented an algorithm that takes a set of strings and substrings and gives back one or more regular expressions that would match the given substrings inside the strings. So, for instance, if I'd give my algorithm this two samples:
string1 = "fwef 1234 asdfd"
substring1 = "1234"
string2 = "asdf456fsdf"
substring2 = "456"
The algorithm would give me the regular expression "[0-9]*" back. I know it could give more than one regex or even no possible regex back and you might find 1000 reasons why such algorithm would be close to impossible to implement to perfection. But what's the closest thing?
I don't really care about regex itself also. Basically what I want is an algorithm that takes samples as the ones above and then finds a pattern in them that can be used to easily find the "kind" of text I want to find in a string without having to write any regex or code manually.
I don't have proof but I suspect no such discrete algorithm with a finite output could exist since you are asking for the creation of a regular language which could be "large" in respect to the input size.
With that, I suggest you peek at txt2re which can break down sample texts one-by-one and help you build regexes.
FlashFill a new feature of MS Excel 2013 would do exactly the task you want, but it does not give you the regular expression. It's a NP-complete problem and an open question for practical purposes. If you're interested in how to synthesise string manipulation from multiple examples, Go Flash Fill official website and read a few papers. They have pseudo-code and demo. movies as well.
There are many such algorithm in fact. This is a research area called "Grammatical inference".
I know RPNI, for example. (you could also look on the probabilistic branch, alergia, MDI, DEES). These algorithms generate DSA (Deterministic State Automata). In fact you absolutely don't need to enter the strings in your example. Only substrings.
There are also some algorithms to generate directly Non deterministic automata.
Of course, get the regular expression from an Non Deterministic Automata is easy.
The main ideas are simple:
Generate a PTSA (Prefix Tree State Automata) from your sample.
Then, you have to try to "merge" some states. From these merge, will emerge loops (i.e. * in the regular expression). All the difficulty being to choose the right rule to merge.
Here you go:
re = '|'.join(substrings)
If you want anything more general, your algorithm is going to have to make educated guesses about what type of strings are acceptable as matches, and it's trivial to demonstrate that no procedure can account for all possible sets of possible inputs without simply enumerating them all. For instance, consider some of these scenarios:
Match all prime numbers
Match hexadecimal strings, but no strings containing 'f' are in the sample set
Match the same string repeated twice
Match any even-length string
The root problem is that your question is incompletely specified. If you have a more specific requirement, that might be solvable, depending on what it is.

Is it possible to calucate the edit distance between a regexp and a string?

If so, please explain how.
Re: what is distance -- "The distance between two strings is defined as the minimal number of edits required to convert one into the other."
For example, xyz to XYZ would take 3 edits, so the string xYZ is closer to XYZ and xyz.
If the pattern is [0-9]{3} or for instance 123, then a23 would be closer to the pattern than ab3.
How can you find the shortest distance between a regexp and a non-matching string?
The above is the Damerau–Levenshtein distance algorithm.
You can use Finite State Machines to do this efficiently (that is, linear in time). If you use a transducer, you can even write the specification of the transformation fairly compactly and do far more nuanced transformations than simply inserts or deletes - see wikipedia for Finite State Transducer as a starting point, and software such as the FSA toolkit or FSA6 (which has a not entirely stable web-demo) too. There are lots of libraries for FSA manipulation; I don't want to suggest the previous two are your only or best options, just two I've heard of.
If, however, you merely want the efficient, approximate searching, a less flexibly but already-implemented-for-you option exists: TRE, which has an approximate matching function that returns the cost of the match - i.e., the distance to the match, from your perspective.
If you mean the string with the smallest levenshtein distance between the closest matched string and a sample, then I'm pretty sure it can be done, but you'd have to convert the Regex to a DFA yourself, then try to match and whenever something fails, non-deterministically continue as if it had passed and keep track of the number differences. you could use A* search or something similar for this, it would be quite inefficient though (O(2^n) worst case)

How to determine if a regex is orthogonal to another regex?

I guess my question is best explained with an (simplified) example.
Regex 1:
^\d+_[a-z]+$
Regex 2:
^\d*$
Regex 1 will never match a string where regex 2 matches.
So let's say that regex 1 is orthogonal to regex 2.
As many people asked what I meant by orthogonal I'll try to clarify it:
Let S1 be the (infinite) set of strings where regex 1 matches.
S2 is the set of strings where regex 2 matches.
Regex 2 is orthogonal to regex 1 iff the intersection of S1 and S2 is empty.
The regex ^\d_a$ would be not orthogonal as the string '2_a' is in the set S1 and S2.
How can it be programmatically determined, if two regexes are orthogonal to each other?
Best case would be some library that implements a method like:
/**
* #return True if the regex is orthogonal (i.e. "intersection is empty"), False otherwise or Null if it can't be determined
*/
public Boolean isRegexOrthogonal(Pattern regex1, Pattern regex2);
By "Orthogonal" you mean "the intersection is the empty set" I take it?
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
Then again, I'm a theorist...
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
That seems like shooting sparrows with a cannon. Why not just construct the product automaton and check if an accept state is reachable from the initial state? That'll also give you a string in the intersection straight away without having to construct a regular expression first.
I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem.
I only know of a way to do it which involves creating a DFA from a regexp, which is exponential time (in the degenerate case). It's reducible to the halting problem, because everything is, but the halting problem is not reducible to it.
If the last, then you can use the fact that any RE can be translated into a finite state machine. Two finite state machines are equal if they have the same set of nodes, with the same arcs connecting those nodes.
So, given what I think you're using as a definition for orthogonal, if you translate your REs into FSMs and those FSMs are not equal, the REs are orthogonal.
That's not correct. You can have two DFAs (FSMs) that are non-isomorphic in the edge-labeled multigraph sense, but accept the same languages. Also, were that not the case, your test would check whether two regexps accepted non-identical, whereas OP wants non-overlapping languages (empty intersection).
Also, be aware that the \1, \2, ..., \9 construction is not regular: it can't be expressed in terms of concatenation, union and * (Kleene star). If you want to include back substitution, I don't know what the answer is. Also of interest is the fact that the corresponding problem for context-free languages is undecidable: there is no algorithm which takes two context-free grammars G1 and G2 and returns true iff L(G1) ∩ L(g2) ≠ Ø.
It's been two years since this question was posted, but I'm happy to say this can be determined now simply by calling the "genex" program here: https://github.com/audreyt/regex-genex
$ ./binaries/osx/genex '^\d+_[a-z]+$' '^\d*$'
$
The empty output means there is no strings that matches both regex. If they have any overlap, it will output the entire list of overlaps:
$ runghc Main.hs '\d' '[123abc]'
1.00000000 "2"
1.00000000 "3"
1.00000000 "1"
Hope this helps!
The fsmtools can do all kinds of operations on finite state machines, your only problem would be to convert the string representation of the regular expression into the format the fsmtools can work with. This is definitely possible for simple cases, but will be tricky in the presence of advanced features like look{ahead,behind}.
You might also have a look at OpenFst, although I've never used it. It supports intersection, though.
Excellent point on the \1, \2 bit... that's context free, and so not solvable. Minor point: Not EVERYTHING is reducible to Halt... Program Equivalence for example.. – Brian Postow
[I'm replying to a comment]
IIRC, a^n b^m a^n b^m is not context free, and so (a\*)(b\*)\1\2 isn't either since it's the same. ISTR { ww | w ∈ L } not being "nice" even if L is "nice", for nice being one of regular, context-free.
I modify my statement: everything in RE is reducible to the halting problem ;-)
I finally found exactly the library that I was looking for:
dk.brics.automaton
Usage:
/**
* #return true if the two regexes will never both match a given string
*/
public boolean isRegexOrthogonal( String regex1, String regex2 ) {
Automaton automaton1 = new RegExp(regex1).toAutomaton();
Automaton automaton2 = new RegExp(regex2).toAutomaton();
return automaton1.intersection(automaton2).isEmpty();
}
It should be noted that the implementation doesn't and can't support complex RegEx features like back references. See the blog post "A Faster Java Regex Package" which introduces dk.brics.automaton.
You can maybe use something like Regexp::Genex to generate test strings to match a specified regex and then use the test string on the 2nd regex to determine whether the 2 regexes are orthogonal.
Proving that one regular expression is orthogonal to another can be trivial in some cases, such as mutually exclusive character groups in the same locations. For any but the simplest regular expressions this is a nontrivial problem. For serious expressions, with groups and backreferences, I would go so far as to say that this may be impossible.
I believe kdgregory is correct you're using Orthogonal to mean Complement.
Is this correct?
Let me start by saying that I have no idea how to construct such an algorithm, nor am I aware of any library that implements it. However, I would not be at all surprised to learn that nonesuch exists for general regular expressions of arbitrary complexity.
Every regular expression defines a regular language of all the strings that can be generated by the expression, or if you prefer, of all the strings that are "matched by" the regular expression. Think of the language as a set of strings. In most cases, the set will be infinitely large. Your question asks whether the intersections of the two sets given by the regular expressions is empty or not.
At least to a first approximation, I can't imagine a way to answer that question without computing the sets, which for infinite sets will take longer than you have. I think there might be a way to compute a limited set and determine when a pattern is being elaborated beyond what is required by the other regex, but it would not be straightforward.
For example, just consider the simple expressions (ab)* and (aba)*b. What is the algorithm that will decide to generate abab from the first expression and then stop, without checking ababab, abababab, etc. because they will never work? You can't just generate strings and check until a match is found because that would never complete when the languages are disjoint. I can't imagine anything that would work in the general case, but then there are folks much better than me at this kind of thing.
All in all, this is a hard problem. I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem. Although, given that regular expressions are not Turing complete, it seems at least possible that a solution exists.
I would do the following:
convert each regex to a FSA, using something like the following structure:
struct FSANode
{
bool accept;
Map<char, FSANode> links;
}
List<FSANode> nodes;
FSANode start;
Note that this isn't trivial, but for simple regex shouldn't be that difficult.
Make a new Combined Node like:
class CombinedNode
{
CombinedNode(FSANode left, FSANode right)
{
this.left = left;
this.right = right;
}
Map<char, CombinedNode> links;
bool valid { get { return !left.accept || !right.accept; } }
public FSANode left;
public FSANode right;
}
Build up links based on following the same char on the left and right sides, and you get two FSANodes which make a new CombinedNode.
Then start at CombinedNode(leftStart, rightStart), and find the spanning set, and if there are any non-valid CombinedNodes, the set isn't "orthogonal."
Convert each regular expression into a DFA. From the accept state of one DFA create an epsilon transition to the start state of the second DFA. You will in effect have created an NFA by adding the epsilon transition. Then convert the NFA into a DFA. If the start state is not the accept state, and the accept state is reachable, then the two regular expressions are not "orthogonal." (Since their intersection is non-empty.)
There are know procedures for converting a regular expression to a DFA, and converting an NFA to a DFA. You could look at a book like "Introduction to the Theory of Computation" by Sipser for the procedures, or just search around the web. No doubt many undergrads and grads had to do this for one "theory" class or another.
I spoke too soon. What I said in my original post would not work out, but there is a procedure for what you are trying to do if you can convert your regular expressions into DFA form.
You can find the procedure in the book I mentioned in my first post: "Introduction to the Theory of Computation" 2nd edition by Sipser. It's on page 46, with details in the footnote.
The procedure would give you a new DFA that is the intersection of the two DFAs. If the new DFA had a reachable accept state then the intersection is non-empty.

Complexity of Regex substitution

I didn't get the answer to this anywhere. What is the runtime complexity of a Regex match and substitution?
Edit: I work in python. But would like to know in general about most popular languages/tools (java, perl, sed).
From a purely theoretical stance:
The implementation I am familiar with would be to build a Deterministic Finite Automaton to recognize the regex. This is done in O(2^m), m being the size of the regex, using a standard algorithm. Once this is built, running a string through it is linear in the length of the string - O(n), n being string length. A replacement on a match found in the string should be constant time.
So overall, I suppose O(2^m + n).
Other theoretical info of possible interest.
For clarity, assume the standard definition for a regular expression
http://en.wikipedia.org/wiki/Regular_language
from the formal language theory. Practically, this means that the only building
material are alphabet symbols, operators of concatenation, alternation and
Kleene closure, along with the unit and zero constants (which appear for
group-theoretic reasons). Generally it's a good idea not to overload this term
despite the everyday practice in scripting languages which leads to
ambiguities.
There is an NFA construction that solves the matching problem for a regular
expression r and an input text t in O(|r| |t|) time and O(|r|) space, where
|-| is the length function. This algorithm was further improved by Myers
http://doi.acm.org/10.1145/128749.128755
to the time and space complexity O(|r| |t| / log |t|) by using automaton node listings and the Four Russians paradigm. This paradigm seems to be named after four Russian guys who wrote a groundbreaking paper which is not
online. However, the paradigm is illustrated in these computational biology
lecture notes
http://lyle.smu.edu/~saad/courses/cse8354/lectures/lecture5.pdf
I find it hilarious to name a paradigm by the number and
the nationality of authors instead of their last names.
The matching problem for regular expressions with added backreferences is
NP-complete, which was proven by Aho
http://portal.acm.org/citation.cfm?id=114877
by a reduction from the vertex-cover problem which is a classical NP-complete problem.
To match regular expressions with backreferences deterministically we could
employ backtracking (not unlike the Perl regex engine) to keep track of the
possible subwords of the input text t that can be assigned to the variables in
r. There are only O(|t|^2) subwords that can be assigned to any one variable
in r. If there are n variables in r, then there are O(|t|^2n) possible
assignments. Once an assignment of substrings to variables is fixed, the
problem reduces to the plain regular expression matching. Therefore the
worst-case complexity for matching regular expressions with backreferences is
O(|t|^2n).
Note however, regular expressions with backreferences are not yet
full-featured regexen.
Take, for example, the "don't care" symbol apart from any other
operators. There are several polynomial algorithms deciding whether a set of
patterns matches an input text. For example, Kucherov and Rusinowitch
http://dx.doi.org/10.1007/3-540-60044-2_46
define a pattern as a word w_1#w_2#...#w_n where each w_i is a word (not a regular expression) and "#" is a variable length "don't care" symbol not contained in either of w_i. They derive an O((|t| + |P|) log |P|) algorithm for matching a set of patterns P against an input text t, where |t| is the length of the text, and |P| is the length of all the words in P.
It would be interesting to know how these complexity measures combine and what
is the complexity measure of the matching problem for regular expressions with
backreferences, "don't care" and other interesting features of practical
regular expressions.
Alas, I haven't said a word about Python... :)
Depends on what you define by regex. If you allow operators of concatenation, alternative and Kleene-star, the time can actually be O(m*n+m), where m is size of a regex and n is length of the string. You do it by constructing a NFA (that is linear in m), and then simulating it by maintaining the set of states you're in and updating that (in O(m)) for every letter of input.
Things that make regex parsing difficult:
parentheses and backreferences: capturing is still OK with the aforementioned algorithm, although it would get the complexity higher, so it might be infeasable. Backreferences raise the recognition power of the regex, and its difficulty is well
positive look-ahead: is just another name for intersection, which raises the complexity of the aforementioned algorithm to O(m^2+n)
negative look-ahead: a disaster for constructing the automaton (O(2^m), possibly PSPACE-complete). But should still be possible to tackle with the dynamic algorithm in something like O(n^2*m)
Note that with a concrete implementation, things might get better or worse. As a rule of thumb, simple features should be fast enough, and unambiguous (eg. not like a*a*) regexes are better.
To delve into theprise's answer, for the construction of the automaton, O(2^m) is the worst case, though it really depends on the form of the regular expression (for a very simple one that matches a word, it's in O(m), using for example the Knuth-Morris-Pratt algorithm).
Depends on the implementation. What language/library/class? There may be a best case, but it would be very specific to the number of features in the implementation.
You can trade space for speed by building a nondeterministic finite automaton instead of a DFA. This can be traversed in linear time. Of course, in the worst case this could need O(2^m) space. I'd expect the tradeoff to be worth it.
If you're after matching and substitution, that implies grouping and backreferences.
Here is a perl example where grouping and backreferences can be used to solve an NP complete problem:
http://perl.plover.com/NPC/NPC-3SAT.html
This (coupled with a few other theoretical tidbits) means that using regular expressions for matching and substitution is NP-complete.
Note that this is different from the formal definition of a regular expression - which don't have the notion of grouping - and match in polynomial time as described by the other answers.
In python's re library, even if a regex is compiled, the complexity can still be exponential (in string length) in some cases, as it is not built on DFA. Some references here, here or here.