Is it possible to calucate the edit distance between a regexp and a string? - regex

If so, please explain how.
Re: what is distance -- "The distance between two strings is defined as the minimal number of edits required to convert one into the other."
For example, xyz to XYZ would take 3 edits, so the string xYZ is closer to XYZ and xyz.
If the pattern is [0-9]{3} or for instance 123, then a23 would be closer to the pattern than ab3.
How can you find the shortest distance between a regexp and a non-matching string?
The above is the Damerau–Levenshtein distance algorithm.

You can use Finite State Machines to do this efficiently (that is, linear in time). If you use a transducer, you can even write the specification of the transformation fairly compactly and do far more nuanced transformations than simply inserts or deletes - see wikipedia for Finite State Transducer as a starting point, and software such as the FSA toolkit or FSA6 (which has a not entirely stable web-demo) too. There are lots of libraries for FSA manipulation; I don't want to suggest the previous two are your only or best options, just two I've heard of.
If, however, you merely want the efficient, approximate searching, a less flexibly but already-implemented-for-you option exists: TRE, which has an approximate matching function that returns the cost of the match - i.e., the distance to the match, from your perspective.

If you mean the string with the smallest levenshtein distance between the closest matched string and a sample, then I'm pretty sure it can be done, but you'd have to convert the Regex to a DFA yourself, then try to match and whenever something fails, non-deterministically continue as if it had passed and keep track of the number differences. you could use A* search or something similar for this, it would be quite inefficient though (O(2^n) worst case)

Related

Levenshtein distance in regular expression

Is it possible to include Levenshtein distance in a regular expression query?
(Except by making union between permutations, like this to search for "hello" with Levenshtein distance 1:
.ello | h.llo | he.lo | hel.o | hell.
since this is stupid and unusable for larger Levenshtein distances.)
There are a couple of regex dialects out there with an approximate matching feature - namely the TRE library and the regex PyPI module for Python.
The TRE approximate matching syntax is described in the "Approximate matching settings" section at https://laurikari.net/tre/documentation/regex-syntax/. A TRE regex to match stuff within Levenshtein distance 1 of hello would be:
(hello){~1}
The regex module's approximate matching syntax is described at https://pypi.org/project/regex/ in the bullet point that begins with the text Approximate “fuzzy” matching. A regex regex to match stuff within Levenshtein distance 1 of hello would be:
(hello){e<=1}
Perhaps one or the other of these syntaxes will in time be adopted by other regex implementations, but at present I only know of these two.
You can generate the regex programmatically. I will leave that as an exercise for the reader, but for the output of this hypothetical function (given an input of "word") you want something like this string:
"^(?>word|wodr|wrod|owrd|word.|wor.d|wo.rd|w.ord|.word|wor.?|wo.?d|w.?rd|.?ord)$"
In English, first you try to match on the word itself, then on every possible single transposition, then on every possible single insertion, then on every possible single omission or substitution (can be done simultaneously).
The length of that string, given a word of length n, is linear (and notably not exponential) with n.
Which is reasonable, I think.
You pass this to your regex generator (like in Ruby it would be Regexp.new(str)) and bam, you got a matcher for ANY word with a Damerau-Levenshtein distance of 1 from a given word.
(Damerau-Levenshtein distances of 2 are far more complicated.)
Note use of the (?> non-backtracing construct which means the order of the individual |'d expressions in that output matter.
I could not think of a way to "compact" that expression.
EDIT: I got it to work, at least in Elixir! https://github.com/pmarreck/elixir-snippets/blob/master/damerau_levenshtein_distance_1.exs
I wouldn't necessarily recommend this though (except for educational purposes) since it will only get you to distances of 1; a legit D-L library will let you compute distances > 1. Although since this is regex, it would probably work pretty fast once constructed (note that you should save the "compiled" regex somewhere since this code currently reconstructs it on EVERY comparison!)
is there possiblity how to include levenshtein distance in regular expression query?
No, not in a sane way. Implementing - or using an existing - Levenshtein distance algorithm is the way to go.

String matching algorithm

Say I have 3 strings. And then 1 more string.
Is there an algorithm that would allow me to find which one of the first 3 strings matches the 4th string the most?
None of the strings are going to be exact matches, I'm just trying to find the closest match.
And if the algorithm already exists in STL, that would be nice.
Thanks in advance.
You don't specify what exactly you mean by "matches the most", so I assume you don't have precise requirements. In that case, Levenshtein distance in a reasonable metric. Simply compute the Levenshtein distance between each of the three strings and the fourth, and pick the one that gives the lowest distance.
You can implement the Levenshtein Distance algorithm, it provides a very nice measure of how close a match between two strings you have. It measures how many keystrokes you need to make in order to turn one string into the other. You can find a C++ implementation here.
Compute Levenshtein Distance between string #4 and the three strings that you have. Pick the string with the smallest distance.
There's nothing ready in the STL, but what you need is some kind of string metric.
Use Levenshtein distance if the strings are similar up to some typos, e.g. Hello vs Helol.
Use Jaccard distance/Dice's coefficient on the set of n-grams if the strings might change more drastically, like hello world versus world hello.
You have approximate string matching problem. Depending on what kind of matching you want to perform, you will use different algorithm. There are many..SOUNDEX, Jaro-Winkler, Levenstein Distance, metaphore... etc. Regarding STL, I don't know any functions that implement those algorithms, but you can take a look here for some soource using c++. Also, note that if you are getting your strings from a database, it is very likely that your database engine implements some of those algorithms (most likely SOUNDEX).

Create regex from samples algorithm

AFAIK no one have implemented an algorithm that takes a set of strings and substrings and gives back one or more regular expressions that would match the given substrings inside the strings. So, for instance, if I'd give my algorithm this two samples:
string1 = "fwef 1234 asdfd"
substring1 = "1234"
string2 = "asdf456fsdf"
substring2 = "456"
The algorithm would give me the regular expression "[0-9]*" back. I know it could give more than one regex or even no possible regex back and you might find 1000 reasons why such algorithm would be close to impossible to implement to perfection. But what's the closest thing?
I don't really care about regex itself also. Basically what I want is an algorithm that takes samples as the ones above and then finds a pattern in them that can be used to easily find the "kind" of text I want to find in a string without having to write any regex or code manually.
I don't have proof but I suspect no such discrete algorithm with a finite output could exist since you are asking for the creation of a regular language which could be "large" in respect to the input size.
With that, I suggest you peek at txt2re which can break down sample texts one-by-one and help you build regexes.
FlashFill a new feature of MS Excel 2013 would do exactly the task you want, but it does not give you the regular expression. It's a NP-complete problem and an open question for practical purposes. If you're interested in how to synthesise string manipulation from multiple examples, Go Flash Fill official website and read a few papers. They have pseudo-code and demo. movies as well.
There are many such algorithm in fact. This is a research area called "Grammatical inference".
I know RPNI, for example. (you could also look on the probabilistic branch, alergia, MDI, DEES). These algorithms generate DSA (Deterministic State Automata). In fact you absolutely don't need to enter the strings in your example. Only substrings.
There are also some algorithms to generate directly Non deterministic automata.
Of course, get the regular expression from an Non Deterministic Automata is easy.
The main ideas are simple:
Generate a PTSA (Prefix Tree State Automata) from your sample.
Then, you have to try to "merge" some states. From these merge, will emerge loops (i.e. * in the regular expression). All the difficulty being to choose the right rule to merge.
Here you go:
re = '|'.join(substrings)
If you want anything more general, your algorithm is going to have to make educated guesses about what type of strings are acceptable as matches, and it's trivial to demonstrate that no procedure can account for all possible sets of possible inputs without simply enumerating them all. For instance, consider some of these scenarios:
Match all prime numbers
Match hexadecimal strings, but no strings containing 'f' are in the sample set
Match the same string repeated twice
Match any even-length string
The root problem is that your question is incompletely specified. If you have a more specific requirement, that might be solvable, depending on what it is.

Distance between regular expression

Can we compute a sort of distance between regular expressions ?
The idea is to mesure in which way two regular expression are similar.
You can build deterministic finite-state machines for both regular expressions and compare the transitions. The difference of both transitions can then be used to measure the distance of these regular expressions.
There are a few of metrics you could use:
The length of a valid match. Some regexs have a fixed size, some an upper limit and some a lower limit. Compare how similar their lengths or possible lengths are.
The characters that match. Any regex will have a set of characters a match can contain (maybe all characters). Compare the set of included characters.
Use a large document and see how many matches each regex makes and how many of those are identical.
Are you looking for strict equivalence?
I suppose you could compute a Levenshtein Distance between the actual Regular Experssion strings. That's certainly one way of measuring a "distance" between two different Regular Expression strings.
Of course, I think it's possible that regular expressions are not required here at all, and computing the Levenshtein Distance of the actual "value" strings that the Regular Expressions would otherwise be applied to, may yield a better result.
If you have two regular expressions and have a set of example inputs you could try matching every input against each regex. For each input:
If they both match or both don't match, score 0.
If one matches and the other doesn't, score 1.
Sum this score over all inputs, and this will give you a 'distance' between the regular expressions. This will give you an idea of how often two regular expressions will differ for typical input. It will be very slow to calculate if your sample input set is large. It won't work at all if both regexes fail to match for almost all random strings and your expected input is entirely random. For example the regex 'sgjlkwren' and the regex 'ueuenwbkaalf' would probably both never match anything if tested on random input, so this metric would say the distance between them is zero. That might or might not be what you want (probably not).
You might be able to analyze the structure of the regex and use biased random sampling to deliberately hit strings that match more frequently than in completely random input. For example, if both regex require that the string starts with 'foo', you could make sure that your test inputs also always start with foo, to avoid wasting time testing strings that you know will fail for both.
So in conclusion: unless you have a very specific situation with a restricted input set and/or restricted regular expression language, I'd say its not possible. If you do have some restrictions on your input and on the regular expression, it might be possible. Please specify what these restrictions are and maybe I can come up with something better.
There's an answer hidden in an earlier question here on SO: Generating strings from regexes. You can calculate an (asymmetric) distance measure by generating strings using one regex and checking how many of those match the other regex.
This can be optimized by stripping out shared prefixes/suffixes. E.g. a[0-9]* and a[0-7]* share the a prefix, so you can calculate the distance between [0-9]* and [0-7]* instead.
I think first you need to understand for yourself how you see a "difference" between two expressions. Basically, define a distance metric.
In general case, it would be quite different to make. Depending on what you need to do, you may see allowing one different character in some place as a big difference. In the other case, allowing any number of consequent but same characters may not yield much difference.
I'd like to emphasize as well that normally when they talk about distance functions, they apply them to..., well, let's call them, tokens. In our case, character sequences. What you are willing to do, is to apply this method not to those tokens, but to the rules a multitude of tokens will match. I'm not quite sure it even makes sense.
Still, I believe we could think of something, but not in general, but for one particular and quite restricted case. Do you have some sort of example to show us?

How can you detect if two regular expressions overlap in the strings they can match?

I have a container of regular expressions. I'd like to analyze them to determine if it's possible to generate a string that matches more than 1 of them. Short of writing my own regex engine with this use case in mind, is there an easy way in C++ or Python to solve this problem?
There's no easy way.
As long as your regular expressions use only standard features (Perl lets you embed arbitrary code in matching, I think), you can produce from each one a nondeterministic finite-state automaton (NFA) that compactly encodes all the strings that the RE matches.
Given any pair of NFA, it's decidable whether their intersection is empty. If the intersection isn't empty, then some string matches both REs in the pair (and conversely).
The standard decidability proof is to determinize them into DFAs first, and then construct a new DFA whose states are pairs of the two DFAs' states, and whose final states are exactly those in which both states in the pair are final in their original DFA. Alternatively, if you've already shown how to compute the complement of a NFA, then you can (DeMorgan's law style) get the intersection by complement(union(complement(A),complement(B))).
Unfortunately, NFA->DFA involves a potentially exponential size explosion (because states in the DFA are subsets of states in the NFA). From Wikipedia:
Some classes of regular languages can
only be described by deterministic
finite automata whose size grows
exponentially in the size of the
shortest equivalent regular
expressions. The standard example are
here the languages L_k consisting of
all strings over the alphabet {a,b}
whose kth-last letter equals a.
By the way, you should definitely use OpenFST. You can create automata as text files and play around with operations like minimization, intersection, etc. in order to see how efficient they are for your problem. There already exist open source regexp->nfa->dfa compilers (I remember a Perl module); modify one to output OpenFST automata files and play around.
Fortunately, it's possible to avoid the subset-of-states explosion, and intersect two NFA directly using the same construction as for DFA:
if A ->a B (in one NFA, you can go from state A to B outputting the letter 'a')
and X ->a Y (in the other NFA)
then (A,X) ->a (B,Y) in the intersection
(C,Z) is final iff C is final in the one NFA and Z is final in the other.
To start the process off, you start in the pair of start states for the two NFAs e.g. (A,X) - this is the start state of the intersection-NFA. Each time you first visit a state, generate an arc by the above rule for every pair of arcs leaving the two states, and then visit all the (new) states those arcs reach. You'd store the fact that you expanded a state's arcs (e.g. in a hash table) and end up exploring all the states reachable from the start.
If you allow epsilon transitions (that don't output a letter), that's fine:
if A ->epsilon B in the first NFA, then for every state (A,Y) you reach, add the arc (A,Y) ->epsilon (B,Y) and similarly for epsilons in the second-position NFA.
Epsilon transitions are useful (but not necessary) in taking the union of two NFAs when translating a regexp to an NFA; whenever you have alternation regexp1|regexp2|regexp3, you take the union: an NFA whose start state has an epsilon transition to each of the NFAs representing the regexps in the alternation.
Deciding emptiness for an NFA is easy: if you ever reach a final state in doing a depth-first-search from the start state, it's not empty.
This NFA-intersection is similar to finite state transducer composition (a transducer is an NFA that outputs pairs of symbols, that are concatenated pairwise to match both an input and output string, or to transform a given input to an output).
This regex inverter (written using pyparsing) works with a limited subset of re syntax (no * or + allowed, for instance) - you could invert two re's into two sets, and then look for a set intersection.
In theory, the problem you describe is impossible.
In practice, if you have a manageable number of regular expressions that use a limited subset or of regexp syntax, and/or a limited selection of strings that can be used to match against the container of regular expressions, you might be able to solve it.
Assuming you're not trying to solve the abstract general case, there might be something you can do to solve a practical application. Perhaps if you provided a representative sample of the regexps, and described the strings you'd be matching with, a heuristic could be created to solve the problem.