what is the difference of substitution cost 1 and substitution cost 2 variation in levenstein distance for string matching - levenshtein-distance

I am learning fuzzy search, for this, I was reading about Levenshtein distance. I have found 2 variations in substitution operation some sites shows that substitution operation cost is 1, some sites shows that substitution operation cost is 2(insertion+ deletion ). I am a bit of confused which one I should take for implementation. Need a clear concept

Related

why evil regular expression will cause ReDoS?

Currently I am looking into ReDoS: Regular express denial of Service
some bad(evil) regular expression will cause low performance of validation
but why.... I search for wiki and owasp, the answer is mainly for NFA DFA that I could hardly understand....
Could anyone help me with some good sample of explaination....?
It's called Catastrophic Backtracking.
It occurs when there's no match, but there are O(2n) ways to not match that must all be explored before returning false.
The example in the linked article from https://www.regular-expressions.info is (x+x+)+y, which when used with input of xxxxxxxxxx will take about 2500 steps to discover there's no match. Add one x and it takes 5000 steps, and so on. With input of 100 xs, you're talking billions of years of computing time.
The reason is there are m ways that x+x+ can match m xs and there are n / m ways that (x+x+) can be repeated over n xs and m can be an arbitrary number less than n. The exploration tree is like a binary tree of ways of carving up the input, leading to the time complexity of O(2n).

Check if 2 minimum DFA are equivalent

I have 2 minimized DFA and i need to check if they are equivalent.
If they are equivalent, the problem is to find a efficient comparison of state regardless of different labels. In my case DFA are table, then i need to find the permutation that match the rows of first DFA with rows of second DFA.
I thought also about to have a Breadth-first search of DFA and create the minimum access string to a state and then compare the first list with the second list (this should be regardless of the particular input, for example: 001 and 110 could be interchangeable).
I'm interesting either to direct and inefficient algorithm and to more sophisticated algorithm.
The right approach is to construct another DFA with:
L3=(L1-L2) U (L2-L1)
And test whether L3 is empty or not. If L3 is empty then L1=L2, otherwise L1<>L2
I found these algorithms:
- Symmetric difference
- Table-filling algorithm
- Faster Table-Filling algorithm O(n^2)
- Hopcroft algorithm
- Nearly Linear algorithm by Hopcroft and Karp
A complete reference is:
Algorithms for testing equivalence of finite automata, with a grading tool for Jflap - Norton, 2009
I accepted this my answere because the one of #abbaasi is too incomplete.
I will accept any other answer with a significant contribution.
I remeber a minimum DFA is unique. So if you have 2 minimized DFA, I think you only need to check whether they're the same.

DFA to regular expression time complexity

I am looking at the time complexity analysis of converting DFAs to regular expressions in the
"Introduction to the Automata Theory, Languages and Computation", 2nd edition, page 151, by Ullman et al. This method is sometimes referred to as the transitive closure method. I don't understand how they came up with the 4^n expression in the O((n^3)*(4^n)) time complexity.
I understand that the 4^n expression holds regarding space complexity, but, regarding time complexity, it seems that we are performing only four constant time operations for each pair of states at each iteration, using the results of the previous iterations. What am I exactly missing?
It's a crude bound on the complexity of an algorithm that isn't using the right data structures. I don't think that there's much to explain other than that the authors clearly did not care to optimize here, probably because their main point was that regular expressions are at least as expressive as DFAs and because they feel that it's pointless to optimize this exponential-time algorithm.
There are three nested loops of n iterations each; the regular expressions constructed during iteration k of the outer loop inductively have size O(4^k), since they are constructed from at most four regular expressions constructed during the previous iteration. If the algorithm copies these subexpressions and we overestimate the regular-expression size bound at O(4^n) for all iterations, then we get O(n^3 4^n).
Obviously we can do better. Without eliminating the copying, we can get O(sum_{k=1}^n n^2 4^k) = O(n^2 (n + 4^n)) by bounding the geometric sum properly. Moreover, as you point out, we don't need to copy at all, except at the end if we agree with templatetypedef that the output must be completely written out, giving a running time of O(n^3) to prepare the regular expression and O(4^n) to write it out. The space complexity for this version equals the time complexity.
I suppose your doubt is about the n3 Time Complexity.
Let us assume Rijk represents the set of all strings that transition the automata from state qi to qj without passing through any state higher than qk.
Then the iterative formula for Rijk is shown below,
Rijk = Rikk-1 (Rkkk-1)* Rkjk-1 + Rijk-1.
This technique is similar to the all-pairs shortest path problem. The only difference is that we are taking the union and concatenation of regular expressions instead of summing up distances. The Time Complexity of all-pairs shortest path problem is n3. So we can expect the same complexity for DFA to Regular Expression Conversion also. The same method can also be used to convert NFA and ε-NFA to corresponding Regular Expressions.
The main problem of transitive closure approach is that it creates very large regular expressions. This large length is due to the repeated union of concatenated terms.

Levenshtein distance in regular expression

Is it possible to include Levenshtein distance in a regular expression query?
(Except by making union between permutations, like this to search for "hello" with Levenshtein distance 1:
.ello | h.llo | he.lo | hel.o | hell.
since this is stupid and unusable for larger Levenshtein distances.)
There are a couple of regex dialects out there with an approximate matching feature - namely the TRE library and the regex PyPI module for Python.
The TRE approximate matching syntax is described in the "Approximate matching settings" section at https://laurikari.net/tre/documentation/regex-syntax/. A TRE regex to match stuff within Levenshtein distance 1 of hello would be:
(hello){~1}
The regex module's approximate matching syntax is described at https://pypi.org/project/regex/ in the bullet point that begins with the text Approximate “fuzzy” matching. A regex regex to match stuff within Levenshtein distance 1 of hello would be:
(hello){e<=1}
Perhaps one or the other of these syntaxes will in time be adopted by other regex implementations, but at present I only know of these two.
You can generate the regex programmatically. I will leave that as an exercise for the reader, but for the output of this hypothetical function (given an input of "word") you want something like this string:
"^(?>word|wodr|wrod|owrd|word.|wor.d|wo.rd|w.ord|.word|wor.?|wo.?d|w.?rd|.?ord)$"
In English, first you try to match on the word itself, then on every possible single transposition, then on every possible single insertion, then on every possible single omission or substitution (can be done simultaneously).
The length of that string, given a word of length n, is linear (and notably not exponential) with n.
Which is reasonable, I think.
You pass this to your regex generator (like in Ruby it would be Regexp.new(str)) and bam, you got a matcher for ANY word with a Damerau-Levenshtein distance of 1 from a given word.
(Damerau-Levenshtein distances of 2 are far more complicated.)
Note use of the (?> non-backtracing construct which means the order of the individual |'d expressions in that output matter.
I could not think of a way to "compact" that expression.
EDIT: I got it to work, at least in Elixir! https://github.com/pmarreck/elixir-snippets/blob/master/damerau_levenshtein_distance_1.exs
I wouldn't necessarily recommend this though (except for educational purposes) since it will only get you to distances of 1; a legit D-L library will let you compute distances > 1. Although since this is regex, it would probably work pretty fast once constructed (note that you should save the "compiled" regex somewhere since this code currently reconstructs it on EVERY comparison!)
is there possiblity how to include levenshtein distance in regular expression query?
No, not in a sane way. Implementing - or using an existing - Levenshtein distance algorithm is the way to go.

Is it possible to calucate the edit distance between a regexp and a string?

If so, please explain how.
Re: what is distance -- "The distance between two strings is defined as the minimal number of edits required to convert one into the other."
For example, xyz to XYZ would take 3 edits, so the string xYZ is closer to XYZ and xyz.
If the pattern is [0-9]{3} or for instance 123, then a23 would be closer to the pattern than ab3.
How can you find the shortest distance between a regexp and a non-matching string?
The above is the Damerau–Levenshtein distance algorithm.
You can use Finite State Machines to do this efficiently (that is, linear in time). If you use a transducer, you can even write the specification of the transformation fairly compactly and do far more nuanced transformations than simply inserts or deletes - see wikipedia for Finite State Transducer as a starting point, and software such as the FSA toolkit or FSA6 (which has a not entirely stable web-demo) too. There are lots of libraries for FSA manipulation; I don't want to suggest the previous two are your only or best options, just two I've heard of.
If, however, you merely want the efficient, approximate searching, a less flexibly but already-implemented-for-you option exists: TRE, which has an approximate matching function that returns the cost of the match - i.e., the distance to the match, from your perspective.
If you mean the string with the smallest levenshtein distance between the closest matched string and a sample, then I'm pretty sure it can be done, but you'd have to convert the Regex to a DFA yourself, then try to match and whenever something fails, non-deterministically continue as if it had passed and keep track of the number differences. you could use A* search or something similar for this, it would be quite inefficient though (O(2^n) worst case)