How to compute multiple related Levenshtein distances? - levenshtein-distance

Given two strings of equal length, Levenshtein distance allows to find the minimum number of transformations necessary to get the second string, given the first. However, I'd like to find a way to adjust the alogrithm for multiple pairs of strings, given that they were all generated in the same way.

Reading the comments, it appears that this is the problem:
You are given a set of pairs of strings, all the same length and each pair is the input to some function paired with the output from the function. So, for the pair A,B, we know that f(A)=B. The goal is to reverse engineer f() with a large set of A,B pairs.
Using Levenshtein distance on the entire set will, at most, tell you the maximum number of transformations that must take place.
A better start would be Hamming distance (modified to allow multiple characters) or Jaccard similarity to identify how many positions in strings do not change at all for all of the pairs. Then, you are left only with those that do change.
This will fail if the letters shift.
To detect shift, you want to use global alignment (Needleman-Wunsch). You will then see something like "ABCDE"=>"xABCD" to show that from the input to the output, there was a left shift.
Overall, I feel that Levenshtein distance will do very little to help you get at the original algorithm.

Related

How to find the row in a matrix which is closest to a test row?

I am currently trying to come up with a smart way to store my data, such that I can easily search, sort and insert.
My data consist of a std::pair<vector<int>,string> which is stored in a std::vector, forming sort of an 2d matrix.
Something like this:
Each vector has a number sequence that matches a string.
My problem here is when it should be tested.
A test, consist of testing it with a test vector, which might be the same length or be smaller than those stored in the matrix. how do I find out which string the test vector best matches with?
How do i search for that efficiently?
Some thought of ideas on how to solve the problem, and some of the problems with them:
One idea was to make sub sums, depending on the length of the test vector, and then see which of the sub sums, best matches the sum of the test vector.
Problem: I am looking to the same pattern, so the same sum could occur given another pattern.
Another idea was to make a copy of the matrix, sort it column wise, and make a search for a each index, and keep track of which the string were matching the best..
Problem: This though requires sorting all the columns - the matrix is way larger than showed above, it has around 1000 columns, and it seems too expensive to make one search for a one sort - given the amount of time I would spent on it - A insert possibility also need to be implemented, so something efficient would be appreciated.

Record all optimal sequence alignments when calculating Levenshtein distance in Julia

I'm working on the Levenshtein distance with Wagner–Fischer algorithm in Julia.
It would be easy to get the optimal value, but a little hard to get the optimal operation sequence, like insert or deletion, while backtrace from the right down corner of the matrix.
I can record the pointer information of each d[i][j], but it might give me 3 directions to go back to d[i-1][j-1] for substitution, d[i-1][j] for deletion and d[i][j-1] for insertion. So I'm trying to get all combination of the operation sets that gave me the optimal Levenshtein distance.
It seems that I can store one operation set in one array, but I don't know the total number of all combinations as well as there length, so it would be hard for me to define an array to store the operation set during the enumeration process. How can I generate arrays while store the former ones? Or I should use Dataframe?
If you implement the Wagner-Fischer algorithm, at some point, you choose the minimum over three alternatives (see Wikipedia pseudo-code). At this point, you save the chosen alternative in another matrix. Using a statement like:
c[i,j] = indmin([d[i-1,j]+1,d[i,j-1]+1,d[i-1,j-1]+1])
# indmin returns the index of the minimum element in a collection.
Now c[i,j] contains 1,2 or 3 according to deletion, insertion or substitution.
At the end of the calculation, you have the final d matrix element achieving the minimum distance, you then follow the c matrix backwards and read the action at each step. Keeping track of i and j allows reading the exact substitution by looking which element was in string1 at i and string2 at j in the current step. Keeping a matrix like c cannot be avoided because at the end of the algorithm, the information about the intermediate choices (done by min) would be lost.
I'm not sure that I got your question but anyway, vectors in Julia are dynamic data structures, so you are always able to grow it using appropriate function, e.g pull!() , append!() , preapend!() also its possible to reshape() the result vector to an array of desired size.
but one particular approach for the above case could be obtained using sparse() matrix:
import Base.zero
Base.zero(ASCIIString)=""
module GSparse
export insertion,deletion,substitude,result
s=sparse(ASCIIString[])
deletion(newval::ASCIIString)=begin
global s
s.n+=1
push!(s.colptr,last(s.colptr))
s[s.m,s.n]=newval
end
insertion(newval::ASCIIString)=begin
global s
s.m+=1
s[s.m,s.n]=newval
end
substitude(newval::ASCIIString)=begin
global s
s.n+=1
s.m+=1
push!(s.colptr,last(s.colptr))
s[s.m,s.n]=newval
end
result()=begin
global s
ret=zeros(ASCIIString,size(s))
le=length(s)
for (i = 1:le)
ret[le-i+1]=s[i]
end
ret
end
end
using GSparse
insertion("test");
insertion("testo");
insertion("testok");
substitude("1estok");
deletion("1stok");
result()
I like the approach because for large texts you could have many zero elements. also I fill data structure in forward way and create results by reversing.

Levenstein program variance?

I have a C++ task and I'm not sure how to approach this problem.
So you have a startString and an ENDSTRING. The point is to transform the startString into ENDSTRING with the given set of operations:
insert character
detele character
replace substring from dictionary
move substring (to the left)
reverse substring
AND the operations you use should be as low as possible
So I searched google to see that this is the string reconstruction problem.
This is the edit distance and particularly the Levenshtein distance algorithm.
BUT Levenshtein algorithm does NOT give you the steps you make - it gives you only the number of steps. I have to write an algorithm to reconstruct the givenString to the ENDSTRING with minimum operations as possible and to write a file which describes the steps taken so far.
Can you please guide me which algorithm should I use because that Levenshtein one only gives you the number of steps, but I need their number as well and a list with the actual steps.
Thanks
Basically, you would have to modify the Levishtein algorithm a bit. As it is an example of Dynamic Programming, the calculation of one specific state of the Dynamic Program state space corresponds to a specific choice of alternatives. To my understanding, you have two options:
1.) use some auxiliary data strucute where for each state you store the choice to which its
value corresponds;
2.) use no auxiliary data strucuture, but as soon as the sate space is evaluated, you use backtracking to see which choices have been possible.
Option 2.) might result in less code, but option 1.) might be easier to understand.

Extract the upper and lower boundaries from a list (vector) of 2d coordinates

My program contains polygons which have the form of a vector containing points (2 dimensional double coordinates, stored in a self-made structure). I'm looking for a quick way of finding the smallest square containing my polygon (ie. knowing the maximal and minimal coordinates of all the points).
Is there a quicker way than just parsing all the points and storing the minimum and maximum values?
The algorithm ou are describing is straightforward: Iterate over all your points and find the minimum and maximum for each coordinate. This is an O(n) algorithm, n being the number of points you have.
You can't do better, since you will need to check at least all your points once, otherwise the last one could be outside the square you found.
Now, the complexity is at best O(n) so you just have to minimize the constant factors, but in that case it's already pretty small : Only one loop over your vector, looking for two maximums and two minimums.
You can either iterate through all points and find max and min values, or do some preprocessing, for example, store your points in treap (http://en.wikipedia.org/wiki/Treap).
There is no way w/o some preprocessing to do it better than just iterating over all points.
I'm not sure if there can be any faster way to find the min & max values in an array of values than linear time. The only 'optimization' I can think of is to find these values on one of the other occasions you're iterating the array (filling it/performing a function on all points), then perform checks on any data update.

Compare similarity algorithms

I want to use string similarity functions to find corrupted data in my database.
I came upon several of them:
Jaro,
Jaro-Winkler,
Levenshtein,
Euclidean and
Q-gram,
I wanted to know what is the difference between them and in what situations they work best?
Expanding on my wiki-walk comment in the errata and noting some of the ground-floor literature on the comparability of algorithms that apply to similar problem spaces, let's explore the applicability of these algorithms before we determine if they're numerically comparable.
From Wikipedia, Jaro-Winkler:
In computer science and statistics, the Jaro–Winkler distance
(Winkler, 1990) is a measure of similarity between two strings. It is
a variant of the Jaro distance metric (Jaro, 1989, 1995) and
mainly[citation needed] used in the area of record linkage (duplicate
detection). The higher the Jaro–Winkler distance for two strings is,
the more similar the strings are. The Jaro–Winkler distance metric is
designed and best suited for short strings such as person names. The
score is normalized such that 0 equates to no similarity and 1 is an
exact match.
Levenshtein distance:
In information theory and computer science, the Levenshtein distance
is a string metric for measuring the amount of difference between two
sequences. The term edit distance is often used to refer specifically
to Levenshtein distance.
The Levenshtein distance between two strings is defined as the minimum
number of edits needed to transform one string into the other, with
the allowable edit operations being insertion, deletion, or
substitution of a single character. It is named after Vladimir
Levenshtein, who considered this distance in 1965.
Euclidean distance:
In mathematics, the Euclidean distance or Euclidean metric is the
"ordinary" distance between two points that one would measure with a
ruler, and is given by the Pythagorean formula. By using this formula
as distance, Euclidean space (or even any inner product space) becomes
a metric space. The associated norm is called the Euclidean norm.
Older literature refers to the metric as Pythagorean metric.
And Q- or n-gram encoding:
In the fields of computational linguistics and probability, an n-gram
is a contiguous sequence of n items from a given sequence of text or
speech. The items in question can be phonemes, syllables, letters,
words or base pairs according to the application. n-grams are
collected from a text or speech corpus.
The two core
advantages of n-gram models (and algorithms that use
them) are relative simplicity and the ability to scale up – by simply
increasing n a model can be used to store more context with a
well-understood space–time tradeoff, enabling small experiments to
scale up very efficiently.
The trouble is these algorithms solve different problems that have different applicability within the space of all possible algorithms to solve the longest common subsequence problem, in your data or in grafting a usable metric thereof. In fact, not all of these are even metrics, as some of them don't satisfy the triangle inequality.
Instead of going out of your way to define a dubious scheme to detect data corruption, do this properly: by using checksums and parity bits for your data. Don't try to solve a much harder problem when a simpler solution will do.
String similarity helps in a lot of different ways. For example
google's did you mean results are calculated using string similarity.
string similarity is used to correct OCR errors.
string similarity is used to correct keyboard entering errors.
string similarity is used to find most matching sequence of two DNAs in bioinformatics.
But as one size does not fit all. Every string similarity algorithm is designed for a specific usage though most of them are similar. For example Levenshtein_distance is about how many char you change to make two strings equal.
kitten → sitten
Here distance is 1 character change. You may give different weights to deletion, addition and substitution. For example OCR errors and keyboard errors give less weight for some changes. OCR ( some chars are very similar to others ), keyboard some chars are very near to each other. Bioinformatic string similarity allows a lot of insertion.
Your second example of "Jaro–Winkler distance metric is designed and best suited for short strings such as person names"
Therefore you should keep in your mind about your problem.
I want to use string similarity functions to find corrupted data in my database.
How your data is corrupted? Is it a user error , similar to keyboard input error? Or is it similar to OCR errors? Or something else entirely?