Record all optimal sequence alignments when calculating Levenshtein distance in Julia - combinations

I'm working on the Levenshtein distance with Wagner–Fischer algorithm in Julia.
It would be easy to get the optimal value, but a little hard to get the optimal operation sequence, like insert or deletion, while backtrace from the right down corner of the matrix.
I can record the pointer information of each d[i][j], but it might give me 3 directions to go back to d[i-1][j-1] for substitution, d[i-1][j] for deletion and d[i][j-1] for insertion. So I'm trying to get all combination of the operation sets that gave me the optimal Levenshtein distance.
It seems that I can store one operation set in one array, but I don't know the total number of all combinations as well as there length, so it would be hard for me to define an array to store the operation set during the enumeration process. How can I generate arrays while store the former ones? Or I should use Dataframe?

If you implement the Wagner-Fischer algorithm, at some point, you choose the minimum over three alternatives (see Wikipedia pseudo-code). At this point, you save the chosen alternative in another matrix. Using a statement like:
c[i,j] = indmin([d[i-1,j]+1,d[i,j-1]+1,d[i-1,j-1]+1])
# indmin returns the index of the minimum element in a collection.
Now c[i,j] contains 1,2 or 3 according to deletion, insertion or substitution.
At the end of the calculation, you have the final d matrix element achieving the minimum distance, you then follow the c matrix backwards and read the action at each step. Keeping track of i and j allows reading the exact substitution by looking which element was in string1 at i and string2 at j in the current step. Keeping a matrix like c cannot be avoided because at the end of the algorithm, the information about the intermediate choices (done by min) would be lost.

I'm not sure that I got your question but anyway, vectors in Julia are dynamic data structures, so you are always able to grow it using appropriate function, e.g pull!() , append!() , preapend!() also its possible to reshape() the result vector to an array of desired size.
but one particular approach for the above case could be obtained using sparse() matrix:
import Base.zero
Base.zero(ASCIIString)=""
module GSparse
export insertion,deletion,substitude,result
s=sparse(ASCIIString[])
deletion(newval::ASCIIString)=begin
global s
s.n+=1
push!(s.colptr,last(s.colptr))
s[s.m,s.n]=newval
end
insertion(newval::ASCIIString)=begin
global s
s.m+=1
s[s.m,s.n]=newval
end
substitude(newval::ASCIIString)=begin
global s
s.n+=1
s.m+=1
push!(s.colptr,last(s.colptr))
s[s.m,s.n]=newval
end
result()=begin
global s
ret=zeros(ASCIIString,size(s))
le=length(s)
for (i = 1:le)
ret[le-i+1]=s[i]
end
ret
end
end
using GSparse
insertion("test");
insertion("testo");
insertion("testok");
substitude("1estok");
deletion("1stok");
result()
I like the approach because for large texts you could have many zero elements. also I fill data structure in forward way and create results by reversing.

Related

How to find the row in a matrix which is closest to a test row?

I am currently trying to come up with a smart way to store my data, such that I can easily search, sort and insert.
My data consist of a std::pair<vector<int>,string> which is stored in a std::vector, forming sort of an 2d matrix.
Something like this:
Each vector has a number sequence that matches a string.
My problem here is when it should be tested.
A test, consist of testing it with a test vector, which might be the same length or be smaller than those stored in the matrix. how do I find out which string the test vector best matches with?
How do i search for that efficiently?
Some thought of ideas on how to solve the problem, and some of the problems with them:
One idea was to make sub sums, depending on the length of the test vector, and then see which of the sub sums, best matches the sum of the test vector.
Problem: I am looking to the same pattern, so the same sum could occur given another pattern.
Another idea was to make a copy of the matrix, sort it column wise, and make a search for a each index, and keep track of which the string were matching the best..
Problem: This though requires sorting all the columns - the matrix is way larger than showed above, it has around 1000 columns, and it seems too expensive to make one search for a one sort - given the amount of time I would spent on it - A insert possibility also need to be implemented, so something efficient would be appreciated.

How to efficiently add and remove vector values C++

I am trying to efficiently calculate the averages from a vector.
I have a matrix (vector of vectors) where:
row: the days I am going back (250)
column: the types of things I am calculating the average of (10,000 different things)
Currently I am using .push_back() which essentially iterates through each row in each column and then I use erase() in order to remove the last value. As this method goes through all the values, my code is very slow.
I am thinking of a method linked to substitution, however I have a hard time implementing the idea, as all the values have an order (i.e. I need to remove the old value and the value I add / substitute will be the newest).
Below is my code so far.
Any ideas for a solution or guides for the right direction will be much appreciated.
//declaration
vector <vector<float> > vectorOne;
//initialization
vectorOne(250, vector<float>(10000, 0)),
//This is the slow method
vectorOne[column].push_back(1);//add newest value
vectorOne[column].erase(vectorOne[column].begin() + 0); //remove latest value
You probably need a different data structure.
The problem sounds like a queue. You add to the end and take from the front. With real queues, everyone then shuffles up a step. With computer queues, we can use a circular buffer (you do need to be able to get a reasonable bound on maximum queue length).
I suggest implementing your own on top of a plain C array first, then using the STL version when you've understood the principle.

Finding the index position of the nearest value in a Fortran array

I have two sorted arrays, one containing factors (array a) that when multiplied with values from another array (array b), yields the desired value:
a(idx1) * b(idx2) = value
With idx2 known, I would like find the idx1 of a that provides the factor necessary to get as close to value as possible.
I have looked at some different algorithms (like this one, for example), but I feel like they would all be subject to potential problems with floating point arithmetic in my particular case.
Could anyone suggest a method that would avoid this?
If I understand correctly, this expression
minloc(abs(a-value/b(idx2)))
will return the the index into a of the first occurrence of the value in a which minimises the difference. I expect that the compiler will write code to scan all the elements in a so this may not be faster in execution than a search which takes advantage of the knowledge that a and b are both sorted. In compensation, this is much quicker to write and, I expect, to debug.

Efficiently searching arrays in FORTRAN

I am trying to store the stiffness matrix in FORTRAN in sparse format to save memory, i.e. I am using three vectors of non-zero elements (irows, icols, A). After finding out the size of these arrays the next step is to insert the values in them. So I am using gauss points, i.e. for each gauss point I am going to find out the local stiffness matrix and then insert this local stiffness matrix in the Global (irows, icols, A) one.
The main problem with this insertion is that every time we have to check that either the new value is exists in the global array or not, so if the value exists add the new to the old but if not append to the end. i.e. we have to search the whole array to find that either the value exists or not. If the size of these arrays (irows, icols, A) are large so this search is computationally very expensive.
Can any one suggest a better way of insertion of the local stiffness matrix for each gauss point the global stiffness matrix.
I am fairly sure that this is a well known problem in FEM analysis - I found reference to it in this scipy documentation, but of course the principals are language independent. Basically what you should do is create your matrix in the format you have, but instead of searching the matrix to see whether an entry already exists, just assume that it doesn't. This means that you will end up with duplicate entries which need to be added together to get the correct value.
Once you have constructed your matrix, you will usually convert it to some more efficient form for solving it (e.g. CSR etc.) - the exact format may be determined by the sparse solver you are using. During this conversion process duplicate entries should get added together - and some sparse matrix libraries will do this for you. I know that scipy does this, and many of its internal routines are written in fortran, so you might be able to use one of them (they are all open source). Or you could check if anything suitable is on netlib.
If you use a data structure that is pre-sorted it would be very efficient to search it. Either as your primary data structure or as an auxiliary data structure. You want one that you can insert another entry into the middle. For example, a binary search tree (http://en.wikipedia.org/wiki/Binary_search_tree).

How to compute multiple related Levenshtein distances?

Given two strings of equal length, Levenshtein distance allows to find the minimum number of transformations necessary to get the second string, given the first. However, I'd like to find a way to adjust the alogrithm for multiple pairs of strings, given that they were all generated in the same way.
Reading the comments, it appears that this is the problem:
You are given a set of pairs of strings, all the same length and each pair is the input to some function paired with the output from the function. So, for the pair A,B, we know that f(A)=B. The goal is to reverse engineer f() with a large set of A,B pairs.
Using Levenshtein distance on the entire set will, at most, tell you the maximum number of transformations that must take place.
A better start would be Hamming distance (modified to allow multiple characters) or Jaccard similarity to identify how many positions in strings do not change at all for all of the pairs. Then, you are left only with those that do change.
This will fail if the letters shift.
To detect shift, you want to use global alignment (Needleman-Wunsch). You will then see something like "ABCDE"=>"xABCD" to show that from the input to the output, there was a left shift.
Overall, I feel that Levenshtein distance will do very little to help you get at the original algorithm.