Levenshtein distance with substitution, deletion and insertion count

Levenshtein distance with substitution, deletion and insertion count - levenshtein-distance

There's a great blog post here https://davedelong.com/blog/2015/12/01/edit-distance-and-edit-steps/ on Levenshtein distance. I'm trying to implement this to also include counts of subs, dels and ins when returning the Levenshtein distance. Just running a smell check on my algorithm.
def get_levenshtein_w_counts(s1: str, s2: str):
row_dim = len(s1) + 1 # +1 for empty string
height_dim = len(s2) + 1
# tuple = [ins, del, subs]
# Moving across row is insertion
# Moving down column is deletion
# Moving diagonal is sub
matrix = [[[n, 0, 0] for n in range(row_dim)] for m in range(height_dim)]
for i in range(1, height_dim):
matrix[i][0][1] = i
for y in range(1, height_dim):
for x in range(1, row_dim):
left_scores = matrix[y][x - 1].copy()
above_scores = matrix[y - 1][x].copy()
diagonal_scores = matrix[y - 1][x - 1].copy()
scores = [sum_list(left_scores), sum_list(diagonal_scores), sum_list(above_scores)]
min_idx = scores.index(min(scores))
if min_idx == 0:
matrix[y][x] = left_scores
matrix[y][x][0] += 1
elif min_idx == 1:
matrix[y][x] = diagonal_scores
matrix[y][x][2] += (s1[x-1] != s2[y-1])
else:
matrix[y][x] = above_scores
matrix[y][x][1] += 1
return matrix[-1][-1]
So according to the blog post if you make a matrix where the row is the first word + and empty str and the column is the 2nd word plus an empty string. You store the edit distance at each index. Then you get the smallest from the left, above and diagonal. If the min is diagonal then you know you're just adding 1 sub, if the min is from the left then you're just adding 1 insertion. If the min is from above then you're just deleting 1 character.
I think I did something wrong cause get_levenshtein_w_counts("Frank", "Fran") returned [3, 2, 2]

The problem was that Python does address passing for objects so I should be cloning the lists to the variables rather than doing a direct reference.

Related

Creating Huge Sparse Matrices in python

I have been using normal matrices from numpy to store a Matrix for a physics project. The size of the matrix is determined by the physical system.
So for instance if the system has parameters:
L=4 and N =2, then the matrix is of dimension 4C2 = 6, so the matrix is a 6x6 matrix.
This is fine except for now I need larger size i.e 20C10 = 184,756. So the matrix required is now a 184756x184756 matrix, which when I try to create an empty matrix of this size gives me a memory error. (with 16GB of RAM)
The resultant matrix is mostly just diagonal and off diagonal terms, so there are a huge amount of zeroes in the large size matrices. Hence Sparse Matrices seem like the correct approach.
I have tried to get it to work by looking at other answers and just trying by myself from the python libraries, but to no avail.
Below is the code for my normal matrix:
def HamGen(L,N,delta,J):
"""
Will Generate the hamiltonian matrix,
Takes parameters:
L : Number of sites
N : Number of spin downs
delta : anistropy
Each term is gotten by getting H(i,j) = <Set(i)|H|Set(j)>
The term will be a number
Where H is an operator that acts on elements of the set
"""
D = BS.dimension(L,N) # Gets the dimension of the matrix, i.e NxN matrix
Hamiltonian = np.zeros((D,D)) # Creates empty matrix
count1 = 0
Set = BS.getSet(L,N) # The set of states to construct the hamiltonian
for alpha in Set: #loop through the set (i)
count2 = 0
for beta in Set: # j
"""
Compute ab = <alpha|Hamiltonian|beta>
Then let Hamiltonian[a][b] = ab
"""
if (alpha == beta):
for i in range(L-1):
# Sz is just a function
Hamiltonian[count1][count2] += (J*delta*Sz(beta,i)*Sz(beta,i+1))
b = check(alpha,beta)
if b:
del b[0]
for j in b:
Hamiltonian[count1][count2] += (J*0.5*(Sp(beta,j)*Sm(beta,j+1) + Sm(beta,j)*Sp(beta,j+1)))
count2 += 1
count1 += 1
return (np.asmatrix(Hamiltonian))
I mostly just need to know how to make the matrix without having to use as much memory, and then how to put the terms I calculate into the matrix.
Here is my attempt to make the matrix as a sparse matrix.
def SPHamGen(L,N,delta):
"""
Will Generate the hamiltonian matrix,
Takes parameters:
L : Number of sites
N : Number of spin downs
delta : anistropy
"""
start = timeit.default_timer()
D = BS.dimension(L,N)
Ham = sp.coo_matrix((D,D))
print Ham
#data = ([0])*D
count1 = 0
Set = BS.getSet(L,N)
data = ([0])*(D*D)
rows = ([0])*(D*D)
cols = ([0])*(D*D)
for alpha in Set:
count2 = 0
for beta in Set:
"""
Compute ab = <alpha|Hamiltonian|beta>
Then let Hamiltonian[a][b] = ab
"""
if (alpha == beta):
for i in range(L-1):
#Hamiltonian[count1][count2] += (J*delta*Sz(beta,i)*Sz(beta,i+1))
data[count2] += (J*delta*Sz(beta,i)*Sz(beta,i+1))
rows[count2] = count1
cols[count2] = count2
b = check(alpha,beta)
if b:
del b[0]
for j in b:
#Hamiltonian[count1][count2] += (J*0.5*(Sp(beta,j)*Sm(beta,j+1) + Sm(beta,j)*Sp(beta,j+1)))
data[count2] += (J*0.5*(Sp(beta,j)*Sm(beta,j+1) + Sm(beta,j)*Sp(beta,j+1)))
rows[count2] = count1
cols[count2] = count2
count2 += 1
count1 += 1
Ham = Ham + sp.coo_matrix((data,(rows,cols)), shape = (D,D))
time = (timeit.default_timer() - start)
print "\n"+str(time) +"s to calculate H"
#return Ham
return sparse.csr_matrix(Ham)
Thanks, Phil.

Finding the two closest numbers in a list using sorting

If I am given a list of integers/floats, how would I find the two closest numbers using sorting?

Such a method will do what you want:
>>> def minDistance(lst):
lst = sorted(lst)
index = -1
distance = max(lst) - min(lst)
for i in range(len(lst)-1):
if lst[i+1] - lst[i] < distance:
distance = lst[i+1] - lst[i]
index = i
for i in range(len(lst)-1):
if lst[i+1] - lst[i] == distance:
print lst[i],lst[i+1]
In the first for loop we find out the minimum distance, and in the second loop, we print all the pairs with this distance. Works as below:
>>> lst = (1,2,3,6,12,9,1.4,145,12,83,53,12,3.4,2,7.5)
>>> minDistance(lst)
2 2
12 12
12 12
>>>

It could be more than one possibilities. Consider this list
[0,1, 20, 25, 30, 200, 201]
[0,1] and [200, 201] are equal closest.

Jose has a valid point. However, you could just consider these cases equal and not care about returning one or the other.
I don't think you need a sorting algorithm, per say, but maybe just a sort of 'champion' algorithm like this one:
def smallestDistance(self, arr):
championI = -1
championJ = -1
champDistance = sys.maxint
i = 0
while i < arr.length:
j = i + 1
while j < arr.length:
if math.fabs(arr[i] - arr[j]) < champDistance:
championI = i
championJ = j
champDistance = math.fabs(arr[i] - arr[j])
j += 1
i += 1
r = [arr[championI], arr[championJ]]
return r
This function will return a sub array with the two values that are closest together. Note that this will only work given an array of at least two long. Otherwise, you will throw some error.
I think the popular sorting algorithm known as bubble sort would do this quite well. Though running at possible O(n^2) time if that kind of thing matters to you...
Here is standard bubble sort based on the sorting of arrays by integer size.
def bubblesort( A ):
for i in range( len( A ) ):
for k in range( len( A ) - 1, i, -1 ):
if ( A[k] < A[k - 1] ):
swap( A, k, k - 1 )
def swap( A, x, y ):
tmp = A[x]
A[x] = A[y]
A[y] = tmp
You can just modify the algorithm slightly to fit your purposes if you insist on doing this using a sorting algorithm. However, I think the initial function works as well...
hope that helps.

Compress dict sum statement with Python

In my python application I have a big list (now with almost 9000 indexes). I need to find the two most similar items in this list. So, what I have now is something like:
aux1 = 0
aux2 = 1
min_distance = 0xffff
weights = get_weights()
for i in range(0, len(_list)):
for j in range(i + 1, len(_list)):
obj1 = _list[i]
obj2 = _list[j]
dist = 0
for key in self.__fields:
dist += weights[key] * (obj1[key] - obj2[key]) ** 2
if dist < min_distance:
min_distance = dist
aux1 = i
aux2 = j
return aux1, aux2, min_distance
In the code, weights is a dict, obj1 and obj2 are both objects in which the __getitem__ is implemented and the return value also comes from a dict. And self.__fields is a list with the selected fields (it has now 9 items).
My problem is, this loop is taking too much time to complete. Even after 5 hours, the i variable still in the first 100th list items.
With this next silly code, I come to the conclusion that the problem is not the size of the list (the silly code finishes with 5 minutes of difference).
count = 0
total = 9000
for i in range(0, total):
for j in range(i + 1, total):
for k in range(0, 10):
count += 1
print("Count is " + str(count))
Therefore, the problem seems to be in the most internal loop of my code:
for key in self.__fields:
dist += weights[key] * (obj1[key] - obj2[key]) ** 2
I know Python, but I'm not a Python specialist. I conclude that the access to the values of three objects through their key is a slow operation. Some time ago, I saw in some blog that list comprehensions and/or lambda operations can be faster.
So, my question is: how do I make this most internal loop faster using list comprehensions and/or lambda? Feel free to give any other advice if you want.

Not sure whether it's any faster, but you could rewrite that code using itertools.combinations and get the min using a key function calculating the "distance".
from itertools import combinations
weights = get_weights()
aux1, aux2 = min(combinations(_list, 2),
key=lambda pair: sum(weights[key] * (pair[0][key] - pair[1][key]) ** 2
for key in self.__fields))
If this does not help, you might consider temporarily turning the dictionaries in _list into lists, holding just the values of the relevant fields. Instead of using dictionary lookup, you can then just zip those lists together with the weights. Afterwards, turm them back into dicts.
weights_list = [weights[f] for f in self.__fields]
as_lists = [[d[f] for f in self.__fields] for d in _list]
aux1, aux2 = min(combinations(as_lists, 2),
key=lambda pair: sum(w * (x - y) ** 2
for w, x, y in zip(weights_list, *pair)))
aux1, aux2 = (dict(zip(self.__fields, x)) for x in (aux1, aux2))
This should be a bit faster, but it will only work if the dicts do not have any other fields than those in self.__fields, otherwise the dicts can not be reconstructed from the lists (at least not as easily). Alternatively, you might use tuples instead of lists and use another dictionary to map those tuples to the original dictionaries...
Or try this, using the indices of the elements instead of the elements themselves (not tested):
idx1, idx2 = min(combinations(range(len(_list)), 2),
key=lambda pair: sum(w * (x - y) ** 2
for w, x, y in zip(weights_list, as_list[pair[0]], as_list[pair[1]])))
aux1, aux2 = _lists[idx1], _lists[idx2]

Enumeration all possible matrices with constraints

I'm attempting to enumerate all possible matrices of size r by r with a few constraints.
Row and column sums must be in non-ascending order.
Starting from the top left element down the main diagonal, each row and column subset from that entry must be made up of combinations with replacements from 0 to the value in that upper left entry (inclusive).
The row and column sums must all be less than or equal to a predetermined n value.
The main diagonal must be in non-ascending order.
Important note is that I need every combination to be store somewhere, or if written in c++, to be ran through another few functions after finding them
r and n are values that range from 2 to say 100.
I've tried a recursive way to do this, along with an iterative, but keep getting hung up on keeping track column and row sums, along with all the data in a manageable sense.
I have attached my most recent attempt (which is far from completed), but may give you an idea of what is going on.
The function first_section(): builds row zero and column zero correctly, but other than that I don't have anything successful.
I need more than a push to get this going, the logic is a pain in the butt, and is swallowing me whole. I need to have this written in either python or C++.
import numpy as np
from itertools import combinations_with_replacement
global r
global n
r = 4
n = 8
global myarray
myarray = np.zeros((r,r))
global arraysums
arraysums = np.zeros((r,2))
def first_section():
bigData = []
myarray = np.zeros((r,r))
arraysums = np.zeros((r,2))
for i in reversed(range(1,n+1)):
myarray[0,0] = i
stuff = []
stuff = list(combinations_with_replacement(range(i),r-1))
for j in range(len(stuff)):
myarray[0,1:] = list(reversed(stuff[j]))
arraysums[0,0] = sum(myarray[0,:])
for k in range(len(stuff)):
myarray[1:,0] = list(reversed(stuff[k]))
arraysums[0,1] = sum(myarray[:,0])
if arraysums.max() > n:
break
bigData.append(np.hstack((myarray[0,:],myarray[1:,0])))
if printing: print 'myarray \n%s' %(myarray)
return bigData
def one_more_section(bigData,index):
newData = []
for item in bigData:
if printing: print 'item = %s' %(item)
upperbound = int(item[index-1]) # will need to have logic worked out
if printing: print 'upperbound = %s' % (upperbound)
for i in reversed(range(1,upperbound+1)):
myarray[index,index] = i
stuff = []
stuff = list(combinations_with_replacement(range(i),r-1))
for j in range(len(stuff)):
myarray[index,index+1:] = list(reversed(stuff[j]))
arraysums[index,0] = sum(myarray[index,:])
for k in range(len(stuff)):
myarray[index+1:,index] = list(reversed(stuff[k]))
arraysums[index,1] = sum(myarray[:,index])
if arraysums.max() > n:
break
if printing: print 'index = %s' %(index)
newData.append(np.hstack((myarray[index,index:],myarray[index+1:,index])))
if printing: print 'myarray \n%s' %(myarray)
return newData
bigData = first_section()
bigData = one_more_section(bigData,1)
A possible matrix could look like this:
r = 4, n >= 6
|3 2 0 0| = 5
|3 2 0 0| = 5
|0 0 2 1| = 3
|0 0 0 1| = 1
6 4 2 2

Here's a solution in numpy and python 2.7. Note that all the rows and columns are in non-increasing order, because you only specified that they should be combinations with replacement, and not their sortedness (and generating combinations is the simplest with sorted lists).
The code could be optimized somewhat by keeping row and column sums around as arguments instead of recomputing them.
import numpy as np
r = 2 #matrix dimension
maxs = 5 #maximum sum of row/column
def generate(r, maxs):
# We create an extra row and column for the starting "dummy" values.
# Filling in the matrix becomes much simpler when we do not have to treat cells with
# one or two zero indices in special way. Thus, we start iteration from the
# (1, 1) index.
m = np.zeros((r + 1, r + 1), dtype = np.int32)
m[0] = m[:,0] = maxs + 1
def go(n, i, j):
# If we completely filled the matrix, yield a copy of the non-dummy parts.
if (i, j) == (r, r):
yield m[1:, 1:].copy()
return
# We compute the next indices in row major order (the choice is arbitrary).
(i2, j2) = (i + 1, 1) if j == r else (i, j + 1)
# Computing the maximum possible value for the current cell.
max_val = min(
maxs - m[i, 1:].sum(),
maxs - m[1:, j].sum(),
m[i, j-1],
m[i-1, j])
for n2 in xrange(max_val, -1, -1):
m[i, j] = n2
for matrix in go(n2, i2, j2):
yield matrix
return go(maxs, 1, 1) #note that this is a generator object
# testing
for matrix in generate(r, maxs):
print
print matrix
If you'd like to have all the valid permutations in the rows and columns, this code below should work.
def generate(r, maxs):
m = np.zeros((r + 1, r + 1), dtype = np.int32)
rows = [0]*(r+1) # We avoid recomputing row/col sums on each cell.
cols = [0]*(r+1)
rows[0] = cols[0] = m[0, 0] = maxs
def go(i, j):
if (i, j) == (r, r):
yield m[1:, 1:].copy()
return
(i2, j2) = (i + 1, 1) if j == r else (i, j + 1)
max_val = min(rows[i-1] - rows[i], cols[j-1] - cols[j])
if i == j:
max_val = min(max_val, m[i-1, j-1])
if (i, j) != (1, 1):
max_val = min(max_val, m[1, 1])
for n in xrange(max_val, -1, -1):
m[i, j] = n
rows[i] += n
cols[j] += n
for matrix in go(i2, j2):
yield matrix
rows[i] -= n
cols[j] -= n
return go(1, 1)

Connected Component Counting

In the standard algorithm for connected component counting, a disjoint-set data structure called union-find is used.
Why is this data structure used? I've written code to just search the image linearly, maintaining two linear buffers to store the current and next component counts for each connected pixels by just examining four neighbors (E, SE, S, SW), and in case of a connection, update the connection map to join the higher component with the lower component.
Once done, search for all non joined components and report the count.
I just can't see why this approach is less efficient than using union-find.
Here's my code. The input file has been reduced to 0s and 1s. The program outputs the number of connected components formed from 0s.
def CompCount(fname):
fin = open(fname)
b,l = fin.readline().split()
b,l = int(b),int(l)+1
inbuf = '1'*l + fin.read()
prev = curr = [sys.maxint]*l
nextComp = 0
tree = dict()
for i in xrange(1, b+1):
curr = [sys.maxint]*l
for j in xrange(0, l-1):
curr[j] = sys.maxint
if inbuf[i*l+j] == '0':
p = [prev[j+n] for m,n in [(-l+1,1),(-l,0),(-l-1,-1)] if inbuf[i*l + j+m] == '0']
curr[j] = min([curr[j]] + p + [curr[j-1]])
if curr[j] == sys.maxint:
nextComp += 1
curr[j] = nextComp
tree[curr[j]] = 0
else:
if curr[j] < prev[j+1]: tree[prev[j+1]] = curr[j]
if curr[j] < prev[j]: tree[prev[j]] = curr[j]
if curr[j] < prev[j-1]: tree[prev[j-1]] = curr[j]
if curr[j] < curr[j-1]: tree[curr[j-1]] = curr[j]
prev = curr
return len([x for x in tree if tree[x]==0])

I didn't completely understand your question, you'd really gain for yourself in writing this up clearly and structuring your question.
What I understand is that you want to do a connected component labeling in a 0-1 image by using the 8 neighborhood. If this is so your assumption that the resulting neighborhood graph would be planar is wrong. You have crossings at the "diagonals". It should be easy to construct a K_{3,3} or K_{5} in such an image.

Your algorithm is flawed. Consider this example:
11110
01010
10010
11101
Your algorithm says 2 components whereas it has only 1.
To test, I used this slightly-modified version of your code.
import sys
def CompCount(image):
l = len(image[0])
b = len(image)
prev = curr = [sys.maxint]*(l+1)
nextComp = 0
tree = dict()
for i in xrange(b):
curr = [sys.maxint]*(l+1)
for j in xrange(l):
curr[j] = sys.maxint
if image[i][j] == '0':
p = [prev[j+n] for m,n in [(1,1),(-1,0),(-1,-1)] if 0<=i+m<b and 0<=j+n<l and image[i+m][j+n] == '0']
curr[j] = min([curr[j]] + p + [curr[j-1]])
if curr[j] == sys.maxint:
nextComp += 1
curr[j] = nextComp
tree[curr[j]] = 0
else:
if curr[j] < prev[j+1]: tree[prev[j+1]] = curr[j]
if curr[j] < prev[j]: tree[prev[j]] = curr[j]
if curr[j] < prev[j-1]: tree[prev[j-1]] = curr[j]
if curr[j] < curr[j-1]: tree[curr[j-1]] = curr[j]
prev = curr
return len([x for x in tree if tree[x]==0])
print CompCount(['11110', '01010', '10010', '11101'])
Let me try to explain your algorithm in words (in terms of a graph rather than a grid).
Set 'roots' be an empty set.
Iterate over the nodes in the graph.
For a node, n, look at all its neighbours already processed. Call this set A.
If A is empty, pick a new value k, set v[node] to be k, and add k to roots.
Otherwise, let k be the min of v[node] for node in A. Remove v[x] from roots for each x in A with v[x] != k.
The number of components is the number of elements of roots.
(Your tree is the same as my roots: note that you never use the value of tree[] elements, only whether they are 0 or not... this is just implementing a set)
It's like union-find, except that it assumes that when you merge two components, the one with the higher v[] value has never been previously merged with another component. In the counterexample this is exploited because the two 0s in the center column have been merged with the 0s to their left.

My variant:
Split your entire graph into edges. Add each edge to a set.
On next iteration, draw edges between the 2 outer nodes of the edge you made in step 2. This means adding new nodes (with their corresponding sets) to the set the original edge was from. (basically set merging)
Repeat 2 until the 2 nodes you're looking for are in the same set. You will also need to do a check after step 1 (just in case the 2 nodes are adjacent).
At first your nodes will be each in its set,
o-o o-o o1-o3 o2 o3-o4
\ / |
o-o-o-o o2 o1-o3-o4
As the algorithm progresses and merges the sets, it relatively halves the input.
In the example I am checking for components in some graph. After merging all edges to their maximum possible set, I am left with 3 sets giving 3 disconnected components.
(The number of components is the number of sets you are able to get when the algorithm finishes.)
A possible graph (for the tree above):
o-o-o o4 o2
| |
o o3
|
o1

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Levenshtein distance with substitution, deletion and insertion count - levenshtein-distance

The problem was that Python does address passing for objects so I should be cloning the lists to the variables rather than doing a direct reference.

Related

Creating Huge Sparse Matrices in python

Finding the two closest numbers in a list using sorting

Compress dict sum statement with Python

Enumeration all possible matrices with constraints

Connected Component Counting

Categories

Resources