More efficient way of finding edit distance over a large array

More efficient way of finding edit distance over a large array - c++

I have a large array of words (300k words) and I want to find the edit distance between each word, so I was just iterating over it and doing running through this version of the levenstein algorithm:
unsigned int edit_distance(const std::string& s1, const std::string& s2)
{
const std::size_t len1 = s1.size(), len2 = s2.size();
std::vector<std::vector<unsigned int>> d(len1 + 1, std::vector<unsigned int>(len2 + 1));
d[0][0] = 0;
for (unsigned int i = 1; i <= len1; ++i) d[i][0] = i;
for (unsigned int i = 1; i <= len2; ++i) d[0][i] = i;
for (unsigned int i = 1; i <= len1; ++i)
for (unsigned int j = 1; j <= len2; ++j)
// note that std::min({arg1, arg2, arg3}) works only in C++11,
// for C++98 use std::min(std::min(arg1, arg2), arg3)
d[i][j] = std::min({ d[i - 1][j] + 1, d[i][j - 1] + 1, d[i - 1][j - 1] + (s1[i - 1] == s2[j - 1] ? 0 : 1) });
return d[len1][len2];
}
So what I was wondering is, if there was a more efficient way of doing this, I heard about Levenshtein Autonoma but I wasn't sure if that would be any more efficient.
I would imagine that there you could avoid processing the same thing over and over again by preprocessing something but I have no idea how to actually achieve it (some approximate calculations would be to preprocess everything would be around 10^28 operations so that would not be an improvement)

As stated in his comment, The OP is actually looking for all the pairs with edit distance of less than 2.
Given an input of n words, a naive approach would be to make n(n-1)/2 comparisons, but less comparison may be required when L in an edit distance which is a metric space for strings.
Levenshtein distance is a metric space, and satisfies the 4 required metric axioms - including the triangle inequality.
Edit:
Given this, we can use the method proposed by Sergey Brin (Google's co-founder) in his paper Near Neighbor Search in Large Metric Spaces back in 1995, to solve our problem.
Quoting from the paper: Given a metric space (X, d), a data set Y ⊆ X, a query point x ∈ X, and a range r ∈ R, the near neighbors of x are the set of points y ∈ Y, such that d(x, y) ≤ r.
In this paper, Brin introduced GNAT (Geometric Near-neighbor Access Tree) - a data structure to solve this problem. Brin actually test the performance of his algorithm using the Levenshtein distance (which he calls "Edit distance") against two text corpora.
Over the years GNAT become well-known and widely used. Some improvements to GNAT where suggested in Geometric Near-neighbor Access Tree (GNAT) revisited - Fredriksson 2016.

If as indicated in the comments what you actually want is to find pairs with edit distance at most two, you can generate from each word all possibilities of deleting at most two characters (should be at most 500 or so), and store these in a hash table. Then, you only need to check each pair of words in a hash bucket, which is probably not hard to do by looking at whether deletions coincide.

Related

needed dtw like in R package

There is a function of the dtw package
dtw(x, y=NULL, dist.method="Euclidean", step.pattern=symmetric2, window.type="none", keep.internals=FALSE, distance.only=FALSE, open.end=FALSE, open.begin=FALSE, ... )
In the function, there are three methods of calculating distances
symmetric1 , symmetric2 , asymmetric
I am interested in the method step.pattern = symmetric2.
I have a C ++ function that works exactly like symmetric1
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double dtw_rcpp(const NumericVector& x, const NumericVector& y) {
size_t n = x.size(), m = y.size();
NumericMatrix res = no_init(n + 1, m + 1);
std::fill(res.begin(), res.end(), R_PosInf);
res(0, 0) = 0;
double cost = 0;
size_t w = std::abs(static_cast<int>(n - m));
for (size_t i = 1; i <= n; ++i) {
for (size_t j = std::max(1, static_cast<int>(i - w)); j <= std::min(m, i + w); ++j) {
cost = std::abs(x[i - 1] - y[j - 1]);
res(i, j) = cost + std::min(std::min(res(i - 1, j), res(i, j - 1)), res(i - 1, j - 1));
}
}
return res(n, m);
}
What do I need to change in this с++ function that it considered the method of distance symmetric2.
I do not understand how it works symmetric2.
here it is said very little about it
1. Well-known step patterns
These common transition types are used in quite a lot of implementations.
symmetric1 (or White-Neely) is the commonly used quasi-symmetric, no local constraint, non-normalizable. It is biased in favor of oblique steps. symmetric2 is normalizable, symmetric, with no local slope constraints. Since one diagonal step costs as much as the two equivalent steps along the sides, it can be normalized dividing by N+M (query+reference lengths).
in the source code, I could not understand because I am a beginner programmer
I do not speak English so forgive me for the mistakes.
thank you

OP is asking about dynamic time warping alignments in R. Printing the symmetric2 object should clarify the recursion rule:
g[i,j] = min(
g[i-1,j-1] + 2 * d[i ,j ] ,
g[i ,j-1] + d[i ,j ] ,
g[i-1,j ] + d[i ,j ] ,
)
g is the global cost matrix, d the local distance. I can't comment on the rest of your code.
If you only need the distance value under this specific step pattern, and no other features, the code may be much simplified (see e.g. the pseudocode on Wikipedia).

Improving C++ algorithm for finding all points within a sphere of radius r

Language/Compiler: C++ (Visual Studio 2013)
Experience: ~2 months
I am working in a rectangular grid in 3D-space (size: xdim by ydim by zdim) where , "xgrid, ygrid, and zgrid" are 3D arrays of the x,y, and z-coordinates, respectively. Now, I am interested in finding all points that lie within a sphere of radius "r" centered about the point "(vi,vj,vk)". I want to store the index locations of these points in the vectors "xidx,yidx,zidx". For a single point this algorithm works and is fast enough but when I wish to iterate over many points within the 3D-space I run into very long run times.
Does anyone have any suggestions on how I can improve the implementation of this algorithm in C++? After running some profiling software I found online (very sleepy, Luke stackwalker) it seems that the "std::vector::size" and "std::vector::operator[]" member functions are bogging down my code. Any help is greatly appreciated.
Note: Since I do not know a priori how many voxels are within the sphere, I set the length of vectors xidx,yidx,zidx to be larger than necessary and then erase all the excess elements at the end of the function.
void find_nv(int vi, int vj, int vk, vector<double> &xidx, vector<double> &yidx, vector<double> &zidx, double*** &xgrid, double*** &ygrid, double*** &zgrid, int r, double xdim,double ydim,double zdim, double pdim)
{
double xcor, ycor, zcor,xval,yval,zval;
vector<double>xyz(3);
xyz[0] = xgrid[vi][vj][vk];
xyz[1] = ygrid[vi][vj][vk];
xyz[2] = zgrid[vi][vj][vk];
int counter = 0;
// Confine loop to be within boundaries of sphere
int istart = vi - r;
int iend = vi + r;
int jstart = vj - r;
int jend = vj + r;
int kstart = vk - r;
int kend = vk + r;
if (istart < 0) {
istart = 0;
}
if (iend > xdim-1) {
iend = xdim-1;
}
if (jstart < 0) {
jstart = 0;
}
if (jend > ydim - 1) {
jend = ydim-1;
}
if (kstart < 0) {
kstart = 0;
}
if (kend > zdim - 1)
kend = zdim - 1;
//-----------------------------------------------------------
// Begin iterating through all points
//-----------------------------------------------------------
for (int k = 0; k < kend+1; ++k)
{
for (int j = 0; j < jend+1; ++j)
{
for (int i = 0; i < iend+1; ++i)
{
if (i == vi && j == vj && k == vk)
continue;
else
{
xcor = pow((xgrid[i][j][k] - xyz[0]), 2);
ycor = pow((ygrid[i][j][k] - xyz[1]), 2);
zcor = pow((zgrid[i][j][k] - xyz[2]), 2);
double rsqr = pow(r, 2);
double sphere = xcor + ycor + zcor;
if (sphere <= rsqr)
{
xidx[counter]=i;
yidx[counter]=j;
zidx[counter] = k;
counter = counter + 1;
}
else
{
}
//cout << "counter = " << counter - 1;
}
}
}
}
// erase all appending zeros that are not voxels within sphere
xidx.erase(xidx.begin() + (counter), xidx.end());
yidx.erase(yidx.begin() + (counter), yidx.end());
zidx.erase(zidx.begin() + (counter), zidx.end());
return 0;

You already appear to have used my favourite trick for this sort of thing, getting rid of the relatively expensive square root functions and just working with the squared values of the radius and center-to-point distance.
One other possibility which may speed things up (a) is to replace all the:
xyzzy = pow (plugh, 2)
calls with the simpler:
xyzzy = plugh * plugh
You may find the removal of the function call could speed things up, however marginally.
Another possibility, if you can establish the maximum size of the target array, is to use an real array rather than a vector. I know they make the vector code as insanely optimal as possible but it still won't match a fixed-size array for performance (since it has to do everything the fixed size array does plus handle possible expansion).
Again, this may only offer very marginal improvement at the cost of more memory usage but trading space for time is a classic optimisation strategy.
Other than that, ensure you're using the compiler optimisations wisely. The default build in most cases has a low level of optimisation to make debugging easier. Ramp that up for production code.
(a) As with all optimisations, you should measure, not guess! These suggestions are exactly that: suggestions. They may or may not improve the situation, so it's up to you to test them.

One of your biggest problems, and one that is probably preventing the compiler from making a lot of optimisations is that you are not using the regular nature of your grid.
If you are really using a regular grid then
xgrid[i][j][k] = x_0 + i * dxi + j * dxj + k * dxk
ygrid[i][j][k] = y_0 + i * dyi + j * dyj + k * dyk
zgrid[i][j][k] = z_0 + i * dzi + j * dzj + k * dzk
If your grid is axis aligned then
xgrid[i][j][k] = x_0 + i * dxi
ygrid[i][j][k] = y_0 + j * dyj
zgrid[i][j][k] = z_0 + k * dzk
Replacing these inside your core loop should result in significant speedups.

You could do two things. Reduce the number of points you are testing for inclusion and simplify the problem to multiple 2d tests.
If you take the sphere an look at it down the z axis you have all the points for y+r to y-r in the sphere, using each of these points you can slice the sphere into circles that contain all the points in the x/z plane limited to the circle radius at that specific y you are testing. Calculating the radius of the circle is a simple solve the length of the base of the right angle triangle problem.
Right now you ar testing all the points in a cube, but the upper ranges of the sphere excludes most points. The idea behind the above algorithm is that you can limit the points tested at each level of the sphere to the square containing the radius of the circle at that height.
Here is a simple hand draw graphic, showing the sphere from the side view.
Here we are looking at the slice of the sphere that has the radius ab. Since you know the length ac and bc of the right angle triangle, you can calculate ab using Pythagoras theorem. Now you have a simple circle that you can test the points in, then move down, it reduce length ac and recalculate ab and repeat.
Now once you have that you can actually do a little more optimization. Firstly, you do not need to test every point against the circle, you only need to test one quarter of the points. If you test the points in the upper left quadrant of the circle (the slice of the sphere) then the points in the other three points are just mirror images of that same point offset either to the right, bottom or diagonally from the point determined to be in the first quadrant.
Then finally, you only need to do the circle slices of the top half of the sphere because the bottom half is just a mirror of the top half. In the end you only tested a quarter of the point for containment in the sphere. This should be a huge performance boost.
I hope that makes sense, I am not at a machine now that I can provide a sample.

simple thing here would be a 3D flood fill from center of the sphere rather than iterating over the enclosing square as you need to visited lesser points. Moreover you should implement the iterative version of the flood-fill to get more efficiency.
Flood Fill

4 by 3 lock pattern

I came across this problem.
which asks to calculate the number of ways a lock pattern of a specific length can be made in 4x3 grid and follows the rules. there may be some of the points must not be included in the path
A valid pattern has the following properties:
A pattern can be represented using the sequence of points which it's touching for the first time (in the same order of drawing the pattern), a pattern going from (1,1) to (2,2) is not the same as a pattern going from (2,2) to (1,1).
For every two consecutive points A and B in the pattern representation, if the line segment connecting A and B passes through some other points, these points must be in the sequence also and comes before A and B, otherwise the pattern will be invalid. For example a pattern representation which starts with (3,1) then (1,3) is invalid because the segment passes through (2,2) which didn't appear in the pattern representation before (3,1), and the correct representation for this pattern is (3,1) (2,2) (1,3). But the pattern (2,2) (3,2) (3,1) (1,3) is valid because (2,2) appeared before (3,1).
In the pattern representation we don't mention the same point more than once, even if the pattern will touch this point again through another valid segment, and each segment in the pattern must be going from a point to another point which the pattern didn't touch before and it might go through some points which already appeared in the pattern.
The length of a pattern is the sum of the Manhattan distances between every two consecutive points in the pattern representation. The Manhattan distance between two points (X1, Y1) and (X2, Y2) is |X1 - X2| + |Y1 - Y2| (where |X| means the absolute value of X).
A pattern must touch at least two points
my approach was a brute force, loop over the points, start at the point and using recursive decremente the length until reach a length zero then add 1 to the number of combinations.
Is there a way to calculate it in mathematical equation or there is a better algorithm for this ?
UPDATE:
here is what I have done, it gives some wrong answers ! I think the problem is in isOk function !
notAllowed is a global bit mask of the not allowed points.
bool isOk(int i, int j, int di,int dj, ll visited){
int mini = (i<di)?i:di;
int minj = (j<dj)?j:dj;
if(abs(i-di) == 2 && abs(j-dj) == 2 && !getbit(visited, mini+1, minj+1) )
return false;
if(di == i && abs(j - dj) == 2 && !getbit(visited, i,minj+1) )
return false;
if(di == i && abs(j-dj) == 3 && (!getbit(visited, i,1) || !getbit(visited, i,2)) )
return false;
if(dj == j && abs(i - di) == 2 && !getbit(visited, 1,j) )
return false;
return true;
}
int f(int i, int j, ll visited, int l){
if(l > L) return 0;
short& res = dp[i][j][visited][l];
if(res != -1) return res;
res = 0;
if(l == L) return ++res;
for(int di=0 ; di<gN ; ++di){
for(int dj=0 ; dj<gM ; ++dj){
if( getbit(notAllowed, di, dj) || getbit(visited, di, dj) || !isOk(i,j, di,dj, visited) )
continue;
res += f(di, dj, setbit(visited, di, dj), l+dist(i,j , di,dj));
}
}
return res;
}

My answer to another question can be adapted to this problem as well.
Let f(i,j,visited,k) the number of ways to complete a partial pattern, when we are currently at node (i,j), have already visited the vertices in the set visited and have so far walked a path length of k. We can represent visited as a bitmask.
We can compute f(i,j,visited,k) recursively by trying all possible next moves and apply DP to reuse subproblem solutions:
f(i,j, visited, L) = 1
f(i,j, visited, k) = 0 if k > L
f(i,j, visited, k) = sum(possible moves (i', j'): f(i', j', visited UNION {(i',j')}, k + dis((i,j), (i',j')))
Possible moves are those that cross a number of visited vertices and then end in an univisited (and not forbidden) one.
If D is the set of forbidden vertices, the answer to the question is
sum((i,j) not in D: f(i,j, {(i,j)}, L)).
The runtime is something like O(X^2 * Y^2 * 2^(X*Y) * maximum possible length). I guess the maximum possible length is in fact well below 1000.
UPDATE: I implemented this solution and it got accepted. I enumerated the possible moves in the following way: Assume we are at point (i,j) and have already visited the set of vertices visited. Enumerate all distinct coprime pairs (dx,dy) 0 <= dx < X and 0 <= dy < Y. Then find the smallest k with P_k = (i + kdx, j + kdy) still being a valid grid point and P_k not in visited. If P_k is not forbidden, it is a valid move.
The maximum possible path length is 39.
I'm using a DP array of size 3 * 4 * 2^12 * 40 to store the subproblem results.

There are a couple of attributes of the combinations that may be used to optimize the brute force method:
Using mirror images (horizontal, vertical, or both) you can generate 4 combinations for each one found (except horizontal or vertical lines). Maybe you could consider only combinations starting in one quadrant.
You can usually generate additional combinations of the same length by translation (moving a combination).

Improve minimum distance filter for pointset

I create a minimum distance filter for points.
The function takes a stream of points (x1,y1,x2,y2...) and removes the corresponding ones.
void minDistanceFilter(vector<float> &points, float distance = 0.0)
{
float p0x, p0y;
float dx, dy, dsq;
float mdsq = distance*distance; // minimum distance square
unsigned i, j, n = points.size();
for(i=0; i<n; ++i)
{
p0x = points[i];
p0y = points[i+1];
for(j=0; j<n; j+=2)
{
//if (i == j) continue; // discard itself (seems like it slows down the algorithm)
dx = p0x - points[j]; // delta x (p0x - p1x)
dy = p0y - points[j+1]; // delta y (p0y - p1y)
dsq = dx*dx + dy*dy; // distance square
if (dsq < mdsq)
{
auto del = points.begin() + j;
points.erase(del,del+3);
n = points.size(); // update n
j -= 2; // decrement j
}
}
}
}
The only problem that is very slow, due to it tests all points against all points (n^2).
How could it be improved?

kd-trees or range trees could be used for your problem. However, if you want to code from scratch and want something simpler, then you can use a hash table structure. For each point (a,b), hash using the key (round(a/d),round(b/d)) and store all the points that have the same key in a list. Then, for each key (m,n) in your hash table, compare all points in the list to the list of points that have key (m',n') for all 9 choices of (m',n') where m' = m + (-1 or 0 or 1) and n' = n + (-1 or 0 or 1). These are the only points that can be within distance d of your points that have key (m,n). The downside compared to a kd-tree or range tree is that for a given point, you are effectively searching within a square of side length 3*d for points that might have distance d or less, instead of searching within a square of side length 2*d which is what you would get if you used a kd-tree or range tree. But if you are coding from scratch, this is easier to code; also kd-trees and range trees are kinda overkill if you only have one universal distance d that you care about for all points.

Look up range tree, e.g. en.wikipedia.org/wiki/Range_tree . You can use this structure to store 2-dimensional points and very quickly find all the points that lie inside a query rectangle. Since you want to find points within a certain distance d of a point (a,b), your query rectangle will need to be [a-d,a+d]x[b-d,b+d] and then you test any points found inside the rectangle to make sure they are actually within distance d of (a,b). Range tree can be built in O(n log n) time and space, and range queries take O(log n + k) time where k is the number of points found in the rectangle. Seems optimal for your problem.

Histogram approximation for streaming data

This question is a slight extension of the one answered here. I am working on re-implementing a version of the histogram approximation found in Section 2.1 of this paper, and I would like to get all my ducks in a row before beginning this process again. Last time, I used boost::multi_index, but performance wasn't the greatest, and I would like to avoid the logarithmic in number of buckets insert/find complexity of a std::set. Because of the number of histograms I'm using (one per feature per class per leaf node of a random tree in a random forest), the computational complexity must be as close to constant as possible.
A standard technique used to implement a histogram involves mapping the input real value to a bin number. To accomplish this, one method is to:
initialize a standard C array of size N, where N = number of bins; and
multiply the input value (real number) by some factor and floor the result to get its index in the C array.
This works well for histograms with uniform bin size, and is quite efficient. However, Section 2.1 of the above-linked paper provides a histogram algorithm without uniform bin sizes.
Another issue is that simply multiplying the input real value by a factor and using the resulting product as an index fails with negative numbers. To resolve this, I considered identifying a '0' bin somewhere in the array. This bin would be centered at 0.0; the bins above/below it could be calculated using the same multiply-and-floor method just explained, with the slight modification that the floored product be added to two or subtracted from two as necessary.
This then raises the question of merges: the algorithm in the paper merges the two closest bins, as measured from center to center. In practice, this creates a 'jagged' histogram approximation, because some bins would have extremely large counts and others would not. Of course, this is due to non-uniform-sized bins, and doesn't result in any loss of precision. A loss of precision does, however, occur if we try to normalize the non-uniform-sized bins to make the uniform. This is because of the assumption that m/2 samples fall to the left and right of the bin center, where m = bin count. We could model each bin as a gaussian, but this will still result in a loss of precision (albeit minimal)
So that's where I'm stuck right now, leading to this major question: What's the best way to implement a histogram accepting streaming data and storing each sample in bins of uniform size?

Keep four variables.
int N; // assume for simplicity that N is even
int count[N];
double lower_bound;
double bin_size;
When a new sample x arrives, compute double i = floor(x - lower_bound) / bin_size. If i >= 0 && i < N, then increment count[i]. If i >= N, then repeatedly double bin_size until x - lower_bound < N * bin_size. On every doubling, adjust the counts (optimize this by exploiting sparsity for multiple doublings).
for (int j = 0; j < N / 2; j++) count[j] = count[2 * j] + count[2 * j + 1];
for (int j = N / 2; j < N; j++) count[j] = 0;
The case i < 0 is trickier, since we need to decrease lower_bound as well as increase bin_size (again, optimize for sparsity or adjust the counts in one step).
while (lower_bound > x) {
lower_bound -= N * bin_size;
bin_size += bin_size;
for (int j = N - 1; j > N / 2 - 1; j--) count[j] = count[2 * j - N] + count[2 * j - N + 1];
for (int j = 0; j < N / 2; j++) count[j] = 0;
}
The exceptional cases are expensive but happen only a logarithmic number of times in the range of your data over the initial bin size.
If you implement this in floating-point, be mindful that floating-point numbers are not real numbers and that statements like lower_bound -= N * bin_size may misbehave (in this case, if N * bin_size is much smaller than lower_bound). I recommend that bin_size be a power of the radix (usually two) at all times.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js