What is the best comparison algorithm for comparing the similarity of arrays?

What is the best comparison algorithm for comparing the similarity of arrays? - c++

I am building a program that takes 2 arrays and returns some value showing to what degree they are similar. For example, images with few differences will have a good score, whereas images that are vastly different will have a worse score.
So far the only two algorithms I have come across for this problem are the sum of the squared differences and the normalized correlation.
Both of these will be fairly simple to implement, however I was wondering if there was another algorithm I haven't been able to find that I could use?
Furthermore, which previously mentioned method will be the best? Would be great to know both in terms of their accuracy and efficiency.
Thanks,

Comparing images usually depends on application you are dealing with. Normally distance functions used depends on image descriptor.
Take look at Distance functions
Euclidean Distance
Squared Euclidean Distance
Cosine Distance or Similarity [THIS SHOULD WORK FINE]
Sum of Absolute Differences
Sum of Squared Differences
Correlation Distance
Hellinger Distance
Grid Distance
Manhattan Distance
Chebyshev Distance
statistics distance function
Wasserstein Metric
Mahalanobis Distance
Bray Curtis Distance
Canberra Distance
Binary Distance functions
L0 Norm
Jacard similarity
Hamming Distance
As you are directly comparing images, taking cosine similarity should work for you.

Comparing images well is quite non-trivial.
Just for one example, to get meaningful results you'll need to account for misalignment between the images being compared. If you just compare (for example) the top-left pixel in one image with the top-left pixel in the other, a fairly minor misalignment between the two can make those pixels entirely different, even though a person looking at the images would have difficulty seeing any difference at all.
One way to deal with this would be to start with something similar to the motion compensation used by MPEG 4. That is, break each image into small blocks (e.g., MPEG using 16x16 pixel blocks) and compare blocks in one to blocks in the other. This can eliminate (or at least drastically reduce) the effects of misalignment between the images.

Related

Fast method to find distance from point to closest edge of polygon

Setup
Function will need to provide the distance from a point to the closest edge of a polygon
Point is known to be inside of the polygon
Polygon can be convex or concave
Many points (millions) will need to be tested
Many separate polygons (dozens) will need to be ran through the function per point
Precalculated and persistently stored data structures are an option.
The final search function will be in C++
For the function implementation, I know a simple method would be to test the distance to all segments of the polygon using standard distance to line segment formulas. This option would be fairly slow at scale and I am confident there should be a better option.
My gut instinct is that there should be some very fast known algorithms for this type of function that would have been implemented in a game engine, but I'm not sure where to look.
I've found a reference for storing line segments in a quadtree, which would provide for very rapid searching and I think it could be used for my purpose to quickly narrow down which segment to look at as the closest segment and then would only need to calculate the distance to one line segment.
https://people.cs.vt.edu/~shaffer/Papers/SametCVPR85.pdf
I've not been able to locate any code examples for how this would work. I don't mind implementing algorithms from scratch, but don't see the point in doing so if a working, tested code base exists.
I've been looking at a couple quadtree implementations and I think the way it would work is to create a quadtree per polygon and insert each polygon's line segments with a bounding box into the quadtree for that polygon.
The "query" portion of the function I would be making would then consist of creating a point as a very small bounding box, which would then be used to search against the quadtree structure, which would then find only the very closest portions of the polygon.
http://www.codeproject.com/Articles/30535/A-Simple-QuadTree-Implementation-in-C
and
https://github.com/Esri/geometry-api-java/blob/master/src/main/java/com/esri/core/geometry/QuadTree.java
My real question would be, does this seem like a sound approach for a fast search time function?
Is there something that would work faster?
EDIT:
I've been looking around and found some issues with using a quadtree. The way quadtrees work is good for collision detection, but isn't setup to allow for efficient nearest neighbor searching.
https://gamedev.stackexchange.com/questions/14373/in-2d-how-do-i-efficiently-find-the-nearest-object-to-a-point
R-Trees look to be a better option.
https://en.wikipedia.org/wiki/R-tree
and
efficient way to handle 2d line segments
Based on those posts, R-trees look like the winner. Also handy to see that C++ Boost already has them implemented. This looks close enough to what I was planning on doing that I'll go ahead and implement it and verify the results.

EDIT:
Since i have implemented an PMR quadtree, I see now, that the nearest neighbour search is a bit more complex than I described.
If the quad search result for the search point would be empty then it gets more complex.
I remeber a description somewhere in Hannan Sammets:Multidimensional search structure.
Giving the answer below I had in mind searching for all objects withing a specified distance. This is easy for the PMR quadtree, but just finding the closest is more complex.
Edit End
I would not use a R-Tree.
The weak point (and the strong point!) on R-trees is the separation of the space into rectangles.
There are three algorithms known to do that separation but none is well suited for all situations.
R-trees are really complex to implement.
Why then do it? Just because R-Trees can be twice fast than a quad tree when perfectly implemented. The speed difference between a quadtree and a R-Tree is not relevant. The monetary difference is. (If you have working code for both I would use the PMR quadtree, if you have only code for the R-Tree then use that, If you have none use the PMR Quadtree)
Quad trees (PMR) always work, and they are simple to implement.
Using the PMR quad tree, you just find all segments related to the search point. The result will be a few segments, then you just check them and ready you are.
People that tell quad trees are not suited or neighbour search, do not know that there are hundreds of different quad trees. The non suitability is just true for a point quad tree, not for the PMR one, which stores bounding boxes.
I once remebered the compelx description of finding the neighbour points in a POINT-Quadtree. For the PMR-quadtree I had nothing to do (for a search within a specified rectangular interval), no code change, Just iterate the result and find the closest.
I think that there are even better solutions than Quad tree or R-Tree for your spefic questions, but the point is that the PMR always work. Just implement it one time and use if for all spatial searches.

Since there are so many more points to test than polygons, you could consider doing some fairly extensive pre-processing of the polygons in order to accelerate the average number of tests to find the nearest line segment per point.
Consider an approach like this (assumes polygons have no holes):
Walk the edges of the polygon and define line segments along each equidistant line
Test which side of the line segment a point is to restrict the potential set of closest line segments
Build an arithmetic coding tree with each test weighted by the amount of space that is culled by the half-space of the line segment. this should give good average performance in determining the closest segment for a point and open up the possibility of parallel testing over multiple points at once.
This diagram should illustrate the concept. The blue lines define the polygon and the red lines are the equidistant lines.
Notice that needing to support concave polygons greatly increase the complexity, as illustrated by the 6-7-8 region. Concave regions mean that the line segments that extend to infinity may be defined by vertices that are arbitrarily far apart.
You could decompose this problem by fitting a convex hull to the polygon and then doing a fast, convex test for most points and only doing additional work on points that are within the "region of influence" of the concave region, but I am not sure if there is a fast way to calculate that test.

I am not sure how great the quadtree algorithm you posed would be, so I will let someone else comment on that, but I had a thought on something that might be fast and robust.
My thought is you could represent a polygon by a KD-Tree (assuming the vertices are static in time) and then find the nearest two vertices, doing a nearest 2 neighbor search, to whatever the point is that lies in this polygon. These two vertices should be the ones that create the nearest line segment, regardless of convexity, if my thinking is correct.

Better understanding of cosine similarity

I am doing a little research on text mining and data mining. I need more help in understanding cosine similarity. I have read about it and notice that all of the given examples on the internet is using tf-idf before computing it through cosine-similarity.
My question
Is it possible to calculate cosine similarity just by using highest frequency distribution from a text file which will be the dataset. Most of the videos and tutorials that i go through has tf-idf ran before inputting it's data into cosine similarity, if no, what other types of equation/algorithm that can be input into cosine similarity?
2.Why is normalization used with tf-idf to compute cosine similarity? (can i do it without normalization?) Cosine similarity are computed from normalization of tf-idf output. Why is normalization needed?
3.What cosine similarity actually does to the weights of tf-idf?

I do not understand question 1.
TF-IDF weighting is a weighting scheme that worked well for lots of people on real data (think Lucene search). But the theoretical foundations of it are a bit weak. In particular, everybody seems to be using a slightly different version of it... and yes, it is weights + cosine similarity. In practise, you may want to try e.g. Okapi BM25 weighting instead, though.
I do not undestand this question either. Angular similarity is beneficial because the length of the text has less influence than with other distances. Furthermore, sparsity can be nicely exploitet. As for the weights, IDF is a heuristic with only loose statistical arguments: frequent words are more likely to occur at random, and thus should have less weight.
Maybe you can try to rephrase your questions so I can fully understand them. Also search for related questions such as these: Cosine similarity and tf-idf and
Better text documents clustering than tf/idf and cosine similarity?

Confusion about methods of pose estimation

I'm trying to do pose estimation (actually [Edit: 3DOF] rotation is all I need) from a planar marker with 4 corners = 4 coplanar points.
Up until today I was under the impression from everything I read that you will always compute a homography (e.g. using DLT) and decompose that matrix using the various methods available (Faugeras, Zhang, the analytic method which is also described in this post here on stackexchange) and refine it using non-linear optimization, if necessary.
First minor question: if this is an analytical method (simply taking two columns from a matrix and creating an orthonormal matrix out of these resulting in the desired rotation matrix), what is there to optimize? I've tried it in Matlab and the result jitters badly so I can clearly see the result is not perfect or even sufficient, but I also don't understand why one would want to use the rather expensive and complex SVDs used by Faugeras and Zhang if this simple method yields results already.
Then there are iterative pose estimation methods like the Ortohogonal Iteration (OI) Algorithm by Lu et al. or the Robust Pose Estimation Algorithm by Schweighofer and Pinz where there's not even a mention of the word 'homography'. All they need is an initial pose estimation which is then optimized (the reference implementation in Matlab done by Schweighofer uses the OI algorithm, for example, which itself uses some method based on SVD).
My problem is: everything I read so far was '4 points? Homography, homography, homography. Decomposition? Well, tricky, in general not unique, several methods.' Now this iterative world opens up and I just cannot connect these two worlds in my head, I don't fully understand their relation. I cannot even articulate properly what my problem is, I just hope someone understands where I am.
I'd be very thankful for a hint or two.
Edit: Is it correct to say: 4 points on a plane and their image are related by a homography, i.e. 8 parameters. Finding the parameters of the marker's pose can be done by calculating and decomposing the homography matrix using Faugeras, Zhang or a direct solution, each with their drawbacks. It can also be done using iterative methods like OI or Schweighofer's algorithm, which at no point calculate the homography matrix, but just use the corresponding points and which require an initial estimation (for which the initial guess from a homography decomposition could be used).

With only four points your solution will be normally very sensitive to small errors in their location, particularly when the rectangle is nearly orthogonal to the optical axis (this is because the vanishing points are not observable - they are outside the image and very far from the measurements - and the pose is given by the cross product of the vectors from the centre of the quadrangle to the vanishing points).
Is your pattern such that the corners can be confidently located with subpixel accuracy? I recommend using "checkerboard-type" patterns for the corners, which allow using a good and simple iterative refining algorithm to achieve subpixel accuracy (look up "iterative saddle points algorithm", or look up the docs in OpenCV).

I will not provide you with a full answer, but it looks like at least one of the points that need to be clarified is this:
homography is an invertible mapping from P^2 (homogeneous 3-vectors) to itself, which always may be represented by an invertible 3x3 matrix. Having said that, note that if your 3d points are coplanar you will always be able to use homography to relate the world points to the image points.
In general, a point in 3-space is represented in homogeneous coordinates as a 4-vector. Projective transformation acting on P^3 is represented by a non-singular 4x4 matrix (15 degrees of freedom, 16 elements minus one for overall scale).
So, the bottom line is that if your model is planar, you will be able to get away with a homography (8 DOF) and an appropriate algorithm, while in general case you will need to estimate 4x4 matrix and would need a different algorithm for that.
Hope this helps,
Alex

Trajectory interpolation and derivative

I'm working on the analysis of a particle's trajectory in a 2D plane. This trajectory typically consists of 5 to 50 (in rare cases more) points (discrete integer coordinates). I have already matched the points of my dataset to form a trajectory (thus I have time resolution).
I'd like to perform some analysis on the curvature of this trajectory, unfortunately the analysis framework I'm using has no support for fitting a trajectory. From what I heard one can use splines/bezier curves for getting this done but I'd like your opinion and/or suggestions what to use.
As this is only an optional part of my work I can not invest a vast amount of time for implementing a solution on my own or understanding a complex framework. The solution has to be as simple as possible.
Let me specify the features I need from a possible library:
- create trajectory from varying number of points
- as the points are discrete it should interpolate their position; no need for exact matches for all points as long as the resulting distance between trajectory and point is less than a threshold
- it is essential that the library can yield the derivative of the trajectory for any given point
- it would be beneficial if the library could report a quality level (like chiSquare for fits) of the interpolation
EDIT: After reading the comments I'd like to add some more:
It is not necessary that the trajectory exactly matches the points. The points are created from values of a pixel matrix and thus they form a discrete matrix of coordinates with a space resolution limited by the number of pixel per given distance. Therefore the points (which are placed at the center of the firing pixel) do not (exactly) match the actual trajectory of the particle. Either interpolation or fit is fine for me as long as the solution can cope with a trajectory which may/most probably will be neither bijective nor injective.
Thus most traditional fit approaches (like fitting with polynomials or exponential functions using a least squares fit) can't fulfil my criterias.
Additionaly all traditional fit approaches I have tried yield a function which seems to describe the trajectory quite well but when looking at their first derivative (or at higher resolution) one can find numerous "micro-oscillations" which (from my interpretation) are a result of fitting non-straight functions to (nearly) straight parts of the trajectory.
Edit2: There has been some discussion in the comments, what those trajectories may look like. Essentially thay may have any shape, length and "curlyness", although I try to exclude trajectories which overlap or cross in the previous steps. I have included two examples below; ignore the colored boxes, they're just a representation of the values of the raw pixel matrix. The black, circular dots are the points which I'd like to match to a trajectory, as you can see they are always centered to the pixels and therefore may have only discrete (integer) values.
Thanks in advance for any help & contribution!

This MIGHT be the way to go
http://alglib.codeplex.com/
From your description I would say that a parametric spline interpolation may suit your requirements. I have not used the above library myself, but it does have support for spline interpolation. Using an interpolant means you will not have to worry about goodness of fit - the curve will pass through every point that you give it.

If you don't mind using matrix libraries, linear least squares is the easiest solution (look at the end of the General Problem section for the equation to use). You can also use linear/polynomial regression to solve something like this.
Linear least squares will always give the best solution, but it's not scalable, because matrix multiplication is moderately expensive. Regression is an iterative heuristic method, so you can just run it until you have a "sufficiently good" answer. I've seen guidelines for the cutoff at about 1000-10000 dimensions in your data. So, with your data set, I'd recommend linear least squares, unless you decide to make them highly dimensioned for some reason.

Searching for Points close to a Vector in 3D space

I'm terrible with math, but I have a situation where I need to find all points in a 3D space that are arbitrarily close to a vector being projected through that same space. The points can be stored in any fashion the algorithm calls for, not that I can think of any particularly beneficial ordering for them.
Are there any existing C++ algorithms for this feat? And if so (or not), what kind of mathematical concept does or would it entail, since I'd love to attempt to understand it and tie my brain into a pretzel.
( This algorithm would be operating on a space with perhaps 100,000 points in it, it would need to test around 1,000,000 vectors, and need to complete those vectors within 1/30th of a second. I of course doubt if any algorithm can perform this feat at all, but it'll be fun to see if that's true or not. )

You would probably want to store your points in some spatial data structure. The ones that come to mind are:
oct-trees
BSP trees
kd-trees
They have slightly different properties. An oct-tree divides the entire world up into 8 equally sized cubes, organized to themselves form a larger cube. Each of these cubes are then in turn split into 8, evenly sized, cubes. You keep splitting the cubes until you have less than some number of points in a cube. With this tree structure, you can quite easily traverse the tree, extracting all points that may intersect a given cube. Once you have that list of points, you can test them one at a time. Since your test geometry is a sphere (distance from a point) you would circumscribe a cube around the sphere and get the points that may intersect it. As an optimization, you may also inscribe a cube in your circle, and anything that for sure intersects that, you can simply include in your hit-set right away.
The BSP tree is a Binary space partitioning tree. It's a tree of planes in 3-space, forming a binary tree. The main problem of using this for your problem is that you might have to do a lot of square roots while traversing it, to find the distance to the planes. The principle is the same though, once you have fewer than some number of points you form a leaf with those points in it. All leaves in a BSP tree are convex polygons (except for the leaves that are along the perimeter, which will be infinitely large polygons). When building the BSP, you want to split the points in half for each step, to truly get O(log n) searches.
The kd-tree is a special case of BSP, where all planes are axis aligned. This typically speeds up tests against them quite significantly, but doesn't allow you to optimize the planes based on your set of points quite as well.
I don't know of any c++ libraries that implement these, but I'm sure there are plenty of them. These are fairly common techniques used in video games, so you might want to look at game engines.

It might help your understanding of octrees when you can think of it as a curve that fills the space traversing every coordinate only once and w/o crossing itself. The curve maps the 3d complexity to a 1d complexity. There are some of this monster curve, like the z curve, the hilbert curve, and the moore curve. The latter is a copy of 4 hilbert curves and has very good space fills quality. But isn't a search for the closest points not solved with dijkstra algorithm?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js