How to find edit distance between two s-expressions? - compression

Would such an edit distance be a good measure of expression similarity? What if we wanted a semantic difference of two s-expressions? Can such an edit distance be used for s-expression compression?

Related

Principal component analysis on proportional data

Is it valid to run a PCA on data that is comprised of proportions? For example, I have data on the proportion of various food items in the diet of different species. Can I run a PCA on this type of data or should I transform the data or do something else beforehand?
I had a similar question. You should search for "compositional data analysis". There are transformation to apply to proportions in order to analyze them with multivariate tecniques such as PCA. You can find also "robust" PCA algorithms to run your analysis in R. Let us know if you find an appropriate solution to your specific problem.
I don't think so.
PCA will give you "impossible" answers. You might get principal components with values that proportions can't have, like negative values or values greater than 1. How would you interpret this component?
In technical terms, the support of your data is a subset of the support of PCA. Say you have $k$ classes. Then:
the support for PCA vectors is $\R^k$
the support for your proportion vectors is the $k$- dimensional simplex. By simplex I mean the set of $p$ vectors of length $k$ such that:
$0 \le p_i \le 1$ where $i = 1, ..., k$
$\sum_{i=1}^k{p_i} = 1$
One way around this is if there's a one to one mapping between the $k$-simplex to all of $\R^k$. If so, you could map from your proportions to $\R^k$, do PCA there, then map the PCA vectors to the simplex.
But I'm not sure the simplex is a self-contained linear space. If you add two elements of the simplex, you don't get an element of the simplex :/
A better approach, I think, is clustering, eg with Gaussian mixtures, or spectral clustering. This is related to PCA. But a nice property of clustering is you can express any element of your data as a "convex combination" of the clusters. If you analyze your proportion data and find clusters, they (unlike PCA vectors) will be within the simplex space, and any mixture of them will be, too.
I also recommend looking into nonnegative matrix factorization. This is like PCA but, as the name suggests, avoids negative components and also negative eigenvectors. It's very useful for inferring structure in strictly positive data, like proportions. But nmf does not give you a basis for simplex space.

Which algorithm or idea to find the convex envelope of a set of curves?

Let's define a curve as set of 2D points which can be computed to arbitrary precision. For instance, this is a curve:
A set of N intersecting curves is given (N can be arbitrarily large), like in the following image:
How to find the perimeter of the connected area (a bounding box is given if necessary) which is delimited by the set of curves; or, given the example above, the red curve? Note that the perimeter can be concave and it has no obvious parametrization.
A starting point of the red curve can be given
I am interested in efficient ideas to build up a generic algorithm however...
I am coding in C++ and I can use any opensource library to help with this
I do not know if this problem has a name or if there is a ready-made solution, in case please let me know and I will edit the title and the tags.
Additional notes:
The solution is unique as in the region of interest there is only a single connected area which is free from any curve, but of course I can only compute a finite number of curves.
The curves are originally parametrized (and then affine transformations are applied), so I can add as many point as I want. I can compute distances, lengths and go along with them. Intersections are also feasible. Basically any geometric operation that can be built up from point coordinates is acceptable.
I have found that a similar problem is encountered when "cutting" gears eg. https://scialert.net/fulltext/?doi=jas.2014.362.367, but still I do not see how to solve it in a decently efficient way.
If the curves are given in order, you can find the pairwise intersections between successive curves. Depending on their nature, an analytical or numerical solution will do.
Then a first approximation of the envelope is the polyline through these points.
Another approximation can be obtained by drawing the common tangent to successive curves, and by intersecting those tangents pairwise. The common tangent problem is more difficult, anyway.
If the equations of the curves are known in terms of a single parameter, you can find the envelope curve by solving a differential equation, obtained by eliminating the parameter between the implicit equation of the curve and this equation differentiated wrt the parameter. You can integrate this equation numerically.
When I have got such problems (maths are not enough or are terribly tricky) I decompose each curve into segments.
Then, I search segment-segment intersections. For example, a segment in curve Ci with all of segments in curve Cj. Even you can replace a segment with its bounding box and do box-box intersection for quick discard, focusing in those boxes that have intersection.
This gives a rough aproximation of curve-curve intersections.
Apart from intersections you can search for max/min coordinates, aproximated also with segments or boxes.
Once you get a decent aproximation, you can refine it by reducing the length/size of segments and boxes and repeating the intersection (or max/min) checks.
You can have an approximate solution using the grids. First, find a bounding box for the curves. And then griding inside the bounding box. and then search over the cells to find the specified area. And finally using the number of cells over the perimeter approximate the value of the perimeter (as the size of the cells is known).

How does the Lowe's ratio test work?

Suppose I have a set of N images and I have already computed the SIFT descriptors of each image. I know would like to compute the matches between the different features. I have heard that a common approach is the Lowe's ratio test but I cannot understand how it works. Can someone explain it to me?
Short version: each keypoint of the first image is matched with a number of keypoints from the second image. We keep the 2 best matches for each keypoint (best matches = the ones with the smallest distance measurement). Lowe's test checks that the two distances are sufficiently different. If they are not, then the keypoint is eliminated and will not be used for further calculations.
Long version:
David Lowe proposed a simple method for filtering keypoint matches by eliminating matches when the second-best match is almost as good. Do note that, although popularized in the context of computer vision, this method is not tied to CV. Here I describe the method, and how it is implemented/applied in the context of computer vision.
Let's suppose that L1 is the set of keypoints of image 1, each keypoint having a description that lists information about the keypoint, the nature of that info really depending on the descriptor algorithm that was used. And L2 is the set of keypoints for image 2. A typical matching algorithm will work by finding, for each keypoint in L1, the closest match in L2. If using Euclidian distance, like in Lowe's paper, this means the keypoint from set L2 that has the smallest Euclidian distance from the keypoint in L1.
Here we could be tempted to just set a threshold and eliminate all the pairings where the distance is above that threshold. But it's not that simple because not all variables inside the descriptors are as "discriminant": two keypoints could have a small distance measurement because most of the variables inside their descriptors have similar values, but then those variables could be irrelevant to the actual matching. One could always add weighting to the variables of the descriptors so that the more discriminating traits "count" more. Lowe proposes a much simpler solution, described below.
First, we match the keypoints in L1 with two keypoints in L2. Working from the assumption that a keypoint in image 1 can't have more than one equivalent in image 2, we deduce that those two matches can't both be right: at least one of them is wrong. Following Lowe's reasoning, the match with the smallest distance is the "good" match, and the match with the second-smallest distance the equivalent of random noise, a base rate of sorts. If the "good" match can't be distinguished from noise, then the "good" match should be rejected because it does not bring anything interesting, information-wise. So the general principle is that there needs to be enough difference between the best and second-best matches.
How the concept of "enough difference" is operationalized is important: Lowe uses a ratio of the two distances, often expressed a such:
if distance1 < distance2 * a_constant then ....
Where distance1 is the distance between the keypoint and its best match, and distance2 is the distance between the keypoint and its second-best match. The use of a "smalled than" sign can be somewhat confusing, but that becomes obvious when taking into consideration that a smaller distance means that the point is closer. In OpenCV world, the knnMatch function will return the matches from best to worst, so the 1st match will have a smaller distance. The question is really "how smaller?" To figure that out we multiply distance2 by a constant that has to be between 0 and 1, thus decreasing the value of distance2. Then we look at distance1 again: is it still smaller than distance2? If it is, then it passed the test and will be added to the list of good points. If not, it must be eliminated.
So that explans the "smaller than" part, but what about the multiplication? Since we are looking at the difference between the distances, why not just use an actual mathematical difference between distance1 and distance2? Although technically we could, the resulting difference would be in absolute terms, it would be too dependent on the variables inside the descriptors, the type of distance measurement that we use, etc. What if the code for extracting descriptions changes, affecting all distance measurements? In short, doing distance1 - distance2 would be less robust, would require frequent tweaking and would make methodological comparisons more complicated. It's all about the ratio.
Take-away message: Lowe's solution is interesting not only because of it's simplicity, but because it is in many ways algorithm-agnostic.
Lowe's Ratio Test
Algorithm:
First, we compute the distance between feature fi in image one and all the features fj in image two.
We choose feature fc in image two with the minimum distance to feature fi in image of one as our closest match.
We then proceed to get feature fs the feature in image two with the second closest distance to the feature fi.
Then we find how much nearer our closest match fc is over our second closest match fs through the distance ratio.
Finally we keep the matches with distance ratio < distance ratio threshold.
The distance ratio = d(fi, fc)/d(fi, fs) can be defined as the distance computed between feature fi in image one and fc the closest match in image two. Over the distance computed between feature fi and fs, the second closest match in image two.
We usually set the distance ratio threshold (ρ) to around 0.5, which means that we require our best match to be at least twice as close as our second best match to our initial features descriptor. Thus discarding our ambiguous matches and retaining the good ones.
For better understanding the ratio test, you need to read his article. Only by reading the article you will find out your answer.
The simple answer is that it is low, which Lowe achieved during his experiments and suggest that for choosing between two similar distance, choose the one which its distance is 0.7 other one.
check the below link:
https://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf

Measuring distance along ellipse

Suppose we have an ellipse x^2/a^2 + y^2/b^2 .
Taking a point (a*cos(t),b*sint(t)) on the ellipse, what is the fastest way to find another point on the ellipse such that distance between them is a given d. [d is less than pi*a*b].
The problem was encountered when i have a corner [quarter ellipse] and need to find points along it seperated by some 'd'.
The length of a subsection of an ellipse is an elliptic integral, with no closed form solution.
In order to compute the distance along the ellipse, you will need a numerical integration routine. I recommend Romberg, or Gauss Quadrature (look up on Wikipedia). If you are doing this repeatedly, then precompute the distance across a bunch of points around the Ellipse so that you can rapidly get to the right region, then start integrating.
You will need to bisect (look up on Wikipedia) to find the desired length.
There is no analytical solution for the length of an elliptical arc. This means you won't be able to plug numbers into an equation to find a result, but instead use a method of numerical integration.
Simpsons rule is very easy to implement although most likely slower than the methods mentioned in other answers.
Now that you have a way to find the length of an elliptical arc, just measure different end points until you find one of length d to some acceptable tolerance

Finding the spread of each cluster from Kmeans

I'm trying to detect how well an input vector fits a given cluster centre. I can find the best match quite easily (the centre with the minimum euclidean distance to the input vector is the best), however, I now need to work how good a match that is.
To do this I need to find the spread (standard deviation?) of the vectors which build up the centroid, then see if the distance from my input vector to the centre is less than the spread. If it's more than the spread than I should be able to say that I have no clusters to fit it (given that the best doesn't fit the input vector well).
I'm not sure how to find the spread per cluster. I have all the centre vectors, and all the training vectors are labelled with their closest cluster, I just can't quite fathom exactly what I need to do to get the spread.
I hope that's clear? If not I'll try to reword it!
TIA
Ian
Use the distance function and calculate the distance from your center point to each labeled point, then figure out the mean of those distances. That should give you the standard deviation.
If you switch to using a different algorithm, such as Mixture of Gaussians, you get the spread (e.g., std. deviation) as part of the model (clustering result).
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/mixture.html
http://en.wikipedia.org/wiki/Mixture_model