Query re. how to set up an SVM, which SVM variation … and how to define a metric

Query re. how to set up an SVM, which SVM variation … and how to define a metric - c++

I’d like to learn how best set up an SVM in openCV (or other C++ library) for my particular problem (or if indeed there is a more appropriate algorithm).
My goal is to receive a weighting of how well an input set of labeled points on a 2D plane compares or fits with a set of ‘ideal’ sets of labeled 2D points.
I hope my illustrations make this clear – the first three boxes labeled A through C, indicate different ideal placements of 3 points, in my illustrations the labelling is managed by colour:
The second graphic gives examples of possible inputs:
If I then pass for instance example input set 1 to the algorithm it will compare that input set with each ideal set, illustrated here:
I would suggest that most observers would agree that the example input 1 is most similar to ideal set A, then B, then C.
My problem is to get not only this ordering out of an algorithm, but also ideally a weighting of by how much proportion is the input like A with respect to B and C.
For the example given it might be something like:
A:60%, B:30%, C:10%
Example input 3 might yield something such as:
A:33%, B:32%, C:35% (i.e. different order, and a less 'determined' result)
My end goal is to interpolate between the ideal settings using these weights.
To get the ordering I’m guessing the ‘cost’ involved of fitting the inputs to each set maybe have simply been compared anyway (?) … if so, could this cost be used to find the weighting? or maybe was it non-linear and some kind of transformation needs to happen? (but still obviously, relative comparisons were ok to determine the order).
Am I on track?
Direct question>> is the openCV SVM appropriate? - or more specifically:
A series of separated binary SVM classifiers for each ideal state and then a final ordering somehow ? (i.e. what is the metric?)
A version of an SVM such as multiclass, structured and so on from another library? (...that I still find hard to conceptually grasp as the examples seem so unrelated)
Also another critical component I’m not fully grasping yet is how to define what determines a good fit between any example input set and an ideal set. I was thinking Euclidian distance, and I simply sum the distances? What about outliers? My vector calc needs a brush up, but maybe dot products could nose in there somewhere?
Direct question>> How best to define a metric that describes a fit in this case?
The real case would have 10~20 points per set, and time permitting as many 'ideal' sets of points as possible, lets go with 30 for now. Could I expect to get away with ~2ms per iteration on a reasonable machine? (macbook pro) or does this kind of thing blow up ?
(disclaimer, I have asked this question more generally on Cross Validated, but there isn't much activity there (?))

Related

C++ support vector machine (SVM) template libraries?

I have a dataset from custom abstract objects and a custom distance function. Is there any good SVM libraries that allows me to train on my custom objects (not 2d points) and my custom distance function?
I searched the answers in this similar stackoverflow question, but none of them allows me to use custom objects and distance functions.

First things first.
SVM does not work on distance functions, it only accepts dot products. So your distance function (actually similarity, but usually 1-distance is similarity) has to:
be symmetric s(a,b)=s(b,a)
be positive definite s(a,a)>=0, s(a,a)=0 <=> a=0
be linear in first argument s(ka, b) = k s(a,b) and s(a+b,c) = s(a,c) + s(b,c)
This can be tricky to check, as you actually ask "is there a function from my objects to some vector space, phi such that s(phi(x), phi(y))" is a dot-product, thus leading to definition of so called kernel, K(x,y)=s(phi(x), phi(y)). If your objects are themselves elements of vector space, then sometimes it is enough to put phi(x)=x thus K=s, but it is not true in general.
Once you have this kind of similarity nearly any SVM library (for example libSVM) works with providing Gram matrix. Which is simply defined as
G_ij = K(x_i, x_j)
Thus requiring O(N^2) memory and time. Consequently it does not matter what are your objects, as SVM only works on pairwise dot-products, nothing more.
If you look appropriate mathematical tools to show this property, what can be done is to look for kernel learning from similarity. These methods are able to create valid kernel which behaves similarly to your similarity.

Check out the following:
MLPack: a lightweight library that provides lots of functionality.
DLib: a very popular toolkit that is used both in industry and academia.
Apart from these, you can also use Python packages, but import them from C++.

Given 2 points with known speed direction and location, compute a path composed of (circle) arcs

So, I have two points, say A and B, each one has a known (x, y) coordinate and a speed vector in the same coordinate system. I want to write a function to generate a set of arcs (radius and angle) that lead A to status B.
The angle difference is known, since I can get it by subtracting speed unit vector. Say I move a certain distance with (radius=r, angle=theta) then I got into the exact same situation. Does it have a unique solution? I only need one solution, or even an approximation.
Of course I can solve it by giving a certain circle and a line(radius=infine), but that's not what I want to do. I think there's a library that has a function for this, since it's quite a common approach.

A biarc is a smooth curve consisting of two circular arcs. Given two points with tangents, it is almost always possible to construct a biarc passing through them (with correct tangents).
This is a very basic routine in geometric modelling, and it is indispensable for smoothly approximating an arbirtrary curve (bezier, NURBS, etc) with arcs. Approximation with arcs and lines is heavily used in CAM, because modellers use NURBS without a problem, but machine controllers usually understand only lines and arcs. So I strongly suggest reading on this topic.
In particular, here is a great article on biarcs on biarcs, I seriously advice reading it. It even contains some working code, and an interactive demo.

Polynomial Least Squares for Image Curve Fitting

I am trying to fit a curve to a number of pixels in an image so I can do further processing regarding it's shape. Does anyone know how to implement a least squares method in C/++ preferably using the following parameters: an x array, a y array, and an answers array (the length of the answers array should tell how many coefficients need to be calculated)?

If this is not some exercise in implementing this yourself, I would suggest you use a ready-made library like GNU gsl. Have a look at the functions whose names start with gsl_multifit_, see e.g. the second example here.

If you are trying to fit ordered points (x,y) like in a graph you can use linear least squares methods but always with such methods you will need to specify the degree of the polynomial you use to approximate with (length of your answers array presumably). If your points are general ordered points in the plane that are able to form a closed loop or some outline of a structure (for example trying to fit points that describe an ellipse or a circle or other closed or more complex geometry) then you are going to need something more sophisticated. You can still use least squares but you will need to use a parametric type curve like a spline. Take a look at the pdf at this link which may give what you need (or at the very least illustrate what I am saying): http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CE0QFjAA&url=http%3A%2F%2Ffolk.uio.no%2Fin329%2Fnchap6.pdf&ei=Yp8CUNvHC8Kg0QX6r_mEBw&usg=AFQjCNHBUZ5t2Y7C8eONYSosRydLs4Zu4A
Without seeing an image of exactly what you are trying to fit it is hard to say - it is quite possible that your data can be fit in a non parametric way with linear least squares polynomials - if so all you will need is a linear algebra library and you can code the approximations yourself like so: http://en.wikipedia.org/wiki/Ordinary_least_squares
Even so, all forms of approximation require you to decide on your form (function basis and degree etc) before you fit it. For example, if you want to decide on whether you need a 4th,5th,6th or 7th degree polynomial fit your data you would need to fit each one and assess the suitability for yourself. There is no generic way (at least none that I know of) that will tell you the degree of approximation you need to fit to your data.

Identifying local minima in a histogram

I'm interested in finding the local minima in a histogram that roughly resembles
I'd want to find the local minimum at 109.258, and the easiest way to do so would be to identify whether the number of counts at 109.258 is lower than the average number of counts around in some interval around (and including 109.258). It's identifying this interval that's the most difficult part for me.
As for the source of this data, it's a histogram with 100 bins of non-uniform width. Each bin has a value (shown on the x-axis), and a count of the samples falling into that bin (shown on the y-axis). What I'm trying to do is find the "best" place to split the histogram. Each side of the split is propagated down a binary tree, as part of a classification algorithm.
I'm thinking that my best course of action would be to try to fit a curve to this histogram, using something like the Levenberg-Marquardt algorithm and then to compare the local minima to find the "best" one. A proper measure of "best" would include some indication of the significance of that split, which is measured as the difference between the average counts in the interval to the left and the average of the counts in the interval to the right, and then maybe weight each difference with the number of counts included to get a composite measurement of "best," if that makes sense.
Either way, computational complexity of the algorithm isn't a huge issue, 100 bins is about the maximum number I'd expect to encounter. However, this calculation will be performed once for each sample, so keeping it linear with respect to the number of bins would, of course, be ideal.
By the way, I'm doing everything in C++, and make use of the boost libraries and STL, so nothing is off-limits in that regard.
Any thoughts or insights concerning best practices would be greatly appreciated!

If I understand correctly kmore wants to partition two "peaks" based on the largest separation (product of histogram count and bin distance). If this is true:
Find all maxs.
for each max build rectangles like in Fig.
Find rectangle with max white area, which gives you the x range to find desirable bin 109.258

Levenberg–Marquardt is not so good a choice in a rugged optimization terrain -- and yours is pretty rugged. There are lots of local minima there. Levenberg–Marquardt might well find the local minimum at about 100. Or it might find one the two global minima at the extremes of the graph where the function tails off to zero.
You want something that finds the most significant local minimum. For example, some kind of clustering algorithm. Here is a very simple one:
Step 1:
Find the local extrema in your data set. These are the extrema at the extremes of the range plus the internal local minima and maxima. With your histogram you should have an odd number of such extrema, alternating between minima and maxima.
Step 2:
Find the pair with the smallest delta. This will either be a (local max, local min) or a (local min, local max) pair.
Step 3:
Find a pair of elements to remove, one of
The pair found by step 2
The pair comprising the first element of the pair from step 2 and its predecessor
The pair comprising the last element of the pair from step 2 and its successor
When the found pair includes a boundary point you should use option 2 or 3, as appropriate. For an internal pair, you might want to use some heuristics in choosing between the three choices. Or you could just do the simple thing and use the found pair.
Step 4:
Delete the pair of elements from step 3, keeping track of the deleted pair.
Step 5:
Repeat steps 2 to 4 until there are only three elements left in the extrema data set (the extremes of the range plus a local max, hopefully the global max).
The last-removed minima is what you want.
There are lots of other clustering algorithms. The one I presented is rather crude and obviously isn't particularly fast. One that extends nicely to a lot more data, and higher dimensional data is the Expectation Maximization algorithm. Simulated annealing (Metropolis-Hastings) could also be adapted to this problem.

The problem can, of course be transformed into one of peak finding by functional manipulation of the data (inversion or negation are obvious candidates).
Alternatively, if the example is typical, one might begin with peak-finding in the untransformed data and seek regions where the peaks are (relatively) widely separated as candidates for containing a good local minima.
I am forever recommending the method used by the ROOT TSpectrum classes for peak finding.
The underling algorithm is discussed in detail in
M.Morhac et al.: Background elimination methods for multidimensional coincidence gamma-ray spectra. Nuclear Instruments and Methods in Physics Research A 401 (1997) 113-132.
M.Morhac et al.: Efficient one- and two-dimensional Gold deconvolution and its application to gamma-ray spectra decomposition. Nuclear Instruments and Methods in Physics Research A 401 (1997) 385-408.
M.Morhac et al.: Identification of peaks in multidimensional coincidence gamma-ray spectra. Nuclear Instruments and Methods in Research Physics A 443(2000), 108-125.
Copies of these papers are maintained on the ROOT web site and linked in the TSpectrum documentation for those that do not have a subscription to NIM A.

What you want seems to be more complicated than just a local minimum. Also, the local minimum concept depends strongly on your choice of bins.
Have you heard about Otsu's method? It might be more along the lines of what you want.
Here's another Otsu's method link.

finding valid points in 2d space with restrictions on arbitrary regions

I have a 2D double precision space with regions (arbitrarily defined, mostly circles) that are "not valid", so to say, and I'd like to get the nearest valid point, given a desired destination (that doesn't have to be valid). Now so far I've tried going from a case-by-case basis in avoiding those regions but when there are multiple constraints (like having to avoid 2-3 regions that are close/blended together) this approach doesn't work. I thought about some kind of search but discretizing the space would be another problem as these regions won't really comform with it.
I was hoping you guys could give me some advice on how to tackle a problem like this. A related but much simpler case would be this.
Thanks!

It's basically impossible, unless you can put some constraints on these invalid regions.
Consider an invalid region (or union of regions) in the form of a large irregular blob with a tiny pinhole of validity somewhere inside. And suppose your destination is inside the blob, near the pinhole, so that the desired point is actually in the pinhole. If the only way to examine this blob is with a yes/no method to test a point for validity, the only way to find the pinhole will be by exhaustive search, which will take forever.

If all of your invalid regions are disjoint, the problem is manageable. For a given point, if it is inside one of the regions, look for the closest point on the region boundary. This isn't necessarily trivial, but there should be lots of references - even on this site - for doing that, given various types of boundaries - straight lines, arcs, circles, splines, etc.
If the regions are not disjoint, you can combine them into regions that are. CGAL provides libraries for 2D booleans (specifically unions).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js