How to find the best combination to minimize the total distance in all groups of points - grouping

Given 4n+k points with the position (x,y) (or (x,y,z) for a 3D case), where n=1,2,3,...; k∈{0,1,2,3}.
Group the points into n-k groups of 4 points, and k groups of 5 points.
A group centroid is the mean position of the 4(or 5) points in the group.
How to effectively get the best combination to minimize the sum of the distance of each point to its own group centroid?
Brutal enumeration is the only way I've achieve to get the best combination. However brutal force only works when n is quite small because of the computational limitation.
I also tried K-Means clustering and genetic algorithm, but neither of them nor the combination of these two algorithms can guarantee the best combination.

As the problem supposedly is NP-hard you won't be able to find an algorithm that guarantees to find the optimum and that runs in acceptable time on large data.
You'll have to settle for an approximation such as kmeans.

Related

What is the fastest algorithm to find the point from a set of points, which is closest to a line?

I have:
- a set of points of known size (in my case, only 6 points)
- a line characterized by x = s + t * r, where x, s and r are 3D vectors
I need to find the point closest to the given line. The actual distance does not matter to me.
I had a look at several different questions that seem related (including this one) and know how to solve this on paper from my highschool math classes. But I cannot find a solution without calculating every distance, and I am sure there has to be a better/faster way. Performance is absolutely crucial in my application.
One more thing: All numbers are integers (coordinates of points and elements of s and r vectors). Again, for performance reasons I would like to keep the floating-point math to a minimum.
You have to process every point at least once to know their distance. Unless you want to repeat the process many times with different lines, simply computing the distance of every point is unavoidable. So the algorithm has to be O(n).
Since you don't care about the actual distance, we can make some simplification to the point-distance computation. The exact distance is computed by (source):
d^2 = |r⨯(p-s)|^2 / |r|^2
where ⨯ is the cross product and |r|^2 is the squared length of vector r. Since |r|^2 is constant for all points, we can omit it from the distance computation without changing result:
d^2 = |r⨯(p-s)|^2
Compare the approximated square distances and keep the minimum. The advantage of this formula is that you can do everything with integers since you mentioned that all coordinates are integers.
I'm afraid you can't get away with computing less than 6 distances (if you could, at least one point would be left out -- including the nearest one).
See if it makes sense to preprocess: Is the line fixed and the points vary? Consider rotating coordinates to make the line horizontal.
As there are few points, it is doubtful that this is your bottleneck. Measure where the hot spots are, redesign algorithms/data representation, spice up compiler optimization, compile to assembly and bum that. Strictly in that order.
Jon Bentley's "Writing Efficient Programs" (sadly long out of print) and "Programming Pearls" (2nd edition) are full of advise on practical programming.

what kind of algorithm for generating height-map from contour line?

I'm looking for interpolating some contour lines to generating a 3D view. The contours are not stored in a picture, coordinates of each point of the contour are simply stored in a std::vector.
for convex contours :
, it seems (I didn't check by myself) that the height can be easily calculates (linear interpolation) by using the distance between the two closest points of the two closest contours.
my contours are not necessarily convex :
, so it's more tricky... actualy I don't have any idea what kind of algorithm I can use.
UPDATE : 26 Nov. 2013
I finished to write a Discrete Laplace example :
you can get the code here
What you have is basically the classical Dirichlet problem:
Given the values of a function on the boundary of a region of space, assign values to the function in the interior of the region so that it satisfies a specific equation (such as Laplace's equation, which essentially requires the function to have no arbitrary "bumps") everywhere in the interior.
There are many ways to calculate approximate solutions to the Dirichlet problem. A simple approach, which should be well suited to your problem, is to start by discretizing the system; that is, you take a finite grid of height values, assign fixed values to those points that lie on a contour line, and then solve a discretized version of Laplace's equation for the remaining points.
Now, what Laplace's equation actually specifies, in plain terms, is that every point should have a value equal to the average of its neighbors. In the mathematical formulation of the equation, we require this to hold true in the limit as the radius of the neighborhood tends towards zero, but since we're actually working on a finite lattice, we just need to pick a suitable fixed neighborhood. A few reasonable choices of neighborhoods include:
the four orthogonally adjacent points surrounding the center point (a.k.a. the von Neumann neighborhood),
the eight orthogonally and diagonally adjacent grid points (a.k.a. the Moore neigborhood), or
the eight orthogonally and diagonally adjacent grid points, weighted so that the orthogonally adjacent points are counted twice (essentially the sum or average of the above two choices).
(Out of the choices above, the last one generally produces the nicest results, since it most closely approximates a Gaussian kernel, but the first two are often almost as good, and may be faster to calculate.)
Once you've picked a neighborhood and defined the fixed boundary points, it's time to compute the solution. For this, you basically have two choices:
Define a system of linear equations, one per each (unconstrained) grid point, stating that the value at each point is the average of its neighbors, and solve it. This is generally the most efficient approach, if you have access to a good sparse linear system solver, but writing one from scratch may be challenging.
Use an iterative method, where you first assign an arbitrary initial guess to each unconstrained grid point (e.g. using linear interpolation, as you suggest) and then loop over the grid, replacing the value at each point with the average of its neighbors. Then keep repeating this until the values stop changing (much).
You can generate the Constrained Delaunay Triangulation of the vertices and line segments describing the contours, then use the height defined at each vertex as a Z coordinate.
The resulting triangulation can then be rendered like any other triangle soup.
Despite the name, you can use TetGen to generate the triangulations, though it takes a bit of work to set up.

How to implement k-means algorithm on string data

I am trying to implement K-means algorithm on the below data-set.It's stragiht-forward to calculate distance between any two numeric attributes but how do I calculate distance between two strings and also how do I sum up all the distances(i.e the distance between string attributes and the distance between numeric attributes.) Please kindly advise me.Thank you.
K-means is designed for Euclidean distance. You cannot just plug in arbitrary other distance functions. This may cause k-means to no longer converge.
The required property is that the mean must minimize the variances. If you cannot guarantee this property (and what is the mean of a string anyway?) then you lose guaranteed convergence.
Technically, k-means isn't even based on Euclidean distance, but it minimizes variances, which happen to be the same as squared Euclidean distances; and if you minimize these squares, you also minimize Euclidean distance. But what the algorithm really aims at minimizing is Var(Attribute 1, Cluster 1) + Var(Attribute 2, Cluster 1) + ... + Var(Attribute n, Cluster k).
You might want to look into k-medians, which by using a medoid instead of the mean, avoids both the need to be able to compute a mean and can give convergence guarantees for arbitrary distances as far as I know.
However, you might want to look into truly distance based algorithms, including the various density based clustering algorithms which usually also are distance-based.
To calculate the distance between strings, you can use the Levenshtein distance (aka edit distance).
To normalize the values between the string and numeric attributes, you can can try to state the attributes as percentages: find the min and max value of each type of attribute, and then for a given data instance, calculate its percentage within the respective range.

Identifying local minima in a histogram

I'm interested in finding the local minima in a histogram that roughly resembles
I'd want to find the local minimum at 109.258, and the easiest way to do so would be to identify whether the number of counts at 109.258 is lower than the average number of counts around in some interval around (and including 109.258). It's identifying this interval that's the most difficult part for me.
As for the source of this data, it's a histogram with 100 bins of non-uniform width. Each bin has a value (shown on the x-axis), and a count of the samples falling into that bin (shown on the y-axis). What I'm trying to do is find the "best" place to split the histogram. Each side of the split is propagated down a binary tree, as part of a classification algorithm.
I'm thinking that my best course of action would be to try to fit a curve to this histogram, using something like the Levenberg-Marquardt algorithm and then to compare the local minima to find the "best" one. A proper measure of "best" would include some indication of the significance of that split, which is measured as the difference between the average counts in the interval to the left and the average of the counts in the interval to the right, and then maybe weight each difference with the number of counts included to get a composite measurement of "best," if that makes sense.
Either way, computational complexity of the algorithm isn't a huge issue, 100 bins is about the maximum number I'd expect to encounter. However, this calculation will be performed once for each sample, so keeping it linear with respect to the number of bins would, of course, be ideal.
By the way, I'm doing everything in C++, and make use of the boost libraries and STL, so nothing is off-limits in that regard.
Any thoughts or insights concerning best practices would be greatly appreciated!
If I understand correctly kmore wants to partition two "peaks" based on the largest separation (product of histogram count and bin distance). If this is true:
Find all maxs.
for each max build rectangles like in Fig.
Find rectangle with max white area, which gives you the x range to find desirable bin 109.258
Levenberg–Marquardt is not so good a choice in a rugged optimization terrain -- and yours is pretty rugged. There are lots of local minima there. Levenberg–Marquardt might well find the local minimum at about 100. Or it might find one the two global minima at the extremes of the graph where the function tails off to zero.
You want something that finds the most significant local minimum. For example, some kind of clustering algorithm. Here is a very simple one:
Step 1:
Find the local extrema in your data set. These are the extrema at the extremes of the range plus the internal local minima and maxima. With your histogram you should have an odd number of such extrema, alternating between minima and maxima.
Step 2:
Find the pair with the smallest delta. This will either be a (local max, local min) or a (local min, local max) pair.
Step 3:
Find a pair of elements to remove, one of
The pair found by step 2
The pair comprising the first element of the pair from step 2 and its predecessor
The pair comprising the last element of the pair from step 2 and its successor
When the found pair includes a boundary point you should use option 2 or 3, as appropriate. For an internal pair, you might want to use some heuristics in choosing between the three choices. Or you could just do the simple thing and use the found pair.
Step 4:
Delete the pair of elements from step 3, keeping track of the deleted pair.
Step 5:
Repeat steps 2 to 4 until there are only three elements left in the extrema data set (the extremes of the range plus a local max, hopefully the global max).
The last-removed minima is what you want.
There are lots of other clustering algorithms. The one I presented is rather crude and obviously isn't particularly fast. One that extends nicely to a lot more data, and higher dimensional data is the Expectation Maximization algorithm. Simulated annealing (Metropolis-Hastings) could also be adapted to this problem.
The problem can, of course be transformed into one of peak finding by functional manipulation of the data (inversion or negation are obvious candidates).
Alternatively, if the example is typical, one might begin with peak-finding in the untransformed data and seek regions where the peaks are (relatively) widely separated as candidates for containing a good local minima.
I am forever recommending the method used by the ROOT TSpectrum classes for peak finding.
The underling algorithm is discussed in detail in
M.Morhac et al.: Background elimination methods for multidimensional coincidence gamma-ray spectra. Nuclear Instruments and Methods in Physics Research A 401 (1997) 113-132.
M.Morhac et al.: Efficient one- and two-dimensional Gold deconvolution and its application to gamma-ray spectra decomposition. Nuclear Instruments and Methods in Physics Research A 401 (1997) 385-408.
M.Morhac et al.: Identification of peaks in multidimensional coincidence gamma-ray spectra. Nuclear Instruments and Methods in Research Physics A 443(2000), 108-125.
Copies of these papers are maintained on the ROOT web site and linked in the TSpectrum documentation for those that do not have a subscription to NIM A.
What you want seems to be more complicated than just a local minimum. Also, the local minimum concept depends strongly on your choice of bins.
Have you heard about Otsu's method? It might be more along the lines of what you want.
Here's another Otsu's method link.

All k nearest neighbors in 2D, C++

I need to find for each point of the data set all its nearest neighbors. The data set contains approx. 10 million 2D points. The data are close to the grid, but do not form a precise grid...
This option excludes (in my opinion) the use of KD Trees, where the basic assumption is no points have same x coordinate and y coordinate.
I need a fast algorithm O(n) or better (but not too difficult for implementation :-)) ) to solve this problem ... Due to the fact that boost is not standardized, I do not want to use it ...
Thanks for your answers or code samples...
I would do the following:
Create a larger grid on top of the points.
Go through the points linearly, and for each one of them, figure out which large "cell" it belongs to (and add the points to a list associated with that cell).
(This can be done in constant time for each point, just do an integer division of the coordinates of the points.)
Now go through the points linearly again. To find the 10 nearest neighbors you only need to look at the points in the adjacent, larger, cells.
Since your points are fairly evenly scattered, you can do this in time proportional to the number of points in each (large) cell.
Here is an (ugly) pic describing the situation:
The cells must be large enough for (the center) and the adjacent cells to contain the closest 10 points, but small enough to speed up the computation. You could see it as a "hash-function" where you'll find the closest points in the same bucket.
(Note that strictly speaking it's not O(n) but by tweaking the size of the larger cells, you should get close enough. :-)
I have used a library called ANN (Approximate Nearest Neighbour) with great success. It does use a Kd-tree approach, although there was more than one algorithm to try. I used it for point location on a triangulated surface. You might have some luck with it. It is minimal and was easy to include in my library just by dropping in its source.
Good luck with this interesting task!