What is the fastest algorithm to find the point from a set of points, which is closest to a line? - c++

I have:
- a set of points of known size (in my case, only 6 points)
- a line characterized by x = s + t * r, where x, s and r are 3D vectors
I need to find the point closest to the given line. The actual distance does not matter to me.
I had a look at several different questions that seem related (including this one) and know how to solve this on paper from my highschool math classes. But I cannot find a solution without calculating every distance, and I am sure there has to be a better/faster way. Performance is absolutely crucial in my application.
One more thing: All numbers are integers (coordinates of points and elements of s and r vectors). Again, for performance reasons I would like to keep the floating-point math to a minimum.

You have to process every point at least once to know their distance. Unless you want to repeat the process many times with different lines, simply computing the distance of every point is unavoidable. So the algorithm has to be O(n).
Since you don't care about the actual distance, we can make some simplification to the point-distance computation. The exact distance is computed by (source):
d^2 = |r⨯(p-s)|^2 / |r|^2
where ⨯ is the cross product and |r|^2 is the squared length of vector r. Since |r|^2 is constant for all points, we can omit it from the distance computation without changing result:
d^2 = |r⨯(p-s)|^2
Compare the approximated square distances and keep the minimum. The advantage of this formula is that you can do everything with integers since you mentioned that all coordinates are integers.

I'm afraid you can't get away with computing less than 6 distances (if you could, at least one point would be left out -- including the nearest one).
See if it makes sense to preprocess: Is the line fixed and the points vary? Consider rotating coordinates to make the line horizontal.
As there are few points, it is doubtful that this is your bottleneck. Measure where the hot spots are, redesign algorithms/data representation, spice up compiler optimization, compile to assembly and bum that. Strictly in that order.
Jon Bentley's "Writing Efficient Programs" (sadly long out of print) and "Programming Pearls" (2nd edition) are full of advise on practical programming.

Related

Given 2 points with known speed direction and location, compute a path composed of (circle) arcs

So, I have two points, say A and B, each one has a known (x, y) coordinate and a speed vector in the same coordinate system. I want to write a function to generate a set of arcs (radius and angle) that lead A to status B.
The angle difference is known, since I can get it by subtracting speed unit vector. Say I move a certain distance with (radius=r, angle=theta) then I got into the exact same situation. Does it have a unique solution? I only need one solution, or even an approximation.
Of course I can solve it by giving a certain circle and a line(radius=infine), but that's not what I want to do. I think there's a library that has a function for this, since it's quite a common approach.
A biarc is a smooth curve consisting of two circular arcs. Given two points with tangents, it is almost always possible to construct a biarc passing through them (with correct tangents).
This is a very basic routine in geometric modelling, and it is indispensable for smoothly approximating an arbirtrary curve (bezier, NURBS, etc) with arcs. Approximation with arcs and lines is heavily used in CAM, because modellers use NURBS without a problem, but machine controllers usually understand only lines and arcs. So I strongly suggest reading on this topic.
In particular, here is a great article on biarcs on biarcs, I seriously advice reading it. It even contains some working code, and an interactive demo.

Efficient Longest arithmetic progression for a set of linear Points

Longest arithmetic progression of a set of numbers {ab1,ab2,ab3 .... abn} is defined as a subset {bb1,bb2,bb3 .... bbn} such that bi+1 - bi is constant.
I would like to extend this problems to a set of two dimensional points lying on a straight line.
Lets define Dist(P1,P2) is the distance between two Points P1(X1,Y1) and P2(X2,Y2) on a line as
Dist(P1,P2) = Dist((X1,Y1),(X2,Y2)) = (X2 - X1)2 + (Y2 - Y1))2
Now For a given set of points I need to find the largest Arithmetic Progression such that Dist(Pi,Pi+1) is constant, assuming they all lie on the same line (m & C are constant).
I researched a bit but could not figure out an algorithm which is better than O(n2).
In fact currently the way I am doing is I am maintaining a Dictionary say
DistDict=dict()
and say Points are defined in a List as
Points = [(X1,Y1),(X2,Y2),....]
then this is what I am doing
for i,pi in enumerate(Points):
for pj in Points[i+1:]:
DistDict.setdefault(dist(pi,pj),set([])).add((pi,pj))
so all I have ended up with a dictionary of points which are of equal distance. So the only thing I have to do is to scan through to find out the longest set.
I am just wondering that this ought to have a better solution, but somehow I can't figure out one. I have also seen couple of similar older SO posts but none I can find to give something that is more efficient than O(n2). Is this somehow an NP Hard problem that we can never have something better or if not what approach could be take.
Please note I came across a post which claims about an efficient divide and conquer algorithm but couldn't make any head or tail out of it.
Any help in this regard?
Note*** I am tagging this Python because I understand Python better than maybe Matlab or Ruby. C/C++/Java is also fine as I am somewhat proficient in these too :-)
To sum things up: As #TonyK pointed out, if you assume that the points lie on a straight line, you can reduce it to the one-dimensional case that was discussed extensively here already. The solution uses Fast Fourier Transforms as mentioned by #YochaiTimmer.
Additional note: The problem is almost certainly not NP-hard as it has an efficient O(n log n) solution, so that would imply P=NP.
You could study the Fast Fourier Transform methods for multiplication.O(N log N)
You might be able to do something similar with your problem.
Firstly, your definition of distance is wrong. You have to take the square root. Secondly, if you know that all the points lie on a straight line, you can simply ignore the y-coordinates (unless the line is vertical) or the x-coordinates (unless the line is horizontal). Then it reduces to the problem in your first paragraph.

Looking for C/C++ library calculating max of Gaussian curve using discrete values

I have some discrete values and assumption, that these values lie on a Gaussian curve.
There should be an algorithm for max-calculation using only 3 discrete values.
Do you know any library or code in C/C++ implementing this calculation?
Thank you!
P.S.:
The original task is auto-focus implementation. I move a (microscope) camera and capture the pictures in different positions. The position having most different colors should have best focus.
EDIT
This was long time ago :-(
I'just wanted to remove this question, but left it respecting the good answer.
You have three points that are supposed to be on a Gaussian curve; this means that they lie on the function:
If you take the logarithm of this function, you get:
which is just a simple 2nd grade polynomial, i.e. a parabola with a vertical axis of simmetry:
with
So, if you know the three coefficients of the parabola, you can derive the parameters of the Gaussian curve; incidentally, the only parameter of the Gaussian function that is of some interest to you is b, since it tells you where the center of the distribution, i.e. where is its maximum. It's immediate to find out that
All that remains to do is to fit the parabola (with the "original" x and the logarithm of your values). Now, if you had more points, a polynomial fit would be involved, but, since you have just three points, the situation is really simple: there's one and only one parabola that goes through three points.
You now just have to write the equation of the parabola for each of your points and solve the system:
(with , where the zs are the actual values read at the corresponding x)
This can be solved by hand (with some time), with some CAS or... looking on StackOverflow :) ; the solution thus is:
So using these last equations (remember: the ys are the logarithm of your "real" values) and the other relations you can easily write a simple algebraic formula to get the parameter b of your Gaussian curve, i.e. its maximum.
(I may have done some mess in the calculations, double-check them before using the results, anyhow the procedure should be correct)
(thanks at http://www.codecogs.com/latex/eqneditor.php for the LaTeX equations)

All k nearest neighbors in 2D, C++

I need to find for each point of the data set all its nearest neighbors. The data set contains approx. 10 million 2D points. The data are close to the grid, but do not form a precise grid...
This option excludes (in my opinion) the use of KD Trees, where the basic assumption is no points have same x coordinate and y coordinate.
I need a fast algorithm O(n) or better (but not too difficult for implementation :-)) ) to solve this problem ... Due to the fact that boost is not standardized, I do not want to use it ...
Thanks for your answers or code samples...
I would do the following:
Create a larger grid on top of the points.
Go through the points linearly, and for each one of them, figure out which large "cell" it belongs to (and add the points to a list associated with that cell).
(This can be done in constant time for each point, just do an integer division of the coordinates of the points.)
Now go through the points linearly again. To find the 10 nearest neighbors you only need to look at the points in the adjacent, larger, cells.
Since your points are fairly evenly scattered, you can do this in time proportional to the number of points in each (large) cell.
Here is an (ugly) pic describing the situation:
The cells must be large enough for (the center) and the adjacent cells to contain the closest 10 points, but small enough to speed up the computation. You could see it as a "hash-function" where you'll find the closest points in the same bucket.
(Note that strictly speaking it's not O(n) but by tweaking the size of the larger cells, you should get close enough. :-)
I have used a library called ANN (Approximate Nearest Neighbour) with great success. It does use a Kd-tree approach, although there was more than one algorithm to try. I used it for point location on a triangulated surface. You might have some luck with it. It is minimal and was easy to include in my library just by dropping in its source.
Good luck with this interesting task!

Is there a data structure with these characteristics?

I'm looking for a data structure that would allow me to store an M-by-N 2D matrix of values contiguously in memory, such that the distance in memory between any two points approximates the Euclidean distance between those points in the matrix. That is, in a typical row-major representation as a one-dimensional array of M * N elements, the memory distance differs between adjacent cells in the same row (1) and adjacent cells in neighbouring rows (N).
I'd like a data structure that reduces or removes this difference. Really, the name of such a structure is sufficient—I can implement it myself. If answers happen to refer to libraries for this sort of thing, that's also acceptable, but they should be usable with C++.
I have an application that needs to perform fast image convolutions without hardware acceleration, and though I'm aware of the usual optimisation techniques for this sort of thing, I feel a specialised data structure or data ordering could improve performance.
Given the requirement that you want to store the values contiguously in memory, I'd strongly suggest you research space-filling curves, especially Hilbert curves.
To give a bit of context, such curves are sometimes used in database indexes to improve the locality of multidimensional range queries (e.g., "find all items with x/y coordinates in this rectangle"), thereby aiming to reduce the number of distinct pages accessed. A bit similar to the R-trees that have been suggested here already.
Either way, it looks that you're bound to an M*N array of values in memory, so the whole question is about how to arrange the values in that array, I figure. (Unless I misunderstood the question.)
So in fact, such orderings would probably still only change the characteristics of distance distribution.. average distance for any two randomly chosen points from the matrix should not change, so I have to agree with Oli there. Potential benefit depends largely on your specific use case, I suppose.
I would guess "no"! And if the answer happens to be "yes", then it's almost certainly so irregular that it'll be way slower for a convolution-type operation.
EDIT
To qualify my guess, take an example. Let's say we store a[0][0] first. We want a[k][0] and a[0][k] to be similar distances, and proportional to k, so we might choose to interleave the storage of first row and first column (i.e. a[0][0], a[1][0], a[0][1], a[2][0], a[0][2], etc.) But how do we now do the same for e.g. a[1][0]? All the locations near it in memory are now taken up by stuff that's near a[0][0].
Whilst there are other possibilities than my example, I'd wager that you always end up with this kind of problem.
EDIT
If your data is sparse, then there may be scope to do something clever (re Cubbi's suggestion of R-trees). However, it'll still require irregular access and pointer chasing, so will be significantly slower than straightforward convolution for any given number of points.
You might look at space-filling curves, in particular the Z-order curve, which (mostly) preserves spatial locality. It might be computationally expensive to look up indices, however.
If you are using this to try and improve cache performance, you might try a technique called "bricking", which is a little bit like one or two levels of the space filling curve. Essentially, you subdivide your matrix into nxn tiles, (where nxn fits neatly in your L1 cache). You can also store another level of tiles to fit into a higher level cache. The advantage this has over a space-filling curve is that indices can be fairly quick to compute. One reference is included in the paper here: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.8959
This sounds like something that could be helped by an R-tree. or one of its variants. There is nothing like that in the C++ Standard Library, but looks like there is an R-tree in the boost candidate library Boost.Geometry (not a part of boost yet). I'd take a look at that before writing my own.
It is not possible to "linearize" a 2D structure into an 1D structure and keep the relation of proximity unchanged in both directions. This is one of the fundamental topological properties of the world.
Having that that, it is true that the standard row-wise or column-wise storage order normally used for 2D array representation is not the best one when you need to preserve the proximity (as much as possible). You can get better result by using various discrete approximations of fractal curves (space-filling curves).
Z-order curve is a popular one for this application: http://en.wikipedia.org/wiki/Z-order_(curve)
Keep in mind though that regardless of which approach you use, there will always be elements that violate your distance requirement.
You could think of your 2D matrix as a big spiral, starting at the center and progressing to the outside. Unwind the spiral, and store the data in that order, and distance between addresses at least vaguely approximates Euclidean distance between the points they represent. While it won't be very exact, I'm pretty sure you can't do a whole lot better either. At the same time, I think even at very best, it's going to be of minimal help to your convolution code.
The answer is no. Think about it - memory is 1D. Your matrix is 2D. You want to squash that extra dimension in - with no loss? It's not going to happen.
What's more important is that once you get a certain distance away, it takes the same time to load into cache. If you have a cache miss, it doesn't matter if it's 100 away or 100000. Fundamentally, you cannot get more contiguous/better performance than a simple array, unless you want to get an LRU for your array.
I think you're forgetting that distance in computer memory is not accessed by a computer cpu operating on foot :) so the distance is pretty much irrelevant.
It's random access memory, so really you have to figure out what operations you need to do, and optimize the accesses for that.
You need to reconvert the addresses from memory space to the original array space to accomplish this. Also, you've stressed distance only, which may still cause you some problems (no direction)
If I have an array of R x C, and two cells at locations [r,c] and [c,r], the distance from some arbitrary point, say [0,0] is identical. And there's no way you're going to make one memory address hold two things, unless you've got one of those fancy new qubit machines.
However, you can take into account that in a row major array of R x C that each row is C * sizeof(yourdata) bytes long. Conversely, you can say that the original coordinates of any memory address within the bounds of the array are
r = (address / C)
c = (address % C)
so
r1 = (address1 / C)
r2 = (address2 / C)
c1 = (address1 % C)
c2 = (address2 % C)
dx = r1 - r2
dy = c1 - c2
dist = sqrt(dx^2 + dy^2)
(this is assuming you're using zero based arrays)
(crush all this together to make it run more optimally)
For a lot more ideas here, go look for any 2D image manipulation code that uses a calculated value called 'stride', which is basically an indicator that they're jumping back and forth between memory addresses and array addresses
This is not exactly related to closeness but might help. It certainly helps for minimation of disk accesses.
one way to get better "closness" is to tile the image. If your convolution kernel is less than the size of a tile you typical touch at most 4 tiles at worst. You can recursively tile in bigger sections so that localization improves. A Stokes-like (At least I thinks its Stokes) argument (or some calculus of variations ) can show that for rectangles the best (meaning for examination of arbitrary sub rectangles) shape is a smaller rectangle of the same aspect ratio.
Quick intuition - think about a square - if you tile the larger square with smaller squares the fact that a square encloses maximal area for a given perimeter means that square tiles have minimal boarder length. when you transform the large square I think you can show you should the transform the tile the same way. (might also be able to do a simple multivariate differentiation)
The classic example is zooming in on spy satellite data images and convolving it for enhancement. The extra computation to tile is really worth it if you keep the data around and you go back to it.
Its also really worth it for the different compression schemes such as cosine transforms. (That's why when you download an image it frequently comes up as it does in smaller and smaller squares until the final resolution is reached.
There are a lot of books on this area and they are helpful.