Closest point from List for every point of other List - c++

I have a population of so called "Dots" that search for food. Every Dot has a sight_ value, which indicates the range in which it can see food.
The position of each Dot is saved as a pair<uint16_t,uint16_t>. The positions of all foodsources are in a vector<pair<uint16_t,uint16_t>>.
Now I want to calculate the closest foodsource for every Dot, which this Dot can see. And I don't want to calculate the distance of every combination.
My idea was to create a copy of the food-vector, sort one copy by x and the other by y. Then find the interval [x-sight, x+sight] respectively [y-sight, y+sight] in the vectors and then create the intersection of both.
I've read over set_intersection, but it requires both ranges to be sorted with the same rule.
Any Ideas how I could do this? Could also be that my Idea is just the wrong approach.
Thanks
IceFreez3r
Edit:
I did some runtime approximations:
Sort Food: n log n
Find Interval for one Coordinate and one Dot: 2 log n (lower and upper bound)
If we assume equal distribution of food sources, we can calculate the bound that is estimated to be closer to the middle first and then calculate the second bound in the rest interval. This would reduce the runtime to: log n + log(n/2) (Just realized this s probably not *that* powerful:log(n/2) =~ log(n) - 1)
Build intersection: #x * #y =~ (n * sight/testgroundsize)^2
Compute exact Distance for every Food in Intersection: n * (sight/testgroundsize)^2
Sum: 2 n log n + 2 * #Dots * (log n + log(n/2) + (n * sight/testgroundsize)^2 + n * (sight/testgroundsize)^2)
Sum with just limiting one coordinate: n log n + #Dots * (log n + log(n/2) + n * sight/testgroundsize)
I did some tests and just calculated the above formulas on the run:
int dots = dots_.size();
int sum = 2 * n * log(n) + 2 * dots * (log(n) + log(n/2) + pow(n * (sum_sight / dots) / testground_size_,2) + n * pow((sum_sight / dots) / testground_size_, 2));
int sum2 = n * log(n) + dots * (log(n) + log(n/2) + n * (sum_sight / dots) / testground_size_);
cout << n*dots << endl << sum << endl << sum2 << endl;
It turned out the Intersection idea is just bad. While the idea of just limiting one coordinate is at least better than brute-force.
I didn't think about the grid-idea yet #Daniel Jour

You're stepping into a whole field of interesting approaches to this problem. Terms to Google are binary space partitioning, quadtrees, ... and of course nearest neighbour search.
A relatively simple but effective approach when the dots are far more spread than what their "visible range" is:
Select a value "grid size".
Create a map from grid coordinates to a list/set of entities
For each food source: put them in the map at their grid coordinates
For each dot: put them in the map at their grid coordinates and also in the neighbour grid "cells". The size of the neighbourhood depends on the grid size and the dot's sight value
For each entry in the map which contains at least one dot: Either do this algorithm recursively with a smaller grid size or use the brute force approach: check each dot in that grid cell against each food source in that grid cell.
This is a linear algorithm, compared with the quadratic brute force approach.
Calculation of grid coordinates: grid_x = int(x / grid_size) ... same for other coordinate.
Neighbourhood: steps = ceil(sight_value / grid_size) .. the neighbourhood is a square with side length 2×steps + 1 centred at the dot's grid coordinates

I believe your approach is incorrect. This can be mathematically verified. What you can do instead is calculate the magnitude of the vector joining the dot with the food source by means of Pythagoras theorem, and ensure that this magnitude is less than the observation limit. This deals exclusively with determining relative distance, as defined by the Cartesian co-ordinate system, and the standard unit of measurement. In relation to efficiency concerns, the first order of business is to determine if the approach to be taken is in computational terms in actuality less efficient, as measured by time, even though the logical component responsible for certain calculations are, in virtue of this alternative implementation, less time consuming. Of coarse, the ideal is one in which the time taken is decreased, and not merely numerically contained by means of refactoring.
Now, if it is the case that the position of a dot can be specified as any two numbers one may choose, this of course implies a frame of reference called the basis, and also one local to the dot in question. With respect to both, one can quantify position, and other such characteristics and properties. As a consequence of this observation, it would seem that you need n*2 data structures, where n is the amount of dots in the environment, that
contain the sorted values relative to each dot, and quite frankly it is unclear whether or not this approach would even work or is optimal. You state the design and programmatic constraint that the solution shall not compute the distances from each dot to each food source. But to achieve this, one must implement other such procedures, in order that we derive the correct results. These comments are made in relation to my discussion on efficiency. Therefore, you may be better of simply calculating the distance in each case. This is somewhat elegant.

Related

Histogram Binning of Gradient Vectors

I am working on a project that has a small component requiring the comparison of distributions over image gradients. Assume I have computed the image gradients in the x and y directions using a Sobel filter and have for each pixel a 2-vector. Obviously getting the magnitude and direction is reasonably trivial and is as follows:
However, what is not clear to me is how to bin these two components in to a two dimensional histogram for an arbitrary number of bins.
I had considered something along these lines(written in browser):
//Assuming normalised magnitudes.
//Histogram dimensions are bins * bins.
int getHistIdx(float mag, float dir, int bins) {
const int magInt = reinterpret_cast<int>(mag);
const int dirInt = reinterpret_cast<int>(dir);
const int magMod = reinterpret_cast<int>(static_cast<float>(1.0));
const int dirMod = reinterpret_cast<int>(static_cast<float>(TWO_PI));
const int idxMag = (magInt % magMod) & bins
const int idxDir = (dirInt % dirMod) & bins;
return idxMag * bins + idxDir;
}
However, I suspect that the mod operation will introduce a lot of incorrect overlap, i.e. completely different gradients getting placed in to the same bin.
Any insight in to this problem would be very much appreciated.
I would like to avoid using any off the shelf libraries as I want to keep this project as dependency light as possible. Also I intend to implement this in CUDA.
This is more of a what is an histogram question? rather than one of your tags. Two things:
In a 2D plain two directions equal by modulation of 2pi are in fact the same - so it makes sense to modulate.
I see no practical or logical reason of modulating the norms.
Next, you say you want a "two dimensional histogram", but return a single number. A 2D histogram, and what would make sense in your context, is a 3D plot - the plane is theta/R, 2 indexed, while the 3D axis is the "count".
So first suggestion, return
return Pair<int,int>(idxMag,idxDir);
Then you can make a 2D histogram, or 2 2D histograms.
Regarding the "number of bins"
this is use case dependent. You need to define the number of bins you want (maybe different for theta and R). Maybe just some constant 10 bins? Maybe it should depend on the amount of vectors? In any case, you need a function that receives either the number of vectors, or the total set of vectors, and returns the number of bins for each axis. This could be a constant (10 bins) initially, and you can play with it. Once you decide on the number of bins:
Determine the bins
For a bounded case such as 0<theta<2 pi, this is easy. Divide the interval equally into the number of bins, assuming a flat distribution. Your modulation actually handles this well - if you would have actually modulated by 2*pi, which you didn't. You would still need to determine the bin bounds though.
For R this gets trickier, as this is unbounded. Two options here, but both rely on the same tactic - choose a maximal bin. Either arbitrarily (Say R=10), so any vector longer than that is placed in the "longer than max" bin. The rest is divided equally (for example, though you could choose other distributions). Another option is for the longest vector to determine the edge of the maximal bin.
Getting the index
Once you have the bins, you need to search the magnitude/direction of the current vector in your bins. If bins are pairs representing min/max of bin (and maybe an index), say in a linked list, then it would be something like (for mag for example):
bin = histogram.first;
while ( mag > bin.min ) bin = bin.next;
magIdx = bin.index;
If the bin does not hold the index you can just use a counter and increase it in the while. Also, for the magnitude the final bin should hold "infinity" or some large number as a limit. Note this has nothing to do with modulation, though that would work for your direction - as you have coded. I don't see how this makes sense for the norm.
Bottom line though, you have to think a bit about what you want. In any case all the "objects" here are trivial enough to write yourself, or even use small arrays.
I think you should arrange your bins in a square array, and then bin by vx and vy independently.
If your gradients are reasonably even you just need to scan the data first to accumulate the min and max in x and y, and then split the gradients evenly.
If the gradients are very unevenly distributed, you might want to sort the (eg) vx first and arrange that the boundaries between each bin exactly evenly divides the values.
An intermediate solution might be to obtain the min and max ignoring the (eg) 10% most extreme values.

Precomputed distances for spectral clustering with scikit-learn

I'm struggling to make sense of the spectral clustering documentation here.
Specifically.
If you have an affinity matrix, such as a distance matrix, for which 0 means identical elements, and high values means very dissimilar elements, it can be transformed in a similarity matrix that is well suited for the algorithm by applying the Gaussian (RBF, heat) kernel:
np.exp(- X ** 2 / (2. * delta ** 2))
For my data, I have a complete distance matrix of size (n_samples, n_samples) where large entries represent dissimilar pairs, small values represent similar pairs and zero represents identical entries. (I.e. the only zeros are along the diagonal).
So all I need to do is build the SpectralClustering object with affinity = "precomputed" and then pass the transformed distance matrix to fit_predict.
I'm stuck on the suggested transformation equation. np.exp(- X ** 2 / (2. * delta ** 2)).
What is X here? The (n_samples, n_samples) distance matrix?
If so, what is delta. Is it just X.max()-X.min()?
Calling np.exp(- X ** 2 / (2. * (X.max()-X.min()) ** 2)) seems to do the right thing. I.e. big entries become relatively small, and small entries relatively big, with all the entries between 0 and 1. The diagonal is all 1's, which makes sense, since each point is most affine with itself.
But I'm worried. I think if the author had wanted me to use np.exp(- X ** 2 / (2. * (X.max()-X.min()) ** 2)) he would have told me to use just that, instead of throwing delta in there.
So I guess my question is just this. What's delta?
Yes, X in this case is the matrix of distances. delta is a scale parameter that you can tune as you wish. It controls the "tightness", so to speak, of the distance/similarity relation, in the sense that a small delta increases the relative dissimmilarity of faraway points.
Notice that delta is proportional to the inverse of the gamma parameter of the RBF kernel, mentioned earlier in the doc link you give: both are free parameters which can be used to tune the clustering results.

Find optimal route in farm land-dynamic programming/Dijkstra's

I was trying to solve a question on InterviewStreet (the competition has since ended). The problem is to build a ditch from a pond to a farm, given a N*M grid of elevations. The pond and the farm are one of the tiles within the N*M grid and won't be the same tile.
The elevations are numbers between 0 and 9. Additionally, you are given the coordinates of the pond and the farm (1-indexed, row followed by column), which each take up exactly one tile on the grid. You are to write a program that, given this data, computes the minimum cost to build an irrigation ditch.
More specifically, the input that will be fed into your program will be formatted as follows:
N M
pondLocationX pondLocationY
farmLocationX farmLocationY
elevationX1Y1elevationX1Y2...elevationX1YM
elevationX2Y1elevationX2Y2...elevationX2YM
.
.
.
elevationXNY1elevationXNY2...elevationXNYM
where pondLocationX and farmLocationX are integers in the interval [1, N], and pondLocationY and farmLocationY are integers in the interval [1, M], and all elements are integers in the interval [0, 9]. Note that a single space separates the X and Y coordinates of the farm and pond, but there are no spaces separating the elevations.
Given such an input, your program should print out the minimum cost to build an irrigation ditch from the pond to the farm. The constraints are as follows. The pond and farm will not be at the same location. The elevation of all tiles except for the pond can be increased or decreased at a cost of one for every unit of change (you may leave the elevation the same for a cost of 0). N and M will each be at most 300. After paying for any excavation that is necessary, you can build a ditch at 0 additional cost if there is a sequence of tiles starting at the pond and ending at the farm such that the following are true:
(Contiguous path) Each tile in the sequence is adjacent to the previous tile (no diagonal adjacency -- tiles in the interior of the map have exactly 4 adjacent tiles)
(Downhill path) Each tile in the sequence, including the pond and farm, has an elevation that is at most that of the previous tile in the sequence.
For example, if the input is the following:
3 5
1 1
3 4
27310
21171
77721
then we can build an irrigation ditch at a cost of just 4, since it suffices to lower the tile at location (1, 3) from 3 to 1 (cost 2), raise the tile at position (1, 5) from 0 to 1 (cost 1), and lower the farm, which is at location (3, 4), from 2 to 1 (cost 1). Note that you cannot travel diagonally to get from (2, 3) to (3, 4) in one step.
Solution:
I think this is a variation of the Djikstra's algorithm, i.e. use the farm as the source node, and stop when you calculate the shortest path to the pond. The "adjacent" tiles are your neighbours, and your edge weights are the differences in your elevations.
However, since you can modify the weights in two ways i.e. if you are higher than your neighbour, then you can either 1) decrease your height to match your neighbour's or 2) increase your neighbour's height to match yours. This effect can percolate outwards and I'm not able to capture this in the algorithm.
How can I adjust Djikstra's algorithm to acommodate for the fact that the weights can be changed?
Use the Dijkstra algorithm on the 3D grid N*M*10. Two vertices (x,y,z) and (x',y',z') are connected (with an oriented arc) if (x,y) and (x',y') are adjacent and z' is not greater than z. The cost on the arc is given by the difference between z' and the initial height at (x',y'). Then find the shortedst path from the pond (with its initial length) to the farm (even if the z coordinate is not the same.
It is possible that the minimal path finded in this way passes two times on the same point (x,y). For example it could pass first from (x,y,z') and then from (x,y,z''). But if this happens you can remove the path from (x,y,z') to (x,y,z'') since replacing (x,y,z') with (x,y,z'') costs equal or less then the path from (x,y,z') to (x,y,z''). So you can assume that for every point (x,y) the path uses only a single value of z.
So the path you have found is the solution to the given problem.

Algorithm to produce a difference of two collections of intervals

Problem
Suppose I have two collections of intervals, named A and B. How would I find a difference (a relative complement) in a most time- and memory-efficient way?
Picture for illustration:
Interval endpoints are integers ( ≤ 2128-1 ) and they are always both 2n long and aligned on the m×2n lattice (so you can make a binary tree out of them).
Intervals can overlap in the input but this does not affect the output (the result if flattened would be the same).
The problem is because there are MANY intervals in both collections (up to 100,000,000), so naïve implementations will probably be slow.
The input is read from two files and it is sorted in such a way that smaller sub-intervals (if overlapping) come immediately after their parents in order of size. For example:
[0,7]
[0,3]
[4,7]
[4,5]
[8,15]
...
What have I tried?
So far, I've been working on a implementation that generates a binary search tree while doing so aggregates neighbouring intervals ( [0,3],[4,7] => [0,7] ) from both collections, then traverses the second tree and "bumps out" the intervals that are present in both (subdividing the larger intervals in the first tree if necessary).
While this appears to be working for small collections, it requires more and more RAM to hold the tree itself, not to mention the time it needs to complete the insertion and removal from the tree.
I figured that since intervals come pre-sorted, I could use some dynamic algorithm and finish in one pass. I am not sure if this is possible, however.
So, how would I go about solving this problem in a efficient way?
Disclaimer: This is not a homework but a modification/generalization of an actual real-life problem I am facing. I am programming in C++ but I can accept an algorithm in any [imperative] language.
Recall one of the first programming exercises we all had back in school - writing a calculator program. Taking an arithmetic expression from the input line, parsing it and evaluating. Remember keeping track of the parentheses depth? So here we go.
Analogy: interval start points are opening parentheses, end points - closing parentheses. We keep track of the parentheses depth (nesting). The depth of two - intersection of intervals, the depth of one - difference of intervals
Algorithm:
No need to distinguish between A and B, just sort all start points and end points in the ascending order
Set the parentheses depth counter to zero
Iterate through the end points, starting from the smallest one. If it is a starting point increment the depth counter, if it is an ending point decrement the counter
Keep track of intervals where the depth is 1, those are intervals of A and B difference. The intervals where the depth is two are AB intersections
Your intervals are sorted which is great. You can do this in linear time with almost no memory.
Start by "flattening" your two sets. That is for set A, start from the lowest interval, and combine any overlapping intervals until you have an interval set that has no overlaps. Then do that for B.
Now take your two sets and start with the first two intervals. We'll call these the interval indices for A and B, Ai and Bi.
Ai indexes the first interval in A, Bi the first interval in B.
While there are intervals to process do the following:
Consider the start points of both intervals, are the start points the same? If so advance the start point of both intervals to the end point of the smaller interval, emit nothing to your output. Advance the index of the smaller interval to the next interval. (That is if Ai ends before Bi, then Ai advances to the next interval.) If both intervals end in the same place, advance both Ai and Bi and emit nothing.
Is the one start point earlier than the other start point? If so emit the interval from the earlier start point to either a) the start of the later endpoint, or b) the end of the earlier end point. If you chose option b, advance the index of the eariler interval.
So for example if the interval at Ai starts first, you emit the interval from start of Ai to start of Bi, or the end of Ai whichever is smaller. If Ai ended before the start of Bi, you advance Ai.
Repeat until all intervals are consumed.
Ps. I assume you don't have spare memory to flatten the two interval sets into separate buffers. Do this in two functions. A "get next interval" function that advances the interval indices, which does the flattening as necessary, and feed flattened data to the differencing function.
What you are looking for is a Sweep line algorithm.
A simple logic should tell you when the Sweep line is intersecting an interval in both A and B and where it intersects only one set.
This is very similar to this problem. Just consider that you have a set of vertical lines passing through the end points of the B's segments.
This algorithm complexity is O((m+n) log (m+n)) which is the cost of the initial sort. The sweep line algorithm on a sorted set takes O(m+n)
I think you should use boost.icl (Interval Container Library)
http://www.boost.org/doc/libs/1_50_0/libs/icl/doc/html/index.html
#include <iostream>
#include <boost/icl/interval_set.hpp>
using namespace boost::icl;
int main()
{
typedef interval_set<int> TIntervalSet;
TIntervalSet intSetA;
TIntervalSet intSetB;
intSetA += discrete_interval<int>::closed( 0, 2);
intSetA += discrete_interval<int>::closed( 9,15);
intSetA += discrete_interval<int>::closed(12,15);
intSetB += discrete_interval<int>::closed( 1, 2);
intSetB += discrete_interval<int>::closed( 4, 7);
intSetB += discrete_interval<int>::closed( 9,10);
intSetB += discrete_interval<int>::closed(12,13);
std::cout << intSetA << std::endl;
std::cout << intSetB << std::endl;
std::cout << intSetA - intSetB << std::endl;
return 0;
}
this prints
{[0,2][9,15]}
{[1,2][4,7][9,10][12,13]}
{[0,1)(10,12)(13,15]}

Best way to program piecewise-linear function on DSP TMS320C5509

There is a Table of pairs , which defines pieces bounds.
And we are using straightforward algorithm:
y = f(x)
Calculate index n in Table using x
Get Yn and Yn+1, compute linear interpolation Y
Y is the answer.
So i think, there must be more efficient method, could you please point me?
Depending on the number and distribution of pairs, you might be able to instead store a table T containing only the Y values at regular intervals. Pick the interval to be a power of 2: i=2^c. Then for a given X:
n=X>>c;
Y= T[n]
Y+= ((T[n+1]-T[n])* (X&(i-1))>>c;
This should work as long as you have space for a table with small enough intervals to catch sudden changes in the slope of Y, and enough headroom in Y for the multiply.
Use binary search for step 1.
EDIT: due to the comment you added afterwards, this is not necessary, since your intervals are equally spaced.