Efficient data structure for sparse data lookup - c++

Situation:
Given some points with coordinate (x, y)
Range 0 < x < 100,000,000 and 0 < y <100,000,000
I have to find smallest square which contains at least N no of points on its edge and inside it.
I used vector to store coordinates and searched all squares with side length minLength upto side length maxLength (Appling Brute Force in relevant space)
struct Point
{
int x;
int y;
};
vector<Point> P;
int minLength = sqrt(N) - 1;
int maxLength = 0;
// bigx= largest x coordinate of any point
// bigy= largest y coordinate of any point
// smallx= smallest x coordinate of any point
// smally= smallest y coordinate of any point
(bigx - smallx) < (bigy - smally) ? maxLength = (bigx - smallx) : maxLength = (bigy - smally);
For each square I looked up, traversed through complete vector to see if at least N points are on its edge and inside it.
This was quite time inefficient.
Q1. What data structure should I use to improve time efficiency without changing Algorithm I used?
Q2. Efficient Algorithm for this problem?

There are points on 2 opposite edges - if not, you could shrink the square by 1 and still contain the same number of points. That means the possible coordinates of the edges are limited to those of the input points. The input points are probably not on the corners, though. (For a minimum rectangle, there would be points on all 4 edges as you can shrink one dimension without altering the other)
The next thing to realize is that each point divides the plane in 4 quadrants, and each quadrant contains a number of points. (These can add up to more than the total number of points as the quadrants have one pixel overlap). Lets say that NW(p) is the number of points to the northwest of point p, i.e. those that have x>=px and y>=py. Then the number of points in a square is NW(bottomleft) + NW(topright) - NW(bottomright) - NW(topleft).
It's fairly easy to calculate NW(p) for all input points. Sort them by x and for equal x by y. The most northwestern point has NW(p)==0. The next point can have NW(p)==1 if it's to the southeast of the first point, else it has NW(p)==0. It's also useful to keep track of SW(p) in this stage, as you're working through the points from west to east and they're therefore not sorted north to south. Having calculated NW(p), you can determine the number of points in a square S in O(1)
Recall that the square size is restricted by by the need to have points on opposite edges. Assume the points are on the left (western) and right edge - you still have the points sorted by x order. Start by assuming the left edge is at your leftmost x coordinate, and see what the right edge must be to contain N points. Now shift the left edge to the next x coordinate and find a new right edge (and thus a new square). Do this until the right edge of the square is the rightmost point.
Its also possible that the square is constrained in y direction. Just sort the points in y direction and repeat, then choose the smallest square between the two outcomes.
Since you're running linearly through the points in x and y direction, that part is just O(N) and the dominant factor is the O(N log N) sort.

Look at http://en.wikipedia.org/wiki/Space_partitioning for algorithms that use the Divide-and-Conquer technique to solve this. This is definitely solvable in Polynomial time.
Another variant algorithms can be on the following lines.
Generate a vornoi-diagram on the points to get neighbour information. [ O(nlog(n)) ]
Now use Dynamic Programming, the DP will be similar to the problem of finding the maximum subarray in a 2D array. Here instead of the sum of numbers, you will keep count of points before it.
2.a Essentially a recursion similar to this will hold. [ O(n) ]
Number of elements in square from (0,0) to (x,y ) = (Number of elems
from square (0,0 to ((x-1),y))+ (Number of elems in square 0,0 - ( x, y-1))
- (Number of elems in (0,0)-((x-1),(y-1)))
Your recurrence will have to change for all the points on its neighbourhood and to the left and above, instead of just the points above and left as above.
Once the DP is ready, you can query the points in a sqare in O(1).
Another O(n^2) loop to find from all possible combinations and find the least square.
You can even greedily start from the smallest squares first, that way you can end your search as soon as you find a suitable square..

The rtree allows spatial searching, but doesn't have stl implementation, although sqlite would allow binding. This can answer "get all points within range", "k nearest neighbours"
Finding a region which has the most dense data, is a problem similar to clustering.
Iterating over the points and finding the N nearest entries to each point. Then generate the smallest circle - centre would be the Max(x) - min(x), Max(y) - min(y). A square can be formed which contains all the neighbours, and would be somewhere between 2r length and 2sqrt(r) length sides compared to circle.
Time taken O(x) to build structure
O(X N log(X)) to search for smallest cluster

Note: There are a bunch of answers for your second question (which will probably reap bigger benefits), but I'm only referring to your first one, i.e. what data to use without changing the algorithm.
There, I think that your choice using a vector is already pretty good, because in general vectors offer the best payload/overhead ratio and also the fastest iteration. In order to find out specific bottlenecks, use a profiler, otherwise you are only guessing. With large vectors, there are a few things to avoid though:
Overallocation, this wastes space.
Underallocation, this causes copying when the vector is grown to the necessary size.
Copying.

Related

How does this Dijkstra code return minimum value (and not maximum)?

I am solving this question on LeetCode.com called Path With Minimum Effort:
You are given heights, a 2D array of size rows x columns, where heights[row][col] represents the height of cell (row, col). Aim is to go from top left to bottom right. You can move up, down, left, or right, and you wish to find a route that requires the minimum effort. A route's effort is the maximum absolute difference in heights between two consecutive cells of the route. Return the minimum effort required to travel from the top-left cell to the bottom-right cell. For e.g., if heights = [[1,2,2],[3,8,2],[5,3,5]], the answer is 2 (in green).
The code I have is:
class Solution {
public:
vector<pair<int,int>> getNeighbors(vector<vector<int>>& h, int r, int c) {
vector<pair<int,int>> n;
if(r+1<h.size()) n.push_back({r+1,c});
if(c+1<h[0].size()) n.push_back({r,c+1});
if(r-1>=0) n.push_back({r-1,c});
if(c-1>=0) n.push_back({r,c-1});
return n;
}
int minimumEffortPath(vector<vector<int>>& heights) {
int rows=heights.size(), cols=heights[0].size();
using arr=array<int, 3>;
priority_queue<arr, vector<arr>, greater<arr>> pq;
vector<vector<int>> dist(rows, vector<int>(cols, INT_MAX));
pq.push({0,0,0}); //r,c,weight
dist[0][0]=0;
//Dijkstra
while(pq.size()) {
auto [r,c,wt]=pq.top();
pq.pop();
if(wt>dist[r][c]) continue;
vector<pair<int,int>> neighbors=getNeighbors(heights, r, c);
for(auto n: neighbors) {
int u=n.first, v=n.second;
int curr_cost=abs(heights[u][v]-heights[r][c]);
if(dist[u][v]>max(curr_cost,wt)) {
dist[u][v]=max(curr_cost,wt);
pq.push({u,v,dist[u][v]});
}
}
}
return dist[rows-1][cols-1];
}
};
This gets accepted, but I have two questions:
a. Since we update dist[u][v] if it is greater than max(curr_cost,wt), how does it guarantee that in the end we return the minimum effort required? That is, why don't we end up returning the effort of the one in red above?
b. Some solutions such as this one, short-circuit and return immediately when we reach the bottom right the first time (ie, if(r==rows-1 and c==cols-1) return wt;) - how does this work? Can't we possibly get a shorter dist when we revisit the bottom right node in future?
The problem statement requires that we find the path with the minimum "effort".
And "effort" is defined as the maximum difference in heights between adjacent cells on a path.
The expression max(curr_cost, wt) takes care of the maximum part of the problem statement. When moving from one cell to another, the distance to the new cell is either the same as the distance to the old cell, or it's the difference in heights, whichever is greater. Hence max(difference_in_heights, distance_to_old_cell).
And Dijkstra's algorithm takes care of the minimum part of the problem statement, where instead of using a distance from the start node, we're using the "effort" needed to get from the start node to any given node. Dijkstra's attempts to minimize the distance, and hence it minimizes the effort.
Dijkstra's has two closely related concepts: visited and explored. A node is visited when any incoming edge is used to arrive at the node. A node is explored when its outgoing edges are used to visit its neighbors. The key design feature of Dijkstra's is that after a node has been explored, additional visits to that node will never improve the distance to that node. That's the reason for the priority queue. The priority queue guarantees that the node being explored has the smallest distance of any unexplored nodes.
In the sample grid, the red path will be explored before the green path because the red path has effort 1 until the last move, whereas the green path has effort 2. So the red path will set the distance to the bottom right cell to 3, i.e. dist[2][2] = 3.
But when the green path is explored, and we arrive at the 3 at row=2, col=1, we have
dist[2][2] = 3
curr_cost=2
wt=2
So dist[2][2] > max(curr_cost, wt), and dist[2][2] gets reduced to 2.
The answers to the questions:
a. The red path does set the bottom right cell to a distance of 3, temporarily. But the result of the red path is discarded in favor of the result from the green path. This is the natural result of Dijkstra's algorithm searching for the minimum.
b. When the bottom right node is ready to be explored, i.e. it's at the head of the priority queue, then it has the best distance it will ever have, so the algorithm can stop at that point. This is also a natural result of Dijkstra's algorithm. The priority queue guarantees that after a node has been explored, no later visit to that node will reduce its distance.

Vector on upper half of hemisphere

I have a normal vector N, which defines the upper half of an hemisphere and an function, which creates random points P on the hemisphere.
Now I want to know, if the randomly choosen point is on the upper half. Is it save to assume, if the length of N+P is greater or equal 1, P is on the upper half, or is there a better way to calculate this in glm?
#Raxvan gave a perfectly valid answer how to do it properly: use dot product and check if it is positive (non-negative).
Answering your original idea that you also re-stated in the comments:
if the length of N+P is greater or equal 1, P is on the upper half
this is an incorrect way. Yes this test returns "true" for all the correct points but it does not filter out all the incorrect points. For example, consider N is (0,0,1) (i.e. vector along Z-axis) and P is (0.99, 0, -0.14) (i.e. a vector just a bit below the XY-plane and at the far end along the X-axis). Obviously P is not in the "upper hemisphere" but N + P is (0.99, 0, 0.86) and its length is obviously more than 1.

Pick a matrix cell according to its probability

I have a 2D matrix of positive real values, stored as follow:
vector<vector<double>> matrix;
Each cell can have a value equal or greater to 0, and this value represents the possibility of the cell to be chosen. In particular, for example, a cell with a value equals to 3 has three times the probability to be chosen compared to a cell with value 1.
I need to select N cells of the matrix (0 <= N <= total number of cells) randomly, but according to their probability to be selected.
How can I do that?
The algorithm should be as fast as possible.
I describe two methods, A and B.
A works in time approximately N * number of cells, and uses space O(log number of cells). It is good when N is small.
B works in time approximately (number of cells + N) * O(log number of cells), and uses space O(number of cells). So, it is good when N is large (or even, 'medium') but uses a lot more memory, in practice it might be slower in some regimes for that reason.
Method A:
The first thing you need to do is normalize the entries. (It's not clear to me if you assume they are normalized or not.) That means, sum all the entries and divide by the sum. (This part is potentially slow, so it's better if you assume or require that it already happened.)
Then you sample like this:
Choose a random [i,j] entry of the matrix (by choosing i,j each uniformly randomly from the range of integers 0 to n-1).
Choose a uniformly random real number p in the range [0, 1].
Check if matrix[i][j] > p. If so, return the pair [i][j]. If not, go back to step 1.
Why does this work? The probability that we end at step 3 with any particular output, is equal to, the probability that [i][j] was selected (this is the same for each entry), times the probality that the number p was small enough. This is proportional to the value matrix[i][j], so the sampling is choosing each entry with the correct proportions. It's also possible that at step 3 we go back to the start -- does that bias things? Basically, no. The reason is, suppose we arbitrarily choose a number k and then consider the distribution of the algorithm, conditioned on stopping exactly after k rounds. Conditioned on the assumption that we stop at the k'th round, no matter what value k we choose, the distribution we sample has to be exactly right by the above argument. Since if we eliminate the case that p is too small, the other possibilities all have their proportions correct. Since the distribution is perfect for each value of k that we might condition on, and the overall distribution (not conditioned on k) is an average of the distributions for each value of k, the overall distribution is perfect also.
If you want to analyze the number of rounds that typically needed in a rigorous way, you can do it by analyzing the probability that we actually stop at step 3 for any particular round. Since the rounds are independent, this is the same for every round, and statistically, it means that the running time of the algorithm is poisson distributed. That means it is tightly concentrated around its mean, and we can determine the mean by knowing that probability.
The probability that we stop at step 3 can be determined by considering the conditional probability that we stop at step 3, given that we chose any particular entry [i][j]. By the formulas for conditional expectation, you get that
Pr[ stop at step 3 ] = sum_{i,j} ( 1/(n^2) * Matrix[i,j] )
Since we assumed the matrix is normalized, this sum reduces to just 1/n^2. So, the expected number of rounds is about n^2 (that is, n^2 up to a constant factor) no matter what the entries in the matrix are. You can't hope to do a lot better than that I think -- that's about the same amount of time it takes to just read all the entries of the matrix, and it's hard to sample from a distribution that you cannot even read all of.
Note: What I described is a way to correctly sample a single element -- to get N elements from one matrix, you can just repeat it N times.
Method B:
Basically you just want to compute a histogram and sample inversely from it, so that you know you get exactly the right distribution. Computing the histogram is expensive, but once you have it, getting samples is cheap and easy.
In C++ it might look like this:
// Make histogram
typedef unsigned int uint;
typedef std::pair<uint, uint> upair;
typedef std::map<double, upair> histogram_type;
histogram_type histogram;
double cumulative = 0.0f;
for (uint i = 0; i < Matrix.size(); ++i) {
for (uint j = 0; j < Matrix[i].size(); ++j) {
cumulative += Matrix[i][j];
histogram[cumulative] = std::make_pair(i,j);
}
}
std::vector<upair> result;
for (uint k = 0; k < N; ++k) {
// Do a sample (this should never repeat... if it does not find a lower bound you could also assert false quite reasonably since it means something is wrong with rand() implementation)
while(1) {
double p = cumulative * rand(); // Or, for best results use std::mt19937 or boost::mt19937 and sample a real in the range [0,1] here.
histogram_type::iterator it = histogram::lower_bound(p);
if (it != histogram.end()) {
result.push_back(it->second);
break;
}
}
}
return result;
Here the time to make the histogram is something like number of cells * O(log number of cells) since inserting into the map takes time O(log n). You need an ordered data structure in order to get cheap lookup N * O(log number of cells) later when you do repeated sampling. Possibly you could choose a more specialized data structure to go faster, but I think there's only limited room for improvement.
Edit: As #Bob__ points out in comments, in method (B) a written there is potentially going to be some error due to floating point round-off if the matrices are quite large, even using type double, at this line:
cumulative += Matrix[i][j];
The problem is that, if cumulative is much larger than Matrix[i][j] beyond what the floating point precision can handle then these each time this statement is executed you may observe significant errors which accumulate to introduce significant inaccuracy.
As he suggests, if that happens, the most straightforward way to fix it is to sort the values Matrix[i][j] first. You could even do this in the general implementation to be safe -- sorting these guys isn't going to take more time asymptotically than you already have anyways.

Predict the required number of preallocated nodes in a kD-Tree

I'm implementing a dynamic kD-Tree in array representation (storing the nodes in std::vector) in breadth-first fashion. Each i-th non-leaf node have a left child at (i<<1)+1 and a right child at (i<<1)+2. It would support incremental insertion of points and collection of points.
However I'm facing problem determining the required number of possible nodes to incrementally preallocate space.
I've found a formula on the web, which seems to be wrong:
N = min(m − 1, 2n − ½m − 1),
where m is the smallest power of 2 greater than or equal to n, the
number of points.
My implementation of the formula is the following:
size_t required(size_t n)
{
size_t m = nextPowerOf2(n);
return min(m - 1, (n<<1) - (m>>1) - 1);
}
function nextPowerOf2 returns a power of 2 largest or equal to n
Any help would be appreciated.
Each node of a kd-tree divides the space into two spaces. Hence, the number of nodes in the kd-tree depends on how you perform this division:
1) If you divide them in the midpoint of the space (that is, if the space is from x1 to x2, you divide the space with the x3=(x1+x2)/2 line), then:
i) Each point will be allocated its own node, and
ii) Each intermediate node will be empty.
In this case, the number of nodes will depend on how large the coordinates of the points are. If the coordinates are bounded by |X|, then the total number of nodes in the kd-tree should be slightly less than log |X| * n (more precisely, around log |X| * n - n log n + 2n) in the worst case. To see this, consider the following way to add the points: you add multiple collections, each collection has two extremely nearby points located at random. For each pair of point, the tree will need to continuously divide the space log |X| times, and if log |X| is significantly larger than log n, creating log |X| intermediate nodes in the process.
2) If you divide them by using a point as a dividing line, then each node (including the intermediate nodes) will contain a point. Thus, the total number of nodes is simply n. However, note that using a point to divide the space may yield to a very bad performance if the points are not given in a random order (for example, if the points are given in an ascending order of X, the depth of the tree would be O(n). For comparison, the depth of the tree in (1) is at most O(log |X|) ).

Programming task: sum of submatrices

I have a problem with resolving my programming task. In fact, I resolved it, but my code don't pass some of test (time overseed).
The text of task is following:
We have a matrix that have N*N size. The first line of input contain two int: N and K. K is a number of lines that define submatrices.
Next N lines contains elements of main matrix (whitespace as delimeter of elements, \n as delimeter of lines). After that we have K lines that defines submatrices.
Definition is following:
y_l x_l y_r x_r where (x_l, y_l) is column and line of left top corner of submatrix in main matrix and (x_r, y_r) is column and line of right bottom corner of submatrix. We have to calculate sum of all submatrices and divide it into equivalence classes (submatrices are belong to one class if that sums are equal).
Output of program should be following:
three int (divided by whitespace) where first one is number of equivalence classes, second one is number of equivalence classes that have maximum elements and third one is average of sum of all submatrices.
From tests I pick up fact that problem is in calculation of sum:
while(true){
for(int i = x_l; i <= x_r; i++)
sum += *diff++;
if(diff == d_end) break;
d_start = d_start + size;
diff = d_start;
}
But I have no idea how to optimize it. May be someone can give me algorithm or some ideas how to calculate those sums faster.
Thanks.
UPDATE: Answer
After few days of searching I finally got working version of my program. Thanks to Yakk, which gave some very usefull advices.
There's final code.
Very usefull link that I strangely couldn't find before unless I ask a very specific question (bases on information that Yakk gave me) link.
I hope that my code might be helpfull for somebody in future.
Build a sum matrix.
At location (a,b) in the sum matrix, the sum of all elements left&above (including at (a,b)) of (a,b) in the original matrix is summed.
Now calculating the sum of a submatrix is 4 lookups, one add and two subtracts. Draw a 4x4 matrix and express the bottom right 2x2 using such sums to see how.
If you double stored data you can halve lookups. But I would not bother.
Building the sum matrix requires only a modest amoumt of work if you do it carefully.