locality sensitive hashing for spatial data

locality sensitive hashing for spatial data - mapreduce

I would like to find out a Locality Sensitive Hashing algorithm in order to split my spatial data into a number of buckets(reducer tasks). The spatial data are actually trajectories so from my understating of LSH a trajectory will be represented a set of 2d points.
Thanks,
Adam

Most probably you want a QuadTree:
"Quadtrees are most often used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions."
You could store the actual points in a quadtree, and define trajectories as lists of indices referencing points in the quadtree.

Related

Finding the best algorithm for nearest neighbor search in a 2D plane with moving points

I am looking for an efficient way to perform nearest neighbor searches within a specified radius in a two-dimensional plane. According to Wikipedia, space-partitioning data structures, such as :
k-d trees,
r-trees,
octrees,
quadtrees,
cover trees,
metric trees,
BBD trees
locality-sensitive hashing,
and bins,
are often used for organizing points in a multi-dimensional space and can provide O(log n) performance for search and insert operations. However, in my case, the points in the two-dimensional plane are moving at each iteration, so I need to update the tree accordingly. Rebuilding the tree from scratch at each iteration seems easier, but I would like to avoid it if possible because the points only move slightly between iterations.
I have read that k-d trees are not naturally balanced, which could be an issue in my case. R-trees, on the other hand, are better suited for storing rectangles. Bin algorithms, on the other hand, are easy to implement and provide near-linear search performance within local bins.
I am working on an autonomous agent simulation where 1,000,000 agents are rendered in the GPU, and the CPU is responsible for computing the next movement of each agent. Each agent is influenced by other agents within its line of sight, or in other words, other agents within a circular sector of angle θ and radius r. So here specific requirements for my use case:
Search space is a 2-d plane,
Each object is a point identified with the x,y coordinate.
All points are frequently updated by a small factor.
Cannot afford any O(n^2) algorithms.
Search within a radius (circular sector)
Search for all candidates within the search surface.
Given these considerations, what would be the best algorithms for my use case?

I think you could potentially solve this by doing a sort of scheduling approach. If you know that no object will move more than d distance in each iteration, and you want to know which objects are within X distance of each other on each iteration, then given the distances between all objects you know that on the next iteration the only potential pairs of objects that would change their neighbor status would be those with a distance between X-d and X+d. The iteration after that it would be X-2d and X+2d and so on.
So I'm thinking that you could do an initial distance calculation between all pairs of objects, and then based on each difference you can create an NxN matrix where the value in each cell is which iteration you will need to re-check their distance. Then when you re-check those during that iteration, you would update their values in this matrix for the next iteration that they need to be checked.
The only problem is whether calculating an initial NxN distance matrix is feasible.

Partition large amount of 3D point data

I need to partition a large set of 3D points (using C++). The points are stored on the HDD as binary float array, and the files are usually larger than 10GB.
I need to divide the set into smaller subsets that have a size less than 1GB.
The points in the subset should still have the same neighborhood because I need to perform certain algorithms on the data (e.g., object detection).
I thought I could use a KD-Tree. But how can I construct the KD-Tree efficiently if I can't load all the points into RAM? Maybe I could map the file as virtual memory. Then I could save a pointer to each 3D point that belongs to a segment and store it in a node of the KD-Tree. Would that work? Any other ideas?
Thank you for your help. I hope you can unterstand the problem :D

You basically need an out-of-core algorithm for computing (approximate) medians. Given a large file, find its median and then partition it into two smaller files. A k-d tree is the result of applying this process recursively along varying dimensions (and when the smaller files start to fit in memory, you don't have to bother with the out-of-core algorithm any more).
To approximate the median of a large file, use reservoir sampling to grab a large but in-memory sample, then run an in-core median finding algorithm. Alternatively, for an exact median, compute the (e.g.) approximate 45th and 55th percentiles, then make another pass to extract the data points between them and compute the median exactly (unless the sample was unusually non-random, in which case retry). Details are in the Motwani--Raghavan book on randomized algorithms.

Extract the upper and lower boundaries from a list (vector) of 2d coordinates

My program contains polygons which have the form of a vector containing points (2 dimensional double coordinates, stored in a self-made structure). I'm looking for a quick way of finding the smallest square containing my polygon (ie. knowing the maximal and minimal coordinates of all the points).
Is there a quicker way than just parsing all the points and storing the minimum and maximum values?

The algorithm ou are describing is straightforward: Iterate over all your points and find the minimum and maximum for each coordinate. This is an O(n) algorithm, n being the number of points you have.
You can't do better, since you will need to check at least all your points once, otherwise the last one could be outside the square you found.
Now, the complexity is at best O(n) so you just have to minimize the constant factors, but in that case it's already pretty small : Only one loop over your vector, looking for two maximums and two minimums.

You can either iterate through all points and find max and min values, or do some preprocessing, for example, store your points in treap (http://en.wikipedia.org/wiki/Treap).
There is no way w/o some preprocessing to do it better than just iterating over all points.

I'm not sure if there can be any faster way to find the min & max values in an array of values than linear time. The only 'optimization' I can think of is to find these values on one of the other occasions you're iterating the array (filling it/performing a function on all points), then perform checks on any data update.

Thresholding a huge matrix to avoid overuse of memory, C++

I want to generate a huge weighted undirected graph, represented by a huge adjacency matrix AJM. So for the loop over i and j,
AJM[i][j] = AJM[j][i]
AJM[i][i] = 0
The weights are generated as random double numbers in the interval, say [0.01, 10.00]. If I have 10k vertices, the matrix would be 10k by 10k with double type entries, which is a huge chunk in the memory if I store it.
Now I want to set a threshold E for the wanted number of edges, and ignore all the edges with weight larger than some threshold T (T is determined by E, E is user-defined), just store the smallest E edges with weight under T in a vector for later use. Could you give me some suggestion how to achieve this in an efficient manner? It is best to avoid any kind of storage of the whole adjacency matrix, just use streaming structure. So I'm wondering how I should generate the matrix and do the thresholding?
I guess writing and reading file is needed, right?
One approach would be, after some kind of manipulation with file, I set the threshold E and do the following:
I read the element from the matrix one by one so I don't read in the whole matrix (could you show some lines of C++ code for achieving this?), and insert its weight into a min-heap, store its corresponding edge index in a vector. I stop when the size of the heap reaches E so that the vector of edge indices is what I want.
Do you think its the right way to do it? Any other suggestions? Pls point out any error I may have here. Thank you so much!

If there is no need to keep the original threshold-ed graph then it sounds like there is an easy way to save yourself a lot of work. You are given the number of vertices (V=10,000), and the number of edges (E) is user configurable. Just randomly select pairs of vertices until you have the required number of edges. Am I missing an obvious reason why this would not be equivalent?

O(n^log n) algorithm for collision detection

I'm building a game engine and I was wondering: are there any algorithms out there for Collision Detection that have time complexity of O(N^log N)?
I haven't written any coding yet, but I can only think of a O(N^2) algorithm (ie: 2 for-loops looping through a list of object to see if there's collision).
Any advice and help will be appreciated.
Thanks

Spatial partitioning can create O(n log(n)) solutions. Depending on the exact structure and nature of your objects, you'll want a different spatial partitioning algorithm, but the most common are octrees and BSP.
Basically, the idea of spatial partitioning is to group objects by the space they occupy. An object in node Y can never collide with an object in node X (unless X is a subnode of Y or vice versa). Then you partition the objects by which go in which nodes. I implemented an octree myself.

You can minimize the number of checks by sorting the objects into areas of space.
( There is no point in checking for a collision between an object near 0,0 and one near 1000,1000 )
The obvious solution would be to succesively divide your space in half and use a tree (BSP) structure. Although this works best for sparse clouds of objects, otherwise you spend all the time checking if an object near a boundary hits an object just on the other side of the boundary

I assume you have a restricted interaction length, i.e. when two objects are a certain distance, there is no more interaction.
If that is so, you normally would divide your space into domains of appropriate size (e.g. interaction length in each direction). Now, for applying the interaction to a particle, all you need to do is go through its own domain and the nearest neighbor domains, because all other particles are guaranteed further away than the interaction length is. Of course, you must check at particle updates whether any domain boundary is crossed, and adjust the domain membership accordingly. This extra bookkeeping is no problem, considering the enormous performance improvement due to the lowered interacting pair count.
For more tricks I suggest a scientific book about numerical N-Body-Simulation.
Of course each domain has its own list of particles that are in that domain. There's no use having a central list of particles in the simulation and going through all entries to check on each one whether it's in the current or the neighboring domains.

I'm using oct-tree for positions in 3D, which can be quite in-homogeneously distributed. The tree (re-)build is usually quite fast, bot O(N log(N)). Then finding all collisions for a given particle can be done in O(K) with K the number of collisions per particle, in particular there is no factor log(N). So, to find all collisions then need O(K*N), after the tree build.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js