parallel quadtree construction from morton ordered points - c++

I have a collection of points [(x1,y1),(x2,y2), ..., (xn,yn)] which are Morton sorted. I wish to construct a quadtree from these points in parallel. My intuition is to construct a subtree on each core and merge all subtrees to form a complete quadtree. Can anyone provide some high level insights or pseudocode how may I do this efficiently?

First some thought on your plan:
Are you sure that parallelizing construction will help? I think there is a risk that you won't a much speedup. Quadtree construction is rather cheap on the CPU, so it will be partly bound by your memory bandwidth. Parallelization may not help much, unless you have separate memory buses, for example separate machines.
If you want to parallelize construction on parallel machines, it may be cheapest to simply create separate quadtrees by splitting your point collection in evenly sized chunks. This has one big advantage over other solution: When you want insert more points, or want to look up points, the morton order allows you to pretty efficiently determine which tree contains the point (or should contain it, for insertion). For window queries you can do a similar optimization, if the morton-codes of the 'min/min' and the 'max/max' corners of the query-window lie in the same 'chunk' (sub-tree), then you only need to query this one tree. More optimizations are possible.
If you really want to create a single quadtree on a single machine, there are several ways to split your dataset efficiently:
Walk through all points and identify global min/max. Then walk through all points and assign them (assuming 4 cores) to each core, where each core represents a quadrant. These steps are well parallelizable by splitting the dataset into 4 evenly sized chunks, and it results in a quadtree that exactly fits your dataset. You will have to synchronize insertion, into the trees, but since the dataset is morton ordered, there should be relatively few lock collisions.
You can completely avoid lock collisions during insertion by aligning the quadtrants with Morton coordinates, such that the morton-curve (a z-curve) crosses the quadrant borders only once. Disadvantage: the tree will be imbalanced, i.e. it is unlikely that all quadrants contain the same amount of data. This means your CPUs may have considerably different workloads, unless you split the sub-tree into sub-sub-trees, and so on, to distribute the load better. The split-planes for avoiding the z-curve to cross quadrant borders can be identified on the morton-code/z-code of your coordinates. Split the z-code in chunks of two bits, each to bits tell you which (sub-)quadrant to choose, i.e. 00 is lower/left, 01 is lower/right, 10 is upper/left and 11 is upper/right. Since your points a morton ordered, you can simply use binary search to find the chunks for each quadrant. I realize this maybe sound rather cryptic without more explanation. So maybe you can have a look at the PH-Tree, it is essentially are Z-Ordered (morton-ordered) quadtree (more a 'trie' than a 'tree'). There are also some in-depth explanations here and here (shameless self advertisement). The PH-Tree has some nice properties, such as inherently limiting depth to 64 levels (for 64bit numbers) while guaranteeing small nodes (4 entries max for 2 dimensions); it also guarantees, like the quadtree, that any insert/removal will never affect more than one node, plus possibly adding or removing a second node. There is also a C++ implementation here.

Related

3D-Grid of bins: nested std::vector vs std::unordered_map

pros, I need some performance-opinions with the following:
1st Question:
I want to store objects in a 3D-Grid-Structure, overall it will be ~33% filled, i.e. 2 out of 3 gridpoints will be empty.
Short image to illustrate:
Maybe Option A)
vector<vector<vector<deque<Obj>> grid;// (SizeX, SizeY, SizeZ);
grid[x][y][z].push_back(someObj);
This way I'd have a lot of empty deques, but accessing one of them would be fast, wouldn't it?
The Other Option B) would be
std::unordered_map<Pos3D, deque<Obj>, Pos3DHash, Pos3DEqual> Pos3DMap;
where I add&delete deques when data is added/deleted. Probably less memory used, but maybe less fast? What do you think?
2nd Question (follow up)
What if I had multiple containers at each position? Say 3 buckets for 3 different entities, say object types ObjA, ObjB, ObjC per grid point, then my data essentially becomes 4D?
Another illustration:
Using Option 1B I could just extend Pos3D to include the bucket number to account for even more sparse data.
Possible queries I want to optimize for:
Give me all Objects out of ObjA-buckets from the entire structure
Give me all Objects out of ObjB-buckets for a set of
grid-positions
Which is the nearest non-empty ObjC-bucket to
position x,y,z?
PS:
I had also thought about a tree based data-structure before, reading about nearest neighbour approaches. Since my data is so regular I had thought I'd save all the tree-building dividing of the cells into smaller pieces and just make a static 3D-grid of the final leafs. Thats how I came to ask about the best way to store this grid here.
Question associated with this, if I have a map<int, Obj> is there a fast way to ask for "all objects with keys between 780 and 790"? Or is the fastest way the building of the above mentioned tree?
EDIT
I ended up going with a 3D boost::multi_array that has fortran-ordering. It's a little bit like the chunks games like minecraft use. Which is a little like using a kd-tree with fixed leaf-size and fixed amount of leaves? Works pretty fast now so I'm happy with this approach.
Answer to 1st question
As #Joachim pointed out, this depends on whether you prefer fast access or small data. Roughly, this corresponds to your options A and B.
A) If you want fast access, go with a multidimensional std::vector or an array if you will. std::vector brings easier maintenance at a minimal overhead, so I'd prefer that. In terms of space it consumes O(N^3) space, where N is the number of grid points along one dimension. In order to get the best performance when iterating over the data, remember to resolve the indices in the reverse order as you defined it: innermost first, outermost last.
B) If you instead wish to keep things as small as possible, use a hash map, and use one which is optimized for space. That would result in space O(N), with N being the number of elements. Here is a benchmark comparing several hash maps. I made good experiences with google::sparse_hash_map, which has the smallest constant overhead I have seen so far. Plus, it is easy to add it to your build system.
If you need a mixture of speed and small data or don't know the size of each dimension in advance, use a hash map as well.
Answer to 2nd question
I'd say you data is 4D if you have a variable number of elements a long the 4th dimension, or a fixed large number of elements. With option 1B) you'd indeed add the bucket index, for 1A) you'd add another vector.
Which is the nearest non-empty ObjC-bucket to position x,y,z?
This operation is commonly called nearest neighbor search. You want a KDTree for that. There is libkdtree++, if you prefer small libraries. Otherwise, FLANN might be an option. It is a part of the Point Cloud Library which accomplishes a lot of tasks on multidimensional data and could be worth a look as well.

Graph memory implementation

The two ways commonly used to represent a graph in memory are to use either an adjacency list or and adjacency matrix. An adjacency list is implemented using an array of pointers to linked lists. Is there any reason that that is faster than using a vector of vectors? I feel like it should make searching and traversals faster because backtracking would be a lot simpler.
The vector of linked adjacencies is a favorite textbook meme with many variations in practice. Certainly you can use vectors of vectors. What are the differences?
One is that links (double ones anyway) allow edges to be easily added and deleted in constant time. This obviously is important only when the edge set shrinks as well as grows. With vectors for edges, any individual operation may require O(k) where k is the incident edge count.
NB: If the order of edges in adjacency lists is unimportant for your application, you can easily get O(1) deletions with vectors. Just copy the last element to the position of the one to be deleted, then delete the last! Alas, there are many cases (e.g. where you're worried about embedding in the plane) when order of adjacencies is important.
Even if order must be maintained, you can arrange for copying costs to amortize to an average that is O(1) per operation over many operations. Still in some applications this is not good enough, and it requires "deleted" marks (a reserved vertex number suffices) with compaction performed only when the number of marked deletions is a fixed fraction of the vector. The code is tedious and checking for deleted nodes in all operations adds overhead.
Another difference is overhead space. Adjacency list nodes are quite small: Just a node number. Double links may require 4 times the space of the number itself (if the number is 32 bits and both pointers are 64). For a very large graph, a space overhead of 400% is not so good.
Finally, linked data structures that are frequently edited over a long period may easily lead to highly non-contiguous memory accesses. This decreases cache performance compared to linear access through vectors. So here the vector wins.
In most applications, the difference is not worth worrying about. Then again, huge graphs are the way of the modern world.
As others have said, it's a good idea to use a generalized List container for the adjacencies, one that may be quickly implemented either with linked nodes or vectors of nodes. E.g. in Java, you'd use List and implement/profile with both LinkedList and ArrayList to see which works best for your application. NB ArrayList compacts the array on every remove. There is no amortization as described above, although adds are amortized.
There are other variations: Suppose you have a very dense graph, where there's a frequent need to search all edges incident to a given node for one with a certain label. Then you want maps for the adjacencies, where the keys are edge labels. Of course the maps can be hashes or trees or skiplists or whatever you like.
The list goes on. How to implement for efficient vertex deletion? As you might expect, there are alternatives here, too, each with advantages and disadvantages.

Efficiently insert integers into a growing array if no duplicates

There is a data structure which acts like a growing array. Unknown amount of integers will be inserted into it one by one, if and only if these integers has no dup in this data structure.
Initially I thought a std::set suffices, it will automatically grow as new integers come in and make sure no dups.
But, as the set grows large, the insertion speed goes down. So any other idea to do this job besides hash?
Ps
I wonder any tricks such as xor all the elements or build a Sparse Table (just like for rmq) would apply?
If you're willing to spend memory on the problem, 2^32 bits is 512MB, at which point you can just use a bit field, one bit per possible integer. Setting aside CPU cache effects, this gives O(1) insertion and lookup times.
Without knowing more about your use case, it's difficult to say whether this is a worthwhile use of memory or a vast memory expense for almost no gain.
This site includes all the possible containers and layout their running time for each action ,
so maybe this will be useful :
http://en.cppreference.com/w/cpp/container
Seems like unordered_set as suggested is your best way.
You could try a std::unordered_set, which should be implemented as a hash table (well, I do not understand why you write "besides hash"; std::set normally is implemented as a balanced tree, which should be the reason for insufficient insertion performance).
If there is some range the numbers fall in, then you can create several std::set as buckets.
EDIT- According to the range that you have specified, std::set, should be fast enough. O(log n) is fast enough for most purposes, unless you have done some measurements and found it slow for your case.
Also you can use Pigeonhole Principle along with sets to reject any possible duplicate, (applicable when set grows large).
A bit vector can be useful to detect duplicates
Even more requirements would be necessary for an optimal decision. This suggestion is based on the following constraints:
Alcott 32 bit integers, with about 10.000.000 elements (ie any 10m out of 2^32)
It is a BST (binary search tree) where every node stores two values, the beginning and the end of a continuous region. The first element stores the number where a region starts, the second the last. This arrangement allows big regions in the hope that you reach you 10M limit with a very small tree height, so cheap search. The data structure with 10m elements would take up 8 bytes per node, plus the links (2x4bytes) maximum two children per node. So that make 80M for all the 10M elements. And of course, if there are usually more elements inserted you can keep track of the once which are not.
Now if you need to be very careful with space and after running simulations and/or statistical checks you find that there are lots of small regions (less than 32 bit in length), you may want to change your node type to one number which starts the region, plus a bitmap.
If you don't have to align access to the bitmap and, say, you only have continuous chunks with only 8 elements, then your memo requirement becuse 4+1 for the node and 4+4 bytes for the children. Hope this helps.

Threading an octree

Efficient way of threading an octree such that the pointers contained by each octcell in an oct make it easy in the traversal through the tree at the same level.
We have to make use of fully threaded trees here so that i can use openmp to parallelize the code at the same level.
I have some experience with oct-trees and coded several myself. The basic problem is that the tree has (at least) two directions of traversal: horizontal (between daughter cells) and vertical (between mother and daughter cells), which cannot be mapped to linear memory. Thus, traversing the tree (for example for neighbour search) will inevitably result in cache misses.
For a most efficient implementations, you should have all (up to 8) daughter cells of a non-final cell to be in one contiguous block of memory, avoiding both cache misses when traversing over them and the need to link them up with pointers. Each cell then only need one pointer/index for their first daughter cell and, possibly (depending on the needs of your applications), a pointer to their mother cell.
Similarly, any particles/positions sorted by the tree should be ordered such that all contained within a cell are contiguous in memory, at all tree levels. Then each cell only has to store the first and number of particles, allowing access to all them at at every level of the tree (not just final cells).
In practice, such an ordering can be achieved by first building a fully linked tree and then mapping it to the form described above. The overhead of this mapping is minor but the gain in a applications substantial.
Finally, when re-building the tree with only slightly changed particle positions, it makes for a significant speed up (depending on your algorithm) to feed the particles in the previous tree order to the tree building algorithm.

Is there a data structure with these characteristics?

I'm looking for a data structure that would allow me to store an M-by-N 2D matrix of values contiguously in memory, such that the distance in memory between any two points approximates the Euclidean distance between those points in the matrix. That is, in a typical row-major representation as a one-dimensional array of M * N elements, the memory distance differs between adjacent cells in the same row (1) and adjacent cells in neighbouring rows (N).
I'd like a data structure that reduces or removes this difference. Really, the name of such a structure is sufficient—I can implement it myself. If answers happen to refer to libraries for this sort of thing, that's also acceptable, but they should be usable with C++.
I have an application that needs to perform fast image convolutions without hardware acceleration, and though I'm aware of the usual optimisation techniques for this sort of thing, I feel a specialised data structure or data ordering could improve performance.
Given the requirement that you want to store the values contiguously in memory, I'd strongly suggest you research space-filling curves, especially Hilbert curves.
To give a bit of context, such curves are sometimes used in database indexes to improve the locality of multidimensional range queries (e.g., "find all items with x/y coordinates in this rectangle"), thereby aiming to reduce the number of distinct pages accessed. A bit similar to the R-trees that have been suggested here already.
Either way, it looks that you're bound to an M*N array of values in memory, so the whole question is about how to arrange the values in that array, I figure. (Unless I misunderstood the question.)
So in fact, such orderings would probably still only change the characteristics of distance distribution.. average distance for any two randomly chosen points from the matrix should not change, so I have to agree with Oli there. Potential benefit depends largely on your specific use case, I suppose.
I would guess "no"! And if the answer happens to be "yes", then it's almost certainly so irregular that it'll be way slower for a convolution-type operation.
EDIT
To qualify my guess, take an example. Let's say we store a[0][0] first. We want a[k][0] and a[0][k] to be similar distances, and proportional to k, so we might choose to interleave the storage of first row and first column (i.e. a[0][0], a[1][0], a[0][1], a[2][0], a[0][2], etc.) But how do we now do the same for e.g. a[1][0]? All the locations near it in memory are now taken up by stuff that's near a[0][0].
Whilst there are other possibilities than my example, I'd wager that you always end up with this kind of problem.
EDIT
If your data is sparse, then there may be scope to do something clever (re Cubbi's suggestion of R-trees). However, it'll still require irregular access and pointer chasing, so will be significantly slower than straightforward convolution for any given number of points.
You might look at space-filling curves, in particular the Z-order curve, which (mostly) preserves spatial locality. It might be computationally expensive to look up indices, however.
If you are using this to try and improve cache performance, you might try a technique called "bricking", which is a little bit like one or two levels of the space filling curve. Essentially, you subdivide your matrix into nxn tiles, (where nxn fits neatly in your L1 cache). You can also store another level of tiles to fit into a higher level cache. The advantage this has over a space-filling curve is that indices can be fairly quick to compute. One reference is included in the paper here: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.8959
This sounds like something that could be helped by an R-tree. or one of its variants. There is nothing like that in the C++ Standard Library, but looks like there is an R-tree in the boost candidate library Boost.Geometry (not a part of boost yet). I'd take a look at that before writing my own.
It is not possible to "linearize" a 2D structure into an 1D structure and keep the relation of proximity unchanged in both directions. This is one of the fundamental topological properties of the world.
Having that that, it is true that the standard row-wise or column-wise storage order normally used for 2D array representation is not the best one when you need to preserve the proximity (as much as possible). You can get better result by using various discrete approximations of fractal curves (space-filling curves).
Z-order curve is a popular one for this application: http://en.wikipedia.org/wiki/Z-order_(curve)
Keep in mind though that regardless of which approach you use, there will always be elements that violate your distance requirement.
You could think of your 2D matrix as a big spiral, starting at the center and progressing to the outside. Unwind the spiral, and store the data in that order, and distance between addresses at least vaguely approximates Euclidean distance between the points they represent. While it won't be very exact, I'm pretty sure you can't do a whole lot better either. At the same time, I think even at very best, it's going to be of minimal help to your convolution code.
The answer is no. Think about it - memory is 1D. Your matrix is 2D. You want to squash that extra dimension in - with no loss? It's not going to happen.
What's more important is that once you get a certain distance away, it takes the same time to load into cache. If you have a cache miss, it doesn't matter if it's 100 away or 100000. Fundamentally, you cannot get more contiguous/better performance than a simple array, unless you want to get an LRU for your array.
I think you're forgetting that distance in computer memory is not accessed by a computer cpu operating on foot :) so the distance is pretty much irrelevant.
It's random access memory, so really you have to figure out what operations you need to do, and optimize the accesses for that.
You need to reconvert the addresses from memory space to the original array space to accomplish this. Also, you've stressed distance only, which may still cause you some problems (no direction)
If I have an array of R x C, and two cells at locations [r,c] and [c,r], the distance from some arbitrary point, say [0,0] is identical. And there's no way you're going to make one memory address hold two things, unless you've got one of those fancy new qubit machines.
However, you can take into account that in a row major array of R x C that each row is C * sizeof(yourdata) bytes long. Conversely, you can say that the original coordinates of any memory address within the bounds of the array are
r = (address / C)
c = (address % C)
so
r1 = (address1 / C)
r2 = (address2 / C)
c1 = (address1 % C)
c2 = (address2 % C)
dx = r1 - r2
dy = c1 - c2
dist = sqrt(dx^2 + dy^2)
(this is assuming you're using zero based arrays)
(crush all this together to make it run more optimally)
For a lot more ideas here, go look for any 2D image manipulation code that uses a calculated value called 'stride', which is basically an indicator that they're jumping back and forth between memory addresses and array addresses
This is not exactly related to closeness but might help. It certainly helps for minimation of disk accesses.
one way to get better "closness" is to tile the image. If your convolution kernel is less than the size of a tile you typical touch at most 4 tiles at worst. You can recursively tile in bigger sections so that localization improves. A Stokes-like (At least I thinks its Stokes) argument (or some calculus of variations ) can show that for rectangles the best (meaning for examination of arbitrary sub rectangles) shape is a smaller rectangle of the same aspect ratio.
Quick intuition - think about a square - if you tile the larger square with smaller squares the fact that a square encloses maximal area for a given perimeter means that square tiles have minimal boarder length. when you transform the large square I think you can show you should the transform the tile the same way. (might also be able to do a simple multivariate differentiation)
The classic example is zooming in on spy satellite data images and convolving it for enhancement. The extra computation to tile is really worth it if you keep the data around and you go back to it.
Its also really worth it for the different compression schemes such as cosine transforms. (That's why when you download an image it frequently comes up as it does in smaller and smaller squares until the final resolution is reached.
There are a lot of books on this area and they are helpful.