Managing large spatial data set with attributes in C++

Managing large spatial data set with attributes in C++ - c++

I have a data set with about 700 000 entries, and each entry is a set of 3D coordinates with attributes such as name, timestamp, ID, and so on.
Right now I'm just reading the coordinates and render them as points in OpenGL. However I want to associate each point with its corresponding attributes and I want to be able to sort and pick them during runtime based on their attributes. How would I go about to achieve this in an efficient manner?
I know I can put I can put the data in a struct and use stl sort for sorting, but is that a good design choice or is there a more efficient/elegant way of handling the problem?

The way I tend to look at these design choices is to first use one of the standard library containers (btw, if you need to "just" do lookup you don't necessarily have to sort, but you need a container that allows lookup), then check if this an "efficient enough" solution for the problem.
You can usually come up with a custom solution that is more efficient and maybe more elegant but you tend to run into two issues with that:
1) You end up having to implement some type of a container, which will cost you time both in implementation and debugging compared to a well understood and tested container that is already out there. Most of the time you're better off trying to solve the problem at hand rather than make it bigger by adding more code.
2) If someone else will have to maintain your code at some point, chances are they are familiar with standard library components both from a design and implementation perspective, but they won't be familiar with your custom container, thus increasing the learning curve.

If you consider each attribute of your point class as a component of a vector, then your selection process is a region query. Your example of a string attribute being equal to something means that the region is actually a line in your data space. However, there won't be any sorting made on other attributes within that selection, you will have to implement it by yourself, but it should be relatively straightforward for octrees, which partition data in ordered regions.
As advocated in another answer, try existing standard solutions first. If you can find an of the shelf implementation of one of these data structures:
R-tree
KD tree
BSP
Octree, or more likely, a n dimensional version of the quadtree or octree principle (I will use the term octree herein to denote the general data structure)
then go for it. These are the data structures I recommend for spatial data management.
You could also use an embedded RDBMS capable of working with spatial data (they usually implement R-tree for spatial indexing), but it may not be interesting if your dataset isn't dynamic.
If your dataset falls within the 10000 entries range, then by today standards it isn't that large, so using simpler structures should suffice. In that perimeter, I would go first for a simple std::vector, and use std::sort and std::find to filter the data in smaller set and sort it afterward.
I would probably try an ordered set or map on the most queried attribute in a second attempt, then do some benchmarks to pick the more performing solution.
For a more efficient one dimensional indexing algorithm (in essence, that`s what sets and maps are), you might want to try B-trees: there's C++ implementation available from google.
My third attempt would go toward an OpenCL solution (although if you are doing heavy OpenGL rendering, you might prefer doing the work on the CPU instead, but that depends on your framerate needs).
If your dataset is much larger, as it seems to be, then consider one of the more complex solutions I listed initially.
At any rate, without more details about your dataset and how you plan to use it, it will be difficult to provide a good solution, so the only real advice we can give is: try everthing you can and benchmark.

If you're dealing with point clouds, take a look at PCL, it could save you a lot of time and effort without having to dig into the intricacies of spatial indexing yourself. It also includes visualisation.

Related

Why do C++ data structures for graphs hide contiguous integer indices?

Data structures for directed and undirected graphs are of fundamental importance. Well-known and widely-used implementations such as the Boost Graph Library and Lemon are designed such that the contiguous integer indices of nodes and edges are not exposed to the user via the interface.
Instead, the user identifies nodes and edges by (small) representative objects. One advantage is that these objects are updated automatically when the indices of nodes and edges change due to the removal of edges or nodes from the graph.
In my opinion (!), this advantage is overrated. Users will typically store the representative objects of nodes and/or edges in a container, e.g., an std::vector. Now, if nodes or edges are removed from the graph and their representative objects become invalid, the user needs to either ignore this or rearrange the vector so as to keep valid integer indices contiguous, i.e., do exactly the bookkeeping that the design was supposed to make unnecessary.
Hence, my question is: Does the design choice (of hiding the contiguous integer indices of nodes and edges from the user) have other advantages?

(I'm at home in the Java world, but hope that it is OK to give an answer that is not focussed on the particular libraries in question)
There are several possible advantages of such an abstraction. One of the most important ones was already mentioned in the question: The consistency when performing modifications in a graph is much harder to accomplish when indices have to be maintained.
The reason why this may be hard lies in the different possible graph representations: It may be easy to maintain consistent indices if the internal representation only (and always) consisted of random-access lists of Vertex and Edge objects. But for other representations, determining an index may be difficult.
This directly related to the second main advantage: One is free to use different implementations of the graph interface. The section "Graph Data Structures" in the Review of Elementary Graph Theory of the boost documentation lists several data structures that are already offered by the BGL (and everybody may add his own implementation). The running times for certain operations are given in Big-O-notation, and once can see that they vary greatly between the different data structures.
So one can easily imagine that different implementations are better suited for certain tasks. For example, consider an algorithm frequently has to check whether a particular vertex is contained in a graph. For an indexed (that is, list-based) vertex storage, this would require O(n) for each test. With a set-based storage of the vertices, this could be done in O(1) - but there simply are no sensible "indices" in this case.
Additionally, as mentioned in the Graph Concepts overview:
In fact, the BGL interface need not even be implemented using a data-structure, as for some problems it is easier or more efficient to define a graph implicitly based on some functions.
So suggesting that there is an "indexed access" even when the graph does not even exist in memory may hinder such a purely functional implementation.

I can't speak for Lemon graph, but for boost graph I think the main goal is to be generic. So abstracting away the vertex (edge) access helps to achieve that goal.
It is stated in the documentation that boost graph is based on Dietmar Kühl's Masters Thesis on generic graph algorithms. (See my answer to Do property maps remain necessary for BGL?). So the main goal behind the library is to be generic and extensible. The choice of encapsulating access is part of abstracting the algorithms from the graph representation. To me, continuous integer indices are an implementation detail.
Boost doesn't make any assumptions on how you will use the graph or what performance trade offs are important to you. It lets you choose (or implement) the container that will best fit your needs.
If you want to break this encapsulation, you are free to do so. In fact, my most common use of boost graph involves vecS containers and a vector of structs. I usually work with graphs where the size is fixed. I could just as easily use a map of vertex_descriptors (or edge_descriptors) to objects to achieve the same goal.
So in summary, I would say that this not so much a design choice, but rather a consequence of achieving the broader goal of being generic. So the hiding of access has the benefit of being more generic.

C++ - fastest sorting algorithm for objects based on distance

I'm trying to make a game or 3D application using openGL. The game/program will have many objects in them and drawn to the screen(around 7000 of them). When I render them, I would need to calculate the distance between the camera and the object and sort them in order to correctly render the objects within the scene. Knowing this, what is the best way to sort them? I really want the sorting to be done really fast, but I've heard there are "trade off" for them, so what algorithm should I use to get the best performance out of it?
Any help would be greatly appreciated.
Edit: a lot of people are talking about the z-buffer/depth buffer. This doesn't work in some cases like a few people talked about. This is why I asked this question.

Sorting by distance doesn't solve the transparency problem perfectly. Consider the situation where two transparent surfaces intersect and each has a part which is closer to you. Perhaps rare in games, but still something to consider if you don't want an occasional glitched look to your renderer.
The better solution is order-independent transparency. With the latest graphics hardware supporting atomic operations, you can use an A-buffer to do this with little memory overhead and in a single pass so it is pretty efficient. See for example this article.
The issue of sorting your scene is still a valid one, though, even if it isn't for transparency -- it is still useful to sort opaque objects front to back to to allow depth testing to discard unseen fragments. For this, Vaughn provided the great solution of BSP trees -- these have been used for this purpose for as long as 3D games have been around.

Use http://en.wikipedia.org/wiki/Insertion_sort which has O(n) complexity for nearly sorted arrrays.
In your case by exploiting temporal cohesion insertion sort gives fastest results.
It is used for http://en.wikipedia.org/wiki/Sweep_and_prune
From link above:
In many applications, the configuration of physical bodies from one time step to the next changes very little. Many of the objects may not move at all. Algorithms have been designed so that the calculations done in a preceding time step can be reused in the current time step, resulting in faster completion of the calculation.
So in such cases insertion sort is best(or similar sorts with O(n) at best case)

Boost flat_map container

Working on some legacy code, I am running into memory issues due mainly (I believe) to the extensive use of STL maps (particularly “maps-of-maps”.)
I am looking at Boost flat_map as a possible solution. Does anyone have any firsthand experience with flat_maps, in particular with regards improvements in speed and/or memory usage? I realize of course this can be very dependent on the types of data stored and the manner in which they are stored but still curious of folk’s actual experience.
Can anyone point me to some solid examples?
As an example: there are several cases in this code of a map-of-a-map; that is, a map where the value is another map.
By replacing the “inner” map with a pair of vectors, I reduced the memory footprint 10:1 (3G to 300M). Of course this can slow down searches but for this particular case it doesn’t seem to matter much. And it involved about a day of refactoring and careful testing.
Boost’s flat_map sounds like it might be just what I need but I can’t seem to find out much about it other than the class description on the Boost web site. Looking for some firsthand feedback.

Boost's flat_map is a binary-tree-based map implementation, except that that binary tree is stored as a (sorted) vector of key-value pairs.
You can basically figure out the answers regarding performance (relative to an std::map by yourself based on that fact:
Iterating the map or a large part of it should be super-fast, relatively
Lookup should typically be relatively fast
Adding or removing values is theoretically much slower, but in practice - assuming your key and value types are small and the number of map elements not very high - probably comparable in speed (or even better on small maps - often no allocation is necessary on insert)
etc.
In your case - maps-of-maps - you're going to lose some of the benefit of "flattening things out", since you'll have an outer map with a pointer to an inner map (if not more levels of indirection); but the flat map would at least help you reduce that. Also, supposing you have two levels of maps, you could arrange it so that you store all of the inner maps contiguously (either by constructing the inner maps appropriately or by instantiating them with your own allocator, a trickier affair); in that case, you could replace pointers to maps with map indices, reducing the amount of space they take up and making life easier for the compiler.
You might also want read Boost's documentation of flat_map; and you could also just use the force and read the source (and the source of the underlying flat_tree) - like I have; I dont actually have flat_map experience myself.

I know this is an old question, but this might be of use to someone finding this question.
I found that flat_map was a big improvement in searching, lookup and iterating large maps. The fact the map is using contiguous data in memory also makes inserting faster than you might expect due to great data locality. If you're doing more inserts than lookups in your map then it might not be for you.
Having said that, repeatedly inserting a random value into a sorted vector is faster than the same on a linked list because of the data locality - despite what Big O might tell you. (tested in VS2017 and G++ 4.8).

Search structure with history (persistence)

I need a map-like data structure (in C++) for storing pairs (Key,T) with the following functionality:
You can insert new elements (Key,T) into the current structure
You can search for elements based on Key in the current structure
You can make a "snapshot" of the current version of the structure
You can switch to one of the versions of the structures which you took the snapshot of and continue all operations from there
Completely remove one of the versions
What I don't need
Element removal from the structure
Merging of different versions of the structure into one
Iteration over all (or some of) elements currently stored in the structure
In other words, you have some search structure that you can build up, but at any point you can jump in history, and expand the earlier/different version of the structure in a different way. Later on you may jump between those different versions.
In my project, Key and T are likely to be integers or pointer values, but not strings.
The primary objective is to reduce the time complexity; space consumption is secondary (but should be reasonable as well). To clarify, for me log(N)+log(S) (where N-number of elements, S-number of snapshots) would be enough, although faster is better :)
I have some rough idea how to implement it --- for example: being the structure a binary search tree, the insertion of a new element can clone the path from the root to the insertion location, while keeping the rest of the tree intact. Switching tree versions would be equivalent to picking a different version of the root node, for which some changes are simply not visible.
However, to make this custom tree efficient (e.g. self-balancing) it will require some additional effort and careful coding. Of course I can do it myself but perhaps there are already existing libraries to do exactly that?
Also, there is probably a proper name for this kind of data structure that I simply don't know, making my Google searches (or SO searches) total failures...
Thank you for your help!

I think what you are looking for is an immutable map. Functional (or functionally inspired) programming languages (such as Haskell or Scala) have immutable versions of most of the containers you'd find in the STL. Operations such as insertion/removal etc. then return a copy of the map (preserving the original) with the copy containing your requested modification. A lot of work has gone into designing the datastructures so that the copies are able to point to as much of the original datastructure as possible to reduce time and memory complexity of each operation.
You can find a lot more details in a book such as this one: http://www.amazon.co.uk/Purely-Functional-Structures-Chris-Okasaki/dp/0521663504.

While searching for some persistent search trees libraries I stumbled on this:
http://cg.scs.carleton.ca/~dana/pbst/
While it does not have the exact same functionality as needed, it seems pretty close to it. I will investigate.
(posting here, as someone may find it useful as well)

c++ pivot table implementation

Similar to this question Pivot Table in c#, I'm looking to find an implementation of a pivot table in c++. Due to the project requirements speed is fairly critical and the rest of the performance critical part project is written in c++ so an implementation in c++ or callable from c++ would be highly desirable. Does anyone know of implementations of a pivot table similar to the one found in Excel or open office?
I'd rather not have to code such a thing from scratch, but if I was to do this how should I go about it? What algorithms and data structures would be good to be aware of? Any links to an algorithm would be greatly appreciated.

I am sure you are not asking full feature of pivot table in Excel. I think you want simple statistics table based on discrete explanatory variables and given statistics. If you do, I think this is the case that writing from scratch might be faster than looking at other implementations.
Just update std::map (or similar data structure) of key representing combination of explanatory variables and value of given statistics when program reading each data point.
After done with the reading, it's just matter of organizing output table with the map which might be trivial depending on your goal.
I believe most of C# examples in that question you linked do this approach anyway.

I'm not aware of an existing implementation that would suit your needs, so, assuming you were to write one...
I'd suggest using SQLite to store your data and use SQL to compute aggregates (Note: SQL won't do median, I suggest an abstraction at some stage to allow such behavior), The benefit of using SQLite is that it's pretty flexible and extremely robust, plus it lets you take advantage of their hard work in terms of storing and manipulating data. Wrapping the interface you expect from your pivot table around this concept seems like a good way to start, and save you quite a lot of time.
You could then combine this with a model-view-controller architecture for UI components, I anticipate that would work like a charm. I'm a very satisfied user of Qt, so in this regard I'd suggest using Qt's QTableView in combination with QStandardItemModel (if I can get away with it) or QAbstractItemModel (if I have to). Not sure if you wanted this suggestion, but it's there if you want it :).
Hope that gives you a starting point, any questions or additions, don't hesitate to ask.

I think the reason your question didn't get much attention is that it's not clear what your input data is, nor what options for pivot table you want to support.
A pivot table is in it's basic form, running through the data, aggregating operations into buckets. For example, you want to see how many items you shipped each week from each warehouse for the last few weeks:
You would create a multi-dimensional array of buckets (rows are weeks, columns are warehouses), and run through the data, deciding which bucket that data belongs to, adding the amount in the record you're looking at, and moving to the next record.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js