Multiple indexing with big data set of small data: space inefficient? - c++

I am not at all an expert in database design, so I will put my need in plain words before I try to translate it in CS terms: I am trying to find the right way to iterate quickly over large subsets (say ~100Mo of double) of data, in a potentially very large dataset (say several Go).
I have objects that basically consist of 4 integers (keys) and the value, a simple struct (1 double 1 short).
Since my keys can take only a small number of values (couple hundreds) I thought it would make sense to save my data as a tree (1 depth by key, values are the leaves, much like XML's XPath in my naive view at least).
I want to be able to iterate through subset of leaves based on key values / a fonction of those keys values. Which key combination to filter upon will vary. I think this is call a transversal search ?
So to avoid comparing n times the same keys, ideally I would need the data structure to be indexed by each of the permutation of the keys (12 possibilities: !4/!2 ). This seems to be what boost::multi_index is for, but, unless I'm overlooking smth, the way this would be done would be actually constructing those 12 tree structure, storing pointers to my value nodes as leaves. I guess this would be extremely space inefficient considering the small size of my values compared to the keys.
Any suggestions regarding the design / data structure I should use, or pointers to concise educational materials regarding these topics would be very appreciated.

With Boost.MultiIndex, you don't need as many as 12 indices (BTW, the number of permutations of 4 elements is 4!=24, not 12) to cover all queries comprising a particular subset of 4 keys: thanks to the use of composite keys, and with a little ingenuity, 6 indices suffice.
By some happy coincindence, I provided in my blog some years ago an example showing how to do this in a manner that almost exactly matches your particular scenario:
Multiattribute querying with Boost.MultiIndex
Source code is provided that you can hopefully use with little modification to suit your needs. The theoretical justification of the construct is also provided in a series of articles in the same blog:
A combinatory theorem
Generating permutation covers: part I
Generating permutation covers: part II
Multicolumn querying
The maths behind this is not trivial and you might want to safely ignore it: if you need assistance understanding it, though, do not hesitate to comment on the blog articles.
How much memory does this container use? In a typical 32-bit computer, the size of your objects is 4*sizeof(int)+sizeof(double)+sizeof(short)+padding, which typically yields 32 bytes (checked with Visual Studio on Win32). To this Boost.MultiIndex adds an overhead of 3 words (12 bytes) per index, so for each element of the container you've got
32+6*12 = 104 bytes + padding.
Again, I checked with Visual Studio on Win32 and the size obtained was 128 bytes per element. If you have 1 billion (10^9) elements, then 32 bits is not enough: going to a 64-bit OS will most likely double the size of obejcts, so the memory needed would amount to 256 GB, which is quite a powerful beast (don't know whether you are using something as huge as this.)

B-Tree index and Bitmap Index are two of the major indexes used, but they aren't the only ones. You should explore them. Something to get you started .
Article evaluating when to use B-Tree and when to use Bitmap

It depends on the algorithm accessing it, honestly. If this structure needs to be resident, and you can afford the memory consumption, then just do it. multi_index is fine, though it will destroy your compile times if it's in a header.
If you just need a one time traversal, then building the structure will be kind of a waste. Something like next_permutation may be a good place to start.

Related

Efficient data structure to map integer-to-integer with find & insert, no allocations and fixed upper bound

I am looking for input on an associative data structure that might take advantage of the specific criteria of my use case.
Currently I am using a red/black tree to implement a dictionary that maps keys to values (in my case integers to addresses).
In my use case, the maximum number of elements is known up front (1024), and I will only ever be inserting and searching. Searching happens twenty times more often than inserting. At the end of the process I clear the structure and repeat again. There can be no allocations during use - only the initial up front one. Unfortunately, the STL and recent versions of C++ are not available.
Any insight?
I ended up implementing a simple linear-probe HashTable from an example here. I used the MurmurHash3 hash function since my data is randomized.
I modified the hash table in the following ways:
The size is a template parameter. Internally, the size is doubled. The implementation requires power of 2 sizes, and traditionally resizes at 75% occupation. Since I know I am going to be filling up the hash table, I pre-emptively double it's capacity to keep it sparse enough. This might be less efficient when adding small number of objects, but it is more efficient once the capacity starts to fill up. Since I cannot resize it I chose to start it doubled in size.
I do not allow keys with a value of zero to be stored. This is okay for my application and it keeps the code simple.
All resizing and deleting is removed, replaced by a single clear operation which performs a memset.
I chose to inline the insert and lookup functions since they are quite small.
It is faster than my red/black tree implementation before. The only change I might make is to revisit the hashing scheme to see if there is something in the source keys that would help make a cheaper hash.
Billy ONeal suggested, given a small number of elements (1024) that a simple linear search in a fixed array would be faster. I followed his advice and implemented one for side by side comparison. On my target hardware (roughly first generation iPhone) the hash table outperformed a linear search by a factor of two to one. At smaller sizes (256 elements) the hash table was still superior. Of course these values are hardware dependant. Cache line sizes and memory access speed are terrible in my environment. However, others looking for a solution to this problem would be smart to follow his advice and try and profile it first.

3D-Grid of bins: nested std::vector vs std::unordered_map

pros, I need some performance-opinions with the following:
1st Question:
I want to store objects in a 3D-Grid-Structure, overall it will be ~33% filled, i.e. 2 out of 3 gridpoints will be empty.
Short image to illustrate:
Maybe Option A)
vector<vector<vector<deque<Obj>> grid;// (SizeX, SizeY, SizeZ);
grid[x][y][z].push_back(someObj);
This way I'd have a lot of empty deques, but accessing one of them would be fast, wouldn't it?
The Other Option B) would be
std::unordered_map<Pos3D, deque<Obj>, Pos3DHash, Pos3DEqual> Pos3DMap;
where I add&delete deques when data is added/deleted. Probably less memory used, but maybe less fast? What do you think?
2nd Question (follow up)
What if I had multiple containers at each position? Say 3 buckets for 3 different entities, say object types ObjA, ObjB, ObjC per grid point, then my data essentially becomes 4D?
Another illustration:
Using Option 1B I could just extend Pos3D to include the bucket number to account for even more sparse data.
Possible queries I want to optimize for:
Give me all Objects out of ObjA-buckets from the entire structure
Give me all Objects out of ObjB-buckets for a set of
grid-positions
Which is the nearest non-empty ObjC-bucket to
position x,y,z?
PS:
I had also thought about a tree based data-structure before, reading about nearest neighbour approaches. Since my data is so regular I had thought I'd save all the tree-building dividing of the cells into smaller pieces and just make a static 3D-grid of the final leafs. Thats how I came to ask about the best way to store this grid here.
Question associated with this, if I have a map<int, Obj> is there a fast way to ask for "all objects with keys between 780 and 790"? Or is the fastest way the building of the above mentioned tree?
EDIT
I ended up going with a 3D boost::multi_array that has fortran-ordering. It's a little bit like the chunks games like minecraft use. Which is a little like using a kd-tree with fixed leaf-size and fixed amount of leaves? Works pretty fast now so I'm happy with this approach.
Answer to 1st question
As #Joachim pointed out, this depends on whether you prefer fast access or small data. Roughly, this corresponds to your options A and B.
A) If you want fast access, go with a multidimensional std::vector or an array if you will. std::vector brings easier maintenance at a minimal overhead, so I'd prefer that. In terms of space it consumes O(N^3) space, where N is the number of grid points along one dimension. In order to get the best performance when iterating over the data, remember to resolve the indices in the reverse order as you defined it: innermost first, outermost last.
B) If you instead wish to keep things as small as possible, use a hash map, and use one which is optimized for space. That would result in space O(N), with N being the number of elements. Here is a benchmark comparing several hash maps. I made good experiences with google::sparse_hash_map, which has the smallest constant overhead I have seen so far. Plus, it is easy to add it to your build system.
If you need a mixture of speed and small data or don't know the size of each dimension in advance, use a hash map as well.
Answer to 2nd question
I'd say you data is 4D if you have a variable number of elements a long the 4th dimension, or a fixed large number of elements. With option 1B) you'd indeed add the bucket index, for 1A) you'd add another vector.
Which is the nearest non-empty ObjC-bucket to position x,y,z?
This operation is commonly called nearest neighbor search. You want a KDTree for that. There is libkdtree++, if you prefer small libraries. Otherwise, FLANN might be an option. It is a part of the Point Cloud Library which accomplishes a lot of tasks on multidimensional data and could be worth a look as well.

Efficiently insert integers into a growing array if no duplicates

There is a data structure which acts like a growing array. Unknown amount of integers will be inserted into it one by one, if and only if these integers has no dup in this data structure.
Initially I thought a std::set suffices, it will automatically grow as new integers come in and make sure no dups.
But, as the set grows large, the insertion speed goes down. So any other idea to do this job besides hash?
Ps
I wonder any tricks such as xor all the elements or build a Sparse Table (just like for rmq) would apply?
If you're willing to spend memory on the problem, 2^32 bits is 512MB, at which point you can just use a bit field, one bit per possible integer. Setting aside CPU cache effects, this gives O(1) insertion and lookup times.
Without knowing more about your use case, it's difficult to say whether this is a worthwhile use of memory or a vast memory expense for almost no gain.
This site includes all the possible containers and layout their running time for each action ,
so maybe this will be useful :
http://en.cppreference.com/w/cpp/container
Seems like unordered_set as suggested is your best way.
You could try a std::unordered_set, which should be implemented as a hash table (well, I do not understand why you write "besides hash"; std::set normally is implemented as a balanced tree, which should be the reason for insufficient insertion performance).
If there is some range the numbers fall in, then you can create several std::set as buckets.
EDIT- According to the range that you have specified, std::set, should be fast enough. O(log n) is fast enough for most purposes, unless you have done some measurements and found it slow for your case.
Also you can use Pigeonhole Principle along with sets to reject any possible duplicate, (applicable when set grows large).
A bit vector can be useful to detect duplicates
Even more requirements would be necessary for an optimal decision. This suggestion is based on the following constraints:
Alcott 32 bit integers, with about 10.000.000 elements (ie any 10m out of 2^32)
It is a BST (binary search tree) where every node stores two values, the beginning and the end of a continuous region. The first element stores the number where a region starts, the second the last. This arrangement allows big regions in the hope that you reach you 10M limit with a very small tree height, so cheap search. The data structure with 10m elements would take up 8 bytes per node, plus the links (2x4bytes) maximum two children per node. So that make 80M for all the 10M elements. And of course, if there are usually more elements inserted you can keep track of the once which are not.
Now if you need to be very careful with space and after running simulations and/or statistical checks you find that there are lots of small regions (less than 32 bit in length), you may want to change your node type to one number which starts the region, plus a bitmap.
If you don't have to align access to the bitmap and, say, you only have continuous chunks with only 8 elements, then your memo requirement becuse 4+1 for the node and 4+4 bytes for the children. Hope this helps.

Efficient data structure for Zobrist keys

Zobrist keys are 64bit hashed values used in board games to univocally represent different positions found during a tree search. They are usually stored in arrays having a size of 1000K entries or more (each entry is about 10 bytes long). The table is usually accessed by hashKey % size as index. What kind of STL container would you use to represent this sort of table? Consider that since the size of the table is limited collisions might happen. With a "plain" array I would have to handle this case, so I thought of an unordered_map, but since the implementation is not specified, I am not sure how efficient it will be while the map is being populated.
Seems to me a standard hashmap would suit you well - very fast look up and it will handle the collisions for you reliably and invisibly.
If you wish to explore other territories apart STL, take a look at Judy arrays: these should fit your problem.
If you are on Linux you can experiment with them very easily, just install from your repository...
This application note could help to solve your task.
EDIT
There is this STL interface: I'm going to experiment with it, then I'll report my results.

Using Hash Maps to represent an extremely large data source

I have a very large possible data set that I am trying to visualize at once. The set itself consists of hundreds of thousands of segments, each of which is mapped to an id.
I have received a second data source that gives more real-time information for each segment, but the id's do not correspond to the id's I have.
I have a 1:1 mapping of the data id's (9-character strings) to the current id's (long integers). The problem is that there are a lot of id's, and the data that is coming in is in no specific order.
The solution I came up with is to have a hash-map that maps the strings to the road id's. The problem is that I don't know if the hash-map will be efficient enough to have all 166k data entries.
Does anyone have any suggestions and/or hashing algorithms that I can use for this?
If you're only dealing with hundreds of thousands of datapoints, it will likely not be a problem to go with the naive way and just stick with a hash-map.
Even if you have 500,000 9-character strings and an equal number of longs, that still only 16ish bytes per item, or 8,000,000 bytes total. Even if you double that for overhead, 16 MB is hardly too big to have in memory at one time.
Basically, try the easy way first, and only worry about it when your profiling tells you it's taking too long.
Judy Arrays are designed for this sort of thing: "Judy's key benefits are scalability, high performance, and memory efficiency. [...] Judy can replace many common data structures, such as arrays, sparse arrays, hash tables, B-trees, binary trees, linear lists, skiplists, other sort and search algorithms, and counting functions."
Since the comments on the question indicate the primary concern may be memory usage:
Use a pooling or other small-object-optimized allocator; assuming you have access to boost you can probably find a drop-in replacement in Pool. Using a better small-object allocator is probably the single biggest memory win you'll find.
If you know your strings are fixed-width, you may want to make sure you're allocating only enough space to store them. For example, a struct wrapped around a fixed-length char[] with a custom comparison operator may work better than a std::string. std::string comes with an additional dynamic allocation (and uses space for the corresponding pointer) and some extra size and capacity tracking overhead. (Generally, try to reduce the number of allocations that stick around; it reduces overhead.)
(Assuming STL) Look at the overhead difference between std::map and std::unordered_map (the latter may or may not be available to you at the moment); an RBtree-based std::map may be close enough to the lookup performance characteristics of your "hashmap" and may (or may not) be more memory efficient depending on your standard library implementation.
What route you take should be influenced by info you can gather -- try to get a picture of number of allocs and alloc size/alignment overhead.
You can either instrument your allocator or insert a few elements and see how you're doing compared to how you think you should be doing in terms of memory usage.
Since your strings are known up front and have a fixed length, theoretically and practically the best solution is a perfect hash. You could use cmph to generate it.
According to Wikipedia, your keys woud take 2.5 bits/key, or about 50KB. That's negligable compared to the 664KB for the values.
Although 166k data entries is rather small IMO you can have a look at google-sparsehash