Using Hash Maps to represent an extremely large data source

Using Hash Maps to represent an extremely large data source - c++

I have a very large possible data set that I am trying to visualize at once. The set itself consists of hundreds of thousands of segments, each of which is mapped to an id.
I have received a second data source that gives more real-time information for each segment, but the id's do not correspond to the id's I have.
I have a 1:1 mapping of the data id's (9-character strings) to the current id's (long integers). The problem is that there are a lot of id's, and the data that is coming in is in no specific order.
The solution I came up with is to have a hash-map that maps the strings to the road id's. The problem is that I don't know if the hash-map will be efficient enough to have all 166k data entries.
Does anyone have any suggestions and/or hashing algorithms that I can use for this?

If you're only dealing with hundreds of thousands of datapoints, it will likely not be a problem to go with the naive way and just stick with a hash-map.
Even if you have 500,000 9-character strings and an equal number of longs, that still only 16ish bytes per item, or 8,000,000 bytes total. Even if you double that for overhead, 16 MB is hardly too big to have in memory at one time.
Basically, try the easy way first, and only worry about it when your profiling tells you it's taking too long.

Judy Arrays are designed for this sort of thing: "Judy's key benefits are scalability, high performance, and memory efficiency. [...] Judy can replace many common data structures, such as arrays, sparse arrays, hash tables, B-trees, binary trees, linear lists, skiplists, other sort and search algorithms, and counting functions."

Since the comments on the question indicate the primary concern may be memory usage:
Use a pooling or other small-object-optimized allocator; assuming you have access to boost you can probably find a drop-in replacement in Pool. Using a better small-object allocator is probably the single biggest memory win you'll find.
If you know your strings are fixed-width, you may want to make sure you're allocating only enough space to store them. For example, a struct wrapped around a fixed-length char[] with a custom comparison operator may work better than a std::string. std::string comes with an additional dynamic allocation (and uses space for the corresponding pointer) and some extra size and capacity tracking overhead. (Generally, try to reduce the number of allocations that stick around; it reduces overhead.)
(Assuming STL) Look at the overhead difference between std::map and std::unordered_map (the latter may or may not be available to you at the moment); an RBtree-based std::map may be close enough to the lookup performance characteristics of your "hashmap" and may (or may not) be more memory efficient depending on your standard library implementation.
What route you take should be influenced by info you can gather -- try to get a picture of number of allocs and alloc size/alignment overhead.
You can either instrument your allocator or insert a few elements and see how you're doing compared to how you think you should be doing in terms of memory usage.

Since your strings are known up front and have a fixed length, theoretically and practically the best solution is a perfect hash. You could use cmph to generate it.
According to Wikipedia, your keys woud take 2.5 bits/key, or about 50KB. That's negligable compared to the 664KB for the values.

Although 166k data entries is rather small IMO you can have a look at google-sparsehash

Related

Efficient data structure to map integer-to-integer with find & insert, no allocations and fixed upper bound

I am looking for input on an associative data structure that might take advantage of the specific criteria of my use case.
Currently I am using a red/black tree to implement a dictionary that maps keys to values (in my case integers to addresses).
In my use case, the maximum number of elements is known up front (1024), and I will only ever be inserting and searching. Searching happens twenty times more often than inserting. At the end of the process I clear the structure and repeat again. There can be no allocations during use - only the initial up front one. Unfortunately, the STL and recent versions of C++ are not available.
Any insight?

I ended up implementing a simple linear-probe HashTable from an example here. I used the MurmurHash3 hash function since my data is randomized.
I modified the hash table in the following ways:
The size is a template parameter. Internally, the size is doubled. The implementation requires power of 2 sizes, and traditionally resizes at 75% occupation. Since I know I am going to be filling up the hash table, I pre-emptively double it's capacity to keep it sparse enough. This might be less efficient when adding small number of objects, but it is more efficient once the capacity starts to fill up. Since I cannot resize it I chose to start it doubled in size.
I do not allow keys with a value of zero to be stored. This is okay for my application and it keeps the code simple.
All resizing and deleting is removed, replaced by a single clear operation which performs a memset.
I chose to inline the insert and lookup functions since they are quite small.
It is faster than my red/black tree implementation before. The only change I might make is to revisit the hashing scheme to see if there is something in the source keys that would help make a cheaper hash.
Billy ONeal suggested, given a small number of elements (1024) that a simple linear search in a fixed array would be faster. I followed his advice and implemented one for side by side comparison. On my target hardware (roughly first generation iPhone) the hash table outperformed a linear search by a factor of two to one. At smaller sizes (256 elements) the hash table was still superior. Of course these values are hardware dependant. Cache line sizes and memory access speed are terrible in my environment. However, others looking for a solution to this problem would be smart to follow his advice and try and profile it first.

How to make a fast dictionary that contains another dictionary?

I have a map<size_t, set<size_t>>, which, for better performance, I'm actually representing as a lexicographically-sorted vector<pair<size_t, vector<size_t>>>.
What I need is a set<T> with fast insertion times (removal doesn't matter), where T is the data type above, so that I can check for duplicates (my program runs until there are no more unique T's being generated.).
So far, switching from set to unordered_set has turned out to be quite beneficial (it makes my program run > 25% faster), but even now, inserting T still seems to be one of the main bottlenecks.
The maximum number of integers in a given T is around ~1000, and each integer is also <= ~1000, so the numbers are quite small (but there are thousands of these T's being generated).
What I have already tried:
Using unsigned short. It actually decreases performance slightly.
Using Google's btree::btree_map.
It's actually much slower because I have to work around the iterator invalidation.
(I have to copy the keys, and I think that's why it turned out slow. It was at least twice as slow.)
Using a different hash function. I haven't found any measurable difference as long as I use something reasonable, so it seems like this can't be improved.
What I have not tried:
Storing "fingerprints"/hashes instead of the actual sets.
This sounds like the perfect solution, except that the fingerprinting function needs to be fast, and I need to be extremely confident that collisions won't happen, or they'll screw up my program.
(It's a deterministic program that needs exact results; collisions render it useless.)
Storing the data in some other compact, CPU-friendly way.
I'm not sure how beneficial this would be, because it might involve copying around data, and most of the performance I've gained so far is by (cleverly) avoiding copying data in many situations.
What else can I do to improve the speed, if anything?

I am under the impression that you have 3 different problems here:
you need the T itself to be relatively compact and easy to move around
you need to quickly check whether a T is a possible duplicate of an already existing one
you finally need to quickly insert the new T in whatever data structure you have to check for duplicates
Regarding T itself, it is not yet as compact as it could be. You could probably use a single std::vector<size_t> to represent it:
N pairs
N Indexes
N "Ids" of I elements each
all that can be linearized:
[N, I(0), ..., I(N-1),
R(0) = Size(Id(0)), Id(0, 0), ... , Id(0, R(0)-1),
R(1) = ... ]
and this way you have a single chunk of memory.
Note: depending on the access pattern you may have to tweak it, specifically if you need random access to any ID.
Regarding the possibility of duplicates, a hash-map seems indeed quite appropriate. You will need a good hash function, but with a single array of size_t (or unsigned short if you can, it is smaller), you can just pick MurmurHash or CityHash or SipHash. They all are blazing fast and do their damnest to produce good quality hash (not cryptographic ones, emphasis is on speed).
Now, the question is when is it slow when checking for duplicates.
If you spend too much time checking for non-existing duplicates because the hash-map is too big, you might want to invest in a Bloom Filter in front of it.
Otherwise, check your hash function to make sure that it is really fast and has a low collision rate and your hash-map implementation to make sure it only ever computes the hash once.
Regarding insertion speed. Normally a hash-map, specifically if well-balanced and pre-sized, should have one of the quickest insertion. Make sure you move data into it and do not copy it; if you cannot move, it might be worth using a shared_ptr to limit the cost of copying.

Don't be afraid of collisions, use cryptographic hash. But choose a fast one. 256 bit collision is MUCH less probable than hardware error. Sun used it to deduplicate blocks in ZFS. ZFS uses SHA256. Probably you can use less secure hash. If it takes $1000000 to find one collision hash isn't secure but one collision don't seem to drop your performance. Many collisions would cost many $1000000. You can use something like (unordered) multimap<SHA, T> to deal with collisions. By the way, ANY hash table suffer from collisions (or takes too many memory), so ordered map (rbtree in gcc) or btree_map has better guaranteed time. Also hash table can be DOSed via hash collisions. Probably a secret salt can solve this problem. It is due to table size is much less than number of possible hashes.

You can also:
1) use short ints
2) reinterpret your array as an array of something like uint64_t for fast comparison (+some trailing elements), or even reinterpret it as an array of 128-bit values (or 256-bit, depending on your CPU) and compare via SSE. This should push your performance to memory speed limit.
From my experience SSE works fast with aligned memory access only. uint64_t comparison probably needs alignment for speed too, so you have to allocate memory manually with proper alignment (allocate more and skip first bytes). tcmalloc is 16 byte-aligned, uint64_t-ready. It is strange that you have to copy keys in btree, you can avoid it using shared_ptr. With fast comparisons and slow hash btree or std::map may turn out to be faster than hash table. I guess any hash is slower than memory. You can also calculate hash via SSE and probably find a library that does it.
PS I strongly recommend you to use profiler if you don't yet. Please tell % of time your program spend to insert, compare in insert and calculate hash.

data structure for storing array of strings in a memory

I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?
Many thanks!

Try a Prefix Tree
A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question

If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.
The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.
It's almost never truly the case that speed is the only crucial point.

There is e.g. google-sparsehash.
It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map.
See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)
Strings will be inserted at the
beginning of the programm and will not
be added or deleted while programm is running.
If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.
**Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).*

A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.

The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.
A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).

Your best bet would be as follows:
Building your structure:
Insert all your strings (char*s) into an array.
Sort the array lexicographically.
Lookup
Use a binary search on your array.
This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).
Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!

Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.

Need to store string as id for objects in some fast data structure

I'm implementing a session store for a web-server. Keys are string
and stored objects are pointers. I tried using map but need something
faster. I will look up an object 5-20 times
as frequent than insert.
I tried using hash-map but failed. I felt like I got more constraints than more free time.
I'm coding c/c++ under Linux.
I don't want to commit to boost, since my web server is going to outlive boost. :)
This is a highly relevant question since the hardware (ssd disk) is
changing rapidly. What was the right solution will not be in 2 years.

I was going to suggest a map, but I see you have already ruled this out.
I tried using map but need something
faster.
These are the std::map performance bounds courtesy of the Wikipedia page:
Searching for an element takes O(log n) time
Inserting a new element takes O(log n) time
Incrementing/decrementing an iterator takes O(log n) time
Iterating through every element of a map takes O(n) time
Removing a single map element takes O(log n) time
Copying an entire map takes O(n log n) time.
How have you measured and determined that a map is not optimised sufficiently for you? It's quite possible that any bottlenecks you are seeing are in other parts of the code, and a map is perfectly adequate.
The above bounds seem like they would fit within all but the most stringent scalability requirements.

The type of data structure that will be used will be determined by the data you want to access. Some questions you should ask:
How many items will be in the session store? 50? 100000? 10000000000?
How large is each item in the store (byte size)?
What kind of string input is used for the key? ASCII-7? UTF-8? UCS2?
...
Hash tables generally perform very well for look ups. You can optimize them heavily for speed by writing them yourself (and yes, you can resize the table). Suggestions to improve performance with hash tables:
Choose a good hash function! this will have preferably even distribution among your hash table and will not be time intensive to compute (this will depend on the format of the key input).
Make sure that if you are using buckets to not exceed a length of 6. If you do exceed 6 buckets then your hash function probably isn't distributing evenly enough. A bucket length of < 3 is preferable.
Watch out for how you allocate your objects. If at all possible, try to allocate them near each other in memory to take advantage of locality of reference. If you need to, write your own sub-allocator/heap manager. Also keep to aligned boundaries for better access speeds (aligned is processor/bus dependent so you'll have to determine if you want to target a particular processor type).
BTrees are also very good and in general perform well. (Someone can insert info about btrees here).
I'd recommend looking at the data you are storing and making sure that the data is as small as possible. use shorts, unsigned char, bit fields as necessary. There are other additional ways to squeeze out improved performance as well such as allocating your string data at the end of your struct at the same time that you allocate the struct. i.e.
struct foo {
int a;
char my_string[0]; // allocate an instance of foo to be
// sizeof(int) + sizeof(your string data) etc
}
You may also find that implementing your own string compare routine can actually boost performance dramatically, however this will depend upon your input data.

It is possible to make your own. But you shouldn't have any problems with boost or std::tr1::unordered_map.
A ternary trie may be faster than a hash map for a smaller number of elements.

Fast container for setting bits in a sparse domain, and iterating (C++)?

I need a fast container with only two operations. Inserting keys on from a very sparse domain (all 32bit integers, and approx. 100 are set at a given time), and iterating over the inserted keys. It should deal with a lot of insertions which hit the same entries (like, 500k, but only 100 different ones).
Currently, I'm using a std::set (only insert and the iterating interface), which is decent, but still not fast enough. std::unordered_set was twice as slow, same for the Google Hash Maps. I wonder what data structure is optimized for this case?

Depending on the distribution of the input, you might be able to get some improvement without changing the structure.
If you tend to get a lot of runs of a single value, then you can probably speed up insertions by keeping a record of the last value you inserted, and don't bother doing the insertion if it matches. It costs an extra comparison per input, but saves a lookup for each element in a run beyond the first. So it could improve things no matter what data structure you're using, depending on the frequency of repeats and the relative cost of comparison vs insertion.
If you don't get runs, but you tend to find that values aren't evenly distributed, then a splay tree makes accessing the most commonly-used elements cheaper. It works by creating a deliberately-unbalanced tree with the frequent elements near the top, like a Huffman code.

I'm not sure I understand "a lot of insertions which hit the same entries". Do you mean that there are only 100 values which are ever members, but 500k mostly-duplicate operations which insert one of those 100 values?
If so, then I'd guess that the fastest container would be to generate a collision-free hash over those 100 values, then maintain an array (or vector) of flags (int or bit, according to what works out fastest on your architecture).
I leave generating the hash as an exercise for the reader, since it's something that I'm aware exists as a technique, but I've never looked into it myself. The point is to get a fast hash over as small a range as possible, such that for each n, m in your 100 values, hash(n) != hash(m).
So insertion looks like array[hash(value)] = 1;, deletion looks like array[hash(value)] = 0; (although you don't need that), and to enumerate you run over the array, and for each set value at index n, inverse_hash(n) is in your collection. For a small range you can easily maintain a lookup table to perform the inverse hash, or instead of scanning the whole array looking for set flags, you can run over the 100 potentially-in values checking each in turn.
Sorry if I've misunderstood the situation and this is useless to you. And to be honest, it's not very much faster than a regular hashtable, since realistically for 100 values you can easily size the table such that there will be few or no collisions, without using so much memory as to blow your caches.

For an in-use set expected to be this small, a non-bucketed hash table might be OK. If you can live with an occasional expansion operation, grow it in powers of 2 if it gets more than 70% full. Cuckoo hashing has been discussed on Stackoverflow before and might also be a good approach for a set this small. If you really need to optimise for speed, you can implement the hashing function and lookup in assembler - on linear data structures this will be very simple so the coding and maintenance effort for an assembler implementation shouldn't be unduly hard to maintain.

You might want to consider implementing a HashTree using a base 10 hash function at each level instead of a binary hash function. You could either make it non-bucketed, in which case your performance would be deterministic (log10) or adjust your bucket size based on your expected distribution so that you only have a couple of keys/bucket.

A randomized data structure might be perfect for your job. Take a look at the skip list – though I don't know any decend C++ implementation of it. I intended to submit one to Boost but never got around to do it.

Maybe a set with a b-tree (instead of binary tree) as internal data structure. I found this article on codeproject which implements this.

Note that while inserting into a hash table is fast, iterating over it isn't particularly fast, since you need to iterate over the entire array.
Which operation is slow for you? Do you do more insertions or more iteration?

How much memory do you have? 32-bits take "only" 4GB/8 bytes, which comes to 512MB, not much for a high-end server. That would make your insertions O(1). But that could make the iteration slow. Although skipping all words with only zeroes would optimize away most iterations. If your 100 numbers are in a relatively small range, you can optimize even further by keeping the minimum and maximum around.
I know this is just brute force, but sometimes brute force is good enough.

Since no one has explicitly mentioned it, have you thought about memory locality? A really great data structure with an algorithm for insertion that causes a page fault will do you no good. In fact a data structure with an insert that merely causes a cache miss would likely be really bad for perf.
Have you made sure a naive unordered set of elements packed in a fixed array with a simple swap to front when an insert collisides is too slow? Its a simple experiment that might show you have memory locality issues rather than algorithmic issues.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js