Hash map optimised for lookup - c++

I am looking for some map which has fixed keys (fixed during initialization) and that does faster look-up. It may not support adding/updating elements later. Is there some algorithm which looks the list of keys and formulates a function so that it is faster to look-up later. In my case, keys are strings.
Update:
Keys are not known at compile time. But during initialization time of the application. There wont be any further insertions later but there will be lots of look-ups. So I want look-ups to be optimized.

CMPH may be what you're looking for. Basically this is gperf without requiring the set at compile-time.
Though of course std::unordered_map as by C++11 might just do too, though possibly with a few collisions.
Since you lookup strings, for strings, a trie (any of the different trie flavours, crit-bit or whatever funky names they have) may also be worthwhile to look into, especially if you have many of them. There are a lot of free trie implementations freely available.
The advantage of tries is that they can index-compress strings, so they use less memory, which has a higher likelihood of having data in cache. Also the access pattern is less random, which is also cache-friendly. A hash table must store the value plus the hash, and indexes more or less randomly (not randomly, but unpredictably) into memory. A trie/trie-like structure ideally only needs one extra bit that distinguishes a key from its common prefix in each node.
(Note by the way that O(log(N)) may quite possibly be faster than O(1) in such a case, because big-O does not consider things like that.)

Note that these are distinct things: do you need an upper limit, do you need a fast typical rate, or do you need the fastest lookup ever, no questions asked? The last one will cost you, the first two ones may be conflicting goals.
You could attempt to create a perfect hash function based on the input (i.e. one that does not have collisions of the input set). This is a somehow-solved problem (e.g. this, this). However, they usually generate source code and may spend significant time generating the hash function.
A modification of this would be using a generic hash function (e.g. shift-multiply-add) and do a brute force search over suitable parameters.
This has to be traded off with the cost of a few string comparisons (which aren't that terribly expensive if you don't have to collate).
Another option is to use two distinct hash functions - this increases the cost of a single lookup but makes degradation slightly less likely than aliens stealing your clock cylces. It is rather unlikely that this would be a problem with typical strings and a decent hash function.

Try google-sparsehash: http://code.google.com/p/google-sparsehash/
An extremely memory-efficient hash_map implementation. 2 bits/entry overhead!
The SparseHash library contains several hash-map implementations, including
implementations that optimize for space or speed.

In a similar topic ((number of) items known at compile time) , I produced this one: Lookups on known set of integer keys. Low overhead, no need for perfect hash. Fortunately, it is in C ;-)

Related

Storing filepath and size in C++

I'm processing a large number of image files (tens of millions) and I need to return the number of pixels for each file.
I have a function that uses an std::map<string, unsigned int> to keep track of files already processed. If a path is found in the map, then the value is returned, otherwise the file is processed and inserted into the map. I do not delete entries from the map.
The problem is as the number of entries grow, the time for lookup is killing the performance. This portion of my application is single threaded.
I wanted to know if unordered_map is the solution to this, or the fact that I'm using std::string as keys going to affects the hashing and require too many rehashings as the number of keys increases, thus once again killing the performance.
One other item to note is that the paths for the string are expected (but not guaranteed) to have the same prefix, for example: /common/until/here/now_different/. So all strings will likely have the same first N characters. I could potentially store these as relative to the common directory. How likely is that to help performance?
unordered_map will probably be better in this case. It will typically be implemented as a hash table, with amortized O(1) lookup time, while map is usually a binary tree with O(log n) lookups. It doesn't sound like your application would care about the order of items in the map, it's just a simple lookup table.
In both cases, removing the common prefix should be helpful, as it means less time has to be spent needlessly iterating over that part of the strings. For unordered_map it will have to traverse it twice: once to hash and then to compare against the keys in the table. Some hash functions also limit the amount of a string they hash, to prevent O(n) hash performance -- if the common prefix is longer than this limit, you'll end up with a worst-case hash table (everything is in one bucket).
I really like Galik's suggestion of using inodes if you can, but if not...
Will emphasise a point already made in comments: if you've reason to care, always implement the alternatives and measure. The more reason, the more effort it's worth expending on that....
So /- another option is to use a 128-bit cryptographic strength hash function on your filepaths, then trust that statistically it's extremely unlikely to produce a collision. A rule of thumb is that if you have 2^n distinct keys, you want significantly more than a 2n-bit hash. For ~100m keys, that's about 2*27 bits, so you could probably get away with a 64 bit hash but it's a little too close for comfort and headroom if the number of images grows over the years. Then use a vector to back a hash table of just the hashes and file sizes, with say quadratic probing. Your caller would ideally pre-calculate the hash of an incoming file path in a different thread, passing your lookup API only the hash.
The above avoids the dynamic memory allocation, indirection, and of course memory usage when storing variable-length strings in the hash table and utilises the cache much better. Relying on hashes not colliding may make you uncomfortable, but past a point the odds of a meteor destroying the computer, or lightning frying it, will be higher than the odds of a collision in the hash space (i.e. before mapping to hash table bucket), so there's really no point fixating on that. Cryptographic hashing is relatively slow, hence the suggestion to let clients do it in other threads.
(I have worked with a proprietary distributed database based on exactly this principle for path-like keys.)
Aside: beware Visual C++'s string hashing - they pick 10 characters spaced along your string to incorporate in the hash value, which would be extremely collision prone for you, especially if several of those were taken from the common prefix. The C++ Standard leaves implementations the freedom to provide whatever hashes they like, so re-measure such things if you ever need to port your system.

How to make a fast dictionary that contains another dictionary?

I have a map<size_t, set<size_t>>, which, for better performance, I'm actually representing as a lexicographically-sorted vector<pair<size_t, vector<size_t>>>.
What I need is a set<T> with fast insertion times (removal doesn't matter), where T is the data type above, so that I can check for duplicates (my program runs until there are no more unique T's being generated.).
So far, switching from set to unordered_set has turned out to be quite beneficial (it makes my program run > 25% faster), but even now, inserting T still seems to be one of the main bottlenecks.
The maximum number of integers in a given T is around ~1000, and each integer is also <= ~1000, so the numbers are quite small (but there are thousands of these T's being generated).
What I have already tried:
Using unsigned short. It actually decreases performance slightly.
Using Google's btree::btree_map.
It's actually much slower because I have to work around the iterator invalidation.
(I have to copy the keys, and I think that's why it turned out slow. It was at least twice as slow.)
Using a different hash function. I haven't found any measurable difference as long as I use something reasonable, so it seems like this can't be improved.
What I have not tried:
Storing "fingerprints"/hashes instead of the actual sets.
This sounds like the perfect solution, except that the fingerprinting function needs to be fast, and I need to be extremely confident that collisions won't happen, or they'll screw up my program.
(It's a deterministic program that needs exact results; collisions render it useless.)
Storing the data in some other compact, CPU-friendly way.
I'm not sure how beneficial this would be, because it might involve copying around data, and most of the performance I've gained so far is by (cleverly) avoiding copying data in many situations.
What else can I do to improve the speed, if anything?
I am under the impression that you have 3 different problems here:
you need the T itself to be relatively compact and easy to move around
you need to quickly check whether a T is a possible duplicate of an already existing one
you finally need to quickly insert the new T in whatever data structure you have to check for duplicates
Regarding T itself, it is not yet as compact as it could be. You could probably use a single std::vector<size_t> to represent it:
N pairs
N Indexes
N "Ids" of I elements each
all that can be linearized:
[N, I(0), ..., I(N-1),
R(0) = Size(Id(0)), Id(0, 0), ... , Id(0, R(0)-1),
R(1) = ... ]
and this way you have a single chunk of memory.
Note: depending on the access pattern you may have to tweak it, specifically if you need random access to any ID.
Regarding the possibility of duplicates, a hash-map seems indeed quite appropriate. You will need a good hash function, but with a single array of size_t (or unsigned short if you can, it is smaller), you can just pick MurmurHash or CityHash or SipHash. They all are blazing fast and do their damnest to produce good quality hash (not cryptographic ones, emphasis is on speed).
Now, the question is when is it slow when checking for duplicates.
If you spend too much time checking for non-existing duplicates because the hash-map is too big, you might want to invest in a Bloom Filter in front of it.
Otherwise, check your hash function to make sure that it is really fast and has a low collision rate and your hash-map implementation to make sure it only ever computes the hash once.
Regarding insertion speed. Normally a hash-map, specifically if well-balanced and pre-sized, should have one of the quickest insertion. Make sure you move data into it and do not copy it; if you cannot move, it might be worth using a shared_ptr to limit the cost of copying.
Don't be afraid of collisions, use cryptographic hash. But choose a fast one. 256 bit collision is MUCH less probable than hardware error. Sun used it to deduplicate blocks in ZFS. ZFS uses SHA256. Probably you can use less secure hash. If it takes $1000000 to find one collision hash isn't secure but one collision don't seem to drop your performance. Many collisions would cost many $1000000. You can use something like (unordered) multimap<SHA, T> to deal with collisions. By the way, ANY hash table suffer from collisions (or takes too many memory), so ordered map (rbtree in gcc) or btree_map has better guaranteed time. Also hash table can be DOSed via hash collisions. Probably a secret salt can solve this problem. It is due to table size is much less than number of possible hashes.
You can also:
1) use short ints
2) reinterpret your array as an array of something like uint64_t for fast comparison (+some trailing elements), or even reinterpret it as an array of 128-bit values (or 256-bit, depending on your CPU) and compare via SSE. This should push your performance to memory speed limit.
From my experience SSE works fast with aligned memory access only. uint64_t comparison probably needs alignment for speed too, so you have to allocate memory manually with proper alignment (allocate more and skip first bytes). tcmalloc is 16 byte-aligned, uint64_t-ready. It is strange that you have to copy keys in btree, you can avoid it using shared_ptr. With fast comparisons and slow hash btree or std::map may turn out to be faster than hash table. I guess any hash is slower than memory. You can also calculate hash via SSE and probably find a library that does it.
PS I strongly recommend you to use profiler if you don't yet. Please tell % of time your program spend to insert, compare in insert and calculate hash.

Removing large number of strings from a huge list

I have a large list of strings stored in one huge memory block (usually there is 100k+ or even 1M+ of them). These are actually hashes, so the alphabet of the strings is limited to A-F0-9 and each string is exactly 32 bytes long (so its stored 'compressed'). I will call this list the main list from now on.
I want to be able to remove items from the main list. This will be usually done in bulks, so i get a large list (about 100 to 10k usually) of hashes which i need to find in this list and remove them. At the end of this operation there cannot be any empty blocks in the large memory block, so i need to take that into account. It is not guaranteed that all of the items will be in the main list, but none will be there multiple times. No rellocation can be done, the main block will always stay the same size.
The naive approach of iterating through the main list and checking if given hash shall be removed of course works, but is a bit slow. Also there is a bit too much moving of small memory blocks, because every time when a hash is flagged for removal i rewrite it with the last element of the main list, thus satisfying the condition of no empty blocks. This of course creates thousands of small memcpy's which in turn slow the thing down more because i get tons of cache misses.
Is there a better approach?
Some important notes:
the main list is not sorted and i cannot waste time sorting it, this
is a limitation imposed by the whole project and rewriting it so the
list is always sorted is not an option (it might not even be
possible)
memory is not really a problem, but the less is used the better
i can use STL, but not boost
Okay, here's what I'd do if I absolutely had to optimize the hell out of this.
I'm assuming order doesn't matter, which seems to be the case as you (IIUC) remove items by swapping them with the last item.
Store 128 bit integers (however you represent them, either your compiler supports them natively or you use a small array of 32/64 bit integers) instead of 32-char strings. See my comment on the question.
Roll my own hash set of 128 bit integers. Note that you can optimize a lot here if you're willing to think a bit, make some assumption, and get down 'n dirty. Some notes:
You only need to store the hashes themselves (for collision resolution), and a bit or two of metadata to identify deleted/unused slots. Have a look at what existing hash tables do if you're unsure how to guarantee correctness. I figure it's even simpler if you only ever delete (not add) after building the hash set. Though I think you could even do without that metadata if you had a value that's not a valid hash to denote empty slots, but this way removal is easier (just flip a bit, instead of overwriting 128 bit).
You don't need a hash function, as your inputs are already integers. You just need to do what every hash tables does anyway: Take the hashes modulo 2^n to derive an index that's not freaking huge. Choose n such that the load factor (the percentage of table entries used) is reasonable (< 2/3 seems standard). Choosing a power of makes the modulo operation cheaper (masking off bits via binary AND), and allows you to just do it on the lower 32 or 64 bit (ignoring the rest).
Choosing a collision resolution strategy is hard. I'd probably go with open addressing with linear probing, as first attempt. It may work badly, but if your input hashes are any good, this seems unlikely. There's also a probing scheme that factors in more and more of the bits you originally cut off, used by CPython's dict.
Now, this is a lot more work and maintenance burden than using off-the-shelf solutions. I wouldn't advise it unless this really is as performance-critical as it sounds in your description.
If C++11 is an option, and your compiler's unordered_set is any good, maybe you should just use it and save yourself most of the hassle (but be aware that this probably increases memory requirements). You still need to specialize std::hash and std::equal_to or operator==. Alternative supply your own Hash and KeyEqual for unordered_set, but that probably doesn't offer any benefit.
Two things might help. First, at least sort the list of items
to be removed; that way, you can use a binary search
(std::lower_bounds) on it. Second, keep two pointers:
a source and a destination. If the source points to something
not in the list to be removed, copy it to the destination, and
advance both. If the source points to something to be removed,
just advance the source pointer, without copying. There should
never be a reason to copy an entry more than once.

Selection of map or unordered_map based on keys's type

A generally asked question is whether we should use unordered_map or map for faster access.
The most common( rather age old ) answer to this question is:
If you want direct access to single elements, use unordered_map but if you want to iterate over elements(most likely in a sorted way) use map.
Shouldn't we consider the data type of key while making such a choice?
As hash algorithm for one dataType(say int) may be more collision prone than other(say string).
If that is the case( the hash algorithm is quite collision prone ), then I would probably use map even for direct access as in that case the O(1) constant time(probably averaged over large no. of inputs) for unordered_map map be more than lg(N) even for fairly large value of N.
You raise a good point... but you are focusing on the wrong part.
The problem is not the type of the key, per se, but on the hash function that is used to derive a hash value for that key.
Lexicographical ordering is easy: if you tell me you want to order a structure according to its 3 fields (and they already support ordering themselves) then I'll just write:
bool operator<(Struct const& left, Struct const& right) {
return boost::tie(left._1, left._2, left._3)
< boost::tie(right._1, right._2, right._3);
}
And I am done!
However writing a hash function is difficult. You need some knowledge about the distribution of your data (statistics), you might need to prevent specially crafted attacks, etc... Honestly, I do not expect many people of being able to craft a good hash function. But the worst part is, composition is difficult too! Given two independent fields, combining their hash value right is hard (hint: boost::hash_combine).
So, indeed, if you have no idea what you are doing and you are treating user-crafted data, just stick to a map. It's maybe slower (not sure), but it's safer.
There isn't really such a thing as collision prone object, because this thing is dependent on the hash function you use. Assuming the objects are not identical - there is some feature that can be utilized to create an informative hash function to be used.
Assuming you have some knowledge on your data - and you know it is likely to have a lot of collision for some hash function h1() - then you should find and use a different hash function h2() which is better suited for this task.
That said, there are other issues as well why to favor tree based data structures over hash bases (such as latency and the size of the set), some are covered by my answer in this thread.
There's no point trying to be too clever about this. As always, profile, compare, optimise if useful. There are many factors involved - quite a few of which aren't specified in the Standard and will vary across compilers. Some things may profile better or worse on specific hardware. If you are interested in this stuff (or paid to pretend to be) you should learn about these things a bit more systematically. You might start with learning a bit about actual hash functions and their characteristics. It's extremely rare to be unable to find a hash function that has - for all practical purposes - no more collision proneness than a random but repeatable value - it's just that sometimes it's slower to approach that point than it is to handle a few extra collisions.

data structure for storing array of strings in a memory

I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?
Many thanks!
Try a Prefix Tree
A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question
If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.
The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.
It's almost never truly the case that speed is the only crucial point.
There is e.g. google-sparsehash.
It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map.
See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)
Strings will be inserted at the
beginning of the programm and will not
be added or deleted while programm is running.
If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.
**Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).*
A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.
The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.
A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).
Your best bet would be as follows:
Building your structure:
Insert all your strings (char*s) into an array.
Sort the array lexicographically.
Lookup
Use a binary search on your array.
This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).
Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!
Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.