hashing a dictionary in C++

hashing a dictionary in C++ - c++

hi I want to use a hashmap for words in the dictionary and the indices of the words in the dicionary.
What would be the fastest hash algorithm for this?
Thanks!

At the bottom of this page there is a section A Note on Hash Functions with some information which you might find useful.
For convenience, I'll just replicate some links here:
Bob Jenkins
Paul Hsieh
Fowler/Noll/Vo (FNV)
MurmurHash

There are many different hashing algorithms, of varying efficiency, but the most important issue is that it scatter the items fairly uniformly across the different hash buckets.
However, you may as well assume that the Microsoft engineers/library engineers have done a decent job of writing an efficient and effective hash algorithm, and just using the built-in libraries/classes.

The fastest hash function will be
template <class T>
size_t hash(T key) {
return 0;
}
however, though the hashing will be mighty fast, you will suffer performance elsewhere. What you want is to try several hashing algorithms on actual data and see which one actually gives you the best performance in aggregate on the actual data you expect to use if the hashing or lookup is even a performance bottleneck. Until then, go with something handy. MD5 is pretty widely available.

Have you tried just using the STL hash_map and seeing if it serves your needs before rolling anything more complex?
http://www.sgi.com/tech/stl/hash_map.html

boost has a hash function that you can reuse for your own data (predefined for common types). That'd probably work well & fast enough if your needs aren't special.

What is your use case? A radix search tree (trie) might be more suitable than a hash if you're mapping from string to integer. Tries have the advantage of reducing key comparisons for variable length keys. (e.g., strings)
Even a binary search tree (e.g., STL's map) might be superior to a hash based container in terms of memory use and number of key comparisons. A hash is more efficient only if you have very few collisions.

Related

Selection of map or unordered_map based on keys's type

A generally asked question is whether we should use unordered_map or map for faster access.
The most common( rather age old ) answer to this question is:
If you want direct access to single elements, use unordered_map but if you want to iterate over elements(most likely in a sorted way) use map.
Shouldn't we consider the data type of key while making such a choice?
As hash algorithm for one dataType(say int) may be more collision prone than other(say string).
If that is the case( the hash algorithm is quite collision prone ), then I would probably use map even for direct access as in that case the O(1) constant time(probably averaged over large no. of inputs) for unordered_map map be more than lg(N) even for fairly large value of N.

You raise a good point... but you are focusing on the wrong part.
The problem is not the type of the key, per se, but on the hash function that is used to derive a hash value for that key.
Lexicographical ordering is easy: if you tell me you want to order a structure according to its 3 fields (and they already support ordering themselves) then I'll just write:
bool operator<(Struct const& left, Struct const& right) {
return boost::tie(left._1, left._2, left._3)
< boost::tie(right._1, right._2, right._3);
}
And I am done!
However writing a hash function is difficult. You need some knowledge about the distribution of your data (statistics), you might need to prevent specially crafted attacks, etc... Honestly, I do not expect many people of being able to craft a good hash function. But the worst part is, composition is difficult too! Given two independent fields, combining their hash value right is hard (hint: boost::hash_combine).
So, indeed, if you have no idea what you are doing and you are treating user-crafted data, just stick to a map. It's maybe slower (not sure), but it's safer.

There isn't really such a thing as collision prone object, because this thing is dependent on the hash function you use. Assuming the objects are not identical - there is some feature that can be utilized to create an informative hash function to be used.
Assuming you have some knowledge on your data - and you know it is likely to have a lot of collision for some hash function h1() - then you should find and use a different hash function h2() which is better suited for this task.
That said, there are other issues as well why to favor tree based data structures over hash bases (such as latency and the size of the set), some are covered by my answer in this thread.

There's no point trying to be too clever about this. As always, profile, compare, optimise if useful. There are many factors involved - quite a few of which aren't specified in the Standard and will vary across compilers. Some things may profile better or worse on specific hardware. If you are interested in this stuff (or paid to pretend to be) you should learn about these things a bit more systematically. You might start with learning a bit about actual hash functions and their characteristics. It's extremely rare to be unable to find a hash function that has - for all practical purposes - no more collision proneness than a random but repeatable value - it's just that sometimes it's slower to approach that point than it is to handle a few extra collisions.

C++ (Hashmap style) Data Structure Ideal For This Scenario?

People have asked similar questions about the efficiency of various data structures but none I have read are totally applicable to my scenario so I wondered if people had suggestions for one that was tailored to satisfy the following criteria efficiently:
Each element will have a unique key. There will be no possibility of collisions because each element hashes to a different key. EDIT: *The key is a 32-bit uint.*
The elements are all unique and therefore can be thought of as a set.
The only operations required are adding and getting, not deletion. These need to be quick as they will be used several 100,000 times in a typical run!
The order in which elements are kept is irrelevant.
Speed is more important than memory-consumption... though it can't be too
greedy!
I am developing for a company that will use the program commercially so any third-party data structures should come with no copyright protection or anything, but if the STL has a data structure that will do the job efficiently then that would be perfect.
I know there are countless Hashmap/Dictionary style C++ data structures with implementations that are built to satisfy different criteria so if someone can suggest one ideal for this situation then that would be greatly appreciated.
Many thanks
Edit:
I found this passage on SO that seems to suggest unordered_map would be good?
hash_map and unordered_map are generally implemented with hash tables.
Thus the order is not maintained. unordered_map insert/delete/query
will be O(1) (constant time) where map will be O(log n) where n is the
number of items in the data structure. So unordered_map is faster, and
if you don't care about the order of the items should be preferred
over map. Sometimes you want to maintain order (ordered by the key)
and for that map would be the choice.

Looks like a prefix tree (with element at each node end) also fits in this scenario. It's damn fast, even faster than hash map because no hash value calculation is done and getting a value is purely O(n) where n is the key length. It's a bit memory hungry but common prefix of keys are shared in the same node path.
EDIT: I assume the keys are string, not simple values like integers

As for build-in solutions I'd recommand google::dense_hash_map. They are really fast especially for numeric keys. You'll have to decide on a specific key that will be reserved as "empty_key". Moreover here is a really nice comparison of different hash-map implementations.
An excerpt
Library Linux-intCPU (sec) Linux-strCPU (sec) Linux PeakMem (MB)
glib 3.490 4.720 24.968
ghthash 3.260 3.460 61.232
CC’s hashtable 3.040 4.050 129.020
TR1 1.750 3.300 28.648
STL hash_set 2.070 3.430 25.764
google-sparse 2.560 6.930 5.42/8.54
google-dense 0.550 2.820 24.7/49.3
khash (C++) 1.100 2.900 6.88/13.1
khash (C) 1.140 2.940 6.91/13.1
STL set (RB) 7.840 18.620 29.388
kbtree (C) 4.260 17.620 4.86/9.59
NP’s splaytree 11.180 27.610 19.024
However, when setting a "deleted_key", this map can also perform deletions. So maybe it'll be possible to create a custom solution that is even more efficient. But apart from that minor point, any hash-map should exactly suit your needs (note that "map" is an ordered tree-map and thus slower).

What you need definitely sounds like a hash set, C++ has this as either std::tr1::unordered_set or in Boost.Unordered.
P.S. Note, however, that TR1 is not yet standard, and you'll probably need to get Boost for the implementation.

It sounds like std::unordered_set would fit the bill, but without
knowing more about the key, it's difficult to say. I'm curious about
how you can guarantee that there will be no possibility of collisions:
this implies a small (less than the size of the table), finite set of
keys. If this is the case, it may be more efficient to map the keys to
a small int, and use std::vector (with empty slots for the entries not
present).

What you're looking for is an unordered_set. You can find one in Boost, TR1, or C++0x. If you're hoping to associate the key with a value, then unordered_map does just that- also in Boost/TR1/C++0x.

non STL hash table type structure

Is there a way to write simple hashtable with the key as "strings" and value as the frequency, so that there are NO collisons? There will no be removal from the hashtable, and if the object already exists in the hashtable, then just update its frequency(add them together).
I was thinking there might be a algorithm that can compute a unique number from the string which will be used as the index.
Yes, i am avoiding the use of all STL construct including unordered_map.

You can use any perfect hash generator like gperf
See here for a list: http://en.wikipedia.org/wiki/Perfect_hash_function
PS. You'd still possibly want to use a map instead of flat array/vector in case the mapped domain gets too big/sparse

It really depends on what you mean by 'simple'.
The std::map is a fairly simple class. Still, it uses a red-black tree with all of the insertion, deletion, and balancing nicely hidden away, and it is templated to handle any orderable type as a key and any type as the value. Most map classes use a similar implementation, and avoid any sort of hashing functionality.
Hashing without collisions is not a trivial matter whatsoever. Perhaps the simplest method is Pearson Hashing.
It seems like you have 3 choices:
Implement your own perfect hashing class. This would be a pretty good sized class with a lot of functionality and some decently complex algorithms. I don't think this is simple.
Download and use a perfect hashing library that is already out there. Of course, you have to worry about deployability.
Use STL's map class. It's embedded, well-documented, easy to use, type-flexible, and completely cross-platform. This seems like the 'simplest' solution.
If I may ask, Why are you avoiding STL?

If the set of possible strings is known beforehand, you can use a perfect hash function generator to do this. But otherwise, what you ask is impossible.
Now, it IS possible to make the likelihood of collisions extremely low by using a good hash function and making sure your table is huge. You basically need a big enough table to make the likelihood of invoking the Birthday Paradox low enough to suit you. Then you just use n bits of output from SHA-1, and 2^n will be your table size.
I'm also wondering if maybe you could use a Bloom filter and have an actual counter instead of bits. Keep a list of all the words you've stuffed into the bloom filter and what entries they've incremented (which will be the same each time) and you have yourself a gigantic linear function that you might be able to solve to get all the individual counts back out again.

data structure for storing array of strings in a memory

I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?
Many thanks!

Try a Prefix Tree
A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question

If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.
The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.
It's almost never truly the case that speed is the only crucial point.

There is e.g. google-sparsehash.
It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map.
See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)
Strings will be inserted at the
beginning of the programm and will not
be added or deleted while programm is running.
If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.
**Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).*

A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.

The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.
A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).

Your best bet would be as follows:
Building your structure:
Insert all your strings (char*s) into an array.
Sort the array lexicographically.
Lookup
Use a binary search on your array.
This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).
Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!

Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.

How to check whether my custom hashing is good in hash_map?

I've written a custom hashing for my custom key in stdext::hash_map and would like to check whether the hasher is good. I'm using STL supplied with VS 2008. A typical check, as I know, is to check the uniformity of distribution among buckets.
How should I organize such a check correctly? A solution that comes to my mind is to modify STL sources to add a method to hash_map that walks through buckets and does the subject. Is there are any better ways?
Maybe, derive from hash_map and create there such method?

Your best bet might be to just take your hashing algorithm to an array of ints and count the number of times that each hash bucket is hit, given real-world data. (I'm suggesting taking the STL out of the equation here, really.)
If you end up seeing high deviation in your counts with large sets of real-world data, your hashing algorithm is generating lots of collisions when there are plenty of empty (or emptier) buckets available.
Note that 'high deviation' is a relative term. A good hash algorithm is a deterministic random process and any random process has a chance of generating strange results, so test often, test well, and wherever possible, use your actual problem domain as a source of your tests and your controls.

I'd run one (large) dataset through stl::hash_map. Once done, I'd collect the results for all buckets using the following method
From hash_map:
size_type elems_in_bucket (size_type __n) const;
Finally, I would do compute the standard deviation (SD) of the elem-to-bucket distribution.
I'd do the above for different hash functions. Whichever hash function results in minimum SD is the winner (for this dataset).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js