For example if i have to store many entries like in C
hash["1"]="11"
hash["2"]="12"
hash["11"]="21"
Condition is that : You only look for consecutive identical numbers, so only keep those in the map.
And like these many entries for solving a problem. We can use some technique such as adding the sum of ASCII values for finding the index and then using some mod function.
But this approach doesn't guarantee that there will be no collision and even if there is one collision then the whole question will go wrong.
Can this be accomplished in C++ easily?
Please give some suggestions/hints.
Thanks in Advance
Regarding hash functions: No hash function can guarantee, that no collisions will occur. They can only try to minimize the chance of collisions in typical workloads. One hash function which is often used is the bernstein hash function. You can find a comparison of different string hash functions (including bernstein) here
In C++ you can simply use the standard map (see here) template which does not use a hash map, but is typically implemented using red-black trees. The C++11 standard has a unordered_map (see here) which is implemented using a hash function.
Related
I have a very large array storing some numbers. My task is to find if a particular number exists in array or not efficiently. Which algorithm and data structure I should go with?
Few assumptions:
Each number in array would be unique.
I am not concerned about where the data is found in array I just want to return true if data is found else false.
I would be using C++ as programming language.
Please suggest.
Thanks
Constant time lookup with unordered_set.
There are also options of bitsets etc. Depends exactly how large is "very large" and the sparseness of the values stored compared to how many of them there actually are.
seems unordered_set is suitable for your requirement.
PS: Pls remember all elements in this set are immutable
The known best way to check if an element (number) is a member of a set (array) is to use bloom filters. It works well if set is changing over time or if there are set operations among sets. Bloom filters are easy to implement and good implementations are availble
If set is static (i.e. never change) the good way is to use perfect hash function. It will take time to build, but will outperform usual hash function provided by std::unordered_set
I want to condense a list of about 100 non-negative 32-bit integers into a single integer. Ideally the resulting integer is always unique but a few relatively rare collisions is acceptable. How can I do this?
I'm writing a puzzle-solver. Part of my search algorithm is avoiding re-exploring puzzle states already seen. I'll use the integer generated from the list as the key into a statesAlreadySeen table. Currently I'm using strings as the keys. However I've seen noticeable performance improvements when going from string keys to integer keys in a map<,> hence I'd like to switch.
Edit: Thanks for the unordered map suggestions! However I'm still curious about an actual hashing function. IIRC there's a simple function involving basic bit manipulation and xoring. Would be great to see this and have some general understanding of the collision probabilities.
As suggested in the comments, you need to use a hash function. Most easily accomplished with boost::hash_range:
#include <boost/functional/hash.hpp>
std::size_t vectorhash(std::vector<int> f){
size_t hash = std::size_t hash = boost::hash_range(f.begin(), f.end());
}
Having said that, if you don't have a real need to keep the states ordered (and I can't see why you would), I would go with us2012's solution of keeping string keys and switching to unordered_map - thus letting the container take care of the hashing.
People have asked similar questions about the efficiency of various data structures but none I have read are totally applicable to my scenario so I wondered if people had suggestions for one that was tailored to satisfy the following criteria efficiently:
Each element will have a unique key. There will be no possibility of collisions because each element hashes to a different key. EDIT: *The key is a 32-bit uint.*
The elements are all unique and therefore can be thought of as a set.
The only operations required are adding and getting, not deletion. These need to be quick as they will be used several 100,000 times in a typical run!
The order in which elements are kept is irrelevant.
Speed is more important than memory-consumption... though it can't be too
greedy!
I am developing for a company that will use the program commercially so any third-party data structures should come with no copyright protection or anything, but if the STL has a data structure that will do the job efficiently then that would be perfect.
I know there are countless Hashmap/Dictionary style C++ data structures with implementations that are built to satisfy different criteria so if someone can suggest one ideal for this situation then that would be greatly appreciated.
Many thanks
Edit:
I found this passage on SO that seems to suggest unordered_map would be good?
hash_map and unordered_map are generally implemented with hash tables.
Thus the order is not maintained. unordered_map insert/delete/query
will be O(1) (constant time) where map will be O(log n) where n is the
number of items in the data structure. So unordered_map is faster, and
if you don't care about the order of the items should be preferred
over map. Sometimes you want to maintain order (ordered by the key)
and for that map would be the choice.
Looks like a prefix tree (with element at each node end) also fits in this scenario. It's damn fast, even faster than hash map because no hash value calculation is done and getting a value is purely O(n) where n is the key length. It's a bit memory hungry but common prefix of keys are shared in the same node path.
EDIT: I assume the keys are string, not simple values like integers
As for build-in solutions I'd recommand google::dense_hash_map. They are really fast especially for numeric keys. You'll have to decide on a specific key that will be reserved as "empty_key". Moreover here is a really nice comparison of different hash-map implementations.
An excerpt
Library Linux-intCPU (sec) Linux-strCPU (sec) Linux PeakMem (MB)
glib 3.490 4.720 24.968
ghthash 3.260 3.460 61.232
CC’s hashtable 3.040 4.050 129.020
TR1 1.750 3.300 28.648
STL hash_set 2.070 3.430 25.764
google-sparse 2.560 6.930 5.42/8.54
google-dense 0.550 2.820 24.7/49.3
khash (C++) 1.100 2.900 6.88/13.1
khash (C) 1.140 2.940 6.91/13.1
STL set (RB) 7.840 18.620 29.388
kbtree (C) 4.260 17.620 4.86/9.59
NP’s splaytree 11.180 27.610 19.024
However, when setting a "deleted_key", this map can also perform deletions. So maybe it'll be possible to create a custom solution that is even more efficient. But apart from that minor point, any hash-map should exactly suit your needs (note that "map" is an ordered tree-map and thus slower).
What you need definitely sounds like a hash set, C++ has this as either std::tr1::unordered_set or in Boost.Unordered.
P.S. Note, however, that TR1 is not yet standard, and you'll probably need to get Boost for the implementation.
It sounds like std::unordered_set would fit the bill, but without
knowing more about the key, it's difficult to say. I'm curious about
how you can guarantee that there will be no possibility of collisions:
this implies a small (less than the size of the table), finite set of
keys. If this is the case, it may be more efficient to map the keys to
a small int, and use std::vector (with empty slots for the entries not
present).
What you're looking for is an unordered_set. You can find one in Boost, TR1, or C++0x. If you're hoping to associate the key with a value, then unordered_map does just that- also in Boost/TR1/C++0x.
Is there a way to write simple hashtable with the key as "strings" and value as the frequency, so that there are NO collisons? There will no be removal from the hashtable, and if the object already exists in the hashtable, then just update its frequency(add them together).
I was thinking there might be a algorithm that can compute a unique number from the string which will be used as the index.
Yes, i am avoiding the use of all STL construct including unordered_map.
You can use any perfect hash generator like gperf
See here for a list: http://en.wikipedia.org/wiki/Perfect_hash_function
PS. You'd still possibly want to use a map instead of flat array/vector in case the mapped domain gets too big/sparse
It really depends on what you mean by 'simple'.
The std::map is a fairly simple class. Still, it uses a red-black tree with all of the insertion, deletion, and balancing nicely hidden away, and it is templated to handle any orderable type as a key and any type as the value. Most map classes use a similar implementation, and avoid any sort of hashing functionality.
Hashing without collisions is not a trivial matter whatsoever. Perhaps the simplest method is Pearson Hashing.
It seems like you have 3 choices:
Implement your own perfect hashing class. This would be a pretty good sized class with a lot of functionality and some decently complex algorithms. I don't think this is simple.
Download and use a perfect hashing library that is already out there. Of course, you have to worry about deployability.
Use STL's map class. It's embedded, well-documented, easy to use, type-flexible, and completely cross-platform. This seems like the 'simplest' solution.
If I may ask, Why are you avoiding STL?
If the set of possible strings is known beforehand, you can use a perfect hash function generator to do this. But otherwise, what you ask is impossible.
Now, it IS possible to make the likelihood of collisions extremely low by using a good hash function and making sure your table is huge. You basically need a big enough table to make the likelihood of invoking the Birthday Paradox low enough to suit you. Then you just use n bits of output from SHA-1, and 2^n will be your table size.
I'm also wondering if maybe you could use a Bloom filter and have an actual counter instead of bits. Keep a list of all the words you've stuffed into the bloom filter and what entries they've incremented (which will be the same each time) and you have yourself a gigantic linear function that you might be able to solve to get all the individual counts back out again.
hi I want to use a hashmap for words in the dictionary and the indices of the words in the dicionary.
What would be the fastest hash algorithm for this?
Thanks!
At the bottom of this page there is a section A Note on Hash Functions with some information which you might find useful.
For convenience, I'll just replicate some links here:
Bob Jenkins
Paul Hsieh
Fowler/Noll/Vo (FNV)
MurmurHash
There are many different hashing algorithms, of varying efficiency, but the most important issue is that it scatter the items fairly uniformly across the different hash buckets.
However, you may as well assume that the Microsoft engineers/library engineers have done a decent job of writing an efficient and effective hash algorithm, and just using the built-in libraries/classes.
The fastest hash function will be
template <class T>
size_t hash(T key) {
return 0;
}
however, though the hashing will be mighty fast, you will suffer performance elsewhere. What you want is to try several hashing algorithms on actual data and see which one actually gives you the best performance in aggregate on the actual data you expect to use if the hashing or lookup is even a performance bottleneck. Until then, go with something handy. MD5 is pretty widely available.
Have you tried just using the STL hash_map and seeing if it serves your needs before rolling anything more complex?
http://www.sgi.com/tech/stl/hash_map.html
boost has a hash function that you can reuse for your own data (predefined for common types). That'd probably work well & fast enough if your needs aren't special.
What is your use case? A radix search tree (trie) might be more suitable than a hash if you're mapping from string to integer. Tries have the advantage of reducing key comparisons for variable length keys. (e.g., strings)
Even a binary search tree (e.g., STL's map) might be superior to a hash based container in terms of memory use and number of key comparisons. A hash is more efficient only if you have very few collisions.