std::hash for a long is the identity function. This can causes poor hash distributions (e.g., if all the values are even, all hashes will also be even, etc). Is there a better way to hash a long?
if all the values are even, all hashes will also be even
And that's fine, because they're not used as is. Imagine if you allocated 4 billion buckets for one dictionary, it would be faster to just implement a linear search. Much, much faster.
Instead they're used to allocate a co-prime number of items (and usually a straight up prime number), for the very reason you mention.
All a hash has to do is to be as different as possible for different input values (and when it can't, at least be different for the most common or close values), and an identity function for a long (which is, I'm assuming, the same size of your hash) is the perfect candidate.
Related
I was wondering that unordered_set uses hashing, so that should be faster in the case of integers than in the case of strings. The same would be the case for unordered_map. I found no definite answer on the web. It will be great if someone can clarify this.
Is there any difference in perofrmance of unordered_set (C++) in case of strings vs in case of integer?
There can be. The language specification doesn't have guarantees one way or the other.
You can verify whether this is the case for your program on your target system by measuring the performance.
If you're considering whether to use string itself as the key, or a separately hashed string (i.e. integer), then technically separate hashing is more expensive since the integer would be hashed again. That said, hashing an integer is trivial (I think it may be the identity function), so this might have no noticeable effect.
Separate hashing + storing integer does have potential advantage: You can pre-hash the string keys once, and reuse the hashed integer key, while a map with string keys requires the key to be re-hashed on every lookup. Whether this is useful in your case depends on what you're going to do with the map.
The abstract answer would be "depends on implementation and other details" like sizes of keys and containers. Standard doesn't specify anything around strings vs ints, so you should not expect that there is globally valid answer.
On popular platforms like gcc / clang on on x86 / x86_64 your guess seems about right. I have experience of getting essential performance wins after replacing string keys in maps with ints or pointers.
Still, there're might be specific circumstances when strings will beat ints even on mentioned platforms.
Hashing functions aren't specified by the C++ Standard.
That said, GCC, Clang and Visual C++ all use an identify hash for integers - meaning the std::hash<> specialised for integer types returns its input. Visual C++ uses power-of-2 bucket counts, while GCC uses prime numbers. Consequently, certain inputs are extremely collision-prone on Visual C++, e.g. pointers to objects with N-byte alignment - where N is largish - that have been converted to numbers will all collide at buckets 0, N, 2N, 3N etc. with all the buckets in between storing no data. On the other hand, if the integers are random enough that they happen to distribute well across the buckets without excessive collisions (which is much more likely with GCC's prime bucket count), then an identity hash saves CPU time in trying to further process them.
GCC uses MURMUR32 hashing for strings, while Visual C++ does some simple xoring and shifting on 10 characters roughly evenly sampled along the string (so, GCC is slower but massively better hash quality, particularly for things like same-length directory/filename paths with common prefixes and just an incrementing code at the end, where Visual C++ may only incorporate a single different character into the hash). Compared to a string storing its text inside itself (a technique known as Short String Optimisation or SSO) or an integer, any string storing longer text in dynamically allocated memory ("heap"), will depend on at least one extra least one extra cache line to reach the text (and on modern x86 architectures, an extra cache fault may be needed for each 64-byte chunk of the string accessed during hashing).
It is possible to create an object to store a string and a hash - calculated once - but that's not exactly what the question asks about, and after finding a match on hash value you still need to compare the entire string content to be certain of a match.
In conclusion, if you use the default identify hashing with collision-prone keys on Visual C++, it may be slower than using strings (if the strings hash with fewer collisions, which is far from certain). But, in most cases using integer keys will be faster.
If you really care, always benchmark on your own system and dataset.
Below link it is mentioned chances of collision but I am trying to use it for finding duplicate entry:
http://www.cplusplus.com/reference/functional/hash/
I am using std::hash<std::string> and storing the return value in std::unordered_set. if emplace is fails, I am marking string as it is duplicate string.
Hashes are generally functions from a large space of values into a small space of values, e.g. from the space of all strings to 64-bit integers. There are a lot more strings than 64-bit integers, so obviously multiple strings can have the same hash. A good hash function is such that there's no simple rule relating strings with the same hash value.
So, when we want to use hashes to find duplicate strings (or duplicate anything), it's always a two-phase process (at least):
Look for strings with identical hash (i.e. locate the "hash bucket" for your string)
Do a character-by-character comparison of your string with other strings having the same hash.
std::unordered_set does this - and never mind the specifics. Note that it does this for you, so it's redundant for you to hash yourself, then store the result in an std::unordered_set.
Finally, note that there are other features one could use for initial duplicate screening - or for searching among the same-hash values. For example, string length: Before comparing two strings character-by-character, you check their lengths (which you should be able to access without actually iterating the strings); different lengths -> non-equal strings.
Yes, it is possible that two different strings will share the same hash. Simply put, let's imagine you have a hash stored in an 8bit type (unsigned char).
That is 2^8 = 256 possible values. That means you can only have 256 unique hashes of arbitrary inputs.
Since you can definitely create more than 256 different strings, there is no way the hash would be unique for all possible strings.
std::size_t is a 64bit type, so if you used this as a storage for the hash value, you'd have 2^64 possible hashes, which is marginally more than 256 possible unique hashes, but it's still not enough to differentiate between all the possible strings you can create.
You just can't store an entire book in only 64 bits.
Yes it can return the same result for different strings. This is a natural consequence of reducing an infinite range of possibilities to a single 64-bit number.
There exist things called "perfect hash functions" which produce a hash function that will return unique results. However, this is only guaranteed for a known set of inputs. An unknown input from outside might produce a matching hash number. That possibility can be reduced by using a bloom filter.
However, at some point with all these hash calculations the program would have been better off doing simple string comparisons in an unsorted linear array. Who cares if the operation is O(1)+C if C is ridiculously big.
Yes, std::hash return same result for different std::string.
The creation of buckets is different by different compiler.
Compiler based implementation found at link:
hashing and rehashing for std::unordered_set
I have a very long string that I need to compare for equality. Since comparing them char by char is very time consuming, I like to create a hash for the string.
I like the generated hash code be unique ( or the chance that two string with the same hash generated, be very small). I think creating an int from a string as hash is not strong enough to eliminate of having two different string with the same hash code, so I am looking for a string hash code.
Am I right that the above assumption?
To clarify, assume that I have a string of say 1K length and I create a hash code of 10 char, then comparison hash codes speed up by 100 times.
The question that I have is how to create such hash code in c++?
I am developing on windows using visual studio 2012.
To be useful in this case, the hash code must be quick to
calculate. Using anything larger than the largest words
supported by the hardware (usually 64 bits) may be counter
productive. Still, you can give it a try. I've found the
following to work fairly well:
unsigned long long
hash( std::string const& s )
{
unsigned long long results = 12345; // anything but 0 is probably OK.
for ( auto current = s.begin(); current != s.end(); ++ current ) {
results = 127 * results + static_cast<unsigned char>( *current );
}
return results;
}
Using a hash like this will probably not be advantageous,
however, unless most of the comparisons are with strings that
aren't equal, but have long common initial sequences. Remember
that if the hashes are equal, you still have to compare the
strings, and that comparison only has to go until the first
characters which aren't equal. (In fact, most of the comparison
functions I've seen start by comparing length, and only compare
characters if the strings are of equal length.)
There are a lot of hashing algorithms present which you may use.
If you want to implement one by yourself, then a simple one could be to take the ascii for each character and align it with 0(i.e. a = 1, b = 2...) and multiply it with the character index in the string. Keep adding these values and store it as a hash value for a particular string.
For example tha hash value for abc would be:
HASH("abc") = 1*1 + 2*2 + 3*3 = 14;
The probability of collision lowers as the string length increases(Considering your strings will be lengthy).
There are many known hash algorithms available. For example MD5, SHA1, etc. You should not need to implement your own algorithm but use one of the available ones. Use the search engine of your choice to find implementations like this one.
It really depends what your hard requirements are. If you have hard requirements such as "search may never take more than so and so long", then it's possible that no solution is applicable. If your intent is simply to speed up a large number of searches, then a simple, short hash will do fine.
While it is generally true that hashing a 1000-character string to an integer (a single 32-bit or 64-bit number) can, and eventually will produce collisions, this is not something to be concerned about.
A 10-charcter hash will also produce collisions. This is a necessary consequence of the fact that 1000 > 10. For every 10-character hash, there exist 100 1000-character strings1.
The important question is whether you will actually see collisions, how often you will see them, and whether it matters at all. Whether (or how likely) you see a collision depends not on the length of the strings, but on the number of distinct strings.
If you hash 77,100 strings (longer than 4 characters) using a 32-bit hash, you have a 50% chance of encountering a collision for each new hash. At 25,000 strings, the likelihood is only somewhere around 5-6%. At 1000 strings, the likelihood is approximately 0.1%.
Note that when I say "50% at 77,100 strings", this does not mean that your chance of actually encountering a collision is that high. It's merely the chance of having two strings with the same hash value. Unless that's the case for the majority of strings, the chance of actually hitting one is again a lot lower.
Which means no more and no less than for most usage cases, it simply doesn't matter. Unless you want to hash hundred thousands of strings, stop worrying now and use a 32-bit hash.
Otherwise, unless you want to hash billions of strings, stop worrying here and use a 64-bit hash.
Thing is, you must be prepared to handle collisions in any case because as long as you have 2 strings, the likelihood for a collision is never exactly zero. Even hashing only 2 or 3 1000-character strings into a 500-byte hash could in principle have a collision (very unlikely but possible).
Which means you must do a string comparison if the hash matches in either case, no matter how long (or how good or bad) your hash is.
If collisions don't happen every time, they're entirely irrelevant. If you have a lot of collisions in your table and encounter one, say, on 1 in 10,000 lookups (which is a lot!), it has no practical impact. Yes, you will have to do a useless string comparison once in 10,000 lookups, but the other 9,999 work by comparing a single integer alone. Unless you have a hard realtime requirement, the measurable impact is exactly zero.
Even if you totally screw up and encounter a collision on every 5th search (pretty desastrous case, that would mean that roughly 800 million string pairs collide, which is only possible at all with at least 1,6 billion strings), this still means that 4 out of 5 searches don't hit a collision, so you still discard 80% of non-matches without doing a comparison.
On the other hand, generating a 10-character hash is cumbersome and slow, and you are likely to create a hash function that has more collision (because of bad design) than a readily existing 32 or 64 bit hash.
Cryptographic hash functions are certainly better, but they run slower than their non-cryptographic counterparts too, and the storage required to store their 16 or 32 byte hash values is much larger, too (at virtually no benefit, to most people). This is a space/time tradeoff.
Personally, I would just use something like djb2, which can be implemented in 3 lines of C code, works well, and runs very fast. There exist of course many other hash functions that you could use, but I like djb2 for its simplicity.
Funnily, after reading James Kanze's answer, posted code seems to be a variation of djb2, only with a different seed and a different multiplier (5381 and 33, respectively).
In the same answer, the remark about comparing string lengths first is a good tip as well. It's noteworthy that you can consider a string's lenght a form of "hash function", too (albeit a rather weak one, but one that often comes "for free").
1However, strings are not some "random binary garbage" as the hash is. They are structured, low-entropy data. Insofar, the comparison does not really hold true.
Well, I would first compare string lenghts. If they match, then I'd start comparing using an algorithm that uses random positions to test character equality, and stop at first difference.
The random positions would be obtained from a stringLength sized vector, filled with random ints ranging from 0 to stringLength-1. I haven't measured this method, though, it's just an idea. But this would save you the concerns of hash collisions, while reducing comparison times.
Say I decided that my hasher for hash_set of a series of integer is the integer itself. And also say my integer range is very large, 1-20 and then 1000-1200, then 10000-12000.
e.g.: 1, 2, 5, 7, 1111, 1102, 1000, 10003, 10005
Wouldn't that be a very bad hashing function? How would data be stored by hash_set in this case, by say, the gcc implementation if anyone knows.
Thanks
EDIT:
Thank you for both replies. I should note that I have already specified my hasher to return the input value. e.g. the hash for 1001 would be 1001. So I ask if the implementation would take liberty to do another round of hashing, or would it see 1001 and the array size would grow to 1001?
Even if your data is clumped in certain ranges within the hash values typically only the least significant bits of the hash of each value will be used to store it. This means that if the bits representing say 0-128 were evenly distributed then your hash function would still behave well regardless of the distribution of the hash value. This does mean however if your values are all multiples of a certain binary value e.g. eight then lower bits won't be so evenly distributed and the values will clump in the hash table causing excessive chaining and slowing down operations.
The hash table would start small, occasionally rehashing to grow when the load factor gets high enough. Just because the hash value is 12000 does not mean there would be 12000 buckets, of course--the hash_set will do something like "mod" the hash function's output to make it fit within the number of buckets.
The identity function you describe is not a bad hash function for many hash table implementations (including GCC's). In fact it is what many people use, and obviously it is efficient. What it would be a bad example of is a cryptographic hash function, but that has a different purpose.
I'm writing a program right now which produces four unsigned 32-bit integers as output from a certain function. I'm wanting to hash these four integers, so I can compare the output of this function to future outputs.
I'm having trouble writing a decent hashing function though. When I originally wrote this code, I threw in a simple addition of each of the four integers, which I knew would not suffice. I've tried several other techniques, such as shifting and adding, to no avail. I get a hash, but it's of poor quality, and the function generate a ton of collisions.
The hash output can be either a 32-bit or 64-bit integer. The function in question generates many billions of hashes, so collisions are a real problem here, and I'm willing to use a larger variable to ensure that there are as few collisions as possible.
Can anyone help me figure out how to write a quality hash function?
Why don't you store the four integers in a suitable data structure and compare them all? The benefit of hashing them in this case appears dubious to me, unless storage is a problem.
If storage is the issue, you can use one of the hash functions analyzed here.
Here's a fairly reasonable hash function from 4 integers to 1 integer:
unsigned int hash = in[0];
hash *= 37;
hash += in[1];
hash *= 37;
hash += in[2];
hash *= 37;
hash += in[3];
With uniformly-distributed input it gives uniformly-distributed output. All bits of the input participate in the output, and every input value (although not every input bit) can affect every output bit. Chances are it's faster than the function which produces the output, in which case no performance concerns.
There are other hashes with other characteristics, but accumulate-with-multiplication-by-prime is a good start until proven otherwise. You could try accumulating with xor instead of addition if you like. Either way, it's easy to generate collisions (for example {1, 0, a, b} collides with {0, 37, a, b} for all a, b), so you might want to pick a prime which you think has nothing to do with any plausible implementation bug in your function. So if your function has a lot of modulo-37 arithmetic in it, maybe use 1000003 instead.
Because hashing can generate collisions, you have to keep the keys in memory anyway in order to discover these collisions. Hashmaps and other standard datastructures do do this in their internal bookkeeping.
As the key is so small, just use the key directly rather than hashing. This will be faster and will ensure no collisions.
I fully agree with Vinko - just compare them all. If you still want a good hashing function, you need to analyse the distribution of your 4 unsinged integers. Then you have to craft your hashing function in a way, that the result will be even distributed over the whole range of the 32 bit hashing value.
A simple example - let's just assume that most of the time, the result from each function is in the range from 0 to 255. Then you could easily blend the lower 8 bits from each function into your hash. Most of the time, you'd finde the result directly, just sometimes (when one function returns a larger result) you'd have a collision.
To sum it up - without information how the results of the 4 functions are distributed, we can't help you with a good hashing function.
Why a hash? It seems like a std::set or std::multi set would be better suited to store this kind of output. All you'd need to do is wrap the four integers up in a struct and write a simple compare function.
Try using CRC or FNV. FNV is nice because it is fast and has a defined method of folding bits to get "smaller" hash values (i.e. 12-bit / 24-bit / etc).
Also the benefit of generating a 64-bit hash from a 128-bit (4 X 32-bit) number is a bit questionable because as other people have suggested, you could just use the original value as a key in a set. You really want the number of bits in the hash to represent the number of values you originally have. For example, if your dataset has 100,000 4X32-bit values, you probably want a 17-bit or 18-bit hash value, not a 64-bit hash.
Might be a bit overkill, but consider Boost.Hash. Generates very simple code and good values.