Say I decided that my hasher for hash_set of a series of integer is the integer itself. And also say my integer range is very large, 1-20 and then 1000-1200, then 10000-12000.
e.g.: 1, 2, 5, 7, 1111, 1102, 1000, 10003, 10005
Wouldn't that be a very bad hashing function? How would data be stored by hash_set in this case, by say, the gcc implementation if anyone knows.
Thanks
EDIT:
Thank you for both replies. I should note that I have already specified my hasher to return the input value. e.g. the hash for 1001 would be 1001. So I ask if the implementation would take liberty to do another round of hashing, or would it see 1001 and the array size would grow to 1001?
Even if your data is clumped in certain ranges within the hash values typically only the least significant bits of the hash of each value will be used to store it. This means that if the bits representing say 0-128 were evenly distributed then your hash function would still behave well regardless of the distribution of the hash value. This does mean however if your values are all multiples of a certain binary value e.g. eight then lower bits won't be so evenly distributed and the values will clump in the hash table causing excessive chaining and slowing down operations.
The hash table would start small, occasionally rehashing to grow when the load factor gets high enough. Just because the hash value is 12000 does not mean there would be 12000 buckets, of course--the hash_set will do something like "mod" the hash function's output to make it fit within the number of buckets.
The identity function you describe is not a bad hash function for many hash table implementations (including GCC's). In fact it is what many people use, and obviously it is efficient. What it would be a bad example of is a cryptographic hash function, but that has a different purpose.
Related
If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Kindly explain how these two things work. Its getting quite confusing. Thanks!!
I tried to make the table size around 75% of the input data size, that you can call as X. Then I did key%(X) to get the hash code. But I am not sure if this is correct.
If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
The ratio of the number of elements stored to the number of buckets in the hash table is known as the load factor. In a separate chaining implementation, I'd suggest doing what std::unordered_set et al do and keeping it roughly in the range 0.5 to 1.0. So, for 10^9 elements have 10^9 to 2x10^9 buckets. Luckily, with separate chaining nothing awful happens if you go a bit outside this range (lower load factors just waste some memory on extra unused buckets, and higher load factors lead to increased collisions, longer lists and search times, but at load factors under 5 or 10 with an ok hash function the slow down will be roughly linear on average (so 5 or 10x slower than at load factor 1).
One important decision you should make is whether to pick a number around this magnitude that is a power of two, or a prime number. Explaining the implications is tedious, and anyway - which will work best for you is best determined by trying both and measuring the performance (if you really have to care about smallish differences in performance; if not - a prime number is the safer bet).
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Are these keys unsigned integers? In general, you can't have only 10^6 potential keys and end up with 10^9 hash table entries, as hash tables don't normally store duplicates (std::unordered_multiset/multi_map can, but it'll be easier for you to model that kind of thing as being a hash table from distinct keys to a container or values). More generally, it's best to separate the act of hashing (which usually is expected to generate a size_t result), from the "folding" of the hash value over the number of buckets in the hash table. That folding can be done using % in the general case, or by bitwise-ANDing with a bitmask for power-of-two bucket counts (e.g. for 256 buckets, & 255 is the same as % 256, but may execute faster on the CPU when those 255/256 values aren't known at compile time).
I tried to make the table size around 75% of the input data size, that you can call as X.
So that's a load factor around 1.33, which is ok.
Then I did key%(X) to get the hash code. But I am not sure if this is
correct.
It ends up being the same thing, but I'd suggest thinking of that as having a hash function hash(key) = key, followed by mod-ing into the bucket count. Such a hash function is known as an identity hash function, and is the implementation used for integers by all major C++ compiler Standard Libraries, though no particular hash functions are specified in the C++ Standard. It tends to work ok, but if your integer keys are particularly prone to collisions (for example, if they were all distinct multiples of 16 and your bucket count was a power of two they'd tend to only map to every 16th bucket) then it'd be better to use a stronger hash function. There are other questions about that - e.g. What integer hash function are good that accepts an integer hash key?
Rehashing
If the number of elements may increase dramatically beyond your initial expectations at run-time, then you'll want to increase the number of buckets to keep the load factor reasonable (in the range discussed above). Implementing support for that can easily be done by first writing a hash table class that doesn't support rehashing - simply taking the number of buckets to use as a constructor argument. Then write an outer rehashing-capable hash table class with a data member of the above type, and when an insert would push the load factor too high (Standard Library containers have a max_load_factor member which defaults to 1.0), you can construct an additional inner hash table object telling the constructor a new larger bucket count to use, then iterate over the smaller hash table inserting (or - better - moving, see below) the elements to the new hash table, then swap the two hash tables so the data member ends up with the new larger content and the smaller one is destructed. By "moving" above With a little I mean simply relink linked list elements from the smaller hash table into the lists in the larger one, instead of deep copying the elements, which will be dramatically faster and use less memory momentarily while rehashing.
In Hash map data structure such as unordered_map in C++:
unodered_map<char, int> mp = { {'a', 10}, {'b', 20} };
if (mp.find('a') != mp.end())
cout << "found you";
we know find() method takes constant time. but if I have composite data as the key:
unodered_map<tuple<char, string, int>, int> mp = { {'a', "apple", 10}, 100};
if (mp.find( {'a', "apple", 10} ) != mp.end())
cout << "found you";
Will the find() method still take constant time? how to evaluate the time complexity now?
In general, the more bytes of data in the key, the longer the hash function will take to generate a value (though some hash functions do not look at every byte, and can therefore have reduced big-O complexity). There might be more or less bytes because the tuple has more values, or some element in the tuple is variable sized (like a std::string). Similarly, with more bytes, it generally takes longer to test two keys for equality, which is another crucial operation for hash tables.
So, you could say your table's operations scale linearly with the size of the keys - O(K) - all other things being equal.
But, more often, you're interested in comparing how the performance of any given insert/erase/find compares with how long it would take in another type of container, and in many other types of containers the performance tends to degrade as you add more and more keys. That's where people describe hash tables as generally having amortised average-case O(1) operational complexity, whereas e.g. balanced binary trees may be O(logN) where N is the number of elements stored.
There are some other considerations, such as that operations in a balanced binary tree tend to involve comparisons (i.e. key1 < key2), which may be short-circuited at the first differing byte, whereas hash functions tend to have to process all bytes in the key.
Now, if in your problem domain, the size of keys may vary widely, then it's meaningful to think in terms of O(K) complexity, but if the size of keys tends to hover around the same typical range - regardless of the number of keys you're storing, then the table property is reasonably expressed as O(1) - removing the near-constant multiplicative factor.
I think it helps to consider a familiar analogy. If you have 100 friends' names stored in your phone address book, or you have millions of names from a big city's phone book, the average length of names is probably pretty similar, so you could very reasonably talk about the big-O efficiency of your data structure in terms of "N" while ignoring the way it shrinks or grows with name length "K".
On the other hand, if you're thinking about storing arbitrary-length keys in a hash table, and some people might try to put XML versions of encyclopaedias, while others store novels, poems, or individual words, then there's enough variety in key length that it makes sense to describe the varying performance in terms of K.
Similarly true if you were storing say information on binary video data, and someone was considering using the raw binary video data as the hash table key: some 8k HDR and hours long, and others tiny animated gifs. (A better approach would be to generate a 64+ bit hash of the video data and use that for a key, which for most practical purposes will be reliably unique; if dealing with billions of videos use 128 bit).
The theoretical running time is not in fact constant. The running time is constant only on avarage, given reasonable use cases.
A hash function is used in the implementation. If you implement a (good) hash function for your tuple that runs in constant time, the asymptotic running time of find is unaffected.
std::hash for a long is the identity function. This can causes poor hash distributions (e.g., if all the values are even, all hashes will also be even, etc). Is there a better way to hash a long?
if all the values are even, all hashes will also be even
And that's fine, because they're not used as is. Imagine if you allocated 4 billion buckets for one dictionary, it would be faster to just implement a linear search. Much, much faster.
Instead they're used to allocate a co-prime number of items (and usually a straight up prime number), for the very reason you mention.
All a hash has to do is to be as different as possible for different input values (and when it can't, at least be different for the most common or close values), and an identity function for a long (which is, I'm assuming, the same size of your hash) is the perfect candidate.
If I want to use extendible hashing to store a maximum of 100 records, then what is the minimum array size that I need?
I am guessing that an array of 100 would be sufficient, but I could be wrong. I also suspect that I can use a smaller array.
What do you know about your hash function?
You mentioned extendible hashing.
With extendible hashing you look at your hash as a bit string and typically implement the bucket lookup via a trie. Instead of a trie based lookup though I assume you are converting this to an index into your array.
You mentioned you will have at most 100 elements. If you wanted all distinct hashes you'd have 128 possibilities since that's the closest combination of bits with 7 bits.
If your hashing function can hash each element to have 7 of 7 (or more) different bits, then you have the most optimal solution with a bucket size of 1. Leaving 128 leaf nodes, or an array of size 128.
If your hashing function can hash each element to have 6 of 7 (or more) different bits, then you have a bucket size of 2. You would have 64 leaf nodes/combinations/array size.
If your hashing function can hash each element to have 5 of 7 (or more) different bits, then you have a bucket size of 4. You would have 32 leaf nodes/combinations/array size.
Since you said you want a bucket size of 4 I think your answer would be 32 and you have a hard requirement that you have a good hashing function that can give you at least 5 of the first bits as distinct.
I think it depends on whether you need high performance or saving storage. You can just save elements into an array of 100. I don't know a lot about extendible hashing, but my general understanding about hashing is that it will have some kinds of collision, and if you use a bigger array to store it, the number of collision can reduce and the performance in adding/deleting and querying will also be faster. I think you should use at least 128 (just to be 2^k, I am not an expert in hashing):)
I'm writing a program right now which produces four unsigned 32-bit integers as output from a certain function. I'm wanting to hash these four integers, so I can compare the output of this function to future outputs.
I'm having trouble writing a decent hashing function though. When I originally wrote this code, I threw in a simple addition of each of the four integers, which I knew would not suffice. I've tried several other techniques, such as shifting and adding, to no avail. I get a hash, but it's of poor quality, and the function generate a ton of collisions.
The hash output can be either a 32-bit or 64-bit integer. The function in question generates many billions of hashes, so collisions are a real problem here, and I'm willing to use a larger variable to ensure that there are as few collisions as possible.
Can anyone help me figure out how to write a quality hash function?
Why don't you store the four integers in a suitable data structure and compare them all? The benefit of hashing them in this case appears dubious to me, unless storage is a problem.
If storage is the issue, you can use one of the hash functions analyzed here.
Here's a fairly reasonable hash function from 4 integers to 1 integer:
unsigned int hash = in[0];
hash *= 37;
hash += in[1];
hash *= 37;
hash += in[2];
hash *= 37;
hash += in[3];
With uniformly-distributed input it gives uniformly-distributed output. All bits of the input participate in the output, and every input value (although not every input bit) can affect every output bit. Chances are it's faster than the function which produces the output, in which case no performance concerns.
There are other hashes with other characteristics, but accumulate-with-multiplication-by-prime is a good start until proven otherwise. You could try accumulating with xor instead of addition if you like. Either way, it's easy to generate collisions (for example {1, 0, a, b} collides with {0, 37, a, b} for all a, b), so you might want to pick a prime which you think has nothing to do with any plausible implementation bug in your function. So if your function has a lot of modulo-37 arithmetic in it, maybe use 1000003 instead.
Because hashing can generate collisions, you have to keep the keys in memory anyway in order to discover these collisions. Hashmaps and other standard datastructures do do this in their internal bookkeeping.
As the key is so small, just use the key directly rather than hashing. This will be faster and will ensure no collisions.
I fully agree with Vinko - just compare them all. If you still want a good hashing function, you need to analyse the distribution of your 4 unsinged integers. Then you have to craft your hashing function in a way, that the result will be even distributed over the whole range of the 32 bit hashing value.
A simple example - let's just assume that most of the time, the result from each function is in the range from 0 to 255. Then you could easily blend the lower 8 bits from each function into your hash. Most of the time, you'd finde the result directly, just sometimes (when one function returns a larger result) you'd have a collision.
To sum it up - without information how the results of the 4 functions are distributed, we can't help you with a good hashing function.
Why a hash? It seems like a std::set or std::multi set would be better suited to store this kind of output. All you'd need to do is wrap the four integers up in a struct and write a simple compare function.
Try using CRC or FNV. FNV is nice because it is fast and has a defined method of folding bits to get "smaller" hash values (i.e. 12-bit / 24-bit / etc).
Also the benefit of generating a 64-bit hash from a 128-bit (4 X 32-bit) number is a bit questionable because as other people have suggested, you could just use the original value as a key in a set. You really want the number of bits in the hash to represent the number of values you originally have. For example, if your dataset has 100,000 4X32-bit values, you probably want a 17-bit or 18-bit hash value, not a 64-bit hash.
Might be a bit overkill, but consider Boost.Hash. Generates very simple code and good values.