What bucket_count value should I use if I know the intended number of map keys? - c++

I'm creating an std::unordered_map which I will immediately proceed to populate with n key-value pairs - and I know n. After that no more elements will be added - I will only be performing lookups.
What, therefore, should I pass as bucket_count to the constructor?
Notes:
I know it's not terribly critical and I could simply not specify anything and it will work.
This is related to, but not a dupe of, What should I pass to unordered_map's bucket count argument if I just want to specify a hash function?)
If it helps your answer, you may assume I want to have a load factor between f_1 and f_2 (known in advance).
I'm using the default hash function, and I don't know what the input is like, but it's unlikely to be adversarial to the hashing..

According to n4296 in 23.5.4.2 [unord.map.cnstr] (this is the final draft of C++14)
by default the max_load_factor for an unordered_map is 1.0, so you could just set the bucket_count to n.
There is obviously a space-time trade-off between increasing the bucket count for improved speed and decreasing it (and raising the max load factor) for improved space.
I would either not worry about it, or if it is a large map, set the bucket count to n. Then you can worry about optimizing when profiling shows you have a problem.
If you know the range of load factors you want, then you just set the bucket count to std::ceil(n/(std::max(f_1,f_2)), (and set the load factor before you fill the map).

Given the fact you have a range for your load factor, the only missing information is the collision rate. You can simply use nb_buckets = n / f_2 and you will be sure to get a load factor less than or equal to f_2. Ensuring correctness about f_1 requires data about the collision rate.

Related

Please reply::HashTable:Determining Table size and which hash function to use

If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Kindly explain how these two things work. Its getting quite confusing. Thanks!!
I tried to make the table size around 75% of the input data size, that you can call as X. Then I did key%(X) to get the hash code. But I am not sure if this is correct.
If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
The ratio of the number of elements stored to the number of buckets in the hash table is known as the load factor. In a separate chaining implementation, I'd suggest doing what std::unordered_set et al do and keeping it roughly in the range 0.5 to 1.0. So, for 10^9 elements have 10^9 to 2x10^9 buckets. Luckily, with separate chaining nothing awful happens if you go a bit outside this range (lower load factors just waste some memory on extra unused buckets, and higher load factors lead to increased collisions, longer lists and search times, but at load factors under 5 or 10 with an ok hash function the slow down will be roughly linear on average (so 5 or 10x slower than at load factor 1).
One important decision you should make is whether to pick a number around this magnitude that is a power of two, or a prime number. Explaining the implications is tedious, and anyway - which will work best for you is best determined by trying both and measuring the performance (if you really have to care about smallish differences in performance; if not - a prime number is the safer bet).
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?
Are these keys unsigned integers? In general, you can't have only 10^6 potential keys and end up with 10^9 hash table entries, as hash tables don't normally store duplicates (std::unordered_multiset/multi_map can, but it'll be easier for you to model that kind of thing as being a hash table from distinct keys to a container or values). More generally, it's best to separate the act of hashing (which usually is expected to generate a size_t result), from the "folding" of the hash value over the number of buckets in the hash table. That folding can be done using % in the general case, or by bitwise-ANDing with a bitmask for power-of-two bucket counts (e.g. for 256 buckets, & 255 is the same as % 256, but may execute faster on the CPU when those 255/256 values aren't known at compile time).
I tried to make the table size around 75% of the input data size, that you can call as X.
So that's a load factor around 1.33, which is ok.
Then I did key%(X) to get the hash code. But I am not sure if this is
correct.
It ends up being the same thing, but I'd suggest thinking of that as having a hash function hash(key) = key, followed by mod-ing into the bucket count. Such a hash function is known as an identity hash function, and is the implementation used for integers by all major C++ compiler Standard Libraries, though no particular hash functions are specified in the C++ Standard. It tends to work ok, but if your integer keys are particularly prone to collisions (for example, if they were all distinct multiples of 16 and your bucket count was a power of two they'd tend to only map to every 16th bucket) then it'd be better to use a stronger hash function. There are other questions about that - e.g. What integer hash function are good that accepts an integer hash key?
Rehashing
If the number of elements may increase dramatically beyond your initial expectations at run-time, then you'll want to increase the number of buckets to keep the load factor reasonable (in the range discussed above). Implementing support for that can easily be done by first writing a hash table class that doesn't support rehashing - simply taking the number of buckets to use as a constructor argument. Then write an outer rehashing-capable hash table class with a data member of the above type, and when an insert would push the load factor too high (Standard Library containers have a max_load_factor member which defaults to 1.0), you can construct an additional inner hash table object telling the constructor a new larger bucket count to use, then iterate over the smaller hash table inserting (or - better - moving, see below) the elements to the new hash table, then swap the two hash tables so the data member ends up with the new larger content and the smaller one is destructed. By "moving" above With a little I mean simply relink linked list elements from the smaller hash table into the lists in the larger one, instead of deep copying the elements, which will be dramatically faster and use less memory momentarily while rehashing.

QHash: Any weak spots performance-wise, besides rehashing? Why not?

This is more of a theoretical question. In addition, I have to admit, that I did some rather sophisticated performance tests about this some time ago, but I cannot find the source code anymore.
Some words about the type of application: I am focussing on scenarios with really big numbers of entries, from 50,000 up to some million, while memory consumption does not really matter.
I fully understand the basic concept of a hash data structure, why it generally has constant access times, and why rehashing is required at some point.
Any possible key is mapped to a certain slot in the hash structure. Of course, collisions may occur, resulting in multiple keys being mapped to the same slot. This is where implementation details come into play. As far as I know, there is some kind of logic using the "next" slot if the initially assigned slot is occupied.
My feeling is, there has to be a weak spot somewhere, performance-wise. Given a really big QHash, filled up just below its capacity, of which keys are then removed randomly, while new keys are added (without ever increaing the total number of stored keys, making sure it does not rehash): I would think, this has to lead to severe performance degration in the long term.
Filling up a QHash just below its capacity, with random values, should result in a considerable amount of key collisions. The lookup of a key affected from collisions requires multiple slots to be inspected, resulting in performance penalties. Removing keys and adding new random keys instead should make things even worse: Contiguous sequences of colliding keys will be fragmented. Collisions occupy 'foreign' slots, forcing a key actually mapped to this slot to be stored somewhere else. This slot might still be free'd later...
To make it short, I would expect, for the given scenario (perfoming deletes and inserts on a QHash which is always kept just below its capacity), performance should degrade in the long run, either because of lookup times are increasing due to increasing disorder, or because of periodical reordering.
However, I had taken some efforts to show this performance degration once (as I said, I cannot find this project anymore, so I stand here barehanded, I'm afraid), and I could not find any.
Is there a special magic in place regarding QHash handling collisions I am not aware of?
tl;dr;
Is there a special magic in place regarding QHash handling collisions I am not aware of
There is no magic. You just misunderstood how hash maps work.
(*) I will treat the concept of hash map, not the specific implementation QHash. And although there are several approaches to handling collisions, what I am describing here is the most common pattern. It is used, among others, by c++ std::unordered_map, Qt QHash, java HashMap.
Your terminology with "slots" it's a bit confusing. I first though that by slot you mean "bucket", but I now think you mean element of a bucket.
So in a hash, colliding keys are stored in a bucket. This can be any container, from vector to list.
Finding a bucket is O(1) and finding a key inside a bucket is O(k), where k is the bucket's length. So key access is constant in best case and linear in worst case.
You seem to assume that the number of buckets somehow increases when the hash map fills it's capacity. Well, there is no such thing as a capacity for a hash map (like there is for a vector for instance). So the situation that you describe: "having a hash with the capacity of let's say 100 and in the worst case all 100 elements collide and are stored in the same bucket" can't happen.
For a hash map you have a measure called "load factor" which is the average number of elements per bucket (size / bucket_count). The hash will increase the number of buckets (and recompute the hashes of every element, repositioning them) when the load factor exceeds a threshold max load factor. Performance is assured first and foremost by the quality of the hash function which must assure that keys are uniformly spread across all buckets. But no matter how good the hash function is, you can still have situations where some buckets are considerably larger then the rest. The fail-safe for that is the mentioned the max load factor.
By consequence, the max load factor achieves two purposes:
it acts as a "logical capacity" if you will: it makes the hash naturally increase the buckets count in the scenario that elements are added to the hash uniformly, making all the buckets too large.
it acts as a "fail safe" for the hash function. It increases the buckets count in the (rare) scenario that you have multiple keys colliding on a small subset of buckets.

Strategy to set the number of initial buckets in `std::unordered_set` in C++

If we know that we're going to hash between m and n items, where m and n are relatively large, what's a reasonable strategy for setting the number of initial buckets for std::unordered_set? If it helps, in my case m=n/2. In general, I'd like to optimize for speed, but can't afford an unreasonable amount of memory. Thanks in advance.
tl;dr There's no simple answer. Either measure, or let the container manage the bucket size automatically.
As I tried to say in the comments, there are too many variables, and you don't seem to realise how vague you're being. It took an hour for you to even say which implementation you're interested in.
m and n are "relatively large" ... relative to what?
"These are the only two operations and I want them to be fast." Define fast? What's fast enough? What's too slow? Have you measured?
If you want to minimize the load factor, so that there is on average no more than one element per bucket (and so no iteration through buckets needed once that the right bucket is known) then you'll need at least n buckets. But that doesn't guarantee one bucket per element, because the function used to determine the bucket from a hash code might return the same value for every pointer you put in the container. Knowing if that's likely depends on the hash function being used, and the function that maps hash codes to buckets, and the pointer values themselves.
For GCC the hash function for pointers is the identity function. For the default unordered_map implementation the mapping to buckets is hash_function(x) % bucket_count() and the bucket count is always a prime number, to reduce the likelihood of collisions. If the addresses you're storing in the hash map tend to be separated by multiples of the bucket count then they're going to end up in the same bucket. Knowing how likely that is depends on the number of buckets used for n (which you haven't stated) and the distribution of pointer values you're using (which you haven't stated).
If you use a custom hash function that has knowledge of the pointer values you expect to store then you could use a perfect hash function that uniformly distributes between [0, n) and then set the bucket_count() to n and ensure no collisions.
But it's not obvious that ensuring only a single element per bucket is worth it, because it uses more memory. Iterating through a bucket containing two or three elements is not going to be a bottleneck in most programs. Maybe it will be in yours, it's impossible to know because you haven't said what you want except it has to be fast. Which is so vague it's meaningless.
The only way to answer these questions is for you to measure the real world performance, nobody can give you a magic number that will make your code faster based on your vague requirements. If there was an easy answer that always makes things faster for "relatively large" number of elements then the standard library implementation should already be doing that and so you'd just be wasting your time doing the same thing by hand.
As an alternative, if you can live with logarithmic performance (usually not a problem), use a std::map instead. Then you have guaranteed lookup complexity 100% of the time, no re-hashing. A very useful property e.g. in hard real-time systems.

insert is taking long long time in an Unordered_map (C++) with ULONG as key and unknown bucket count

I have an unordered_map with key as ULONG.
I know that there can be a huge number of entries but not sure how many. So can't specify the bucket count before hand. I was expecting the time complexity for inserts to be O(1) since the keys are unique. But it seems like the inserts are taking very long time.
I've read that this might be possible if there have been lots of re-hashing since the bucket count is unspecified which takes indeterministic time.
Is there anything I can do to improve the time complexity of insert here. Or am I missing something?
Couple of things that might help:
You can actually compute when a rehash will take place and work out if that is the issue. From cplusplus.com:
"A rehash is forced if the new container size after the insertion operation would increase above its capacity threshold (calculated as the container's bucket_count multiplied by its max_load_factor)."
Try to isolate the insert operation and see if it does indeed take as long as it seems, otherwise write a simple timer and place it at useful places in the code to see where the time is being eaten
Between max_load_factor and preemptively calling reserve, you should be able to both minimize rehashing, and minimize bucket collisions. Getting the balance right is mostly a matter of performance testing.
For a start, many Standard Library implementations hash integers with an identity function, i.e. hash(x) == x. This is usually ok, especially if the implementation ensures the bucket count is prime (e.g. GCC), but some implementations use power-of-two bucket counts (e.g. Visual C++). You could run some numbers through your hash function and see if it's an identity function: if so, consider whether your inputs are sufficiently random for that not to matter. For example, if your numbers are all multiples of 4, then a power-of-two bucket count means you're only using a quarter of your buckets. If they tend to vary most in the high-order bits that's also a problem because bitwise-AND with a power-of-two bucket count effectively throws some number of high order bits away (for example, with 1024 buckets only the 10 least significant bits of the hash value will affect the bucket selection).
A good way to check for this type of problem is to iterate over your buckets and create a histogram of the number of colliding elements, for example - given unordered_map m:
std::map<int, int> histogram;
for (size_t i = 0; i < m.bucket_count(); ++i)
++histogram[m.bucket_size(i)];
for (auto& kv : histogram)
std::cout << ".bucket_size() " << kv.first << " seen "
<< kv.second << " times\n";
(You can see such code run here).
You'd expect the frequency of larger bucket_size() values to trail away quickly: if not, work on your hash function.
On the other hand, if you've gone over-the-top and instantiated your unordered_map with a 512-bit cryptographic hash or something, that will unnecessarily slow down your table operations too. The strongest hash you need consider for day-to-day use with tables having less than 4 billion elements is a 32 bit murmur hash. Still, use the Standard Library provided one unless the collisions reported above are problematic.
Switching to a closed addressing / open hashing hash table implementation (you could Google for one) is very likely to be dramatically faster if you don't have a lot of "churn": deleting hash table elements nearly as much as you insert new ones, with lookups interspersed. Closed addressing means the hash table keys are stored directly in buckets, and you get far better cache hits and much faster rehashing as the table size increases. If the values your integral keys map to are large (memory wise, as in sizeof(My_Map::value_type)), then do make sure the implementation's only storing pointers to them in the table itself, so the entire objects do not need to be copied during resizing.
Note: the above assumes the hash table really is causing your performance problems. If there's any doubt, do use a profiler to confirm that if you haven't already.

how to choose max_load_factor based on number of elements?

I work with unordered_set.
Here it's written that it has a reserve function which
set buckets based on number of elements N to contain.
However, mpic++ compiler on Ubuntu complains that there is no function reserve:
class std::tr1::unordered_set<pair_int>’ has no member named ‘reserve’
I need to optimize my set to hold N elements,
it seems the max_load_factor is available, how do I come with one based on N?
Or can I optimize it somehow else?
Thanks in advance
p/s/ saw some discussion for java, but not for c++ stl lib
Load factor is independent of the number of items you insert. It's basically the percentage of available space that's actually in use. If, for example, you currently have space for 100 elements allocated, the maximum load factor could say to start resizing the table when you had inserted, say, 80 items (this would correspond to a maximum load factor of 80%).
Setting the maximum load factor is, therefore, largely independent of the number of elements you're going to store. Rather, it's (mostly) an indication of how much extra space you're willing to use to improve search speed. All else being equal, a table that's closer to full will have more collisions, which will slow searching.
If you want to optimize an unordered set to hold N elements, you want to use the rehash function. This accepts an argument that sets the minimum buckets for the set. This will prevent a rehash from occuring when you are inserting elements into your set.
For instance, if your desired load factor is 75% then your bucket size should be N / .75
// This creates an unordered set optimized for `80` elements with a load factor of `75%`
std::unordered_set<std::string> myset;
myset.rehash(120);