I've a little bit confusion how unordered_map works and what buckets are and how thet are managed.
From this blog post, unordered_map is vector of vectors.
My questions are:
is it correct to assume that the buckets are the "internal" vectors?
since each bucket (vector) can contains multiple elements, given by an hash collision on the hash table (the "external" vector), and since we have to scan this internal vector (in linear time), is it correct to assume that we have to define the equal method on the key type (in addiction to the hash operator) in order to find the key inside the bucket?
what is the external vector (hash table) size by default?
what is the internal vector size by default?
what happens if the number of elements in one bucket becomes too big?bor in other words, when the rehash happens?
Sorry for these questions, but I didn't find any detailed explanation how this structure works (on cppreference.com for example).
std::unordered_map is the standard C++ hash table. It used to be called hash_map in STL, but missed the boat when many of STL's interfaces were merged into C++ in 1998, and by 2011, so many libraries had their own hash_map, C++ had to pick another name (I think "unordered" was a great choice; assuming order in a hash table is a common source of bugs).
is it correct to assume that the buckets are the "internal" vectors?
no, it is both incorrect (incompatible with iterator invalidation requirements) and dangerous (under that assumption you may end up subtracting pointers to elements in the same bucket).
In real life, the buckets are linked lists; e.g.
LLVM libc++ unordered_map is a unique_ptr to an array of linked lists of __hash_nodes
GNU libstdc++ unordered_map is a pointer to an array of linked lists of _Hash_nodes
is it correct to assume that we have to define the equal method on the key type (in addiction to the hash operator) in order to find the key inside the bucket?
Yes, locating the key in a bucket is exactly what the 4th template parameter of std::unordered_map is for (does not need to call the "equal method on the key type" literally, of course)
what is the external vector (hash table) size by default?
There is no "external vector". The number of buckets for a default-constructed std::unordered_map is implementation-defined, you can query it with bucket_count.
what is the internal vector size by default?
There is no "internal vector". The size of any given bucket equals the number of elements currently placed in the bucket. You can query it with bucket_size
what happens if the number of elements in one bucket becomes too big?bor in other words, when the rehash happens?
Nothing happens if the number of elements in one bucket becomes too big. But if average number of elements per bucket (aka load_factor) exceeds max_load_factor, rehash happens (e.g. on insert)
This may help you understand the buckets:
http://www.cplusplus.com/reference/unordered_map/unordered_map/bucket_count/
http://www.cplusplus.com/reference/unordered_map/unordered_map/max_load_factor/
But generally, yes, the buckets are something like internal vectors. It needs an equality operator (or a predicate) to distinguish keys that had the same hash as you suggest.
The initial number of buckets is possibly 0. It can be set via rehash() or reserve() (they have slightly different semantics.)
http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/
In an ideal case, each bucket will only have the one item. You can check this by using bucket_size. When the load factor (total items vs. bucket count) gets high, it rehashes automatically.
By default, it'll aim for a 1:1 load factor. If the hash function is good, this may last until max_bucket_count items are inserted.
Keep in mind that the specific implementation of this may vary. Each implementation (e.g. from different platforms or standard libraries) really only needs to have the correct semantics.
If these answers are important to your program, you can query the values as I've described. If you're just trying to wrap your head around it, query them in some test scenarios and it may become more clear.
Related
I'm trying to understand Unordered Maps and Hashing. As I understand it, Unordered Maps have a Hash function inside of it that takes an object of type T, and returns an int, which then uses the int as an index to an internal array. It uses a List of the object of type T in the array position, so that if there's something already in the spot, additions are inserted into the List.
Conceptually, would using a Set instead of a List improve efficiency?
(maybe somehow binary search and a Set being ordered helps over having a List)
Or maybe a Vector instead of the List?
(maybe random access helps over the List.)
The datatype should not matter much, because in most cases, the container at the hashed index only contains zero or one element. If you regularly have many elements there, the hash map degrades in performance anyway. The remedy for that is to resize the initial array, which std::unordered_map<> does itself. However, if you have a bad hash function which causes many hash collisions, switching the hash function is necessary for proper operation.
If there're often a lot of collisions at the same bucket, then using a set is more efficient than using a list, and indeed some Java hash table implementations have adopted sets for this reason. vectors can't be used for std::unordered_map or std::unordered_set implementations, as they need to reallocate to a different memory area when grown past their capacity, whilst the Standard requires that the elements in an unordered container are never moved by other operations on the container.
That said, the nature of hash tables is that - with a high quality hash function - the statistical distribution of number-of-elements colliding in particular buckets relates only to the load factor. If you can't trust the collisions not to get out of control, perhaps you shouldn't be using that hash function.
Some details: Standard-library unordered containers have a default max_load_factor() (load_factor() is the ratio of size() to bucket_count()) of 1.0, and with a strong pseudo-randomizing hash function they'll have 1/e ~= 36.8% of buckets empty, as many with one element, half that with 2 elements (~18.4%), a third of that with 3 elements (~6.13%), a quarter of that with 4 elements (~1.53%), a fifth of that with 5 elements (~0.3%), a sixth of that with 6 elements (~0.05%). As you can hopefully see, it's incredibly rare to have to search through many elements (even in the worst case scenario where the hash table is at its max load factor), so a list approach is usually adequate.
I need to create a lookup function where a (X,Y) pair corresponds to a specific Z value. One major requirement for this is that I need to do it in as close to O(1) complexity as I can. My plan is to use an unordered_map.
I generally do not use a hash table for lookup, as the lookup time has never been important to me. Am I correct in thinking that as long as I built the unordered_map with no collisions, my lookup time will be O(1)?
My concern then is what the complexity becomes if there the key is not present in the unordered map. If I use unordered_map::find():, for example, to determine whether a key is present in my hash table, how will it go about giving me an answer? Does it actually iterate over all the keys?
I greatly appreciate the help.
The standard more or less requires using buckets for collision
resolution, which means that the actual look up time will
probably be linear with respect to the number of elements in the
bucket, regardless of whether the element is present or not.
It's possible to make it O(lg N), but it's not usually done,
because the number of elements in the bucket should be small,
if the hash table is being used correctly.
To ensure that the number of elements in a bucket is small, you
must ensure that the hashing function is effective. What
effective means depends on the types and values being hashed.
(The MS implementation uses FNV, which is one of the best
generic hashs around, but if you have special knowledge of the
actual data you'll be seeing, you might be able to do better.)
Another thing which can help reduce the number of elements per
bucket is to force more buckets or use a smaller load factor.
For the first, you can pass the minimum initial number of
buckets as an argument to the constructor. If you know the
total number of elements that will be in the map, you can
control the load factor this way. You can also forse a minumum
number of buckets once the table has been filled, by calling
rehash. Otherwise, there is a function
std::unordered_map<>::max_load_factor which you can use. It
is not guaranteed to do anything, but in any reasonable
implementation, it will. Note that if you use it on an already
filled unordered_map, you'll probably have to call
unordered_map<>::rehash afterwards.
(There are several things I don't understand about the standard
unordered_map: why the load factor is a float, instead of
double; why it's not required to have an effect; and why it
doesn't automatically call rehash for you.)
As with any hash table, worst case is always linear complexity (Edit: if you built the map without any collisions like you stated in your original post, then you'll never see this case):
http://www.cplusplus.com/reference/unordered_map/unordered_map/find/
Complexity
Average case: constant.
Worst case: linear in container size.
Return Value
An iterator to the element, if the specified key value is found, or unordered_map::end if the specified key is not found in the container.
However, because an unordered_map can only contain unique keys, you will see average complexity of constant time (container first checks hash index, and then iterates over values at that index).
I think the documentation for unordered_map::count function is more informative:
Searches the container for elements whose key is k and returns the
number of elements found. Because unordered_map containers do not
allow for duplicate keys, this means that the function actually
returns 1 if an element with that key exists in the container, and
zero otherwise.
To have no collisions in a hashed data structure is incredibly difficult (if not impossible for a given hash function and any kind of data). It would also require a table size exactly equal to the number of keys. No, it does not need to be that strict. As long as the hash function distributes the values in a relatively uniform way, you will have O(1) lookup complexity.
Hash tables are generally just arrays with linked lists taking care of the collisions (this is the chaining method - there are other methods, but this is likely the most utilized way of dealing with collisions). Thus, to find if a value is contained within a bucket, it will have to (potentially) iterate over all the values in that bucket. So if the hash function gives you a uniform distribution, and there are N buckets, and a total of M values, there should be (on average) M/N values per bucket. As long as this value is not too large, this allows O(1) lookup.
So, as a bit of a long winded answer to your question, as long as the hashing function is reasonable, you will get O(1) lookup, with it having to iterate over (on average) O(M/N) keys to give you a "negative" result.
I have not read the C++ standard but this is how I feel that the unordered_map of c++ suppose to work.
Allocate a memory block in the heap.
With every put request, hash the object and map it to a space in this memory
During this process handle collision handling via chaining or open addressing..
I am quite surprised that I could not find much about how the memory is handled by unordered_map. Is there a specific initial size of memory which unordered_map allocates. What happens if lets say we allocated 50 int memory and we ended up inserting 5000 integer?
This will be lot of collisions so I believe there should be kind of like a re-hashing and re-sizing algorithm to decrease the number of collisions after a certain level of collision threshold is reached. Since they are explicitly provided as member functions to the class, I assume they are used internally as well. Is there a such mechanism?
With every put request, hash the object and map it to a space in this memory
Unfortunately, this isn't exactly true. You are referring to an open addressing or closed hashing data structure which is not how unordered_map is specified.
Every unordered_map implementation stores a linked list to external nodes in the array of buckets. Meaning that inserting an item will always allocate at least once (the new node) if not twice (resizing the array of buckets, then the new node).
No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements. Because inserting might cause the bucket array to grow (reallocate), it is not generally possible to have an iterator pointing directly into the bucket array and meet the stability guarantees.
unordered_map is a better data structure if you are storing expensive-to-copy items as your key or value. Which makes sense, given that its general design was lifted from Boost's pre-move-semantics design.
Chandler Carruth (Google) mentions this problem in his CppCon '14 talk "Efficiency with Algorithms, Performance with Data Structures".
std::unordered_map contains a load factor that it uses to manage the size of it's internal buckets. std::unordered_map uses this odd factor to keep the size of the container somewhere in between a 0.0 and 1.0 factor. This decreases the likelihood of a collision in a bucket. After that, I'm not sure if they fallback to linear probing within a bucket that a collision was found in, but I would assume so.
Allocate a memory block in the heap.
True - there's a block of memory for an array of "buckets", which in the case of GCC are actually iterators capable of recording a place in a forward-linked list.
With every put request, hash the object and map it to a space in this memory
No... when you insert/emplace further items into the list, an additional dynamic (i.e. heap) allocation is done with space for the node's next link and the value being inserted/emplaced. The linked list is rewired accordingly, so the newly inserted element is linked to and/or from the other elements that hashed to the same bucket, and if other buckets also have elements, that group will be linked to and/or from the nodes for those elements.
At some point, the hash table content might look like this (GCC does things this way, but it's possible to do something simpler):
+-------> head
/ |
bucket# / #503
[0]----\/ |
[1] /\ /===> #1003
[2]===/==\====/ |
[3]--/ \ /==> #22
[4] \ / |
[5] \ / #7
[6] \ |
[7]=========/ \-----> #177
[8] |
[9] #100
The buckets on the left are the array from the original allocation: there are 10 elements in the illustrated array, so "bucket_count()" == 10.
A key with hash value X - denoted #x e.g. #177 - hashes to bucket X % bucket_count(); that bucket will need to store an iterator to the singly-linked list element immediately before the first element hashing to that bucket, so it can remove the last element from the bucket and rewire either head, or another bucket's next pointer, to skip over the erased element.
While elements in a bucket need to be contiguous in the forward-linked list, the ordering of buckets within that list is an unimportant consequence of the order of insertion of elements in the container, and isn't stipulated in the Standard.
During this process handle collision handling via chaining or open addressing..
The Standard library containers that are backed by hash tables always use separate chaining.
I am quite surprised that I could not find much about how the memory is handled by unordered_map. Is there a specific initial size of memory which unordered_map allocates.
No, the C++ Standard doesn't dictate what the initial memory allocation should be; it's up to the C++ implementation to choose. You can see how many buckets a newly created table has by printing out .bucket_count(), and in all likelihood if you multiply that by the your pointer size you'll get the size of the heap allocation that the unordered container made: myUnorderedContainer.bucket_count() * sizeof(int*). That said, there's no prohibition on your Standard Library implementation varying the initial bucket_count() in arbitrary and bizarre ways (e.g. with optimisation level, depending on Key type), but I can't imagine why any would.
What happens if lets say we allocated 50 int memory and we ended up
inserting 5000 integer? This will be lot of collisions so I believe there should be kind of like a re-hashing and re-sizing algorithm to decrease the number of collisions after a certain level of collision threshold is reached.
Rehashing/resizing isn't triggered by a certain number of collisions, but a certain proneness for collisions, as measured by the load factor, which is .size() / .bucket_count().
When an insertion would push the .load_factor() above the .max_load_factor(), which you can change but is required by the C++ Standard to default to 1.0, then the hash table is resized. That effectively means it allocates more buckets - normally somewhere close to but not necessarily exactly twice as many - then it points the new buckets at the linked list nodes, then finally deletes the heap allocation with the old buckets.
Since they are explicitly provided as member functions to the class, I assume they are used internally as well. Is there a such mechanism?
There's is no C++ Standard requirement about how the resizing is implemented. That said, if I were implementing resize() I'd consider creating a function-local container whilst specifying the newly desired bucket_count, then iterate over the elements in the *this object, calling extract() to detach them, then merge() to add them to the function-local container object, then eventually invoke swap on *this and the function-local container.
While looking for a container suitable for an application I'm building, I ran across documentation for unordered_set. Given that my application typically requires only insert and find functions, this class seems rather attractive. I'm slightly put off, however, by the fact that find is O(1) amortized, but O(n) worst case - I would be using the function frequently, and it could make or break my application. What causes the spike in complexity? Is the likelihood of running into an O(n) search predictable?
_unordered_set_ are implemented as hash tables, that said, one of the common implementations of hash table is using a container (ex: like vector) of hash bucket (that are a container (ex: like list) of elements of the unordered_set in the same bucket).
When inserting elements in the unordered_set, a hash function is apply to then which give you the bucket where to placed.
There could be various elements inserted that end in the same bucket, when you are finding an element, the hash functions is apply, giving you the bucket and you need to go for their elements searching the one you are looking for.
The worst case scenario is that all elements end in the same bucket (depending the containers used to store the elements in the same bucket O(n) is the worst running time of search when all the elements are in the same bucket).
The key points for elements ending in the same bucket are the hash function (how good it's) and the elements (could expose specific weakness of the hash function).
The elements normally one can no predict, if there are predictable enough in your case (you could select a hash function that spread evenly this kind of elements).
To speed up search, the key point is using good hash function (that distribute evenly the elements in the buckets and using if needed rehash increasing the bucket size (take care with this option, the hash function will be apply to all elements)).
I suggest that if it's that important for your application the storage of that elements, you do performance test with as close as possible to production data (and take decision from there), that said the containers in STL and more the containers of the same group (ex: associative, etc...) share almost the same interface, being easy to change one for another, with little or no change in the code that used.
I need to create a lookup function where a (X,Y) pair corresponds to a specific Z value. One major requirement for this is that I need to do it in as close to O(1) complexity as I can. My plan is to use an unordered_map.
I generally do not use a hash table for lookup, as the lookup time has never been important to me. Am I correct in thinking that as long as I built the unordered_map with no collisions, my lookup time will be O(1)?
My concern then is what the complexity becomes if there the key is not present in the unordered map. If I use unordered_map::find():, for example, to determine whether a key is present in my hash table, how will it go about giving me an answer? Does it actually iterate over all the keys?
I greatly appreciate the help.
The standard more or less requires using buckets for collision
resolution, which means that the actual look up time will
probably be linear with respect to the number of elements in the
bucket, regardless of whether the element is present or not.
It's possible to make it O(lg N), but it's not usually done,
because the number of elements in the bucket should be small,
if the hash table is being used correctly.
To ensure that the number of elements in a bucket is small, you
must ensure that the hashing function is effective. What
effective means depends on the types and values being hashed.
(The MS implementation uses FNV, which is one of the best
generic hashs around, but if you have special knowledge of the
actual data you'll be seeing, you might be able to do better.)
Another thing which can help reduce the number of elements per
bucket is to force more buckets or use a smaller load factor.
For the first, you can pass the minimum initial number of
buckets as an argument to the constructor. If you know the
total number of elements that will be in the map, you can
control the load factor this way. You can also forse a minumum
number of buckets once the table has been filled, by calling
rehash. Otherwise, there is a function
std::unordered_map<>::max_load_factor which you can use. It
is not guaranteed to do anything, but in any reasonable
implementation, it will. Note that if you use it on an already
filled unordered_map, you'll probably have to call
unordered_map<>::rehash afterwards.
(There are several things I don't understand about the standard
unordered_map: why the load factor is a float, instead of
double; why it's not required to have an effect; and why it
doesn't automatically call rehash for you.)
As with any hash table, worst case is always linear complexity (Edit: if you built the map without any collisions like you stated in your original post, then you'll never see this case):
http://www.cplusplus.com/reference/unordered_map/unordered_map/find/
Complexity
Average case: constant.
Worst case: linear in container size.
Return Value
An iterator to the element, if the specified key value is found, or unordered_map::end if the specified key is not found in the container.
However, because an unordered_map can only contain unique keys, you will see average complexity of constant time (container first checks hash index, and then iterates over values at that index).
I think the documentation for unordered_map::count function is more informative:
Searches the container for elements whose key is k and returns the
number of elements found. Because unordered_map containers do not
allow for duplicate keys, this means that the function actually
returns 1 if an element with that key exists in the container, and
zero otherwise.
To have no collisions in a hashed data structure is incredibly difficult (if not impossible for a given hash function and any kind of data). It would also require a table size exactly equal to the number of keys. No, it does not need to be that strict. As long as the hash function distributes the values in a relatively uniform way, you will have O(1) lookup complexity.
Hash tables are generally just arrays with linked lists taking care of the collisions (this is the chaining method - there are other methods, but this is likely the most utilized way of dealing with collisions). Thus, to find if a value is contained within a bucket, it will have to (potentially) iterate over all the values in that bucket. So if the hash function gives you a uniform distribution, and there are N buckets, and a total of M values, there should be (on average) M/N values per bucket. As long as this value is not too large, this allows O(1) lookup.
So, as a bit of a long winded answer to your question, as long as the hashing function is reasonable, you will get O(1) lookup, with it having to iterate over (on average) O(M/N) keys to give you a "negative" result.