As far as I understand, amount of buckets in unordered_map increases accidentally while filling of unordered_map.
If I perform copy of unordered_map (to another unordered_map) it is guaranteed there will be exactly same pairs. But will they be in the same buckets? Will amount of buckets will be the same?
I don't know bucket's creation mechanism, and didn't find short explanation, how it have to be implemented in standard. But if buckets amount may rely on sequence of insertions, allocation and etc, then after copying we may get different amount of buckets, or different distribution in there (even if items will be the same). Is it true? Both for boost's implementation and for gcc's standard implementation?
The max load factor, but not the "current" load factor are specified as being copied when an unordered_map is copied.
Both the entry for copy construction and copy assignment include the following
In addition to the requirements of Table 64, copies the hash function,
predicate, and maximum load factor.
[unord.req]
In general this means there may be a different count of buckets, and thus a different distribution of elements into buckets in a copy.
Related
If a std::unordered_map<int,...> was to stay roughly the same size but continually add and remove items, would it continually allocate and free memory or cache and reuse the memory (ie. like a pool or vector)? Assuming a modern standard MS implementation of the libraries.
The standard is not specific about these aspects, so they are implementation defined. Most notably, a caching behaviour like you describe is normally achieved by using a custom allocator (e.g. for a memory pool allocator) so it should normally be decoupled from the container implementation.
The relevant bits of the standard, ~p874 about unordered containers:
The elements of an unordered associative container are organized into
buckets. Keys with the same hash code appear in the same bucket. The
number of buckets is automatically increased as elements are added to
an unordered associative container, so that the average number of
elements per bucket is kept below a bound.
and insert:
The insert and emplace members shall not affect the validity of
iterators if (N+n) <= z * B, where N is the number of elements in the
container prior to the insert operation, n is the number of elements
inserted, B is the container’s bucket count, and z is the container’s
maximum load factor
You could read between the lines and assume that since iterator validity is not affected, probably no memory allocations will take place. Though this is by no means guaranteed (e.g. if the bucket data structure is a linked list, you can append to it without invalidating iterators). The standard doesn't seem to specify what should happen when an element is removed, but since it couldn't invalidate the constraint above I don't see a reason to deallocate memory.
The easiest way to find out for sure for your specific implementation is to read the source or profile your code.
Alternatively you can try to take control of this behaviour (if you really need to) by using the rehash and resize methods and tuning the map's load_factor.
I've a little bit confusion how unordered_map works and what buckets are and how thet are managed.
From this blog post, unordered_map is vector of vectors.
My questions are:
is it correct to assume that the buckets are the "internal" vectors?
since each bucket (vector) can contains multiple elements, given by an hash collision on the hash table (the "external" vector), and since we have to scan this internal vector (in linear time), is it correct to assume that we have to define the equal method on the key type (in addiction to the hash operator) in order to find the key inside the bucket?
what is the external vector (hash table) size by default?
what is the internal vector size by default?
what happens if the number of elements in one bucket becomes too big?bor in other words, when the rehash happens?
Sorry for these questions, but I didn't find any detailed explanation how this structure works (on cppreference.com for example).
std::unordered_map is the standard C++ hash table. It used to be called hash_map in STL, but missed the boat when many of STL's interfaces were merged into C++ in 1998, and by 2011, so many libraries had their own hash_map, C++ had to pick another name (I think "unordered" was a great choice; assuming order in a hash table is a common source of bugs).
is it correct to assume that the buckets are the "internal" vectors?
no, it is both incorrect (incompatible with iterator invalidation requirements) and dangerous (under that assumption you may end up subtracting pointers to elements in the same bucket).
In real life, the buckets are linked lists; e.g.
LLVM libc++ unordered_map is a unique_ptr to an array of linked lists of __hash_nodes
GNU libstdc++ unordered_map is a pointer to an array of linked lists of _Hash_nodes
is it correct to assume that we have to define the equal method on the key type (in addiction to the hash operator) in order to find the key inside the bucket?
Yes, locating the key in a bucket is exactly what the 4th template parameter of std::unordered_map is for (does not need to call the "equal method on the key type" literally, of course)
what is the external vector (hash table) size by default?
There is no "external vector". The number of buckets for a default-constructed std::unordered_map is implementation-defined, you can query it with bucket_count.
what is the internal vector size by default?
There is no "internal vector". The size of any given bucket equals the number of elements currently placed in the bucket. You can query it with bucket_size
what happens if the number of elements in one bucket becomes too big?bor in other words, when the rehash happens?
Nothing happens if the number of elements in one bucket becomes too big. But if average number of elements per bucket (aka load_factor) exceeds max_load_factor, rehash happens (e.g. on insert)
This may help you understand the buckets:
http://www.cplusplus.com/reference/unordered_map/unordered_map/bucket_count/
http://www.cplusplus.com/reference/unordered_map/unordered_map/max_load_factor/
But generally, yes, the buckets are something like internal vectors. It needs an equality operator (or a predicate) to distinguish keys that had the same hash as you suggest.
The initial number of buckets is possibly 0. It can be set via rehash() or reserve() (they have slightly different semantics.)
http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/
In an ideal case, each bucket will only have the one item. You can check this by using bucket_size. When the load factor (total items vs. bucket count) gets high, it rehashes automatically.
By default, it'll aim for a 1:1 load factor. If the hash function is good, this may last until max_bucket_count items are inserted.
Keep in mind that the specific implementation of this may vary. Each implementation (e.g. from different platforms or standard libraries) really only needs to have the correct semantics.
If these answers are important to your program, you can query the values as I've described. If you're just trying to wrap your head around it, query them in some test scenarios and it may become more clear.
I have not read the C++ standard but this is how I feel that the unordered_map of c++ suppose to work.
Allocate a memory block in the heap.
With every put request, hash the object and map it to a space in this memory
During this process handle collision handling via chaining or open addressing..
I am quite surprised that I could not find much about how the memory is handled by unordered_map. Is there a specific initial size of memory which unordered_map allocates. What happens if lets say we allocated 50 int memory and we ended up inserting 5000 integer?
This will be lot of collisions so I believe there should be kind of like a re-hashing and re-sizing algorithm to decrease the number of collisions after a certain level of collision threshold is reached. Since they are explicitly provided as member functions to the class, I assume they are used internally as well. Is there a such mechanism?
With every put request, hash the object and map it to a space in this memory
Unfortunately, this isn't exactly true. You are referring to an open addressing or closed hashing data structure which is not how unordered_map is specified.
Every unordered_map implementation stores a linked list to external nodes in the array of buckets. Meaning that inserting an item will always allocate at least once (the new node) if not twice (resizing the array of buckets, then the new node).
No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements. Because inserting might cause the bucket array to grow (reallocate), it is not generally possible to have an iterator pointing directly into the bucket array and meet the stability guarantees.
unordered_map is a better data structure if you are storing expensive-to-copy items as your key or value. Which makes sense, given that its general design was lifted from Boost's pre-move-semantics design.
Chandler Carruth (Google) mentions this problem in his CppCon '14 talk "Efficiency with Algorithms, Performance with Data Structures".
std::unordered_map contains a load factor that it uses to manage the size of it's internal buckets. std::unordered_map uses this odd factor to keep the size of the container somewhere in between a 0.0 and 1.0 factor. This decreases the likelihood of a collision in a bucket. After that, I'm not sure if they fallback to linear probing within a bucket that a collision was found in, but I would assume so.
Allocate a memory block in the heap.
True - there's a block of memory for an array of "buckets", which in the case of GCC are actually iterators capable of recording a place in a forward-linked list.
With every put request, hash the object and map it to a space in this memory
No... when you insert/emplace further items into the list, an additional dynamic (i.e. heap) allocation is done with space for the node's next link and the value being inserted/emplaced. The linked list is rewired accordingly, so the newly inserted element is linked to and/or from the other elements that hashed to the same bucket, and if other buckets also have elements, that group will be linked to and/or from the nodes for those elements.
At some point, the hash table content might look like this (GCC does things this way, but it's possible to do something simpler):
+-------> head
/ |
bucket# / #503
[0]----\/ |
[1] /\ /===> #1003
[2]===/==\====/ |
[3]--/ \ /==> #22
[4] \ / |
[5] \ / #7
[6] \ |
[7]=========/ \-----> #177
[8] |
[9] #100
The buckets on the left are the array from the original allocation: there are 10 elements in the illustrated array, so "bucket_count()" == 10.
A key with hash value X - denoted #x e.g. #177 - hashes to bucket X % bucket_count(); that bucket will need to store an iterator to the singly-linked list element immediately before the first element hashing to that bucket, so it can remove the last element from the bucket and rewire either head, or another bucket's next pointer, to skip over the erased element.
While elements in a bucket need to be contiguous in the forward-linked list, the ordering of buckets within that list is an unimportant consequence of the order of insertion of elements in the container, and isn't stipulated in the Standard.
During this process handle collision handling via chaining or open addressing..
The Standard library containers that are backed by hash tables always use separate chaining.
I am quite surprised that I could not find much about how the memory is handled by unordered_map. Is there a specific initial size of memory which unordered_map allocates.
No, the C++ Standard doesn't dictate what the initial memory allocation should be; it's up to the C++ implementation to choose. You can see how many buckets a newly created table has by printing out .bucket_count(), and in all likelihood if you multiply that by the your pointer size you'll get the size of the heap allocation that the unordered container made: myUnorderedContainer.bucket_count() * sizeof(int*). That said, there's no prohibition on your Standard Library implementation varying the initial bucket_count() in arbitrary and bizarre ways (e.g. with optimisation level, depending on Key type), but I can't imagine why any would.
What happens if lets say we allocated 50 int memory and we ended up
inserting 5000 integer? This will be lot of collisions so I believe there should be kind of like a re-hashing and re-sizing algorithm to decrease the number of collisions after a certain level of collision threshold is reached.
Rehashing/resizing isn't triggered by a certain number of collisions, but a certain proneness for collisions, as measured by the load factor, which is .size() / .bucket_count().
When an insertion would push the .load_factor() above the .max_load_factor(), which you can change but is required by the C++ Standard to default to 1.0, then the hash table is resized. That effectively means it allocates more buckets - normally somewhere close to but not necessarily exactly twice as many - then it points the new buckets at the linked list nodes, then finally deletes the heap allocation with the old buckets.
Since they are explicitly provided as member functions to the class, I assume they are used internally as well. Is there a such mechanism?
There's is no C++ Standard requirement about how the resizing is implemented. That said, if I were implementing resize() I'd consider creating a function-local container whilst specifying the newly desired bucket_count, then iterate over the elements in the *this object, calling extract() to detach them, then merge() to add them to the function-local container object, then eventually invoke swap on *this and the function-local container.
c++ unordered_map collision handling , resize and rehash
This is a previous question opened by me and I have seen that I am having a lot of confusion about how unordered_map is implemented. I am sure many other people shares that confusion with me. Based on the information I have know without reading the standard:
Every unordered_map implementation stores a linked list to external
nodes in the array of buckets... No, that is not at all the most
efficient way to implement a hash map for most common uses.
Unfortunately, a small "oversight" in the specification of
unordered_map all but requires this behavior. The required behavior is
that iterators to elements must stay valid when inserting or deleting
other elements
I was hoping that someone might explain the implementation and how it fits with the C++ standard definition ( in terms of performance requirements ) and if it is really not the most efficient way to implement an hash map data structure how it can be improved ?
The Standard effectively mandates that implementations of std::unordered_set and std::unordered_map - and their "multi" brethren - use open hashing aka separate chaining, which means an array of buckets, each of which holds the head of a linked list†. That requirement is subtle: it is a consequence of:
the default max_load_factor() being 1.0 (which means the table will resize whenever size() would otherwise exceed 1.0 times the bucket_count(), and
the guarantee that the table will not be rehashed unless grown beyond that load factor.
That would be impractical without chaining, as the collisions with the other main category of hash table implementation - closed hashing aka open addressing - become overwhelming as the load_factor()](https://en.cppreference.com/w/cpp/container/unordered_map/load_factor) approaches 1.
References:
23.2.5/15: The insert and emplace members shall not affect the validity of iterators if (N+n) < z * B, where N is the number of elements in the container prior to the insert operation, n is the number of elements inserted, B is the container’s bucket count, and z is the container’s maximum load factor.
amongst the Effects of the constructor at 23.5.4.2/1: max_load_factor() returns 1.0.
† To allow optimal iteration without passing over any empty buckets, GCC's implementation fills the buckets with iterators into a single singly-linked list holding all the values: the iterators point to the element immediately before that bucket's elements, so the next pointer there can be rewired if erasing the bucket's last value.
Regarding the text you quote:
No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements
There is no "oversight"... what was done was very deliberate and done with full awareness. It's true that other compromises could have been struck, but the open hashing / chaining approach is a reasonable compromise for general use, that copes reasonably elegantly with collisions from mediocre hash functions, isn't too wasteful with small or large key/value types, and handles arbitrarily-many insert/erase pairs without gradually degrading performance the way many closed hashing implementations do.
As evidence of the awareness, from Matthew Austern's proposal here:
I'm not aware of any satisfactory implementation of open addressing in a generic framework. Open addressing presents a number of problems:
• It's necessary to distinguish between a vacant position and an occupied one.
• It's necessary either to restrict the hash table to types with a default constructor, and to construct every array element ahead of time, or else to maintain an array some of whose elements are objects and others of which are raw memory.
• Open addressing makes collision management difficult: if you're inserting an element whose hash code maps to an already-occupied location, you need a policy that tells you where to try next. This is a solved problem, but the best known solutions are complicated.
• Collision management is especially complicated when erasing elements is allowed. (See Knuth for a discussion.) A container class for the standard library ought to allow erasure.
• Collision management schemes for open addressing tend to assume a fixed size array that can hold up to N elements. A container class for the standard library ought to be able to grow as necessary when new elements are inserted, up to the limit of available memory.
Solving these problems could be an interesting research project, but, in the absence of implementation experience in the context of C++, it would be inappropriate to standardize an open-addressing container class.
Specifically for insert-only tables with data small enough to store directly in the buckets, a convenient sentinel value for unused buckets, and a good hash function, a closed hashing approach may be roughly an order of magnitude faster and use dramatically less memory, but that's not general purpose.
A full comparison and elaboration of hash table design options and their implications is off topic for S.O. as it's way too broad to address properly here.
While looking for a container suitable for an application I'm building, I ran across documentation for unordered_set. Given that my application typically requires only insert and find functions, this class seems rather attractive. I'm slightly put off, however, by the fact that find is O(1) amortized, but O(n) worst case - I would be using the function frequently, and it could make or break my application. What causes the spike in complexity? Is the likelihood of running into an O(n) search predictable?
_unordered_set_ are implemented as hash tables, that said, one of the common implementations of hash table is using a container (ex: like vector) of hash bucket (that are a container (ex: like list) of elements of the unordered_set in the same bucket).
When inserting elements in the unordered_set, a hash function is apply to then which give you the bucket where to placed.
There could be various elements inserted that end in the same bucket, when you are finding an element, the hash functions is apply, giving you the bucket and you need to go for their elements searching the one you are looking for.
The worst case scenario is that all elements end in the same bucket (depending the containers used to store the elements in the same bucket O(n) is the worst running time of search when all the elements are in the same bucket).
The key points for elements ending in the same bucket are the hash function (how good it's) and the elements (could expose specific weakness of the hash function).
The elements normally one can no predict, if there are predictable enough in your case (you could select a hash function that spread evenly this kind of elements).
To speed up search, the key point is using good hash function (that distribute evenly the elements in the buckets and using if needed rehash increasing the bucket size (take care with this option, the hash function will be apply to all elements)).
I suggest that if it's that important for your application the storage of that elements, you do performance test with as close as possible to production data (and take decision from there), that said the containers in STL and more the containers of the same group (ex: associative, etc...) share almost the same interface, being easy to change one for another, with little or no change in the code that used.