c++ unordered_map collision handling , resize and rehash

c++ unordered_map collision handling , resize and rehash - c++

I have not read the C++ standard but this is how I feel that the unordered_map of c++ suppose to work.
Allocate a memory block in the heap.
With every put request, hash the object and map it to a space in this memory
During this process handle collision handling via chaining or open addressing..
I am quite surprised that I could not find much about how the memory is handled by unordered_map. Is there a specific initial size of memory which unordered_map allocates. What happens if lets say we allocated 50 int memory and we ended up inserting 5000 integer?
This will be lot of collisions so I believe there should be kind of like a re-hashing and re-sizing algorithm to decrease the number of collisions after a certain level of collision threshold is reached. Since they are explicitly provided as member functions to the class, I assume they are used internally as well. Is there a such mechanism?

With every put request, hash the object and map it to a space in this memory
Unfortunately, this isn't exactly true. You are referring to an open addressing or closed hashing data structure which is not how unordered_map is specified.
Every unordered_map implementation stores a linked list to external nodes in the array of buckets. Meaning that inserting an item will always allocate at least once (the new node) if not twice (resizing the array of buckets, then the new node).
No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements. Because inserting might cause the bucket array to grow (reallocate), it is not generally possible to have an iterator pointing directly into the bucket array and meet the stability guarantees.
unordered_map is a better data structure if you are storing expensive-to-copy items as your key or value. Which makes sense, given that its general design was lifted from Boost's pre-move-semantics design.
Chandler Carruth (Google) mentions this problem in his CppCon '14 talk "Efficiency with Algorithms, Performance with Data Structures".

std::unordered_map contains a load factor that it uses to manage the size of it's internal buckets. std::unordered_map uses this odd factor to keep the size of the container somewhere in between a 0.0 and 1.0 factor. This decreases the likelihood of a collision in a bucket. After that, I'm not sure if they fallback to linear probing within a bucket that a collision was found in, but I would assume so.

Allocate a memory block in the heap.
True - there's a block of memory for an array of "buckets", which in the case of GCC are actually iterators capable of recording a place in a forward-linked list.
With every put request, hash the object and map it to a space in this memory
No... when you insert/emplace further items into the list, an additional dynamic (i.e. heap) allocation is done with space for the node's next link and the value being inserted/emplaced. The linked list is rewired accordingly, so the newly inserted element is linked to and/or from the other elements that hashed to the same bucket, and if other buckets also have elements, that group will be linked to and/or from the nodes for those elements.
At some point, the hash table content might look like this (GCC does things this way, but it's possible to do something simpler):
+-------> head
/ |
bucket# / #503
[0]----\/ |
[1] /\ /===> #1003
[2]===/==\====/ |
[3]--/ \ /==> #22
[4] \ / |
[5] \ / #7
[6] \ |
[7]=========/ \-----> #177
[8] |
[9] #100
The buckets on the left are the array from the original allocation: there are 10 elements in the illustrated array, so "bucket_count()" == 10.
A key with hash value X - denoted #x e.g. #177 - hashes to bucket X % bucket_count(); that bucket will need to store an iterator to the singly-linked list element immediately before the first element hashing to that bucket, so it can remove the last element from the bucket and rewire either head, or another bucket's next pointer, to skip over the erased element.
While elements in a bucket need to be contiguous in the forward-linked list, the ordering of buckets within that list is an unimportant consequence of the order of insertion of elements in the container, and isn't stipulated in the Standard.
During this process handle collision handling via chaining or open addressing..
The Standard library containers that are backed by hash tables always use separate chaining.
I am quite surprised that I could not find much about how the memory is handled by unordered_map. Is there a specific initial size of memory which unordered_map allocates.
No, the C++ Standard doesn't dictate what the initial memory allocation should be; it's up to the C++ implementation to choose. You can see how many buckets a newly created table has by printing out .bucket_count(), and in all likelihood if you multiply that by the your pointer size you'll get the size of the heap allocation that the unordered container made: myUnorderedContainer.bucket_count() * sizeof(int*). That said, there's no prohibition on your Standard Library implementation varying the initial bucket_count() in arbitrary and bizarre ways (e.g. with optimisation level, depending on Key type), but I can't imagine why any would.
What happens if lets say we allocated 50 int memory and we ended up
inserting 5000 integer? This will be lot of collisions so I believe there should be kind of like a re-hashing and re-sizing algorithm to decrease the number of collisions after a certain level of collision threshold is reached.
Rehashing/resizing isn't triggered by a certain number of collisions, but a certain proneness for collisions, as measured by the load factor, which is .size() / .bucket_count().
When an insertion would push the .load_factor() above the .max_load_factor(), which you can change but is required by the C++ Standard to default to 1.0, then the hash table is resized. That effectively means it allocates more buckets - normally somewhere close to but not necessarily exactly twice as many - then it points the new buckets at the linked list nodes, then finally deletes the heap allocation with the old buckets.
Since they are explicitly provided as member functions to the class, I assume they are used internally as well. Is there a such mechanism?
There's is no C++ Standard requirement about how the resizing is implemented. That said, if I were implementing resize() I'd consider creating a function-local container whilst specifying the newly desired bucket_count, then iterate over the elements in the *this object, calling extract() to detach them, then merge() to add them to the function-local container object, then eventually invoke swap on *this and the function-local container.

Related

Memory allocations when using unordered_map

If a std::unordered_map<int,...> was to stay roughly the same size but continually add and remove items, would it continually allocate and free memory or cache and reuse the memory (ie. like a pool or vector)? Assuming a modern standard MS implementation of the libraries.

The standard is not specific about these aspects, so they are implementation defined. Most notably, a caching behaviour like you describe is normally achieved by using a custom allocator (e.g. for a memory pool allocator) so it should normally be decoupled from the container implementation.
The relevant bits of the standard, ~p874 about unordered containers:
The elements of an unordered associative container are organized into
buckets. Keys with the same hash code appear in the same bucket. The
number of buckets is automatically increased as elements are added to
an unordered associative container, so that the average number of
elements per bucket is kept below a bound.
and insert:
The insert and emplace members shall not affect the validity of
iterators if (N+n) <= z * B, where N is the number of elements in the
container prior to the insert operation, n is the number of elements
inserted, B is the container’s bucket count, and z is the container’s
maximum load factor
You could read between the lines and assume that since iterator validity is not affected, probably no memory allocations will take place. Though this is by no means guaranteed (e.g. if the bucket data structure is a linked list, you can append to it without invalidating iterators). The standard doesn't seem to specify what should happen when an element is removed, but since it couldn't invalidate the constraint above I don't see a reason to deallocate memory.
The easiest way to find out for sure for your specific implementation is to read the source or profile your code.
Alternatively you can try to take control of this behaviour (if you really need to) by using the rehash and resize methods and tuning the map's load_factor.

Not sure how unordered_map works

I've a little bit confusion how unordered_map works and what buckets are and how thet are managed.
From this blog post, unordered_map is vector of vectors.
My questions are:
is it correct to assume that the buckets are the "internal" vectors?
since each bucket (vector) can contains multiple elements, given by an hash collision on the hash table (the "external" vector), and since we have to scan this internal vector (in linear time), is it correct to assume that we have to define the equal method on the key type (in addiction to the hash operator) in order to find the key inside the bucket?
what is the external vector (hash table) size by default?
what is the internal vector size by default?
what happens if the number of elements in one bucket becomes too big?bor in other words, when the rehash happens?
Sorry for these questions, but I didn't find any detailed explanation how this structure works (on cppreference.com for example).

std::unordered_map is the standard C++ hash table. It used to be called hash_map in STL, but missed the boat when many of STL's interfaces were merged into C++ in 1998, and by 2011, so many libraries had their own hash_map, C++ had to pick another name (I think "unordered" was a great choice; assuming order in a hash table is a common source of bugs).
is it correct to assume that the buckets are the "internal" vectors?
no, it is both incorrect (incompatible with iterator invalidation requirements) and dangerous (under that assumption you may end up subtracting pointers to elements in the same bucket).
In real life, the buckets are linked lists; e.g.
LLVM libc++ unordered_map is a unique_ptr to an array of linked lists of __hash_nodes
GNU libstdc++ unordered_map is a pointer to an array of linked lists of _Hash_nodes
is it correct to assume that we have to define the equal method on the key type (in addiction to the hash operator) in order to find the key inside the bucket?
Yes, locating the key in a bucket is exactly what the 4th template parameter of std::unordered_map is for (does not need to call the "equal method on the key type" literally, of course)
what is the external vector (hash table) size by default?
There is no "external vector". The number of buckets for a default-constructed std::unordered_map is implementation-defined, you can query it with bucket_count.
what is the internal vector size by default?
There is no "internal vector". The size of any given bucket equals the number of elements currently placed in the bucket. You can query it with bucket_size
what happens if the number of elements in one bucket becomes too big?bor in other words, when the rehash happens?
Nothing happens if the number of elements in one bucket becomes too big. But if average number of elements per bucket (aka load_factor) exceeds max_load_factor, rehash happens (e.g. on insert)

This may help you understand the buckets:
http://www.cplusplus.com/reference/unordered_map/unordered_map/bucket_count/
http://www.cplusplus.com/reference/unordered_map/unordered_map/max_load_factor/
But generally, yes, the buckets are something like internal vectors. It needs an equality operator (or a predicate) to distinguish keys that had the same hash as you suggest.
The initial number of buckets is possibly 0. It can be set via rehash() or reserve() (they have slightly different semantics.)
http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/
In an ideal case, each bucket will only have the one item. You can check this by using bucket_size. When the load factor (total items vs. bucket count) gets high, it rehashes automatically.
By default, it'll aim for a 1:1 load factor. If the hash function is good, this may last until max_bucket_count items are inserted.
Keep in mind that the specific implementation of this may vary. Each implementation (e.g. from different platforms or standard libraries) really only needs to have the correct semantics.
If these answers are important to your program, you can query the values as I've described. If you're just trying to wrap your head around it, query them in some test scenarios and it may become more clear.

How std::unordered_map is implemented

c++ unordered_map collision handling , resize and rehash
This is a previous question opened by me and I have seen that I am having a lot of confusion about how unordered_map is implemented. I am sure many other people shares that confusion with me. Based on the information I have know without reading the standard:
Every unordered_map implementation stores a linked list to external
nodes in the array of buckets... No, that is not at all the most
efficient way to implement a hash map for most common uses.
Unfortunately, a small "oversight" in the specification of
unordered_map all but requires this behavior. The required behavior is
that iterators to elements must stay valid when inserting or deleting
other elements
I was hoping that someone might explain the implementation and how it fits with the C++ standard definition ( in terms of performance requirements ) and if it is really not the most efficient way to implement an hash map data structure how it can be improved ?

The Standard effectively mandates that implementations of std::unordered_set and std::unordered_map - and their "multi" brethren - use open hashing aka separate chaining, which means an array of buckets, each of which holds the head of a linked list†. That requirement is subtle: it is a consequence of:
the default max_load_factor() being 1.0 (which means the table will resize whenever size() would otherwise exceed 1.0 times the bucket_count(), and
the guarantee that the table will not be rehashed unless grown beyond that load factor.
That would be impractical without chaining, as the collisions with the other main category of hash table implementation - closed hashing aka open addressing - become overwhelming as the load_factor()](https://en.cppreference.com/w/cpp/container/unordered_map/load_factor) approaches 1.
References:
23.2.5/15: The insert and emplace members shall not affect the validity of iterators if (N+n) < z * B, where N is the number of elements in the container prior to the insert operation, n is the number of elements inserted, B is the container’s bucket count, and z is the container’s maximum load factor.
amongst the Effects of the constructor at 23.5.4.2/1: max_load_factor() returns 1.0.
† To allow optimal iteration without passing over any empty buckets, GCC's implementation fills the buckets with iterators into a single singly-linked list holding all the values: the iterators point to the element immediately before that bucket's elements, so the next pointer there can be rewired if erasing the bucket's last value.
Regarding the text you quote:
No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements
There is no "oversight"... what was done was very deliberate and done with full awareness. It's true that other compromises could have been struck, but the open hashing / chaining approach is a reasonable compromise for general use, that copes reasonably elegantly with collisions from mediocre hash functions, isn't too wasteful with small or large key/value types, and handles arbitrarily-many insert/erase pairs without gradually degrading performance the way many closed hashing implementations do.
As evidence of the awareness, from Matthew Austern's proposal here:
I'm not aware of any satisfactory implementation of open addressing in a generic framework. Open addressing presents a number of problems:
• It's necessary to distinguish between a vacant position and an occupied one.
• It's necessary either to restrict the hash table to types with a default constructor, and to construct every array element ahead of time, or else to maintain an array some of whose elements are objects and others of which are raw memory.
• Open addressing makes collision management difficult: if you're inserting an element whose hash code maps to an already-occupied location, you need a policy that tells you where to try next. This is a solved problem, but the best known solutions are complicated.
• Collision management is especially complicated when erasing elements is allowed. (See Knuth for a discussion.) A container class for the standard library ought to allow erasure.
• Collision management schemes for open addressing tend to assume a fixed size array that can hold up to N elements. A container class for the standard library ought to be able to grow as necessary when new elements are inserted, up to the limit of available memory.
Solving these problems could be an interesting research project, but, in the absence of implementation experience in the context of C++, it would be inappropriate to standardize an open-addressing container class.
Specifically for insert-only tables with data small enough to store directly in the buckets, a convenient sentinel value for unused buckets, and a good hash function, a closed hashing approach may be roughly an order of magnitude faster and use dramatically less memory, but that's not general purpose.
A full comparison and elaboration of hash table design options and their implications is off topic for S.O. as it's way too broad to address properly here.

What container to choose for fast search/insert with huge amounts of data?

So it's a thought experiment. I want to have a huge collection of structures such as:
struct
{
KeyType key;
ValueType value;
}
And I need fast access by a key and fast insertion of new values.
I would not use std::map cuz it has too big memory overhead for one structure and for huge amounts of data it might be drastical. Right?
So next I would consider using sorted std::vector and binary_search. It's fine for searching, but adding new values to the vector would be too slow. Imagine you need to add a new value to the beginning of the sorted array, you'd have to move data right aaaaaAAAALOT!
What if I use deque? As I know it has O(1) for push_back/push_front, but still O(n) for inserting (as it would have to move data anyway, less data though).
The questions are:
1) Is O(n) of inserting data in deque much faster in real situation than O(n) in vector?
2) What happens when you insert a value to Deque and the bucket it should go into is full?
3) Is there another preferable type of container in case you need to store lots of data and need two fast operations: search and insertion?
Thanks!

I would not use std::map cuz it has too big memory overhead for one structure and for huge amounts of data it might be drastical. Right?
That depends on the size of your structs... the bigger they are the less the overheads are as a proportion of the overall memory use. For example, a std::map implementation might average say 20 bytes of housekeeping data per element (I just made that up - measure on your own system), so if your struct size is in the hundreds of bytes - who cares...? But, if the struct holds 2 ints, it's a big proportion....
So next I would consider using sorted std::vector and binary_search. It's fine for searching, but adding new values to the vector would be too slow. Imagine you need to add a new value to the beginning of the sorted array, you'd have to move data right aaaaaAAAALOT!
Totally unsuitable....
1) Is O(n) of inserting data in deque much faster in real situation than O(n) in vector?
As deque is likely implemented as a vector of fixed-sized arrays, insertion implies a shuffling of all elements towards the nearest end of the container. The shuffling's probably a tiny bit less cache efficient, but if inserting nearer the front of the container it would likely still end up faster.
2) What happens when you insert a value to Deque and the bucket it should go into is full?
As above, it'll need to shuffle, overflowing either:
the last element to become the first element of the next "bucket", moving all those elements along and overflowing into the next bucket, etc.
the first element to become the last element of the previous bucket, moving all those elements along and overflowing into the next bucket, etc.
3) Is there another preferable type of container in case you need to store lots of data and need two fast operations: search and insertion?
unordered_map, which is implemented as a hash map. If you have small objects (e.g. less than 20 or 30 bytes) or a firm cap on the number of elements, you can normally easily outperform unordered_map with custom code, but it's rarely worth the effort unless the table access dominates you application's performance, and that performance is critical.

3) Is there another preferable type of container in case you need to store lots of data and need two fast operations: search and insertion?
Consider using std::unordered_map, which is an implementation of a hash map. Insertion, lookup, and removal are all O(1) in the average case. This assumes that you will only ever look for an item based on its exact key; if your searches can have different constraints then you either need a different structure, or you need multiple maps to map the various keys you will search for to the corresponding object.
This requires that there is an available hash function for KeyType, either as part of the standard library or provided by you.

There's no container which would provide the best of all the worlds to you. Like you are saying you want best lookup/insertion with minimum amount of space needed for storing elements.
Below if the list of containers which you could consider for your implementation:-
VECTOR :-
Strengths:-
1) Space is allocated only for holding data.
2) Good for random access.
3) Container of choice if insertions/deletions are not in the middle of the container.
Weakness:-
1) poor performance if insertions/deletions are at the middle.
2) rellocations happen if reserve is not used properly.
DEQUE:-
Choose deque over vector in case insertions/deletions are at the beginning as well as end of the container.
MAP:-
Disadvantage over vector:-
1) more space is allocated for holding pointers.
Advantages over vector:-
1) better insertions/deletions/lookup as compared to vector.
If std::unordered_map is used then these dictionary operations would be amortized O(1).

Firstly, in order to directly answer your questions:
1) Is O(n) of inserting data in deque much faster in real situation
than O(n) in vector?
The number of elements that have to be moved is (on average) only half compared to vector. However, it can actually perform worse as the data is stored in non-contiguous memory, so copying/moving the same number of elements is much less efficient (it cannot e.g. be implemented in terms of a single memcopy operation).
2) What happens when you insert a value to Deque and the bucket it
should go into is full?
At least for the gnu gcc Libstdc++ implementation, every bucket except the first and last one is always full. I believe, that inserting in the middle means that all elements are moved/copied one slot to the closer end (front or back) and the effect ripples through all buckets until the first or last one is reached.
In summary, the only scenario, where std::deque is consistently better than vector is if you use it as (suprise) a queue (only inserting and removing elements from the front or end) and that's what the implementation is optimized for. It is not optimized for insertions in the middle.
3) Is there another preferable type of container in case you need to
store lots of data and need two fast operations: search and insertion?
As already stated by others: A hash table like std::unordered_map is the data structure you are looking for.
From what I've heard however, std::unordered_map is a slightly suboptimal implementation if it, as it uses buckets in order to resolve hash collisions and those buckets are implemented as linked lists (here is a very interesting talk from Chandler Carruth on the general topic of the performance of different data structures). For random access on big data structures, cache locality should matter a lot less, so this is probably not such a big issue in your case.
Finally I'd like to mention that if your value and key types are small PODs and depending on how big your huge collection is (are we talking about some million or rather billions of elements) and how often you actually have to insert/remove elements, there might still be cases, where a simple std::vector outperforms any other STL container. So as always: if your thought experiment ever becomes reality try out and measure.

What is better, a STL list or a STL Map for 20 entries, considering order of insertion is as important as the search speed

I have the following scenario.The implementation is required for a real time application.
1)I need to store at max 20 entries in a container(STL Map, STL List etc).
2)If a new entry comes and 20 entries are already present i have to overwrite the oldest entry with the new entry.
Considering point 2, i feel if the container is full (Max 20 entries) 'list' is the best bet as i can always remove the first entry in the list and add the new one at last (push_back). However, search won't be as efficient.
For only 20 entries, does it really make a big difference in terms of searching efficiency if i use a list in place of a map?
Also considering the cost of insertion in map i feel i should go for a list?
Could you please tell what is a better bet for me ?

1)I need to store at max 20 entries in a container(STL Map, STL List etc). 2)If a new entry comes and 20 entries are already present i have to overwrite the oldest entry with the new entry.
This seems to me the job for boost::circular_buffer.
In general the term circular buffer refers to an area in memory which is used to store incoming data. When the buffer is filled, new data is written starting at the beginning of the buffer and overwriting the old.
The circular_buffer is a STL compliant container. It is a kind of sequence similar to std::list or std::deque. It supports random access iterators, constant time insert and erase operations at the beginning or the end of the buffer and interoperability with std algorithms. The circular_buffer is especially designed to provide fixed capacity storage. When its capacity is exhausted, newly inserted elements will cause elements either at the beginning or end of the buffer (depending on what insert operation is used) to be overwritten.
The circular_buffer only allocates memory when created, when the capacity is adjusted explicitly, or as necessary to accommodate resizing or assign operations. On the other hand, there is also a circular_buffer_space_optimized available. It is an adaptor of the circular_buffer which does not allocate memory at once when created, rather it allocates memory as needed.
For the fast search, I think that with just 20 elements (if their comparison isn't too complicated) you're ok with a "low-cost" container like this and normal linear search, in my opinion it would be difficult to achieve better performance with other STL containers.

Maintain order of insertion, or allow fast searching: choose one.
std::map is not an option here because it doesn't maintain the order of insertion. Besides, it's an associative container. You should choose between a list, a deque and a vector. In terms of performance your best bet is a list, since you can pop off an element from the back and insert a new one at the front (or vice-versa) without any shifting or performance penalty.
The cost of insertion in a map, just as a sidenote, isn't expensive it all: it's in the order of O(log n). Practically irrelevant in the case of 20 elements. The same holds for a std::set.

With only 20 elements, I would not worry much about which container you use. If you determine that the container chosen is in fact a detriment to the performance of your application, it should be relatively easy to swap out the container chosen and replace it with a more-efficient container later.
With that being said, for a large number of elements, the std::deque would probably give you the best all-around efficiency for what you are trying to accomplish. Unlike std::vector, std::deque allows for removal from the front without needing to move all of the other elements. Unlike std::list, std::deque allows for random access of its elements.

You just need to implement a priority queue. STL Map doesn't work.

It depends on the size of the elements.
I know from my own experience that for five integers an unordered array of integers searched with linear search is faster than a set, a list or insertion sort and binary search on an ordered array.
The O() notation of an unordered array may be much worse than any of the other options but the normally unseen C in O(N+C) + C is so much smaller.
A list, set or map (anything that uses dynamic memory and is linked by pointers) will be dominated by cache misses, memory allocations and indirect reference penalties.

You need a Priority Queue implemented on an array.
See the Binary Heap for an implementation.

Do you already know that this is a bottleneck?
My advice would be to first use what is more natural to read while programming and only optimize it when you see that the performance is not what you need.

My suggestion would be to make a circular buffer. But that only works if "old" is determined by when it was inserted, and not some field.
If you need to have a proper LRU, then you should probably go and look at something like http://www.codeproject.com/KB/recipes/LRUCache.aspx?fid=1000025&df=90&mpp=25&noise=3&sort=Position&view=Quick&fr=15
But with 20 entries as your max, it will be very hard to you to find a complex algorithm that is actually faster than the trivial lineary check of every element.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js