How std::unordered_map is implemented - c++

c++ unordered_map collision handling , resize and rehash
This is a previous question opened by me and I have seen that I am having a lot of confusion about how unordered_map is implemented. I am sure many other people shares that confusion with me. Based on the information I have know without reading the standard:
Every unordered_map implementation stores a linked list to external
nodes in the array of buckets... No, that is not at all the most
efficient way to implement a hash map for most common uses.
Unfortunately, a small "oversight" in the specification of
unordered_map all but requires this behavior. The required behavior is
that iterators to elements must stay valid when inserting or deleting
other elements
I was hoping that someone might explain the implementation and how it fits with the C++ standard definition ( in terms of performance requirements ) and if it is really not the most efficient way to implement an hash map data structure how it can be improved ?

The Standard effectively mandates that implementations of std::unordered_set and std::unordered_map - and their "multi" brethren - use open hashing aka separate chaining, which means an array of buckets, each of which holds the head of a linked list†. That requirement is subtle: it is a consequence of:
the default max_load_factor() being 1.0 (which means the table will resize whenever size() would otherwise exceed 1.0 times the bucket_count(), and
the guarantee that the table will not be rehashed unless grown beyond that load factor.
That would be impractical without chaining, as the collisions with the other main category of hash table implementation - closed hashing aka open addressing - become overwhelming as the load_factor()](https://en.cppreference.com/w/cpp/container/unordered_map/load_factor) approaches 1.
References:
23.2.5/15: The insert and emplace members shall not affect the validity of iterators if (N+n) < z * B, where N is the number of elements in the container prior to the insert operation, n is the number of elements inserted, B is the container’s bucket count, and z is the container’s maximum load factor.
amongst the Effects of the constructor at 23.5.4.2/1: max_load_factor() returns 1.0.
† To allow optimal iteration without passing over any empty buckets, GCC's implementation fills the buckets with iterators into a single singly-linked list holding all the values: the iterators point to the element immediately before that bucket's elements, so the next pointer there can be rewired if erasing the bucket's last value.
Regarding the text you quote:
No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements
There is no "oversight"... what was done was very deliberate and done with full awareness. It's true that other compromises could have been struck, but the open hashing / chaining approach is a reasonable compromise for general use, that copes reasonably elegantly with collisions from mediocre hash functions, isn't too wasteful with small or large key/value types, and handles arbitrarily-many insert/erase pairs without gradually degrading performance the way many closed hashing implementations do.
As evidence of the awareness, from Matthew Austern's proposal here:
I'm not aware of any satisfactory implementation of open addressing in a generic framework. Open addressing presents a number of problems:
• It's necessary to distinguish between a vacant position and an occupied one.
• It's necessary either to restrict the hash table to types with a default constructor, and to construct every array element ahead of time, or else to maintain an array some of whose elements are objects and others of which are raw memory.
• Open addressing makes collision management difficult: if you're inserting an element whose hash code maps to an already-occupied location, you need a policy that tells you where to try next. This is a solved problem, but the best known solutions are complicated.
• Collision management is especially complicated when erasing elements is allowed. (See Knuth for a discussion.) A container class for the standard library ought to allow erasure.
• Collision management schemes for open addressing tend to assume a fixed size array that can hold up to N elements. A container class for the standard library ought to be able to grow as necessary when new elements are inserted, up to the limit of available memory.
Solving these problems could be an interesting research project, but, in the absence of implementation experience in the context of C++, it would be inappropriate to standardize an open-addressing container class.
Specifically for insert-only tables with data small enough to store directly in the buckets, a convenient sentinel value for unused buckets, and a good hash function, a closed hashing approach may be roughly an order of magnitude faster and use dramatically less memory, but that's not general purpose.
A full comparison and elaboration of hash table design options and their implications is off topic for S.O. as it's way too broad to address properly here.

Related

Build large(ish) unordered set with all data available at the beginning

I have a situation where I need to optimize the creation of an unordered set. Expected number of elements is around 5-25M. My first thought is that I should prepare all data beforehand and do something like
unordered_set s(data);
instead of
for (auto& elem : data)
s.insert(elem);
Can the STL unordered set use bulk loading methods and speed up its creation? How can I tweak the parameters of the hash table (bucket size etc) if I know prior to the table construction the expected number of elements?
This question is quite broad and interesting.
First of all, there is a special method called reserve - it allows you to pre-allocate the storage for a number of elements before actually inserting them. Pre-allocating sufficient memory (and avoiding re-locations during the instertion) is a very powerful approach, which is commonly used for large data sets. Note, that it is also available for various standard containers, including vector, unordered_map etc.
Secondly, if you're using C++11, you might benefit from using move-semantics while inserting the elements into your container (of course, given you don't need them in your feed once they are placed in the set, which should be true for 5 to 25 millions of objects).
These two techniques are a good start. You may need to tune it further by setting different hashing function, or even choosing different implementation of an unordered_set. But at this point, you should provide more information: what are your value objects and what is their life-cycle; what insertion time do you find acceptable in your application.
EDIT: of course it's all about C++11, as unordered_set was not available prior to it. Shame on me :)
My focus now is on whether I can use functions like rehash to notify the table for the upcoming size
Suppose you call
unordered_set s(begin(data), end(data));
while the standard doesn't dictate an implementation, a good implementation will be able to discern the number of elements, and preallocate the size accordingly. If you look at the source code used by gcc (by me /usr/include/c++/5/tr1/hashtable.h), for example, it uses
_M_bucket_count = std::max(_M_rehash_policy._M_next_bkt(__bucket_hint),
_M_rehash_policy.
_M_bkt_for_elements(__detail::
__distance_fw(__f,
__l)));
_M_buckets = _M_allocate_buckets(_M_bucket_count);
so it already preallocates size based on the number of elements.
The problem might be different, though. If you look at the documentation, it states:
constructs the container with the contents of the range [first, last). Sets max_load_factor() to 1.0.
This saves space, but might cause collisions. To reduce the collisions, you could use
unordered_set s(begin(data), end(data), k * data.size());
where k > 1 is some constant. This corresponds to a load factor that is 1 / k. YMMV.

iterate ordered versus unordered containers

I want to know which data-structures are more efficient for iterating through their elements between std::set, std::map and std::unordered_set, std::unordered_map.
I searched through SO and I found this question. The answers either propose to copy the elements in a std::vector or to use Boost.Container, which IMHO don't answer my question.
My purpose is to keep in a container a big number of unique elements, that most of the time I want to iterate through them. Insertions and extractions are more rare. I want to avoid std::vector in combination with std::unique.
Lets consider set vs unordered_set.
The main difference here is the 'nature' of the iteration, that is the traversal of the set will give you the elements in order while traversing a range in an unordered set will give you a bunch of values in no particular order.
Suppose you want to traverse a range [it1, it2]. If we exclude the lookup time that's needed to find elements it1 and it2 there can be no direct mapping from one case to another since the elements in between are not guarrandeed to be the same even if you've used the same elements to construct the container.
There are cases however where something like this has meaning when e.g. you want to traverse a fixed number of elements (regardless of what they are) or when you need to traverse the whole container. In such cases you need to consider implementation mechanics :
Sets are usually implemented like Red–black trees (a form of binary search trees). Like all binary search trees allow efficient in-order traversal (LRR: left root right) of their elements. That is to traverse you pay the cost of pointer chasing (just like traversing a list).
Unordered sets on the other hand are hash tables and to my knowledge the STL implementation uses hashing with chaining. That means (in a very very high level) that what's used for the structure is a (contiguous) buffer where each element is the head of a chain (list) that contains the elements. The way the elements are layed out across those chains (buckets) and across the buffer will affect the traversal time, however you'll be chasing pointers once again jumping through differents lists as well this time. I don't think it'll vary significantly from the tree case but won't be any better for sure.
In any case micro tuning and benchmarking will give you the answer for your particular application.
The difference does not lie between the ordering or lack of one but in the backing container. If it's a contiguous memory it should be fast to iterate over, due to simple implementation of iterator and cache friendliness.
Unordered containers are usually stored as a vector of vectors (or a similar thing), while ordered containers are implemented using trees, but it is left for implementation after all. This would suggest that iterating over unordered version should be waster. However this is left for implementation after all, and I saw implementations (which bent rules a little to be fair) with different behaviour.
Generally speaking, container performance is quite a complex topic and usually has to be tested in actual application to get reliable answer. There is plenty on implemention-defined stuff that might affect the performance. I'd go with hash_set if I had to go in blind. Copying into a vector might also turn out a good option.
EDIT: As #TonyD said in it's comment, there is a rule, that disallows invalidating iterators during addition of element when the max_load_factor() is not exceeded, this practically rules out backing containers which are contiguous in memory.
Thus, copying everything into a vector seems like even more reasonable option. If you need to remove duplicates, a feasible option might be to use http://en.cppreference.com/w/cpp/algorithm/sort and have dupes easily ignored. I have heard that using vector and sort to have a sorted array (or vector) is quite often a used option in case of need for a container that needs to be sorter and is being iterated over more often than modified.
iterate from fastest to slowest should be : set > map > unordered_set > unordered_map;
set is a little lighter than map, and they are ordered with binary tree rule, so should be faster than unordered_ containers.

Would a unordered_map be a good choice?

I'm wondering if an unordered_map would be a good choice as container for my specific problem. What I've read about maps does not really cover my are, which is:
The container will store between 100 and 500 objects (not
int/double...)
The size will never change.
The order is not important as the objects themselves contain some kind of "index".
Very often (!) I need to filter all elements in the container that have some
property (e.g. have color==blue)
Currently I use vectors, which works. However if e.g. an unordered_map would improve performance (in regard to "filtering") I could image to change that.
std::unordered_map wouldn't really help you if you have multiple search criteria (sometimes color == blue, sometimes flavour == up), because maps only offer fast query on a single, pre-determined key.
I'd say std::vector is just fine for you, ideally wrapped in your own structure which will provide the lookup interface. If profiling later tells you this is not fast enough, you could build your own indexes above such data. You wouldn't even have to do that manually, boost::multi_index is a generic container designed for multiple-criterion lookup.
I would use vector or simply array for storing actual data. And have a few maps that maps key with pointer to actual data.
This would give higher memory usage, but in case searching by different indexes is often needed you may sacrifice a bit of memory.
A hash table (which std::unordered_map is) provides constant-time lookup for one key (key-value pair). However, its constant factors are always higher (i. e. the lookup is slower) than a simple array (which provides constant-time lookup for integer indices).
If you need to filter a collection of elements based on some criteria, then you need to inspect each individual element. In this case, a hash table would be strictly worse than an array/vector performance-wise, since its computational complexity is the same as that of array indexing, but with worse constant factors.
So no, there's no reason why you would want to use an unordered_map in this case.

why stl choose tree based map instead of hash based map?

I'm wondering why STL's map is base on rb tree?
I mean, hash-based map seems to be more efficient in inserting/deleting or even getting the value.
Are there any specific considerations?
The STL originally chose both. It had a hash table and the tree-based map.
However, when it was adopted into the standard, many parts were stripped away in order to simplify the task (it was easier to talk the committee into including a smaller library, and it required less work in terms of actually specifying their behavior).
So the hash table was skipped.
However, both data structures have their advantages. In particular, a binary tree allows the contents of the map to be ordered (you can iterate over the contents of a map in sorted order, or you can ask for all elements smaller than a specific element, for example), and I can only guess that this property was considered more important than the performance advantages of a hash map.
However, in C++11, std::unordered_map is added, which is the long lost hash table. Its original omission was simply due to time pressure and, quite possibly, committee politics (keeping the library small to minimize resistance against it)

Efficiently erasing elements in tr1::unordered_map

I am experimenting with tr1::unordered_map and stumbled upon the problem how to
efficiently delete elements. The 'erase' method offers to delete either by key or
by iterator. I would assume the latter to be more efficient, since the former
presumably involves an implicit find operation. On the other hand my investigations
on the internet have revealed that iterators may become invalid after calling
the insert() method.
I am interested in a typical real-world situation, where objects put into a hash table
have a life span which is long enough such that calls to insert() happen during that
life span. Thus may I conclude that in such a situation deletion by key is the only
option left? Are there any alternatives how to delete objects more efficiently? I am
fully aware that the question only matters in applications, where deletions happen
often. Whether this will be the case for my current project, remains to be seen, but
I would rather learn about these issues while designing my project rather than when
there is already a lot of code present.
The whole point of the unordered containers is to have the fastest possible lookup time. Worrying about the time it takes to erase an element by key sounds like the classic example of premature optimization.
If it matters a great deal to you, because you're keeping the iterator for some other reason, then C++0x says of std::unordered_map (quoting from the FDIS), in 23.2.5/11:
The insert and emplace members shall not affect the validity of
iterators if (N+n) < z * B, where N is the number of elements in the
container prior to the insert operation, n is the number of elements
inserted, B is the container’s bucket count, and z is the container’s
maximum load factor.
I haven't checked whether the tr1 spec has the same guarantee, but it's fairly logical based on the expected implementation.
If you can use this guarantee, then you can protect your iterators up to a point. As Mark says, though, lookup in unordered_map is supposed to be fast. Keeping a key rather than an iterator is worse than keeping an index rather than an iterator in a vector, but better than the equivalent for map.
Yes, insert() can invalidate all iterators. Therefore, I don't think there's a way to avoid the (implicit) lookup. The good news is that the latter is likely to be cheap.