While looking for a container suitable for an application I'm building, I ran across documentation for unordered_set. Given that my application typically requires only insert and find functions, this class seems rather attractive. I'm slightly put off, however, by the fact that find is O(1) amortized, but O(n) worst case - I would be using the function frequently, and it could make or break my application. What causes the spike in complexity? Is the likelihood of running into an O(n) search predictable?
_unordered_set_ are implemented as hash tables, that said, one of the common implementations of hash table is using a container (ex: like vector) of hash bucket (that are a container (ex: like list) of elements of the unordered_set in the same bucket).
When inserting elements in the unordered_set, a hash function is apply to then which give you the bucket where to placed.
There could be various elements inserted that end in the same bucket, when you are finding an element, the hash functions is apply, giving you the bucket and you need to go for their elements searching the one you are looking for.
The worst case scenario is that all elements end in the same bucket (depending the containers used to store the elements in the same bucket O(n) is the worst running time of search when all the elements are in the same bucket).
The key points for elements ending in the same bucket are the hash function (how good it's) and the elements (could expose specific weakness of the hash function).
The elements normally one can no predict, if there are predictable enough in your case (you could select a hash function that spread evenly this kind of elements).
To speed up search, the key point is using good hash function (that distribute evenly the elements in the buckets and using if needed rehash increasing the bucket size (take care with this option, the hash function will be apply to all elements)).
I suggest that if it's that important for your application the storage of that elements, you do performance test with as close as possible to production data (and take decision from there), that said the containers in STL and more the containers of the same group (ex: associative, etc...) share almost the same interface, being easy to change one for another, with little or no change in the code that used.
Related
I'm trying to understand Unordered Maps and Hashing. As I understand it, Unordered Maps have a Hash function inside of it that takes an object of type T, and returns an int, which then uses the int as an index to an internal array. It uses a List of the object of type T in the array position, so that if there's something already in the spot, additions are inserted into the List.
Conceptually, would using a Set instead of a List improve efficiency?
(maybe somehow binary search and a Set being ordered helps over having a List)
Or maybe a Vector instead of the List?
(maybe random access helps over the List.)
The datatype should not matter much, because in most cases, the container at the hashed index only contains zero or one element. If you regularly have many elements there, the hash map degrades in performance anyway. The remedy for that is to resize the initial array, which std::unordered_map<> does itself. However, if you have a bad hash function which causes many hash collisions, switching the hash function is necessary for proper operation.
If there're often a lot of collisions at the same bucket, then using a set is more efficient than using a list, and indeed some Java hash table implementations have adopted sets for this reason. vectors can't be used for std::unordered_map or std::unordered_set implementations, as they need to reallocate to a different memory area when grown past their capacity, whilst the Standard requires that the elements in an unordered container are never moved by other operations on the container.
That said, the nature of hash tables is that - with a high quality hash function - the statistical distribution of number-of-elements colliding in particular buckets relates only to the load factor. If you can't trust the collisions not to get out of control, perhaps you shouldn't be using that hash function.
Some details: Standard-library unordered containers have a default max_load_factor() (load_factor() is the ratio of size() to bucket_count()) of 1.0, and with a strong pseudo-randomizing hash function they'll have 1/e ~= 36.8% of buckets empty, as many with one element, half that with 2 elements (~18.4%), a third of that with 3 elements (~6.13%), a quarter of that with 4 elements (~1.53%), a fifth of that with 5 elements (~0.3%), a sixth of that with 6 elements (~0.05%). As you can hopefully see, it's incredibly rare to have to search through many elements (even in the worst case scenario where the hash table is at its max load factor), so a list approach is usually adequate.
I need to create a lookup function where a (X,Y) pair corresponds to a specific Z value. One major requirement for this is that I need to do it in as close to O(1) complexity as I can. My plan is to use an unordered_map.
I generally do not use a hash table for lookup, as the lookup time has never been important to me. Am I correct in thinking that as long as I built the unordered_map with no collisions, my lookup time will be O(1)?
My concern then is what the complexity becomes if there the key is not present in the unordered map. If I use unordered_map::find():, for example, to determine whether a key is present in my hash table, how will it go about giving me an answer? Does it actually iterate over all the keys?
I greatly appreciate the help.
The standard more or less requires using buckets for collision
resolution, which means that the actual look up time will
probably be linear with respect to the number of elements in the
bucket, regardless of whether the element is present or not.
It's possible to make it O(lg N), but it's not usually done,
because the number of elements in the bucket should be small,
if the hash table is being used correctly.
To ensure that the number of elements in a bucket is small, you
must ensure that the hashing function is effective. What
effective means depends on the types and values being hashed.
(The MS implementation uses FNV, which is one of the best
generic hashs around, but if you have special knowledge of the
actual data you'll be seeing, you might be able to do better.)
Another thing which can help reduce the number of elements per
bucket is to force more buckets or use a smaller load factor.
For the first, you can pass the minimum initial number of
buckets as an argument to the constructor. If you know the
total number of elements that will be in the map, you can
control the load factor this way. You can also forse a minumum
number of buckets once the table has been filled, by calling
rehash. Otherwise, there is a function
std::unordered_map<>::max_load_factor which you can use. It
is not guaranteed to do anything, but in any reasonable
implementation, it will. Note that if you use it on an already
filled unordered_map, you'll probably have to call
unordered_map<>::rehash afterwards.
(There are several things I don't understand about the standard
unordered_map: why the load factor is a float, instead of
double; why it's not required to have an effect; and why it
doesn't automatically call rehash for you.)
As with any hash table, worst case is always linear complexity (Edit: if you built the map without any collisions like you stated in your original post, then you'll never see this case):
http://www.cplusplus.com/reference/unordered_map/unordered_map/find/
Complexity
Average case: constant.
Worst case: linear in container size.
Return Value
An iterator to the element, if the specified key value is found, or unordered_map::end if the specified key is not found in the container.
However, because an unordered_map can only contain unique keys, you will see average complexity of constant time (container first checks hash index, and then iterates over values at that index).
I think the documentation for unordered_map::count function is more informative:
Searches the container for elements whose key is k and returns the
number of elements found. Because unordered_map containers do not
allow for duplicate keys, this means that the function actually
returns 1 if an element with that key exists in the container, and
zero otherwise.
To have no collisions in a hashed data structure is incredibly difficult (if not impossible for a given hash function and any kind of data). It would also require a table size exactly equal to the number of keys. No, it does not need to be that strict. As long as the hash function distributes the values in a relatively uniform way, you will have O(1) lookup complexity.
Hash tables are generally just arrays with linked lists taking care of the collisions (this is the chaining method - there are other methods, but this is likely the most utilized way of dealing with collisions). Thus, to find if a value is contained within a bucket, it will have to (potentially) iterate over all the values in that bucket. So if the hash function gives you a uniform distribution, and there are N buckets, and a total of M values, there should be (on average) M/N values per bucket. As long as this value is not too large, this allows O(1) lookup.
So, as a bit of a long winded answer to your question, as long as the hashing function is reasonable, you will get O(1) lookup, with it having to iterate over (on average) O(M/N) keys to give you a "negative" result.
I've a little bit confusion how unordered_map works and what buckets are and how thet are managed.
From this blog post, unordered_map is vector of vectors.
My questions are:
is it correct to assume that the buckets are the "internal" vectors?
since each bucket (vector) can contains multiple elements, given by an hash collision on the hash table (the "external" vector), and since we have to scan this internal vector (in linear time), is it correct to assume that we have to define the equal method on the key type (in addiction to the hash operator) in order to find the key inside the bucket?
what is the external vector (hash table) size by default?
what is the internal vector size by default?
what happens if the number of elements in one bucket becomes too big?bor in other words, when the rehash happens?
Sorry for these questions, but I didn't find any detailed explanation how this structure works (on cppreference.com for example).
std::unordered_map is the standard C++ hash table. It used to be called hash_map in STL, but missed the boat when many of STL's interfaces were merged into C++ in 1998, and by 2011, so many libraries had their own hash_map, C++ had to pick another name (I think "unordered" was a great choice; assuming order in a hash table is a common source of bugs).
is it correct to assume that the buckets are the "internal" vectors?
no, it is both incorrect (incompatible with iterator invalidation requirements) and dangerous (under that assumption you may end up subtracting pointers to elements in the same bucket).
In real life, the buckets are linked lists; e.g.
LLVM libc++ unordered_map is a unique_ptr to an array of linked lists of __hash_nodes
GNU libstdc++ unordered_map is a pointer to an array of linked lists of _Hash_nodes
is it correct to assume that we have to define the equal method on the key type (in addiction to the hash operator) in order to find the key inside the bucket?
Yes, locating the key in a bucket is exactly what the 4th template parameter of std::unordered_map is for (does not need to call the "equal method on the key type" literally, of course)
what is the external vector (hash table) size by default?
There is no "external vector". The number of buckets for a default-constructed std::unordered_map is implementation-defined, you can query it with bucket_count.
what is the internal vector size by default?
There is no "internal vector". The size of any given bucket equals the number of elements currently placed in the bucket. You can query it with bucket_size
what happens if the number of elements in one bucket becomes too big?bor in other words, when the rehash happens?
Nothing happens if the number of elements in one bucket becomes too big. But if average number of elements per bucket (aka load_factor) exceeds max_load_factor, rehash happens (e.g. on insert)
This may help you understand the buckets:
http://www.cplusplus.com/reference/unordered_map/unordered_map/bucket_count/
http://www.cplusplus.com/reference/unordered_map/unordered_map/max_load_factor/
But generally, yes, the buckets are something like internal vectors. It needs an equality operator (or a predicate) to distinguish keys that had the same hash as you suggest.
The initial number of buckets is possibly 0. It can be set via rehash() or reserve() (they have slightly different semantics.)
http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/
In an ideal case, each bucket will only have the one item. You can check this by using bucket_size. When the load factor (total items vs. bucket count) gets high, it rehashes automatically.
By default, it'll aim for a 1:1 load factor. If the hash function is good, this may last until max_bucket_count items are inserted.
Keep in mind that the specific implementation of this may vary. Each implementation (e.g. from different platforms or standard libraries) really only needs to have the correct semantics.
If these answers are important to your program, you can query the values as I've described. If you're just trying to wrap your head around it, query them in some test scenarios and it may become more clear.
Can you share your thoughts on what the best STL data structure would be for storing a large list of names and perform searches on these names?
Edit:
The names are not unique and the list can grow as new names can continuously added to it. And by large I am talking of from 1 million to 10 million names.
Since you want to search names, you want a structure that support fast random access. That means vector, deque and list are all out of the question. Also, vector/array are slow on random adds/inserts for sorted sets because they have to shift items to make room for each inserted item. Adding to end is very fast, though.
Consider std::map, std::unordered_map or std::unordered_multimap (or their siblings std::set, std::unordered_set and std::unordered_multiset if you are only storing keys).
If you are purely going to do unique, random access, I'd start with one of the unordered_* containers.
If you need to store an ordered list of names, and need to do range searches/iteration and sorted operations, a tree based container like std::map or std::set should do better with the iteration operation than a hash based container because the former will store items adjacent to their logical predecessors and successors. For random access, it is O(log N) which is still decent.
Prior to std::unordered_*, I used std::map to hold large numbers of objects for an object cache and though there are faster random access containers, it scaled well enough for our uses. The newer unordered_map has O(1) access time so it is a hashed structure and should give you the near best access times.
You can consider the possibility of using concatenation of those names using a delimiter but the searching might take a hit. You would need to come up with a adjusted binary searching.
But you should try the obvious solution first which is a hashmap which is called unordered_map in stl. See if that meets your needs. Searching should be plenty fast there but at a cost of memory.
I need to create a lookup function where a (X,Y) pair corresponds to a specific Z value. One major requirement for this is that I need to do it in as close to O(1) complexity as I can. My plan is to use an unordered_map.
I generally do not use a hash table for lookup, as the lookup time has never been important to me. Am I correct in thinking that as long as I built the unordered_map with no collisions, my lookup time will be O(1)?
My concern then is what the complexity becomes if there the key is not present in the unordered map. If I use unordered_map::find():, for example, to determine whether a key is present in my hash table, how will it go about giving me an answer? Does it actually iterate over all the keys?
I greatly appreciate the help.
The standard more or less requires using buckets for collision
resolution, which means that the actual look up time will
probably be linear with respect to the number of elements in the
bucket, regardless of whether the element is present or not.
It's possible to make it O(lg N), but it's not usually done,
because the number of elements in the bucket should be small,
if the hash table is being used correctly.
To ensure that the number of elements in a bucket is small, you
must ensure that the hashing function is effective. What
effective means depends on the types and values being hashed.
(The MS implementation uses FNV, which is one of the best
generic hashs around, but if you have special knowledge of the
actual data you'll be seeing, you might be able to do better.)
Another thing which can help reduce the number of elements per
bucket is to force more buckets or use a smaller load factor.
For the first, you can pass the minimum initial number of
buckets as an argument to the constructor. If you know the
total number of elements that will be in the map, you can
control the load factor this way. You can also forse a minumum
number of buckets once the table has been filled, by calling
rehash. Otherwise, there is a function
std::unordered_map<>::max_load_factor which you can use. It
is not guaranteed to do anything, but in any reasonable
implementation, it will. Note that if you use it on an already
filled unordered_map, you'll probably have to call
unordered_map<>::rehash afterwards.
(There are several things I don't understand about the standard
unordered_map: why the load factor is a float, instead of
double; why it's not required to have an effect; and why it
doesn't automatically call rehash for you.)
As with any hash table, worst case is always linear complexity (Edit: if you built the map without any collisions like you stated in your original post, then you'll never see this case):
http://www.cplusplus.com/reference/unordered_map/unordered_map/find/
Complexity
Average case: constant.
Worst case: linear in container size.
Return Value
An iterator to the element, if the specified key value is found, or unordered_map::end if the specified key is not found in the container.
However, because an unordered_map can only contain unique keys, you will see average complexity of constant time (container first checks hash index, and then iterates over values at that index).
I think the documentation for unordered_map::count function is more informative:
Searches the container for elements whose key is k and returns the
number of elements found. Because unordered_map containers do not
allow for duplicate keys, this means that the function actually
returns 1 if an element with that key exists in the container, and
zero otherwise.
To have no collisions in a hashed data structure is incredibly difficult (if not impossible for a given hash function and any kind of data). It would also require a table size exactly equal to the number of keys. No, it does not need to be that strict. As long as the hash function distributes the values in a relatively uniform way, you will have O(1) lookup complexity.
Hash tables are generally just arrays with linked lists taking care of the collisions (this is the chaining method - there are other methods, but this is likely the most utilized way of dealing with collisions). Thus, to find if a value is contained within a bucket, it will have to (potentially) iterate over all the values in that bucket. So if the hash function gives you a uniform distribution, and there are N buckets, and a total of M values, there should be (on average) M/N values per bucket. As long as this value is not too large, this allows O(1) lookup.
So, as a bit of a long winded answer to your question, as long as the hashing function is reasonable, you will get O(1) lookup, with it having to iterate over (on average) O(M/N) keys to give you a "negative" result.