c++ - unordered_map complexity - c++

I need to create a lookup function where a (X,Y) pair corresponds to a specific Z value. One major requirement for this is that I need to do it in as close to O(1) complexity as I can. My plan is to use an unordered_map.
I generally do not use a hash table for lookup, as the lookup time has never been important to me. Am I correct in thinking that as long as I built the unordered_map with no collisions, my lookup time will be O(1)?
My concern then is what the complexity becomes if there the key is not present in the unordered map. If I use unordered_map::find():, for example, to determine whether a key is present in my hash table, how will it go about giving me an answer? Does it actually iterate over all the keys?
I greatly appreciate the help.

The standard more or less requires using buckets for collision
resolution, which means that the actual look up time will
probably be linear with respect to the number of elements in the
bucket, regardless of whether the element is present or not.
It's possible to make it O(lg N), but it's not usually done,
because the number of elements in the bucket should be small,
if the hash table is being used correctly.
To ensure that the number of elements in a bucket is small, you
must ensure that the hashing function is effective. What
effective means depends on the types and values being hashed.
(The MS implementation uses FNV, which is one of the best
generic hashs around, but if you have special knowledge of the
actual data you'll be seeing, you might be able to do better.)
Another thing which can help reduce the number of elements per
bucket is to force more buckets or use a smaller load factor.
For the first, you can pass the minimum initial number of
buckets as an argument to the constructor. If you know the
total number of elements that will be in the map, you can
control the load factor this way. You can also forse a minumum
number of buckets once the table has been filled, by calling
rehash. Otherwise, there is a function
std::unordered_map<>::max_load_factor which you can use. It
is not guaranteed to do anything, but in any reasonable
implementation, it will. Note that if you use it on an already
filled unordered_map, you'll probably have to call
unordered_map<>::rehash afterwards.
(There are several things I don't understand about the standard
unordered_map: why the load factor is a float, instead of
double; why it's not required to have an effect; and why it
doesn't automatically call rehash for you.)

As with any hash table, worst case is always linear complexity (Edit: if you built the map without any collisions like you stated in your original post, then you'll never see this case):
http://www.cplusplus.com/reference/unordered_map/unordered_map/find/
Complexity
Average case: constant.
Worst case: linear in container size.
Return Value
An iterator to the element, if the specified key value is found, or unordered_map::end if the specified key is not found in the container.
However, because an unordered_map can only contain unique keys, you will see average complexity of constant time (container first checks hash index, and then iterates over values at that index).
I think the documentation for unordered_map::count function is more informative:
Searches the container for elements whose key is k and returns the
number of elements found. Because unordered_map containers do not
allow for duplicate keys, this means that the function actually
returns 1 if an element with that key exists in the container, and
zero otherwise.

To have no collisions in a hashed data structure is incredibly difficult (if not impossible for a given hash function and any kind of data). It would also require a table size exactly equal to the number of keys. No, it does not need to be that strict. As long as the hash function distributes the values in a relatively uniform way, you will have O(1) lookup complexity.
Hash tables are generally just arrays with linked lists taking care of the collisions (this is the chaining method - there are other methods, but this is likely the most utilized way of dealing with collisions). Thus, to find if a value is contained within a bucket, it will have to (potentially) iterate over all the values in that bucket. So if the hash function gives you a uniform distribution, and there are N buckets, and a total of M values, there should be (on average) M/N values per bucket. As long as this value is not too large, this allows O(1) lookup.
So, as a bit of a long winded answer to your question, as long as the hashing function is reasonable, you will get O(1) lookup, with it having to iterate over (on average) O(M/N) keys to give you a "negative" result.

Related

Implementation of an Unordered Map

I'm trying to understand Unordered Maps and Hashing. As I understand it, Unordered Maps have a Hash function inside of it that takes an object of type T, and returns an int, which then uses the int as an index to an internal array. It uses a List of the object of type T in the array position, so that if there's something already in the spot, additions are inserted into the List.
Conceptually, would using a Set instead of a List improve efficiency?
(maybe somehow binary search and a Set being ordered helps over having a List)
Or maybe a Vector instead of the List?
(maybe random access helps over the List.)
The datatype should not matter much, because in most cases, the container at the hashed index only contains zero or one element. If you regularly have many elements there, the hash map degrades in performance anyway. The remedy for that is to resize the initial array, which std::unordered_map<> does itself. However, if you have a bad hash function which causes many hash collisions, switching the hash function is necessary for proper operation.
If there're often a lot of collisions at the same bucket, then using a set is more efficient than using a list, and indeed some Java hash table implementations have adopted sets for this reason. vectors can't be used for std::unordered_map or std::unordered_set implementations, as they need to reallocate to a different memory area when grown past their capacity, whilst the Standard requires that the elements in an unordered container are never moved by other operations on the container.
That said, the nature of hash tables is that - with a high quality hash function - the statistical distribution of number-of-elements colliding in particular buckets relates only to the load factor. If you can't trust the collisions not to get out of control, perhaps you shouldn't be using that hash function.
Some details: Standard-library unordered containers have a default max_load_factor() (load_factor() is the ratio of size() to bucket_count()) of 1.0, and with a strong pseudo-randomizing hash function they'll have 1/e ~= 36.8% of buckets empty, as many with one element, half that with 2 elements (~18.4%), a third of that with 3 elements (~6.13%), a quarter of that with 4 elements (~1.53%), a fifth of that with 5 elements (~0.3%), a sixth of that with 6 elements (~0.05%). As you can hopefully see, it's incredibly rare to have to search through many elements (even in the worst case scenario where the hash table is at its max load factor), so a list approach is usually adequate.

Time complexity of insert() in unordered_map when adding a std::vector as a value [duplicate]

I need to create a lookup function where a (X,Y) pair corresponds to a specific Z value. One major requirement for this is that I need to do it in as close to O(1) complexity as I can. My plan is to use an unordered_map.
I generally do not use a hash table for lookup, as the lookup time has never been important to me. Am I correct in thinking that as long as I built the unordered_map with no collisions, my lookup time will be O(1)?
My concern then is what the complexity becomes if there the key is not present in the unordered map. If I use unordered_map::find():, for example, to determine whether a key is present in my hash table, how will it go about giving me an answer? Does it actually iterate over all the keys?
I greatly appreciate the help.
The standard more or less requires using buckets for collision
resolution, which means that the actual look up time will
probably be linear with respect to the number of elements in the
bucket, regardless of whether the element is present or not.
It's possible to make it O(lg N), but it's not usually done,
because the number of elements in the bucket should be small,
if the hash table is being used correctly.
To ensure that the number of elements in a bucket is small, you
must ensure that the hashing function is effective. What
effective means depends on the types and values being hashed.
(The MS implementation uses FNV, which is one of the best
generic hashs around, but if you have special knowledge of the
actual data you'll be seeing, you might be able to do better.)
Another thing which can help reduce the number of elements per
bucket is to force more buckets or use a smaller load factor.
For the first, you can pass the minimum initial number of
buckets as an argument to the constructor. If you know the
total number of elements that will be in the map, you can
control the load factor this way. You can also forse a minumum
number of buckets once the table has been filled, by calling
rehash. Otherwise, there is a function
std::unordered_map<>::max_load_factor which you can use. It
is not guaranteed to do anything, but in any reasonable
implementation, it will. Note that if you use it on an already
filled unordered_map, you'll probably have to call
unordered_map<>::rehash afterwards.
(There are several things I don't understand about the standard
unordered_map: why the load factor is a float, instead of
double; why it's not required to have an effect; and why it
doesn't automatically call rehash for you.)
As with any hash table, worst case is always linear complexity (Edit: if you built the map without any collisions like you stated in your original post, then you'll never see this case):
http://www.cplusplus.com/reference/unordered_map/unordered_map/find/
Complexity
Average case: constant.
Worst case: linear in container size.
Return Value
An iterator to the element, if the specified key value is found, or unordered_map::end if the specified key is not found in the container.
However, because an unordered_map can only contain unique keys, you will see average complexity of constant time (container first checks hash index, and then iterates over values at that index).
I think the documentation for unordered_map::count function is more informative:
Searches the container for elements whose key is k and returns the
number of elements found. Because unordered_map containers do not
allow for duplicate keys, this means that the function actually
returns 1 if an element with that key exists in the container, and
zero otherwise.
To have no collisions in a hashed data structure is incredibly difficult (if not impossible for a given hash function and any kind of data). It would also require a table size exactly equal to the number of keys. No, it does not need to be that strict. As long as the hash function distributes the values in a relatively uniform way, you will have O(1) lookup complexity.
Hash tables are generally just arrays with linked lists taking care of the collisions (this is the chaining method - there are other methods, but this is likely the most utilized way of dealing with collisions). Thus, to find if a value is contained within a bucket, it will have to (potentially) iterate over all the values in that bucket. So if the hash function gives you a uniform distribution, and there are N buckets, and a total of M values, there should be (on average) M/N values per bucket. As long as this value is not too large, this allows O(1) lookup.
So, as a bit of a long winded answer to your question, as long as the hashing function is reasonable, you will get O(1) lookup, with it having to iterate over (on average) O(M/N) keys to give you a "negative" result.

Is the complexity of unordered_set::find predictable?

While looking for a container suitable for an application I'm building, I ran across documentation for unordered_set. Given that my application typically requires only insert and find functions, this class seems rather attractive. I'm slightly put off, however, by the fact that find is O(1) amortized, but O(n) worst case - I would be using the function frequently, and it could make or break my application. What causes the spike in complexity? Is the likelihood of running into an O(n) search predictable?
_unordered_set_ are implemented as hash tables, that said, one of the common implementations of hash table is using a container (ex: like vector) of hash bucket (that are a container (ex: like list) of elements of the unordered_set in the same bucket).
When inserting elements in the unordered_set, a hash function is apply to then which give you the bucket where to placed.
There could be various elements inserted that end in the same bucket, when you are finding an element, the hash functions is apply, giving you the bucket and you need to go for their elements searching the one you are looking for.
The worst case scenario is that all elements end in the same bucket (depending the containers used to store the elements in the same bucket O(n) is the worst running time of search when all the elements are in the same bucket).
The key points for elements ending in the same bucket are the hash function (how good it's) and the elements (could expose specific weakness of the hash function).
The elements normally one can no predict, if there are predictable enough in your case (you could select a hash function that spread evenly this kind of elements).
To speed up search, the key point is using good hash function (that distribute evenly the elements in the buckets and using if needed rehash increasing the bucket size (take care with this option, the hash function will be apply to all elements)).
I suggest that if it's that important for your application the storage of that elements, you do performance test with as close as possible to production data (and take decision from there), that said the containers in STL and more the containers of the same group (ex: associative, etc...) share almost the same interface, being easy to change one for another, with little or no change in the code that used.

Hash table in C++

Is the insertion/deletion/lookup time of a C++ std::map O(log n)? Is it possible to implement an O(1) hash table?
Is the insertion/deletion/lookup time of a C++ map O(log n)?
Yes.
Is it possible to implement an O(1) hash table?
Definitely. The standard library also provides one as std::unordered_map.
C++ has a unordered_map type. The STL also contains a hash_map type, although this is not in the C++ standard library.
Now, for a bit of algorithmic theory. It is possible to implement an O(1) hash table under perfect conditions, and technically, hash tables are O(1) insertion and lookup. The perfect conditions in this case are that the hash function must be perfect (i.e. collision free), and you have infinite storage.
In practise, let's take a dumb hash table. For any input key, it returns 1. In this case, when there is a collision (i.e. on the second and subsequent insertions), it will have to chain further to find some free space. It can either go to the next storage location, or use a linked list for this.
In any case, in the best case, yes, hash tables are O(1) (until you have exhausted all of your hash values, of course, since it is impractical to have a hash function with an infinite amount of output). In the worst case (e.g. with my completely dumb hash function), hash tables are O(n), since you will have to traverse over the storage in order to find your actual value from the given hash, since the initial value is not the correct value.
The implementation of std::map is a tree. This is not directly specified in the standard, but as some good books are saying: "It is difficult to imagine that it can be anything else". This means that the insertion/deletion/lookup time for map is O(log n).
Classic hash tables have lookup time O(n/num_slots). Once the expected number of items in the table is comparable with the number of slots, you will have saturated O(1).

what the difference between map and hashmap in STL [duplicate]

This question already has answers here:
map vs. hash_map in C++
(6 answers)
Closed 9 years ago.
in C++ STL, there are two map, map and hashmap. Anyone know the main difference of them?
map uses a red-black tree as the data structure, so the elements you put in there are sorted, and insert/delete is O(log(n)). The elements need to implement at least operator<.
hashmap uses a hash, so elements are unsorted, insert/delete is O(1). Elements need to implement at least operator== and you need a hash function.
hash_map uses a hash table. This is "constant" time in theory. Most implementations use a "collision" hash table. What happens in reality is:
It creates a big table
You have a "hash" function for your object that generates you a random place in the table (random-looking, but the hash function will always return the same value for your object) and usually this is the mod of the actual 32-bit (or 64-bit) hash value with the size of the table.
The table looks to see if the space is available. If so it places the item in the table. If not it checks if the element there is the one you are trying to insert. If so it is a duplicate so no insert. If not, this is called a "collision" and it uses some formula to find another cell and this continues until it either finds a duplicate or an empty cell.
When the table gets filled up too much it resizes. An efficient (in time) implementation will store all the original hash values together with the elements so it won't need to recalculate the hashes when it does this. In addition, comparing the hashes is usually faster than comparing the elements, so it can do this whilst searching to eliminate most of the collisions as a pre-step.
If you never delete anything it is simple. However deleting elements adds an extra complexity. A cell that had an element in it which has been deleted is in a different state from one that was just empty all along, as you may have had collisions and if you just empty it, those elements won't be found. So there is usually some "mark". Of course now when we want to reuse the cell, we still have to recurse down in case there is a duplicate lower down (in which case we can't insert in this cell), then remember to reuse the deleted cell.
The usual constraint is that your objects must be implemented to check for equality, but Dinkumware (or was it SGI) implemented theirs with operator< which might be slower but has the advantage of decoupling your elements and the type of associated container they can be stored in, although you still need a hash function to store in a hash.
The theory is that if you have a big enough table, the operations are constant time, i.e. it does not depend on the number of actual elements you have. In practice, of course, the more elements you have the more collisions occur.
std::map uses a binary tree. There is no need to define a hash function for an object, just strictly ordered comparison. On insertion it recurses down the tree to find the insertion point (and whether there are any duplicates) and adds the node, and may need to rebalance the tree so the depth of leaves is never more than 1 apart. Rebalancing time is relative to the depth of the tree too so all these operations are O(log N) where N is the number of elements.
The advantages of hash is the complexity
The advantages of the tree is:
Totally scalable. It only uses what it needs, no need for a huge table or to pre-empt the size of the table, although hash may require less "baggage" per element than a tree.
No need to hash first, which for a good function can take longer than the comparisons would if the data set is not large.
One other issue with std::map is that it uses a single strictly-ordered comparison function whilst a "compare" function that returned -1, 0 or 1 would be a lot more efficient, particularly with the most commonly used key type, std::string, which already has this function implemented (it is char_traits::compare). (This inefficiency is based on the premise that to check that x==y, you check x<y and y<x so you do two comparisons. You would do this just once per lookup).
map is a red-black tree, O(log(n)) access time. hash_map (which is not standard, however unordered_map will become standard) uses (conceptually) a hash of the key as an index in an array of linked lists, and therefore has a best-case access time of O(1) and a worst case of O(n).
See http://en.wikipedia.org/wiki/Red-black_tree
The main difference is the searching time.
for few data is better map
for lots of data is better hashmap
anyway the tecnical answers given previously are correct.