Implementation of an Unordered Map - c++

I'm trying to understand Unordered Maps and Hashing. As I understand it, Unordered Maps have a Hash function inside of it that takes an object of type T, and returns an int, which then uses the int as an index to an internal array. It uses a List of the object of type T in the array position, so that if there's something already in the spot, additions are inserted into the List.
Conceptually, would using a Set instead of a List improve efficiency?
(maybe somehow binary search and a Set being ordered helps over having a List)
Or maybe a Vector instead of the List?
(maybe random access helps over the List.)

The datatype should not matter much, because in most cases, the container at the hashed index only contains zero or one element. If you regularly have many elements there, the hash map degrades in performance anyway. The remedy for that is to resize the initial array, which std::unordered_map<> does itself. However, if you have a bad hash function which causes many hash collisions, switching the hash function is necessary for proper operation.

If there're often a lot of collisions at the same bucket, then using a set is more efficient than using a list, and indeed some Java hash table implementations have adopted sets for this reason. vectors can't be used for std::unordered_map or std::unordered_set implementations, as they need to reallocate to a different memory area when grown past their capacity, whilst the Standard requires that the elements in an unordered container are never moved by other operations on the container.
That said, the nature of hash tables is that - with a high quality hash function - the statistical distribution of number-of-elements colliding in particular buckets relates only to the load factor. If you can't trust the collisions not to get out of control, perhaps you shouldn't be using that hash function.
Some details: Standard-library unordered containers have a default max_load_factor() (load_factor() is the ratio of size() to bucket_count()) of 1.0, and with a strong pseudo-randomizing hash function they'll have 1/e ~= 36.8% of buckets empty, as many with one element, half that with 2 elements (~18.4%), a third of that with 3 elements (~6.13%), a quarter of that with 4 elements (~1.53%), a fifth of that with 5 elements (~0.3%), a sixth of that with 6 elements (~0.05%). As you can hopefully see, it's incredibly rare to have to search through many elements (even in the worst case scenario where the hash table is at its max load factor), so a list approach is usually adequate.

Related

Time complexity of insert() in unordered_map when adding a std::vector as a value [duplicate]

I need to create a lookup function where a (X,Y) pair corresponds to a specific Z value. One major requirement for this is that I need to do it in as close to O(1) complexity as I can. My plan is to use an unordered_map.
I generally do not use a hash table for lookup, as the lookup time has never been important to me. Am I correct in thinking that as long as I built the unordered_map with no collisions, my lookup time will be O(1)?
My concern then is what the complexity becomes if there the key is not present in the unordered map. If I use unordered_map::find():, for example, to determine whether a key is present in my hash table, how will it go about giving me an answer? Does it actually iterate over all the keys?
I greatly appreciate the help.
The standard more or less requires using buckets for collision
resolution, which means that the actual look up time will
probably be linear with respect to the number of elements in the
bucket, regardless of whether the element is present or not.
It's possible to make it O(lg N), but it's not usually done,
because the number of elements in the bucket should be small,
if the hash table is being used correctly.
To ensure that the number of elements in a bucket is small, you
must ensure that the hashing function is effective. What
effective means depends on the types and values being hashed.
(The MS implementation uses FNV, which is one of the best
generic hashs around, but if you have special knowledge of the
actual data you'll be seeing, you might be able to do better.)
Another thing which can help reduce the number of elements per
bucket is to force more buckets or use a smaller load factor.
For the first, you can pass the minimum initial number of
buckets as an argument to the constructor. If you know the
total number of elements that will be in the map, you can
control the load factor this way. You can also forse a minumum
number of buckets once the table has been filled, by calling
rehash. Otherwise, there is a function
std::unordered_map<>::max_load_factor which you can use. It
is not guaranteed to do anything, but in any reasonable
implementation, it will. Note that if you use it on an already
filled unordered_map, you'll probably have to call
unordered_map<>::rehash afterwards.
(There are several things I don't understand about the standard
unordered_map: why the load factor is a float, instead of
double; why it's not required to have an effect; and why it
doesn't automatically call rehash for you.)
As with any hash table, worst case is always linear complexity (Edit: if you built the map without any collisions like you stated in your original post, then you'll never see this case):
http://www.cplusplus.com/reference/unordered_map/unordered_map/find/
Complexity
Average case: constant.
Worst case: linear in container size.
Return Value
An iterator to the element, if the specified key value is found, or unordered_map::end if the specified key is not found in the container.
However, because an unordered_map can only contain unique keys, you will see average complexity of constant time (container first checks hash index, and then iterates over values at that index).
I think the documentation for unordered_map::count function is more informative:
Searches the container for elements whose key is k and returns the
number of elements found. Because unordered_map containers do not
allow for duplicate keys, this means that the function actually
returns 1 if an element with that key exists in the container, and
zero otherwise.
To have no collisions in a hashed data structure is incredibly difficult (if not impossible for a given hash function and any kind of data). It would also require a table size exactly equal to the number of keys. No, it does not need to be that strict. As long as the hash function distributes the values in a relatively uniform way, you will have O(1) lookup complexity.
Hash tables are generally just arrays with linked lists taking care of the collisions (this is the chaining method - there are other methods, but this is likely the most utilized way of dealing with collisions). Thus, to find if a value is contained within a bucket, it will have to (potentially) iterate over all the values in that bucket. So if the hash function gives you a uniform distribution, and there are N buckets, and a total of M values, there should be (on average) M/N values per bucket. As long as this value is not too large, this allows O(1) lookup.
So, as a bit of a long winded answer to your question, as long as the hashing function is reasonable, you will get O(1) lookup, with it having to iterate over (on average) O(M/N) keys to give you a "negative" result.

C++ Data Structure that would be best to hold a large list of names

Can you share your thoughts on what the best STL data structure would be for storing a large list of names and perform searches on these names?
Edit:
The names are not unique and the list can grow as new names can continuously added to it. And by large I am talking of from 1 million to 10 million names.
Since you want to search names, you want a structure that support fast random access. That means vector, deque and list are all out of the question. Also, vector/array are slow on random adds/inserts for sorted sets because they have to shift items to make room for each inserted item. Adding to end is very fast, though.
Consider std::map, std::unordered_map or std::unordered_multimap (or their siblings std::set, std::unordered_set and std::unordered_multiset if you are only storing keys).
If you are purely going to do unique, random access, I'd start with one of the unordered_* containers.
If you need to store an ordered list of names, and need to do range searches/iteration and sorted operations, a tree based container like std::map or std::set should do better with the iteration operation than a hash based container because the former will store items adjacent to their logical predecessors and successors. For random access, it is O(log N) which is still decent.
Prior to std::unordered_*, I used std::map to hold large numbers of objects for an object cache and though there are faster random access containers, it scaled well enough for our uses. The newer unordered_map has O(1) access time so it is a hashed structure and should give you the near best access times.
You can consider the possibility of using concatenation of those names using a delimiter but the searching might take a hit. You would need to come up with a adjusted binary searching.
But you should try the obvious solution first which is a hashmap which is called unordered_map in stl. See if that meets your needs. Searching should be plenty fast there but at a cost of memory.

c++ - unordered_map complexity

I need to create a lookup function where a (X,Y) pair corresponds to a specific Z value. One major requirement for this is that I need to do it in as close to O(1) complexity as I can. My plan is to use an unordered_map.
I generally do not use a hash table for lookup, as the lookup time has never been important to me. Am I correct in thinking that as long as I built the unordered_map with no collisions, my lookup time will be O(1)?
My concern then is what the complexity becomes if there the key is not present in the unordered map. If I use unordered_map::find():, for example, to determine whether a key is present in my hash table, how will it go about giving me an answer? Does it actually iterate over all the keys?
I greatly appreciate the help.
The standard more or less requires using buckets for collision
resolution, which means that the actual look up time will
probably be linear with respect to the number of elements in the
bucket, regardless of whether the element is present or not.
It's possible to make it O(lg N), but it's not usually done,
because the number of elements in the bucket should be small,
if the hash table is being used correctly.
To ensure that the number of elements in a bucket is small, you
must ensure that the hashing function is effective. What
effective means depends on the types and values being hashed.
(The MS implementation uses FNV, which is one of the best
generic hashs around, but if you have special knowledge of the
actual data you'll be seeing, you might be able to do better.)
Another thing which can help reduce the number of elements per
bucket is to force more buckets or use a smaller load factor.
For the first, you can pass the minimum initial number of
buckets as an argument to the constructor. If you know the
total number of elements that will be in the map, you can
control the load factor this way. You can also forse a minumum
number of buckets once the table has been filled, by calling
rehash. Otherwise, there is a function
std::unordered_map<>::max_load_factor which you can use. It
is not guaranteed to do anything, but in any reasonable
implementation, it will. Note that if you use it on an already
filled unordered_map, you'll probably have to call
unordered_map<>::rehash afterwards.
(There are several things I don't understand about the standard
unordered_map: why the load factor is a float, instead of
double; why it's not required to have an effect; and why it
doesn't automatically call rehash for you.)
As with any hash table, worst case is always linear complexity (Edit: if you built the map without any collisions like you stated in your original post, then you'll never see this case):
http://www.cplusplus.com/reference/unordered_map/unordered_map/find/
Complexity
Average case: constant.
Worst case: linear in container size.
Return Value
An iterator to the element, if the specified key value is found, or unordered_map::end if the specified key is not found in the container.
However, because an unordered_map can only contain unique keys, you will see average complexity of constant time (container first checks hash index, and then iterates over values at that index).
I think the documentation for unordered_map::count function is more informative:
Searches the container for elements whose key is k and returns the
number of elements found. Because unordered_map containers do not
allow for duplicate keys, this means that the function actually
returns 1 if an element with that key exists in the container, and
zero otherwise.
To have no collisions in a hashed data structure is incredibly difficult (if not impossible for a given hash function and any kind of data). It would also require a table size exactly equal to the number of keys. No, it does not need to be that strict. As long as the hash function distributes the values in a relatively uniform way, you will have O(1) lookup complexity.
Hash tables are generally just arrays with linked lists taking care of the collisions (this is the chaining method - there are other methods, but this is likely the most utilized way of dealing with collisions). Thus, to find if a value is contained within a bucket, it will have to (potentially) iterate over all the values in that bucket. So if the hash function gives you a uniform distribution, and there are N buckets, and a total of M values, there should be (on average) M/N values per bucket. As long as this value is not too large, this allows O(1) lookup.
So, as a bit of a long winded answer to your question, as long as the hashing function is reasonable, you will get O(1) lookup, with it having to iterate over (on average) O(M/N) keys to give you a "negative" result.

Dynamic array width id?

I need some sort of dynamic array in C++ where each element have their own id represented by an int.
The datatype needs these functions:
int Insert() - return ID
Delete(int ID)
Get(ID) - return Element
What datatype should I use? I'we looked at Vector and List, but can't seem to find any sort of ID. Also I'we looked at map and hastable, these may be usefull. I'm however not sure what to chose.
I would probably use a vector and free id list to handle deletions, then the index is the id. This is really fast to insert and get and fairly easy to manage (the only trick is the free list for deleted items).
Otherwise you probably want to use a map and just keep track of the lowest unused id and assign it upon insertion.
A std::map could work for you, which allows to associate a key to a value. The key would be your ID, but you should provide it yourself when adding an element to the map.
An hash table is a sort of basic mechanism that can be used to implement an unordered map. It corresponds to std::unordered_map.
It seems that the best container to use is unordered_map.
It is based on hash. You can insert, delete or searche for elements in O(n).
Currently unordered_map is not in STL. If you want to use STL container use std::map.
It is based on tree. Inserts, deletes and searches for elements in O(n*log(n)).
Still the container choice depends much on the usage intensity. For example, if you will find for elements rare, vector and list could be ok. These containers do not have find method, but <algorithm> library include it.
A vector gives constant-time random access, the "id" can simply be the offset (index) into the vector. A deque is similar, but doesn't store all items contiguously.
Either of these would be appropriate, if the ID values can start at 0 (or a known offset from 0 and increment monotonically). Over time if there are a large amount of removals, either vector or deque can become sparsely populated, which may be detrimental.
std::map doesn't have the problem of becoming sparsely populated, but look ups move from constant time to logarithmic time, which could impact performance.
boost::unordered_map may be the best yet, as the best case scenario as a hash table will likely have the best overall performance characteristics given the question. However, usage of the boost library may be necessary -- but there are also unordered container types in std::tr1 if available in your STL implementation.

what the difference between map and hashmap in STL [duplicate]

This question already has answers here:
map vs. hash_map in C++
(6 answers)
Closed 9 years ago.
in C++ STL, there are two map, map and hashmap. Anyone know the main difference of them?
map uses a red-black tree as the data structure, so the elements you put in there are sorted, and insert/delete is O(log(n)). The elements need to implement at least operator<.
hashmap uses a hash, so elements are unsorted, insert/delete is O(1). Elements need to implement at least operator== and you need a hash function.
hash_map uses a hash table. This is "constant" time in theory. Most implementations use a "collision" hash table. What happens in reality is:
It creates a big table
You have a "hash" function for your object that generates you a random place in the table (random-looking, but the hash function will always return the same value for your object) and usually this is the mod of the actual 32-bit (or 64-bit) hash value with the size of the table.
The table looks to see if the space is available. If so it places the item in the table. If not it checks if the element there is the one you are trying to insert. If so it is a duplicate so no insert. If not, this is called a "collision" and it uses some formula to find another cell and this continues until it either finds a duplicate or an empty cell.
When the table gets filled up too much it resizes. An efficient (in time) implementation will store all the original hash values together with the elements so it won't need to recalculate the hashes when it does this. In addition, comparing the hashes is usually faster than comparing the elements, so it can do this whilst searching to eliminate most of the collisions as a pre-step.
If you never delete anything it is simple. However deleting elements adds an extra complexity. A cell that had an element in it which has been deleted is in a different state from one that was just empty all along, as you may have had collisions and if you just empty it, those elements won't be found. So there is usually some "mark". Of course now when we want to reuse the cell, we still have to recurse down in case there is a duplicate lower down (in which case we can't insert in this cell), then remember to reuse the deleted cell.
The usual constraint is that your objects must be implemented to check for equality, but Dinkumware (or was it SGI) implemented theirs with operator< which might be slower but has the advantage of decoupling your elements and the type of associated container they can be stored in, although you still need a hash function to store in a hash.
The theory is that if you have a big enough table, the operations are constant time, i.e. it does not depend on the number of actual elements you have. In practice, of course, the more elements you have the more collisions occur.
std::map uses a binary tree. There is no need to define a hash function for an object, just strictly ordered comparison. On insertion it recurses down the tree to find the insertion point (and whether there are any duplicates) and adds the node, and may need to rebalance the tree so the depth of leaves is never more than 1 apart. Rebalancing time is relative to the depth of the tree too so all these operations are O(log N) where N is the number of elements.
The advantages of hash is the complexity
The advantages of the tree is:
Totally scalable. It only uses what it needs, no need for a huge table or to pre-empt the size of the table, although hash may require less "baggage" per element than a tree.
No need to hash first, which for a good function can take longer than the comparisons would if the data set is not large.
One other issue with std::map is that it uses a single strictly-ordered comparison function whilst a "compare" function that returned -1, 0 or 1 would be a lot more efficient, particularly with the most commonly used key type, std::string, which already has this function implemented (it is char_traits::compare). (This inefficiency is based on the premise that to check that x==y, you check x<y and y<x so you do two comparisons. You would do this just once per lookup).
map is a red-black tree, O(log(n)) access time. hash_map (which is not standard, however unordered_map will become standard) uses (conceptually) a hash of the key as an index in an array of linked lists, and therefore has a best-case access time of O(1) and a worst case of O(n).
See http://en.wikipedia.org/wiki/Red-black_tree
The main difference is the searching time.
for few data is better map
for lots of data is better hashmap
anyway the tecnical answers given previously are correct.