Down sides of abusing O(1) lookup of a hash table? - c++

Hash tables are very common data structures used for coding problems presented in competitive programming/interviews.
Hash tables take key value pairs so that you can lookup a key and get the value. However, I often find myself needing the O(1) lookup of a key and not really caring about the value.
For example:
If I need to know if some strings have been used previously, I might plug them into a hash table with key: string, value: bool where the value of the bool is always true.
What are the down sides of doing something like this? Are there other data structures that give O(1) lookup that don't need a key value pair?

You should use a data structure the way it's intended to be used. And then you can profile your code to see if the performance is adequate. If it isn't, then optimize bottlenecks.
Having said that, a better data structure to check if a string has already been used would be std::unordered_set or std::set. Your use case is a typical use case for a set data structure. Wikipedia:
In computer science, a set is an abstract data type that can store
unique values, without any particular order. It is a computer
implementation of the mathematical concept of a finite set. Unlike
most other collection types, rather than retrieving a specific element
from a set, one typically tests a value for membership in a set.

If the sole purpose of this container to test whether a string has already been used, then unordered_set (collection of unique keys, hashed by keys) would do the trick.
Unordered set is an associative container that contains a set of
unique objects of type Key. Search, insertion, and removal have
average constant-time complexity.

Unless you have a perfect hash algorithm you will have collisions. At that point you have to look at your actual hash table implementation. Which each has their advantages and disadvantages, currently all std::unordered_map (and std::unordered_set?) are implemented as buckets with linked lists, which then makes your O(1) to a worst case of O(N).
The average is still O(1), so if your system is not time critical and you use a good hashing, as of a few years ago the hash for strings used some academically accepted good algorithm, then use a hash set. If you on the other hand in no way can accept a worst case O(N) lookup use std::map or std::set.
Note 1: all hash tables has an Achilles heel that makes them worst case O(N), unless your buckets are implemented as a balanced tree.
Note 2: Perfect hashing is without collisions.
Note 3: there is some literature that can tell how to make a perfect hash from known data, haven't seen any yet for perfect hashes for unknown data.
Note 4: if you find the latter then you might be able to prove P = NP.

Related

which container from std::map or std::unordered_map is suitable for my case

I don't know how a red black tree works with string keys. I've already seen it with numbers on youtube and it baffled me a lot. However I know very well how unoredred_map work (the internal of hash maps). std::map stays esoterical for me, but I read and tested that if we don't have many changes in the std::map, it could beat hash maps.
My case is simple, I have a std::map of <std::string,bool>. Keys contains paths to XML elements (example of a key: "Instrument_Roots/Instrument_Root/Rating_Type"), and I use the boolean value in my SAX parser to know if we reached a particular element.
I build this map "only once"; and then all I do is using std::find to search if a particular "key" ("path") exists in order to set its Boolean value to true, or search the first element who has "true" as associated value and use its corresponded "key", and finally I set all the boolean values to false to guarantee that only a single "key" has a "true" boolean value.
You shouldn't need to understand how red-black trees work in order to understand how to use a std::map. It's simply an associative array where the keys are in order (lexicographical order, in the case of string keys, at least with the default comparison function). That means that you can not only look keys up in a std::map, you can also make queries which depend on order. For example, you can find the largest key in the map which is not greater than the key you have. You can find the next larger key. Or (again in the case of strings) you can find all keys which start with the same prefix.
If you iterate over all the key-value pairs in a std::map, you will see them in order by key. That can be very useful, sometimes.
The extra functionality comes at a price. std::map is usually slower than std::unordered_map (though not always; for large string keys, the overhead of computing the hash function might be noticeable), and the underlying data structure has a certain amount of overhead, so they may occupy more space. The usual advice is to use a std::map if you find the fact that the keys are ordered to be essential or even useful.
But if you've benchmarked and concluded that for your application, a std::map is also faster, then go ahead and use it :)
It is occasionally useful to have a map whose mapped type is bool, but only if you need to distinguish between keys whose corresponding value is false and keys which are not present in the map at all. In effect, a std::map<T, bool> (or std::unordered_map<T, bool>) provides a ternary choice for each possible key.
If you don't need to distinguish between the two false cases, and you don't frequently change a key's value, then you may well be better off with a std::set (or std::unordered_set), which is exactly the same datastructure but without the overhead of the bool in each element. (Although only one bit of the bool is useful, alignment considerations may end up using 8 additional bytes for each entry.) Other than storage space, there won't be much (if any) performance difference, though.
If you do really need a ternary case, then you would be well-advised to make the value an enum rather than a bool. What do true and false mean in the context of your usage? My guess is that they don't mean "true" and "false". Instead, they mean something like "is an attribute path" and "is an element path". That distinction could be made much clearer (and therefore less accident-prone) by using enum PathType {ATTRIBUTE_PATH, ELEMENT_PATH};. That will not involve any additional resources, since the bool is occupying eight bytes of storage in any case (because of alignment).
By the way, there is no guarantee that the underlying data structure is precisely a red-black tree, although the performance guarantees would be difficult to achieve without some kind of self-balancing tree. I don't know of such an implementation, but it would be possible to use k-ary trees (for some small k) to take advantage of SIMD vector comparison operations, for example. Of course, that would need to be customized for appropriate key types.
If you do want to understand red-black trees, you could do worse than Robert Sedgewick's standard textbook on Algorithms. On the book's website, you'll find a brief illustrated explanation in the chapter on balanced trees.
I would recommend you to use std::unordered_set because you really don't need to store this boolean flag and you also don't need to keep these xml tags in sorted order so std::unordered_set seems to me as logical and the most efficient choice.

Would a unordered_map be a good choice?

I'm wondering if an unordered_map would be a good choice as container for my specific problem. What I've read about maps does not really cover my are, which is:
The container will store between 100 and 500 objects (not
int/double...)
The size will never change.
The order is not important as the objects themselves contain some kind of "index".
Very often (!) I need to filter all elements in the container that have some
property (e.g. have color==blue)
Currently I use vectors, which works. However if e.g. an unordered_map would improve performance (in regard to "filtering") I could image to change that.
std::unordered_map wouldn't really help you if you have multiple search criteria (sometimes color == blue, sometimes flavour == up), because maps only offer fast query on a single, pre-determined key.
I'd say std::vector is just fine for you, ideally wrapped in your own structure which will provide the lookup interface. If profiling later tells you this is not fast enough, you could build your own indexes above such data. You wouldn't even have to do that manually, boost::multi_index is a generic container designed for multiple-criterion lookup.
I would use vector or simply array for storing actual data. And have a few maps that maps key with pointer to actual data.
This would give higher memory usage, but in case searching by different indexes is often needed you may sacrifice a bit of memory.
A hash table (which std::unordered_map is) provides constant-time lookup for one key (key-value pair). However, its constant factors are always higher (i. e. the lookup is slower) than a simple array (which provides constant-time lookup for integer indices).
If you need to filter a collection of elements based on some criteria, then you need to inspect each individual element. In this case, a hash table would be strictly worse than an array/vector performance-wise, since its computational complexity is the same as that of array indexing, but with worse constant factors.
So no, there's no reason why you would want to use an unordered_map in this case.

Why std::map is red black tree and not hash table ?

This is very strange for me, i expected it to be a hash table.
I saw 3 reasons in the following answer (which maybe correct but i don't think that they are the real reason).
Hash tables v self-balancing search trees
Although hash might be not a trivial operation. I think that for most of the types it is pretty simple.
when you use map you expect something that will give you amortized O(1) insert, delete, find , not log(n).
i agree that trees have better worst case performance.
I think that there is a bigger reason for that, but i can't figure it out.
In c# for example Dictionary is a hash table.
It's largely a historical accident. The standard containers (along with iterators and algorithms) were one of the very last additions before the feature set of the standard was frozen. As it happened, they didn't have what they considered an adequate definition of a hash-based map at the time, and there wasn't time to add it before features were frozen, so the original specification included only a tree-based map.
C++ 11 added std::unordered_map (as well as std::unordered_set and multi versions of both), which is based on hashing though.
The reason is that map is explicitly called out as an ordered container. It keeps the elements sorted and allows you to iterate in sorted order in linear time. A hashtable couldn't fulfill those requirements.
In C++11 they added std::unordered_map which is a hashtable implementation.
A hash table requires an additional hash function. The current implementation of map which uses a tree can work without an extra hash function by using operator<. Additionally the map allows sorted access to elements, which may be beneficial for some applications. With C++ we now have the hash versions available in form of unordered_set.
Simple answer: because a hash table cannot satisfy the complexity requirements of iteration over a std::map.
Why does std::map hold these requirements? Unanswerable question. Historical factors contribute but, overall, that's just the way it is.
Hashes are available as std::unordered_map.
It doesn't really matter what the two are called, or what they're called in some other language.

c++ - unordered_map complexity

I need to create a lookup function where a (X,Y) pair corresponds to a specific Z value. One major requirement for this is that I need to do it in as close to O(1) complexity as I can. My plan is to use an unordered_map.
I generally do not use a hash table for lookup, as the lookup time has never been important to me. Am I correct in thinking that as long as I built the unordered_map with no collisions, my lookup time will be O(1)?
My concern then is what the complexity becomes if there the key is not present in the unordered map. If I use unordered_map::find():, for example, to determine whether a key is present in my hash table, how will it go about giving me an answer? Does it actually iterate over all the keys?
I greatly appreciate the help.
The standard more or less requires using buckets for collision
resolution, which means that the actual look up time will
probably be linear with respect to the number of elements in the
bucket, regardless of whether the element is present or not.
It's possible to make it O(lg N), but it's not usually done,
because the number of elements in the bucket should be small,
if the hash table is being used correctly.
To ensure that the number of elements in a bucket is small, you
must ensure that the hashing function is effective. What
effective means depends on the types and values being hashed.
(The MS implementation uses FNV, which is one of the best
generic hashs around, but if you have special knowledge of the
actual data you'll be seeing, you might be able to do better.)
Another thing which can help reduce the number of elements per
bucket is to force more buckets or use a smaller load factor.
For the first, you can pass the minimum initial number of
buckets as an argument to the constructor. If you know the
total number of elements that will be in the map, you can
control the load factor this way. You can also forse a minumum
number of buckets once the table has been filled, by calling
rehash. Otherwise, there is a function
std::unordered_map<>::max_load_factor which you can use. It
is not guaranteed to do anything, but in any reasonable
implementation, it will. Note that if you use it on an already
filled unordered_map, you'll probably have to call
unordered_map<>::rehash afterwards.
(There are several things I don't understand about the standard
unordered_map: why the load factor is a float, instead of
double; why it's not required to have an effect; and why it
doesn't automatically call rehash for you.)
As with any hash table, worst case is always linear complexity (Edit: if you built the map without any collisions like you stated in your original post, then you'll never see this case):
http://www.cplusplus.com/reference/unordered_map/unordered_map/find/
Complexity
Average case: constant.
Worst case: linear in container size.
Return Value
An iterator to the element, if the specified key value is found, or unordered_map::end if the specified key is not found in the container.
However, because an unordered_map can only contain unique keys, you will see average complexity of constant time (container first checks hash index, and then iterates over values at that index).
I think the documentation for unordered_map::count function is more informative:
Searches the container for elements whose key is k and returns the
number of elements found. Because unordered_map containers do not
allow for duplicate keys, this means that the function actually
returns 1 if an element with that key exists in the container, and
zero otherwise.
To have no collisions in a hashed data structure is incredibly difficult (if not impossible for a given hash function and any kind of data). It would also require a table size exactly equal to the number of keys. No, it does not need to be that strict. As long as the hash function distributes the values in a relatively uniform way, you will have O(1) lookup complexity.
Hash tables are generally just arrays with linked lists taking care of the collisions (this is the chaining method - there are other methods, but this is likely the most utilized way of dealing with collisions). Thus, to find if a value is contained within a bucket, it will have to (potentially) iterate over all the values in that bucket. So if the hash function gives you a uniform distribution, and there are N buckets, and a total of M values, there should be (on average) M/N values per bucket. As long as this value is not too large, this allows O(1) lookup.
So, as a bit of a long winded answer to your question, as long as the hashing function is reasonable, you will get O(1) lookup, with it having to iterate over (on average) O(M/N) keys to give you a "negative" result.

Hash table in C++

Is the insertion/deletion/lookup time of a C++ std::map O(log n)? Is it possible to implement an O(1) hash table?
Is the insertion/deletion/lookup time of a C++ map O(log n)?
Yes.
Is it possible to implement an O(1) hash table?
Definitely. The standard library also provides one as std::unordered_map.
C++ has a unordered_map type. The STL also contains a hash_map type, although this is not in the C++ standard library.
Now, for a bit of algorithmic theory. It is possible to implement an O(1) hash table under perfect conditions, and technically, hash tables are O(1) insertion and lookup. The perfect conditions in this case are that the hash function must be perfect (i.e. collision free), and you have infinite storage.
In practise, let's take a dumb hash table. For any input key, it returns 1. In this case, when there is a collision (i.e. on the second and subsequent insertions), it will have to chain further to find some free space. It can either go to the next storage location, or use a linked list for this.
In any case, in the best case, yes, hash tables are O(1) (until you have exhausted all of your hash values, of course, since it is impractical to have a hash function with an infinite amount of output). In the worst case (e.g. with my completely dumb hash function), hash tables are O(n), since you will have to traverse over the storage in order to find your actual value from the given hash, since the initial value is not the correct value.
The implementation of std::map is a tree. This is not directly specified in the standard, but as some good books are saying: "It is difficult to imagine that it can be anything else". This means that the insertion/deletion/lookup time for map is O(log n).
Classic hash tables have lookup time O(n/num_slots). Once the expected number of items in the table is comparable with the number of slots, you will have saturated O(1).