What is the difference between set and hashset in C++ STL? - c++

When should I choose one over the other?
Are there any pointers that you would recommend for using the right STL containers?

hash_set is an extension that is not part of the C++ standard. Lookups should be O(1) rather than O(log n) for set, so it will be faster in most circumstances.
Another difference will be seen when you iterate through the containers. set will deliver the contents in sorted order, while hash_set will be essentially random (Thanks Lou Franco).
Edit: The C++11 update to the C++ standard introduced unordered_set which should be preferred instead of hash_set. The performance will be similar and is guaranteed by the standard. The "unordered" in the name stresses that iterating it will produce results in no particular order.

stl::set is implemented as a binary search tree.
hashset is implemented as a hash table.
The main issue here is that many people use stl::set thinking it is a hash table with look-up of O(1), which it isn't, and doesn't have. It really has O(log(n)) for look-ups. Other than that, read about binary trees vs hash tables to get a better idea of the data structures.

Another thing to keep in mind is that with hash_set you have to provide the hash function, whereas a set only requires a comparison function ('<') which is easier to define (and predefined for native types).

I don't think anyone has answered the other part of the question yet.
The reason to use hash_set or unordered_set is the usually O(1) lookup time. I say usually because every so often, depending on implementation, a hash may have to be copied to a larger hash array, or a hash bucket may end up containing thousands of entries.
The reason to use a set is if you often need the largest or smallest member of a set. A hash has no order so there is no quick way to find the smallest item. A tree has order, so largest or smallest is very quick. O(log n) for a simple tree, O(1) if it holds pointers to the ends.

A hash_set would be implemented by a hash table, which has mostly O(1) operations, whereas a set is implemented by a tree of some sort (AVL, red black, etc.) which have O(log n) operations, but are in sorted order.
Edit: I had written that trees are O(n). That's completely wrong.

Related

questions about the searching and sorting algorithms

I'm doing a little research about searching and sorting algorithms in the Standard library. I couldn't find something about those questions. I hope someone can help me out. You can also send me links if you know some.
Does the searching behavior change if the data is not sorted compared to one which is sorted?
How can I know if it is better to use std::sort() on a vector instead of maybe to copy the vector to an already sorted set? That is just an example. I hoped to find some explanations on the web which ways are the best for searching or sorting, but I didn't.
How can I adapt the behavior of the searching and sorting algorithms to make it more efficient?
Does the searching behavior change if the data is not sorted compared
to one which is sorted?
Depends. If you access your data in a vector/array by position, there's no performance improvement, and there's no need for sorting neither.
Searching can be done linearly, binary, keys, and by hash function.
For small (I guess something below a few dozens of items) and contiguous containers (e.g. a vector) linear search can be the fastest, just because of cache-friendly memory layout.
Binary search has O(log N) complexity which is likely the best you can get... I'm thinking in Information theory. It requires that you sort previously the container. It's useful for frequetly searches in the same container.
A std::set (and its cousin std::map) uses internally a tree, which makes searching O(log N) complexity too. Useful if you search by keys, instead of some criteria of your items. The drawback is that it's a bit slower on building (always keep sorted) than fill a vector an later sort it.
A hashmap or hashtable uses a function for getting the bucket where the item lies. The complexity is something near to O(1), depending on number of items and the function used (collisions issue).
As you see, selecting a type of container depends on how are you going to handle your data. Choose the one that fits your requirements.
How can I know if it is better to use std::sort() on a vector instead
of maybe to copy the vector to an already sorted set?
std::sort changes the container so the result is, obviously, sorted. If you need the original, unordered, container, then make a copy and sort the copy. Sorting the whole of the container is better that "insert-item-so-container-is-always-sorted" for all items, specially with a vector (many memory reallocations); a set/map filling process may be not that slow.
How can I adapt the behavior of the searching and sorting algorithms
to make it more efficient?
It's not clear to me what you mean. But, "The end justifies the means". Again, choose the container that servers best for your data handling.
Does the searching behavior change if the data is not sorted compared to one which is sorted?
No. It depends on the algorithm you choose. General search std::find is O(n), binary search std::lower_bound is O(log n), but it works only on sorted ranges.
How can I know if it is better to use std::sort() on a vector instead of maybe to copy the vector to an already sorted set? That is just an example. I hoped to find some explanations on the web which ways are the best for searching or sorting, but I didn't.
You can write a benchmark and measure. You can sort an std::vector (without duplicated elements) by copying it into an std::set, which maintains sorted order internally. std::set is typically implemented as a red-black tree and has in general high memory fragmentation in contrast to contiguous std::vector. So it is easy to predict the result. Alexander Stepanov discusses (if I remember correctly) this particular example in his lectures available on YouTube.

Why std::map is red black tree and not hash table ?

This is very strange for me, i expected it to be a hash table.
I saw 3 reasons in the following answer (which maybe correct but i don't think that they are the real reason).
Hash tables v self-balancing search trees
Although hash might be not a trivial operation. I think that for most of the types it is pretty simple.
when you use map you expect something that will give you amortized O(1) insert, delete, find , not log(n).
i agree that trees have better worst case performance.
I think that there is a bigger reason for that, but i can't figure it out.
In c# for example Dictionary is a hash table.
It's largely a historical accident. The standard containers (along with iterators and algorithms) were one of the very last additions before the feature set of the standard was frozen. As it happened, they didn't have what they considered an adequate definition of a hash-based map at the time, and there wasn't time to add it before features were frozen, so the original specification included only a tree-based map.
C++ 11 added std::unordered_map (as well as std::unordered_set and multi versions of both), which is based on hashing though.
The reason is that map is explicitly called out as an ordered container. It keeps the elements sorted and allows you to iterate in sorted order in linear time. A hashtable couldn't fulfill those requirements.
In C++11 they added std::unordered_map which is a hashtable implementation.
A hash table requires an additional hash function. The current implementation of map which uses a tree can work without an extra hash function by using operator<. Additionally the map allows sorted access to elements, which may be beneficial for some applications. With C++ we now have the hash versions available in form of unordered_set.
Simple answer: because a hash table cannot satisfy the complexity requirements of iteration over a std::map.
Why does std::map hold these requirements? Unanswerable question. Historical factors contribute but, overall, that's just the way it is.
Hashes are available as std::unordered_map.
It doesn't really matter what the two are called, or what they're called in some other language.

Confusion in the differences between hashmap and hashtable

I have a confusion:
I have read in many posts that Hash-maps are implemented as binary search trees which makes the various operations time complexity of logarithmic order.
Hashtables on the other hand provide constant time fetching.
But, as I read in this post, no difference has been provided in terms of the complexity for retrieval/searching of elements in the two data structures.
So, here is my question-
Since hashtables are guaranteed to provide constant searching time complexity, their implementation must differ from those of hash-maps. So, why will someone ever use hash-maps if they do not provide constant time searching. Also, why in the first place, they are implemented as binary search trees?
I know hash maps store the keys in sorted form and provide iteration through the map. But, the same could also be provide in the hashtable too.
Your confusion stems from the following:
Hash-maps are implemented as binary search trees
They are not. No sensible naming convention would use the term "hashmap" to describe a structure based on a tree.
For example, in Java, HashMap is based on a hash table and TreeMap is based on a tree.
C++ uses neither "hash" nor "tree" in the naming of its standard containers. The two containers that broaly correspond to HashMap and TreeMap are std::unordered_map and std::map.

Hash table in C++

Is the insertion/deletion/lookup time of a C++ std::map O(log n)? Is it possible to implement an O(1) hash table?
Is the insertion/deletion/lookup time of a C++ map O(log n)?
Yes.
Is it possible to implement an O(1) hash table?
Definitely. The standard library also provides one as std::unordered_map.
C++ has a unordered_map type. The STL also contains a hash_map type, although this is not in the C++ standard library.
Now, for a bit of algorithmic theory. It is possible to implement an O(1) hash table under perfect conditions, and technically, hash tables are O(1) insertion and lookup. The perfect conditions in this case are that the hash function must be perfect (i.e. collision free), and you have infinite storage.
In practise, let's take a dumb hash table. For any input key, it returns 1. In this case, when there is a collision (i.e. on the second and subsequent insertions), it will have to chain further to find some free space. It can either go to the next storage location, or use a linked list for this.
In any case, in the best case, yes, hash tables are O(1) (until you have exhausted all of your hash values, of course, since it is impractical to have a hash function with an infinite amount of output). In the worst case (e.g. with my completely dumb hash function), hash tables are O(n), since you will have to traverse over the storage in order to find your actual value from the given hash, since the initial value is not the correct value.
The implementation of std::map is a tree. This is not directly specified in the standard, but as some good books are saying: "It is difficult to imagine that it can be anything else". This means that the insertion/deletion/lookup time for map is O(log n).
Classic hash tables have lookup time O(n/num_slots). Once the expected number of items in the table is comparable with the number of slots, you will have saturated O(1).

Choosing between std::map and std::unordered_map [duplicate]

This question already has answers here:
Is there any advantage of using map over unordered_map in case of trivial keys?
(15 answers)
Closed 4 years ago.
Now that std has a real hash map in unordered_map, why (or when) would I still want to use the good old map over unordered_map on systems where it actually exists? Are there any obvious situations that I cannot immediately see?
As already mentioned, map allows to iterate over the elements in a sorted way, but unordered_map does not. This is very important in many situations, for example displaying a collection (e.g. address book). This also manifests in other indirect ways like: (1) Start iterating from the iterator returned by find(), or (2) existence of member functions like lower_bound().
Also, I think there is some difference in the worst case search complexity.
For map, it is O( lg N )
For unordered_map, it is O( N ) [This may happen when the hash function is not good leading to too many hash collisions.]
The same is applicable for worst case deletion complexity.
In addition to the answers above you should also note that just because unordered_map is constant speed (O(1)) doesn't mean that it's faster than map (of order log(N)). The constant may be bigger than log(N) especially since N is limited by 232 (or 264).
So in addition to the other answers (map maintains order and hash functions may be difficult) it may be that map is more performant.
For example in a program I ran for a blog post I saw that for VS10 std::unordered_map was slower than std::map (although boost::unordered_map was faster than both).
Note 3rd through 5th bars.
This is due to Google's Chandler Carruth in his CppCon 2014 lecture
std::map is (considered by many to be) not useful for performance-oriented work: If you want O(1)-amortized access, use a proper associative array (or for lack of one, std::unorderded_map); if you want sorted sequential access, use something based on a vector.
Also, std::map is a balanced tree; and you have to traverse it, or re-balance it, incredibly often. These are cache-killer and cache-apocalypse operations respectively... so just say NO to std::map.
You might be interested in this SO question on efficient hash map implementations.
(PS - std::unordered_map is cache-unfriendly because it uses linked lists as buckets.)
I think it's obvious that you'd use the std::map you need to iterate across items in the map in sorted order.
You might also use it when you'd prefer to write a comparison operator (which is intuitive) instead of a hash function (which is generally very unintuitive).
Say you have very large keys, perhaps large strings. To create a hash value for a large string you need to go through the whole string from beginning to end. It will take at least linear time to the length of the key. However, when you only search a binary tree using the > operator of the key each string comparison can return when the first mismatch is found. This is typically very early for large strings.
This reasoning can be applied to the find function of std::unordered_map and std::map. If the nature of the key is such that it takes longer to produce a hash (in the case of std::unordered_map) than it takes to find the location of an element using binary search (in the case of std::map), it should be faster to lookup a key in the std::map. It's quite easy to think of scenarios where this would be the case, but they would be quite rare in practice i believe.