vs2005 support
::stdext::hash_map
::std::map.
however it seems ::stdext::hash_map's insert and remove OP is slower then ::std::map in my test.
( less then 10000 items)
Interesting....
Can anyone offored a comparision article about them?
Normally you look to the complexities of the various operations, and that's a good guide: amortized O(1) insert, O(1) lookup, delete for a hashmap as against O(log N) insert, lookup, delete for a tree-based map.
However, there are certain situations where the complexities are misleading because the constant terms involved are extreme. For example, suppose that your 10k items are keyed off strings. Suppose further that those strings are each 100k characters long. Suppose that different strings typically differ near the beginning of the string (for example if they're essentially random, pairs will differ in the first byte with probability 255/256).
Then to do a lookup the hashmap has to hash a 100k string. This is O(1) in the size of the collection, but might take quite a long time since it's probably O(M) in the length of the string. A balanced tree has to do log N <= 14 comparisons, but each one only needs to look at a few bytes. This might not take very long at all.
In terms of memory access, with a 64 byte cache line size, the hashmap loads over 1500 sequential lines, and does 100k byte operations, whereas the tree loads 15 random lines (actually probably 30 due to the indirection through the string) and does 14 * (some small number) byte operations. You can see that the former might well be slower than the latter. Or it might be faster: how good are your architecture's FSB bandwidth, stall time, and speculative read caching?
If the lookup finds a match, then of course in addition to this both structures need to perform a single full-length string comparison. Also the hashmap might do additional failed comparisons if there happens to be a collision in the bucket.
So assuming that failed comparisons are so fast as to be negligible, while successful comparisons and hashing ops are slow, the tree might be roughly 1.5-2 times as fast as the hash. If those assumptions don't hold, then it won't be.
An extreme example, of course, but it's pretty easy to see that on your data, a particular O(log N) operation might be considerably faster than a particular O(1) operation. You are of course right to want to test, but if your test data is not representative of the real world, then your test results may not be representative either. Comparisons of data structures based on complexity refer to behaviour in the limit as N approaches infinity. But N doesn't approach infinity. It's 10000.
It is not just about insertion and removal. You must consider that memory is allocated differently in a hash_map vs map and you every time have to calculate the hash of the value being searched.
I think this Dr.Dobbs article will answer your question best:
C++ STL Hash Containers and Performance
It depends upon your usage and your hash collisions. One is a binary tree and the other is a hashtable.
Ideally the hash map will have O(1) insertion and lookup, and the map O(ln n), but it presumes non-clashing hashes.
hash_map uses a hash table, something that offers almost constant time O(1) operations assuming a good hash function.
map uses a BST, it offers O(lg(n)) operations, for 10000 elements that's 13 which is very acceptable
I'd say stay with map, it's safer.
Hash tables are supposed to be faster than binary trees (i.e. std::map) for lookup. Nobody has ever suggested that they are faster for insert and delete.
A hash map will create a hash of the string/key for indexing. Though while proving the complexity it is mentioned as O(1), hash_map does collision detection for every insert as a hash of a string can produce the same index as the hash of another string. A hash map hence has complexity for managing these collisions & you konw these collisions are based on the input data.
However, if you are going to perform lot of look-ups on the structure, opt for hash_map.
Related
As in new standards std::unordered_map/std::unordered_set were introduced, which uses hash function and have in average constant complexity of inserting/deleting/getting the elements, in case where we do not need to iterate over the collection in particular order, it seems there is no reason to use "old" std::map/std::set? Or there are some other cases/reasons when std::map/std::set would be a better choice? Like would they be for ex. less memory consuming, or their only pros over the "unordered" versions is the ordering?
They are ordered, and writing < us easier than writing hash and equality.
Never underestimate ease of use, because 90% of your code has trivial impact on your code's performance. Making the 10% faster can use time you would have spent on writing a hash for yet another type.
OTOH, a good hash combiner is write once, and get-state-as-tie makes <, == and hash nearly free.
Splicing guarantees between containers with node based operations may be better, as splicing into a hash map isn't free like a good ordered container splice. But I am uncertain.
Finally the iterator invalidation guarantees differ. Blindly replacing a mature tested moew with an unordered meow could create bugs. And maybe the invalidation features of maps are worth it to you.
std::set/std::map and std::unordered_set/std::unordered_map are used in very different problem areas and are not replaceable by each other.
std::set/std::map are used where problem is moving around the order of elements and element access is O(log n) time in average case is acceptable. By using std::set/std::map other information can also be retrieved for example to find number of elements greater than given element.
std::unordered_set/std::unordered_map are used where elements access has to be in O(1) time complexity in average case and order is not important, for example if you want to keep elements of integer key in std::vector, it means vec[10] = 10 but that is not practical approach because if keys drastically very, for example one key is 20 and other is 50000 then keeping only two values a std::vector of size 50001 have to be allocated and if you use std::set/std::map then element access complexity is O(log n) not O(1). In this problem std::unordered_set/std::unordered_map is used and it offers O(1) constant time complexity in average case by using hashing without allocating a large space.
Will a map get slower the longer it is? I'm not talking about iterating through it, but rather operations like .find() .insert() and .at().
For instance if we have map<int, Object> mapA which contains 100'000'000 elements and map<int, Object> mapB which only contains 100 elements.
Will there be any difference performance wise executing mapA.find(x) and mapB.find(x)?
The complexity of lookup and insertion operations on std::map is logarithmic in the number of elements in the map. So it gets slower as the map gets larger, but only it gets slower only very slowly (slower than any polynomial in the element number). To implement a container with such properties, operations usually take a form of binary search.
To imagine how much slower it is, you essentially require one further operation every time you double the number of elements. So if you need k operations on a map with 4000 elements, you need k + 1 operations on a map with 8000 elements, k + 2 for 16000 elements, and so forth.
By contrast, std::unordered_map does not offer you an ordering of the elements, and in return it gives you a complexity that's constant on average. This container is usually implemented as a hash table. "On average" means that looking up one particular element may take long, but the time it takes to look up many randomly chosen elements, divided by the number of looked-up elements, does not depend on the container size. The unordered map offers you fewer features, and as a result can potentially give you better performance.
However, be careful when choosing which map to use (provided ordering doesn't matter), since asymptotic cost doesn't tell you anything about real wall-clock cost. The cost of hashing involved in the unordered map operations may contribute a significant constant factor that only makes the unordered map faster than the ordered map at large sizes. Moreover, the lack of predictability of the unordered map (along with potential complexity attacks using chosen keys) may make the ordered map preferable in situations where you need control on the worst case rather than the average.
The C++ standard only requires that std::map has logarithmic lookup time; not that it is a logarithm of any particular base or with any particular constant overhead.
As such, asking "how many times slower would a 100 million map be than a 100 map" is nonsensical; it could well be that the overhead easily dominates both, so that the operations are about the same speed. It could even well be that for small sizes the time growth is exponential! By design, none of these things are deducible purely from the specification.
Further, you're asking about time, rather than operations. This depends heavily on access patterns. To use some diagrams from Paul Khong's (amazing) blog on Binary searches, the runtimes for repeated searches (look at stl, the turquoise line) are almost perfectly logarithmic,
but once you start doing random access the performance becomes decidedly non-logarithmic due to memory access outside of level-1 cache:
Note that goog refers to Google's dense_hash_map, which is akin to unordered_map. In this case, even it does not escape performance degradation at larger sizes.
The latter graph is likely the more telling for most situations, and suggests that the speed cost from looking up a random index in a size-100 map will cost about 10x less than a size-500'000 map. dense_hash_map will degrade worse than that, in that it will go from almost-free to certainly-not-free, albeit always remaining much faster than the STL's map.
In general, when asking these questions, an approach from theory can only give you very rough answers. A quick look at actual benchmarks and considerations of constant factors is likely to fine-tune these rough answers significantly.
Now, also remember that you're talking about map<int, Object>, which is very different from set<uint32_t>; if the Object is large this will emphasize the cost of cache misses and de-emphasize the cost of traversal.
A pedantic aside.
A quick note about hash maps: Their time complexity is often described as constant time, but this isn't strictly true. Most hash maps rather give you constant time with very high likelihood with regards to lookups, and amortized constant time with very high likelihood with regards to inserts.
The former means that for most hash tables there is an input that makes them perform less than optimal, and for user-input this could be dangerous. For this reason, Rust uses a cryptographic hash by defaul, Java's HashMap resolves collision with a binary search and CPython randomizes hashes. Generally if you're exposing your hash table to untrusted input, you should make sure you're using some mitigation of this kind.
Some, like Cuckoo hashes, do better than probabilistic (on constrained data types, given a special kind of hash function) for the case where you're worried about attackers, and incremental resizing removes the amortized time cost (assuming cheap allocations), but neither are commonly used since these are rarely problems that need solving, and the solutions are not free.
That said, if you're struggling to think of why we'd go through the hassle of using unordered maps, look back up at the graphs. They're fast, and you should use them.
Is there a data structure in C++ with O(1) lookup?
A std::map has O(log(n)) lookup time (right?).
I'm looking from something in std preferably (so not Boost pls). Also, if there is, how does it work?
EDIT: Ok, I wasn't clear enough I guess. I want to associate values, kind of like in a map. So I want something like std::map<int,string>, and find and insert should take O(1).
Arrays have O(1) lookup.
Hashtable (std::unordered_map) for c++11 has O(1) lookup. (Amortized, but more or less constant.)
I would also like to mention that tree based data structures like maps come with great advantages and are only log(n) which is more often than not sufficient.
Answer to your edit -> You can literally associate an index of an array to one of the values. Also hash tables are associative but perfect hash (each key maps to exactly 1 value) is really difficult to get.
One more thing worth mentioning: Arrays have great cache performance (due to locality, aka. elements being right next to each other so they can be prefetched to cache by the prefecthing engine). Trees, not so much. With reasonable amount of elements, hash performance can be more critical than asymptotic performance.
Data structures with O(1) lookup (ignoring the size of the key) include:
arrays
hash tables
For complex types, balanced trees will be fine at O(log n), or sometimes you can get away with a patricia trie at O(k).
For reference:complexity of search structures
An array has O(1) lookup.
This is a very naive question, yet I can't find an explicit discussion of it.
Everybody agrees that using a hash for a map container with only 10 elements, is overkill.. an ordered map will be much faster. With a hundred; a thousand, etc. a map should scale by logN where N= # of pairs in the map. So for a thousand, it takes three times as long; a million, six times as long; 10 billion, nine times as long.
Of course, we are led to believe that a well designed hashed container can be searched in O(1) (constant) time vs O(logN) for a sorted container. But what are the implied constants? At what point does the hash map lose the map in the dust? Especially, if the keys are integers, there is little overhead in the key search, so the constant in map would be small.
Nevertheless, just about EVERYBODY thinks hashed containers are faster. Lots of real time tests have been done.
What's going on?
As you've said, a hash based map does have the potential to be asymptotically faster than a binary search tree, with queries in O(1) vs O(log(N)) time - but this is entirely dependent on the performance of the hash function used over the allowable distribution of input data.
There are two important worst-case situations to think about with a hash table:
All data items generate the same hash index, therefore all items end up in the same hash bucket - querying the hash map will take O(N) in this case.
The distribution of hash indices generated by the data is extremely sparse, therefore most hash buckets are empty. You can still end up with O(1) query time in this case but the space complexity can essentially become unbounded in the limit.
A binary search tree on the other hand (at least the red-black tree used in most standard library implementations) enjoys worst-case O(log(N)) time and O(N) space complexity.
The up-shot of all of this (in my opinion) is that if you know enough about your input data to design a "good" hash function (doesn't have too many collisions, doesn't generate too sparse a distribution of hash buckets) using the hash map will generally be a better choice.
If you can't ensure the performance of your hash function over your expected inputs use a BST.
The exact point at which one becomes better than the other is entirely problem/machine dependent.
Hope this helps.
As you have rightly noted - the devil is in the details (in this case - constants). You have to benchmark your code to decide which is more efficient for you, because the O-Notation is for infinitesimal values while you're dealing with the real-world constraints.
The hash would be faster if it is indeed O(1) (i.e.: the has function is really really good) and the hash function calculation is relatively fast (to begin with - doesn't depend on the size of the input).
The overhead on the map is traversing the tree and while the key comparison may be more or less fast (integers faster, strings slower), traversing the tree is always dependent on the input (the tree depth). For larger trees, consider using B-Trees instead of the standard map (which in C++ is implemented usually with red-black trees).
Again, the magic word is benchmark.
The exact point where hash maps are faster will be machine-dependent.
It is true that it only takes O(log n) "steps" to traverse the map. But looking at the constant factor for a moment, note that the base on that log is 2, not 10; and the binary tree is probably implemented as a red-black tree, which is not perfectly balanced in general. (If memory serves, it can be up to 2x deeper than log2(n).)
However, what really drives the difference is the poor locality of the ordered map. Each of those O(log n) steps involves an impossible-to-predict branch, which is bad for the instruction pipeline. Worse, it involves chasing a random pointer to memory. The rule of thumb on modern CPUs is: "Math is fast; memory is slow." This is a good rule to remember because it becomes more true with every generation. CPU core speeds generally improve faster than memory speeds.
So unless your map is small enough to fit in cache, those random pointer dereferences are very bad for overall performance. Computing a hash is just math (and therefore fast), and chasing O(1) pointers is better than chasing O(log n); usually much better for large n.
But again, the exact point of hash table dominance will depend on the specific system.
I have data that is a set of ordered ints
[0] = 12345
[1] = 12346
[2] = 12454
etc.
I need to check whether a value is in the collection in C++, what container will have the lowest complexity upon retrieval? In this case, the data does not grow after initiailization. In C# I would use a dictionary, in c++, I could either use a hash_map or set. If the data were unordered, I would use boost's unordered collections. However, do I have better options since the data is ordered? Thanks
EDIT: The size of the collection is a couple of hundred items
Just to detail a bit over what have already been said.
Sorted Containers
The immutability is extremely important here: std::map and std::set are usually implemented in terms of binary trees (red-black trees for my few versions of the STL) because of the requirements on insertion, retrieval and deletion operation (and notably because of the invalidation of iterators requirements).
However, because of immutability, as you suspected there are other candidates, not the least of them being array-like containers. They have here a few advantages:
minimal overhead (in term of memory)
contiguity of memory, and thus cache locality
Several "Random Access Containers" are available here:
Boost.Array
std::vector
std::deque
So the only thing you actually need to do can be broken done in 2 steps:
push all your values in the container of your choice, then (after all have been inserted) use std::sort on it.
search for the value using std::binary_search, which has O(log(n)) complexity
Because of cache locality, the search will in fact be faster even though the asymptotic behavior is similar.
If you don't want to reinvent the wheel, you can also check Alexandrescu's [AssocVector][1]. Alexandrescu basically ported the std::set and std::map interfaces over a std::vector:
because it's faster for small datasets
because it can be faster for frozen datasets
Unsorted Containers
Actually, if you really don't care about order and your collection is kind of big, then a unordered_set will be faster, especially because integers are so trivial to hash size_t hash_method(int i) { return i; }.
This could work very well... unless you're faced with a collection that somehow causes a lot of collisions, because then unsorted containers will search over the "collisions" list of a given hash in linear time.
Conclusion
Just try the sorted std::vector approach and the boost::unordered_set approach with a "real" dataset (and all optimizations on) and pick whichever gives you the best result.
Unfortunately we can't really help more there, because it heavily depends on the size of the dataset and the repartition of its elements
If the data is in an ordered random-access container (e.g. std::vector, std::deque, or a plain array), then std::binary_search will find whether a value exists in logarithmic time. If you need to find where it is, use std::lower_bound (also logarithmic).
Use a sorted std::vector, and use a std::binary_search to search it.
Your other options would be a hash_map (not in the C++ standard yet but there are other options, e.g. SGI's hash_map and boost::unordered_map), or an std::map.
If you're never adding to your collection, a sorted vector with binary_search will most likely have better performance than a map.
I'd suggest using a std::vector<int> to store them and a std::binary_search or std::lower_bound to retrieve them.
Both std::unordered_set and std::set add significant memory overhead - and even though the unordered_set provides O(1) lookup, the O(logn) binary search will probably outperform it given that the data is stored contiguously (no pointer following, less chance of a page fault etc.) and you don't need to calculate a hash function.
If you already have an ordered array or std::vector<int> or similar container of the data, you can just use std::binary_search to probe each value. No setup time, but each probe will take O(log n) time, where n is the number of ordered ints you've got.
Alternately, you can use some sort of hash, such as boost::unordered_set<int>. This will require some time to set up, and probably more space, but each probe will take O(1) time on the average. (For small n, this O(1) could be more than the previous O(log n). Of course, for small n, the time is negligible anyway.)
There is no point in looking at anything like std::set or std::map, since those offer no advantage over binary search, given that the list of numbers to match will not change after being initialized.
So, the questions are the approximate value of n, and how many times you intend to probe the table. If you aren't going to check many values to see if they're in the ints provided, then setup time is very important, and std::binary_search on the sorted container is the way to go. If you're going to check a lot of values, it may be worth setting up a hash table. If n is large, the hash table will be faster for probing than binary search, and if there's a lot of probes this is the main cost.
So, if the number of ints to compare is reasonably small, or the number of probe values is small, go with the binary search. If the number of ints is large, and the number of probes is large, use the hash table.