Yet another ordered map vs. unordered (hash) map question

Yet another ordered map vs. unordered (hash) map question - c++

This is a very naive question, yet I can't find an explicit discussion of it.
Everybody agrees that using a hash for a map container with only 10 elements, is overkill.. an ordered map will be much faster. With a hundred; a thousand, etc. a map should scale by logN where N= # of pairs in the map. So for a thousand, it takes three times as long; a million, six times as long; 10 billion, nine times as long.
Of course, we are led to believe that a well designed hashed container can be searched in O(1) (constant) time vs O(logN) for a sorted container. But what are the implied constants? At what point does the hash map lose the map in the dust? Especially, if the keys are integers, there is little overhead in the key search, so the constant in map would be small.
Nevertheless, just about EVERYBODY thinks hashed containers are faster. Lots of real time tests have been done.
What's going on?

As you've said, a hash based map does have the potential to be asymptotically faster than a binary search tree, with queries in O(1) vs O(log(N)) time - but this is entirely dependent on the performance of the hash function used over the allowable distribution of input data.
There are two important worst-case situations to think about with a hash table:
All data items generate the same hash index, therefore all items end up in the same hash bucket - querying the hash map will take O(N) in this case.
The distribution of hash indices generated by the data is extremely sparse, therefore most hash buckets are empty. You can still end up with O(1) query time in this case but the space complexity can essentially become unbounded in the limit.
A binary search tree on the other hand (at least the red-black tree used in most standard library implementations) enjoys worst-case O(log(N)) time and O(N) space complexity.
The up-shot of all of this (in my opinion) is that if you know enough about your input data to design a "good" hash function (doesn't have too many collisions, doesn't generate too sparse a distribution of hash buckets) using the hash map will generally be a better choice.
If you can't ensure the performance of your hash function over your expected inputs use a BST.
The exact point at which one becomes better than the other is entirely problem/machine dependent.
Hope this helps.

As you have rightly noted - the devil is in the details (in this case - constants). You have to benchmark your code to decide which is more efficient for you, because the O-Notation is for infinitesimal values while you're dealing with the real-world constraints.
The hash would be faster if it is indeed O(1) (i.e.: the has function is really really good) and the hash function calculation is relatively fast (to begin with - doesn't depend on the size of the input).
The overhead on the map is traversing the tree and while the key comparison may be more or less fast (integers faster, strings slower), traversing the tree is always dependent on the input (the tree depth). For larger trees, consider using B-Trees instead of the standard map (which in C++ is implemented usually with red-black trees).
Again, the magic word is benchmark.

The exact point where hash maps are faster will be machine-dependent.
It is true that it only takes O(log n) "steps" to traverse the map. But looking at the constant factor for a moment, note that the base on that log is 2, not 10; and the binary tree is probably implemented as a red-black tree, which is not perfectly balanced in general. (If memory serves, it can be up to 2x deeper than log2(n).)
However, what really drives the difference is the poor locality of the ordered map. Each of those O(log n) steps involves an impossible-to-predict branch, which is bad for the instruction pipeline. Worse, it involves chasing a random pointer to memory. The rule of thumb on modern CPUs is: "Math is fast; memory is slow." This is a good rule to remember because it becomes more true with every generation. CPU core speeds generally improve faster than memory speeds.
So unless your map is small enough to fit in cache, those random pointer dereferences are very bad for overall performance. Computing a hash is just math (and therefore fast), and chasing O(1) pointers is better than chasing O(log n); usually much better for large n.
But again, the exact point of hash table dominance will depend on the specific system.

Related

Is there a data structure like a C++ std set which also quickly returns the number of elements in a range?

In a C++ std::set (often implemented using red-black binary search trees), the elements are automatically sorted, and key lookups and deletions in arbitrary positions take time O(log n) [amortised, i.e. ignoring reallocations when the size gets too big for the current capacity].
In a sorted C++ std::vector, lookups are also fast (actually probably a bit faster than std::set), but insertions are slow (since maintaining sortedness takes time O(n)).
However, sorted C++ std::vectors have another property: they can find the number of elements in a range quickly (in time O(log n)).
i.e., a sorted C++ std::vector can quickly answer: how many elements lie between given x,y?
std::set can quickly find iterators to the start and end of the range, but gives no clue how many elements are within.
So, is there a data structure that allows all the speed of a C++ std::set (fast lookups and deletions), but also allows fast computation of the number of elements in a given range?
(By fast, I mean time O(log n), or maybe a polynomial in log n, or maybe even sqrt(n). Just as long as it's faster than O(n), since O(n) is almost the same as the trivial O(n log n) to search through everything).
(If not possible, even an estimate of the number to within a fixed factor would be useful. For integers a trivial upper bound is y-x+1, but how to get a lower bound? For arbitrary objects with an ordering there's no such estimate).
EDIT: I have just seen the
related question, which essentially asks whether one can compute the number of preceding elements. (Sorry, my fault for not seeing it before). This is clearly trivially equivalent to this question (to get the number in a range, just compute the start/end elements and subtract, etc.)
However, that question also allows the data to be computed once and then be fixed, unlike here, so that question (and the sorted vector answer) isn't actually a duplicate of this one.

The data structure you're looking for is an Order Statistic Tree
It's typically implemented as a binary search tree in which each node additionally stores the size of its subtree.
Unfortunately, I'm pretty sure the STL doesn't provide one.

All data structures have their pros and cons, the reason why the standard library offers a bunch of containers.
And the rule is that there is often a balance between quickness of modifications and quickness of data extraction. Here you would like to easily access the number of elements in a range. A possibility in a tree based structure would be to cache in each node the number of elements of its subtree. That would add an average log(N) operations (the height of the tree) on each insertion or deletion, but would highly speedup the computation of the number of elements in a range. Unfortunately, few classes from the C++ standard library are tailored for derivation (and AFAIK std::set is not) so you will have to implement your tree from scratch.

Maybe you are looking for LinkedHashSet alternate for C++ https://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashSet.html.

When std::map/std::set should be used over std::unordered_map/std::unordered_set?

As in new standards std::unordered_map/std::unordered_set were introduced, which uses hash function and have in average constant complexity of inserting/deleting/getting the elements, in case where we do not need to iterate over the collection in particular order, it seems there is no reason to use "old" std::map/std::set? Or there are some other cases/reasons when std::map/std::set would be a better choice? Like would they be for ex. less memory consuming, or their only pros over the "unordered" versions is the ordering?

They are ordered, and writing < us easier than writing hash and equality.
Never underestimate ease of use, because 90% of your code has trivial impact on your code's performance. Making the 10% faster can use time you would have spent on writing a hash for yet another type.
OTOH, a good hash combiner is write once, and get-state-as-tie makes <, == and hash nearly free.
Splicing guarantees between containers with node based operations may be better, as splicing into a hash map isn't free like a good ordered container splice. But I am uncertain.
Finally the iterator invalidation guarantees differ. Blindly replacing a mature tested moew with an unordered meow could create bugs. And maybe the invalidation features of maps are worth it to you.

std::set/std::map and std::unordered_set/std::unordered_map are used in very different problem areas and are not replaceable by each other.
std::set/std::map are used where problem is moving around the order of elements and element access is O(log n) time in average case is acceptable. By using std::set/std::map other information can also be retrieved for example to find number of elements greater than given element.
std::unordered_set/std::unordered_map are used where elements access has to be in O(1) time complexity in average case and order is not important, for example if you want to keep elements of integer key in std::vector, it means vec[10] = 10 but that is not practical approach because if keys drastically very, for example one key is 20 and other is 50000 then keeping only two values a std::vector of size 50001 have to be allocated and if you use std::set/std::map then element access complexity is O(log n) not O(1). In this problem std::unordered_set/std::unordered_map is used and it offers O(1) constant time complexity in average case by using hashing without allocating a large space.

Efficiency of sorting Algorithms as it relates to input range

I was wondering if the typical fast sorting algos (i.e. quicksort) maintain their superiority when 'unnatural' inputs are used as opposed to rather more standard inputs.
I.E, if we had an array of N integers in the range of 0 to N^4, would quicksort still be the fastest given the extremely wide range of the integers?

Quicksort doesn't get affected by range of numbers, but the order (i.e. if the numbers are already sorted or sorted in reverse order, and if you pick the first element as the pivot). If you are using random pivot approach, even that problem is solved.
In summary, every algorithm has a worst case complexity and it is usually discussed in the literature about the algorithm.

N^4 isn't very big, an array of 2 billion integers would only require 128 bits for each integer to meet that requirement. Since this would require at least 8GB to store in memory, you will generally be limited to O(N*log(N)) sorting algorithms that can sort in place, like quick-sort, rather than O(N) algorithms that require twice as much memory.
Algorithms that allow O(N) (in the best case, which is not likely here) will typically be limited by memory. The example given, radix sort, becomes O(N log(N)) with large data elements, because the data is effectively variable-length - consider an integer that is 32,768 bytes - on a 64-bit machine, your first bucket might be based on the first 8 bytes, the second bucket on the second 8 bytes, but because of the very large possible range and the non-random distribution within buckets, most buckets will be small, leaving a few very large buckets to be sorted using an O(N log(N)) algorithm. Also, this algorithm requires "buckets" to be allocated to hold elements for each radix, which will double the total memory requirement.
With small lists of elements that require very expensive comparisons, radix sort might be a good option, but the difference between O(N) and O(N log(N)) may not be as important with small lists.
Also, with very expensive comparisons, such as very large strings, some variation of a Schwartzian Transform would probably be helpful, and since each algorithm balances between memory and cpu, the optimal sorting algorithm will then be based on the choice between using more memory or using more cpu.
Extreme cases might favor a different sorting algorithm, such as nearly-sorted lists, but usually the cost of detecting those will be high, and making assumptions that an extreme case is true can cause big problems if there is ever a chance that it won't be.
Having said all of that, all practical implementations should attempt to use std::sort with a corresponding implementation of std::hash<> unless absolutely necessary, since std::sort can choose from more than one algorithm, depending on the input data.

All of the well-known search algorithms are based on element comparison, i.e they check if an element is less, equal or greater than another element. Therefore they are absolutely independent of the range.
However there are special cases where the relative performance of certain algorithms can differ strongly from the average case. Examples for such cases are:
The elements are already sorted except a single element or a small subset.
The elements are in reverse order.
All elements are equal except one.
That's why for each sort algorithm, an average and a worst-case performance can be determined.

The other answers are essentially right, in that generally sorting algorithms aren't better or worse based on the range of the inputs. However, there is at least one reason why an algorithm could be better or worse based on input range, and that is how they handle duplicate values.
For example, Quicksort is worse on average when there are more duplicate values (see this question for an explanation of why), and when the range of inputs is greater, the chances of duplicates decreases (assuming they are distributed throughout the full range).

Does a map get slower the longer it is

Will a map get slower the longer it is? I'm not talking about iterating through it, but rather operations like .find() .insert() and .at().
For instance if we have map<int, Object> mapA which contains 100'000'000 elements and map<int, Object> mapB which only contains 100 elements.
Will there be any difference performance wise executing mapA.find(x) and mapB.find(x)?

The complexity of lookup and insertion operations on std::map is logarithmic in the number of elements in the map. So it gets slower as the map gets larger, but only it gets slower only very slowly (slower than any polynomial in the element number). To implement a container with such properties, operations usually take a form of binary search.
To imagine how much slower it is, you essentially require one further operation every time you double the number of elements. So if you need k operations on a map with 4000 elements, you need k + 1 operations on a map with 8000 elements, k + 2 for 16000 elements, and so forth.
By contrast, std::unordered_map does not offer you an ordering of the elements, and in return it gives you a complexity that's constant on average. This container is usually implemented as a hash table. "On average" means that looking up one particular element may take long, but the time it takes to look up many randomly chosen elements, divided by the number of looked-up elements, does not depend on the container size. The unordered map offers you fewer features, and as a result can potentially give you better performance.
However, be careful when choosing which map to use (provided ordering doesn't matter), since asymptotic cost doesn't tell you anything about real wall-clock cost. The cost of hashing involved in the unordered map operations may contribute a significant constant factor that only makes the unordered map faster than the ordered map at large sizes. Moreover, the lack of predictability of the unordered map (along with potential complexity attacks using chosen keys) may make the ordered map preferable in situations where you need control on the worst case rather than the average.

The C++ standard only requires that std::map has logarithmic lookup time; not that it is a logarithm of any particular base or with any particular constant overhead.
As such, asking "how many times slower would a 100 million map be than a 100 map" is nonsensical; it could well be that the overhead easily dominates both, so that the operations are about the same speed. It could even well be that for small sizes the time growth is exponential! By design, none of these things are deducible purely from the specification.
Further, you're asking about time, rather than operations. This depends heavily on access patterns. To use some diagrams from Paul Khong's (amazing) blog on Binary searches, the runtimes for repeated searches (look at stl, the turquoise line) are almost perfectly logarithmic,
but once you start doing random access the performance becomes decidedly non-logarithmic due to memory access outside of level-1 cache:
Note that goog refers to Google's dense_hash_map, which is akin to unordered_map. In this case, even it does not escape performance degradation at larger sizes.
The latter graph is likely the more telling for most situations, and suggests that the speed cost from looking up a random index in a size-100 map will cost about 10x less than a size-500'000 map. dense_hash_map will degrade worse than that, in that it will go from almost-free to certainly-not-free, albeit always remaining much faster than the STL's map.
In general, when asking these questions, an approach from theory can only give you very rough answers. A quick look at actual benchmarks and considerations of constant factors is likely to fine-tune these rough answers significantly.
Now, also remember that you're talking about map<int, Object>, which is very different from set<uint32_t>; if the Object is large this will emphasize the cost of cache misses and de-emphasize the cost of traversal.
A pedantic aside.
A quick note about hash maps: Their time complexity is often described as constant time, but this isn't strictly true. Most hash maps rather give you constant time with very high likelihood with regards to lookups, and amortized constant time with very high likelihood with regards to inserts.
The former means that for most hash tables there is an input that makes them perform less than optimal, and for user-input this could be dangerous. For this reason, Rust uses a cryptographic hash by defaul, Java's HashMap resolves collision with a binary search and CPython randomizes hashes. Generally if you're exposing your hash table to untrusted input, you should make sure you're using some mitigation of this kind.
Some, like Cuckoo hashes, do better than probabilistic (on constrained data types, given a special kind of hash function) for the case where you're worried about attackers, and incremental resizing removes the amortized time cost (assuming cheap allocations), but neither are commonly used since these are rarely problems that need solving, and the solutions are not free.
That said, if you're struggling to think of why we'd go through the hassle of using unordered maps, look back up at the graphs. They're fast, and you should use them.

hash_map and map which is faster? less than 10000 items

vs2005 support
::stdext::hash_map
::std::map.
however it seems ::stdext::hash_map's insert and remove OP is slower then ::std::map in my test.
( less then 10000 items)
Interesting....
Can anyone offored a comparision article about them?

Normally you look to the complexities of the various operations, and that's a good guide: amortized O(1) insert, O(1) lookup, delete for a hashmap as against O(log N) insert, lookup, delete for a tree-based map.
However, there are certain situations where the complexities are misleading because the constant terms involved are extreme. For example, suppose that your 10k items are keyed off strings. Suppose further that those strings are each 100k characters long. Suppose that different strings typically differ near the beginning of the string (for example if they're essentially random, pairs will differ in the first byte with probability 255/256).
Then to do a lookup the hashmap has to hash a 100k string. This is O(1) in the size of the collection, but might take quite a long time since it's probably O(M) in the length of the string. A balanced tree has to do log N <= 14 comparisons, but each one only needs to look at a few bytes. This might not take very long at all.
In terms of memory access, with a 64 byte cache line size, the hashmap loads over 1500 sequential lines, and does 100k byte operations, whereas the tree loads 15 random lines (actually probably 30 due to the indirection through the string) and does 14 * (some small number) byte operations. You can see that the former might well be slower than the latter. Or it might be faster: how good are your architecture's FSB bandwidth, stall time, and speculative read caching?
If the lookup finds a match, then of course in addition to this both structures need to perform a single full-length string comparison. Also the hashmap might do additional failed comparisons if there happens to be a collision in the bucket.
So assuming that failed comparisons are so fast as to be negligible, while successful comparisons and hashing ops are slow, the tree might be roughly 1.5-2 times as fast as the hash. If those assumptions don't hold, then it won't be.
An extreme example, of course, but it's pretty easy to see that on your data, a particular O(log N) operation might be considerably faster than a particular O(1) operation. You are of course right to want to test, but if your test data is not representative of the real world, then your test results may not be representative either. Comparisons of data structures based on complexity refer to behaviour in the limit as N approaches infinity. But N doesn't approach infinity. It's 10000.

It is not just about insertion and removal. You must consider that memory is allocated differently in a hash_map vs map and you every time have to calculate the hash of the value being searched.
I think this Dr.Dobbs article will answer your question best:
C++ STL Hash Containers and Performance

It depends upon your usage and your hash collisions. One is a binary tree and the other is a hashtable.
Ideally the hash map will have O(1) insertion and lookup, and the map O(ln n), but it presumes non-clashing hashes.

hash_map uses a hash table, something that offers almost constant time O(1) operations assuming a good hash function.
map uses a BST, it offers O(lg(n)) operations, for 10000 elements that's 13 which is very acceptable
I'd say stay with map, it's safer.

Hash tables are supposed to be faster than binary trees (i.e. std::map) for lookup. Nobody has ever suggested that they are faster for insert and delete.

A hash map will create a hash of the string/key for indexing. Though while proving the complexity it is mentioned as O(1), hash_map does collision detection for every insert as a hash of a string can produce the same index as the hash of another string. A hash map hence has complexity for managing these collisions & you konw these collisions are based on the input data.
However, if you are going to perform lot of look-ups on the structure, opt for hash_map.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js