Deciding when to use a hash table

Deciding when to use a hash table - c++

I was soving a competitive programming problem with the following requirements:
I had to maintain a list of unqiue 2d points (x,y), the number of unique points would be less than 500.
My idea was to store them in a hash table (C++ unordered set to be specific) and each time a node turned up i would lookup the table and if the node is not already there i would insert it.
I also know for a fact that i wouldn't be doing more than 500 lookups.
So i saw some solutions simply searching through an array (unsorted) and checking if the node was already there before inserting.
My question is is there any reasonable way to guess when should i use a hash table over a manual search over keys without having to benchmark them?

My question is is there any reasonable way to guess when should i use a hash table over a manual search over keys without having to benchmark them?
I am guessing you are familiar with basic algorithmics & time complexity and C++ standard containers and know that with luck hash table access is O(1)
If the hash table code (or some balanced tree code, e.g. using std::map - assuming there is an easy order on keys) is more readable, I would prefer it for that readability reason alone.
Otherwise, you might make some guess taking into account the approximate timing for various operations on a PC. BTW, the entire http:///norvig.com/21-days.html page is worth reading.
Basically, memory accesses are much more slow than everything else in the CPU. The CPU cache is extremely important. A typical memory access with cache fault requiring fetching data from DRAM modules is several hundreds times slower than some elementary arithmetic operation or machine instruction (e.g. adding two integers in registers).
In practice, it does not matter that much, as long as your data is tiny (e.g. less than a thousand elements), since in that case it is likely to sit in L2 cache.
Searching (linearly) in an array is really fast (since very cache friendly), up to several thousand of (small) elements.
IIRC, Herb Sutter mentions in some video that even inserting an element inside a vector is practically -but unintuitively- faster (taking into account the time needed to move slices) than inserting it into some balanced tree (or perhaps some other container, e.g. an hash table), up to a container size of several thousand small elements. This is on typical tablet, desktop or server microprocessor with a multimegabyte cache. YMMV.
If you really care that much, you cannot avoid benchmarking.
Notice that 500 pairs of integers is probably fitting into the L1 cache!

My rule of thumb is to assume the processor can deal with 10^9 operations per second.
In your case there are only 500 entries. An algorithm up to O(N^2) could be safe. By using contiguous data structure like vector you can leverage the fast cache hit. Also hash function sometimes can be costly in terms of constant. However if you have a data size of 10^6, the safe complexity might be only O(N) in total. In this case you might need to consider O(1) hashmap for a single lookup.

You can use Big O Complexity to roughly estimate the performance. For the Hash Table, Searching an element is between O(1) and O(n) in the worst case. That means, that in the best case your access time is independant of the number of elements in your map but in the worst case it is linear dependant on the size of your hash table.
A Binary tree has a guaranteed search complexity of O(nlog(n)). That means, that searching an element always depends on the size of the array, but in the Worst Case its faster than a hash table.
You can look up some Big O Complexities at this handy website here: http://bigocheatsheet.com/

Related

Why O(1) strict LRU implementation is not used in production software(s) and hardware(s)?

Explaining more:
While reading about LRU or Least Recently Used Cache implementation, I came across O(1) solution which uses unordered_map (c++) and a doubly linked list. Which is very efficient as accessing a element from this map is essentially O(1) (due to very good internal hashing) and then moving or deleting from this doubly linked list is also O(1).
Well Explained here
But then, on more research I came accross How is an LRU cache implemented in a CPU?
"For larger associativity, the number of states increases
dramatically: factorial of the number of ways. So a 4-way cache would
have 24 states, requiring 5 bits per set and an 8-way cache would have
40,320 states, requiring 16 bits per set. In addition to the storage
overhead, there is also greater overhead in updating the value"
and this
Now, I am having hard time understanding why the O(1) solution won't fix most of the problems of saving states per set and "age bits" for tracking LRU?
My guess would be, using the hashmap there is no way of saving the associativity but then each entry can be a standalone entry and as the access is O(1), it should not matter, right? The only thing I can think of right now is about the size of this hashmap, but still can't figure out why?
Any help appreciated, Thanks!

Why are Get and MultiGet significantly slower for large key sets compared to using an Iterator?

I'm currently playing around with RocksDB (C++) and was curious about some performance metrics I've experienced.
For testing purposes, my database keys are file paths and the values are filenames. My database has around 2M entries in it. I'm running RocksDB locally on a MacBook Pro 2016 (SSD).
My use case is dominated by reads. Full key scans are quite common as are key scans that include a "significant" number of keys. (50%+)
I'm curious about the following observations:
1. An Iterator is dramatically faster than calling Get when performing full key scans.
When I want to look at all of the keys in the database, I'm seeing a 4-8x performance improvement when using an Iterator instead of calling Get for each key. The use of MultiGet makes no difference.
In the case of calling Get roughly 2M times, the keys have been previously fetched into a vector and sorted lexicographically. Why is calling Get repeatedly so much slower than using an Iterator? Is there a way to narrow the performance gap between the two APIs?
2. When fetching around half the keys, the performance between using an Iterator and Get starts to become negligible.
As the number of keys to fetch is reduced, then making multiple calls to Get starts to take about as long as using an Iterator as the iterator is paying the price of scanning over keys that aren't in the desired keyset.
Is there some "magic" ratio where this becomes true for most databases? For example, if I need to scan over 25% of the keys, then calling Get is faster, but if it's 75% of the keys, then an Iterator is faster. But those numbers are just "made up" by rough testing.
3. Fetching keys in sorted order does not appear to improve performance.
If I pre-sort the keys I want to fetch into the same order that an Iterator would return them in, that does not appear to make calling Get multiple times any faster. Why is that? It's mentioned in the documentation that it's recommended to sort keys before doing a batch insert. Does Get not benefit from the same look-ahead caching that an Iterator benefits from?
4. What settings are recommended for a read-heavy use case?
Finally, are there any specific settings recommended for a read-heavy use case that might involve scanning a significant number of keys at once?
macOS 10.14.3, MacBook Pro 2016 SSD, RocksDB 5.18.3, Xcode 10.1

RocksDB internally represents its data as a log-structured merge tree which has several sorted layers by default (this can be changed with plugins/config). The intuition from Paul's first answer holds, except there is no classical index; the data is actually sorted on disk with pointers to the next files. The lookup operation has on average logarithmic complexity, but advancing an iterator in a sorted range is constant time. So for dense sequential reads, iterating is much faster.
The point where the costs balance out is determined not only by the number of keys you read, but also by the size of the database. As the database grows, the lookup becomes slower, while Next() remains constant. Very recent inserts are likely to be read very fast, since they may still be in memory (memtables).
Sorting the keys actually just improves your cache hit-rate. Depending on your disk, the difference may be very small, e.g., if you have an NVMe SSD, the difference in access time is just not as drastic anymore as it was when it was RAM vs. HDD. If you have to do several operations over the same or even different key-sets doing them by key-order (f(a-c) g(a-c) f(d-g)...) instead of sequentially should improve your performance, since you will have more cache-hits and also benefit from the RocksDB block cache.
The tuning guide is a good starting point, especially the video on database solutions, but if RocksDB is too slow for you also consider using a DB based on a different storage algorithm. LSM is typically better for write-heady workloads, and while RocksDB lets you control read vs. write vs. space amplification very well, a b-tree or ISAM based solution may just be much faster for range-reads/repeated reads.

I don't know anything about RocksDB per-se, but I can answer a lot of this from first principles.
An Iterator is dramatically faster than calling Get when performing full key scans.
This is likely to be because Get has to do a full lookup in the underlying index (starting from the top) whereas advancing an iterator can be achieved by just moving from the current node to the next. Assuming the index is implemented as a red-black tree or similar, there's a lot less work in the second method than the first.
When fetching around half the keys, the performance between using an Iterator and Get starts to become negligible.
So you are skipping entries by calling iterator->Next () multiple times? If so, then there will come a point where it's cheaper to call Get for each key instead, yes. Exactly when that happens will depend on the number of entries in the index (since that determines the number of levels in the tree).
Fetching keys in sorted order does not appear to improve performance.
No, I would not expect it to. Get is (presumably) stateless.
What settings are recommended for a read-heavy use case?
That I don't know, sorry, but you might read:
https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide

Why does CouchDB use an append-only B+ tree and not a HAMT

I'm reading up on datastructures, especially immutable ones like the append-only B+ tree used in CouchDB and the Hash array mapped trie used in Clojure and some other functional programming languages.
The main reason datastructures that work well in memory might not work well on disk appears to be time spent on disk seeks due to fragmentation, as with a normal binary tree.
However, HAMT is also very shallow, so doesn't require any more seeks than a B tree.
Another suggested reason is that deletions from a array mapped trie are more expensive tha from a B tree. This is based on the assumption that we're talking about a dense vector, and doesn't apply when using either as a hash map.
What's more, it seems that a B tree does more rebalancing, so using it in an append-only manner produces more garbage.
So why do CouchDB and practically every other database and filesystem use B trees?
[edit] fractal trees? log-structured merge tree? mind = blown
[edit] Real-life B trees use a degree in the thousands, while a HAMT has a degree of 32. A HAMT of degree 1024 would be possible, but slower due to popcnt handling 32 or 64 bits at a time.

B-trees are used because they are a well-understood algorithm that achieves "ideal" sorted-order read-cost. Because keys are sorted, moving to the next or previous key is very cheap.
HAMTs or other hash storage, stores keys in random order. Keys are retrieved by their exact value, and there is no efficient way to find to the next or previous key.
Regarding degree, it is normally selected indirectly, by selecting page size. HAMTs are most often used in memory, with pages sized for cache lines, while B-trees are most often used with secondary storage, where page sizes are related to IO and VM parameters.
Log Structured Merge (LSM) is a different approach to sorted order storage which achieves more optimal write-efficiency, by trading off some read efficiency. That hit to read efficiency can be a problem for read-modify-write workloads, but the fewer uncached reads there are, the more LSM provides better overall throughput vs B-tree - at the cost of higher worst case read latency.
LSM also offers the promise of a wider-performance envelope. Putting new data into its proper place is "deferred", offering the possibility to tune read-to-write efficiency by controlling the proportion of deferred cleanup work to live work. In theory, an ideal-LSM with zero-deferral is a B-tree and with 100%-deferral is a log.
However, LSM is more of a "family" of algorithms than a specific algorithm like a B-tree. Their usage is growing in popularity, but it is hindered by the lack of a de-facto optimal LSM design. LevelDB/RocksDB is one of the more practical LSM implementations, but it is far from optimal.
Another approach to achieving write-throughput efficiency is to write-optimize B-trees through write-deferral, while attempting to maintain their optimal read-throughput.
Fractal-trees, shuttle-trees, stratified-trees are this type of design, and represent a hybrid gray area between B-tree and LSM. Rather than deferring writes to an offline process, they amortize write-deferral in a fixed way. For example, such a design might represent a fixed 60%-write-deferral fraction. This means they can't achieve the 100% write-deferral performance of an LSM, but they also have a more predictable read-performance, making them more practical drop-in replacements for B-trees. (As in the commercial Tokutek MySQL and MongoDB fractal-tree backends)

Btrees are ordered by their key while in a hash map similar keys have very different hash values so are stored far each other. Now think of a query that do a range scan "give me yesterday's sales": with a hash map you have to scan all the map to find them, with a btree on the sales_dtm columns you'll find them nicely clustered and you exactly know where to start and stop reading.

What are the criteria for choosing a sorting algorithm?

I was reading sorting method which include bubble sort, selection sort, merge sort, heap sort, bucket sort etc.. They also contain time complexity which help us to know which sorting is efficient. So I had a basic question. If we contain data than how will we be choose sorting. Time complexity is one of parameter which help us to decide sorting method. But do we have another parameter to choose sorting method?.
Just trying to figure out sorting for better understanding.
Having some query about heap sort:
Where do we use heap sort?
What is bigger advantage of heap sort (except time complexity O(n log n))?
What is disadvantage of heap sort?
What is build time for heap? (I heard O(n) but I'm not sure.)
Any scenario where we have to use heap sort or heap sort is better option (except priority queue)?
Before applying the heap sort on data, what are the parameter will we look into data?

The two main theoretical features of sorting algorithms are time complexity and space complexity.
In general, time complexity lets us know how the performance of the algorithm changes as the size of the data set increases. Things to consider:
How much data are you expecting to sort? This will help you know whether you need to look for an algorithm with a very low time complexity.
How sorted will your data be already? Will it be partly sorted? Randomly sorted? This can affect the time complexity of the sorting algorithm. Most algorithms will have worst and best cases - you want to make sure you're not using an algorithm on a worst-case data set.
Time complexity is not the same as running time. Remember that time complexity only describes how the performance of an algorithm varies as the size of the data set increases. An algorithm that always does one pass over all the input would be O(n) - it's performance is linearly correlated with the size of the input. But, an algorithm that always does two passes over the data set is also O(n) - the correlation is still linear, even if the constant (and actual running time) is different.
Similarly, space complexity describes how much space an algorithm needs to run. For example, a simple sort such as insertion sort needs an additional fixed amount of space to store the value of the element currently being inserted. This is an auxiliary space complexity of O(1) - it doesn't change with the size of the input. However, merge sort creates extra arrays in memory while it runs, with an auxiliary space complexity of O(n). This means the amount of extra space it requires is linearly correlated with the size of the input.
Of course, algorithm design is often a trade-off between time and space - algorithms with a low space complexity may require more time, and algoithms with a low time complexity may require more space.
For more information, you may find this tutorial useful.
To answer your updated question, you may find the wikipedia page on Heap Sort useful.

If you mean criteria for what type of sort to choose, here are some other items to consider.
The amount of data you have: To you have ten, one hundred, a thousand or millions of items to be sorted.
Complexity of the algorithm: The more complex the more testing will need to be done to make sure it is correct. For small amounts, a bubble sort or quick sort is easy to code and test, verse other sorts which may be overkill for the amount of data you have to sort.
How much time will it take to sort: If you have a large set, bubble/quick sort will take a lot of time, but if you have a lot of time, that may not be an issue. However, using a more complex algorithm will cut down the time to sort, but at the cost of more effort in coding and testing, which may be worth it if sorting goes from long (hours/days) to a shorter amount of time.
The data itself: Is the data close to being the same for everything. For some sorts you may end up with a linear list, so if you know something about the composition of the data, it may help in determining which algorithm to choose for the effort.
The amount of resources available: Do you have lots of memory in which you store all items, or do you need to store items to disk. If everything cannot fit in memory, merge sort may be best, where other may be better if you work with everything in memory.

Correct data structure to use for (this specific) expiring cache?

I need to read from a dataset which is very large, highly interlinked, the data is fairly localized, and reads are fairly expensive. Specifically:
The data sets are 2gigs - 30gigs in size, so I have to map sections of the file into memory to read. This is very expensive compared to the rest of the work I do in the algorithm. From profiling I've found roughly 60% of the time is spent reading the memory, so this is the right place to start optimizing.
When operating on a piece of this dataset, I have to follow links inside of it (imagine it like being similar to a linked list), and while those reads aren't guaranteed to anywhere near sequential, they are fairly localized. This means:
Let's say, for example, we operate on 2 megs of memory at a time. If you read 2 megs of data into memory, roughly 40% of the reads I will have to subsequently do will be in that same 2 megs of memory. Roughly 20% of the reads will be purely random access in the rest of the data, and the other 40% very likely links back into the 2meg segment which pointed to this one.
From knowledge of the problem and from profiling, I believe that introducing a cache to the program will help greatly. What I want to do is create a cache which holds N chunks of X megs of memory (N and X configurable so I can tune it) which I can check first, before having to map another section of memory. Additionally, the longer something has been in the cache, the less likely it is that we will request that memory in the short term, and so the oldest data will need to be expired.
After all that, my question is very simple: What data structure would be best to implement a cache of this nature?
I need to have very fast lookups to see if a given address is in the cache. With every "miss" of the cache, I'll want to expire the oldest member of it, and add a new member. However, I plan to try to tune it (by changing the amount that's cached) such that 70% or more of reads are hits.
My current thinking is to use either an AVL tree (LOG2 n for search/insert/delete) would be the safest (no degenerate cases). My other option is a sparse hashtable such that lookups would be O(1) in the best case. In theory this could degenerate into O(n), but in practice I could keep collisions low. The concern here would be how long it takes to find and remove the oldest entry in the hashtable.
Does anyone have any thoughts or suggestions on what data structure would be best here, and why?

Put the cache into two sorted trees (AVL or any other reasonably balanced tree implementation is fine--you're better off using one from a library than creating your own).
One tree should sort by position in the file. This lets you do log(n) lookups to see if your cache is there.
The other tree should sort by time used (which can be represented by a number that increments by one on each use). When you use a cached block, you remove it, update the time, and insert it again. This will take log(n) also. When you miss, remove the smallest element of the tree, and add the new block as the largest. (Don't forget to also remove/add that block to the by-position-in-file tree.)
If your cache doesn't have very many items in it, you'll be better off still by just storing everything in a sorted array (using insertion sort to add new elements). Moving 16 items down one spot in an array is incredibly fast.

Seems like you are looking for an LRU (Least Recently Used) cache: LRU cache design

If 60% of your algorithm is I/O, I suggest that the actual cache design doesn't really matter that much- any sort of cache could be a substantial speed-up.
However, the design depends a lot on what data you're using to access your chunks. String, int, etc. If you had an int, you could do a hashmap into a linked list, erase the back on cache miss, erase and then push on top if cache hit.
hashmaps are provided under varying names (most commonly, unordered map) in many implementations. Boost has one, there's one in TR1, etc. A big advantage of a hash_map is less performance loss with growing numbers, and more flexibility about key values.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js