Linked List vs Open Addressing in Hash Table - c++

I have a data set with 130000 elements and I have two different data structures which are a doubly-linked list and hash table. When inserting data set elements into the linked list, I place the node at the end of the list by using tail pointer. When inserting data set elements into hash table , I benefit from open addressing method with a probe function. I face 110000 collisions for last 10 elements in the data set.
However, difference between total running times of insertion for two different data structures is equal to 0.0981 second.
Linked list = 0.028521 second
Hash Table = 0.120102 second
Is pointer operation slow or probing method very fast ?

Insertion at the end of Double Linked List via the tail pointer is O(1), as you read here too.
Insertion in a Hash Table with Open Addressing can be also constant, as you can read here too.
However, a correct and efficient implementation of a Hash Table with Open Addressing is quite tricky, since a lot of things can wrong (probing, load factor, hash function, etc.). Even Wikipedia mentions that..
I face 110000 collisions (in the hash table) for last 10 elements
This is a strong indication that something is not good in your hash table implementation.
This explains why the time measurements you made, if they are correct, make the double linked list faster than your hash table.

Is pointer operation slow or probing method very fast ?
There are no actual algorithms, so the answer will be rather theoretical.
In terms of performance, the general answer is: cache misses are expensive. Having DDR with 60 ns latency and 3.2 GHz CPU, last level cache miss stalls CPU for 60 * 3.2 =~ 200 cycles.
Double-linked List
For double linked list algorithm, you have to access tail pointer, tail element and new element. If you just add elements in a loop, there is a huge probability all those accesses will be in CPU cache.
In real life application, if you do something between the additions, you might have up to 3 cache misses (tail pointer, tail element, new element).
Hash Table
For hash table with open addressing the situation is a bit different. The hash function produces a random index in hash table, so usually, the first access to a hash table is a cache miss. In your case, the hash table is not that big (130K pointers), so it might fit into the L3 cache. But still, L3 cache miss is about 30 cycles CPU stall.
But what happens next? You just put a pointer into the table, no need to update tail element nor the new element. So no cache misses here.
If the hash table element is occupied, you just check the next one. Such a sequential access is easily predictable by CPU prefetcher, so all those accesses usually do no produce any cache misses as well: CPU prefetches next hash table into L1 cache.
So, in real application, hash table usually has one cache miss, but since the hash is unpredictable, hash table always have this first cache miss.
The Answer
To get a practical answer, what is going on in you application, you should use a tool to analyse CPU performance counters, like perf on Linux or VTune on Windows. The tool will show what exactly your CPU spend time on.
Real Application
One more theoretical disclaimer here. I guess, that if you fix your hash table (say, use few elements per bucket rather than open addressing) and effectively reduce number of collision, the performance could be on par.
Should you use double-linked list over the hash table or vice versa? It depend on you application. Hash table is good for random access, i.e. you can access any element in O(1) time. For double linked list you have to traverse the list, so the estimation would be O(n).
On the other hand, just adding elements at the end of the list is not only a cheaper operation, but also much easier to implement. You do not care about any collisions and hash table overflows.
So, in some cases double linked list would have huge advantages over the hash table and it is up to the application what will suit the best.

Related

Is there a way to simulate cache locality when benchmarking?

I'm trying to figure out what would be the best approach to benchmark C++ programs and would like to simulate both the scenario when the data related to the benchmarked section is present in the cache and when it's cold.
Is there a reliable way to enforce good and bad cache locality on x86-64 machines as a form of preparation for a test run, assuming the data it will involve is known?
Presumably you are benchmarking an algorithm that performs an operation over a range of objects, and you care about the locality of those objects in memory (and thus cache).
To "simulate" locality: Create locality. You can create a linked list with high locality as well as linked list with low locality:
Allocate the nodes in an array. To create a list with high locality, make sure that first element of the array points to the second, and so on. To create list with lower locality, create a random permutation of the order so that each node points to another in a random position of the array.
Make sure that number of elements at least a magnitude greater than the largest cache.

Inserting to unordered_map takes too much time

I wrote a program that reads numbers out of a file (about 500,000 of them), and inserts them to a data structure. the numbers are distinct.
I'm inserting the numbers to an unordered_map with another struct (using std::make_pair(myNumber, emptyStruct)).
And after the insertion of all the numbers, I'm using it to search only a couple of hundred times. I never delete the DS until the program finish running.
After profiling, I've noticed that the insert operation takes about 50% of the running time. (There is also some other code, that runs as many times as the insertion, but it doesn't take so much time).
I thought maybe the resizing takes time, so I used the reserve function with 500,000, but the results are still the same.
As far as I know, this DS should be O(1) insert and search (and the trade off is large memory), so I don't see why it takes so much time to insert. How can I improve my results?
Unordered maps are implemented with a hash table. It has amortised constant insertion time. Reserving size to the map helps, but not by too much. There is not much better you can do in terms of insertions to it.
This means that you might be able to shave some time, but it is only going to be marginal. For instance, inserting into a vector is slightly faster, but it is also amortized constant time. So you will shave some seconds in the insertion at the cost of the search.
This is where a database helps. Say you have the data in a sqlite database instead. You create the database, create the table with the search value as its primary key, and the data value as its other attribute, insert the values into a table once. Now, the program simply runs and queries the database. It only reads the minimum necessary. In this case, the sqlite database takes the role of the unordered map you are using.
Since you are specifically not using a value, and merely searching for existence, go with std::unordered_set. It does what you wanted when you made a dummy value to go with every key in the map.
First, I want to re-iterate what everyone said: inserting 500,000 items to use it a few hundred times is going to take up a sizable chunk of your time, and you can't really avoid that, unless you can turn it around -- build a set of the things you are searching for, then search that 500,000 times.
All that said, I was able to get some improvement on the insertion of 500,000 items in a test app, by taking into account the nature of hash tables:
Reviewing http://en.cppreference.com/w/cpp/container/unordered_map, I found these:
[Insert] Complexity: Average case: O(1), worst case O(size())
By default, unordered_map containers have a max_load_factor of 1.0.
When you reserve space for 500000 items, you get 500000 buckets. If you put 500000 pieces of data in 500000 buckets, you are going to get a lot of collisions. I reserved extra space, and it was faster.
If you really need speed, and are willing to get some errors, look into bloom filters.

Tree or other data structure most efficient to lookup "recent searches"

I thought there exists a tree algorithm for what I'm now looking for, but I forgot about it's name and Googling didn't help there.
I'm searching for an algortithm that has the very best lookup performance for a data. Characteristics:
- Each lookup is expected to be a hit. So all keys which are looked up exist (there may be some misses, but these will be treated as a "misconfiguration", and the occurrence of such misses is negligible)
- It is very likely (the data set is optimized for this) that same lookups occur subsequently - e.g. there are likely to be a million lookups for key 123, there may be a single lookup for key 456 in between, and then again millions of lookups for 123. Then later a next group with likely same keys are looked up, and so on
Sure I could use a hash algorithm. But for the given purpose I remember that there was a search optimized tree, which optimizes lookups in such way that most recent lookups are at the very top of the tree. so potentially you'd have the first node of the tree directly a hit O(1), without needing a hash function or modulo of an hash store.
I'm seeking this algorithm to achieve raw performance for graphics rendering on mobilde devices.
Perhaps a splay tree.
A splay tree is a self-adjusting binary search tree with the additional property that recently accessed elements are quick to access again.
But a hash table would be expected O(1), so you shouldn't expect the one to clearly outperform the other.
I would suggest using a hash table for the job. To speed up subsequent searches, you can cache the K most recently accessed, different elements in an array. If K is small (< 20 or so), linear search in that array will be very fast, because it can stay in the L1 cache.

How to improve performance of a hashtable with 1 million elements and 997 buckets?

This is an interview question.
Suppose that there are 1 million elements in the table and 997 buckets of unordered lists. Further suppose that the hash function distributes keys with equal probability (i.e., each bucket has 1000 elements).
What is the worst case time to find an element which is not in the table? To find one which is in the table? How can you improve this?
My solution:
The worst case time of finding an element not in table and in table are all O(1000). 1000 is the length of the unsorted list.
Improve it :
(0) straightforward, increase bucket numbers > 1 million.
(1) each bucket holds a second hashtable, which use a different hash function to compute hash value for the second table. it will be O(1)
(2) each bucket holds a binary search tree. It will be O(lg n).
is it possible to make a trade-off between space and time. Keep both of them in a reasonable range.
Any better ideas ? thanks !
The simplest and most obvious improvement would be to increase the number of buckets in the hash table to something like 1.2 million -- at least assuming your hash function can generate numbers in that range (which it typically will).
Obviously increasing the bucket number improves the performance. Assuming this is no an option (for whatever reason) I suggest the following:
Normally the hash table consists of buckets, each holds a linked list (points to its head). You may however create a hash table, buckets of which hold a binary search tree (pointer to its root) rather than the list.
So that you'll have a hybrid of a hash table and the binary tree. Once I've implemented such thing. I didn't have a limitation on the number of buckets in the hash table, however I didn't know the number of elements from the beginning, plus I had no information about the quality of the hash function. Hence, I created a hash table with reasonable number of buckets, and the rest of the ambiguity was solved by the binary tree.
If N is the number of elements, and M is the number of buckets, then the complexity grows as O[log(N/M)], in case of equal distribution.
If you can't use another data structure or a larger table there are still options:
If the active set of elements is closer to 1000 than 1M you can improve the average successful lookup time by moving each element you find to the front of its list. That will allow it to be found quickly when it is reused.
Similarly if there is a set of misses that happens frequently you can cache the negative result (this can just be a special kind of entry in the hash table).
Suppose that there are 1 million elements in the table and 997 buckets of unordered lists. Further suppose that the hash function distributes keys with equal probability (i.e., each bucket has 1000 elements).
That doesn't quite add up, but let's run with it....
What is the worst case time to find an element which is not in the table? To find one which is in the table? How can you improve this?
The worst (and best = only) case for missing elements is that you hash to a bucket then go through inspecting all the elements in that specific list (i.e. 1000) then fail. If they want big-O notation, by definition that describes how performance varies with the number of elements N, so we have to make an assumption about how the # buckets varies with N too: my guess is that the 997 buckets is a fixed amount, and is not going to be increased if the number of elements increases. The number of comparisons is therefore N/997, which - being a linear factor - is still O(N).
My solution: The worst case time of finding an element not in table and in table are all O(1000). 1000 is the length of the unsorted list.
Nope - you're thinking of the number of comparisons - but big-O notation is about scalability.
Improve it : (0) straightforward, increase bucket numbers > 1 million. (1) each bucket holds a second hashtable, which use a different hash function to compute hash value for the second table. it will be O(1) (2) each bucket holds a binary search tree. It will be O(lg n).
is it possible to make a trade-off between space and time. Keep both of them in a reasonable range.
Well yes - average collisions relates to the number of entries and buckets. If you want very few collisions, you'd have well over 1 million entries in the table, but that gets wasteful of memory, though for large objects you can have an index or pointer to the actual object. An alternative is to look for faster collision handling mechanisms, such as trying a series of offsets from the hashed-to bucket (using % to map the displacements back into the table size), rather than resorting to some heap using linked lists. Rehashing is another alternative, given lower re-collision rates but typically needing more CPU, and having an arbitrarily long list of good hashing algorithms is problematic.
Hash tables within hash tables is totally pointless and remarkably wasteful of memory. Much better to use a fraction of that space to reduce collisions in the outer hash table.

Fast container for setting bits in a sparse domain, and iterating (C++)?

I need a fast container with only two operations. Inserting keys on from a very sparse domain (all 32bit integers, and approx. 100 are set at a given time), and iterating over the inserted keys. It should deal with a lot of insertions which hit the same entries (like, 500k, but only 100 different ones).
Currently, I'm using a std::set (only insert and the iterating interface), which is decent, but still not fast enough. std::unordered_set was twice as slow, same for the Google Hash Maps. I wonder what data structure is optimized for this case?
Depending on the distribution of the input, you might be able to get some improvement without changing the structure.
If you tend to get a lot of runs of a single value, then you can probably speed up insertions by keeping a record of the last value you inserted, and don't bother doing the insertion if it matches. It costs an extra comparison per input, but saves a lookup for each element in a run beyond the first. So it could improve things no matter what data structure you're using, depending on the frequency of repeats and the relative cost of comparison vs insertion.
If you don't get runs, but you tend to find that values aren't evenly distributed, then a splay tree makes accessing the most commonly-used elements cheaper. It works by creating a deliberately-unbalanced tree with the frequent elements near the top, like a Huffman code.
I'm not sure I understand "a lot of insertions which hit the same entries". Do you mean that there are only 100 values which are ever members, but 500k mostly-duplicate operations which insert one of those 100 values?
If so, then I'd guess that the fastest container would be to generate a collision-free hash over those 100 values, then maintain an array (or vector) of flags (int or bit, according to what works out fastest on your architecture).
I leave generating the hash as an exercise for the reader, since it's something that I'm aware exists as a technique, but I've never looked into it myself. The point is to get a fast hash over as small a range as possible, such that for each n, m in your 100 values, hash(n) != hash(m).
So insertion looks like array[hash(value)] = 1;, deletion looks like array[hash(value)] = 0; (although you don't need that), and to enumerate you run over the array, and for each set value at index n, inverse_hash(n) is in your collection. For a small range you can easily maintain a lookup table to perform the inverse hash, or instead of scanning the whole array looking for set flags, you can run over the 100 potentially-in values checking each in turn.
Sorry if I've misunderstood the situation and this is useless to you. And to be honest, it's not very much faster than a regular hashtable, since realistically for 100 values you can easily size the table such that there will be few or no collisions, without using so much memory as to blow your caches.
For an in-use set expected to be this small, a non-bucketed hash table might be OK. If you can live with an occasional expansion operation, grow it in powers of 2 if it gets more than 70% full. Cuckoo hashing has been discussed on Stackoverflow before and might also be a good approach for a set this small. If you really need to optimise for speed, you can implement the hashing function and lookup in assembler - on linear data structures this will be very simple so the coding and maintenance effort for an assembler implementation shouldn't be unduly hard to maintain.
You might want to consider implementing a HashTree using a base 10 hash function at each level instead of a binary hash function. You could either make it non-bucketed, in which case your performance would be deterministic (log10) or adjust your bucket size based on your expected distribution so that you only have a couple of keys/bucket.
A randomized data structure might be perfect for your job. Take a look at the skip list – though I don't know any decend C++ implementation of it. I intended to submit one to Boost but never got around to do it.
Maybe a set with a b-tree (instead of binary tree) as internal data structure. I found this article on codeproject which implements this.
Note that while inserting into a hash table is fast, iterating over it isn't particularly fast, since you need to iterate over the entire array.
Which operation is slow for you? Do you do more insertions or more iteration?
How much memory do you have? 32-bits take "only" 4GB/8 bytes, which comes to 512MB, not much for a high-end server. That would make your insertions O(1). But that could make the iteration slow. Although skipping all words with only zeroes would optimize away most iterations. If your 100 numbers are in a relatively small range, you can optimize even further by keeping the minimum and maximum around.
I know this is just brute force, but sometimes brute force is good enough.
Since no one has explicitly mentioned it, have you thought about memory locality? A really great data structure with an algorithm for insertion that causes a page fault will do you no good. In fact a data structure with an insert that merely causes a cache miss would likely be really bad for perf.
Have you made sure a naive unordered set of elements packed in a fixed array with a simple swap to front when an insert collisides is too slow? Its a simple experiment that might show you have memory locality issues rather than algorithmic issues.