Tree or other data structure most efficient to lookup "recent searches" - c++

I thought there exists a tree algorithm for what I'm now looking for, but I forgot about it's name and Googling didn't help there.
I'm searching for an algortithm that has the very best lookup performance for a data. Characteristics:
- Each lookup is expected to be a hit. So all keys which are looked up exist (there may be some misses, but these will be treated as a "misconfiguration", and the occurrence of such misses is negligible)
- It is very likely (the data set is optimized for this) that same lookups occur subsequently - e.g. there are likely to be a million lookups for key 123, there may be a single lookup for key 456 in between, and then again millions of lookups for 123. Then later a next group with likely same keys are looked up, and so on
Sure I could use a hash algorithm. But for the given purpose I remember that there was a search optimized tree, which optimizes lookups in such way that most recent lookups are at the very top of the tree. so potentially you'd have the first node of the tree directly a hit O(1), without needing a hash function or modulo of an hash store.
I'm seeking this algorithm to achieve raw performance for graphics rendering on mobilde devices.

Perhaps a splay tree.
A splay tree is a self-adjusting binary search tree with the additional property that recently accessed elements are quick to access again.
But a hash table would be expected O(1), so you shouldn't expect the one to clearly outperform the other.

I would suggest using a hash table for the job. To speed up subsequent searches, you can cache the K most recently accessed, different elements in an array. If K is small (< 20 or so), linear search in that array will be very fast, because it can stay in the L1 cache.

Related

Storing filepath and size in C++

I'm processing a large number of image files (tens of millions) and I need to return the number of pixels for each file.
I have a function that uses an std::map<string, unsigned int> to keep track of files already processed. If a path is found in the map, then the value is returned, otherwise the file is processed and inserted into the map. I do not delete entries from the map.
The problem is as the number of entries grow, the time for lookup is killing the performance. This portion of my application is single threaded.
I wanted to know if unordered_map is the solution to this, or the fact that I'm using std::string as keys going to affects the hashing and require too many rehashings as the number of keys increases, thus once again killing the performance.
One other item to note is that the paths for the string are expected (but not guaranteed) to have the same prefix, for example: /common/until/here/now_different/. So all strings will likely have the same first N characters. I could potentially store these as relative to the common directory. How likely is that to help performance?
unordered_map will probably be better in this case. It will typically be implemented as a hash table, with amortized O(1) lookup time, while map is usually a binary tree with O(log n) lookups. It doesn't sound like your application would care about the order of items in the map, it's just a simple lookup table.
In both cases, removing the common prefix should be helpful, as it means less time has to be spent needlessly iterating over that part of the strings. For unordered_map it will have to traverse it twice: once to hash and then to compare against the keys in the table. Some hash functions also limit the amount of a string they hash, to prevent O(n) hash performance -- if the common prefix is longer than this limit, you'll end up with a worst-case hash table (everything is in one bucket).
I really like Galik's suggestion of using inodes if you can, but if not...
Will emphasise a point already made in comments: if you've reason to care, always implement the alternatives and measure. The more reason, the more effort it's worth expending on that....
So /- another option is to use a 128-bit cryptographic strength hash function on your filepaths, then trust that statistically it's extremely unlikely to produce a collision. A rule of thumb is that if you have 2^n distinct keys, you want significantly more than a 2n-bit hash. For ~100m keys, that's about 2*27 bits, so you could probably get away with a 64 bit hash but it's a little too close for comfort and headroom if the number of images grows over the years. Then use a vector to back a hash table of just the hashes and file sizes, with say quadratic probing. Your caller would ideally pre-calculate the hash of an incoming file path in a different thread, passing your lookup API only the hash.
The above avoids the dynamic memory allocation, indirection, and of course memory usage when storing variable-length strings in the hash table and utilises the cache much better. Relying on hashes not colliding may make you uncomfortable, but past a point the odds of a meteor destroying the computer, or lightning frying it, will be higher than the odds of a collision in the hash space (i.e. before mapping to hash table bucket), so there's really no point fixating on that. Cryptographic hashing is relatively slow, hence the suggestion to let clients do it in other threads.
(I have worked with a proprietary distributed database based on exactly this principle for path-like keys.)
Aside: beware Visual C++'s string hashing - they pick 10 characters spaced along your string to incorporate in the hash value, which would be extremely collision prone for you, especially if several of those were taken from the common prefix. The C++ Standard leaves implementations the freedom to provide whatever hashes they like, so re-measure such things if you ever need to port your system.

Linked List vs Open Addressing in Hash Table

I have a data set with 130000 elements and I have two different data structures which are a doubly-linked list and hash table. When inserting data set elements into the linked list, I place the node at the end of the list by using tail pointer. When inserting data set elements into hash table , I benefit from open addressing method with a probe function. I face 110000 collisions for last 10 elements in the data set.
However, difference between total running times of insertion for two different data structures is equal to 0.0981 second.
Linked list = 0.028521 second
Hash Table = 0.120102 second
Is pointer operation slow or probing method very fast ?
Insertion at the end of Double Linked List via the tail pointer is O(1), as you read here too.
Insertion in a Hash Table with Open Addressing can be also constant, as you can read here too.
However, a correct and efficient implementation of a Hash Table with Open Addressing is quite tricky, since a lot of things can wrong (probing, load factor, hash function, etc.). Even Wikipedia mentions that..
I face 110000 collisions (in the hash table) for last 10 elements
This is a strong indication that something is not good in your hash table implementation.
This explains why the time measurements you made, if they are correct, make the double linked list faster than your hash table.
Is pointer operation slow or probing method very fast ?
There are no actual algorithms, so the answer will be rather theoretical.
In terms of performance, the general answer is: cache misses are expensive. Having DDR with 60 ns latency and 3.2 GHz CPU, last level cache miss stalls CPU for 60 * 3.2 =~ 200 cycles.
Double-linked List
For double linked list algorithm, you have to access tail pointer, tail element and new element. If you just add elements in a loop, there is a huge probability all those accesses will be in CPU cache.
In real life application, if you do something between the additions, you might have up to 3 cache misses (tail pointer, tail element, new element).
Hash Table
For hash table with open addressing the situation is a bit different. The hash function produces a random index in hash table, so usually, the first access to a hash table is a cache miss. In your case, the hash table is not that big (130K pointers), so it might fit into the L3 cache. But still, L3 cache miss is about 30 cycles CPU stall.
But what happens next? You just put a pointer into the table, no need to update tail element nor the new element. So no cache misses here.
If the hash table element is occupied, you just check the next one. Such a sequential access is easily predictable by CPU prefetcher, so all those accesses usually do no produce any cache misses as well: CPU prefetches next hash table into L1 cache.
So, in real application, hash table usually has one cache miss, but since the hash is unpredictable, hash table always have this first cache miss.
The Answer
To get a practical answer, what is going on in you application, you should use a tool to analyse CPU performance counters, like perf on Linux or VTune on Windows. The tool will show what exactly your CPU spend time on.
Real Application
One more theoretical disclaimer here. I guess, that if you fix your hash table (say, use few elements per bucket rather than open addressing) and effectively reduce number of collision, the performance could be on par.
Should you use double-linked list over the hash table or vice versa? It depend on you application. Hash table is good for random access, i.e. you can access any element in O(1) time. For double linked list you have to traverse the list, so the estimation would be O(n).
On the other hand, just adding elements at the end of the list is not only a cheaper operation, but also much easier to implement. You do not care about any collisions and hash table overflows.
So, in some cases double linked list would have huge advantages over the hash table and it is up to the application what will suit the best.

How can we benefit from vs2010 hash_map's less?

See this if you don't know vs2010 actually requires total ordering, and hence it require a user defined less.
one of answer said it is possible for binary search, but I don't think so, this because
The hash function should be uniform, and it is better that load factor less than 1, it means, in most case, one element per hash slot. i.e. no need binary search.
Obviously, it will slow down insertion because of locating the appropriate position.
How does hash-map benefit from this design? and how do we utilize this design?
thanks
The hash function should be uniform, and it is better that load factor less than 1, it means, in most case, one element per hash slot. i.e. no need binary search.
There won't be at most one element per hash slot. Some buckets will have to keep more than one key. Unless the input is only from a pre-determined restricted set of values (i.e. perfect hashing), the hash function will have to deal with more inputs than the outputs that it can produce. There will be collisions; this is unavoidable in an implementation as generic as this one. However, good hash functions should produce well-distributed hashes and that makes the number of elements per hash slot stay low.
Obviously, it will slow down insertion because of locating the appropriate position.
Assuming a good hash function and non-degenerate input (input designed so that many elements result in the same hash), there will always be only a few keys per bucket. Inserting into such a binary search tree won't be that big of a cost, and that little cost may bring benefits elsewhere (searches may be faster than on an implementation with a linked list). And in case of degenerate input, the hash map will degenerate into a binary search tree, which is much better than a simple linked list.
Your question is largely irrelevant in practice, because C++ now supplies unordered_map etc. which use an Equal predicate rather than a less-than comparator.
However, consider a hash_map<string, ...>. Clearly, the value space of string is larger than that of size_t, so for any hash function there will be values that have the same hash and so are placed in the same bucket. In the pathological situation where all the items in the hash table are placed in the same bucket, exploiting ordering among keys will result in improved speed of access, insertion and removal.
Note that search on an ordered list (or binary tree) is O(log n) as opposed to O(n).

hash table vs. linear list

Will there be a instance where a search for a keyword in a linear list will be quicker than a hash table?
I'd basically like to know if there is a fringe case where the search for a keyword in a linear list will be faster than a hash table search.
Thanks!
Searching in a hash table is not always constant-time in reality. If the hash function is a poor match for the data, you can have a lot of collisions, and in the extreme case that every data item has the same hash value, the result looks much like linear search. Depending on the details, this effective linear search might work slower than a linear search over the data in an array. (E.g. open addressing with a quadratic probing sequence, which makes poor use of the processor caches, might well be slower than a linear search over an array.)
Here's an example of a real-world case where all keys ended up in the same bucket: Java bug 4669519.
Yes, in the cases of a very small number of elements. Think about how a hash works. It has to compute the hash to find a bucket, then search through the list in that bucket. Plus it could be a complex multi level hash, etc. So you break even around the point where searching through a linear list is more work than the hash lookup algorithm.
Another instance would be if the element you are looking for is always at the beginning or near the beginning of a list. Depending on what you are doing it could happen.
There are others, but that should help you think about it.
Still, don't get confused. The hash is usually what you want.

Fast container for setting bits in a sparse domain, and iterating (C++)?

I need a fast container with only two operations. Inserting keys on from a very sparse domain (all 32bit integers, and approx. 100 are set at a given time), and iterating over the inserted keys. It should deal with a lot of insertions which hit the same entries (like, 500k, but only 100 different ones).
Currently, I'm using a std::set (only insert and the iterating interface), which is decent, but still not fast enough. std::unordered_set was twice as slow, same for the Google Hash Maps. I wonder what data structure is optimized for this case?
Depending on the distribution of the input, you might be able to get some improvement without changing the structure.
If you tend to get a lot of runs of a single value, then you can probably speed up insertions by keeping a record of the last value you inserted, and don't bother doing the insertion if it matches. It costs an extra comparison per input, but saves a lookup for each element in a run beyond the first. So it could improve things no matter what data structure you're using, depending on the frequency of repeats and the relative cost of comparison vs insertion.
If you don't get runs, but you tend to find that values aren't evenly distributed, then a splay tree makes accessing the most commonly-used elements cheaper. It works by creating a deliberately-unbalanced tree with the frequent elements near the top, like a Huffman code.
I'm not sure I understand "a lot of insertions which hit the same entries". Do you mean that there are only 100 values which are ever members, but 500k mostly-duplicate operations which insert one of those 100 values?
If so, then I'd guess that the fastest container would be to generate a collision-free hash over those 100 values, then maintain an array (or vector) of flags (int or bit, according to what works out fastest on your architecture).
I leave generating the hash as an exercise for the reader, since it's something that I'm aware exists as a technique, but I've never looked into it myself. The point is to get a fast hash over as small a range as possible, such that for each n, m in your 100 values, hash(n) != hash(m).
So insertion looks like array[hash(value)] = 1;, deletion looks like array[hash(value)] = 0; (although you don't need that), and to enumerate you run over the array, and for each set value at index n, inverse_hash(n) is in your collection. For a small range you can easily maintain a lookup table to perform the inverse hash, or instead of scanning the whole array looking for set flags, you can run over the 100 potentially-in values checking each in turn.
Sorry if I've misunderstood the situation and this is useless to you. And to be honest, it's not very much faster than a regular hashtable, since realistically for 100 values you can easily size the table such that there will be few or no collisions, without using so much memory as to blow your caches.
For an in-use set expected to be this small, a non-bucketed hash table might be OK. If you can live with an occasional expansion operation, grow it in powers of 2 if it gets more than 70% full. Cuckoo hashing has been discussed on Stackoverflow before and might also be a good approach for a set this small. If you really need to optimise for speed, you can implement the hashing function and lookup in assembler - on linear data structures this will be very simple so the coding and maintenance effort for an assembler implementation shouldn't be unduly hard to maintain.
You might want to consider implementing a HashTree using a base 10 hash function at each level instead of a binary hash function. You could either make it non-bucketed, in which case your performance would be deterministic (log10) or adjust your bucket size based on your expected distribution so that you only have a couple of keys/bucket.
A randomized data structure might be perfect for your job. Take a look at the skip list – though I don't know any decend C++ implementation of it. I intended to submit one to Boost but never got around to do it.
Maybe a set with a b-tree (instead of binary tree) as internal data structure. I found this article on codeproject which implements this.
Note that while inserting into a hash table is fast, iterating over it isn't particularly fast, since you need to iterate over the entire array.
Which operation is slow for you? Do you do more insertions or more iteration?
How much memory do you have? 32-bits take "only" 4GB/8 bytes, which comes to 512MB, not much for a high-end server. That would make your insertions O(1). But that could make the iteration slow. Although skipping all words with only zeroes would optimize away most iterations. If your 100 numbers are in a relatively small range, you can optimize even further by keeping the minimum and maximum around.
I know this is just brute force, but sometimes brute force is good enough.
Since no one has explicitly mentioned it, have you thought about memory locality? A really great data structure with an algorithm for insertion that causes a page fault will do you no good. In fact a data structure with an insert that merely causes a cache miss would likely be really bad for perf.
Have you made sure a naive unordered set of elements packed in a fixed array with a simple swap to front when an insert collisides is too slow? Its a simple experiment that might show you have memory locality issues rather than algorithmic issues.