C++ hashing: Open addressing and Chaining

C++ hashing: Open addressing and Chaining - c++

For Chaining:
Can someone please explain this concept to me and provide me a theory example and a simple code one?
I get the idea of "Each table location points to a linked list (chain) of items that hash to this location", but I can't seem to illustrate what's actually going on.
Suppose we had h(x) (hashing function) = x/10 mod 5. Now to hash 12540, 51288, 90100, 41233, 54991, 45329, 14236, how would that look like?
And for open addressing (linear probing, quadratic probing, and probing for every R location), can someone explain that to me as well? I tried Googling around but I seem to get confused further.

Chaining is probably the most obvious form of hashing. The hash-table is actually an array of linked-lists that are initially empty. Items are inserted by adding a new node to the linked-list at the item's calculated table index. If a collision occurs then a new node is linked to the previous tail node of the linked-list. (Actually, an implementation may sort the items in the list but let's keep it simple). One advantage of this mode is that the hash-table can never become 'full', a disadvantage is that you jump around memory a lot and your CPU cache will hate you.
Open Addressing tries to take advantage of the fact that the hash-table is likely to be sparsely populated (large gaps between entries). The hash-table is an array of items. If a collision occurs, instead of adding the item to the end of the current item at that location, the algorithm searches for the next empty space in the hash-table. However this means that you cannot rely on the hashcode alone to see if an item is present, you must also compare the contents if the hashcode matches.
The 'probing' is the strategy the algorithm follows when trying to find the next free slot.
One issue is that the table can become full, i.e. no more empty slots. In this case the table will need to be resized and the hash function changed to take into account the new size. All existing items in the table must be reinserted too as their hash codes will no longer have the same values once the hash function is changed. This may take a while.
Here's a Java animation of a hash table.

because you do mod 5, your table will have 5 locations
location 0: 90100
because the result of 90100/10 mod 5 is 0
for same reason, you have:
location 1: None
location 2: 45329
location 3: 51288->41233->14236
location 4: 12540->54991
you can check out more info on wikipedia

In open addressing we have to store element in table using any of the technique (load factor less than equal to one).
But in case of chaining the hash table only stores the head pointers of Linklist ,Therefore load factor can be greater than one.

Related

Least Recently Used (LRU) Cache

I know that I can use various container classes in STL but it's an overkill and expensive for this purpose.
We have over 1M+ users online and per user we need to maintain 8 unrelated 32-bit data items. The goal is to
find if an item exists in the list,
if not, insert. Remove oldest entry if full.
Brute Force approach would be to maintain a last write pointer and iterate (since only 8 items) but I am looking for inputs to better analyze and implement.
Look forward to some interesting suggestions in terms of design pattern and algorithm.

Don Knuth gives several interesting and very efficient approximations in The Art of Computer Proramming.
Self-organizing list I: when you find an entry, move it to the head of the list; delete from the end.
Self-organizing list II: when you find an entry, move it up one spot; delete from the end.
[Both the above in Vol. 3 §6.1(A).]
Another scheme involves maintaining the list circularly with 1 extra bit per entry, which is set when you find that entry, and cleared when you skip past it to find something else. You always start searching at the last place you stopped, and if you don't find the entry you replace the one with the next clear bit with it, i.e. it hasn't been used since one entire trip around the list.
[Vol. 1 §2.5(G).]

You want to use here a combination of a Hash table and a doubly linked list.
Each item is accessible via the hash table that holds the key you need plus a pointer to the element in the list.
Algorithm:
Given new item x, do:
1. Add x to the head of the list, save pointer as ptr.
2. Add x to the hash table where the data is stored, and add ptr.
3. If the list is bigger than allowed, take the last element (from the tail of the list) and remove it. Use the key of this element to remove it from the Hash table as well.

If you want a C implementation of LRU cache try this link
The idea is that we use two data structures to implement an LRU Cache.
Queue which is implemented using a doubly linked list. The maximum size of the queue will be equal to the total number of frames available (cache size).The most recently used pages will be near front end and least recently pages will be near rear end.
A Hash with page number as key and address of the corresponding queue node as value.
When a page is referenced, the required page may be in the memory. If it is in the memory, we need to detach the node of the list and bring it to the front of the queue.
If the required page is not in the memory, we bring that in memory. In simple words, we add a new node to the front of the queue and update the corresponding node address in the hash. If the queue is full, i.e. all the frames are full, we remove a node from the rear of queue, and add the new node to the front of queue.

I personally would either go with the self organising lists as proposed by EJP or, as we only have eight elements, simply store them together with a timestamp sequentially.
When accessing an element, just update the timestamp, when replacing, replace the one with oldest timestamp (one linear search). This is less efficient on replacements, but more efficient on access (no need to move any elements around). And it might be the easiest to implement...
Modification of self organising lists, if based on some array data structure: Sure, on update, you have to shift several elements (variant I) or at least swap two of them (variant II) - but if you organize the data as ring buffer, on replacement we just replace the last element with the new one and move the buffer's pointer to this new element:
a, b, c, d
^
Accessing a:
d, b, a, c
^
New element e:
d, e, a, c
^
Special case: accessing the oldest element (d in this case) - we then simply can move the pointer, too:
d, e, a, c
^
Just: with only 8 elements, it might not be worth the effort to implement all this...

I agree with Drop and Geza's comments. The straightforward implementation will take one cache line read, and cause one cache line write.
The only performance question left is going to be the lookup and update of that 32 bit value in 256 bits. Assuming modern x86, the lookup itself can be two instructions: _mm256_cmp_epi32_mask finds all equal values in parallel, _mm256_lzcnt_epi32 counts leading zeroes = number of older non-matching items*32. But even with older SIMD operations, the cache line read/write operations will dominate the execution time. And that's in turn is dominated by finding the right user. Which in turn is dominated by the network I/O involved.

You should use Cuckoo's Filter which is a probabilistic data structure that supports fast set membership testing. It is a hash-based data structure.
Time Complexity of Cuckoo's Filter:
Lookup: O(1)
Deletion: O(1)
Insertion: O(1)
For reference here is how the cuckoo filter works.
Parameters of the filter
1. Two Hash Functions: h1 and h2
2. An array B with n Buckets. The i-th Bucket will be called B[i]
Input : L, a list of elements to inserted into the cuckoo filter.
Algorithm:
while L is not empty:
Let x be the 1st item in the list L. Remove x from the list.
if (B[h1(x)] == empty)
place x in B[h1(x)];
else if (B[h2(x)] == empty)
place x in B[h2(x)];
else
Let y be the element in B[h2(x)]
Prepend y to L
place x in B[h2(x)]
For LRU you can use time stamping in your hash function by keeping just a local variable.
This is the best approach for very large data sets to date.

What is the most efficient data structure for designing a PRIM algorithm?

I am designing a Graph in c++ using a hash table for its elements. The hashtable is using open addressing and the Graph has no more than 50.000 edges. I also designed a PRIM algorithm to find the minimum spanning tree of the graph. My PRIM algorithm creates storage for the following data:
A table named Q to put there all the nodes in the beginning. In every loop, a node is visited and in the end of the loop, it's deleted from Q.
A table named Key, one for each node. The key is changed when necessary (at least one time per loop).
A table named Parent, one for each node. In each loop, a new element is inserted in this table.
A table named A. The program stores here the final edges of the minimum spanning tree. It's the table that is returned.
What would be the most efficient data structure to use for creating these tables, assuming the graph has 50.000 edges?
Can I use arrays?
I fear that the elements for every array will be way too many. I don't even consider using linked lists, of course, because the accessing of each element will take to much time. Could I use hash tables?
But again, the elements are way to many. My algorithm works well for Graphs consisting of a few nodes (10 or 20) but I am sceptical about the situation where the Graphs consist of 40.000 nodes. Any suggestion is much appreciated.

(Since comments were getting a bit long): The only part of the problem that seems to get ugly for very large size, is that every node not yet selected has a cost and you need to find the one with lowest cost at each step, but executing each step reduces the cost of a few effectively random nodes.
A priority queue is perfect when you want to keep track of lowest cost. It is efficient for removing the lowest cost node (which you do at each step). It is efficient for adding a few newly reachable nodes, as you might on any step. But in the basic design, it does not handle reducing the cost of a few nodes that were already reachable at high cost.
So (having frequent need for a more functional priority queue), I typically create a heap of pointers to objects and in each object have an index of its heap position. The heap methods all do a callback into the object to inform it whenever its index changes. The heap also has some external calls into methods that might normally be internal only, such as the one that is perfect for efficiently fixing the heap when an existing element has its cost reduced.
I just reviewed the documentation for the std one
http://en.cppreference.com/w/cpp/container/priority_queue
to see if the features I always want to add were there in some form I hadn't noticed before (or had been added in some recent C++ version). So far as I can tell, NO. Most real world uses of priority queue (certainly all of mine) need minor extra features that I have no clue how to tack onto the standard version. So I have needed to rewrite it from scratch including the extra features. But that isn't actually hard.
The method I use has been reinvented by many people (I was doing this in C in the 70's, and wasn't first). A quick google search found one of many places my approach is described in more detail than I have described it.
http://users.encs.concordia.ca/~chvatal/notes/pq.html#heap

How to reduce the pointer/address width in a hash table slot?

Assume that we have a hash table using chaining (linked list) to resove hash collisions. Each hash table slot will have a pointer field pointing to the first node of the linked list. This pointer will occupy 4 or 8 bytes depending on the x86 or x64 OS.
For some large hash table with million slots, the pointers will consume huge memory resource. For a hardware implementation, we can customize the pointer/address width on the FPGA to save memory. My question is, for a software implementation, is there any way also to reduce the pointer size to, say, 3 bytes?

You can reduce the pointer size overhead for your overflow lists to 0 bytes if you do not implement the hash table that way in the first place.
There is actually no drawback in implementing hash tables such that if one slot of the table already holds a value, you apply "some strategy" to find another, empty slot. If you did so while writing, your read function needs to execute analog steps to find the right spot to read from.
This approach actually does not perform worse than external overflow lists, because what you do in case of having those overflow lists is to perform a linear search within the overflow list. With an in-place hash table you perform - depending on the chosen strategy also something like a linear probing.
One idea to do that is to have a set of hash keys instead of one. (Typically 2, then it is called Double hashing). If you write and the slot of the table is already taken, you use the next hash key in your set and try again, until your hash keys are exhausted or until you found an empty spot. With N hash keys, you do N steps.
For reading, in that case, you try to find the entry, applying the set of hash keys in the same sequence as you did for write and probe if this is the entry you need, just the same way as you would probe your overflow lists.
Since hash tables only "make sense" if they have a low fill rate, this strategy actually saves much of the memory which would be needed for an overflow-list implementation.

Structure for top hit objects

I want to have a hit parameter for objects that are received, showing its frequency. and being able to have the most frequent, top hit, objects.
Unordered_map fits the first part, having object as the key and hit as the value.
unordered_map<object,int>
It enables searching fast for object and incrementing its hit. But how about sorting? priority_queue enables having the top hit object. But how about incrementing the object's hit?

I would suggest you have a look at splay tree that keeps objects in a way that most recent and most frequnetly accessed objects are closer to the top. This relies on several euristicts and thus will give you an approximation of the perfect solution.
For an exact solution it is better to implement your own binary heap and implement the operation icrement priority. In theory the same is used for backing for priority_queue, but there is no cahnge priority operation, while it can be done without affecting the complexity of the data structure's operations.

I managed to solved it by keeping track of sorted list of objects by their hit number as I insert the objects. So there is always the list of the most N top hits. There are 3,000,000 objects and I want to have the top 20.
Here are the structures I used:
key_hit to keep track of hits (by key, a string, I mean the object):
unordered_map<string, int> key_hit;
two arrays : hits[N], keys[N] which contains the top hits and their corresponding key (object).
idx, hits, keys
0, 212, x
1, 200, y
...
N, 12, z
and another map key_idx to keep the key and its corresponding index:
unordered_map<string,int> key_idx;
Algorithm (without details):
key is input.
search the key in key_hit, find its hit and increment (this is fast enough).
if hit<hits[N], ignore it.
else, idx=key_idx[key], (if not found, add it to structures and delete the existing one. it too long to write all details)
H=h[idx]++
check whether it is greater than the above entry, h[idx-1]<H. if yes, swap idx and idx-1 in key_idx,hits,keys.
I tried to make it fast. but I don't know how far it's fast.

C++ design: How to cache most recent used

We have a C++ application for which we try to improve performance. We identified that data retrieval takes a lot of time, and want to cache data. We can't store all data in memory as it is huge. We want to store up to 1000 items in memory. This items can be indexed by a long key. However, when the cache size goes over 1000, we want to remove the item that was not accessed for the longest time, as we assume some sort of "locality of reference", that is we assume that items in the cache that was recently accessed will probably be accessed again.
Can you suggest a way to implement it?
My initial implementation was to have a map<long, CacheEntry> to store the cache, and add an accessStamp member to CacheEntry which will be set to an increasing counter whenever an entry is created or accessed. When the cache is full and a new entry is needed, the code will scan the entire cache map and find the entry with the lowest accessStamp, and remove it.
The problem with this is that once the cache is full, every insertion requires a full scan of the cache.
Another idea was to hold a list of CacheEntries in addition to the cache map, and on each access move the accessed entry to the top of the list, but the problem was how to quickly find that entry in the list.
Can you suggest a better approach?
Thankssplintor

Have your map<long,CacheEntry> but instead of having an access timestamp in CacheEntry, put in two links to other CacheEntry objects to make the entries form a doubly-linked list. Whenever an entry is accessed, move it to the head of the list (this is a constant-time operation). This way you will both find the cache entry easily, since it's accessed from the map, and are able to remove the least-recently used entry, since it's at the end of the list (my preference is to make doubly-linked lists circular, so a pointer to the head suffices to get fast access to the tail as well). Also remember to put in the key that you used in the map into the CacheEntry so that you can delete the entry from the map when it gets evicted from the cache.

Scanning a map of 1000 elements will take very little time, and the scan will only be performed when the item is not in the cache which, if your locality of reference ideas are correct, should be a small proportion of the time. Of course, if your ideas are wrong, the cache is probably a waste of time anyway.

An alternative implementation that might make the 'aging' of the elements easier but at the cost of lower search performance would be to keep your CacheEntry elements in a std::list (or use a std::pair<long, CacheEntry>. The newest element gets added at the front of the list so they 'migrate' towards the end of the list as they age. When you check if an element is already present in the cache, you scan the list (which is admittedly an O(n) operation as opposed to being an O(log n) operation in a map). If you find it, you remove it from its current location and re-insert it at the start of the list. If the list length extends over 1000 elements, you remove the required number of elements from the end of the list to trim it back below 1000 elements.

Update: I got it now...
This should be reasonably fast. Warning, some pseudo-code ahead.
// accesses contains a list of id's. The latest used id is in front(),
// the oldest id is in back().
std::vector<id> accesses;
std::map<id, CachedItem*> cache;
CachedItem* get(long id) {
if (cache.has_key(id)) {
// In cache.
// Move id to front of accesses.
std::vector<id>::iterator pos = find(accesses.begin(), accesses.end(), id);
if (pos != accesses.begin()) {
accesses.erase(pos);
accesses.insert(0, id);
}
return cache[id];
}
// Not in cache, fetch and add it.
CachedItem* item = noncached_fetch(id);
accesses.insert(0, id);
cache[id] = item;
if (accesses.size() > 1000)
{
// Remove dead item.
std::vector<id>::iterator back_it = accesses.back();
cache.erase(*back_it);
accesses.pop_back();
}
return item;
}
The inserts and erases may be a little expensive, but may also not be too bad given the locality (few cache misses). Anyway, if they become a big problem, one could change to std::list.

In my approach, it's needed to have a hash-table for lookup stored objects quickly and a linked-list for maintain the sequence of last used.
When an object are requested.
1) try to find a object from the hash table
2.yes) if found(the value have an pointer of the object in linked-list), move the object in linked-list to the top of the linked-list.
2.no) if not, remove last object from the linked-list and remove the data also from hash-table then put object into hash-table and top of linked-list.
For example
Let's say we have a cache memory only for 3 objects.
The request sequence is 1 3 2 1 4.
1) Hash-table : [1]
Linked-list : [1]
2) Hash-table : [1, 3]
Linked-list : [3, 1]
3) Hash-table : [1,2,3]
Linked-list : [2,3,1]
4) Hash-table : [1,2,3]
Linked-list : [1,2,3]
5) Hash-table : [1,2,4]
Linked-list : [4,1,2] => 3 out

Create a std:priority_queue<map<int, CacheEntry>::iterator>, with a comparer for the access stamp.. For an insert, first pop the last item off the queue, and erase it from the map. Than insert the new item into the map, and finally push it's iterator onto the queue.

I agree with Neil, scanning 1000 elements takes no time at all.
But if you want to do it anyway, you could just use the additional list you propose and, in order to avoid scanning the whole list each time, instead of storing just the CacheEntry in your map, you could store the CacheEntry and a pointer to the element of your list that corresponds to this entry.

As a simpler alternative, you could create a map that grows indefinitely and clears itself out every 10 minutes or so (adjust time for expected traffic).
You could also log some very interesting stats this way.

I believe this is a good candidate for treaps. The priority would be the time (virtual or otherwise), in ascending order (older at the root) and the long as the key.
There's also the second chance algorithm, that's good for caches. Although you lose search ability, it won't be a big impact if you only have 1000 items.
The naïve method would be to have a map associated with a priority queue, wrapped in a class. You use the map to search and the queue to remove (first remove from the queue, grabbing the item, and then remove by key from the map).

Another option might be to use boost::multi_index. It is designed to separate index from data and by that allowing multiple indexes on the same data.
I am not sure this really would be faster then to scan through 1000 items. It might use more memory then good. Or slow down search and/or insert/remove.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js