Structure for top hit objects - c++

I want to have a hit parameter for objects that are received, showing its frequency. and being able to have the most frequent, top hit, objects.
Unordered_map fits the first part, having object as the key and hit as the value.
unordered_map<object,int>
It enables searching fast for object and incrementing its hit. But how about sorting? priority_queue enables having the top hit object. But how about incrementing the object's hit?

I would suggest you have a look at splay tree that keeps objects in a way that most recent and most frequnetly accessed objects are closer to the top. This relies on several euristicts and thus will give you an approximation of the perfect solution.
For an exact solution it is better to implement your own binary heap and implement the operation icrement priority. In theory the same is used for backing for priority_queue, but there is no cahnge priority operation, while it can be done without affecting the complexity of the data structure's operations.

I managed to solved it by keeping track of sorted list of objects by their hit number as I insert the objects. So there is always the list of the most N top hits. There are 3,000,000 objects and I want to have the top 20.
Here are the structures I used:
key_hit to keep track of hits (by key, a string, I mean the object):
unordered_map<string, int> key_hit;
two arrays : hits[N], keys[N] which contains the top hits and their corresponding key (object).
idx, hits, keys
0, 212, x
1, 200, y
...
N, 12, z
and another map key_idx to keep the key and its corresponding index:
unordered_map<string,int> key_idx;
Algorithm (without details):
key is input.
search the key in key_hit, find its hit and increment (this is fast enough).
if hit<hits[N], ignore it.
else, idx=key_idx[key], (if not found, add it to structures and delete the existing one. it too long to write all details)
H=h[idx]++
check whether it is greater than the above entry, h[idx-1]<H. if yes, swap idx and idx-1 in key_idx,hits,keys.
I tried to make it fast. but I don't know how far it's fast.

Related

How to track changes to a list

I have an immutable base list of items that I want to perform a number of operations on: edit, add, delete, read. The actual operations will be queued up and performed elsewhere (sent up to the server and a new list will be sent down), but I want a representation of what the list would look like with the current set of operations applied to the base list.
My current implementation keeps a vector of ranges and where they map to. So an unedited list has one range from 0 to length that maps directly to the base list. If an add is performed in index 5, then we have 3 ranges: 0-4 maps to base list 0-4. 5 maps to the new item, and 6-(length+1) maps to 5-length. This works, however with a lot of adds and deletes, reads degrades to O(n).
I've thought of using hashmaps but the shifts in ranges that can occur with inserts and deletes presents a challenge. Is there some way to achieve this so that reads are around O(1) still?
If you had a roughly balanced tree of ranges, where each node kept a record of the total number of elements below it in the tree, you could answer reads in worst case time the depth of the tree, which should be about log(n). Perhaps a https://en.wikipedia.org/wiki/Treap would be one of the easier balanced trees to implement for this.
If you had a lot of repetitious reads and few modifications, you might gain by keeping a hashmap of the results of reads since the last modification, clearing it on modification.

LRU sorted by score in C++, is there such container?

I need to implement a very efficient cache LRU with the following properties: entries are indices in a vector of cache entries, each cache hit updates an empirical score, computed from some values that can be kept in the container value, like number of hits, size of matched object etc.
I need to be able to quickly pick a victim for cache eviction from the bottom of such LRU, and be able to quickly iterate over some number of the best-performing entries from the top, so such container needs to be sorted.
So far, I was only be able to come up with a vector of structures that hold values for score calculation that are updated, and bi-directional links, which I use to put the updated element in place after score recalculation by linear search from its current position and score comparison. This search may obviously happen upwards (when the score is updated, always getting bigger) and downwards (when an element is evicted, and its score resets to 0). Linear search may not be so bad, because this is running for a long time, and scores of elements that survive grow large, and each increment is small, so the element does not have to move very far to get to its place, and in case of reset I can start search from bottom.
I am aware of STL sorted containers, folly's cache LRU implementation, and Boost.Bimap (this last one seems to be an overkill for what I need).
Can I do better than a linear search here? Does anyone know of an implementation?
Thanks in advance!
Update: implemented a solution that involves a vector of iterators into a set that has index into the vector (for uniqueness) + necessary data to compute the score, with comparator sorting by the score.
Seems to work well, maybe there is a more elegant solution out there?

C++ hashing: Open addressing and Chaining

For Chaining:
Can someone please explain this concept to me and provide me a theory example and a simple code one?
I get the idea of "Each table location points to a linked list (chain) of items that hash to this location", but I can't seem to illustrate what's actually going on.
Suppose we had h(x) (hashing function) = x/10 mod 5. Now to hash 12540, 51288, 90100, 41233, 54991, 45329, 14236, how would that look like?
And for open addressing (linear probing, quadratic probing, and probing for every R location), can someone explain that to me as well? I tried Googling around but I seem to get confused further.
Chaining is probably the most obvious form of hashing. The hash-table is actually an array of linked-lists that are initially empty. Items are inserted by adding a new node to the linked-list at the item's calculated table index. If a collision occurs then a new node is linked to the previous tail node of the linked-list. (Actually, an implementation may sort the items in the list but let's keep it simple). One advantage of this mode is that the hash-table can never become 'full', a disadvantage is that you jump around memory a lot and your CPU cache will hate you.
Open Addressing tries to take advantage of the fact that the hash-table is likely to be sparsely populated (large gaps between entries). The hash-table is an array of items. If a collision occurs, instead of adding the item to the end of the current item at that location, the algorithm searches for the next empty space in the hash-table. However this means that you cannot rely on the hashcode alone to see if an item is present, you must also compare the contents if the hashcode matches.
The 'probing' is the strategy the algorithm follows when trying to find the next free slot.
One issue is that the table can become full, i.e. no more empty slots. In this case the table will need to be resized and the hash function changed to take into account the new size. All existing items in the table must be reinserted too as their hash codes will no longer have the same values once the hash function is changed. This may take a while.
Here's a Java animation of a hash table.
because you do mod 5, your table will have 5 locations
location 0: 90100
because the result of 90100/10 mod 5 is 0
for same reason, you have:
location 1: None
location 2: 45329
location 3: 51288->41233->14236
location 4: 12540->54991
you can check out more info on wikipedia
In open addressing we have to store element in table using any of the technique (load factor less than equal to one).
But in case of chaining the hash table only stores the head pointers of Linklist ,Therefore load factor can be greater than one.

How to repeatedly insert elements into a sorted list fast

I do not have formal CS training, so bear with me.
I need to do a simulation, which can abstracted away to the following (omitting the details):
We have a list of real numbers representing the times of events. In
each step, we
remove the first event, and
as a result of "processing" it, a few other events may get inserted into the list at a strictly later time
and repeat this many times.
Questions
What data structure / algorithm can I use to implement this as efficiently as possible? I need to increase the number of events/numbers in the list significantly. The priority is to make this as fast as possible for a long list.
Since I'm doing this in C++, what data structures are already available in the STL or boost that will make it simple to implement this?
More details:
The number of events in the list is variable, but it's guaranteed to be between n and 2*n where n is some simulation parameter. While the event times are increasing, the time-difference of the latest and earliest events is also guaranteed to be less than a constant T. Finally, I suspect that the density of events in time, while not constant, also has an upper and lower bound (i.e. all the events will never be strongly clustered around a single point in time)
Efforts so far:
As the title of the question says, I was thinking of using a sorted list of numbers. If I use a linked list for constant time insertion, then I have trouble finding the position where to insert new events in a fast (sublinear) way.
Right now I am using an approximation where I divide time into buckets, and keep track of how many event are there in each bucket. Then process the buckets one-by-one as time "passes", always adding a new bucket at the end when removing one from the front, thus keeping the number of buckets constant. This is fast, but only an approximation.
A min-heap might suit your needs. There's an explanation here and I think STL provides the priority_queue for you.
Insertion time is O(log N), removal is O(log N)
It sounds like you need/want a priority queue. If memory serves, the priority queue adapter in the standard library is written to retrieve the largest items instead of the smallest, so you'll have to specify that it use std::greater for comparison.
Other than that, it provides just about exactly what you've asked for: the ability to quickly access/remove the smallest/largest item, and the ability to insert new items quickly. While it doesn't maintain all the items in order, it does maintain enough order that it can still find/remove the one smallest (or largest) item quickly.
I would start with a basic priority queue, and see if that's fast enough.
If not, then you can look at writing something custom.
http://en.wikipedia.org/wiki/Priority_queue
A binary tree is always sorted and has faster access times than a linear list. Search, insert and delete times are O(log(n)).
But it depends whether the items have to be sorted all the time, or only after the process is finished. In the latter case a hash table is probably faster. At the end of the process you then would copy the items to an array or a list and sort it.

C++ design: How to cache most recent used

We have a C++ application for which we try to improve performance. We identified that data retrieval takes a lot of time, and want to cache data. We can't store all data in memory as it is huge. We want to store up to 1000 items in memory. This items can be indexed by a long key. However, when the cache size goes over 1000, we want to remove the item that was not accessed for the longest time, as we assume some sort of "locality of reference", that is we assume that items in the cache that was recently accessed will probably be accessed again.
Can you suggest a way to implement it?
My initial implementation was to have a map<long, CacheEntry> to store the cache, and add an accessStamp member to CacheEntry which will be set to an increasing counter whenever an entry is created or accessed. When the cache is full and a new entry is needed, the code will scan the entire cache map and find the entry with the lowest accessStamp, and remove it.
The problem with this is that once the cache is full, every insertion requires a full scan of the cache.
Another idea was to hold a list of CacheEntries in addition to the cache map, and on each access move the accessed entry to the top of the list, but the problem was how to quickly find that entry in the list.
Can you suggest a better approach?
Thankssplintor
Have your map<long,CacheEntry> but instead of having an access timestamp in CacheEntry, put in two links to other CacheEntry objects to make the entries form a doubly-linked list. Whenever an entry is accessed, move it to the head of the list (this is a constant-time operation). This way you will both find the cache entry easily, since it's accessed from the map, and are able to remove the least-recently used entry, since it's at the end of the list (my preference is to make doubly-linked lists circular, so a pointer to the head suffices to get fast access to the tail as well). Also remember to put in the key that you used in the map into the CacheEntry so that you can delete the entry from the map when it gets evicted from the cache.
Scanning a map of 1000 elements will take very little time, and the scan will only be performed when the item is not in the cache which, if your locality of reference ideas are correct, should be a small proportion of the time. Of course, if your ideas are wrong, the cache is probably a waste of time anyway.
An alternative implementation that might make the 'aging' of the elements easier but at the cost of lower search performance would be to keep your CacheEntry elements in a std::list (or use a std::pair<long, CacheEntry>. The newest element gets added at the front of the list so they 'migrate' towards the end of the list as they age. When you check if an element is already present in the cache, you scan the list (which is admittedly an O(n) operation as opposed to being an O(log n) operation in a map). If you find it, you remove it from its current location and re-insert it at the start of the list. If the list length extends over 1000 elements, you remove the required number of elements from the end of the list to trim it back below 1000 elements.
Update: I got it now...
This should be reasonably fast. Warning, some pseudo-code ahead.
// accesses contains a list of id's. The latest used id is in front(),
// the oldest id is in back().
std::vector<id> accesses;
std::map<id, CachedItem*> cache;
CachedItem* get(long id) {
if (cache.has_key(id)) {
// In cache.
// Move id to front of accesses.
std::vector<id>::iterator pos = find(accesses.begin(), accesses.end(), id);
if (pos != accesses.begin()) {
accesses.erase(pos);
accesses.insert(0, id);
}
return cache[id];
}
// Not in cache, fetch and add it.
CachedItem* item = noncached_fetch(id);
accesses.insert(0, id);
cache[id] = item;
if (accesses.size() > 1000)
{
// Remove dead item.
std::vector<id>::iterator back_it = accesses.back();
cache.erase(*back_it);
accesses.pop_back();
}
return item;
}
The inserts and erases may be a little expensive, but may also not be too bad given the locality (few cache misses). Anyway, if they become a big problem, one could change to std::list.
In my approach, it's needed to have a hash-table for lookup stored objects quickly and a linked-list for maintain the sequence of last used.
When an object are requested.
1) try to find a object from the hash table
2.yes) if found(the value have an pointer of the object in linked-list), move the object in linked-list to the top of the linked-list.
2.no) if not, remove last object from the linked-list and remove the data also from hash-table then put object into hash-table and top of linked-list.
For example
Let's say we have a cache memory only for 3 objects.
The request sequence is 1 3 2 1 4.
1) Hash-table : [1]
Linked-list : [1]
2) Hash-table : [1, 3]
Linked-list : [3, 1]
3) Hash-table : [1,2,3]
Linked-list : [2,3,1]
4) Hash-table : [1,2,3]
Linked-list : [1,2,3]
5) Hash-table : [1,2,4]
Linked-list : [4,1,2] => 3 out
Create a std:priority_queue<map<int, CacheEntry>::iterator>, with a comparer for the access stamp.. For an insert, first pop the last item off the queue, and erase it from the map. Than insert the new item into the map, and finally push it's iterator onto the queue.
I agree with Neil, scanning 1000 elements takes no time at all.
But if you want to do it anyway, you could just use the additional list you propose and, in order to avoid scanning the whole list each time, instead of storing just the CacheEntry in your map, you could store the CacheEntry and a pointer to the element of your list that corresponds to this entry.
As a simpler alternative, you could create a map that grows indefinitely and clears itself out every 10 minutes or so (adjust time for expected traffic).
You could also log some very interesting stats this way.
I believe this is a good candidate for treaps. The priority would be the time (virtual or otherwise), in ascending order (older at the root) and the long as the key.
There's also the second chance algorithm, that's good for caches. Although you lose search ability, it won't be a big impact if you only have 1000 items.
The naïve method would be to have a map associated with a priority queue, wrapped in a class. You use the map to search and the queue to remove (first remove from the queue, grabbing the item, and then remove by key from the map).
Another option might be to use boost::multi_index. It is designed to separate index from data and by that allowing multiple indexes on the same data.
I am not sure this really would be faster then to scan through 1000 items. It might use more memory then good. Or slow down search and/or insert/remove.