How to track changes to a list

How to track changes to a list - c++

I have an immutable base list of items that I want to perform a number of operations on: edit, add, delete, read. The actual operations will be queued up and performed elsewhere (sent up to the server and a new list will be sent down), but I want a representation of what the list would look like with the current set of operations applied to the base list.
My current implementation keeps a vector of ranges and where they map to. So an unedited list has one range from 0 to length that maps directly to the base list. If an add is performed in index 5, then we have 3 ranges: 0-4 maps to base list 0-4. 5 maps to the new item, and 6-(length+1) maps to 5-length. This works, however with a lot of adds and deletes, reads degrades to O(n).
I've thought of using hashmaps but the shifts in ranges that can occur with inserts and deletes presents a challenge. Is there some way to achieve this so that reads are around O(1) still?

If you had a roughly balanced tree of ranges, where each node kept a record of the total number of elements below it in the tree, you could answer reads in worst case time the depth of the tree, which should be about log(n). Perhaps a https://en.wikipedia.org/wiki/Treap would be one of the easier balanced trees to implement for this.
If you had a lot of repetitious reads and few modifications, you might gain by keeping a hashmap of the results of reads since the last modification, clearing it on modification.

Related

Least Recently Used (LRU) Cache

I know that I can use various container classes in STL but it's an overkill and expensive for this purpose.
We have over 1M+ users online and per user we need to maintain 8 unrelated 32-bit data items. The goal is to
find if an item exists in the list,
if not, insert. Remove oldest entry if full.
Brute Force approach would be to maintain a last write pointer and iterate (since only 8 items) but I am looking for inputs to better analyze and implement.
Look forward to some interesting suggestions in terms of design pattern and algorithm.

Don Knuth gives several interesting and very efficient approximations in The Art of Computer Proramming.
Self-organizing list I: when you find an entry, move it to the head of the list; delete from the end.
Self-organizing list II: when you find an entry, move it up one spot; delete from the end.
[Both the above in Vol. 3 §6.1(A).]
Another scheme involves maintaining the list circularly with 1 extra bit per entry, which is set when you find that entry, and cleared when you skip past it to find something else. You always start searching at the last place you stopped, and if you don't find the entry you replace the one with the next clear bit with it, i.e. it hasn't been used since one entire trip around the list.
[Vol. 1 §2.5(G).]

You want to use here a combination of a Hash table and a doubly linked list.
Each item is accessible via the hash table that holds the key you need plus a pointer to the element in the list.
Algorithm:
Given new item x, do:
1. Add x to the head of the list, save pointer as ptr.
2. Add x to the hash table where the data is stored, and add ptr.
3. If the list is bigger than allowed, take the last element (from the tail of the list) and remove it. Use the key of this element to remove it from the Hash table as well.

If you want a C implementation of LRU cache try this link
The idea is that we use two data structures to implement an LRU Cache.
Queue which is implemented using a doubly linked list. The maximum size of the queue will be equal to the total number of frames available (cache size).The most recently used pages will be near front end and least recently pages will be near rear end.
A Hash with page number as key and address of the corresponding queue node as value.
When a page is referenced, the required page may be in the memory. If it is in the memory, we need to detach the node of the list and bring it to the front of the queue.
If the required page is not in the memory, we bring that in memory. In simple words, we add a new node to the front of the queue and update the corresponding node address in the hash. If the queue is full, i.e. all the frames are full, we remove a node from the rear of queue, and add the new node to the front of queue.

I personally would either go with the self organising lists as proposed by EJP or, as we only have eight elements, simply store them together with a timestamp sequentially.
When accessing an element, just update the timestamp, when replacing, replace the one with oldest timestamp (one linear search). This is less efficient on replacements, but more efficient on access (no need to move any elements around). And it might be the easiest to implement...
Modification of self organising lists, if based on some array data structure: Sure, on update, you have to shift several elements (variant I) or at least swap two of them (variant II) - but if you organize the data as ring buffer, on replacement we just replace the last element with the new one and move the buffer's pointer to this new element:
a, b, c, d
^
Accessing a:
d, b, a, c
^
New element e:
d, e, a, c
^
Special case: accessing the oldest element (d in this case) - we then simply can move the pointer, too:
d, e, a, c
^
Just: with only 8 elements, it might not be worth the effort to implement all this...

I agree with Drop and Geza's comments. The straightforward implementation will take one cache line read, and cause one cache line write.
The only performance question left is going to be the lookup and update of that 32 bit value in 256 bits. Assuming modern x86, the lookup itself can be two instructions: _mm256_cmp_epi32_mask finds all equal values in parallel, _mm256_lzcnt_epi32 counts leading zeroes = number of older non-matching items*32. But even with older SIMD operations, the cache line read/write operations will dominate the execution time. And that's in turn is dominated by finding the right user. Which in turn is dominated by the network I/O involved.

You should use Cuckoo's Filter which is a probabilistic data structure that supports fast set membership testing. It is a hash-based data structure.
Time Complexity of Cuckoo's Filter:
Lookup: O(1)
Deletion: O(1)
Insertion: O(1)
For reference here is how the cuckoo filter works.
Parameters of the filter
1. Two Hash Functions: h1 and h2
2. An array B with n Buckets. The i-th Bucket will be called B[i]
Input : L, a list of elements to inserted into the cuckoo filter.
Algorithm:
while L is not empty:
Let x be the 1st item in the list L. Remove x from the list.
if (B[h1(x)] == empty)
place x in B[h1(x)];
else if (B[h2(x)] == empty)
place x in B[h2(x)];
else
Let y be the element in B[h2(x)]
Prepend y to L
place x in B[h2(x)]
For LRU you can use time stamping in your hash function by keeping just a local variable.
This is the best approach for very large data sets to date.

Rank-Preserving Data Structure other than std:: vector?

I am faced with an application where I have to design a container that has random access (or at least better than O(n)) has inexpensive (O(1)) insert and removal, and stores the data according to the order (rank) specified at insertion.
For example if I have the following array:
[2, 9, 10, 3, 4, 6]
I can call the remove on index 2 to remove 10 and I can also call the insert on index 1 by inserting 13.
After those two operations I would have:
[2, 13, 9, 3, 4, 6]
The numbers are stored in a sequence and insert/remove operations require an index parameter to specify where the number should be inserted or which number should be removed.
My question is, what kind of data structures, besides a Linked List and a vector, could maintain something like this? I am leaning towards a Heap that prioritizes on the next available index. But I have been seeing something about a Fusion Tree being useful (but more in a theoretical sense).
What kind of Data structures would give me the most optimal running time while still keeping memory consumption down? I have been playing around with an insertion order preserving hash table, but it has been unsuccessful so far.
The reason I am tossing out using a std:: vector straight up is because I must construct something that out preforms a vector in terms of these basic operations. The size of the container has the potential to grow to hundreds of thousands of elements, so committing to shifts in a std::vector is out of the question. The same problem lines with a Linked List (even if doubly Linked), traversing it to a given index would take in the worst case O (n/2), which is rounded to O (n).
I was thinking of a doubling linked list that contained a Head, Tail, and Middle pointer, but I felt that it wouldn't be much better.

In a basic usage, to be able to insert and delete at arbitrary position, you can use linked lists. They allow for O(1) insert/remove, but only provided that you have already located the position in the list where to insert. You can insert "after a given element" (that is, given a pointer to an element), but you can not as efficiently insert "at given index".
To be able to insert and remove an element given its index, you will need a more advanced data structure. There exist at least two such structures that I am aware of.
One is a rope structure, which is available in some C++ extensions (SGI STL, or in GCC via #include <ext/rope>). It allows for O(log N) insert/remove at arbitrary position.
Another structure allowing for O(log N) insert/remove is a implicit treap (aka implicit cartesian tree), you can find some information at http://codeforces.com/blog/entry/3767, Treap with implicit keys or https://codereview.stackexchange.com/questions/70456/treap-with-implicit-keys.
Implicit treap can also be modified to allow to find minimal value in it (and also to support much more operations). Not sure whether rope can handle this.
UPD: In fact, I guess that you can adapt any O(log N) binary search tree (such as AVL or red-black tree) for your request by converting it to "implicit key" scheme. A general outline is as follows.
Imagine a binary search tree which, at each given moment, stores the consequitive numbers 1, 2, ..., N as its keys (N being the number of nodes in the tree). Every time we change the tree (insert or remove the node) we recalculate all the stored keys so that they are still from 1 to the new value of N. This will allow insert/remove at arbitrary position, as the key is now the position, but it will require too much time for all keys update.
To avoid this, we will not store keys in the tree explicitly. Instead, for each node, we will store the number of nodes in its subtree. As a result, any time we go from the tree root down, we can keep track of the index (position) of current node — we just need to sum the sizes of subtrees that we have to our left. This allows us, given k, locate the node that has index k (that is, which is the k-th in the standard order of binary search tree), on O(log N) time. After this, we can perform insert or delete at this position using standard binary tree procedure; we will just need to update the subtree sizes of all the nodes changed during the update, but this is easily done in O(1) time per each node changed, so the total insert or remove time will be O(log N) as in original binary search tree.
So this approach allows to insert/remove/access nodes at given position in O(log N) time using any O(log N) binary search tree as a basis. You can of course store the additional information ("values") you need in the nodes, and you can even be able to calculate the minimum of these values in the tree just by keeping the minimum value of each node's subtree.
However, the aforementioned treap and rope are more advanced as they allow also for split and merge operations (taking a substring/subarray and concatenating two strings/arrays).

Consider a skip list, which can implement linear time rank operations in its "indexable" variation.
For algorithms (pseudocode), see A Skip List Cookbook, by Pugh.
It may be that the "implicit key" binary search tree method outlined by #Petr above is easier to get to, and may even perform better.

Insertion into a skip list

A skip list is a data structure in which the elements are stored in sorted order and each node of the list may contain more than 1 pointer, and is used to reduce the time required for a search operation from O(n) in a singly linked list to O(lg n) for the average case. It looks like this:
Reference: "Skip list" by Wojciech Muła - Own work. Licensed under Public domain via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Skip_list.svg#mediaviewer/File:Skip_list.svg
It can be seen as an analogy to a ruler:
In a skip list, searching an element and deleting one is fine, but when it comes to insertion, it becomes difficult, because according to Data Structures and Algorithms in C++: 2nd edition by Adam Drozdek:
To insert a new element, all nodes following the node just inserted have to be restructured; the number of pointers and the value of pointers have to be changed.
I can construe from this that although choosing a node with random number of pointers based on the likelihood of nodes to insert a new element, doesn't create a perfect skip list, it gets really close to it when large number of elements (~9 million for example) are considered.
My question is: Why can't we insert the new element in a new node, determine its number of pointers based on the previous node, attach it to the end of the list, and then use efficient sorting algorithms to sort just the data present in the nodes, thereby maintaining the perfect structure of the skip list and also achieving the O(lg n) insert complexity?
Edit: I haven't tried any code yet, I'm just presenting a view. Simply because implementing a skip list is somewhat difficult. Sorry.

There is no need to modify any following nodes when you insert a node. See the original paper, Skip Lists: A Probabilistic Alternative to Balanced Trees, for details.
I've implemented a skip list from that reference, and I can assure you that my insertion and deletion routines do not modify any nodes forward of the insertion point.
I haven't read the book you're referring to, but out of context the passage you highlight is just flat wrong.

You have a problem on this point and then use efficient sorting algorithms to sort just the data present in the nodes. Sorting the data will have complexity O(n*lg(n)) and thus it will increase the complexity of insertion. In theory you can choose "perfect" number of links for each node being inserted, but even if you do that, when you perform remove operations, the perfectness will be "broken". Using the randomized approach is close enough to perfect structure to perform well.

You need to have function / method that search for location.
It need to do following:
if you insert unique keys, it need to locate the node. then you keep everything, just change the data (baggage). e.g. node->data = data.
if you allow duplicates, or if key is not found, then this function / method need to give you previous node on each height (lane). Then you determine height of new node and insert it after the found nodes.
Here is my C realisation:
https://github.com/nmmmnu/HM2/blob/master/hm_skiplist.c
You need to check following function:
static const hm_skiplist_node_t *_hm_skiplist_locate(const hm_skiplist_t *l, const char *key, int complete_evaluation);
it stores the position inside hm_skiplist_t struct.
complete_evaluation is used to save time in case you need the data and you are not intend to insert / delete.

Implementation of a Data-Structure supporting various operations

I have to implement a data structure which supports the following three functions. The data is a pair(a,b) of two double values and the data is concentrated in a particular region. Let's say with values of 'a' in the range 500-600.
Insert(double a, double b) - Insert the data, a pair(double,double) in the data structure. If the first element of the pair already exists, update its second element to the new value.
Delete(double a) - Delete the data containing the first element = a.
PrintData(int count) - Print the value of the data which has the count-th largest value. Value is compared according to data.first.
The input file contains a series of Insert, Delete and PrintData operations. Currently, I have implemented the data structure as a height balanced binary search tree using STL-Map but it is not fast enough.
Is there any other implementation which is faster than a Map.
We can use caching to store the most common PrintData queries.

I'd recommend 2 binary search trees (BSTs) - one being the map from a to b (sorted by a), the other should be sorted by b.
The second will need to be a custom BST - you'll need to let each node store a count of the number of nodes in the subtree with it as root - these counts can be updated in O(log n), and will allow for O(log n) queries to get the k-th largest element.
When doing an insert, you'll simply look up b's value in the first BST first, then remove that value from the second, then update the first and insert the new value into the second.
For a delete, you'll simply look up b's value in the first BST (and remove that pair), then remove that value from the second.
All mentioned operations should take O(log n).
Caching
If you are, for instance, only going to query the top 10 elements, you could maintain another BST containing only those 10 elements (or even just an optionally-sorted array, since there's only 10 elements), which we'll then query instead of the second BST above.
When inserting, also insert into this structure if the value is greater than the smallest one, and remove the smallest.
When removing, we need to look up the next largest value and insert it into the small BST. Although this could also be done lazily - when removing, just remove it from this BST - don't fill it up to 10 again. When querying, if there are enough elements in this BST, we can just look up using this one, otherwise we find all the values in the big BST required to fill this BST up, then we query.
This would result in best-case O(1) query (worst-case O(log n)), while the other operations will still be O(log n).
Although the added complexity is probably not worth it - O(log n) is pretty fast, even for a large n.
Building on this idea, we could only have this small BST along with the BST mapping a to b - this would require that we check all values to find the required ones during a query after a removal, so it would only really be beneficial if there aren't a whole lot of removals.

I would recommend an indexed skip list. That will give you O(log n) insert and delete, and O(log n) access to the nth largest value (assuming that you maintain the list in descending order).
Skip list isn't any more difficult to implement than a self-balancing binary tree, and gives much better performance in some situations. Well worth considering.
The original skip list paper.

How to repeatedly insert elements into a sorted list fast

I do not have formal CS training, so bear with me.
I need to do a simulation, which can abstracted away to the following (omitting the details):
We have a list of real numbers representing the times of events. In
each step, we
remove the first event, and
as a result of "processing" it, a few other events may get inserted into the list at a strictly later time
and repeat this many times.
Questions
What data structure / algorithm can I use to implement this as efficiently as possible? I need to increase the number of events/numbers in the list significantly. The priority is to make this as fast as possible for a long list.
Since I'm doing this in C++, what data structures are already available in the STL or boost that will make it simple to implement this?
More details:
The number of events in the list is variable, but it's guaranteed to be between n and 2*n where n is some simulation parameter. While the event times are increasing, the time-difference of the latest and earliest events is also guaranteed to be less than a constant T. Finally, I suspect that the density of events in time, while not constant, also has an upper and lower bound (i.e. all the events will never be strongly clustered around a single point in time)
Efforts so far:
As the title of the question says, I was thinking of using a sorted list of numbers. If I use a linked list for constant time insertion, then I have trouble finding the position where to insert new events in a fast (sublinear) way.
Right now I am using an approximation where I divide time into buckets, and keep track of how many event are there in each bucket. Then process the buckets one-by-one as time "passes", always adding a new bucket at the end when removing one from the front, thus keeping the number of buckets constant. This is fast, but only an approximation.

A min-heap might suit your needs. There's an explanation here and I think STL provides the priority_queue for you.
Insertion time is O(log N), removal is O(log N)

It sounds like you need/want a priority queue. If memory serves, the priority queue adapter in the standard library is written to retrieve the largest items instead of the smallest, so you'll have to specify that it use std::greater for comparison.
Other than that, it provides just about exactly what you've asked for: the ability to quickly access/remove the smallest/largest item, and the ability to insert new items quickly. While it doesn't maintain all the items in order, it does maintain enough order that it can still find/remove the one smallest (or largest) item quickly.

I would start with a basic priority queue, and see if that's fast enough.
If not, then you can look at writing something custom.
http://en.wikipedia.org/wiki/Priority_queue

A binary tree is always sorted and has faster access times than a linear list. Search, insert and delete times are O(log(n)).
But it depends whether the items have to be sorted all the time, or only after the process is finished. In the latter case a hash table is probably faster. At the end of the process you then would copy the items to an array or a list and sort it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js