Best data structure to use - list

The problem is: I have an input of strings, every line of text is ordered 1,2,3,4,5... I have to put these strings inside every line, for example, if the input is
"1.Hi john 2. How are you? 3. XXXX 4.TTTTT"
The output will be:
(1)Hi john
(2) How are you?
(3) XXXX
(4)TTTTT
I can't have an input of row 7, if row 5 and 6 aren't already filled.
In input I have also some command, for example:
print line 3
change line 2 with a given string
delete line 3 (and the next lines will upscale, so 4 becomes 3, 5 becomes 4...)
undo
redo
Which is the best data structure to implement? I started with a heap, because everything is ordered, but if I delete a node, I need to push every next lines up, and I have some problems with heaps. I also thought about persistent tree, because I need to remember precedent step to be able to do the undo and redo.

The answer primarily depends on which operations you want to optimize on.
Array - access is quickest. So it will optimize on printline and change line.
Double Ended Linked List - access anywhere apart from head and tail needs traversal. However, it can easily handle your delete, undo and redo situations.
For your undo and redo function, you should create a separate undo stack, where you store the removed or changed nodes, together with its old state. Redo is a matter of popping elements off the stack to reapply them into the main list.

If you expect to be adding and deleting lines a lot, and you also expect to be referring to lines by their (current) index, what you want is an order statistic tree. This is a type of binary tree which, in addition to normal binary tree operations (including efficient insertion and deletion), allows you to efficiently access items by index. In this case, it's not a binary search tree because you don't have a sort key; all of your accesses will be by (current) index.
In order to efficiently support undo/redo, you would additionally make the tree into a persistent data structure, using "path copying" to partially modify the data while still allowing access to the old version of it. (Path copying would be ideal for order-statistic trees, since all your updates propagate to the root anyway.) Undo would simply be reverting to that old version.
But: Unless you are dealing with millions or billions of lines, these weird exotic data structures are not going to be worth your time. So while the literal answer is "persistent order-statistic tree" the practical answer is "probably just put stuff in an array, and have a stack of undo operations to support undo/redo".

You can use combination of a Map and Double Link List.
Double Link List : Where actual string will be stored, and the index of node will represent number of corresponding string.
Ex-> Print line 3, you print 3rd node in this list
Map : Here key will be the line no, and value will be the corresponding references to Double link list. This will help in insertion and deletion of a new string in Link List.
Ex -> Delete line 3, find the reference of line 3 DLL node from map, change the references in DLL and remove line 3 node. Similar process for update and insertion of a line.
For Undo and Redo as #John advised, you can use 2 different stacks, 1 reflecting the operations done (which will help in undo), and 1 reflecting the undo operations (which will help in redo operation).

Related

Moving values between lockfree lists

Background
I am trying to design and implement lock-free hashmap using chaining method in C++. Each hash table cell is supposed to contain lockfree list. To enable resizing, my data structure is supposed to contain two arrays - small one which is always available and a bigger one for resizing, when the smaller one is no longer sufficient. When the bigger one is created I would like the data stored in small one to be transfered to bigger one by one, whenever any thread does something with the data structure (adds element, searches or removes one). When all data is transfered, the bigger array is moved in place of smaller and the latter one is deleted. The cycle repeats whenever the array needs to be enlarged.
Problem
As mentioned before, each array is supposed to conatin lists in cells. I am trying to find a way to transfer a value or node from one lockfree list to another in such a manner that would keep value visible in any (or both) of the lists. It is needed to ensure that search in hash map won't give the user false negatives. So my questions are:
Is such lockfree list implementation possible?
If so, what would be the general concept of such list and "moving node/value" operation? I would be thankful for any pseudocode, C++ code or scientific article describing it.
To be able to resize the array, while maintaining the lock-free progress guarantees, you will need to use operation descriptors. Once the resize starts, add a descriptor that contains references to the old and the new arrays.
On any operation (add, search, or remove):
Add operation, search the old array, if the element already exists, then move the element to the new array before returning. Indicate, with a descriptor or a special null value that the element has already been moved so that other threads don't attempt the move again
Search, search the old array and move the element as indicated above.
Remove - Remove too will have to search the old array first.
Now the problem is that you will have a thread that has to verify that the move is complete, so that you can remove the descriptor and free up the old array. To maintain lock-freedom, you will need to have all active threads attempt to do this validation, thus it becomes very expensive.
You can look at:
https://dl.acm.org/citation.cfm?id=2611495
https://dl.acm.org/citation.cfm?id=3210408

Least Recently Used (LRU) Cache

I know that I can use various container classes in STL but it's an overkill and expensive for this purpose.
We have over 1M+ users online and per user we need to maintain 8 unrelated 32-bit data items. The goal is to
find if an item exists in the list,
if not, insert. Remove oldest entry if full.
Brute Force approach would be to maintain a last write pointer and iterate (since only 8 items) but I am looking for inputs to better analyze and implement.
Look forward to some interesting suggestions in terms of design pattern and algorithm.
Don Knuth gives several interesting and very efficient approximations in The Art of Computer Proramming.
Self-organizing list I: when you find an entry, move it to the head of the list; delete from the end.
Self-organizing list II: when you find an entry, move it up one spot; delete from the end.
[Both the above in Vol. 3 §6.1(A).]
Another scheme involves maintaining the list circularly with 1 extra bit per entry, which is set when you find that entry, and cleared when you skip past it to find something else. You always start searching at the last place you stopped, and if you don't find the entry you replace the one with the next clear bit with it, i.e. it hasn't been used since one entire trip around the list.
[Vol. 1 §2.5(G).]
You want to use here a combination of a Hash table and a doubly linked list.
Each item is accessible via the hash table that holds the key you need plus a pointer to the element in the list.
Algorithm:
Given new item x, do:
1. Add x to the head of the list, save pointer as ptr.
2. Add x to the hash table where the data is stored, and add ptr.
3. If the list is bigger than allowed, take the last element (from the tail of the list) and remove it. Use the key of this element to remove it from the Hash table as well.
If you want a C implementation of LRU cache try this link
The idea is that we use two data structures to implement an LRU Cache.
Queue which is implemented using a doubly linked list. The maximum size of the queue will be equal to the total number of frames available (cache size).The most recently used pages will be near front end and least recently pages will be near rear end.
A Hash with page number as key and address of the corresponding queue node as value.
When a page is referenced, the required page may be in the memory. If it is in the memory, we need to detach the node of the list and bring it to the front of the queue.
If the required page is not in the memory, we bring that in memory. In simple words, we add a new node to the front of the queue and update the corresponding node address in the hash. If the queue is full, i.e. all the frames are full, we remove a node from the rear of queue, and add the new node to the front of queue.
I personally would either go with the self organising lists as proposed by EJP or, as we only have eight elements, simply store them together with a timestamp sequentially.
When accessing an element, just update the timestamp, when replacing, replace the one with oldest timestamp (one linear search). This is less efficient on replacements, but more efficient on access (no need to move any elements around). And it might be the easiest to implement...
Modification of self organising lists, if based on some array data structure: Sure, on update, you have to shift several elements (variant I) or at least swap two of them (variant II) - but if you organize the data as ring buffer, on replacement we just replace the last element with the new one and move the buffer's pointer to this new element:
a, b, c, d
^
Accessing a:
d, b, a, c
^
New element e:
d, e, a, c
^
Special case: accessing the oldest element (d in this case) - we then simply can move the pointer, too:
d, e, a, c
^
Just: with only 8 elements, it might not be worth the effort to implement all this...
I agree with Drop and Geza's comments. The straightforward implementation will take one cache line read, and cause one cache line write.
The only performance question left is going to be the lookup and update of that 32 bit value in 256 bits. Assuming modern x86, the lookup itself can be two instructions: _mm256_cmp_epi32_mask finds all equal values in parallel, _mm256_lzcnt_epi32 counts leading zeroes = number of older non-matching items*32. But even with older SIMD operations, the cache line read/write operations will dominate the execution time. And that's in turn is dominated by finding the right user. Which in turn is dominated by the network I/O involved.
You should use Cuckoo's Filter which is a probabilistic data structure that supports fast set membership testing. It is a hash-based data structure.
Time Complexity of Cuckoo's Filter:
Lookup: O(1)
Deletion: O(1)
Insertion: O(1)
For reference here is how the cuckoo filter works.
Parameters of the filter
1. Two Hash Functions: h1 and h2
2. An array B with n Buckets. The i-th Bucket will be called B[i]
Input : L, a list of elements to inserted into the cuckoo filter.
Algorithm:
while L is not empty:
Let x be the 1st item in the list L. Remove x from the list.
if (B[h1(x)] == empty)
place x in B[h1(x)];
else if (B[h2(x)] == empty)
place x in B[h2(x)];
else
Let y be the element in B[h2(x)]
Prepend y to L
place x in B[h2(x)]
For LRU you can use time stamping in your hash function by keeping just a local variable.
This is the best approach for very large data sets to date.

C++ writing a list of names to a file in a specific order without loading them all in memory

I have a school task to load a list of names from one text file to another while ordering them, yet I am not allowed to keep them all in the memory (array for example) at the same time. What would be the best way to do this. I have to do a binary search on them afterwards.
My first thought was to generate a hash key for each of them, and then write them in a location that is relative to their key, but the fact that I have to do a binary search afterwards makes me think that this is redundant.
The problem is not knowing them all beforehand (that means I have to somehow push some names in the middle).
This is probably the easiest way
1) read the file line by line and find the first name in your sorting method
e.g.
-read name_1.
-read next name_2.
If name_1 < name_2 then name_2 = name_1 and repeat.
2) read the file line by line again and find the second name.
i.e. The lowest name that is still higher than the first name.
3) write the first name into a file.
4) now read line by line for the third name
5) add the second name into a file
etc...
This will not be fast, but it will have virtual no memory overhead. You will never have more than 3 names stored in memory.
Some ways:
1) You could split the data into multiple temporary files; sort each file separately; merge the files.
2) Call the operating system to sort the file, something like
system ("sort input>output")
Ok, I don't know if I used the term 'lexical tree' right in my comment, but I would make a tree, like a binary, but not with only two possible nodes, but whole alphabet possible. I believe this is called a 'Trie'.
In the nodes you keep a counter how many entries ended on that particular node. You create the nodes dynamically as you need them, so the space consumption is kept low.
Then you can traverse the whole tree and retrieve all elements in an order. That would be non-trivial sort, that would work very well for entries with common prefixes. It would be fast, as all inserts are linear, travesal is also linear. So it would take O(2*N), where N is number of characters in whole set to sort. And memory consumption would be good if the data set would have common prefixes.

C++ - Map-like data structure with structural sharing/immutability

Functional programming languages often work on immutable data structures but stay efficient by structural sharing. E.g. you work on some map of information, if you insert an element, you will not modify the existing map but create a new updated version. To avoid massive copying and memory usage, the map will share (as good as possible) the unchanged data between both instances.
I would be interested if there exists some template library providing such a map like data structure for C++. I searched a bit and found nothing, beside internal classes in LLVM.
A Copy On Write b+tree sounds like what your looking for. It basically creates a new snapshot of itself every time it gets modified but it shares unmodified leaf nodes between versions. Most of the implementations I've seen tend to be baked into append only database log files. CouchDB has a very nice write up on them. They are however "relatively easy", as far as map data structures go, to implement.
You can use an ordinary map, but marking every element with a timestamp or "map version number". If you want to remove elements too, use two marks. If you might reinsert removed elements, then you need a list of values and pairs of marks per element.
For example, you search for the key "foo", and you find that it had the value 5 in versions 0 to 3 (included), then it was "removed", and then it had the value -8 in versions 9 to current.
This eats a lot of memory and time, though.

How to store a tree on the disk and make add/delete/swap operations easy

All right, this question requires a bit of reading on your side. I'll try to keep this short and simple.
I have a tree (not a binary tree, just a tree) with data associated to each node (binary data, I don't know what they are AND I don't know how long they are)
Each node of the tree also has an index which isn't related to how it appears in the tree, to make it short it could be like that:
The index number represents the order the user WANTS the tree to be navigated and cannot be duplicated.
I need to store this structure in a file on the disk.
My problem is: how to design a flexible disk storing format that can make loading and working on the tree as easy as possible.
In fact the user should be allowed to
Create a child block to an element (and this should be easy enough, it's sufficient to add data to the file paying attention to avoiding duplicated indices)
Delete a child (I should prompt the user "do you want to delete all this node's children as well? or should I add its children to its parent?"). The tricky part about this is that deleting a node could also free up an index, and I can't let the user use that index again when adding another node (or the order he set could be messed up), I need to update the entire tree!
Swap an index with another one
I'm using C++ and Qt and by now I thought of a lot of structures with a lot of fields like this one
struct dataToBeStoredInTheFile
{
long data_size;
byte *data; //... the data here
int index;
int number_of_children;
int *children_indices; // ... array of integers
}
this has the advantage to identify each node with its respective index, but it's highly slow when swapping indices between two nodes or deleting a node and updating each other node's index because you have to traverse all the nodes and all their "children_indices" arrays.
Would using something like an "hash" to identify each node be more flexible?
Should I use two indices, one for the position in the tree and one for the user's index? If you have any better idea to store the data, you're welcome
I would suggest using something like boost.serialization, then you don't have to worry about the actual format when save on disk, and can concentrate on effective in-memory solution.
Edit: Re-reading your question I see you are using Qt, in that case it should have it's own serialization framework that you can use.
If it doesn't have to be a SINGLE file, you could use the file/directory structure to represent your tree, where each node corresponds to a single file (w/ a directory for each interior node). Maybe not the most efficient, but incredibly easy to do.
Again, if you have some flexibility on the number of files (but not as much as above), you could have one file for the tree structure (so that each node is a fixed size, simplifying its manipulation) and a separate one for storing node contents. To speed up working with the "content file", you could treat it the way a garbage collecting system would: just keep adding new/updated nodes on the end, marking old nodes as no longer in use, and periodically clearing things out.
Better yet, follow #JoachimPileborg's advice :)
I don't think you should use the user-specified index to identify the nodes, as that's not directly related to the way you're storing the tree, and you don't have an efficient way of accessing the nodes by index. You should either keep two indices for each node - the user-specified one, and another one that's implementation dependent; or maintain an array mapping the user-specified index to one you're using for the implementation.
Also, it might be better if you use a different structure to store the tree. For each node, store the following:
the index of the parent
the index of the leftmost son
the index of the left brother
the index of the right brother
This way adding a node and swapping two nodes could be done with some simple pointer manipulations (I don't mean explicit pointers - the indices are somewhat like pointers anyway). Deleting a node would still probably be slow as you have to visit all the children.
As a bonus, if you use this structure, every node has a fixed size (unlike with the linked list you're proposing). This means that you can access a node directly by seeking in the file.
You should also maintain the smallest index the user can use for new nodes - so, for example, even if the largest index was 5 and it was deleted, you still keep 6 as the next free index so 5 cannot be reused.