Insertion into a skip list - c++

A skip list is a data structure in which the elements are stored in sorted order and each node of the list may contain more than 1 pointer, and is used to reduce the time required for a search operation from O(n) in a singly linked list to O(lg n) for the average case. It looks like this:
Reference: "Skip list" by Wojciech Muła - Own work. Licensed under Public domain via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Skip_list.svg#mediaviewer/File:Skip_list.svg
It can be seen as an analogy to a ruler:
In a skip list, searching an element and deleting one is fine, but when it comes to insertion, it becomes difficult, because according to Data Structures and Algorithms in C++: 2nd edition by Adam Drozdek:
To insert a new element, all nodes following the node just inserted have to be restructured; the number of pointers and the value of pointers have to be changed.
I can construe from this that although choosing a node with random number of pointers based on the likelihood of nodes to insert a new element, doesn't create a perfect skip list, it gets really close to it when large number of elements (~9 million for example) are considered.
My question is: Why can't we insert the new element in a new node, determine its number of pointers based on the previous node, attach it to the end of the list, and then use efficient sorting algorithms to sort just the data present in the nodes, thereby maintaining the perfect structure of the skip list and also achieving the O(lg n) insert complexity?
Edit: I haven't tried any code yet, I'm just presenting a view. Simply because implementing a skip list is somewhat difficult. Sorry.

There is no need to modify any following nodes when you insert a node. See the original paper, Skip Lists: A Probabilistic Alternative to Balanced Trees, for details.
I've implemented a skip list from that reference, and I can assure you that my insertion and deletion routines do not modify any nodes forward of the insertion point.
I haven't read the book you're referring to, but out of context the passage you highlight is just flat wrong.

You have a problem on this point and then use efficient sorting algorithms to sort just the data present in the nodes. Sorting the data will have complexity O(n*lg(n)) and thus it will increase the complexity of insertion. In theory you can choose "perfect" number of links for each node being inserted, but even if you do that, when you perform remove operations, the perfectness will be "broken". Using the randomized approach is close enough to perfect structure to perform well.

You need to have function / method that search for location.
It need to do following:
if you insert unique keys, it need to locate the node. then you keep everything, just change the data (baggage). e.g. node->data = data.
if you allow duplicates, or if key is not found, then this function / method need to give you previous node on each height (lane). Then you determine height of new node and insert it after the found nodes.
Here is my C realisation:
https://github.com/nmmmnu/HM2/blob/master/hm_skiplist.c
You need to check following function:
static const hm_skiplist_node_t *_hm_skiplist_locate(const hm_skiplist_t *l, const char *key, int complete_evaluation);
it stores the position inside hm_skiplist_t struct.
complete_evaluation is used to save time in case you need the data and you are not intend to insert / delete.

Related

Is there any way to move the cursor position of a linked list in constant time?

I have a linked list like this:
Head->A->B->C->D->Tail.
There can be N (1<N<10^5) items in the list.
The current cursor position is, cursor->B which is 2 if we think like an array.
I have to perform the following operation on my list:
insert x characters in the list at the cursor position and update the
cursor.
delete y (y < N) characters starting from the
cursor position and update the cursor.
move the cursor to a specific position within in the list.
I want all this operation in constant time.
Can anyone kindly help by suggesting any data structure model?
There isn't. Searching / iterating is linear in complexity - O(n). If you want a constant complexity, you need to use the different data structure. Since you are using C++, you should utilize one from the Containers library.
If the data can be sorted then by using "skip lists" a speed up can be achieved.
The principle is that extra pointers are used to skip ahead.
skip list is a data structure that allows fast search within an ordered sequence of elements. Fast search is made possible by maintaining a linked hierarchy of subsequences, with each successive subsequence skipping over fewer elements than the previous one ...
wikipedia
Therefore, with O(√n) extra space, we are able to reduce the time complexity to O(√n).
Skip-list
Of course it is not possible to use the linked list for that. As said before a linked list has a linear complexity.
You can try to use a more complex data structure like a hash as a lookup-container for the items in your list, which has a complexity of - O(n). Instead of storing the items itself the stored item can contain a pointer / index showing to the next item. But you have to keep in mind that the deletion will be still expensive because when removing one item you have to refresh the links showing to this item as well, So the item itself will need to know, if any other items are pointing to it.

Rank-Preserving Data Structure other than std:: vector?

I am faced with an application where I have to design a container that has random access (or at least better than O(n)) has inexpensive (O(1)) insert and removal, and stores the data according to the order (rank) specified at insertion.
For example if I have the following array:
[2, 9, 10, 3, 4, 6]
I can call the remove on index 2 to remove 10 and I can also call the insert on index 1 by inserting 13.
After those two operations I would have:
[2, 13, 9, 3, 4, 6]
The numbers are stored in a sequence and insert/remove operations require an index parameter to specify where the number should be inserted or which number should be removed.
My question is, what kind of data structures, besides a Linked List and a vector, could maintain something like this? I am leaning towards a Heap that prioritizes on the next available index. But I have been seeing something about a Fusion Tree being useful (but more in a theoretical sense).
What kind of Data structures would give me the most optimal running time while still keeping memory consumption down? I have been playing around with an insertion order preserving hash table, but it has been unsuccessful so far.
The reason I am tossing out using a std:: vector straight up is because I must construct something that out preforms a vector in terms of these basic operations. The size of the container has the potential to grow to hundreds of thousands of elements, so committing to shifts in a std::vector is out of the question. The same problem lines with a Linked List (even if doubly Linked), traversing it to a given index would take in the worst case O (n/2), which is rounded to O (n).
I was thinking of a doubling linked list that contained a Head, Tail, and Middle pointer, but I felt that it wouldn't be much better.
In a basic usage, to be able to insert and delete at arbitrary position, you can use linked lists. They allow for O(1) insert/remove, but only provided that you have already located the position in the list where to insert. You can insert "after a given element" (that is, given a pointer to an element), but you can not as efficiently insert "at given index".
To be able to insert and remove an element given its index, you will need a more advanced data structure. There exist at least two such structures that I am aware of.
One is a rope structure, which is available in some C++ extensions (SGI STL, or in GCC via #include <ext/rope>). It allows for O(log N) insert/remove at arbitrary position.
Another structure allowing for O(log N) insert/remove is a implicit treap (aka implicit cartesian tree), you can find some information at http://codeforces.com/blog/entry/3767, Treap with implicit keys or https://codereview.stackexchange.com/questions/70456/treap-with-implicit-keys.
Implicit treap can also be modified to allow to find minimal value in it (and also to support much more operations). Not sure whether rope can handle this.
UPD: In fact, I guess that you can adapt any O(log N) binary search tree (such as AVL or red-black tree) for your request by converting it to "implicit key" scheme. A general outline is as follows.
Imagine a binary search tree which, at each given moment, stores the consequitive numbers 1, 2, ..., N as its keys (N being the number of nodes in the tree). Every time we change the tree (insert or remove the node) we recalculate all the stored keys so that they are still from 1 to the new value of N. This will allow insert/remove at arbitrary position, as the key is now the position, but it will require too much time for all keys update.
To avoid this, we will not store keys in the tree explicitly. Instead, for each node, we will store the number of nodes in its subtree. As a result, any time we go from the tree root down, we can keep track of the index (position) of current node — we just need to sum the sizes of subtrees that we have to our left. This allows us, given k, locate the node that has index k (that is, which is the k-th in the standard order of binary search tree), on O(log N) time. After this, we can perform insert or delete at this position using standard binary tree procedure; we will just need to update the subtree sizes of all the nodes changed during the update, but this is easily done in O(1) time per each node changed, so the total insert or remove time will be O(log N) as in original binary search tree.
So this approach allows to insert/remove/access nodes at given position in O(log N) time using any O(log N) binary search tree as a basis. You can of course store the additional information ("values") you need in the nodes, and you can even be able to calculate the minimum of these values in the tree just by keeping the minimum value of each node's subtree.
However, the aforementioned treap and rope are more advanced as they allow also for split and merge operations (taking a substring/subarray and concatenating two strings/arrays).
Consider a skip list, which can implement linear time rank operations in its "indexable" variation.
For algorithms (pseudocode), see A Skip List Cookbook, by Pugh.
It may be that the "implicit key" binary search tree method outlined by #Petr above is easier to get to, and may even perform better.

How to improve linked list searching. C++

I have simple method in C++ which searchs for string in linked list. That works well but I need to make it faster. Is it possible? Maybe I need to insert items into list in alphabetical order? But I dont think it could help in serching list anymore. In list there is about 300 000 items (words).
int GetItemPosition(const char* stringToFind)
{
int i = 0;
MyList* Tmp = FistListItem;
while (Tmp){
if (!strcmp(Tmp->Value, stringToFind))
{
return i;
}
Tmp = Tmp->NextItem;
i++;
}
return -1;
}
Method returns the position number if item found, otherwise returns -1.
Any sugesstion will be helpfull.
Thanks for answers, I can change structure. I have only one constraint. Code must implement the following interface:
int Count(void);
int AddItem(const char* StringValue, int WordOccurrence);
int GetItemPosition(const char* StringValue);
char* GetString(int Index);
int GetOccurrenceNum(int Index);
void SetInteger(int Index, int WordOccurrence);
So which structure will be the in your opinion the most suitable?
Searching a linked list is linear so you need to iterate from beginning one by one so it is O(n). Linked lists are not the best if you will use it for searching, you can utilize more suitable data structures such as binary trees.
Ordering elements does not help much because still you need to iterate each element anyway.
Wikipedia article says:
In an unordered list, one simple heuristic for decreasing average search time is the move-to-front heuristic, which simply moves an element to the beginning of the list once it is found. This scheme, handy for creating simple caches, ensures that the most recently used items are also the quickest to find again.
Another common approach is to "index" a linked list using a more
efficient external data structure. For example, one can build a
red-black tree or hash table whose elements are references to the
linked list nodes. Multiple such indexes can be built on a single
list. The disadvantage is that these indexes may need to be updated
each time a node is added or removed (or at least, before that index
is used again).
So in the first case you can slightly improve (by statistical assumptions) your search performance by moving items found previously closer to the beginning of the list. This assumes that previously found elements will be searched more frequently.
Second method requires to use other data structures.
If using linked lists is not a hard requirement, consider using hash tables, sorted arrays (random access) or balanced trees.
Consider using array or std::vector as a storage instead of linked list, and use binary search to find particular string, or even better, std::set, if you don't need a numerical index. If for some reasons it is not possible to use other containers, there is not much possible to do - you may want to speed up the process of comparison by storing hash of the string along with it in node.
I suggest hashing.
Since you've already got a linked list of your own), you can try chaining with linked lists for collision resolution.
Rather than using a linear linked list, you may want to use a binary search tree, or a red/black tree. These trees are designed on minimizing the traversals to find an item.
You could also store "short cut links". For example, if the list is of strings, you could have an array of links of where to start searching based on the first letter.
For example, shortcut['B'] would return a pointer to the first link to start searching for strings starting with 'B'.
The answer is no, you cannot improve the search without changing your data-structure.
As it stands, sorting the list will not give you a faster search for any random item.
It will only allow you to quickly decide if the given item is in the list by testing against the first item (which will be either the smallest or the largest entry) and this improvement is not likely to make a big difference.
So can you please edit your question and explain to us your constraints?
Can you use a completely different data structure, like an array or a tree? (as others have suggested)
If not, can you modify the way your linked list is linked?
If not, we will be unlikely to help you...
The best option is to use faster data structure for storing strings:
std::map - red-black tree behind the scenes. Has O(logn) for search/insert/delete operations. Suitable if you want to store additional values with strings (for example - positions).
std::set - basically the same tree but without values. Best for case when you need only contains operation.
std::unordered_map - hash table. O(1) access.
std::unordered_set - hash set. Also O(1) access.
Note. But in all of these cases there is a catch. Complexity is calculated only based on n (count of strings). In reality string comparison is not free. So, O(1) becomes O(m), O(logn) becomes O(mlogn) (where m is maximal length of string). This does not matter in case of relatively short strings. But if this is not true consider using Trie. In practice trie can be even faster than hash table - each character of query string is accessed only once instead of multiple times. For hash table/set it's at least once for hash calculation and at least once for actual string comparison (depending on collision resolution strategy - not sure how it is implemented in C++).

Is there any data structure in C++ STL for performing insertion, searching and retrieval of kth element in log(n)?

I need a data structure in c++ STL for performing insertion, searching and retrieval of kth element in log(n)
(Note: k is a variable and not a constant)
I have a class like
class myClass{
int id;
//other variables
};
and my comparator is just based on this id and no two elements will have the same id.
Is there a way to do this using STL or I've to write log(n) functions manually to maintain the array in sorted order at any point of time?
Afaik, there is no such datastructure. Of course, std::set is close to this, but not quite. It is a red black tree. If each node of this red black tree was annotated with the tree weight (the number of nodes in the subtree rooted at this node), then a retrieve(k) query would be possible. As there is no such weight annotation (as it takes valuable memory and makes insert/delete more complex as weights have to be updated), it is impossible to answer such a query efficently with any search tree.
If you want to build such a datastructure, use a conventional search tree implementation (red-black,AVL,B-Tree,...) and add a weight field to each node that counts the number of entries in its subtree. Then searching for the k-th entry is quite simple:
Sketch:
Check the weight of the child nodes, and find the child c which has the largest weight (accumulated from left) that is not greater than k
Subtract from k all weights of children that are left of c.
Descend down to c and call this procedure recursively.
In case of a binary search tree, the algorithm is quite simple since each node only has two children. For a B-tree (which is likely more efficient), you have to account as many children as the node contains.
Of course, you must update the weight on insert/delete: Go up the tree from the insert/delete position and increment/decrement the weight of each node up to the root. Also, you must exchange the weights of nodes when you do rotations (or splits/merges in the B-tree case).
Another idea would be a skip-list where the skips are annotated with the number of elements they skip. But this implementation is not trivial, since you have to update the skip length of each skip above an element that is inserted or deleted, so adjusting a binary search tree is less hassle IMHO.
Edit: I found a C implementation of a 2-3-4 tree (B-tree), check out the links at the bottom of this page: http://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html
You can not achieve what you want with simple array or any other of the built-in containers. You can use a more advanced data structure for instance a skip list or a modified red-black tree(the backing datastructure of std::set).
You can get the k-th element of an arbitrary array in linear time and if the array is sorted you can do that in constant time, but still the insert will require shifting all the subsequent elements which is linear in the worst case.
As for std::set you will need additional data to be stored at each node to be able to get the k-th element efficiently and unfortunately you can not modify the node structure.

Fast bucket implementation

In a graph class I need to handle nodes with integer values (1-1000 mostly). In every step I want to remove a node and all its neighbors from the graph. Also I want to always begin with the node of the minimal value. I thought long about how to do this in the fastest possible manner and decided to do the following:
The graph is stored using adjancency lists
There is a huge array std::vector<Node*> bucket[1000] to store the nodes by its value
The index of the lowest nonempty bucket is always stored and kept track off
I can find the node of minimal value very fast by picking a random element of that index or if the bucket is already empty increase the index
Removing the selected node from the bucket can clearly done in O(1), the problem is that for removing the neighbors I need to search the bucket bucket[value of neighbor] first for all neighbor nodes, which is not really fast.
Is there a more efficient approach to this?
I thought of using something like std::list<Node*> bucket[1000], and assign every node a pointer to its "list element", such that I can remove the node from the list in O(1). Is this possible with stl lists, clearly it can be done with a normal double linked list that I could implement by hand?
I recently did something similar to this for a priority queue implementation using buckets.
What I did was use a hash tables (unordered_map), that way, you don't need to store 1000 empty vectors and you still get O(1) random access (general case, not guaranteed). Now, if you only need to store/create this graph class one time, it probably doesn't matter. In my case I needed to create the priority queue tens/hundreds of time per second and using the hash map made a huge difference (due to the fact that I only created unordered_sets when I actually had an element of that priority, so no need to initialize 1000 empty hash sets). Hash sets and maps are new in C++11, but have been available in std::tr1 for a while now, or you could use the Boost libraries.
The only difference that I can see between your & my usecase, is that you also need to be able to remove neighboring nodes. I'm assuming every node contains a list of pointers to it's neighbors. If so, deletion of the neighbors should take k * O(1) with k the number of neighbors (again, O(1) in general, not guaranteed, worst case is O(n) in an unordered_map/set). You just go over every neighboring node, get its priority, that gives you the correct index into the hash map. Then you find the pointer in the hash set which the priority maps to, this search in general will be O(1) and removing the element is again O(1) in general.
All in all, I think you got a pretty good idea of what to do, but I believe that using hash maps/sets will speed up your code by quite a lot (depends on the exact usage of course). For me, the speed improvement of an implementation with unordered_map<int, unordered_set> versus vector<set> was around 50x.
Here's what I would do. Node structure:
struct Node {
std::vector<Node*>::const_iterator first_neighbor;
std::vector<Node*>::const_iterator last_neighbor;
int value;
bool deleted;
};
Concatenate the adjacency lists and put them in a single std::vector<Node*> to lower the overhead of memory management. I'm using soft deletes so update speed is not important.
Sort pointers to the nodes by value into another std::vector<Node*> with a counting sort. Mark all nodes as not deleted.
Iterate through the nodes in sorted order. If the node under consideration has been deleted, go to the next one. Otherwise, mark it deleted and iterate through its neighbors and mark them deleted.
If your nodes are stored contiguously in memory, then you can omit last_neighbor at the cost of an extra sentinel node at the end of the structure, because last_neighbor of a node is first_neighbor of the succeeding node.