How sets, multisets, maps and multimaps work internally

How sets, multisets, maps and multimaps work internally - c++

How do multisets work? If a set can't have a value mapped to a key, does it only hold keys?
Also, how do associative containers work? I mean vector and deque in the memory is located sequentially it means that deleting/removing (except beginning [deque] and end [vector, deque]) are slow if they are large.
And list is a set of pointers which are not sequentially located in the memory which causes longer search but faster delete/remove.
How are sets, maps, multisets and multimaps stored and how do they work?

These 4 containers are typically all implemented using "nodes". A node is an object that stores one element. In the [multi]set case, the element is just the value; in the [multi]map case each node stores one key and its associated value. A node also stores multiple pointers to other nodes. Unlike a list, the nodes in sets and maps form a tree. You'd typically arrange it such that branches on the "left" of a certain node have values less than that node, while branches on the "right" of a certain node have values higher than that node.
Operations like finding a map key/set value are now quite fast. Start at the root node of the tree. If that matches, you're done. If the root is larger, search in the left branch. If the root is smaller than the value you're looking for, follow the pointer to the right branch. Repeat until you find a value or an empty branch.
Inserting an element is done by creating a new node, finding the location in the tree where it should be placed, and then inserting the node there by adjusting the pointers around it. Finally, there is a "rebalancing" operation to prevent your tree from ending up all out of balance. Ideally each right and left branch is about the same size. Rebalancing works by shifting some nodes from the left to the right or vice versa. E.g. if you have values {1 2 3} and your root node would be 1, you'd have 2 and 3 on the left branch and an empty right branch:
1
\
2
\
3
This is rebalanced by picking 2 as the new root node:
2
/ \
1 3
The STL containers use a smarter, faster rebalancing technique but that level of detail should not matter. It's not even specified in the standard which better technique should be used so implementations can differ.

There can be any implementation, as long as they match the standard specifications for those containers.
AFAIK, the associative containers are implemented as binary trees (red-black). More details... depending on the implementation.

All associate container classes(map,multi-map,set,multi-set)are implemented with Red and Black(R-B Tree) tree. So the R-B tree implementation could be similar to this:-
struct Rb_node {
int value;
struct node *left, *right;
int color;
int size;
};

Related

Should B-Tree nodes contain a pointer to their parent (C++ implementation)?

I am trying to implement a B-tree and from what I understand this is how you split a node:
Attempt to insert a new value V at a leaf node N
If the leaf node has no space, create a new node and pick a middle value of N and anything right of it move to the new node and anything to the left of the middle value leave in the old node, but move it left to free up the right indices and insert V in the appropriate of the now two nodes
Insert the middle value we picked into the parent node of N and also add the newly created node to the list of children of the parent of N (thus making N and the new node siblings)
If the parent of N has no free space, perform the same operation and along with the values also split the children between the two split nodes (so this last part applies only to non-leaf nodes)
Continue trying to insert the previous split's middle point into the parent until you reach the root and potentially split the root itself, making a new root
This brings me to the question - how do I traverse upwards, am I supposed to keep a pointer of the parent?
Because I can only know if I have to split the leaf node when I have reached it for insertion. So once I have to split it, I have to somehow go back to its parent and if I have to split the parent as well, I have to keep going back up.
Otherwise I would have to re-traverse the tree again and again each time to find the next parent.
Here is an example of my node class:
template<typename KEY, typename VALUE, int DEGREE>
struct BNode
{
KEY Keys[DEGREE];
VALUE Values[DEGREE];
BNode<KEY, VALUE, DEGREE>* Children[DEGREE + 1];
BNode<KEY, VALUE, DEGREE>* Parent;
bool IsLeaf;
};
(Maybe I should not have an IsLeaf field and instead just check if it has any children, to save space)

Even if you don't use recursion or an explicit stack while going down the tree, you can still do it without parent pointers if you split nodes a bit sooner with a slightly modified algorithm, which has this key characteristic:
When encountering a node that is full, split it, even when it is not a leaf.
With this pre-emptive splitting algorithm, you only need to keep a reference to the immediate parent (not any other ancestor) to make the split possible, since now it is guaranteed that a split will not lead to another, cascading split more upwards in the tree. This algorithm requires that the maximum degree (number of children) of the B-tree is even (as otherwise one of the two split nodes would have too few keys to be considered valid).
See also Wikipedia which describes this alternative algorithm as follows:
An alternative algorithm supports a single pass down the tree from the root to the node where the insertion will take place, splitting any full nodes encountered on the way preemptively. This prevents the need to recall the parent nodes into memory, which may be expensive if the nodes are on secondary storage. However, to use this algorithm, we must be able to send one element to the parent and split the remaining 𝑈−2 elements into two legal nodes, without adding a new element. This requires 𝑈 = 2𝐿 rather than 𝑈 = 2𝐿−1, which accounts for why some textbooks impose this requirement in defining B-trees.
The same article defines 𝑈 and 𝐿:
Every internal node contains a maximum of 𝑈 children and a minimum of 𝐿 children.
For a comparison with the standard insertion algorithm, see also Will a B-tree with preemptive splitting always have the same height for any input order of keys?

You don't need parent pointers if all your operations start at the root.
I usually code the insert recursively, such that calling node.insert(key) either returns null or a new key to insert at its parent's level. The insert starts with root.insert(key), which finds the appropriate child and calls child.insert(key).
When a leaf node is reached the insert is performed, and non-null is returned if the leaf splits. The parent would then insert the new internal key and return non-null if it splits, etc. If root.insert(key) returns non-null, then it's time to make a new root

Rank-Preserving Data Structure other than std:: vector?

I am faced with an application where I have to design a container that has random access (or at least better than O(n)) has inexpensive (O(1)) insert and removal, and stores the data according to the order (rank) specified at insertion.
For example if I have the following array:
[2, 9, 10, 3, 4, 6]
I can call the remove on index 2 to remove 10 and I can also call the insert on index 1 by inserting 13.
After those two operations I would have:
[2, 13, 9, 3, 4, 6]
The numbers are stored in a sequence and insert/remove operations require an index parameter to specify where the number should be inserted or which number should be removed.
My question is, what kind of data structures, besides a Linked List and a vector, could maintain something like this? I am leaning towards a Heap that prioritizes on the next available index. But I have been seeing something about a Fusion Tree being useful (but more in a theoretical sense).
What kind of Data structures would give me the most optimal running time while still keeping memory consumption down? I have been playing around with an insertion order preserving hash table, but it has been unsuccessful so far.
The reason I am tossing out using a std:: vector straight up is because I must construct something that out preforms a vector in terms of these basic operations. The size of the container has the potential to grow to hundreds of thousands of elements, so committing to shifts in a std::vector is out of the question. The same problem lines with a Linked List (even if doubly Linked), traversing it to a given index would take in the worst case O (n/2), which is rounded to O (n).
I was thinking of a doubling linked list that contained a Head, Tail, and Middle pointer, but I felt that it wouldn't be much better.

In a basic usage, to be able to insert and delete at arbitrary position, you can use linked lists. They allow for O(1) insert/remove, but only provided that you have already located the position in the list where to insert. You can insert "after a given element" (that is, given a pointer to an element), but you can not as efficiently insert "at given index".
To be able to insert and remove an element given its index, you will need a more advanced data structure. There exist at least two such structures that I am aware of.
One is a rope structure, which is available in some C++ extensions (SGI STL, or in GCC via #include <ext/rope>). It allows for O(log N) insert/remove at arbitrary position.
Another structure allowing for O(log N) insert/remove is a implicit treap (aka implicit cartesian tree), you can find some information at http://codeforces.com/blog/entry/3767, Treap with implicit keys or https://codereview.stackexchange.com/questions/70456/treap-with-implicit-keys.
Implicit treap can also be modified to allow to find minimal value in it (and also to support much more operations). Not sure whether rope can handle this.
UPD: In fact, I guess that you can adapt any O(log N) binary search tree (such as AVL or red-black tree) for your request by converting it to "implicit key" scheme. A general outline is as follows.
Imagine a binary search tree which, at each given moment, stores the consequitive numbers 1, 2, ..., N as its keys (N being the number of nodes in the tree). Every time we change the tree (insert or remove the node) we recalculate all the stored keys so that they are still from 1 to the new value of N. This will allow insert/remove at arbitrary position, as the key is now the position, but it will require too much time for all keys update.
To avoid this, we will not store keys in the tree explicitly. Instead, for each node, we will store the number of nodes in its subtree. As a result, any time we go from the tree root down, we can keep track of the index (position) of current node — we just need to sum the sizes of subtrees that we have to our left. This allows us, given k, locate the node that has index k (that is, which is the k-th in the standard order of binary search tree), on O(log N) time. After this, we can perform insert or delete at this position using standard binary tree procedure; we will just need to update the subtree sizes of all the nodes changed during the update, but this is easily done in O(1) time per each node changed, so the total insert or remove time will be O(log N) as in original binary search tree.
So this approach allows to insert/remove/access nodes at given position in O(log N) time using any O(log N) binary search tree as a basis. You can of course store the additional information ("values") you need in the nodes, and you can even be able to calculate the minimum of these values in the tree just by keeping the minimum value of each node's subtree.
However, the aforementioned treap and rope are more advanced as they allow also for split and merge operations (taking a substring/subarray and concatenating two strings/arrays).

Consider a skip list, which can implement linear time rank operations in its "indexable" variation.
For algorithms (pseudocode), see A Skip List Cookbook, by Pugh.
It may be that the "implicit key" binary search tree method outlined by #Petr above is easier to get to, and may even perform better.

Is there any data structure in C++ STL for performing insertion, searching and retrieval of kth element in log(n)?

I need a data structure in c++ STL for performing insertion, searching and retrieval of kth element in log(n)
(Note: k is a variable and not a constant)
I have a class like
class myClass{
int id;
//other variables
};
and my comparator is just based on this id and no two elements will have the same id.
Is there a way to do this using STL or I've to write log(n) functions manually to maintain the array in sorted order at any point of time?

Afaik, there is no such datastructure. Of course, std::set is close to this, but not quite. It is a red black tree. If each node of this red black tree was annotated with the tree weight (the number of nodes in the subtree rooted at this node), then a retrieve(k) query would be possible. As there is no such weight annotation (as it takes valuable memory and makes insert/delete more complex as weights have to be updated), it is impossible to answer such a query efficently with any search tree.
If you want to build such a datastructure, use a conventional search tree implementation (red-black,AVL,B-Tree,...) and add a weight field to each node that counts the number of entries in its subtree. Then searching for the k-th entry is quite simple:
Sketch:
Check the weight of the child nodes, and find the child c which has the largest weight (accumulated from left) that is not greater than k
Subtract from k all weights of children that are left of c.
Descend down to c and call this procedure recursively.
In case of a binary search tree, the algorithm is quite simple since each node only has two children. For a B-tree (which is likely more efficient), you have to account as many children as the node contains.
Of course, you must update the weight on insert/delete: Go up the tree from the insert/delete position and increment/decrement the weight of each node up to the root. Also, you must exchange the weights of nodes when you do rotations (or splits/merges in the B-tree case).
Another idea would be a skip-list where the skips are annotated with the number of elements they skip. But this implementation is not trivial, since you have to update the skip length of each skip above an element that is inserted or deleted, so adjusting a binary search tree is less hassle IMHO.
Edit: I found a C implementation of a 2-3-4 tree (B-tree), check out the links at the bottom of this page: http://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html

You can not achieve what you want with simple array or any other of the built-in containers. You can use a more advanced data structure for instance a skip list or a modified red-black tree(the backing datastructure of std::set).
You can get the k-th element of an arbitrary array in linear time and if the array is sorted you can do that in constant time, but still the insert will require shifting all the subsequent elements which is linear in the worst case.
As for std::set you will need additional data to be stored at each node to be able to get the k-th element efficiently and unfortunately you can not modify the node structure.

Remove an element from unbalanced binary search tree

I have been wanting to write remove() method for my Binary Search Tree (which happens to be an array representation). But before writing it, I must consider all cases. Omitting all cases (since they are easy) except when the node has two children, in all the explanations I have read so far, most of the cases I see remove an element from an already balanced binary search tree. In the few cases where I have seen an element being removed from an unbalanced binary search tree, I find that they balance it through zigs and zags, and then remove the element.
Is there a way that I can possibly remove an element from an unbalanced binary search tree without having to balance it beforehand?
If not, would it be easier to write an AVL tree (in array representation)?

You don't need to balance it, but you do need to recursively go down the tree performing some swaps here and there so you actually end up with a valid BST.
Deletion of a node with 2 children in an (unbalanced) BST: (from Wikipedia)
Call the node to be deleted N. Do not delete N. Instead, choose either its in-order successor node or its in-order predecessor node, R. Copy the value of R to N, then recursively call delete on R until reaching one of the first two cases.
Deleting a node with two children from a binary search tree. First the rightmost node in the left subtree, the inorder predecessor 6, is identified. Its value is copied into the node being deleted. The inorder predecessor can then be easily deleted because it has at most one child. The same method works symmetrically using the inorder successor labelled 9.
Although, why do you want an unbalanced tree? All operations on it take on it take longer (or at least as long), and the additional overhead to balance doesn't change the asymptotic complexity of any operations. And, if you're using the array representation where the node at index i has children at indices 2i and 2i+1, it may end up fairly sparse, i.e. there will be quite a bit of wasted memory.

Fast bucket implementation

In a graph class I need to handle nodes with integer values (1-1000 mostly). In every step I want to remove a node and all its neighbors from the graph. Also I want to always begin with the node of the minimal value. I thought long about how to do this in the fastest possible manner and decided to do the following:
The graph is stored using adjancency lists
There is a huge array std::vector<Node*> bucket[1000] to store the nodes by its value
The index of the lowest nonempty bucket is always stored and kept track off
I can find the node of minimal value very fast by picking a random element of that index or if the bucket is already empty increase the index
Removing the selected node from the bucket can clearly done in O(1), the problem is that for removing the neighbors I need to search the bucket bucket[value of neighbor] first for all neighbor nodes, which is not really fast.
Is there a more efficient approach to this?
I thought of using something like std::list<Node*> bucket[1000], and assign every node a pointer to its "list element", such that I can remove the node from the list in O(1). Is this possible with stl lists, clearly it can be done with a normal double linked list that I could implement by hand?

I recently did something similar to this for a priority queue implementation using buckets.
What I did was use a hash tables (unordered_map), that way, you don't need to store 1000 empty vectors and you still get O(1) random access (general case, not guaranteed). Now, if you only need to store/create this graph class one time, it probably doesn't matter. In my case I needed to create the priority queue tens/hundreds of time per second and using the hash map made a huge difference (due to the fact that I only created unordered_sets when I actually had an element of that priority, so no need to initialize 1000 empty hash sets). Hash sets and maps are new in C++11, but have been available in std::tr1 for a while now, or you could use the Boost libraries.
The only difference that I can see between your & my usecase, is that you also need to be able to remove neighboring nodes. I'm assuming every node contains a list of pointers to it's neighbors. If so, deletion of the neighbors should take k * O(1) with k the number of neighbors (again, O(1) in general, not guaranteed, worst case is O(n) in an unordered_map/set). You just go over every neighboring node, get its priority, that gives you the correct index into the hash map. Then you find the pointer in the hash set which the priority maps to, this search in general will be O(1) and removing the element is again O(1) in general.
All in all, I think you got a pretty good idea of what to do, but I believe that using hash maps/sets will speed up your code by quite a lot (depends on the exact usage of course). For me, the speed improvement of an implementation with unordered_map<int, unordered_set> versus vector<set> was around 50x.

Here's what I would do. Node structure:
struct Node {
std::vector<Node*>::const_iterator first_neighbor;
std::vector<Node*>::const_iterator last_neighbor;
int value;
bool deleted;
};
Concatenate the adjacency lists and put them in a single std::vector<Node*> to lower the overhead of memory management. I'm using soft deletes so update speed is not important.
Sort pointers to the nodes by value into another std::vector<Node*> with a counting sort. Mark all nodes as not deleted.
Iterate through the nodes in sorted order. If the node under consideration has been deleted, go to the next one. Otherwise, mark it deleted and iterate through its neighbors and mark them deleted.
If your nodes are stored contiguously in memory, then you can omit last_neighbor at the cost of an extra sentinel node at the end of the structure, because last_neighbor of a node is first_neighbor of the succeeding node.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js