how is std::set (red/black tree) forward iteration implemented? - c++

if i did an inorder traversal of a balanced BST from smallest to largest value, i'd use a DFS which maintains a stack of size lg(n). but if I needed to find the inorder successor of an arbitrary node, it's a worst case lg(n) operation. but if I wanted to iterate in order, id need to find the inorder successor repeatedly for each node yielding O(n*lg(n)). Does std::set use some trick for inorder iteration, or does it really cost O(n*lg(n)), or is the time cost amortized somehow?

There is no trick in the in-order iteration; all you need is an O(1) mechanism for finding the parent of the current node.
The in-order scan traverses each parent-child edge exactly twice: once from parent-to-child and once from child-to-parent. Since there are the same number of edges as non-root nodes in the tree, it follows that the complete iteration performs Θ(n) transitions to iterate over n nodes, which is amortised constant time per node.
The usual way to find the parent is to store a parent link in the node. The extra link certainly increases the size of a node (four pointers instead of three) but the cost is within reason.
If it were not for the C++ iterator invalidation rules, which require that iterators to elements of ordered associative containers must not be invalidated by insertion or deletion of other elements, it would be possible to maintain a stack of size O(log n) in the iterator. Such an iterator would be bulky but not unmanageably so (since log n is limited in practice to a smallish integer), but it would not be usable after any tree modification.

Related

Is there a sorted data structure with logarithmic time insertion, deletion and find (with distance)?

I have a sorted array in which I find number of items less than particular value using binary search (std::upper_bound) in O(logn) time.
Now I want to insert and delete from this array while keeping it sorted. I want the overall complexity to be O(logn).
I know that using binary search tree or std::multiset I can do insertion, deletion and upper_bound in O(logn) but I am not able do get the distance/index (std::distance is O(n) for sets) in logarithmic time.
So is there a way to achieve what I want to do?
You can augment any balanced-binary-search-tree data structure (e.g. a red-black tree) by including a "subtree size" data-member in each node (alongside the standard "left child", "right child", and "value" members). You can then calculate the number of elements less than a given element as you navigate downward from the root to that element.
It adds a fair bit of bookkeeping, and of course it means you need to use your own balanced-binary-search-tree implementation instead of one from the standard library; but it's quite doable, and it doesn't affect the asymptotic complexities of any of the operations.
You can use balanced BST with size of left subtree in each node to calculate distance

C++ : Running time of next() and prev() in a multiset iterator?

What is the time complexity of applying the next() and prev() function on an multiset<int>::iterator type object where the corresponding multiset contains the N elements?
I understand that in STL, a multiset is implemented as a balanced binary search tree and hence I expect the time complexity to be O(log N) per operation (in the worst case) in case we just traverse the tree until we find the appropriate value, but I have a hunch that this should be O(1) on average.
But what if the tree is implemented as follows - when inserting element x in the balanced binary search tree, we can also retrieve the the largest number in the tree smaller than x and the smallest number in the tree larger than x in O(log N). Thus in theory, we can have each node in the tree maintain pointer to its next and prev elements so that next() and prev() then run in constant time per query.
Can anybody share some light on what's up?
The standard mandates that all operations on iterators run in amortized constant time: http://www.eel.is/c++draft/iterator.requirements#general-10. The basic idea is that each iterator category only defines operations which can be implemented in amortized time.
Iteration is a common thing to do, and if operator++ on an iterator (I guess that's what you mean by next?) was logN, then traversing a container in a loop would be NlogN. The standard makes this impossible; since operator++ is amortized constant, iterating over any data structure in the standard is always O(N).
However, I dug into the implementation of multiset on gcc5.4 to at least have one example. Both set and multiset are implemented in terms of the same underlying structure, _Rb_tree. Delving into that structure a bit, it's nodes not only have left and right node pointers, but also a parent node pointer, and an iterator is just a pointer to a node.
Given a node in a binary search tree that includes a pointer to its parent, it's easy to figure out what the next node in the tree is:
If it has a right child, descend to the right child. Then descend left child as far as you can; that's the next node.
If it does not have a right child, ascend to your parent, and determine whether the original node is the left or right child of the parent. If the node is the left child of the parent, then the parent is the next node. If the node is the right of the parent, the parent was already processed, so you need to apply the same logic recursively between the parent and its grandparent.
This SO question shows the source code with the core logic: What is the definition of _Rb_tree_increment in bits/stl_tree.h? (it's surprisingly hard to find for some reason).
This does not have constant time, in particular in both 1. and 2. we have loops that either descend or ascend and could take at most log(N) time. However, you can easily convince yourself that the amortized time is constant because as you traverse the tree with this algorithm, each node is touched at most 4 times:
Once on the way down to its left child.
Once when it comes back up from the left child and needs to consider itself.
Once when it descends to its right child.
Once when ascending from the right child.
In retrospect I would say this is the fairly-obvious choice. Iteration over the whole data structure is a common operation, so the performance is very important. Adding a third pointer to the node is not a trivial amount of space, but it's not the end of the world either; at most it will bloat the data structure from 3 to 4 words (2 pointers + data, which alignment makes 3 at the minimum, vs 3 pointers + data). If you work with ranges, as opposed to two iterators, an alternative would be to maintain a stack and then you don't need the parent pointer, but this only works if you iterate from the very beginning to the end; it wouldn't allow iteration from an iterator in the middle to the end (which is also an important operation for BST's).
I think the next() and prev() will take anywhere between 1 and h where h is the height of the tree which is approx O(log N). If you use next() to walk from beginning to end, N nodes, the iterator should visit the entire tree and that is about 2N (2 because the iterator has to traverse the downwards then upwards through the pointers that link the nodes). Total traversal is not O(N * log N) as some steps are better than other steps. At the very worst, a next() might be from a leaf node to the head node which is h approximately O(log N). But that will only occur twice (once to arrive at begin(), the second time at the right most node of the left tree to the head node). So on average next() and prev() are 2 which is O(1).

How is the ordering of std::map achieved?

We can see from several sources that std::map is implemented using a red-black tree. It is my understanding that these types of data structures do not hold their elements in any particular order and just maintain the BST property and the height balancing requirements.
So, how is it that map::begin is constant time, and we are able to iterate over an ordered sequence?
Starting from the premise that std::map is maintaining a BST internally (which is not strictly required by the standard, but most libraries probably do that, like a red-black tree).
In a BST, to find the smallest element, you would just follow the left branches until you reach a leaf, which is O(log(N)). However, if you want to deliver the "begin()" iterator in constant time, it is quite simple to just keep track, internally, of the smallest element. Every time an insertion causes the smallest element to change, you update it, that's all. It's memory overhead, of course, but that's a trade-off.
There are possibly other ways to single out the smallest element (like keeping the root node unbalanced on purpose). Either way, it's not hard to do.
To iterate through the "ordered" sequence, you simply have to do an in-order traversal of the tree. Starting from the left-most leaf node, you go (up), (right), (up, up), (right), ... so on.. it's a simple set of rules and it's easy to implement, just see a quick implementation of a simple BST inorder iterator that I wrote a while back. As you do the in-order traversal, you will visit every node from the smallest to the biggest, in the correct order. In other words, it just gives you the illusion that "array" is sorted, but in reality, it's the traversal that makes it look like it's sorted.
The balancing properties of a red-black tree allow you to insert a node, anywhere in the tree, at O(log N) cost. For typical std::map implementations, the container will keep the tree sorted, and whenever you insert a new node, insert it into the correct location to keep the tree sorted, and then rebalance the tree to maintain the red-black property.
So no, red-black trees are not inherently sorted.
RB trees are binary search trees. Binary search trees don't necessarily store their elements in any particular order, but you can always get an inorder traversal. I'm not sure how map::begin guarantees constant time, I'd assume this involves always remembering the path to the smallest element (normally it'd be O(log(n))).

Difference between std::set and std::priority_queue

Since both std::priority_queue and std::set (and std::multiset) are data containers that store elements and allow you to access them in an ordered fashion, and have same insertion complexity O(log n), what are the advantages of using one over the other (or, what kind of situations call for the one or the other?)?
While I know that the underlying structures are different, I am not as much interested in the difference in their implementation as I am in the comparison their performance and suitability for various uses.
Note: I know about the no-duplicates in a set. That's why I also mentioned std::multiset since it has the exactly same behavior as the std::set but can be used where the data stored is allowed to compare as equal elements. So please, don't comment on single/multiple keys issue.
A priority queue only gives you access to one element in sorted order -- i.e., you can get the highest priority item, and when you remove that, you can get the next highest priority, and so on. A priority queue also allows duplicate elements, so it's more like a multiset than a set. [Edit: As #Tadeusz Kopec pointed out, building a heap is also linear on the number of items in the heap, where building a set is O(N log N) unless it's being built from a sequence that's already ordered (in which case it is also linear).]
A set allows you full access in sorted order, so you can, for example, find two elements somewhere in the middle of the set, then traverse in order from one to the other.
std::priority_queue allows to do the following:
Insert an element O(log n)
Get the smallest element O(1)
Erase the smallest element O(log n)
while std::set has more possibilities:
Insert any element O(log n) and the constant is greater than in std::priority_queue
Find any element O(log n)
Find an element, >= than the one your are looking for O(log n) (lower_bound)
Erase any element O(log n)
Erase any element by its iterator O(1)
Move to previous/next element in sorted order O(1)
Get the smallest element O(1)
Get the largest element O(1)
set/multiset are generally backed by a binary tree. http://en.wikipedia.org/wiki/Binary_tree
priority_queue is generally backed by a heap. http://en.wikipedia.org/wiki/Heap_(data_structure)
So the question is really when should you use a binary tree instead of a heap?
Both structures are laid out in a tree, however the rules about the relationship between anscestors are different.
We will call the positions P for parent, L for left child, and R for right child.
In a binary tree L < P < R.
In a heap P < L and P < R
So binary trees sort "sideways" and heaps sort "upwards".
So if we look at this as a triangle than in the binary tree L,P,R are completely sorted, whereas in the heap the relationship between L and R is unknown (only their relationship to P).
This has the following effects:
If you have an unsorted array and want to turn it into a binary tree it takes O(nlogn) time. If you want to turn it into a heap it only takes O(n) time, (as it just compares to find the extreme element)
Heaps are more efficient if you only need the extreme element (lowest or highest by some comparison function). Heaps only do the comparisons (lazily) necessary to determine the extreme element.
Binary trees perform the comparisons necessary to order the entire collection, and keep the entire collection sorted all-the-time.
Heaps have constant-time lookup (peek) of lowest element, binary trees have logarithmic time lookup of lowest element.
Since both std::priority_queue and std::set (and std::multiset) are data containers that store elements and allow you to access them in an ordered fashion, and have same insertion complexity O(log n), what are the advantages of using one over the other (or, what kind of situations call for the one or the other?)?
Even though insert and erase operations for both containers have the same complexity O(log n), these operations for std::set are slower than for std::priority_queue. That's because std::set makes many memory allocations. Every element of std::set is stored at its own allocation. std::priority_queue (with underlying std::vector container by default) uses single allocation to store all elements. On other hand std::priority_queue uses many swap operations on its elements whereas std::set uses just pointers swapping. So if swapping is very slow operation for element type, using std::set may be more efficient. Moreover element may be non-swappable at all.
Memory overhead for std::set is much bigger also because it has to store many pointers between its nodes.

Which STL Container?

I need a container (not necessarily a STL container) which let me do the following easily:
Insertion and removal of elements at any position
Accessing elements by their index
Iterate over the elements in any order
I used std::list, but it won't let me insert at any position (it does, but for that I'll have to iterate over all elements and then insert at the position I want, which is slow, as the list may be huge). So can you recommend any efficient solution?
It's not completely clear to me what you mean by "Iterate over the elements in any order" - does this mean you don't care about the order, as long as you can iterate, or that you want to be able to iterate using arbitrarily defined criteria? These are very different conditions!
Assuming you meant iteration order doesn't matter, several possible containers come to mind:
std::map [a red-black tree, typically]
Insertion, removal, and access are O(log(n))
Iteration is ordered by index
hash_map or std::tr1::unordered_map [a hash table]
Insertion, removal, and access are all (approx) O(1)
Iteration is 'random'
This diagram will help you a lot, I think so.
Either a vector or a deque will suit. vector will provide faster accesses, but deque will provide faster instertions and removals.
Well, you can't have all of those in constant time, unfortunately. Decide if you are going to do more insertions or reads, and base your decision on that.
For example, a vector will let you access any element by index in constant time, iterate over the elements in linear time (all containers should allow this), but insertion and removal takes linear time (slower than a list).
You can try std::deque, but it will not provide the constant time removal of elements in middle but it supports
random access to elements
constant time insertion and removal
of elements at the end of the
sequence
linear time insertion and removal of
elements in the middle.
A vector. When you erase any item, copy the last item over one to be erased (or swap them, whichever is faster) and pop_back. To insert at a position (but why should you, if the order doesn't matter!?), push_back the item at that position and overwrite (or swap) with item to be inserted.
By "iterating over the elements in any order", do you mean you need support for both forward and backwards by index, or do you mean order doesn't matter?
You want a special tree called a unsorted counted tree. This allows O(log(n)) indexed insertion, O(log(n)) indexed removal, and O(log(n)) indexed lookup. It also allows O(n) iteration in either the forward or reverse direction. One example where these are used is text editors, where each line of text in the editor is a node.
Here are some references:
Counted B-Trees
Rope (computer science)
An order statistic tree might be useful here. It's basically just a normal tree, except that every node in the tree includes a count of the nodes in its left sub-tree. This supports all the basic operations with no worse than logarithmic complexity. During insertion, anytime you insert an item in a left sub-tree, you increment the node's count. During deletion, anytime you delete from the left sub-tree, you decrement the node's count. To index to node N, you start from the root. The root has a count of nodes in its left sub-tree, so you check whether N is less than, equal to, or greater than the count for the root. If it's less, you search in the left subtree in the same way. If it's greater, you descend the right sub-tree, add the root's count to that node's count, and compare that to N. Continue until A) you've found the correct node, or B) you've determined that there are fewer than N items in the tree.
(source: adrinael.net)
But it sounds like you're looking for a single container with the following properties:
All the best benefits of various containers
None of their ensuing downsides
And that's impossible. One benefit causes a detriment. Choosing a container is about compromise.
std::vector
[padding for "15 chars" here]