How is the ordering of std::map achieved? - c++

We can see from several sources that std::map is implemented using a red-black tree. It is my understanding that these types of data structures do not hold their elements in any particular order and just maintain the BST property and the height balancing requirements.
So, how is it that map::begin is constant time, and we are able to iterate over an ordered sequence?

Starting from the premise that std::map is maintaining a BST internally (which is not strictly required by the standard, but most libraries probably do that, like a red-black tree).
In a BST, to find the smallest element, you would just follow the left branches until you reach a leaf, which is O(log(N)). However, if you want to deliver the "begin()" iterator in constant time, it is quite simple to just keep track, internally, of the smallest element. Every time an insertion causes the smallest element to change, you update it, that's all. It's memory overhead, of course, but that's a trade-off.
There are possibly other ways to single out the smallest element (like keeping the root node unbalanced on purpose). Either way, it's not hard to do.
To iterate through the "ordered" sequence, you simply have to do an in-order traversal of the tree. Starting from the left-most leaf node, you go (up), (right), (up, up), (right), ... so on.. it's a simple set of rules and it's easy to implement, just see a quick implementation of a simple BST inorder iterator that I wrote a while back. As you do the in-order traversal, you will visit every node from the smallest to the biggest, in the correct order. In other words, it just gives you the illusion that "array" is sorted, but in reality, it's the traversal that makes it look like it's sorted.

The balancing properties of a red-black tree allow you to insert a node, anywhere in the tree, at O(log N) cost. For typical std::map implementations, the container will keep the tree sorted, and whenever you insert a new node, insert it into the correct location to keep the tree sorted, and then rebalance the tree to maintain the red-black property.
So no, red-black trees are not inherently sorted.

RB trees are binary search trees. Binary search trees don't necessarily store their elements in any particular order, but you can always get an inorder traversal. I'm not sure how map::begin guarantees constant time, I'd assume this involves always remembering the path to the smallest element (normally it'd be O(log(n))).

Related

Do not understand how c++ set works

I am using the std::set class for a leetcode question. From Googling I learned that std::set keeps the elements in an ordered manner and I heard that set.begin() returns the smallest element. But I also heard set uses red-black trees and has O(log n) time complexity. I don't understand how these two can go together, as in how does set.begin() return smallest element when a red-black tree doesn't guarantee the smallest element will be the head.
Also set.begin() function makes it seem like this container uses an array instead of a linked list to build the redblack tree, which again I don't understand. How can an array be used instead of a tree?
In the underlying tree, the leftmost node is the smallest, and begin() is the leftmost node.
Iterating traverses the tree's nodes in the appropriate order.
For instance, if the tree is (this is a simpler "regular" binary search tree, but the principle is the same with red-black trees)
4
/ \
2 6
/\ /
1 3 5
then iterating will start at 1, then move up to 2, down again to 3, up two steps to 4, down two steps to 5, and finally up to 6.
(This means that the "steps" when iterating over a tree are not constant-time operations.)
The standard does not impose a particular implementation for any of the containers. A red-black tree is a possible implementation of set. It's also free to choose how to make begin constant complexity.
Assuming the implementation chooses a tree, the most obvious way is by keeping a pointer to the least element, as well as to the root of the tree.
You can implement an ordered set with an array, e.g. boost::flat_set. That doesn't meet all the same complexity requirements as std::set. It does this by only inserting into the sorted position, and only if there isn't an equivalent element already present.

Is there a sorted data structure with logarithmic time insertion, deletion and find (with distance)?

I have a sorted array in which I find number of items less than particular value using binary search (std::upper_bound) in O(logn) time.
Now I want to insert and delete from this array while keeping it sorted. I want the overall complexity to be O(logn).
I know that using binary search tree or std::multiset I can do insertion, deletion and upper_bound in O(logn) but I am not able do get the distance/index (std::distance is O(n) for sets) in logarithmic time.
So is there a way to achieve what I want to do?
You can augment any balanced-binary-search-tree data structure (e.g. a red-black tree) by including a "subtree size" data-member in each node (alongside the standard "left child", "right child", and "value" members). You can then calculate the number of elements less than a given element as you navigate downward from the root to that element.
It adds a fair bit of bookkeeping, and of course it means you need to use your own balanced-binary-search-tree implementation instead of one from the standard library; but it's quite doable, and it doesn't affect the asymptotic complexities of any of the operations.
You can use balanced BST with size of left subtree in each node to calculate distance

how is std::set (red/black tree) forward iteration implemented?

if i did an inorder traversal of a balanced BST from smallest to largest value, i'd use a DFS which maintains a stack of size lg(n). but if I needed to find the inorder successor of an arbitrary node, it's a worst case lg(n) operation. but if I wanted to iterate in order, id need to find the inorder successor repeatedly for each node yielding O(n*lg(n)). Does std::set use some trick for inorder iteration, or does it really cost O(n*lg(n)), or is the time cost amortized somehow?
There is no trick in the in-order iteration; all you need is an O(1) mechanism for finding the parent of the current node.
The in-order scan traverses each parent-child edge exactly twice: once from parent-to-child and once from child-to-parent. Since there are the same number of edges as non-root nodes in the tree, it follows that the complete iteration performs Θ(n) transitions to iterate over n nodes, which is amortised constant time per node.
The usual way to find the parent is to store a parent link in the node. The extra link certainly increases the size of a node (four pointers instead of three) but the cost is within reason.
If it were not for the C++ iterator invalidation rules, which require that iterators to elements of ordered associative containers must not be invalidated by insertion or deletion of other elements, it would be possible to maintain a stack of size O(log n) in the iterator. Such an iterator would be bulky but not unmanageably so (since log n is limited in practice to a smallish integer), but it would not be usable after any tree modification.

How do multimaps internally handle duplicate keys?

With maps, I can understand it being implemented as a binary search tree (a red/black tree, for example) and the time complexity of it.
But with multimaps, how are key collisions handled internally? Is it that a list is maintained for all the nodes with same keys? Or some other handling is undertaken. I came across a situation where I could use either a map<int,vector<strings>> or a multimap<int,string> and would like to know the tradeoffs.
The C++ spec doesn't give a specific implementation for std::multimap, but instead gives requirements on how fast the operations on std::multimap should be and what guarantees should hold on those operations. For example, insert on a multimap needs to insert the key/value pair into the multimap and has to do so in a way that makes it come after all existing entries with the same key. That has to work in time O(log n), and specifically amortized O(1) if the insertion occurs with a hint and the hint is the spot right before where the element should go. With just this information, the multimap could work by having a red/black tree with many nodes, one per key, or it could be an red/black tree storing a vector of values for each key. (This rules out an AVL tree, though, because the rotations involved in an AVL tree insertion don't run in amortized O(1) time. However, it also permits things like 2-3-4 trees or deterministic skiplists).
As we add more requirements, though, certain implementations get ruled out. For example, the erase operation needs to run in amortized constant time if given an iterator to the element to erase. That rules out the use of a single node with a key and a vector of values, but it doesn't rule out a single node with a key and a doubly-linked list of values. The iterator type needs to be able to dereference to a value_type, which needs to match the underlying allocator's value_type. This rules out the possibility that you could have individual nodes in a red/black tree with a single key and a linked list of values, since you couldn't obtain a reference to a value_type that way.
Overall, the restrictions are such that one permissible implementation is a red/black tree with one node per key/value pair, though others may be possible as well. Other ideas - like using an AVL tree, or coalescing values for a given key into a vector or list - aren't possible.
Hope this helps!

Binary Heap vs Binary Tree C++

I am having some confusion on the runtimes of the find_min operation on a Binary Search Tree and a Binary Heap. I understand that returning the min in a Binary Heap is a O(1) operation. I also understand why in theory, returning the minimum element in a Binary Search Tree is O(log(N)) operation. Much to my surprise, when I read up on the data structure in C++ STL, the documentation states that returning an iterator to the first element in a map (which is the same as returning the minimum element) occurs in constant time! Shouldnt this be getting returned in logarithmic time? I need someone to help me understand what C++ is doing under the hood to return this in constant time. Because then, there is no point in really using a binary heap in C++, the map data structure would then support retrieve min and max in both constant time, delete and search in O(log(N)) and keeps everything sorted. This means that the data structure has the benefits of both a BST and Binary Heap all tied up in one!
I had an argument about this with an interviewer (not really an argument) but I was trying to explain to him that in C++ returning min and max from map in C++ (which is a self-balancing binary search tree) occurs in constant time. He was baffled and kept saying I was wrong and that a binary heap was the way to go. Clarification would be much appreciated
The constant-time lookup of the minimum and maximum is achieved by storing references to the leftmost and the rightmost nodes of the RB-tree in the header structure of the map. Here is a comment from the source code of the RB-tree, a template from which the implementation of std::set, std::map, and std::multimap are derived:
the header cell is maintained with links not only to the root but also to the leftmost node of the tree, to enable constant time begin(), and to the rightmost node of the tree, to enable linear time performance when used with the generic set algorithms (set_union, etc.)
The tradeoff here is that these pointers need to be maintained, so insertion and deletion operations would have another "housekeeping operation" to do. However, insertions and deletions are already done in logarithmic time, so there is no additional asymptotic cost for maintaining these pointers.
At least in a typical implementation, std::set (and std::map) will be implemented as a threaded binary tree1. In other words, each node contains not only a pointer to its (up to) two children, but also to the previous and next node in order. The set class itself then has pointers to not only the root of the tree, but also to the beginning and end of the threaded list of nodes.
To search for a node by key, the normal binary pointers are used. To traverse the tree in order, the threading pointers are used.
This does have a number of disadvantages compared to a binary heap. The most obvious is that it stores four pointers for each data item, where a binary heap can store just data, with no pointers (the relationships between nodes are implicit in the positions of the data). In an extreme case (e.g., std::set<char>) this could end up using a lot more storage for the pointers than for the data you actually care about (e.g., on a 64-bit system you could end up with 4 pointers of 64-bits apiece, to store each 8-bit char). This can lead to poor cache utilization, which (in turn) tends to hurt speed.
In addition, each node will typically be allocated individually, which can reduce locality of reference quite badly, again hurting cache usage and further reducing speed.
As such, even though the threaded tree can find the minimum or maximum, or traverse to the next or previous node in O(1), and search for any given item in O(log N), the constants can be substantially higher than doing the same with a heap. Depending on the size of items being stored, total storage used may be substantially greater than with a heap as well (worst case is obviously when only a little data is stored in each node).
1. With some balancing algorithm applied--most often red-black, but sometimes AVL trees or B-trees. Any number of other balanced trees could be using (e.g., alpha-balanced trees, k-neighbor trees, binary b-trees, general balanced trees).
I'm not an expert at maps, but returning the first element of a map would be considered a 'root' of sorts. there is always a pointer to it, so the look up time of it would be instant. The same would go for a BSTree as it clearly has a root node then 2 nodes off of it and so on (which btw I would look into using an AVL Tree as the look up time for the worst case scenario is much better than the BSTree).
The O(log(N)) is normally only used to get an estimate of the worst case scenario. So if you have a completely unbalanced BSTree you'll actually have O(N), so if your searching for the last node you have to do a comparison to every node.
I'm not too sure about your last statement though a map is different from a self-balancing tree, those are called AVL Trees (or thats what I was taught). A map uses 'keys' to organize objects in a specific way. The key is found by serializing the data into a number and the number is for the most part placed in a list.