C++ STL "set" and "map" support logarthmic worst case time for insert,
erase, and find operations. Consequently underlying implementation is
Binary search tree. An important issue in implementing set and map is
providing support for iterator class. Ofcourse, internally the
iterator maintains a pointer to the "current" node in the iterator .
The hard point is efficiently advancing to next node.
My question are
if "set" and "map" implements binary search tree how we advance to next node using iterator class i.e., what we return ++ or -- i.e., is it left subtree or right subtree?
In general how most of the STL implementaions BST and how ++ or -- of iterator is implemented?
Thanks!
There is nothing in the C++ specification that requires the use of a binary tree. Because of this, ++ and -- are defined in terms of the sequence of elements. map and set store an ordered sequence of elements. Therefore, ++ will go to the next element, and -- will go to the previous element. The next and previous elements are defined by the order of the elements.
Note that while the C++ specification doesn't require the use of a binary tree, the particular performance requirements pretty much force some use of a binary tree.
Generally, they use some from of Red/Black self-balancing binary tree.
It mostly depends on the particular implementation.
One way (my description only: not necessarily the one you have) can be the following:
a node has 3 pointers: a left, a right and an up one.
what ++ does is:
if there is "right", go right than go deepest left as possible.
Otherwise go up until we came from the right, and up once again.
what -- does its exactly the inverse.
Besides the worst case operational complexities both map & set (and all Orderer/Sorted Associative Containers in the C++ Standard Library) have a very important property: their elements are always sorted in order by key (according to a comparator).
A self-balancing binary search tree satisfies the operational complexities and the only traversal that will give you the elements in sorted order is, you've guessed it, the in-order one.
Nicol Bolas' answer gives you more details about the usual implementation. If you're curios to actually see such an implementation, you can take a gander at the RDE STL implementation.
It's a lot easier to read than what you'll find in your average C++ Standard Library implementation. As a side note, both set & map implementations in RDESTL have only forward iterator (not bidirectional as the standard says) but it should be pretty easy to figure out the -- part.
Related
We can see from several sources that std::map is implemented using a red-black tree. It is my understanding that these types of data structures do not hold their elements in any particular order and just maintain the BST property and the height balancing requirements.
So, how is it that map::begin is constant time, and we are able to iterate over an ordered sequence?
Starting from the premise that std::map is maintaining a BST internally (which is not strictly required by the standard, but most libraries probably do that, like a red-black tree).
In a BST, to find the smallest element, you would just follow the left branches until you reach a leaf, which is O(log(N)). However, if you want to deliver the "begin()" iterator in constant time, it is quite simple to just keep track, internally, of the smallest element. Every time an insertion causes the smallest element to change, you update it, that's all. It's memory overhead, of course, but that's a trade-off.
There are possibly other ways to single out the smallest element (like keeping the root node unbalanced on purpose). Either way, it's not hard to do.
To iterate through the "ordered" sequence, you simply have to do an in-order traversal of the tree. Starting from the left-most leaf node, you go (up), (right), (up, up), (right), ... so on.. it's a simple set of rules and it's easy to implement, just see a quick implementation of a simple BST inorder iterator that I wrote a while back. As you do the in-order traversal, you will visit every node from the smallest to the biggest, in the correct order. In other words, it just gives you the illusion that "array" is sorted, but in reality, it's the traversal that makes it look like it's sorted.
The balancing properties of a red-black tree allow you to insert a node, anywhere in the tree, at O(log N) cost. For typical std::map implementations, the container will keep the tree sorted, and whenever you insert a new node, insert it into the correct location to keep the tree sorted, and then rebalance the tree to maintain the red-black property.
So no, red-black trees are not inherently sorted.
RB trees are binary search trees. Binary search trees don't necessarily store their elements in any particular order, but you can always get an inorder traversal. I'm not sure how map::begin guarantees constant time, I'd assume this involves always remembering the path to the smallest element (normally it'd be O(log(n))).
I have created a binary tree structure to store a bounded volume hierarchy, to make it easier to use (and safer) I created two iterators to complement it: breadth-first and depth-first.
The breadth-first iterator is essentially a wrapper for the underlying QList. But I am stuck on the depth-first iterator (bidirectional only), I can handle the actual iteration around the tree, I just do not how to create a past-the-end iterator.
I can't just use the QList::end() because there is no guarantee the lowest-level rightmost node is also the rightmost node of the whole tree. I'm reluctant to make a 'fake' BVH node that can tested for because it will involve a large code change (and probably overhead) to have the various node management mechanisms ignore the fake node, and disable a lot of the tree building automation (for example the parent of the fake node will have to be told it is a leaf). But if this is the only way - then it is the only way.
Having looked briefly at qlist.h, it appears that you won't be able to use the same end() for both iteration types. But that's OK--you can use a null pointer or a static dummy or other techniques to make an end() iterator for your second iteration method. I don't see why this would have to impact a huge amount of other code (most of which should just refer to end() without knowing its implementation details).
Can you not just use null or something like this as end? This is at least what I would expect e.g. for a linked list structure.
Lately I've been studying Patricia tries, and working with a really good C++ implementation which can be used as an STL Sorted Associative Container. Patricia tries differ from normal binary trees because leaf nodes have back pointers which point back to internal nodes. Nonetheless, it's possible to traverse a Patricia trie in alphabetical order by doing an in-order traversal, if you only visit internal nodes through leaf-node back pointers.
Which brings me to the question: is it possible to implement the STL lower_bound and upper_bound functions with a Patricia trie? The implementation I'm using does in fact, implement these functions, but they don't work as expected.
For example:
typedef uxn::patl::trie_set<std::string> trie;
trie ts;
ts.insert("LR");
ts.insert("BLQ");
ts.insert("HCDA");
trie::iterator it = ts.lower_bound("GG");
std::cout << *it << std::endl;
This outputs BLQ, when I would expect it to output HCDA. (An std::set, for example, would certainly output HCDA here.)
I emailed the developer who made this library, but never got a response. Regardless, I feel I have a pretty good understanding of how Patricia tries work, and I can't figure out how something like lower_bound would even be possible. The problem is that lower_bound seems to rely on the ability to lexicographically compare the two strings. Since "GG" doesn't exist in the tree, we'd need to find out which element is >= to GG. But Radix/Patricia tries don't use lexicographical comparison to move from node to node; rather each node stores a bit index which is used to perform a bit comparison on the search key. The result of the bit comparison tells you whether to move left or right. This makes it easy to find a particular prefix in the tree. But if the prefix doesn't exist in the tree, (as in the case of my search for "GG"), there doesn't seem to be any way, short of a lexicographical comparison, to get the lower_bound.
The fact that the C++ implementation I'm using doesn't seem to implement lower_bound properly confirms my suspicion that it may not be possible. Still, the fact that you can iterate over the tree in alphabetical order makes me think there might be a way to do it.
Does anyone have experience with this, or know if it is possible to implement a lower_bound functionality with a Patricia Trie?
Yes, it is possible. I have implemented a variant which does this, and D. J. Bernstein's page describes that as one of the fast operations.
http://cr.yp.to/critbit.html
In principle, you keep matching the prefix until you can't match any more, and then you go to the next value, and there's the node you're after.
I would like to know how a set is implemented in C++. If I were to implement my own set container without using the STL provided container, what would be the best way to go about this task?
I understand STL sets are based on the abstract data structure of a binary search tree. So what is the underlying data structure? An array?
Also, how does insert() work for a set? How does the set check whether an element already exists in it?
I read on wikipedia that another way to implement a set is with a hash table. How would this work?
Step debug into g++ 6.4 stdlibc++ source
Did you know that on Ubuntu's 16.04 default g++-6 package or a GCC 6.4 build from source you can step into the C++ library without any further setup?
By doing that we easily conclude that a Red-black tree used in this implementation.
This makes sense, since std::set can be traversed in order, which would not be efficient in if a hash map were used.
main.cpp
#include <cassert>
#include <set>
int main() {
std::set<int> s;
s.insert(1);
s.insert(2);
assert(s.find(1) != s.end());
assert(s.find(2) != s.end());
assert(s.find(3) == s3.end());
}
Compile and debug:
g++ -g -std=c++11 -O0 -o main.out main.cpp
gdb -ex 'start' -q --args main.out
Now, if you step into s.insert(1) you immediately reach /usr/include/c++/6/bits/stl_set.h:
487 #if __cplusplus >= 201103L
488 std::pair<iterator, bool>
489 insert(value_type&& __x)
490 {
491 std::pair<typename _Rep_type::iterator, bool> __p =
492 _M_t._M_insert_unique(std::move(__x));
493 return std::pair<iterator, bool>(__p.first, __p.second);
494 }
495 #endif
which clearly just forwards to _M_t._M_insert_unique.
So we open the source file in vim and find the definition of _M_t:
typedef _Rb_tree<key_type, value_type, _Identity<value_type>,
key_compare, _Key_alloc_type> _Rep_type;
_Rep_type _M_t; // Red-black tree representing set.
So _M_t is of type _Rep_type and _Rep_type is a _Rb_tree.
OK, now that is enough evidence for me. If you don't believe that _Rb_tree is a Black-red tree, step a bit further and read the algorithm.
unordered_set uses hash table
Same procedure, but replace set with unordered_set on the code.
This makes sense, since std::unordered_set cannot be traversed in order, so the standard library chose hash map instead of Red-black tree, since hash map has a better amortized insert time complexity.
Stepping into insert leads to /usr/include/c++/6/bits/unordered_set.h:
415 std::pair<iterator, bool>
416 insert(value_type&& __x)
417 { return _M_h.insert(std::move(__x)); }
So we open the source file in vim and search for _M_h:
typedef __uset_hashtable<_Value, _Hash, _Pred, _Alloc> _Hashtable;
_Hashtable _M_h;
So hash table it is.
std::map and std::unordered_map
Analogous for std::set vs std:unordered_set: What data structure is inside std::map in C++?
Performance characteristics
You could also infer the data structure used by timing them:
Graph generation procedure and Heap vs BST analysis and at: Heap vs Binary Search Tree (BST)
We clearly see for:
std::set, a logarithmic insertion time
std::unordered_set, a more complex hashmap pattern:
on the non-zoomed plot, we clearly see the backing dynamic array doubling on huge one off linearly increasing spikes
on the zoomed plot, we see that the times are basically constant and going towards 250ns, therefore much faster than the std::map, except for very small map sizes
Several strips are clearly visible, and their inclination becomes smaller whenever the array doubles.
I believe this is due to average linearly increasing linked list walks withing each bin. Then when the array doubles, we have more bins, so shorter walks.
As KTC said, how std::set is implemented can vary -- the C++ standard simply specifies an abstract data type. In other words, the standard does not specify how a container should be implemented, just what operations it is required to support. However, most implementations of the STL do, as far as I am aware, use red-black trees or other balanced binary search trees of some kind (GNU libstdc++, for instance, uses red-black trees).
While you could theoretically implement a set as a hash table and get faster asymptotic performance (amortized O(key length) versus O(log n) for lookup and insert), that would require having the user supply a hash function for whatever type they wanted to store (see Wikipedia's entry on hash tables for a good explanation of how they work). As for an implementation of a binary search tree, you wouldn't want to use an array -- as Raul mentioned, you would want some kind of Node data structure.
You could implement a binary search tree by first defining a Node struct:
struct Node
{
void *nodeData;
Node *leftChild;
Node *rightChild;
}
Then, you could define a root of the tree with another Node *rootNode;
The Wikipedia entry on Binary Search Tree has a pretty good example of how to implement an insert method, so I would also recommend checking that out.
In terms of duplicates, they are generally not allowed in sets, so you could either just discard that input, throw an exception, etc, depending on your specification.
I understand STL sets are based on the abstract data structure of a binary search tree.
So what is the underlying data structure? An array?
As others have pointed out, it varies. A set is commonly implemented as a tree (red-black tree, balanced tree, etc.) though but there may be other implementations that exist.
Also, how does insert() work for a
set?
It depends on the underlying implementation of your set. If it is implemented as a binary tree, Wikipedia has a sample recursive implementation for the insert() function. You may want to check it out.
How does the set check whether an
element already exists in it?
If it is implemented as a tree, then it traverses the tree and check each element. However, sets do not allow duplicate elements to be stored though. If you want a set that allows duplicate elements, then you need a multiset.
I read on wikipedia that another way
to implement a set is with a hash
table. How would this work?
You may be referring to a hash_set, where the set is implemented using hash tables. You'll need to provide a hash function to know which location to store your element. This implementation is ideal when you want to be able to search for an element quickly. However, if it is important for your elements to be stored in particular order, then the tree implementation is more appropriate since you can traverse it preorder, inorder or postorder.
How a particular container is implemented in C++ is entirely implementation dependent. All that is required is for the result to meet the requirements set out in the standard, such as complexity requirement for the various methods, iterators requirements etc.
cppreference says:
Sets are usually implemented as red-black trees.
I checked, and both libc++ and libstdc++ do use red-black trees for std::set.
std::unordered_set was implemented with a hash table in libc++ and I presume the same for libstdc++ but didn't check.
Edit: Apparently my word is not good enough.
libc++: 1 2
libstdc++: 1
Chiming in on this, because I did not see anyone explicitly mention it... The C++ standard does not specify the data structure to use for std::set and std::map. What it does however specify is the run-time complexity of various operations. The requirements on computational complexity for the insert, delete and find operations more-or-less force an implementation to use a balanced tree algorithm.
There are two common algorithms for implementing balanced binary trees: Red-Black and AVL. Of the two, Red-Black is a little bit simpler of an implementation, requiring 1 less bit of storage per tree node (which hardly matters, since you're going to burn a byte on it in a simple implementation anyway), and is a little bit faster than AVL on node deletions (this is due to a more relaxed requirement on the balancing of the tree).
All of this, combined with std::map's requirement that the key & datum are stored in an std::pair force this all upon you without explicitly naming the data structure you must use for the container.
This all, in turn is compounded by the c++14/17 supplemental features to the container that allow splicing of nodes from one tree into another.
I'm wondering if anyone can recommend a good C++ tree implementation, hopefully one that is
stl compatible if at all possible.
For the record, I've written tree algorithms many times before, and I know it can be fun, but I want to be pragmatic and lazy if at all possible. So an actual link to a working solution is the goal here.
Note: I'm looking for a generic tree, not a balanced tree or a map/set, the structure itself and the connectivity of the tree is important in this case, not only the data within.
So each branch needs to be able to hold arbitrary amounts of data, and each branch should be separately iterateable.
I don't know about your requirements, but wouldn't you be better off with a graph (implementations for example in Boost Graph) if you're interested mostly in the structure and not so much in tree-specific benefits like speed through balancing? You can 'emulate' a tree through a graph, and maybe it'll be (conceptually) closer to what you're looking for.
Take a look at this.
The tree.hh library for C++ provides an STL-like container class for n-ary trees, templated over the data stored at the nodes. Various types of iterators are provided (post-order, pre-order, and others). Where possible the access methods are compatible with the STL or alternative algorithms are available.
HTH
I am going to suggest using std::map instead of a tree.
The complexity characteristics of a tree are:
Insert: O(ln(n))
Removal: O(ln(n))
Find: O(ln(n))
These are the same characteristics the std::map guarantees.
Thus as a result most implementations of std::map use a tree (Red-Black Tree) underneath the covers (though technically this is not required).
If you don't have (key, value) pairs, but simply keys, use std::set. That uses the same Red-Black tree as std::map.
Ok folks, I found another tree library; stlplus.ntree. But haven't tried it out yet.
Let suppose the question is about balanced (in some form, mostly red black tree) binary trees, even if it is not the case.
Balanced binaries trees, like vector, allow to manage some ordering of elements without any need of key (like by inserting elements anywhere in vector), but :
With optimal O(log(n)) or better complexity for all the modification of one element (add/remove at begin, end and before & after any iterator)
With persistance of iterators thru any modifications except direct destruction of the element pointed by the iterator.
Optionally one may support access by index like in vector (with a cost of one size_t by element), with O(log(n)) complexity. If used, iterators will be random.
Optionally order can be enforced by some comparison func, but persistence of iterators allow to use non repeatable comparison scheme (ex: arbitrary car lanes change during traffic jam).
In practice, balanced binary tree have interface of vector, list, double linked list, map, multimap, deque, queue, priority_queue... with attaining theoretic optimal O(log(n)) complexity for all single element operations.
<sarcastic> this is probably why c++ stl does not propose it </sarcastic>
Individuals may not implement general balanced tree by themselves, due to the difficulties to get correct management of balancing, especially during element extraction.
There is no widely available implementation of balanced binary tree because the state of the art red black tree (at this time the best type of balanced tree due to fixed number of costly tree reorganizations during remove) know implementation, slavishly copied by every implementers’ from the initial code of the structure inventor, does not allow iterator persistency. It is probably the reason of the absence of fully functionnal tree template.