What is the underlying data structure of a STL set in C++? - c++

I would like to know how a set is implemented in C++. If I were to implement my own set container without using the STL provided container, what would be the best way to go about this task?
I understand STL sets are based on the abstract data structure of a binary search tree. So what is the underlying data structure? An array?
Also, how does insert() work for a set? How does the set check whether an element already exists in it?
I read on wikipedia that another way to implement a set is with a hash table. How would this work?

Step debug into g++ 6.4 stdlibc++ source
Did you know that on Ubuntu's 16.04 default g++-6 package or a GCC 6.4 build from source you can step into the C++ library without any further setup?
By doing that we easily conclude that a Red-black tree used in this implementation.
This makes sense, since std::set can be traversed in order, which would not be efficient in if a hash map were used.
main.cpp
#include <cassert>
#include <set>
int main() {
std::set<int> s;
s.insert(1);
s.insert(2);
assert(s.find(1) != s.end());
assert(s.find(2) != s.end());
assert(s.find(3) == s3.end());
}
Compile and debug:
g++ -g -std=c++11 -O0 -o main.out main.cpp
gdb -ex 'start' -q --args main.out
Now, if you step into s.insert(1) you immediately reach /usr/include/c++/6/bits/stl_set.h:
487 #if __cplusplus >= 201103L
488 std::pair<iterator, bool>
489 insert(value_type&& __x)
490 {
491 std::pair<typename _Rep_type::iterator, bool> __p =
492 _M_t._M_insert_unique(std::move(__x));
493 return std::pair<iterator, bool>(__p.first, __p.second);
494 }
495 #endif
which clearly just forwards to _M_t._M_insert_unique.
So we open the source file in vim and find the definition of _M_t:
typedef _Rb_tree<key_type, value_type, _Identity<value_type>,
key_compare, _Key_alloc_type> _Rep_type;
_Rep_type _M_t; // Red-black tree representing set.
So _M_t is of type _Rep_type and _Rep_type is a _Rb_tree.
OK, now that is enough evidence for me. If you don't believe that _Rb_tree is a Black-red tree, step a bit further and read the algorithm.
unordered_set uses hash table
Same procedure, but replace set with unordered_set on the code.
This makes sense, since std::unordered_set cannot be traversed in order, so the standard library chose hash map instead of Red-black tree, since hash map has a better amortized insert time complexity.
Stepping into insert leads to /usr/include/c++/6/bits/unordered_set.h:
415 std::pair<iterator, bool>
416 insert(value_type&& __x)
417 { return _M_h.insert(std::move(__x)); }
So we open the source file in vim and search for _M_h:
typedef __uset_hashtable<_Value, _Hash, _Pred, _Alloc> _Hashtable;
_Hashtable _M_h;
So hash table it is.
std::map and std::unordered_map
Analogous for std::set vs std:unordered_set: What data structure is inside std::map in C++?
Performance characteristics
You could also infer the data structure used by timing them:
Graph generation procedure and Heap vs BST analysis and at: Heap vs Binary Search Tree (BST)
We clearly see for:
std::set, a logarithmic insertion time
std::unordered_set, a more complex hashmap pattern:
on the non-zoomed plot, we clearly see the backing dynamic array doubling on huge one off linearly increasing spikes
on the zoomed plot, we see that the times are basically constant and going towards 250ns, therefore much faster than the std::map, except for very small map sizes
Several strips are clearly visible, and their inclination becomes smaller whenever the array doubles.
I believe this is due to average linearly increasing linked list walks withing each bin. Then when the array doubles, we have more bins, so shorter walks.

As KTC said, how std::set is implemented can vary -- the C++ standard simply specifies an abstract data type. In other words, the standard does not specify how a container should be implemented, just what operations it is required to support. However, most implementations of the STL do, as far as I am aware, use red-black trees or other balanced binary search trees of some kind (GNU libstdc++, for instance, uses red-black trees).
While you could theoretically implement a set as a hash table and get faster asymptotic performance (amortized O(key length) versus O(log n) for lookup and insert), that would require having the user supply a hash function for whatever type they wanted to store (see Wikipedia's entry on hash tables for a good explanation of how they work). As for an implementation of a binary search tree, you wouldn't want to use an array -- as Raul mentioned, you would want some kind of Node data structure.

You could implement a binary search tree by first defining a Node struct:
struct Node
{
void *nodeData;
Node *leftChild;
Node *rightChild;
}
Then, you could define a root of the tree with another Node *rootNode;
The Wikipedia entry on Binary Search Tree has a pretty good example of how to implement an insert method, so I would also recommend checking that out.
In terms of duplicates, they are generally not allowed in sets, so you could either just discard that input, throw an exception, etc, depending on your specification.

I understand STL sets are based on the abstract data structure of a binary search tree.
So what is the underlying data structure? An array?
As others have pointed out, it varies. A set is commonly implemented as a tree (red-black tree, balanced tree, etc.) though but there may be other implementations that exist.
Also, how does insert() work for a
set?
It depends on the underlying implementation of your set. If it is implemented as a binary tree, Wikipedia has a sample recursive implementation for the insert() function. You may want to check it out.
How does the set check whether an
element already exists in it?
If it is implemented as a tree, then it traverses the tree and check each element. However, sets do not allow duplicate elements to be stored though. If you want a set that allows duplicate elements, then you need a multiset.
I read on wikipedia that another way
to implement a set is with a hash
table. How would this work?
You may be referring to a hash_set, where the set is implemented using hash tables. You'll need to provide a hash function to know which location to store your element. This implementation is ideal when you want to be able to search for an element quickly. However, if it is important for your elements to be stored in particular order, then the tree implementation is more appropriate since you can traverse it preorder, inorder or postorder.

How a particular container is implemented in C++ is entirely implementation dependent. All that is required is for the result to meet the requirements set out in the standard, such as complexity requirement for the various methods, iterators requirements etc.

cppreference says:
Sets are usually implemented as red-black trees.
I checked, and both libc++ and libstdc++ do use red-black trees for std::set.
std::unordered_set was implemented with a hash table in libc++ and I presume the same for libstdc++ but didn't check.
Edit: Apparently my word is not good enough.
libc++: 1 2
libstdc++: 1

Chiming in on this, because I did not see anyone explicitly mention it... The C++ standard does not specify the data structure to use for std::set and std::map. What it does however specify is the run-time complexity of various operations. The requirements on computational complexity for the insert, delete and find operations more-or-less force an implementation to use a balanced tree algorithm.
There are two common algorithms for implementing balanced binary trees: Red-Black and AVL. Of the two, Red-Black is a little bit simpler of an implementation, requiring 1 less bit of storage per tree node (which hardly matters, since you're going to burn a byte on it in a simple implementation anyway), and is a little bit faster than AVL on node deletions (this is due to a more relaxed requirement on the balancing of the tree).
All of this, combined with std::map's requirement that the key & datum are stored in an std::pair force this all upon you without explicitly naming the data structure you must use for the container.
This all, in turn is compounded by the c++14/17 supplemental features to the container that allow splicing of nodes from one tree into another.

Related

What data structure is inside std::map in C++?

I'm beginner and learning C++
Having hard times to understand std::map concepts, because the code I'm playing with implies that the map is a search tree, i.e. all the names of std::map objects have *tree in it as well as comments.
However after reading this material http://www.cprogramming.com/tutorial/stl/stlmap.html I tend to think that std::map has nothing to do with tree or hash.
So I`m confused -- either the variables and comment in the code lie to me, or the subject is more complex then I think it is :)
Step debug into g++ 6.4 stdlibc++ source
Did you know that on Ubuntu's 16.04 default g++-6 package or a GCC 6.4 build from source you can step into the C++ library without any further setup?
By doing that we easily conclude that a Red-black tree used in this implementation.
This makes sense, since std::map, unlike std::unordered_map, can be traversed in key order, which would not be efficient in if a hash map were used.
main.cpp
#include <cassert>
#include <map>
int main() {
std::map<int, int> m;
m.emplace(1, -1);
m.emplace(2, -2);
assert(m[1] == -1);
assert(m[2] == -2);
}
Compile and debug:
g++ -g -std=c++11 -O0 -o main.out main.cpp
gdb -ex 'start' -q --args main.out
Now, if you step into s.emplace(1, -1) you immediately reach /usr/include/c++/6/bits/stl_map.h:
556 template<typename... _Args>
557 std::pair<iterator, bool>
558 emplace(_Args&&... __args)
559 { return _M_t._M_emplace_unique(std::forward<_Args>(__args)...); }
which clearly just forwards to _M_t._M_emplace_unique.
So we open the source file in vim and find the definition of _M_t:
typedef _Rb_tree<key_type, value_type, _Select1st<value_type>,
key_compare, _Pair_alloc_type> _Rep_type;
/// The actual tree structure.
_Rep_type _M_t;
So _M_t is of type _Rep_type and _Rep_type is a _Rb_tree.
OK, now that is enough evidence for me. If you don't believe that _Rb_tree is a Black-red tree, step a bit further and read the algorithm
unordered_map uses hash table
Same procedure, but replace map with unordered_map on the code.
This makes sense, since std::unordered_map cannot be traversed in order, so the standard library chose hash map instead of Red-black tree, since hash map has a better amortized insert time complexity.
Stepping into emplace leads to /usr/include/c++/6/bits/unordered_map.h:
377 template<typename... _Args>
378 std::pair<iterator, bool>
379 emplace(_Args&&... __args)
380 { return _M_h.emplace(std::forward<_Args>(__args)...); }
So we open the source file in vim and search for the definition of _M_h:
typedef __umap_hashtable<_Key, _Tp, _Hash, _Pred, _Alloc> _Hashtable;
_Hashtable _M_h;
So hash table it is.
std::set and std::unordered_set
Analogous to std::map vs std::unordered_map: What is the underlying data structure of a STL set in C++?
Performance characteristics
You could also infer the data structure used by timing them:
Graph generation procedure and Heap vs BST analysis and at: Heap vs Binary Search Tree (BST)
Since std::map is analogous to std::set we clearly see for:
std::map, a logarithmic insertion time
std::unordered_map, a more complex hashmap pattern:
on the non-zoomed plot, we clearly see the backing dynamic array doubling on huge one off linearly increasing spikes
on the zoomed plot, we see that the times are basically constant and going towards 250ns, therefore much faster than the std::map, except for very small map sizes
Several strips are clearly visible, and their inclination becomes smaller whenever the array doubles.
I believe this is due to average linearly increasing linked list walks withing each bin. Then when the array doubles, we have more bins, so shorter walks.
std::map is an associative container. The only requirement by the standard is that the container must have an associative container interface and behavior, the implementation is not defined. While the implementation fits the complexity and interface requirements, is a valid implementation.
On the other hand, std::map is usually implemented with a red-black tree, as the reference says.
Viewed externally a map is just an associative container: it behave externally as an "array" (supports an a[x] expression) where x can be whatever type (not necessarily integer) is "comparable by <" (hence ordered).
But:
Because x can be any value, it cannot be a plain array (otherwise it must support whatever index value: if you assign a[1] and a[100] you need also the 2..99 elements in the middle)
Because it has to to be fast in insert and find at whatever position, it cannot be a "linear" structure (otherwise elements shold be shifted, and finding must be sequential, and the requirements are "less then proportional finding time".
The most common implementation uses internally a self-balancing tree (each node is a key/value pair, and are linked togheter so that the left side has lower keys, and the right side has higer keys, so that seraching is re-conducted to a binary search), a multi-skip-list (fastest than tree in retrieval, slower in insert) or a hash-based table (where each x value is re-conducted to an index of an array)
As chris have written, the standard doesn't define the internal structure of the std::map or std::set. It defines the interface and complexity requirements for operations like insertion of an element. Those data structures of course may be implemented as trees. For example the implementation shipped with VisualStudio is based on a red-black tree.
Map internally uses self-balancing BST . Please have a look on this link.self-balancing binary search trees
I would say that if you think of a map as a pair you can't go wrong. Map can be implemented as a tree or a hash map, but the way it is implemented is not as important since you can be sure any implementation is STL is an efficient one.

BST implementations in STL

C++ STL "set" and "map" support logarthmic worst case time for insert,
erase, and find operations. Consequently underlying implementation is
Binary search tree. An important issue in implementing set and map is
providing support for iterator class. Ofcourse, internally the
iterator maintains a pointer to the "current" node in the iterator .
The hard point is efficiently advancing to next node.
My question are
if "set" and "map" implements binary search tree how we advance to next node using iterator class i.e., what we return ++ or -- i.e., is it left subtree or right subtree?
In general how most of the STL implementaions BST and how ++ or -- of iterator is implemented?
Thanks!
There is nothing in the C++ specification that requires the use of a binary tree. Because of this, ++ and -- are defined in terms of the sequence of elements. map and set store an ordered sequence of elements. Therefore, ++ will go to the next element, and -- will go to the previous element. The next and previous elements are defined by the order of the elements.
Note that while the C++ specification doesn't require the use of a binary tree, the particular performance requirements pretty much force some use of a binary tree.
Generally, they use some from of Red/Black self-balancing binary tree.
It mostly depends on the particular implementation.
One way (my description only: not necessarily the one you have) can be the following:
a node has 3 pointers: a left, a right and an up one.
what ++ does is:
if there is "right", go right than go deepest left as possible.
Otherwise go up until we came from the right, and up once again.
what -- does its exactly the inverse.
Besides the worst case operational complexities both map & set (and all Orderer/Sorted Associative Containers in the C++ Standard Library) have a very important property: their elements are always sorted in order by key (according to a comparator).
A self-balancing binary search tree satisfies the operational complexities and the only traversal that will give you the elements in sorted order is, you've guessed it, the in-order one.
Nicol Bolas' answer gives you more details about the usual implementation. If you're curios to actually see such an implementation, you can take a gander at the RDE STL implementation.
It's a lot easier to read than what you'll find in your average C++ Standard Library implementation. As a side note, both set & map implementations in RDESTL have only forward iterator (not bidirectional as the standard says) but it should be pretty easy to figure out the -- part.

What is the best data structure for a generic step function?

I need to implement a step function (i.e. piecewise constant). There are a few requirements that it will need to have.
It will have to be evaluated repeatedly at random locations, then evaluated sequentially over an interval.
It Will have to be easily updated, i.e. adding an increase/decrease over an interval.
So my question is what is the best data structure for this sort of thing? I was thinking that due to the random access nature a Binary tree is the most likely, but I'm hoping I'm not missing something. Also is there a good implementation already out there for C++.
Yes, since the intervals cannot overlap, binary tree is the right data structure. If you did not need fast updates, it could be as simple as having a sorted sequence of the interval endpoints; querying the function value at a certain location could be done using, for example, one of the STL binary search methods std::lower_bound or std::upper_bound, evaluating over a range using two searches. However, for fast updates a self-balancing binary tree is naturally better.
The standard library does define a container with tree-like access: std::map (although it is not required to be implemented using a binary tree, it usually is).
for completeness, it should be mentioned that Boost ICL library has an interval_map, a std::map-like container that uses intervals as keys and with aggregate on overlap insertion semantics.

What is the difference between set and hashset in C++ STL?

When should I choose one over the other?
Are there any pointers that you would recommend for using the right STL containers?
hash_set is an extension that is not part of the C++ standard. Lookups should be O(1) rather than O(log n) for set, so it will be faster in most circumstances.
Another difference will be seen when you iterate through the containers. set will deliver the contents in sorted order, while hash_set will be essentially random (Thanks Lou Franco).
Edit: The C++11 update to the C++ standard introduced unordered_set which should be preferred instead of hash_set. The performance will be similar and is guaranteed by the standard. The "unordered" in the name stresses that iterating it will produce results in no particular order.
stl::set is implemented as a binary search tree.
hashset is implemented as a hash table.
The main issue here is that many people use stl::set thinking it is a hash table with look-up of O(1), which it isn't, and doesn't have. It really has O(log(n)) for look-ups. Other than that, read about binary trees vs hash tables to get a better idea of the data structures.
Another thing to keep in mind is that with hash_set you have to provide the hash function, whereas a set only requires a comparison function ('<') which is easier to define (and predefined for native types).
I don't think anyone has answered the other part of the question yet.
The reason to use hash_set or unordered_set is the usually O(1) lookup time. I say usually because every so often, depending on implementation, a hash may have to be copied to a larger hash array, or a hash bucket may end up containing thousands of entries.
The reason to use a set is if you often need the largest or smallest member of a set. A hash has no order so there is no quick way to find the smallest item. A tree has order, so largest or smallest is very quick. O(log n) for a simple tree, O(1) if it holds pointers to the ends.
A hash_set would be implemented by a hash table, which has mostly O(1) operations, whereas a set is implemented by a tree of some sort (AVL, red black, etc.) which have O(log n) operations, but are in sorted order.
Edit: I had written that trees are O(n). That's completely wrong.

What's a good and stable C++ tree implementation?

I'm wondering if anyone can recommend a good C++ tree implementation, hopefully one that is
stl compatible if at all possible.
For the record, I've written tree algorithms many times before, and I know it can be fun, but I want to be pragmatic and lazy if at all possible. So an actual link to a working solution is the goal here.
Note: I'm looking for a generic tree, not a balanced tree or a map/set, the structure itself and the connectivity of the tree is important in this case, not only the data within.
So each branch needs to be able to hold arbitrary amounts of data, and each branch should be separately iterateable.
I don't know about your requirements, but wouldn't you be better off with a graph (implementations for example in Boost Graph) if you're interested mostly in the structure and not so much in tree-specific benefits like speed through balancing? You can 'emulate' a tree through a graph, and maybe it'll be (conceptually) closer to what you're looking for.
Take a look at this.
The tree.hh library for C++ provides an STL-like container class for n-ary trees, templated over the data stored at the nodes. Various types of iterators are provided (post-order, pre-order, and others). Where possible the access methods are compatible with the STL or alternative algorithms are available.
HTH
I am going to suggest using std::map instead of a tree.
The complexity characteristics of a tree are:
Insert: O(ln(n))
Removal: O(ln(n))
Find: O(ln(n))
These are the same characteristics the std::map guarantees.
Thus as a result most implementations of std::map use a tree (Red-Black Tree) underneath the covers (though technically this is not required).
If you don't have (key, value) pairs, but simply keys, use std::set. That uses the same Red-Black tree as std::map.
Ok folks, I found another tree library; stlplus.ntree. But haven't tried it out yet.
Let suppose the question is about balanced (in some form, mostly red black tree) binary trees, even if it is not the case.
Balanced binaries trees, like vector, allow to manage some ordering of elements without any need of key (like by inserting elements anywhere in vector), but :
With optimal O(log(n)) or better complexity for all the modification of one element (add/remove at begin, end and before & after any iterator)
With persistance of iterators thru any modifications except direct destruction of the element pointed by the iterator.
Optionally one may support access by index like in vector (with a cost of one size_t by element), with O(log(n)) complexity. If used, iterators will be random.
Optionally order can be enforced by some comparison func, but persistence of iterators allow to use non repeatable comparison scheme (ex: arbitrary car lanes change during traffic jam).
In practice, balanced binary tree have interface of vector, list, double linked list, map, multimap, deque, queue, priority_queue... with attaining theoretic optimal O(log(n)) complexity for all single element operations.
<sarcastic> this is probably why c++ stl does not propose it </sarcastic>
Individuals may not implement general balanced tree by themselves, due to the difficulties to get correct management of balancing, especially during element extraction.
There is no widely available implementation of balanced binary tree because the state of the art red black tree (at this time the best type of balanced tree due to fixed number of costly tree reorganizations during remove) know implementation, slavishly copied by every implementers’ from the initial code of the structure inventor, does not allow iterator persistency. It is probably the reason of the absence of fully functionnal tree template.