What data structure is inside std::map in C++? - c++

I'm beginner and learning C++
Having hard times to understand std::map concepts, because the code I'm playing with implies that the map is a search tree, i.e. all the names of std::map objects have *tree in it as well as comments.
However after reading this material http://www.cprogramming.com/tutorial/stl/stlmap.html I tend to think that std::map has nothing to do with tree or hash.
So I`m confused -- either the variables and comment in the code lie to me, or the subject is more complex then I think it is :)

Step debug into g++ 6.4 stdlibc++ source
Did you know that on Ubuntu's 16.04 default g++-6 package or a GCC 6.4 build from source you can step into the C++ library without any further setup?
By doing that we easily conclude that a Red-black tree used in this implementation.
This makes sense, since std::map, unlike std::unordered_map, can be traversed in key order, which would not be efficient in if a hash map were used.
main.cpp
#include <cassert>
#include <map>
int main() {
std::map<int, int> m;
m.emplace(1, -1);
m.emplace(2, -2);
assert(m[1] == -1);
assert(m[2] == -2);
}
Compile and debug:
g++ -g -std=c++11 -O0 -o main.out main.cpp
gdb -ex 'start' -q --args main.out
Now, if you step into s.emplace(1, -1) you immediately reach /usr/include/c++/6/bits/stl_map.h:
556 template<typename... _Args>
557 std::pair<iterator, bool>
558 emplace(_Args&&... __args)
559 { return _M_t._M_emplace_unique(std::forward<_Args>(__args)...); }
which clearly just forwards to _M_t._M_emplace_unique.
So we open the source file in vim and find the definition of _M_t:
typedef _Rb_tree<key_type, value_type, _Select1st<value_type>,
key_compare, _Pair_alloc_type> _Rep_type;
/// The actual tree structure.
_Rep_type _M_t;
So _M_t is of type _Rep_type and _Rep_type is a _Rb_tree.
OK, now that is enough evidence for me. If you don't believe that _Rb_tree is a Black-red tree, step a bit further and read the algorithm
unordered_map uses hash table
Same procedure, but replace map with unordered_map on the code.
This makes sense, since std::unordered_map cannot be traversed in order, so the standard library chose hash map instead of Red-black tree, since hash map has a better amortized insert time complexity.
Stepping into emplace leads to /usr/include/c++/6/bits/unordered_map.h:
377 template<typename... _Args>
378 std::pair<iterator, bool>
379 emplace(_Args&&... __args)
380 { return _M_h.emplace(std::forward<_Args>(__args)...); }
So we open the source file in vim and search for the definition of _M_h:
typedef __umap_hashtable<_Key, _Tp, _Hash, _Pred, _Alloc> _Hashtable;
_Hashtable _M_h;
So hash table it is.
std::set and std::unordered_set
Analogous to std::map vs std::unordered_map: What is the underlying data structure of a STL set in C++?
Performance characteristics
You could also infer the data structure used by timing them:
Graph generation procedure and Heap vs BST analysis and at: Heap vs Binary Search Tree (BST)
Since std::map is analogous to std::set we clearly see for:
std::map, a logarithmic insertion time
std::unordered_map, a more complex hashmap pattern:
on the non-zoomed plot, we clearly see the backing dynamic array doubling on huge one off linearly increasing spikes
on the zoomed plot, we see that the times are basically constant and going towards 250ns, therefore much faster than the std::map, except for very small map sizes
Several strips are clearly visible, and their inclination becomes smaller whenever the array doubles.
I believe this is due to average linearly increasing linked list walks withing each bin. Then when the array doubles, we have more bins, so shorter walks.

std::map is an associative container. The only requirement by the standard is that the container must have an associative container interface and behavior, the implementation is not defined. While the implementation fits the complexity and interface requirements, is a valid implementation.
On the other hand, std::map is usually implemented with a red-black tree, as the reference says.

Viewed externally a map is just an associative container: it behave externally as an "array" (supports an a[x] expression) where x can be whatever type (not necessarily integer) is "comparable by <" (hence ordered).
But:
Because x can be any value, it cannot be a plain array (otherwise it must support whatever index value: if you assign a[1] and a[100] you need also the 2..99 elements in the middle)
Because it has to to be fast in insert and find at whatever position, it cannot be a "linear" structure (otherwise elements shold be shifted, and finding must be sequential, and the requirements are "less then proportional finding time".
The most common implementation uses internally a self-balancing tree (each node is a key/value pair, and are linked togheter so that the left side has lower keys, and the right side has higer keys, so that seraching is re-conducted to a binary search), a multi-skip-list (fastest than tree in retrieval, slower in insert) or a hash-based table (where each x value is re-conducted to an index of an array)

As chris have written, the standard doesn't define the internal structure of the std::map or std::set. It defines the interface and complexity requirements for operations like insertion of an element. Those data structures of course may be implemented as trees. For example the implementation shipped with VisualStudio is based on a red-black tree.

Map internally uses self-balancing BST . Please have a look on this link.self-balancing binary search trees

I would say that if you think of a map as a pair you can't go wrong. Map can be implemented as a tree or a hash map, but the way it is implemented is not as important since you can be sure any implementation is STL is an efficient one.

Related

Finding the number of elements less than or equal to k in a multiset

I have a multiset, implemented as follows:
#include <bits/stdc++.h>
using namespace std;
multiset <int> M;
int numunder(int k){
/*this function must return the number of elements smaller than or equal to k
in M (taking multiplicity into account).
*/
}
At first I thought I could just return M.upper_bound(k)-M.begin()+1. Unfortunately it seems we cannot subtract pointers like that. We ended up having to implement an AVLNodes structure. Is there a way to get this to work taking advantage of the c++ std?
Sticking closely to your proposed M.upper_bound(k)-M.begin()+1 solution (which clearly does not compile, because the multimap iterator is a bidirectional iterator that does not implement operator-), you could use std::distance to get the distance between two multimap iterators to have a correct solution.
Note that this solution will have O(n) complexity, because if the iterator is not a random access iterator, std::distance will just increment the iterator passed in as first parameter, until it finds the iterator passed in as second argument.
I also don't really think that this problem can be solved in better than O(n) complexity with std::multiset.
This can be solved using some policy based data structures avaliable in gcc . You can use the red black tree with information statistics, here is a discussion
Gcc implements multisets as red-black trees. In a binary tree there is no non-trivial way to get the "sorted index" of a node without storing extra info in the node, such as the number of children.
Also know that iterating through the iterators returned by find, upper_bound, etc. will walk the tree, because the iterators are not random access. See https://en.cppreference.com/w/cpp/container/multiset
If you want to only use built-in data structures you could maintain a separate vector that you can perform binary search on. This is more organizational work but if you are only inserting or erasing then it is pretty simple. Anything more complicated probably warrants its own data structure.

What is the difference between set and hashset in C++ STL?

When should I choose one over the other?
Are there any pointers that you would recommend for using the right STL containers?
hash_set is an extension that is not part of the C++ standard. Lookups should be O(1) rather than O(log n) for set, so it will be faster in most circumstances.
Another difference will be seen when you iterate through the containers. set will deliver the contents in sorted order, while hash_set will be essentially random (Thanks Lou Franco).
Edit: The C++11 update to the C++ standard introduced unordered_set which should be preferred instead of hash_set. The performance will be similar and is guaranteed by the standard. The "unordered" in the name stresses that iterating it will produce results in no particular order.
stl::set is implemented as a binary search tree.
hashset is implemented as a hash table.
The main issue here is that many people use stl::set thinking it is a hash table with look-up of O(1), which it isn't, and doesn't have. It really has O(log(n)) for look-ups. Other than that, read about binary trees vs hash tables to get a better idea of the data structures.
Another thing to keep in mind is that with hash_set you have to provide the hash function, whereas a set only requires a comparison function ('<') which is easier to define (and predefined for native types).
I don't think anyone has answered the other part of the question yet.
The reason to use hash_set or unordered_set is the usually O(1) lookup time. I say usually because every so often, depending on implementation, a hash may have to be copied to a larger hash array, or a hash bucket may end up containing thousands of entries.
The reason to use a set is if you often need the largest or smallest member of a set. A hash has no order so there is no quick way to find the smallest item. A tree has order, so largest or smallest is very quick. O(log n) for a simple tree, O(1) if it holds pointers to the ends.
A hash_set would be implemented by a hash table, which has mostly O(1) operations, whereas a set is implemented by a tree of some sort (AVL, red black, etc.) which have O(log n) operations, but are in sorted order.
Edit: I had written that trees are O(n). That's completely wrong.

What is the underlying data structure of a STL set in C++?

I would like to know how a set is implemented in C++. If I were to implement my own set container without using the STL provided container, what would be the best way to go about this task?
I understand STL sets are based on the abstract data structure of a binary search tree. So what is the underlying data structure? An array?
Also, how does insert() work for a set? How does the set check whether an element already exists in it?
I read on wikipedia that another way to implement a set is with a hash table. How would this work?
Step debug into g++ 6.4 stdlibc++ source
Did you know that on Ubuntu's 16.04 default g++-6 package or a GCC 6.4 build from source you can step into the C++ library without any further setup?
By doing that we easily conclude that a Red-black tree used in this implementation.
This makes sense, since std::set can be traversed in order, which would not be efficient in if a hash map were used.
main.cpp
#include <cassert>
#include <set>
int main() {
std::set<int> s;
s.insert(1);
s.insert(2);
assert(s.find(1) != s.end());
assert(s.find(2) != s.end());
assert(s.find(3) == s3.end());
}
Compile and debug:
g++ -g -std=c++11 -O0 -o main.out main.cpp
gdb -ex 'start' -q --args main.out
Now, if you step into s.insert(1) you immediately reach /usr/include/c++/6/bits/stl_set.h:
487 #if __cplusplus >= 201103L
488 std::pair<iterator, bool>
489 insert(value_type&& __x)
490 {
491 std::pair<typename _Rep_type::iterator, bool> __p =
492 _M_t._M_insert_unique(std::move(__x));
493 return std::pair<iterator, bool>(__p.first, __p.second);
494 }
495 #endif
which clearly just forwards to _M_t._M_insert_unique.
So we open the source file in vim and find the definition of _M_t:
typedef _Rb_tree<key_type, value_type, _Identity<value_type>,
key_compare, _Key_alloc_type> _Rep_type;
_Rep_type _M_t; // Red-black tree representing set.
So _M_t is of type _Rep_type and _Rep_type is a _Rb_tree.
OK, now that is enough evidence for me. If you don't believe that _Rb_tree is a Black-red tree, step a bit further and read the algorithm.
unordered_set uses hash table
Same procedure, but replace set with unordered_set on the code.
This makes sense, since std::unordered_set cannot be traversed in order, so the standard library chose hash map instead of Red-black tree, since hash map has a better amortized insert time complexity.
Stepping into insert leads to /usr/include/c++/6/bits/unordered_set.h:
415 std::pair<iterator, bool>
416 insert(value_type&& __x)
417 { return _M_h.insert(std::move(__x)); }
So we open the source file in vim and search for _M_h:
typedef __uset_hashtable<_Value, _Hash, _Pred, _Alloc> _Hashtable;
_Hashtable _M_h;
So hash table it is.
std::map and std::unordered_map
Analogous for std::set vs std:unordered_set: What data structure is inside std::map in C++?
Performance characteristics
You could also infer the data structure used by timing them:
Graph generation procedure and Heap vs BST analysis and at: Heap vs Binary Search Tree (BST)
We clearly see for:
std::set, a logarithmic insertion time
std::unordered_set, a more complex hashmap pattern:
on the non-zoomed plot, we clearly see the backing dynamic array doubling on huge one off linearly increasing spikes
on the zoomed plot, we see that the times are basically constant and going towards 250ns, therefore much faster than the std::map, except for very small map sizes
Several strips are clearly visible, and their inclination becomes smaller whenever the array doubles.
I believe this is due to average linearly increasing linked list walks withing each bin. Then when the array doubles, we have more bins, so shorter walks.
As KTC said, how std::set is implemented can vary -- the C++ standard simply specifies an abstract data type. In other words, the standard does not specify how a container should be implemented, just what operations it is required to support. However, most implementations of the STL do, as far as I am aware, use red-black trees or other balanced binary search trees of some kind (GNU libstdc++, for instance, uses red-black trees).
While you could theoretically implement a set as a hash table and get faster asymptotic performance (amortized O(key length) versus O(log n) for lookup and insert), that would require having the user supply a hash function for whatever type they wanted to store (see Wikipedia's entry on hash tables for a good explanation of how they work). As for an implementation of a binary search tree, you wouldn't want to use an array -- as Raul mentioned, you would want some kind of Node data structure.
You could implement a binary search tree by first defining a Node struct:
struct Node
{
void *nodeData;
Node *leftChild;
Node *rightChild;
}
Then, you could define a root of the tree with another Node *rootNode;
The Wikipedia entry on Binary Search Tree has a pretty good example of how to implement an insert method, so I would also recommend checking that out.
In terms of duplicates, they are generally not allowed in sets, so you could either just discard that input, throw an exception, etc, depending on your specification.
I understand STL sets are based on the abstract data structure of a binary search tree.
So what is the underlying data structure? An array?
As others have pointed out, it varies. A set is commonly implemented as a tree (red-black tree, balanced tree, etc.) though but there may be other implementations that exist.
Also, how does insert() work for a
set?
It depends on the underlying implementation of your set. If it is implemented as a binary tree, Wikipedia has a sample recursive implementation for the insert() function. You may want to check it out.
How does the set check whether an
element already exists in it?
If it is implemented as a tree, then it traverses the tree and check each element. However, sets do not allow duplicate elements to be stored though. If you want a set that allows duplicate elements, then you need a multiset.
I read on wikipedia that another way
to implement a set is with a hash
table. How would this work?
You may be referring to a hash_set, where the set is implemented using hash tables. You'll need to provide a hash function to know which location to store your element. This implementation is ideal when you want to be able to search for an element quickly. However, if it is important for your elements to be stored in particular order, then the tree implementation is more appropriate since you can traverse it preorder, inorder or postorder.
How a particular container is implemented in C++ is entirely implementation dependent. All that is required is for the result to meet the requirements set out in the standard, such as complexity requirement for the various methods, iterators requirements etc.
cppreference says:
Sets are usually implemented as red-black trees.
I checked, and both libc++ and libstdc++ do use red-black trees for std::set.
std::unordered_set was implemented with a hash table in libc++ and I presume the same for libstdc++ but didn't check.
Edit: Apparently my word is not good enough.
libc++: 1 2
libstdc++: 1
Chiming in on this, because I did not see anyone explicitly mention it... The C++ standard does not specify the data structure to use for std::set and std::map. What it does however specify is the run-time complexity of various operations. The requirements on computational complexity for the insert, delete and find operations more-or-less force an implementation to use a balanced tree algorithm.
There are two common algorithms for implementing balanced binary trees: Red-Black and AVL. Of the two, Red-Black is a little bit simpler of an implementation, requiring 1 less bit of storage per tree node (which hardly matters, since you're going to burn a byte on it in a simple implementation anyway), and is a little bit faster than AVL on node deletions (this is due to a more relaxed requirement on the balancing of the tree).
All of this, combined with std::map's requirement that the key & datum are stored in an std::pair force this all upon you without explicitly naming the data structure you must use for the container.
This all, in turn is compounded by the c++14/17 supplemental features to the container that allow splicing of nodes from one tree into another.

What's a good and stable C++ tree implementation?

I'm wondering if anyone can recommend a good C++ tree implementation, hopefully one that is
stl compatible if at all possible.
For the record, I've written tree algorithms many times before, and I know it can be fun, but I want to be pragmatic and lazy if at all possible. So an actual link to a working solution is the goal here.
Note: I'm looking for a generic tree, not a balanced tree or a map/set, the structure itself and the connectivity of the tree is important in this case, not only the data within.
So each branch needs to be able to hold arbitrary amounts of data, and each branch should be separately iterateable.
I don't know about your requirements, but wouldn't you be better off with a graph (implementations for example in Boost Graph) if you're interested mostly in the structure and not so much in tree-specific benefits like speed through balancing? You can 'emulate' a tree through a graph, and maybe it'll be (conceptually) closer to what you're looking for.
Take a look at this.
The tree.hh library for C++ provides an STL-like container class for n-ary trees, templated over the data stored at the nodes. Various types of iterators are provided (post-order, pre-order, and others). Where possible the access methods are compatible with the STL or alternative algorithms are available.
HTH
I am going to suggest using std::map instead of a tree.
The complexity characteristics of a tree are:
Insert: O(ln(n))
Removal: O(ln(n))
Find: O(ln(n))
These are the same characteristics the std::map guarantees.
Thus as a result most implementations of std::map use a tree (Red-Black Tree) underneath the covers (though technically this is not required).
If you don't have (key, value) pairs, but simply keys, use std::set. That uses the same Red-Black tree as std::map.
Ok folks, I found another tree library; stlplus.ntree. But haven't tried it out yet.
Let suppose the question is about balanced (in some form, mostly red black tree) binary trees, even if it is not the case.
Balanced binaries trees, like vector, allow to manage some ordering of elements without any need of key (like by inserting elements anywhere in vector), but :
With optimal O(log(n)) or better complexity for all the modification of one element (add/remove at begin, end and before & after any iterator)
With persistance of iterators thru any modifications except direct destruction of the element pointed by the iterator.
Optionally one may support access by index like in vector (with a cost of one size_t by element), with O(log(n)) complexity. If used, iterators will be random.
Optionally order can be enforced by some comparison func, but persistence of iterators allow to use non repeatable comparison scheme (ex: arbitrary car lanes change during traffic jam).
In practice, balanced binary tree have interface of vector, list, double linked list, map, multimap, deque, queue, priority_queue... with attaining theoretic optimal O(log(n)) complexity for all single element operations.
<sarcastic> this is probably why c++ stl does not propose it </sarcastic>
Individuals may not implement general balanced tree by themselves, due to the difficulties to get correct management of balancing, especially during element extraction.
There is no widely available implementation of balanced binary tree because the state of the art red black tree (at this time the best type of balanced tree due to fixed number of costly tree reorganizations during remove) know implementation, slavishly copied by every implementers’ from the initial code of the structure inventor, does not allow iterator persistency. It is probably the reason of the absence of fully functionnal tree template.

I don't understand std::tr1::unordered_map

I need an associative container that makes me index a certain object through a string, but that also keeps the order of insertion, so I can look for a specific object by its name or just iterate on it and retrieve objects in the same order I inserted them.
I think this hybrid of linked list and hash map should do the job, but before I tried to use std::tr1::unordered_map thinking that it was working in that way I described, but it wasn't. So could someone explain me the meaning and behavior of unordered_map?
#wesc: I'm sure std::map is implemented by STL, while I'm sure std::hash_map is NOT in the STL (I think older version of Visual Studio put it in a namespace called stdext).
#cristopher: so, if I get it right, the difference is in the implementation (and thus performances), not in the way it behaves externally.
You've asked for the canonical reason why Boost::MultiIndex was made: list insertion order with fast lookup by key. Boost MultiIndex tutorial: list fast lookup
You need to index an associative container two ways:
Insertion order
String comparison
Try Boost.MultiIndex or Boost.Intrusive. I haven't used it this way but I think it's possible.
Boost documentation of unordered containers
The difference is in the method of how you generate the look up.
In the map/set containers the operator< is used to generate an ordered tree.
In the unordered containers, an operator( key ) => index is used.
See hashing for a description of how that works.
Sorry, read your last comment wrong. Yes, hash_map is not in STL, map is. But unordered_map and hash_map are the same from what I've been reading.
map -> log (n) insertion, retrieval, iteration is efficient (and ordered by key comparison)
hash_map/unordered_map -> constant time insertion and retrieval, iteration time is not guarantee to be efficient
Neither of these will work for you by themselves, since the map orders things based on the key content, and not the insertion sequence (unless your key contains info about the insertion sequence in it).
You'll have to do either what you described (list + hash_map), or create a key type that has the insertion sequence number plus an appropriate comparison function.
I think that an unordered_map and hash_map are more or less the same thing. The difference is that the STL doesn't officially have a hash_map (what you're using is probably a compiler specific thing), so unordered_map is the fix for that omission.
unordered_map is just that... unordered. You can't depend on it preserving any ordering on iteration.
You sure that std::hash_map exists in all STL implementations? SGI STL implements it, however GNU g++ doesn't have it (it's located in the __gnu_cxx namespace) as of 4.3.1 anyway. As far as I know, hash_map has always been non-standard, and now tr1 is fixing that.
#wesc: STL has std::map... so what's the difference with unordered_map? I don't think STL would implement twice the same thing and call it differently.