How is the C++ multimap container implemented?

How is the C++ multimap container implemented? - c++

For example a C++ vector is implemented using a dynamic array where each element uses consecutive memory spaces.
I know that a C++ multimap is a one to many relationship but what is the internal structure?

The C++ standard does not define how the standard containers should be implemented, it only gives certain constraints like the one you say for vectors.
multimaps have certain runtime complexity (O(lg n) for the interesting operations) and other guarantees, and can be implemented as red-black trees. This is how they are implemented in the GNU standard C++ library.

Very often, a red-black tree. See e.g. STL's Red-Black Trees from Dr. Dobb's.

Addition to the "preferred" answer, because SO won't let me comment:
Given a key with values B, C, D, the behavior of iterators is a lot easier to implement if each element has it's own node. Find() is defined to return the first result in the series, and subsequent iteration takes you across the remaining elements. The de facto difference between a map and a multimap is that multimap is sorted using < over the entire value_type, where the map use < over only the key_type
Correction: the C++11 standard is explicit that new (key, mapping) pairs are inserted at the end of any existing values having the same key. This raises a question I hadn't considered: can a multimap contain two nodes in which both the key and the mapped target are the same. The standard doesn't seem to take a clear position on this, but it's noteworthy that no comparison operator is required on the mapped type. If you write a test program, you will find that a multimap can map X to 1, 2, 1. That is: "1" can appear multiple times as a target and the two instances will not be merged. For some algorithms that's a deficiency.
This article from Dr. Dobbs talks about the underlying rb-tree implementation that is commonly used. The main point to note is that the re-balance operation actually doesn't care about the keys at all, which is why you can build an rb-tree that admits duplicated keys.

The multimap just like it's simpler version i.e the std::map, is mostly built using red black trees. C++ standard itself does not specify the implementation. But in most of the cases ( I personally checked SGI STL) red black trees are used. Red Black trees are height balanced trees and hence fetch / read operation on them is always guaranteed to be O(log(n)) time. But if you are wondering on how values of the key are stored. each key->pair is saved as a separate node in the red black tree ( Even though the same key might appear multiple times just like in the case of key 'b' below). Key is used to lookup/ search the rb-tree. Once the key is found, it's value stored in the node is returned.
std::multimap<char,int> mmp;
mmp.insert(std::pair<char,int>('a',10));
mmp.insert(std::pair<char,int>('b',20));
mmp.insert(std::pair<char,int>('b',10));
mmp.insert(std::pair<char,int>('b',15));
mmp.insert(std::pair<char,int>('b',20));
mmp.insert(std::pair<char,int>('c',25));
mmp.insert(std::pair<char,int>('a',15));
mmp.insert(std::pair<char,int>('a',7));
for (std::multimap<char,int>::iterator it=mmp.begin(); it!=mmp.end(); ++it){
std::cout << (*it).first << " => " << (*it).second << " . Address of (*it).second = " << &((*it).second) << '\n';
}
Output :
a => 10 . Address of (*it).second = 0x96cca24
a => 15 . Address of (*it).second = 0x96ccae4
a => 7 . Address of (*it).second = 0x96ccb04
b => 20 . Address of (*it).second = 0x96cca44
b => 10 . Address of (*it).second = 0x96cca64
b => 15 . Address of (*it).second = 0x96cca84
b => 20 . Address of (*it).second = 0x96ccaa4
c => 25 . Address of (*it).second = 0x96ccac4
Initially I thought the values of a single key like 'b' might be stored in a std::vector .
template <class K, class V>
struct Node {
K key;
std::vector<V> values;
struct Node* left;
struct Node* right;
}
But later I realized that would violate the guaranteed fetch time of O(log(n)). Moreover, printing out the addresses of the values confirms that values with a common key are not contiguous.
They keys are inserted using operator<, so values with the same keys are stored in the order in which they are inserted.
So if we insert first
(key = 'b', value = 20)
and then
(key = 'b', value = 10)
The insertion is done using operator< , since the second 'b' is NOT lesser than the first inserted 'b', it is inserted in the 'right branch of a binary tree'.
The compiler I have used is gcc-5.1 ( C++14).

Related

How are elements in an std::unordered_set stored in memory in C++?

While messing around with type-punning iterators, I came across the ability to do
std::vector<int> vec{ 3, 7, 1, 8, 4 };
int* begin_i = (int*)(void*)&*vec.begin();
std::cout << "1st: " << begin_i << " = " << *begin_i << std::endl;
begin_i++;
std::cout << "2nd: " << begin_i << " = " << *begin_i << std::endl;
Then I tried to do the same kind of thing with an std::unordered_set:
std::unordered_set<int> set{ 3, 7, 1, 8, 4 };
for (auto& el : set)
{ // Display the order the set is currently in
std::cout << el << ", ";
}
std::cout << '\n' <<std::endl;
int* begin_i = (int*)(void*)&*set.begin();
std::cout << "1st: " << begin_i << " = " << *begin_i << std::endl;
begin_i++;
std::cout << "2nd: " << begin_i << " = " << *begin_i << std::endl;
But the output I got was:
4, 8, 1, 7, 3,
1st: [address] = 4
2nd: [address] = 0
I'm supposing this is because and an unordered set's elements are located in different parts of memory? I was confused here considering that I also printed the order the elements were stored in using a range-based loop.
My question is how does an std::unordered_set store its elements in memory? What happens when an element is added to the set? Where does it go in memory and how is that kept track of if it's not stored in an array-like container where the elements are one-right-after-the-other?

An unordered_set is implemented as a hash table using external chaining.
That basically means you have an array of linked lists (which are usually called "buckets"). So, to add an item to an unordered_set you start by hashing the new item you're doing to insert. You then take that hash and reduce it to the range of the current size of the array (which can/will expand as you add more items). You then add the new item at the tail of that linked list.
So, depending on the value produced by the hash, two consecutively inserted items may (and often will) be inserted in the linked lists at completely different parts of the table. Then the node in the linked list will typically be dynamically allocated, so even two consecutive items in the same linked list may be at completely unrelated addresses.
As I noted in an earlier answer, however, quite a bit more about this is actually specified in the standard than most people seem to realize. As I outlined there, it might be (barely) possible to violate the expectation and still (sort of) meet the requirements in the standard, but even at best, doing so would be quite difficult. For most practical purposes, you can assume it's something quite a bit like a vector of linked lists.
Most of the same things apply to an unordered_multiset--the only fundamental difference is that you can have multiple items with the same key instead of only one item with a particular key.
Likewise, there are also unordered_map and unordered_multimap, which are pretty similar again, except that they separate the things being stored into a key and a value associated with that key, and when they do hashing, the only look at the key part, not the value part).

Rather than directly answer the question, I would like to address the "type-punning" trick. (I put that in quotes because the provided code does not demonstrate type-punning. Perhaps the code was appropriately simplified for this question. In any event, *vec.begin() gives an int, so &*vec.begin() is an int*. Further casting to void* then back to int* is a net no-op.)
The property your code takes advantage of is
*(begin_i + 1) == *(vec.begin() + 1) // Using the initial value of begin_i
*(&*vec.begin() + 1) == *(vec.begin() + 1) // Without using an intermediary
This is a property of a contiguous iterator, which is associated with a contiguous container. These are the containers that store their elements in adjacent memory locations. The contiguous containers in the standard library are string, array, and vector; these are the only standard containers for which your trick is guaranteed to work. Trying it on a deque will probably seem to work at first, but the attempt will fail if enough is added to &*begin(). Other containers tend to dynamically allocate elements individually, so there need not be any relation between the addresses of elements; elements are linked together by pointers rather than by position/index.
So that I'm not ignoring the asked question:
An unordered set is merely required to organize elements into buckets. There are no requirements on how this is done, other than requiring that all elements with the same hash value are placed in the same bucket. (This does not imply that all elements in the same bucket have the same hash value.) In practice, each bucket is probably implemented as a list, and the container of buckets is probably a vector, simply because re-using code is cool. At the same time, this is an implementation detail, so it can very from compiler to compiler, and even from compiler version to compiler version. There are no guarantees.

The way std::unordered_set stores its memory is implementation defined. Standart doesn't care as long as it satisfies the requirements.
In VS version it stores them inside an std::list (fast access is provided by creating and managing additional data) - so each element has also pointers towards prev and next is stored via new (at least that's what I remember from std::list).

emplace_hint performance when hint is wrong

I am trying to determine if emplace_hint should be used to insert a key into a multimap (as opposed to regular emplace). I have already calculated the range of the key in an earlier operation (on the same key):
range = multimap.equal_range(key);
Should I use range.first, range.second, or nothing as a hint to insert the key, value pair? What if the range is empty?

Should I use range.first, range.second, or nothing as a hint to insert the key, value pair?
As std::multimap::emplace_hint() states:
Inserts a new element into the container as close as possible to the position just before hint.
(emphasis is mine) you should use second iterator from range and it should make insertion more efficient:
Complexity
Logarithmic in the size of the container in general, but amortized constant if the new element is inserted just before hint.
as for empty range, it is still fine to use second iterator as it should always point to greater than element or behind the last if not such one exists.

First, performance wise, it will not make any difference if you use range.first or range.second. Let's have a look at the return value of equal_range:
std::equal_range - return value
std::pair containing a pair of iterators defining the wanted range,
the first pointing to the first element that is not less than value
and the second pointing to the first element greater than value. If
there are no elements not less than value, last is returned as the
first element. Similarly if there are no elements greater than value,
last is returned as the second element
This means that - when obtained for a value key - both range.first and range.secod are represent positions wherekeymay be correctly inserted right before. So performance wise it should not matter if you userange.firstorrange.last`. Complexity should be "amortized constant", since the new element is inserted just before hint.
Second, when the range is "empty", range.first and range.second are both one-past-the-end, and therefore performance as well as result are identical, actually the same as if you used emplace without any hint.
See the following program demonstrating this:
int main()
{
std::multimap<std::string, std::string> m;
// some clutter:
m.emplace(std::make_pair(std::string("k"), std::string("1")));
m.emplace(std::make_pair(std::string("k"), std::string("2")));
m.emplace(std::make_pair(std::string("z"), std::string("1")));
m.emplace(std::make_pair(std::string("z"), std::string("2")));
// relevant portion of demo data: order a-c-b may be preserved
m.emplace(std::make_pair(std::string("x"), std::string("a")));
m.emplace(std::make_pair(std::string("x"), std::string("c")));
m.emplace(std::make_pair(std::string("x"), std::string("b")));
auto r = m.equal_range("x");
// will insert "x.zzzz" before "x.a":
m.emplace_hint(r.first, std::make_pair(std::string("x"), std::string("zzzz")));
// will insert "x.0" right after "x.b":
m.emplace_hint(r.second, std::make_pair(std::string("x"), std::string("0")));
auto rEmpty = m.equal_range("e");
// "empty" range, normal lookup:
m.emplace_hint(rEmpty.first, std::make_pair(std::string("e"), std::string("b")));
m.emplace_hint(rEmpty.second, std::make_pair(std::string("e"), std::string("a")));
auto rWrong = m.equal_range("k");
m.emplace_hint(rWrong.first, std::make_pair(std::string("z"), std::string("a")));
for (const auto &p : m) {
std::cout << p.first << " => " << p.second << '\n';
}
}
Output:
e => b
e => a
k => 1
k => 2
x => zzzz
x => a
x => c
x => b
x => 0
z => a
z => 1
z => 2
In short: if you have a valid range for key pre-calculated, then use it when inserting key. It will help anyway.
EDIT:
There have been discussions around whether an "invalid" hint might lead to an insertion at a position that does not then reflect the "order of insertion" for values with the same key. This might be concluded from a general multimap statement "The order of the key-value pairs whose keys compare equivalent is the order of insertion and does not change. (since C++11)".
I did not find support for the one or the other point of view in any normative document. I just found the following statement in cplusplus multimap/emplace_hint documentation:
emplate <class... Args>
iterator emplace_hint (const_iterator position, Args&&... args);
position Hint for the position where the element can be inserted. The function optimizes its insertion time if position points to the
element that will follow the inserted element (or to the end, if it
would be the last). Notice that this does not force the new element to
be in that position within the multimap container (the elements in a
multimap always follow a specific order). const_iterator is a member
type, defined as a bidirectional iterator type that points to
elements.
I know that this is not a normative reference, but at least my Apple LLVM 8.0 compiler adheres to this definition (see demo above):
If one inserts an element with a "wrong" hint, i.e. one pointing even before the position where a pair shall be inserted, the algorithm recognizes this and chooses a valid position (see inserting "z"=>"a" where a hint points to an "x"-element).
If we use a range for key "x" and use range.first, the position right before the first x is interpreted as a valid position.
So: I think that m.emplace_hint(r.first,... behaves in a way that the algorithm chooses a valid position immediately, and that to a position close to hint overrules the general statement "The order of the key-value pairs whose keys compare equivalent is the order of insertion and does not change. (since C++11)".

How to retrieve collisions of unordered map?

I have two elements (6 and 747) that share their key ("eggs"). I want to find all the elements that share a key (let's say "eggs", but I would in real life do that for every key). How to do that?
There must be a way to get a container or something back from the data structure . . .

You're still mistaking key's value with key's hash. But to answer question as asked: you can use unordered_map's bucket() member function with bucket iterators:
std::unordered_map<int,int,dumbest_hash> m;
m[0] = 42;
m[1] = 43;
size_t bucket = m.bucket(1);
for(auto it = m.begin(bucket), e = m.end(bucket); it != e; ++it) {
cout << "bucket " << bucket << ": " << it->first << " -> " << it->second << '\n';
}
demo
In simple and mostly correct terms, unordered containers imitate their ordered counterparts in terms of interface. That means that if a map will not allow you to have duplicate keys, then neither will unordered_map.
unordered do employ hashing function to speed up the lookup, but if two keys have the same hash, they will not necessarily have the same value. To keep the behaviour similar to the ordered containers, unordered_set and unordered_map will only consider elements equal when they're actually equal (using operator== or provided comparator), not when their hashed values collide.
To put things in perspective, let's assume that "eggs" and "chicken" have the same hash value and that there's no equality checking. Then the following code would be "correct":
unordered_map<string, int> m;
m["eggs"] = 42;
m.insert(make_pair("chicken", 0)); // not inserted, key already exists
assert(m["chicken"] == 42);
But if you want allow duplicate keys in the same map, simply use unordered_multimap.

Unordered map does not have elements that share a key.
Unordered multi map does.
Use umm.equal_range(key) to get a pair of iterators describing the elements in the map that match a given key.
However, note that "collision" when talking about hashed containers usually refers to elements with the same hashed key, not the same key.
Also, consider using a unordered_map<key, std::vector<value>> instead of a multimap.

Inserting elements at desired positions in a STL map

map <int, string> rollCallRegister;
map <int, string> :: iterator rollCallRegisterIter;
map <int, string> :: iterator temporaryRollCallRegisterIter;
rollCallRegisterIter = rollCallRegister.begin ();
tempRollCallRegisterIter = rollCallRegister.insert (rollCallRegisterIter, pair <int, string> (55, "swati"));
rollCallRegisterIter++;
tempRollCallRegisterIter = rollCallRegister.insert (rollCallRegisterIter, pair <int, string> (44, "shweta"));
rollCallRegisterIter++;
tempRollCallRegisterIter = rollCallRegister.insert (rollCallRegisterIter, pair <int, string> (33, "sindhu"));
// Displaying contents of this map.
cout << "\n\nrollCallRegister contains:\n";
for (rollCallRegisterIter = rollCallRegister.begin(); rollCallRegisterIter != rollCallRegister.end(); ++rollCallRegisterIter)
{
cout << (*rollCallRegisterIter).first << " => " << (*rollCallRegisterIter).second << endl;
}
Output:
rollCallRegister contains:
33 => sindhu
44 => shweta
55 => swati
I have incremented the iterator. Why is it still getting sorted? And if the position is supposed to be changed by the map on its own, then what's the purpose of providing an iterator?

Because std::map is a sorted associative container.
In a map, the key value is generally used to uniquely identify the element, while the mapped value is some sort of value associated to this key.
According to here position parameter is
the position of the first element to be compared for the insertion
operation. Notice that this does not force the new element to be in
that position within the map container (elements in a set always
follow a specific ordering), but this is actually an indication of a
possible insertion position in the container that, if set to the
element that precedes the actual location where the element is
inserted, makes for a very efficient insertion operation. iterator is
a member type, defined as a bidirectional iterator type.
So the purpose of this parameter is mainly slightly increasing the insertion speed by narrowing the range of elements.
You can use std::vector<std::pair<int,std::string>> if the order of insertion is important.

The interface is indeed slightly confusing, because it looks very much like std::vector<int>::insert (for example) and yet does not produce the same effect...
For associative containers, such as set, map and the new unordered_set and co, you completely relinquish the control over the order of the elements (as seen by iterating over the container). In exchange for this loss of control, you gain efficient look-up.
It would not make sense to suddenly give you control over the insertion, as it would let you break invariants of the container, and you would lose the efficient look-up that is the reason to use such containers in the first place.
And thus insert(It position, value_type&& value) does not insert at said position...
However this gives us some room for optimization: when inserting an element in an associative container, a look-up need to be performed to locate where to insert this element. By letting you specify a hint, you are given an opportunity to help the container speed up the process.
This can be illustrated for a simple example: suppose that you receive elements already sorted by way of some interface, it would be wasteful not to use this information!
template <typename Key, typename Value, typename InputStream>
void insert(std::map<Key, Value>& m, InputStream& s) {
typename std::map<Key, Value>::iterator it = m.begin();
for (; s; ++s) {
it = m.insert(it, *s).first;
}
}
Some of the items might not be well sorted, but it does not matter, if two consecutive items are in the right order, then we will gain, otherwise... we'll just perform as usual.

The map is always sorted, but you give a "hint" as to where the element may go as an optimisation.
The insertion is O(log N) but if you are able to successfully tell the container where it goes, it is constant time.
Thus if you are creating a large container of already-sorted values, then each value will get inserted at the end, although the tree will need rebalancing quite a few times.

As sad_man says, it's associative. If you set a value with an existing key, then you overwrite the previous value.
Now the iterators are necessary because you don't know what the keys are, usually.

Is the unordered_map really unordered?

I am very confused by the name 'unordered_map'. The name suggests that the keys are not ordered at all. But I always thought they are ordered by their hash value. Or is that wrong (because the name implies that they are not ordered)?
Or to put it different: Is this
typedef map<K, V, HashComp<K> > HashMap;
with
template<typename T>
struct HashComp {
bool operator<(const T& v1, const T& v2) const {
return hash<T>()(v1) < hash<T>()(v2);
}
};
the same as
typedef unordered_map<K, V> HashMap;
? (OK, not exactly, STL will complain here because there may be keys k1,k2 and neither k1 < k2 nor k2 < k1. You would need to use multimap and overwrite the equal-check.)
Or again differently: When I iterate through them, can I assume that the key-list is ordered by their hash value?

In answer to your edited question, no those two snippets are not equivalent at all. std::map stores nodes in a tree structure, unordered_map stores them in a hashtable*.
Keys are not stored in order of their "hash value" because they're not stored in any order at all. They are instead stored in "buckets" where each bucket corresponds to a range of hash values. Basically, the implementation goes like this:
function add_value(object key, object value) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
buckets[bucket_index] = new linked_list();
}
buckets[bucket_index].add(new key_value(key, value));
}
function get_value(object key) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
return null;
}
foreach(key_value kv in buckets[bucket_index]) {
if (kv.key == key) {
return kv.value;
}
}
}
Obviously that's a serious simplification and real implementation would be much more advanced (for example, supporting resizing the buckets array, maybe using a tree structure instead of linked list for the buckets, and so on), but that should give an idea of how you can't get back the values in any particular order. See wikipedia for more information.
* Technically, the internal implementation of std::map and unordered_map are implementation-defined, but the standard requires certain Big-O complexity for operations that implies those internal implementations

"Unordered" doesn't mean that there isn't a linear sequence somewhere in the implementation. It means "you can't assume anything about the order of these elements".
For example, people often assume that entries will come out of a hash map in the same order they were put in. But they don't, because the entries are unordered.
As for "ordered by their hash value": hash values are generally taken from the full range of integers, but hash maps don't have 2**32 slots in them. The hash value's range will be reduced to the number of slots by taking it modulo the number of slots. Further, as you add entries to a hash map, it might change size to accommodate the new values. This can cause all the previous entries to be re-placed, changing their order.
In an unordered data structure, you can't assume anything about the order of the entries.

As the name unordered_map suggests, no ordering is specified by the C++0x standard. An unordered_map's apparent ordering will be dependent on whatever is convenient for the actual implementation.

If you want an analogy, look at the RDBMS of your choice.
If you don't specify an ORDER BY clause when performing a query, the results are returned "unordered" - that is, in whatever order the database feels like. The order is not specified, and the system is free to "order" them however it likes in order to get the best performance.

You are right, unordered_map is actually hash ordered. Note that most current implementations (pre TR1) call it hash_map.
The IBM C/C++ compiler documentation remarks that if you have an optimal hash function, the number of operations performed during lookup, insertion, and removal of an arbitrary element does not depend on the number of elements in the sequence, so this mean that the order is not so unordered...
Now, what does it mean that it is hash ordered? As an hash should be unpredictable, by definition you can't take any assumption about the order of the elements in the map. This is the reason why it has been renamed in TR1: the old name suggested an order. Now we know that an order is actually used, but you can disregard it as it is unpredictable.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js