how boost multi_index is implemented - c++

I have some difficulties understanding how Boost.MultiIndex is implemented. Lets say I have the following:
typedef multi_index_container<
employee,
indexed_by<
ordered_unique<member<employee, std::string, &employee::name> >,
ordered_unique<member<employee, int, &employee::age> >
>
> employee_set;
I imagine that I have one array, Employee[], which actually stores the employee objects, and two maps
map<std::string, employee*>
map<int, employee*>
with name and age as keys. Each map has employee* value which points to the stored object in the array. Is this ok?

A short explanation on the underlying structure is given here, quoted below:
The implementation is based on nodes interlinked with pointers, just as say your favorite std::set implementation. I'll elaborate a bit on this: A std::set is usually implemented as an rb-tree where nodes look like
struct node
{
// header
color c;
pointer parent,left,right;
// payload
value_type value;
};
Well, a multi_index_container's node is basically a "multinode" with as many headers as indices as well as the payload. For instance, a multi_index_container with two so-called ordered indices uses an internal node that looks like
struct node
{
// header index #0
color c0;
pointer parent0,left0,right0;
// header index #1
color c1;
pointer parent1,left1,right2;
// payload
value_type value;
};
(The reality is more complicated, these nodes are generated through some metaprogramming etc. but you get the idea) [...]

Conceptually, yes.
From what I understand of Boost.MultiIndex (I've used it, but not seen the implementation), your example with two ordered_unique indices will indeed create two sorted associative containers (like std::map) which store pointers/references/indices into a common set of employees.
In any case, every employee is stored only once in the multi-indexed container, whereas a combination of map<string,employee> and map<int,employee> would store every employee twice.
It may very well be that there is indeed a (dynamic) array inside some multi-indexed containers, but there is no guarantee that this is true:
[Random access indices] do not provide memory contiguity,
a property of std::vectors by which
elements are stored adjacent to one
another in a single block of memory.
Also, Boost.Bimap is based on Boost.MultiIndex and the former allows for different representations of its "backbone" structure.

Actually I do not think it is.
Based on what is located in detail/node_type.hpp. It seems to me that like a std::map the node will contain both the value and the index. Except that in this case the various indices differ from one another and thus the node interleaving would actually differ depending on the index you're following.
I am not sure about this though, Boost headers are definitely hard to parse, however it would make sense if you think in term of memory:
less allocations: faster allocation/deallocation
better cache locality
I would appreciate a definitive answer though, if anyone knows about the gore.

Related

performance of boost multi_index_container

I am interested to know the performance of multi_index_container for the following use case:
struct idx_1 {};
struct idx_2 {};
typedef multi_index_container<
Object,
indexed_by<
// Keyed by: idx1
hashed_unique<
tag<idx_1>,
unique_key >,
// Keyed by: (attribute1, attribute2 and attribute3)
ordered_non_unique<
tag<idx_2>,
composite_key<
Object,
attribute1,
attribute2,
attribute3 > >
>
> ObjectMap;
I need a map to save the object, and the number of objects should be more than 300,000. while each object has 1 unique key and 3 attributes. The details of the keys:
unique key is "unique" as the name
each attribute only has a few possible values, say there's only 16 combinations. So with 300,000 objects, each combination will have a list of 300,000/16 objects
attribute1 needs to be modified from one value to another value occasionally
object finding is always be done via the unique_key while the composite_key is used to iterating objects with one or several attributes
For such use case, multi_index_container is a very good fit as I don't need to maintain several map independently. For the unique key part, I believe hashed_unique is a good candidate instead of ordered_unique.
But I am extremely not comfortable about the "ordered_non_unique" part. I don't know how's implemented in boost. My guess it boost maintain a list of objects in a single list for each combination similar to the unordered_map(forgive me if it's too naive!). If that's the case, modify the attribute an existing object will be a big pain as it requires to 1) go through a long list of objects for a particular combination 2) execute the equal comparison 3) and move the destination combination.
the steps that I suspect with high latency:
ObjectMap objects_;
auto& by_idx1 = objects_.get<idx1>();
auto it = by_idx1.find(some_unique_key);
Object new_value;
by_idx1.modify(it, [&](const Object& object) {
object = new_value;
});
My concern is that whether the last "modify" function has some liner behavior as stated to go through some potential long list of objects under one combination...
As this is a very specific piece of code, I'd suggest you benchmark and profile it using a large amount of real-world data.
As Fabio comments, your best option is to profile the case and see the outcome. Anyway, an ordered_non_unique index is implemented exactly as a std::multimap, namely via a regular rb-tree with the provision that elements with equivalent keys are allowed to coexist in the container; no lists of equivalent elements or anything. As for modify (for your particular use case replace is a better fit), the following procedure is executed:
Check if the element is in place: O(1).
If not, rearrange: O(log n), which for 300,000 elements amounts to a maximum of 19 element comparisons (not 300,000/16=18,750 as you suggest): these comparisons are done lexicographically on the triple (attribute1, attribute2, attribute3). Is this fast enough or not? Well, that depends on your performance requirements, so only profiling can really tell.

Searching data using different keys

I am no expert in C++ and STL.
I use a structure in a Map as data. Key is some class C1.
I would like to access the same data but using a different key C2 too (where C1 and C2 are two unrelated classes).
Is this possible without duplicating the data?
I tried searching in google, but had a tough time finding an answer that I could understand.
This is for an embedded target where boost libraries are not supported.
Can somebody offer help?
You may store pointers to Data as std::map values, and you can have two maps with different keys pointing to the same data.
I think a smart pointer like std::shared_ptr is a good option in this case of shared ownership of data:
#include <map> // for std::map
#include <memory> // for std::shared_ptr
....
std::map<C1, std::shared_ptr<Data>> map1;
std::map<C2, std::shared_ptr<Data>> map2;
Instances of Data can be allocated using std::make_shared().
Not in the Standard Library, but Boost offers boost::multi_index
Two keys of different types
I must admit I've misread a bit, and didn't really notice you want 2 keys of different types, not values. The solution for that will base on what's below, though. Other answers have pretty much what will be needed for that, I'd just add that you could make an universal lookup function: (C++14-ish pseudocode).
template<class Key>
auto lookup (Key const& key) { }
And specialize it for your keys (arguably easier than SFINAE)
template<>
auto lookup<KeyA> (KeyA const& key) { return map_of_keys_a[key]; }
And the same for KeyB.
If you wanted to encapsulate it in a class, an obvious choice would be to change lookup to operator[].
Key of the same type, but different value
Idea 1
The simplest solution I can think of in 60 seconds: (simplest meaning exactly that it should be really thought through). I'd also switch to unordered_map as default.
map<Key, Data> data;
map<Key2, Key> keys;
Access via data[keys["multikey"]].
This will obviously waste some space (duplicating objects of Key type), but I am assuming they are much smaller than the Data type.
Idea 2
Another solution would be to use pointers; then the only cost of duplicate is a (smart) pointer:
map<Key, shared_ptr<Data>> data;
Object of Data will be alive as long as there is at least one key pointing to it.
What I usually do in these cases is use non-owned pointers. I store my data in a vector:
std::vector<Data> myData;
And then I map pointers to each element. Since it is possible that pointers are invalidated because of the future growth of the vector, though, I will choose to use the vector indexes in this case.
std::map<Key1, int> myMap1;
std::map<Key2, int> myMap2;
Don't expose the data containers to your clients. Encapsulate element insertion and removal in specific functions, which insert everywhere and remove everywhere.
Bartek's "Idea 1" is good (though there's no compelling reason to prefer unordered_map to map).
Alternatively, you could have a std::map<C2, Data*>, or std::map<C2, std::map<C1, Data>::iterator> to allow direct access to Data objects after one C2-keyed search, but then you'd need to be more careful not to access invalid (erased) Data (or more precisely, to erase from both containers atomically from the perspective of any other users).
It's also possible for one or both maps to move to shared_ptr<Data> - the other could use weak_ptr<> if that's helpful ownership-wise. (These are in the C++11 Standard, otherwise the obvious source - boost - is apparently out for you, but maybe you've implemented your own or selected another library? Pretty fundamental classes for modern C++).
EDIT - hash tables versus balanced binary trees
This isn't particularly relevant to the question, but has received comments/interest below and I need more space to address it properly. Some points:
1) Bartek's casually advising to change from map to unordered_map without recommending an impact study re iterator/pointer invalidation is dangerous, and unwarranted given there's no reason to think it's needed (the question doesn't mention performance) and no recommendation to profile.
3) Relatively few data structures in a program are important to performance-critical behaviours, and there are plenty of times when the relative performance of one versus another is of insignificant interest. Supporting this claim - masses of code were written with std::map to ensure portability before C++11, and perform just fine.
4) When performance is a serious concern, the advice should be "Care => profile", but saying that a rule of thumb is ok - in line with "Don't pessimise prematurely" (see e.g. Sutter and Alexandrescu's C++ Coding Standards) - and if asked for one here I'd happily recommend unordered_map by default - but that's not particularly reliable. That's a world away from recommending every std::map usage I see be changed.
5) This container performance side-track has started to pull in ad-hoc snippets of useful insight, but is far from being comprehensive or balanced. This question is not a sane venue for such a discussion. If there's another question addressing this where it makes sense to continue this discussion and someone asks me to chip in, I'll do it sometime over the next month or two.
You could consider having a plain std::list holding all your data, and then various std::map objects mapping arbitrary key values to iterators pointing into the list:
std::list<Data> values;
std::map<C1, std::list<Data>::iterator> byC1;
std::map<C2, std::list<Data>::iterator> byC2;
I.e. instead of fiddling with more-or-less-raw pointers, you use plain iterators. And iterators into a std::list have very good invalidation guarantees.
I had the same problem, at first holding two map for shared pointers sound very cool. But you will still need to manage this two maps(inserting, removing etc...).
Than I came up with other way of doing this.
My reason was; accessing a data with x-y or radius-angle. Think like each point will hold data but point could be described as cartesian x,y or radius-angle .
So I wrote a struct like
struct MyPoint
{
std::pair<int, int> cartesianPoint;
std::pair<int, int> radianPoint;
bool operator== (const MyPoint& rhs)
{
if (cartesianPoint == rhs.cartesianPoint || radianPoint == rhs.radianPoint)
return true;
return false;
}
}
After that I could used that as key,
std::unordered_map<MyPoint, DataType> myMultIndexMap;
I am not sure if your case is the same or adjustable to this scenerio but it can be a option.

How to convert a struct property to a pointer reference in C++?

I have a DAG in a JSON format, where each node is an entry: it has a name, and two arrays. One array is for other nodes with arrows coming into it, another array for nodes that this node is directed towards (outgoing arrows).
So, for example:
{
'id': 'A',
'connected_from' : ['B','C'],
'connects_to' : ['D','E']
}
And I have a collection of these nodes, that all together form a DAG.
I'd like to map the nodes to a struct to hold these nodes, where the id is simply a string, and I'd like the arrays to be vectors of pointers of this struct:
struct node {
string id;
vector<node*> connected_from;
vector<node*> connected_to;
}
How do I convert the node entries as 'id' in the arrays of the JSON to a pointer to the correct struct holding that node?
One obvious approach is to build a map of key-value pairs, where key = id, value = the pointer to the struct with that id, and do a lookup - but is there a better way?
no, given only the information that you've provided there isn't a better way: you need to build a map.
however, for single letter id's the map can possibly take the form of a simple array with e.g. 26 entries for the English alphabet.
There's going to be some container object holding all the nodes (otherwise you're going to leak them.) You could always scan over the container to find the nodes. But this will be inefficient - O(N^2) while a map lookup will be O(N log N ).
Though if you store the objects in sorted order in the container (or use a sorted container) you can reduce both cases to O(N log N).
The constants will be different though, so for a small graph the scan may be faster.
I think your suggestion is fine... Map from ID to node. It's simple, intuitive and fast enough for practical purposes. Considering the data is being parsed from JSON, your storage and lookups are not going to significantly impact performance. If you're really concerned, then implement a Dictionary to replace your map.
In general terms, I always advocate the simplest, cleanest approach that gets the job done. Too many people obsess about memory or performance hits in algorithms, when the actual bottleneck in their code lies elsewhere.

Implementing list position locator in C++?

I am writing a basic Graph API in C++ (I know libraries already exist, but I am doing it for the practice/experience). The structure is basically that of an adjacency list representation. So there are Vertex objects and Edge objects, and the Graph class contains:
list<Vertex *> vertexList
list<Edge *> edgeList
Each Edge object has two Vertex* members representing its endpoints, and each Vertex object has a list of Edge* members representing the edges incident to the Vertex.
All this is quite standard, but here is my problem. I want to be able to implement deletion of Edges and Vertices in constant time, so for example each Vertex object should have a Locator member that points to the position of its Vertex* in the vertexList. The way I first implemented this was by saving a list::iterator, as follows:
vertexList.push_back(v);
v->locator = --vertexList.end();
Then if I need to delete this vertex later, then rather than searching the whole vertexList for its pointer, I can call:
vertexList.erase(v->locator);
This works fine at first, but it seems that if enough changes (deletions) are made to the list, the iterators will become out-of-date and I get all sorts of iterator errors at runtime. This seems strange for a linked list, because it doesn't seem like you should ever need to re-allocate the remaining members of the list after deletions, but maybe the STL does this to optimize by keeping memory somewhat contiguous?
In any case, I would appreciate it if anyone has any insight as to why this happens. Is there a standard way in C++ to implement a locator that will keep track of an element's position in a list without becoming obsolete?
Much thanks,
Jeff
(I am assuming you are single-threaded, obviously list isn't thread-safe)
but maybe the STL does this to optimize by keeping memory somewhat contiguous?
Incorrect - list::insert, list::push_front and list::push_back do not affect the validity of list::iterators. If you are only calling these mutators on the list, than it will remain valid.
In any case, I would appreciate it if anyone has any insight as to why this happens. Is there a standard way in C++ to implement a locator that will keep track of an element's position in a list without becoming obsolete?
Your method should work, please post some code demonstrating it not working. In the meantime here are two alternative representations:
Why not use:
struct Graph
{
typedef unique_ptr<Vertex> PVertex;
typedef unique_ptr<Edge> PEdge;
unordered_set<PVertex> verticies;
unordered_set<PEdge> edges;
};
That way you can delete them in constant time like you wish. unordered_set is generally implemented with a hash table so its amortized O(1) access time.
And also unique_ptr means that you can have the unordred_sets "owning" them
If verticies are countable and have a fixed maxmimum upper limit (N), another representation would be:
struct Graph
{
typedef unique_ptr<Vertex> PVertex;
typedef unique_ptr<Edge> PEdge;
array<PVertex, N> verticies;
array<array<PEdge, N>, N> edges;
};
Where edges[i][j] holds the edge between verticies[i] and verticies[j]
If verticies[x] or edges[x][y] is nullptr in means the corresponding vertex or edge does not exist.
Old C++ Versions:
unordered_set was introduced in TR1. If you don't have this you can use boost. if you don't want to use boost you can use a plain old set which will give logn access time, or you can implement your own hash table.
unique_ptr can be replaced with auto_ptr for older versions.
array can be replaced with a regular array or with a vector.

How to associate to a number another number without using array

Let's say we have read these values:
3
1241
124515
5322353
341
43262267234
1241
1241
3213131
And I have an array like this (with the elements above):
a[0]=1241
a[1]=124515
a[2]=43262267234
a[3]=3
...
The thing is that the elements' order in the array is not constant (I have to change it somewhere else in my program).
How can I know on which position does one element appear in the read document.
Note that I can not do:
vector <int> a[1000000000000];
a[number].push_back(all_positions);
Because a will be too large (there's a memory restriction). (let's say I have only 3000 elements, but they're values are from 0 to 2^32)
So, in the example above, I would want to know all the positions 1241 is appearing on without iterating again through all the read elements.
In other words, how can I associate to the number "1241" the positions "1,6,7" so I can simply access them in O(1) (where 1 actually is the number of positions the element appears)
If there's no O(1) I want to know what's the optimal one ...
I don't know if I've made myself clear. If not, just say it and I'll update my question :)
You need to use some sort of dynamic array, like a vector (std::vector) or other similar containers (std::list, maybe, it depends on your needs).
Such data structures are safer and easier to use than C-style array, since they take care of memory management.
If you also need to look for an element in O(1) you should consider using some structures that will associate both an index to an item and an item to an index. I don't think STL provides any, but boost should have something like that.
If O(log n) is a cost you can afford, also consider std::map
You can use what is commonly refered to as a multimap. That is, it stores Key and multiple values. This is an O(log) look up time.
If you're working with Visual Studios they provide their own hash_multimap, else may I suggest using Boost::unordered_map with a list as your value?
You don't need a sparse array of 1000000000000 elements; use an std::map to map positions to values.
If you want bi-directional lookup (that is, you sometimes want "what are the indexes for this value?" and sometimes "what value is at this index?") then you can use a boost::bimap.
Things get further complicated as you have values appearing more than once. You can sacrifice the bi-directional lookup and use a std::multimap.
You could use a map for that. Like:
std::map<int, std::vector<int>> MyMap;
So everytime you encounter a value while reading the file, you append it's position to the map. Say X is the value you read and Y is the position then you just do
MyMap[X].push_back( Y );
Instead of you array use
std::map<int, vector<int> > a;
You need an associative collection but you might want to associated with multiple values.
You can use std::multimap< int, int >
or
you can use std::map< int, std::set< int > >
I have found in practice the latter is easier for removing items if you just need to remove one element. It is unique on key-value combinations but not on key or value alone.
If you need higher performance then you may wish to use a hash_map instead of map. For the inner collection though you will not get much performance in using a hash, as you will have very few duplicates and it is better to std::set.
There are many implementations of hash_map, and it is in the new standard. If you don't have the new standard, go for boost.
It seems you need a std::map<int,int>. You can store the mapping such as 1241->0 124515->1 etc. Then perform a look up on this map to get the array index.
Besides the std::map solution offered by others here (O(log n)), there's the approach of a hash map (implemented as boost::unordered_map or std::unordered_map in C++0x, supported by modern compilers).
It would give you O(1) lookup on average, which often is faster than a tree-based std::map. Try for yourself.
You can use a std::multimap to store both a key (e.g. 1241) and multiple values (e.g. 1, 6 and 7).
An insert has logarithmic complexity, but you can speed it up if you give the insert method a hint where it can insert the item.
For O(1) lookup you could hash the number to find its entry (key) in a hash map (boost::unordered_map, dictionary, stdex::hash_map etc)
The value could be a vector of indices where the number occurs or a 3000 bit array (375 bytes) where the bit number for each respective index where the number (key) occurs is set.
boost::unordered_map<unsigned long, std::vector<unsigned long>> myMap;
for(unsigned long i = 0; i < sizeof(a)/sizeof(*a); ++i)
{
myMap[a[i]].push_back(i);
}
Instead of storing an array of integer, you could store an array of structure containing the integer value and all its positions in an array or vector.