Searching data using different keys

Searching data using different keys - c++

I am no expert in C++ and STL.
I use a structure in a Map as data. Key is some class C1.
I would like to access the same data but using a different key C2 too (where C1 and C2 are two unrelated classes).
Is this possible without duplicating the data?
I tried searching in google, but had a tough time finding an answer that I could understand.
This is for an embedded target where boost libraries are not supported.
Can somebody offer help?

You may store pointers to Data as std::map values, and you can have two maps with different keys pointing to the same data.
I think a smart pointer like std::shared_ptr is a good option in this case of shared ownership of data:
#include <map> // for std::map
#include <memory> // for std::shared_ptr
....
std::map<C1, std::shared_ptr<Data>> map1;
std::map<C2, std::shared_ptr<Data>> map2;
Instances of Data can be allocated using std::make_shared().

Not in the Standard Library, but Boost offers boost::multi_index

Two keys of different types
I must admit I've misread a bit, and didn't really notice you want 2 keys of different types, not values. The solution for that will base on what's below, though. Other answers have pretty much what will be needed for that, I'd just add that you could make an universal lookup function: (C++14-ish pseudocode).
template<class Key>
auto lookup (Key const& key) { }
And specialize it for your keys (arguably easier than SFINAE)
template<>
auto lookup<KeyA> (KeyA const& key) { return map_of_keys_a[key]; }
And the same for KeyB.
If you wanted to encapsulate it in a class, an obvious choice would be to change lookup to operator[].
Key of the same type, but different value
Idea 1
The simplest solution I can think of in 60 seconds: (simplest meaning exactly that it should be really thought through). I'd also switch to unordered_map as default.
map<Key, Data> data;
map<Key2, Key> keys;
Access via data[keys["multikey"]].
This will obviously waste some space (duplicating objects of Key type), but I am assuming they are much smaller than the Data type.
Idea 2
Another solution would be to use pointers; then the only cost of duplicate is a (smart) pointer:
map<Key, shared_ptr<Data>> data;
Object of Data will be alive as long as there is at least one key pointing to it.

What I usually do in these cases is use non-owned pointers. I store my data in a vector:
std::vector<Data> myData;
And then I map pointers to each element. Since it is possible that pointers are invalidated because of the future growth of the vector, though, I will choose to use the vector indexes in this case.
std::map<Key1, int> myMap1;
std::map<Key2, int> myMap2;
Don't expose the data containers to your clients. Encapsulate element insertion and removal in specific functions, which insert everywhere and remove everywhere.

Bartek's "Idea 1" is good (though there's no compelling reason to prefer unordered_map to map).
Alternatively, you could have a std::map<C2, Data*>, or std::map<C2, std::map<C1, Data>::iterator> to allow direct access to Data objects after one C2-keyed search, but then you'd need to be more careful not to access invalid (erased) Data (or more precisely, to erase from both containers atomically from the perspective of any other users).
It's also possible for one or both maps to move to shared_ptr<Data> - the other could use weak_ptr<> if that's helpful ownership-wise. (These are in the C++11 Standard, otherwise the obvious source - boost - is apparently out for you, but maybe you've implemented your own or selected another library? Pretty fundamental classes for modern C++).
EDIT - hash tables versus balanced binary trees
This isn't particularly relevant to the question, but has received comments/interest below and I need more space to address it properly. Some points:
1) Bartek's casually advising to change from map to unordered_map without recommending an impact study re iterator/pointer invalidation is dangerous, and unwarranted given there's no reason to think it's needed (the question doesn't mention performance) and no recommendation to profile.
3) Relatively few data structures in a program are important to performance-critical behaviours, and there are plenty of times when the relative performance of one versus another is of insignificant interest. Supporting this claim - masses of code were written with std::map to ensure portability before C++11, and perform just fine.
4) When performance is a serious concern, the advice should be "Care => profile", but saying that a rule of thumb is ok - in line with "Don't pessimise prematurely" (see e.g. Sutter and Alexandrescu's C++ Coding Standards) - and if asked for one here I'd happily recommend unordered_map by default - but that's not particularly reliable. That's a world away from recommending every std::map usage I see be changed.
5) This container performance side-track has started to pull in ad-hoc snippets of useful insight, but is far from being comprehensive or balanced. This question is not a sane venue for such a discussion. If there's another question addressing this where it makes sense to continue this discussion and someone asks me to chip in, I'll do it sometime over the next month or two.

You could consider having a plain std::list holding all your data, and then various std::map objects mapping arbitrary key values to iterators pointing into the list:
std::list<Data> values;
std::map<C1, std::list<Data>::iterator> byC1;
std::map<C2, std::list<Data>::iterator> byC2;
I.e. instead of fiddling with more-or-less-raw pointers, you use plain iterators. And iterators into a std::list have very good invalidation guarantees.

I had the same problem, at first holding two map for shared pointers sound very cool. But you will still need to manage this two maps(inserting, removing etc...).
Than I came up with other way of doing this.
My reason was; accessing a data with x-y or radius-angle. Think like each point will hold data but point could be described as cartesian x,y or radius-angle .
So I wrote a struct like
struct MyPoint
{
std::pair<int, int> cartesianPoint;
std::pair<int, int> radianPoint;
bool operator== (const MyPoint& rhs)
{
if (cartesianPoint == rhs.cartesianPoint || radianPoint == rhs.radianPoint)
return true;
return false;
}
}
After that I could used that as key,
std::unordered_map<MyPoint, DataType> myMultIndexMap;
I am not sure if your case is the same or adjustable to this scenerio but it can be a option.

Related

C++ Counting Map

Recently I was dealing with what I am sure is a very common problem, which essentially boils down into the following:
Given a long text, calculate the frequency of each word occurring in the text.
I was able to solve this problem using std::unordered_map. This, however, turned quite ugly, as for every word in the text, if that's already been encountered I had to do a find, erase, and then a re-insert into the map with the value incremented.
I realise there are other ways of doing this, such as using a hashing function on top of a vanilla array/vector and increment value there, but I was wondering if there was a more elegant way of solving this problem, like an STL component, or function. that would have a similar interface to Pythons Counter collections.
I know C++ being C++ I can't really expect such high level concepts to always be implemented for me, but was just wondering if you guys new about anything (or at least your Googling skills are superior to mine) which could make my code a little nicer.

I'm not quite sure why an std::unordered_map (or just std::map) would involve much complexity. I'd write the code something like this:
std::unordered_map<std::string, int> words;
std::string word;
while (word = getword(input))
++words[word];
There's no need for any kind of find/erase/reinsert.
In case it's not clear how/why this works: operator[] will create an entry for a value if none exists yet in the map. The associated value will be a value-initialized object of the specified type, which will be zero in the case of an int (or similar). We then increment that every time we encounter the word.

An alternative solution:
std::multiset<std::string> m;
for (auto w: words) m.insert(w);
m.count("some word");
The advantage is that you don't have to rely on the 'trick' with operator[], making the code more readable.
EDIT: As Kerrek pointed out in the comments, this solution is slower. multiset stores all the elements you insert, even if they are deemed equal (they might still differ in some aspect that operator== does not check). This causes a significant overhead compared to unordered_map<std::string, int>, which only has to store each word once.
(As a side note, processing the complete works of William Shakespeare using the map solution takes about 0.33s on my machine, as opposed to 0.78s for the multiset solution.)

Is using a map where value is std::shared_ptr a good design choice for having multi-indexed lists of classes?

problem is simple:
We have a class that has members a,b,c,d...
We want to be able to quickly search(key being value of one member) and update class list with new value by providing current value for a or b or c ...
I thought about having a bunch of
std::map<decltype(MyClass.a/*b,c,d*/),shared_ptr<MyClass>>.
1) Is that a good idea?
2) Is boost multi index superior to this handcrafted solution in every way?
PS SQL is out of the question for simplicity/perf reasons.

Boost MultiIndex may have a distinct disadvantage that it will attempt to keep all indices up to date after each mutation of the collection.
This may be a large performance penalty if you have a data load phase with many separate writes.
The usage patterns of Boost Multi Index may not fit with the coding style (and taste...) of the project (members). This should be a minor disadvantage, but I thought I'd mention it
As ildjarn mentioned, Boost MI doesn't support move semantics as of yet
Otherwise, I'd consider Boost MultiIndex superior in most occasions, since you'd be unlikely to reach the amount of testing it received.

You want want to consider containing all of your maps in a single class, arbitrarily deciding on one of the containers as the one that stores the "real" objects, and then just use a std::map with a mapped type of raw pointers to elements of the first std::map.
This would be a little more difficult if you ever need to make copies of those maps, however.

stl map performance?

I am using map<MyStruct, I*> map1;. Apparently 9% of my total app time is spent in there. Specifically on one line of one of my major functions. The map isn't very big (<1k almost always, <20 is common).
Is there an alternative implementation i may want to use? I think i shouldn't write my own but i could if i thought it was a good idea.
Additional info: I always check before adding an element. If a key exist I need to report a problem. Than after a point i will be using map heavily for lookups and will not add any more elements.

First you need to understand what a map is and what the operations that you are doing represent. A std::map is a balanced binary tree, lookup will take O( log N ) operations, each of which is a comparison of the keys plus some extra that you can ignore in most cases (pointer management). Insertion takes roughly the same time to locate the point of insertion, plus allocation of the new node, the actual insertion into the tree and rebalancing. The complexity is again O( log N ) although the hidden constants are higher.
When you try to determine whether an key is in the map prior to insertion you are incurring the cost of the lookup and if it does not succeed, the same cost to locate the point of insertion. You can avoid the extra cost by using std::map::insert that return a pair with an iterator and a bool telling you whether the insertion actually happened or the element was already there.
Beyond that, you need to understand how costly it is to compare your keys, which falls out of what the question shows (MyStruct could hold just one int or a thousand of them), which is something you need to take into account.
Finally, it might be the case that a map is not the most efficient data structure for your needs, and you might want to consider using either an std::unordered_map (hash table) that has expected constant time insertions (if the hash function is not horrible) or for small data sets even a plain ordered array (or std::vector) on which you can use binary search to locate the elements (this will reduce the number of allocations, at the cost of more expensive insertions, but if the held types are small enough it might be worth it)
As always with performance, measure and then try to understand where the time is being spent. Also note that a 10% of the time spent in a particular function or data structure might be a lot or almost nothing at all, depending on what your application is. For example, if your application is just performing lookups and insertions into a data set, and that takes only a 10% of the CPU you have a lot to optimize everywhere else!

Probably it will be quicker to just do an insert and check if the pair.second is false if key already exists:
like this
if ( myMap.insert( make_pair( MyStruct, I* ) ).second == false)
{
// report error
}
else
// inserted new value
... rather than doing a find call every time.

Instead of map you could try unordered_map which uses hash keys, instead of a tree, to find elements. This answer gives some hints when to prefer unordered_map over map.

It might be a long shot, but for small collections, sometimes the most critical factor is the cache performance.
Since std::map implements a Red-Black Tree, which is [AFAIK] not very cache-efficient - maybe implementing the map as a std::vector<pair<MyStruct,I*>> would be a good idea, and use binary search there [instead of map look-ups], at the very least it should be efficient once you start only looking up [stop inserting elements], since the std::vector is more likely to fit in cache than the map.
This factor [cpu-cache] is usually neglected and hidden as constant in the big O notation, but for large collections it might have major effect.

The way you are using the map, you're doing lookups on the basis of a MyStruct instance and depending on your particular implementation, the required comparison may or may not be costly.

Is there an alternative implementation i may want to use? I think i shouldn't write my own but i could if i thought it was a good idea.
If you understand the problem well enough, you should detail how your implementation will be superior.
Is map the proper structure? If so, then your standard library's implementation will likely be of good quality (well optimized).
Can MyStruct comparison be simplified?
Where is the problem -- resizing? lookup?
Have you minimized copy and assign costs for your structures?

As stated in the comments, without proper code, there is little universal answers to give you. However, if MyStruct is really huge the stack copying may be costly. Perhaps it makes sense to store pointers to MyStruct and implement your own compare mechanism:
template <typename T> struct deref_cmp {
bool operator()(std::shared_ptr<T> lhs, std::shared_ptr<T> rhs) const {
return *lhs < *rhs;
}
};
std::map<std::shared_ptr<MyStruct>, I*, deref_cmp<MyStruct>> mymap;
However, this is something you will have to profile. It might speed things up.
You would look up an element like this
template <typename T> struct NullDeleter {
void operator()(T const*) const {}
};
// needle being a MyStruct
mymap.find(std::shared_ptr<MyStruct>(&needle,NullDeleter()));
Needless to say, there is more potential to optimise.

C++ some questions on boost::unordered_map & boost::hash

I've only recently started dwelling into boost and it's containers, and I read a few articles on the web and on stackoverflow that a boost::unordered_map is the fastest performing container for big collections.
So, I have this class State, which must be unique in the container (no duplicates) and there will be millions if not billions of states in the container.
Therefore I have been trying to optimize it for small size and as few computations as possible. I was using a boost::ptr_vector before, but as I read on stackoverflow a vector is only good as long as there are not that many objects in it.
In my case, the State descibes sensorimotor information from a robot, so there can be an enormous amount of states, and therefore fast lookup is of topemost priority.
Following the boost documentation for unordered_map I realize that there are two things I could do to speed things up: use a hash_function, and use an equality operator to compare States based on their hash_function.
So, I implemented a private hash() function which takes in State information and using boost::hash_combine, creates an std::size_t hash value.
The operator== compares basically the state's hash values.
So:
is std::size_t enough to cover billions of possible hash_function
combinations ? In order to avoid duplicate states I intend to use
their hash_values.
When creating a state_map, should I use as key the State* or the hash
value ?
i.e: boost::unordered_map<State*,std::size_t> state_map;
Or
boost::unordered_map<std::size_t,State*> state_map;
Are the lookup times with a boost::unordered_map::iterator =
state_map.find() faster than going through a boost::ptr_vector and
comparing each iterator's key value ?
Finally, any tips or tricks on how to optimize such an unordered map
for speed and fast lookups would be greatly appreciated.
EDIT: I have seen quite a few answers, one being not to use boost but C++0X, another not to use an unordered_set, but to be honest, I still want to see how boost::unordered_set is used with a hash function.
I have followed boost's documentation and implemented, but I still cannot figure out how to use the hash function of boost with the ordered set.

This is a bit muddled.
What you say are not "things that you can do to speed things up"; rather, they are mandatory requirements of your type to be eligible as the element type of an unordered map, and also for an unordered set (which you might rather want).
You need to provide an equality operator that compares objects, not hash values. The whole point of the equality is to distinguish elements with the same hash.
size_t is an unsigned integral type, 32 bits on x86 and 64 bits on x64. Since you want "billions of elements", which means many gigabytes of data, I assume you have a solid x64 machine anyway.
What's crucial is that your hash function is good, i.e. has few collisions.
You want a set, not a map. Put the objects directly in the set: std::unordered_set<State>. Use a map if you are mapping to something, i.e. states to something else. Oh, use C++0x, not boost, if you can.
Using hash_combine is good.
Baby example:
struct State
{
inline bool operator==(const State &) const;
/* Stuff */
};
namespace std
{
template <> struct hash<State>
{
inline std::size_t operator()(const State & s) const
{
/* your hash algorithm here */
}
};
}
std::size_t Foo(const State & s) { /* some code */ }
int main()
{
std::unordered_set<State> states; // no extra data needed
std::unordered_set<State, Foo> states; // another hash function
}

An unordered_map is a hashtable. You don't store the hash; it is done internally as the storage and lookup method.
Given your requirements, an unordered_set might be more appropriate, since your object is the only item to store.
You are a little confused though -- the equality operator and hash function are not truly performance items, but required for nontrivial objects for the container to work correctly. A good hash function will distribute your nodes evenly across the buckets, and the equality operator will be used to remove any ambiguity about matches based on the hash function.
std::size_t is fine for the hash function. Remember that no hash is perfect; there will be collisions, and these collision items are stored in a linked list at that bucket position.
Thus, .find() will be O(1) in the optimal case and very close to O(1) in the average case (and O(N) in the worst case, but a decent hash function will avoid that.)
You don't mention your platform or architecture; at billions of entries you still might have to worry about out-of-memory situations depending on those and the size of your State object.

forget about hash; there is nothing (at least from your question) that suggests you have a meaningful key;
lets take a step back and rephrase your actual performance goals:
you want to quickly validate no duplicates ever exist for any of your State objects
comment if i need to add others.
From the aforementioned goal, and from your comment i would suggest you use actually a ordered_set rather than an unordered_map. Yes, the ordered search uses binary search O(log (n)) while unordered uses lookup O(1).
However, the difference is that with this approach you need the ordered_set ONLY to check that a similar state doesn't exist already when you are about to create a new one, that is, at State creation-time.
In all the other lookups, you actually don't need to look into the ordered_set! because you already have the key; State*, and the key can access the value by the magic dereference operator: *key
so with this approach, you only are using the ordered_set as an index to verify States on creation time only. In all the other cases, you access your State with the dereference operator of your pointer-value key.
if all the above wasn't enough to convince you, here is the final nail in the coffin of the idea of using a hash to quickly determine equality; hash function has a small probability of collision, but as the number of states will grow, that probability will become complete certainty. So depending on your fault-tolerance, you are going to deal with state collisions (and from your question and the number of States you are expecting to deal, it seems you will deal with a lot of them)
For this to work, you obviously need the compare predicate to test for all the internal properties of your state (giroscope, thrusters, accelerometers, proton rays, etc.)

Aggregating contributions from multiple donors

As I try to modernize my C++ skills, I keep encountering this situation where "the STL way" isn't obvious to me.
I have an object that wants to gather contributions from multiple sources into a container (typically a std::vector). Each source is an object, and each of those objects provides a method get_contributions() that returns any number of contributions (from 0 to many). The gatherer will call get_contributions() on each contributor and aggregate the results into a single collection.
The question is, what's the best signature for get_contributions()?
Option 1: std::vector<contribution> get_contributions() const
This is the most straightforward, but it leads to lots of copying as the gatherer copies each set of results into the master collection. And yes, performance matters here. For example, if the contributors were geometric models and getting contributions amounted to tesselating them into triangles for rendering, then speed would count and the number of contributions could be enormous.
Option 2: template <typename container> void get_contributions(container &target) const
This allows each contributor to add its contributions directly to the master container by calling target.push_back(foo). The drawback here is that we're exposing the container to other types of inspection and manipulation. I'd prefer to keep the interface as narrow as possible.
Option 3: template <typename out_it> void get_contributions(out_it &it) const
In this solution, the aggregator would pass a std::back_insert_iterator for the master collection, and the individual contributors would do *it++ = foo; for each contribution. This is the best I've come up with so far, but I'm left with the feeling that there must be a more elegant way. The back_insert_iterator feels like a kludge.
Is Option 3 the best, or is there a better approach? Does this gathering pattern have a name?

There's a fourth, that would require you to define you iterator ranges. Check out Alexandrescu's presentation on "Iterators must go".

Option 3 is the most idiomatic way. Note that you don't have to use back_insert_iterator. If you know how many elements are going to be added, you can resize the vector, and then provide a regular vector iterator instead. It won't call push_back then (and potentially save you some copying)
back_insert_iterator's main advantage is that it expands the vector as needed.
It's not a kludge though. It's designed for this exact purpose.
One minor adjustment would be to take pass the iterator by value, and then return it when the function returns.

I would say there are two idiomatic STL ways: your Option 3 (taking an output iterator, which you'd pass by value, by the way) and taking a functor which you would call with each of the contributions.
Each of these is only appropriate if it is suitable to implement get_contributions as a template, of course.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js