Avoid extra process in unordered_map insertion

Avoid extra process in unordered_map insertion - c++

I have an std::unordered_map, and I want both to increment the first value in a std::pair, hashed by key, and to create a reference to key. For example:
std::unordered_map<int, std::pair<int, int> > hash;
hash[key].first++;
auto it(hash.find(key));
int& my_ref(it->first);
I could, instead of using the [] operator, insert the data with insert(), but I'd allocate a pair, even if it were to be deallocated later, as hash may already have key -- not sure of it, though. Making it clearer:
// If "key" is already inserted, the pair(s) will be allocated
// and then deallocated, right?
auto it(hash.insert(std::make_pair(key, std::make_pair(0, 0))));
it->second.first++;
// Here I can have my reference, with extra memory operations,
// but without an extra search in `hash`
int& my_ref(it->first);
I'm pretty much inclined to use the first option, but I can't seem to decide which one is the best. Any better solution to this?
P.S.: an ideal solution for me would be something like an insertion that does not require an initial, possibly useless, allocation of the value.

As others have pointed out, a "allocating" a std::pair<int,int> is really nothing more than copying two integers (on the stack). For the map<int,pair<int,int>>::value_type, which is pair<int const, pair<int, int>> you are at three ints, so there is no significant overhead in using your second approach. You can slightly optimize by using emplace instead of insert i.e.:
// Here an `int` and a struct containing two `int`s are passed as arguments (by value)
auto it(hash.emplace(key, std::make_pair(0, 0)).first);
it->second.first++;
// You get your reference, without an extra search in `hash`
// Not sure what "extra memory operations" you worry about
int const& my_ref(it->first);
Your first approach, using both hash[key] and hash.find(key) is bound to be more expensive, because an element search will certainly be more expensive than an iterator dereference.
Premature copying of arguments on their way to construction of the unordered_map<...>::value_type is a negligible problem, when all arguments are just ints. But if instead you have a heavyweight key_type or a pair of heavyweight types as mapped_type, you can use the following variant of the above to forward everything by reference as far as possible (and use move semantics for rvalues):
// Here key and arguments to construct mapped_type
// are forwarded as tuples of universal references
// There is no copying of key or value nor construction of a pair
// unless a new map element is needed.
auto it(hash.emplace(std::piecewise_construct,
std::forward_as_tuple(key), // one-element tuple
std::forward_as_tuple(0, 0) // args to construct mapped_type
).first);
it->second.first++;
// As in all solutions, get your reference from the iterator we already have
int const& my_ref(it->first);

How about this:
auto it = hash.find(key);
if (it == hash.end()) { it = hash.emplace(key, std::make_pair(0, 0)).first; }
++it->second.first;
int const & my_ref = it->first; // must be const
(If it were an ordered map, you'd use lower_bound and hinted insertion to recycle the tree walk.)

If I understand correctly, what you want is an operator[] that returns an iterator, not a mapped_type. The current interface of unordered_map does not provide such feature, and operator[] implementation relies on private members (at least the boost implementation, I don't have access C++11 std files in my environment).
I suppose that JoergB's answer will be faster and Kerrek SB's one will have a smaller memory footprint. It's up to you to decide what is more critical for your project.

Related

std::map - adding element using subscript operator Vs insert method

I am trying to understand and make sure if three different ways to insert elements into a std::map are effectively the same.
std::map<int, char> mymap;
Just after declaring mymap - will inserting an element with value a for key 10 be same by these three methods?
mymap[10]='a';
mymap.insert(mymap.end(), std::make_pair(10, 'a'));
mymap.insert(std::make_pair(10, 'a'));
Especially, does it make any sense using mymap.end() when there is no existing element in std::map?

The main difference is that (1) first default-constructs a key object in the map in order to be able to return a reference to this object. This enables you to assign something to it.
Keep that in mind if you are working with types that are stored in a map, but have no default constructor. Example:
struct A {
explicit A(int) {};
};
std::map<int, A> m;
m[10] = A(42); // Error! A has no default ctor
m.insert(std::make_pair(10, A(42))); // Ok
m.insert(m.end(), std::make_pair(10, A(42))); // Ok
The other notable difference is that (as #PeteBecker pointed out in the comments) (1) overwrites existing entries in the map, while (2) and (3) don't.

Yes, they are effectively the same. Just after declaring mymap, all three methods turn mymap into {10, 'a'}.
It is OK to use mymap.end() when there is no existing element in std::map. In this case, begin() == end(), which is the universal way of denoting an empty container.

(1) is different from (2) and (3) if there exists an element with the same key. (1) will replace the element, where (2) and (3) will fail and return value denoting insertion didn't happen.
(1) also requires that mapped type is default constructible. In fact (1) first default constructs the object if not present already and replaces that with the value specified.
(2) and (3) are also different. To understand the difference we need to understand what the iterator in (2) does. From cppreference, the iterator refers to a hint where insertion happens as close to that hint as possible. There is a performance difference depending on the validity of the hint. Quoting from the same page:
Amortized constant if the insertion happens in the position just after the hint, logarithmic in the size of the container otherwise.(until C++11)
Amortized constant if the insertion happens in the position just before the hint, logarithmic in the size of the container otherwise. (since C++11)
So for large maps we can get a performance boost if we already know the position somehow.
Having said all of these, if the map is just created and you are doing the operation with no prior elements in the map as you said in the question then I would say that all three will be practically same (though there internal operation will be different as specified above).

Emplace empty vector into std::map()

How can I emplace an empty vector into a std::map? For example, if I have a std::map<int, std::vector<int>>, and I want map[4] to contain an empty std::vector<int>, what can I call?

If you use operator[](const Key&), the map will automatically emplace a value-initialized (i.e. in the case of std::vector, default-constructed) value if you access an element that does not exist. See here:
http://en.cppreference.com/w/cpp/container/map/operator_at
(Since C++ 11 the details are a tad more complicated, but in your case this is what matters).
That means if your map is empty and you do map[4], it will readily give you a reference to an empty (default-constructed) vector. Assigning an empty vector is unnecessary, although it may make your intent more clear.
Demo: https://godbolt.org/g/rnfW7g

Unfortunately the strictly-correct answer is indeed to use std::piecewise_construct as the first argument, followed by two tuples. The first represents the arguments to create the key (4), and the second represents the arguments to create the vector (empty argument set).
It would look like this:
map.emplace(std::piecewise_construct, // signal piecewise construction
std::make_tuple(4), // key constructed from int(4)
std::make_tuple()); // value is default constructed
Of course this looks unsightly, and other alternatives will work. They may even generate no more code in an optimised build:
This one notionally invokes default-construction and move-assignment, but it is likely that the optimiser will see through it.
map.emplace(4, std::vector<int>());
This one invokes default-construction followed by copy-assignment. But again, the optimiser may well see through it.
map[4] = {};

To ensure an empty vector is placed at position 4, you may simply attempt to clear the vector at position 4.
std::map<int, std::vector<int>> my_map;
my_map[4].clear();
As others have mentioned, the indexing operator for std::map will construct an empty value at the specified index if none already exists. If that is the case, calling clear is redundant. However, if a std::vector<int> does already exist, the call to clear serves to, well, clear the vector there, resulting in an empty vector.
This may be more efficient than my previous approach of assigning to {} (see below), because we probably plan on adding elements to the vector at position 4, and we don't pay any cost of new allocation this way. Additionally, if previous usage of my_map[4] indicates future usage, then our new vector will likely be eventually resized to the nearly the same size as before, meaning we save on reallocation costs.
Previous approach:
just assign to {} and the container should properly construct an empty vector there:
std::map<int, std::vector<int>> my_map;
my_map[4] = {};
std::cout << my_map.size() << std::endl; // prints 1
Demo
Edit: As Jodocus mentions, if you know that the std::map doesn't already contain a vector at position 4, then simply attempting to access the vector at that position will default-construct one, e.g.:
std::map<int, std::vector<int>> my_map;
my_map[4]; // default-constructs a vector there

What's wrong with the simplest possible solution? std::map[4] = {};.
In modern C++, this should do what you want with no or at least, very little, overhead.
If you must use emplace, the best solution I can come up with is this:
std::map<int, std::vector<int>> map;
map.emplace(4, std::vector<int>());

Use piecewise_construct with std::make_tuple:
map.emplace(std::piecewise_construct, std::make_tuple(4), std::make_tuple());
We are inserting an empty vector at position 4.
And if there is a general case like, emplacing a vector of size 100 with 10 filled up then:
map.emplace(std::piecewise_construct, std::make_tuple(4), std::make_tuple(100, 10));
piecewise_construct: This constant value is passed as the first argument to construct a pair object to select the constructor form that constructs its members in place by forwarding the elements of two tuple objects to their respective constructor.

Is emplace for basic types worth it?

Let's say I have a map<int, int>:
std::map<int, int> map;
map.emplace(1, 2);
map.insert({3, 4});
Will there be any difference between the two calls?
In the first call, the two integers will be copied by value to the emplace function and then again to the std::pair<int, int> constructor. In the second call, the two integers will be copied by value to the std::pair<int, int> constructor and then be copied by value to the internal std::pair<int, int> again as members of the first pair.
I understand the benefits of emplace for types like std::string where they would be copied by value in the second call and moved all the way in the first one, but is there any benefit in using emplace in the situation described?

Emplace is slower, if there is a chance that the emplace will fail (the key is already present).
This is because emplace is required to allocate a node and construct the pair<Key const, Value> into it, then extract the key from that node and check whether the key is already present, then deallocate the node if the key is already present. On the other hand insert can extract the key from the passed value to be inserted, so does not need to allocate a node if the insert would fail. See: performance of emplace is worse than check followed by emplace.
To fix this, C++17 adds a member function try_emplace(const key_type& k, Args&&... args) (etc.)
In case of success, there is no real difference between the two cases; the order of operations is different, but that will not affect performance in any predictable fashion. Code size will still be slightly larger for the emplace variant, as it has to be ready to perform more work in the failure case.

Creating unordered_set of unordered_set

I want to create a container that will store unique sets of integers inside.
I want to create something similar to
std::unordered_set<std::unordered_set<unsigned int>>
But g++ does not let me do that and says:
invalid use of incomplete type 'struct std::hash<std::unordered_set<unsigned int> >'
What I want to achieve is to have unique sets of unsigned ints.
How can I do that?

I'm adding yet another answer to this question as currently no one has touched upon a key point.
Everyone is telling you that you need to create a hash function for unordered_set<unsigned>, and this is correct. You can do so by specializing std::hash<unordered_set<unsigned>>, or you can create your own functor and use it like this:
unordered_set<unordered_set<unsigned>, my_unordered_set_hash_functor> s;
Either way is fine. However there is a big problem you need to watch out for:
For any two unordered_set<unsigned> that compare equal (x == y), they must hash to the same value: hash(x) == hash(y). If you fail to follow this rule, you will get run time errors. Also note that the following two unordered_sets compare equal (using pseudo code here for clarity):
{1, 2, 3} == {3, 2, 1}
Therefore hash({1, 2, 3}) must equal hash({3, 2, 1}). Said differently, the unordered containers have an equality operator where order does not matter. So however you construct your hash function, its result must be independent of the order of the elements in the container.
Alternatively you can replace the equality predicate used in the unordered_set such that it does respect order:
unordered_set<unordered_set<unsigned>, my_unordered_set_hash_functor,
my_unordered_equal> s;
The burden of getting all of this right, makes:
unodered_set<set<unsigned>, my_set_hash_functor>
look fairly attractive. You still have to create a hash functor for set<unsigned>, but now you don't have to worry about getting the same hash code for {1, 2, 3} and {3, 2, 1}. Instead you have to make sure these hash codes are different.
I note that Walter's answer gives a hash functor that has the right behavior: it ignores order in computing the hash code. But then his answer (currently) tells you that this is not a good solution. :-) It actually is a good solution for unordered containers. An even better solution would be to return the sum of the individual hashes instead of hashing the sum of the elements.

You can do this, but like every unsorted_set/map element type the inner unsorted_set now needs a Hash function to be defined. It does not have one by default but you can write one yourself.

What you have to do is to define an appropriate hash for keys of type std::unordered_set<unsigned int> (since operator== is already defined for this key, you will not need to also provide the EqualKey template parameter for std::unordered_set<std::unordered_set<unsigned int>, Hash, EqualKey>.
One simple (albeit inefficient) option is to hash on the total sum of all elements of the set. This would look similar to this:
template<typename T>
struct hash_on_sum
: private std::hash<typename T::element_type>
{
typedef T::element_type count_type;
typedef std::hash<count_type> base;
std::size_t operator()(T const&obj) const
{
return base::operator()(std::accumulate(obj.begin(),obj.end(),count_type()));
}
};
typedef std::unordered_set<unsigned int> inner_type;
typedef std::unordered_set<inner_type, hash_on_sum<inner_type>> set_of_unique_sets;
However, while simple, this is not good, since it does not guarantee the following requirement. For two different parameters k1 and k2 that are not equal, the probability that std::hash<Key>()(k1) == std::hash<Key>()(k2) should be very small, approaching 1.0/std::numeric_limits<size_t>::max().

std::unordered_set<unsigned int>> does not meet the requirement to be an element of a std::unordered_set since there is no default hash function (i.e. std::hash<> is no specialized for std::unordered_set<unsigned int>> ).
you can provide one (it should be fast, and avoid collisions as much as possible) :
class MyHash
{
public:
std::size_t operator()(const std::unordered_set<unsigned int>& s) const
{
return ... // return some meaningful hash of the et elements
}
};
int main() {
std::unordered_set<std::unordered_set<unsigned int>, MyHash> u;
}
You can see very good examples of hash functions in this answer.
You should really provide both a Hash and an Equality function meeting the standard requirement of an Unordered Associative Container.

Hash() the default function to create hashes of your set's elements does not know how to deal with an entire set as an element. Create a hash function that creates a unique value for every unique set and you're good to go.
This is the constructor for an unordered_set
explicit unordered_set( size_type bucket_count = /*implementation-defined*/,
const Hash& hash = Hash(),
const KeyEqual& equal = KeyEqual(),
const Allocator& alloc = Allocator() );
http://en.cppreference.com/w/cpp/container/unordered_set/unordered_set
Perhaps the simplest thing for you to do is create a hash function for your unordered_set<unsigned int>
unsigned int my_hash(std::unordered_set<unsigned int>& element)
{
for( e : element )
{
some sort of math to create a unique hash for every unique set
}
}
edit: as seen in another answer, which I forgot completely, the hashing function must be within a Hash object. At least according to the constructor I pasted in my answer.

There's a reason there is no hash to unordered_set. An unordered_set is a mutable sequence by default. A hash must hold the same value for as long as the object is in the unordered_set. Thus your elements must be immutable. This is not guaranteed by using the modifier const&, as it only guaranties that only the main unordered_set and its methods will not modify the sub-unordered_set. Not using a reference could be a safe solution (you'd still have to write the hash function) but do you really want the overhead of moving/copying unordered_sets ?
You could instead use some kind of pointer. This is fine; a pointer is only a memory address and your unordered_set itself does not relocate (it might reallocate its element pool, but who cares ?). Therefore your pointer is constant and it can hold the same hash for its lifetime in the unordered_set.
( EDIT: as Howard pointed out, you must ensure that any order you element are stored for your set, if two sets have the same elements they are considered equal. By enforcing an order in how you store your integers, you freely get that two sets correspond to two equal vectors. )
As a bonus, you now can use a smart pointer within the main set itself to manage the memory of sub-unordered_set if you allocated them on the heap.
Note that this is still not your most efficient implementation to get a collection of sets of int. To make you sub-sets, you could write a quick wrapper around std::vector that stores the int, ordered by value. int int are small and cheap to compare, and using a dichotomic search is only O(log n) in complexity. A std::unordered_set is a heavy structure and what you lose by going from O(1) to O(log n), you gain it back by having compact memory for each sets. This shouldn't be too hard to implement but is almost guaranteed to be better in performance.
Harder to implements solution would involve a trie.

Is it wise to use a pointer to access values in an std::map

Is it dangerous to returning a pointer out of a std::map::find to the data and using that as opposed to getting a copy of the data?
Currently, i get a pointer to an entry in my map and pass it to another function to display the data. I'm concerned about items moving causing the pointer to become invalid. Is this a legit concern?
Here is my sample function:
MyStruct* StructManagementClass::GetStructPtr(int structId)
{
std::map<int, MyStruct>::iterator foundStruct;
foundStruct= myStructList.find(structId);
if (foundStruct== myStructList.end())
{
MyStruct newStruct;
memset(&newStruct, 0, sizeof(MyStruct));
myStructList.structId= structId;
myStructList.insert(pair<int, MyStruct>(structId, newStruct));
foundStruct= myStructList.find(structId);
}
return (MyStruct*) &foundStruct->second;
}

It would undoubtedly be more typical to return an iterator than a pointer, though it probably makes little difference.
As far as remaining valid goes: a map iterator remains valid until/unless the item it refers to is removed/erased from the map.
When you insert or delete some other node in the map, that can result in the nodes in the map being rearranged. That's done by manipulating the pointers between the nodes though, so it changes what other nodes contain pointers to the node you care about, but does not change the address or content of that particular node, so pointers/iterators to that node remain valid.

As long as you, your code, and your development team understand the lifetime of std::map values ( valid after insert, and invalid after erase, clear, assign, or operator= ), then using an iterator, const_iterator, ::mapped_type*, or ::mapped_type const* are all valid. Also, if the return is always guaranteed to exist, then ::mapped_type&, or ::mapped_type const& are also valid.
As for wise, I'd prefer the const versions over the mutable versions, and I'd prefer references over pointers over iterators.
Returning an iterator vs. a pointer is bad:
it exposes an implementation detail.
it is awkward to use, as the caller has to know to dereference the iterator, that the result is an std::pair, and that one must then call .second to get the actual value.
.first is the key that the user may not care about.
determining if an iterator is invalid requires knowledge of ::end(), which is not obviously available to the caller.

It's not dangerous - the pointer remains valid just as long as an iterator or a reference does.
However, in your particular case, I would argue that it is not the right thing anyway. Your function unconditionally returns a result. It never returns null. So why not return a reference?
Also, some comments on your code.
std::map<int, MyStruct>::iterator foundStruct;
foundStruct = myStructList.find(structId);
Why not combine declaration and assignment into initialization? Then, if you have C++11 support, you can just write
auto foundStruct = myStructList.find(structId);
Then:
myStructList.insert(pair<int, MyStruct>(structId, newStruct));
foundStruct = myStructList.find(structId);
You can simplify the insertion using make_pair. You can also avoid the redundant lookup, because insert returns an iterator to the newly inserted element (as the first element of a pair).
foundStruct = myStructList.insert(make_pair(structId, newStruct)).first;
Finally:
return (MyStruct*) &foundStruct->second;
Don't ever use C-style casts. It might not do what you expect. Also, don't use casts at all when they're not necessary. &foundStruct->second already has type MyStruct*, so why insert a cast? The only thing it does is hide a place that you need to change if you ever, say, change the value type of your map.

Yes,
If you build a generic function without knowing the use of it, it can be dangerous to return the pointer (or the iterator) since it can become un-valid.
I would advice do one of two:
1. work with std::shared_ptr and return that. (see below)
2. return the struct by value (can be slower)
//change the difination of the list to
std::map<int, std::shared_ptr<MyStruct>>myStructList;
std::shared_ptr<MyStruct> StructManagementClass::GetStructPtr(int structId)
{
std::map<int, std::shared_ptr<MyStruct>>::iterator foundStruct;
foundStruct = myStructList.find(structId);
if (foundStruct == myStructList.end())
{
MyStruct newStruct;
memset(&newStruct, 0, sizeof(MyStruct));
myStructList.structId= structId;
myStructList.insert(pair<int, shared_ptr<MyStruct>>(structId, shared_ptr<MyStruct>(newStruct)));
foundStruct= myStructList.find(structId);
}
return foundStruct->second;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js