How to hash an unordered_map? - c++

boost::hash has hashing functions for most builtin types including containers.
But as stated in the boost::hash_range function description, the hash algorithm for ranges
is sensitive to the order of the elements so it wouldn't be appropriate to use this with an unordered container
And thus there is no boost::hash specialization for std::unordered_map nor boost::unordered_map.
The question is:
Is there an "easy and efficient" way to hash an unordered_map without reimplementing a hash algorithm from scratch ?

The problem here is that there is no guarantee that the items even have an ordering among them.
So, sorting the items may very well not work for arbitrary unordered containers. You have 2 options:
Just XOR the hashes of all the individual elements. This is the fastest.
First sort the hashes of the containers, and then hash those. This may result in a better hash.

You can of course convert the unordered_map to some other data structure that has a guaranteed order and use that to generate the hash.
A better idea might be to hash each individual element of the map, put those hashes into a vector, then sort and combine the hashes. See for example How do I combine hash values in C++0x? to combine the hashes.
template<typename Hash, typename Iterator>
size_t order_independent_hash(Iterator begin, Iterator end, Hash hasher)
{
std::vector<size_t> hashes;
for (Iterator it = begin; it != end; ++it)
hashes.push_back(hasher(*it));
std::sort(hashes.begin(), hashes.end());
size_t result = 0;
for (auto it2 = hashes.begin(); it2 != hashes.end(); ++it2)
result ^= *it2 + 0x9e3779b9 + (result<<6) + (result>>2);
return result;
}
Testing this on shuffled vectors shows that it always returns the same hash.
Now to adapt that basic concept to work specifically with unordered_map. Since the iterator of unordered_map returns a pair, we need a hash function for that too.
namespace std
{
template<typename T1, typename T2>
struct hash<std::pair<T1,T2> >
{
typedef std::pair<T1,T2> argument_type;
typedef std::size_t result_type;
result_type operator()(argument_type const& s) const
{
result_type const h1 ( std::hash<T1>()(s.first) );
result_type const h2 ( std::hash<T2>()(s.second) );
return h1 ^ (h2 + 0x9e3779b9 + (h1<<6) + (h1>>2));
}
};
template<typename Key, typename T>
struct hash<std::unordered_map<Key,T> >
{
typedef std::unordered_map<Key,T> argument_type;
typedef std::size_t result_type;
result_type operator()(argument_type const& s) const
{
return order_independent_hash(s.begin(), s.end(), std::hash<std::pair<Key,T> >());
}
};
}
See it in action: http://ideone.com/WOLFbc

I think you may be confusing what hash is used for. It is for keys used to identify elements to determine where to store them. Two equivalent elements should has to the same value.
Are you trying to see if two unordered maps are equivalent and storing them in some kind of container?
The keys to an unordered map - well those are hashed. In fact the container would have been called hash_map except that such a container already existed.
But ok, suppose you really do want to store unordered-maps and compare to see if two are equivalent. Well you'd have to come up with a hashing algorithm that would return the same value regardless of the position of the elements it contains. A checksum of all its elements (keys and values) would be one possible way.
Note also that just because two elements have the same hash value, it doesn't mean they are equivalent. It just means that if the hash value differs they definitely are not equivalent. In fact checksums are often used to verify data for exactly this reason. A wrong checksum is proof the data is not valid, and given a good formula, a correct one makes it highly likely although not certain that it is.

I'm curious given that you are trying to hash the unordered_map to use it as a key, and given that once you've hashed the unordered_map you won't be changing it (unless you use it to create a new key), would the performance hit of converting the unordered_map to an ordered map be acceptable (and then, of course, hashing the ordered map and using that as a key)? Or is the problem with that approach that you need the faster lookup time provided by an unordered_map?
For what it's worth there may be a space advantage to using an ordered map (based on the accepted answer in the following post an unordered_map generally uses more memory):
Is there any advantage of using map over unordered_map in case of trivial keys?

You haven't specified any performance requirements, but if you just want a "quick and dirty" solution that won't require much coding on your behalf and would take advantage of boost::hash, you can copy the range of items from unordered_map to a vector, std::sort the vector, and then pass it to boost::hash_range.
Hardly the most effective solution, though, and not one you'd want to use often or with many elements.
My preferred approach would be a specialization of unordered_map that keeps a running, up-to-date hash of the contents — you shouldn't have to pass over all elements and perform a calculation to get the current value. Instead, a member of the data structure should reflect the hash, and be modified in real-time as elements are inserted or removed, and read when needed.

Related

Hash value for a std::unordered_map

According to the standard there's no support for containers (let alone unordered ones) in the std::hash class. So I wonder how to implement that. What I have is:
std::unordered_map<std::wstring, std::wstring> _properties;
std::wstring _class;
I thought about iterating the entries, computing the individual hashes for keys and values (via std::hash<std::wstring>) and concatenate the results somehow.
What would be a good way to do that and does it matter if the order in the map is not defined?
Note: I don't want to use boost.
A simple XOR was suggested, so it would be like this:
size_t MyClass::GetHashCode()
{
std::hash<std::wstring> stringHash;
size_t mapHash = 0;
for (auto property : _properties)
mapHash ^= stringHash(property.first) ^ stringHash(property.second);
return ((_class.empty() ? 0 : stringHash(_class)) * 397) ^ mapHash;
}
?
I'm really unsure if that simple XOR is enough.
Response
If by enough, you mean whether or not your function is injective, the answer is No. The reasoning is that the set of all hash values your function can output has cardinality 2^64, while the space of your inputs is much larger. However, this is not really important, because you can't have an injective hash function given the nature of your inputs. A good hash function has these qualities:
It's not easily invertible. Given the output k, it's not computationally feasible within the lifetime of the universe to find m such that h(m) = k.
The range is uniformly distributed over the output space.
It's hard to find two inputs m and m' such that h(m) = h(m')
Of course, the extents of these really depend on whether you want something that's cryptographically secure, or you want to take some arbitrary chunk of data and just send it some arbitrary 64-bit integer. If you want something cryptographically secure, writing it yourself is not a good idea. In that case, you'd also need the guarantee that the function is sensitive to small changes in the input. The std::hash function object is not required to be cryptographically secure. It exists for use cases isomorphic to hash tables. CPP Rerefence says:
For two different parameters k1 and k2 that are not equal, the probability that std::hash<Key>()(k1) == std::hash<Key>()(k2) should be very small, approaching 1.0/std::numeric_limits<size_t>::max().
I'll show below how your current solution doesn't really guarantee this.
Collisions
I'll give you a few of my observations on a variant of your solution (I don't know what your _class member is).
std::size_t hash_code(const std::unordered_map<std::string, std::string>& m) {
std::hash<std::string> h;
std::size_t result = 0;
for (auto&& p : m) {
result ^= h(p.first) ^ h(p.second);
}
return result;
}
It's easy to generate collisions. Consider the following maps:
std::unordered_map<std::string, std::string> container0;
std::unordered_map<std::string, std::string> container1;
container0["123"] = "456";
container1["456"] = "123";
std::cout << hash_code(container0) << '\n';
std::cout << hash_code(container1) << '\n';
On my machine, compiling with g++ 4.9.1, this outputs:
1225586629984767119
1225586629984767119
The question as to whether this matters or not arises. What's relevant is how often you're going to have maps where keys and values are reversed. These collisions will occur between any two maps in which the sets of keys and values are the same.
Order of Iteration
Two unordered_map instances having exactly the same key-value pairs will not necessarily have the same order of iteration. CPP Rerefence says:
For two parameters k1 and k2 that are equal, std::hash<Key>()(k1) == std::hash<Key>()(k2).
This is a trivial requirement for a hash function. Your solution avoids this because the order of iteration doesn't matter since XOR is commutative.
A Possible Solution
If you don't need something that's cryptographically secure, you can modify your solution slightly to kill the symmetry. This approach is okay in practice for hash tables and the like. This solution is also independent of the fact that order in an unordered_map is undefined. It uses the same property your solution used (Commutativity of XOR).
std::size_t hash_code(const std::unordered_map<std::string, std::string>& m) {
const std::size_t prime = 19937;
std::hash<std::string> h;
std::size_t result = 0;
for (auto&& p : m) {
result ^= prime*h(p.first) + h(p.second);
}
return result;
}
All you need in a hash function in this case is a way to map a key-value pair to an arbitrary good hash value, and a way to combine the hashes of the key-value pairs using a commutative operation. That way, order does not matter. In the example hash_code I wrote, the key-value pair hash value is just a linear combination of the hash of the key and the hash of the value. You can construct something a bit more intricate, but there's no need for that.

C++ get index of element of array by value

So far, I have been storing the array in a vector and then looping through the vector to find the matching element and then returning the index.
Is there a faster way to do this in C++? The STL structure I use to store the array doesn't really matter to me (it doesn't have to be a vector). My array is also unique (no repeating elements) and ordered (e.g. a list of dates going forward in time).
Since the elements are sorted, you can use a binary search to find the matching element. The C++ Standard Library has a std::lower_bound algorithm that can be used for this purpose. I would recommend wrapping it in your own binary search algorithm, for clarity and simplicity:
/// Performs a binary search for an element
///
/// The range `[first, last)` must be ordered via `comparer`. If `value` is
/// found in the range, an iterator to the first element comparing equal to
/// `value` will be returned; if `value` is not found in the range, `last` is
/// returned.
template <typename RandomAccessIterator, typename Value, typename Comparer>
auto binary_search(RandomAccessIterator const first,
RandomAccessIterator const last,
Value const& value,
Comparer comparer) -> RandomAccessIterator
{
RandomAccessIterator it(std::lower_bound(first, last, value, comparer));
if (it == last || comparer(*it, value) || comparer(value, *it))
return last;
return it;
}
(The C++ Standard Library has a std::binary_search, but it returns a bool: true if the range contains the element, false otherwise. It's not useful if you want an iterator to the element.)
Once you have an iterator to the element, you can use std::distance algorithm to compute the index of the element in the range.
Both of these algorithms work equally well any random access sequence, including both std::vector and ordinary arrays.
If you want to associate a value with an index and find the index quickly you can use std::map or std::unordered_map. You can also combine these with other data structures (e.g. a std::list or std::vector) depending on the other operations you want to perform on the data.
For example, when creating the vector we also create a lookup table:
vector<int> test(test_size);
unordered_map<int, size_t> lookup;
int value = 0;
for(size_t index = 0; index < test_size; ++index)
{
test[index] = value;
lookup[value] = index;
value += rand()%100+1;
}
Now to look up the index you simply:
size_t index = lookup[find_value];
Using a hash table based data structure (e.g. the unordered_map) is a fairly classical space/time tradeoff and can outperform doing a binary search for this sort of "reverse" lookup operation when you need to do a lot of lookups. The other advantage is that it also works when the vector is unsorted.
For fun :-) I've done a quick benchmark in VS2012RC comparing James' binary search code with a linear search and with using unordered_map for lookup, all on a vector:
To ~50000 elements unordered_set significantly (x3-4) outpeforms the binary search which is exhibiting the expected O(log N) behavior, the somewhat surprising result is that unordered_map loses it's O(1) behavior past 10000 elements, presumably due to hash collisions, perhaps an implementation issue.
EDIT: max_load_factor() for the unordered map is 1 so there should be no collisions. The difference in performance between the binary search and the hash table for very large vectors appears to be caching related and varies depending on the lookup pattern in the benchmark.
Choosing between std::map and std::unordered_map talks about the difference between ordered and unordered maps.

Inserting elements at desired positions in a STL map

map <int, string> rollCallRegister;
map <int, string> :: iterator rollCallRegisterIter;
map <int, string> :: iterator temporaryRollCallRegisterIter;
rollCallRegisterIter = rollCallRegister.begin ();
tempRollCallRegisterIter = rollCallRegister.insert (rollCallRegisterIter, pair <int, string> (55, "swati"));
rollCallRegisterIter++;
tempRollCallRegisterIter = rollCallRegister.insert (rollCallRegisterIter, pair <int, string> (44, "shweta"));
rollCallRegisterIter++;
tempRollCallRegisterIter = rollCallRegister.insert (rollCallRegisterIter, pair <int, string> (33, "sindhu"));
// Displaying contents of this map.
cout << "\n\nrollCallRegister contains:\n";
for (rollCallRegisterIter = rollCallRegister.begin(); rollCallRegisterIter != rollCallRegister.end(); ++rollCallRegisterIter)
{
cout << (*rollCallRegisterIter).first << " => " << (*rollCallRegisterIter).second << endl;
}
Output:
rollCallRegister contains:
33 => sindhu
44 => shweta
55 => swati
I have incremented the iterator. Why is it still getting sorted? And if the position is supposed to be changed by the map on its own, then what's the purpose of providing an iterator?
Because std::map is a sorted associative container.
In a map, the key value is generally used to uniquely identify the element, while the mapped value is some sort of value associated to this key.
According to here position parameter is
the position of the first element to be compared for the insertion
operation. Notice that this does not force the new element to be in
that position within the map container (elements in a set always
follow a specific ordering), but this is actually an indication of a
possible insertion position in the container that, if set to the
element that precedes the actual location where the element is
inserted, makes for a very efficient insertion operation. iterator is
a member type, defined as a bidirectional iterator type.
So the purpose of this parameter is mainly slightly increasing the insertion speed by narrowing the range of elements.
You can use std::vector<std::pair<int,std::string>> if the order of insertion is important.
The interface is indeed slightly confusing, because it looks very much like std::vector<int>::insert (for example) and yet does not produce the same effect...
For associative containers, such as set, map and the new unordered_set and co, you completely relinquish the control over the order of the elements (as seen by iterating over the container). In exchange for this loss of control, you gain efficient look-up.
It would not make sense to suddenly give you control over the insertion, as it would let you break invariants of the container, and you would lose the efficient look-up that is the reason to use such containers in the first place.
And thus insert(It position, value_type&& value) does not insert at said position...
However this gives us some room for optimization: when inserting an element in an associative container, a look-up need to be performed to locate where to insert this element. By letting you specify a hint, you are given an opportunity to help the container speed up the process.
This can be illustrated for a simple example: suppose that you receive elements already sorted by way of some interface, it would be wasteful not to use this information!
template <typename Key, typename Value, typename InputStream>
void insert(std::map<Key, Value>& m, InputStream& s) {
typename std::map<Key, Value>::iterator it = m.begin();
for (; s; ++s) {
it = m.insert(it, *s).first;
}
}
Some of the items might not be well sorted, but it does not matter, if two consecutive items are in the right order, then we will gain, otherwise... we'll just perform as usual.
The map is always sorted, but you give a "hint" as to where the element may go as an optimisation.
The insertion is O(log N) but if you are able to successfully tell the container where it goes, it is constant time.
Thus if you are creating a large container of already-sorted values, then each value will get inserted at the end, although the tree will need rebalancing quite a few times.
As sad_man says, it's associative. If you set a value with an existing key, then you overwrite the previous value.
Now the iterators are necessary because you don't know what the keys are, usually.

Is the unordered_map really unordered?

I am very confused by the name 'unordered_map'. The name suggests that the keys are not ordered at all. But I always thought they are ordered by their hash value. Or is that wrong (because the name implies that they are not ordered)?
Or to put it different: Is this
typedef map<K, V, HashComp<K> > HashMap;
with
template<typename T>
struct HashComp {
bool operator<(const T& v1, const T& v2) const {
return hash<T>()(v1) < hash<T>()(v2);
}
};
the same as
typedef unordered_map<K, V> HashMap;
? (OK, not exactly, STL will complain here because there may be keys k1,k2 and neither k1 < k2 nor k2 < k1. You would need to use multimap and overwrite the equal-check.)
Or again differently: When I iterate through them, can I assume that the key-list is ordered by their hash value?
In answer to your edited question, no those two snippets are not equivalent at all. std::map stores nodes in a tree structure, unordered_map stores them in a hashtable*.
Keys are not stored in order of their "hash value" because they're not stored in any order at all. They are instead stored in "buckets" where each bucket corresponds to a range of hash values. Basically, the implementation goes like this:
function add_value(object key, object value) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
buckets[bucket_index] = new linked_list();
}
buckets[bucket_index].add(new key_value(key, value));
}
function get_value(object key) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
return null;
}
foreach(key_value kv in buckets[bucket_index]) {
if (kv.key == key) {
return kv.value;
}
}
}
Obviously that's a serious simplification and real implementation would be much more advanced (for example, supporting resizing the buckets array, maybe using a tree structure instead of linked list for the buckets, and so on), but that should give an idea of how you can't get back the values in any particular order. See wikipedia for more information.
* Technically, the internal implementation of std::map and unordered_map are implementation-defined, but the standard requires certain Big-O complexity for operations that implies those internal implementations
"Unordered" doesn't mean that there isn't a linear sequence somewhere in the implementation. It means "you can't assume anything about the order of these elements".
For example, people often assume that entries will come out of a hash map in the same order they were put in. But they don't, because the entries are unordered.
As for "ordered by their hash value": hash values are generally taken from the full range of integers, but hash maps don't have 2**32 slots in them. The hash value's range will be reduced to the number of slots by taking it modulo the number of slots. Further, as you add entries to a hash map, it might change size to accommodate the new values. This can cause all the previous entries to be re-placed, changing their order.
In an unordered data structure, you can't assume anything about the order of the entries.
As the name unordered_map suggests, no ordering is specified by the C++0x standard. An unordered_map's apparent ordering will be dependent on whatever is convenient for the actual implementation.
If you want an analogy, look at the RDBMS of your choice.
If you don't specify an ORDER BY clause when performing a query, the results are returned "unordered" - that is, in whatever order the database feels like. The order is not specified, and the system is free to "order" them however it likes in order to get the best performance.
You are right, unordered_map is actually hash ordered. Note that most current implementations (pre TR1) call it hash_map.
The IBM C/C++ compiler documentation remarks that if you have an optimal hash function, the number of operations performed during lookup, insertion, and removal of an arbitrary element does not depend on the number of elements in the sequence, so this mean that the order is not so unordered...
Now, what does it mean that it is hash ordered? As an hash should be unpredictable, by definition you can't take any assumption about the order of the elements in the map. This is the reason why it has been renamed in TR1: the old name suggested an order. Now we know that an order is actually used, but you can disregard it as it is unpredictable.

How to get the first n elements of a std::map

Since there is no .resize() member function in C++ std::map I was wondering, how one can get a std::map with at most n elements.
The obvious solution is to create a loop from 0 to n and use the nth iterator as the first parameter for std::erase().
I was wondering if there is any solution that does not need the loop (at least not in my user code) and is more "the STL way to go".
You can use std::advance( iter, numberofsteps ) for that.
Universal solution for almost any container, such as std::list, std::map, boost::multi_index. You must check the size of your map only.
template<class It>
It myadvance(It it, size_t n) {
std::advance(it, n);
return it;
}
template<class Cont>
void resize_container(Cont & cont, size_t n) {
cont.erase(myadvance(cont.begin(), std::min(n, cont.size())),
cont.end());
}
The correct way for this is to use std::advance. But here is a funny (slow) way allowing to 'use resize on map'. More generally, this kind of trick can be used for other things working on vector but not on map.
map<K,V> m; //your map
vector< pair<K,V> > v(m.begin(), m.end());
v.resize(n);
m = map<K,V>(v.begin(),v.end());
Why would you want to resize a map?
The elements in a map aren't stored in any order - the first 'n' doesn't really mean anything
edit:
Interestingly std::map does have an order, not sure how useful this concept is.
Are the entries in the same sort order as the keys?
What does that mean? If you have Names keyed by SSN does that mean the names are stored in SSN numeric order?
A std::map is not a list. There are no "first n" elements.
BTW: Iterators become invalid if the container is changed.
If you really need a smaller map you could iterate though it and add all elements up to the n-th into a new map.