Hash value for a std::unordered_map - c++

According to the standard there's no support for containers (let alone unordered ones) in the std::hash class. So I wonder how to implement that. What I have is:
std::unordered_map<std::wstring, std::wstring> _properties;
std::wstring _class;
I thought about iterating the entries, computing the individual hashes for keys and values (via std::hash<std::wstring>) and concatenate the results somehow.
What would be a good way to do that and does it matter if the order in the map is not defined?
Note: I don't want to use boost.
A simple XOR was suggested, so it would be like this:
size_t MyClass::GetHashCode()
{
std::hash<std::wstring> stringHash;
size_t mapHash = 0;
for (auto property : _properties)
mapHash ^= stringHash(property.first) ^ stringHash(property.second);
return ((_class.empty() ? 0 : stringHash(_class)) * 397) ^ mapHash;
}
?
I'm really unsure if that simple XOR is enough.

Response
If by enough, you mean whether or not your function is injective, the answer is No. The reasoning is that the set of all hash values your function can output has cardinality 2^64, while the space of your inputs is much larger. However, this is not really important, because you can't have an injective hash function given the nature of your inputs. A good hash function has these qualities:
It's not easily invertible. Given the output k, it's not computationally feasible within the lifetime of the universe to find m such that h(m) = k.
The range is uniformly distributed over the output space.
It's hard to find two inputs m and m' such that h(m) = h(m')
Of course, the extents of these really depend on whether you want something that's cryptographically secure, or you want to take some arbitrary chunk of data and just send it some arbitrary 64-bit integer. If you want something cryptographically secure, writing it yourself is not a good idea. In that case, you'd also need the guarantee that the function is sensitive to small changes in the input. The std::hash function object is not required to be cryptographically secure. It exists for use cases isomorphic to hash tables. CPP Rerefence says:
For two different parameters k1 and k2 that are not equal, the probability that std::hash<Key>()(k1) == std::hash<Key>()(k2) should be very small, approaching 1.0/std::numeric_limits<size_t>::max().
I'll show below how your current solution doesn't really guarantee this.
Collisions
I'll give you a few of my observations on a variant of your solution (I don't know what your _class member is).
std::size_t hash_code(const std::unordered_map<std::string, std::string>& m) {
std::hash<std::string> h;
std::size_t result = 0;
for (auto&& p : m) {
result ^= h(p.first) ^ h(p.second);
}
return result;
}
It's easy to generate collisions. Consider the following maps:
std::unordered_map<std::string, std::string> container0;
std::unordered_map<std::string, std::string> container1;
container0["123"] = "456";
container1["456"] = "123";
std::cout << hash_code(container0) << '\n';
std::cout << hash_code(container1) << '\n';
On my machine, compiling with g++ 4.9.1, this outputs:
1225586629984767119
1225586629984767119
The question as to whether this matters or not arises. What's relevant is how often you're going to have maps where keys and values are reversed. These collisions will occur between any two maps in which the sets of keys and values are the same.
Order of Iteration
Two unordered_map instances having exactly the same key-value pairs will not necessarily have the same order of iteration. CPP Rerefence says:
For two parameters k1 and k2 that are equal, std::hash<Key>()(k1) == std::hash<Key>()(k2).
This is a trivial requirement for a hash function. Your solution avoids this because the order of iteration doesn't matter since XOR is commutative.
A Possible Solution
If you don't need something that's cryptographically secure, you can modify your solution slightly to kill the symmetry. This approach is okay in practice for hash tables and the like. This solution is also independent of the fact that order in an unordered_map is undefined. It uses the same property your solution used (Commutativity of XOR).
std::size_t hash_code(const std::unordered_map<std::string, std::string>& m) {
const std::size_t prime = 19937;
std::hash<std::string> h;
std::size_t result = 0;
for (auto&& p : m) {
result ^= prime*h(p.first) + h(p.second);
}
return result;
}
All you need in a hash function in this case is a way to map a key-value pair to an arbitrary good hash value, and a way to combine the hashes of the key-value pairs using a commutative operation. That way, order does not matter. In the example hash_code I wrote, the key-value pair hash value is just a linear combination of the hash of the key and the hash of the value. You can construct something a bit more intricate, but there's no need for that.

Related

Faster map operations in C++

In my code I'm storing parameters sets for interactions between different groups in a map. Currently at startup I add each structure (testvals in the code below) with the key created from joining the two group names into a single string.
string nKey = key1;
nKey += JOIN_STRING;
nKey += key2;
map< string, struct> mymap_string;
mymap_string.insert( make_pair(nKey, testval ));
When it comes to looking up the data for two groups, I'm again creating that string and then using find on the map to retrieve my data.
string nKey = key1;
nKey += JOIN_STRING;
nKey += key2;
auto it = mymap_string.find( nKey );
if ( it != mymap_string.end() )
{
struct vals= it->second;
}
In my code I'm creating the map once at startup but doing the lookup part millions of times. I'm wondering if there's a better way of doing this as string concatenation seems to be relatively expensive and find may not be the fastest way to search and compare strings?
My testing seems to show that strings are faster than using std::pair<string1, string2> as the key for the map. I've looked at map vs unordered_map but there doesn't seem to be much of a difference. unordered_map may be slightly faster when the number of keys is large.
Does anyone have any suggestion on what might be a better, quicker approach? Given the number of calls made to this, if I can make it significantly quicker I can save a lot of time. I don't mind if the insertion or setup isn't blindingly fast since it only happens once, but lookup is important. It would be better to use something standard that works on Windows and Linux.
Update:
OK so from the questions it seems that more background information is required.
testvals is a structure of doubles for the input parameters for the current model being used and the number of variables provided in it will vary with the model. But typically this is between 4-10 values. A typical set is show here:
typedef struct
{
double m_temp_min;
double m_temp_max;
double m_liquid_content;
double m_growth_rate;
double m_alpha;
double m_beta;
} testvals;
Key1 and Key2 are always strings that are passed from the programs core module, but the strings are user-defined, meaning they could be anything from "a" to "my_big_yellow_submarine_3".
The number of keys in the map will depend on the number of groups in the data. If there are only two groups for which interactions parameters need to be provided, then the map would only have 4 unique string keys: group1~~group1, group1~~group2, group2~~group1 and group2~~group2. Normally there are 3 or 4 group types in the map so the number of keys is usually in the number of tens. This size may be why I don't see much of a difference in map and unordered_map performance.
One of the comments mentioned std::pair<std::string,std::string> and as I originally said, the cost of calling make_pair() seems much higher than the cost of making the string and was more than 50% slower when I tested it. But I didn't try the combination of std::pair with unordered_map. I assumed that if std::pair is slower with map, it is also going to be slower with unordered_map. Is there a reason to expect it to be very different?
I hope this helps clarify some of the things.
You have only a limited number of keys which makes calculating the hash expensive compared to the real lookup. That's why std::map and std::unordered_map aren't much different in your case. Besides JOIN_STRING also introduces unnecessary operations while computing hash or comparing strings
I suggest you to avoid those group names altogether and use group IDs instead. With N group types you only have N2 different types of interactions. Then the IDs will belong to a range of [0, N). If N is known at compile time you can even make it an array. So instead of
string nKey = key1;
nKey += JOIN_STRING;
nKey += key2;
you'll use
std::vector<testvals> vals(N*N); // vector with N² elements
uint32_t nKey = key1*N + key2; // index of the <key1, key2> mapping
const auto &val = vals[nKey]; // get the mapped value
You should use & to get a reference instead of a copy. You can also use a map instead of a vector. It's still much slower than a vector but still much faster than a map of string. You can calculate the mapped key like above, or use some mappings like nKey = (key1 << 16) ^ key2 or nKey = ((uint64_t)key1 << 32) | key2
Group names are only used when you convert the names to ID at the beginning, or when you want to print them out. You can use some struct like this to store the name
struct GroupInfo
{
std::string groupName;
uint32_t groupID;
}
No need to use typedef for structs in C++ like in your code. You can also use std::vector<std::string> or std::map<uint32_t, std::string> to map from ID to name. The ID can be a smaller type like uint8_t or uint16_t

Good hash function over C++ unordered_set

I'm looking to implement a hash function over a C++ std::unordered_set<char>. I initially tried using boost::hash_range:
namespace std
{
template<> struct hash<unordered_set<char> >
size_t operator(const unordered_set<char> &s)(
{
return boost::hash_range(begin(s), end(s))
};
}
But then I realised that because the set is unordered, the iteration order isn't stable, and the hash function is thus wrong. What are some better options for me? I guess I could std::set instead of std::unordered_set, but using an ordered set just because it's easier to hash seems ... wrong.
A very similar question, albeit in C#, was asked here:
Hash function on list independant of order of items in it
Over there, Per gave a nice language-independent answer that should put you on the right track. In short, for the input
x1, …, xn
you should map it to
f(x1) op … op f(xn)
where
f is a good hash function for single elements (integer in your case)
op is a commutative operator, such as xor or plus
Hashing an integer may seam pointless at first, but your goal is to make two neighboring integers be dissimilar from each other, so that when combined with op do not create the same result. e.g. if you use + as the operator, you want f(1)+f(2) to give a different result than f(0)+f(3).
If standard hashing functions are not good candidates for f and you cannot find one, check the linked answer for more details...
You could try simply adding which is independent of order and returning the hash of that:
template<> struct hash<unordered_set<char> >
size_t operator(const unordered_set<char> &s) {
long long sum{0};
for ( auto e : s )
sum += s;
return std::hash(sum);
};

Hashing a user defined type for use in unordered map

Say i have a user defined type
struct Key
{
short a;
int b,c,d
}
And I would like to use this as a key in an unordered map. Whats a good (efficient) hashing technique. Given that I might need to do a lot of reads.
Is there something using hash_combine or hash_append that I should be doing?
The safest path is probably to reuse standard hashing for your atomic types and combine them as you suggested. AFAIK there are no hash combination routines in the standard, but Boost does provide one:
#include <boost/functional/hash.hpp>
#include <functional>
namespace std
{
template<>
struct hash<Key>
{
public:
std::size_t
operator()(Key const& k) const
{
size_t hash = 0;
boost::hash_combine(hash, std::hash<short>()(k.a));
boost::hash_combine(hash, std::hash<int>()(k.b));
boost::hash_combine(hash, std::hash<int>()(k.c));
boost::hash_combine(hash, std::hash<int>()(k.d));
return hash;
}
};
}
If depending on Boost is not an option, their hash combination routine is small enough to be reasonably and shamelessly stolen:
template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
std::hash<T> hasher;
seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}
If your four integral value are purely random (e.g. they can take any value in range with an equal probability), this is probably very close to being optimal. If your values are more specific - one has only three possible values for instance or they are correlated - you could do slightly better. However, this will perform "well" in any circumstance.
Anyway, I don't think you should be too worried unless you're doing something extremely specific, or at least until actual performance issues arise. It's still time to change the hashing algorithm then with no other impact.
The main issue is that you need to reduce the amount of equal hash values for different keys as much as possible. So depending on the actual values you can use different approaches (starting with simple xor up to using a CRC).
So critical factors are:
- range of values
- typical values of values
- number of elements in the map
If you use a "simple" approach: Be sure to actually check the content of your map to ensure that the items are equally distributed over all different buckets.
If you use a "complex" approach: Be sure to check it doesn't have a too big performance impact (usually not a problem. But if it is, you may want to "cache" the hash value...)

Efficient way to hash a 2D point

OK, so the task is this, I would be given (x, y) co-ordinates of points with both (x, y) ranging from -10^6 to 10^6 inclusive. I have to check whether a particular point e.g. (x, y) tuple was given to me or not. In simple words how do i answer the query whether a particular point(2D) is set or not. So far the best i could think of is maintaining a std::map<std::pair<int,int>, bool> and whenever a point is given I mark it 1. Although this must be running in logarithmic time and is fairly optimized way to answer the query I am wondering if there's a better way to do this.
Also I would be glad if anyone could tell what actually complexity would be if I am using the above data structure as a hash.I mean is it that the complexity of std::map is going to be O(log N) in the size of elements present irrespective of the structure of key?
In order to use a hash map you need to be using std::unordered_map instead of std::map. The constraint of using this is that your value type needs to have a hash function defined for it as described in this answer. Either that or just use boost::hash for this:
std::unordered_map<std::pair<int, int>, boost::hash<std::pair<int, int> > map_of_pairs;
Another method which springs to mind is to store the 32 bit int values in a 64 bit integer like so:
uint64_t i64;
uint32_t a32, b32;
i64 = ((uint64_t)a32 << 32) | b32;
As described in this answer. The x and y components can be stored in the high and low bytes of the integer and then you can use a std::unordered_map<uint64_t, bool>. Although I'd be interested to know if this is any more efficient than the previous method or if it even produces different code.
Instead of mapping each point to a bool, why not store all the points given to you in a set? Then, you can simply search the set to see if it contains the point you are looking for. It is essentially the same as what you are doing without having to do an additional lookup of the associated bool. For example:
set<pair<int, int>> points;
Then, you can check whether the set contains a certain point or not like this :
pair<int, int> examplePoint = make_pair(0, 0);
set<pair<int, int>>::iterator it = points.find(examplePoint);
if (it == points.end()) {
// examplePoint not found
} else {
// examplePoint found
}
As mentioned, std::set is normally implemented as a balanced binary search tree, so each lookup would take O(logn) time.
If you wanted to use a hash table instead, you could do the same thing using std::unordered_set instead of std::set. Assuming you use a good hash function, this would speed your lookups up to O(1) time. However, in order to do this, you will have to define the hash function for pair<int, int>. Here is an example taken from this answer:
namespace std {
template <> struct hash<std::pair<int, int>> {
inline size_t operator()(const std::pair<int, int> &v) const {
std::hash<int> int_hasher;
return int_hasher(v.first) ^ int_hasher(v.second);
}
};
}
Edit: Nevermind, I see you already got it working!

Is the unordered_map really unordered?

I am very confused by the name 'unordered_map'. The name suggests that the keys are not ordered at all. But I always thought they are ordered by their hash value. Or is that wrong (because the name implies that they are not ordered)?
Or to put it different: Is this
typedef map<K, V, HashComp<K> > HashMap;
with
template<typename T>
struct HashComp {
bool operator<(const T& v1, const T& v2) const {
return hash<T>()(v1) < hash<T>()(v2);
}
};
the same as
typedef unordered_map<K, V> HashMap;
? (OK, not exactly, STL will complain here because there may be keys k1,k2 and neither k1 < k2 nor k2 < k1. You would need to use multimap and overwrite the equal-check.)
Or again differently: When I iterate through them, can I assume that the key-list is ordered by their hash value?
In answer to your edited question, no those two snippets are not equivalent at all. std::map stores nodes in a tree structure, unordered_map stores them in a hashtable*.
Keys are not stored in order of their "hash value" because they're not stored in any order at all. They are instead stored in "buckets" where each bucket corresponds to a range of hash values. Basically, the implementation goes like this:
function add_value(object key, object value) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
buckets[bucket_index] = new linked_list();
}
buckets[bucket_index].add(new key_value(key, value));
}
function get_value(object key) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
return null;
}
foreach(key_value kv in buckets[bucket_index]) {
if (kv.key == key) {
return kv.value;
}
}
}
Obviously that's a serious simplification and real implementation would be much more advanced (for example, supporting resizing the buckets array, maybe using a tree structure instead of linked list for the buckets, and so on), but that should give an idea of how you can't get back the values in any particular order. See wikipedia for more information.
* Technically, the internal implementation of std::map and unordered_map are implementation-defined, but the standard requires certain Big-O complexity for operations that implies those internal implementations
"Unordered" doesn't mean that there isn't a linear sequence somewhere in the implementation. It means "you can't assume anything about the order of these elements".
For example, people often assume that entries will come out of a hash map in the same order they were put in. But they don't, because the entries are unordered.
As for "ordered by their hash value": hash values are generally taken from the full range of integers, but hash maps don't have 2**32 slots in them. The hash value's range will be reduced to the number of slots by taking it modulo the number of slots. Further, as you add entries to a hash map, it might change size to accommodate the new values. This can cause all the previous entries to be re-placed, changing their order.
In an unordered data structure, you can't assume anything about the order of the entries.
As the name unordered_map suggests, no ordering is specified by the C++0x standard. An unordered_map's apparent ordering will be dependent on whatever is convenient for the actual implementation.
If you want an analogy, look at the RDBMS of your choice.
If you don't specify an ORDER BY clause when performing a query, the results are returned "unordered" - that is, in whatever order the database feels like. The order is not specified, and the system is free to "order" them however it likes in order to get the best performance.
You are right, unordered_map is actually hash ordered. Note that most current implementations (pre TR1) call it hash_map.
The IBM C/C++ compiler documentation remarks that if you have an optimal hash function, the number of operations performed during lookup, insertion, and removal of an arbitrary element does not depend on the number of elements in the sequence, so this mean that the order is not so unordered...
Now, what does it mean that it is hash ordered? As an hash should be unpredictable, by definition you can't take any assumption about the order of the elements in the map. This is the reason why it has been renamed in TR1: the old name suggested an order. Now we know that an order is actually used, but you can disregard it as it is unpredictable.