I tried to implement an unordered map for a Class called Pair, that stores an integer and a bitset. Then I found out, that there isn't a hashfunction for this Class.
Now I wanted to create my own hashfunction. But instead of using the XOR function or comparable functions, I wanted to have a hashfunction like the following approach:
the bitsets in my class obviously have fixed size, so I wanted to do the following:
example: for a instance of Pair with the bitset<6> = 101101, and the integer 6:
create a string = "1011016"
and now use the default hashfunction on this string
because the bitsets have fixed size, each key would be unique
how could I implement this approach?
thank you in advance
To expand on a comment, as requested:
Converting to string and then hashing that string would be somewhat slow. At least slower than it needs to be. A faster approach would be to combine the bit patterns, e.g. like this:
struct Pair
{
std::bitset<6> bits;
int intval;
};
template<>
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::size_t rtrn = static_cast<std::size_t>(pair.intval);
rtrn = (rtrn << pair.bits.size()) | pair.bits.to_ulong();
return rtrn;
}
};
This works on two assumptions:
The upper bits of the integer are generally not interesting
The size of the bitset is always small compared to size_t
I think it is a suitable hash function for use in unordered_map. One may argue that it has poor mixing and a very good hash should change many bits if only a few bits in its input change. But that is not required here. unordered_map is generally designed to work with cheap hash functions. For example GCC's hash for builtin types and pointers is just a static- or reinterpret-cast.
Possible improvements
We can preserve the upper bits by rotating instead of shifting.
template<>
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::size_t rtrn = static_cast<std::size_t>(pair.intval);
std::size_t intdigits = std::numeric_limits<decltype(pair.intval)>::digits;
std::size_t bitdigits = pair.bits.size();
// can be simplified to std::rotl(rtrn, bitdigits) in C++20
rtrn = (rtrn << bitdigits) | (rtrn >> (intdigits - bitdigits));
rtrn ^= pair.bits.to_ulong();
return rtrn;
}
};
Nothing will change for small integers (except some bitflips for small negative ints). But for large integers we still use the whole range of inputs, which might be of interest for pathological cases such as integer series 2^30, 2^30 + 2^29, 2^30 + 2^28, ...
If the size of the bitset may increase, stop doing fancy stuff and just combine the hashes. I wouldn't just xor them to avoid hash collisions on small integers.
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::hash<decltype(pair.intval)> ihash;
std::hash<decltype(pair.bits)> bhash;
return ihash(pair.intval) * 31 + bhash(pair.bits);
}
};
I picked the simple polynomial hash approach common in Java. I believe GCC uses the same one internally for string hashing. Someone else may expand on the topic or suggest a better one. 31 is commonly chosen as it is a prime number one off a power of two. So it can be computed quickly as (x << 5) - x
Related
I have to write a hash function, so that I can place an std::pair<int,std::string> in an unordered_set.
Regarding the input:
The strings that will be hashed are very small (1-3 letters in length).
Likewise, the integers will be unsigned numbers which are small (much smaller than the limit of unsigned int).
Does it make sense to use the hash of the string (as a number), and just use Cantor's enumeration of pairs to generate a "new" hash?
Since the "built-in" hash function for std::string should be a decent hash function...
struct intStringHash{
public:
inline std::size_t operator()(const std::pair<int,std::string>&c)const{
int x = c.first;
std::string s = c.second;
std::hash<std::string> stringHash;
int y = stringHash(s);
return ((x+y)*(x+y+1)/2 + y); // Cantor's enumeration of pairs
}
};
boost::hash_combine is an easy way to create hashes: even if you can't use the Boost, the function is quite simple, and so it's trivial to copy the implementation.
Usage sample:
struct intStringHash
{
public:
std::size_t operator()(const std::pair<int, std::string>& c) const
{
std::size_t hash = 0;
hash_combine(hash, c.first);
hash_combine(hash, c.second);
return hash;
}
};
Yes you would generate hashes for each type that you have a hash function for.
It's normal to exclusive or hashes to combine them:
int hash1;
int hash2;
int combined = hash1 ^ hash2;
Discussion:
Let's say I have a struct/class with an arbitrary number of attributes that I want to use as key to a std::unordered_map e.g.,:
struct Foo {
int i;
double d;
char c;
bool b;
};
I know that I have to define a hasher-functor for it e.g.,:
struct FooHasher {
std::size_t operator()(Foo const &foo) const;
};
And then define my std::unordered_map as:
std::unordered_map<Foo, MyValueType, FooHasher> myMap;
What bothers me though, is how to define the call operator for FooHasher. One way to do it, that I also tend to prefer, is with std::hash. However, there are numerous variations e.g.,:
std::size_t operator()(Foo const &foo) const {
return std::hash<int>()(foo.i) ^
std::hash<double>()(foo.d) ^
std::hash<char>()(foo.c) ^
std::hash<bool>()(foo.b);
}
I've also seen the following scheme:
std::size_t operator()(Foo const &foo) const {
return std::hash<int>()(foo.i) ^
(std::hash<double>()(foo.d) << 1) ^
(std::hash<char>()(foo.c) >> 1) ^
(std::hash<bool>()(foo.b) << 1);
}
I've seen also some people adding the golden ratio:
std::size_t operator()(Foo const &foo) const {
return (std::hash<int>()(foo.i) + 0x9e3779b9) ^
(std::hash<double>()(foo.d) + 0x9e3779b9) ^
(std::hash<char>()(foo.c) + 0x9e3779b9) ^
(std::hash<bool>()(foo.b) + 0x9e3779b9);
}
Questions:
What are they trying to achieve by adding the golden ration or shifting bits in the result of std::hash.
Is there an "official scheme" to std::hash an object with arbitrary number of attributes of fundamental type?
A simple xor is symmetric and behaves badly when fed the "same" value multiple times (hash(a) ^ hash(a) is zero). See here for more details.
This is the question of combining hashes. boost has a hash_combine that is pretty decent. Write a hash combiner, and use it.
There is no "official scheme" to solve this problem.
Myself, I typically write a super-hasher that can take anything and hash it. It hash combines tuples and pairs and collections automatically, where it first hashes the count of elements in the collection, then the elements.
It finds hash(t) via ADL first, and if that fails checks if it has a manually written hash in a helper namespace (used for std containers and types), and if that fails does a std::hash<T>{}(t).
Then my hash for Foo support looks like:
struct Foo {
int i;
double d;
char c;
bool b;
friend auto mytie(Foo const& f) {
return std::tie(f.i, f.d, f.c, f.b);
}
friend std::size_t hash(Foo const& f) {
return hasher::hash(mytie(f));
}
};
where I use mytie to move Foo into a tuple, then use the std::tuple overload of hasher::hash to get the result.
I like the idea of hashes of structurally similar types having the same hash. This lets me act as if my hash is transparent in some cases.
Note that hashing unordered meows in this manner is a bad idea, as an asymmetric hash of an unordered meow may generate spurious misses.
(Meow is the generic name for map and set. Do not ask me why: Ask the STL.)
The standard hash framework is lacking in respect of combining hashes. Combining hashes using xor is sub-optimal.
A better solution is proposed in N3980 "Types Don't Know #".
The main idea is using the same hash function and its state to hash more than one value/element/member.
With that framework your hash function would look:
template <class HashAlgorithm>
void hash_append(HashAlgorithm& h, Foo const& x) noexcept
{
using std::hash_append;
hash_append(h, x.i);
hash_append(h, x.d);
hash_append(h, x.c);
hash_append(h, x.b);
}
And the container:
std::unordered_map<Foo, MyValueType, std::uhash<>> myMap;
I implemented this solution for getting an hash value from vector<T>:
namespace std
{
template<typename T>
struct hash<vector<T>>
{
typedef vector<T> argument_type;
typedef std::size_t result_type;
result_type operator()(argument_type const& in) const
{
size_t size = in.size();
size_t seed = 0;
for (size_t i = 0; i < size; i++)
//Combine the hash of the current vector with the hashes of the previous ones
hash_combine(seed, in[i]);
return seed;
}
};
}
//using boost::hash_combine
template <class T>
inline void hash_combine(std::size_t& seed, T const& v)
{
seed ^= std::hash<T>()(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
But this solution doesn't scale at all: with a vector<double> of 10 millions elements it's gonna take more than 2.5 s (according to VS).
Does exists a fast hash function for this scenario?
Notice that creating an hash value from the vector reference is not a feasible solution, since the related unordred_map will be used in different runs and in addition two vector<double> with the same content but different addresses will be mapped differently (undesired behavior for this application).
NOTE: As per the comments, you get a 25-50x speed-up by compiling with optimizations. Do that, first. Then, if it's still too slow, see below.
I don't think there's much you can do. You have to touch all the elements, and that combination function is about as fast as it gets.
One option may be to parallelize the hash function. If you have 8 cores, you can run 8 threads to each hash 1/8th of the vector, then combine the 8 resulting values at the end. The synchronization overhead may be worth it for very large vectors.
The approach that MSVC's old hashmap used was to sample less often.
This means that isolated changes won't show up in your hash, but the thing you are trying to avoid is reading and processing the entire 80 mb of data in order to hash your vector. Not reading some characters is pretty unavoidable.
The second thing you should do is not specialize std::hash on all vectors, this may make your program ill-formed (as suggested by a defect resolution whose status I do not recall), and at the least is a bad plan (as the std is sure to permit itself to add hash combining and hashing of vectors).
When I write a custom hash, I usually use ADL (Koenig Lookup) to make it easy to extend.
namespace my_utils {
namespace hash_impl {
namespace details {
namespace adl {
template<class T>
std::size_t hash(T const& t) {
return std::hash<T>{}(t);
}
}
template<class T>
std::size_t hasher(T const& t) {
using adl::hash;
return hash(t);
}
}
struct hash_tag {};
template<class T>
std::size_t hash(hash_tag, T const& t) {
return details::hasher(t);
}
template<class T>
std::size_t hash_combine(hash_tag, std::size_t seed, T const& t) {
seed ^= hash(t) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
template<class Container>
std::size_t fash_hash_random_container(hash_tag, Container const& c ) {
std::size_t size = c.size();
std::size_t stride = 1 + size/10;
std::size_t r = hash(hash_tag{}, size);
for(std::size_t i = 0; i < size; i += stride) {
r = hash_combine(hash_tag{}, r, c.data()[i])
}
return r;
}
// std specializations go here:
template<class T, class A>
std::size_t hash(hash_tag, std::vector<T,A> const& v) {
return fash_hash_random_container(hash_tag{}, v);
}
template<class T, std::size_t N>
std::size_t hash(hash_tag, std::array<T,N> const& a) {
return fash_hash_random_container(hash_tag{}, a);
}
// etc
}
struct my_hasher {
template<class T>
std::size_t operator()(T const& t)const {
return hash_impl::hash(hash_impl::hash_tag{}, t);
}
};
}
now my_hasher is a universal hasher. It uses either hashes declared in my_utils::hash_impl (for std types), or free functions called hash that will hash a given type, to hash things. Failing that, it tries to use std::hash<T>. If that fails, you get a compile-time error.
Writing a free hash function in the namespace of the type you want to hash tends to be less annoying than having to go off and open std and specialize std::hash in my experience.
It understands vectors and arrays, recursively. Doing tuples and pairs requires a bit more work.
It samples said vectors and arrays at about 10 times.
(Note: hash_tag is both a bit of a joke, and a way to force ADL and prevent having to forward-declare the hash specializations in the hash_impl namespace, because that requirement sucks.)
The price of sampling is that you could get more collisions.
Another approach if you have a huge amount of data is to hash them once, and keep track of when they are modified. To do this approach, use a copy-on-write monad interface for your type that keeps track of if the hash is up to date. Now a vector gets hashed once; if you modify it, the hash is discarded.
One can go futher and have a random-access hash (where it is easy to predict what happens when you edit a given value hash-wise), and mediate all access to the vector. That is tricky.
You could also multi-thread the hashing, but I would guess that your code is probably memory-bandwidth bound, and multi-threading won't help much there. Worth trying.
You could use a fancier structure than a flat vector (something tree like), where changes to the values bubble-up in a hash-like way to a root hash value. This would add a lg(n) overhead to all element access. Again, you'd have to wrap the raw data up in controls that keep the hashing up to date (or, keep track of what ranges are dirty and needs to be updated).
Finally, because you are working with 10 million elements at a time, consider moving over to a strong large-scale storage solution, like databases or what have you. Using 80 megabyte keys in a map seems strange to me.
Let's say I would like to create a unordered set of unordered multisets of unsigned int. For this, I need to create a hash function to calculate a hash of the unordered multiset. In fact, it has to be good for CRC as well.
One obvious solution is to put the items in vector, sort them and return a hash of the result. This seems to work, but it is expensive.
Another approach is to xor the values, but obviously if I have one item twice or none the result will be the same - which is not good.
Any ideas how I can implement this cheaper - I have an application that will be doing this thousand for thousands of sets, and relatively big ones.
Since it is a multiset, you would like for the hash value to be the same for identical multisets, whose representation might have the same elements presented, added, or deleted in a different order. You would then like for the hash value to be commutative, easy to update, and change for each change in elements. You would also like for two changes to not readily cancel their effect on the hash.
One operation that meets all but the last criteria is addition. Just sum the elements. To keep the sum bounded, do the sum modulo the size of your hash value. (E.g. modulo 264 for a 64-bit hash.) To make sure that inserting or deleting zero values changes the hash, add one to each value first.
A drawback of the sum is that two changes can readily cancel. E.g. replacing 1 3 with 2 2. To address that, you can use the same approach and sum a polynomial of the entries, still retaining commutativity. E.g. instead of summing x+1, you can sum x2+x+1. Now it is more difficult to contrive sets of changes with the same sum.
Here's a reasonable hash function for std::unordered_multiset<int> it would be better if the computations were taken mod a large prime but the idea stands.
#include <iostream>
#include <unordered_set>
namespace std {
template<>
struct hash<unordered_multiset<int>> {
typedef unordered_multiset<int> argument_type;
typedef std::size_t result_type;
const result_type BASE = static_cast<result_type>(0xA67);
result_type log_pow(result_type ex) const {
result_type res = 1;
result_type base = BASE;
while (ex > 0) {
if (ex % 2) {
res = res * base;
}
base *= base;
ex /= 2;
}
return res;
}
result_type operator()(argument_type const & val) const {
result_type h = 0;
for (const int& el : val) {
h += log_pow(el);
}
return h;
}
};
};
int main() {
std::unordered_set<std::unordered_multiset<int>> mySet;
std::unordered_multiset<int> set1{1,2,3,4};
std::unordered_multiset<int> set2{1,1,2,2,3,3,4,4};
std::cout << "Hash 1: " << std::hash<std::unordered_multiset<int>>()(set1)
<< std::endl;
std::cout << "Hash 2: " << std::hash<std::unordered_multiset<int>>()(set2)
<< std::endl;
return 0;
}
Output:
Hash 1: 2290886192
Hash 2: 286805088
When it's a prime p, the number of collisions is proportional to 1/p. I'm not sure what the analysis is for powers of two. You can make updates to the hash efficient by adding/subtracting BASE^x when you insert/remove the integer x.
Implement the inner multiset as a value->count hash map.
This will allow you to avoid the problem that an even number of elements cancels out via xor in the following way: Instead of xor-ing each element, you construct a new number from the count and the value (e.g. multiplying them), and then you can build the full hash using xor.
I'm trying to implement an unordered_map for a vector< pair < int,int> >. Since there's no such default hash function, I tried to imagine a function of my own :
struct ObjectHasher
{
std::size_t operator()(const Object& k) const
{
std::string h_string("");
for (auto i = k.vec.begin(); i != k.vec.end(); ++i)
{
h_string.push_back(97+i->first);
h_string.push_back(47); // '-'
h_string.push_back(97+i->second);
h_string.push_back(43); // '+'
}
return std::hash<std::string>()(h_string);
}
};
The main idea is to change the list of integers, say ( (97, 98), (105, 107) ) into a formatted string like "a-b+i-k" and to compute its hash thanks to hash < string >(). I choosed the 97, 48 and 43 numbers only to allow the hash string to be easily displayed in a terminal during my tests.
I know this kind of function might be a very naive idea since a good hash function should be fast and strong against collisions. Well, if the integers given to push_back() are greater than 255 I don't know what might happen... So, what do you think of the following questions :
(1) is my function ok for big integers ?
(2) is my function ok for all environments/platforms ?
(3) is my function too slow to be a hash function ?
(4) ... do you have anything better ?
All you need is a function to "hash in" an integer. You can steal such a function from boost:
template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
std::hash<T> hasher;
seed ^= std::hash<T>(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}
Now your function is trivial:
struct ObjectHasher
{
std::size_t operator()(const Object& k) const
{
std::size_t hash = 0;
for (auto i = k.vec.begin(); i != k.vec.end(); ++i)
{
hash_combine(hash, i->first);
hash_combine(hash, i->second);
}
return hash;
}
};
This function is is probably very slow compared to other hash functions since it uses dynamic memory allocation. Also std::hash<std::string> Is not a very good hash function since it is very general. It's probably better to XOR all ints and use std::hash<int>.
This is a perfectly valid solution. All a hash function needs is a sequence of bytes and by concatenating your elements together as a string you are providing a unique byte representation of the map.
Of course this could become unruly if your map contains a large number of items.