Fast hash function for `std::vector` - c++

I implemented this solution for getting an hash value from vector<T>:
namespace std
{
template<typename T>
struct hash<vector<T>>
{
typedef vector<T> argument_type;
typedef std::size_t result_type;
result_type operator()(argument_type const& in) const
{
size_t size = in.size();
size_t seed = 0;
for (size_t i = 0; i < size; i++)
//Combine the hash of the current vector with the hashes of the previous ones
hash_combine(seed, in[i]);
return seed;
}
};
}
//using boost::hash_combine
template <class T>
inline void hash_combine(std::size_t& seed, T const& v)
{
seed ^= std::hash<T>()(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
But this solution doesn't scale at all: with a vector<double> of 10 millions elements it's gonna take more than 2.5 s (according to VS).
Does exists a fast hash function for this scenario?
Notice that creating an hash value from the vector reference is not a feasible solution, since the related unordred_map will be used in different runs and in addition two vector<double> with the same content but different addresses will be mapped differently (undesired behavior for this application).

NOTE: As per the comments, you get a 25-50x speed-up by compiling with optimizations. Do that, first. Then, if it's still too slow, see below.
I don't think there's much you can do. You have to touch all the elements, and that combination function is about as fast as it gets.
One option may be to parallelize the hash function. If you have 8 cores, you can run 8 threads to each hash 1/8th of the vector, then combine the 8 resulting values at the end. The synchronization overhead may be worth it for very large vectors.

The approach that MSVC's old hashmap used was to sample less often.
This means that isolated changes won't show up in your hash, but the thing you are trying to avoid is reading and processing the entire 80 mb of data in order to hash your vector. Not reading some characters is pretty unavoidable.
The second thing you should do is not specialize std::hash on all vectors, this may make your program ill-formed (as suggested by a defect resolution whose status I do not recall), and at the least is a bad plan (as the std is sure to permit itself to add hash combining and hashing of vectors).
When I write a custom hash, I usually use ADL (Koenig Lookup) to make it easy to extend.
namespace my_utils {
namespace hash_impl {
namespace details {
namespace adl {
template<class T>
std::size_t hash(T const& t) {
return std::hash<T>{}(t);
}
}
template<class T>
std::size_t hasher(T const& t) {
using adl::hash;
return hash(t);
}
}
struct hash_tag {};
template<class T>
std::size_t hash(hash_tag, T const& t) {
return details::hasher(t);
}
template<class T>
std::size_t hash_combine(hash_tag, std::size_t seed, T const& t) {
seed ^= hash(t) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
template<class Container>
std::size_t fash_hash_random_container(hash_tag, Container const& c ) {
std::size_t size = c.size();
std::size_t stride = 1 + size/10;
std::size_t r = hash(hash_tag{}, size);
for(std::size_t i = 0; i < size; i += stride) {
r = hash_combine(hash_tag{}, r, c.data()[i])
}
return r;
}
// std specializations go here:
template<class T, class A>
std::size_t hash(hash_tag, std::vector<T,A> const& v) {
return fash_hash_random_container(hash_tag{}, v);
}
template<class T, std::size_t N>
std::size_t hash(hash_tag, std::array<T,N> const& a) {
return fash_hash_random_container(hash_tag{}, a);
}
// etc
}
struct my_hasher {
template<class T>
std::size_t operator()(T const& t)const {
return hash_impl::hash(hash_impl::hash_tag{}, t);
}
};
}
now my_hasher is a universal hasher. It uses either hashes declared in my_utils::hash_impl (for std types), or free functions called hash that will hash a given type, to hash things. Failing that, it tries to use std::hash<T>. If that fails, you get a compile-time error.
Writing a free hash function in the namespace of the type you want to hash tends to be less annoying than having to go off and open std and specialize std::hash in my experience.
It understands vectors and arrays, recursively. Doing tuples and pairs requires a bit more work.
It samples said vectors and arrays at about 10 times.
(Note: hash_tag is both a bit of a joke, and a way to force ADL and prevent having to forward-declare the hash specializations in the hash_impl namespace, because that requirement sucks.)
The price of sampling is that you could get more collisions.
Another approach if you have a huge amount of data is to hash them once, and keep track of when they are modified. To do this approach, use a copy-on-write monad interface for your type that keeps track of if the hash is up to date. Now a vector gets hashed once; if you modify it, the hash is discarded.
One can go futher and have a random-access hash (where it is easy to predict what happens when you edit a given value hash-wise), and mediate all access to the vector. That is tricky.
You could also multi-thread the hashing, but I would guess that your code is probably memory-bandwidth bound, and multi-threading won't help much there. Worth trying.
You could use a fancier structure than a flat vector (something tree like), where changes to the values bubble-up in a hash-like way to a root hash value. This would add a lg(n) overhead to all element access. Again, you'd have to wrap the raw data up in controls that keep the hashing up to date (or, keep track of what ranges are dirty and needs to be updated).
Finally, because you are working with 10 million elements at a time, consider moving over to a strong large-scale storage solution, like databases or what have you. Using 80 megabyte keys in a map seems strange to me.

Related

Create custom Hash Function

I tried to implement an unordered map for a Class called Pair, that stores an integer and a bitset. Then I found out, that there isn't a hashfunction for this Class.
Now I wanted to create my own hashfunction. But instead of using the XOR function or comparable functions, I wanted to have a hashfunction like the following approach:
the bitsets in my class obviously have fixed size, so I wanted to do the following:
example: for a instance of Pair with the bitset<6> = 101101, and the integer 6:
create a string = "1011016"
and now use the default hashfunction on this string
because the bitsets have fixed size, each key would be unique
how could I implement this approach?
thank you in advance
To expand on a comment, as requested:
Converting to string and then hashing that string would be somewhat slow. At least slower than it needs to be. A faster approach would be to combine the bit patterns, e.g. like this:
struct Pair
{
std::bitset<6> bits;
int intval;
};
template<>
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::size_t rtrn = static_cast<std::size_t>(pair.intval);
rtrn = (rtrn << pair.bits.size()) | pair.bits.to_ulong();
return rtrn;
}
};
This works on two assumptions:
The upper bits of the integer are generally not interesting
The size of the bitset is always small compared to size_t
I think it is a suitable hash function for use in unordered_map. One may argue that it has poor mixing and a very good hash should change many bits if only a few bits in its input change. But that is not required here. unordered_map is generally designed to work with cheap hash functions. For example GCC's hash for builtin types and pointers is just a static- or reinterpret-cast.
Possible improvements
We can preserve the upper bits by rotating instead of shifting.
template<>
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::size_t rtrn = static_cast<std::size_t>(pair.intval);
std::size_t intdigits = std::numeric_limits<decltype(pair.intval)>::digits;
std::size_t bitdigits = pair.bits.size();
// can be simplified to std::rotl(rtrn, bitdigits) in C++20
rtrn = (rtrn << bitdigits) | (rtrn >> (intdigits - bitdigits));
rtrn ^= pair.bits.to_ulong();
return rtrn;
}
};
Nothing will change for small integers (except some bitflips for small negative ints). But for large integers we still use the whole range of inputs, which might be of interest for pathological cases such as integer series 2^30, 2^30 + 2^29, 2^30 + 2^28, ...
If the size of the bitset may increase, stop doing fancy stuff and just combine the hashes. I wouldn't just xor them to avoid hash collisions on small integers.
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::hash<decltype(pair.intval)> ihash;
std::hash<decltype(pair.bits)> bhash;
return ihash(pair.intval) * 31 + bhash(pair.bits);
}
};
I picked the simple polynomial hash approach common in Java. I believe GCC uses the same one internally for string hashing. Someone else may expand on the topic or suggest a better one. 31 is commonly chosen as it is a prime number one off a power of two. So it can be computed quickly as (x << 5) - x

std::hash variations of object with arbitrary number of attributes of fundamental type

Discussion:
Let's say I have a struct/class with an arbitrary number of attributes that I want to use as key to a std::unordered_map e.g.,:
struct Foo {
int i;
double d;
char c;
bool b;
};
I know that I have to define a hasher-functor for it e.g.,:
struct FooHasher {
std::size_t operator()(Foo const &foo) const;
};
And then define my std::unordered_map as:
std::unordered_map<Foo, MyValueType, FooHasher> myMap;
What bothers me though, is how to define the call operator for FooHasher. One way to do it, that I also tend to prefer, is with std::hash. However, there are numerous variations e.g.,:
std::size_t operator()(Foo const &foo) const {
return std::hash<int>()(foo.i) ^
std::hash<double>()(foo.d) ^
std::hash<char>()(foo.c) ^
std::hash<bool>()(foo.b);
}
I've also seen the following scheme:
std::size_t operator()(Foo const &foo) const {
return std::hash<int>()(foo.i) ^
(std::hash<double>()(foo.d) << 1) ^
(std::hash<char>()(foo.c) >> 1) ^
(std::hash<bool>()(foo.b) << 1);
}
I've seen also some people adding the golden ratio:
std::size_t operator()(Foo const &foo) const {
return (std::hash<int>()(foo.i) + 0x9e3779b9) ^
(std::hash<double>()(foo.d) + 0x9e3779b9) ^
(std::hash<char>()(foo.c) + 0x9e3779b9) ^
(std::hash<bool>()(foo.b) + 0x9e3779b9);
}
Questions:
What are they trying to achieve by adding the golden ration or shifting bits in the result of std::hash.
Is there an "official scheme" to std::hash an object with arbitrary number of attributes of fundamental type?
A simple xor is symmetric and behaves badly when fed the "same" value multiple times (hash(a) ^ hash(a) is zero). See here for more details.
This is the question of combining hashes. boost has a hash_combine that is pretty decent. Write a hash combiner, and use it.
There is no "official scheme" to solve this problem.
Myself, I typically write a super-hasher that can take anything and hash it. It hash combines tuples and pairs and collections automatically, where it first hashes the count of elements in the collection, then the elements.
It finds hash(t) via ADL first, and if that fails checks if it has a manually written hash in a helper namespace (used for std containers and types), and if that fails does a std::hash<T>{}(t).
Then my hash for Foo support looks like:
struct Foo {
int i;
double d;
char c;
bool b;
friend auto mytie(Foo const& f) {
return std::tie(f.i, f.d, f.c, f.b);
}
friend std::size_t hash(Foo const& f) {
return hasher::hash(mytie(f));
}
};
where I use mytie to move Foo into a tuple, then use the std::tuple overload of hasher::hash to get the result.
I like the idea of hashes of structurally similar types having the same hash. This lets me act as if my hash is transparent in some cases.
Note that hashing unordered meows in this manner is a bad idea, as an asymmetric hash of an unordered meow may generate spurious misses.
(Meow is the generic name for map and set. Do not ask me why: Ask the STL.)
The standard hash framework is lacking in respect of combining hashes. Combining hashes using xor is sub-optimal.
A better solution is proposed in N3980 "Types Don't Know #".
The main idea is using the same hash function and its state to hash more than one value/element/member.
With that framework your hash function would look:
template <class HashAlgorithm>
void hash_append(HashAlgorithm& h, Foo const& x) noexcept
{
using std::hash_append;
hash_append(h, x.i);
hash_append(h, x.d);
hash_append(h, x.c);
hash_append(h, x.b);
}
And the container:
std::unordered_map<Foo, MyValueType, std::uhash<>> myMap;

hash function for a vector of pair<int, int>

I'm trying to implement an unordered_map for a vector< pair < int,int> >. Since there's no such default hash function, I tried to imagine a function of my own :
struct ObjectHasher
{
std::size_t operator()(const Object& k) const
{
std::string h_string("");
for (auto i = k.vec.begin(); i != k.vec.end(); ++i)
{
h_string.push_back(97+i->first);
h_string.push_back(47); // '-'
h_string.push_back(97+i->second);
h_string.push_back(43); // '+'
}
return std::hash<std::string>()(h_string);
}
};
The main idea is to change the list of integers, say ( (97, 98), (105, 107) ) into a formatted string like "a-b+i-k" and to compute its hash thanks to hash < string >(). I choosed the 97, 48 and 43 numbers only to allow the hash string to be easily displayed in a terminal during my tests.
I know this kind of function might be a very naive idea since a good hash function should be fast and strong against collisions. Well, if the integers given to push_back() are greater than 255 I don't know what might happen... So, what do you think of the following questions :
(1) is my function ok for big integers ?
(2) is my function ok for all environments/platforms ?
(3) is my function too slow to be a hash function ?
(4) ... do you have anything better ?
All you need is a function to "hash in" an integer. You can steal such a function from boost:
template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
std::hash<T> hasher;
seed ^= std::hash<T>(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}
Now your function is trivial:
struct ObjectHasher
{
std::size_t operator()(const Object& k) const
{
std::size_t hash = 0;
for (auto i = k.vec.begin(); i != k.vec.end(); ++i)
{
hash_combine(hash, i->first);
hash_combine(hash, i->second);
}
return hash;
}
};
This function is is probably very slow compared to other hash functions since it uses dynamic memory allocation. Also std::hash<std::string> Is not a very good hash function since it is very general. It's probably better to XOR all ints and use std::hash<int>.
This is a perfectly valid solution. All a hash function needs is a sequence of bytes and by concatenating your elements together as a string you are providing a unique byte representation of the map.
Of course this could become unruly if your map contains a large number of items.

Returning container from function: optimizing speed and modern style

Not entirely a question, although just something I have been pondering on how to write such code more elegantly by style and at the same time fully making use of the new c++ standard etc. Here is the example
Returning Fibonacci sequence to a container upto N values (for those not mathematically inclined, this is just adding the previous two values with the first two values equal to 1. i.e. 1,1,2,3,5,8,13, ...)
example run from main:
std::vector<double> vec;
running_fibonacci_seq(vec,30000000);
1)
template <typename T, typename INT_TYPE>
void running_fibonacci_seq(T& coll, const INT_TYPE& N)
{
coll.resize(N);
coll[0] = 1;
if (N>1) {
coll[1] = 1;
for (auto pos = coll.begin()+2;
pos != coll.end();
++pos)
{
*pos = *(pos-1) + *(pos-2);
}
}
}
2) the same but using rvalue && instead of & 1.e.
void running_fibonacci_seq(T&& coll, const INT_TYPE& N)
EDIT: as noticed by the users who commented below, the rvalue and lvalue play no role in timing - the speeds were actually the same for reasons discussed in the comments
results for N = 30,000,000
Time taken for &:919.053ms
Time taken for &&: 800.046ms
Firstly I know this really isn't a question as such, but which of these or which is best modern c++ code? with the rvalue reference (&&) it appears that move semantics are in place and no unnecessary copies are being made which makes a small improvement on time (important for me due to future real-time application development). some specific ''questions'' are
a) passing a container (which was vector in my example) to a function as a parameter is NOT an elegant solution on how rvalue should really be used. is this fact true? if so how would rvalue really show it's light in the above example?
b) coll.resize(N); call and the N=1 case, is there a way to avoid these calls so the user is given a simple interface to only use the function without creating size of vector dynamically. Can template metaprogramming be of use here so the vector is allocated with a particular size at compile time? (i.e. running_fibonacci_seq<30000000>) since the numbers can be large is there any need to use template metaprogramming if so can we use this (link) also
c) Is there an even more elegant method? I have a feeling std::transform function could be used by using lambdas e.g.
void running_fibonacci_seq(T&& coll, const INT_TYPE& N)
{
coll.resize(N);
coll[0] = 1;
coll[1] = 1;
std::transform (coll.begin()+2,
coll.end(), // source
coll.begin(), // destination
[????](????) { // lambda as function object
return ????????;
});
}
[1] http://cpptruths.blogspot.co.uk/2011/07/want-speed-use-constexpr-meta.html
Due to "reference collapsing" this code does NOT use an rvalue reference, or move anything:
template <typename T, typename INT_TYPE>
void running_fibonacci_seq(T&& coll, const INT_TYPE& N);
running_fibonacci_seq(vec,30000000);
All of your questions (and the existing comments) become quite meaningless when you recognize this.
Obvious answer:
std::vector<double> running_fibonacci_seq(uint32_t N);
Why ?
Because of const-ness:
std::vector<double> const result = running_fibonacci_seq(....);
Because of easier invariants:
void running_fibonacci_seq(std::vector<double>& t, uint32_t N) {
// Oh, forgot to clear "t"!
t.push_back(1);
...
}
But what of speed ?
There is an optimization called Return Value Optimization that allows the compiler to omit the copy (and build the result directly in the caller's variable) in a number of cases. It is specifically allowed by the C++ Standard even when the copy/move constructors have side effects.
So, why passing "out" parameters ?
you can only have one return value (sigh)
you may wish the reuse the allocated resources (here the memory buffer of t)
Profile this:
#include <vector>
#include <cstddef>
#include <type_traits>
template <typename Container>
Container generate_fibbonacci_sequence(std::size_t N)
{
Container coll;
coll.resize(N);
coll[0] = 1;
if (N>1) {
coll[1] = 1;
for (auto pos = coll.begin()+2;
pos != coll.end();
++pos)
{
*pos = *(pos-1) + *(pos-2);
}
}
return coll;
}
struct fibbo_maker {
std::size_t N;
fibbo_maker(std::size_t n):N(n) {}
template<typename Container>
operator Container() const {
typedef typename std::remove_reference<Container>::type NRContainer;
typedef typename std::decay<NRContainer>::type VContainer;
return generate_fibbonacci_sequence<VContainer>(N);
}
};
fibbo_maker make_fibbonacci_sequence( std::size_t N ) {
return fibbo_maker(N);
}
int main() {
std::vector<double> tmp = make_fibbonacci_sequence(30000000);
}
the fibbo_maker stuff is just me being clever. But it lets me deduce the type of fibbo sequence you want without you having to repeat it.

Unordered (hash) map from bitset to bitset on boost

I want to use a cache, implemented by boost's unordered_map, from a dynamic_bitset to a dynamic_bitset. The problem, of course, is that there is no default hash function from the bitset. It doesn't seem to be like a conceptual problem, but I don't know how to work out the technicalities. How should I do that?
I found an unexpected solution. It turns out boost has an option to #define BOOST_DYNAMIC_BITSET_DONT_USE_FRIENDS. When this is defined, private members including m_bits become public (I think it's there to deal with old compilers or something).
So now I can use #KennyTM's answer, changed a bit:
namespace boost {
template <typename B, typename A>
std::size_t hash_value(const boost::dynamic_bitset<B, A>& bs) {
return boost::hash_value(bs.m_bits);
}
}
There's to_block_range function that copies out the words that the bitset consists of into some buffer. To avoid actual copying, you could define your own "output iterator" that just processes individual words and computes hash from them. Re. how to compute hash: see e.g. the FNV hash function.
Unfortunately, the design of dynamic_bitset is IMHO, braindead because it does not give you direct access to the underlying buffer (not even as const).
It is a feature request.
One could implement a not-so-efficient unique hash by converting the bitset to a vector temporary:
namespace boost {
template <typename B, typename A>
std::size_t hash_value(const boost::dynamic_bitset<B, A>& bs) {
std::vector<B, A> v;
boost::to_block_range(bs, std::back_inserter(v));
return boost::hash_value(v);
}
}
We can't directly calculate the hash because the underlying data in dynamic_bitset is private (m_bits)
But we can easily finesse past (subvert!) the c++ access specification system without either
hacking at the code or
pretending your compiler is non-conforming (BOOST_DYNAMIC_BITSET_DONT_USE_FRIENDS)
The key is the template function to_block_range which is a friend to dynamic_bitset. Specialisations of this function, therefore, also have access to its private data (i.e. m_bits).
The resulting code couldn't be simpler
namespace boost {
// specialise dynamic bitset for size_t& to return the hash of the underlying data
template <>
inline void
to_block_range(const dynamic_bitset<>& b, size_t& hash_result)
{
hash_result = boost::hash_value(bs.m_bits);
}
std::size_t hash_value(const boost::dynamic_bitset<B, A>& bs)
{
size_t hash_result;
to_block_range(bs, hash_result);
return hash_result;
}
}
the proposed solution generates the same hash in the following situation.
#define BOOST_DYNAMIC_BITSET_DONT_USE_FRIENDS
namespace boost {
template <typename B, typename A>
std::size_t hash_value(const boost::dynamic_bitset<B, A>& bs) {
return boost::hash_value(bs.m_bits);
}
}
boost::dynamic_biset<> test(1,false);
auto hash1 = boost::hash_value(test);
test.push_back(false);
auto hash2 = boost::hash_value(test);
// keep continue...
test.push_back(false);
auto hash31 = boost::hash_value(test);
// magically all hash1 to hash31 are the same!
the proposed solution is sometimes improper for hash map.
I read the source code of dynamic_bitset why this happened and realized that dynamic_bitset stores one bit per value as same as vector<bool>. For example, you call dynamic_bitset<> test(1, false), then dynamic_bitset initially allocates 4 bytes with all zero and it holds the size of bits (in this case, size is 1). Note that if the size of bits becomes greater than 32, then it allocates 4 bytes again and push it back into dynamic_bitsets<>::m_bits (so m_bits is a vector of 4 byte-blocks).
If I call test.push_back(x), it sets the second bit to x and increases the size of bits to 2. If x is false, then m_bits[0] does not change at all! In order to correctly compute hash, we need to take m_num_bits in hash computation.
Then, the question is how?
1: Use boost::hash_combine
This approach is simple and straight forward. I did not check this compile or not.
namespace boost {
template <typename B, typename A>
std::size_t hash_value(const boost::dynamic_bitset<B, A>& bs) {
size_t tmp = 0;
boost::hash_combine(tmp,bs.m_num_bits);
boost::hash_combine(tmp,bs.m_bits);
return tmp;
}
}
2: flip m_num_bits % bits_per_block th bit.
flip a bit based on bit size. I believe this approach is faster than 1.
namespace boost {
template <typename B, typename A>
std::size_t hash_value(const boost::dynamic_bitset<B, A>& bs) {
// you may need more sophisticated bit shift approach.
auto bit = 1u << (bs.m_num_bits % bs.bits_per_block);
auto return_val = boost::hash_value(bs.m_bits);
// sorry this was wrong
//return (return_val & bit) ? return_val | bit : return_val & (~bit);
return (return_val & bit) ? return_val & (~bit) : return_val | bit;
}
}