hash function for a vector of pair<int, int> - c++

I'm trying to implement an unordered_map for a vector< pair < int,int> >. Since there's no such default hash function, I tried to imagine a function of my own :
struct ObjectHasher
{
std::size_t operator()(const Object& k) const
{
std::string h_string("");
for (auto i = k.vec.begin(); i != k.vec.end(); ++i)
{
h_string.push_back(97+i->first);
h_string.push_back(47); // '-'
h_string.push_back(97+i->second);
h_string.push_back(43); // '+'
}
return std::hash<std::string>()(h_string);
}
};
The main idea is to change the list of integers, say ( (97, 98), (105, 107) ) into a formatted string like "a-b+i-k" and to compute its hash thanks to hash < string >(). I choosed the 97, 48 and 43 numbers only to allow the hash string to be easily displayed in a terminal during my tests.
I know this kind of function might be a very naive idea since a good hash function should be fast and strong against collisions. Well, if the integers given to push_back() are greater than 255 I don't know what might happen... So, what do you think of the following questions :
(1) is my function ok for big integers ?
(2) is my function ok for all environments/platforms ?
(3) is my function too slow to be a hash function ?
(4) ... do you have anything better ?

All you need is a function to "hash in" an integer. You can steal such a function from boost:
template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
std::hash<T> hasher;
seed ^= std::hash<T>(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}
Now your function is trivial:
struct ObjectHasher
{
std::size_t operator()(const Object& k) const
{
std::size_t hash = 0;
for (auto i = k.vec.begin(); i != k.vec.end(); ++i)
{
hash_combine(hash, i->first);
hash_combine(hash, i->second);
}
return hash;
}
};

This function is is probably very slow compared to other hash functions since it uses dynamic memory allocation. Also std::hash<std::string> Is not a very good hash function since it is very general. It's probably better to XOR all ints and use std::hash<int>.

This is a perfectly valid solution. All a hash function needs is a sequence of bytes and by concatenating your elements together as a string you are providing a unique byte representation of the map.
Of course this could become unruly if your map contains a large number of items.

Related

Create custom Hash Function

I tried to implement an unordered map for a Class called Pair, that stores an integer and a bitset. Then I found out, that there isn't a hashfunction for this Class.
Now I wanted to create my own hashfunction. But instead of using the XOR function or comparable functions, I wanted to have a hashfunction like the following approach:
the bitsets in my class obviously have fixed size, so I wanted to do the following:
example: for a instance of Pair with the bitset<6> = 101101, and the integer 6:
create a string = "1011016"
and now use the default hashfunction on this string
because the bitsets have fixed size, each key would be unique
how could I implement this approach?
thank you in advance
To expand on a comment, as requested:
Converting to string and then hashing that string would be somewhat slow. At least slower than it needs to be. A faster approach would be to combine the bit patterns, e.g. like this:
struct Pair
{
std::bitset<6> bits;
int intval;
};
template<>
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::size_t rtrn = static_cast<std::size_t>(pair.intval);
rtrn = (rtrn << pair.bits.size()) | pair.bits.to_ulong();
return rtrn;
}
};
This works on two assumptions:
The upper bits of the integer are generally not interesting
The size of the bitset is always small compared to size_t
I think it is a suitable hash function for use in unordered_map. One may argue that it has poor mixing and a very good hash should change many bits if only a few bits in its input change. But that is not required here. unordered_map is generally designed to work with cheap hash functions. For example GCC's hash for builtin types and pointers is just a static- or reinterpret-cast.
Possible improvements
We can preserve the upper bits by rotating instead of shifting.
template<>
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::size_t rtrn = static_cast<std::size_t>(pair.intval);
std::size_t intdigits = std::numeric_limits<decltype(pair.intval)>::digits;
std::size_t bitdigits = pair.bits.size();
// can be simplified to std::rotl(rtrn, bitdigits) in C++20
rtrn = (rtrn << bitdigits) | (rtrn >> (intdigits - bitdigits));
rtrn ^= pair.bits.to_ulong();
return rtrn;
}
};
Nothing will change for small integers (except some bitflips for small negative ints). But for large integers we still use the whole range of inputs, which might be of interest for pathological cases such as integer series 2^30, 2^30 + 2^29, 2^30 + 2^28, ...
If the size of the bitset may increase, stop doing fancy stuff and just combine the hashes. I wouldn't just xor them to avoid hash collisions on small integers.
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::hash<decltype(pair.intval)> ihash;
std::hash<decltype(pair.bits)> bhash;
return ihash(pair.intval) * 31 + bhash(pair.bits);
}
};
I picked the simple polynomial hash approach common in Java. I believe GCC uses the same one internally for string hashing. Someone else may expand on the topic or suggest a better one. 31 is commonly chosen as it is a prime number one off a power of two. So it can be computed quickly as (x << 5) - x

Proper way to use large amount of known constant variables

The program receives a vector that represents a character.
It then compares the received vector with all the known vectors that represents characters.
I'm not sure how should I use the known vectors.
A few options I thought of:
1) Using global variables:
vector<int> charA{1,2,3,4,5};
vector<int> charB{5,3,7,1};
...
vector<int> charZ{3,2,5,6,8,9,0}
char getLetter(const vector<int> &input){
if(compareVec(input,charA) return 'A';
if(compareVec(input,charB) return 'B';
....
if(compareVec(input,charZ) return 'Z';
}
2) Declaring all variables in function:
char getLetter(const vector<int> &input){
vector<int> charA{1,2,3,4,5};
vector<int> charB{5,3,7,1};
...
vector<int> charZ{3,2,5,6,8,9,0}
if(compareVec(input,charA) return 'A';
if(compareVec(input,charB) return 'B';
....
if(compareVec(input,charZ) return 'Z';
}
3) Passing the variables
char getLetter(const vector<int> &input, vector<int> charA,
vector<int> charB... , vecotr<int> charZ){
if(compareVec(input,charA) return 'A';
if(compareVec(input,charB) return 'B';
....
if(compareVec(input,charZ) return 'Z';
}
This sounds like an application for a perfect hash generator (link to GNU gperf).
To quote the documentation
gperf is a perfect hash function generator written in C++. It
transforms an n element user-specified keyword set W into a perfect
hash function F. F uniquely maps keywords in W onto the range 0..k,
where k >= n-1. If k = n-1 then F is a minimal perfect hash function.
gperf generates a 0..k element static lookup table and a pair of C
functions. These functions determine whether a given character string
s occurs in W, using at most one probe into the lookup table.
If this is not a suitable solution then I'd recommend using function statics. You want to avoid function locals as this will badly affect performance, and globals will pollute your namespace.
So something like
char getLetter(const vector<int> &input){
static vector<int> charA{1,2,3,4,5};
static vector<int> charB{5,3,7,1};
Giving you snippet, I'd go for:
char getLetter(const vector<int> &input)
{
struct
{
char result;
std::vector<char> data;
} const data[]=
{
{ 'A', {1,2,3,4,5}, },
{ 'B', {5,3,7,1}, },
...
};
for(auto const & probe : data)
{
if (comparevec(input, probe.data))
return probe.result;
}
// input does not match any of the given values
throw "That's not the input I'm looking for!";
}
For 40 such pairs, if this is not called in a tight inner loop, the linear search is good enough.
Alternatives:
use a std::map<std::vector<char>, char> to map valid values to results, and turn compareVec into a functor suitable as key-comaprison for the map, and initialize it the same way.
as above, but use a std::unordered_map.
use gperf, as suggested by #PaulFloyd above
I would start by suggesting that you hash or represent the numbers in their binary collection so that you are not comparing vectors each time as that would prove very costly. That said, your question is about how to make a dictionary, so whether you improve your keys as I suggested or not, I'd prefer the use of a map:
map<vector<int>, char, function<bool(const vector<int>&, const vector<int>&)>> dictionary([](const auto& lhs, const auto& rhs){
const auto its = mismatch(cbegin(lhs), cend(lhs), cbegin(rhs), cend(rhs));
return its.second != cend(rhs) && (its.first == cend(lhs) || *its.first < *its.second);
});
If possible dictionary should be constructed constant with an initializer_list containing all mappings and the comparator. If mappings must be looked up before you are guaranteed to have finished all letters then you obviously can't construct constant. Either way this map should be a private member of the class responsible for translating strings. Adding and mapping should be public functions of the class.
Live Example

Hashing a string and an int together?

I have to write a hash function, so that I can place an std::pair<int,std::string> in an unordered_set.
Regarding the input:
The strings that will be hashed are very small (1-3 letters in length).
Likewise, the integers will be unsigned numbers which are small (much smaller than the limit of unsigned int).
Does it make sense to use the hash of the string (as a number), and just use Cantor's enumeration of pairs to generate a "new" hash?
Since the "built-in" hash function for std::string should be a decent hash function...
struct intStringHash{
public:
inline std::size_t operator()(const std::pair<int,std::string>&c)const{
int x = c.first;
std::string s = c.second;
std::hash<std::string> stringHash;
int y = stringHash(s);
return ((x+y)*(x+y+1)/2 + y); // Cantor's enumeration of pairs
}
};
boost::hash_combine is an easy way to create hashes: even if you can't use the Boost, the function is quite simple, and so it's trivial to copy the implementation.
Usage sample:
struct intStringHash
{
public:
std::size_t operator()(const std::pair<int, std::string>& c) const
{
std::size_t hash = 0;
hash_combine(hash, c.first);
hash_combine(hash, c.second);
return hash;
}
};
Yes you would generate hashes for each type that you have a hash function for.
It's normal to exclusive or hashes to combine them:
int hash1;
int hash2;
int combined = hash1 ^ hash2;

Fast hash function for `std::vector`

I implemented this solution for getting an hash value from vector<T>:
namespace std
{
template<typename T>
struct hash<vector<T>>
{
typedef vector<T> argument_type;
typedef std::size_t result_type;
result_type operator()(argument_type const& in) const
{
size_t size = in.size();
size_t seed = 0;
for (size_t i = 0; i < size; i++)
//Combine the hash of the current vector with the hashes of the previous ones
hash_combine(seed, in[i]);
return seed;
}
};
}
//using boost::hash_combine
template <class T>
inline void hash_combine(std::size_t& seed, T const& v)
{
seed ^= std::hash<T>()(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
But this solution doesn't scale at all: with a vector<double> of 10 millions elements it's gonna take more than 2.5 s (according to VS).
Does exists a fast hash function for this scenario?
Notice that creating an hash value from the vector reference is not a feasible solution, since the related unordred_map will be used in different runs and in addition two vector<double> with the same content but different addresses will be mapped differently (undesired behavior for this application).
NOTE: As per the comments, you get a 25-50x speed-up by compiling with optimizations. Do that, first. Then, if it's still too slow, see below.
I don't think there's much you can do. You have to touch all the elements, and that combination function is about as fast as it gets.
One option may be to parallelize the hash function. If you have 8 cores, you can run 8 threads to each hash 1/8th of the vector, then combine the 8 resulting values at the end. The synchronization overhead may be worth it for very large vectors.
The approach that MSVC's old hashmap used was to sample less often.
This means that isolated changes won't show up in your hash, but the thing you are trying to avoid is reading and processing the entire 80 mb of data in order to hash your vector. Not reading some characters is pretty unavoidable.
The second thing you should do is not specialize std::hash on all vectors, this may make your program ill-formed (as suggested by a defect resolution whose status I do not recall), and at the least is a bad plan (as the std is sure to permit itself to add hash combining and hashing of vectors).
When I write a custom hash, I usually use ADL (Koenig Lookup) to make it easy to extend.
namespace my_utils {
namespace hash_impl {
namespace details {
namespace adl {
template<class T>
std::size_t hash(T const& t) {
return std::hash<T>{}(t);
}
}
template<class T>
std::size_t hasher(T const& t) {
using adl::hash;
return hash(t);
}
}
struct hash_tag {};
template<class T>
std::size_t hash(hash_tag, T const& t) {
return details::hasher(t);
}
template<class T>
std::size_t hash_combine(hash_tag, std::size_t seed, T const& t) {
seed ^= hash(t) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
template<class Container>
std::size_t fash_hash_random_container(hash_tag, Container const& c ) {
std::size_t size = c.size();
std::size_t stride = 1 + size/10;
std::size_t r = hash(hash_tag{}, size);
for(std::size_t i = 0; i < size; i += stride) {
r = hash_combine(hash_tag{}, r, c.data()[i])
}
return r;
}
// std specializations go here:
template<class T, class A>
std::size_t hash(hash_tag, std::vector<T,A> const& v) {
return fash_hash_random_container(hash_tag{}, v);
}
template<class T, std::size_t N>
std::size_t hash(hash_tag, std::array<T,N> const& a) {
return fash_hash_random_container(hash_tag{}, a);
}
// etc
}
struct my_hasher {
template<class T>
std::size_t operator()(T const& t)const {
return hash_impl::hash(hash_impl::hash_tag{}, t);
}
};
}
now my_hasher is a universal hasher. It uses either hashes declared in my_utils::hash_impl (for std types), or free functions called hash that will hash a given type, to hash things. Failing that, it tries to use std::hash<T>. If that fails, you get a compile-time error.
Writing a free hash function in the namespace of the type you want to hash tends to be less annoying than having to go off and open std and specialize std::hash in my experience.
It understands vectors and arrays, recursively. Doing tuples and pairs requires a bit more work.
It samples said vectors and arrays at about 10 times.
(Note: hash_tag is both a bit of a joke, and a way to force ADL and prevent having to forward-declare the hash specializations in the hash_impl namespace, because that requirement sucks.)
The price of sampling is that you could get more collisions.
Another approach if you have a huge amount of data is to hash them once, and keep track of when they are modified. To do this approach, use a copy-on-write monad interface for your type that keeps track of if the hash is up to date. Now a vector gets hashed once; if you modify it, the hash is discarded.
One can go futher and have a random-access hash (where it is easy to predict what happens when you edit a given value hash-wise), and mediate all access to the vector. That is tricky.
You could also multi-thread the hashing, but I would guess that your code is probably memory-bandwidth bound, and multi-threading won't help much there. Worth trying.
You could use a fancier structure than a flat vector (something tree like), where changes to the values bubble-up in a hash-like way to a root hash value. This would add a lg(n) overhead to all element access. Again, you'd have to wrap the raw data up in controls that keep the hashing up to date (or, keep track of what ranges are dirty and needs to be updated).
Finally, because you are working with 10 million elements at a time, consider moving over to a strong large-scale storage solution, like databases or what have you. Using 80 megabyte keys in a map seems strange to me.

How to write qHash for a QSet<SomeClass*> container?

I need to implement a set of sets in my application.
Using QSet with a custom class requires providing a qHash() function and an operator==.
The code is as follows:
class Custom{
int x;
int y;
//some other irrelevant here
}
inline uint qHash(Custom* c){
return (qHash(c->x) ^ qHash(c->y));
}
bool operator==(Custom &c1, Custom &c2){
return ((c1.x==c2.x) && (c1.y == c2.y));
}
//now I can use: QSet<Custom*>
How can I implement qHash(QSet<Custom*>), to be able to use QSet< QSet<SomeClass*> >?
Edit:
Additional question:
In my application the "set of sets" can contain up to 15000 sets. Each subset up to 25 Custom class pointers. How to guarantee that qHash(QSet<Custom*>) will be unique enough?
You cannot implement qHash with boost::hash_range/boost::hash_combine (which is what pmr's answer does, effectively), because QSet is the Qt equivalent of std::unordered_set, and, as the STL name suggests, these containers are unordered, whereas the Boost Documentation states that hash_combine is order-dependent, ie. it will hash permutations to different hash values.
This is a problem because if you naively hash-combine the elements in stored order
you cannot guarantee that two sets that compare equal are, indeed, equal, which is one of the requirements of a hash function:
For all x, y: x == y => qHash(x) == qHash(y)
So, if your hash-combining function needs to produce the same output for any permutation of the input values, it needs to be commutative. Fortunately, both (unsigned) addition and the xor operation just fit the bill:
template <typename T>
inline uint qHash(const QSet<T> &set, uint seed=0) {
return std::accumulate(set.begin(), set.end(), seed,
[](uint seed, const T&value) {
return seed + qHash(value); // or ^
});
}
A common way to hash containers is to combine the hashes of all elements. Boost provides hash_combine and hash_range for this purpose. This should give you an idea how to implement this for the results of your qHash.
So, given your qHash for Custom:
uint qHash(const QSet<Custom*>& c) {
uint seed = 0;
for(auto x : c) {
seed ^= qHash(x) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
return seed;
}