I have to write a hash function, so that I can place an std::pair<int,std::string> in an unordered_set.
Regarding the input:
The strings that will be hashed are very small (1-3 letters in length).
Likewise, the integers will be unsigned numbers which are small (much smaller than the limit of unsigned int).
Does it make sense to use the hash of the string (as a number), and just use Cantor's enumeration of pairs to generate a "new" hash?
Since the "built-in" hash function for std::string should be a decent hash function...
struct intStringHash{
public:
inline std::size_t operator()(const std::pair<int,std::string>&c)const{
int x = c.first;
std::string s = c.second;
std::hash<std::string> stringHash;
int y = stringHash(s);
return ((x+y)*(x+y+1)/2 + y); // Cantor's enumeration of pairs
}
};
boost::hash_combine is an easy way to create hashes: even if you can't use the Boost, the function is quite simple, and so it's trivial to copy the implementation.
Usage sample:
struct intStringHash
{
public:
std::size_t operator()(const std::pair<int, std::string>& c) const
{
std::size_t hash = 0;
hash_combine(hash, c.first);
hash_combine(hash, c.second);
return hash;
}
};
Yes you would generate hashes for each type that you have a hash function for.
It's normal to exclusive or hashes to combine them:
int hash1;
int hash2;
int combined = hash1 ^ hash2;
Related
I tried to implement an unordered map for a Class called Pair, that stores an integer and a bitset. Then I found out, that there isn't a hashfunction for this Class.
Now I wanted to create my own hashfunction. But instead of using the XOR function or comparable functions, I wanted to have a hashfunction like the following approach:
the bitsets in my class obviously have fixed size, so I wanted to do the following:
example: for a instance of Pair with the bitset<6> = 101101, and the integer 6:
create a string = "1011016"
and now use the default hashfunction on this string
because the bitsets have fixed size, each key would be unique
how could I implement this approach?
thank you in advance
To expand on a comment, as requested:
Converting to string and then hashing that string would be somewhat slow. At least slower than it needs to be. A faster approach would be to combine the bit patterns, e.g. like this:
struct Pair
{
std::bitset<6> bits;
int intval;
};
template<>
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::size_t rtrn = static_cast<std::size_t>(pair.intval);
rtrn = (rtrn << pair.bits.size()) | pair.bits.to_ulong();
return rtrn;
}
};
This works on two assumptions:
The upper bits of the integer are generally not interesting
The size of the bitset is always small compared to size_t
I think it is a suitable hash function for use in unordered_map. One may argue that it has poor mixing and a very good hash should change many bits if only a few bits in its input change. But that is not required here. unordered_map is generally designed to work with cheap hash functions. For example GCC's hash for builtin types and pointers is just a static- or reinterpret-cast.
Possible improvements
We can preserve the upper bits by rotating instead of shifting.
template<>
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::size_t rtrn = static_cast<std::size_t>(pair.intval);
std::size_t intdigits = std::numeric_limits<decltype(pair.intval)>::digits;
std::size_t bitdigits = pair.bits.size();
// can be simplified to std::rotl(rtrn, bitdigits) in C++20
rtrn = (rtrn << bitdigits) | (rtrn >> (intdigits - bitdigits));
rtrn ^= pair.bits.to_ulong();
return rtrn;
}
};
Nothing will change for small integers (except some bitflips for small negative ints). But for large integers we still use the whole range of inputs, which might be of interest for pathological cases such as integer series 2^30, 2^30 + 2^29, 2^30 + 2^28, ...
If the size of the bitset may increase, stop doing fancy stuff and just combine the hashes. I wouldn't just xor them to avoid hash collisions on small integers.
std::hash<Pair>
{
std::size_t operator()(const Pair& pair) const noexcept
{
std::hash<decltype(pair.intval)> ihash;
std::hash<decltype(pair.bits)> bhash;
return ihash(pair.intval) * 31 + bhash(pair.bits);
}
};
I picked the simple polynomial hash approach common in Java. I believe GCC uses the same one internally for string hashing. Someone else may expand on the topic or suggest a better one. 31 is commonly chosen as it is a prime number one off a power of two. So it can be computed quickly as (x << 5) - x
I need to map a std::vector<uint64_t> to a single uint64_t. It is possible to do? I thought to use a hash function. Is that a solution?
For example, this vector:
std::vector<uint64_t> v {
16377,
2631694347470643681,
11730294873282192384
}
should be converted into one uint64_t.
If a hash function is not a good solution (e.g. high percentage of collision) there is an alternative to do this mapping?
I need to hash a std::vector<uint64_t> to a single uint64_t. It is possibile to do?
Yes, variable length hash functions exist, and it's possible to implement them in C++.
C++ standard library comes with a few hash functions, but unfortunately not for vector (other than for the bool specialisation). We can reuse the hash function provided for string views, but this is a bit of a cludge:
const char* data = reinterpret_cast<const char*>(v.data());
std::size_t size = v.size() * sizeof(v[0]);
std::hash<std::string_view> hash;
std::cout << hash(std::string_view(data, size));
Note that using this is reasonable only in the case std::has_unique_object_representations_v is true of the element type of vector. I think it's reasonable to assume that to be the case for std::uint64_t.
A caveat when using standard library hash functions is that they don't have exact specification and as such you cannot rely on hashes being identical across separate systems. You should use another hash function if that is a concern.
You can create an std::map<std::vector<uint64_t>, uint64_t>, create a compare function for your vectors and just keep adding them to a map while incrementing a counter.
That counter will be your hash value.
The comment above in code :
#include <array>
#include <algorithm>
#include <vector>
#include <iostream>
static std::array<size_t,5> primes = { 3,5,7,11,13 };
static std::uint64_t hash(const std::vector<std::uint64_t>& v)
{
std::uint64_t hash = v[0];
for (size_t n = 1; n < std::min(primes.size(), v.size()); ++n) hash += (primes[n]*v[n]);
return hash;
}
int main()
{
std::vector<uint64_t> v{ 16377, 2631694347470643681, 11730294873282192384 };
std::cout << hash(v);
return 0;
}
Discussion:
Let's say I have a struct/class with an arbitrary number of attributes that I want to use as key to a std::unordered_map e.g.,:
struct Foo {
int i;
double d;
char c;
bool b;
};
I know that I have to define a hasher-functor for it e.g.,:
struct FooHasher {
std::size_t operator()(Foo const &foo) const;
};
And then define my std::unordered_map as:
std::unordered_map<Foo, MyValueType, FooHasher> myMap;
What bothers me though, is how to define the call operator for FooHasher. One way to do it, that I also tend to prefer, is with std::hash. However, there are numerous variations e.g.,:
std::size_t operator()(Foo const &foo) const {
return std::hash<int>()(foo.i) ^
std::hash<double>()(foo.d) ^
std::hash<char>()(foo.c) ^
std::hash<bool>()(foo.b);
}
I've also seen the following scheme:
std::size_t operator()(Foo const &foo) const {
return std::hash<int>()(foo.i) ^
(std::hash<double>()(foo.d) << 1) ^
(std::hash<char>()(foo.c) >> 1) ^
(std::hash<bool>()(foo.b) << 1);
}
I've seen also some people adding the golden ratio:
std::size_t operator()(Foo const &foo) const {
return (std::hash<int>()(foo.i) + 0x9e3779b9) ^
(std::hash<double>()(foo.d) + 0x9e3779b9) ^
(std::hash<char>()(foo.c) + 0x9e3779b9) ^
(std::hash<bool>()(foo.b) + 0x9e3779b9);
}
Questions:
What are they trying to achieve by adding the golden ration or shifting bits in the result of std::hash.
Is there an "official scheme" to std::hash an object with arbitrary number of attributes of fundamental type?
A simple xor is symmetric and behaves badly when fed the "same" value multiple times (hash(a) ^ hash(a) is zero). See here for more details.
This is the question of combining hashes. boost has a hash_combine that is pretty decent. Write a hash combiner, and use it.
There is no "official scheme" to solve this problem.
Myself, I typically write a super-hasher that can take anything and hash it. It hash combines tuples and pairs and collections automatically, where it first hashes the count of elements in the collection, then the elements.
It finds hash(t) via ADL first, and if that fails checks if it has a manually written hash in a helper namespace (used for std containers and types), and if that fails does a std::hash<T>{}(t).
Then my hash for Foo support looks like:
struct Foo {
int i;
double d;
char c;
bool b;
friend auto mytie(Foo const& f) {
return std::tie(f.i, f.d, f.c, f.b);
}
friend std::size_t hash(Foo const& f) {
return hasher::hash(mytie(f));
}
};
where I use mytie to move Foo into a tuple, then use the std::tuple overload of hasher::hash to get the result.
I like the idea of hashes of structurally similar types having the same hash. This lets me act as if my hash is transparent in some cases.
Note that hashing unordered meows in this manner is a bad idea, as an asymmetric hash of an unordered meow may generate spurious misses.
(Meow is the generic name for map and set. Do not ask me why: Ask the STL.)
The standard hash framework is lacking in respect of combining hashes. Combining hashes using xor is sub-optimal.
A better solution is proposed in N3980 "Types Don't Know #".
The main idea is using the same hash function and its state to hash more than one value/element/member.
With that framework your hash function would look:
template <class HashAlgorithm>
void hash_append(HashAlgorithm& h, Foo const& x) noexcept
{
using std::hash_append;
hash_append(h, x.i);
hash_append(h, x.d);
hash_append(h, x.c);
hash_append(h, x.b);
}
And the container:
std::unordered_map<Foo, MyValueType, std::uhash<>> myMap;
I'm trying to implement an unordered_map for a vector< pair < int,int> >. Since there's no such default hash function, I tried to imagine a function of my own :
struct ObjectHasher
{
std::size_t operator()(const Object& k) const
{
std::string h_string("");
for (auto i = k.vec.begin(); i != k.vec.end(); ++i)
{
h_string.push_back(97+i->first);
h_string.push_back(47); // '-'
h_string.push_back(97+i->second);
h_string.push_back(43); // '+'
}
return std::hash<std::string>()(h_string);
}
};
The main idea is to change the list of integers, say ( (97, 98), (105, 107) ) into a formatted string like "a-b+i-k" and to compute its hash thanks to hash < string >(). I choosed the 97, 48 and 43 numbers only to allow the hash string to be easily displayed in a terminal during my tests.
I know this kind of function might be a very naive idea since a good hash function should be fast and strong against collisions. Well, if the integers given to push_back() are greater than 255 I don't know what might happen... So, what do you think of the following questions :
(1) is my function ok for big integers ?
(2) is my function ok for all environments/platforms ?
(3) is my function too slow to be a hash function ?
(4) ... do you have anything better ?
All you need is a function to "hash in" an integer. You can steal such a function from boost:
template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
std::hash<T> hasher;
seed ^= std::hash<T>(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}
Now your function is trivial:
struct ObjectHasher
{
std::size_t operator()(const Object& k) const
{
std::size_t hash = 0;
for (auto i = k.vec.begin(); i != k.vec.end(); ++i)
{
hash_combine(hash, i->first);
hash_combine(hash, i->second);
}
return hash;
}
};
This function is is probably very slow compared to other hash functions since it uses dynamic memory allocation. Also std::hash<std::string> Is not a very good hash function since it is very general. It's probably better to XOR all ints and use std::hash<int>.
This is a perfectly valid solution. All a hash function needs is a sequence of bytes and by concatenating your elements together as a string you are providing a unique byte representation of the map.
Of course this could become unruly if your map contains a large number of items.
I am trying to use std::string as a key in the stxxl::map
The insertion was fine for small number of strings about 10-100.
But while trying to insert large number of strings about 100000 in it, I am getting segmentation fault.
The code is as follows:
struct CompareGreaterString {
bool operator () (const std::string& a, const std::string& b) const {
return a > b;
}
static std::string max_value() {
return "";
}
};
// template parameter <KeyType, DataType, CompareType, RawNodeSize, RawLeafSize, PDAllocStrategy (optional)>
typedef stxxl::map<std::string, unsigned int, CompareGreaterString, DATA_NODE_BLOCK_SIZE, DATA_LEAF_BLOCK_SIZE> name_map;
name_map strMap((name_map::node_block_type::raw_size)*3, (name_map::leaf_block_type::raw_size)*3);
for (unsigned int i = 0; i < 1000000; i++) { /// Inserting 1 million strings
std::stringstream strStream;
strStream << (i);
Console::println("Inserting: " + strStream.str());
strMap[strStream.str()]=i;
}
In here I am unable to identify why I am unable to insert more number of strings. I am getting segmentation fault exactly while inserting "1377". Plus I am able to add any number of integers as key. I feel that the variable size of string might be causing this trouble.
Also I am unable to understand what to return for max_value of the string. I simply returned a blank string.
According to documentation:
CompareType must also provide a static max_value method, that returns a value of type KeyType that is larger than any key stored in map
Because empty string happens to compare as smaller than any other string, it breaks this precondition and may thus cause unspecified behaviour.
Here's a max_value that should work. MAX_KEY_LEN is just an integer which is larger or equal to the length of the longest possible string key that the map can have.
struct CompareGreaterString {
// ...
static std::string max_value() {
return std::string(MAX_KEY_LEN, std::numeric_limits<unsigned char>::max());
}
};
I have finally found the solution to my problem with great help from Timo bingmann, user2079303 and Martin Ba. Thank you.
I would like to share it with you.
Firstly stxxl supports POD only. That means it stores fixed sized structures only. Hence std::string cannot be a key. stxxl::map worked for about 100-1000 strings because they were contained in the physical memory itself. When more strings are inserted it has to write on disk which is internally causing some problems.
Hence we need to use a fixed string using char[] as follows:
static const int MAX_KEY_LEN = 16;
class FixedString {
public:
char charStr[MAX_KEY_LEN];
bool operator< (const FixedString& fixedString) const {
return std::lexicographical_compare(charStr, charStr+MAX_KEY_LEN,
fixedString.charStr, fixedString.charStr+MAX_KEY_LEN);
}
bool operator==(const FixedString& fixedString) const {
return std::equal(charStr, charStr+MAX_KEY_LEN, fixedString.charStr);
}
bool operator!=(const FixedString& fixedString) const {
return !std::equal(charStr, charStr+MAX_KEY_LEN, fixedString.charStr);
}
};
struct comp_type : public std::less<FixedString> {
static FixedString max_value()
{
FixedString s;
std::fill(s.charStr, s.charStr+MAX_KEY_LEN, 0x7f);
return s;
}
};
Please note that all the operators mainly((), ==, !=) need to be overriden for all the stxxl::map functions to work
Now we may define fixed_name_map for map as follows:
typedef stxxl::map<FixedString, unsigned int, comp_type, DATA_NODE_BLOCK_SIZE, DATA_LEAF_BLOCK_SIZE> fixed_name_map;
fixed_name_map myFixedMap((fixed_name_map::node_block_type::raw_size)*5, (fixed_name_map::leaf_block_type::raw_size)*5);
Now the program is compiling fine and is accepting about 10^8 strings without any problem.
also we can use myFixedMap like std::map itself. {for ex: myFixedMap[fixedString] = 10}
If you are using C++11, then as an alternative to the FixedString class you could use std::array<char, MAX_KEY_LEN>. It is an STL layer on top of an ordinary fixed-size C array, implementing comparisons and iterators as you are used to from std::string, but it's a POD type, so STXXL should support it.
Alternatively, you can use serialization_sort in TPIE. It can sort elements of type std::pair<std::string, unsigned int> just fine, so if all you need is to insert everything in bulk and then access it in bulk, this will be sufficient for your case (and probably faster depending on the exact case).