I would like to have a user-defined key in a C++ std::map. The key is a binary representation of an integer set with maximum value 2^V so I can't represent all 2^V possible values. I do so by means of an efficient binary set representation, i.e., an array of uint64_t.
Now the problem is that to put this user-defined bitset as key in a std::map, I need to define a valid comparison between bitset values but if I have a maximum size of, say, V=1000, then I cannot get a number I can compare, let alone aggregating them all i.e., 2^1000 is not representable.
Therefore my question is, suppose I have two different sets (by setting the right bits in my bitset representation) and I cannot represent the final number because it will overflow:
id_1 = 2^0 + 2^1 + ... + 2^V
id_2 = 2^0 + 2^1 + ... + 2^V
Is there a suitable transformation that would lead to a value I can compare? I need to be able to say id_1 < id_2 so I would like to transform a sum of exponentials to a value that is representable BUT maintaining the invariant of the "less than". I was thinking along the lines of e.g. applying a log transformation in a clever way to preserve "less than".
Here is an example:
set_1 = {2,3,4}; set_2 = {8}
id(set_1) = 2^2 + 2^3 + 2^4 = 28; id(set_2) = 2^8 = 256
id(set_1) < id(set_2)
Perfect! How about a general set that can have {1,...,V}, and thus 2^V possible subsets?
I do so by means of an efficient binary set representation, i.e., an array of uint64_t.
Supposing that this array is accessed via a data member ra of the key type Key, and both arrays are of length N, then you want a comparator something like this:
bool operator<(const Key &lhs, const Key &rhs) {
return std::lexicographical_compare(lhs.ra, &lhs.ra[N], rhs.ra, &rhs.ra[N]);
}
This implicitly considers the array to be big-endian, i.e. the first uint64_t is the most significant. If you don't like that, that's fair enough, since you might already have in mind some relative significance for whatever order you've stored your V bits into your array. There's no great mystery to lexicographical_compare, so just look at an example implementation and modify as required.
This is called "lexicographical order". Other than the facts that I've used uint64_t instead of char and both arrays are the same length, it is how strings are compared[*] -- in fact the use of uint64_t isn't important, you could just use std::memcmp in your comparator instead of comparing 64-bit chunks. operator< for strings doesn't work by converting the whole string to an integer, and neither should your comparator.
[*] until you bring locale-specific collation rules into play.
Related
This is in C++. I need to keep a count for every pair of numbers. The two numbers are of type "int". I sort the two numbers, so (n1 n2) pair is the same as (n2 n1) pair. I'm using the std::unordered_map as the container.
I have been using the elegant pairing function by Matthew Szudzik, Wolfram Research, Inc.. In my implementation, the function gives me a unique number of type "long" (64 bits on my machine) for every pair of two numbers of type "int". I use this long as my key for the unordered_map (std::unordered_map). Is there a better way to keep count of such pairs? By better I mean, faster and if possible with lesser memory usage.
Also, I don't need all the bits of long. Even though you can assume that the two numbers can range up to max value for 32 bits, I anticipate the max possible value of my pairing function to require at most 36 bits. If nothing else, at least is there a way to have just 36 bits as key for the unordered_map? (some other data type)
I thought of using bitset, but I'm not exactly sure if the std::hash will generate a unique key for any given bitset of 36 bits, which can be used as key for unordered_map.
I would greatly appreciate any thoughts, suggestions etc.
First of all I think you came with wrong assumption. For std::unordered_map and std::unordered_set hash does not have to be unique (and it cannot be in principle for data types like std::string for example), there should be low probability that 2 different keys will generate the same hash value. But if there is a collision it would not be end of the world, just access would be slower. I would generate 32bit hash from 2 numbers and if you have an idea of typical values just test for probability of hash collision and choose hash function accordingly.
For that to work you should use pair of 32bit numbers as a key in std::unordered_map and provide a proper hash function. Calculating unique 64bit key and use it with hash map is controversal as hash_map will then calculate another hash of this key, so it is possible you are making it slower.
About 36 bits key, this is not a good idea unless you have a special CPU that handles 36 bit data. Your data either will be aligned on 64bit boundary and you would not have any benefits of saving memory, or you will get penalty of unaligned data access otherwise. In first case you would just have extra code to get 36 bits from 64bit data (if processor supports it). In the second your code will be slower than 32 bit hash even if there are some collisions.
If that hash_map is a bottleneck you may consider different implementation of hash map like goog-sparsehash.sourceforge.net
Just my two cents, the pairing functions that you've got in the article are WAY more complicated than you actually need. Mapping 2 32 bit UNISIGNED values to 64 uniquely is easy. The following does that, and even handles the non-pair states, without hitting the math peripheral too heavily (if at all).
uint64_t map(uint32_t a, uint32_t b)
{
uint64_t x = a+b;
uint64_t y = abs((int32_t)(a-b));
uint64_t ans = (x<<32)|(y);
return ans;
}
void unwind(uint64_t map, uint32_t* a, uint32_t* b)
{
uint64_t x = map>>32;
uint64_t y = map&0xFFFFFFFFL;
*a = (x+y)>>1;
*b = (x-*a);
}
Another alternative:
uint64_t map(uint32_t a, uint32_t b)
{
bool bb = a>b;
uint64_t x = ((uint64_t)a)<<(32*(bb));
uint64_t y = ((uint64_t)b)<<(32*!(bb));
uint64_t ans = x|y;
return ans;
}
void unwind(uint64_t map, uint32_t* a, uint32_t* b)
{
*a = map>>32;
*b = map&0xFFFFFFFF;
}
That works as a unique key. You can easily modify that to be a hash function provider for unordered map, though whether or not that will be faster than std::map is dependent on the number of values you've got.
NOTE: this will fail if the values a+b > 32 bits.
Is the floating point specialisation of std::hash (say, for doubles or floats) reliable regarding almost-equality? That is, if two values (such as (1./std::sqrt(5.)/std::sqrt(5.)) and .2) should compare equal but will not do so with the == operator, how will std::hash behave?
So, can I rely on a double as an std::unordered_map key to work as expected?
I have seen "Hashing floating point values" but that asks about boost; I'm asking about the C++11 guarantees.
std::hash has same guarantees for all types over which it can
be instantiated: if two objects are equal, their hash codes will
be equal. Otherwise, there's a very large probability that they
won't. So you can rely on a double as a key in an
unordered_map to work as expected: if two doubles are not
equal (as defined by ==), they will probably have a different
hash (and even if they don't, they're different keys, because
unordered_map also checks for equality).
Obviously, if your values are the results of inexact
calculations, they aren't appropriate keys for unordered_map
(nor perhaps for any map).
Multiple problems with this question:
The reason that your two expressions don't compare as equal is NOT that there are two binary expressions of 0.2, but that there is NO exact (finite) binary representation of 0.2, or sqrt(5) ! So in fact, while (1./std::sqrt(5.)/std::sqrt(5.)) and .2 should be the same algebraically, they may well not be the same in computer-precision arithmetic. (They aren't even in pen-on-paper arithmetic with finite precision. Say you are working with 10 digits after the decimal point. Write out sqrt(5) with 10 digits and calculate your first expression. It will not be .2.)
Of course you have a sensible concept of two numbers being close. In fact you have at least two: One absolute (|a-b| < eps) , one relative. But that doesn't translate into sensible hashes. If you want all numbers within eps of each other to have the same hash, then 1, 1+eps, 1+2*eps, ... would all have the same hash and therefore, ALL numbers would have the same hash. That is a valid, but useless hash function. But it is the only one that satisfies your requirement of mapping nearby values to the same hash!
Behind the default hashing of an unordered_map there is a std::hash struct which provides the operator() to compute the hash of a given value.
A set of default specializations of this templates is available, including std::hash<float> and std::hash<double>.
On my machine (LLVM+clang) these are defined as
template <>
struct hash<float> : public __scalar_hash<float>
{
size_t operator()(float __v) const _NOEXCEPT
{
// -0.0 and 0.0 should return same hash
if (__v == 0)
return 0;
return __scalar_hash<float>::operator()(__v);
}
};
where __scalar_hash is defined as:
template <class _Tp>
struct __scalar_hash<_Tp, 0> : public unary_function<_Tp, size_t>
{
size_t operator()(_Tp __v) const _NOEXCEPT
{
union
{
_Tp __t;
size_t __a;
} __u;
__u.__a = 0;
__u.__t = __v;
return __u.__a;
}
};
Where basically the hash is built by setting a value of an union to the source value and then getting just a piece which is large as a size_t.
So you get some padding or you get your value truncated, but that doesn't really matter because as you can see the raw bits of the number are used to compute the hash, this means that it works exactly as the == operator. Two floating numbers, to have the same hash (excluding collision given by truncation), must be the same value.
There is no rigorous concept of "almost equality". So behavior can't be guaranteed in principle. If you want to define your own concept of "almost equal" and construct a hash function such that two "almost equal" floats have the same hash, you can. But then it will only be true for your particular notion of "almost equal" floats.
Background
I have a large collection (~thousands) of sequences of integers. Each sequence has the following properties:
it is of length 12;
the order of the sequence elements does not matter;
no element appears twice in the same sequence;
all elements are smaller than about 300.
Note that the properties 2. and 3. imply that the sequences are actually sets, but they are stored as C arrays in order to maximise access speed.
I'm looking for a good C++ algorithm to check if a new sequence is already present in the collection. If not, the new sequence is added to the collection. I thought about using a hash table (note however that I cannot use any C++11 constructs or external libraries, e.g. Boost). Hashing the sequences and storing the values in a std::set is also an option, since collisions can be just neglected if they are sufficiently rare. Any other suggestion is also welcome.
Question
I need a commutative hash function, i.e. a function that does not depend on the order of the elements in the sequence. I thought about first reducing the sequences to some canonical form (e.g. sorting) and then using standard hash functions (see refs. below), but I would prefer to avoid the overhead associated with copying (I can't modify the original sequences) and sorting. As far as I can tell, none of the functions referenced below are commutative. Ideally, the hash function should also take advantage of the fact that elements never repeat. Speed is crucial.
Any suggestions?
http://partow.net/programming/hashfunctions/index.html
http://code.google.com/p/smhasher/
Here's a basic idea; feel free to modify it at will.
Hashing an integer is just the identity.
We use the formula from boost::hash_combine to get combine hashes.
We sort the array to get a unique representative.
Code:
#include <algorithm>
std::size_t array_hash(int (&array)[12])
{
int a[12];
std::copy(array, array + 12, a);
std::sort(a, a + 12);
std::size_t result = 0;
for (int * p = a; p != a + 12; ++p)
{
std::size_t const h = *p; // the "identity hash"
result ^= h + 0x9e3779b9 + (result << 6) + (result >> 2);
}
return result;
}
Update: scratch that. You just edited the question to be something completely different.
If every number is at most 300, then you can squeeze the sorted array into 9 bits each, i.e. 108 bits. The "unordered" property only saves you an extra 12!, which is about 29 bits, so it doesn't really make a difference.
You can either look for a 128 bit unsigned integral type and store the sorted, packed set of integers in that directly. Or you can split that range up into two 64-bit integers and compute the hash as above:
uint64_t hash = lower_part + 0x9e3779b9 + (upper_part << 6) + (upper_part >> 2);
(Or maybe use 0x9E3779B97F4A7C15 as the magic number, which is the 64-bit version.)
Sort the elements of your sequences numerically and then store the sequences in a trie. Each level of the trie is a data structure in which you search for the element at that level ... you can use different data structures depending on how many elements are in it ... e.g., a linked list, a binary search tree, or a sorted vector.
If you want to use a hash table rather than a trie, then you can still sort the elements numerically and then apply one of those non-commutative hash functions. You need to sort the elements in order to compare the sequences, which you must do because you will have hash table collisions. If you didn't need to sort, then you could multiply each element by a constant factor that would smear them across the bits of an int (there's theory for finding such a factor, but you can find it experimentally), and then XOR the results. Or you could look up your ~300 values in a table, mapping them to unique values that mix well via XOR (each one could be a random value chosen so that it has an equal number of 0 and 1 bits -- each XOR flips a random half of the bits, which is optimal).
I would just use the sum function as the hash and see how far you come with that. This doesn’t take advantage of the non-repeating property of the data, nor of the fact that they are all < 300. On the other hand, it’s blazingly fast.
std::size_t hash(int (&arr)[12]) {
return std::accumulate(arr, arr + 12, 0);
}
Since the function needs to be unaware of ordering, I don’t see a smart way of taking advantage of the limited range of the input values without first sorting them. If this is absolutely required, collision-wise, I’d hard-code a sorting network (i.e. a number of if…else statements) to sort the 12 values in-place (but I have no idea how a sorting network for 12 values would look like or even if it’s practical).
EDIT After the discussion in the comments, here’s a very nice way of reducing collisions: raise every value in the array to some integer power before summing. The easiest way of doing this is via transform. This does generate a copy but that’s probably still very fast:
struct pow2 {
int operator ()(int n) const { return n * n; }
};
std::size_t hash(int (&arr)[12]) {
int raised[12];
std::transform(arr, arr + 12, raised, pow2());
return std::accumulate(raised, raised + 12, 0);
}
You could toggle bits, corresponding to each of the 12 integers, in the bitset of size 300. Then use formula from boost::hash_combine to combine ten 32-bit integers, implementing this bitset.
This gives commutative hash function, does not use sorting, and takes advantage of the fact that elements never repeat.
This approach may be generalized if we choose arbitrary bitset size and if we set or toggle arbitrary number of bits for each of the 12 integers (which bits to set/toggle for each of the 300 values is determined either by a hash function or using a pre-computed lookup table). Which results in a Bloom filter or related structures.
We can choose Bloom filter of size 32 or 64 bits. In this case, there is no need to combine pieces of large bit vector into a single hash value. In case of classical implementation of Bloom filter with size 32, optimal number of hash functions (or non-zero bits for each value of the lookup table) is 2.
If, instead of "or" operation of classical Bloom filter, we choose "xor" and use half non-zero bits for each value of the lookup table, we get a solution, mentioned by Jim Balter.
If, instead of "or" operation, we choose "+" and use approximately half non-zero bits for each value of the lookup table, we get a solution, similar to one, suggested by Konrad Rudolph.
I accepted Jim Balter's answer because he's the one who came closest to what I eventually coded, but all of the answers got my +1 for their helpfulness.
Here is the algorithm I ended up with. I wrote a small Python script that generates 300 64-bit integers such that their binary representation contains exactly 32 true and 32 false bits. The positions of the true bits are randomly distributed.
import itertools
import random
import sys
def random_combination(iterable, r):
"Random selection from itertools.combinations(iterable, r)"
pool = tuple(iterable)
n = len(pool)
indices = sorted(random.sample(xrange(n), r))
return tuple(pool[i] for i in indices)
mask_size = 64
mask_size_over_2 = mask_size/2
nmasks = 300
suffix='UL'
print 'HashType mask[' + str(nmasks) + '] = {'
for i in range(nmasks):
combo = random_combination(xrange(mask_size),mask_size_over_2)
mask = 0;
for j in combo:
mask |= (1<<j);
if(i<nmasks-1):
print '\t' + str(mask) + suffix + ','
else:
print '\t' + str(mask) + suffix + ' };'
The C++ array generated by the script is used as follows:
typedef int_least64_t HashType;
const int maxTableSize = 300;
HashType mask[maxTableSize] = {
// generated array goes here
};
inline HashType xorrer(HashType const &l, HashType const &r) {
return l^mask[r];
}
HashType hashConfig(HashType *sequence, int n) {
return std::accumulate(sequence, sequence+n, (HashType)0, xorrer);
}
This algorithm is by far the fastest of those that I have tried (this, this with cubes and this with a bitset of size 300). For my "typical" sequences of integers, collision rates are smaller than 1E-7, which is completely acceptable for my purpose.
So I have two different field types, a char* of length n and an int. I want to generate a hashvalue using both as keys. I add the last 16 bits of the int variable, we'll call the sum integer x, then I use collate: hash to generate a hashvalue for the char*, we'll call it integer y. I then add x+y together, then use hash with the sum to generate a hash value. Lets say i want to limit the hashvalues to a range of [1,4]. Can i just hashvalue%4 to get what i want? Also if there is a better way of generating a hashvalue from the two key let me know.
For the range [1,4] you will have to add 1 to hashvalue%4. However, a hash of 4 is a very small hash. That will give you a lot of collisions, limiting the effectiveness of the hash (that is, many different values of the fields will give you the same hash value.)
I recommend that you add more size (bits) to the hash, maybe 64K (16 bit hash). That will give you less collisions. Also, why not using std::unordered_map, that already implements a hash table?
Finally, as per the hashing function, it depends on the meaning of each of the fields. For example, if in your implementation, only the low 16 bits of the integers count, then the hash should be based only on those bits. There are general hashing functions for strings and for integers, so you could use any of them. Finally, for combining hash values for both fields, summing (or xor-ing) them is a common approach. Just ensure that the generated hash values are as much equally spread over the range as possible.
So, what you describe in many words is written:
struct noname {
int ifield;
char[N] cfield;
};
int hash(const noname &n) {
int x = n.ifield;
int y = ???(n.cfield);
return x + y;
// return (x + y) & 3;
}
Whether this hash function is good depends on the data. For example, if the ifield is always a multiple of 4, it is clearly bad. If the values of the fields are roughly evenly distributed, everything is fine.
Well, except for your requirement to limit the hash range to [1;4]. First, [0;3] is easier to compute, second, such a small range would be appropriate if you only have two or three different things that will have their hash code generated. The range should be at least twice as large as the number of expected different elements.
I want to hash a char array in to an int or a long. The resulting value has to adhere to a given precision value.
The function I've been using is given below:
int GetHash(const char* zKey, int iPrecision /*= 6*/)
{
/////FROM : http://courses.cs.vt.edu/~cs2604/spring02/Projects/4/elfhash.cpp
unsigned long h = 0;
long M = pow(10, iPrecision);
while(*zKey)
{
h = (h << 4) + *zKey++;
unsigned long g = h & 0xF0000000L;
if (g) h ^= g >> 24;
h &= ~g;
}
return (int) (h % M);
}
The string to be hashed is similar to "SAEUI1210.00000010_1".
However, this produces duplicate values in some cases.
Are there any good alternatives which wouldn't duplicate the same hash for different string values.
The very definition of a hash is that it produces duplicate values for some values, due to hash value range being smaller than the space of the hashed data.
In theory, a 32-bit hash has enough range to hash all ~6 character strings (A-Z,a-z,0-9 only), without causing a collision. In practice, hashes are not a perfect permutation of the input. Given a 32-bit hash, you can expect to get hash collisions after hashing ~16 bit of random inputs, due to the birthday paradox.
Given a static set of data values, it's always possible to construct a hash function designed specifically for them, which will never collide with itself (of course, size of its output will be at least log(|data set|). However, it requires you to know all the possible data values ahead of time. This is called perfect hashing.
That being said, here are a few alternatives which should get you started (they are designed to minimize collisions)
Every hash will have collisions. Period. That's called a Birthday Problem.
You may want to check cryptographic has functions like MD5 (relatively fast and you don't care that it's insecure) but it also will have collisions.
Hashes generate the same value for different inputs -- that's what they do. All you can do is create a hash function with sufficient distribution or bit depth (or both) to minimize those collisions. Since you have this additional constraint of precision (0-5 ?) then you are going to hit collisions far more often.
MD5 or SHA. There are many open implementations, and the outcome is very unlikely to produce a duplicate result.