I want to hash a char array in to an int or a long. The resulting value has to adhere to a given precision value.
The function I've been using is given below:
int GetHash(const char* zKey, int iPrecision /*= 6*/)
{
/////FROM : http://courses.cs.vt.edu/~cs2604/spring02/Projects/4/elfhash.cpp
unsigned long h = 0;
long M = pow(10, iPrecision);
while(*zKey)
{
h = (h << 4) + *zKey++;
unsigned long g = h & 0xF0000000L;
if (g) h ^= g >> 24;
h &= ~g;
}
return (int) (h % M);
}
The string to be hashed is similar to "SAEUI1210.00000010_1".
However, this produces duplicate values in some cases.
Are there any good alternatives which wouldn't duplicate the same hash for different string values.
The very definition of a hash is that it produces duplicate values for some values, due to hash value range being smaller than the space of the hashed data.
In theory, a 32-bit hash has enough range to hash all ~6 character strings (A-Z,a-z,0-9 only), without causing a collision. In practice, hashes are not a perfect permutation of the input. Given a 32-bit hash, you can expect to get hash collisions after hashing ~16 bit of random inputs, due to the birthday paradox.
Given a static set of data values, it's always possible to construct a hash function designed specifically for them, which will never collide with itself (of course, size of its output will be at least log(|data set|). However, it requires you to know all the possible data values ahead of time. This is called perfect hashing.
That being said, here are a few alternatives which should get you started (they are designed to minimize collisions)
Every hash will have collisions. Period. That's called a Birthday Problem.
You may want to check cryptographic has functions like MD5 (relatively fast and you don't care that it's insecure) but it also will have collisions.
Hashes generate the same value for different inputs -- that's what they do. All you can do is create a hash function with sufficient distribution or bit depth (or both) to minimize those collisions. Since you have this additional constraint of precision (0-5 ?) then you are going to hit collisions far more often.
MD5 or SHA. There are many open implementations, and the outcome is very unlikely to produce a duplicate result.
Related
Basically I'm using the hash function used in rabin karp.
Same function as in Fast implementation of Rolling hash but instead of hashing a string, I am hashing a vector of integers.
const unsigned PRIME_BASE = 257;
const unsigned PRIME_MOD = 1000000007;
unsigned hash(const std::vector< unsigned int >& Line)
{
unsigned long long ret = 0;
for (int i = 0; i < Line.size(); i++)
{
ret = ret*PRIME_BASE + Line[i];
ret %= PRIME_MOD;
}
return ret;
}
The problem is that I am getting lots of collisions. Changing prime number can minimize or maximize the collision but I can't avoid it.
Any ideas how to avoid collisions with such function or a better one ?
You don't.
The whole point of a hash is to take an input from a large domain, and produce an output in a smaller domain.
Collisions are, by the very nature of the process, inevitable and unavoidable.
You can try to reduce their likelihood, for some particular class of dataset, but you've already explored doing that.
You can make little bit better (decrease the chance of the collisions) to add more hash function. Eg:
create 2 hash function, with different PRIME BASE and PRIME MOD, ans store pair of long long's.
Another problem can be if the Line stores many zero's, so better to add some random (which is fixed after the initialization) shift to the values.
E.g with Robin-Karb if you want to calculate 'A' and 'AA' hash its better to add shift value otherwise both of this string hash value will be 0. (I mean if you convert the characters like : f(char c){return c-'A';}
Another interesting topic I think if you choose a good hash function (from random aspect), and your input are also randoms, than collsions shouldn't happen when the number of different item in the Line vector is less than sqrt(range of the hash function), this is the birthday paradox. Your current range is 1e9+7 so the sqrt of this is around 3e4. If you using 2 hash function than the combined range is the multiplication of their ranges.
This is in C++. I need to keep a count for every pair of numbers. The two numbers are of type "int". I sort the two numbers, so (n1 n2) pair is the same as (n2 n1) pair. I'm using the std::unordered_map as the container.
I have been using the elegant pairing function by Matthew Szudzik, Wolfram Research, Inc.. In my implementation, the function gives me a unique number of type "long" (64 bits on my machine) for every pair of two numbers of type "int". I use this long as my key for the unordered_map (std::unordered_map). Is there a better way to keep count of such pairs? By better I mean, faster and if possible with lesser memory usage.
Also, I don't need all the bits of long. Even though you can assume that the two numbers can range up to max value for 32 bits, I anticipate the max possible value of my pairing function to require at most 36 bits. If nothing else, at least is there a way to have just 36 bits as key for the unordered_map? (some other data type)
I thought of using bitset, but I'm not exactly sure if the std::hash will generate a unique key for any given bitset of 36 bits, which can be used as key for unordered_map.
I would greatly appreciate any thoughts, suggestions etc.
First of all I think you came with wrong assumption. For std::unordered_map and std::unordered_set hash does not have to be unique (and it cannot be in principle for data types like std::string for example), there should be low probability that 2 different keys will generate the same hash value. But if there is a collision it would not be end of the world, just access would be slower. I would generate 32bit hash from 2 numbers and if you have an idea of typical values just test for probability of hash collision and choose hash function accordingly.
For that to work you should use pair of 32bit numbers as a key in std::unordered_map and provide a proper hash function. Calculating unique 64bit key and use it with hash map is controversal as hash_map will then calculate another hash of this key, so it is possible you are making it slower.
About 36 bits key, this is not a good idea unless you have a special CPU that handles 36 bit data. Your data either will be aligned on 64bit boundary and you would not have any benefits of saving memory, or you will get penalty of unaligned data access otherwise. In first case you would just have extra code to get 36 bits from 64bit data (if processor supports it). In the second your code will be slower than 32 bit hash even if there are some collisions.
If that hash_map is a bottleneck you may consider different implementation of hash map like goog-sparsehash.sourceforge.net
Just my two cents, the pairing functions that you've got in the article are WAY more complicated than you actually need. Mapping 2 32 bit UNISIGNED values to 64 uniquely is easy. The following does that, and even handles the non-pair states, without hitting the math peripheral too heavily (if at all).
uint64_t map(uint32_t a, uint32_t b)
{
uint64_t x = a+b;
uint64_t y = abs((int32_t)(a-b));
uint64_t ans = (x<<32)|(y);
return ans;
}
void unwind(uint64_t map, uint32_t* a, uint32_t* b)
{
uint64_t x = map>>32;
uint64_t y = map&0xFFFFFFFFL;
*a = (x+y)>>1;
*b = (x-*a);
}
Another alternative:
uint64_t map(uint32_t a, uint32_t b)
{
bool bb = a>b;
uint64_t x = ((uint64_t)a)<<(32*(bb));
uint64_t y = ((uint64_t)b)<<(32*!(bb));
uint64_t ans = x|y;
return ans;
}
void unwind(uint64_t map, uint32_t* a, uint32_t* b)
{
*a = map>>32;
*b = map&0xFFFFFFFF;
}
That works as a unique key. You can easily modify that to be a hash function provider for unordered map, though whether or not that will be faster than std::map is dependent on the number of values you've got.
NOTE: this will fail if the values a+b > 32 bits.
Background
I have a large collection (~thousands) of sequences of integers. Each sequence has the following properties:
it is of length 12;
the order of the sequence elements does not matter;
no element appears twice in the same sequence;
all elements are smaller than about 300.
Note that the properties 2. and 3. imply that the sequences are actually sets, but they are stored as C arrays in order to maximise access speed.
I'm looking for a good C++ algorithm to check if a new sequence is already present in the collection. If not, the new sequence is added to the collection. I thought about using a hash table (note however that I cannot use any C++11 constructs or external libraries, e.g. Boost). Hashing the sequences and storing the values in a std::set is also an option, since collisions can be just neglected if they are sufficiently rare. Any other suggestion is also welcome.
Question
I need a commutative hash function, i.e. a function that does not depend on the order of the elements in the sequence. I thought about first reducing the sequences to some canonical form (e.g. sorting) and then using standard hash functions (see refs. below), but I would prefer to avoid the overhead associated with copying (I can't modify the original sequences) and sorting. As far as I can tell, none of the functions referenced below are commutative. Ideally, the hash function should also take advantage of the fact that elements never repeat. Speed is crucial.
Any suggestions?
http://partow.net/programming/hashfunctions/index.html
http://code.google.com/p/smhasher/
Here's a basic idea; feel free to modify it at will.
Hashing an integer is just the identity.
We use the formula from boost::hash_combine to get combine hashes.
We sort the array to get a unique representative.
Code:
#include <algorithm>
std::size_t array_hash(int (&array)[12])
{
int a[12];
std::copy(array, array + 12, a);
std::sort(a, a + 12);
std::size_t result = 0;
for (int * p = a; p != a + 12; ++p)
{
std::size_t const h = *p; // the "identity hash"
result ^= h + 0x9e3779b9 + (result << 6) + (result >> 2);
}
return result;
}
Update: scratch that. You just edited the question to be something completely different.
If every number is at most 300, then you can squeeze the sorted array into 9 bits each, i.e. 108 bits. The "unordered" property only saves you an extra 12!, which is about 29 bits, so it doesn't really make a difference.
You can either look for a 128 bit unsigned integral type and store the sorted, packed set of integers in that directly. Or you can split that range up into two 64-bit integers and compute the hash as above:
uint64_t hash = lower_part + 0x9e3779b9 + (upper_part << 6) + (upper_part >> 2);
(Or maybe use 0x9E3779B97F4A7C15 as the magic number, which is the 64-bit version.)
Sort the elements of your sequences numerically and then store the sequences in a trie. Each level of the trie is a data structure in which you search for the element at that level ... you can use different data structures depending on how many elements are in it ... e.g., a linked list, a binary search tree, or a sorted vector.
If you want to use a hash table rather than a trie, then you can still sort the elements numerically and then apply one of those non-commutative hash functions. You need to sort the elements in order to compare the sequences, which you must do because you will have hash table collisions. If you didn't need to sort, then you could multiply each element by a constant factor that would smear them across the bits of an int (there's theory for finding such a factor, but you can find it experimentally), and then XOR the results. Or you could look up your ~300 values in a table, mapping them to unique values that mix well via XOR (each one could be a random value chosen so that it has an equal number of 0 and 1 bits -- each XOR flips a random half of the bits, which is optimal).
I would just use the sum function as the hash and see how far you come with that. This doesn’t take advantage of the non-repeating property of the data, nor of the fact that they are all < 300. On the other hand, it’s blazingly fast.
std::size_t hash(int (&arr)[12]) {
return std::accumulate(arr, arr + 12, 0);
}
Since the function needs to be unaware of ordering, I don’t see a smart way of taking advantage of the limited range of the input values without first sorting them. If this is absolutely required, collision-wise, I’d hard-code a sorting network (i.e. a number of if…else statements) to sort the 12 values in-place (but I have no idea how a sorting network for 12 values would look like or even if it’s practical).
EDIT After the discussion in the comments, here’s a very nice way of reducing collisions: raise every value in the array to some integer power before summing. The easiest way of doing this is via transform. This does generate a copy but that’s probably still very fast:
struct pow2 {
int operator ()(int n) const { return n * n; }
};
std::size_t hash(int (&arr)[12]) {
int raised[12];
std::transform(arr, arr + 12, raised, pow2());
return std::accumulate(raised, raised + 12, 0);
}
You could toggle bits, corresponding to each of the 12 integers, in the bitset of size 300. Then use formula from boost::hash_combine to combine ten 32-bit integers, implementing this bitset.
This gives commutative hash function, does not use sorting, and takes advantage of the fact that elements never repeat.
This approach may be generalized if we choose arbitrary bitset size and if we set or toggle arbitrary number of bits for each of the 12 integers (which bits to set/toggle for each of the 300 values is determined either by a hash function or using a pre-computed lookup table). Which results in a Bloom filter or related structures.
We can choose Bloom filter of size 32 or 64 bits. In this case, there is no need to combine pieces of large bit vector into a single hash value. In case of classical implementation of Bloom filter with size 32, optimal number of hash functions (or non-zero bits for each value of the lookup table) is 2.
If, instead of "or" operation of classical Bloom filter, we choose "xor" and use half non-zero bits for each value of the lookup table, we get a solution, mentioned by Jim Balter.
If, instead of "or" operation, we choose "+" and use approximately half non-zero bits for each value of the lookup table, we get a solution, similar to one, suggested by Konrad Rudolph.
I accepted Jim Balter's answer because he's the one who came closest to what I eventually coded, but all of the answers got my +1 for their helpfulness.
Here is the algorithm I ended up with. I wrote a small Python script that generates 300 64-bit integers such that their binary representation contains exactly 32 true and 32 false bits. The positions of the true bits are randomly distributed.
import itertools
import random
import sys
def random_combination(iterable, r):
"Random selection from itertools.combinations(iterable, r)"
pool = tuple(iterable)
n = len(pool)
indices = sorted(random.sample(xrange(n), r))
return tuple(pool[i] for i in indices)
mask_size = 64
mask_size_over_2 = mask_size/2
nmasks = 300
suffix='UL'
print 'HashType mask[' + str(nmasks) + '] = {'
for i in range(nmasks):
combo = random_combination(xrange(mask_size),mask_size_over_2)
mask = 0;
for j in combo:
mask |= (1<<j);
if(i<nmasks-1):
print '\t' + str(mask) + suffix + ','
else:
print '\t' + str(mask) + suffix + ' };'
The C++ array generated by the script is used as follows:
typedef int_least64_t HashType;
const int maxTableSize = 300;
HashType mask[maxTableSize] = {
// generated array goes here
};
inline HashType xorrer(HashType const &l, HashType const &r) {
return l^mask[r];
}
HashType hashConfig(HashType *sequence, int n) {
return std::accumulate(sequence, sequence+n, (HashType)0, xorrer);
}
This algorithm is by far the fastest of those that I have tried (this, this with cubes and this with a bitset of size 300). For my "typical" sequences of integers, collision rates are smaller than 1E-7, which is completely acceptable for my purpose.
So I have two different field types, a char* of length n and an int. I want to generate a hashvalue using both as keys. I add the last 16 bits of the int variable, we'll call the sum integer x, then I use collate: hash to generate a hashvalue for the char*, we'll call it integer y. I then add x+y together, then use hash with the sum to generate a hash value. Lets say i want to limit the hashvalues to a range of [1,4]. Can i just hashvalue%4 to get what i want? Also if there is a better way of generating a hashvalue from the two key let me know.
For the range [1,4] you will have to add 1 to hashvalue%4. However, a hash of 4 is a very small hash. That will give you a lot of collisions, limiting the effectiveness of the hash (that is, many different values of the fields will give you the same hash value.)
I recommend that you add more size (bits) to the hash, maybe 64K (16 bit hash). That will give you less collisions. Also, why not using std::unordered_map, that already implements a hash table?
Finally, as per the hashing function, it depends on the meaning of each of the fields. For example, if in your implementation, only the low 16 bits of the integers count, then the hash should be based only on those bits. There are general hashing functions for strings and for integers, so you could use any of them. Finally, for combining hash values for both fields, summing (or xor-ing) them is a common approach. Just ensure that the generated hash values are as much equally spread over the range as possible.
So, what you describe in many words is written:
struct noname {
int ifield;
char[N] cfield;
};
int hash(const noname &n) {
int x = n.ifield;
int y = ???(n.cfield);
return x + y;
// return (x + y) & 3;
}
Whether this hash function is good depends on the data. For example, if the ifield is always a multiple of 4, it is clearly bad. If the values of the fields are roughly evenly distributed, everything is fine.
Well, except for your requirement to limit the hash range to [1;4]. First, [0;3] is easier to compute, second, such a small range would be appropriate if you only have two or three different things that will have their hash code generated. The range should be at least twice as large as the number of expected different elements.
I need to count a lot of different items. I'm processing a list of pairs such as:
A34223,34
B23423,-23
23423212,16
What I was planning to do was hash the first value (the key) into a 32bit integer which will then be a key to a sparse structure where the 'value' will be added (all start at zero) number and be negative.
Given that they keys are short and alphanumeric, is there an way to generate a hash algorithm that is fast on 32bit x86 architectures? Or is there an existing suitable hash?
I don't know anything about the design of hashes, but was hoping that due to the simple input, there would be a way of generating a high performance hash that guarantees no collision for a given key length of "X" and has high dispersion so minimizes collisions when length exceeds "X".
As you are using C++, the first thing you should do is to create a trivial implimentation using std::map. Is it fast enough (it probably will be)? If so, stick with it, otherwise investigate if your C++ implementation provides a hash table. If it does, use it to create a trivial implementation, test, time it. Is it fast enough (almost certainly yes)?
Only after you hav eexhausted these options should you think of implementing your own hash table and hashing function.
A guarantee for no collisions is difficult.
In your case, the keys
A34223
B23423
23423212
can be translated to 32-bit integers with little effort.
And here is a good function that generate hashes from strings:
/**
* "The Practice of Programming", Hash Tables, section 2.9, pg. 57
*
* computes hash value of string
*/
DWORD
strhash( char* str )
{
//#define MULTIPLIER 31 or 37
unsigned int h;
unsigned char* p;
h = 0;
for ( p=(unsigned char*)str; *p != '\0'; p++ )
h = 31 * h + *p; // <- FIXED MULTIPLIER
return h;
}
Check Bob Jenkin's website for good hash functions. IIRC it is the same hash that is used in Perl.