Understanding bits and bytes of a hashing algorithm

Understanding bits and bytes of a hashing algorithm - c++

In a question about a very simple hashing algorithm called djb2, the author wants to know why the number 33 is chosen in the algorithm (see below code in C).
unsigned long;
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++) //just the character
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
In the top answer, point 2 talks about the hashing accumulator and how it makes two copies of itself, and then it says something about the spreading.
Can someone explain what is meant by "copying itself" and the "spread" of answer 2?

The step 2 being references is this:
As you can see from the shift and add implementation, using 33 makes two copies of most of the input bits in the hash accumulator, and then spreads those bits relatively far apart. This helps produce good avalanching. Using a larger shift would duplicate fewer bits, using a smaller shift would keep bit interactions more local and make it take longer for the interactions to spread.
33 is 32+1. That means, thanks to multiplication being distributive, that hash * 33 = (hash * 32) + (hash * 1) - or in other words, make two copies of hash, shift one of them left by 5 bits, then add them together, which is what (hash << 5) + hash expresses in a more direct way.

Related

Does a 64 bit packed structure contains a field set to specified value

I have an odd structure with 5 fields of bit length 12 and 4 boolean flags stored in the high bits. This all fits nicely into a 64 bit long, and as such they are stored as a 64 bit word array. What I want to do is search the array and find if any of the 12 bit fields are set to a given value.
I have tried the obvious solution of using bit shifts and masks, however this is a very hot function and needs to be optimized for speed. This led me to the this page containing a way to check for a byte in a word in very few operations. This makes me think it is possible to do something similar with the 12 bit fields, however I am struggling to find what constants I would replace the ones given on that page with.

I'm not very versed in low level languages, but I'm in the mood to fiddle with some bits so I thought I'd give it a try.
POC: JS can't do 64bit longs, but we can check if we can adapt the algorithm to deal with 2x12bit fields + 8boolean flags (noise) in an 32bit (u)int.
The noise because the original algorithm. Dealt with exactly 4 bytes and no further bits, but neither 32 nor 64 can be divided by 12 so we need to ensure that these additional bits don't interfere. Or worse, get matched.
function hasValue(x, n) { return hasZero(x ^ (0x001001 * n)); }
function hasZero(v) { return ((v - 0x001001) & ~(v) & 0x800800); }
function hex(v) { return "0x" + v.toString(16) }
// create a random value, 2x12bit fields plus 8 random flags.
var v = Math.floor(Math.random() * 0x100000000);
console.log("value", hex(v));
// get the two fields
var a = v & 0xFFF;
console.log("check", hex(a), !!hasValue(v, a));
var b = (v >> 12) & 0xFFF;
console.log("check", hex(b), !!hasValue(v, b));
// brute force.
// check if any other value is matched.
// these should only return the 2 values from above.
for (var i = 0; i < 0x1000; ++i) {
if (hasValue(v, i)) {
console.log("matched", hex(i));
}
}
extrapolating from this, your solution should be
#define hasValue(x,n) hasZero(x ^ (0x001001001001001 * n))
#define hasZero(v) ((v - 0x001001001001001) & ~(v) & 0x800800800800800)
where all values are unsigned longs. (sorry don't know if you somehow have to annotate any of these numbers)

How to hash a 96-bit struct/number?

So I can't figure out how to do this in C++. I need to do a modulus operation and integer conversion on data that is 96 bits in length.
Example:
struct Hash96bit
{
char x[12];
};
int main()
{
Hash96bit n;
// set n to something
int size = 23;
int result = n % size
}
Edit: I'm trying to have a 96 bit hash because i have 3 floats which when combined create a unique combination. Thought that would be best to use as the hash because you don't really have to process it at all.
Edit: Okay... so at this point I might as well explain the bigger issue. I have a 3D world that I want to subdivide into sectors, that way groups of objects can be placed in sectors that would make frustum culling and physics iterations take less time. So at the begging lets say you are at sector 0,0,0. Sure we store them all in array, cool, but what happens when we get far away from 0,0,0? We don't care about those sectors there anymore. So we use a hashmap since memory isn't an issue and because we will be accessing data with sector values rather than handles. Now a sector is 3 floats, hashing that could easily be done with any number of algorithms. I thought it might be better if I could just say the 3 floats together is the key and go from there, I just needed a way to mod a 96 bit number to fit it in the data segment. Anyway I think i'm just gonna take the bottom bits of each of these floats and use a 64 bit hash unless anyone comes up with something brilliant. Thank you for the advice so far.

UPDATE: Having just read your second edit to the question, I'd recommend you use David's jenkin's approach (which I upvoted a while back)... just point it at the lowest byte in your struct of three floats.
Regarding "Anyway I think i'm just gonna take the bottom bits of each of these floats" - again, the idea with a hash function used by a hash table is not just to map each bit in the input (less till some subset of them) to a bit in the hash output. You could easily end up with a lot of collisions that way, especially if the number of buckets is not a prime number. For example, if you take 21 bits from each float, and the number of buckets happens to be 1024 currently, then after % 1024 only 10 bits from one of the floats will be used with no regard to the values of the other floats... hash(a,b,c) == hash(d,e,c) for all c (it's actually a little worse than that - values like 5.5, 2.75 etc. will only use a couple bits of the mantissa....).
Since you're insisting on this (though it's very likely not what you need, and a misnomer to boot):
struct Hash96bit
{
union {
float f[3];
char x[12];
uint32_t u[3];
};
Hash96bit(float a, float b, float c)
{
f[0] = a;
f[1] = b;
f[2] = c;
}
// the operator will support your "int result = n % size;" usage...
operator uint128_t() const
{
return u[0] * ((uint128_t)1 << 64) + // arbitrary ordering
u[1] + ((uint128_t)1 << 32) +
u[2];
}
};

You can use jenkins hash.
uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}

knuth multiplicative hash

Is this a correct implementation of the Knuth multiplicative hash.
int hash(int v)
{
v *= 2654435761;
return v >> 32;
}
Does overflow in the multiplication affects the algorithm?
How to improve the performance of this method?

Knuth multiplicative hash is used to compute an hash value in {0, 1, 2, ..., 2^p - 1} from an integer k.
Suppose that p is in between 0 and 32, the algorithm goes like this:
Compute alpha as the closest integer to 2^32 (-1 + sqrt(5)) / 2. We get alpha = 2 654 435 769.
Compute k * alpha and reduce the result modulo 2^32:
k * alpha = n0 * 2^32 + n1 with 0 <= n1 < 2^32
Keep the highest p bits of n1:
n1 = m1 * 2^(32-p) + m2 with 0 <= m2 < 2^(32 - p)
So, a correct implementation of Knuth multiplicative algorithm in C++ is:
std::uint32_t knuth(int x, int p) {
assert(p >= 0 && p <= 32);
const std::uint32_t knuth = 2654435769;
const std::uint32_t y = x;
return (y * knuth) >> (32 - p);
}
Forgetting to shift the result by (32 - p) is a major mistake. As you would lost all the good properties of the hash. It would transform an even sequence into an even sequence which would be very bad as all the odd slots would stay unoccupied. That's like taking a good wine and mixing it with Coke. By the way, the web is full of people misquoting Knuth and using a multiplication by 2 654 435 761 without taking the higher bits. I just opened the Knuth and he never said such a thing. It looks like some guy who decided he was "smart" decided to take a prime number close to 2 654 435 769.
Bare in mind that most hash tables implementations don't allow this kind of signature in their interface, as they only allow
uint32_t hash(int x);
and reduce hash(x) modulo 2^p to compute the hash value for x. Those hash tables cannot accept the Knuth multiplicative hash. This might be a reason why so many people completely ruined the algorithm by forgetting to take the higher p bits.
So you can't use the Knuth multiplicative hash with std::unordered_map or std::unordered_set. But I think that those hash tables use a prime number as a size, so the Knuth multiplicative hash is not useful in this case. Using hash(x) = x would be a good fit for those tables.
Source: "Introduction to Algorithms, third edition", Cormen et al., 13.3.2 p:263
Source: "The Art of Computer Programming, Volume 3, Sorting and Searching", D.E. Knuth, 6.4 p:516

Ok, I looked it up in TAOCP volume 3 (2nd edition), section 6.4, page 516.
This implementation is not correct, though as I mentioned in the comments it may give the correct result anyway.
A correct way (I think - feel free to read the relevant chapter of TAOCP and verify this) is something like this: (important: yes, you must shift the result right to reduce it, not use bitwise AND. However, that is not the responsibility of this function - range reduction is not properly part of hashing itself)
uint32_t hash(uint32_t v)
{
return v * UINT32_C(2654435761);
// do not comment about the lack of right shift. I'm not ignoring it. read on.
}
Note the uint32_t's (as opposed to int's) - they make sure the multiplication overflows modulo 2^32, as it is supposed to do if you choose 32 as the word size. There is also no right shift by k here, because there is no reason to give responsibility for range-reduction to the basic hashing function and it is actually more useful to get the full result. The constant 2654435761 is from the question, the actual suggested constant is 2654435769, but that's a small difference that as far as I know does not affect the quality of the hash.
Other valid implementations shift the result right by some amount (not the full word size though, that doesn't make sense and C++ doesn't like it), depending on how many bits of hash you need. Or they may use an other constant (subject to certain conditions) or an other word size. Reducing the hash modulo something is not a valid implementation, but a common mistake, likely it is a de-facto standard way to do range-reduction on a hash. The bottom bits of a multiplicative hash are the worst-quality bits (they depend on less of the input), you only want to use them if you really need more bits, while reducing the hash modulo a power of two would return only the worst bits. Indeed that is equivalent to throwing away most of the input bits too. Reducing modulo a non-power-of-two is not so bad since it does mix in the higher bits, but it's not how the multiplicative hash was defined.
So to be clear, yes there is a right shift, but that is range reduction not hashing and can only be the responsibility of the hash table, since it depends on its internal size.
The type should be unsigned, otherwise the overflow is unspecified (thus possibly wrong, not just on non-2's-complement architectures but also on overly clever compilers) and the optional right shift would be a signed shift (wrong).
On the page I mention at the top, there is this formula:
Here we have A = 2654435761 (or 2654435769), w = 232 and M = 232. Calculating AK/w gives a fixed-point result with the format Q32.32, the mod 1 step takes only the 32 fraction bits. But that's just the same thing as doing a modular multiplication and then saying that the result is the fraction bits. Of course when multiplied by M, all the fraction bits become integer bits because of how M was chosen, and so it simplifies to just a plain old modular multiplication. When M is a lower power of two, that just right-shifts the result, as mentioned.

Might be late, but heres a Java Implementation of Knuth's Method :
For a hashtable of Size N :
public long hash(int key) {
long l = 2654435769L;
return (key * l >> 32) % N ;
}

If the input argument is a pointer then I use this
#include <inttypes.h>
uint32_t knuth_mul_hash(void* k) {
ptrdiff_t v = (ptrdiff_t)k * UINT32_C(2654435761);
v >>= ((sizeof(ptrdiff_t) - sizeof(uint32_t)) * 8); // Right-shift v by the size difference between a pointer and a 32-bit integer (0 for x86, 32 for x64)
return (uint32_t)(v & UINT32_MAX);
}
I usually use this as the default fallback hashing function in hashmap implementations, dictionaries, sets, etc...

Suggest any good hash function [duplicate]

I have a long list of English words and I would like to hash them. What would be a good hashing function? So far my hashing function sums the ASCII values of the letters then modulo the table size. I'm looking for something efficient and simple.

To simply sum the letters is not a good strategy because a permutation gives the same result.
This one (djb2) is quite popular and works nicely with ASCII strings.
unsigned long hashstring(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
More info here.
If you need more alternatives and some perfomance measures, read here.
Added: These are general hashing functions, where the input domain is not known in advance (except perhaps some very general assumptions: eg the above works slightly better with ascii input), which is the most usual scenario. If you have a known restricted domain (set of inputs fixed) you can do better, see Fionn's answer.

Maybe something like this would help you: http://www.gnu.org/s/gperf/
It generates a optimized hashing function for the input domain.

If you don't need it be cryptographically secure, I would suggest the Murmur Hash. It's extremely fast and has high diffusion. Easy to use.
http://en.wikipedia.org/wiki/MurmurHash
http://code.google.com/p/smhasher/wiki/MurmurHash3
If you do need a cryptographically secure hash, then I suggest SHA1 via OpenSSL.
http://www.openssl.org/docs/crypto/sha.html

A bit late, but here is a hashing function with an extremely low collision rate for 64-bit version below, and ~almost~ as good for the 32-bit version:
uint64_t slash_hash(const char *s)
//uint32_t slash_hash(const char *s)
{
union { uint64_t h; uint8_t u[8]; } uu;
int i=0; uu.h=strlen(s);
while (*s) { uu.u[i%8] += *s + i + (*s >> ((uu.h/(i+1)) % 5)); s++; i++; }
return uu.h; //64-bit
//return (uu.h+(uu.h>>32)); //32-bit
}
The hash-numbers are also very evenly spread across the possible range, with no clumping that I could detect - this was checked using the random strings only.
[edit]Also tested against words extracted from local text-files combined with LibreOffice dictionary/thesaurus words (English and French - more than 97000 words and constructs) with 0 collisions in 64-bit and 1 collision in 32-bit :)
(Also compared with FNV1A_Hash_Yorikke, djb2 and MurmurHash2 on same sets: Yorikke & djb2 did not do well; slash_hash did slightly better than MurmurHash2 in all the tests)

What's a good hash function for English words?

I have a long list of English words and I would like to hash them. What would be a good hashing function? So far my hashing function sums the ASCII values of the letters then modulo the table size. I'm looking for something efficient and simple.

To simply sum the letters is not a good strategy because a permutation gives the same result.
This one (djb2) is quite popular and works nicely with ASCII strings.
unsigned long hashstring(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
More info here.
If you need more alternatives and some perfomance measures, read here.
Added: These are general hashing functions, where the input domain is not known in advance (except perhaps some very general assumptions: eg the above works slightly better with ascii input), which is the most usual scenario. If you have a known restricted domain (set of inputs fixed) you can do better, see Fionn's answer.

Maybe something like this would help you: http://www.gnu.org/s/gperf/
It generates a optimized hashing function for the input domain.

If you don't need it be cryptographically secure, I would suggest the Murmur Hash. It's extremely fast and has high diffusion. Easy to use.
http://en.wikipedia.org/wiki/MurmurHash
http://code.google.com/p/smhasher/wiki/MurmurHash3
If you do need a cryptographically secure hash, then I suggest SHA1 via OpenSSL.
http://www.openssl.org/docs/crypto/sha.html

A bit late, but here is a hashing function with an extremely low collision rate for 64-bit version below, and ~almost~ as good for the 32-bit version:
uint64_t slash_hash(const char *s)
//uint32_t slash_hash(const char *s)
{
union { uint64_t h; uint8_t u[8]; } uu;
int i=0; uu.h=strlen(s);
while (*s) { uu.u[i%8] += *s + i + (*s >> ((uu.h/(i+1)) % 5)); s++; i++; }
return uu.h; //64-bit
//return (uu.h+(uu.h>>32)); //32-bit
}
The hash-numbers are also very evenly spread across the possible range, with no clumping that I could detect - this was checked using the random strings only.
[edit]Also tested against words extracted from local text-files combined with LibreOffice dictionary/thesaurus words (English and French - more than 97000 words and constructs) with 0 collisions in 64-bit and 1 collision in 32-bit :)
(Also compared with FNV1A_Hash_Yorikke, djb2 and MurmurHash2 on same sets: Yorikke & djb2 did not do well; slash_hash did slightly better than MurmurHash2 in all the tests)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Understanding bits and bytes of a hashing algorithm - c++

Related

Does a 64 bit packed structure contains a field set to specified value

How to hash a 96-bit struct/number?

knuth multiplicative hash

Suggest any good hash function [duplicate]

What's a good hash function for English words?

Categories

Resources