What's a good hash function for English words?

What's a good hash function for English words? - c++

I have a long list of English words and I would like to hash them. What would be a good hashing function? So far my hashing function sums the ASCII values of the letters then modulo the table size. I'm looking for something efficient and simple.

To simply sum the letters is not a good strategy because a permutation gives the same result.
This one (djb2) is quite popular and works nicely with ASCII strings.
unsigned long hashstring(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
More info here.
If you need more alternatives and some perfomance measures, read here.
Added: These are general hashing functions, where the input domain is not known in advance (except perhaps some very general assumptions: eg the above works slightly better with ascii input), which is the most usual scenario. If you have a known restricted domain (set of inputs fixed) you can do better, see Fionn's answer.

Maybe something like this would help you: http://www.gnu.org/s/gperf/
It generates a optimized hashing function for the input domain.

If you don't need it be cryptographically secure, I would suggest the Murmur Hash. It's extremely fast and has high diffusion. Easy to use.
http://en.wikipedia.org/wiki/MurmurHash
http://code.google.com/p/smhasher/wiki/MurmurHash3
If you do need a cryptographically secure hash, then I suggest SHA1 via OpenSSL.
http://www.openssl.org/docs/crypto/sha.html

A bit late, but here is a hashing function with an extremely low collision rate for 64-bit version below, and ~almost~ as good for the 32-bit version:
uint64_t slash_hash(const char *s)
//uint32_t slash_hash(const char *s)
{
union { uint64_t h; uint8_t u[8]; } uu;
int i=0; uu.h=strlen(s);
while (*s) { uu.u[i%8] += *s + i + (*s >> ((uu.h/(i+1)) % 5)); s++; i++; }
return uu.h; //64-bit
//return (uu.h+(uu.h>>32)); //32-bit
}
The hash-numbers are also very evenly spread across the possible range, with no clumping that I could detect - this was checked using the random strings only.
[edit]Also tested against words extracted from local text-files combined with LibreOffice dictionary/thesaurus words (English and French - more than 97000 words and constructs) with 0 collisions in 64-bit and 1 collision in 32-bit :)
(Also compared with FNV1A_Hash_Yorikke, djb2 and MurmurHash2 on same sets: Yorikke & djb2 did not do well; slash_hash did slightly better than MurmurHash2 in all the tests)

Related

Understanding bits and bytes of a hashing algorithm

In a question about a very simple hashing algorithm called djb2, the author wants to know why the number 33 is chosen in the algorithm (see below code in C).
unsigned long;
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++) //just the character
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
In the top answer, point 2 talks about the hashing accumulator and how it makes two copies of itself, and then it says something about the spreading.
Can someone explain what is meant by "copying itself" and the "spread" of answer 2?

The step 2 being references is this:
As you can see from the shift and add implementation, using 33 makes two copies of most of the input bits in the hash accumulator, and then spreads those bits relatively far apart. This helps produce good avalanching. Using a larger shift would duplicate fewer bits, using a smaller shift would keep bit interactions more local and make it take longer for the interactions to spread.
33 is 32+1. That means, thanks to multiplication being distributive, that hash * 33 = (hash * 32) + (hash * 1) - or in other words, make two copies of hash, shift one of them left by 5 bits, then add them together, which is what (hash << 5) + hash expresses in a more direct way.

How to hash a 96-bit struct/number?

So I can't figure out how to do this in C++. I need to do a modulus operation and integer conversion on data that is 96 bits in length.
Example:
struct Hash96bit
{
char x[12];
};
int main()
{
Hash96bit n;
// set n to something
int size = 23;
int result = n % size
}
Edit: I'm trying to have a 96 bit hash because i have 3 floats which when combined create a unique combination. Thought that would be best to use as the hash because you don't really have to process it at all.
Edit: Okay... so at this point I might as well explain the bigger issue. I have a 3D world that I want to subdivide into sectors, that way groups of objects can be placed in sectors that would make frustum culling and physics iterations take less time. So at the begging lets say you are at sector 0,0,0. Sure we store them all in array, cool, but what happens when we get far away from 0,0,0? We don't care about those sectors there anymore. So we use a hashmap since memory isn't an issue and because we will be accessing data with sector values rather than handles. Now a sector is 3 floats, hashing that could easily be done with any number of algorithms. I thought it might be better if I could just say the 3 floats together is the key and go from there, I just needed a way to mod a 96 bit number to fit it in the data segment. Anyway I think i'm just gonna take the bottom bits of each of these floats and use a 64 bit hash unless anyone comes up with something brilliant. Thank you for the advice so far.

UPDATE: Having just read your second edit to the question, I'd recommend you use David's jenkin's approach (which I upvoted a while back)... just point it at the lowest byte in your struct of three floats.
Regarding "Anyway I think i'm just gonna take the bottom bits of each of these floats" - again, the idea with a hash function used by a hash table is not just to map each bit in the input (less till some subset of them) to a bit in the hash output. You could easily end up with a lot of collisions that way, especially if the number of buckets is not a prime number. For example, if you take 21 bits from each float, and the number of buckets happens to be 1024 currently, then after % 1024 only 10 bits from one of the floats will be used with no regard to the values of the other floats... hash(a,b,c) == hash(d,e,c) for all c (it's actually a little worse than that - values like 5.5, 2.75 etc. will only use a couple bits of the mantissa....).
Since you're insisting on this (though it's very likely not what you need, and a misnomer to boot):
struct Hash96bit
{
union {
float f[3];
char x[12];
uint32_t u[3];
};
Hash96bit(float a, float b, float c)
{
f[0] = a;
f[1] = b;
f[2] = c;
}
// the operator will support your "int result = n % size;" usage...
operator uint128_t() const
{
return u[0] * ((uint128_t)1 << 64) + // arbitrary ordering
u[1] + ((uint128_t)1 << 32) +
u[2];
}
};

You can use jenkins hash.
uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}

Get 32-bit hash value from boost::hash

I am using boost::hash to get hash value for a string.
But it is giving different hash values for same string on Windows 32-bit and Debian 64-bit systems.
So how can I get same hash value (32-bit or 64-bit) using boost::hash irrespective of platform?

What is the guarantee concerning boost::hash? I don't see any
guarantees that a generated hash code is usable outside of the
process which generates it. (This is frequently the case with
hash functions.) If you need a hash value for external data,
valid over different programs and different platforms (e.g. for
a hashed access to data on disk), then you'll have to write your
own. Something like:
uint32_t
hash( std::string const& key )
{
uint32_t results = 12345;
for ( auto current = key.begin(); current != key.end(); ++ current ) {
results = 127 * results + static_cast<unsigned char>( *current );
}
return results;
}
should do the trick, as long as you don't have to worry about
porting to some exotic mainframes (which might not support
uint32_t).

Use some of the well-known universal hash functions such as SHA instead, because those are supposed to guarantee that the same string will have the same hash everywhere. Note that in case you are doing something security-related, SHA might be too fast. It's a strange thing to say, but sometimes fast does not mean good as it opens a possibility for a brute force attack - in this case, there are other, slower hash function, some of which basically re-apply SHA many times in a row. Another thing, if you are hashing passwords, remember to salt them (I won't go into details, but the information is readily accessible online).

Hash-function above is simple, but weak and vulnerable.
For example, pass to that function string like "bb" "bbbb" "bbddbb" "ddffbb" -- any combination of pairs symbols with even ASCII codes, and watch for low byte.
It always will be 57.
Rather, I recommend to use my hash function, which is relative lightweight,
and does not have easy vulnerabilities:
#define NLF(h, c) (rand[(uint8_t)(c ^ h)])
uint32_t rand[0x100] = { 256 random non-equal values };
uint32_t oleg_h(const char *key) {
uint32_t h = 0x1F351F35;
char c;
while(c = *key++)
h = ((h >> 11) | (h << (32 - 11))) + NLF(h, c);
h ^= h >> 16;
return h ^ (h >> 8);
}

Suggest any good hash function [duplicate]

I have a long list of English words and I would like to hash them. What would be a good hashing function? So far my hashing function sums the ASCII values of the letters then modulo the table size. I'm looking for something efficient and simple.

To simply sum the letters is not a good strategy because a permutation gives the same result.
This one (djb2) is quite popular and works nicely with ASCII strings.
unsigned long hashstring(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
More info here.
If you need more alternatives and some perfomance measures, read here.
Added: These are general hashing functions, where the input domain is not known in advance (except perhaps some very general assumptions: eg the above works slightly better with ascii input), which is the most usual scenario. If you have a known restricted domain (set of inputs fixed) you can do better, see Fionn's answer.

Maybe something like this would help you: http://www.gnu.org/s/gperf/
It generates a optimized hashing function for the input domain.

If you don't need it be cryptographically secure, I would suggest the Murmur Hash. It's extremely fast and has high diffusion. Easy to use.
http://en.wikipedia.org/wiki/MurmurHash
http://code.google.com/p/smhasher/wiki/MurmurHash3
If you do need a cryptographically secure hash, then I suggest SHA1 via OpenSSL.
http://www.openssl.org/docs/crypto/sha.html

A bit late, but here is a hashing function with an extremely low collision rate for 64-bit version below, and ~almost~ as good for the 32-bit version:
uint64_t slash_hash(const char *s)
//uint32_t slash_hash(const char *s)
{
union { uint64_t h; uint8_t u[8]; } uu;
int i=0; uu.h=strlen(s);
while (*s) { uu.u[i%8] += *s + i + (*s >> ((uu.h/(i+1)) % 5)); s++; i++; }
return uu.h; //64-bit
//return (uu.h+(uu.h>>32)); //32-bit
}
The hash-numbers are also very evenly spread across the possible range, with no clumping that I could detect - this was checked using the random strings only.
[edit]Also tested against words extracted from local text-files combined with LibreOffice dictionary/thesaurus words (English and French - more than 97000 words and constructs) with 0 collisions in 64-bit and 1 collision in 32-bit :)
(Also compared with FNV1A_Hash_Yorikke, djb2 and MurmurHash2 on same sets: Yorikke & djb2 did not do well; slash_hash did slightly better than MurmurHash2 in all the tests)

quickest way to generate random bits

What would be the fastest way to generate a large number of (pseudo-)random bits. Each bit must be independent and be zero or one with equal probability. I could obviously do some variation on
randbit=rand()%2;
but I feel like there should be a faster way, generating several random bits from each call to the random number generator. Ideally I'd like to get an int or a char where each bit is random and independent, but other solutions are also possible.
The application is not cryptographic in nature so strong randomness isn't a major factor, whereas speed and getting the correct distribution is important.

convert a random number into binary
Why not get just one number (of appropriate size to get enough bits you need) and then convert it to binary. You'll actually get bits from a random number which means they are random as well.
Zeros and ones also have the probability of 50%, since taking all numbers between 0 and some 2^n limit and counting the number of zeros and ones are equal > meaning that probability of zeros and ones is the same.
regarding speed
this would probably be very fast, since getting just one random number compared to number of bits in it is faster. it purely depends on your binary conversion now.

Take a look at Boost.Random, especially boost::uniform_int<>.

As you say just generate random integers.
Then you have 32 random bits with ones and zeroes all equally probable.
Get the bits in a loop:
for (int i = 0; i < 32; i++)
{
randomBit = (randomNum >> i) & 1
...
// Do your thing
}
Repeat this for as many times you need to to get the correct amount of bits.

Here's a very fast one I coded in Java based on George Marsaglia's XORShift algorithm: gets you 64 bits at a time!
/**
* State for random number generation
*/
private static volatile long state=xorShift64(System.nanoTime()|0xCAFEBABE);
/**
* Gets a long random value
* #return Random long value based on static state
*/
public static final long nextLong() {
long a=state;
state = xorShift64(a);
return a;
}
/**
* XORShift algorithm - credit to George Marsaglia!
* #param a Initial state
* #return new state
*/
public static final long xorShift64(long a) {
a ^= (a << 21);
a ^= (a >>> 35);
a ^= (a << 4);
return a;
}

SMP Safe (i.e. Fastest way possiable these days) and good bits
Note the use of the [ThreadStatic] attribute, this object automatically handle's new thread's, no locking. That's the only way your going to ensure high-performance random, SMP lockfree.
http://blogs.msdn.com/pfxteam/archive/2009/02/19/9434171.aspx

If I rememeber correctly, the least significant bits are normally having a "less random"
distribution for most pseuodo random number generators, so using modulo and/or each bit in the generated number would be bad if you are worried about the distribution.
(Maybe you should at least google what Knuth says...)
If that holds ( and its hard to tell without knowing exactly what algorithm you are using) just use the highest bit in each generated number.
http://en.wikipedia.org/wiki/Pseudo-random

#include <iostream>
#include <bitset>
#include <random>
int main()
{
const size_t nrOfBits = 256;
std::bitset<nrOfBits> randomBits;
std::default_random_engine generator;
std::uniform_real_distribution<float> distribution(0.0,1.0);
float randNum;
for(size_t i = 0; i < nrOfBits; i++)
{
randNum = distribution(generator);
if(randNum < 0.5) {
randNum = 0;
} else {
randNum = 1;
}
randomBits.set(i, randNum);
}
std::cout << "randomBits =\n" << randomBits << std::endl;
return 0;
}
This took 4.5886e-05s in my test with 256 bits.

You can generate a random number and keep on right shifitng and testing the least significant bit to get the random bits instead of doing a mod operation.

How large do you need the number of generated bits to be? If it is not larger than a few million, and keeping in mind that you are not using the generator for cryptography, then I think the fastest possible way would be to precompute a large set of integers with the correct distribution, convert it to a text file like this:
unsigned int numbers[] =
{
0xABCDEF34, ...
};
and then compile the array into your program and go through it one number at a time.
That way you get 32 bits with every call (on a 32-bit processor), the generation time is the shortest possible because all the numbers are generated in advance, and the distribution is controlled by you. The downside is, of course, that these numbers are not random at all, which depending on what you are using the PRNG for may or may not matter.

if you only need a bit at a time try
bool coinToss()
{
return rand()&1;
} It would technically be a faster way to generate bits because of replacing the %2 with a &1 which are equivalent.

Just read some memory - take a n bit section of raw memory. It will be pretty random.
Alternatively, generate a large random int x and just use the bit values.
for(int i = (bitcount-1); i >= 0 ; i--) bin += x & (1 << i);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

What's a good hash function for English words? - c++

I have a long list of English words and I would like to hash them. What would be a good hashing function? So far my hashing function sums the ASCII values of the letters then modulo the table size. I'm looking for something efficient and simple.

Maybe something like this would help you: http://www.gnu.org/s/gperf/ It generates a optimized hashing function for the input domain.

Related

Understanding bits and bytes of a hashing algorithm

How to hash a 96-bit struct/number?

Get 32-bit hash value from boost::hash

Suggest any good hash function [duplicate]

quickest way to generate random bits

Categories

Resources