Get 32-bit hash value from boost::hash - c++

I am using boost::hash to get hash value for a string.
But it is giving different hash values for same string on Windows 32-bit and Debian 64-bit systems.
So how can I get same hash value (32-bit or 64-bit) using boost::hash irrespective of platform?

What is the guarantee concerning boost::hash? I don't see any
guarantees that a generated hash code is usable outside of the
process which generates it. (This is frequently the case with
hash functions.) If you need a hash value for external data,
valid over different programs and different platforms (e.g. for
a hashed access to data on disk), then you'll have to write your
own. Something like:
uint32_t
hash( std::string const& key )
{
uint32_t results = 12345;
for ( auto current = key.begin(); current != key.end(); ++ current ) {
results = 127 * results + static_cast<unsigned char>( *current );
}
return results;
}
should do the trick, as long as you don't have to worry about
porting to some exotic mainframes (which might not support
uint32_t).

Use some of the well-known universal hash functions such as SHA instead, because those are supposed to guarantee that the same string will have the same hash everywhere. Note that in case you are doing something security-related, SHA might be too fast. It's a strange thing to say, but sometimes fast does not mean good as it opens a possibility for a brute force attack - in this case, there are other, slower hash function, some of which basically re-apply SHA many times in a row. Another thing, if you are hashing passwords, remember to salt them (I won't go into details, but the information is readily accessible online).

Hash-function above is simple, but weak and vulnerable.
For example, pass to that function string like "bb" "bbbb" "bbddbb" "ddffbb" -- any combination of pairs symbols with even ASCII codes, and watch for low byte.
It always will be 57.
Rather, I recommend to use my hash function, which is relative lightweight,
and does not have easy vulnerabilities:
#define NLF(h, c) (rand[(uint8_t)(c ^ h)])
uint32_t rand[0x100] = { 256 random non-equal values };
uint32_t oleg_h(const char *key) {
uint32_t h = 0x1F351F35;
char c;
while(c = *key++)
h = ((h >> 11) | (h << (32 - 11))) + NLF(h, c);
h ^= h >> 16;
return h ^ (h >> 8);
}

Related

Hash algorithm for string of characters using XOR and bit shift

I was given this algorithm to write a hash function:
BEGIN Hash (string)
UNSIGNED INTEGER key = 0;
FOR_EACH character IN string
key = ((key << 5) + key) ^ character;
END FOR_EACH
RETURN key;
END Hash
The <<operator refers to shift bits to the left. The ^ refers to the XOR operation and the character refers to the ASCII value of the character. Seems pretty straightforward.
Below is my code
unsigned int key = 0;
for (int i = 0; i < data.length(); i++) {
key = ((key<<5) + key) ^ (int)data[i];
}
return key;
However, I keep getting ridiculous positive and negative huge numbers when i should actually get a hash value from 0 - n. n is a value set by the user beforehand. I'm not sure where things went wrong but I'm thinking it could be the XOR operation.
Any suggestions or opinions will be greatly appreciated. Thanks!
The output of this code is a 32-bit (or 64-bit or however wide your unsigned int is) unsigned integer. To restrict it to the range from 0 to n−1, simply reduce it modulo n, using the % operator:
unsigned int hash = key % n;
(It should be obvious that your code, as written, cannot return "a hash value from 0 - n", since n does not appear anywhere in your code.)
In fact, there's a good reason not to reduce the hash value modulo n too soon: if you ever need to grow your hash, storing the unreduced hash codes of your strings saves you the effort of recalculating them whenever n changes.
Finally, a few general notes on your hash function:
As Joachim Pileborg comments above, the explicit (int) cast is unnecessary. If you want to keep it for clarity, it really should say (unsigned int) to match the type of key, since that's what the value actually gets converted into.
For unsigned integer types, ((key<<5) + key) is equal to 33 * key (since shifting left by 5 bits is the same as multiplying by 25 = 32). On modern CPUs, using multiplication is almost certainly faster; on old or very low-end processors with slow multiplication, it's likely that any decent compiler will optimize multiplication by a constant into a combination of shifts and adds anyway. Thus, either way, expressing the operation as a multiplication is IMO preferable.
You don't want to call data.length() on every iteration of the loop. Call it once before the loop and store the result in a variable.
Initializing key to zero means that your hash value is not affected by any leading zero bytes in the string. The original version of your hash function, due to Dan Bernstein, uses a (more or less random) initial value of 5381 instead.

How to hash a 96-bit struct/number?

So I can't figure out how to do this in C++. I need to do a modulus operation and integer conversion on data that is 96 bits in length.
Example:
struct Hash96bit
{
char x[12];
};
int main()
{
Hash96bit n;
// set n to something
int size = 23;
int result = n % size
}
Edit: I'm trying to have a 96 bit hash because i have 3 floats which when combined create a unique combination. Thought that would be best to use as the hash because you don't really have to process it at all.
Edit: Okay... so at this point I might as well explain the bigger issue. I have a 3D world that I want to subdivide into sectors, that way groups of objects can be placed in sectors that would make frustum culling and physics iterations take less time. So at the begging lets say you are at sector 0,0,0. Sure we store them all in array, cool, but what happens when we get far away from 0,0,0? We don't care about those sectors there anymore. So we use a hashmap since memory isn't an issue and because we will be accessing data with sector values rather than handles. Now a sector is 3 floats, hashing that could easily be done with any number of algorithms. I thought it might be better if I could just say the 3 floats together is the key and go from there, I just needed a way to mod a 96 bit number to fit it in the data segment. Anyway I think i'm just gonna take the bottom bits of each of these floats and use a 64 bit hash unless anyone comes up with something brilliant. Thank you for the advice so far.
UPDATE: Having just read your second edit to the question, I'd recommend you use David's jenkin's approach (which I upvoted a while back)... just point it at the lowest byte in your struct of three floats.
Regarding "Anyway I think i'm just gonna take the bottom bits of each of these floats" - again, the idea with a hash function used by a hash table is not just to map each bit in the input (less till some subset of them) to a bit in the hash output. You could easily end up with a lot of collisions that way, especially if the number of buckets is not a prime number. For example, if you take 21 bits from each float, and the number of buckets happens to be 1024 currently, then after % 1024 only 10 bits from one of the floats will be used with no regard to the values of the other floats... hash(a,b,c) == hash(d,e,c) for all c (it's actually a little worse than that - values like 5.5, 2.75 etc. will only use a couple bits of the mantissa....).
Since you're insisting on this (though it's very likely not what you need, and a misnomer to boot):
struct Hash96bit
{
union {
float f[3];
char x[12];
uint32_t u[3];
};
Hash96bit(float a, float b, float c)
{
f[0] = a;
f[1] = b;
f[2] = c;
}
// the operator will support your "int result = n % size;" usage...
operator uint128_t() const
{
return u[0] * ((uint128_t)1 << 64) + // arbitrary ordering
u[1] + ((uint128_t)1 << 32) +
u[2];
}
};
You can use jenkins hash.
uint32_t jenkins_one_at_a_time_hash(char *key, size_t len)
{
uint32_t hash, i;
for(hash = i = 0; i < len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}

Suggest any good hash function [duplicate]

I have a long list of English words and I would like to hash them. What would be a good hashing function? So far my hashing function sums the ASCII values of the letters then modulo the table size. I'm looking for something efficient and simple.
To simply sum the letters is not a good strategy because a permutation gives the same result.
This one (djb2) is quite popular and works nicely with ASCII strings.
unsigned long hashstring(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
More info here.
If you need more alternatives and some perfomance measures, read here.
Added: These are general hashing functions, where the input domain is not known in advance (except perhaps some very general assumptions: eg the above works slightly better with ascii input), which is the most usual scenario. If you have a known restricted domain (set of inputs fixed) you can do better, see Fionn's answer.
Maybe something like this would help you: http://www.gnu.org/s/gperf/
It generates a optimized hashing function for the input domain.
If you don't need it be cryptographically secure, I would suggest the Murmur Hash. It's extremely fast and has high diffusion. Easy to use.
http://en.wikipedia.org/wiki/MurmurHash
http://code.google.com/p/smhasher/wiki/MurmurHash3
If you do need a cryptographically secure hash, then I suggest SHA1 via OpenSSL.
http://www.openssl.org/docs/crypto/sha.html
A bit late, but here is a hashing function with an extremely low collision rate for 64-bit version below, and ~almost~ as good for the 32-bit version:
uint64_t slash_hash(const char *s)
//uint32_t slash_hash(const char *s)
{
union { uint64_t h; uint8_t u[8]; } uu;
int i=0; uu.h=strlen(s);
while (*s) { uu.u[i%8] += *s + i + (*s >> ((uu.h/(i+1)) % 5)); s++; i++; }
return uu.h; //64-bit
//return (uu.h+(uu.h>>32)); //32-bit
}
The hash-numbers are also very evenly spread across the possible range, with no clumping that I could detect - this was checked using the random strings only.
[edit]Also tested against words extracted from local text-files combined with LibreOffice dictionary/thesaurus words (English and French - more than 97000 words and constructs) with 0 collisions in 64-bit and 1 collision in 32-bit :)
(Also compared with FNV1A_Hash_Yorikke, djb2 and MurmurHash2 on same sets: Yorikke & djb2 did not do well; slash_hash did slightly better than MurmurHash2 in all the tests)

What's a good hash function for English words?

I have a long list of English words and I would like to hash them. What would be a good hashing function? So far my hashing function sums the ASCII values of the letters then modulo the table size. I'm looking for something efficient and simple.
To simply sum the letters is not a good strategy because a permutation gives the same result.
This one (djb2) is quite popular and works nicely with ASCII strings.
unsigned long hashstring(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
More info here.
If you need more alternatives and some perfomance measures, read here.
Added: These are general hashing functions, where the input domain is not known in advance (except perhaps some very general assumptions: eg the above works slightly better with ascii input), which is the most usual scenario. If you have a known restricted domain (set of inputs fixed) you can do better, see Fionn's answer.
Maybe something like this would help you: http://www.gnu.org/s/gperf/
It generates a optimized hashing function for the input domain.
If you don't need it be cryptographically secure, I would suggest the Murmur Hash. It's extremely fast and has high diffusion. Easy to use.
http://en.wikipedia.org/wiki/MurmurHash
http://code.google.com/p/smhasher/wiki/MurmurHash3
If you do need a cryptographically secure hash, then I suggest SHA1 via OpenSSL.
http://www.openssl.org/docs/crypto/sha.html
A bit late, but here is a hashing function with an extremely low collision rate for 64-bit version below, and ~almost~ as good for the 32-bit version:
uint64_t slash_hash(const char *s)
//uint32_t slash_hash(const char *s)
{
union { uint64_t h; uint8_t u[8]; } uu;
int i=0; uu.h=strlen(s);
while (*s) { uu.u[i%8] += *s + i + (*s >> ((uu.h/(i+1)) % 5)); s++; i++; }
return uu.h; //64-bit
//return (uu.h+(uu.h>>32)); //32-bit
}
The hash-numbers are also very evenly spread across the possible range, with no clumping that I could detect - this was checked using the random strings only.
[edit]Also tested against words extracted from local text-files combined with LibreOffice dictionary/thesaurus words (English and French - more than 97000 words and constructs) with 0 collisions in 64-bit and 1 collision in 32-bit :)
(Also compared with FNV1A_Hash_Yorikke, djb2 and MurmurHash2 on same sets: Yorikke & djb2 did not do well; slash_hash did slightly better than MurmurHash2 in all the tests)

Best way to merge hex strings in c++? [heavily edited]

I have two hex strings, accompanied by masks, that I would like to merge into a single string value/mask pair. The strings may have bytes that overlap but after applying masks, no overlapping bits should contradict what the value of that bit must be, i.e. value1 = 0x0A mask1 = 0xFE and value2 = 0x0B, mask2 = 0x0F basically says that the resulting merge must have the upper nibble be all '0's and the lower nibble must be 01011
I've done this already using straight c, converting strings to byte arrays and memcpy'ing into buffers as a prototype. It's tested and seems to work. However, it's ugly and hard to read and doesn't throw exceptions for specific bit requirements that contradict. I've considered using bitsets, but is there another way that might not demand the conversion overhead? Performance would be nice, but not crucial.
EDIT: More detail, although writing this makes me realize I've made a simple problem too difficult. But, here it is, anyway.
I am given a large number of inputs that are binary searches of a mixed-content document. The document is broken into pages, and pages are provided by an api the delivers a single page at a time. Each page needs to be searched with the provided search terms.
I have all the search terms prior to requesting pages. The input are strings representing hex digits (this is what I mean by hex strings) as well a mask to indicate bits that are significant in the input hex string. Since I'm given all input up-front I wanted to improve the search of each page returned. I wanted to pre-process merge these hex strings together. To make the problem more interesting, every string has an optional offset into the page where they must appear and a lack of an offset indicates that the string can appear anywhere in a page requested. So, something like this:
class Input {
public:
int input_id;
std::string value;
std::string mask;
bool offset_present;
unsigned int offset;
};
If a given Input object has offset_present = false, then any value assigned to offset is ignored. If offset_present is false, then it clearly can't be merged with other inputs.
To make the problem more interesting, I want to report an output that provides information about what was found (input_id that was found, where the offset was, etc). Merging some input (but not others) makes this a bit more difficult.
I had considered defining a CompositeInput class and was thinking about the underlying merger be a bitset, but further reading about about bitsets made me realize it wasn't what I really thought. My inexperience made me give up on the composite idea and go brute force. I necessarily skipped some details about other input types an additional information to be collected for the output (say, page number, parag. number) when an input is found. Here's an example output class:
class Output {
public:
Output();
int id_result;
unsigned int offset_result;
};
I would want to product N of these if I merge N hex strings, keeping any merger details hidden from the user.
I don't know what a hexstring is... but other than that it should be like this:
outcome = (value1 & mask1) | (value2 & mask2);
it sounds like |, & and ~ would work?
const size_t prefix = 2; // "0x"
const size_t bytes = 2;
const char* value1 = "0x0A";
const char* mask1 = "0xFE";
const char* value2 = "0x0B";
const char* mask2 = "0x0F";
char output[prefix + bytes + 1] = "0x";
uint8_t char2int[] = { /*zeroes until index '0'*/ 0,1,2,3,4,5,6,7,8,9 /*...*/ 10,11,12,13,14,15 };
char int2char[] = { '0', /*...*/ 'F' };
for (size_t ii = prefix; ii != prefix + bytes; ++ii)
{
uint8_t result1 = char2int[value1[ii]] & char2int[mask1[ii]];
uint8_t result2 = char2int[value2[ii]] & char2int[mask2[ii]];
if (result1 & result2)
throw invalid_argument("conflicting bits");
output[ii] = int2char[result1 | result2];
}