Best string hashing function for short filenames - c++

What would be the best string hashing function for say filename like strings?
The strings would be similar to:
pics/test.pic
maps/test.map
materials/metal.mtl

If the nature of data to be hashed doesn't require any fancy hashing algorithms, like the nature of textual strings, you may want to try the FNV hashing function. The FNV hash, short for Fowler/Noll/Vo in honor of the creators, is a very fast algorithm that has been used in many applications with wonderful results, and for its simplicity, the FNV hash should be one of the first hashes tried in an application.
unsigned int fnv_hash (void* key, int len)
{
unsigned char* p = key;
unsigned int h = 2166136261;
int i;
for (i = 0; i < len; i++)
h = (h*16777619) ^ p[i];
return h;
}
Or roll with MD5 algorithm instead, which is general-purpose and thus covers your needs quite well.

There is no universally "best" hashing function independently of how the hash are used.
Let's suppose you want to have a 32 bits int in order to use a small hash table in memory.
Then you can use the FNV-1a algorithm:
hash = offset_basis
for each octet_of_data to be hashed
hash = hash xor octet_of_data
hash = hash * FNV_prime
return hash
If your purpose is to be confident about the fact that two paths give different hash, then you can use the SHA1 algorithm.
If you want to be sure it's very hard to maliciously create collisions, then you can use SHA256.
Note that those 2 last algorithm generate long hash (longer than your typical path).

Just use std::hash<std::string>. That's your library implementer's idea of the 'best' general purpose, non-cryptographic hash function.

Related

Hash Function without collision

Basically I'm using the hash function used in rabin karp.
Same function as in Fast implementation of Rolling hash but instead of hashing a string, I am hashing a vector of integers.
const unsigned PRIME_BASE = 257;
const unsigned PRIME_MOD = 1000000007;
unsigned hash(const std::vector< unsigned int >& Line)
{
unsigned long long ret = 0;
for (int i = 0; i < Line.size(); i++)
{
ret = ret*PRIME_BASE + Line[i];
ret %= PRIME_MOD;
}
return ret;
}
The problem is that I am getting lots of collisions. Changing prime number can minimize or maximize the collision but I can't avoid it.
Any ideas how to avoid collisions with such function or a better one ?
You don't.
The whole point of a hash is to take an input from a large domain, and produce an output in a smaller domain.
Collisions are, by the very nature of the process, inevitable and unavoidable.
You can try to reduce their likelihood, for some particular class of dataset, but you've already explored doing that.
You can make little bit better (decrease the chance of the collisions) to add more hash function. Eg:
create 2 hash function, with different PRIME BASE and PRIME MOD, ans store pair of long long's.
Another problem can be if the Line stores many zero's, so better to add some random (which is fixed after the initialization) shift to the values.
E.g with Robin-Karb if you want to calculate 'A' and 'AA' hash its better to add shift value otherwise both of this string hash value will be 0. (I mean if you convert the characters like : f(char c){return c-'A';}
Another interesting topic I think if you choose a good hash function (from random aspect), and your input are also randoms, than collsions shouldn't happen when the number of different item in the Line vector is less than sqrt(range of the hash function), this is the birthday paradox. Your current range is 1e9+7 so the sqrt of this is around 3e4. If you using 2 hash function than the combined range is the multiplication of their ranges.

Hash value for a std::unordered_map

According to the standard there's no support for containers (let alone unordered ones) in the std::hash class. So I wonder how to implement that. What I have is:
std::unordered_map<std::wstring, std::wstring> _properties;
std::wstring _class;
I thought about iterating the entries, computing the individual hashes for keys and values (via std::hash<std::wstring>) and concatenate the results somehow.
What would be a good way to do that and does it matter if the order in the map is not defined?
Note: I don't want to use boost.
A simple XOR was suggested, so it would be like this:
size_t MyClass::GetHashCode()
{
std::hash<std::wstring> stringHash;
size_t mapHash = 0;
for (auto property : _properties)
mapHash ^= stringHash(property.first) ^ stringHash(property.second);
return ((_class.empty() ? 0 : stringHash(_class)) * 397) ^ mapHash;
}
?
I'm really unsure if that simple XOR is enough.
Response
If by enough, you mean whether or not your function is injective, the answer is No. The reasoning is that the set of all hash values your function can output has cardinality 2^64, while the space of your inputs is much larger. However, this is not really important, because you can't have an injective hash function given the nature of your inputs. A good hash function has these qualities:
It's not easily invertible. Given the output k, it's not computationally feasible within the lifetime of the universe to find m such that h(m) = k.
The range is uniformly distributed over the output space.
It's hard to find two inputs m and m' such that h(m) = h(m')
Of course, the extents of these really depend on whether you want something that's cryptographically secure, or you want to take some arbitrary chunk of data and just send it some arbitrary 64-bit integer. If you want something cryptographically secure, writing it yourself is not a good idea. In that case, you'd also need the guarantee that the function is sensitive to small changes in the input. The std::hash function object is not required to be cryptographically secure. It exists for use cases isomorphic to hash tables. CPP Rerefence says:
For two different parameters k1 and k2 that are not equal, the probability that std::hash<Key>()(k1) == std::hash<Key>()(k2) should be very small, approaching 1.0/std::numeric_limits<size_t>::max().
I'll show below how your current solution doesn't really guarantee this.
Collisions
I'll give you a few of my observations on a variant of your solution (I don't know what your _class member is).
std::size_t hash_code(const std::unordered_map<std::string, std::string>& m) {
std::hash<std::string> h;
std::size_t result = 0;
for (auto&& p : m) {
result ^= h(p.first) ^ h(p.second);
}
return result;
}
It's easy to generate collisions. Consider the following maps:
std::unordered_map<std::string, std::string> container0;
std::unordered_map<std::string, std::string> container1;
container0["123"] = "456";
container1["456"] = "123";
std::cout << hash_code(container0) << '\n';
std::cout << hash_code(container1) << '\n';
On my machine, compiling with g++ 4.9.1, this outputs:
1225586629984767119
1225586629984767119
The question as to whether this matters or not arises. What's relevant is how often you're going to have maps where keys and values are reversed. These collisions will occur between any two maps in which the sets of keys and values are the same.
Order of Iteration
Two unordered_map instances having exactly the same key-value pairs will not necessarily have the same order of iteration. CPP Rerefence says:
For two parameters k1 and k2 that are equal, std::hash<Key>()(k1) == std::hash<Key>()(k2).
This is a trivial requirement for a hash function. Your solution avoids this because the order of iteration doesn't matter since XOR is commutative.
A Possible Solution
If you don't need something that's cryptographically secure, you can modify your solution slightly to kill the symmetry. This approach is okay in practice for hash tables and the like. This solution is also independent of the fact that order in an unordered_map is undefined. It uses the same property your solution used (Commutativity of XOR).
std::size_t hash_code(const std::unordered_map<std::string, std::string>& m) {
const std::size_t prime = 19937;
std::hash<std::string> h;
std::size_t result = 0;
for (auto&& p : m) {
result ^= prime*h(p.first) + h(p.second);
}
return result;
}
All you need in a hash function in this case is a way to map a key-value pair to an arbitrary good hash value, and a way to combine the hashes of the key-value pairs using a commutative operation. That way, order does not matter. In the example hash_code I wrote, the key-value pair hash value is just a linear combination of the hash of the key and the hash of the value. You can construct something a bit more intricate, but there's no need for that.

String to Integer Hashing Function with Precision

I want to hash a char array in to an int or a long. The resulting value has to adhere to a given precision value.
The function I've been using is given below:
int GetHash(const char* zKey, int iPrecision /*= 6*/)
{
/////FROM : http://courses.cs.vt.edu/~cs2604/spring02/Projects/4/elfhash.cpp
unsigned long h = 0;
long M = pow(10, iPrecision);
while(*zKey)
{
h = (h << 4) + *zKey++;
unsigned long g = h & 0xF0000000L;
if (g) h ^= g >> 24;
h &= ~g;
}
return (int) (h % M);
}
The string to be hashed is similar to "SAEUI1210.00000010_1".
However, this produces duplicate values in some cases.
Are there any good alternatives which wouldn't duplicate the same hash for different string values.
The very definition of a hash is that it produces duplicate values for some values, due to hash value range being smaller than the space of the hashed data.
In theory, a 32-bit hash has enough range to hash all ~6 character strings (A-Z,a-z,0-9 only), without causing a collision. In practice, hashes are not a perfect permutation of the input. Given a 32-bit hash, you can expect to get hash collisions after hashing ~16 bit of random inputs, due to the birthday paradox.
Given a static set of data values, it's always possible to construct a hash function designed specifically for them, which will never collide with itself (of course, size of its output will be at least log(|data set|). However, it requires you to know all the possible data values ahead of time. This is called perfect hashing.
That being said, here are a few alternatives which should get you started (they are designed to minimize collisions)
Every hash will have collisions. Period. That's called a Birthday Problem.
You may want to check cryptographic has functions like MD5 (relatively fast and you don't care that it's insecure) but it also will have collisions.
Hashes generate the same value for different inputs -- that's what they do. All you can do is create a hash function with sufficient distribution or bit depth (or both) to minimize those collisions. Since you have this additional constraint of precision (0-5 ?) then you are going to hit collisions far more often.
MD5 or SHA. There are many open implementations, and the outcome is very unlikely to produce a duplicate result.

specialised hash table c++

I need to count a lot of different items. I'm processing a list of pairs such as:
A34223,34
B23423,-23
23423212,16
What I was planning to do was hash the first value (the key) into a 32bit integer which will then be a key to a sparse structure where the 'value' will be added (all start at zero) number and be negative.
Given that they keys are short and alphanumeric, is there an way to generate a hash algorithm that is fast on 32bit x86 architectures? Or is there an existing suitable hash?
I don't know anything about the design of hashes, but was hoping that due to the simple input, there would be a way of generating a high performance hash that guarantees no collision for a given key length of "X" and has high dispersion so minimizes collisions when length exceeds "X".
As you are using C++, the first thing you should do is to create a trivial implimentation using std::map. Is it fast enough (it probably will be)? If so, stick with it, otherwise investigate if your C++ implementation provides a hash table. If it does, use it to create a trivial implementation, test, time it. Is it fast enough (almost certainly yes)?
Only after you hav eexhausted these options should you think of implementing your own hash table and hashing function.
A guarantee for no collisions is difficult.
In your case, the keys
A34223
B23423
23423212
can be translated to 32-bit integers with little effort.
And here is a good function that generate hashes from strings:
/**
* "The Practice of Programming", Hash Tables, section 2.9, pg. 57
*
* computes hash value of string
*/
DWORD
strhash( char* str )
{
//#define MULTIPLIER 31 or 37
unsigned int h;
unsigned char* p;
h = 0;
for ( p=(unsigned char*)str; *p != '\0'; p++ )
h = 31 * h + *p; // <- FIXED MULTIPLIER
return h;
}
Check Bob Jenkin's website for good hash functions. IIRC it is the same hash that is used in Perl.

What's the best hashing algorithm to use on a stl string when using hash_map?

I've found the standard hashing function on VS2005 is painfully slow when trying to achieve high performance look ups. What are some good examples of fast and efficient hashing algorithms that should void most collisions?
I worked with Paul Larson of Microsoft Research on some hashtable implementations. He investigated a number of string hashing functions on a variety of datasets and found that a simple multiply by 101 and add loop worked surprisingly well.
unsigned int
hash(
const char* s,
unsigned int seed = 0)
{
unsigned int hash = seed;
while (*s)
{
hash = hash * 101 + *s++;
}
return hash;
}
From some old code of mine:
/* magic numbers from http://www.isthe.com/chongo/tech/comp/fnv/ */
static const size_t InitialFNV = 2166136261U;
static const size_t FNVMultiple = 16777619;
/* Fowler / Noll / Vo (FNV) Hash */
size_t myhash(const string &s)
{
size_t hash = InitialFNV;
for(size_t i = 0; i < s.length(); i++)
{
hash = hash ^ (s[i]); /* xor the low 8 bits */
hash = hash * FNVMultiple; /* multiply by the magic number */
}
return hash;
}
Its fast. Really freaking fast.
That always depends on your data-set.
I for one had surprisingly good results by using the CRC32 of the string. Works very good with a wide range of different input sets.
Lots of good CRC32 implementations are easy to find on the net.
Edit: Almost forgot: This page has a nice hash-function shootout with performance numbers and test-data:
http://smallcode.weblogs.us/ <-- further down the page.
Boost has an boost::hash library which can provides some basic hash functions for most common types.
I've use the Jenkins hash to write a Bloom filter library, it has great performance.
Details and code are available here: http://burtleburtle.net/bob/c/lookup3.c
This is what Perl uses for its hashing operation, fwiw.
If you are hashing a fixed set of words, the best hash function is often a perfect hash function. However, they generally require that the set of words you are trying to hash is known at compile time. Detection of keywords in a lexer (and translation of keywords to tokens) is a common usage of perfect hash functions generated with tools such as gperf. A perfect hash also lets you replace hash_map with a simple array or vector.
If you're not hashing a fixed set of words, then obviously this doesn't apply.
One classic suggestion for a string hash is to step through the letters one by one adding their ascii/unicode values to an accumulator, each time multiplying the accumulator by a prime number. (allowing overflow on the hash value)
template <> struct myhash{};
template <> struct myhash<string>
{
size_t operator()(string &to_hash) const
{
const char * in = to_hash.c_str();
size_t out=0;
while(NULL != *in)
{
out*= 53; //just a prime number
out+= *in;
++in;
}
return out;
}
};
hash_map<string, int, myhash<string> > my_hash_map;
It's hard to get faster than that without throwing out data. If you know your strings can be differentiated by only a few characters and not their whole content, you can do faster.
You might try caching the hash value better by creating a new subclass of basic_string that remembers its hash value, if the value gets calculated too often. hash_map should be doing that internally, though.
I did a little searching, and funny thing, Paul Larson's little algorithm showed up here
http://www.strchr.com/hash_functions
as having the least collisions of any tested in a number of conditions, and it's very fast for one that it's unrolled or table driven.
Larson's being the simple multiply by 101 and add loop above.
Python 3.4 includes a new hash algorithm based on SipHash. PEP 456 is very informative.
From Hash Functions all the way down:
MurmurHash got quite popular, at least in game developer circles, as a “general hash function”.
It’s a fine choice, but let’s see later if we can generally do better. Another fine choice, especially if you know more about your data than “it’s gonna be an unknown number of bytes”, is to roll your own (e.g. see Won Chun’s replies, or Rune’s modified xxHash/Murmur that are specialized for 4-byte keys etc.). If you know your data, always try to see whether that knowledge can be used for good effect!
Without more information I would recommend MurmurHash as a general purpose non-cryptographic hash function. For small strings (of the size of the average identifier in programs) the very simple and famous djb2 and FNV are very good.
Here (data sizes < 10 bytes) we can see that the ILP smartness of other algorithms does not get to show itself, and the super-simplicity of FNV or djb2 win in performance.
djb2
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
FNV-1
hash = FNV_offset_basis
for each byte_of_data to be hashed
hash = hash × FNV_prime
hash = hash XOR byte_of_data
return hash
FNV-1A
hash = FNV_offset_basis
for each byte_of_data to be hashed
hash = hash XOR byte_of_data
hash = hash × FNV_prime
return hash
A note about security and availability
Hash functions can make your code vulnerable to denial-of-service attacks. If an attacker is able to force your server to handle too many collisions, your server may not be able to cope with requests.
Some hash functions like MurmurHash accept a seed that you can provide to drastically reduce the ability of attackers to predict the hashes your server software is generating. Keep that in mind.
If your strings are on average longer than a single cache line, but their length+prefix are reasonably unique, consider hasing just the length+first 8/16 characters. (The length is contained in the std::string object itself and therefore cheap to read)