Small String Hash Function - c++

I have a string that will be from anywhere from length 1 to length 5 and need to hash it with good performance and minimal collisions. Any suggestions? (I don't need security)

Depending on your scenario, you may have certain needs for what kind of hash that you need. But if all you need is something to separate them, then std::hash() does spring to mind...
Another option would be something like:
long long hash(const std::string &val) {
long long hash = 0;
memcpy(reinterpret_cast<char*>(&hash), val.c_str(), std::min(sizeof(hash), val.length());
return hash;
}
Apologies for any mistypes in the above code, it has not been compiled or tested. This has minimal collisions (none), I would guess pretty good performance, and not-so-great quality. By quality I mean separation of values that are near each other and usage of entire key space.
There are also of course the whole range of regular cryptographic hash functions to choose from, but I take it from your question that these are not what you are aiming for.

Do you have linux? Try /bin/gperf it generate a perfect hash function from a key set.

Related

Hashing an std::string to Something Other than std::size_t

As a part of the project I am currently working on, I need to using several relatively short strings (e.g. "ABCD1234") as keys for a custom container. The problem is, the objects in this container are of type whose "primary key", so to speak, is numeric. So I need to take the unique strings given to me, translate them into something numeric, and make sure I preserve the uniqueness.
I've been trying to use boost::hash, and while I think it's going to work, I am annoyed by how big the hash value ends up being, especially considering that I know I am going to start of with short strings.
Is there another library, native or third-party, I could maybe use? This is obviously a convenience thing, so I am not too worried about it, but figured I might as well ask.
you could write your own that returns a short, but that's going to be prone to collisions.
Here is one I adapted to return a short/16 bits. Might need some tweaking.
unsigned short hash( std::string const& s ) {
short results = 3;
for ( auto current = s.begin(); current != s.end(); ++ current ) {
unsigned char c = static_cast<unsigned char>( *current );
results = results + ((results) << 5) + *(c + i) + ((*(c + i)) << 7);
i++;
}
return ((results) ^ (results >> 16)) & 0xffff;
}
Also, if you know what your keys are ahead of time and there aren't a lot of them, you could look into a perfect hash
You can use proper cryptographically strong hashes (digests).
These have the nice property that they can be truncated without removing their random distribution properties (this is NOT the case with general purpose hash values and also not of UUIDs).
While say a raw SHA-1 is much longer (160bit) and also not nearly as fast, you can truncate it too much smaller values as long as you can provide a usefully small collision probability.
This is the approach that Darcs, Mercurial, Git etc. take with their commit identifiers.
Note for speed, SHA-2 is faster and results in a 512 bits digest, so there's a special approach known as SHA-512/64 eg. to truncate SHA-2's 512 bits into a 64 digest. Also, you could look at faster hashes such as BLAKE or BLAKE2.
If you're looking for a perfect hash for known strings, here's an old answer of mine that gives a complete example of this:
Is it possible to map string to int faster than using hashmap?
Turns out neither solution is viable for me. I'll just have to work with size_ts. Thanks though.

Is there a library that would produce a string that would hash (SHA1) to a given input?

I'm wondering if it's possible to find a block of text that would hash to a known value. In particular, I'm looking for a function CreateDataFromHash() that could be called as follows:
unsigned char myHash[] = "da39a3ee5e6b4b0d3255bfef95601890afd80709";
unsigned int length = 10000;
CreateDataFromHash(myHash, length);
Here CreateDataFromHash would return the string of the length 10000 containing arbitrary data, which would hash to myHash using SHA1.
Thanks.
There's no known easy or even moderately difficult way to do this in general.
The entire point of hashes (or so-called one-way functions), is that it's easy to compute them, but next to impossible to reverse their computation (find input values based on output). That said, for some hash functions, there are known methods that may allow computing inputs for a given hash value in reasonable time.
For example, this MD5 sum technique will find collisions (but not input for a given output) in about 8 hours on a 1.6GHz computer.
For SHA-1 in particular you may be interested in reading this.
One of the purposes of SHA1 is that this should be very hard to do.
hashing is a one way function. you can't get input from the output.
This would be a "preimage attack". No such thing is publicly known against SHA-1.
The only attack known against SHA-1 is a collision attack. That means I find two inputs that produce the same result, but neither of them is pre-ordained, so to speak. Even so, this attack isn't really feasible for most people -- based on the amount of computation involved, the closest I can figure is that you'd have to spend somewhere in the range of a few million dollars to build a machine that would give you about one colliding pair of keys per week (assuming it ran, doing nothing else 24/7).
You have to brute force it. See
PHP brute force password generator
Get string, do hash, compare, repeat

Guessing a string with time comparison. Is it possible?

I was wondering about a strange idea: you are given and algorithm wich takes a string in input and compares it to a string that you don't know. The algoritm is just a trivial comparison, one char at a time. When a couple that doesn't match is found, 0 is returned. Otherwise it returns 1.
Can you guess the secret string in a polynomial time by using the provided algorithm?
When a string doesn't match, the time used to give the answer 0 is less than the time taken to return 1, because less comparisons are needed. Times involved are very small, and for this reason you can try a single instance many times to get a more accurate estimation. Estimating the time taken we could have informations about the secret string. If this works properly, we can guess the string one char at a time, in a polynomial time. So if this can happen we can try some kind of brute force attack char by char.
Does this make sense? Or is there something I'm misunderstanding?
Thanks in advance.
You can guess the secret string if you can can input your own strings to compare, or just observe enough strings (not chosen by you) being compared to the secret string, if the string comparison has been written in a way such that its execution time reveals information about the secret string.
This is a known weakness cryptographic software can have, and all serious cryptographic software written nowadays avoids this weakness.
For instance, to avoid revealing information about its arguments, a function that tests whether two buffers are the same or different may be written:
int crypto_memcmp(const char *s1, const char *s2, size_t n)
{
size_t i;
int answer;
for (i=0; i<n; i++)
answer = answer | (s1[i] != s2[i]);
return answer;
}
You can use several techniques to check that a piece of code does not leak secrets through timing attacks. I wrote how to do it with static analysis here but this is based on a previous idea that used Valgrind (dynamic analysis) here.
Note that it goes further than that. This article showed how you did not even need the execution path to depend on the secret to leak information. It was enough that the secret was used in the computation of some array indices that were subsequently accessed. On modern computers, this changes the execution time because the cache will make two successive accesses to similar indices faster than two successive accesses to indices that are far from each other, revealing information about the secret.
You can determine the string bit by bit. for each bit use binary search
for example:
you already know the first a bits. say it (Sa).
now you have to determine the (a+1)th bit.
there are upper bound (Sa)zzzzzzz... and lower bound (Sa)azzzzzz....
first you guess the (a+1)th bit is (a+z)/2, say r, then the string is (Sa)rzzzzzz..., and with the result, you update the upper bound and lower bound.

Two-way "Hashing" of string

I want to generate int from a string and be able to generate it back.
Something like hash function but two-way function.
I want to use ints as ID in my application, but want to be able to convert it back in case of logging or debugging.
Like:
int id = IDProvider::getHash("NameOfMyObject");
object * a = createObject(id);
...
if(error)
{
LOG(IDProvider::getOriginalString(a->getId()), "some message");
}
I have heard of slightly modified CRC32 to be fast and 100% reversible, but I can not find it and I am not able to write it by myself.
Any hints what should I use?
Thank you!
edit
I have just founded the source I have the whole CRC32 thing from:
Jason Gregory : Game Engine Architecture
quotation:
"As with any hashing system, collisions are a possibility (i.e., two different strings might end up with the same hash code). However, with a suitable hash function, we can all but guarantee that collisions will not occur for all reasonable input strings we might use in our game. After all, a 32-bit hash chode represents more than four billion possible values. So if our hash function does a good job of distributing strings evently throughout this very large range, we are unlikely to collide. At Naughty Dog, we used a variant of the CRC-32 algorithm to hash our strings, and we didn't encounter a single collision in over two years of development on Uncharted: Drake's Fortune."
Reducing an arbitrary length string to a fixed size int is mathematically impossible to reverse. See Pidgeonhole principle. There is a near infinite amount of strings, but only 2^32 32 bit integers.
32 bit hashes(assuming your int is 32 bit) can have collisions very easily. So it's not a good unique ID either.
There are hashfunctions which allow you to create a message with a predefined hash, but it most likely won't be the original message. This is called a pre-image.
For your problem it looks like the best idea is creating a dictionary that maps integer-ids to strings and back.
To get the likelyhood of a collision when you hash n strings check out the birthday paradox. The most important property in that context is that collisions become likely once the number of hashed messages approaches the squareroot of the number of available hash values. So with a 32 bit integer collisions become likely if you hash around 65000 strings. But if you're unlucky it can happen much earlier.
I have exactly what you need. It is called a "pointer". In this system, the "pointer" is always unique, and can always be used to recover the string. It can "point" to any string of any length. As a bonus, it also has the same size as your int. You can obtain a "pointer" to a string by using the & operand, as shown in my example code:
#include <string>
int main() {
std::string s = "Hai!";
std::string* ptr = &s; // this is a pointer
std::string copy = *ptr; // this retrieves the original string
std::cout << copy; // prints "Hai!"
}
What you need is encryption. Hashing is by definition one way. You might try simple XOR Encryption with some addition/subtraction of values.
Reversible hash function?
How come MD5 hash values are not reversible?
checksum/hash function with reversible property
http://groups.google.com/group/sci.crypt.research/browse_thread/thread/ffca2f5ac3093255
... and many more via google search...
You could look at perfect hashing
http://en.wikipedia.org/wiki/Perfect_hash_function
It only works when all the potential strings are known up front. In practice what you enable by this, is to create a limited-range 'hash' mapping that you can reverse-lookup.
In general, the [hash code + hash algorithm] are never enough to get the original value back. However, with a perfect hash, collisions are by definition ruled out, so if the source domain (list of values) is known, you can get the source value back.
gperf is a well-known, age old program to generate perfect hashes in c/c++ code. Many more do exist (see the Wikipedia page)
Is it not possible. Hashing is not-returnable function - by definition.
As everyone mentioned, it is not possible to have a "reversible hash". However, there are alternatives (like encryption).
Another one is to zip/unzip your string using any lossless algorithm.
That's a simple, fully reversible method, with no possible collision.

What's the best way to hash a string vector not very long (urls)?

I am now dealing with url classification. I partition url with "/?", etc, generating a bunch of parts. In the process, I need to hash the first part to the kth part, say, k=2, then for "http://stackoverflow.com/questions/ask", the key is a string vector "stackoverflow.com questions". Currently, the hash is like Hash. But it consumes a lot of memory. I wonder whether MD5 can help or are there other alternatives. In effect, I do not need to recover the key exactly, as long as differentiating different keys.
Thanks!
It consumes a lot of memory
If your code already works, you may want to consider leaving it as-is. If you don't have a target, you won't know when you're done. Are you sure "a lot" is synonymous with "too much" in your case?
If you decide you really need to change your working code, you should consider the large variety of the options you have available, rather than taking someone's word for a specific algorithm:
http://en.wikipedia.org/wiki/List_of_hash_functions
http://en.wikipedia.org/wiki/Comparison_of_cryptographic_hash_functions
http://www.strchr.com/hash_functions
etc
Not sure about memory implications, and it certainly would change your perf profile, but you could also look into using Tries:
http://en.wikipedia.org/wiki/Trie
MD5 is a nice hash code for stuff where security is not an issue. It's fast and reasonably long (128 bits is enough for most applications). Also the distribution is very good.
Adler32 would be a possible alternative. It's very easy to implement, just a few lines of code. It's even faster then MD5. And it's long enough/good enough for many applications (though for many it is not).
(I know Adler32 is strictly not a hash-code, but it will still do fine for many applications)
However, if storing the hash-code is consuming a lot of memory, you can always truncate the hash-code, or use XOR to "shrink" it. E.g.
uint8_t md5[16];
GetMD5(md5, ...);
// use XOR to shrink the MD5 to 32 bits
for (size_t i = 4; i < 16; i++)
md5[i % 4] ^= md5[i];
// assemble the parts into one uint32_t
uint32_t const hash = md5[0] + (md5[1] << 8) + (md5[2] << 16) + (md5[3] << 24);
Personally I think MD5 would be overkill though. Have a look at Adler32, I think it will do.
EDIT
I have to correct myself: Adler23 is a rather poor choice for short strings (less then a few thousand bytes). I had completely forgotten about that. But there is always the obvious: CRC32. Not as fast as Adler23 (about the same speed as MD5), but still acceptably easy to implement, and there are also a ton of existing implementations with all kinds of licenses out there.
If you're only trying to find out if two URL's are the same have you considered storing a binary version of the IP address of the server? If two server names resolve to the same address is that incorrect or an advantage for your application?