is there any function in C++ that calculates a fingerprint or hash of a string that's guaranteed to be at least 64 bits wide? - c++

is there any function in C++ that calculates a fingerprint or hash of a string that's guaranteed to be at least 64 bits wide?
I'd like to replace my unordered_map<string, int> with unordered_map<long long, int>.
Given the answers that I'm getting (thanks Stack Overflow community...) the technique that I'm describing is not well-known. The reason that I want an unordered map of fingerprints instead of strings is for space and speed. The second map does not have to store strings and when doing the lookup, it doesn't incur any extra cache misses to fetch those strings. The only downside is the slight chance of a collision. That's why the key has to be 64 bits: a probability of 2^(-64) is basically an impossibility. Of course, this is predicated on a good hash function, which is exactly what my question is seeking.
Thanks again, Stack Overflowers.

unordered_map always hashes the key into a size_t variable. This is independent from the type of the key and depends solely on the architecture you are working with.

c++ has no native 128 Bit type, nor does it have native hashing support. Such extensions for hashing are supposed to be added in TR1, but as far as I am aware 128 bit ints aren't supported my many compilers. (Microsoft supports an __int128 type -- only on x64 platforms though)
I'd expect the functions included with unordered_map would be faster in any case.
If you really do want to do things that way, MD5 provides a good 128 bit hash.

If you want to map any string to a unique integer:
typedef std::map<string,long long> Strings;
static Strings s_strings;
long long s_highWaterMark = 0;
long long my_function(const string& s)
{
Strings::const_iterator it = s_strings.find(s);
if (it != s_strings.end())
{
//we've previously returned a fingerprint for this string
//now return the same fingerprint again
return it->second;
}
//else new fingerprint
long long rc = ++s_highWaterMark;
//... remember it for next time
s_strings.insert(Strings::value_type(s, rc));
//... and return it this time
return rc;
}

What exactly is it that you seek to achieve? Your map is not gonna work any better with a "bigger" hash function. Not notably, anyway.

Related

Hashing an std::string to Something Other than std::size_t

As a part of the project I am currently working on, I need to using several relatively short strings (e.g. "ABCD1234") as keys for a custom container. The problem is, the objects in this container are of type whose "primary key", so to speak, is numeric. So I need to take the unique strings given to me, translate them into something numeric, and make sure I preserve the uniqueness.
I've been trying to use boost::hash, and while I think it's going to work, I am annoyed by how big the hash value ends up being, especially considering that I know I am going to start of with short strings.
Is there another library, native or third-party, I could maybe use? This is obviously a convenience thing, so I am not too worried about it, but figured I might as well ask.
you could write your own that returns a short, but that's going to be prone to collisions.
Here is one I adapted to return a short/16 bits. Might need some tweaking.
unsigned short hash( std::string const& s ) {
short results = 3;
for ( auto current = s.begin(); current != s.end(); ++ current ) {
unsigned char c = static_cast<unsigned char>( *current );
results = results + ((results) << 5) + *(c + i) + ((*(c + i)) << 7);
i++;
}
return ((results) ^ (results >> 16)) & 0xffff;
}
Also, if you know what your keys are ahead of time and there aren't a lot of them, you could look into a perfect hash
You can use proper cryptographically strong hashes (digests).
These have the nice property that they can be truncated without removing their random distribution properties (this is NOT the case with general purpose hash values and also not of UUIDs).
While say a raw SHA-1 is much longer (160bit) and also not nearly as fast, you can truncate it too much smaller values as long as you can provide a usefully small collision probability.
This is the approach that Darcs, Mercurial, Git etc. take with their commit identifiers.
Note for speed, SHA-2 is faster and results in a 512 bits digest, so there's a special approach known as SHA-512/64 eg. to truncate SHA-2's 512 bits into a 64 digest. Also, you could look at faster hashes such as BLAKE or BLAKE2.
If you're looking for a perfect hash for known strings, here's an old answer of mine that gives a complete example of this:
Is it possible to map string to int faster than using hashmap?
Turns out neither solution is viable for me. I'll just have to work with size_ts. Thanks though.

Small String Hash Function

I have a string that will be from anywhere from length 1 to length 5 and need to hash it with good performance and minimal collisions. Any suggestions? (I don't need security)
Depending on your scenario, you may have certain needs for what kind of hash that you need. But if all you need is something to separate them, then std::hash() does spring to mind...
Another option would be something like:
long long hash(const std::string &val) {
long long hash = 0;
memcpy(reinterpret_cast<char*>(&hash), val.c_str(), std::min(sizeof(hash), val.length());
return hash;
}
Apologies for any mistypes in the above code, it has not been compiled or tested. This has minimal collisions (none), I would guess pretty good performance, and not-so-great quality. By quality I mean separation of values that are near each other and usage of entire key space.
There are also of course the whole range of regular cryptographic hash functions to choose from, but I take it from your question that these are not what you are aiming for.
Do you have linux? Try /bin/gperf it generate a perfect hash function from a key set.

Two-way "Hashing" of string

I want to generate int from a string and be able to generate it back.
Something like hash function but two-way function.
I want to use ints as ID in my application, but want to be able to convert it back in case of logging or debugging.
Like:
int id = IDProvider::getHash("NameOfMyObject");
object * a = createObject(id);
...
if(error)
{
LOG(IDProvider::getOriginalString(a->getId()), "some message");
}
I have heard of slightly modified CRC32 to be fast and 100% reversible, but I can not find it and I am not able to write it by myself.
Any hints what should I use?
Thank you!
edit
I have just founded the source I have the whole CRC32 thing from:
Jason Gregory : Game Engine Architecture
quotation:
"As with any hashing system, collisions are a possibility (i.e., two different strings might end up with the same hash code). However, with a suitable hash function, we can all but guarantee that collisions will not occur for all reasonable input strings we might use in our game. After all, a 32-bit hash chode represents more than four billion possible values. So if our hash function does a good job of distributing strings evently throughout this very large range, we are unlikely to collide. At Naughty Dog, we used a variant of the CRC-32 algorithm to hash our strings, and we didn't encounter a single collision in over two years of development on Uncharted: Drake's Fortune."
Reducing an arbitrary length string to a fixed size int is mathematically impossible to reverse. See Pidgeonhole principle. There is a near infinite amount of strings, but only 2^32 32 bit integers.
32 bit hashes(assuming your int is 32 bit) can have collisions very easily. So it's not a good unique ID either.
There are hashfunctions which allow you to create a message with a predefined hash, but it most likely won't be the original message. This is called a pre-image.
For your problem it looks like the best idea is creating a dictionary that maps integer-ids to strings and back.
To get the likelyhood of a collision when you hash n strings check out the birthday paradox. The most important property in that context is that collisions become likely once the number of hashed messages approaches the squareroot of the number of available hash values. So with a 32 bit integer collisions become likely if you hash around 65000 strings. But if you're unlucky it can happen much earlier.
I have exactly what you need. It is called a "pointer". In this system, the "pointer" is always unique, and can always be used to recover the string. It can "point" to any string of any length. As a bonus, it also has the same size as your int. You can obtain a "pointer" to a string by using the & operand, as shown in my example code:
#include <string>
int main() {
std::string s = "Hai!";
std::string* ptr = &s; // this is a pointer
std::string copy = *ptr; // this retrieves the original string
std::cout << copy; // prints "Hai!"
}
What you need is encryption. Hashing is by definition one way. You might try simple XOR Encryption with some addition/subtraction of values.
Reversible hash function?
How come MD5 hash values are not reversible?
checksum/hash function with reversible property
http://groups.google.com/group/sci.crypt.research/browse_thread/thread/ffca2f5ac3093255
... and many more via google search...
You could look at perfect hashing
http://en.wikipedia.org/wiki/Perfect_hash_function
It only works when all the potential strings are known up front. In practice what you enable by this, is to create a limited-range 'hash' mapping that you can reverse-lookup.
In general, the [hash code + hash algorithm] are never enough to get the original value back. However, with a perfect hash, collisions are by definition ruled out, so if the source domain (list of values) is known, you can get the source value back.
gperf is a well-known, age old program to generate perfect hashes in c/c++ code. Many more do exist (see the Wikipedia page)
Is it not possible. Hashing is not-returnable function - by definition.
As everyone mentioned, it is not possible to have a "reversible hash". However, there are alternatives (like encryption).
Another one is to zip/unzip your string using any lossless algorithm.
That's a simple, fully reversible method, with no possible collision.

Fast 64 bit comparison

I'm working on a GUI framework, where I want all the elements to be identified by ascii strings of up to 8 characters (or 7 would be ok).
Every time an event is triggered (some are just clicks, but some are continuous), the framework would callback to the client code with the id and its value.
I could use actual strings and strcmp(), but I want this to be really fast (for mobile devices), so I was thinking to use char constants (e.g. int id = 'BTN1';) so you'd be doing a single int comparison to test for the id. However, 4 chars isn't readable enough.
I tried an experiment, something like-
long int id = L'abcdefg';
... but it looks as if char constants can only hold 4 characters, and the only thing making a long int char constant gives you is the ability for your 4 characters to be twice as wide, not have twice the amount of characters. Am I missing something here?
I want to make it easy for the person writing the client code. The gui is stored in xml, so the id's are loaded in from strings, but there would be constants written in the client code to compare these against.
So, the long and the short of it is, I'm looking for a cross-platform way to do quick 7-8 character comparison, any ideas?
Are you sure this is not premature optimisation? Have you profiled another GUI framework that is slow purely from string comparisons? Why are you so sure string comparisons will be too slow? Surely you're not doing that many string compares. Also, consider strcmp should have a near optimal implementation, possibly written in assembly tailored for the CPU you're compiling for.
Anyway, other frameworks just use named integers, for example:
static const int MY_BUTTON_ID = 1;
You could consider that instead, avoiding the string issue completely. Alternatively, you could simply write a helper function to convert a const char[9] in to a 64-bit integer. This should accept a null-terminated string "like so" up to 8 characters (assuming you intend to throw away the null character). Then your program is passing around 64-bit integers, but the programmer is dealing with strings.
Edit: here's a quick function that turns a string in to a number:
__int64 makeid(const char* str)
{
__int64 ret = 0;
strncpy((char*)&ret, str, sizeof(__int64));
return ret;
}
One possibility is to define your IDs as a union of a 64-bit integer and an 8-character string:
union ID {
Int64 id; // Assuming Int64 is an appropriate typedef somewhere
char name[8];
};
Now you can do things like:
ID id;
strncpy(id.name, "Button1", 8);
if (anotherId.id == id.id) ...
The concept of string interning can be useful for this problem, turning string compares into pointer compares.
Easy to get pre-rolled Components
binary search tree for the win -- you get a red-black tree from most STL implementations of set and map, so you might want to consider that.
Intrusive versions of the STL containers perform MUCH better when you move the container nodes around a lot (in the general case) -- however they have quite a few caveats.
Specific Opinion -- First Alternative
If I was you I'd stick to a 64-bit integer type and bundle it in a intrusive container and use the library provided by boost. However if you are new to this sort of thing then use stl::map it is conceptually simpler to grasp, and it has less chances of leaking resources since there is more literature and guides out there for these types of containers and the best practises.
Alternative 2
The problem you are trying to solve I believe: is to have a global naming scheme which maps to handles. You can create a mapping of names to handles so that you can use the names to retrieve handles:
// WidgetHandle is a polymorphic base class (i.e., it has a virtual method),
// and foo::Luv implement WidgetHandle's interface (public inheritance)
foo::WidgetHandle * LuvComponent =
Factory.CreateComponent<foo::Luv>( "meLuvYouLongTime");
....
.... // in different function
foo::WidgetHandle * LuvComponent =
Factory.RetrieveComponent<foo::Luv>("meLuvYouLongTime");
Alternative 2 is a common idiom for IPC, you create an IPC type say a pipe in one process and you can ask the kernel for to retrieve the other end of the pipe by name.
I see a distinction between easily read identifiers in your code, and the representation being passed around.
Could you use an enumerated type (or a large header file of constants) to represent the identifier? The names of the enumerated types could then be as long and meaningful as you wish, and still fit in (I am guessing) a couple of bytes.
In C++0x, you'll be able to use user-defined string literals, so you could add something like 7chars..id or "7chars.."id:
template <char...> constexpr unsigned long long operator ""id();
constexpr unsigned long long operator ""id(const char *, size_t);
Although I'm not sure you can use constexpr for the second one.

Hash of a string to be of specific length

Is there a way to generate a hash of a string so that the hash itself would be of specific length? I've got a function that generates 41-byte hashes (SHA-1), but I need it to be 33-bytes max (because of certain hardware limitations). If I truncate the 41-byte hash to 33, I'd probably (certainly!) lost the uniqueness.
Or actually I suppose an MD5 algorithm would fit nicely, if I could find some C code for one with your help.
EDIT: Thank you all for the quick and knowledgeable responses. I've chosen to go with an MD5 hash and it fits fine for my purpose. The uniqueness is an important issue, but I don't expect the number of those hashes to be very large at any given time - these hashes represent software servers on a home LAN, so at max there would be 5, maybe 10 running.
If I truncate the 41-byte hash to 33, I'd probably (certainly!) lost the uniqueness.
What makes you think you've got uniqueness now? Yes, there's clearly a higher chance of collision when you're only playing with 33 bytes instead of 41, but you need to be fully aware that collisions are only ever unlikely, not impossible, for any situation where it makes sense to use a hash in the first place. If you're hashing more than 41 bytes of data, there are clearly more possible combinations than there are hashes available.
Now, whether you'd be better off truncating the SHA-1 hash or using a shorter hash such as MD5, I don't know. I think I'd be more generally confident when keeping the whole of a hash, but MD5 has known vulnerabilities which may or may not be a problem for your particular application.
The way hashes are calculated that's unfortunately not possible. To limit the hash length to 33 bytes, you will have to cut it. You could xor the first and last 33 bytes, as that might keep more of the information. But even with 33 bytes you don't have that big a chance of a collision.
md5: http://www.md5hashing.com/c++/
btw. md5 is 16 bytes, sha1 20 bytes and sha256 is 32 bytes, however as hexstrings, they all double in size. If you can store bytes, you can even use sha256.
There is no more chance of collision with substring(sha_hash, 0, 33) than with any other hash that is 33 bytes long, due to the way hash algorithms are designed (entropy is evenly spread out in the resulting string).
You could use an Elf hash(<- C code included) or some other simple hash function like that instead of MD5 or SHA-X.
They are not secure, but they can be tuned to any length you need
/*****Please include following header files*****/
// string
/***********************************************/
/*****Please use following namespaces*****/
// std
/*****************************************/
static unsigned int ELFHash(string str) {
unsigned int hash = 0;
unsigned int x = 0;
unsigned int i = 0;
unsigned int len = str.length();
for (i = 0; i < len; i++)
{
hash = (hash << 4) + (str[i]);
if ((x = hash & 0xF0000000) != 0)
{
hash ^= (x >> 24);
}
hash &= ~x;
}
return hash;
}
Example
string data = "jdfgsdhfsdfsd 6445dsfsd7fg/*/+bfjsdgf%$^";
unsigned int value = ELFHash(data);
Output
248446350
Hashes are by definition only unique for small amount of data (and even then it's still not guaranteed). It is impossible to map a large amount of information uniquely to a small amount of information by virtue of the fact that you can't mmagically get rid of information and get it back later. Keep in mind this isn't compression going on.
Personally, I'd use MD5 (if you need to store in text), or a 256b (32B) hash such as SHA256 (if you can store in binary) in this situation. Truncating another hash algorithm to 33B works too, and MAY increase the possibility of generating hash collisions. It depends alot on the algorithm.
Also, yet another C implementation of MD5, by the people who designed it.
I believe the MD5 hashing algorithm results in a 32 digit number so maybe that one will be more suitable.
Edit: to access MD5 functionality, it should be possible to hook into the openssl libraries. However you mentioned hardware limitations so this may no be possible in your case.
Here is an MD5 implementation in C.
The chance of a 33-byte collision is 1/2^132 (by the Birthday Paradox)
So don't worry about losing uniqueness.
Update: I didn't check the actual byte length of SHA1. Here's the relevant calculation: a 32-nibble collision (33 bytes of hex - 1 termination char), occurs only when the number of strings hashed becomes around sqrt(2^(32*4)) = 2^64.
Use Apache's DigestUtils:
http://commons.apache.org/codec/api-release/org/apache/commons/codec/digest/DigestUtils.html#md5Hex(java.lang.String)
Converts the hash into 32 character hex string.