Hash of a string to be of specific length - c++

Is there a way to generate a hash of a string so that the hash itself would be of specific length? I've got a function that generates 41-byte hashes (SHA-1), but I need it to be 33-bytes max (because of certain hardware limitations). If I truncate the 41-byte hash to 33, I'd probably (certainly!) lost the uniqueness.
Or actually I suppose an MD5 algorithm would fit nicely, if I could find some C code for one with your help.
EDIT: Thank you all for the quick and knowledgeable responses. I've chosen to go with an MD5 hash and it fits fine for my purpose. The uniqueness is an important issue, but I don't expect the number of those hashes to be very large at any given time - these hashes represent software servers on a home LAN, so at max there would be 5, maybe 10 running.

What makes you think you've got uniqueness now? Yes, there's clearly a higher chance of collision when you're only playing with 33 bytes instead of 41, but you need to be fully aware that collisions are only ever unlikely, not impossible, for any situation where it makes sense to use a hash in the first place. If you're hashing more than 41 bytes of data, there are clearly more possible combinations than there are hashes available.
Now, whether you'd be better off truncating the SHA-1 hash or using a shorter hash such as MD5, I don't know. I think I'd be more generally confident when keeping the whole of a hash, but MD5 has known vulnerabilities which may or may not be a problem for your particular application.

The way hashes are calculated that's unfortunately not possible. To limit the hash length to 33 bytes, you will have to cut it. You could xor the first and last 33 bytes, as that might keep more of the information. But even with 33 bytes you don't have that big a chance of a collision.
md5: http://www.md5hashing.com/c++/
btw. md5 is 16 bytes, sha1 20 bytes and sha256 is 32 bytes, however as hexstrings, they all double in size. If you can store bytes, you can even use sha256.

There is no more chance of collision with substring(sha_hash, 0, 33) than with any other hash that is 33 bytes long, due to the way hash algorithms are designed (entropy is evenly spread out in the resulting string).

You could use an Elf hash(<- C code included) or some other simple hash function like that instead of MD5 or SHA-X.
They are not secure, but they can be tuned to any length you need
/*****Please include following header files*****/
// string
/*****Please use following namespaces*****/
// std
static unsigned int ELFHash(string str) {
unsigned int hash = 0;
unsigned int x = 0;
unsigned int i = 0;
unsigned int len = str.length();
for (i = 0; i < len; i++)
hash = (hash << 4) + (str[i]);
if ((x = hash & 0xF0000000) != 0)
hash ^= (x >> 24);
hash &= ~x;
return hash;
string data = "jdfgsdhfsdfsd 6445dsfsd7fg/*/+bfjsdgf%$^";
unsigned int value = ELFHash(data);

Hashes are by definition only unique for small amount of data (and even then it's still not guaranteed). It is impossible to map a large amount of information uniquely to a small amount of information by virtue of the fact that you can't mmagically get rid of information and get it back later. Keep in mind this isn't compression going on.
Personally, I'd use MD5 (if you need to store in text), or a 256b (32B) hash such as SHA256 (if you can store in binary) in this situation. Truncating another hash algorithm to 33B works too, and MAY increase the possibility of generating hash collisions. It depends alot on the algorithm.
Also, yet another C implementation of MD5, by the people who designed it.

I believe the MD5 hashing algorithm results in a 32 digit number so maybe that one will be more suitable.
Edit: to access MD5 functionality, it should be possible to hook into the openssl libraries. However you mentioned hardware limitations so this may no be possible in your case.

Here is an MD5 implementation in C.

The chance of a 33-byte collision is 1/2^132 (by the Birthday Paradox)
So don't worry about losing uniqueness.
Update: I didn't check the actual byte length of SHA1. Here's the relevant calculation: a 32-nibble collision (33 bytes of hex - 1 termination char), occurs only when the number of strings hashed becomes around sqrt(2^(32*4)) = 2^64.

Use Apache's DigestUtils:
Converts the hash into 32 character hex string.


Two-way "Hashing" of string

I want to generate int from a string and be able to generate it back.
Something like hash function but two-way function.
I want to use ints as ID in my application, but want to be able to convert it back in case of logging or debugging.
int id = IDProvider::getHash("NameOfMyObject");
object * a = createObject(id);
LOG(IDProvider::getOriginalString(a->getId()), "some message");
I have heard of slightly modified CRC32 to be fast and 100% reversible, but I can not find it and I am not able to write it by myself.
Any hints what should I use?
Thank you!
I have just founded the source I have the whole CRC32 thing from:
Jason Gregory : Game Engine Architecture
"As with any hashing system, collisions are a possibility (i.e., two different strings might end up with the same hash code). However, with a suitable hash function, we can all but guarantee that collisions will not occur for all reasonable input strings we might use in our game. After all, a 32-bit hash chode represents more than four billion possible values. So if our hash function does a good job of distributing strings evently throughout this very large range, we are unlikely to collide. At Naughty Dog, we used a variant of the CRC-32 algorithm to hash our strings, and we didn't encounter a single collision in over two years of development on Uncharted: Drake's Fortune."
Reducing an arbitrary length string to a fixed size int is mathematically impossible to reverse. See Pidgeonhole principle. There is a near infinite amount of strings, but only 2^32 32 bit integers.
32 bit hashes(assuming your int is 32 bit) can have collisions very easily. So it's not a good unique ID either.
There are hashfunctions which allow you to create a message with a predefined hash, but it most likely won't be the original message. This is called a pre-image.
For your problem it looks like the best idea is creating a dictionary that maps integer-ids to strings and back.
To get the likelyhood of a collision when you hash n strings check out the birthday paradox. The most important property in that context is that collisions become likely once the number of hashed messages approaches the squareroot of the number of available hash values. So with a 32 bit integer collisions become likely if you hash around 65000 strings. But if you're unlucky it can happen much earlier.
I have exactly what you need. It is called a "pointer". In this system, the "pointer" is always unique, and can always be used to recover the string. It can "point" to any string of any length. As a bonus, it also has the same size as your int. You can obtain a "pointer" to a string by using the & operand, as shown in my example code:
#include <string>
int main() {
std::string s = "Hai!";
std::string* ptr = &s; // this is a pointer
std::string copy = *ptr; // this retrieves the original string
std::cout << copy; // prints "Hai!"
What you need is encryption. Hashing is by definition one way. You might try simple XOR Encryption with some addition/subtraction of values.
Reversible hash function?
How come MD5 hash values are not reversible?
checksum/hash function with reversible property
... and many more via google search...
You could look at perfect hashing
It only works when all the potential strings are known up front. In practice what you enable by this, is to create a limited-range 'hash' mapping that you can reverse-lookup.
In general, the [hash code + hash algorithm] are never enough to get the original value back. However, with a perfect hash, collisions are by definition ruled out, so if the source domain (list of values) is known, you can get the source value back.
gperf is a well-known, age old program to generate perfect hashes in c/c++ code. Many more do exist (see the Wikipedia page)
Is it not possible. Hashing is not-returnable function - by definition.
As everyone mentioned, it is not possible to have a "reversible hash". However, there are alternatives (like encryption).
Another one is to zip/unzip your string using any lossless algorithm.
That's a simple, fully reversible method, with no possible collision.

What's the best way to hash a string vector not very long (urls)?

I am now dealing with url classification. I partition url with "/?", etc, generating a bunch of parts. In the process, I need to hash the first part to the kth part, say, k=2, then for "http://stackoverflow.com/questions/ask", the key is a string vector "stackoverflow.com questions". Currently, the hash is like Hash. But it consumes a lot of memory. I wonder whether MD5 can help or are there other alternatives. In effect, I do not need to recover the key exactly, as long as differentiating different keys.
It consumes a lot of memory
If your code already works, you may want to consider leaving it as-is. If you don't have a target, you won't know when you're done. Are you sure "a lot" is synonymous with "too much" in your case?
If you decide you really need to change your working code, you should consider the large variety of the options you have available, rather than taking someone's word for a specific algorithm:
Not sure about memory implications, and it certainly would change your perf profile, but you could also look into using Tries:
MD5 is a nice hash code for stuff where security is not an issue. It's fast and reasonably long (128 bits is enough for most applications). Also the distribution is very good.
Adler32 would be a possible alternative. It's very easy to implement, just a few lines of code. It's even faster then MD5. And it's long enough/good enough for many applications (though for many it is not).
(I know Adler32 is strictly not a hash-code, but it will still do fine for many applications)
However, if storing the hash-code is consuming a lot of memory, you can always truncate the hash-code, or use XOR to "shrink" it. E.g.
uint8_t md5[16];
GetMD5(md5, ...);
// use XOR to shrink the MD5 to 32 bits
for (size_t i = 4; i < 16; i++)
md5[i % 4] ^= md5[i];
// assemble the parts into one uint32_t
uint32_t const hash = md5[0] + (md5[1] << 8) + (md5[2] << 16) + (md5[3] << 24);
Personally I think MD5 would be overkill though. Have a look at Adler32, I think it will do.
I have to correct myself: Adler23 is a rather poor choice for short strings (less then a few thousand bytes). I had completely forgotten about that. But there is always the obvious: CRC32. Not as fast as Adler23 (about the same speed as MD5), but still acceptably easy to implement, and there are also a ton of existing implementations with all kinds of licenses out there.
If you're only trying to find out if two URL's are the same have you considered storing a binary version of the IP address of the server? If two server names resolve to the same address is that incorrect or an advantage for your application?

10 character id that's globally and locally unique

I need to generate a 10 character unique id (SIP/VOIP folks need to know that it's for a param icid-value in the P-Charging-Vector header). Each character shall be one of the 26 ASCII letters (case sensitive), one of the 10 ASCII digits, or the hyphen-minus.
It MUST be 'globally unique (outside of the machine generating the id)' and sufficiently 'locally unique (within the machine generating the id)', and all that needs to be packed into 10 characters, phew!
Here's my take on it. I'm FIRST encoding the 'MUST' be encoded globally unique local ip address into base-63 (its an unsigned long int that will occupy 1-6 characters after encoding) and then as much as I can of the current time stamp (its a time_t/long long int that will occupy 9-4 characters after encoding depending on how much space the encoded ip address occupies in the first place).
I've also added loop count 'i' to the time stamp to preserve the uniqueness in case the function is called more than once in a second.
Is this good enough to be globally and locally unique or is there another better approach?
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
//base-63 character set
static char set[]="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-";
// b63() returns the next vacant location in char array x
int b63(long long longlong,char *x,int index){
if(index > 9)
return index+1;
if(longlong < 63){
x[index] = set[longlong];
return index+1;
x[index] = set[longlong%63];
return b63(longlong/63,x,index+1);
int main(){
char x[11],y[11] = {0}; /* '\0' is taken care of here */
//let's generate 10 million ids
for(int i=0; i<10000000; i++){
/* add i to timestamp to take care of sub-second function calls,
3770168404(is a sample ip address in n/w byte order) = */
b63((long long)time(NULL)+i,x,b63((long long)3770168404,x,0));
// reverse the char array to get proper base-63 output
for(int j=0,k=9; j<10; j++,k--)
y[j] = x[k];
return 0;
Are you in control of all the software generating ids? Are you doling out the ids? If not...
I know nothing about SIP, but there's got to be a misunderstanding that you have about the spec (or the spec must be wrong). If another developer attempts to build an id using a different algorithm than the one you've cooked up, you will have collisions with their ids, meaning they will know longer be globally unique in that system.
I'd go back to the SIP documentation, see if there's an appendix with an algorithm for generating these ids. Or maybe a smarter SO user than I can answer what the SIP algorithm for generating these id's is.
I would have a serious look at RFC 4122 which describes the generation of 128-bit GUIDs. There are several different generation algorithms, some of which may fit (MAC address-based one springs to mind). This is a bigger number-space than yours 2^128 = 3.4 * 10^38 compared with 63^10 = 9.8 * 10^17, so you may have to make some compromises on uniqueness. Consider factors like how frequently the IDs will be generated.
However in the RFC, they have considered some practical issues, like the ability to generate large numbers of unique values efficiently by pre-allocating blocks of IDs.
Can't you just have a distributed ID table ?
Machines on NAT'ed LANs will often have an IP from a small range, and not all of the 32-bit values would be valid (think multicast, etc). Machines may also grab the same timestamp, especially if the granularity is large (such as seconds); keep in mind that the year is very often going to be the same, so it's the lower bits that will give you the most 'uniqueness'.
You may want to take the various values, hash them with a cryptographic hash, and translate that to the characters you are permitted to use, truncating to the 10 characters.
But you're dealing with a value with less than 60 bits; you need to think carefully about the implications of a collision. You might be approaching the problem the wrong way...
Well, if I cast aside the fact that I think this is a bad idea, and concentrate on a solution to your problem, here's what I would do:
You have an id range of 10^63, which correspond to roughly 60 bits. You want it to be both "globally" and "locally" unique. Let's generate the first N bits to be globally unique, and the rest to be locally unique. The concatenation of the two will have the properties you are looking for.
First, the global uniqueness : IP won't work, especially local ones, they hold very little entropy. I would go with MAC addresses, they were made for being globally unique. They cover a range of 256^6, so using up 6*8 = 48 bits.
Now, for the locally unique : why not use the process ID ? I'm making the assumption that the uniqueness is per process, if it's not, you'll have to think of something else. On Linux, process ID is 32 bits. If we wanted to nitpick, the 2 most significant bytes probably hold very little entropy, as they would at 0 on most machines. So discard them if you know what you're doing.
So now you'll see you have a problem as it would use up to 70 bits to generate a decent (but not bulletproof) globally and locally unique ID (using my technique anyway). And since I would also advise to put in a random number (at least 8 bits long) just in case, it definitely won't fit. So if I were you, I would hash the ~78 generated bits to SHA1 (for example), and convert the first 60 bits of the resulting hash to your ID format. To do so, notice that you have a 63 characters range to chose from, so almost the full range of 6 bits. So split the hash in 6 bits pieces, and use the first 10 pieces to select the 10 characters of your ID from the 63 character range. Obviously, the range of 6 bits is 64 possible values (you only want 63), so if you have a 6 bits piece equals to 63, either floor it to 62 or assume modulo 63 and pick 0. It will slightly bias the distribution, but it's not too bad.
So there, that should get you a decent globally and locally pseudo-unique ID.
A few last points: according to the Birthday paradox, you'll get a ~ 1 % chance of collisions after generating ~ 142 million IDs, and a 99% chance after generating 3 billions IDs. So if you hit great commercial success and have millions of IDs being generated, get a larger ID.
Finally, I think I provided a "better than the worse" solution to your problem, but I can't help but think you're attacking this problem in the wrong fashion, and possibly as other have mentioned, misreading the specs. So use this if there are no other ways that would be more "bulletproof" (centralised ID provider, much longer ID ... ).
Edit: I re-read your question, and you say you call this function possibly many times a second. I was assuming this was to serve as some kind of application ID, generated once at the start of your application, and never changed afterwards until it exited. Since it's not the case, you should definitely add a random number and if you generate a lot of IDs, make that at least a 32 bits number. And read and re-read the Birthday Paradox I linked to above. And seed your number generator to a highly entropic value, like the usec value of the current timestamp for example. Or even go so far as to get your random values from /dev/urandom .
Very honestly, my take on your endeavour is that 60 bits is probably not enough...
Hmm, using the system clock may be a weakness... what if someone sets the clock back? You might re-generate the same ID again. But if you are going to use the clock, you might call gettimeofday() instead of time(); at least that way you'll get better resolution than one second.
#Doug T.
No, I'm not in control of all the software generating the ids.
I agree without a standardized algorithm there maybe collisions, I've raised this issue in the appropriate mailing lists.
Taking a cue from you're reply. I decided to use the /dev/urandom PRNG for a 32 bit random number as the space unique component of the id. I assume that every machine will have its own noise signature and it can be assumed to be safely globally unique in space at an instant of time. The time unique component that I used earlier remains the same.
These unique ids are generated to collate all the billing information collected from different network functions that independently generated charging information of a particular call during call processing.
Here's the updated code below:
#include <stdio.h>
#include <string.h>
#include <time.h>
//base-63 character set
static char set[]="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-";
// b63() returns the next vacant location in char array x
int b63(long long longlong, char *x, int index){
if(index > 9)
return index+1;
if(longlong < 63){
x[index] = set[longlong];
return index+1;
x[index] = set[longlong%63];
return b63(longlong/63, x, index+1);
int main(){
unsigned int number;
char x[11], y[11] = {0};
FILE *urandom = fopen("/dev/urandom", "r");
return -1;
//let's generate a 1 billion ids
for(int i=0; i<1000000000; i++){
fread(&number, 1, sizeof(number), urandom);
// add i to timestamp to take care of sub-second function calls,
b63((long long)time(NULL)+i, x, b63((long long)number, x, 0));
// reverse the char array to get proper base-63 output
for(int j=0, k=9; j<10; j++, k--)
y[j] = x[k];
printf("%s\n", y);
return 0;

Can this checksum algorithm be improved?

We have a very old, unsupported program which copies files across SMB shares. It has a checksum algorithm to determine if the file contents have changed before copying. The algorithm seems easily fooled -- we've just found an example where two files, identical except a single '1' changing to a '2', return the same checksum. Here's the algorithm:
unsigned long GetFileCheckSum(CString PathFilename)
FILE* File;
unsigned long CheckSum = 0;
unsigned long Data = 0;
unsigned long Count = 0;
if ((File = fopen(PathFilename, "rb")) != NULL)
while (fread(&Data, 1, sizeof(unsigned long), File) != FALSE)
CheckSum ^= Data + ++Count;
Data = 0;
return CheckSum;
I'm not much of a programmer (I am a sysadmin) but I know an XOR-based checksum is going to be pretty crude. What're the chances of this algorithm returning the same checksum for two files of the same size with different contents? (I'm not expecting an exact answer, "remote" or "quite likely" is fine.)
How could it be improved without a huge performance hit?
Lastly, what's going on with the fread()? I had a quick scan of the documentation but I couldn't figure it out. Is Data being set to each byte of the file in turn? Edit: okay, so it's reading the file into unsigned long (let's assume a 32-bit OS here) chunks. What does each chunk contain? If the contents of the file are abcd, what is the value of Data on the first pass? Is it (in Perl):
(ord('a') << 24) & (ord('b') << 16) & (ord('c') << 8) & ord('d')
MD5 is commonly used to verify the integrity of transfer files. Source code is readily available in c++. It is widely considered to be a fast and accurate algorithm.
See also Robust and fast checksum algorithm?
I'd suggest you take a look at Fletcher's checksum, specifically fletcher-32, which ought to be fairly fast, and detect various things the current XOR chain would not.
You could easily improve the algorithm by using a formula like this one:
Checksum = (Checksum * a + Data * b) + c;
If a, b and c are large primes, this should return good results. After this, rotating (not shifting!) the bits of checksum will further improve it a bit.
Using primes, this is a similar algorithm to that used for Linear congruential generators - it guarantees long periods and good distribution.
The fread bit is reading in the file one chunk at a time. Each chunk is the size of a long (in c this is not a well defined size but you can assume 32 or 64 bits). Depending on how it gets buffered, this might not be to bad. OTOH, reading a larger chunk into an array and looping over it might be a lot faster.
Even "expensive" cryptographic hash functions usually require multiple iterations to take significant amounts of time. Although no longer recommended for cryptographic purposes, where users would deliberately try to create collisions, functions like SHA1 and MD5 are widely available and suitable for this purpose.
If a smaller hash value is needed, CRC is alright, but not great. A n-bit CRC will fail to detect a small fraction of changes that are longer than n bits. For example, suppose just a single dollar amount in a file is changed, from $12,345 to $34,567. A 32-bit CRC might miss that change.
Truncating the result of a longer cryptographic hash will detect changes more reliably than a CRC.
I seems like your algorithm makes no effort to deal with files that are not an exact multiple of 4 bytes in size. The return value of fread is not a boolean but the number of bytes actually read, which will differ from 4 in the case of an EOF or if an error occurred. You are checked for neither, but simply assuming that if it didn't return 0, you have 4 valid bytes in 'data' which which to calculate your hash.
If you really want to use a hash, I'd recommend several things. First, use a simple cryptographic hash like MD5, not CRC32. CRC32 is decent for checking data validity, but for spanning a file system and ensuring no collisions, its not as great a tool because of the birthday paradox mentioned in the comments elsewhere. Second, don't write the function yourself. Find an existing implementation. Finally, consider simply using rsync to replicate files instead of rolling your own solution.
CheckSum ^= Data + ++Count;
Data = 0;
I don't think "++Count" do much work. The code is equivalent with
CheckSum ^= Data;
XORing a sequence of bytes is not enough. Especially with text files.
I suggest to use a hash function.
SHA-1 and (more recently SHA-2) provide excellent hashing functions and I believe as slowly supplanting MD5 due to better hashing properties. All of them (md2, sha, etc...) have efficient implementations and return a hash of a buffer that is several characters long (although always a fixed length). are provably more reliable than reducing a hash to an integer. If I had my druthers, I'd use SHA-2. Follow this link for libraries that implement SHA checksums.
If you don't want to compile in those libraries, linux (and probably cygwin) has the following executables: md5sum, sha1sum, sha224sum, sha256sum, sha384sum, sha512sum; to which you can provide your file and they will print out the checksum as a hex string.
You can use popen to execute those programs -- with something like this:
const int maxBuf=1024;
char buf[maxBuf];
FILE* f = popen( "sha224sum myfile", "w" );
int bytesRead = f.read( buf, maxBuf );
fclose( f );
Obviously this will run quite a lot slower, but makes for a useful first pass.
If speed is an issue, given that file hashing operations like this and I/O bound (memory and disk access will be you bottlenecks), I'd expect all of this algorithms to run about as fast a one that produces an unsigned int. Perl and Python also come with implementations of MD5 SHA1 and SHA2 and will probably run as fast as in C/C++.

is there any function in C++ that calculates a fingerprint or hash of a string that's guaranteed to be at least 64 bits wide?

is there any function in C++ that calculates a fingerprint or hash of a string that's guaranteed to be at least 64 bits wide?
I'd like to replace my unordered_map<string, int> with unordered_map<long long, int>.
Given the answers that I'm getting (thanks Stack Overflow community...) the technique that I'm describing is not well-known. The reason that I want an unordered map of fingerprints instead of strings is for space and speed. The second map does not have to store strings and when doing the lookup, it doesn't incur any extra cache misses to fetch those strings. The only downside is the slight chance of a collision. That's why the key has to be 64 bits: a probability of 2^(-64) is basically an impossibility. Of course, this is predicated on a good hash function, which is exactly what my question is seeking.
Thanks again, Stack Overflowers.
unordered_map always hashes the key into a size_t variable. This is independent from the type of the key and depends solely on the architecture you are working with.
c++ has no native 128 Bit type, nor does it have native hashing support. Such extensions for hashing are supposed to be added in TR1, but as far as I am aware 128 bit ints aren't supported my many compilers. (Microsoft supports an __int128 type -- only on x64 platforms though)
I'd expect the functions included with unordered_map would be faster in any case.
If you really do want to do things that way, MD5 provides a good 128 bit hash.
If you want to map any string to a unique integer:
typedef std::map<string,long long> Strings;
static Strings s_strings;
long long s_highWaterMark = 0;
long long my_function(const string& s)
Strings::const_iterator it = s_strings.find(s);
if (it != s_strings.end())
//we've previously returned a fingerprint for this string
//now return the same fingerprint again
return it->second;
//else new fingerprint
long long rc = ++s_highWaterMark;
//... remember it for next time
s_strings.insert(Strings::value_type(s, rc));
//... and return it this time
return rc;
What exactly is it that you seek to achieve? Your map is not gonna work any better with a "bigger" hash function. Not notably, anyway.