10 character id that's globally and locally unique - c++

I need to generate a 10 character unique id (SIP/VOIP folks need to know that it's for a param icid-value in the P-Charging-Vector header). Each character shall be one of the 26 ASCII letters (case sensitive), one of the 10 ASCII digits, or the hyphen-minus.
It MUST be 'globally unique (outside of the machine generating the id)' and sufficiently 'locally unique (within the machine generating the id)', and all that needs to be packed into 10 characters, phew!
Here's my take on it. I'm FIRST encoding the 'MUST' be encoded globally unique local ip address into base-63 (its an unsigned long int that will occupy 1-6 characters after encoding) and then as much as I can of the current time stamp (its a time_t/long long int that will occupy 9-4 characters after encoding depending on how much space the encoded ip address occupies in the first place).
I've also added loop count 'i' to the time stamp to preserve the uniqueness in case the function is called more than once in a second.
Is this good enough to be globally and locally unique or is there another better approach?
Gaurav
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
//base-63 character set
static char set[]="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-";
// b63() returns the next vacant location in char array x
int b63(long long longlong,char *x,int index){
if(index > 9)
return index+1;
//printf("index=%d,longlong=%lld,longlong%63=%lld\n",index,longlong,longlong%63);
if(longlong < 63){
x[index] = set[longlong];
return index+1;
}
x[index] = set[longlong%63];
return b63(longlong/63,x,index+1);
}
int main(){
char x[11],y[11] = {0}; /* '\0' is taken care of here */
//let's generate 10 million ids
for(int i=0; i<10000000; i++){
/* add i to timestamp to take care of sub-second function calls,
3770168404(is a sample ip address in n/w byte order) = 84.52.184.224 */
b63((long long)time(NULL)+i,x,b63((long long)3770168404,x,0));
// reverse the char array to get proper base-63 output
for(int j=0,k=9; j<10; j++,k--)
y[j] = x[k];
printf("%s\n",y);
}
return 0;
}

It MUST be 'globally unique (outside
of the machine generating the id)' and
sufficiently 'locally unique (within
the machine generating the id)', and
all that needs to be packed into 10
characters, phew!
Are you in control of all the software generating ids? Are you doling out the ids? If not...
I know nothing about SIP, but there's got to be a misunderstanding that you have about the spec (or the spec must be wrong). If another developer attempts to build an id using a different algorithm than the one you've cooked up, you will have collisions with their ids, meaning they will know longer be globally unique in that system.
I'd go back to the SIP documentation, see if there's an appendix with an algorithm for generating these ids. Or maybe a smarter SO user than I can answer what the SIP algorithm for generating these id's is.

I would have a serious look at RFC 4122 which describes the generation of 128-bit GUIDs. There are several different generation algorithms, some of which may fit (MAC address-based one springs to mind). This is a bigger number-space than yours 2^128 = 3.4 * 10^38 compared with 63^10 = 9.8 * 10^17, so you may have to make some compromises on uniqueness. Consider factors like how frequently the IDs will be generated.
However in the RFC, they have considered some practical issues, like the ability to generate large numbers of unique values efficiently by pre-allocating blocks of IDs.

Can't you just have a distributed ID table ?

Machines on NAT'ed LANs will often have an IP from a small range, and not all of the 32-bit values would be valid (think multicast, etc). Machines may also grab the same timestamp, especially if the granularity is large (such as seconds); keep in mind that the year is very often going to be the same, so it's the lower bits that will give you the most 'uniqueness'.
You may want to take the various values, hash them with a cryptographic hash, and translate that to the characters you are permitted to use, truncating to the 10 characters.
But you're dealing with a value with less than 60 bits; you need to think carefully about the implications of a collision. You might be approaching the problem the wrong way...

Well, if I cast aside the fact that I think this is a bad idea, and concentrate on a solution to your problem, here's what I would do:
You have an id range of 10^63, which correspond to roughly 60 bits. You want it to be both "globally" and "locally" unique. Let's generate the first N bits to be globally unique, and the rest to be locally unique. The concatenation of the two will have the properties you are looking for.
First, the global uniqueness : IP won't work, especially local ones, they hold very little entropy. I would go with MAC addresses, they were made for being globally unique. They cover a range of 256^6, so using up 6*8 = 48 bits.
Now, for the locally unique : why not use the process ID ? I'm making the assumption that the uniqueness is per process, if it's not, you'll have to think of something else. On Linux, process ID is 32 bits. If we wanted to nitpick, the 2 most significant bytes probably hold very little entropy, as they would at 0 on most machines. So discard them if you know what you're doing.
So now you'll see you have a problem as it would use up to 70 bits to generate a decent (but not bulletproof) globally and locally unique ID (using my technique anyway). And since I would also advise to put in a random number (at least 8 bits long) just in case, it definitely won't fit. So if I were you, I would hash the ~78 generated bits to SHA1 (for example), and convert the first 60 bits of the resulting hash to your ID format. To do so, notice that you have a 63 characters range to chose from, so almost the full range of 6 bits. So split the hash in 6 bits pieces, and use the first 10 pieces to select the 10 characters of your ID from the 63 character range. Obviously, the range of 6 bits is 64 possible values (you only want 63), so if you have a 6 bits piece equals to 63, either floor it to 62 or assume modulo 63 and pick 0. It will slightly bias the distribution, but it's not too bad.
So there, that should get you a decent globally and locally pseudo-unique ID.
A few last points: according to the Birthday paradox, you'll get a ~ 1 % chance of collisions after generating ~ 142 million IDs, and a 99% chance after generating 3 billions IDs. So if you hit great commercial success and have millions of IDs being generated, get a larger ID.
Finally, I think I provided a "better than the worse" solution to your problem, but I can't help but think you're attacking this problem in the wrong fashion, and possibly as other have mentioned, misreading the specs. So use this if there are no other ways that would be more "bulletproof" (centralised ID provider, much longer ID ... ).
Edit: I re-read your question, and you say you call this function possibly many times a second. I was assuming this was to serve as some kind of application ID, generated once at the start of your application, and never changed afterwards until it exited. Since it's not the case, you should definitely add a random number and if you generate a lot of IDs, make that at least a 32 bits number. And read and re-read the Birthday Paradox I linked to above. And seed your number generator to a highly entropic value, like the usec value of the current timestamp for example. Or even go so far as to get your random values from /dev/urandom .
Very honestly, my take on your endeavour is that 60 bits is probably not enough...

Hmm, using the system clock may be a weakness... what if someone sets the clock back? You might re-generate the same ID again. But if you are going to use the clock, you might call gettimeofday() instead of time(); at least that way you'll get better resolution than one second.

#Doug T.
No, I'm not in control of all the software generating the ids.
I agree without a standardized algorithm there maybe collisions, I've raised this issue in the appropriate mailing lists.
#Florian
Taking a cue from you're reply. I decided to use the /dev/urandom PRNG for a 32 bit random number as the space unique component of the id. I assume that every machine will have its own noise signature and it can be assumed to be safely globally unique in space at an instant of time. The time unique component that I used earlier remains the same.
These unique ids are generated to collate all the billing information collected from different network functions that independently generated charging information of a particular call during call processing.
Here's the updated code below:
Gaurav
#include <stdio.h>
#include <string.h>
#include <time.h>
//base-63 character set
static char set[]="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-";
// b63() returns the next vacant location in char array x
int b63(long long longlong, char *x, int index){
if(index > 9)
return index+1;
if(longlong < 63){
x[index] = set[longlong];
return index+1;
}
x[index] = set[longlong%63];
return b63(longlong/63, x, index+1);
}
int main(){
unsigned int number;
char x[11], y[11] = {0};
FILE *urandom = fopen("/dev/urandom", "r");
if(!urandom)
return -1;
//let's generate a 1 billion ids
for(int i=0; i<1000000000; i++){
fread(&number, 1, sizeof(number), urandom);
// add i to timestamp to take care of sub-second function calls,
b63((long long)time(NULL)+i, x, b63((long long)number, x, 0));
// reverse the char array to get proper base-63 output
for(int j=0, k=9; j<10; j++, k--)
y[j] = x[k];
printf("%s\n", y);
}
if(urandom)
fclose(urandom);
return 0;
}

Related

What are some checksum implementations that allow for incremental computation?

In my program I have a set of sets that are stored in a proprietary hash table. Like all hash tables, I need two functions for each element. First, I need the hash value to use for insertion. Second, I need a compare function when there's conflicts. It occurs to me that a checksum function would be perfect for this. I could use the value in both functions. There's no shortage of checksum functions but I would like to know if there's any commonly available ones that I wouldn't need to bring in a library for (my company is a PIA when it comes to that).A system library would be ok.
But I have an additional, more complicated requirement. I need for the checksum to be incrementally calculable. That is, if a set contains A B C D E F and I subtract D from the set, it should be able to return a new checksum value without iterating over all the elements in the set again. The reason for this is to prevent non-linearity in my code. Ideally, I'd like for the checksum to be order independent but I can sort them first if needed. Does such an algorithm exist?
Simply store a dictionary of items in your set, and their corresponding hash value. The hash value of the set is the hash value of the concatenated, sorted hashes of the items. In Python:
hashes = '''dictionary of hashes in string representation'''
# e.g.
hashes = { item: hashlib.sha384(item) for item in items }
sorted_hashes = sorted(hashes.values())
concatenated_hashes = ''.join(sorted_hashes)
hash_of_the_set = hashlib.sha384(concatenated_hashes)
As hash function I would use sha384, but you might want to try Keccak-384.
Because there are (of course) no cryptographic hash functions with a lengths of only 32-bit, you have to use a checksum instead, like Adler-32 or CRC32. The idea remains the same. Best use Adler32 on the items and crc32 on the concatenated hashes:
hashes = { item: zlib.adler32(item) for item in items }
sorted_hashes = sorted(hashes.values())
concatenated_hashes = ''.join(sorted_hashes)
hash_of_the_set = zlib.crc32(concatenated_hashes)
In C++ you can use Adler-32 and CRC-32 of Botan.
A CRC is a set of bits that are calculated from an input.
If your input is the same size (or less) as the CRC (in your case - 32 bits), you can find the input that created this CRC - in effect reversing it.
If your input is larger than 32 bits, but you know all the input except for 32 bits, you can still reverse the CRC to find the missing bits.
If, however, the unknown part of the input is larger than 32 bits, you can't find it as there is more than one solution.
Why am I telling you this? Imagine you have the CRC of the set
{A,B,C}
Say you know what B is, and you can now calculate easily the CRC of the set
{A,C}
(by "easily" I mean - without going over the entire A and C inputs - like you wanted)
Now you have 64 bits describing A and C! And since we didn't have to go over the entirety of A and C to do it - it means we can do it even if we're missing information about A and C.
So it looks like IF such a method exists, we can magically fix more than 32 unknown bits from an input if we have the CRC of it.
This obviously is wrong. Does that mean there's no way to do what you want? Of course not. But it does give us constraints on how it can be done:
Option 1: we don't gain more information from CRC({A,C}) that we didn't have in CRC({A,B,C}). That means that the (relative) effect of A and C on the CRC doesn't change with the removal of B. Basically - it means that when calculating the CRC we use some "order not important" function when adding new elements:
we can use, for example, CRC({A,B,C}) = CRC(A) ^ CRC(B) ^ CRC(C) (not very good, as if A appears twice it's the same CRC as if it never appeared at all), or CRC({A,B,C}) = CRC(A) + CRC(B) + CRC(C) or CRC({A,B,C}) = CRC(A) * CRC(B) * CRC(C) (make sure CRC(X) is odd, so it's actually just 31 bits of CRC) or CRC({A,B,C}) = g^CRC(A) * g^CRC(B) * g^CRC(C) (where ^ is power - useful if you want cryptographically secure) etc.
Option 2: we do need all of A and C to calculate CRC({A,C}), but we have a data structure that makes it less than linear in time to do so if we already calculated CRC({A,B,C}).
This is useful if you want specifically CRC32, and don't mind remembering more information in addition to the CRC after the calculation (the CRC is still 32 bit, but you remember a data structure that's O(len(A,B,C)) that you will later use to calculate CRC{A,C} more efficiently)
How will that work? Many CRCs are just the application of a polynomial on the input.
Basically, if you divide the input into n chunks of 32 bit each - X_1...X_n - there is a matrix M such that
CRC(X_1...X_n) = M^n * X_1 + ... + M^1 * X_n
(where ^ here is power)
How does that help? This sum can be calculated in a tree-like fashion:
CRC(X_1...X_n) = M^(n/2) * CRC(X_1...X_n/2) + CRC(X_(n/2+1)...X_n)
So you begin with all the X_i on the leaves of the tree, start by calculating the CRC of each consecutive pair, then combine them in pairs until you get the combined CRC of all your input.
If you remember all the partial CRCs on the nodes, you can then easily remove (or add) an item anywhere in the list by doing just O(log(n)) calculations!
So there - as far as I can tell, those are your two options. I hope this wasn't too much of a mess :)
I'd personally go with option 1, as it's just simpler... but the resulting CRC isn't standard, and is less... good. Less "CRC"-like.
Cheers!

Generate nonce c++

I am wondering if there is a way to generate a Cryptographic Nonce using OpenSSL or Crypto++ libraries. Is there anything more to it than just generating a set of random bytes using autoseeded pools?
I am wondering if there is a way to generate a cryptographic nonce using OpenSSL or Crypto++ libraries.
Crypto++:
SecByteBlock nonce(16);
AutoSeededRandomPool prng;
prng.GenerateBlock(nonce, nonce.size());
OpenSSL:
unsigned char nonce[16];
int rc = RAND_bytes(nonce, sizeof(nonce));
unsigned long err = ERR_get_error();
if(rc != 1) {
/* RAND_bytes failed */
/* `err` is valid */
}
/* OK to proceed */
Is there anything more to it than just generating a set of random bytes using autoseeded pools?
A nonce is basically an IV. Its usually considered a public parameter, like an IV or a Salt.
A nonce must be unique within a security context. You may need a nonce to be unpredictable, too.
Uniqueness and unpredictability are two different properties. For example, a counter starting at 0000000000000000 is unique, but its also predictable.
When you need both uniqueness and unpredictability, you can partition the nonce into a random value and a counter. The random value will take up 8 bytes of a 16 byte nonce; while the counter will take up the remaining 8 bytes of a 16 byte nonce. Then you use an increment function to basically perform i++ each time you need a value.
You don't need an 8-8 split. 12-4 works, as does 4-12. It depends on the application and the number of nonces required before rekeying. Rekeying is usually driven by plain text byte counts.
16-0 also works. In this case, you're using random values, avoiding the counter, and avoiding the increment function. (The increment function is basically a cascading add).
NIST SP800-38C and SP800-38D offer a couple of methods for creating nonces because CCM and GCM uses them.
Also see What are the requirements of a nonce? on the Crypto Stack Exchange.
You need a unique number for each nonce. You can use either a serial number or a random number. To help ensure uniqueness, it is common, though not required, to add a timestamp to the nonce. Either passing the timestamp as a separate field or concatenating it with the nonce. Sometimes information such as IP addresses and process IDs are also added.
When you use a serial number, you don't need to worry about skipping numbers. That's fine. Just make sure you never repeat. It must be unique across restarts of your software. This is one place where adding a timestamp can help. Because time-in-millis+serial-number is almost certainly unique across restarts of the server.
For the pseudo random number generator, anyone should be fine. Just make sure that you use a sufficiently large space to make the chances of getting a duplicate effectively impossible. Again, adding time will reduce the likelihood of you getting duplicates as you'll need to get the same random number twice in the same millisecond.
You may wish to hash the nonce to obscure the data in it (eg: process ID) though the hash will only be secure if you include a secure random number in the nonce. Otherwise it may be possible for a viewer of the nonce to guess the components and validate by redoing the hash (ie: they guess the time and try all possible proc IDs).
No. If the nonce is large enough then an autoseeded DRBG (deterministic random bit generator - NIST nomenclature) is just fine. I would suggest a nonce of about 12 bytes. If the nonce needs to be 16 bytes then you can leave the least significant bits - most often the rightmost bytes - set to zero for maximum compatibility.
Just using the cryptographically secure random number generators provided by the API should be fine - they should be seeded using information obtained from the operating system (possibly among other data). It never hurts to add the system time to the seed data just to be sure.
Alternatively you could use a serial number, but that would require you to keep some kind of state which may be hard across invocations. Beware that there are many pitfalls that may allow a clock to repeat itself (daylight saving, OS changes, dead battery etc. etc.).
It never hurts to double check that the random number generator doesn't repeat for a large enough output. There have been issues just with programming or system configuration mistakes, e.g. when a fix after a static code analysis for Debian caused the OpenSSL RNG not to be seeded at all.

Is there a way to uniquely build an integer from below data?

I have structured data like below:
struct Leg
{
char type;
char side;
int qty;
int id;
} Legs[5];
where
type is O or E,
side is B or S;
qty is 1 to 9999 and qty in all Legs is relative prime to each other i.e. 1 2 3 not 2 4 6
id is an integer from 1 to 9999999 and all ids are unique in the group of Legs
To build unique signature of above data, currently I am building a string like below:
first sort Legs based on id;
then
signature=""
for i=1 to 5
signature+=id+type+qty+side of leg-i
and I insert into unordered_map so that if any matching structured data comes, I can, lookup by building a signature as above and looking up.
unorderd_map on string means key-compare which is string compare and also hash function which needs to traverse the string which is usually around 25 chars.
For efficiency, it it is possible to build a unique integer out of above data for each structure above, the lookups/insertions in unorderd_map will be extremely faster.
Just wondering if there is any mathematical properties I can take advantage of.
Edit:
The map will contain key,value pairs like
<unique-signature=key, value=int-value needs to be located on looking up another repeating Leg group by constructing signature like above after sorting Legs based on id>
<123O2B234E3S456O3S567O2S789E2B, 989>
The goal is to build unique signature from each such unique repeating group of legs. Legs can be in different order and yet they can be match with another group of legs which are in different order thats why I sort based on id which is unique and build the signature.
My signature is string based, if there was a way to construct a unique number signature, then my lookups/insertions will be faster.
You can just create a unique 40-bit number from the fields you have. Why 40 bits? I'm glad you asked.
You have 9,999,999 possible id values, which means you can use 24 bits to represent all possibilities (log2(9999999) = a little over 23).
You have 9,999 possible qty values, which requires another 14 bits.
type and side require 1 bit each, which gives you a total of 40 bits of information. Store this number as a long long and you have a nice, fast key for your map.
If you really want a unique int key then you're probably out of luck because it's going to be pretty tricky to get rid of 8 bits of information. You might be able to take advantage of the co-primality of the qty field to represent it in fewer than 14 bits, however I doubt that you can get it down to 6 bits because that only gives you 64 possible values for qty.
That's a way to get what you asked for, but #David Schwartz's answer is probably what you actually need: hash collisions are generally not expensive unless you have a really bad hash function - see Application vulnerability due to Non Random Hash Functions for an example of how that can bite you - or a carefully crafted data set that happens to hit the worst-case.
In your case you should be fine with David's answer. It'll be fast enough unless you are extremely unfortunate with your set of data.
EDIT: Just noticed that you are computing your signature over the set of 5 Legs. The same math applies, you just will need 200 bits rather than 4. So it won't fit in a long long unless you have some information that can be shared amongst all 5 Leg objects; if each set of 5 shares the same id, for example.
Stick with David's answer.
It doesn't have to be unique. I would suggest something like:
std::size_t hash_value(const Leg& l)
{
std::size_t ret = l.type;
ret << = 8;
ret |= l.side;
ret *= 2654435761;
ret += l.qty;
ret *= 2654435761;
ret += l.id;
return ret * 2654435761;
}
In order to create an order-independent hash function for groups of five legs, first choose a hash function for individual legs -- David's answer looks great. Compute the hashes for each of the five legs. Now choose an order-independent function to combine these five hash values. You could, for example, xor the hashes together, or add them all together, or multiply them all together.
The fact that multiplication distributes over addition, and multiplication was the last operation to happen, makes me a little bit wary of using that. I think xor might be the best option of the ones I give here; but before using this in production, you should definitely run a few tests to see if you can easily generate collisions with any of them.
Probably superfluous, but here is a simple implementation that calls hash_value from David's answer:
std::size_t hash_value(const Leg_Array& legs) {
std::size_t ret = 0;
for (int i = 0; i < 5; ++i) {
ret ^= hash_value(legs[i]);
}
return ret;
}

Fast code for searching bit-array for contiguous set/clear bits?

Is there some reasonably fast code out there which can help me quickly search a large bitmap (a few megabytes) for runs of contiguous zero or one bits?
By "reasonably fast" I mean something that can take advantage of the machine word size and compare entire words at once, instead of doing bit-by-bit analysis which is horrifically slow (such as one does with vector<bool>).
It's very useful for e.g. searching the bitmap of a volume for free space (for defragmentation, etc.).
Windows has an RTL_BITMAP data structure one can use along with its APIs.
But I needed the code for this sometime ago, and so I wrote it here (warning, it's a little ugly):
https://gist.github.com/3206128
I have only partially tested it, so it might still have bugs (especially on reverse). But a recent version (only slightly different from this one) seemed to be usable for me, so it's worth a try.
The fundamental operation for the entire thing is being able to -- quickly -- find the length of a run of bits:
long long GetRunLength(
const void *const pBitmap, unsigned long long nBitmapBits,
long long startInclusive, long long endExclusive,
const bool reverse, /*out*/ bool *pBit);
Everything else should be easy to build upon this, given its versatility.
I tried to include some SSE code, but it didn't noticeably improve the performance. However, in general, the code is many times faster than doing bit-by-bit analysis, so I think it might be useful.
It should be easy to test if you can get a hold of vector<bool>'s buffer somehow -- and if you're on Visual C++, then there's a function I included which does that for you. If you find bugs, feel free to let me know.
I can't figure how to do well directly on memory words, so I've made up a quick solution which is working on bytes; for convenience, let's sketch the algorithm for counting contiguous ones:
Construct two tables of size 256 where you will write for each number between 0 and 255, the number of trailing 1's at the beginning and at the end of the byte. For example, for the number 167 (10100111 in binary), put 1 in the first table and 3 in the second table. Let's call the first table BBeg and the second table BEnd. Then, for each byte b, two cases: if it is 255, add 8 to your current sum of your current contiguous set of ones, and you are in a region of ones. Else, you end a region with BBeg[b] bits and begin a new one with BEnd[b] bits.
Depending on what information you want, you can adapt this algorithm (this is a reason why I don't put here any code, I don't know what output you want).
A flaw is that it does not count (small) contiguous set of ones inside one byte ...
Beside this algorithm, a friend tells me that if it is for disk compression, just look for bytes different from 0 (empty disk area) and 255 (full disk area). It is a quick heuristic to build a map of what blocks you have to compress. Maybe it is beyond the scope of this topic ...
Sounds like this might be useful:
http://www.aggregate.org/MAGIC/#Population%20Count%20%28Ones%20Count%29
and
http://www.aggregate.org/MAGIC/#Leading%20Zero%20Count
You don't say if you wanted to do some sort of RLE or to simply count in-bytes zeros and one bits (like 0b1001 should return 1x1 2x0 1x1).
A look up table plus SWAR algorithm for fast check might gives you that information easily.
A bit like this:
byte lut[0x10000] = { /* see below */ };
for (uint * word = words; word < words + bitmapSize; word++) {
if (word == 0 || word == (uint)-1) // Fast bailout
{
// Do what you want if all 0 or all 1
}
byte hiVal = lut[*word >> 16], loVal = lut[*word & 0xFFFF];
// Do what you want with hiVal and loVal
The LUT will have to be constructed depending on your intended algorithm. If you want to count the number of contiguous 0 and 1 in the word, you'll built it like this:
for (int i = 0; i < sizeof(lut); i++)
lut[i] = countContiguousZero(i); // Or countContiguousOne(i)
// The implementation of countContiguousZero can be slow, you don't care
// The result of the function should return the largest number of contiguous zero (0 to 15, using the 4 low bits of the byte, and might return the position of the run in the 4 high bits of the byte
// Since you've already dismissed word = 0, you don't need the 16 contiguous zero case.

Hash of a string to be of specific length

Is there a way to generate a hash of a string so that the hash itself would be of specific length? I've got a function that generates 41-byte hashes (SHA-1), but I need it to be 33-bytes max (because of certain hardware limitations). If I truncate the 41-byte hash to 33, I'd probably (certainly!) lost the uniqueness.
Or actually I suppose an MD5 algorithm would fit nicely, if I could find some C code for one with your help.
EDIT: Thank you all for the quick and knowledgeable responses. I've chosen to go with an MD5 hash and it fits fine for my purpose. The uniqueness is an important issue, but I don't expect the number of those hashes to be very large at any given time - these hashes represent software servers on a home LAN, so at max there would be 5, maybe 10 running.
If I truncate the 41-byte hash to 33, I'd probably (certainly!) lost the uniqueness.
What makes you think you've got uniqueness now? Yes, there's clearly a higher chance of collision when you're only playing with 33 bytes instead of 41, but you need to be fully aware that collisions are only ever unlikely, not impossible, for any situation where it makes sense to use a hash in the first place. If you're hashing more than 41 bytes of data, there are clearly more possible combinations than there are hashes available.
Now, whether you'd be better off truncating the SHA-1 hash or using a shorter hash such as MD5, I don't know. I think I'd be more generally confident when keeping the whole of a hash, but MD5 has known vulnerabilities which may or may not be a problem for your particular application.
The way hashes are calculated that's unfortunately not possible. To limit the hash length to 33 bytes, you will have to cut it. You could xor the first and last 33 bytes, as that might keep more of the information. But even with 33 bytes you don't have that big a chance of a collision.
md5: http://www.md5hashing.com/c++/
btw. md5 is 16 bytes, sha1 20 bytes and sha256 is 32 bytes, however as hexstrings, they all double in size. If you can store bytes, you can even use sha256.
There is no more chance of collision with substring(sha_hash, 0, 33) than with any other hash that is 33 bytes long, due to the way hash algorithms are designed (entropy is evenly spread out in the resulting string).
You could use an Elf hash(<- C code included) or some other simple hash function like that instead of MD5 or SHA-X.
They are not secure, but they can be tuned to any length you need
/*****Please include following header files*****/
// string
/***********************************************/
/*****Please use following namespaces*****/
// std
/*****************************************/
static unsigned int ELFHash(string str) {
unsigned int hash = 0;
unsigned int x = 0;
unsigned int i = 0;
unsigned int len = str.length();
for (i = 0; i < len; i++)
{
hash = (hash << 4) + (str[i]);
if ((x = hash & 0xF0000000) != 0)
{
hash ^= (x >> 24);
}
hash &= ~x;
}
return hash;
}
Example
string data = "jdfgsdhfsdfsd 6445dsfsd7fg/*/+bfjsdgf%$^";
unsigned int value = ELFHash(data);
Output
248446350
Hashes are by definition only unique for small amount of data (and even then it's still not guaranteed). It is impossible to map a large amount of information uniquely to a small amount of information by virtue of the fact that you can't mmagically get rid of information and get it back later. Keep in mind this isn't compression going on.
Personally, I'd use MD5 (if you need to store in text), or a 256b (32B) hash such as SHA256 (if you can store in binary) in this situation. Truncating another hash algorithm to 33B works too, and MAY increase the possibility of generating hash collisions. It depends alot on the algorithm.
Also, yet another C implementation of MD5, by the people who designed it.
I believe the MD5 hashing algorithm results in a 32 digit number so maybe that one will be more suitable.
Edit: to access MD5 functionality, it should be possible to hook into the openssl libraries. However you mentioned hardware limitations so this may no be possible in your case.
Here is an MD5 implementation in C.
The chance of a 33-byte collision is 1/2^132 (by the Birthday Paradox)
So don't worry about losing uniqueness.
Update: I didn't check the actual byte length of SHA1. Here's the relevant calculation: a 32-nibble collision (33 bytes of hex - 1 termination char), occurs only when the number of strings hashed becomes around sqrt(2^(32*4)) = 2^64.
Use Apache's DigestUtils:
http://commons.apache.org/codec/api-release/org/apache/commons/codec/digest/DigestUtils.html#md5Hex(java.lang.String)
Converts the hash into 32 character hex string.