What are some checksum implementations that allow for incremental computation? - c++

In my program I have a set of sets that are stored in a proprietary hash table. Like all hash tables, I need two functions for each element. First, I need the hash value to use for insertion. Second, I need a compare function when there's conflicts. It occurs to me that a checksum function would be perfect for this. I could use the value in both functions. There's no shortage of checksum functions but I would like to know if there's any commonly available ones that I wouldn't need to bring in a library for (my company is a PIA when it comes to that).A system library would be ok.
But I have an additional, more complicated requirement. I need for the checksum to be incrementally calculable. That is, if a set contains A B C D E F and I subtract D from the set, it should be able to return a new checksum value without iterating over all the elements in the set again. The reason for this is to prevent non-linearity in my code. Ideally, I'd like for the checksum to be order independent but I can sort them first if needed. Does such an algorithm exist?

Simply store a dictionary of items in your set, and their corresponding hash value. The hash value of the set is the hash value of the concatenated, sorted hashes of the items. In Python:
hashes = '''dictionary of hashes in string representation'''
# e.g.
hashes = { item: hashlib.sha384(item) for item in items }
sorted_hashes = sorted(hashes.values())
concatenated_hashes = ''.join(sorted_hashes)
hash_of_the_set = hashlib.sha384(concatenated_hashes)
As hash function I would use sha384, but you might want to try Keccak-384.
Because there are (of course) no cryptographic hash functions with a lengths of only 32-bit, you have to use a checksum instead, like Adler-32 or CRC32. The idea remains the same. Best use Adler32 on the items and crc32 on the concatenated hashes:
hashes = { item: zlib.adler32(item) for item in items }
sorted_hashes = sorted(hashes.values())
concatenated_hashes = ''.join(sorted_hashes)
hash_of_the_set = zlib.crc32(concatenated_hashes)
In C++ you can use Adler-32 and CRC-32 of Botan.

A CRC is a set of bits that are calculated from an input.
If your input is the same size (or less) as the CRC (in your case - 32 bits), you can find the input that created this CRC - in effect reversing it.
If your input is larger than 32 bits, but you know all the input except for 32 bits, you can still reverse the CRC to find the missing bits.
If, however, the unknown part of the input is larger than 32 bits, you can't find it as there is more than one solution.
Why am I telling you this? Imagine you have the CRC of the set
{A,B,C}
Say you know what B is, and you can now calculate easily the CRC of the set
{A,C}
(by "easily" I mean - without going over the entire A and C inputs - like you wanted)
Now you have 64 bits describing A and C! And since we didn't have to go over the entirety of A and C to do it - it means we can do it even if we're missing information about A and C.
So it looks like IF such a method exists, we can magically fix more than 32 unknown bits from an input if we have the CRC of it.
This obviously is wrong. Does that mean there's no way to do what you want? Of course not. But it does give us constraints on how it can be done:
Option 1: we don't gain more information from CRC({A,C}) that we didn't have in CRC({A,B,C}). That means that the (relative) effect of A and C on the CRC doesn't change with the removal of B. Basically - it means that when calculating the CRC we use some "order not important" function when adding new elements:
we can use, for example, CRC({A,B,C}) = CRC(A) ^ CRC(B) ^ CRC(C) (not very good, as if A appears twice it's the same CRC as if it never appeared at all), or CRC({A,B,C}) = CRC(A) + CRC(B) + CRC(C) or CRC({A,B,C}) = CRC(A) * CRC(B) * CRC(C) (make sure CRC(X) is odd, so it's actually just 31 bits of CRC) or CRC({A,B,C}) = g^CRC(A) * g^CRC(B) * g^CRC(C) (where ^ is power - useful if you want cryptographically secure) etc.
Option 2: we do need all of A and C to calculate CRC({A,C}), but we have a data structure that makes it less than linear in time to do so if we already calculated CRC({A,B,C}).
This is useful if you want specifically CRC32, and don't mind remembering more information in addition to the CRC after the calculation (the CRC is still 32 bit, but you remember a data structure that's O(len(A,B,C)) that you will later use to calculate CRC{A,C} more efficiently)
How will that work? Many CRCs are just the application of a polynomial on the input.
Basically, if you divide the input into n chunks of 32 bit each - X_1...X_n - there is a matrix M such that
CRC(X_1...X_n) = M^n * X_1 + ... + M^1 * X_n
(where ^ here is power)
How does that help? This sum can be calculated in a tree-like fashion:
CRC(X_1...X_n) = M^(n/2) * CRC(X_1...X_n/2) + CRC(X_(n/2+1)...X_n)
So you begin with all the X_i on the leaves of the tree, start by calculating the CRC of each consecutive pair, then combine them in pairs until you get the combined CRC of all your input.
If you remember all the partial CRCs on the nodes, you can then easily remove (or add) an item anywhere in the list by doing just O(log(n)) calculations!
So there - as far as I can tell, those are your two options. I hope this wasn't too much of a mess :)
I'd personally go with option 1, as it's just simpler... but the resulting CRC isn't standard, and is less... good. Less "CRC"-like.
Cheers!

Related

Algo: find max Xor in array for various interval limis, given N inputs, and p,q where 0<=p<=i<=q<=N

the problem statement is the following:
Xorq has invented an encryption algorithm which uses bitwise XOR operations extensively. This encryption algorithm uses a sequence of non-negative integers x1, x2, … xn as key. To implement this algorithm efficiently, Xorq needs to find maximum value for (a xor xj) for given integers a,p and q such that p<=j<=q. Help Xorq to implement this function.
Input
First line of input contains a single integer T (1<=T<=6). T test cases follow.
First line of each test case contains two integers N and Q separated by a single space (1<= N<=100,000; 1<=Q<= 50,000). Next line contains N integers x1, x2, … xn separated by a single space (0<=xi< 2^15). Each of next Q lines describe a query which consists of three integers ai,pi and qi (0<=ai< 2^15, 1<=pi<=qi<= N).
Output
For each query, print the maximum value for (ai xor xj) such that pi<=j<=qi in a single line.
int xArray[100000];
cin >>t;
for(int j =0;j<t;j++)
{
cin>> n >>q;
//int* xArray = (int*)malloc(n*sizeof(int));
int i,a,pi,qi;
for(i=0;i<n;i++)
{
cin>>xArray[i];
}
for(i=0;i<q;i++)
{
cin>>a>>pi>>qi;
int max =0;
for(int it=pi-1;it<qi;it++)
{
int t = xArray[it] ^ a;
if(t>max)
max =t;
}
cout<<max<<"\n" ;
}
No other assumptions may be made except for those stated in the text of the problem (numbers are not sorted).
The code is functional but not fast enough; is reading from stdin really that slow or is there anything else I'm missing?
XOR flips bits. The max result of XOR is 0b11111111.
To get the best result
if 'a' on ith place has 1 then you have to XOR it with key that has ith bit = 0
if 'a' on ith place has 0 then you have to XOR it with key that has ith bit = 1
saying simply, for bit B you need !B
Another obvious thing is that higher order bits are more important than lower order bits.
That is:
if 'a' on highest place has B and you have found a key with highest bit = !B
then ALL keys that have highest bit = !B are worse that this one
This cuts your amount of numbers by half "in average".
How about building a huge binary tree from all the keys and ordering them in the tree by their bits, from MSB to LSB. Then, cutting the A bit-by-bit from MSB to LSB would tell you which left-right branch to take next to get the best result. Of course, that ignores PI/QI limits, but surely would give you the best result since you always pick the best available bit on i-th level.
Now if you annotate the tree nodes with low/high index ranges of its subelements (performed only done once when building the tree), then later when querying against a case A-PI-QI you could use that to filter-out branches that does not fall in the index range.
The point is that if you order the tree levels like the MSB->LSB bit order, then the decision performed at the "upper nodes" could guarantee you that currently you are in the best possible branch, and it would hold even if all the subbranches were the worst:
Being at level 3, the result of
0b111?????
can be then expanded into
0b11100000
0b11100001
0b11100010
and so on, but even if the ????? are expanded poorly, the overall result is still greater than
0b11011111
which would be the best possible result if you even picked the other branch at level 3rd.
I habe absolutely no idea how long would preparing the tree cost, but querying it for an A-PI-QI that have 32 bits seems to be something like 32 times N-comparisons and jumps, certainly faster than iterating randomly 0-100000 times and xor/maxing. And since you have up to 50000 queries, then building such tree can actually be a good investment, since such tree would be build once per keyset.
Now, the best part is that you actually dont need the whole tree. You may build such from i.e. first two or four or eight bits only, and use the index ranges from the nodes to limit your xor-max loop to a smaller part. At worst, you'd end up with the same range as PiQi. At best, it'd be down to one element.
But, looking at the max N keys, I think the whole tree might actually fit in the memory pool and you may get away without any xor-maxing loop.
I've spent some time google-ing this problem and it seams that you can find it in the context of various programming competitions. While the brute force approach is intuitive it does not really solve the challenge as it is too slow.
There are a few contraints in the problem which you need to speculate in order to write a faster algorithm:
the input consists of max 100k numbers, but there are only 32768 (2^15) possible numbers
for each input array there are Q, max 50k, test cases; each test case consists of 3 values, a,pi,and qi. Since 0<=a<2^15 and there are 50k cases, there is a chance the same value will come up again.
I've found 2 ideas for solving the problem: splitting the input in sqrt(N) intervals and building a segment tree ( a nice explanation for these approaches can be found here )
The biggest problem is the fact that for each test case you can have different values for a, and that would make previous results useless, since you need to compute max(a^x[i]), for a small number of test cases. However when Q is large enough and the value a repeats, using previous results can be possible.
I will come back with the actual results once I finish implementing both methods

Is there a way to uniquely build an integer from below data?

I have structured data like below:
struct Leg
{
char type;
char side;
int qty;
int id;
} Legs[5];
where
type is O or E,
side is B or S;
qty is 1 to 9999 and qty in all Legs is relative prime to each other i.e. 1 2 3 not 2 4 6
id is an integer from 1 to 9999999 and all ids are unique in the group of Legs
To build unique signature of above data, currently I am building a string like below:
first sort Legs based on id;
then
signature=""
for i=1 to 5
signature+=id+type+qty+side of leg-i
and I insert into unordered_map so that if any matching structured data comes, I can, lookup by building a signature as above and looking up.
unorderd_map on string means key-compare which is string compare and also hash function which needs to traverse the string which is usually around 25 chars.
For efficiency, it it is possible to build a unique integer out of above data for each structure above, the lookups/insertions in unorderd_map will be extremely faster.
Just wondering if there is any mathematical properties I can take advantage of.
Edit:
The map will contain key,value pairs like
<unique-signature=key, value=int-value needs to be located on looking up another repeating Leg group by constructing signature like above after sorting Legs based on id>
<123O2B234E3S456O3S567O2S789E2B, 989>
The goal is to build unique signature from each such unique repeating group of legs. Legs can be in different order and yet they can be match with another group of legs which are in different order thats why I sort based on id which is unique and build the signature.
My signature is string based, if there was a way to construct a unique number signature, then my lookups/insertions will be faster.
You can just create a unique 40-bit number from the fields you have. Why 40 bits? I'm glad you asked.
You have 9,999,999 possible id values, which means you can use 24 bits to represent all possibilities (log2(9999999) = a little over 23).
You have 9,999 possible qty values, which requires another 14 bits.
type and side require 1 bit each, which gives you a total of 40 bits of information. Store this number as a long long and you have a nice, fast key for your map.
If you really want a unique int key then you're probably out of luck because it's going to be pretty tricky to get rid of 8 bits of information. You might be able to take advantage of the co-primality of the qty field to represent it in fewer than 14 bits, however I doubt that you can get it down to 6 bits because that only gives you 64 possible values for qty.
That's a way to get what you asked for, but #David Schwartz's answer is probably what you actually need: hash collisions are generally not expensive unless you have a really bad hash function - see Application vulnerability due to Non Random Hash Functions for an example of how that can bite you - or a carefully crafted data set that happens to hit the worst-case.
In your case you should be fine with David's answer. It'll be fast enough unless you are extremely unfortunate with your set of data.
EDIT: Just noticed that you are computing your signature over the set of 5 Legs. The same math applies, you just will need 200 bits rather than 4. So it won't fit in a long long unless you have some information that can be shared amongst all 5 Leg objects; if each set of 5 shares the same id, for example.
Stick with David's answer.
It doesn't have to be unique. I would suggest something like:
std::size_t hash_value(const Leg& l)
{
std::size_t ret = l.type;
ret << = 8;
ret |= l.side;
ret *= 2654435761;
ret += l.qty;
ret *= 2654435761;
ret += l.id;
return ret * 2654435761;
}
In order to create an order-independent hash function for groups of five legs, first choose a hash function for individual legs -- David's answer looks great. Compute the hashes for each of the five legs. Now choose an order-independent function to combine these five hash values. You could, for example, xor the hashes together, or add them all together, or multiply them all together.
The fact that multiplication distributes over addition, and multiplication was the last operation to happen, makes me a little bit wary of using that. I think xor might be the best option of the ones I give here; but before using this in production, you should definitely run a few tests to see if you can easily generate collisions with any of them.
Probably superfluous, but here is a simple implementation that calls hash_value from David's answer:
std::size_t hash_value(const Leg_Array& legs) {
std::size_t ret = 0;
for (int i = 0; i < 5; ++i) {
ret ^= hash_value(legs[i]);
}
return ret;
}

Using CRCs as a digest to detect duplicates among files

The primary use of CRCs and similar computations (such as Fletcher and Adler) seems to be for the detection of transmission errors. As such, most studies I have seen seem to address the issue of the probability of detecting small-scale differences between two data sets. My needs are slightly different.
What follows is a very approximate description of the problem. Details are much more complicated than this, but the description below illustrates the functionality I am looking for. This little disclaimer is intended to ward of answers such as "Why are you solving your problem this way when you can more easily solve it this other way I propose?" - I need to solve my problem this way for a myriad of reasons that are not germane to this question or post, so please don't post such answers.
I am dealing with collections of data sets (size ~1MB) on a distributed network. Computations are performed on these data sets, and speed/performance is critical. I want a mechanism to allow me to avoid re-transmitting data sets. That is, I need some way to generate a unique identifier (UID) for each data set of a given size. (Then, I transmit data set size and UID from one machine to another, and the receiving machine only needs to request transmission of the data if it does not already have it locally, based on the UID.)
This is similar to the difference between using CRC to check changes to a file, and using a CRC as a digest to detect duplicates among files. I have not seen any discussions of the latter use.
I am not concerned with issues of tampering, i.e. I do not need cryptographic strength hashing.
I am currently using a simple 32-bit CRC of the serialized data, and that has so far served me well. However, I would like to know if anyone can recommend which 32-bit CRC algorithm (i.e. which polynomial?) is best for minimizing the probability of collisions in this situation?
The other question I have is a bit more subtle. In my current implementation, I ignore the structure of my data set, and effectively just CRC the serialized string representing my data. However, for various reasons, I want to change my CRC methodology as follows. Suppose my top-level data set is a collection of some raw data and a few subordinate data sets. My current scheme essentially concatenates the raw data and all the subordinate data sets and then CRC's the result. However, most of the time I already have the CRC's of the subordinate data sets, and I would rather construct my UID of the top-level data set by concatenating the raw data with the CRC's of the subordinate data sets, and then CRC this construction. The question is, how does using this methodology affect the probability of collisions?
To put it in a language what will allow me to discuss my thoughts, I'll define a bit of notation. Call my top-level data set T, and suppose it consists of raw data set R and subordinate data sets Si, i=1..n. I can write this as T = (R, S1, S2, ..., Sn). If & represents concatenation of data sets, my original scheme can be thought of as:
UID_1(T) = CRC(R & S1 & S2 & ... & Sn)
and my new scheme can be thought of as
UID_2(T) = CRC(R & CRC(S1) & CRC(S2) & ... & CRC(Sn))
Then my questions are: (1) if T and T' are very different, what CRC algorithm minimizes prob( UID_1(T)=UID_1(T') ), and what CRC algorithm minimizes prob( UID_2(T)=UID_2(T') ), and how do these two probabilities compare?
My (naive and uninformed) thoughts on the matter are this. Suppose the differences between T and T' are in only one subordinate data set, WLOG say S1!=S1'. If it happens that CRC(S1)=CRC(S1'), then clearly we will have UID_2(T)=UID_2(T'). On the other hand, if CRC(S1)!=CRC(S1'), then the difference between R & CRC(S1) & CRC(S2) & ... & CRC(Sn) and R & CRC(S1') & CRC(S2) & ... & CRC(Sn) is a small difference on 4 bytes only, so the ability of UID_2 to detect differences is effectively the same as a CRC's ability to detect transmission errors, i.e. its ability to detect errors in only a few bits that are not widely separated. Since this is what CRC's are designed to do, I would think that UID_2 is pretty safe, so long as the CRC I am using is good at detecting transmission errors. To put it in terms of our notation,
prob( UID_2(T)=UID_2(T') ) = prob(CRC(S1)=CRC(S1')) + (1-prob(CRC(S1)=CRC(S1'))) * probability of CRC not detecting error a few bits.
Let call the probability of CRC not detecting an error of a few bits P, and the probability of it not detecting large differences on a large size data set Q. The above can be written approximately as
prob( UID_2(T)=UID_2(T') ) ~ Q + (1-Q)*P
Now I will change my UID a bit more as follows. For a "fundamental" piece of data, i.e. a data set T=(R) where R is just a double, integer, char, bool, etc., define UID_3(T)=(R). Then for a data set T consisting of a vector of subordinate data sets T = (S1, S2, ..., Sn), define
UID_3(T) = CRC(ID_3(S1) & ID_3(S2) & ... & ID_3(Sn))
Suppose a particular data set T has subordinate data sets nested m-levels deep, then, in some vague sense, I would think that
prob( UID_3(T)=UID_3(T') ) ~ 1 - (1-Q)(1-P)^m
Given these probabilities are small in any case, this can be approximated as
1 - (1-Q)(1-P)^m = Q + (1-Q)*P*m + (1-Q)*P*P*m*(m-1)/2 + ... ~ Q + m*P
So if I know my maximum nesting level m, and I know P and Q for various CRCs, what I want is to pick the CRC that gives me the minimum value for Q + m*P. If, as I suspect might be the case, P~Q, the above simplifies to this. My probability of error for UID_1 is P. My probability of error for UID_3 is (m+1)P, where m is my maximum nesting (recursion) level.
Does all this seem reasonable?
I want a mechanism to allow me to avoid re-transmitting data sets.
rsync has already solved this problem, using generally the approach you outline.
However, I would like to know if anyone can recommend which 32-bit CRC
algorithm (i.e. which polynomial?) is best for minimizing the
probability of collisions in this situation?
You won't see much difference among well-selected CRC polynomials. Speed may be more important to you, in which case you may want to use a hardware CRC, e.g. the crc32 instruction on modern Intel processors. That one uses the CRC-32C (Castagnoli) polynomial. You can make that really fast by using all three arithmetic units on a single core in parallel by computing the CRC on three buffers in the same loop, and then combining them. See below how to combine CRCs.
However, most of the time I already have the CRC's of the subordinate
data sets, and I would rather construct my UID of the top-level data
set by concatenating the raw data with the CRC's of the subordinate
data sets, and then CRC this construction.
Or you could quickly compute the CRC of the entire set as if you had done a CRC on the whole thing, but using the already calculated CRCs of the pieces. Look at crc32_combine() in zlib. That would be better than taking the CRC of a bunch of CRCs. By combining, you retain all the mathematical goodness of the CRC algorithm.
Mark Adler's answer was bang on. If I'd taken my programmers hat off and put on my mathematicians hat, some of it should have been obvious. He didn't have the time to explain the mathematics, so I will here for those who are interested.
The process of calculating a CRC is essentially the process of doing a polynomial division. The polynomials have coefficients mod 2, i.e. the coefficient of each term is either 0 or 1, hence a polynomial of degree N can be represented by an N-bit number, each bit being the coefficient of a term (and the process of doing a polynomial division amounts to doing a whole bunch of XOR and shift operations). When CRC'ing a data block, we view the "data" as one big polynomial, i.e. a long string of bits, each bit representing the coefficient of a term in the polynomial. Well call our data-block polynomial A. For each CRC "version", there has been chosen the polynomial for the CRC, which we'll call P. For 32-bit CRCs, P is a polynomial with degree 32, so it has 33 terms and 33 coefficients. Because the top coefficient is always 1, it is implicit and we can represent the 32nd-degree polynomial with a 32-bit integer. (Computationally, this is quite convenient actually.) The process of calculating the CRC for a data block A is the process of finding the remainder when A is divided by P. That is, A can always be written
A = Q * P + R
where R is a polynomial of degree less than degree of P, i.e. R has degree 31 or less, so it can be represented by a 32-bit integer. R is essentially the CRC. (Small note: typically one prepends 0xFFFFFFFF to A, but that is unimportant here.) Now, if we concatenate two data blocks A and B, the "polynomial" corresponding to the concatenation of the two blocks is the polynomial for A, "shifted to the left" by the number of bits in B, plus B. Put another way, the polynomial for A&B is A*S+B, where S is the polynomial corresponding to a 1 followed by N zeros, where N is the number of bits in B. (i.e. S = x**N ). Then, what can we say about the CRC for A&B? Suppose we know A=Q*P+R and B=Q'*P+R', i.e. R is the CRC for A and R' is the CRC for B. Suppose we also know S=q*P+r. Then
A * S + B = (Q*P+R)*(q*P+r) + (Q'*P+R')
= Q*(q*P+r)*P + R*q*P + R*r + Q'*P + R'
= (Q*S + R*q + Q') * P + R*r + R'
So to find the remainder when A*S+B is divided by P, we need only find the remainder when R*r+R' is divided by P. Thus, to calculate the CRC of the concatenation of two data streams A and B, we need only know the separate CRC's of the data streams, i.e. R and R', and the length N of the trailing data stream B (so we can compute r). This is also the content of one of Marks other comments: if the lengths of the trailing data streams B are constrained to a few values, we can pre-compute r for each of these lengths, making combination of two CRC's quite trivial. (For an arbitrary length N, computing r is not trivial, but it is much faster (log_2 N) than re-doing the division over the entire B.)
Note: the above is not a precise exposition of CRC. There is some shifting that goes on. To be precise, if L is the polynomial represented by 0xFFFFFFFF, i.e. L=x*31+x*30+...+x+1, and S_n is the "shift left by n bits" polynomial, i.e. S_n = x**n, then the CRC of a data block with polynomial A of N bits, is the remainder when ( L * S_N + A ) * S_32 is divided by P, i.e. when (L&A)*S_32 is divided by P, where & is the "concatenation" operator.
Also, I think I disagree with one of Marks comments, but he can correct me if I'm wrong. If we already know R and R', comparing the time to compute the CRC of A&B using the above methodology, as compared with computing it the straightforward way, does not depend on the ratio of len(A) to len(B) - to compute it the "straight forward" way, one really does not have to re-compute the CRC on the entire concatenated data set. Using our notation above, one only needs to compute the CRC of R*S+B. That is, instead of pre-pending 0xFFFFFFFF to B and computing its CRC, we prepend R to B and compute its CRC. So its a comparison of the time to compute B's CRC over again with the time to compute r, (followed by dividing R*r+R' by P, which is trivial and inconsequential in time likely).
Mark Adler's answer addresses the technical question so that's not what I'll do here. Here I'm going to point out a major potential flaw in the synchronization algorithm proposed in the OP's question and suggest a small improvement.
Checksums and hashes provide a single signature value for some data. However, being of finite length, the number of possible unique values of a checksum/hash is always smaller than the possible combinations of the raw data if the data is longer. For instance, a 4 byte CRC can only ever take on 4 294 967 296 unique values whilst even a 5 byte value which might be the data can take on 8 times as many values. This means for any data longer than the checksum itself, there always exists one or more byte combinations with exactly the same signature.
When used to check integrity, the assumption is that the likelihood of a slightly different stream of data resulting in the same signature is small so that we can assume the data is the same if the signature is the same. It is important to note that we start with some data d and verify that given a checksum, c, calculated using a checksum function, f that f(d) == c.
In the OP's algorithm, however, the different use introduces a subtle, detrimental degradation of confidence. In the OP's algorithm, server A would start with the raw data [d1A,d2A,d3A,d4A] and generate a set of checksums [c1,c2,c3,c4] (where dnA is the n-th data item on server A). Server B would then receive this list of checksums and check its own list of checksums to determine if any are missing. Say Server B has the list [c1,c2,c3,c5]. What should then happen is that it requests d4 from Server A and the synchronization has worked properly in the ideal case.
If we recall the possibilty of collisions, and that it doesn't always take that much data to produce one (e.g. CRC("plumless") == CRC("buckeroo")), then we'll quickly realize that the best guarantee our scheme provides is that server B definitely doesn't have d4A but it cannot guarantee that it has [d1A,d2A,d3A]. This is because it is possible that f(d1A) = c1 and f(d1B) = c1 even though d1A and d1B are distinct and we would like both servers to have both. In this scheme, neither server can ever know about the existence of both d1A and d1B. We can use more and more collision resistant checksums and hashes but this scheme can never guarantee complete synchronization. This becomes more important, the greater the number of files the network must keep track of. I would recommend using a cryptographic hash like SHA1 for which no collisions have been found.
A possible mitigation of the risk of this is to introduce redundant hashes. One way of doing is is to use a completely different algorithm since whilst it is possible crc32(d1) == crc32(d2) it is less likely that adler32(d1) == adler32(d2) simultaneously. This paper suggests you don't gain all that much this way though. To use the OP notation, it is also less likely that crc32('a' & d1) == crc32('a' & d2) and crc32('b' & d1) == crc32('b' & d2) are simultaneously true so you can "salt" to less collision prone combinations. However, I think you may just as well just use a collision resistant hash function like SHA512 which in practice likely won't have that great an impact on your performance.

Fast code for searching bit-array for contiguous set/clear bits?

Is there some reasonably fast code out there which can help me quickly search a large bitmap (a few megabytes) for runs of contiguous zero or one bits?
By "reasonably fast" I mean something that can take advantage of the machine word size and compare entire words at once, instead of doing bit-by-bit analysis which is horrifically slow (such as one does with vector<bool>).
It's very useful for e.g. searching the bitmap of a volume for free space (for defragmentation, etc.).
Windows has an RTL_BITMAP data structure one can use along with its APIs.
But I needed the code for this sometime ago, and so I wrote it here (warning, it's a little ugly):
https://gist.github.com/3206128
I have only partially tested it, so it might still have bugs (especially on reverse). But a recent version (only slightly different from this one) seemed to be usable for me, so it's worth a try.
The fundamental operation for the entire thing is being able to -- quickly -- find the length of a run of bits:
long long GetRunLength(
const void *const pBitmap, unsigned long long nBitmapBits,
long long startInclusive, long long endExclusive,
const bool reverse, /*out*/ bool *pBit);
Everything else should be easy to build upon this, given its versatility.
I tried to include some SSE code, but it didn't noticeably improve the performance. However, in general, the code is many times faster than doing bit-by-bit analysis, so I think it might be useful.
It should be easy to test if you can get a hold of vector<bool>'s buffer somehow -- and if you're on Visual C++, then there's a function I included which does that for you. If you find bugs, feel free to let me know.
I can't figure how to do well directly on memory words, so I've made up a quick solution which is working on bytes; for convenience, let's sketch the algorithm for counting contiguous ones:
Construct two tables of size 256 where you will write for each number between 0 and 255, the number of trailing 1's at the beginning and at the end of the byte. For example, for the number 167 (10100111 in binary), put 1 in the first table and 3 in the second table. Let's call the first table BBeg and the second table BEnd. Then, for each byte b, two cases: if it is 255, add 8 to your current sum of your current contiguous set of ones, and you are in a region of ones. Else, you end a region with BBeg[b] bits and begin a new one with BEnd[b] bits.
Depending on what information you want, you can adapt this algorithm (this is a reason why I don't put here any code, I don't know what output you want).
A flaw is that it does not count (small) contiguous set of ones inside one byte ...
Beside this algorithm, a friend tells me that if it is for disk compression, just look for bytes different from 0 (empty disk area) and 255 (full disk area). It is a quick heuristic to build a map of what blocks you have to compress. Maybe it is beyond the scope of this topic ...
Sounds like this might be useful:
http://www.aggregate.org/MAGIC/#Population%20Count%20%28Ones%20Count%29
and
http://www.aggregate.org/MAGIC/#Leading%20Zero%20Count
You don't say if you wanted to do some sort of RLE or to simply count in-bytes zeros and one bits (like 0b1001 should return 1x1 2x0 1x1).
A look up table plus SWAR algorithm for fast check might gives you that information easily.
A bit like this:
byte lut[0x10000] = { /* see below */ };
for (uint * word = words; word < words + bitmapSize; word++) {
if (word == 0 || word == (uint)-1) // Fast bailout
{
// Do what you want if all 0 or all 1
}
byte hiVal = lut[*word >> 16], loVal = lut[*word & 0xFFFF];
// Do what you want with hiVal and loVal
The LUT will have to be constructed depending on your intended algorithm. If you want to count the number of contiguous 0 and 1 in the word, you'll built it like this:
for (int i = 0; i < sizeof(lut); i++)
lut[i] = countContiguousZero(i); // Or countContiguousOne(i)
// The implementation of countContiguousZero can be slow, you don't care
// The result of the function should return the largest number of contiguous zero (0 to 15, using the 4 low bits of the byte, and might return the position of the run in the 4 high bits of the byte
// Since you've already dismissed word = 0, you don't need the 16 contiguous zero case.

10 character id that's globally and locally unique

I need to generate a 10 character unique id (SIP/VOIP folks need to know that it's for a param icid-value in the P-Charging-Vector header). Each character shall be one of the 26 ASCII letters (case sensitive), one of the 10 ASCII digits, or the hyphen-minus.
It MUST be 'globally unique (outside of the machine generating the id)' and sufficiently 'locally unique (within the machine generating the id)', and all that needs to be packed into 10 characters, phew!
Here's my take on it. I'm FIRST encoding the 'MUST' be encoded globally unique local ip address into base-63 (its an unsigned long int that will occupy 1-6 characters after encoding) and then as much as I can of the current time stamp (its a time_t/long long int that will occupy 9-4 characters after encoding depending on how much space the encoded ip address occupies in the first place).
I've also added loop count 'i' to the time stamp to preserve the uniqueness in case the function is called more than once in a second.
Is this good enough to be globally and locally unique or is there another better approach?
Gaurav
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
//base-63 character set
static char set[]="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-";
// b63() returns the next vacant location in char array x
int b63(long long longlong,char *x,int index){
if(index > 9)
return index+1;
//printf("index=%d,longlong=%lld,longlong%63=%lld\n",index,longlong,longlong%63);
if(longlong < 63){
x[index] = set[longlong];
return index+1;
}
x[index] = set[longlong%63];
return b63(longlong/63,x,index+1);
}
int main(){
char x[11],y[11] = {0}; /* '\0' is taken care of here */
//let's generate 10 million ids
for(int i=0; i<10000000; i++){
/* add i to timestamp to take care of sub-second function calls,
3770168404(is a sample ip address in n/w byte order) = 84.52.184.224 */
b63((long long)time(NULL)+i,x,b63((long long)3770168404,x,0));
// reverse the char array to get proper base-63 output
for(int j=0,k=9; j<10; j++,k--)
y[j] = x[k];
printf("%s\n",y);
}
return 0;
}
It MUST be 'globally unique (outside
of the machine generating the id)' and
sufficiently 'locally unique (within
the machine generating the id)', and
all that needs to be packed into 10
characters, phew!
Are you in control of all the software generating ids? Are you doling out the ids? If not...
I know nothing about SIP, but there's got to be a misunderstanding that you have about the spec (or the spec must be wrong). If another developer attempts to build an id using a different algorithm than the one you've cooked up, you will have collisions with their ids, meaning they will know longer be globally unique in that system.
I'd go back to the SIP documentation, see if there's an appendix with an algorithm for generating these ids. Or maybe a smarter SO user than I can answer what the SIP algorithm for generating these id's is.
I would have a serious look at RFC 4122 which describes the generation of 128-bit GUIDs. There are several different generation algorithms, some of which may fit (MAC address-based one springs to mind). This is a bigger number-space than yours 2^128 = 3.4 * 10^38 compared with 63^10 = 9.8 * 10^17, so you may have to make some compromises on uniqueness. Consider factors like how frequently the IDs will be generated.
However in the RFC, they have considered some practical issues, like the ability to generate large numbers of unique values efficiently by pre-allocating blocks of IDs.
Can't you just have a distributed ID table ?
Machines on NAT'ed LANs will often have an IP from a small range, and not all of the 32-bit values would be valid (think multicast, etc). Machines may also grab the same timestamp, especially if the granularity is large (such as seconds); keep in mind that the year is very often going to be the same, so it's the lower bits that will give you the most 'uniqueness'.
You may want to take the various values, hash them with a cryptographic hash, and translate that to the characters you are permitted to use, truncating to the 10 characters.
But you're dealing with a value with less than 60 bits; you need to think carefully about the implications of a collision. You might be approaching the problem the wrong way...
Well, if I cast aside the fact that I think this is a bad idea, and concentrate on a solution to your problem, here's what I would do:
You have an id range of 10^63, which correspond to roughly 60 bits. You want it to be both "globally" and "locally" unique. Let's generate the first N bits to be globally unique, and the rest to be locally unique. The concatenation of the two will have the properties you are looking for.
First, the global uniqueness : IP won't work, especially local ones, they hold very little entropy. I would go with MAC addresses, they were made for being globally unique. They cover a range of 256^6, so using up 6*8 = 48 bits.
Now, for the locally unique : why not use the process ID ? I'm making the assumption that the uniqueness is per process, if it's not, you'll have to think of something else. On Linux, process ID is 32 bits. If we wanted to nitpick, the 2 most significant bytes probably hold very little entropy, as they would at 0 on most machines. So discard them if you know what you're doing.
So now you'll see you have a problem as it would use up to 70 bits to generate a decent (but not bulletproof) globally and locally unique ID (using my technique anyway). And since I would also advise to put in a random number (at least 8 bits long) just in case, it definitely won't fit. So if I were you, I would hash the ~78 generated bits to SHA1 (for example), and convert the first 60 bits of the resulting hash to your ID format. To do so, notice that you have a 63 characters range to chose from, so almost the full range of 6 bits. So split the hash in 6 bits pieces, and use the first 10 pieces to select the 10 characters of your ID from the 63 character range. Obviously, the range of 6 bits is 64 possible values (you only want 63), so if you have a 6 bits piece equals to 63, either floor it to 62 or assume modulo 63 and pick 0. It will slightly bias the distribution, but it's not too bad.
So there, that should get you a decent globally and locally pseudo-unique ID.
A few last points: according to the Birthday paradox, you'll get a ~ 1 % chance of collisions after generating ~ 142 million IDs, and a 99% chance after generating 3 billions IDs. So if you hit great commercial success and have millions of IDs being generated, get a larger ID.
Finally, I think I provided a "better than the worse" solution to your problem, but I can't help but think you're attacking this problem in the wrong fashion, and possibly as other have mentioned, misreading the specs. So use this if there are no other ways that would be more "bulletproof" (centralised ID provider, much longer ID ... ).
Edit: I re-read your question, and you say you call this function possibly many times a second. I was assuming this was to serve as some kind of application ID, generated once at the start of your application, and never changed afterwards until it exited. Since it's not the case, you should definitely add a random number and if you generate a lot of IDs, make that at least a 32 bits number. And read and re-read the Birthday Paradox I linked to above. And seed your number generator to a highly entropic value, like the usec value of the current timestamp for example. Or even go so far as to get your random values from /dev/urandom .
Very honestly, my take on your endeavour is that 60 bits is probably not enough...
Hmm, using the system clock may be a weakness... what if someone sets the clock back? You might re-generate the same ID again. But if you are going to use the clock, you might call gettimeofday() instead of time(); at least that way you'll get better resolution than one second.
#Doug T.
No, I'm not in control of all the software generating the ids.
I agree without a standardized algorithm there maybe collisions, I've raised this issue in the appropriate mailing lists.
#Florian
Taking a cue from you're reply. I decided to use the /dev/urandom PRNG for a 32 bit random number as the space unique component of the id. I assume that every machine will have its own noise signature and it can be assumed to be safely globally unique in space at an instant of time. The time unique component that I used earlier remains the same.
These unique ids are generated to collate all the billing information collected from different network functions that independently generated charging information of a particular call during call processing.
Here's the updated code below:
Gaurav
#include <stdio.h>
#include <string.h>
#include <time.h>
//base-63 character set
static char set[]="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-";
// b63() returns the next vacant location in char array x
int b63(long long longlong, char *x, int index){
if(index > 9)
return index+1;
if(longlong < 63){
x[index] = set[longlong];
return index+1;
}
x[index] = set[longlong%63];
return b63(longlong/63, x, index+1);
}
int main(){
unsigned int number;
char x[11], y[11] = {0};
FILE *urandom = fopen("/dev/urandom", "r");
if(!urandom)
return -1;
//let's generate a 1 billion ids
for(int i=0; i<1000000000; i++){
fread(&number, 1, sizeof(number), urandom);
// add i to timestamp to take care of sub-second function calls,
b63((long long)time(NULL)+i, x, b63((long long)number, x, 0));
// reverse the char array to get proper base-63 output
for(int j=0, k=9; j<10; j++, k--)
y[j] = x[k];
printf("%s\n", y);
}
if(urandom)
fclose(urandom);
return 0;
}