Algo: find max Xor in array for various interval limis, given N inputs, and p,q where 0<=p<=i<=q<=N - c++

the problem statement is the following:
Xorq has invented an encryption algorithm which uses bitwise XOR operations extensively. This encryption algorithm uses a sequence of non-negative integers x1, x2, … xn as key. To implement this algorithm efficiently, Xorq needs to find maximum value for (a xor xj) for given integers a,p and q such that p<=j<=q. Help Xorq to implement this function.
Input
First line of input contains a single integer T (1<=T<=6). T test cases follow.
First line of each test case contains two integers N and Q separated by a single space (1<= N<=100,000; 1<=Q<= 50,000). Next line contains N integers x1, x2, … xn separated by a single space (0<=xi< 2^15). Each of next Q lines describe a query which consists of three integers ai,pi and qi (0<=ai< 2^15, 1<=pi<=qi<= N).
Output
For each query, print the maximum value for (ai xor xj) such that pi<=j<=qi in a single line.
int xArray[100000];
cin >>t;
for(int j =0;j<t;j++)
{
cin>> n >>q;
//int* xArray = (int*)malloc(n*sizeof(int));
int i,a,pi,qi;
for(i=0;i<n;i++)
{
cin>>xArray[i];
}
for(i=0;i<q;i++)
{
cin>>a>>pi>>qi;
int max =0;
for(int it=pi-1;it<qi;it++)
{
int t = xArray[it] ^ a;
if(t>max)
max =t;
}
cout<<max<<"\n" ;
}
No other assumptions may be made except for those stated in the text of the problem (numbers are not sorted).
The code is functional but not fast enough; is reading from stdin really that slow or is there anything else I'm missing?

XOR flips bits. The max result of XOR is 0b11111111.
To get the best result
if 'a' on ith place has 1 then you have to XOR it with key that has ith bit = 0
if 'a' on ith place has 0 then you have to XOR it with key that has ith bit = 1
saying simply, for bit B you need !B
Another obvious thing is that higher order bits are more important than lower order bits.
That is:
if 'a' on highest place has B and you have found a key with highest bit = !B
then ALL keys that have highest bit = !B are worse that this one
This cuts your amount of numbers by half "in average".
How about building a huge binary tree from all the keys and ordering them in the tree by their bits, from MSB to LSB. Then, cutting the A bit-by-bit from MSB to LSB would tell you which left-right branch to take next to get the best result. Of course, that ignores PI/QI limits, but surely would give you the best result since you always pick the best available bit on i-th level.
Now if you annotate the tree nodes with low/high index ranges of its subelements (performed only done once when building the tree), then later when querying against a case A-PI-QI you could use that to filter-out branches that does not fall in the index range.
The point is that if you order the tree levels like the MSB->LSB bit order, then the decision performed at the "upper nodes" could guarantee you that currently you are in the best possible branch, and it would hold even if all the subbranches were the worst:
Being at level 3, the result of
0b111?????
can be then expanded into
0b11100000
0b11100001
0b11100010
and so on, but even if the ????? are expanded poorly, the overall result is still greater than
0b11011111
which would be the best possible result if you even picked the other branch at level 3rd.
I habe absolutely no idea how long would preparing the tree cost, but querying it for an A-PI-QI that have 32 bits seems to be something like 32 times N-comparisons and jumps, certainly faster than iterating randomly 0-100000 times and xor/maxing. And since you have up to 50000 queries, then building such tree can actually be a good investment, since such tree would be build once per keyset.
Now, the best part is that you actually dont need the whole tree. You may build such from i.e. first two or four or eight bits only, and use the index ranges from the nodes to limit your xor-max loop to a smaller part. At worst, you'd end up with the same range as PiQi. At best, it'd be down to one element.
But, looking at the max N keys, I think the whole tree might actually fit in the memory pool and you may get away without any xor-maxing loop.

I've spent some time google-ing this problem and it seams that you can find it in the context of various programming competitions. While the brute force approach is intuitive it does not really solve the challenge as it is too slow.
There are a few contraints in the problem which you need to speculate in order to write a faster algorithm:
the input consists of max 100k numbers, but there are only 32768 (2^15) possible numbers
for each input array there are Q, max 50k, test cases; each test case consists of 3 values, a,pi,and qi. Since 0<=a<2^15 and there are 50k cases, there is a chance the same value will come up again.
I've found 2 ideas for solving the problem: splitting the input in sqrt(N) intervals and building a segment tree ( a nice explanation for these approaches can be found here )
The biggest problem is the fact that for each test case you can have different values for a, and that would make previous results useless, since you need to compute max(a^x[i]), for a small number of test cases. However when Q is large enough and the value a repeats, using previous results can be possible.
I will come back with the actual results once I finish implementing both methods

Related

SSE optimisation for a loop that finds zeros in an array and toggles a flag + updates another array

A piece of C++ code determines the occurances of zero and keeps a binary flag variable for each number that is checked. The value of the flag toggles between 0 and 1 each time a zero is encountered in a 1 dimensional array.
I am attempting to use SSE to speed it up, but I am unsure of how to go about this. Evaluating the individual fields of __m128i is inefficient, I've read.
The code in C++ is:
int flag = 0;
int var_num2[1000];
for(int i = 0; i<1000; i++)
{
if (var[i] == 0)
{
var_num2[i] = flag;
flag = !flag; //toggle value upon encountering a 0
}
}
How should I go about this using SSE intrinsics?
You'd have to recognize the problem, but this is a variation of a well-known problem. I'll first give a theoretical description
Introduce a temporary array not_var[] which contains 1 if var contains 0 and 0 otherwise.
Introduce a temporary array not_var_sum[] which holds the partial sum of not_var.
var_num2 is now the LSB of not_var_sum[]
The first and third operation are trivially parallelizable. Parallelizing a partial sum is only a bit harder.
In a practical implementation, you wouldn't construct not_var[], and you'd write the LSB directly to var_num2 in all iterations of step 2. This is valid because you can discard the higher bits. Keeping just the LSB is equivalent to taking the result modulo 2, and (a+b)%2 == ((a%2) + (b%2))%s.
What type are the elements of var[]? int? Or char? Are zeroes frequent?
A SIMD prefix sum aka partial is possible (with log2(vector_width) work per element, e.g. 2 shuffles and 2 adds for a vector of 4 float), but the conditional-store based on the result is the other major problem. (Your array of 1000 elements is probably too small for multi-threading to be profitable.)
An integer prefix-sum is easier to do efficiently, and the lower latency of integer ops helps. NOT is just adding without carry, i.e. XOR, so use _mm_xor_si128 instead of _mm_add_ps. (You'd be using this on the integer all-zero/all-one compare result vector from _mm_cmpeq_epi32 (or epi8 or whatever, depending on the element size of var[]. You didn't specify, but different choices of strategy are probably optimal for different sizes).
But, just having a SIMD prefix sum actually barely helps: you'd still have to loop through and figure out where to store and where to leave unmodified.
I think your best bet is to generate a list of indices where you need to store, and then
for (size_t j = 0 ; j < scatter_count ; j+=2) {
var_num2[ scatter_element[j+0] ] = 0;
var_num2[ scatter_element[j+1] ] = 1;
}
You could generate the whole list if indices up-front, or you could work in small batches to overlap the search work with the store work.
The prefix-sum part of the problem is handled by alternately storing 0 and 1 in an unrolled loop. The real trick is avoiding branch mispredicts, and generating the indices efficiently.
To generate scatter_element[], you've transformed the problem into left-packing (filtering) an (implicit) array of indices based on the corresponding _mm_cmpeq_epi32( var[i..i+3], _mm_setzero_epi32() ). To generate the indices you're filtering, start with a vector of [0,1,2,3] and add [4,4,4,4] to it (_mm_add_epi32). I'm assuming the element size of var[] is 32 bits. If you have smaller elements, this require unpacking.
BTW, AVX512 has scatter instructions which you could use here, otherwise doing the store part with scalar code is your best bet. (But beware of Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake when just storing without loading.)
To overlap the left-packing with the storing, I think you want to left-pack until you have maybe 64 indices in a buffer. Then leave that loop and run another loop that left-packs indices and consumes indices, only stopping if your circular buffer is full (then just store) or empty (then just left-pack). This lets you overlap the vector compare / lookup-table work with the scatter-store work, but without too much unpredictable branching.
If zeros are very frequent, and var_num2[] elements are 32 or 64 bits, and you have AVX or AVX2 available, you could consider doing an standard prefix sum and using AVX masked stores. e.g. vpmaskmovd. Don't use SSE maskmovdqu, though: it has an NT hint, so it bypasses and evicts data from cache, and is quite slow.
Also, because your prefix sum is mod 2, i.e. boolean, you could use a lookup table based on the packed-compare result mask. Instead of horizontal ops with shuffles, use the 4-bit movmskps result of a compare + a 5th bit for the initial state as an index to a lookup table of 32 vectors (assuming 32-bit element size for var[]).

Finding all values that occurs odd number of times in huge list of positive integers

I came across this question from a colleague.
Q: Given a huge list (say some thousands)of positive integers & has many values repeating in the list, how to find those values occurring odd number of times?
Like 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 1 2 3 1 2 1...
Here,
1 occrus 8 times
2 occurs 7 times (must be listed in output)
3 occurs 6 times
4 occurs 5 times (must be listed in output)
& so on... (the above set of values is only for explaining the problem but really there would be any positive numbers in the list in any order).
Originally we were looking at deriving a logic (to be based on c).
I suggested the following,
Using a hash table and the values from the list as an index/key to the table, keep updating the count in the corresponding index every time when the value is encountered while walking through the list; however, how to decide on the size of the hash table?? I couldn't say it surely though it might require Hashtable as big as the list.
Once the list is walked through & the hash table is populated (with the 'count' number of occurrences for each values/indices), only way to find/list the odd number of times occurring value is to walk through the table & find it out? Is that's the only way to do?
This might not be the best solution given this scenario.
Can you please suggest on any other efficient way of doing it so??
I sought in SO, but there were queries/replies on finding a single value occurring odd number of times but none like the one I have mentioned.
The relevance for this question is not known but seems to be asked in his interview...
Please suggest.
Thank You,
If the values to be counted are bounded by even a moderately reasonable limit then you can just create an array of counters, and use the values to be counted as the array indices. You don't need a tight bound, and "reasonable" is somewhat a matter of platform. I would not hesitate to take this approach for a bound (and therefore array size) sufficient for all uint16_t values, and that's not a hard limit:
#define UPPER_BOUND 65536
uint64_t count[UPPER_BOUND];
void count_values(size_t num_values, uint16_t values[num_values]) {
size_t i;
memset(count, 0, sizeof(count));
for (i = 0; i < num_values; i += 1) {
count[values[i]] += 1;
)
}
Since you only need to track even vs. odd counts, though, you really only need one bit per distinct value in the input. Squeezing it that far is a bit extreme, but this isn't so bad:
#define UPPER_BOUND 65536
uint8_t odd[UPPER_BOUND];
void count_values(size_t num_values, uint16_t values[num_values]) {
size_t i;
memset(odd, 0, sizeof(odd));
for (i = 0; i < num_values; i += 1) {
odd[values[i]] ^= 1;
)
}
At the end, odd[i] contains 1 if the value i appeared an odd number of times, and it contains 0 if i appeared an even number of times.
On the other hand, if the values to be counted are so widely distributed that an array would require too much memory, then the hash table approach seems reasonable. In that case, however, you are asking the wrong question. Rather than
how to decide on the size of the hash table?
you should be asking something along the lines of "what hash table implementation doesn't require me to manage the table size manually?" There are several. Personally, I have used UTHash successfully, though as of recently it is no longer maintained.
You could also use a linked list maintained in order, or a search tree. No doubt there are other viable choices.
You also asked
Once the list is walked through & the hash table is populated (with the 'count' number of occurrences for each values/indices), only way to find/list the odd number of times occurring value is to walk through the table & find it out? Is that's the only way to do?
If you perform the analysis via the general approach we have discussed so far then yes, the only way to read out the result is to iterate through the counts. I can imagine alternative, more complicated, approaches wherein you switch numbers between lists of those having even counts and those having odd counts, but I'm having trouble seeing how whatever efficiency you might gain in readout could fail to be swamped by the efficiency loss at the counting stage.
In your specific case, you can walk the list and toggle the value's existence in a set. The resulting set will contain all of the values that appeared an odd number of times. However, this only works for that specific predicate, and the more generic count-then-filter algorithm you describe will be required if you wanted, say, all of the entries that appear an even number of times.
Both algorithms should be O(N) time and worst-case O(N) space, and the constants will probably be lower for the set-based algorithm, but you'll need to benchmark it against your data. In practice, I'd run with the more generic algorithm unless there was a clear performance problem.

pseudo random distribution which guarantees all possible permutations of value sequence - C++

Random question.
I am attempting to create a program which would generate a pseudo-random distribution. I am trying to find the right pseudo-random algorithm for my needs. These are my concerns:
1) I need one input to generate the same output every time it is used.
2) It needs to be random enough that a person who looks at the output from input 1 sees no connection between that and the output from input 2 (etc.), but there is no need for it to be cryptographically secure or truly random.
3)Its output should be a number between 0 and (29^3200)-1, with every possible integer in that range a possible and equally (or close to it) likely output.
4) I would like to be able to guarantee that every possible permutation of sequences of 410 outputs is also a potential output of consecutive inputs. In other words, all the possible groupings of 410 integers between 0 and (29^3200)-1 should be potential outputs of sequential inputs.
5) I would like the function to be invertible, so that I could take an integer, or a series of integers, and say which input or series of inputs would produce that result.
The method I have developed so far is to run the input through a simple halson sequence:
boost::multiprecision::mpz_int denominator = 1;
boost::multiprecision::mpz_int numerator = 0;
while (input>0) {
denominator *=3;
numerator = numerator * 3 + (input%3);
input = input/3;
}
and multiply the result by 29^3200. It meets requirements 1-3, but not 4. And it is invertible only for single integers, not for series (since not all sequences can be produced by it). I am working in C++, using boost multiprecision.
Any advice someone can give me concerning a way to generate a random distribution meeting these requirements, or just a class of algorithms worth researching towards this end, would be greatly appreciated. Thank you in advance for considering my question.
----UPDATE----
Since multiple commenters have focused on the size of the numbers in question, I just wanted to make clear that I recognize the practical problems that working with such sets poses but in asking this question I'm interested only in the theoretical or conceptual approach to the problem - for example, imagine working with a much smaller set of integers like 0 to 99, and the permutations of sets of 10 of output sequences. How would you design an algorithm to meet these five conditions - 1)input is deterministic, 2)appears random (at least to the human eye), 3)every integer in the range is a possible output, 4)not only all values, but also all permutations of value sequences are possible outputs, 5)function is invertible.
---second update---
with many thanks to #Severin Pappadeux I was able to invert an lcg. I thought I'd add a little bit about what I did to hopefully make it easier for anyone seeing this in the future. First of all, these are excellent sources on inverting modular functions:
https://www.khanacademy.org/computing/computer-science/cryptography/modarithmetic/a/modular-inverses
https://www.khanacademy.org/computer-programming/discrete-reciprocal-mod-m/6253215254052864
If you take the equation next=ax+c%m, using the following code with your values for a and m will print out the euclidean equations you need to find ainverse, as well as the value of ainverse:
int qarray[12];
qarray[0]=0;
qarray[1]=1;
int i =2;
int reset = m;
while (m % a >0) {
int remainder=m%a;
int quotient=m/a;
std::cout << m << " = " << quotient << "*" << a << " + " << remainder << "\n";
qarray[i] =qarray[i-2]-(qarray[i-1]*quotient);
m=a;
a=remainder;
i++;
}
if (qarray[i-1]<0) {qarray[i-1]+=reset;}
std::cout << qarray[i-1] << "\n";
The other thing it took me a while to figure out is that if you get a negative result, you should add m to it. You should add a similar term to your new equation:
prev = (ainverse(next-c))%m;
if (prev<0) {prev+=m;}
I hope that helps anyone who ventures down this road in the future.
Ok, I'm not sure if there is a general answer, so I would concentrate on random number generator having, say, 64bit internal state/seed, producing 64bit output and having 2^64-1 period. In particular, I would look at linear congruential generator (aka LCG) in the form of
next = (a * prev + c) mod m
where a and m are primes to each other
So:
1) Check
2) Check
3) Check (well, for 64bit space of course)
4) Check (again, except 0 I believe, but each and every permutation of 64bits is output of LCG starting with some seed)
5) Check. LCG is known to be reversible, i.e. one could get
prev = (next - c) * a_inv mod m
where a_inv could be computed from a, m using Euclid's algorithm
Well, if it looks ok to you, you could try to implement LCG in your 15546bits space
UPDATE
And quick search shows reversible LCG discussion/code here
Reversible pseudo-random sequence generator
In your update, "appears random (to the human eye)" is the phrasing you use. The definition of "appears random" is not a well agreed upon topic. There are varying degrees of tests for "randomness."
However, if you're just looking to make it appear random to the human eye, you can just use ring multiplication.
Start with the idea of generating N! values between 0 and M (N>=410, M>=29^3200)
Group this together into one big number. we're going to generate a single number ranging from 0 to *M^N!. If we can show that the pseudorandom number generator generates every value from 0 to M^N!, we guarantee your permutation rule.
Now we need to make it "appear random." To the human eye, Linear Congruent Generators are enough. Pick a LCG with a period greater than or equal to 410!*M^N satisfying the rules to ensure a complete period. Easiest way to ensure fairness is to pick a LCG in the form x' = (ax+c) mod M^N!
That'll do the trick. Now, the hard part is proving that what you did was worth your time. Consider that the period of just a 29^3200 long sequence is outside the realm of physical reality. You'll never actually use it all. Ever. Consider that a superconductor made of Josephine junctions (10^-12kg processing 10^11bits/s) weighing the mass of the entire universe 3*10^52kg) can process roughly 10^75bits/s. A number that can count to 29^3200 is roughly 15545 bits long, so that supercomputer can process roughly 6.5x10^71 numbers/s. This means it will take roughly 10^4600s to merely count that high, or somewhere around 10^4592 years. Somewhere around 10^12 years from now, the stars are expected to wink out, permanently, so it could be a while.
There are M**N sequences of N numbers between 0 and M-1.
You can imagine writing all of them one after the other in a (pseudorandom) sequence and placing your read pointer randomly in the resulting loop of N*(M**N) numbers between 0 and M-1...
def output(input):
total_length = N*(M**N)
index = input % total_length
permutation_index = shuffle(index / N, M**N)
element = input % N
return (permutation_index / (N**element)) % M
Of course for every permutation of N elements between 0 and M-1 there is a sequence of N consecutive inputs that produces it (just un-shuffle the permutation index). I'd also say (just using symmetry reasoning) that given any starting input the output of next N elements is equally probable (each number and each sequence of N numbers is equally represented in the total period).

Using CRCs as a digest to detect duplicates among files

The primary use of CRCs and similar computations (such as Fletcher and Adler) seems to be for the detection of transmission errors. As such, most studies I have seen seem to address the issue of the probability of detecting small-scale differences between two data sets. My needs are slightly different.
What follows is a very approximate description of the problem. Details are much more complicated than this, but the description below illustrates the functionality I am looking for. This little disclaimer is intended to ward of answers such as "Why are you solving your problem this way when you can more easily solve it this other way I propose?" - I need to solve my problem this way for a myriad of reasons that are not germane to this question or post, so please don't post such answers.
I am dealing with collections of data sets (size ~1MB) on a distributed network. Computations are performed on these data sets, and speed/performance is critical. I want a mechanism to allow me to avoid re-transmitting data sets. That is, I need some way to generate a unique identifier (UID) for each data set of a given size. (Then, I transmit data set size and UID from one machine to another, and the receiving machine only needs to request transmission of the data if it does not already have it locally, based on the UID.)
This is similar to the difference between using CRC to check changes to a file, and using a CRC as a digest to detect duplicates among files. I have not seen any discussions of the latter use.
I am not concerned with issues of tampering, i.e. I do not need cryptographic strength hashing.
I am currently using a simple 32-bit CRC of the serialized data, and that has so far served me well. However, I would like to know if anyone can recommend which 32-bit CRC algorithm (i.e. which polynomial?) is best for minimizing the probability of collisions in this situation?
The other question I have is a bit more subtle. In my current implementation, I ignore the structure of my data set, and effectively just CRC the serialized string representing my data. However, for various reasons, I want to change my CRC methodology as follows. Suppose my top-level data set is a collection of some raw data and a few subordinate data sets. My current scheme essentially concatenates the raw data and all the subordinate data sets and then CRC's the result. However, most of the time I already have the CRC's of the subordinate data sets, and I would rather construct my UID of the top-level data set by concatenating the raw data with the CRC's of the subordinate data sets, and then CRC this construction. The question is, how does using this methodology affect the probability of collisions?
To put it in a language what will allow me to discuss my thoughts, I'll define a bit of notation. Call my top-level data set T, and suppose it consists of raw data set R and subordinate data sets Si, i=1..n. I can write this as T = (R, S1, S2, ..., Sn). If & represents concatenation of data sets, my original scheme can be thought of as:
UID_1(T) = CRC(R & S1 & S2 & ... & Sn)
and my new scheme can be thought of as
UID_2(T) = CRC(R & CRC(S1) & CRC(S2) & ... & CRC(Sn))
Then my questions are: (1) if T and T' are very different, what CRC algorithm minimizes prob( UID_1(T)=UID_1(T') ), and what CRC algorithm minimizes prob( UID_2(T)=UID_2(T') ), and how do these two probabilities compare?
My (naive and uninformed) thoughts on the matter are this. Suppose the differences between T and T' are in only one subordinate data set, WLOG say S1!=S1'. If it happens that CRC(S1)=CRC(S1'), then clearly we will have UID_2(T)=UID_2(T'). On the other hand, if CRC(S1)!=CRC(S1'), then the difference between R & CRC(S1) & CRC(S2) & ... & CRC(Sn) and R & CRC(S1') & CRC(S2) & ... & CRC(Sn) is a small difference on 4 bytes only, so the ability of UID_2 to detect differences is effectively the same as a CRC's ability to detect transmission errors, i.e. its ability to detect errors in only a few bits that are not widely separated. Since this is what CRC's are designed to do, I would think that UID_2 is pretty safe, so long as the CRC I am using is good at detecting transmission errors. To put it in terms of our notation,
prob( UID_2(T)=UID_2(T') ) = prob(CRC(S1)=CRC(S1')) + (1-prob(CRC(S1)=CRC(S1'))) * probability of CRC not detecting error a few bits.
Let call the probability of CRC not detecting an error of a few bits P, and the probability of it not detecting large differences on a large size data set Q. The above can be written approximately as
prob( UID_2(T)=UID_2(T') ) ~ Q + (1-Q)*P
Now I will change my UID a bit more as follows. For a "fundamental" piece of data, i.e. a data set T=(R) where R is just a double, integer, char, bool, etc., define UID_3(T)=(R). Then for a data set T consisting of a vector of subordinate data sets T = (S1, S2, ..., Sn), define
UID_3(T) = CRC(ID_3(S1) & ID_3(S2) & ... & ID_3(Sn))
Suppose a particular data set T has subordinate data sets nested m-levels deep, then, in some vague sense, I would think that
prob( UID_3(T)=UID_3(T') ) ~ 1 - (1-Q)(1-P)^m
Given these probabilities are small in any case, this can be approximated as
1 - (1-Q)(1-P)^m = Q + (1-Q)*P*m + (1-Q)*P*P*m*(m-1)/2 + ... ~ Q + m*P
So if I know my maximum nesting level m, and I know P and Q for various CRCs, what I want is to pick the CRC that gives me the minimum value for Q + m*P. If, as I suspect might be the case, P~Q, the above simplifies to this. My probability of error for UID_1 is P. My probability of error for UID_3 is (m+1)P, where m is my maximum nesting (recursion) level.
Does all this seem reasonable?
I want a mechanism to allow me to avoid re-transmitting data sets.
rsync has already solved this problem, using generally the approach you outline.
However, I would like to know if anyone can recommend which 32-bit CRC
algorithm (i.e. which polynomial?) is best for minimizing the
probability of collisions in this situation?
You won't see much difference among well-selected CRC polynomials. Speed may be more important to you, in which case you may want to use a hardware CRC, e.g. the crc32 instruction on modern Intel processors. That one uses the CRC-32C (Castagnoli) polynomial. You can make that really fast by using all three arithmetic units on a single core in parallel by computing the CRC on three buffers in the same loop, and then combining them. See below how to combine CRCs.
However, most of the time I already have the CRC's of the subordinate
data sets, and I would rather construct my UID of the top-level data
set by concatenating the raw data with the CRC's of the subordinate
data sets, and then CRC this construction.
Or you could quickly compute the CRC of the entire set as if you had done a CRC on the whole thing, but using the already calculated CRCs of the pieces. Look at crc32_combine() in zlib. That would be better than taking the CRC of a bunch of CRCs. By combining, you retain all the mathematical goodness of the CRC algorithm.
Mark Adler's answer was bang on. If I'd taken my programmers hat off and put on my mathematicians hat, some of it should have been obvious. He didn't have the time to explain the mathematics, so I will here for those who are interested.
The process of calculating a CRC is essentially the process of doing a polynomial division. The polynomials have coefficients mod 2, i.e. the coefficient of each term is either 0 or 1, hence a polynomial of degree N can be represented by an N-bit number, each bit being the coefficient of a term (and the process of doing a polynomial division amounts to doing a whole bunch of XOR and shift operations). When CRC'ing a data block, we view the "data" as one big polynomial, i.e. a long string of bits, each bit representing the coefficient of a term in the polynomial. Well call our data-block polynomial A. For each CRC "version", there has been chosen the polynomial for the CRC, which we'll call P. For 32-bit CRCs, P is a polynomial with degree 32, so it has 33 terms and 33 coefficients. Because the top coefficient is always 1, it is implicit and we can represent the 32nd-degree polynomial with a 32-bit integer. (Computationally, this is quite convenient actually.) The process of calculating the CRC for a data block A is the process of finding the remainder when A is divided by P. That is, A can always be written
A = Q * P + R
where R is a polynomial of degree less than degree of P, i.e. R has degree 31 or less, so it can be represented by a 32-bit integer. R is essentially the CRC. (Small note: typically one prepends 0xFFFFFFFF to A, but that is unimportant here.) Now, if we concatenate two data blocks A and B, the "polynomial" corresponding to the concatenation of the two blocks is the polynomial for A, "shifted to the left" by the number of bits in B, plus B. Put another way, the polynomial for A&B is A*S+B, where S is the polynomial corresponding to a 1 followed by N zeros, where N is the number of bits in B. (i.e. S = x**N ). Then, what can we say about the CRC for A&B? Suppose we know A=Q*P+R and B=Q'*P+R', i.e. R is the CRC for A and R' is the CRC for B. Suppose we also know S=q*P+r. Then
A * S + B = (Q*P+R)*(q*P+r) + (Q'*P+R')
= Q*(q*P+r)*P + R*q*P + R*r + Q'*P + R'
= (Q*S + R*q + Q') * P + R*r + R'
So to find the remainder when A*S+B is divided by P, we need only find the remainder when R*r+R' is divided by P. Thus, to calculate the CRC of the concatenation of two data streams A and B, we need only know the separate CRC's of the data streams, i.e. R and R', and the length N of the trailing data stream B (so we can compute r). This is also the content of one of Marks other comments: if the lengths of the trailing data streams B are constrained to a few values, we can pre-compute r for each of these lengths, making combination of two CRC's quite trivial. (For an arbitrary length N, computing r is not trivial, but it is much faster (log_2 N) than re-doing the division over the entire B.)
Note: the above is not a precise exposition of CRC. There is some shifting that goes on. To be precise, if L is the polynomial represented by 0xFFFFFFFF, i.e. L=x*31+x*30+...+x+1, and S_n is the "shift left by n bits" polynomial, i.e. S_n = x**n, then the CRC of a data block with polynomial A of N bits, is the remainder when ( L * S_N + A ) * S_32 is divided by P, i.e. when (L&A)*S_32 is divided by P, where & is the "concatenation" operator.
Also, I think I disagree with one of Marks comments, but he can correct me if I'm wrong. If we already know R and R', comparing the time to compute the CRC of A&B using the above methodology, as compared with computing it the straightforward way, does not depend on the ratio of len(A) to len(B) - to compute it the "straight forward" way, one really does not have to re-compute the CRC on the entire concatenated data set. Using our notation above, one only needs to compute the CRC of R*S+B. That is, instead of pre-pending 0xFFFFFFFF to B and computing its CRC, we prepend R to B and compute its CRC. So its a comparison of the time to compute B's CRC over again with the time to compute r, (followed by dividing R*r+R' by P, which is trivial and inconsequential in time likely).
Mark Adler's answer addresses the technical question so that's not what I'll do here. Here I'm going to point out a major potential flaw in the synchronization algorithm proposed in the OP's question and suggest a small improvement.
Checksums and hashes provide a single signature value for some data. However, being of finite length, the number of possible unique values of a checksum/hash is always smaller than the possible combinations of the raw data if the data is longer. For instance, a 4 byte CRC can only ever take on 4 294 967 296 unique values whilst even a 5 byte value which might be the data can take on 8 times as many values. This means for any data longer than the checksum itself, there always exists one or more byte combinations with exactly the same signature.
When used to check integrity, the assumption is that the likelihood of a slightly different stream of data resulting in the same signature is small so that we can assume the data is the same if the signature is the same. It is important to note that we start with some data d and verify that given a checksum, c, calculated using a checksum function, f that f(d) == c.
In the OP's algorithm, however, the different use introduces a subtle, detrimental degradation of confidence. In the OP's algorithm, server A would start with the raw data [d1A,d2A,d3A,d4A] and generate a set of checksums [c1,c2,c3,c4] (where dnA is the n-th data item on server A). Server B would then receive this list of checksums and check its own list of checksums to determine if any are missing. Say Server B has the list [c1,c2,c3,c5]. What should then happen is that it requests d4 from Server A and the synchronization has worked properly in the ideal case.
If we recall the possibilty of collisions, and that it doesn't always take that much data to produce one (e.g. CRC("plumless") == CRC("buckeroo")), then we'll quickly realize that the best guarantee our scheme provides is that server B definitely doesn't have d4A but it cannot guarantee that it has [d1A,d2A,d3A]. This is because it is possible that f(d1A) = c1 and f(d1B) = c1 even though d1A and d1B are distinct and we would like both servers to have both. In this scheme, neither server can ever know about the existence of both d1A and d1B. We can use more and more collision resistant checksums and hashes but this scheme can never guarantee complete synchronization. This becomes more important, the greater the number of files the network must keep track of. I would recommend using a cryptographic hash like SHA1 for which no collisions have been found.
A possible mitigation of the risk of this is to introduce redundant hashes. One way of doing is is to use a completely different algorithm since whilst it is possible crc32(d1) == crc32(d2) it is less likely that adler32(d1) == adler32(d2) simultaneously. This paper suggests you don't gain all that much this way though. To use the OP notation, it is also less likely that crc32('a' & d1) == crc32('a' & d2) and crc32('b' & d1) == crc32('b' & d2) are simultaneously true so you can "salt" to less collision prone combinations. However, I think you may just as well just use a collision resistant hash function like SHA512 which in practice likely won't have that great an impact on your performance.

Mutually exclusive contiguous ranges from multiple bitfields

(This is not a CS class homework, even if it looks like one)
I'm using bitfields to represent ranges between 0 and 22. As an input, I have several different ranges, for example (order doesn't matter). I used . for 0 and X for 1 for better readability.
.....XXXXX..............
..XXXX..................
.....XXXXXXXXXXXXXXX....
........XXXXXXX.........
XXXXXXXXXXXXXXXXXXXXXXXX
The number of bitfield ranges is typically below 10, but can potentially become as high as 100. From that input, I want to calculate the mutually exclusive, contiguous ranges, like this:
XX......................
..XXX...................
.....X..................
......XX................
........XX..............
..........XXXXX.........
...............XXXXX....
....................XXXX
(again, the output order doesn't matter, they just need to be mutually exclusive and contiguous, i.e. they can't have holes in them. .....XXX.......XXXXX.... must be split up in two individual ranges).
I tried a couple of algorithms, but all of them ended up being rather complex and unelegant. What would help me immensely is a way to detect that .....XXX.......XXXXX.... has a hole and a way to determine the index of one of the bits in the hole.
Edit: The bitfield range represent zoomlevels on a map. They are intended to be used for outputting XML stylesheets for Mapnik (the tile rendering system that is, among others, used by OpenStreetMap).
I'm assuming the solution you're mentioning in the comment is something like this:
Start at the left or right (so index = 0), and scan which bits are set (upto 100 operations). Name that set x. Also set a variable block=0.
At index=1, repeat and store to set y. If x XOR y = 0, both are identical sets, so move on to index=2. If it x XOR y = z != 0, then range [block, index) is contiguous. Now set x = y, block = index, and continue.
If you have 100 bit-arrays of length 22 each, this takes something on the order of 2200 operations.
This is an optimum solution because the operation cannot be reduced further -- at each stage, your range is broken if another set doesn't match your set, so to check if the range is broken you must check all 100 bits.
I'll take a shot at your sub-problem, at least..
What would help me immensely is a way to detect that
.....XXX.......XXXXX.... has a hole and a way to determine the index
of one of the bits in the hole.
Finding the lowest and highest set ("1") bits in a bitmask is a pretty
solved problem; See, for example, ffs(3) in glibc, or see
e.g. http://en.wikipedia.org/wiki/Bit_array#Find_first_one
Given the first and last indexes of a bitmap, call them i, and j,
you can compute the bitmap that has all bits betweem i and j set
using M = ((1 << i) - 1) & (~((1 << j) - 1)) (apologies for any
off-by-one-errors).
You can then test if the original bitmap has a hole by comparing it to
M. If it doesn't match, you can take the input xor M to find the
holes and repeat.