Relinearizing one Ciphertext in SEAL

Relinearizing one Ciphertext in SEAL - c++

Let's say I calulated addition or multiplication of 2 Ciphertexts and seved the result in a third one. If I want to perform additional mathematical operations on my result Ciphertext (destination Chipertext), is it advisable to use evaluator.relinearize() on it before doing so? Because if I understood it correctly, some operations on Ciphertext cause the result Ciphertext size to be larger than 2. If yes, then would this be a good approach for relinearizing one Ciphertext?
EvaluationKeys ev_keys;
int size = result.size();
keygen.generate_evaluation_keys(size - 2, ev_keys); // We need size - 2 ev_keys for performing this relinearization.
evaluator.relinearize(result, ev_keys);

Only Evaluator::multiply will increase the size of your ciphertext. Every ciphertext has size at least 2 (fresh encryptions have size 2) and multiplication of size a and b ciphertexts results in a ciphertext of size a+b-1. Thus, multiplying two ciphertext of size 2 you'll end up with a ciphertext of size 3. In almost all cases you'll want to relinearize at this point to bring the size back down to 2, as further operations on size 3 ciphertext can be significantly more computationally costly.
There are some exceptions to this rule: say you want to compute the sum of many products. In this case you might want to relinearize only the final sum and not the individual summands, since computing sums of size 3 ciphertexts is still very fast.
To make relinearization possible, the party who generates keys needs to also generate evaluation keys as follows:
EvaluationKeys ev_keys;
keygen.generate_evaluation_keys(60, ev_keys);
Later the evaluating party can use these as:
evaluator.relinearize(result, ev_keys);
Here I used 60 as the decomposition_bit_count in generate_evaluation_keys, which is the fastest and most often best choice. You should probably never use a int count parameter different from 1 (default) in generate_evaluation_keys. This is meant for use-cases where you let your ciphertexts grow in size beyond 3 and need to bring them down from e.g. size 4 or 5 down to 2.

Related

SSE optimisation for a loop that finds zeros in an array and toggles a flag + updates another array

A piece of C++ code determines the occurances of zero and keeps a binary flag variable for each number that is checked. The value of the flag toggles between 0 and 1 each time a zero is encountered in a 1 dimensional array.
I am attempting to use SSE to speed it up, but I am unsure of how to go about this. Evaluating the individual fields of __m128i is inefficient, I've read.
The code in C++ is:
int flag = 0;
int var_num2[1000];
for(int i = 0; i<1000; i++)
{
if (var[i] == 0)
{
var_num2[i] = flag;
flag = !flag; //toggle value upon encountering a 0
}
}
How should I go about this using SSE intrinsics?

You'd have to recognize the problem, but this is a variation of a well-known problem. I'll first give a theoretical description
Introduce a temporary array not_var[] which contains 1 if var contains 0 and 0 otherwise.
Introduce a temporary array not_var_sum[] which holds the partial sum of not_var.
var_num2 is now the LSB of not_var_sum[]
The first and third operation are trivially parallelizable. Parallelizing a partial sum is only a bit harder.
In a practical implementation, you wouldn't construct not_var[], and you'd write the LSB directly to var_num2 in all iterations of step 2. This is valid because you can discard the higher bits. Keeping just the LSB is equivalent to taking the result modulo 2, and (a+b)%2 == ((a%2) + (b%2))%s.

What type are the elements of var[]? int? Or char? Are zeroes frequent?
A SIMD prefix sum aka partial is possible (with log2(vector_width) work per element, e.g. 2 shuffles and 2 adds for a vector of 4 float), but the conditional-store based on the result is the other major problem. (Your array of 1000 elements is probably too small for multi-threading to be profitable.)
An integer prefix-sum is easier to do efficiently, and the lower latency of integer ops helps. NOT is just adding without carry, i.e. XOR, so use _mm_xor_si128 instead of _mm_add_ps. (You'd be using this on the integer all-zero/all-one compare result vector from _mm_cmpeq_epi32 (or epi8 or whatever, depending on the element size of var[]. You didn't specify, but different choices of strategy are probably optimal for different sizes).
But, just having a SIMD prefix sum actually barely helps: you'd still have to loop through and figure out where to store and where to leave unmodified.
I think your best bet is to generate a list of indices where you need to store, and then
for (size_t j = 0 ; j < scatter_count ; j+=2) {
var_num2[ scatter_element[j+0] ] = 0;
var_num2[ scatter_element[j+1] ] = 1;
}
You could generate the whole list if indices up-front, or you could work in small batches to overlap the search work with the store work.
The prefix-sum part of the problem is handled by alternately storing 0 and 1 in an unrolled loop. The real trick is avoiding branch mispredicts, and generating the indices efficiently.
To generate scatter_element[], you've transformed the problem into left-packing (filtering) an (implicit) array of indices based on the corresponding _mm_cmpeq_epi32( var[i..i+3], _mm_setzero_epi32() ). To generate the indices you're filtering, start with a vector of [0,1,2,3] and add [4,4,4,4] to it (_mm_add_epi32). I'm assuming the element size of var[] is 32 bits. If you have smaller elements, this require unpacking.
BTW, AVX512 has scatter instructions which you could use here, otherwise doing the store part with scalar code is your best bet. (But beware of Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake when just storing without loading.)
To overlap the left-packing with the storing, I think you want to left-pack until you have maybe 64 indices in a buffer. Then leave that loop and run another loop that left-packs indices and consumes indices, only stopping if your circular buffer is full (then just store) or empty (then just left-pack). This lets you overlap the vector compare / lookup-table work with the scatter-store work, but without too much unpredictable branching.
If zeros are very frequent, and var_num2[] elements are 32 or 64 bits, and you have AVX or AVX2 available, you could consider doing an standard prefix sum and using AVX masked stores. e.g. vpmaskmovd. Don't use SSE maskmovdqu, though: it has an NT hint, so it bypasses and evicts data from cache, and is quite slow.
Also, because your prefix sum is mod 2, i.e. boolean, you could use a lookup table based on the packed-compare result mask. Instead of horizontal ops with shuffles, use the 4-bit movmskps result of a compare + a 5th bit for the initial state as an index to a lookup table of 32 vectors (assuming 32-bit element size for var[]).

What are some checksum implementations that allow for incremental computation?

In my program I have a set of sets that are stored in a proprietary hash table. Like all hash tables, I need two functions for each element. First, I need the hash value to use for insertion. Second, I need a compare function when there's conflicts. It occurs to me that a checksum function would be perfect for this. I could use the value in both functions. There's no shortage of checksum functions but I would like to know if there's any commonly available ones that I wouldn't need to bring in a library for (my company is a PIA when it comes to that).A system library would be ok.
But I have an additional, more complicated requirement. I need for the checksum to be incrementally calculable. That is, if a set contains A B C D E F and I subtract D from the set, it should be able to return a new checksum value without iterating over all the elements in the set again. The reason for this is to prevent non-linearity in my code. Ideally, I'd like for the checksum to be order independent but I can sort them first if needed. Does such an algorithm exist?

Simply store a dictionary of items in your set, and their corresponding hash value. The hash value of the set is the hash value of the concatenated, sorted hashes of the items. In Python:
hashes = '''dictionary of hashes in string representation'''
# e.g.
hashes = { item: hashlib.sha384(item) for item in items }
sorted_hashes = sorted(hashes.values())
concatenated_hashes = ''.join(sorted_hashes)
hash_of_the_set = hashlib.sha384(concatenated_hashes)
As hash function I would use sha384, but you might want to try Keccak-384.
Because there are (of course) no cryptographic hash functions with a lengths of only 32-bit, you have to use a checksum instead, like Adler-32 or CRC32. The idea remains the same. Best use Adler32 on the items and crc32 on the concatenated hashes:
hashes = { item: zlib.adler32(item) for item in items }
sorted_hashes = sorted(hashes.values())
concatenated_hashes = ''.join(sorted_hashes)
hash_of_the_set = zlib.crc32(concatenated_hashes)
In C++ you can use Adler-32 and CRC-32 of Botan.

A CRC is a set of bits that are calculated from an input.
If your input is the same size (or less) as the CRC (in your case - 32 bits), you can find the input that created this CRC - in effect reversing it.
If your input is larger than 32 bits, but you know all the input except for 32 bits, you can still reverse the CRC to find the missing bits.
If, however, the unknown part of the input is larger than 32 bits, you can't find it as there is more than one solution.
Why am I telling you this? Imagine you have the CRC of the set
{A,B,C}
Say you know what B is, and you can now calculate easily the CRC of the set
{A,C}
(by "easily" I mean - without going over the entire A and C inputs - like you wanted)
Now you have 64 bits describing A and C! And since we didn't have to go over the entirety of A and C to do it - it means we can do it even if we're missing information about A and C.
So it looks like IF such a method exists, we can magically fix more than 32 unknown bits from an input if we have the CRC of it.
This obviously is wrong. Does that mean there's no way to do what you want? Of course not. But it does give us constraints on how it can be done:
Option 1: we don't gain more information from CRC({A,C}) that we didn't have in CRC({A,B,C}). That means that the (relative) effect of A and C on the CRC doesn't change with the removal of B. Basically - it means that when calculating the CRC we use some "order not important" function when adding new elements:
we can use, for example, CRC({A,B,C}) = CRC(A) ^ CRC(B) ^ CRC(C) (not very good, as if A appears twice it's the same CRC as if it never appeared at all), or CRC({A,B,C}) = CRC(A) + CRC(B) + CRC(C) or CRC({A,B,C}) = CRC(A) * CRC(B) * CRC(C) (make sure CRC(X) is odd, so it's actually just 31 bits of CRC) or CRC({A,B,C}) = g^CRC(A) * g^CRC(B) * g^CRC(C) (where ^ is power - useful if you want cryptographically secure) etc.
Option 2: we do need all of A and C to calculate CRC({A,C}), but we have a data structure that makes it less than linear in time to do so if we already calculated CRC({A,B,C}).
This is useful if you want specifically CRC32, and don't mind remembering more information in addition to the CRC after the calculation (the CRC is still 32 bit, but you remember a data structure that's O(len(A,B,C)) that you will later use to calculate CRC{A,C} more efficiently)
How will that work? Many CRCs are just the application of a polynomial on the input.
Basically, if you divide the input into n chunks of 32 bit each - X_1...X_n - there is a matrix M such that
CRC(X_1...X_n) = M^n * X_1 + ... + M^1 * X_n
(where ^ here is power)
How does that help? This sum can be calculated in a tree-like fashion:
CRC(X_1...X_n) = M^(n/2) * CRC(X_1...X_n/2) + CRC(X_(n/2+1)...X_n)
So you begin with all the X_i on the leaves of the tree, start by calculating the CRC of each consecutive pair, then combine them in pairs until you get the combined CRC of all your input.
If you remember all the partial CRCs on the nodes, you can then easily remove (or add) an item anywhere in the list by doing just O(log(n)) calculations!
So there - as far as I can tell, those are your two options. I hope this wasn't too much of a mess :)
I'd personally go with option 1, as it's just simpler... but the resulting CRC isn't standard, and is less... good. Less "CRC"-like.
Cheers!

Algo: find max Xor in array for various interval limis, given N inputs, and p,q where 0<=p<=i<=q<=N

the problem statement is the following:
Xorq has invented an encryption algorithm which uses bitwise XOR operations extensively. This encryption algorithm uses a sequence of non-negative integers x1, x2, … xn as key. To implement this algorithm efficiently, Xorq needs to find maximum value for (a xor xj) for given integers a,p and q such that p<=j<=q. Help Xorq to implement this function.
Input
First line of input contains a single integer T (1<=T<=6). T test cases follow.
First line of each test case contains two integers N and Q separated by a single space (1<= N<=100,000; 1<=Q<= 50,000). Next line contains N integers x1, x2, … xn separated by a single space (0<=xi< 2^15). Each of next Q lines describe a query which consists of three integers ai,pi and qi (0<=ai< 2^15, 1<=pi<=qi<= N).
Output
For each query, print the maximum value for (ai xor xj) such that pi<=j<=qi in a single line.
int xArray[100000];
cin >>t;
for(int j =0;j<t;j++)
{
cin>> n >>q;
//int* xArray = (int*)malloc(n*sizeof(int));
int i,a,pi,qi;
for(i=0;i<n;i++)
{
cin>>xArray[i];
}
for(i=0;i<q;i++)
{
cin>>a>>pi>>qi;
int max =0;
for(int it=pi-1;it<qi;it++)
{
int t = xArray[it] ^ a;
if(t>max)
max =t;
}
cout<<max<<"\n" ;
}
No other assumptions may be made except for those stated in the text of the problem (numbers are not sorted).
The code is functional but not fast enough; is reading from stdin really that slow or is there anything else I'm missing?

XOR flips bits. The max result of XOR is 0b11111111.
To get the best result
if 'a' on ith place has 1 then you have to XOR it with key that has ith bit = 0
if 'a' on ith place has 0 then you have to XOR it with key that has ith bit = 1
saying simply, for bit B you need !B
Another obvious thing is that higher order bits are more important than lower order bits.
That is:
if 'a' on highest place has B and you have found a key with highest bit = !B
then ALL keys that have highest bit = !B are worse that this one
This cuts your amount of numbers by half "in average".
How about building a huge binary tree from all the keys and ordering them in the tree by their bits, from MSB to LSB. Then, cutting the A bit-by-bit from MSB to LSB would tell you which left-right branch to take next to get the best result. Of course, that ignores PI/QI limits, but surely would give you the best result since you always pick the best available bit on i-th level.
Now if you annotate the tree nodes with low/high index ranges of its subelements (performed only done once when building the tree), then later when querying against a case A-PI-QI you could use that to filter-out branches that does not fall in the index range.
The point is that if you order the tree levels like the MSB->LSB bit order, then the decision performed at the "upper nodes" could guarantee you that currently you are in the best possible branch, and it would hold even if all the subbranches were the worst:
Being at level 3, the result of
0b111?????
can be then expanded into
0b11100000
0b11100001
0b11100010
and so on, but even if the ????? are expanded poorly, the overall result is still greater than
0b11011111
which would be the best possible result if you even picked the other branch at level 3rd.
I habe absolutely no idea how long would preparing the tree cost, but querying it for an A-PI-QI that have 32 bits seems to be something like 32 times N-comparisons and jumps, certainly faster than iterating randomly 0-100000 times and xor/maxing. And since you have up to 50000 queries, then building such tree can actually be a good investment, since such tree would be build once per keyset.
Now, the best part is that you actually dont need the whole tree. You may build such from i.e. first two or four or eight bits only, and use the index ranges from the nodes to limit your xor-max loop to a smaller part. At worst, you'd end up with the same range as PiQi. At best, it'd be down to one element.
But, looking at the max N keys, I think the whole tree might actually fit in the memory pool and you may get away without any xor-maxing loop.

I've spent some time google-ing this problem and it seams that you can find it in the context of various programming competitions. While the brute force approach is intuitive it does not really solve the challenge as it is too slow.
There are a few contraints in the problem which you need to speculate in order to write a faster algorithm:
the input consists of max 100k numbers, but there are only 32768 (2^15) possible numbers
for each input array there are Q, max 50k, test cases; each test case consists of 3 values, a,pi,and qi. Since 0<=a<2^15 and there are 50k cases, there is a chance the same value will come up again.
I've found 2 ideas for solving the problem: splitting the input in sqrt(N) intervals and building a segment tree ( a nice explanation for these approaches can be found here )
The biggest problem is the fact that for each test case you can have different values for a, and that would make previous results useless, since you need to compute max(a^x[i]), for a small number of test cases. However when Q is large enough and the value a repeats, using previous results can be possible.
I will come back with the actual results once I finish implementing both methods

Using CRCs as a digest to detect duplicates among files

The primary use of CRCs and similar computations (such as Fletcher and Adler) seems to be for the detection of transmission errors. As such, most studies I have seen seem to address the issue of the probability of detecting small-scale differences between two data sets. My needs are slightly different.
What follows is a very approximate description of the problem. Details are much more complicated than this, but the description below illustrates the functionality I am looking for. This little disclaimer is intended to ward of answers such as "Why are you solving your problem this way when you can more easily solve it this other way I propose?" - I need to solve my problem this way for a myriad of reasons that are not germane to this question or post, so please don't post such answers.
I am dealing with collections of data sets (size ~1MB) on a distributed network. Computations are performed on these data sets, and speed/performance is critical. I want a mechanism to allow me to avoid re-transmitting data sets. That is, I need some way to generate a unique identifier (UID) for each data set of a given size. (Then, I transmit data set size and UID from one machine to another, and the receiving machine only needs to request transmission of the data if it does not already have it locally, based on the UID.)
This is similar to the difference between using CRC to check changes to a file, and using a CRC as a digest to detect duplicates among files. I have not seen any discussions of the latter use.
I am not concerned with issues of tampering, i.e. I do not need cryptographic strength hashing.
I am currently using a simple 32-bit CRC of the serialized data, and that has so far served me well. However, I would like to know if anyone can recommend which 32-bit CRC algorithm (i.e. which polynomial?) is best for minimizing the probability of collisions in this situation?
The other question I have is a bit more subtle. In my current implementation, I ignore the structure of my data set, and effectively just CRC the serialized string representing my data. However, for various reasons, I want to change my CRC methodology as follows. Suppose my top-level data set is a collection of some raw data and a few subordinate data sets. My current scheme essentially concatenates the raw data and all the subordinate data sets and then CRC's the result. However, most of the time I already have the CRC's of the subordinate data sets, and I would rather construct my UID of the top-level data set by concatenating the raw data with the CRC's of the subordinate data sets, and then CRC this construction. The question is, how does using this methodology affect the probability of collisions?
To put it in a language what will allow me to discuss my thoughts, I'll define a bit of notation. Call my top-level data set T, and suppose it consists of raw data set R and subordinate data sets Si, i=1..n. I can write this as T = (R, S1, S2, ..., Sn). If & represents concatenation of data sets, my original scheme can be thought of as:
UID_1(T) = CRC(R & S1 & S2 & ... & Sn)
and my new scheme can be thought of as
UID_2(T) = CRC(R & CRC(S1) & CRC(S2) & ... & CRC(Sn))
Then my questions are: (1) if T and T' are very different, what CRC algorithm minimizes prob( UID_1(T)=UID_1(T') ), and what CRC algorithm minimizes prob( UID_2(T)=UID_2(T') ), and how do these two probabilities compare?
My (naive and uninformed) thoughts on the matter are this. Suppose the differences between T and T' are in only one subordinate data set, WLOG say S1!=S1'. If it happens that CRC(S1)=CRC(S1'), then clearly we will have UID_2(T)=UID_2(T'). On the other hand, if CRC(S1)!=CRC(S1'), then the difference between R & CRC(S1) & CRC(S2) & ... & CRC(Sn) and R & CRC(S1') & CRC(S2) & ... & CRC(Sn) is a small difference on 4 bytes only, so the ability of UID_2 to detect differences is effectively the same as a CRC's ability to detect transmission errors, i.e. its ability to detect errors in only a few bits that are not widely separated. Since this is what CRC's are designed to do, I would think that UID_2 is pretty safe, so long as the CRC I am using is good at detecting transmission errors. To put it in terms of our notation,
prob( UID_2(T)=UID_2(T') ) = prob(CRC(S1)=CRC(S1')) + (1-prob(CRC(S1)=CRC(S1'))) * probability of CRC not detecting error a few bits.
Let call the probability of CRC not detecting an error of a few bits P, and the probability of it not detecting large differences on a large size data set Q. The above can be written approximately as
prob( UID_2(T)=UID_2(T') ) ~ Q + (1-Q)*P
Now I will change my UID a bit more as follows. For a "fundamental" piece of data, i.e. a data set T=(R) where R is just a double, integer, char, bool, etc., define UID_3(T)=(R). Then for a data set T consisting of a vector of subordinate data sets T = (S1, S2, ..., Sn), define
UID_3(T) = CRC(ID_3(S1) & ID_3(S2) & ... & ID_3(Sn))
Suppose a particular data set T has subordinate data sets nested m-levels deep, then, in some vague sense, I would think that
prob( UID_3(T)=UID_3(T') ) ~ 1 - (1-Q)(1-P)^m
Given these probabilities are small in any case, this can be approximated as
1 - (1-Q)(1-P)^m = Q + (1-Q)*P*m + (1-Q)*P*P*m*(m-1)/2 + ... ~ Q + m*P
So if I know my maximum nesting level m, and I know P and Q for various CRCs, what I want is to pick the CRC that gives me the minimum value for Q + m*P. If, as I suspect might be the case, P~Q, the above simplifies to this. My probability of error for UID_1 is P. My probability of error for UID_3 is (m+1)P, where m is my maximum nesting (recursion) level.
Does all this seem reasonable?

I want a mechanism to allow me to avoid re-transmitting data sets.
rsync has already solved this problem, using generally the approach you outline.
However, I would like to know if anyone can recommend which 32-bit CRC
algorithm (i.e. which polynomial?) is best for minimizing the
probability of collisions in this situation?
You won't see much difference among well-selected CRC polynomials. Speed may be more important to you, in which case you may want to use a hardware CRC, e.g. the crc32 instruction on modern Intel processors. That one uses the CRC-32C (Castagnoli) polynomial. You can make that really fast by using all three arithmetic units on a single core in parallel by computing the CRC on three buffers in the same loop, and then combining them. See below how to combine CRCs.
However, most of the time I already have the CRC's of the subordinate
data sets, and I would rather construct my UID of the top-level data
set by concatenating the raw data with the CRC's of the subordinate
data sets, and then CRC this construction.
Or you could quickly compute the CRC of the entire set as if you had done a CRC on the whole thing, but using the already calculated CRCs of the pieces. Look at crc32_combine() in zlib. That would be better than taking the CRC of a bunch of CRCs. By combining, you retain all the mathematical goodness of the CRC algorithm.

Mark Adler's answer was bang on. If I'd taken my programmers hat off and put on my mathematicians hat, some of it should have been obvious. He didn't have the time to explain the mathematics, so I will here for those who are interested.
The process of calculating a CRC is essentially the process of doing a polynomial division. The polynomials have coefficients mod 2, i.e. the coefficient of each term is either 0 or 1, hence a polynomial of degree N can be represented by an N-bit number, each bit being the coefficient of a term (and the process of doing a polynomial division amounts to doing a whole bunch of XOR and shift operations). When CRC'ing a data block, we view the "data" as one big polynomial, i.e. a long string of bits, each bit representing the coefficient of a term in the polynomial. Well call our data-block polynomial A. For each CRC "version", there has been chosen the polynomial for the CRC, which we'll call P. For 32-bit CRCs, P is a polynomial with degree 32, so it has 33 terms and 33 coefficients. Because the top coefficient is always 1, it is implicit and we can represent the 32nd-degree polynomial with a 32-bit integer. (Computationally, this is quite convenient actually.) The process of calculating the CRC for a data block A is the process of finding the remainder when A is divided by P. That is, A can always be written
A = Q * P + R
where R is a polynomial of degree less than degree of P, i.e. R has degree 31 or less, so it can be represented by a 32-bit integer. R is essentially the CRC. (Small note: typically one prepends 0xFFFFFFFF to A, but that is unimportant here.) Now, if we concatenate two data blocks A and B, the "polynomial" corresponding to the concatenation of the two blocks is the polynomial for A, "shifted to the left" by the number of bits in B, plus B. Put another way, the polynomial for A&B is A*S+B, where S is the polynomial corresponding to a 1 followed by N zeros, where N is the number of bits in B. (i.e. S = x**N ). Then, what can we say about the CRC for A&B? Suppose we know A=Q*P+R and B=Q'*P+R', i.e. R is the CRC for A and R' is the CRC for B. Suppose we also know S=q*P+r. Then
A * S + B = (Q*P+R)*(q*P+r) + (Q'*P+R')
= Q*(q*P+r)*P + R*q*P + R*r + Q'*P + R'
= (Q*S + R*q + Q') * P + R*r + R'
So to find the remainder when A*S+B is divided by P, we need only find the remainder when R*r+R' is divided by P. Thus, to calculate the CRC of the concatenation of two data streams A and B, we need only know the separate CRC's of the data streams, i.e. R and R', and the length N of the trailing data stream B (so we can compute r). This is also the content of one of Marks other comments: if the lengths of the trailing data streams B are constrained to a few values, we can pre-compute r for each of these lengths, making combination of two CRC's quite trivial. (For an arbitrary length N, computing r is not trivial, but it is much faster (log_2 N) than re-doing the division over the entire B.)
Note: the above is not a precise exposition of CRC. There is some shifting that goes on. To be precise, if L is the polynomial represented by 0xFFFFFFFF, i.e. L=x*31+x*30+...+x+1, and S_n is the "shift left by n bits" polynomial, i.e. S_n = x**n, then the CRC of a data block with polynomial A of N bits, is the remainder when ( L * S_N + A ) * S_32 is divided by P, i.e. when (L&A)*S_32 is divided by P, where & is the "concatenation" operator.
Also, I think I disagree with one of Marks comments, but he can correct me if I'm wrong. If we already know R and R', comparing the time to compute the CRC of A&B using the above methodology, as compared with computing it the straightforward way, does not depend on the ratio of len(A) to len(B) - to compute it the "straight forward" way, one really does not have to re-compute the CRC on the entire concatenated data set. Using our notation above, one only needs to compute the CRC of R*S+B. That is, instead of pre-pending 0xFFFFFFFF to B and computing its CRC, we prepend R to B and compute its CRC. So its a comparison of the time to compute B's CRC over again with the time to compute r, (followed by dividing R*r+R' by P, which is trivial and inconsequential in time likely).

Mark Adler's answer addresses the technical question so that's not what I'll do here. Here I'm going to point out a major potential flaw in the synchronization algorithm proposed in the OP's question and suggest a small improvement.
Checksums and hashes provide a single signature value for some data. However, being of finite length, the number of possible unique values of a checksum/hash is always smaller than the possible combinations of the raw data if the data is longer. For instance, a 4 byte CRC can only ever take on 4 294 967 296 unique values whilst even a 5 byte value which might be the data can take on 8 times as many values. This means for any data longer than the checksum itself, there always exists one or more byte combinations with exactly the same signature.
When used to check integrity, the assumption is that the likelihood of a slightly different stream of data resulting in the same signature is small so that we can assume the data is the same if the signature is the same. It is important to note that we start with some data d and verify that given a checksum, c, calculated using a checksum function, f that f(d) == c.
In the OP's algorithm, however, the different use introduces a subtle, detrimental degradation of confidence. In the OP's algorithm, server A would start with the raw data [d1A,d2A,d3A,d4A] and generate a set of checksums [c1,c2,c3,c4] (where dnA is the n-th data item on server A). Server B would then receive this list of checksums and check its own list of checksums to determine if any are missing. Say Server B has the list [c1,c2,c3,c5]. What should then happen is that it requests d4 from Server A and the synchronization has worked properly in the ideal case.
If we recall the possibilty of collisions, and that it doesn't always take that much data to produce one (e.g. CRC("plumless") == CRC("buckeroo")), then we'll quickly realize that the best guarantee our scheme provides is that server B definitely doesn't have d4A but it cannot guarantee that it has [d1A,d2A,d3A]. This is because it is possible that f(d1A) = c1 and f(d1B) = c1 even though d1A and d1B are distinct and we would like both servers to have both. In this scheme, neither server can ever know about the existence of both d1A and d1B. We can use more and more collision resistant checksums and hashes but this scheme can never guarantee complete synchronization. This becomes more important, the greater the number of files the network must keep track of. I would recommend using a cryptographic hash like SHA1 for which no collisions have been found.
A possible mitigation of the risk of this is to introduce redundant hashes. One way of doing is is to use a completely different algorithm since whilst it is possible crc32(d1) == crc32(d2) it is less likely that adler32(d1) == adler32(d2) simultaneously. This paper suggests you don't gain all that much this way though. To use the OP notation, it is also less likely that crc32('a' & d1) == crc32('a' & d2) and crc32('b' & d1) == crc32('b' & d2) are simultaneously true so you can "salt" to less collision prone combinations. However, I think you may just as well just use a collision resistant hash function like SHA512 which in practice likely won't have that great an impact on your performance.

how to efficiently access 3^20 vectors in a 2^30 bits of memory

I want to store a 20-dimensional array where each coordinate can have 3 values,
in a minimal amount of memory (2^30 or 1 Gigabyte).
It is not a sparse array, I really need every value.
Furthermore I want the values to be integers of arbirary but fixed precision,
say 256 bits or 8 words
example;
set_big_array(1,0,0,0,1,2,2,0,0,2,1,1,2,0,0,0,1,1,1,2, some_256_bit_value);
and
get_big_array(1,0,0,0,1,2,2,0,0,2,1,1,2,0,0,0,1,1,1,2, &some_256_bit_value);
Because the value 3 is relative prime of 2. its difficult to implement this using
efficient bitwise shift, and and or operators.
I want this to be as fast as possible.
any thoughts?

Seems tricky to me without some compression:
3^20 = 3486784401 values to store
256bits / 8bitsPerByte = 32 bytes per value
3486784401 * 32 = 111577100832 size for values in bytes
111577100832 / (1024^3) = 104 Gb
You're trying to fit 104 Gb in 1 Gb. There'd need to be some pattern to the data that could be used to compress it.
Sorry, I know this isn't much help, but maybe you can rethink your strategy.

There are 3.48e9 variants of 20-tuple of indexes that are 0,1,2. If you wish to store a 256 bit value at each index, that means you're talking about 8.92e11 bits - about a terabit, or about 100GB.
I'm not sure what you're trying to do, but that sounds computationally expensive. It may be reasonable feasible as a memory-mapped file, and may be reasonably fast as a memory-mapped file on an SSD.
What are you trying to do?
So, a practical solution would be to use a 64-bit OS and a large memory-mapped file (preferably on an SSD) and simply compute the address for a given element in the typical way for arrays, i.e. as sum-of(forall-i(i-th-index * 3^i)) * 32 bytes in pseudeo-math. Or, use a very very expensive machine with that much memory, or another algorithm that doesn't require this array in the first place.
A few notes on platforms: Windows 7 supports just 192GB of memory, so using physical memory for a structure like this is possible but really pushing it (more expensive editions support more). If you can find a machine at all that is. According to microsoft's page on the matter the user-mode virtual address space is 7-8TB, so mmap/virtual memory should be doable. Alex Ionescu explains why there's such a low limit on virtual memory despite an apparently 64-bit architecture. Wikipedia puts linux's addressable limits at 128TB, though probably that's before the kernel/usermode split.
Assuming you want to address such a multidimensional array, you must process each index at least once: that means any algorithm will be O(N) where N is the number of indexes. As mentioned before, you don't need to convert to base-2 addressing or anything else, the only thing that matters is that you can compute the integer offset - and which base the maths happens in is irrelevant. You should use the most compact representation possible and ignore the fact that each dimension is not a multiple of 2.
So, for a 16-dimensional array, that address computation function could be:
int offset = 0;
for(int ii=0;ii<16;ii++)
offset = offset*3 + indexes[ii];
return &the_array[offset];
As previously said, this is just the common array indexing formula, nothing special about it. Note that even for "just" 16 dimensions, if each item is 32 bytes, you're dealing with a little more than a gigabyte of data.

Maybe i understand your question wrong. But can't you just use a normal array?
INT256 bigArray[3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3];
OR
INT256 ********************bigArray = malloc(3^20 * 8);
bigArray[1][0][0][1][2][0][1][1][0][0][0][0][1][1][2][1][1][1][1][1] = some_256_bit_value;
etc.
Edit:
Will not work because you would need 3^20 * 8Byte = ca. 25GByte.
The malloc variant is wrong.

I'll start by doing a direct calculation of the address, then see if I can optimize it
address = 0;
for(i=15; i>=0; i--)
{
address = 3*address + array[i];
}
address = address * number_of_bytes_needed_for_array_value

2^30 bits is 2^27 bytes so not actually a gigabyte, it's an eighth of a gigabyte.
It appears impossible to do because of the mathematics although of course you can create the data size bigger then compress it, which may get you down to the required size although it cannot guarantee. (It must fail to some of the time as the compression is lossless).
If you do not require immediate "random" access your solution may be a "variable sized" two-bit word so your most commonly stored value takes only 1 bit and the other two take 2 bits.
If 0 is your most common value then:
0 = 0
10 = 1
11 = 2
or something like that.
In that case you will be able to store your bits in sequence this way.
It could take up to 2^40 bits this way but probably will not.
You could pre-run through your data and see which is the commonly occurring value and use that to indicate your single-bit word.
You can also compress your data after you have serialized it in up to 2^40 bits.
My assumption here is that you will be using disk possibly with memory mapping as you are unlikely to have that much memory available.
My assumption is that space is everything and not time.

You might want to take a look at something like STXXL, an implementation of the STL designed for handling very large volumes of data

You can actually use a pointer-to-array20 to have your compiler implement the index calculations for you:
/* Note: there are 19 of the [3]'s below */
my_256bit_type (*foo)[3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3][3];
foo = allocate_giant_array();
foo[0][1][1][0][2][1][2][2][0][2][1][0][2][1][0][0][2][1][0][0] = some_256bit_value;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js