Parallel binomial coefficients using SIMD instructions

Parallel binomial coefficients using SIMD instructions - c++

Background
I've recently been taking some old code (~1998) and re-writing some of it to improve performance. Previously in the basic data structures for a state I stored elements in several arrays, and now I'm using raw bits (for the cases that requires less than 64 bits). That is, before I had an array of b elements and now I have b bits set in a single 64-bit integer that indicate whether that value is part of my state.
Using intrinsics like _pext_u64 and _pdep_u64 I've managed to get all operations 5-10x faster. I am working on the last operation, which has to do with computing a perfect hash function.
The exact details of the hash function aren't too important, but it boils down to computing binomial coefficients (n choose k - n!/((n-k)!k!) for various n and k. My current code uses a large lookup table for this, which is probably hard to speed up significantly on its own (except for possible cache misses in the table which I haven't measured).
But, I was thinking that with SIMD instructions I might be able to directly compute these for several states in parallel, and thus see an overall performance boost.
Some constraints:
There are always exactly b bits set in each 64-bit state (representing small numbers).
The k value in the binomial coefficients is related to b and changes uniformly in the calculation. These values are small (most of the time <= 5).
The final hash will be < 15 million (easily fits in 32 bits).
So, I can fairly easily write out the math for doing this in parallel and for keeping all operations as integer multiple/divide without remainders while keeping within 32 bits. The overall flow is:
Extract the bits into values suitable for SIMD instructions.
Perform the n choose k computation in a way to avoid overflow.
Extract out the final hash value from each entry
But, I haven't written SIMD code before, so I'm still getting up to speed on all the functions available and their caveats/efficiencies.
Example:
Previously I would have had my data in an array, supposing there are always 5 elements:
[3 7 19 31 38]
Now I'm using a single 64-bit value for this:
0x880080088
This makes many other operations very efficient. For the perfect hash I need to compute something like this efficiently (using c for choose):
(50c5)-(38c5) + (37c4)-(31c4) + (30c3)-(19c3) + ...
But, in practice I have a bunch of these to compute, just with slightly different values:
(50c5)-(Xc5) + ((X-1)c4)-(Yc4) + ((Y-1)c3)-(Zc3) + ...
All the X/Y/Z... will be different but the form of the calculation is identical for each.
Questions:
Is my intuition on gaining efficiency by converting to SIMD operations reasonable? (Some sources suggest "no", but that's the problem of computing a single coefficient, not doing several in parallel.)
Is there something more efficient than repeated _tzcnt_u64 calls for extracting bits into the data structures for SIMD operations? (For instance, I could temporarily break my 64-bit state representation into 32-bit chunks if it would help, but then I wouldn't be guaranteed to have the same number of bits set in each element.)
What are the best intrinsics for computing several sequential multiply/divide operations for the binomial coefficients when I know there won't be overflow. (When I look through the Intel references I have trouble interpreting the naming quickly when going through all the variants - it isn't clear that what I want is available.)
If directly computing the coefficients is unlikely to be efficient, can SIMD instructions be used for parallel lookups into my previous lookup table of coefficients?
(I apologize for putting several questions together, but given the specific context, I thought it would be better to put them together as one.)

Here is one possible solution that does the computation from a lookup table using one state at a time. It's probably going to be more efficient to do this in parallel over several states instead of using a single state. Note: This is hard-coded for the fixed case of getting combinations of 6 elements.
int64_t GetPerfectHash2(State &s)
{
// 6 values will be used
__m256i offsetsm1 = _mm256_setr_epi32(6*boardSize-1,5*boardSize-1,
4*boardSize-1,3*boardSize-1,
2*boardSize-1,1*boardSize-1,0,0);
__m256i offsetsm2 = _mm256_setr_epi32(6*boardSize-2,5*boardSize-2,
4*boardSize-2,3*boardSize-2,
2*boardSize-2,1*boardSize-2,0,0);
int32_t index[9];
uint64_t value = _pext_u64(s.index2, ~s.index1);
index[0] = boardSize-numItemsSet+1;
for (int x = 1; x < 7; x++)
{
index[x] = boardSize-numItemsSet-_tzcnt_u64(value);
value = _blsr_u64(value);
}
index[8] = index[7] = 0;
// Load values and get index in table
__m256i firstLookup = _mm256_add_epi32(_mm256_loadu_si256((const __m256i*)&index[0]), offsetsm2);
__m256i secondLookup = _mm256_add_epi32(_mm256_loadu_si256((const __m256i*)&index[1]), offsetsm1);
// Lookup in table
__m256i values1 = _mm256_i32gather_epi32(combinations, firstLookup, 4);
__m256i values2 = _mm256_i32gather_epi32(combinations, secondLookup, 4);
// Subtract the terms
__m256i finalValues = _mm256_sub_epi32(values1, values2);
_mm256_storeu_si256((__m256i*)index, finalValues);
// Extract out final sum
int64_t result = 0;
for (int x = 0; x < 6; x++)
{
result += index[x];
}
return result;
}
Note that I actually have two similar cases. In the first case I don't need the _pext_u64 and this code is ~3x slower than my existing code. In the second case I need it, and it is 25% faster.

Related

Count number of matching bytes between two _m128i SIMD vectors

I'm developing a bioinformatics tool and I'm trying to use SIMD to boost its speed.
Given two char arrays of length 16, I need to rapidly count the number of indices at which the strings match. For example, the two following strings, "TTTTTTTTTTTTTTTT" and "AAAAGGGGTTTTCCCC", match from 9th through 12th positions ("TTTT"), and therefore the output should be 4.
As shown in the following function foo (which works fine but slow), I packed each characters in seq1 and seq2 into __m128i variables s1 and s2, and used _mm_cmpeq_epi8 to compare every position simultaneously. Then, using popcnt128 (from Fast counting the number of set bits in __m128i register by Marat Dukhan) to add up the number of matching bits.
float foo(char* seq1, char* seq2) {
__m128i s1, s2, ceq;
int match;
s1 = _mm_load_si128((__m128i*)(seq1));
s2 = _mm_load_si128((__m128i*)(seq2));
ceq = _mm_cmpeq_epi8(s1, s2);
match = (popcnt128(ceq)/8);
return match;
}
Although popcnt128 by Marat Dukhan is a lot faster than naïvely adding up every bit in __m128i, __popcnt128() is the slowest bottleneck in the function, taking up about 80% of the computational speed. So, I would like to come up with an alternative to popcnt128.
I tried to interpret __m128i ceq as a string and to use it as a key for a precomputed look-up table that maps a string to the total number of bits. If char array were hashable, I could do something like
union{__m128i ceq; char c_arr[16];}
match = table[c_arr] // table = unordered map
If I try to do something similar for strings (i.e. union{__m128i ceq; string s;};), I get the following error message "::()’ is implicitly deleted because the default definition would be ill-formed". When I tried other things, I ran into segmentation faults.
Is there any way I can tell the compiler to read __m128i as string so I can directly use __m128i as a key for unordered_map? I don't see why it shouldn't work because string is a contiguous array of chars, which can be naturally represented by __m128i. But I couldn't get it to work and unable to find any solution online.

You're probably doing this for longer sequences, multiple SIMD vectors of data. In that case, you can accumulate counts in a vector that you only sum up at the end. It's a lot less efficient to popcount every vector separately.
See How to count character occurrences using SIMD - instead of _mm256_set1_epi8(c); to search for a specific character, load from the other string. Do everything else the same, including
counts = _mm_sub_epi8(counts, _mm_cmpeq_epi8(s1, s2));
in the inner loop, and the loop unrolling. (A compare result is an integer 0 / -1, so subtracting it adds 0 or 1 to another vector.) This is at risk of overflow after 256 iterations, so do at most 255. That linked question uses AVX2, but the __m128i versions of those intrinsics only require SSE2. (Of course, AVX2 would let you get twice as much work done per vector instruction.)
Horizontal sum the byte counters in the outer loop, using _mm_sad_epu8(v, _mm_setzero_si128()); and then accumulating into another vector of counts. Again, this is all in the code in the linked Q&A, so just copy/paste that and add a load from the other string into the inner loop, instead of using a broadcast constant.
Can counting byte matches between two strings be optimized using SIMD? shows basically the same thing for 128-bit vectors, including a version at the bottom that only does SAD hsums after an inner loop. It's written for two input pointers already, rather than char and string.
For a single vector:
You don't need to count all the bits in your __m128i; take advantage of the fact that all 8 bits in each byte are the same by extracting 1 bit per element to a scalar integer. (x86 SIMD can do that efficiently, unlike some other SIMD ISAs)
count = __builtin_popcnt(_mm_movemask_epi8(cmp_result));
Another possible option is psadbw against 0 (hsum of bytes on the compare result), but that needs a final hsum step of qword halves, so that's going to be worse than HW popcnt. But if you can't compile with -mpopcnt then it's worth considering if you need baseline x86-64 with just SSE2. (Also you need to negate before psadbw, or scale the sum down by 1/255...)
(Note that the psadbw strategy is basically what I described in the first section of the answer, but for only a single vector, not taking advantage of the ability to cheaply add multiple counts into one vector accumulator.)
If you really need the result as a float, then that makes a psadbw strategy less bad: you can keep the value in SIMD vectors the whole time, using _mm_cvtepi32_ps to do packed conversion on the horizontal sum result (even cheaper than cvtsi2ss int->float scalar conversion). _mm_cvtps_f32 is free; a scalar float is just the low element of an XMM register.
But seriously, do you really need an integer count as a float now? Can't you at least wait until you have the sum across all vectors, or keep it integer?
-mpopcnt is implied by gcc -msse4.2, or -march=native on anything less than 10 years old. Core 2 lacked hardware popcnt, but Nehalem had it for Intel.

Why is vectorization not beneficial in this for loop?

I am trying to vectorize this for loop. After using the Rpass flag, I am getting the following remark for it:
int someOuterVariable = 0;
for (unsigned int i = 7; i != -1; i--)
{
array[someOuterVariable + i] -= 0.3 * anotherArray[i];
}
Remark:
The cost-model indicates that vectorization is not beneficial
the cost-model indicates that interleaving is not beneficial
I want to understand what this means. Does "interleaving is not benificial" mean the array indexing is not proper?

It's hard to answer without more details about your types. But in general, starting a loop incurs some costs and vectorising also implies some costs (such as moving data to/from SIMD registers, ensuring proper alignment of data)
I'm guessing here that the compiler tells you that the vectorisation cost here is bigger than simply running the 8 iterations without it, so it's not doing it.
Try to increase the number of iterations, or help the compiler for computing alignement for example.
Typically, unless the type of array's item are exactly of the proper alignment for SIMD vector, accessing an array from a "unknown" offset (what you've called someOuterVariable) prevents the compiler to write an efficient vectorisation code.
EDIT: About the "interleaving" question, it's hard to guess without knowning your tool. But in general, interleaving usually means mixing 2 streams of computations so that the compute units of the CPU are all busy. For example, if you have 2 ALU in your CPU, and the program is doing:
c = a + b;
d = e * f;
The compiler can interleave the computation so that both the addition and multiplication happens at the same time (provided you have 2 ALU available). Typically, this means that the multiplication which is a bit longer to compute (for example 6 cycles) will be started before the addition (for example 3 cycles). You'll then get the result of both operation after only 6 cycles instead of 9 if the compiler serialized the computations. This is only possible if there is no dependencies between the computation (if d required c, it can not work). A compiler is very cautious about this, and, in your example, will not apply this optimization if it can't prove that array and anotherArray don't alias.

Storing multiple int values into one variable - C++

I am doing an algorithmic contest, and I'm trying to optimize my code. Maybe what I want to do is stupid and impossible but I was wondering.
I have these requirements:
An inventory which can contains 4 distinct types of item. This inventory can't contain more than 10 items (all type included). Example of valid inventory: 1 / 1 / 1 / 0. Example of invalid inventories: 11 / 0 / 0 / 0 or 5 / 5 / 5 / 0
I have some recipe which consumes or adds items into my inventory. The recipe can't add or consume more than 10 items since the inventory can't have more than 10 items. Example of valid recipe: -1 / -2 / 3 /
0. Example of invalid recipe: -6 / -6 / +12 / 0
For now, I store the inventory and the recipe into 4 integers. Then I am able to perform some operations like:
ApplyRecepe: Inventory(1/1/1/0).Apply(Recepe(-1/1/0/0)) = Inventory(0/2/1/0)
CanAfford: Iventory(1/1/0/0).CanAfford(Recepe(-2/1/0/0)) = False
I would like to know if it is possible (and if yes, how) to store the 4 values of an inventory/recipe into one single integer and to performs previous operations on it that would be faster than comparing / adding the 4 integers as I'm doing now.
I thought of something like having the inventory like that:
int32: XXXX (number of items of the first type) - YYYY (number of items of the second type) - ZZZ (number of items of the third type) - WWW (number of item of the fourth type)
But I have two problems with that:
I don't know how to handle the possible negative values
It seems to me much slower than just adding the 4 integers since I have to bit shift the inventory and the recipe to get the value I want and then proceed with the addition.

Storing multiple int values into one variable
Here are two alternatives:
An array. The advantage of this is that you may iterate over the elements:
int variable[] {
1,
1,
1,
0,
};
Or a class. The advantage of this is the ability to name the members:
struct {
int X;
int Y;
int Z;
int W;
} variable {
1,
1,
1,
0,
};
Then I am able to perform some operations like:
Those look like SIMD vector operations (Single Instruction Multiple Data). The array is the way to go in this case. Since the number of operands appears to be constant and small in your description, an efficient way to perform them are vector operations on the CPU 1.
There is no standard way to use SIMD operations directly in C++. To give the compiler optimal opportunity to use them, these steps need to be followed:
Make sure that the CPU you use supports the operations that you need. AVX-2 instruction set and its expansions have wide support for integer vector operations.
Make sure that you tell the compiler that the program should be optimised for that architecture.
Make sure to tell the compiler to perform vectorisation optimisations.
Make sure that the integers are sufficiently aligned as required by the operations. This can be achieved with alignas.
Make sure that the number of integers is known at compile time.
If the prospect of relying on the optimiser worries you, then you may instead prefer to use vector extensions that may be provided by your compiler. The use of language extensions would come at the cost of portability to other compilers naturally. Here is an example with GCC:
constexpr int count = 4;
using v4si = int __attribute__ ((vector_size (sizeof(int) * count)));
#include <iostream>
int main()
{
v4si inventory { 1, 1, 1, 0};
v4si recepe {-1, 1, 0, 0};
v4si applied = inventory + recepe;
for (int i = 0; i < count; i++) {
std::cout << applied[i] << ", ";
}
}
1 If the number of operands were large, then specialised vector processor such as a GPU could be faster.

Especially if you're learning, it's not a bad opportunity to try implementing your own helper class for vectorization, and consequently deepen your understanding about data in C++, even if your use case might not warrant the technique.
The insight you want to exploit is that arithmetic operations seem invariant to bitshifts, if one considers the pesky carry-bit and effects of signage (e.g. two's complement). But precisely because of these latter factors, it's much better to use some standardized underlying type like an int8_t[], as #Botje suggests.
To begin, implement the following functions. (My C++ is rusty, consider this pseudocode.)
int8_t* add(int8_t[], int8_t[], size_t);
int8_t* multiply(int8_t[], int8_t[], size_t);
int8_t* zeroes(size_t); // additive identity
int8_t* ones(size_t); // multiplicative identity
Also considering:
How would you like to handle overflows and underflows? Let them be and ask the developer to be cautious? Or throw exceptions?
Maybe you'd like to pin down the size of the array and avoid having to deal with a dynamic size_t?
Maybe you'd like to go as far as overloading operators?
The end result of an exercise like this, but generalized and polished, is something like Armadillo. But you'll understand it on a whole different level by doing the exercise yourself first. Also, if all this makes sense so far, you can now take a look at How to vectorize my loop with g++?—even the compiler can vectorize for you in certain cases.
Bitpacking as #Botje mentions is another step beyond this. You won't even have the safety and convenience of an integer type like int8_t or int4_t. Which additionally means the code you write might stop being platform-independent. I recommend at least finishing the vectorization exercise before delving into this.

This will be something of a non-answer, just intended to show what you're up against if you do bitpacking.
Suppose, for simplicity's sake, that recipes can only remove from inventory, and only contain positive values (you could represent negative numbers using two's complement, but it would take more bits, and add much complexity to working with the bit-packed numbers).
You then have 11 possible values for an item, so you need 4 bits for each item. Four items can then be represented in one uint16.
So, say you have an inventory with 10,4,6,9 items; this would be uint16_t inv = 0b1010'0100'0110'1001.
Then, a recipe with 2,2,2,2 items or uint16_t rec = 0b0010'0010'0010'0010.
inv - rec would give 0b1000'0010'0100'0111 for 8,2,4,7 items.
So far, so good. No need here to shift and mask to get at the individual values before doing the calculation. Yay.
Now, a recipe with 6,6,6,6 items which would be 0b0110'0110'0110'0110, giving inv - rec = 0b0011'1110'0000'0011 for 3,14,0,3 items.
Oops.
The arithmetic will work, but only if you check beforehand that the individual 4-bit results don't go out of bounds; in this example this would mean that you know beforehand that there are enough items in the inventory to fill a recipe.
You could get at, say, the third item in the inventory by doing: (inv >> 4) & 0b1111 or (inv << 8) >> 12 for doing your checks.
For testing, you would then get expressions like:
if ((inv >> 4) & 0b1111 >= (rec >> 4) & 0b1111)
or, comparing the 4 bits "in place":
if (inv & 0b0000000011110000 >= rec & 0b0000000011110000)
for each 4-bit part.
All these things are doable, but do you want to? It probably won't be faster than what is suggested in the other answers after the compiler has done its job, and it certainly won't be more readable.
It becomes even more horrible when you allow negative numbers (two's complement or otherwise) in recipes, especially if you want to bit-shift them.
So, bitpacking is nice for storage, and in some rare cases you can even do math without unpacking the bits, but I wouldn't try to go there (unless you are very performance and memory constrained).
Having said that, it could be fun to try to get it to work; there's always that.

SSE optimisation for a loop that finds zeros in an array and toggles a flag + updates another array

A piece of C++ code determines the occurances of zero and keeps a binary flag variable for each number that is checked. The value of the flag toggles between 0 and 1 each time a zero is encountered in a 1 dimensional array.
I am attempting to use SSE to speed it up, but I am unsure of how to go about this. Evaluating the individual fields of __m128i is inefficient, I've read.
The code in C++ is:
int flag = 0;
int var_num2[1000];
for(int i = 0; i<1000; i++)
{
if (var[i] == 0)
{
var_num2[i] = flag;
flag = !flag; //toggle value upon encountering a 0
}
}
How should I go about this using SSE intrinsics?

You'd have to recognize the problem, but this is a variation of a well-known problem. I'll first give a theoretical description
Introduce a temporary array not_var[] which contains 1 if var contains 0 and 0 otherwise.
Introduce a temporary array not_var_sum[] which holds the partial sum of not_var.
var_num2 is now the LSB of not_var_sum[]
The first and third operation are trivially parallelizable. Parallelizing a partial sum is only a bit harder.
In a practical implementation, you wouldn't construct not_var[], and you'd write the LSB directly to var_num2 in all iterations of step 2. This is valid because you can discard the higher bits. Keeping just the LSB is equivalent to taking the result modulo 2, and (a+b)%2 == ((a%2) + (b%2))%s.

What type are the elements of var[]? int? Or char? Are zeroes frequent?
A SIMD prefix sum aka partial is possible (with log2(vector_width) work per element, e.g. 2 shuffles and 2 adds for a vector of 4 float), but the conditional-store based on the result is the other major problem. (Your array of 1000 elements is probably too small for multi-threading to be profitable.)
An integer prefix-sum is easier to do efficiently, and the lower latency of integer ops helps. NOT is just adding without carry, i.e. XOR, so use _mm_xor_si128 instead of _mm_add_ps. (You'd be using this on the integer all-zero/all-one compare result vector from _mm_cmpeq_epi32 (or epi8 or whatever, depending on the element size of var[]. You didn't specify, but different choices of strategy are probably optimal for different sizes).
But, just having a SIMD prefix sum actually barely helps: you'd still have to loop through and figure out where to store and where to leave unmodified.
I think your best bet is to generate a list of indices where you need to store, and then
for (size_t j = 0 ; j < scatter_count ; j+=2) {
var_num2[ scatter_element[j+0] ] = 0;
var_num2[ scatter_element[j+1] ] = 1;
}
You could generate the whole list if indices up-front, or you could work in small batches to overlap the search work with the store work.
The prefix-sum part of the problem is handled by alternately storing 0 and 1 in an unrolled loop. The real trick is avoiding branch mispredicts, and generating the indices efficiently.
To generate scatter_element[], you've transformed the problem into left-packing (filtering) an (implicit) array of indices based on the corresponding _mm_cmpeq_epi32( var[i..i+3], _mm_setzero_epi32() ). To generate the indices you're filtering, start with a vector of [0,1,2,3] and add [4,4,4,4] to it (_mm_add_epi32). I'm assuming the element size of var[] is 32 bits. If you have smaller elements, this require unpacking.
BTW, AVX512 has scatter instructions which you could use here, otherwise doing the store part with scalar code is your best bet. (But beware of Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake when just storing without loading.)
To overlap the left-packing with the storing, I think you want to left-pack until you have maybe 64 indices in a buffer. Then leave that loop and run another loop that left-packs indices and consumes indices, only stopping if your circular buffer is full (then just store) or empty (then just left-pack). This lets you overlap the vector compare / lookup-table work with the scatter-store work, but without too much unpredictable branching.
If zeros are very frequent, and var_num2[] elements are 32 or 64 bits, and you have AVX or AVX2 available, you could consider doing an standard prefix sum and using AVX masked stores. e.g. vpmaskmovd. Don't use SSE maskmovdqu, though: it has an NT hint, so it bypasses and evicts data from cache, and is quite slow.
Also, because your prefix sum is mod 2, i.e. boolean, you could use a lookup table based on the packed-compare result mask. Instead of horizontal ops with shuffles, use the 4-bit movmskps result of a compare + a 5th bit for the initial state as an index to a lookup table of 32 vectors (assuming 32-bit element size for var[]).

Using CRCs as a digest to detect duplicates among files

The primary use of CRCs and similar computations (such as Fletcher and Adler) seems to be for the detection of transmission errors. As such, most studies I have seen seem to address the issue of the probability of detecting small-scale differences between two data sets. My needs are slightly different.
What follows is a very approximate description of the problem. Details are much more complicated than this, but the description below illustrates the functionality I am looking for. This little disclaimer is intended to ward of answers such as "Why are you solving your problem this way when you can more easily solve it this other way I propose?" - I need to solve my problem this way for a myriad of reasons that are not germane to this question or post, so please don't post such answers.
I am dealing with collections of data sets (size ~1MB) on a distributed network. Computations are performed on these data sets, and speed/performance is critical. I want a mechanism to allow me to avoid re-transmitting data sets. That is, I need some way to generate a unique identifier (UID) for each data set of a given size. (Then, I transmit data set size and UID from one machine to another, and the receiving machine only needs to request transmission of the data if it does not already have it locally, based on the UID.)
This is similar to the difference between using CRC to check changes to a file, and using a CRC as a digest to detect duplicates among files. I have not seen any discussions of the latter use.
I am not concerned with issues of tampering, i.e. I do not need cryptographic strength hashing.
I am currently using a simple 32-bit CRC of the serialized data, and that has so far served me well. However, I would like to know if anyone can recommend which 32-bit CRC algorithm (i.e. which polynomial?) is best for minimizing the probability of collisions in this situation?
The other question I have is a bit more subtle. In my current implementation, I ignore the structure of my data set, and effectively just CRC the serialized string representing my data. However, for various reasons, I want to change my CRC methodology as follows. Suppose my top-level data set is a collection of some raw data and a few subordinate data sets. My current scheme essentially concatenates the raw data and all the subordinate data sets and then CRC's the result. However, most of the time I already have the CRC's of the subordinate data sets, and I would rather construct my UID of the top-level data set by concatenating the raw data with the CRC's of the subordinate data sets, and then CRC this construction. The question is, how does using this methodology affect the probability of collisions?
To put it in a language what will allow me to discuss my thoughts, I'll define a bit of notation. Call my top-level data set T, and suppose it consists of raw data set R and subordinate data sets Si, i=1..n. I can write this as T = (R, S1, S2, ..., Sn). If & represents concatenation of data sets, my original scheme can be thought of as:
UID_1(T) = CRC(R & S1 & S2 & ... & Sn)
and my new scheme can be thought of as
UID_2(T) = CRC(R & CRC(S1) & CRC(S2) & ... & CRC(Sn))
Then my questions are: (1) if T and T' are very different, what CRC algorithm minimizes prob( UID_1(T)=UID_1(T') ), and what CRC algorithm minimizes prob( UID_2(T)=UID_2(T') ), and how do these two probabilities compare?
My (naive and uninformed) thoughts on the matter are this. Suppose the differences between T and T' are in only one subordinate data set, WLOG say S1!=S1'. If it happens that CRC(S1)=CRC(S1'), then clearly we will have UID_2(T)=UID_2(T'). On the other hand, if CRC(S1)!=CRC(S1'), then the difference between R & CRC(S1) & CRC(S2) & ... & CRC(Sn) and R & CRC(S1') & CRC(S2) & ... & CRC(Sn) is a small difference on 4 bytes only, so the ability of UID_2 to detect differences is effectively the same as a CRC's ability to detect transmission errors, i.e. its ability to detect errors in only a few bits that are not widely separated. Since this is what CRC's are designed to do, I would think that UID_2 is pretty safe, so long as the CRC I am using is good at detecting transmission errors. To put it in terms of our notation,
prob( UID_2(T)=UID_2(T') ) = prob(CRC(S1)=CRC(S1')) + (1-prob(CRC(S1)=CRC(S1'))) * probability of CRC not detecting error a few bits.
Let call the probability of CRC not detecting an error of a few bits P, and the probability of it not detecting large differences on a large size data set Q. The above can be written approximately as
prob( UID_2(T)=UID_2(T') ) ~ Q + (1-Q)*P
Now I will change my UID a bit more as follows. For a "fundamental" piece of data, i.e. a data set T=(R) where R is just a double, integer, char, bool, etc., define UID_3(T)=(R). Then for a data set T consisting of a vector of subordinate data sets T = (S1, S2, ..., Sn), define
UID_3(T) = CRC(ID_3(S1) & ID_3(S2) & ... & ID_3(Sn))
Suppose a particular data set T has subordinate data sets nested m-levels deep, then, in some vague sense, I would think that
prob( UID_3(T)=UID_3(T') ) ~ 1 - (1-Q)(1-P)^m
Given these probabilities are small in any case, this can be approximated as
1 - (1-Q)(1-P)^m = Q + (1-Q)*P*m + (1-Q)*P*P*m*(m-1)/2 + ... ~ Q + m*P
So if I know my maximum nesting level m, and I know P and Q for various CRCs, what I want is to pick the CRC that gives me the minimum value for Q + m*P. If, as I suspect might be the case, P~Q, the above simplifies to this. My probability of error for UID_1 is P. My probability of error for UID_3 is (m+1)P, where m is my maximum nesting (recursion) level.
Does all this seem reasonable?

I want a mechanism to allow me to avoid re-transmitting data sets.
rsync has already solved this problem, using generally the approach you outline.
However, I would like to know if anyone can recommend which 32-bit CRC
algorithm (i.e. which polynomial?) is best for minimizing the
probability of collisions in this situation?
You won't see much difference among well-selected CRC polynomials. Speed may be more important to you, in which case you may want to use a hardware CRC, e.g. the crc32 instruction on modern Intel processors. That one uses the CRC-32C (Castagnoli) polynomial. You can make that really fast by using all three arithmetic units on a single core in parallel by computing the CRC on three buffers in the same loop, and then combining them. See below how to combine CRCs.
However, most of the time I already have the CRC's of the subordinate
data sets, and I would rather construct my UID of the top-level data
set by concatenating the raw data with the CRC's of the subordinate
data sets, and then CRC this construction.
Or you could quickly compute the CRC of the entire set as if you had done a CRC on the whole thing, but using the already calculated CRCs of the pieces. Look at crc32_combine() in zlib. That would be better than taking the CRC of a bunch of CRCs. By combining, you retain all the mathematical goodness of the CRC algorithm.

Mark Adler's answer was bang on. If I'd taken my programmers hat off and put on my mathematicians hat, some of it should have been obvious. He didn't have the time to explain the mathematics, so I will here for those who are interested.
The process of calculating a CRC is essentially the process of doing a polynomial division. The polynomials have coefficients mod 2, i.e. the coefficient of each term is either 0 or 1, hence a polynomial of degree N can be represented by an N-bit number, each bit being the coefficient of a term (and the process of doing a polynomial division amounts to doing a whole bunch of XOR and shift operations). When CRC'ing a data block, we view the "data" as one big polynomial, i.e. a long string of bits, each bit representing the coefficient of a term in the polynomial. Well call our data-block polynomial A. For each CRC "version", there has been chosen the polynomial for the CRC, which we'll call P. For 32-bit CRCs, P is a polynomial with degree 32, so it has 33 terms and 33 coefficients. Because the top coefficient is always 1, it is implicit and we can represent the 32nd-degree polynomial with a 32-bit integer. (Computationally, this is quite convenient actually.) The process of calculating the CRC for a data block A is the process of finding the remainder when A is divided by P. That is, A can always be written
A = Q * P + R
where R is a polynomial of degree less than degree of P, i.e. R has degree 31 or less, so it can be represented by a 32-bit integer. R is essentially the CRC. (Small note: typically one prepends 0xFFFFFFFF to A, but that is unimportant here.) Now, if we concatenate two data blocks A and B, the "polynomial" corresponding to the concatenation of the two blocks is the polynomial for A, "shifted to the left" by the number of bits in B, plus B. Put another way, the polynomial for A&B is A*S+B, where S is the polynomial corresponding to a 1 followed by N zeros, where N is the number of bits in B. (i.e. S = x**N ). Then, what can we say about the CRC for A&B? Suppose we know A=Q*P+R and B=Q'*P+R', i.e. R is the CRC for A and R' is the CRC for B. Suppose we also know S=q*P+r. Then
A * S + B = (Q*P+R)*(q*P+r) + (Q'*P+R')
= Q*(q*P+r)*P + R*q*P + R*r + Q'*P + R'
= (Q*S + R*q + Q') * P + R*r + R'
So to find the remainder when A*S+B is divided by P, we need only find the remainder when R*r+R' is divided by P. Thus, to calculate the CRC of the concatenation of two data streams A and B, we need only know the separate CRC's of the data streams, i.e. R and R', and the length N of the trailing data stream B (so we can compute r). This is also the content of one of Marks other comments: if the lengths of the trailing data streams B are constrained to a few values, we can pre-compute r for each of these lengths, making combination of two CRC's quite trivial. (For an arbitrary length N, computing r is not trivial, but it is much faster (log_2 N) than re-doing the division over the entire B.)
Note: the above is not a precise exposition of CRC. There is some shifting that goes on. To be precise, if L is the polynomial represented by 0xFFFFFFFF, i.e. L=x*31+x*30+...+x+1, and S_n is the "shift left by n bits" polynomial, i.e. S_n = x**n, then the CRC of a data block with polynomial A of N bits, is the remainder when ( L * S_N + A ) * S_32 is divided by P, i.e. when (L&A)*S_32 is divided by P, where & is the "concatenation" operator.
Also, I think I disagree with one of Marks comments, but he can correct me if I'm wrong. If we already know R and R', comparing the time to compute the CRC of A&B using the above methodology, as compared with computing it the straightforward way, does not depend on the ratio of len(A) to len(B) - to compute it the "straight forward" way, one really does not have to re-compute the CRC on the entire concatenated data set. Using our notation above, one only needs to compute the CRC of R*S+B. That is, instead of pre-pending 0xFFFFFFFF to B and computing its CRC, we prepend R to B and compute its CRC. So its a comparison of the time to compute B's CRC over again with the time to compute r, (followed by dividing R*r+R' by P, which is trivial and inconsequential in time likely).

Mark Adler's answer addresses the technical question so that's not what I'll do here. Here I'm going to point out a major potential flaw in the synchronization algorithm proposed in the OP's question and suggest a small improvement.
Checksums and hashes provide a single signature value for some data. However, being of finite length, the number of possible unique values of a checksum/hash is always smaller than the possible combinations of the raw data if the data is longer. For instance, a 4 byte CRC can only ever take on 4 294 967 296 unique values whilst even a 5 byte value which might be the data can take on 8 times as many values. This means for any data longer than the checksum itself, there always exists one or more byte combinations with exactly the same signature.
When used to check integrity, the assumption is that the likelihood of a slightly different stream of data resulting in the same signature is small so that we can assume the data is the same if the signature is the same. It is important to note that we start with some data d and verify that given a checksum, c, calculated using a checksum function, f that f(d) == c.
In the OP's algorithm, however, the different use introduces a subtle, detrimental degradation of confidence. In the OP's algorithm, server A would start with the raw data [d1A,d2A,d3A,d4A] and generate a set of checksums [c1,c2,c3,c4] (where dnA is the n-th data item on server A). Server B would then receive this list of checksums and check its own list of checksums to determine if any are missing. Say Server B has the list [c1,c2,c3,c5]. What should then happen is that it requests d4 from Server A and the synchronization has worked properly in the ideal case.
If we recall the possibilty of collisions, and that it doesn't always take that much data to produce one (e.g. CRC("plumless") == CRC("buckeroo")), then we'll quickly realize that the best guarantee our scheme provides is that server B definitely doesn't have d4A but it cannot guarantee that it has [d1A,d2A,d3A]. This is because it is possible that f(d1A) = c1 and f(d1B) = c1 even though d1A and d1B are distinct and we would like both servers to have both. In this scheme, neither server can ever know about the existence of both d1A and d1B. We can use more and more collision resistant checksums and hashes but this scheme can never guarantee complete synchronization. This becomes more important, the greater the number of files the network must keep track of. I would recommend using a cryptographic hash like SHA1 for which no collisions have been found.
A possible mitigation of the risk of this is to introduce redundant hashes. One way of doing is is to use a completely different algorithm since whilst it is possible crc32(d1) == crc32(d2) it is less likely that adler32(d1) == adler32(d2) simultaneously. This paper suggests you don't gain all that much this way though. To use the OP notation, it is also less likely that crc32('a' & d1) == crc32('a' & d2) and crc32('b' & d1) == crc32('b' & d2) are simultaneously true so you can "salt" to less collision prone combinations. However, I think you may just as well just use a collision resistant hash function like SHA512 which in practice likely won't have that great an impact on your performance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js