How to vectorize a loop with many conditions? - c++

I have the loop below. The goal is to perform an operation between all elements of an array tmp and store it in a scalar b. The operation is equivalent to an addition, so there is no specific execution order. For example if we have a + b + c + d, we can compute this in any order, which means (a+b) + (c+d) is possible as well. The same is applicable to this operation. However, there are some special conditions which lead to the result by different ways.
tmp.e and b.e are longs, while tmp.xand b.x are doubles.
Is there any form to compare all tmp.e, in for example pairs of 2 for SSE, and perform the correct computation of b.x accordingly. In all cases, it can be viewed as an addMul, in the first case it's just multiplying by 1, in others by 0 or BOUND. Is it possible to vectorize this? If so, how?
Thanks.
void op(vec& tmp, scalar& b)
{
for (i = 1; i < n; ++i)
{
if (b.e == tmp.e[i])
{
b.x += tmp.x[i];
b.normalize();
continue;
}
else if (b.e > tmp.e[i])
{
if (b.e > tmp.e[i]+1)
{
continue;
}
b.x += tmp.x[i] * BOUND;
b.normalize();
}
else
{
if (tmp.e[i] > b.e+1)
{
b.x = tmp.x[i];
b.e = tmp.e[i];
b.normalize();
continue;
}
b.x = b.x * BOUND + tmp.x[i];
b.e = tmp.e[i];
b.normalize();
}
}
}

Per-element conditionals in SIMD code are usually handled by using a packed-compare instruction to generate a mask of all-zero and all-one elements. You can use this to AND or OR vectors. So e.g. you can increment only the elements that pass a test by using AND to produce a vector with 1 in elements that should be incremented, and 0 in elements that shouldn't, because 0 is the identity value for addition. (x+0 = x).
You can also compute two results and then blend them together, according to a mask. (using AND and OR, or using vector blend instructions.)
This method of doing SIMD conditionals is like a cmov: you have to compute both sides of the branch, even if all the elements you're processing in a vector take the same side of the branch.
It looks like your data is in struct-of-arrays format already. So you could generate masks from operations on vectors of e values, for use with vectors of x values. If long is 32bits, you could do a compare of 4 elements, and unpack-low and unpack-high to get two masks with 64bit elements to match your doubles. If the arrays are small (so they'd fit in cache even .e[] taking as much space as .x[]), having the longs the same as the doubles means less unpacking.
Anyway, it doesn't look promising. Too many conditions, and I have no idea what the whole thing is really trying to accomplish, and what restrictions there might be on the input data. If I knew more about the problem, maybe I could think of a vectorized way to do some of it.
Oh, I think another fatal flaw is that each iteration depends on the previous iteration, because it might modify b. So you can't vectorize to do multiple iterations in parallel, unless you can work out a rule for updating b based on the last vector element.

Related

Efficient way of ensuring newness of a set

Given set N = {1,...,n}, consider P different pre-existing subsets of N. A subset, S_p, is characterized by the 0-1 n vector x_p where the ith element is 0 or 1 depending on whether the ith (of n) items is part of the subset or not. Let us call such x_ps indicator vectors.
For e.g., if N={1,2,3,4,5}, subset {1,2,5} is represented by vector (1,0,0,1,1).
Now, given P pre-existing subsets and their associated vectors x_ps.
A candidate subset denoted by vector yis computed.
What is the most efficient way of checking whether y is already part of the set of P pre-existing subsets or whether y is indeed a new subset not part of the P subsets?
The following are the methods I can think of:
(Method 1) Basically, we have to do an element by element check against all pre-existing sets. Pseudocode follows:
for(int p = 0; p < P; p++){
//(check if x_p == y by doing an element by element comparison)
int i;
for(i = 0; i < n; i++){
if(x_pi != y_i){
i = 999999;
}
}
if(i < 999999)
return that y is pre-existing
}
return that y is new
(Method 2) Another thought that comes to mind is to store the decimal equivalent of the indicator vectors x_ps (where the indicator vectors are taken to be binary representations) and compare it with the decimal equivalent of y. That is, if set of P pre-existing sets is: { (0,1,0,0,1), (1,0,1,1,0) }, the stored decimals for this set would be {9, 22}. If y is (0,1,1,0,0), we compute 12 and check this against the set {9, 22}. The benefit of this method is that for each new y, we don't have to check against the n elements of every pre-existing set. We can just compare the decimal numbers.
Question 1. It appears to me that (Method 2) should be more efficient than (Method 1). For (Method 2), is there an efficient way (inbuilt library function in C/C++) that converts the x_ps and y from binary to decimal? What should be data type of these indicator variables? For e.g., bool y[5]; or char y[5];?
Question 2. Is there any method more efficient than (Method 2)?
As you've noticed, there's a trivial isomorphism between your indicator vectors and N-bit integers. That means the answer to your question 2 is "no": the tools available for maintain a set and testing membership in it are the same as integers (hash tables bring the normal approach). A commented mentioned Bloom fillers, which can efficiently test membership at the risk of some false positives, but Bloom filters are generally for much larger data sizes than you're looking at.
As for your question 1: Method 2 is reasonable, and it's even easier than you think. While vector<bool> doesn't give you an easy way to turn it into integer blocks, on implementations I'm aware of it's already implemented this way (the C++ standard allows special treatment of that particular vector type, something that is generally considered nowadays to have been a poor decision, but which occasionally yields some benefit). And those vectors are hashable. So just keep an unordered_set<vector<bool>> around, and you'll get performance which is reasonably close to the optimum. (If you know N at compile time you may want to prefer bitset to vector<bool>.)
Method 2 can be optimized by calculating the decimal equivalent of the given subset and hashing it using modulus 1e9+7. This results in different decimal numbers every time since N<=1000(No collision occurs).
#define M 1000000007 //big prime number
unordered_set<long long> subset; //containing decimal representation of all the
//previous found subsets
/*fast computation of power of 2*/
long long Pow(long long num,long long pow){
long long result=1;
while(pow)
{
if(pow&1)
{
result*=num;
result%=M;
}
num*=num;
num%=M;
pow>>=1;
}
return result;
}
/*checks if subset pre exists*/
bool check(vector<bool> booleanVector){
long long result=0;
for(int i=0;i<booleanVector.size();i++)
if(booleanVector[i])
result+=Pow(2,i);
return (subset.find(result)==subset.end());
}

Computing size of symmetric difference of two sorted arrays using SIMD AVX

I am looking for a way to optimize an algorithm that I am working on. It's most repetitive and thus compute-intensive part is comparison of two sorted arrays of any size, containing unique unsigned integer (uint32_t) values in order to obtain the size of symmetric difference of them (number of elements that exist only in one of the vectors). The target machine on which the algorithm will be deployed uses Intel processors supporting AVX2, therefore I am looking for a way to perform it in-place using SIMD.
Is there a way to exploit the AVX2 instructions to obtain the size of symmetric difference of two sorted arrays of unsigned integers?
Since both arrays are sorted it should be fairly easy to implement this algorithm using SIMD (AVX2). You would just need to iterate through the two arrays concurrently, and then when you get a mismatch when comparing two vectors of 8 ints you would need to resolve the mismatch, i.e. count the differences, and get the two array indices back in phase, and continue until you get to the end of one of the arrays. Then just add the no of remaining elements in the other array, if any, to get the final count.
Unless your arrays are tiny (like <=16 elements), you need to perform merge of the two sorted arrays with additional code for dumping non-equal elements.
If the size of symmetric difference is expected to be very small, then use the method described by PaulR.
If the size is expected to be high (like 10% of total number of elements), then you will have real trouble with vectorizing it. It is much easier to optimize scalar solution.
After writing several versions of code, the fastest one for me is:
int Merge3(const int *aArr, int aCnt, const int *bArr, int bCnt, int *dst) {
int i = 0, j = 0, k = 0;
while (i < aCnt - 32 && j < bCnt - 32) {
for (int t = 0; t < 32; t++) {
int aX = aArr[i], bX = bArr[j];
dst[k] = (aX < bX ? aX : bX);
k += (aX != bX);
i += (aX <= bX);
j += (aX >= bX);
}
}
while (i < aCnt && j < bCnt) {
... //use simple code to merge tails
The main optimizations here are:
Perform merging iterations in blocks (32 iterations per block). This allows to simplify stop criterion from (i < aCnt && j < bCnt) to t < 32. This can be done for most of the elements, and the tails can be processed with slow code.
Perform iterations in branchless fashion. Note that ternary operator is compiled into cmov instruction, and comparisons are compiled into setXX instructions, so there are no branches in the loop body. The output data is stored with the well-known trick: write all elements, but increase pointer only for the valid ones.
What else I have tried:
(no vectorization) perform (4 + 4) bitonic merge, check consecutive elements for duplicates, move pointers so that 4 min elements (in total) are skipped:
gets 4.95ns vs 4.65ns --- slightly worse.
(fully vectorized) compare 4 x 4 elements pairwise, extract comparison results into 16-bit mask, pass it through perfect hash function, use _mm256_permutevar8x32_epi32 with 128-entry LUT to get sorted 8 elements, check consecutive elements for duplicates, use _mm_movemask_ps + 16-entry LUT + _mm_shuffle_epi8 to store only unique elements among minimal 4 elements: gets 4.00ns vs 4.65ns --- slightly better.
Here is the file with solutions and file with perfect hash + LUT generator.
P.S. Note that similar problem for intersection of sets is solved here. The solution is somewhat similar to what I outlined as point 2 above.

Vectorizing a conditional involving shorts

I'm using a compact struct of 2 unsigned shorts indicating a start and end position.
I need to be able to quickly determine if there are any Range objects with a length (difference from start to end) past a threshold value.
I'm going to have a huge quantity of objects each with their own Range array, so it is not feasible to track which Range objects are above the threshold in a list or something. This code is also going to be run very often (many times a second for each array), so it needs to be efficient.
struct Range
{
unsigned short start;
unsigned short end;
}
I will always have an array of Range sized 2^n. While I would like to abort as soon as I find something over the threshold, I'm pretty sure it'd be faster to simply OR it all together and check at the end... assuming I can vectorize the loop. Although if I could do an if statement on the chunk of results for each vector, that would be grand.
size_t rangecount = 1 << resolution;
Range* ranges = new Range[rangecount];
...
bool result = false;
for (size_t i = 0; i < ranges; ++i)
{
result |= (range[i].end - range[i].start) > 4;
}
Not surprisingly, the auto-vectorizer gives the 1202 error because my data type isn't 32 or 64 bits wide. I really don't want to double my data size and make each field an unsigned int. So I'm guessing the auto-vectorizer approach is out for this.
Are there vector instructions that can handle 16 bit variables? If there are, how could I use them in c++ to vectorize my loop?
You are wondering if any value is greater than 4?
Yes, there are SIMD instructions for this. It's unfortunate that the auto-vectorized isn't able to handle this scenario. Here's a vectorized algorithm:
diff_v = end_v - start_v; // _mm_hsub_epi16
floor_v = max(4_v, diff_v); // _mm_max_epi16
if (floor_v != 4_v) return true; // wide scalar comparison
Use _mm_sub_epi16 with a structure of arrays or _mm_hsub_epi16 with an array of structures.
Actually since start is stored first in memory, you will be working on start_v - end_v, so use _mm_min_epi16 and a vector of -4.
Each SSE3 instruction will perform 8 comparisons at a time. It will still be fastest to return early instead of looping. However, unrolling the loop a bit more may buy you additional speed (pass the first set of results into the packed min/max function to combine them).
So you end up with (approximately):
most_negative = threshold = _mm_set_epi64(0xFCFCFCFCFCFCFCFC); // vectorized -4
loop:
a = load from range;
b = load from range;
diff = _mm_hsub_epi16(a, b);
most_negative = _mm_min_epi16(most_negative, diff);
// unroll by repeating the above four instructions 4 times or so
if (most_negative != threshold) return true;
repeat loop

How to approximate the count of distinct values in an array in a single pass through it

I have several huge arrays (millions++ members). All those are arrays of numbers and they are not sorted (and I cannot do that). Some are uint8_t, some uint16_t/32/64. I would like to approximate the count of distinct values in these arrays. The conditions are following:
speed is VERY important, I need to do this in one pass through the array and I must go through it sequentially (cannot jump back and forth) (I am doing this in C++, if that's important)
I don't need EXACT counts. What I want to know is that if it is an uint32_t array if there are like 10 or 20 distinct numbers or if there are thousands or millions.
I have quite a bit of memory that I can use, but the less is used the better
the smaller the array data type, the more accurate I need to be
I dont mind STL, but if I can do it without it that would be great (no BOOST though, sorry)
if the approach can be easily parallelized, that would be cool (but its not a mandatory condition)
Examples of perfect output:
ArrayA [uint32_t, 3M members]: ~128 distinct values
ArrayB [uint32_t, 9M members]: 100000+ distinct values
ArrayC [uint8_t, 50K members]: 2-5 distinct values
ArrayD [uint8_t, 700K members]: 64+ distinct values
I understand that some of the constraints may seem illogical, but thats the way it is.
As a side note, I also want the top X (3 or 10) most used and least used values, but that is far easier to do and I can do it on my own. However if someone has thoughts for that too, feel free to share them!
EDIT: a bit of clarification regarding STL. If you have a solution using it, please post it. Not using STL would be just a bonus for us, we dont fancy it too much. However if it is a good solution, it will be used!
For 8- and 16-bit values, you can just make a table of the count of each value; every time you write to a table entry that was previously zero, that's a different value found.
For larger values, if you are not interested in counts above 100000, std::map is suitable, if it's fast enough. If that's too slow for you, you could program your own B-tree.
I'm pretty sure you can do it by:
Create a Bloom filter
Run through the array inserting each element into the filter (this is a "slow" O(n), since it requires computing several independent decent hashes of each value)
Count how many bits are set in the Bloom Filter
Compute back from the density of the filter to an estimate of the number of distinct values. I don't know the calculation off the top of my head, but any treatment of the theory of Bloom filters goes into this, because it's vital to the probability of the filter giving a false positive on a lookup.
Presumably if you're simultaneously computing the top 10 most frequent values, then if there are less than 10 distinct values you'll know exactly what they are and you don't need an estimate.
I believe the "most frequently used" problem is difficult (well, memory-consuming). Suppose for a moment that you only want the top 1 most frequently used value. Suppose further that you have 10 million entries in the array, and that after the first 9.9 million of them, none of the numbers you've seen so far has appeared more than 100k times. Then any of the values you've seen so far might be the most-frequently used value, since any of them could have a run of 100k values at the end. Even worse, any two of them could have a run of 50k each at the end, in which case the count from the first 9.9 million entries is the tie-breaker between them. So in order to work out in a single pass which is the most frequently used, I think you need to know the exact count of each value that appears in the 9.9 million. You have to prepare for that freak case of a near-tie between two values in the last 0.1 million, because if it happens you aren't allowed to rewind and check the two relevant values again. Eventually you can start culling values -- if there's a value with a count of 5000 and only 4000 entries left to check, then you can cull anything with a count of 1000 or less. But that doesn't help very much.
So I might have missed something, but I think that in the worst case, the "most frequently used" problem requires you to maintain a count for every value you have seen, right up until nearly the end of the array. So you might as well use that collection of counts to work out how many distinct values there are.
One approach that can work, even for big values, is to spread them into lazily allocated buckets.
Suppose that you are working with 32 bits integers, creating an array of 2**32 bits is relatively impractical (2**29 bytes, hum). However, we can probably assume that 2**16 pointers is still reasonable (2**19 bytes: 500kB), so we create 2**16 buckets (null pointers).
The big idea therefore is to take a "sparse" approach to counting, and hope that the integers won't be to dispersed, and thus that many of the buckets pointers will remain null.
typedef std::pair<int32_t, int32_t> Pair;
typedef std::vector<Pair> Bucket;
typedef std::vector<Bucket*> Vector;
struct Comparator {
bool operator()(Pair const& left, Pair const& right) const {
return left.first < right.first;
}
};
void add(Bucket& v, int32_t value) {
Pair const pair(value, 1);
Vector::iterator it = std::lower_bound(v.begin(), v.end(), pair, Compare());
if (it == v.end() or it->first > value) {
v.insert(it, pair);
return;
}
it->second += 1;
}
void gather(Vector& v, int32_t const* begin, int32_t const* end) {
for (; begin != end; ++begin) {
uint16_t const index = *begin >> 16;
Bucket*& bucket = v[index];
if (bucket == 0) { bucket = new Bucket(); }
add(*bucket, *begin);
}
}
Once you have gathered your data, then you can count the number of different values or find the top or bottom pretty easily.
A few notes:
the number of buckets is completely customizable (thus letting you control the amount of original memory)
the strategy of repartition is customizable as well (this is just a cheap hash table I have made here)
it is possible to monitor the number of allocated buckets and abandon, or switch gear, if it starts blowing up.
if each value is different, then it just won't work, but when you realize it, you will already have collected many counts, so you'll at least be able to give a lower bound of the number of different values, and a you'll also have a starting point for the top/bottom.
If you manage to gather those statistics, then the work is cut out for you.
For 8 and 16 bit it's pretty obvious, you can track every possibility every iteration.
When you get to 32 and 64 bit integers, you don't really have the memory to track every possibility.
Here's a few natural suggestions that are likely outside the bounds of your constraints.
I don't really understand why you can't sort the array. RadixSort is O(n) and once sorted it would be one more pass to get accurate distinctiveness and top X information. In reality it would be 6 passes all together for 32bit if you used a 1 byte radix (1 pass for counting + 1 * 4 passes for each byte + 1 pass for getting values).
In the same cusp as above, why not just use SQL. You could create a stored procedure that takes the array in as a table valued parameter and return the number of distinct values and the top x values in one go. This stored procedure could also be called in parallel.
-- number of distinct
SELECT COUNT(DISTINCT(n)) FROM #tmp
-- top x
SELECT TOP 10 n, COUNT(n) FROM #tmp GROUP BY n ORDER BY COUNT(n) DESC
I've just thought of an interesting solution. It's based on law of boolean algebra called Idempotence of Multiplication, which states that:
X * X = X
From it, and using the commutative property of boolean multiplication, we can deduce that:
X * Y * X = X * X * Y = X * Y
Now, you see where I'm going to? This is how the algorithm would work (I'm terrible with pseudo-code):
make c = element1 & element2 (binary AND between the binary representation of the integers)
for i=3 until i == size_of_array
make b = c & element[i];
if b != c then diferent_values++;
c=b;
In first iteration, we make (element1*element2) * element3. We could represent it as:
(X * Y) * Z
If Z (element3) is equal to X (element1), then:
(X * Y) * Z = X * Y * X = X * Y
And if Z is equal to Y (element2), then:
(X * Y) * Z = X * Y * Y = X * Y
So, if Z isnĀ“t different to X or Y, then X * Y won't change when we multiply it for Z
This remains valid for big expressions, like:
(X * A * Z * G * T * P * S) * S = X * A * Z * G * T * P * S
If we receive a value which is factor of our big multiplicand (that means that it has been already computed) then the big multiplicand won't change when we multiply it to the recieved input, so there's no new distinct value.
So that's how it will go. Each time that a different value is computed then the multiplication of our big multiplicand and that distinct value, will be different to the big operand. So, with b = c & element[i], if b!= c we just increment out distinct values counter.
I guess I'm no being clear enough. If that's the case, please let me know.

iterating through TWO sparse matrices

I'm using boost sparse matrices holding bool's and trying to write a comparison function for storing them in a map. It is a very simple comparison function. Basically, the idea is to look at the matrix as a binary number (after being flattened into a vector) and sorting based on the value of that number. This can be accomplished in this way:
for(unsigned int j = 0; j < maxJ; j++)
{
for(unsigned int i = 0; i < maxI; i++)
{
if(matrix1(i,j) < matrix2(i,j) return true;
else if(matrix1(i,j) > matrix2(i,j) return false;
}
}
return false;
However, this is inefficient because of the sparseness of the matrices and I'd like to use iterators for the same result. The algorithm using iterators seems straightforward, i.e.
1) grab the first nonzero cell in each matrix, 2) compare j*maxJ+i for both, 3) if equal, grab the next nonzero cells in each matrix and repeat. Unfortunately, in code this is extremely tedious and I'm worried about errors.
What I'm wondering is (a) is there a better way to do this and (b) is there a simple way to get the "next nonzero cell" for both matrices? Obviously, I can't use nested for loops like one would to iterate through one sparse matrix.
Thanks for your help.
--
Since it seems that the algorithm I proposed above may be the best solution in my particular application, I figured I should post the code I developed for the tricky part, getting the next nonzero cells in the two sparse matrices. This code is not ideal and not very clear, but I'm not sure how to improve it. If anyone spots a bug or knows how to improve it, I would appreciate some comments. Otherwise, I hope this is useful to someone else.
typedef boost::numeric::ublas::mapped_matrix<bool>::const_iterator1 iter1;
typedef boost::numeric::ublas::mapped_matrix<bool>::const_iterator2 iter2;
// Grabs the next nonzero cell in a sparse matrix after the cell pointed to by i1, i2.
std::pair<iter1, iter2> next_cell(iter1 i1, iter2 i2, iter1 end) const
{
if(i2 == i1.end())
{
if (i1 == end)
return std::pair<iter1, iter2>(i1, i2);
++i1;
i2 = i1.begin();
}
else
{
++i2;
}
for(; i1 != end;)
{
for(; i2 != i1.end(); ++i2)
{
return std::pair<iter1, iter2>(i1,i2);
}
++i1;
if(i1 != end) i2 = i1.begin();
}
return std::pair<iter1, iter2>(i1, i2);
}
I like this question, by the way.
Let me pseudocode out what I think you're asking
declare list of sparse matrices ListA
declare map MatMAp with a sparse Matrix type mapping to a double, along with a
`StrictWeakMatrixOrderer` function which takes two sparse matrices.
Insert ListA into MatMap.
The Question: How do I write a StrictWeakMatrixOrderer efficiently?
This is an approach. I'm inventing this on the fly....
Define a function flatten() and precompute the flattened matrices, storing the flattened vectors in a vector(or another container with a random indexing upper bound). flatten() could be as simple as concatenating each row(or column) with the previous one(which can be done in linear time if you have a constant-time function to grab a row/column).
This yields a set of vectors with size on the order of 10^6. This is a tradeoff - saving this information instead of on-the-fly computing it. This is useful if you're going to be doing a lot of compares as you go along.
Remember, zeros contain information - dropping them will possibly yield two vectors equal to each other, whereas their generating matrix may not be equal.
Then, we have transformed the algorithm question from "order matrices" into "order vectors".
I've never heard of a distance metric for matrices, but I've heard of distance metrics for vectors.
You could use a "sum of differences" ordering aka Hamming distance. (foreach element that's different, add 1). That will be a O(n) algorithm:
for i = 0 to max.
if(a[i] != b[i])
distance++
return distance
The Hamming distance satisfies these conditions
d(a,b) = d(b,a)
d(a,a) = 0
d(x, z) <= d(x, y) + d(y, z)
Now to do some off-the-cuff analysis....
10^6 elements in a matrix(or its corrosponding vector).
O(n) distance metric.
But it's O(n) compares. If each array access has O(m) time, then you would have an O(n*(n+n)) = O(n^2) metric. So you have to have < O(n) access. It turns out that std::vector [] operator provides "amortized constant time access to arbitrary elements" according to SGI's STL site.
Providing you have sufficient memory to store k*2*10^6, where k is the number of matrices you are managing, this is a working solution that uses lots of memory in exchange for being linear.
(a) I don't fully understand what you're trying to accomplish, but if you want to compare if both matrices have the same value at the same index it's sufficient to use elementwise matrix multiplication (which should be implemented for sparse as well):
matrix3 = element_prod (matrix1, matrix2);
That way you'll get for each index:
0 (false) * 1 (true) = 0 (false)
0*0 = 0
1*1 = 1
So resulting matrix3 will have your solution in one line :)
It seems to me we're talking about implementing bitwise,elementwise operators on boost::sparse_matrix, since comparing if one vector (or matrix) is smaller than another without using any standard vector norms demands special operators (or special mappings/norms).
To my knowledge boost does not provide special operators for binary matrices (not to speak of sparse binary matrices). There are unlikely any straightforward solutions to this using BLAS level matrix/vector algebra. Binary matrices have an own place in the linear algebra field, so there are tricks and theorems but i doubt those are easier than your solution.
Your question could be reformulated as: How do i sort efficiently astronomically large numbers represented by a 2d-bitmap (n=100 so 100x100 elements would give you a number like 2^10000).
Good question !