Vectorization & #pragma omp simd - c++

Since I got lost through all the reading of SIMD and OpenMP depending on vectorization, I would like to ask you if somebody can clarify me the above.
Specifically, I have a part of a C++ code I want to parallelize, but I am pretty stuffed at the moment and can't figure something on my own.
Any help clearing out to me what exactly the vectorization is and how can I use it in the following part of code would be greatly appreciated!
for(unsigned short i=1; i<=N_a; i++) {
for(unsigned short j=1; j<=N_b; j++) {
temp[0] = H[i-1][j-1]+similarity_score(seq_a[i-1],seq_b[j-1]);
temp[1] = H[i-1][j]-delta;
temp[2] = H[i][j-1]-delta;
temp[3] = 0.;
H[i][j] = find_array_max(temp, 4);
switch(ind) {
case 0: // score in (i,j) stems from a match/mismatch
I_i[i][j] = i-1;
I_j[i][j] = j-1;
break;
case 1: // score in (i,j) stems from a deletion in sequence A
I_i[i][j] = i-1;
I_j[i][j] = j;
break;
case 2: // score in (i,j) stems from a deletion in sequence B
I_i[i][j] = i;
I_j[i][j] = j-1;
break;
case 3: // (i,j) is the beginning of a subsequence
I_i[i][j] = i;
I_j[i][j] = j;
break;
}
}
}
Regards!

So ind is constant for both nested loops?
You might get a compiler to auto-vectorize this for you with OpenMP. (Put the line #pragma omp simd right before either of your for loops, and see if that affects the asm when you compile with -O3. I don't know OpenMP that well, so IDK if you need other options.)
Wrap it in a function that actually compiles, so I can see what happens. (e.g. by putting the code on http://gcc.godbolt.org/ to get nicely formatted asm output).
If it doesn't auto-vectorize, it's probably not too hard to manually vectorize with Intel intrinsics for x86, since you're just initializing some arrays with the array index. Keep a vector of loop counters starting with a vector of __m128i jvec = _mm_set_epi32(3, 2, 1, 0);, and increment it with _mm_add_ps() with a vector of [ 4 4 4 4 ] (_mm_set1_epi32(4)) to increment every element by 4.
Keep a separate vector of i values, which you only modify in the outer loop, but still store in the inner loop.
See the x86 tag wiki for instruction-set stuff.
See the sse tag wiki for some SIMD guides, including this nice intro to SIMD and what it's all about.

Related

nested vectorized openmp loops with multiple lines of code in inner-most loop

I was wondering if it is valid to use an opemp simd construct to collapse multiple nested loops, where
the code in the inner-most loop first calculates a number of indices (as shown below) and then those indices
are used to modify a multidimensional array (as shown below). In other words would the lines labelled I1-I4
below be all vectorized? In all the openmp examples I have seen there is always a single variable whose result gets
vectorized. Would the below code be considered valid? Thanks
for(std::size_t a=0;a<A;a++)
{
#pragma omp simd collapse(3)
for(std::size_t b=0;b<B;b++)
{
for(std::size_t c=0;c<C;c++)
{
for(std::size_t d=0;d<D;d++)
{
std::size_t idx1 = c*B + b; //I1
std::size_t idx2 = d*(B*C) + c*B + b; //I2
std::size_t idx3 = d*(E) + c*F + b; //I3
W1[idx1][idx3] += W1[idx1][a]*W2[a][idx3]; //I4
}
}
}
}
This is definitely valid OpenMP code. Depending on the compiler, and the target architecture, the results of compiling it may change, but at least some compilers will definitely vectorize it. Because the indices are likely non linear, it will only vectorize well on a platform with both gather and scatter instructions, but it’s valid regardless.

Example of C++ code optimization for parallel computing

I'm trying to understand optimization routines. I'm focusing on the most critical part of my code (the code has some cycles of length "nc" and one cycle of length "np", where number "np" is much larger then "nc"). I present part of the code in here. The rest of code is not very essential in % of computational time so i prefer code purify in the rest of the algorithm. However, the critical cycle with "np" length is a pretty simple piece of code and it can be parallelized. So it will not hurt if i rewrite this part into some more effective and less clear version (maybe into SSE instructions). I'm using a gcc compiler, c++ code, and OpenMP parallelization.
This code is part of the well known particle-in-cell algorithm (and this one is also basic one). I'm trying to learn code optimization on this version (so my goal is not to have effective PIC algorithm only, because it is already written in thousand variants, but i want to bring some demonstrative example for code optimization also). I'm trying to do some work but i am not very sure if i solved all optimization properties correctly.
const int NT = ...; // number of threads (in two versions: about 6 or about 30)
const int np = 10000000; // np is about 1000-10000 times larger than nc commonly
const int nc = 10000;
const int step = 1000;
float u[np], x[np];
float a[nc], a_lin[nc], rho_full[NT][nc], rho_diff[NT][nc] , weight[nc];
int p,num;
for ( i = 0 ; i<step ; i++) {
// ***
// *** some not very time consuming code for calculation
// *** a, a_lin from values of rho_full and rho_diff
#pragma omp for private(p,num)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
rho_full[num][p] += weight[k];
rho_diff[num][p] += weight[k] * (x[k] - p);
}
};
I realize this has problems:
1) (main question) I use set of arrays rho_full[num][p] where num is index for each thread. After computation i just summarize this arrays (rho_full[0][p] + rho_full[1][p] + rho_full[2][p] ...). The reason is avoidance of writing into same part of array with two different threads. I am not very sure if this way is an effective solution (note that number "nc" is relatively small, so number of operations with "np" is still probably most essential)
2) (also important question) I need to read x[k] many times and it's also changed many times. Maybe its better to read this value into some register and then forget whole x array or fix some pointer in here. After all calculation i can call x[k] array again and store obtained value. I believe that compiler do this work for me but i am not very sure because i used modification of x[k] in the center of algorithm. So the compiler probably do some effective work on their own but maybe in this version it call more times then nessesary becouse more then ones I swich calling and storing this value.
3) (probably not relevant) The code works with integer part and remainder below decimal point part. It needs both of this values. I identify integer part as p = (int) x and remainder as x - p. I calculate this routine at the begin and also in the end of cycle interior. One can see that this spliting can be stored somewhere and used at next step (i mean step at i index). Do you thing that following version is better? I store integral and remainder part at arrays of x instead of whole value x.
int x_int[np];
float x_rem[np];
//...
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
u[k] += a[x_int[k]] + a_lin[x_int[k]] * x_rem[k];
x_rem[k] += u[k];
p = (int) x_rem[k]; // *** This part is added into code for simplify the rest.
x_int[k] += p; // *** And maybe there is a better way how to realize
x_rem[k] -= p; // *** this "pushing correction".
if (x_int[k]<0 ) {x_int[k]+=nc;} else
if (x_int[k]>nc) {x_int[k]-=nc;};
rho_full[num][x_int[k]] += weight[k];
rho_diff[num][x_int[k]] += weight[k] * x_rem[k];
}
};
You can use OMP reduction for your for loop:
int result = 0;
#pragma omp for nowait reduction(+:result)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
result += weight[k] + weight[k] * (x[k] - p);
}

How can writing to a shared array (over a pointer) in a nested for loop parallelized with OpenMP produce wrong results?

I have a very strange problem that I'm try to solve and understand. I have a nested for loop of the following form:
#pragma omp parallel for schedule(guided) shared(Array) collapse(3)
for (int i=istart; i<iend; i++)
{
for (int j=jstart; j<jend; j++)
{
for(int k=kstart; k<kend; k++)
{
Int IJK = (i*(jend-jstart) + (j-jstart))*(kend-kstart) + (k-kstart);
Array[3*IJK + 2] = an operation with some shared values;
}
}
}
There are three loops of this form, with Array[3*IJK], Array[3*IJK + 1] and Array[3*IJK+2] respectively. Array is also actually a shared pointer and for the value of IJK, a function is actually called (inlined).
I first tried parallelizing all loops and the program runs through, but the results are different compared to my serial results.
Now come the strange parts.
The for loop that is of this same structure, but has Array[3*IJK + 1] instead, produces correct results when it is parallelized (the other loops are serial in this case). But as soon as I parallelize one of the other loops, I get different results. It is only this single loop that produces correct results when parallelized by itself.
Also, If I don't use collapse, or collapse(2) instead of collapse(3), I get different results. Only with the #pragma statement as above, I get correct results in the Array[3*IJK + 1] loop.
I thought it might have something to do with the order in which Array was written to, but with an ordered clause and construct, I still get wrong results.
What can be the cause of this?
Are you sure your serial case is correct?
Your IJK calculation makes no sense to me; for one thing, it doesn't depend on j at all. As it is, if two threads get the same (i,k) pair with different j -- certainly possible with collapse(3) -- there's going to be a race condition as they both will be trying to write to the same IJK.
Are you sure you don't want something like
Int IJK = (i*(jend-jstart) + (j-jstart))*(kend-kstart) + (k-kstart);
?

Digital filter and std::inner_product optimization

In a digital filtering C++ application, I use std::inner_product (with std::vector<double> and std::deque<double>) to compute the dot product between the filter coefficients and the input data, for each data sample. After profiling my application, I figured out that no less than 85% of the execution time is spent in std::inner_product!
To what extend is std::inner_product optimized, in GCC for example?
Does it uses SIMD instructions? Does it performs loop unrolling? How to make sure of that?
Based on this, would it worth it to implement custom dot product function(s) (especially if the number of coefficient is low)? (but I would like to keep the function as generic as possible)
More specifically, this is the piece of code I use to apply a filter:
std::deque<double> in(filterNum.size(), 0.0);
std::deque<double> out(filterDenom.size() - 1, 0.0);
const double gain = filterDenom.back();
for (unsigned int s = 0, size = data.size(); s < size; ++s) {
in.pop_front();
in.push_back(data[s] / gain);
data[s] = inner_product(in.begin(), in.end(), filterNum.begin(),
-inner_product(out.begin(), out.end(), filterDenom.begin(), 0.0));
out.pop_front();
out.push_back(data[s]);
}
Typically, I use second order bandpass IIR filters, which means that the size of filterNum and filterDenom (numerator and denominator coefficients of the filter) is 5. data is the vector containing the input samples.
Getting an additional factor of 2 out of this shouldn't be hard if you just write the code directly. Part of it might come from removing some of the generality of inner_product, but some would also come from such things as eliminating the use of deques - if you just keep a pointer into your input array you can index off it and off the filter array in the inner loop, and increment the pointer to the input array in the outer loop.
Each of those inner_products has to use iterators through deques,
Most of the (coding) effort then becomes handling the edge conditions.
And take that division out of there - it should be a multiplication by a constant calculated outside the loop.
Inner product itself is pretty efficient (there's not much to do there), but it needs to increment two iterators on each pass through the inner loop. There is no explicit loop unrolling, but a good compiler can unroll a loop that simple. And a compiler is more likely to know how far to unroll a loop before running into instruction cache issues.
Deque iterators are not nearly as efficient as ++ on a pure pointer. There is at least a test on every ++, and there may be more than one assignment.
This is what a simple (FIR) filter can look like, without including the code for the edge conditions (which goes outside of the loop)
double norm = 1.0/sum;
double *p = data.values(); // start of input data
double *q = output.values(); // start of output buffer
int width = data.size() - filter.size();
for( int i = 0; i < width; ++i )
{
double *f = filter.values();
double accumulator = ( f[0] * p[0] );
for( int j = 1; j < filter.size(); ++j )
{
accumulator += ( f[i] * p[i] );
}
*q++ = accumulator * norm;
}
Note that there are messy details left out, and this is not the same as your filter, but it gives the idea. What's inside the outer loop easily fits in a modern instruction cache. The inner loop may be unrolled by the compiler. Most modern architectures can do the add and multiply in parallel.
You can ask GCC to computes most of the algorithms in <algorithms> and <numeric> in parallel mode, it may give a performance boost if your data set is very high (I think that it really only uses OpenMP inside).
However on small datasets it may give a performance hit.
A comparison with the other solution would be more than welcome!
http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html

C/C++ optimization: negate doubles fast

I need to negate very large number of doubles quickly. If bit_generator generates 0, then the sign must be changed. If bit_generator generates 1, then nothing happens. The loop is run many times over and bit_generator is extremely fast. On my platform case 2 is noticeably faster than case 1. Looks like my CPU doesn't like branching. Is there any faster and portable way to do it? What do you think about case 3?
// generates 0 and 1
int bit_generator();
// big vector (C++)
vector<double> v;
// case 1
for (size_t i=0; i<v.size(); ++i)
if (bit_generator()==0)
v[i] = -v[i];
// case 2
const int sign[] = {-1, 1};
for (size_t i=0; i<v.size(); ++i)
v[i] *= sign[bit_generator()];
// case 3
const double sign[] = {-1, 1};
for (size_t i=0; i<v.size(); ++i)
v[i] *= sign[bit_generator()];
// case 4 uses C-array
double a[N];
double number_generator(); // generates doubles
double z[2]; // used as buffer
for (size_t i=0; i<N; ++i) {
z[0] = number_generator();
z[1] = -z[0];
a[i] = z[bit_generator()];
}
EDIT: Added case 4 and C-tag, because the vector can be a plain array. Since I can control how doubles are generated, I redesigned the code as shown in case 4. It avoids extra multiplication and branching at the same. I presume it should be quite fast on all platforms.
Unless you want to resize the vector in the loop, lift the v.size() out of the for expression, i.e.
const unsigned SZ=v.size();
for (size_t i=0; i<SZ; ++i)
if (bit_generator()==0)
v[i] = -v[i];
If the compiler can't see what happens in bit_generator(), then it might be very hard for the compiler to prove that v.size() does not change, which makes loop unrolling or vectorization impossible.
UPDATE: I've made some tests and on my machine method 2 seems to be fastest. However, it seems to be faster to use a pattern which I call "group action" :-). Basically, you group multiple decisions into one value and switch over it:
const size_t SZ=v.size();
for (size_t i=0; i<SZ; i+=2) // manual loop unrolling
{
int val=2*bit_generator()+bit_generator();
switch(val) // only one conditional
{
case 0:
break; // nothing happes
case 1:
v[i+1]=-v[i+1];
break;
case 2:
v[i]=-v[i];
break;
case 3:
v[i]=-v[i];
v[i+1]=-v[i+1];
}
}
// not shown: wrap up the loop if SZ%2==1
if you can assume that the sign is represented by one specific bit, like in x86 implementations, you can simply do:
v[i] ^= !bit_generator() << SIGN_BIT_POSITION; // negate the output of
// bit_generator because 0 means
// negate and one means leave
// unchanged.
In x86 the sign bit is the MSB, so for doubles that's bit 63:
#define SIGN_BIT_POSITION 63
will do the trick.
Edit:
Based on comments, I should add that you might need to do some extra work to get this to compile, since v is an array of double, while bit_generator() returns int. You could do it like this:
union int_double {
double d; // assumption: double is 64 bits wide
long long int i; // assumption: long long is 64 bits wide
};
(syntax might be a bit different for C because you might need a typedef.)
Then define v as a vector of int_double and use:
v[i].i ^= bit_generator() << SIGN_BIT_POSITION;
Generally, if you have an if() inside a loop, that loop cannot be vectorized or unrolled, and the code has to execute once per pass, maximizing the loop overhead. Case 3 should perform very well, especially if the compiler can use SSE instructions.
For fun, if you're using GCC, use the -S -o foo.S -c foo.c flags instead of the usual -o foo.o -c foo.c flags. This will give you the assembly code, and you can see what is getting compiled for your three cases.
You don't need the lookup table, a simple formula suffices:
const size_t size = v.size();
for (size_t i=0; i<size; ++i)
v[i] *= 2*bit_generator() - 1;
assuming that the actual negation is fast (a good assumption on a modern compiler and CPU), you could use a conditional assignment, which is also fast on modern CPUs, to choose between two possibilities:
v[i] = bit_generator() ? v[i] : -v[i];
This avoids branches and allows the compiler to vectorize the loop and make it faster.
Are you able to rewrite bit_generator so it returns 1 and -1 instead? That removes an indirection from the equation at the possible cost of some clarity.
Premature optimization is the root of insipid SO questions
On my machine, running at 5333.24 BogoMIPS, the timings for 1'000 iterations across an array of 1'000'000 doubles yields the following times per expression:
p->d = -p->d 7.33 ns
p->MSB(d) ^= 0x80 6.94 ns
Where MSB(d) is pseudo-code for grabbing the most significant byte of d. This means that the naive d = -d takes 5.32% longer to execute than the obfuscated approach. For a billion such negations this means the difference between 7.3 and 6.9 seconds.
Someone must have an awfully big pile of doubles to care about that optimization.
Incidentally, I had to print out the content of the array when completed or my compiler optimized the whole test into zero op codes.