Auto-vectorization of scalar product in loop

Auto-vectorization of scalar product in loop - c++

I am trying to autovectorize the following loop. In the following we loop with the i- and j-loop over the lower triangle of a matrix. Unfortunetly the vectorization report cannot vectorize (=translate to AVX SIMD instructions) the j- and the k-loop. But I think it is straightforward, because there are no pointer aliases (#pragma ivdep and compiler option -D NOALIAS) and the data (x: 1D-array and p: 1D-array) is aligned to 64 bytes.
It could be, that the if-statement is a problem, but even with the if-free solution (expensive shifting operation and count the sign of a double) the compiler is not able to vectorize this loop.
__assume_aligned(x, 64);
__assume_aligned(p, 64);
#pragma omp parallel for simd reduction(+:accum)
for ( int i = 1 ; i < N ; i++ ){ // loop over lower triangle (i,j), OpenMP SIMD LOOP WAS VECTORIZED
for ( int j = 0 ; j < i ; j++ ){ // <-- remark #25460: No loop optimizations reported
double __attribute__((aligned(64))) scalarp = 0.0;
#pragma omp simd
for ( int k=0 ; k < D ; k++ ){ // <-- remark #25460: No loop optimizations reported
// scalar product of \sum_k x_{i,k} \cdot x_{j,k}
scalarp += x[i*D + k] * x[j*D + k];
}
// Alternative to following if:
// accum += - ( (long long) floor( - ( scalarp + p[i] + p[j] ) ) >> 63);
#pragma ivdep
if ( scalarp + p[i] + p[j] >= 0 ){ // check if condition is satisfied
accum += 1;
}
}
}
Does it refer to the problem, that OpenMP starting points for each OpenMP thread are not known until run-time? I thought it this resolves the simd clause and Intels auto-vectorization is aware of that.
Intel Compiler: 18.0.2 20180210
edit: I've looked into the assembly and now it is clear that the code is already vectorized, sorry for boardering all of you.

Looking into the assembly really helps. Code is already vectorized. OpenMP SIMD LOOP WAS VECTORIZED takes also care of inner loop in this particular case.

Related

C/C++ fast absolute difference between two series

i am interested in generating efficient c/c++ code to get the differences between two time series.
More precise: The time series values are stored as uint16_t arrays with fixed and equal length == 128.
I am good with a pure c as well as a pure c++ implementation. My code examples are in c++
My intentions are:
Let A,B and C be discrete time series of length l with a value-type of uint16_t.
Vn[n<l]: Cn = |An - Bn|;
What i can think of in pseudo code:
for index i:
if a[i] > b[i]:
c[i] = a[i] - b[i]
else:
c[i] = b[i] - a[i]
Or in c/c++
for(uint8_t idx = 0; idx < 128; idx++){
c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i];
}
But i really dont like the if/else statement in the loop.
I am okay with looping - this can be unrolled by the compiler.
Somewhat like:
void getBufDiff(const uint16_t (&a)[], const uint16_t (&b)[], uint16_t (&c)[]) {
#pragma unroll 16
for (uint8_t i = 0; i < 128; i++) {
c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i];
}
#end pragma
}
What i am looking for is a 'magic code' which speeds up the if/else and gets me the absolute difference between the two unsigned values.
I am okay with a +/- 1 precision (In case this would allow some bit-magic to happen). I am also okay with changing the data-type to get faster results. And i am also okay with dropping the loop for something else.
So something like:
void getBufDiff(const uint16_t (&a)[], const uint16_t (&b)[], uint16_t (&c)[]) {
#pragma unroll 16
for (uint8_t i = 0; i < 128; i++) {
c[i] = magic_code_for_abs_diff(a[i],b[i]);
}
#end pragma
}
Did try XORing the two values. Gives proper results only for one of the cases.
EDIT 2:
Did a quick test on different approaches on my Laptop.
For 250000000 entrys this is the performance (256 rounds):
c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i]; ~500ms
c[i] = std::abs(a[i] - b[i]); ~800ms
c[i] = ((a[i] - b[i]) + ((a[i] - b[i]) >> 15)) ^ (i >> 15) ~425ms
uint16_t tmp = (a[i] - b[i]); c[i] = tmp * ((tmp > 0) - (tmp < 0)); ~600ms
uint16_t ret[2] = { a[i] - b[i], b[i] - a[i] };c[i] = ret[a[i] < b[i]] ~900ms
c[i] = ((a[i] - b[i]) >> 31 | 1) * (a[i] - b[i]); ~375ms
c[i] = ((a[i] - b[i])) ^ ((a[i] - b[i]) >> 15) ~425ms

Your problem is a good candidate for SIMD. GCC can do it automatically, here is a simplified example: https://godbolt.org/z/36nM8bYYv
void absDiff(const uint16_t* a, const uint16_t* b, uint16_t* __restrict__ c)
{
for (uint8_t i = 0; i < 16; i++)
c[i] = a[i] - b[i];
}
Note that I added __restrict__ to enable autovectorization, otherwise the compiler has to assume your arrays may overlap and it isn't safe to use SIMD (because some writes could change future reads in the loop).
I simplified it to just 16 at a time, and removed the absolute value for the sake of illustration. The generated assembly is:
vld1.16 {q9}, [r0]!
vld1.16 {q11}, [r1]!
vld1.16 {q8}, [r0]
vld1.16 {q10}, [r1]
vsub.i16 q9, q9, q11
vsub.i16 q8, q8, q10
vst1.16 {q9}, [r2]!
vst1.16 {q8}, [r2]
bx lr
That means it loads 8 integers at once from a, then from b, repeats that once, then does 8 subtracts at once, then again, then stores 8 values twice into c. Many fewer instructions than without SIMD.
Of course it requires benchmarking to see if this is actually faster on your system (after you add back the absolute value part, I suggest using your ?: approach which does not defeat autovectorization), but I expect it will be significantly faster.

Try to let the compiler see the conditional lane-selection pattern for SIMD instructions like this (pseudo code):
// store a,b to SIMD registers
for(0 to 32)
a[...] = input[...]
b[...] = input2[...]
// single type operation, easily parallelizable
for(0 to 32)
vector1[...] = a[...] - b[...]
// single type operation, easily parallelizable
// maybe better to compute b-a to decrease dependency to first step
// since a and b are already in SIMD registers
for(0 to 32)
vector2[...] = -vector1[...]
// single type operation, easily parallelizable
// re-use a,b registers, again
for(0 to 32)
vector3[...] = a[...] < b[...]
// x84 architecture has SIMD instructions for this
// operation is simple, no other calculations inside, just 3 inputs, 1 out
// all operands are registers (at least should be, if compiler works fine)
for(0 to 32)
vector4[...] = vector3[...] ? vector2[...]:vetor1[...]
If you write your benchmark codes, I can compare this with other solutions. But it wouldn't matter for good compilers (or good compiler flags) that do same thing automatically for the first benchmark code in question.

Fast abs (under two complement integers) can be implemented as (x + (x >> N)) ^ (x >> N) where N is the size of int - 1, i.e. 15 in your case. That's a possible implementation of std::abs. Still you can try it
– answer by freakish

Since you write "I am okay with a +/- 1 precision", you can use a XOR-solution: instead of abs(x), do x ^ (x >> 15). This will give an off-by-1 result for negative values.
If you want to calculate the correct result even for negative values, use the other answer (with x >> 15 correction).
In any case, this XOR-trick only works if overflow is impossible. The compiler can't replace abs by code which uses XOR because of that.

Iterative Karatsuba algorithm parallelized and vectorized using OpenACC in C++

I'm trying to parallelize iterative version of Karatsuba algorithm using OpenACC in C++. I would like to ask how can I vectorize inner for loop. My compiler shows my this message about that loop:
526, Complex loop carried dependence of result-> prevents parallelization
Loop carried dependence of result-> prevents parallelization
Loop carried backward dependence of result-> prevents vectorization
and here the code of two nested loops:
#pragma acc kernels num_gangs(1024) num_workers(32) copy (result[0:2*size-1]) copyin(A[0:size],$
{
#pragma acc loop gang
for (TYPE position = 1; position < 2 * (size - 1); position++) {
// for even coefficient add Di/2
if (position % 2 == 0)
result[position] += D[position / 2];
TYPE start = (position >= size) ? (position % size ) + 1 : 0;
TYPE end = (position + 1) / 2;
// inner loop: sum (Dst) - sum (Ds + Dt) where s+t=i
#pragma acc loop worker
for(TYPE inner = start; inner < end; inner++){
result[position] += (A[inner] + A[position - inner]) * (B[inner] + B[position - inn$
result[position] -= (D[inner] + D[position - inner]);
}
}
}
Actually, I'm not sure if it is possible to vectorize it. But if It is, I can't realize what I'm doing wrong. Thank you

The "Complex loop carried dependence of result" problem is due to pointer aliasing. The compiler can't tell if the object that "result" points to overlaps with one of the other pointer's objects.
As a C++ extension, you can add the C99 "restrict" keyword to the declaration of your arrays. This will assert to the compiler that pointers don't alias.
Alternatively, you can add the OpenACC "independent" clause on your loop directives to tell the compiler that the loops do not have any dependencies.
Note that OpenACC does not support array reductions, so you wont be able to parallelize the "inner" loop unless you modify the code to use a scalar. Something like:
rtmp = result[position];
#pragma acc loop vector reduction(+:rtmp)
for(TYPE inner = start; inner < end; inner++){
rtmp += (A[inner] + A[position - inner]) * (B[inner] + B[position - inn$
rtmp -= (D[inner] + D[position - inner]);
}
result[position] = rtmp;

Intel compiler (ICC) unable to auto vectorize inner loop (matrix multiplication)

EDIT:
ICC (after adding -qopt-report=5 -qopt-report-phase:vec):
LOOP BEGIN at 4.c(107,2)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP BEGIN at 4.c(108,3)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP BEGIN at 4.c(109,4)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed ANTI dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP END
LOOP BEGIN at 4.c(109,4)
<Remainder>
LOOP END
LOOP END
LOOP END
It seems that the C[i][j] is read before it is written if vectorized (as I am doing reduction). The question is why the reduction is allowed is a local variable is introduced (temp)?
Original issue:
I have a C snippet below which does matrix multiplication. a, b - operands, c - a*b result. n - row&column length.
double ** c = create_matrix(...) // initialize n*n matrix with zeroes
double ** a = fill_matrix(...) // fills n*n matrix with random doubles
double ** b = fill_matrix(...) // fills n*n matrix with random doubles
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
The ICC (version 18.0.0.1) is not able to vectorize (provided -O3 flag) the inner loop.
ICC output:
LOOP BEGIN at 4.c(107,2)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(108,3)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(109,4)
remark #25460: No loop optimizations reported
LOOP END
LOOP BEGIN at 4.c(109,4)
<Remainder>
LOOP END
LOOP END
LOOP END
Though, with changes below, the compiler vectorizes the inner loop.
// OLD
for (k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
// TO (NEW)
double tmp = 0;
for (k = 0; k < n; k++) {
tmp += a[i][k] * b[k][j];
}
c[i][j] = tmp;
ICC vectorized output:
LOOP BEGIN at 4.c(119,2)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(120,3)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(134,4)
<Peeled loop for vectorization>
LOOP END
LOOP BEGIN at 4.c(134,4)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP BEGIN at 4.c(134,4)
<Alternate Alignment Vectorized Loop>
LOOP END
LOOP BEGIN at 4.c(134,4)
<Remainder loop for vectorization>
LOOP END
LOOP END
LOOP END
Instead of accumulating vector multiplication result in matrix C cell, the result is accumulated in a separate variable and assigned later.
Why does the compiler not optimize the first version? Could it be due to potential aliasing of a or / and b to c elements (Read after write problem)?

Leverage Your Compiler
You don't show the flags you're using to get your vectorization report. I recommend:
-qopt-report=5 -qopt-report-phase:vec
The documentation says:
For levels n=1 through n=5, each level includes all the information of the previous level, as well as potentially some additional information. Level 5 produces the greatest level of detail. If you do not specify n, the default is level 2, which produces a medium level of detail.
With the higher level of detail, the compiler will likely tell you (in mysterious terms) why it is not vectorizing.
My suspicion is that the compiler is worried about the memory being aliased. The solution you've found allows the compiler to prove that it is not, so it performs the vectorization.
A Portable Solution
If you're into OpenMP, you could use:
#pragma omp simd
for (k = 0; k < n; k++)
c[i][j] += a[i][k] * b[k][j];
to accomplish the same thing. I think Intel also has a set of compiler-specific directives that will do this in a non-portable way.
A Miscellaneous Note
This line:
double ** c = create_matrix(...)
makes me nervous. It suggests that somewhere you have something like this:
for(int i=0;i<10;i++)
c[i] = new double[20];
That is, you have an array of arrays.
The problem is, this gives no guarantee that your subarrays are contiguous in memory. The result is suboptimal cache utilization. You want a 1D array that is addressed like a 2D array. Making a 2D array class which does this or using functions/macros to access elements will allow you to preserve much the same syntax while benefiting from better cache performance.
You might also consider compiling with the -align flag and decorating your code appropriately. This will give better SIMD performance by allowing aligned accesses to memory.

Is this a good practice for vectorization?

I'm trying to improve performance from this code by vectorizing this function:
inline float calcHaarPattern( const int* origin, const SurfHF* f, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
return (float)d;
}
From my knowledge, you can vectorize loops that involves exactly one math operation. In the code above we have 5 math operations, so (using OMP):
#pragma omp simd
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
Isn't gonna work. However, I was thinking if breaking the loop above into multiple loops with exactly one math operation is a good practice for vectorization? The resulting code would be:
double p0[n], p3[n], p1[n], p2[n];
#pragma omp simd
for( int k = 0; k < n; k++ )
p0[k] = origin[f[k].p0]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p3[k] = origin[f[k].p3]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p1[k] = origin[f[k].p1]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p2[k] = origin[f[k].p2]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
d += p0[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d -= p1[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d -= p2[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d += p3[k];
Is this a good solution, or there is any better? Modern compilers (say gcc) are going to do this (or better) kind of optimizations (e.g. enabling -O3) by themselves (so there is actually no gain in performance)?

This is generally bad HPC programming practice by several reasons:
These days you normally have to make your code as computationally dense as possible. To achieve that you need to have the higher Arithmetic Intensity (AI) for the loop whenever you can. For simplicity think of AI as ration of [amount of computations] divided by [number of bytes moved to/from memory in order to perform these computations].
By splitting loops you make AI for each loop much lower in your case, because you do not reuse the same bytes for different computations anymore.
(Vector- or Thread-) -Parallel Reduction (in your case by variable "d") has cost/overhead which you don't want to multiply by 8 or 10 (i.e. by number of loops produced by you by hands).
In many cases Intel/CGG compiler vectorization engines can make slightly better optimization when have several data fields of the same data object processed in the same loop body as opposed to splitted loops case.
There are few theoretical advantages of loop splitting as well, but they don't apply to your case, so I provided them just in case. Loop splitting is reasonable/profitable when:
There is more than one loop carried dependency or reduction in the
same loop.
In case loop split helps to get rid of some out of order execution
data flow dependencies.
In case you do not have enough vector registers to perform all the computations (for complicated algorithms).
Intel Advisor (mentioned by you in prev questons) helps to analyze many of these factors and measures AI.
This is also true that good compilers "don't care" whenever you have one such loop or loop-split, because they could easily transform one case to another or vice versa on the fly.
However the applicability for this kind of transformation is very limited in real codes, because in order to do that you have to know in compile-time a lot of extra info: whenever pointers or dynamic arrays overlap or not, whenever data is aligned or not etc etc. So you shouldn't rely on compiler transformations and specific compiler minor version, but just write HPC-ready code as much as you are capable to.

Example of C++ code optimization for parallel computing

I'm trying to understand optimization routines. I'm focusing on the most critical part of my code (the code has some cycles of length "nc" and one cycle of length "np", where number "np" is much larger then "nc"). I present part of the code in here. The rest of code is not very essential in % of computational time so i prefer code purify in the rest of the algorithm. However, the critical cycle with "np" length is a pretty simple piece of code and it can be parallelized. So it will not hurt if i rewrite this part into some more effective and less clear version (maybe into SSE instructions). I'm using a gcc compiler, c++ code, and OpenMP parallelization.
This code is part of the well known particle-in-cell algorithm (and this one is also basic one). I'm trying to learn code optimization on this version (so my goal is not to have effective PIC algorithm only, because it is already written in thousand variants, but i want to bring some demonstrative example for code optimization also). I'm trying to do some work but i am not very sure if i solved all optimization properties correctly.
const int NT = ...; // number of threads (in two versions: about 6 or about 30)
const int np = 10000000; // np is about 1000-10000 times larger than nc commonly
const int nc = 10000;
const int step = 1000;
float u[np], x[np];
float a[nc], a_lin[nc], rho_full[NT][nc], rho_diff[NT][nc] , weight[nc];
int p,num;
for ( i = 0 ; i<step ; i++) {
// ***
// *** some not very time consuming code for calculation
// *** a, a_lin from values of rho_full and rho_diff
#pragma omp for private(p,num)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
rho_full[num][p] += weight[k];
rho_diff[num][p] += weight[k] * (x[k] - p);
}
};
I realize this has problems:
1) (main question) I use set of arrays rho_full[num][p] where num is index for each thread. After computation i just summarize this arrays (rho_full[0][p] + rho_full[1][p] + rho_full[2][p] ...). The reason is avoidance of writing into same part of array with two different threads. I am not very sure if this way is an effective solution (note that number "nc" is relatively small, so number of operations with "np" is still probably most essential)
2) (also important question) I need to read x[k] many times and it's also changed many times. Maybe its better to read this value into some register and then forget whole x array or fix some pointer in here. After all calculation i can call x[k] array again and store obtained value. I believe that compiler do this work for me but i am not very sure because i used modification of x[k] in the center of algorithm. So the compiler probably do some effective work on their own but maybe in this version it call more times then nessesary becouse more then ones I swich calling and storing this value.
3) (probably not relevant) The code works with integer part and remainder below decimal point part. It needs both of this values. I identify integer part as p = (int) x and remainder as x - p. I calculate this routine at the begin and also in the end of cycle interior. One can see that this spliting can be stored somewhere and used at next step (i mean step at i index). Do you thing that following version is better? I store integral and remainder part at arrays of x instead of whole value x.
int x_int[np];
float x_rem[np];
//...
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
u[k] += a[x_int[k]] + a_lin[x_int[k]] * x_rem[k];
x_rem[k] += u[k];
p = (int) x_rem[k]; // *** This part is added into code for simplify the rest.
x_int[k] += p; // *** And maybe there is a better way how to realize
x_rem[k] -= p; // *** this "pushing correction".
if (x_int[k]<0 ) {x_int[k]+=nc;} else
if (x_int[k]>nc) {x_int[k]-=nc;};
rho_full[num][x_int[k]] += weight[k];
rho_diff[num][x_int[k]] += weight[k] * x_rem[k];
}
};

You can use OMP reduction for your for loop:
int result = 0;
#pragma omp for nowait reduction(+:result)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
result += weight[k] + weight[k] * (x[k] - p);
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js