Iterative Karatsuba algorithm parallelized and vectorized using OpenACC in C++ - c++

I'm trying to parallelize iterative version of Karatsuba algorithm using OpenACC in C++. I would like to ask how can I vectorize inner for loop. My compiler shows my this message about that loop:
526, Complex loop carried dependence of result-> prevents parallelization
Loop carried dependence of result-> prevents parallelization
Loop carried backward dependence of result-> prevents vectorization
and here the code of two nested loops:
#pragma acc kernels num_gangs(1024) num_workers(32) copy (result[0:2*size-1]) copyin(A[0:size],$
{
#pragma acc loop gang
for (TYPE position = 1; position < 2 * (size - 1); position++) {
// for even coefficient add Di/2
if (position % 2 == 0)
result[position] += D[position / 2];
TYPE start = (position >= size) ? (position % size ) + 1 : 0;
TYPE end = (position + 1) / 2;
// inner loop: sum (Dst) - sum (Ds + Dt) where s+t=i
#pragma acc loop worker
for(TYPE inner = start; inner < end; inner++){
result[position] += (A[inner] + A[position - inner]) * (B[inner] + B[position - inn$
result[position] -= (D[inner] + D[position - inner]);
}
}
}
Actually, I'm not sure if it is possible to vectorize it. But if It is, I can't realize what I'm doing wrong. Thank you

The "Complex loop carried dependence of result" problem is due to pointer aliasing. The compiler can't tell if the object that "result" points to overlaps with one of the other pointer's objects.
As a C++ extension, you can add the C99 "restrict" keyword to the declaration of your arrays. This will assert to the compiler that pointers don't alias.
Alternatively, you can add the OpenACC "independent" clause on your loop directives to tell the compiler that the loops do not have any dependencies.
Note that OpenACC does not support array reductions, so you wont be able to parallelize the "inner" loop unless you modify the code to use a scalar. Something like:
rtmp = result[position];
#pragma acc loop vector reduction(+:rtmp)
for(TYPE inner = start; inner < end; inner++){
rtmp += (A[inner] + A[position - inner]) * (B[inner] + B[position - inn$
rtmp -= (D[inner] + D[position - inner]);
}
result[position] = rtmp;

Related

nested vectorized openmp loops with multiple lines of code in inner-most loop

I was wondering if it is valid to use an opemp simd construct to collapse multiple nested loops, where
the code in the inner-most loop first calculates a number of indices (as shown below) and then those indices
are used to modify a multidimensional array (as shown below). In other words would the lines labelled I1-I4
below be all vectorized? In all the openmp examples I have seen there is always a single variable whose result gets
vectorized. Would the below code be considered valid? Thanks
for(std::size_t a=0;a<A;a++)
{
#pragma omp simd collapse(3)
for(std::size_t b=0;b<B;b++)
{
for(std::size_t c=0;c<C;c++)
{
for(std::size_t d=0;d<D;d++)
{
std::size_t idx1 = c*B + b; //I1
std::size_t idx2 = d*(B*C) + c*B + b; //I2
std::size_t idx3 = d*(E) + c*F + b; //I3
W1[idx1][idx3] += W1[idx1][a]*W2[a][idx3]; //I4
}
}
}
}
This is definitely valid OpenMP code. Depending on the compiler, and the target architecture, the results of compiling it may change, but at least some compilers will definitely vectorize it. Because the indices are likely non linear, it will only vectorize well on a platform with both gather and scatter instructions, but it’s valid regardless.

Auto-vectorization of scalar product in loop

I am trying to autovectorize the following loop. In the following we loop with the i- and j-loop over the lower triangle of a matrix. Unfortunetly the vectorization report cannot vectorize (=translate to AVX SIMD instructions) the j- and the k-loop. But I think it is straightforward, because there are no pointer aliases (#pragma ivdep and compiler option -D NOALIAS) and the data (x: 1D-array and p: 1D-array) is aligned to 64 bytes.
It could be, that the if-statement is a problem, but even with the if-free solution (expensive shifting operation and count the sign of a double) the compiler is not able to vectorize this loop.
__assume_aligned(x, 64);
__assume_aligned(p, 64);
#pragma omp parallel for simd reduction(+:accum)
for ( int i = 1 ; i < N ; i++ ){ // loop over lower triangle (i,j), OpenMP SIMD LOOP WAS VECTORIZED
for ( int j = 0 ; j < i ; j++ ){ // <-- remark #25460: No loop optimizations reported
double __attribute__((aligned(64))) scalarp = 0.0;
#pragma omp simd
for ( int k=0 ; k < D ; k++ ){ // <-- remark #25460: No loop optimizations reported
// scalar product of \sum_k x_{i,k} \cdot x_{j,k}
scalarp += x[i*D + k] * x[j*D + k];
}
// Alternative to following if:
// accum += - ( (long long) floor( - ( scalarp + p[i] + p[j] ) ) >> 63);
#pragma ivdep
if ( scalarp + p[i] + p[j] >= 0 ){ // check if condition is satisfied
accum += 1;
}
}
}
Does it refer to the problem, that OpenMP starting points for each OpenMP thread are not known until run-time? I thought it this resolves the simd clause and Intels auto-vectorization is aware of that.
Intel Compiler: 18.0.2 20180210
edit: I've looked into the assembly and now it is clear that the code is already vectorized, sorry for boardering all of you.
Looking into the assembly really helps. Code is already vectorized. OpenMP SIMD LOOP WAS VECTORIZED takes also care of inner loop in this particular case.

Intel compiler (ICC) unable to auto vectorize inner loop (matrix multiplication)

EDIT:
ICC (after adding -qopt-report=5 -qopt-report-phase:vec):
LOOP BEGIN at 4.c(107,2)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP BEGIN at 4.c(108,3)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP BEGIN at 4.c(109,4)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed ANTI dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP END
LOOP BEGIN at 4.c(109,4)
<Remainder>
LOOP END
LOOP END
LOOP END
It seems that the C[i][j] is read before it is written if vectorized (as I am doing reduction). The question is why the reduction is allowed is a local variable is introduced (temp)?
Original issue:
I have a C snippet below which does matrix multiplication. a, b - operands, c - a*b result. n - row&column length.
double ** c = create_matrix(...) // initialize n*n matrix with zeroes
double ** a = fill_matrix(...) // fills n*n matrix with random doubles
double ** b = fill_matrix(...) // fills n*n matrix with random doubles
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
The ICC (version 18.0.0.1) is not able to vectorize (provided -O3 flag) the inner loop.
ICC output:
LOOP BEGIN at 4.c(107,2)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(108,3)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(109,4)
remark #25460: No loop optimizations reported
LOOP END
LOOP BEGIN at 4.c(109,4)
<Remainder>
LOOP END
LOOP END
LOOP END
Though, with changes below, the compiler vectorizes the inner loop.
// OLD
for (k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
// TO (NEW)
double tmp = 0;
for (k = 0; k < n; k++) {
tmp += a[i][k] * b[k][j];
}
c[i][j] = tmp;
ICC vectorized output:
LOOP BEGIN at 4.c(119,2)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(120,3)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(134,4)
<Peeled loop for vectorization>
LOOP END
LOOP BEGIN at 4.c(134,4)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP BEGIN at 4.c(134,4)
<Alternate Alignment Vectorized Loop>
LOOP END
LOOP BEGIN at 4.c(134,4)
<Remainder loop for vectorization>
LOOP END
LOOP END
LOOP END
Instead of accumulating vector multiplication result in matrix C cell, the result is accumulated in a separate variable and assigned later.
Why does the compiler not optimize the first version? Could it be due to potential aliasing of a or / and b to c elements (Read after write problem)?
Leverage Your Compiler
You don't show the flags you're using to get your vectorization report. I recommend:
-qopt-report=5 -qopt-report-phase:vec
The documentation says:
For levels n=1 through n=5, each level includes all the information of the previous level, as well as potentially some additional information. Level 5 produces the greatest level of detail. If you do not specify n, the default is level 2, which produces a medium level of detail.
With the higher level of detail, the compiler will likely tell you (in mysterious terms) why it is not vectorizing.
My suspicion is that the compiler is worried about the memory being aliased. The solution you've found allows the compiler to prove that it is not, so it performs the vectorization.
A Portable Solution
If you're into OpenMP, you could use:
#pragma omp simd
for (k = 0; k < n; k++)
c[i][j] += a[i][k] * b[k][j];
to accomplish the same thing. I think Intel also has a set of compiler-specific directives that will do this in a non-portable way.
A Miscellaneous Note
This line:
double ** c = create_matrix(...)
makes me nervous. It suggests that somewhere you have something like this:
for(int i=0;i<10;i++)
c[i] = new double[20];
That is, you have an array of arrays.
The problem is, this gives no guarantee that your subarrays are contiguous in memory. The result is suboptimal cache utilization. You want a 1D array that is addressed like a 2D array. Making a 2D array class which does this or using functions/macros to access elements will allow you to preserve much the same syntax while benefiting from better cache performance.
You might also consider compiling with the -align flag and decorating your code appropriately. This will give better SIMD performance by allowing aligned accesses to memory.

Vectorization & #pragma omp simd

Since I got lost through all the reading of SIMD and OpenMP depending on vectorization, I would like to ask you if somebody can clarify me the above.
Specifically, I have a part of a C++ code I want to parallelize, but I am pretty stuffed at the moment and can't figure something on my own.
Any help clearing out to me what exactly the vectorization is and how can I use it in the following part of code would be greatly appreciated!
for(unsigned short i=1; i<=N_a; i++) {
for(unsigned short j=1; j<=N_b; j++) {
temp[0] = H[i-1][j-1]+similarity_score(seq_a[i-1],seq_b[j-1]);
temp[1] = H[i-1][j]-delta;
temp[2] = H[i][j-1]-delta;
temp[3] = 0.;
H[i][j] = find_array_max(temp, 4);
switch(ind) {
case 0: // score in (i,j) stems from a match/mismatch
I_i[i][j] = i-1;
I_j[i][j] = j-1;
break;
case 1: // score in (i,j) stems from a deletion in sequence A
I_i[i][j] = i-1;
I_j[i][j] = j;
break;
case 2: // score in (i,j) stems from a deletion in sequence B
I_i[i][j] = i;
I_j[i][j] = j-1;
break;
case 3: // (i,j) is the beginning of a subsequence
I_i[i][j] = i;
I_j[i][j] = j;
break;
}
}
}
Regards!
So ind is constant for both nested loops?
You might get a compiler to auto-vectorize this for you with OpenMP. (Put the line #pragma omp simd right before either of your for loops, and see if that affects the asm when you compile with -O3. I don't know OpenMP that well, so IDK if you need other options.)
Wrap it in a function that actually compiles, so I can see what happens. (e.g. by putting the code on http://gcc.godbolt.org/ to get nicely formatted asm output).
If it doesn't auto-vectorize, it's probably not too hard to manually vectorize with Intel intrinsics for x86, since you're just initializing some arrays with the array index. Keep a vector of loop counters starting with a vector of __m128i jvec = _mm_set_epi32(3, 2, 1, 0);, and increment it with _mm_add_ps() with a vector of [ 4 4 4 4 ] (_mm_set1_epi32(4)) to increment every element by 4.
Keep a separate vector of i values, which you only modify in the outer loop, but still store in the inner loop.
See the x86 tag wiki for instruction-set stuff.
See the sse tag wiki for some SIMD guides, including this nice intro to SIMD and what it's all about.

Example of C++ code optimization for parallel computing

I'm trying to understand optimization routines. I'm focusing on the most critical part of my code (the code has some cycles of length "nc" and one cycle of length "np", where number "np" is much larger then "nc"). I present part of the code in here. The rest of code is not very essential in % of computational time so i prefer code purify in the rest of the algorithm. However, the critical cycle with "np" length is a pretty simple piece of code and it can be parallelized. So it will not hurt if i rewrite this part into some more effective and less clear version (maybe into SSE instructions). I'm using a gcc compiler, c++ code, and OpenMP parallelization.
This code is part of the well known particle-in-cell algorithm (and this one is also basic one). I'm trying to learn code optimization on this version (so my goal is not to have effective PIC algorithm only, because it is already written in thousand variants, but i want to bring some demonstrative example for code optimization also). I'm trying to do some work but i am not very sure if i solved all optimization properties correctly.
const int NT = ...; // number of threads (in two versions: about 6 or about 30)
const int np = 10000000; // np is about 1000-10000 times larger than nc commonly
const int nc = 10000;
const int step = 1000;
float u[np], x[np];
float a[nc], a_lin[nc], rho_full[NT][nc], rho_diff[NT][nc] , weight[nc];
int p,num;
for ( i = 0 ; i<step ; i++) {
// ***
// *** some not very time consuming code for calculation
// *** a, a_lin from values of rho_full and rho_diff
#pragma omp for private(p,num)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
rho_full[num][p] += weight[k];
rho_diff[num][p] += weight[k] * (x[k] - p);
}
};
I realize this has problems:
1) (main question) I use set of arrays rho_full[num][p] where num is index for each thread. After computation i just summarize this arrays (rho_full[0][p] + rho_full[1][p] + rho_full[2][p] ...). The reason is avoidance of writing into same part of array with two different threads. I am not very sure if this way is an effective solution (note that number "nc" is relatively small, so number of operations with "np" is still probably most essential)
2) (also important question) I need to read x[k] many times and it's also changed many times. Maybe its better to read this value into some register and then forget whole x array or fix some pointer in here. After all calculation i can call x[k] array again and store obtained value. I believe that compiler do this work for me but i am not very sure because i used modification of x[k] in the center of algorithm. So the compiler probably do some effective work on their own but maybe in this version it call more times then nessesary becouse more then ones I swich calling and storing this value.
3) (probably not relevant) The code works with integer part and remainder below decimal point part. It needs both of this values. I identify integer part as p = (int) x and remainder as x - p. I calculate this routine at the begin and also in the end of cycle interior. One can see that this spliting can be stored somewhere and used at next step (i mean step at i index). Do you thing that following version is better? I store integral and remainder part at arrays of x instead of whole value x.
int x_int[np];
float x_rem[np];
//...
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
u[k] += a[x_int[k]] + a_lin[x_int[k]] * x_rem[k];
x_rem[k] += u[k];
p = (int) x_rem[k]; // *** This part is added into code for simplify the rest.
x_int[k] += p; // *** And maybe there is a better way how to realize
x_rem[k] -= p; // *** this "pushing correction".
if (x_int[k]<0 ) {x_int[k]+=nc;} else
if (x_int[k]>nc) {x_int[k]-=nc;};
rho_full[num][x_int[k]] += weight[k];
rho_diff[num][x_int[k]] += weight[k] * x_rem[k];
}
};
You can use OMP reduction for your for loop:
int result = 0;
#pragma omp for nowait reduction(+:result)
for ( k = np ; --k ; ) {
num = omp_get_thread_num();
p = (int) x[k];
u[k] += a[p] + a_lin[p] * (x[k] - p);
x[k] += u[k];
if (x[k]<0 ) {x[k]+=nc;} else
if (x[k]>nc) {x[k]-=nc;};
p = (int) x[k];
result += weight[k] + weight[k] * (x[k] - p);
}