Intel compiler (ICC) unable to auto vectorize inner loop (matrix multiplication) - c++

EDIT:
ICC (after adding -qopt-report=5 -qopt-report-phase:vec):
LOOP BEGIN at 4.c(107,2)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP BEGIN at 4.c(108,3)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP BEGIN at 4.c(109,4)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed ANTI dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP END
LOOP BEGIN at 4.c(109,4)
<Remainder>
LOOP END
LOOP END
LOOP END
It seems that the C[i][j] is read before it is written if vectorized (as I am doing reduction). The question is why the reduction is allowed is a local variable is introduced (temp)?
Original issue:
I have a C snippet below which does matrix multiplication. a, b - operands, c - a*b result. n - row&column length.
double ** c = create_matrix(...) // initialize n*n matrix with zeroes
double ** a = fill_matrix(...) // fills n*n matrix with random doubles
double ** b = fill_matrix(...) // fills n*n matrix with random doubles
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
The ICC (version 18.0.0.1) is not able to vectorize (provided -O3 flag) the inner loop.
ICC output:
LOOP BEGIN at 4.c(107,2)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(108,3)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(109,4)
remark #25460: No loop optimizations reported
LOOP END
LOOP BEGIN at 4.c(109,4)
<Remainder>
LOOP END
LOOP END
LOOP END
Though, with changes below, the compiler vectorizes the inner loop.
// OLD
for (k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
// TO (NEW)
double tmp = 0;
for (k = 0; k < n; k++) {
tmp += a[i][k] * b[k][j];
}
c[i][j] = tmp;
ICC vectorized output:
LOOP BEGIN at 4.c(119,2)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(120,3)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(134,4)
<Peeled loop for vectorization>
LOOP END
LOOP BEGIN at 4.c(134,4)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP BEGIN at 4.c(134,4)
<Alternate Alignment Vectorized Loop>
LOOP END
LOOP BEGIN at 4.c(134,4)
<Remainder loop for vectorization>
LOOP END
LOOP END
LOOP END
Instead of accumulating vector multiplication result in matrix C cell, the result is accumulated in a separate variable and assigned later.
Why does the compiler not optimize the first version? Could it be due to potential aliasing of a or / and b to c elements (Read after write problem)?

Leverage Your Compiler
You don't show the flags you're using to get your vectorization report. I recommend:
-qopt-report=5 -qopt-report-phase:vec
The documentation says:
For levels n=1 through n=5, each level includes all the information of the previous level, as well as potentially some additional information. Level 5 produces the greatest level of detail. If you do not specify n, the default is level 2, which produces a medium level of detail.
With the higher level of detail, the compiler will likely tell you (in mysterious terms) why it is not vectorizing.
My suspicion is that the compiler is worried about the memory being aliased. The solution you've found allows the compiler to prove that it is not, so it performs the vectorization.
A Portable Solution
If you're into OpenMP, you could use:
#pragma omp simd
for (k = 0; k < n; k++)
c[i][j] += a[i][k] * b[k][j];
to accomplish the same thing. I think Intel also has a set of compiler-specific directives that will do this in a non-portable way.
A Miscellaneous Note
This line:
double ** c = create_matrix(...)
makes me nervous. It suggests that somewhere you have something like this:
for(int i=0;i<10;i++)
c[i] = new double[20];
That is, you have an array of arrays.
The problem is, this gives no guarantee that your subarrays are contiguous in memory. The result is suboptimal cache utilization. You want a 1D array that is addressed like a 2D array. Making a 2D array class which does this or using functions/macros to access elements will allow you to preserve much the same syntax while benefiting from better cache performance.
You might also consider compiling with the -align flag and decorating your code appropriately. This will give better SIMD performance by allowing aligned accesses to memory.

Related

How do I indicate OpenACC to sequentially execute one instruction inside a parallel loop?

I would like the 'r_m[i] /= lines_samples;' line to be executed once, by one thread I mean. Do I have to put a special pragma or do anything for the compiler to understand it?
Here is the code:
#pragma acc parallel loop
for(i=0; i<bands; i++)
{
#pragma acc loop seq // This may be a reduction, not a seq, who knows? ^^
for(j=0; j<lines_samples; j++)
r_m[i] += image_vector[i*lines_samples+j];
r_m[i] /= lines_samples;
#pragma acc loop
for(j=0; j<lines_samples; j++)
R_o[i*lines_samples+j] = image_vector[i*lines_samples+j] - r_m[i];
}
Thank you a lot!
Assuming the loops are scheduled with the outer loop being "gang" and inner loop as "vector", this line would be executed once per gang (i.e. only on thread in the gang). So it will work as you expect.
Depending on the trip count of the first "j" loop, you may or may not use a reduction. Reductions do have overhead, so if the trip count is small, then it may be better to leave it as sequential. Otherwise, I suggest using a temp scalar for the reductions since as it is now, would require an array reduction which incurs more overhead.
Something like:
float rmi = r_m[i];
#pragma acc loop reduction(+:rmi)
for(j=0; j<lines_samples; j++)
rmi += image_vector[i*lines_samples+j];
r_m[i] = rm1/lines_samples;

nested vectorized openmp loops with multiple lines of code in inner-most loop

I was wondering if it is valid to use an opemp simd construct to collapse multiple nested loops, where
the code in the inner-most loop first calculates a number of indices (as shown below) and then those indices
are used to modify a multidimensional array (as shown below). In other words would the lines labelled I1-I4
below be all vectorized? In all the openmp examples I have seen there is always a single variable whose result gets
vectorized. Would the below code be considered valid? Thanks
for(std::size_t a=0;a<A;a++)
{
#pragma omp simd collapse(3)
for(std::size_t b=0;b<B;b++)
{
for(std::size_t c=0;c<C;c++)
{
for(std::size_t d=0;d<D;d++)
{
std::size_t idx1 = c*B + b; //I1
std::size_t idx2 = d*(B*C) + c*B + b; //I2
std::size_t idx3 = d*(E) + c*F + b; //I3
W1[idx1][idx3] += W1[idx1][a]*W2[a][idx3]; //I4
}
}
}
}
This is definitely valid OpenMP code. Depending on the compiler, and the target architecture, the results of compiling it may change, but at least some compilers will definitely vectorize it. Because the indices are likely non linear, it will only vectorize well on a platform with both gather and scatter instructions, but it’s valid regardless.

Auto-vectorization of scalar product in loop

I am trying to autovectorize the following loop. In the following we loop with the i- and j-loop over the lower triangle of a matrix. Unfortunetly the vectorization report cannot vectorize (=translate to AVX SIMD instructions) the j- and the k-loop. But I think it is straightforward, because there are no pointer aliases (#pragma ivdep and compiler option -D NOALIAS) and the data (x: 1D-array and p: 1D-array) is aligned to 64 bytes.
It could be, that the if-statement is a problem, but even with the if-free solution (expensive shifting operation and count the sign of a double) the compiler is not able to vectorize this loop.
__assume_aligned(x, 64);
__assume_aligned(p, 64);
#pragma omp parallel for simd reduction(+:accum)
for ( int i = 1 ; i < N ; i++ ){ // loop over lower triangle (i,j), OpenMP SIMD LOOP WAS VECTORIZED
for ( int j = 0 ; j < i ; j++ ){ // <-- remark #25460: No loop optimizations reported
double __attribute__((aligned(64))) scalarp = 0.0;
#pragma omp simd
for ( int k=0 ; k < D ; k++ ){ // <-- remark #25460: No loop optimizations reported
// scalar product of \sum_k x_{i,k} \cdot x_{j,k}
scalarp += x[i*D + k] * x[j*D + k];
}
// Alternative to following if:
// accum += - ( (long long) floor( - ( scalarp + p[i] + p[j] ) ) >> 63);
#pragma ivdep
if ( scalarp + p[i] + p[j] >= 0 ){ // check if condition is satisfied
accum += 1;
}
}
}
Does it refer to the problem, that OpenMP starting points for each OpenMP thread are not known until run-time? I thought it this resolves the simd clause and Intels auto-vectorization is aware of that.
Intel Compiler: 18.0.2 20180210
edit: I've looked into the assembly and now it is clear that the code is already vectorized, sorry for boardering all of you.
Looking into the assembly really helps. Code is already vectorized. OpenMP SIMD LOOP WAS VECTORIZED takes also care of inner loop in this particular case.

Iterative Karatsuba algorithm parallelized and vectorized using OpenACC in C++

I'm trying to parallelize iterative version of Karatsuba algorithm using OpenACC in C++. I would like to ask how can I vectorize inner for loop. My compiler shows my this message about that loop:
526, Complex loop carried dependence of result-> prevents parallelization
Loop carried dependence of result-> prevents parallelization
Loop carried backward dependence of result-> prevents vectorization
and here the code of two nested loops:
#pragma acc kernels num_gangs(1024) num_workers(32) copy (result[0:2*size-1]) copyin(A[0:size],$
{
#pragma acc loop gang
for (TYPE position = 1; position < 2 * (size - 1); position++) {
// for even coefficient add Di/2
if (position % 2 == 0)
result[position] += D[position / 2];
TYPE start = (position >= size) ? (position % size ) + 1 : 0;
TYPE end = (position + 1) / 2;
// inner loop: sum (Dst) - sum (Ds + Dt) where s+t=i
#pragma acc loop worker
for(TYPE inner = start; inner < end; inner++){
result[position] += (A[inner] + A[position - inner]) * (B[inner] + B[position - inn$
result[position] -= (D[inner] + D[position - inner]);
}
}
}
Actually, I'm not sure if it is possible to vectorize it. But if It is, I can't realize what I'm doing wrong. Thank you
The "Complex loop carried dependence of result" problem is due to pointer aliasing. The compiler can't tell if the object that "result" points to overlaps with one of the other pointer's objects.
As a C++ extension, you can add the C99 "restrict" keyword to the declaration of your arrays. This will assert to the compiler that pointers don't alias.
Alternatively, you can add the OpenACC "independent" clause on your loop directives to tell the compiler that the loops do not have any dependencies.
Note that OpenACC does not support array reductions, so you wont be able to parallelize the "inner" loop unless you modify the code to use a scalar. Something like:
rtmp = result[position];
#pragma acc loop vector reduction(+:rtmp)
for(TYPE inner = start; inner < end; inner++){
rtmp += (A[inner] + A[position - inner]) * (B[inner] + B[position - inn$
rtmp -= (D[inner] + D[position - inner]);
}
result[position] = rtmp;

Vectorization & #pragma omp simd

Since I got lost through all the reading of SIMD and OpenMP depending on vectorization, I would like to ask you if somebody can clarify me the above.
Specifically, I have a part of a C++ code I want to parallelize, but I am pretty stuffed at the moment and can't figure something on my own.
Any help clearing out to me what exactly the vectorization is and how can I use it in the following part of code would be greatly appreciated!
for(unsigned short i=1; i<=N_a; i++) {
for(unsigned short j=1; j<=N_b; j++) {
temp[0] = H[i-1][j-1]+similarity_score(seq_a[i-1],seq_b[j-1]);
temp[1] = H[i-1][j]-delta;
temp[2] = H[i][j-1]-delta;
temp[3] = 0.;
H[i][j] = find_array_max(temp, 4);
switch(ind) {
case 0: // score in (i,j) stems from a match/mismatch
I_i[i][j] = i-1;
I_j[i][j] = j-1;
break;
case 1: // score in (i,j) stems from a deletion in sequence A
I_i[i][j] = i-1;
I_j[i][j] = j;
break;
case 2: // score in (i,j) stems from a deletion in sequence B
I_i[i][j] = i;
I_j[i][j] = j-1;
break;
case 3: // (i,j) is the beginning of a subsequence
I_i[i][j] = i;
I_j[i][j] = j;
break;
}
}
}
Regards!
So ind is constant for both nested loops?
You might get a compiler to auto-vectorize this for you with OpenMP. (Put the line #pragma omp simd right before either of your for loops, and see if that affects the asm when you compile with -O3. I don't know OpenMP that well, so IDK if you need other options.)
Wrap it in a function that actually compiles, so I can see what happens. (e.g. by putting the code on http://gcc.godbolt.org/ to get nicely formatted asm output).
If it doesn't auto-vectorize, it's probably not too hard to manually vectorize with Intel intrinsics for x86, since you're just initializing some arrays with the array index. Keep a vector of loop counters starting with a vector of __m128i jvec = _mm_set_epi32(3, 2, 1, 0);, and increment it with _mm_add_ps() with a vector of [ 4 4 4 4 ] (_mm_set1_epi32(4)) to increment every element by 4.
Keep a separate vector of i values, which you only modify in the outer loop, but still store in the inner loop.
See the x86 tag wiki for instruction-set stuff.
See the sse tag wiki for some SIMD guides, including this nice intro to SIMD and what it's all about.