Is this a good practice for vectorization? - c++

I'm trying to improve performance from this code by vectorizing this function:
inline float calcHaarPattern( const int* origin, const SurfHF* f, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
return (float)d;
}
From my knowledge, you can vectorize loops that involves exactly one math operation. In the code above we have 5 math operations, so (using OMP):
#pragma omp simd
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
Isn't gonna work. However, I was thinking if breaking the loop above into multiple loops with exactly one math operation is a good practice for vectorization? The resulting code would be:
double p0[n], p3[n], p1[n], p2[n];
#pragma omp simd
for( int k = 0; k < n; k++ )
p0[k] = origin[f[k].p0]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p3[k] = origin[f[k].p3]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p1[k] = origin[f[k].p1]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
p2[k] = origin[f[k].p2]*f[k].w;
#pragma omp simd
for( int k = 0; k < n; k++ )
d += p0[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d -= p1[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d -= p2[k];
#pragma omp simd
for( int k = 0; k < n; k++ )
d += p3[k];
Is this a good solution, or there is any better? Modern compilers (say gcc) are going to do this (or better) kind of optimizations (e.g. enabling -O3) by themselves (so there is actually no gain in performance)?

This is generally bad HPC programming practice by several reasons:
These days you normally have to make your code as computationally dense as possible. To achieve that you need to have the higher Arithmetic Intensity (AI) for the loop whenever you can. For simplicity think of AI as ration of [amount of computations] divided by [number of bytes moved to/from memory in order to perform these computations].
By splitting loops you make AI for each loop much lower in your case, because you do not reuse the same bytes for different computations anymore.
(Vector- or Thread-) -Parallel Reduction (in your case by variable "d") has cost/overhead which you don't want to multiply by 8 or 10 (i.e. by number of loops produced by you by hands).
In many cases Intel/CGG compiler vectorization engines can make slightly better optimization when have several data fields of the same data object processed in the same loop body as opposed to splitted loops case.
There are few theoretical advantages of loop splitting as well, but they don't apply to your case, so I provided them just in case. Loop splitting is reasonable/profitable when:
There is more than one loop carried dependency or reduction in the
same loop.
In case loop split helps to get rid of some out of order execution
data flow dependencies.
In case you do not have enough vector registers to perform all the computations (for complicated algorithms).
Intel Advisor (mentioned by you in prev questons) helps to analyze many of these factors and measures AI.
This is also true that good compilers "don't care" whenever you have one such loop or loop-split, because they could easily transform one case to another or vice versa on the fly.
However the applicability for this kind of transformation is very limited in real codes, because in order to do that you have to know in compile-time a lot of extra info: whenever pointers or dynamic arrays overlap or not, whenever data is aligned or not etc etc. So you shouldn't rely on compiler transformations and specific compiler minor version, but just write HPC-ready code as much as you are capable to.

Related

How to obtain performance enhancement while multiplying two sub-matrices?

I've got a program multiplying two sub-matrices residing in the same container matrix. I'm trying to obtain some performance gain by using the OpenMP API for parallelization. Below is the multiplication algorithm I use.
#pragma omp parallel for
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
}
}
}
The algorithm accesses the elements of both input sub-matrices row-wise to enhance cache usage with the spatial locality.
What other OpenMP directives can be used to obtain better performance from that simple algorithm? Is there any other directive for optimizing the operations on the overlapping areas of two sub-matrices?
You can assume that all the sub-matrices have the same size and they are square-shaped. The resulting sub-matrix resides in another container matrix.
For the matrix-matrix product, any permutation of i,j,k indices computes the right result, sequentially. In parallel, not so. In your original code the k iterations do not write to unique locations, so you can not just collapse the outer two loops. Do a k,j interchange and then it is allowed.
Of course OpenMP gets you from 5 percent efficiency on one core to 5 percent on all cores. You really want to block the loops. But that is a lot harder. See the paper by Goto and van de Geijn.
I'm adding something related to main matrix. Do you use this code to multiply two bigger matrices? Then one of the sub-matrices are re-used between different iterations and likely to benefit from CPU cache. For example, if there are 4 sub-matrices of a matrix, then each sub-matrix is used twice, to get a value on result matrix.
To benefit from cache most, the re-used data should be kept in the cache of the same thread (core). To do this, maybe it is better to move the work-distribution level to the place where you select two submatrices.
So, something like this:
select sub-matrix A
#pragma omp parallel for
select sub-matrix B
for(size_t i = 0; i < matrixA.m_edgeSize; i++) {
for(size_t k = 0; k < matrixA.m_edgeSize; k++) {
for(size_t j = 0; j < matrixA.m_edgeSize; j++) {
resultMatrix(i, j) += matrixA(i, k) * matrixB(k, j);
}
}
}
could work faster since whole data always stays in same thread (core).

Auto-vectorization of scalar product in loop

I am trying to autovectorize the following loop. In the following we loop with the i- and j-loop over the lower triangle of a matrix. Unfortunetly the vectorization report cannot vectorize (=translate to AVX SIMD instructions) the j- and the k-loop. But I think it is straightforward, because there are no pointer aliases (#pragma ivdep and compiler option -D NOALIAS) and the data (x: 1D-array and p: 1D-array) is aligned to 64 bytes.
It could be, that the if-statement is a problem, but even with the if-free solution (expensive shifting operation and count the sign of a double) the compiler is not able to vectorize this loop.
__assume_aligned(x, 64);
__assume_aligned(p, 64);
#pragma omp parallel for simd reduction(+:accum)
for ( int i = 1 ; i < N ; i++ ){ // loop over lower triangle (i,j), OpenMP SIMD LOOP WAS VECTORIZED
for ( int j = 0 ; j < i ; j++ ){ // <-- remark #25460: No loop optimizations reported
double __attribute__((aligned(64))) scalarp = 0.0;
#pragma omp simd
for ( int k=0 ; k < D ; k++ ){ // <-- remark #25460: No loop optimizations reported
// scalar product of \sum_k x_{i,k} \cdot x_{j,k}
scalarp += x[i*D + k] * x[j*D + k];
}
// Alternative to following if:
// accum += - ( (long long) floor( - ( scalarp + p[i] + p[j] ) ) >> 63);
#pragma ivdep
if ( scalarp + p[i] + p[j] >= 0 ){ // check if condition is satisfied
accum += 1;
}
}
}
Does it refer to the problem, that OpenMP starting points for each OpenMP thread are not known until run-time? I thought it this resolves the simd clause and Intels auto-vectorization is aware of that.
Intel Compiler: 18.0.2 20180210
edit: I've looked into the assembly and now it is clear that the code is already vectorized, sorry for boardering all of you.
Looking into the assembly really helps. Code is already vectorized. OpenMP SIMD LOOP WAS VECTORIZED takes also care of inner loop in this particular case.

Efficient Tensor Multiplication

I have a Matrix that is a representation of a higher dimensional tensor which could in principle be N dimensional but each dimension is the same size. Lets say I want to compute the following:
and C is stored as a matrix via
where there is some mapping from ij to I and kl to J.
I can do this with nested for loops where each dimension of my tensor is of size 3 via
for (int i=0; i<3; i++){
for (int j=0; j<3; j++){
I = map_ij_to_I(i,j);
for (int k=0; k<3; k++){
for (int l=0; l<3; l++){
J = map_kl_to_J(k,l);
D(I,J) = 0.;
for (int m=0; m<3; m++){
for (int n=0; n<3; n++){
M = map_mn_to_M(m,n);
D(I,J) += a(i,m)*C(M,J)*b(j,n);
}
}
}
}
}
}
but that's pretty messy and not very efficient. I'm using the Eigen matrix library so I suspect there is probably a much better way to do this than either a for loop or coding each entry separately. I've tried the unsupported tensor library and found it was slower than my explicit loops. Any thoughts?
As a bonus question, how would I compute something like the following efficiently?
There's a lot of work that the optimizer of your compiler will do for you under the hood. For once, loops with constant number of iterations are unrolled. That may be the reason why your code is faster than the library.
I would suggest to take a look at the assembly produced with the optimizations turned to really get a grasp on where you can optimize and how really your program looks like once compiled.
Then of course, you can think about parallel implementations either on the CPU (multiple threads) or on GPU (cuda, OpenCL, OpenAcc, etc).
As for the bonus question, if you think about writing it as two nested loops, I would suggest to rearrange the expression so that the a_km term is between the two sums. No need to perform that multiplication inside the inner sum as it doesn't depend on n. Although this will probably give only a slight performance benefit in modern CPUs...

Which has better memory access ? (C++) [duplicate]

This question already has answers here:
Which ordering of nested loops for iterating over a 2D array is more efficient [duplicate]
(10 answers)
Closed 6 years ago.
Which version is more efficient and why?
It seems that both make the same computations. The only thing I can think of is if the compiler recognizes that in (a) j does not change value and doesn't have to compute it over and over again.
Any input would be great!
#define M /* some mildly large number */
double a[M*M], x[M], c[M];
int i, j;
(a) First version
for (j = 0; j < M; j++)
for (i = 0; i < M; i++)
c[j] += a[i+j*M]*x[i];
(b) Second version
for (i = 0; i < M; i++)
for (j = 0; j < M; j++)
c[j] += a[i+j*M]*x[i];
This is about memory-access patterns rather than computational efficiency. In general (a) is faster because it accesses memory with unit stride, which is much more cache-efficient than (b), which has a stride of M. In the case of (a) each cache line is fully utilised, whereas with (b) it is possible that only one array element will be used from each cache line before it is evicted,
Having said that, some compilers can perform loop reordering optimisations, so in practice you may not see any difference if that happens. As always, you should benchmark/profile your code, rather than just guessing.

compiler nested loop optimization for sequential memory access.

I came across a strange performance issue in a matrix multiply benchmark (matrix_mult in Metis from the MOSBENCH suite). The benchmark was optimized to tile the data such that the active working set was 12kb (3 tiles of 32x32 ints) and would fit into the L1 cache. To make a long story short, swapping the inner two most loops had a performance difference of almost 4x on certain array input sizes (4096, 8192) and about a 30% difference on others. The problem essentially came down to accessing elements sequentially versus in a stride pattern. Certain array sizes I think created a bad stride access that generated a lot cache line collisions. The performance difference is noticeably less when changing from 2-way associative L1 to an 8-way associative L1.
My question is why doesn't gcc optimize loop ordering to maximize sequential memory accesses?
Below is a simplified version of the problem (note that performance times are highly dependent on L1 configuration. The numbers indicated below are from 2.3 GHZ AMD system with 64K L1 2-way associative compiled with -O3).
N = ARRAY_SIZE // 1024
int* mat_A = (int*)malloc(N*N*sizeof(int));
int* mat_B = (int*)malloc(N*N*sizeof(int));
int* mat_C = (int*)malloc(N*N*sizeof(int));
// Elements of mat_B are accessed in a stride pattern of length N
// This takes 800 msec
for (int t = 0; t < 1000; t++)
for (int a = 0; a < 32; a++)
for (int b = 0; b < 32; b++)
for (int c = 0; c < 32; c++)
mat_C[N*a+b] += mat_A[N*a+c] * mat_B[N*c+b];
// Inner two loops are swapped
// Elements are now accessed sequentially in inner loop
// This takes 172 msec
for (int t = 0; t < 1000; t++)
for (int a = 0; a < 32; a++)
for (int c = 0; c < 32; c++)
for (int b = 0; b < 32; b++)
mat_C[N*a+b] += mat_A[N*a+c] * mat_B[N*c+b];
gcc might not be able to prove that the pointers don't overlap. If you are fine using non standard extensions you could try using __restrict.
gcc doesn't take full advantage of your architecture to avoid the necessity to recompile for every processor. Using the option -march with the appropriate value for your system might help.
gcc has a bunch of optimizations that just do what you want.
Look up the -floop-strip-mine and -floop-block compiler options.
Quote from the manual:
Perform loop blocking transformations on loops. Blocking strip mines
each loop in the loop nest such that the memory accesses of the
element loops fit inside caches. The strip length can be changed using
the loop-block-tile-size parameter.