Intel compiler vectorisation report: heavy-overhead vs. lightweight? - fortran

In this vectorization report from Intel's Fortran compiler:
LOOP BEGIN at MLFMATranslationProd.f90(38,2)
remark #15399: vectorization support: unroll factor set to 4
remark #15300: LOOP WAS VECTORIZED
remark #15462: unmasked indexed (or gather) loads: 2
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 12
remark #15477: vector loop cost: 20.000
remark #15478: estimated potential speedup: 2.340
remark #15479: lightweight vector operations: 5
remark #15481: heavy-overhead vector operations: 1
remark #15488: --- end vector loop cost summary ---
LOOP END
What are the meanings of lightweight vector and heavy-overhead vector operations here?
The relevant loop looks like
do ir=1,N(lev)
G1(lev)%D(ir) = 0.d0
G2(lev)%D(ir) = 0.d0
enddo
with lev some integer.

Related

MATMUL result not equal with explicit calculation for double precision?

sorry for a seemingly stupid question. I was testing the computational efficiency when replacing for-loop operations on matrices with intrinsic functions. When I check the matrices product results of the two methods, it confused me that the two outputs were not the same. Here is the simplified code I used
program matmultest
integer,parameter::nx=64,ny=32,nz=16
real*8::mat1(nx,ny),mat2(ny,nz)
real*8::result1(nx,nz),result2(nx,nz),diff(nx,nz)
real*8::localsum
integer::i,j,m
do i=1,ny
do j=1,nx
mat1(j,i)=dble(j)/7d0+2.65d0*dble(i)
enddo
enddo
do i=1,nz
do j=1,ny
mat2(j,i)=5d0*dble(j)-dble(i)*0.45d0
enddo
enddo
do j=1,nz
do i=1,nx
localsum=0d0
do m=1,ny
localsum=localsum+mat1(i,m)*mat2(m,j)
enddo
result1(i,j)=localsum
enddo
enddo
result2=matmul(mat1,mat2)
diff=result2-result1
print*,sum(abs(diff)),maxval(diff)
end program matmultest
And the result gives
1.6705598682165146E-008 5.8207660913467407E-011
The difference is non-zero for real8 but zero when I tested for integer later. I wonder if it is because of my code's faults somewhere or the numerical accuracy of MATMUL() is single precision?
And the compiler I am using is GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Thanks!
francescalus explained that reordering of operations causes these differences. Let's try to find out how it actually happened.
A few words about matrix product
Consider matrices A(n,p), B(p,q), C(n,q) and C = A*B.
The naive approach, a variant of which you used, involves the following nested loops:
c = 0
do i = 1, n
do j = 1, p
do k = 1, q
c(i, j) = c(i, j) + a(i, k) * b(k, j)
end do
end do
end do
These loops can be executed in any of 6 orders, depending on the variable that you choose at each level. In the example above, the loop is named "ijk", and the other variants "ikj", "jik", etc. are all correct.
There is a speed difference, due to the memory cache: when the inner loop runs across contiguous memory elements, the loop is faster. That's the jki or kji cases.
Indeed, since Fortran matrices are stored in column major order, if the innermost loop runs on i, in the instruction c(i, j) = c(i, j) + a(i, k) * c(k, j), the value c(k, j) is constant, and the operation is equivalent to v(i) = v(i) + x * u(i), where the elements of vectors v and u are contiguous.
However, regarding the order of operations, there shouldn't be a difference: you can check for yourself that all elements of C are computed in the same order. At least at the "higher level": the compiler might optimize things differently, and it's where it becomes really interesting.
What about MATMUL? I believe it's usually a naive matrix product, based on the nested loops above, say a jki loop.
There are other ways to multiply matrices, that involve the Strassen algorithm to improve the algorithm complexity or blocking (i.e. computed products of submatrices) to improve cache use. Other methods that could change the result are OpenMP (i.e. multithread), or using FMA instructions. But here we are not going to delve into these methods. It's really only about the nested loops. If you are interested, there are many resources online, check this.
A few words about optimization
Three remarks first:
On a processor without SIMD instructions, you would get the same result as MATMUL (i.e. you would print zero in the end).
If you had implemented the loops as above, you would also get the same result. There is a tiny but significant difference in your code.
If you had implemented the loops as a subroutine, you would also get the same result. Here I suspect the compiler optimizer is doing some reordering, as I can't reproduce your "accumulator" code with a subroutine, at least with Intel Fortran.
Here is your implementation:
do i = 1, n
do j = 1, p
s = 0
do k = 1, q
s = s + a(i, k) * b(k, j)
end do
c(i, j) = s
end do
end do
It's also correct of course. Here, you are using an accumulator, and at the end of the innermost loop, the value of the accumulator is written in the matrix C.
Optimization is typically relevant on the innermost loop mainly. For our purpose, two "basic" instructions in the innermost loop are relevant, if we get rid of all other details:
v(i) = v(i) + x*u(i) (the jki loop)
s = s + x(k)*y(k) (the accumulator loop where y is contiguous in memory, but not x)
The first is usually called a "daxpy" (from the name of a BLAS routine), for "A X Plus Y", the "D" meaning double precision. The second one is just an accumulator.
On an old sequential processor, there is not much to be done to optimize. On a modern processor with SIMD, registers can hold several values, and computations can be done on all of them at once, in parallel. For instance, on x86, an XMM register (from SSE instruction set) can hold two double precision floating-point numbers. A YMM register (from AVX2) can hold four numbers, and a ZMM register (AVX512, found on Xeon) can hold eight numbers.
For instance, on YMM the innermost loop will be "unrolled" to deal with four vector elements at a time (or even more if using several registers).
Here is what the basic loop block is then roughly doing:
daxpy case:
Read 4 numbers from u into register YMM1
Read 4 numbers from v into register YMM2
x is constant and is kept in another register
Multiply in parallel x with YMM1, add in parallel to YMM2, put the result in YMM2
Write back the result to corresponding elements of v
The read/write part is faster if the elements are contiguous in memory, but if they are not it's still worth doing this in parallel.
Note that here, we haven't changed the execution order of additions of the high level Fortran loop.
accumulator case
For the parallelism to be useful, there will be a trick: accumulate four values in parallel in a YMM register, and then add the four accumulated values.
The basic loop block is thus doing this:
The accumulator is kept in YMM3 (four numbers)
Read 4 numbers from X into register YMM1
Read 4 numbers from Y into register YMM2
Multiply in parallel YMM1 with YMM2, add in parallel to YMM3
At the end of the innermost loop, add the four components of the accumulator, and write this back as the matrix element.
It's like if we had computed:
s1 = x(1)*y(1) + x(5)*y(5) + ... + x(29)*y(29)
s2 = x(2)*y(2) + x(6)*y(6) + ... + x(30)*y(30)
s3 = x(3)*y(3) + x(7)*y(7) + ... + x(31)*y(31)
s4 = x(4)*y(4) + x(8)*y(8) + ... + x(32)*y(32)
And then the matrix element written is c(i,j) = s1+s2+s3+s4.
Here the order of additions has changed! And then, since the order is different, the result is very likely different.
I can replicate the results when using fast math (I have Intel Fortran), and when I compile with the default /fp:fast I get the following max error and speed
! Error Loops Matmul
! 0.58208E-10 107526.9 140056.0 FAST
The error is just maxval(abs(diff)) speed measured is in # of matrix operations per second.
But when I compile with /fp:strict then I get no error, but a slowdown with the loops
! Error Loops Matmul
! 0.0000 43140.6 141844.0 STRICT
I see a -60% slowdown in the loops with strict floating-point handling, but surprisingly no slowdown with the matmul() function.
Source Code for completeness
program Console1
use iso_fortran_env
implicit none
integer,parameter :: nr = 100000
integer,parameter::nx=64,ny=32,nz=16
real(real64)::mat1(nx,ny),mat2(ny,nz)
real(real64)::result1(nx,nz),result2(nx,nz),diff(nx,nz)
real(real64)::localsum
integer::i,j,r
integer(int64) :: tic, toc, rate
real(real64) :: dt1, dt2
do i=1,ny
do j=1,nx
mat1(j,i)=dble(j)/7d0+2.65d0*dble(i)
enddo
enddo
do i=1,nz
do j=1,ny
mat2(j,i)=5d0*dble(j)-dble(i)*0.45d0
enddo
enddo
call SYSTEM_CLOCK(tic,rate)
do r=1, nr
result1=mymatmul(mat1,mat2)
end do
call SYSTEM_CLOCK(toc,rate)
dt1 = dble(toc-tic)/rate
call SYSTEM_CLOCK(tic,rate)
do r=1, nr
result2=matmul(mat1,mat2)
end do
call SYSTEM_CLOCK(toc,rate)
dt2 = dble(toc-tic)/rate
diff=result2-result1
print ('(1x,a16,1x,a16,1x,a16)'), "Error", "Loops", "Matmul"
print ('(1x,g16.5,1x,f16.1,1x,f16.1)'), maxval(abs(diff)), nr/dt1, nr/dt2
! Error Loops Matmul
! 0.58208E-10 107526.9 140056.0 FAST
! 0.0000 43140.6 141844.0 STRICT
!
contains
pure function mymatmul(a,b) result(c)
real(real64), intent(in) :: a(:,:), b(:,:)
real(real64) :: c(size(a,1), size(b,2))
integer :: i,j,k
real(real64) :: sum
do j=1, size(c,2)
do i=1, size(c,1)
sum = 0d0
do k=1, size(a,2)
sum = sum + a(i,k)*b(k,j)
end do
c(i,j) = sum
end do
end do
end function
end program Console1
Always compiled as Release-x64 and not Debug.

Explanation of difference in performance between CPLEX, Gurobi and FICO Xpress using interior point method (barrier) without crossover?

I am working with a very large (stochastic) LP with the barrier algorithm without crossover. My model is implemented in Pyomo, and I have tried to use CPLEX, Gurobi and FICO Xpress to solve it. The solver settings in Pyomo are as follows:
For CPLEX:
opt = SolverFactory("cplex")
opt.options["lpmethod"] = 4
opt.options["barrier crossover"] = -1
results = opt.solve(instance)
For Gurobi:
opt = SolverFactory('gurobi_persistent')
opt.set_instance(instance)
opt.options["Crossover"]=0
opt.options["Method"]=2
results = opt.solve(instance, load_solutions=True)
results = opt.load_vars()
results = opt.load_duals()
For FICO Xpress:
opt = SolverFactory("xpress")
opt.options["defaultAlg"] = 4
opt.options["crossover"] = 0
results = opt.solve(instance)
All solvers find a solution, but with (very) varying speed:
Gurobi finds a sub-optimal solution in 4593 s (9.98262837e+11) using 337 iterations
FICO Xpress finds the optimal solution in 7981 s (9.98246410e+11) using 169 iterations
CPLEX finds a sub-optimal solution in 40,178 s (9.98250954e+11) using 258 iterations
My question is: Why is there such a huge difference between the solvers (especially comparing CPLEX and Gurobi)? What is going on in the different barrier algorithms?
Summary of log for CPLEX:
IBM(R) ILOG(R) CPLEX(R) Interactive Optimizer 12.8.0.0
Read time = 52.61 sec. (3283.43 ticks)
Objective sense : Minimize
Variables : 17684371
Objective nonzeros : 7976817
Linear constraints : 26929486 [Less: 25202826, Equal: 1726660]
Nonzeros : 83463204
RHS nonzeros : 621453
Tried aggregator 1 time.
DUAL formed by presolve
Aggregator has done 14545 substitutions...
LP Presolve eliminated 8512063 rows and 3459945 columns.
Reduced LP has 14209881 rows, 21009396 columns, and 61814653 nonzeros.
Presolve time = 148.20 sec. (209740.04 ticks)
Parallel mode: using up to 24 threads for barrier.
***NOTE: Found 243 dense columns.
Number of nonzeros in lower triangle of A*A' = 268787475
Elapsed ordering time = 17.45 sec. (10000.00 ticks)
Using Nested Dissection ordering
Total time for automatic ordering = 376.13 sec. (209058.23 ticks)
Summary statistics for Cholesky factor:
Threads = 24
Rows in Factor = 14210124
Integer space required = 145889976
Total non-zeros in factor = 12261481354
Total FP ops to factor = 39156639536792
Total time on 24 threads = 40177.89 sec. (62234738.71 ticks)
Barrier - Non-optimal: Objective = 9.9825095360e+11
Solution time = 40177.90 sec. Iterations = 258
Summary of log for Xpress:
FICO Xpress Solver 64bit v8.5.0:
Problem Statistics
26929486 ( 0 spare) rows
17684371 ( 0 spare) structural columns
83463204 ( 0 spare) non-zero elements
Presolved problem has:
18426768 rows 14805105 cols 59881674 elements
Barrier cache sizes : L1=32K L2=20480K
Using AVX support
Cores per CPU (CORESPERCPU): 12
Barrier starts, using up to 24 threads, 12 cores
Matrix ordering - Dense cols.: 6776 NZ(L): 485925484 Flops: 273369311062
Barrier method finished in 7874 seconds
Optimal solution found
Barrier solved problem
169 barrier iterations in 7981s
Final objective : 9.982464100682021e+11
Max primal violation (abs / rel) : 1.612e-03 / 1.612e-03
Max dual violation (abs / rel) : 1.837e+02 / 7.381e+01
Max complementarity viol. (abs / rel) : 1.837e+02 / 1.675e-07
Summary of log for Gurobi:
Gurobi 8.0.0:
Optimize a model with 26929485 rows, 17684370 columns and 83463203 nonzeros
Coefficient statistics:
Matrix range [1e-05, 4e+00]
Objective range [2e+00, 8e+06]
Bounds range [0e+00, 0e+00]
RHS range [1e-01, 2e+08]
Presolve removed 8527789 rows and 2871939 columns
Presolve time: 53.79s
Presolved: 18401696 rows, 14812431 columns, 59808411 nonzeros
Ordering time: 29.38s
Barrier statistics:
Dense cols : 4607
AA' NZ : 6.262e+07
Factor NZ : 5.722e+08 (roughly 18.0 GBytes of memory)
Factor Ops : 3.292e+11 (roughly 4 seconds per iteration)
Threads : 12
Barrier performed 337 iterations in 4592.92 seconds
Sub-optimal termination - objective 9.98262837e+11
With CPLEX the factorization is much more expensive.
#FLOPS CPLEX ~3.92e+13
#FLOPS Xpress ~2.73e+11
#FLOPS Gurobi ~3.29e+11
I've seen such discrepancy a few times for large-scale LPs. In most of these cases CPLEX made a bad decision to pass the dual to the optimizer
DUAL formed by presolve
Setting the PreDual parameter to -1 prevents that.

Intel compiler (ICC) unable to auto vectorize inner loop (matrix multiplication)

EDIT:
ICC (after adding -qopt-report=5 -qopt-report-phase:vec):
LOOP BEGIN at 4.c(107,2)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP BEGIN at 4.c(108,3)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP BEGIN at 4.c(109,4)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed ANTI dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP END
LOOP BEGIN at 4.c(109,4)
<Remainder>
LOOP END
LOOP END
LOOP END
It seems that the C[i][j] is read before it is written if vectorized (as I am doing reduction). The question is why the reduction is allowed is a local variable is introduced (temp)?
Original issue:
I have a C snippet below which does matrix multiplication. a, b - operands, c - a*b result. n - row&column length.
double ** c = create_matrix(...) // initialize n*n matrix with zeroes
double ** a = fill_matrix(...) // fills n*n matrix with random doubles
double ** b = fill_matrix(...) // fills n*n matrix with random doubles
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
The ICC (version 18.0.0.1) is not able to vectorize (provided -O3 flag) the inner loop.
ICC output:
LOOP BEGIN at 4.c(107,2)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(108,3)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(109,4)
remark #25460: No loop optimizations reported
LOOP END
LOOP BEGIN at 4.c(109,4)
<Remainder>
LOOP END
LOOP END
LOOP END
Though, with changes below, the compiler vectorizes the inner loop.
// OLD
for (k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
// TO (NEW)
double tmp = 0;
for (k = 0; k < n; k++) {
tmp += a[i][k] * b[k][j];
}
c[i][j] = tmp;
ICC vectorized output:
LOOP BEGIN at 4.c(119,2)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(120,3)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(134,4)
<Peeled loop for vectorization>
LOOP END
LOOP BEGIN at 4.c(134,4)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP BEGIN at 4.c(134,4)
<Alternate Alignment Vectorized Loop>
LOOP END
LOOP BEGIN at 4.c(134,4)
<Remainder loop for vectorization>
LOOP END
LOOP END
LOOP END
Instead of accumulating vector multiplication result in matrix C cell, the result is accumulated in a separate variable and assigned later.
Why does the compiler not optimize the first version? Could it be due to potential aliasing of a or / and b to c elements (Read after write problem)?
Leverage Your Compiler
You don't show the flags you're using to get your vectorization report. I recommend:
-qopt-report=5 -qopt-report-phase:vec
The documentation says:
For levels n=1 through n=5, each level includes all the information of the previous level, as well as potentially some additional information. Level 5 produces the greatest level of detail. If you do not specify n, the default is level 2, which produces a medium level of detail.
With the higher level of detail, the compiler will likely tell you (in mysterious terms) why it is not vectorizing.
My suspicion is that the compiler is worried about the memory being aliased. The solution you've found allows the compiler to prove that it is not, so it performs the vectorization.
A Portable Solution
If you're into OpenMP, you could use:
#pragma omp simd
for (k = 0; k < n; k++)
c[i][j] += a[i][k] * b[k][j];
to accomplish the same thing. I think Intel also has a set of compiler-specific directives that will do this in a non-portable way.
A Miscellaneous Note
This line:
double ** c = create_matrix(...)
makes me nervous. It suggests that somewhere you have something like this:
for(int i=0;i<10;i++)
c[i] = new double[20];
That is, you have an array of arrays.
The problem is, this gives no guarantee that your subarrays are contiguous in memory. The result is suboptimal cache utilization. You want a 1D array that is addressed like a 2D array. Making a 2D array class which does this or using functions/macros to access elements will allow you to preserve much the same syntax while benefiting from better cache performance.
You might also consider compiling with the -align flag and decorating your code appropriately. This will give better SIMD performance by allowing aligned accesses to memory.

Alignment of multi-dimensional array for omp simd

If I understand the aligned clause of the omp simd construct, it refers to the alignment of the whole array.
How is it used for multi-dimensional arrays? Assume
ni = 131; nj = 137; nk = 127
!allocates arr(1:131,1:137,1:127) aligned to 64-bytes
call somehow_allocate_aligned(arr, [ni,nj,nk], 64)
!$omp parallel do collapse(2)
do k = 1, nk
do j = 1, nj
call some_complicated_subroutine(arr(:,j,k))
!$omp simd aligned(arr:64)
do i = 1, ni
arr(i,j,k) = some arithmetic expression involving arr(i,j,k)
end do
end do
end do
!$omp end parallel do
Is this the correct way to indicate the alignment of the array although the iteration of the inner loop starts at arr(1,j,k)?
How does the compiler use that information to infer anything about the alignment of the inner loop subarray?
Does it matter for the performance if the run-time sizes are nicer (say 128, 128, 128)?
It is explained here, slides 160-165 : http://irpf90.ups-tlse.fr/files/parallel_programming.pdf
You should
1) Align the array
2) use padding to force all your columns to be aligned : Your first dimension (specified in the allocate statement) should be a multiple of the number of elements to reach the 16, 32 or 64 -byte boundary depending on the instruction set.
For example, for a 99x29x200 matrix with the AVX instruction set (32 bytes alignment) in double precision (8 bytes/element), you should do
n = 99
l = 29
m=200
delta_n = mod(n,32/8)
if (delta_n == 0) then
n_pad = n
else
n_pad = n-delta_n+32/8
end if
allocate( A(n_pad,l,m) )
!DIR$ ATTRIBUTES ALIGN : 32 :: A
do k=1,m
do j=1,l
!$OMP SIMD
do i=1,n
A(i,j,k) = ...
end do
end do
end do
You can use the C preprocessor to make portable code replacing the 32 and 8 in the previous example.
Note : be careful using statements such as B=A for arrays, as the physical dimensions will not correspond to the logical dimensions. Good practice is to set the boundaries as B(1:n,1:l,1:m) = A(1:n,1:l,1:m) as it will still work if you change the physical dimensions.

openmp issues when three do-loops are involved (fortran)

I am very confused about this problem regarding openmp in fortran. Specifically, when I write the program like this:
PROGRAM TEST
IMPLICIT NONE
INTEGER :: i,j,l
INTEGER :: M(2,2)
i=2
j=2
l=41
!$OMP PARALLEL SHARED(M),PRIVATE(l,i,j)
!$OMP DO
DO i=1,2
DO j=1,2
DO l=0,41
M(i,j)=M(i,j)+1
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP END PARALLEL
END PROGRAM TEST
After compiling by: ifort -openmp test.f90, it works well, and the results of M(1,1) is 42 as expected.
However, when I only adjust the order of sum over l and {i,j}, like the following:
PROGRAM TEST
IMPLICIT NONE
INTEGER :: i,j,l
INTEGER :: M(2,2)
i=2
j=2
l=41
!$OMP PARALLEL SHARED(M),PRIVATE(l,i,j)
!$OMP DO
DO l=0,41
DO i=1,2
DO j=1,2
M(i,j)=M(i,j)+1
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP END PARALLEL
END PROGRAM TEST
After compiling by: ifort -openmp test.f90, it doesn't work well. In fact, when you run a.out several times, the results of M(1,1) seems to be random. Does anyone know what's the problem? Also, if I want to obtain the right results, under the summing order:
DO l=0,41
DO i=1,2
DO j=1,2
what part should I modify this code?
Many thanks for any help.
You have a race condition. Threads with different l are trying to use the same element M(i,j). You can use tools like Intel Inspector or Oracle Thread Analyzer to find it (I checked with Intel). The best thing to do is using your original order. You can also use reduction, but be careful with larger arrays:
PROGRAM TEST
IMPLICIT NONE
INTEGER :: i,j,l
INTEGER :: M(2,2)
M = 0
!$OMP PARALLEL DO PRIVATE(l,i,j),reduction(+:M)
DO l = 0, 41
DO i = 1, 2
DO j = 1, 2
M(i,j) = M(i,j) + 1
END DO
END DO
END DO
!$OMP END PARALLEL DO
print *, M
END PROGRAM
There are many problems with your approach. First of all, the missing initialization of your array M. Inside your loop, you issue
M(i,j) = M(i,j) + 1
without having given any initial value to M(i,j). So the algorithm is indeterministic even in the serial case, and it is just a matter of lack, that you obtain the right result with any specific compiler or any specific summation order.
Addintionally, if you parallelize the loop over l, like
!$OMP PARALLEL DO SHARED(M),PRIVATE(l,i,j)
DO l = 0, 41
DO i = 1, 2
DO j = 1, 2
M(i,j) = M(i,j) + 1
END DO
END DO
END DO
every thread will have an own nested loop construct over i and j covering all matrix elements. Consequently, different threads will access the same elements of the matrix at the same time. The result again being indeterministic. You could of course, try to solve the issue by ensuring via OpenMP constructs, that the threads wait on each other before accessing a certain matrix element. However, that would make the algorithm definitely too slow. The best you can do in this case, in my oppinion, to parallelize over the matrix elements (the loops over i and j).
By the way, the lines
i=2
j=2
l=41
in your code are superfluous, since you immediately use them as loop variables so that their will be overwritten anyway.