Do most compilers optimize MATMUL(TRANSPOSE(A),B)? - fortran

In a Fortran program, I need to compute several expressions like M · v, MT · v, MT · M, M · MT, etc ...
Here, M and v are 2D and 1D arrays of small size (less than 100, typically around 2-10)
I was wondering if writing MATMUL(TRANSPOSE(M),v) would unfold at compile time into some code as efficient as MATMUL(N,v), where N is explicitly stored as N=TRANSPOSE(M). I am specifically interested in the gnu and ifort compilers with "strong" optimization flags (-O2, -O3 or -Ofast for instance).

Below you find a couple of execution times of various methods.
system:
Intel(R) Core(TM) i5-6500T CPU # 2.50GHz
cache size : 6144 KB
RAM : 16MB
GNU Fortran (GCC) 6.3.1 20170216 (Red Hat 6.3.1-3)
ifort (IFORT) 18.0.5 20180823
BLAS : for gnu compiler, the used blas is the default version
compilation:
[gnu] $ gfortran -O3 x.f90 -lblas
[intel] $ ifort -O3 -mkl x.f90
execution:
[gnu] $ ./a.out > matmul.gnu.txt
[intel] $ EXPORT MKL_NUM_THREADS=1; ./a.out > matmul.intel.txt
In order, to make the results as neutral as possible, I've rescaled the answers with the average time of an equivalent set of operations done.
I ignored threading.
matrix times vector
Six different implementations were compared:
manual: do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
matmul: matmul(P,v)
blas N:dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
matmul-transpose: matmul(transpose(P),v)
matmul-transpose-tmp: Q=transpose(P); w=matmul(Q,v)
blas T: dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)
In Figure 1 and Figure 2 you can compare the timing results for the above cases. Overall we can say that the usage of a temporary is in both gfortran and ifort not advised. Both compilers can optimize MATMUL(TRANSPOSE(P),v) much better. While in gfortran, the implementation of MATMUL is faster than default BLAS, ifort clearly shows that mkl-blas is faster.
figure 1: Matrix-vector multiplication. Comparison of various implementations ran on gfortran. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.
figure 2: Matrix-vector multiplication. Comparison of various implementations ran on a single-threaded ifort compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.
matrix times matrix
Six different implementations were compared:
manual: do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
matmul: matmul(P,P)
blas N:dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
matmul-transpose: matmul(transpose(P),P)
matmul-transpose-tmp: Q=transpose(P); matmul(Q,P)
blas T: dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
In Figure 3 and Figure 4 you can compare the timing results for the above cases. In contrast to the vector-case, the usage of a temporary is only advised for gfortran. While in gfortran, the implementation of MATMUL is faster than default BLAS, ifort clearly shows that mkl-blas is faster. Remarkably, the manual implementation is comparable to mkl-blas.
figure 3: Matrix-matrix multiplication. Comparison of various implementations ran on gfortran. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.
figure 4: Matrix-matrix multiplication. Comparison of various implementations ran on a single-threaded ifort compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.
The used code:
program matmul_test
implicit none
double precision, dimension(:,:), allocatable :: P,Q,R
double precision, dimension(:), allocatable :: v,w
integer :: n,i,j,k,l
double precision,dimension(12) :: t1,t2
do n = 1,1000
allocate(P(n,n),Q(n,n), R(n,n), v(n),w(n))
call random_number(P)
call random_number(v)
i=0
i=i+1
call cpu_time(t1(i))
do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
w=matmul(P,v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
w=matmul(transpose(P),v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=transpose(P)
w=matmul(Q,v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=matmul(P,P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=matmul(transpose(P),P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=transpose(P)
R=matmul(Q,P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
call cpu_time(t2(i))
write(*,'(I6,12D25.17)') n, t2-t1
deallocate(P,Q,R,v,w)
end do
end program matmul_test

Related

Compiler optimization when variables are reused

While benchmarking 'subtracting a vector from a matrix', I noticed Fortran compilers appear to be performing some sort of optimization when I reuse variables/code. It looks like the arrays are being reused from cache memory, however I'm not sure.
I believe this optimization is causing discrepancies in my benchmark results and would like to identify the specific type of optimization and, if possible, turn it off.
For example, in the following code that compares 2 cases, an additional Case 3 is introduced which is identical to Case 1. However, the time taken to run Case 3 is reported to be much lesser than that for Case 1.
program main
implicit none
integer :: n = 1E7
real*8, dimension(3) :: a
real*8, allocatable, dimension(:, :) :: b, c
real :: start, finish
integer :: i
allocate(b(n, 3))
allocate(c(n, 3))
call random_number(a)
call random_number(b)
! Case 1: Do loop
call cpu_time(start)
do i = 1, 3
c(:, i) = b(:, i) - a(i)
enddo
call cpu_time(finish)
print*, 'do-loop : ', finish-start
! Case 2: Spread
call cpu_time(start)
c = b - spread(a, dim=1, ncopies=n)
call cpu_time(finish)
print*, 'spread : ', finish-start
! Case 3: Do loop (again)
call cpu_time(start)
do i = 1, 3
c(:, i) = b(:, i) - a(i)
enddo
call cpu_time(finish)
print*, 'do-loop : ', finish-start
end program main
This produces similar results with Intel and GNU compilers as shown below. I have tried investigating using flags like -O0 and -qopt-report, but cannot understand why the code behaves so. Because the arrays are large, ulimit -s unlimited might be required (on Linux) to avoid a segmentation fault.
$ ifort reuse.f90 && ./a.out
do-loop : 0.2072840
spread : 0.4781271
do-loop : 3.6670923E-02
$ gfortran reuse.f90 && ./a.out
do-loop : 0.232345015
spread : 0.342370987
do-loop : 4.52849865E-02
At least in Linux, the memory allocator uses the "optimistic memory allocation strategy" (or see Why can Fortran allocate such large arrays? for Fortran). It assumes that there will be enough memory, assigns the virtual address space and that is all. The memory pages are only assigned when you access the memory by assigning some values (or trying to read the undefined garbage).
That has two implication.
If you requested too much memory, the allocate may still succeed and the program may crash later.
The first access will take more time.
To remove the problem with the latter, initialize the memory first, e.g. C = 0.
There are other reasons why you should disregard the first runs of any tests and always run them multiple times - not just one long test, but multiple short runs. There are various turbo modes in modern CPUs that may take some time to start, for example.

Poor scaling and a segmentation fault in a Fortran OpenMP code

I'm having some trouble when executing a program with a parallel do. Here is a test code.
module test
use, intrinsic :: iso_fortran_env, only: dp => real64
implicit none
contains
subroutine Addition(x,y,s)
real(dp),intent(in) :: x,y
real(dp), intent(out) :: s
s = x+y
end subroutine Addition
function linspace(length,xi,xf) result (vec)
! function to create an equally spaced vector given a begin and end point
real(dp),intent(in) :: xi,xf
integer, intent(in) :: length
real(dp),dimension(1:length) :: vec
integer ::i
real(dp) :: increment
increment = (xf-xi)/(real(length)-1)
vec(1) = xi
do i = 2,length
vec(i) = vec(i-1) + increment
end do
end function linspace
end module test
program paralleltest
use, intrinsic :: iso_fortran_env, only: dp => real64
use test
use :: omp_lib
implicit none
integer, parameter :: length = 1000
real(dp),dimension(length) :: x,y
real(dp) :: s
integer:: i,j
integer :: num_threads = 8
real(dp),dimension(length,length) :: SMatrix
x = linspace(length,.0d0,1.0d0)
y = linspace(length,2.0d0,3.0d0)
!$ call omp_set_num_threads(num_threads)
!$OMP PARALLEL DO
do i=1,size(x)
do j = 1,size(y)
call Addition(x(i),y(j),s)
SMatrix(i,j) = s
end do
end do
!$OMP END PARALLEL DO
open(unit=1,file ='Add6.dat')
do i= 1,size(x)
do j= 1,size(y)
write(1,*) x(i),";",y(j),";",SMatrix(i,j)
end do
end do
close(unit=1)
end program paralleltest
I'm running the program in the following waygfortran-8 -fopenmp paralleltest.f03 -o pt.out -mcmodel=medium and then export OMP_NUM_THREADS=8
This simple code brings me at least two big questions on parallel do. The first is that if I run with length = 1100 or greater, I have Segmentation fault (core dump) error message but with smaller values it runs with no problem. The second is about the time it takes. When I run it with length = 1000 (run with time ./pt.out) the time it takes is 1,732s but if I run it in a sequential way (without calling the -fopenmplibrary and with taskset -c 4 time./pt.out ) it takes 1,714s. I guess the difference between both ways arise in a longer and more complex code where parallel is more usefull. In fact when I tried it with more complex calculations running in parallel with eight threads, time was reduced at half that it took in sequential but not an eighth as I expected. In view of this my questions are, is any optimization available always or is it code dependent? and second, is there a friendly way to control which thread runs which iteration? That is the first running the first length/8 iteration, and so on, like performing several taskset 's with different code where in each is the iteration that I want.
As I commented, the Segmentation fault has been treated elsewhere Why Segmentation fault is happening in this openmp code?, I would use an allocatable array, but you can also set the stacksize using ulimit -s.
Regarding the time, almost all of the runtime is spent in writing the array to the external file.
But even if you remove that and you measure the time only spent in the parallel section using omp_get_wtime() and increase the problem size, it still does not scale too well. This because there is very little computation for the CPU to do and a lot of array writing to memory (accessing main memory is slow - cache misses).
As Jean-Claude Arbaut pointed out, your loop order is wrong and makes accessing the memory even slower. Some compilers can change that for you with higher optimization levels (-O2 or -O3), but only some of them.
And even worse, as Jim Cownie pointed out, you have a race condition. Multiple threads try to use the same s for both reading and writing and the program is invalid. You need to make s private using private(s).
With the above fixes I get a roughly two times faster parallel section with four cores and four threads. Don't try to use hyper-threading, it slows the program down.
If you give the CPU more computational work to do, like s = Bessel_J0(x)/Bessel_J1(y) it scales pretty well for me, almost four times faster with four threads, and hyper threading does speed it up a little bit.
Finally, I suggest just removing the manual setting of the number of threads, it is a pain for testing. If you remove that, you can use OMP_NUM_THREADS=4 ./a.out easily.

fortran limits of vectorization

I have written a function that returns a vector A equal to the product of a sparse matrix Sparse by another vector F. The non-zero values of the matrix are in Sparse(nnz), rowind(nnz) and colind(nnz) each contain the row and column of each particular value of Sparse. It was relatively simple to vectorize the (now commented) inner loop by the two lines beneath do kx.... I cannot see how to vectorize the outer loop, since pos has different size for different kx.
The question is : can the outer loop (do kx=1,nxy) be vectorized, and if yes how?
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Vladimir F correctly surmises that I come from the Python/Octave world. I have moved (back) to fortran to get more performance out of my hardware, as the PDE I solve become larger. As of a half hour ago, vectorization meant to get rid of do loops, something that fortran seems very good at: the time savings involved in replacing the "inner loop" (do ky=1,size(pos)..) by the two lines above is astonishing. I look at the info given by gfortran (really gcc?) when -fopt-info is invoked and see loop modification is often used. I will immediately go and read about SIMD and array notation. Please,please if there are good sources on this topic please let me know.
In reply to Holz, there are myriad ways to store sparse matrices, usually resulting in lowering the rank of the operator by 1: The example I cooked up involves forcing and solution vectors that are evaluated at each position in some field,and therefore have rank 1. The operator that relates then (S, as in A= S . F) is two dimensional BUT sparse. It is stored in such a way that only nonzero values are kept. If there are nnz non-zero values in S, then Sp, the sparse equivalent to S, is Sp(1:nnz). If pos represents the location within that sequence of some number Sp(pos), then the column and row position in the original matrix S is given by colind(pos) and rowind(pos).
With that background, I might enlarge the question to: What is the very best (measured by execution time) that can be done to accomplish the multiplication?
pure function SparseMul(Sparse,F) result(A)
implicit none
integer (kind=4),allocatable :: pos(:)
integer (kind=4) :: kx,ky ! gp counters
real (kind=8),intent(in) :: Sparse(:),F(:)
real (kind=8),allocatable :: A(:)
allocate(A(nxy))
do kx=1,nxy !for each row
pos=pack([(ky,ky=1,nnz)],rowind==kx)
A(kx)=sum(Sparse(pos)*F(colind(pos)))
!!$ A(kx)=0
!!$ do ky=1,size(pos)
!!$ A(kx)=A(kx)+Sparse(pos(ky))*F(colind(pos(ky)))
!!$ end do
end do
end function SparseMul
I assume the question "as is", i.e.:
we do not want to change the matrix storage format
we do not want to use an external library to perform the task
Otherwise, I think that using an external library should
be the best way to approach the problem, e.g.
https://software.intel.com/en-us/node/520797.
It is not easy to predict the "best" Fortran way to write the
multiplication. It depends on several factors (compiler, architecture,
matrix size,...). I think that the best strategy is to propose
some (reasonable) attempts and test them in a realistic configuration.
If I correctly understand the matrix storage format, my attempts -- including those reported in the question -- are provided
below:
save non-zero positions using pack
do kx=1,nxy
pos=pack([(ky,ky=1,nnz)],rowind==kx)
A(kx)=0
do ky=1,size(pos)
A(kx)=A(kx)+Sparse(pos(ky))*F(colind(pos(ky)))
end do
end do
as the previous one but using Fortran array syntax
do kx=1,nxy
pos=pack([(ky,ky=1,nnz)],rowind==kx)
A(kx)=sum(Sparse(pos)*F(colind(pos)))
end do
use a conditional to determine the components to be used
do kx=1,nxy
A(kx)=0
do ky=1,nnz
if(rowind(ky)==kx) A(kx)=A(kx)+Sparse(ky)*F(colind(ky))
end do
end do
as the previous one but interchanging loops
A(:)=0
do ky=1,nnz
do kx=1,nxy
if(rowind(ky)==kx) A(kx)=A(kx)+Sparse(ky)*F(colind(ky))
end do
end do
use the intrisic sum with the mask argument
do kx=1,nxy
A(kx)=sum(Sparse*F(colind), mask=(rowind==kx))
enddo
as the previous one but using an implied do-loop
A =[(sum(Sparse*F(colind), mask=(rowind==kx)), kx=1,nxy)]
These are the results using a 1000x1000 matrix with
33% non-zero values. The machine is an Intel Xeon
and my tests were performed using Intel v17 and GNU 6.1 compiler
using no optimization, high optimization but
without vectorization, and high optimization.
V1 V2 V3 V4 V5 V6
-O0
ifort 4.28 4.26 0.97 0.91 1.33 2.70
gfortran 2.10 2.10 1.10 1.05 0.30 0.61
-O3 -no-vec
ifort 0.94 0.91 0.23 0.22 0.23 0.52
gfortran 1.73 1.80 0.16 0.15 0.16 0.32
-O3
ifort 0.59 0.56 0.23 0.23 0.30 0.60
gfortran 1.52 1.50 0.16 0.15 0.16 0.32
A few short comments on the results:
Versions 3-4-5 are usually the fastest ones
The role of compiler optimizations is crucial for any version
The vectorization seems to play an important role only for
the non-optimal versions
Version 4 is the best for both compilers
gfortran V4 is the "best" version
Elegance does not always mean good performance (V6 is not very good)
Additional comments can be done analyzing the reports of the
compiler optimizations.
If we have a multi-core machine, we can try to use
all the cores. This implies dealing with the code parallelization, which
is a wide issue, but just to give some hints let us test
two possible OpenMP parallelizations. We work on the
serial fastest version (even though there is no guarantee it
is also the best version to be parallelized).
OpenMP 1.
!$omp parallel
!$omp workshare
A(:)=0
!$omp end workshare
!$omp do
do ky=1,nnz
do kx=1,nxy !for each row
if(rowind(ky)==kx) A(kx)=A(kx)+Sparse(ky)*F(colind(ky))
end do
end do
!$omp end do
!$omp end parallel
</pre>
OpenMP 2. add firstprivate to read-only vectors to improve memory access
!$omp parallel firstprivate(Sparse, colind, rowind)
...
!$omp end parallel
These are the results for up to 16 threads on 16 cores:
#threads 1 2 4 8 16
OpenMP v1
ifort 0.22 0.14 0.088 0.050 0.027
gfortran 0.155 0.11 0.064 0.035 0.020
OpenMP v2
ifort 0.24 0.12 0.065 0.042 0.029
gfortran 0.157 0.11 0.052 0.036 0.029
The scalability (around 8 at 16 threads) is reasonable considering
that it is a memory-bound computation. The firstprivate optimization
has advantages only for a small number of threads. gfortran using
16 threads is the "best" OpenMP solution.
I am having a hard time seeing where COLIND is and what it is doing... And also KX and KY. For the inner loop you want that vectorized, and that seems easiest for me using OpenMP SIMD REDUCTION. I am specifically looking here:
!!$ A(kx)=0
!!$ do ky=1,size(pos)
!!$ A(kx)=A(kx)+Sparse(pos(ky))*F(colind(pos(ky)))
!!$ end do
If you have to gather (PACK) then it may not help much. If there are more than 7/8 of zeros in F then F is likely better to PACK. Otherwise it may be better to vector multiply everything (including the zero-sums).
The main rule is that the data needs to be contiguous, so you cannot vectorize across the second dimension... If feels like Sparse and F are rank=2, but they are shown as being RANK=1. That works fine for going through as a vector, even if they are really a rank=2 array. UNION/MAP can also be used to implement a 2D array as also being a 1D vector.
Are Sparse and F really rank=1? and what are nax, nay, nxy and colind used for? And many of those are not defined (e.g. nay , nnz and colind )

Fortran: 10 nested loops slow with ending print statement

I have some code that runs in about a second, but slows to a standstill after a very minor edit.
The following code runs in 1 sec with gfortran -O3
program loop
implicit none
integer n, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10
parameter(n=18) !<=== most important
integer i,array(n)
real cal
real p1(n)
do i=1,n
p1(i)=float(i)/10.
enddo
write (*,1) p1
1 format (10(f6.2))
cal=0.
i1=0
i2=0
do i1=1,n
!write(*,1) cal !<-- too slow if write here
do i2=1,n
do i3=1,n
do i4=1,n
do i5=1,n
do i6=1,n
do i7=1,n
do i8=1,n
do i9=1,n
do i10=1,n
cal=p1(i1) !<-- perfectly happy to compute, as long as I don't write
array(i1)=i1+i2
enddo
enddo
enddo
enddo
enddo
enddo
enddo
enddo
enddo
!write(*,1) cal !<-- and too slow if write here too!
enddo
write(*,*) (array(i),i=1,n)
stop
end
First of all, forgive me for the mixture of f77 & 90. It's a boiled down example based on a real problem. However, the salient point is that if the parameter n=17, everything's fine. The second to last write statement can be uncommented, and the code runs in about a second. However, with n=18, the code slows to a halt... unless, if the second to last write statement is commented out, it runs in a second with n=18.
In the two tests, there are a total of 17^10 and 18^10 iterations total. I have been unable to find any indication there is a limit on the number of total iterations. I keep thinking 18^10 must be exceeding some limit, but I do not know what. And why would the print statement matter for n=18 but not n=17? More info: Mem usage is near zero. CPU is i5-4570 CPU # 3.20GHz.
If I use -O0 the code always runs extremely slowly.
With gfortan 4.8.3, I don't see much runtime difference between including the write statements and leaving them out, but there is a huge difference between -O3 and -O0. The reason for this is because the compiler is able to massively optimise the loops with -O3, which it doesn't do with -O0. The compiler can essentially work out the answer in advance and completely omit the loops. With the higher optimisations, the compiler can also use more advanced features of your CPU, which work faster.
Putting the write statements inside the loop somewhat disrupts the ability of the compiler to aggressively optimise the loops, meaning it can no longer omit them entirely, which leads to the slower runtimes you're seeing. You are probably using an older version of gfortran which doesn't cope very well with this situation.

Mandated vectorization for gfortran compiler

I want to execute a Fortran loop in a vectorial way with a vector processor (Intel Xeon). I recently got the way doing this with the Intel compiler ifort that we can add !DIR$ SIMD before the loop.
But when I work with gfortran compiler, I find that all the vectorization operations are automatic. For example,
PROGRAM MAIN1
IMPLICIT NONE
DOUBLE PRECISION :: X(100)
INTEGER :: NELEM = 100, NELMAX = 100, LV = 4
INTEGER :: IKLE(100), I, IB, IELEM
DOUBLE PRECISION :: W(100)
DOUBLE PRECISION :: MASKEL(100)
LOGICAL :: MSK = .FALSE.
DO I = 1, 100
X(I) = I
IKLE(I) = I
W(I) = 0
END DO
DO IB = 1,(NELEM+LV-1)/LV
!------------loop to vectorize------------------
DO IELEM = 1+(IB-1)*LV , MIN(NELEM,IB*LV)
X(IKLE(IELEM)) = X(IKLE(IELEM)) + W(IELEM)
ENDDO ! IELEM
!-----------------------------------------------
ENDDO ! IB
PRINT *, X
END PROGRAM
Part of the output of gfortran main1.f -O3 -fopt-info-optimized is printed below
main1.f:18:0: note: not vectorized: not suitable for gather load _33 = x[_32];
main1.f:18:0: note: bad data references.
main1.f:18:0: note: not vectorized: not enough data-refs in basic block.
main1.f:18:0: note: not vectorized: not enough data-refs in basic block.
Since the program output X is right when the loop is compiled by ifort in a mandated vectorization mode, I wonder if there's also a similar way for gfortran.
In this case with scatter stores, forcing vectorization by directive could change the results when there are repeated entries in the index array IKLE(:), as it doesn't preserve the sequence of memory access. As far as I know, the only directive of this nature available in gfortran is !$omp simd, which gfortran is free to ignore. omp simd directives are active only when corresponding compile options are set.
ifort offers (-opt-report4 in recent versions) an assessment of peak speedup possible by vectorization. I don't know whether that assessment is based on the declared array sizes. If there is a speedup, it would be achieved more by changing the operation sequence than by actual SIMD parallelism.