fortran limits of vectorization - fortran

I have written a function that returns a vector A equal to the product of a sparse matrix Sparse by another vector F. The non-zero values of the matrix are in Sparse(nnz), rowind(nnz) and colind(nnz) each contain the row and column of each particular value of Sparse. It was relatively simple to vectorize the (now commented) inner loop by the two lines beneath do kx.... I cannot see how to vectorize the outer loop, since pos has different size for different kx.
The question is : can the outer loop (do kx=1,nxy) be vectorized, and if yes how?
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Vladimir F correctly surmises that I come from the Python/Octave world. I have moved (back) to fortran to get more performance out of my hardware, as the PDE I solve become larger. As of a half hour ago, vectorization meant to get rid of do loops, something that fortran seems very good at: the time savings involved in replacing the "inner loop" (do ky=1,size(pos)..) by the two lines above is astonishing. I look at the info given by gfortran (really gcc?) when -fopt-info is invoked and see loop modification is often used. I will immediately go and read about SIMD and array notation. Please,please if there are good sources on this topic please let me know.
In reply to Holz, there are myriad ways to store sparse matrices, usually resulting in lowering the rank of the operator by 1: The example I cooked up involves forcing and solution vectors that are evaluated at each position in some field,and therefore have rank 1. The operator that relates then (S, as in A= S . F) is two dimensional BUT sparse. It is stored in such a way that only nonzero values are kept. If there are nnz non-zero values in S, then Sp, the sparse equivalent to S, is Sp(1:nnz). If pos represents the location within that sequence of some number Sp(pos), then the column and row position in the original matrix S is given by colind(pos) and rowind(pos).
With that background, I might enlarge the question to: What is the very best (measured by execution time) that can be done to accomplish the multiplication?
pure function SparseMul(Sparse,F) result(A)
implicit none
integer (kind=4),allocatable :: pos(:)
integer (kind=4) :: kx,ky ! gp counters
real (kind=8),intent(in) :: Sparse(:),F(:)
real (kind=8),allocatable :: A(:)
allocate(A(nxy))
do kx=1,nxy !for each row
pos=pack([(ky,ky=1,nnz)],rowind==kx)
A(kx)=sum(Sparse(pos)*F(colind(pos)))
!!$ A(kx)=0
!!$ do ky=1,size(pos)
!!$ A(kx)=A(kx)+Sparse(pos(ky))*F(colind(pos(ky)))
!!$ end do
end do
end function SparseMul

I assume the question "as is", i.e.:
we do not want to change the matrix storage format
we do not want to use an external library to perform the task
Otherwise, I think that using an external library should
be the best way to approach the problem, e.g.
https://software.intel.com/en-us/node/520797.
It is not easy to predict the "best" Fortran way to write the
multiplication. It depends on several factors (compiler, architecture,
matrix size,...). I think that the best strategy is to propose
some (reasonable) attempts and test them in a realistic configuration.
If I correctly understand the matrix storage format, my attempts -- including those reported in the question -- are provided
below:
save non-zero positions using pack
do kx=1,nxy
pos=pack([(ky,ky=1,nnz)],rowind==kx)
A(kx)=0
do ky=1,size(pos)
A(kx)=A(kx)+Sparse(pos(ky))*F(colind(pos(ky)))
end do
end do
as the previous one but using Fortran array syntax
do kx=1,nxy
pos=pack([(ky,ky=1,nnz)],rowind==kx)
A(kx)=sum(Sparse(pos)*F(colind(pos)))
end do
use a conditional to determine the components to be used
do kx=1,nxy
A(kx)=0
do ky=1,nnz
if(rowind(ky)==kx) A(kx)=A(kx)+Sparse(ky)*F(colind(ky))
end do
end do
as the previous one but interchanging loops
A(:)=0
do ky=1,nnz
do kx=1,nxy
if(rowind(ky)==kx) A(kx)=A(kx)+Sparse(ky)*F(colind(ky))
end do
end do
use the intrisic sum with the mask argument
do kx=1,nxy
A(kx)=sum(Sparse*F(colind), mask=(rowind==kx))
enddo
as the previous one but using an implied do-loop
A =[(sum(Sparse*F(colind), mask=(rowind==kx)), kx=1,nxy)]
These are the results using a 1000x1000 matrix with
33% non-zero values. The machine is an Intel Xeon
and my tests were performed using Intel v17 and GNU 6.1 compiler
using no optimization, high optimization but
without vectorization, and high optimization.
V1 V2 V3 V4 V5 V6
-O0
ifort 4.28 4.26 0.97 0.91 1.33 2.70
gfortran 2.10 2.10 1.10 1.05 0.30 0.61
-O3 -no-vec
ifort 0.94 0.91 0.23 0.22 0.23 0.52
gfortran 1.73 1.80 0.16 0.15 0.16 0.32
-O3
ifort 0.59 0.56 0.23 0.23 0.30 0.60
gfortran 1.52 1.50 0.16 0.15 0.16 0.32
A few short comments on the results:
Versions 3-4-5 are usually the fastest ones
The role of compiler optimizations is crucial for any version
The vectorization seems to play an important role only for
the non-optimal versions
Version 4 is the best for both compilers
gfortran V4 is the "best" version
Elegance does not always mean good performance (V6 is not very good)
Additional comments can be done analyzing the reports of the
compiler optimizations.
If we have a multi-core machine, we can try to use
all the cores. This implies dealing with the code parallelization, which
is a wide issue, but just to give some hints let us test
two possible OpenMP parallelizations. We work on the
serial fastest version (even though there is no guarantee it
is also the best version to be parallelized).
OpenMP 1.
!$omp parallel
!$omp workshare
A(:)=0
!$omp end workshare
!$omp do
do ky=1,nnz
do kx=1,nxy !for each row
if(rowind(ky)==kx) A(kx)=A(kx)+Sparse(ky)*F(colind(ky))
end do
end do
!$omp end do
!$omp end parallel
</pre>
OpenMP 2. add firstprivate to read-only vectors to improve memory access
!$omp parallel firstprivate(Sparse, colind, rowind)
...
!$omp end parallel
These are the results for up to 16 threads on 16 cores:
#threads 1 2 4 8 16
OpenMP v1
ifort 0.22 0.14 0.088 0.050 0.027
gfortran 0.155 0.11 0.064 0.035 0.020
OpenMP v2
ifort 0.24 0.12 0.065 0.042 0.029
gfortran 0.157 0.11 0.052 0.036 0.029
The scalability (around 8 at 16 threads) is reasonable considering
that it is a memory-bound computation. The firstprivate optimization
has advantages only for a small number of threads. gfortran using
16 threads is the "best" OpenMP solution.

I am having a hard time seeing where COLIND is and what it is doing... And also KX and KY. For the inner loop you want that vectorized, and that seems easiest for me using OpenMP SIMD REDUCTION. I am specifically looking here:
!!$ A(kx)=0
!!$ do ky=1,size(pos)
!!$ A(kx)=A(kx)+Sparse(pos(ky))*F(colind(pos(ky)))
!!$ end do
If you have to gather (PACK) then it may not help much. If there are more than 7/8 of zeros in F then F is likely better to PACK. Otherwise it may be better to vector multiply everything (including the zero-sums).
The main rule is that the data needs to be contiguous, so you cannot vectorize across the second dimension... If feels like Sparse and F are rank=2, but they are shown as being RANK=1. That works fine for going through as a vector, even if they are really a rank=2 array. UNION/MAP can also be used to implement a 2D array as also being a 1D vector.
Are Sparse and F really rank=1? and what are nax, nay, nxy and colind used for? And many of those are not defined (e.g. nay , nnz and colind )

Related

Is a matrix automatically deallocated at the end? [duplicate]

I am interested in the difference between alloc_array and automatic_array in the following extract:
subroutine mysub(n)
integer, intent(in) :: n
integer :: automatic_array(n)
integer, allocatable :: alloc_array(:)
allocate(alloc_array(n))
...[code]...
I am familiar enough with the basics of allocation (not so much on advanced techniques) to know that allocation allows you to change the size of the array in the middle of the code (as pointed out in this question), but I'm interested in considering the case where you don't need to change the size of the array; they might be passed onto other subroutines for operation, but the only purpose of both variables in the code and any subroutine is to hold the data of an array of dimension n (and maybe change the data, but not the size).
(1) Is there any difference in memory usage? I am not an expert in low level procedures, but I have a very slight knowledge of how they matter and how they can impact on the higher level programming (kind of experience I'm talkng about: once trying to run a big code in fortran I was getting a mistake I didn't understand, sysadmin told me "oh, yeah, you are probably saturating the stack; try adding this line in your running script"; anything that gives me insight into how to consider this things when actually coding and not having to patch them later is welcomed). I've been told by people that it might be dependent on many other things like compiler or architecture, but I interpreted from those responses that they were not completely sure of exactly how this was so. Is it so absolutely dependant on a multitude of factors or is there a default/intended behavior in the coding that may then be over-riden by optional compiling keywords or system preferences?
(2) Would the subroutines have different interface needs? Again, not an expert, but it had happened to me before that because of the way I declare variables of subroutine, I end up having to put the subroutines in a module. I've been given to understand this may vary depending on whether I use things that are special for allocatable variables. I am thinking about the case in which everything I do with the variables can be done both by allocatables and automatics, not intentionally using anything specific of allocatables (other than allocation before usage, that is).
Finally, in case this is of use: the reason I am asking is because we are developing in a group and we have recently noticed different people use those two declarations in different ways and we needed to determine if this is something that can be left to personal preference or if there might be any reasons why it might be a good idea to set a clear criteria (and how to set that criteria). I don't need extremely detailed answers, I am trying to determine if this is something I should be doing research about to be careful on how we use it and in what aspects of it should the research be directed.
Though I would be interested to know of "interesting tricks" than can be done with allocation but are not directly related to the need of having size variability, I am leaving those for a possible future follow-up question and focusing here on the strictly functional differences (meaning: what I am explicitly telling compilers to do with my code). The two items I mentioned are the thing I could come up with due to previous experiences, but any other important one that I am missing and should consider, please do mention them.
Because gfortran or ifort + Linux(x86_64) are among the most popular combinations used for HPC, I made some performance comparison between local allocatable vs automatic arrays for these combinations. The CPU used is Xeon E5-2650 v2#2.60GHz, and the compilers are gfortran4.8.2 and ifort14.0. The test program is like the following.
In test.f90:
!------------------------------------------------------------------------
subroutine use_automatic( n )
integer :: n
integer :: a( n ) !! local automatic array (with unknown size at compile-time)
integer :: i
do i = 1, n
a( i ) = i
enddo
call sub( a )
end
!------------------------------------------------------------------------
subroutine use_alloc( n )
integer :: n
integer, allocatable :: a( : ) !! local allocatable array
integer :: i
allocate( a( n ) )
do i = 1, n
a( i ) = i
enddo
call sub( a )
deallocate( a ) !! not necessary for modern Fortran but for clarity
end
!------------------------------------------------------------------------
program main
implicit none
integer :: i, nsizemax, nsize, nloop, foo
common /dummy/ foo
nloop = 10**7
nsizemax = 10
do i = 1, nloop
nsize = mod( i, nsizemax ) + 1
call use_automatic( nsize )
! call use_alloc( nsize )
enddo
print *, "foo = ", foo !! to check if sub() is really called
end
In sub.f90:
!------------------------------------------------------------------------
subroutine sub( a )
integer a( * )
integer foo
common /dummy/ foo
foo = a( 1 )
ends
In the above program, I tried avoiding compiler optimization that eliminates a(:) itself (i.e., no operation) by placing sub() in a different file and making the interface implicit. First, I compiled the program using gfortran as
gfortran -O3 test.f90 sub.f90
and tested different values of nsizemax while keeping nloop = 10^7. The result is in the following table (time is in sec, measured several times by the time command).
nsizemax use_automatic() use_alloc()
10 0.30 0.31 # average result
50 0.48 0.47
500 1.0 0.90
5000 4.3 4.2
100000 75.6 75.7
So the overall timing seems almost the same for two calls when -O3 is used (but see Edit for different options). Next, I compiled with ifort as
[O3] ifort -O3 test.f90 sub.f90
or
[O3h] ifort -O3 -heap-arrays test.f90 sub.f90
In the former case the automatic array is stored on the stack, while when -heap-arrays is attached the array is stored on the heap. The obtained result is
use_automatic() use_alloc()
[O3] [O3h] [O3] [O3h]
10 0.064 0.39 0.48 0.48
50 0.094 0.56 0.65 0.66
500 0.45 1.03 1.12 1.12
5000 3.8 4.4 4.4 4.4
100000 74.5 75.3 76.5 75.5
So for ifort, the use of automatic arrays seems beneficial when relatively small arrays are mainly used. On the other hand, gfortran -O3 shows no difference because both arrays are treated the same way (see Edit for more details).
Additional comparison:
Below is the result for Oracle Fortran compiler 12.4 for Linux (used with f90 -O3). The overall trend seems similar; automatic arrays are faster for small n, indicating the internal use of stack.
nsizemax use_automatic() use_alloc()
10 0.16 0.45
50 0.17 0.62
500 0.37 0.97
5000 2.04 2.67
100000 65.6 65.7
Edit
Thanks to Vladimir's comment, it has turned out that gfortran -O3 put automatic arrays (with unknown size at compile-time) on the heap. This explains why use_automatic() and use_alloc() did not make any difference above. So I made another comparison between different options below:
[O3] gfortran -O3
[O5] gfortran -O5
[O3s] gfortran -O3 -fstack-arrays
[Of] gfortran -Ofast # this includes -fstack-arrays
Here, -fstack-arrays means that the compiler puts all local arrays with unknown size on the stack. Note that this flag is enabled by default with -Ofast. The obtained result is
nsizemax use_automatic() use_alloc()
[Of] [O3s] [O5] [O3] [Of] [O3s] [O5] [O3]
10 0.087 0.087 0.29 0.29 0.29 0.29 0.29 0.29
50 0.15 0.15 0.43 0.43 0.45 0.44 0.44 0.45
500 0.57 0.56 0.84 0.84 0.92 0.92 0.92 0.92
5000 3.9 3.9 4.1 4.1 4.2 4.2 4.2 4.2
100000 75.1 75.0 75.6 75.6 75.6 75.3 75.7 76.0
where the average of ten measurements are shown. This table demonstrates that if -fstack-arrays is included, the execution time for small n becomes shorter. This trend is consistent with the results obtained for ifort above.
It should be mentioned, however, that the above comparison probably corresponds to the "best-case" scenario that highlights the difference between them, so the timing difference can be much smaller in practice. For example, I have compared the timing for the above options by using some other program (involving both small and large arrays), and the results were not much affected by the stack options. Also the result should depend on machine architecture as well as compilers, of course. So your mileage may vary.
For the sake of clarity, I'll briefly mention terminology. The two arrays are both local variables and arrays of rank 1.
alloc_array is an allocatable array;
automatic_array is an explicit-shape automatic object.
Being local variables their scope is that of the procedure. Automatic arrays and unsaved allocatable arrays come to an end when execution of the procedure completes (with the allocatable array being deallocated); automatic objects cannot be saved and saved allocatable objects are not deallocated on completion of execution.
Again, as in the linked question, after the allocation statement both arrays are of size n. These are still two very different things. Of course, the allocatable array can have its allocation status changed and its allocation moved. I'll leave both of those (mostly) out of the scope of this answer. An allocatable array, of course, doesn't have to have these things changed once it's been allocated.
Memory usage
What was partly contentious about a previous revision of the question is how ill-defined the concept of memory usage is. Fortran, as a language definition, tells us that both arrays come to be the same size and they'll have the same storage layout, and are both contiguous. Beyond that, much follows terms you'll hear a lot: implementation specific and processor dependent.
In a comment you expressed interest in ifort. So that I don't wander too far, I'll stick to that one compiler. Other compilers have similar concepts, albeit with different names and options.
Often, ifort will place automatic objects and array temporaries onto stack. There is a (default) compiler option -no-heap-arrays described as having effect
The compiler puts automatic arrays and temporary arrays in the stack storage area.
Using the alternative option -heap-arrays allows one to control that slightly:
This option puts automatic arrays and arrays created for temporary computations on the heap instead of the stack.
There is a possibility to control size thresholds for which heap/stack would be chosen (when that is known at compile-time):
If the compiler cannot determine the size at compile time, it always puts the automatic array on the heap.
As n isn't a constant, one would expect automatic_array to be on the heap with this option, regardless of the size specified. To determine the size, n, of the array at compile time, the compiler would potentially need to do quite a bit of code analysis, even if it is possible.
There's probably more to be said, but this answer would be far too long if I tried. One thing to note, however, is that automatic local objects and (post-Fortran 90) allocatable local objects can be expected not to leak memory.
Interface needs
There is nothing special about the interface requirements of the subroutine mysub: local variables have no impact on that. Any program unit calling that would be happy with an implicit interface. What you are asking about is how the two local arrays can be used.
This largely comes down to what uses the two arrays can be put to.
If the dummy argument of a second procedure has the allocatable attribute then only the allocatable array here can be passed to that procedure. It will also need to have an explicit interface. This is true whether or not the procedure changes the allocation.
Of course, both arrays could be passed as arguments to a dummy argument without the allocatable attribute and then we don't have different interface requirements.
Anyway, why would one want to pass an argument to an allocatable dummy when there will be no change in allocation status, etc.? There are good reasons:
there may be a code path in the procedure which does have an allocation change (controlled by a switch, say);
allocatable dummy arguments also pass bounds;
etc.,
This second one is more obvious if the subroutine had specification
subroutine mysub(n)
integer, intent(in) :: n
integer :: automatic_array(2:n+1)
integer, allocatable :: alloc_array(:)
allocate(alloc_array(2:n+1))
Finally, an automatic object has quite strict conditions on its size. n here is clearly allowed, but things don't have to be much more complicated before allocation is the only plausible way. Depending on how much one wants to play with block constructs.
Taking also a comment from IanH: if we have a very large n the automatic object is likely to lead to crash-and-burn. With the allocatable, one could use the stat= option to come to some amicable agreement with the compiler run-time.

Do most compilers optimize MATMUL(TRANSPOSE(A),B)?

In a Fortran program, I need to compute several expressions like M · v, MT · v, MT · M, M · MT, etc ...
Here, M and v are 2D and 1D arrays of small size (less than 100, typically around 2-10)
I was wondering if writing MATMUL(TRANSPOSE(M),v) would unfold at compile time into some code as efficient as MATMUL(N,v), where N is explicitly stored as N=TRANSPOSE(M). I am specifically interested in the gnu and ifort compilers with "strong" optimization flags (-O2, -O3 or -Ofast for instance).
Below you find a couple of execution times of various methods.
system:
Intel(R) Core(TM) i5-6500T CPU # 2.50GHz
cache size : 6144 KB
RAM : 16MB
GNU Fortran (GCC) 6.3.1 20170216 (Red Hat 6.3.1-3)
ifort (IFORT) 18.0.5 20180823
BLAS : for gnu compiler, the used blas is the default version
compilation:
[gnu] $ gfortran -O3 x.f90 -lblas
[intel] $ ifort -O3 -mkl x.f90
execution:
[gnu] $ ./a.out > matmul.gnu.txt
[intel] $ EXPORT MKL_NUM_THREADS=1; ./a.out > matmul.intel.txt
In order, to make the results as neutral as possible, I've rescaled the answers with the average time of an equivalent set of operations done.
I ignored threading.
matrix times vector
Six different implementations were compared:
manual: do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
matmul: matmul(P,v)
blas N:dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
matmul-transpose: matmul(transpose(P),v)
matmul-transpose-tmp: Q=transpose(P); w=matmul(Q,v)
blas T: dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)
In Figure 1 and Figure 2 you can compare the timing results for the above cases. Overall we can say that the usage of a temporary is in both gfortran and ifort not advised. Both compilers can optimize MATMUL(TRANSPOSE(P),v) much better. While in gfortran, the implementation of MATMUL is faster than default BLAS, ifort clearly shows that mkl-blas is faster.
figure 1: Matrix-vector multiplication. Comparison of various implementations ran on gfortran. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.
figure 2: Matrix-vector multiplication. Comparison of various implementations ran on a single-threaded ifort compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n2 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000.
matrix times matrix
Six different implementations were compared:
manual: do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
matmul: matmul(P,P)
blas N:dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
matmul-transpose: matmul(transpose(P),P)
matmul-transpose-tmp: Q=transpose(P); matmul(Q,P)
blas T: dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
In Figure 3 and Figure 4 you can compare the timing results for the above cases. In contrast to the vector-case, the usage of a temporary is only advised for gfortran. While in gfortran, the implementation of MATMUL is faster than default BLAS, ifort clearly shows that mkl-blas is faster. Remarkably, the manual implementation is comparable to mkl-blas.
figure 3: Matrix-matrix multiplication. Comparison of various implementations ran on gfortran. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.
figure 4: Matrix-matrix multiplication. Comparison of various implementations ran on a single-threaded ifort compilation. The left panels show the absolute timing divided by the total time of the manual computation for a system of size 1000. The right panels show the absolute timing divided by n3 × δ. Here δ is the average time of the manual computation of size 1000 divided by 1000 × 1000 × 1000.
The used code:
program matmul_test
implicit none
double precision, dimension(:,:), allocatable :: P,Q,R
double precision, dimension(:), allocatable :: v,w
integer :: n,i,j,k,l
double precision,dimension(12) :: t1,t2
do n = 1,1000
allocate(P(n,n),Q(n,n), R(n,n), v(n),w(n))
call random_number(P)
call random_number(v)
i=0
i=i+1
call cpu_time(t1(i))
do j=1,n; do k=1,n; w(j) = P(j,k)*v(k); end do; end do
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
w=matmul(P,v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemv('N',n,n,1.0D0,P,n,v,1,0,w,1)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
w=matmul(transpose(P),v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=transpose(P)
w=matmul(Q,v)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemv('T',n,n,1.0D0,P,n,v,1,0,w,1)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
do l=1,n; do j=1,n; do k=1,n; Q(j,l) = P(j,k)*P(k,l); end do; end do; end do
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=matmul(P,P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemm('N','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=matmul(transpose(P),P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
Q=transpose(P)
R=matmul(Q,P)
call cpu_time(t2(i))
i=i+1
call cpu_time(t1(i))
call dgemm('T','N',n,n,n,1.0D0,P,n,P,n,0.0D0,R,n)
call cpu_time(t2(i))
write(*,'(I6,12D25.17)') n, t2-t1
deallocate(P,Q,R,v,w)
end do
end program matmul_test

Why doesn't vectorization speed up these loops?

I'm getting up to speed with vectorization, since my current PC supports it. I have an Intel i7-7600u. It has 2 cores running at 2.8/2.9 GHz and supports SSE4.1, SSE4.2 and AVX2. I'm not sure of the vector register size. I believe it is 256 bits, so will work with 4 64 bit double precision values at a time. I believe this should give a peak rate of:
(2.8GHz)(2 core)(4 vector)(2 add/mult) = 45 GFlops.
I am using GNU Gfortran and g++.
I have a set of fortran loops I built up back in my days of working on various supercomputers.
One loop I tested is:
do j=1,m
s(:) = s(:) + a(:,j)*b(:,j)
enddo
The vector length is 10000, m = 200 and the nest was executed 500 times to give 2e9 operations. I ran it with the j loop unrolled 0, 1, 2, 3 and 5 times. Unrolling should reduce the number of times s is loaded and stored. It is also optimal because all the memory accesses are stride one and it has a paired add and multiply. I ran it using both array syntax as shown above and by using an inner do loop, but that seems to make little difference. With do loops and no unrolling it looks like:
do j=1,m
do i=1,n
s(i)=s(i)+a(i,j)*b(i,j)
end do
end do
The build looks like:
gfortran -O3 -w -fimplicit-none -ftree-vectorize -fopt-info-vec loops.f90
The compiler says the loops are all vectorized. The best results I have gotten is about 2.8 GFlops, which is one per cycle. If I run it with:
gfortran -O2 -w -fimplicit-none -fno-tree-vectorize -fopt-info-vec loops.f90
No vectorization is reported. It executes a little slower without unrolling, but the same with unrolling. Can someone tell me what is going on here? Do I have the characterization of my processor wrong? Why doesn't vectorization speed it up? I was expecting to get at least some improvement. I apologize if this plows old ground, but I could not find a clean example similar to this.

Fortran: 10 nested loops slow with ending print statement

I have some code that runs in about a second, but slows to a standstill after a very minor edit.
The following code runs in 1 sec with gfortran -O3
program loop
implicit none
integer n, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10
parameter(n=18) !<=== most important
integer i,array(n)
real cal
real p1(n)
do i=1,n
p1(i)=float(i)/10.
enddo
write (*,1) p1
1 format (10(f6.2))
cal=0.
i1=0
i2=0
do i1=1,n
!write(*,1) cal !<-- too slow if write here
do i2=1,n
do i3=1,n
do i4=1,n
do i5=1,n
do i6=1,n
do i7=1,n
do i8=1,n
do i9=1,n
do i10=1,n
cal=p1(i1) !<-- perfectly happy to compute, as long as I don't write
array(i1)=i1+i2
enddo
enddo
enddo
enddo
enddo
enddo
enddo
enddo
enddo
!write(*,1) cal !<-- and too slow if write here too!
enddo
write(*,*) (array(i),i=1,n)
stop
end
First of all, forgive me for the mixture of f77 & 90. It's a boiled down example based on a real problem. However, the salient point is that if the parameter n=17, everything's fine. The second to last write statement can be uncommented, and the code runs in about a second. However, with n=18, the code slows to a halt... unless, if the second to last write statement is commented out, it runs in a second with n=18.
In the two tests, there are a total of 17^10 and 18^10 iterations total. I have been unable to find any indication there is a limit on the number of total iterations. I keep thinking 18^10 must be exceeding some limit, but I do not know what. And why would the print statement matter for n=18 but not n=17? More info: Mem usage is near zero. CPU is i5-4570 CPU # 3.20GHz.
If I use -O0 the code always runs extremely slowly.
With gfortan 4.8.3, I don't see much runtime difference between including the write statements and leaving them out, but there is a huge difference between -O3 and -O0. The reason for this is because the compiler is able to massively optimise the loops with -O3, which it doesn't do with -O0. The compiler can essentially work out the answer in advance and completely omit the loops. With the higher optimisations, the compiler can also use more advanced features of your CPU, which work faster.
Putting the write statements inside the loop somewhat disrupts the ability of the compiler to aggressively optimise the loops, meaning it can no longer omit them entirely, which leads to the slower runtimes you're seeing. You are probably using an older version of gfortran which doesn't cope very well with this situation.

Using OpenMP critical and ordered

I've quite new to Fortran and OpenMP, but I'm trying to get my bearings. I have a piece of code for calculating variograms which I'm attempting to parallelize. However, I seem to be getting race conditions, as some of the results are off by a thousandth or so.
The problem seems to be the reductions. Using OpenMP reductions work and give the correct results, but they are not desirable, because the reductions actually happen in another subroutine (I copied the relevant lines into the OpenMP loop for the test). Therefore I put the reductions inside a CRITICAL section but without success. Interestingly, the problem only occurs for reals, not integers. I have thought about whether or not the order of the additions make any difference, but they should not produce errors this big.
Just to check, I put everything in the parallel do in an ORDERED block, which (of course) gave the correct results (albeit without any speedup). I also tried putting everything inside a CRITICAL section, but for some reason that did not give the correct results. My understanding is that OpenMP will flush the shared variables upon entering/exiting CRITICAL sections, so there shouldn't be any cache problems.
So my question is: why doesn't a critical section work in this case?
My code is below. All shared variables except np, tm, hm, gam are read-only.
EDIT: I tried to simulate the randomness induced by multiple threads by replacing the do loops with random integers in the same range (i.e. generate a pair i,j in the of the loops; if they are "visited", generate new ones) and to my surprise the results matched. However, upon further inspection it was revealed that I had forgotten to seed the RNG, and the results were correct by coincidence. How embarrassing!
TL;DR: The discrepancies in the results were caused by the ordering of the floating point values. Using double precision instead helps.
!$OMP PARALLEL DEFAULT(none) SHARED(nd, x, y, z, nzlag, nylag, nxlag, &
!$OMP& dzlag, dylag, dxlag, nvarg, ivhead, ivtail, ivtype, vr, tmin, tmax, np, tm, hm, gam) num_threads(512)
!$OMP DO PRIVATE(i,j,zdis,ydis,xdis,izl,iyl,ixl,indx,vrh,vrt,vrhpr,vrtpr,variogram_type) !reduction(+:np, tm, hm, gam)
DO i=1,nd
!$OMP CRITICAL (main)
! Second loop over the data:
DO j=1,nd
! The lag:
zdis = z(j) - z(i)
IF(zdis >= 0.0) THEN
izl = INT( zdis/dzlag+0.5)
ELSE
izl = -INT(-zdis/dzlag+0.5)
END IF
! ---- SNIP ----
! Loop over all variograms for this lag:
DO cur_variogram=1,nvarg
variogram_type = ivtype(cur_variogram)
! Get the head and tail values:
indx = i+(ivhead(cur_variogram)-1)*maxdim
vrh = vr(indx)
indx = j+(ivtail(cur_variogram)-1)*maxdim
vrt = vr(indx)
IF(vrh < tmin.OR.vrh >= tmax.OR. vrt < tmin.OR.vrt >= tmax) CYCLE
! ----- PROBLEM AREA -------
np(ixl,iyl,izl,1) = np(ixl,iyl,izl,1) + 1. ! <-- This never fails
tm(ixl,iyl,izl,1) = tm(ixl,iyl,izl,1) + vrt
hm(ixl,iyl,izl,1) = hm(ixl,iyl,izl,1) + vrh
gam(ixl,iyl,izl,1) = gam(ixl,iyl,izl,1) + ((vrh-vrt)*(vrh-vrt))
! ----- END OF PROBLEM AREA -----
!CALL updtvarg(ixl,iyl,izl,cur_variogram,variogram_type,vrt,vrh,vrtpr,vrhpr)
END DO
END DO
!$OMP END CRITICAL (main)
END DO
!$OMP END DO
!$OMP END PARALLEL
Thanks very much in advance!
If you are using 32-bit floating-point numbers and arithmetic the difference between 84.26539 and 84.26538, that is a difference of 1 in the least-significant digit, is entirely explicable by the non-determinism of parallel floating-point arithmetic. Bear in mind that a 32-bit f-p number only has about 7 decimal digits to play with.
Ordinary floating-point arithmetic is not strictly associative. For real (in the mathematical not Fortran sense) numbers (a+b)+c==a+(b+c) but there is no such rule for floating-point numbers. This is nicely explained in the Wikipedia article on floating-point arithmetic.
The non-determinism arises because, in using OpenMP you surrender control over the ordering of operations to the run-time. A summation of values across threads (such as a reduction on +) leaves the bracketing of the global sum expression to the run-time. It is not even necessarily true that 2 executions of the same OpenMP program will produce the same-to-the-last-bit results.
I suspect that even running an OpenMP program on one thread may produce different results from the equivalent non-OpenMP program. Since knowledge of the number of threads available to an OpenMP executable may be deferred until run-time the compiler will have to create a parallelised executable whether it is eventually run in parallel or not.
High Performance Mark makes an interesting point about floating point and associativity. This can easily be tested (in C).
float a = -1.0E8f, b = 1.0E8f, c = 1.23456f;
printf("sum %f\n", a+b+c); //output 1.234560
printf("sum %f\n", a+(b+c)); //output 0.000000
But I would like to point out it is possible to preserve order in OpenMP. I discussed this here C++ OpenMP: Split for loop in even chunks static and join data at the end
Edit:
Actually, I confused commutativity and associativity. If you have an operator which is associative but not commuative than it's possible to preserve the order with OpenMP as I did in the post above. However, IEEE floating point is commutative but NOT asssociative so the only thing that came be done is to break IEEE and let it be associative.