Fortran: 10 nested loops slow with ending print statement - fortran

I have some code that runs in about a second, but slows to a standstill after a very minor edit.
The following code runs in 1 sec with gfortran -O3
program loop
implicit none
integer n, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10
parameter(n=18) !<=== most important
integer i,array(n)
real cal
real p1(n)
do i=1,n
p1(i)=float(i)/10.
enddo
write (*,1) p1
1 format (10(f6.2))
cal=0.
i1=0
i2=0
do i1=1,n
!write(*,1) cal !<-- too slow if write here
do i2=1,n
do i3=1,n
do i4=1,n
do i5=1,n
do i6=1,n
do i7=1,n
do i8=1,n
do i9=1,n
do i10=1,n
cal=p1(i1) !<-- perfectly happy to compute, as long as I don't write
array(i1)=i1+i2
enddo
enddo
enddo
enddo
enddo
enddo
enddo
enddo
enddo
!write(*,1) cal !<-- and too slow if write here too!
enddo
write(*,*) (array(i),i=1,n)
stop
end
First of all, forgive me for the mixture of f77 & 90. It's a boiled down example based on a real problem. However, the salient point is that if the parameter n=17, everything's fine. The second to last write statement can be uncommented, and the code runs in about a second. However, with n=18, the code slows to a halt... unless, if the second to last write statement is commented out, it runs in a second with n=18.
In the two tests, there are a total of 17^10 and 18^10 iterations total. I have been unable to find any indication there is a limit on the number of total iterations. I keep thinking 18^10 must be exceeding some limit, but I do not know what. And why would the print statement matter for n=18 but not n=17? More info: Mem usage is near zero. CPU is i5-4570 CPU # 3.20GHz.
If I use -O0 the code always runs extremely slowly.

With gfortan 4.8.3, I don't see much runtime difference between including the write statements and leaving them out, but there is a huge difference between -O3 and -O0. The reason for this is because the compiler is able to massively optimise the loops with -O3, which it doesn't do with -O0. The compiler can essentially work out the answer in advance and completely omit the loops. With the higher optimisations, the compiler can also use more advanced features of your CPU, which work faster.
Putting the write statements inside the loop somewhat disrupts the ability of the compiler to aggressively optimise the loops, meaning it can no longer omit them entirely, which leads to the slower runtimes you're seeing. You are probably using an older version of gfortran which doesn't cope very well with this situation.

Related

Poor scaling and a segmentation fault in a Fortran OpenMP code

I'm having some trouble when executing a program with a parallel do. Here is a test code.
module test
use, intrinsic :: iso_fortran_env, only: dp => real64
implicit none
contains
subroutine Addition(x,y,s)
real(dp),intent(in) :: x,y
real(dp), intent(out) :: s
s = x+y
end subroutine Addition
function linspace(length,xi,xf) result (vec)
! function to create an equally spaced vector given a begin and end point
real(dp),intent(in) :: xi,xf
integer, intent(in) :: length
real(dp),dimension(1:length) :: vec
integer ::i
real(dp) :: increment
increment = (xf-xi)/(real(length)-1)
vec(1) = xi
do i = 2,length
vec(i) = vec(i-1) + increment
end do
end function linspace
end module test
program paralleltest
use, intrinsic :: iso_fortran_env, only: dp => real64
use test
use :: omp_lib
implicit none
integer, parameter :: length = 1000
real(dp),dimension(length) :: x,y
real(dp) :: s
integer:: i,j
integer :: num_threads = 8
real(dp),dimension(length,length) :: SMatrix
x = linspace(length,.0d0,1.0d0)
y = linspace(length,2.0d0,3.0d0)
!$ call omp_set_num_threads(num_threads)
!$OMP PARALLEL DO
do i=1,size(x)
do j = 1,size(y)
call Addition(x(i),y(j),s)
SMatrix(i,j) = s
end do
end do
!$OMP END PARALLEL DO
open(unit=1,file ='Add6.dat')
do i= 1,size(x)
do j= 1,size(y)
write(1,*) x(i),";",y(j),";",SMatrix(i,j)
end do
end do
close(unit=1)
end program paralleltest
I'm running the program in the following waygfortran-8 -fopenmp paralleltest.f03 -o pt.out -mcmodel=medium and then export OMP_NUM_THREADS=8
This simple code brings me at least two big questions on parallel do. The first is that if I run with length = 1100 or greater, I have Segmentation fault (core dump) error message but with smaller values it runs with no problem. The second is about the time it takes. When I run it with length = 1000 (run with time ./pt.out) the time it takes is 1,732s but if I run it in a sequential way (without calling the -fopenmplibrary and with taskset -c 4 time./pt.out ) it takes 1,714s. I guess the difference between both ways arise in a longer and more complex code where parallel is more usefull. In fact when I tried it with more complex calculations running in parallel with eight threads, time was reduced at half that it took in sequential but not an eighth as I expected. In view of this my questions are, is any optimization available always or is it code dependent? and second, is there a friendly way to control which thread runs which iteration? That is the first running the first length/8 iteration, and so on, like performing several taskset 's with different code where in each is the iteration that I want.
As I commented, the Segmentation fault has been treated elsewhere Why Segmentation fault is happening in this openmp code?, I would use an allocatable array, but you can also set the stacksize using ulimit -s.
Regarding the time, almost all of the runtime is spent in writing the array to the external file.
But even if you remove that and you measure the time only spent in the parallel section using omp_get_wtime() and increase the problem size, it still does not scale too well. This because there is very little computation for the CPU to do and a lot of array writing to memory (accessing main memory is slow - cache misses).
As Jean-Claude Arbaut pointed out, your loop order is wrong and makes accessing the memory even slower. Some compilers can change that for you with higher optimization levels (-O2 or -O3), but only some of them.
And even worse, as Jim Cownie pointed out, you have a race condition. Multiple threads try to use the same s for both reading and writing and the program is invalid. You need to make s private using private(s).
With the above fixes I get a roughly two times faster parallel section with four cores and four threads. Don't try to use hyper-threading, it slows the program down.
If you give the CPU more computational work to do, like s = Bessel_J0(x)/Bessel_J1(y) it scales pretty well for me, almost four times faster with four threads, and hyper threading does speed it up a little bit.
Finally, I suggest just removing the manual setting of the number of threads, it is a pain for testing. If you remove that, you can use OMP_NUM_THREADS=4 ./a.out easily.

Bad performance of parallel subroutine

I was trying to parallelize the following code; however, when it was executed on the main program, there didn't seem to be significant speed-up. I tested the same subroutine on another program, and it took even longer time to run than the serial code.
SUBROUTINE rotate(r,qt,n,np,i,a,b)
IMPLICIT NONE
INTEGER n,np,i
DOUBLE PRECISION a,b,r(np,np),qt(np,np)
INTEGER j
DOUBLE PRECISION c,fact,s,w,y
if(a.eq.0.d0)then
c=0.d0
s=sign(1.d0,b)
else if(abs(a).gt.abs(b))then
fact=b/a
c=sign(1.d0/sqrt(1.d0+fact**2),a)
s=fact*c
else
fact=a/b
s=sign(1.d0/sqrt(1.d0+fact**2),b)
c=fact*s
endif
!$omp parallel shared(i,n,c,s,r,qt) private(y,w,j)
!$omp do schedule(static,2)
do 11 j=i,n
y=r(i,j)
w=r(i+1,j)
r(i,j)=c*y-s*w
r(i+1,j)=s*y+c*w
11 continue
!$omp do schedule(static,2)
do 12 j=1,n
y=qt(i,j)
w=qt(i+1,j)
qt(i,j)=c*y-s*w
qt(i+1,j)=s*y+c*w
12 continue
!$omp end parallel
return
END
C (C) Copr. 1986-92 Numerical Recipes Software Vs94z&):9+X%1j49#:`*.
However when I used the built-in function in Linux to measure the time, i got:
real 0m12.160s
user 4m49.894s
sys 0m0.880s
which is ridiculous compared to the time of the serial code:
real 0m2.078s
user 0m2.068s
sys 0m0.000s
So you have something like
do i=1,n
do j=1,n
do k=1,n
call rotate()
end do
end do
end do
for n = 100 and you are parallelizing two simple loops inside rotate.
That is hopeless. If you want decent performance, you must parallelize the outermost loop that is possible.
There is simply not enough work inside the loops inside rotate and it is called too many times. You call it 1000000 times so the threads must be synchronised or re-launched 2000000 times. That takes all of your run time. All the run time increase you see is this synchronization.

Could the compiler vectorize the looping with an array which consists of an array inside?

I would like to vectorize this code below (just for an example), just assume somehow I should write an array inside an array.
PROGRAM TEST
IMPLICIT NONE
REAL, DIMENSION(2000):: A,B,C !100000
INTEGER, DIMENSION(2000):: E
REAL(KIND=8):: TIME1,TIME2
INTEGER::I
DO I=1, 2000 !Actually only this loop could be vectorized
B(I)=100.00 !by the compiler
C(I)=200.00
E(I)=I
END DO
!Computing computer's running time (start)
CALL CPU_TIME (TIME1)
DO I=1, 2000 !This is the problem, somehow I should put
A(E(I))=B(E(I))*C(E(I)) !an integer array E(I) inside an array
END DO !I would like to vectorize this loop also, but it didn't work
PRINT *, 'Results =', A(2000)
PRINT *, ' '
!Computing computer's running time (finish)
CALL CPU_TIME (TIME2)
PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM TEST
I thought at first time, that compiler could understand what I want which somehow be vectorized like this:
DO I=1, 2000, 4 !Unrolled 4 times
A(E(I))=B(E(I))*C(E(I))
A(E(I+1))=B(E(I+1))*C(E(I+1))
A(E(I+2))=B(E(I+2))*C(E(I+2))
A(E(I+3))=B(E(I+3))*C(E(I+3))
END DO
but I was wrong. I used: gfortran -Ofast -o -fopt-info-optimized Tes.F95 and I got the information that only the first looping was successfully to be vectorized.
Do you have any idea how I could vectorize it? Or can't it be vectorized at all?
If E hase equal values for different I, then you would be manipulating the same elements of A multiple times, in which case the order could matter. (Though not in your case.) Also, if you have multiple index arrays, like E1, E2 and E3, and
DO I=1, 2000
A(E3(I))=B(E1(I))*C(E2(I))
END DO
the order could matter too. So I think this kind of indexing is not in general allowed in parallel loops.
With ifort one can use !DIR$ IVDEP which is "ignore Vector dependence". It only works when E(I) is linear as in the example...
Assuming that one wants to do all the indexes then just replace (E(i)) with (I) and work out the obvious E(I) order later...

No speedup with OpenMP in DO loop [duplicate]

I have a Fortran 90 program calling a multi threaded routine. I would like to time this program from the calling routine. If I use cpu_time(), I end up getting the cpu_time for all the threads (8 in my case) added together and not the actual time it takes for the program to run. The etime() routine seems to do the same. Any idea on how I can time this program (without using a stopwatch)?
Try omp_get_wtime(); see http://gcc.gnu.org/onlinedocs/libgomp/omp_005fget_005fwtime.html for the signature.
If this is a one-off thing, then I agree with larsmans, that using gprof or some other profiling is probably the way to go; but I also agree that it is very handy to have coarser timers in your code for timing different phases of the computation. The best timing information you have is the stuff you actually use, and it's hard to beat stuff that's output every single tiem you run your code.
Jeremia Wilcock pointing out omp_get_wtime() is very useful; it's standards compliant so should work on any OpenMP compiler - but it only has second resolution, which may or may not be enough, depending on what you're doing. Edited; the above was completely wrong.
Fortran90 defines system_clock() which can also be used on any standards-compliant compiler; the standard doesn't specify a time resolution, but gfortran it seems to be milliseconds and ifort seems to be microseconds. I usually use it in something like this:
subroutine tick(t)
integer, intent(OUT) :: t
call system_clock(t)
end subroutine tick
! returns time in seconds from now to time described by t
real function tock(t)
integer, intent(in) :: t
integer :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
end function tock
And using them:
call tick(calc)
! do big calculation
calctime = tock(calc)
print *,'Timing summary'
print *,'Calc: ', calctime

Puzzling performance difference between ifort and gfortran

Recently, I read a post on Stack Overflow about finding integers that are perfect squares. As I wanted to play with this, I wrote the following small program:
PROGRAM PERFECT_SQUARE
IMPLICIT NONE
INTEGER*8 :: N, M, NTOT
LOGICAL :: IS_SQUARE
N=Z'D0B03602181'
WRITE(*,*) IS_SQUARE(N)
NTOT=0
DO N=1,1000000000
IF (IS_SQUARE(N)) THEN
NTOT=NTOT+1
END IF
END DO
WRITE(*,*) NTOT ! should find 31622 squares
END PROGRAM
LOGICAL FUNCTION IS_SQUARE(N)
IMPLICIT NONE
INTEGER*8 :: N, M
! check if negative
IF (N.LT.0) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
! check if ending 4 bits belong to (0,1,4,9)
M=IAND(N,15)
IF (.NOT.(M.EQ.0 .OR. M.EQ.1 .OR. M.EQ.4 .OR. M.EQ.9)) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
! try to find the nearest integer to sqrt(n)
M=DINT(SQRT(DBLE(N)))
IF (M**2.NE.N) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
IS_SQUARE=.TRUE.
RETURN
END FUNCTION
When compiling with gfortran -O2, running time is 4.437 seconds, with -O3 it is 2.657 seconds. Then I thought that compiling with ifort -O2 could be faster since it might have a faster SQRT function, but it turned out running time was now 9.026 seconds, and with ifort -O3 the same. I tried to analyze it using Valgrind, and the Intel compiled program indeed uses many more instructions.
My question is why? Is there a way to find out where exactly the difference comes from?
EDITS:
gfortran version 4.6.2 and ifort version 12.0.2
times are obtained from running time ./a.out and is the real/user time (sys was always almost 0)
this is on Linux x86_64, both gfortran and ifort are 64-bit builds
ifort inlines everything, gfortran only at -O3, but the latter assembly code is simpler than that of ifort, which uses xmm registers a lot
fixed line of code, added NTOT=0 before loop, should fix issue with other gfortran versions
When the complex IF statement is removed, gfortran takes about 4 times as much time (10-11 seconds). This is to be expected since the statement approximately throws out about 75% of the numbers, avoiding to do the SQRT on them. On the other hand, ifort only uses slightly more time. My guess is that something goes wrong when ifort tries to optimize the IF statement.
EDIT2:
I tried with ifort version 12.1.2.273 it's much faster, so looks like they fixed that.
What compiler versions are you using?
Interestingly, it looks like a case where there is a performance regression from 11.1 to 12.0 -- e.g. for me, 11.1 (ifort -fast square.f90) takes 3.96s, and 12.0 (same options) took 13.3s.
gfortran (4.6.1) (-O3) is still faster (3.35s).
I have seen this kind of a regression before, although not quite as dramatic.
BTW, replacing the if statement with
is_square = any(m == [0, 1, 4, 9])
if(.not. is_square) return
makes it run twice as fast with ifort 12.0, but slower in gfortran and ifort 11.1.
It looks like part of the problem is that 12.0 is overly aggressive in trying to vectorize things: adding
!DEC$ NOVECTOR
right before the DO loop (without changing anything else in the code) cuts the run time down to 4.0 sec.
Also, as a side benefit: if you have a multi-core CPU, try adding -parallel to the ifort command line :)