Puzzling performance difference between ifort and gfortran - fortran

Recently, I read a post on Stack Overflow about finding integers that are perfect squares. As I wanted to play with this, I wrote the following small program:
PROGRAM PERFECT_SQUARE
IMPLICIT NONE
INTEGER*8 :: N, M, NTOT
LOGICAL :: IS_SQUARE
N=Z'D0B03602181'
WRITE(*,*) IS_SQUARE(N)
NTOT=0
DO N=1,1000000000
IF (IS_SQUARE(N)) THEN
NTOT=NTOT+1
END IF
END DO
WRITE(*,*) NTOT ! should find 31622 squares
END PROGRAM
LOGICAL FUNCTION IS_SQUARE(N)
IMPLICIT NONE
INTEGER*8 :: N, M
! check if negative
IF (N.LT.0) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
! check if ending 4 bits belong to (0,1,4,9)
M=IAND(N,15)
IF (.NOT.(M.EQ.0 .OR. M.EQ.1 .OR. M.EQ.4 .OR. M.EQ.9)) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
! try to find the nearest integer to sqrt(n)
M=DINT(SQRT(DBLE(N)))
IF (M**2.NE.N) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
IS_SQUARE=.TRUE.
RETURN
END FUNCTION
When compiling with gfortran -O2, running time is 4.437 seconds, with -O3 it is 2.657 seconds. Then I thought that compiling with ifort -O2 could be faster since it might have a faster SQRT function, but it turned out running time was now 9.026 seconds, and with ifort -O3 the same. I tried to analyze it using Valgrind, and the Intel compiled program indeed uses many more instructions.
My question is why? Is there a way to find out where exactly the difference comes from?
EDITS:
gfortran version 4.6.2 and ifort version 12.0.2
times are obtained from running time ./a.out and is the real/user time (sys was always almost 0)
this is on Linux x86_64, both gfortran and ifort are 64-bit builds
ifort inlines everything, gfortran only at -O3, but the latter assembly code is simpler than that of ifort, which uses xmm registers a lot
fixed line of code, added NTOT=0 before loop, should fix issue with other gfortran versions
When the complex IF statement is removed, gfortran takes about 4 times as much time (10-11 seconds). This is to be expected since the statement approximately throws out about 75% of the numbers, avoiding to do the SQRT on them. On the other hand, ifort only uses slightly more time. My guess is that something goes wrong when ifort tries to optimize the IF statement.
EDIT2:
I tried with ifort version 12.1.2.273 it's much faster, so looks like they fixed that.

What compiler versions are you using?
Interestingly, it looks like a case where there is a performance regression from 11.1 to 12.0 -- e.g. for me, 11.1 (ifort -fast square.f90) takes 3.96s, and 12.0 (same options) took 13.3s.
gfortran (4.6.1) (-O3) is still faster (3.35s).
I have seen this kind of a regression before, although not quite as dramatic.
BTW, replacing the if statement with
is_square = any(m == [0, 1, 4, 9])
if(.not. is_square) return
makes it run twice as fast with ifort 12.0, but slower in gfortran and ifort 11.1.
It looks like part of the problem is that 12.0 is overly aggressive in trying to vectorize things: adding
!DEC$ NOVECTOR
right before the DO loop (without changing anything else in the code) cuts the run time down to 4.0 sec.
Also, as a side benefit: if you have a multi-core CPU, try adding -parallel to the ifort command line :)

Related

Why doesn't vectorization speed up these loops?

I'm getting up to speed with vectorization, since my current PC supports it. I have an Intel i7-7600u. It has 2 cores running at 2.8/2.9 GHz and supports SSE4.1, SSE4.2 and AVX2. I'm not sure of the vector register size. I believe it is 256 bits, so will work with 4 64 bit double precision values at a time. I believe this should give a peak rate of:
(2.8GHz)(2 core)(4 vector)(2 add/mult) = 45 GFlops.
I am using GNU Gfortran and g++.
I have a set of fortran loops I built up back in my days of working on various supercomputers.
One loop I tested is:
do j=1,m
s(:) = s(:) + a(:,j)*b(:,j)
enddo
The vector length is 10000, m = 200 and the nest was executed 500 times to give 2e9 operations. I ran it with the j loop unrolled 0, 1, 2, 3 and 5 times. Unrolling should reduce the number of times s is loaded and stored. It is also optimal because all the memory accesses are stride one and it has a paired add and multiply. I ran it using both array syntax as shown above and by using an inner do loop, but that seems to make little difference. With do loops and no unrolling it looks like:
do j=1,m
do i=1,n
s(i)=s(i)+a(i,j)*b(i,j)
end do
end do
The build looks like:
gfortran -O3 -w -fimplicit-none -ftree-vectorize -fopt-info-vec loops.f90
The compiler says the loops are all vectorized. The best results I have gotten is about 2.8 GFlops, which is one per cycle. If I run it with:
gfortran -O2 -w -fimplicit-none -fno-tree-vectorize -fopt-info-vec loops.f90
No vectorization is reported. It executes a little slower without unrolling, but the same with unrolling. Can someone tell me what is going on here? Do I have the characterization of my processor wrong? Why doesn't vectorization speed it up? I was expecting to get at least some improvement. I apologize if this plows old ground, but I could not find a clean example similar to this.

Fortran: 10 nested loops slow with ending print statement

I have some code that runs in about a second, but slows to a standstill after a very minor edit.
The following code runs in 1 sec with gfortran -O3
program loop
implicit none
integer n, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10
parameter(n=18) !<=== most important
integer i,array(n)
real cal
real p1(n)
do i=1,n
p1(i)=float(i)/10.
enddo
write (*,1) p1
1 format (10(f6.2))
cal=0.
i1=0
i2=0
do i1=1,n
!write(*,1) cal !<-- too slow if write here
do i2=1,n
do i3=1,n
do i4=1,n
do i5=1,n
do i6=1,n
do i7=1,n
do i8=1,n
do i9=1,n
do i10=1,n
cal=p1(i1) !<-- perfectly happy to compute, as long as I don't write
array(i1)=i1+i2
enddo
enddo
enddo
enddo
enddo
enddo
enddo
enddo
enddo
!write(*,1) cal !<-- and too slow if write here too!
enddo
write(*,*) (array(i),i=1,n)
stop
end
First of all, forgive me for the mixture of f77 & 90. It's a boiled down example based on a real problem. However, the salient point is that if the parameter n=17, everything's fine. The second to last write statement can be uncommented, and the code runs in about a second. However, with n=18, the code slows to a halt... unless, if the second to last write statement is commented out, it runs in a second with n=18.
In the two tests, there are a total of 17^10 and 18^10 iterations total. I have been unable to find any indication there is a limit on the number of total iterations. I keep thinking 18^10 must be exceeding some limit, but I do not know what. And why would the print statement matter for n=18 but not n=17? More info: Mem usage is near zero. CPU is i5-4570 CPU # 3.20GHz.
If I use -O0 the code always runs extremely slowly.
With gfortan 4.8.3, I don't see much runtime difference between including the write statements and leaving them out, but there is a huge difference between -O3 and -O0. The reason for this is because the compiler is able to massively optimise the loops with -O3, which it doesn't do with -O0. The compiler can essentially work out the answer in advance and completely omit the loops. With the higher optimisations, the compiler can also use more advanced features of your CPU, which work faster.
Putting the write statements inside the loop somewhat disrupts the ability of the compiler to aggressively optimise the loops, meaning it can no longer omit them entirely, which leads to the slower runtimes you're seeing. You are probably using an older version of gfortran which doesn't cope very well with this situation.

Why xlf and ifort take differently on Array Bounds Read?

I'm trying to port a code from ifort compiler to ibm xlf compiler. It works well under ifort on redhat but give results contain "NaNQ" under xlf on AIX system. It turns out that there is a array bounds reading in the code cost this problem, here is a simplified example:
program main
implicit none
real(8)::a(1,0:10)=0.D0
print *, a(1,-1)
end program main
Using both compiler I can successfully compile it, without any mistake or warning.
On ifort I get result:
0.000000000000000E+000
But on xlf, I get:
0.247032822920623272E-322
However, if I read more beyond the boundary, the xlf won't compile but ifort compile successfully.
program main
implicit none
real(8)::a(1,0:10)=0.D0
print *, a(1,-3:-1)
end program main
On ifort I get:
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
On xlf it won't compile:
"1.f90", line 5.9: 1516-023 (S) Subscript is out of bounds.
** main === End of Compilation 1 ===
1501-511 Compilation failed for file 1.f90.
Why the ifort and xlf take differently on this cross boundary read? Is there any way to make the compiler to check it strictly and prevent cross boundary read to happen? After all, it took me a long time to catch this bug in our code, since our group have been using this code for more than 15 years without any problems on ifort. Thanks.
Most Fortran compilers have options to check for array bounds errors at runtime. In these examples with constant indices the error can be found at compile time, which some compilers but not others will do without using non-default options. With ifort use -check bounds to request array bounds checking. You can get additional checking with -check all. These options are generally not the default because there is a runtime cost. But the cost of getting a wrong answer can be much higher!! I have found the runtime cost to frequently be surprisingly low and recommend using runtime checks during code development and even in production if the runtime cost is acceptable.

Fortran I/O: Specifying large record sizes

I am trying to write an array to file, where I have opened the file this way:
open(unit=20, FILE="output.txt", form='unformatted', access='direct', recl=sizeof(u))
Here, u is an array and sizeof(u) is 2730025920, which is ~2.5GB.
When I run the program, I get an error Fortran runtime error: RECL parameter is non-positive in OPEN statement, which I believe means that the record size is too large.
Is there a way to handle this? One option would be to write the array in more than one write call such that the record size in each write is smaller than 2.5GB. But I am wondering if I can write the entire array in a single call.
Edit:
u has been declared as double precision u(5,0:408,0:408,0:407)
The program was compiled as gfortran -O3 -fopenmp -mcmodel=medium test.f
There is some OpenMP code in this program, but the file I/O is sequential.
gfortran v 4.5.0, OS: Opensuse 11.3 on 64 bit AMD Opteron
Thanks for your help.
You should be able to write big arrays as long as it's memory permitting. It seems like you are getting integer overflow with the sizeof function. sizeof is not Fortran standard and I would not recommend using it (implementations may vary between compilers). Instead, it is a better practice to use the inquire statement to obtain record length. I was able to reproduce your problem with ifort and this solution works for me. You can avoid integer overflow by declaring a higher kind variable:
integer(kind=8) :: reclen
inquire(iolength=reclen)u
open(unit=20,file='output.txt',form='unformatted',&
access='direct',recl=reclen)
EDIT: After some investigation, this seems to be a gfortran problem. Setting a higher kind for integer reclen solves the problem for ifort and pgf90, but not for gfortran - I just tried this with version 4.6.2. Even though reclen has the correct positive value, it seems that recl is 32-bit signed integer internally with gfortran (Thanks #M.S.B. for pointing this out). The Fortran run-time error suggests this, and not that the value is larger than maximum. I doubt it is an OS issue. If possible, try using ifort (free for non-commercial use): Intel Non-Commercial Software Download.

Variable strangely takes the value zero after the call of a subroutine

I have been facing some issues trying to convert a code previously compiled with compaq visual fortran 6.6 to gfortran.
Here is a specific problem I have met with gfortran :
There is a variable called "et" which takes the value 3E+10. Then the program calls a subroutine. "et" doesn't appear in the subroutine, but after coming back to the main program it has now the value 0.
When compliling with compaq visual fortran I didn't have this problem.
The code I am working on is a huge scientific program, so I put below only a small part of it :
c
c calculate load/unload modulus
c
500 t=(s1-s3)/2.
aa=1.00
if(iconeps.ne.1)bb=1.00
if(smean.lt.ap1) smean=ap1
if(xn.gt.0.000001) aa=(smean/atmp)**xn
if(iconeps.eq.1)go to 220
if(xm.gt.0.000001) bb=(smean/atmp)**xm
220 if(t.ge.0.99*sm1) go to 600
et=xku*aa*atmp+tt*tm1
if(iconeps.ne.1)bt=xkb*atmp*bb
go to 900
600 et=(xkl*aa*atmp+tt*tm1)*(1.0-rf*sr)**2
if(iconeps.ne.1)bt=xkb*atmp*bb
900 continue
btmax=17.0*et
btmin=0.33*et
if(iconeps.ne.1)then
tbt=(alf1+alf3*dtt)*dtt*(1.+vide)*tm2
btf=bt+tbt
bt=btf
endif
if(bt.lt.btmin) bt=btmin
if(bt.gt.btmax) bt=btmax
if(iconeps.eq.1)go to 1100
1000 continue
1050 if(mt.eq.mtyp4c)goto 1100
s=0.0
t=0.0
call shap4n(s,t,f,pfs,pft) ! Modification by NHV
call thick4n(s,t,xe,ye,thick)
call bmat4n(xe,ye,f,pfs,pft,b,detj,thick)
c calculate incremental strains
do 1300 i=1,4
temp=0.0
do 1200 j=1,8
1200 temp=temp+b(i,j)*disp(j)
1300 depi(i)=temp
epsv=0.0
do 1400 i=1,2
1400 epsv=epsv+depi(i)
epsv=epsv+depi(4)
ev=vide-(1.+vide)*epsv
if(ev.lt.0.0)ev=vide*.01
1100 continue
call perm(permws,xkw,coef,rw,tvisc,ev,vide,tt,pp)
: "et" keeps the good value until just before calling the subroutine "perm". Just after this subroutine it takes the value zero.
"et" isn't in any common block
This piece of code is part of a subroutine called by several different subroutines. What is even more strange is that when it is called in other parts of the code I doesn't have this problem ("et" keeps its value)
So if someone has ever met this kind of problem or have any idea about it I will be very gratefull
Perhaps you have a memory access error, such as an array bounds violation, or a mismatch between actual and dummy arguments. Are the interfaces of the subroutines explicit, such as being "used" from a module? Also try turning on compiler debugging options ... obviously subscript checking, but others might catch something. An extensive set for gfortran 4.5 or 4.6 is:
-O2 -fimplicit-none -Wall -Wline-truncation -Wcharacter-truncation -Wsurprising -Waliasing -Wimplicit-interface -Wunused-parameter -fwhole-file -fcheck=all -std=f2008 -pedantic -fbacktrace
Subscript checking is included in fcheck=all
I had this problem. In my main program, I was using double precision but the numbers I calculated with in my subroutine were single precision. After I changed them to double it fixed the problem and I got actual values instead of 0.