So, I was wondering if it is preferable to work on the real and imaginary part of the array separately instead of a complex variable for performance gain. For example,
program test
implicit none
integer,parameter :: n = 1e8
real(kind=8),parameter :: pi = 4.0d0*atan(1.0d0)
complex(kind=8),parameter :: i_ = (0.0d0,1.0d0)
double complex :: s
real(kind=8) :: th(n),sz, t1,t2, s1,s2
integer :: i
sz = 2.0d0*pi/n
do i=1,n
th(i) = sz*i
enddo
call cpu_time(t1)
s= sum(exp(th*i_))
call cpu_time(t2)
print *, t2-t1
call cpu_time(t1)
s1 = sum(cos(th))
s2 = sum(sin(th))
call cpu_time(t2)
print *, t2-t1
end program test
And the time it takes
3.7041089999999999
2.6299830000000002
So, the splited calculation does takes less time. This was a very simple calculation. But I have some long calculation and using complex variables improves the readability and does takes less lines of code. But will it sacrifice the performance of my code ? Or is it always advisable to work on the real and imaginary part separately?
Better to understand what kind of trick compiler can do for you. Generally it's not worth the effort to do so nowadays. Create a little script to study the CPU time of your code.
#!/bin/bash
src=a.f90
for fcc in gfortran ifort; do
$fcc --version
for flag in "-O0" "-O1" "-O2" "-O3"; do
fexe=$fcc$flag
echo $fcc $src -o "$fcc$flag" $flag
$fcc $src -o $fexe $flag
echo "run $fexe ..."
./$fexe
done
done
You will notice the some of the CPU time may show very close to 0, as the compiler is clever enough to discard the computation that you never used. Make the change to avoid the compile optimize out your computation.
print *, t2-t1, s
print *, t2-t1, s1, s2
The result of using ifort is here, beside the speed, notice the ACCURACY, speed comes at a price:
ifort (IFORT) 14.0.2
ifort a.f90 -o ifort-O0 -O0
run ifort-O0 ...
3.57999900000000 (-2.319317404797516E-009,7.034712528404704E-009)
4.07666600000000 -2.319317404797516E-009 7.034712528404704E-009
ifort a.f90 -o ifort-O1 -O1
run ifort-O1 ...
3.30333300000000 (-2.319317404797516E-009,7.034712528404704E-009)
3.54666700000000 -2.319317404797516E-009 7.034712528404704E-009
ifort a.f90 -o ifort-O2 -O2
run ifort-O2 ...
3.08000000000000 (-2.319317404797516E-009,7.034712528404704E-009)
1.13666600000000 -6.304215927066537E-009 1.737099880017717E-009
ifort a.f90 -o ifort-O3 -O3
run ifort-O3 ...
3.08333400000000 (-2.319317404797516E-009,7.034712528404704E-009)
1.13666600000000 -6.304215927066537E-009 1.737099880017717E-009
sum 31.999 3.496 0:35.82 99.0% 0
you may wonder what happens between -O1 and -O2 flag, if check the compiled object file, the actual internal function it linked has changed from:
U cexp
U cos
U sin
to :
U __svml_cos2
U __svml_sin2
U cexp
svml stand for short vector math library. Some trade off between speed and accuracy can be found in Intel IPP Library Fixed-Accuracy Arithmetic Functions
Related
I am aware of this and this, but I ask again as the first link is pretty old now, and the second link did not seem to reach a conclusive answer. Has any consensus developed?
My problem is simple:
I have a DO loop that has elements that may be run concurrently. Which method do I use ?
Below is code to generate particles on a simple cubic lattice.
npart is the number of particles
npart_edge & npart_face are that along an edge and a face, respectively
space is the lattice spacing
Rx, Ry, Rz are position arrays
x, y, z are temporary variables to decide positon on lattice
Note the difference that x,y and z have to be arrays in the CONCURRENT case, but not so in the OpenMP case because they can be defined as being PRIVATE.
So do I use DO CONCURRENT (which, as I understand from the links above, uses SIMD) :
DO CONCURRENT (i = 1, npart)
x(i) = MODULO(i-1, npart_edge)
Rx(i) = space*x(i)
y(i) = MODULO( ( (i-1) / npart_edge ), npart_edge)
Ry(i) = space*y(i)
z(i) = (i-1) / npart_face
Rz(i) = space*z(i)
END DO
Or do I use OpenMP?
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(x,y,z)
!$OMP DO
DO i = 1, npart
x = MODULO(i-1, npart_edge)
Rx(i) = space*x
y = MODULO( ( (i-1) / npart_edge ), npart_edge)
Ry(i) = space*y
z = (i-1) / npart_face
Rz(i) = space*z
END DO
!$OMP END DO
!$OMP END PARALLEL
My tests:
Placing 64 particles in a box of side 10:
$ ifort -qopenmp -real-size 64 omp.f90
$ ./a.out
CPU time = 6.870000000000001E-003
Real time = 3.600000000000000E-003
$ ifort -real-size 64 concurrent.f90
$ ./a.out
CPU time = 6.699999999999979E-005
Real time = 0.000000000000000E+000
Placing 100000 particles in a box of side 100:
$ ifort -qopenmp -real-size 64 omp.f90
$ ./a.out
CPU time = 8.213300000000000E-002
Real time = 1.280000000000000E-002
$ ifort -real-size 64 concurrent.f90
$ ./a.out
CPU time = 2.385000000000000E-003
Real time = 2.400000000000000E-003
Using the DO CONCURRENT construct seems to be giving me at least an order of magnitude better performance. This was done on an i7-4790K. Also, the advantage of concurrency seems to decrease with increasing size.
DO CONCURRENT does not do any parallelization per se. The compiler may decide to parallelize it using threads or use SIMD instructions or even offload to a GPU. For threads you often have to instruct it to do so. For GPU offloading you need a particular compiler with particular options. Or (often!), the compiler just treats DO CONCURENT as a regular DO and uses SIMD if it would use them for the regular DO.
OpenMP is also not just threads, the compiler can use SIMD instructions if it wants. There is also omp simd directive, but that is only a suggestion to the compiler to use SIMD, it can be ignored.
You should try, measure and see. There is no single definitive answer. Not even for a given compiler, the less for all compilers.
If you would not use OpenMP anyway, I would give DO CONCURRENT a try to see if the automatic parallelizer does a better job with this construct. Chances are good that it will help. If your code is already in OpenMP, I do not see any point introducing DO CONCURRENT.
My practice is to use OpenMP and try to make sure the compiler vectorizes (SIMD) what it can. Especially because I use OpenMP all over my program anyway. DO CONCURRENT still has to prove it is actually useful. I am not convinced, yet, but some GPU examples look promising - however, real codes are often much more complex.
Your specific examples and the performance measurement:
Too little code is given and there are subtle points in every benchmarking. I wrote some simple code around your loops and did my own tests. I was careful NOT to include the thread creation into the timed block. You should not include $omp parallel into your timing. I also took the minimum real time over multiple computations because sometimes the first take is longer (certainly with DO CONCURRENT). CPU has various throttle modes and may need some time to spin-up. I also added SCHEDULE(STATIC).
npart=10000000
ifort -O3 concurrent.f90: 6.117300000000000E-002
ifort -O3 concurrent.f90 -parallel: 5.044600000000000E-002
ifort -O3 concurrent_omp.f90: 2.419600000000000E-002
npart=10000, default 8 threads (hyper-threading)
ifort -O3 concurrent.f90: 5.430000000000000E-004
ifort -O3 concurrent.f90 -parallel: 8.899999999999999E-005
ifort -O3 concurrent_omp.f90: 1.890000000000000E-004
npart=10000, OMP_NUM_THREADS=4 (ignore hyper-threading)
ifort -O3 concurrent.f90: 5.410000000000000E-004
ifort -O3 concurrent.f90 -parallel: 9.200000000000000E-005
ifort -O3 concurrent_omp.f90: 1.070000000000000E-004
Here, DO CONCURRENT seems to be somewhat faster for the small case, but not too much if we make sure to use the right number of cores. It is clearly slower for the big case. The -parallel option is clearly necessary for the automatic parallelization.
Thank you for having a look at this problem.
Problem:
seg. fault when returning from f90 subroutine that contains KINSOL solving process, after the correct computation result has been generated. No problem when the same solving process is in the main program.
Environment:
linux,
gcc,
sundials static libs
How to initiate the problem:
get the attached REDUCED test code
module moduleNonlinearSolve
integer,save::nEq
contains
subroutine solveNonlinear(u)
double precision::u(*)
integer iout(15),ier
double precision rout(2),koefScal(nEq)
koefScal(:)=1d0
call fnvinits(3,nEq,ier)
call fkinmalloc(iout,rout,ier)
call fkinspgmr(50,10,ier)
call fkinsol(u,1,koefScal,koefScal,ier)
call fkinfree()
do i=1,nEq
write(*,*),i,u(i)
end do
end subroutine
end module
subroutine fkfun(u,fval,ier)
use moduleNonlinearSolve
double precision::u(*)
double precision::fval(*)
integer::ier
forall(i=2:nEq-1)
fval(i)=-u(i-1)+2d0*u(i)-u(i+1)-1d0
end forall
fval(1)=u(1)+2d0*u(1)-u(2)-1d0
fval(nEq)=-u(nEq-1)+2d0*u(nEq)+u(nEq)-1d0
ier=0
end subroutine
program test
use moduleNonLinearSolve
double precision u(10)
nEq=size(u)
u(:)=10d0
call solveNonlinear(u)
end program``
compile
$ gfortran -c -Wall -g test.f90
$ gfortran -Wall -g -o test test.o -lsundials_fkinsol -lsundials_fnvecserial -lsundials_kinsol -lsundials_nvecserial -llapack -lblas
run
$ ./test
Note: It would work flawlessly if put all the SUNDIALS procedures in the main program.
Thank you very much for any input.
Mianzhi
According to the KINSOL documentation, the first argument of fkinmalloc must be of the same integer type as the C type long int. In your case, long int is 8 bytes long, but you are passing in an array of 4 byte integers. This will lead to fkinmalloc trying to write beyond the bounds of the array, and into some other memory. This typically leads to memory corruption, which has symptoms just like what you are observing: Crash at some random later point, such as when returning from a function. You should be able to confirm this by running the program through valgrind, which will probably report invalid writes of size 8. Anyway, replacing
integer :: iout(15)
with
integer*8 :: iout(15)
should solve the problem.
I tried to write a Taylor series expansion for exp(x)/sin(x) using fortran, but when I tested my implementatin for small numbers(N=3 and X=1.0) and add them manually, the results are not matching what I expect. On by hand I calculated 4.444.., and with the program I found 7.54113. Could you please check my code and tell me if i got anything wrong.
Here is the expansion formula for e^x/sin(x) in wolframalpha: http://www.wolframalpha.com/input/?i=e%5Ex%2Fsin%28x%29
PROGRAM Taylor
IMPLICIT NONE
INTEGER ::Count1,Count2,N=3
REAL:: X=1.0,Sum=0.0
COMPLEX ::i=(0.0,0.1)
INTEGER:: FACT
DO Count1=1,N,1
DO Count2=0,N,1
Sum=Sum+EXP(i*X*(-1+2*Count1))*(X**Count2)/FACT(Count2)
END DO
END DO
PRINT*,Sum
END PROGRAM Taylor
INTEGER FUNCTION FACT(n)
IMPLICIT NONE
INTEGER, INTENT(IN) :: n
INTEGER :: i, Ans
Ans = 1
DO i = 1, n
Ans = Ans * i
END DO
FACT = Ans
END FUNCTION FACT
I don't see any complex terms in that expansion by Wolfram, so I'd wonder why you think you need the complex number in the exponential term. And you can't get that 1/x term the way you've programmed it. You need an x**(-1.0) term somewhere.
Your factorial implementation is rather naive, too.
I'd recommend that you forget about the loop and factorials and start with a polynomial, coefficients, and Horner's method for evaluation. Get that working and then see if you can sort out the loop.
The Wolfram article has the expansion formula stated using q = e**(ix) so there is a complex term. Therefore "sum" should be declared complex.
As already stated, the factorial function is simplistic. Be careful of overflow.
It is best to place your procedures into a module and "use" that module from the main program. Use as many compiler debugging options as possible. For example, gfortran, when the appropriate warning option is used, warns about the type of "sum": "Warning: Possible change of value in conversion from COMPLEX(4) to REAL(4)". If you are using gfortran try: -O2 -fimplicit-none -Wall -Wline-truncation -Wcharacter-truncation -Wsurprising -Waliasing -Wimplicit-interface -Wunused-parameter -fwhole-file -fcheck=all -std=f2008 -pedantic -fbacktrace
Since you can do this problem by hand, try outputting each step with a write statement and comparing to your hand calculation. You will probably quickly see where the calculation diverges. If it isn't clear why the calculation is different, break it down into pieces.
Recently, I read a post on Stack Overflow about finding integers that are perfect squares. As I wanted to play with this, I wrote the following small program:
PROGRAM PERFECT_SQUARE
IMPLICIT NONE
INTEGER*8 :: N, M, NTOT
LOGICAL :: IS_SQUARE
N=Z'D0B03602181'
WRITE(*,*) IS_SQUARE(N)
NTOT=0
DO N=1,1000000000
IF (IS_SQUARE(N)) THEN
NTOT=NTOT+1
END IF
END DO
WRITE(*,*) NTOT ! should find 31622 squares
END PROGRAM
LOGICAL FUNCTION IS_SQUARE(N)
IMPLICIT NONE
INTEGER*8 :: N, M
! check if negative
IF (N.LT.0) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
! check if ending 4 bits belong to (0,1,4,9)
M=IAND(N,15)
IF (.NOT.(M.EQ.0 .OR. M.EQ.1 .OR. M.EQ.4 .OR. M.EQ.9)) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
! try to find the nearest integer to sqrt(n)
M=DINT(SQRT(DBLE(N)))
IF (M**2.NE.N) THEN
IS_SQUARE=.FALSE.
RETURN
END IF
IS_SQUARE=.TRUE.
RETURN
END FUNCTION
When compiling with gfortran -O2, running time is 4.437 seconds, with -O3 it is 2.657 seconds. Then I thought that compiling with ifort -O2 could be faster since it might have a faster SQRT function, but it turned out running time was now 9.026 seconds, and with ifort -O3 the same. I tried to analyze it using Valgrind, and the Intel compiled program indeed uses many more instructions.
My question is why? Is there a way to find out where exactly the difference comes from?
EDITS:
gfortran version 4.6.2 and ifort version 12.0.2
times are obtained from running time ./a.out and is the real/user time (sys was always almost 0)
this is on Linux x86_64, both gfortran and ifort are 64-bit builds
ifort inlines everything, gfortran only at -O3, but the latter assembly code is simpler than that of ifort, which uses xmm registers a lot
fixed line of code, added NTOT=0 before loop, should fix issue with other gfortran versions
When the complex IF statement is removed, gfortran takes about 4 times as much time (10-11 seconds). This is to be expected since the statement approximately throws out about 75% of the numbers, avoiding to do the SQRT on them. On the other hand, ifort only uses slightly more time. My guess is that something goes wrong when ifort tries to optimize the IF statement.
EDIT2:
I tried with ifort version 12.1.2.273 it's much faster, so looks like they fixed that.
What compiler versions are you using?
Interestingly, it looks like a case where there is a performance regression from 11.1 to 12.0 -- e.g. for me, 11.1 (ifort -fast square.f90) takes 3.96s, and 12.0 (same options) took 13.3s.
gfortran (4.6.1) (-O3) is still faster (3.35s).
I have seen this kind of a regression before, although not quite as dramatic.
BTW, replacing the if statement with
is_square = any(m == [0, 1, 4, 9])
if(.not. is_square) return
makes it run twice as fast with ifort 12.0, but slower in gfortran and ifort 11.1.
It looks like part of the problem is that 12.0 is overly aggressive in trying to vectorize things: adding
!DEC$ NOVECTOR
right before the DO loop (without changing anything else in the code) cuts the run time down to 4.0 sec.
Also, as a side benefit: if you have a multi-core CPU, try adding -parallel to the ifort command line :)
In trying to mix precision in a simple program - using both real and double - and use the ddot routine from BLAS, I'm coming up with incorrect output for the double precision piece. Here's the code:
program test
!! adding this statement narrowed the issue down to ddot being considered real(4)
implicit none
integer, parameter :: dp = kind(1.0d0)
!! The following 2 lines were added for the calls to the BLAS routines.
!! This fixed the issue.
real(dp), external :: ddot
real, external :: sdot
real, dimension(3) :: a,b
real(dp), dimension(3) :: d,e
integer :: i
do i = 1,3
a(i) = 1.0*i
b(i) = 3.5*i
d(i) = 1.0d0*i
e(i) = 3.5d0*i
end do
write (*,200) "sdot real(4) = ", sdot(3,a,1,b,1) ! should work and return 49.0
write (*,200) "ddot real(4) = ", ddot(3,a,1,b,1) ! should not work
write (*,200) "sdot real(8) = ", sdot(3,d,1,e,1) ! should not work
write (*,200) "ddot real(8) = ", ddot(3,d,1,e,1) ! should work and return 49.0
200 format(a,f5.2)
end program test
I've tried compiling with both gfortran and ifort using the MKL BLAS libraries as follows:
ifort -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
gfortran -lmkl_intel_lp64 -lmkl_sequential -lmkl_core main.f90
The output is:
sdot real(4) = 49.00
ddot real(4) = 0.00
sdot real(8) = 4.10
ddot real(8) = 0.00
How can I get the ddot routine to correctly process the double precision values?
Additionally, adding the -autodouble flag (ifort) or -fdefault-real-8 (gfortran) flag makes both of the ddot routines work, but the sdot routines fail.
Edit:
I added the implicit none statement, and the two type statements for the ddot and sdot functions. Without the type specified for the function calls the ddot was being typed implicitly as single precision real.
I haven't used MKL, but perhaps you need a "use" statement so that the compiler knows the interface to the functions? Or to otherwise declare the functions. They are not declared so the compiler is probably assuming that return of ddot is single precision and mis-interpreting the bits.
Turning on the warning option causes the compiler to tell you about the problem. With gfortran, try:
-fimplicit-none -Wall -Wline-truncation -Wcharacter-truncation -Wsurprising -Waliasing -Wimplicit-interface -Wunused-parameter -fwhole-file -fcheck=all -std=f2008 -pedantic -fbacktrace
Passing incorrect kind variables is a case of interface mismatch (which is illegal, so in principle the compiler might do anything including starting WW III), so maybe this is messing up the stack and hence following calls also return incorrect results. Try to comment out those incorrect calls (your lines marked with "should not work") and see if that helps.
Also, enable all kinds of debug options you can find, as e.g. the answer by M.S.B. shows for gfortran.