I have parallelized a subroutine. It have very good benchmark : speedup 4X on a quad core. I have them in two different source: serial.f and paral.f. The comparison is made running them from terminal and printing elapsed wall clock time. Inside each source code there is only call to the associate subroutine. But, when I modify the sources like this :
serial.f :
do i=1,100
call serial
end do
and like this
paral.f :
do i=1,100
call paral
end do
performance goes down to 0.96 X speed: the parallel version is bad than the serial one! The code can be found in why calling many N times a serial subroutine is faster than calling N times the parallel version of the same subroutin
For obtaining the serial.f just comment the block containing the call paral. For obtaining the paral.f just comment the block containing the call serial.
I'm asking : is this a common problem ? How can I solve it to maintain the 4 X speedup maintaning the loop call?
Please note :
(1)I've tried translating to C and timing, benchmarks and problems remains all the same
(2) I've tried translating to modern fortran and timing, benchmarks and problems remains all the same
(3) I've tried all kind of tricks and rewriting of the code. I'm sure the problem is not how the subroutine is parallelized (I achieved 4 X ) but that it is called too many times inside a loop.
Thank you.
EDIT ::
As requested, I'm posting a program written in modern fortran who esibit the same issues :
program main
use omp_lib
implicit none
integer ( kind = 4 ), parameter :: m = 5000
integer ( kind = 4 ), parameter :: n = 5000
integer ( kind = 4 ) i
integer ( kind = 4 ) j
integer ( kind = 4 ) nn
real ( kind = 8 ) u(m,n)
real ( kind = 8 ) w(m,n)
real ( kind = 8 ) wtime,h
call random_seed()
do j=1,n
do i=1,m
call random_number(u(i,j))
end do
end do
wtime = omp_get_wtime ( )
do nn=1,100
!$omp parallel do default(none) shared(u, w) private(i,j)
do j = 2, n - 1
do i = 2, m - 1
w(i,j) = 0.25D+00 * ( u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) )
end do
end do
!$omp end parallel do
end do
wtime = omp_get_wtime ( ) - wtime
h=0.0D+00
do j=1,n
do i=1,m
h=h+w(i,j)
end do
end do
write ( *, '(a,g14.6)' ) ' Wall clock time serial= ', wtime
write ( *, '(a,g14.6)' ) ' h ', h
stop
end
In order to get serial_with_loop.f90 just comment openmp directives and the nn loop. You must obtain also with a similar method parall_with_loop.f90 and serial and parall without loop. You can compile with " gfortran -o name.out -fopenmp -O3 name.f90 " and launch from terminal with output redirection to text file "name.out > time_result.txt"
The problem you have is that you are parallelizing the loop on j that is located inside a loop on nn. Therefore, for each nn value, your machine needs time to create a pool of threads that do the job for different value of j. Therefore, this time (required for creating the pool) is serial and cannot be devided by the number of used threads. As I see your code, there is no reason for not being able to parallelize the nn loop and creating that pool only once, instead of nn times. I think that your code will work better if you write
wtime = omp_get_wtime ( )
!$omp parallel do default(none) shared(u, w) private(nn,i,j)
do nn=1,100
do j = 2, n - 1
do i = 2, m - 1
w(i,j) = 0.25D+00 * ( u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1))
end do
end do
end do
!$omp end parallel do
wtime = omp_get_wtime ( ) - wtime
I hope that this helps you.
I am using persistent communication in my CFD code. I have the communications setup in another subroutine and in the main subroutine, where I have the do loop, I use the MPI_STARTALL(), MPI_WAITALL().
In order to make it shorter, I am showing hte first part of the setup. The rest of the arrays are exactly the same.
My setup subrotuine looks like:
Subroutine MPI_Subroutine
use Variables
use mpi
implicit none
!Starting up MPI
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,npes,ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,MyRank,ierr)
!Compute the size of local block (1D Decomposition)
Jmax = JmaxGlobal
Imax = ImaxGlobal/npes
if (MyRank.lt.(ImaxGlobal - npes*Imax)) then
Imax = Imax + 1
end if
if (MyRank.ne.0.and.MyRank.ne.(npes-1)) then
Imax = Imax + 2
Else
Imax = Imax + 1
endif
! Computing neighboars
if (MyRank.eq.0) then
Left = MPI_PROC_NULL
else
Left = MyRank - 1
end if
if (MyRank.eq.(npes -1)) then
Right = MPI_PROC_NULL
else
Right = MyRank + 1
end if
! Initializing the Arrays in each processor, according to the number of local nodes
Call InitializeArrays
!Creating the channel of communication for this computation,
!Sending and receiving the u_old (Ghost cells)
Call MPI_SEND_INIT(u_old(2,:),Jmax,MPI_DOUBLE_PRECISION,Left,tag,MPI_COMM_WORLD,req(1),ierr)
Call MPI_RECV_INIT(u_old(Imax,:),jmax,MPI_DOUBLE_PRECISION,Right,tag,MPI_COMM_WORLD,req(2),ierr)
Call MPI_SEND_INIT(u_old(Imax-1,:),Jmax,MPI_DOUBLE_PRECISION,Right,tag,MPI_COMM_WORLD,req(3),ierr)
Call MPI_RECV_INIT(u_old(1,:),jmax,MPI_DOUBLE_PRECISION,Left,tag,MPI_COMM_WORLD,req(4),ierr)
Since I am debugging my code I am just checking these arrays. When I check my ghost cells are full of zeroes. Then I guess that I messing with the instruction.
The main code, where I call the MPI_STARTALL, MPI_WAITALL looks like:
Program
use Variables
use mpi
implicit none
open(32, file = 'error.dat')
Call MPI_Subroutine
!kk=kk+1
DO kk=1, 2001
! A lot of calculation
! communicating the maximum error among the processes and delta t
call MPI_REDUCE(eps,epsGlobal,1,MPI_DOUBLE_PRECISION,MPI_MAX,0,MPI_COMM_WORLD,ierr)
call MPI_BCAST(epsGlobal,1,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr)
call MPI_REDUCE(delta_t,delta_tGlobal,1,MPI_DOUBLE_PRECISION,MPI_MIN,0,MPI_COMM_WORLD,ierr)
if(MyRank.eq.0) delta_t = delta_tGlobal
call MPI_BCAST(delta_t,1,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr)
if(MyRank.eq.0) then
write(*,*) kk,epsGlobal,(kk*delta_t)
write(32,*) kk,epsGlobal
endif
Call Swap
Call MPI_STARTALL(4,req,ierr) !
Call MPI_WAITALL(4,req,status,ierr)
enddo
The variables are set in another module. the MPI related variables looks like:
! MPI variables
INTEGER :: npes, MyRank, ierr, Left, Right, tag
INTEGER :: status(MPI_STATUS_SIZE,4)
INTEGER,dimension(4) :: req
I appreciate your time and suggestion in this problem.
PROGRAM ShareNeighbors
IMPLICIT REAL (a-h,o-z)
INCLUDE "mpif.h"
PARAMETER (m = 500, n = 500)
DIMENSION a(m,n), b(m,n)
DIMENSION h(m,n)
INTEGER istatus(MPI_STATUS_SIZE)
INTEGER iprocs, jprocs
PARAMETER (ROOT = 0)
integer dims(2),coords(2)
logical periods(2)
data periods/2*.false./
integer status(MPI_STATUS_SIZE)
integer comm2d,req,source
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
! Get a new communicator for a decomposition of the domain.
! Let MPI find a "good" decomposition
dims(1) = 0
dims(2) = 0
CALL MPI_DIMS_CREATE(nprocs,2,dims,ierr)
if (myrank.EQ.Root) then
print *,nprocs,'processors have been arranged into',dims(1),'X',dims(2),'grid'
endif
CALL MPI_CART_CREATE(MPI_COMM_WORLD,2,dims,periods,.true., &
comm2d,ierr)
! Get my position in this communicator
CALL MPI_COMM_RANK(comm2d,myrank,ierr)
! Get the decomposition
CALL fnd2ddecomp(comm2d,m,n,ista,iend,jsta,jend)
! print *,ista,jsta,iend,jend
ilen = iend - ista + 1
jlen = jend - jsta + 1
CALL MPI_Cart_get(comm2d,2,dims,periods,coords,ierr)
iprocs = dims(1)
jprocs = dims(2)
myranki = coords(1)
myrankj = coords(2)
DO j = jsta, jend
DO i = ista, iend
a(i,j) = myrank+1
ENDDO
ENDDO
! Send data from each processor to Root
call MPI_ISEND(ista,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(iend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jsta,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
Root,1,MPI_COMM_WORLD,req,ierr )
! Recieved the results from othe precessors
if (myrank.EQ.Root) then
do source = 0,nprocs-1
call MPI_RECV(ista,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(iend,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(jsta,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(jend,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
ilen = iend - ista + 1
jlen = jend - jsta + 1
call MPI_RECV(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
source,1,MPI_COMM_WORLD,status,ierr)
! print the results
call ZMINMAX(m,n,ista,iend,jsta,jend,a(:,:),amin,amax)
print *, 'myid=',source,amin,amax
call MPI_Wait(req, status, ierr)
enddo
endif
CALL MPI_FINALIZE(ierr)
END
subroutine fnd2ddecomp(comm2d,m,n,ista,iend,jsta,jend)
integer comm2d
integer m,n,ista,jsta,iend,jend
integer dims(2),coords(2),ierr
logical periods(2)
! Get (i,j) position of a processor from Cartesian topology.
CALL MPI_Cart_get(comm2d,2,dims,periods,coords,ierr)
! Decomposition in first (ie. X) direction
CALL MPE_DECOMP1D(m,dims(1),coords(1),ista,iend)
! Decomposition in second (ie. Y) direction
CALL MPE_DECOMP1D(n,dims(2),coords(2),jsta,jend)
return
end
SUBROUTINE MPE_DECOMP1D(n,numprocs,myid,s,e)
integer n,numprocs,myid,s,e,nlocal,deficit
nlocal = n / numprocs
s = myid * nlocal + 1
deficit = mod(n,numprocs)
s = s + min(myid,deficit)
! Give one more slice to processors
if (myid .lt. deficit) then
nlocal = nlocal + 1
endif
e = s + nlocal - 1
if (e .gt. n .or. myid .eq. numprocs-1) e = n
return
end
SUBROUTINE ZMINMAX(IX,JX,SX,EX,SY,EY,ZX,ZXMIN,ZXMAX)
INTEGER :: IX,JX,SX,EX,SY,EY
REAL :: ZX(IX,JX)
REAL :: ZXMIN,ZXMAX
ZXMIN=1000.
ZXMAX=-1000.
DO II=SX,EX
DO JJ=SY,EY
IF(ZX(II,JJ).LT.ZXMIN)ZXMIN=ZX(II,JJ)
IF(ZX(II,JJ).GT.ZXMAX)ZXMAX=ZX(II,JJ)
ENDDO
ENDDO
RETURN
END
When I am running the above code with 4 processors Root receives garbage values. Where as for 15 processors, the data transfer is proper. How I can tackle this?
I guess it is related buffer, a point which is not clear to me. How I have to tackle the buffer wisely?
1. problem
You are doing multiple sends
call MPI_ISEND(ista,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(iend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jsta,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
Root,1,MPI_COMM_WORLD,req,ierr )
and all of them with the same request variable req. That can't work.
2. problem
You are using a subarray a(ista:iend,jsta:jend) in non-blocking MPI. That is not allowed*. You need to copy the array into some temporary array buffer or use MPI derived subarray datatype (too hard for you at this stage).
The reason for the problem is that the compiler will create a temporary copy just for the call to ISend. The ISend will remember the address, but will not send anything. Then temporary is deleted and the address becomes invalid. And then the MPI_Wait will try to use that address and will fail.
3. problem
Your MPI_Wait is in the wrong place. It must be after the sends out of any if conditions so that they are always executed (provided you are always sending).
You must collect all request separately and than wait for all of them. Best to have them in a an array and wait for all of them at once using MPI_Waitall.
Remeber, the ISend typically does not actually send anything if the buffer is large. The exchange often happens during the Wait operation. At least for larger arrays.
Recommendation:
Take a simple problem example and try to exchange just two small arrays with MPI_IRecv and MPI_ISend between two processes. As simple test problem as you can do. Learn from it, do simple steps. Take no offence, but your current understanding of non-blocking MPI is too weak to write full scale programs. MPI is hard, non-blocking MPI is even harder.
* not allowed when using the interface available in MPI-2. MPI-3 brings a new interface available by using use mpi_f08 where it is possible. But learn the basics first.
I'm trying to run this MPI Fortran code. There are several problems:
1) when I run this code I expect the program to write 'Enter the number of intervals: (0 quits) ' to screen then ask me n. Instead it asks me n first!!! why?
2) if I don't comment out the line 'goto 10', the program keeps asking me n for ever and does not show me anything else!!!
3) if I comment out 'goto 10' the program ask me n and then writes results. But, problem is every time the program write part of the result not the complete results. It truncate the output!! below are output for three consecutive time I ran the program:
> mpiexec -n 40 ./a.out
10000000
Enter the number of intervals: (0 quits)
pi is 3.14159265358978 Error is 1.287858708565182E-014
time is 1.687502861022949E-002 seconds
> mpiexec -n 40 ./a.out
10000000
Enter the number of intervals: (0 quits)
pi is 3.14159265358978 Error is 1.287858708565182E-014
time is 1.68750286102
> mpiexec -n 40 ./a.out
10000000
Enter the number of intervals: (0 quits)
pi is 3.14159265358978 Error is 1.287858708565182E-014
time is 1.687502861022949E-002 se
Anyone has any idea what's going on? I appreciate your help in advance.
program main
use mpi
double precision starttime, endtime
double precision PI25DT
parameter (PI25DT = 3.141592653589793238462643d0)
double precision mypi, pi, h, sum, x, f, a
double precision starttime, endtime
integer n, myid, numprocs, i, ierr
f(a) = 4.d0 / (1.d0 + a*a) ! function to integrate
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
10 if ( myid .eq. 0 ) then
print *, 'Enter the number of intervals: (0 quits) '
read(*,*) n
endif
starttime = MPI_WTIME()
! broadcast n
call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)
! check for quit signal
if ( n .le. 0 ) goto 30
! calculate the interval size
h = 1.0d0/n
sum = 0.0d0
do 20 i = myid+1, n, numprocs
x = h * (dble(i) - 0.5d0)
sum = sum + f(x)
20 continue
mypi = h * sum
! collect all the partial sums
call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0, &
MPI_COMM_WORLD,ierr)
! node 0 prints the answer.
endtime = MPI_WTIME()
if (myid .eq. 0) then
print *, 'pi is ', pi, 'Error is ', abs(pi - PI25DT)
print *, 'time is ', endtime-starttime, ' seconds'
endif
go to 10
30 call MPI_FINALIZE(ierr)
stop
end
This program is designed to loop via the "goto 10" at the end. The only way to break out of this is for n to have a value <= 0, which will activate the "goto 30" and branch past the "goto 10". Additional clues that this is the intent are the comment "check for quit signal" and that the prompt for the input of n includes "(0 quits)". So try inputting 0!
This is not a good example of modern Fortran. Despite clearly using a Fortran 90 or more recent compiler (the "use" statement shows this), it is written in the style of FORTRAN 77 or earlier. Fixed-source layout, with the source lines apparently starting in column 7. Comment characters in the first column (old Fortran required a C in the first column). "Double Precision". Heavy use of gotos for the program logic. (In my opinion, and others may disagree, there is a place for the goto statement, but not for the basic control flow of a program.) Obsolete Fortran (in my opinion).
The modern Fortran way to express the basic flow:
MainLoop: do
.....
if (n .le. 0) exit MainLoop
....
end do MainLoop
you have to explicitly flush your output. I do not remember if fortran has standard flush function, if flush does not work, try flush_.
Basically what happens, your process zero buffers output, and unless the explicitly tell it to display, you end up a funny stuff