Calling BLACS with more processes than used

Calling BLACS with more processes than used - fortran

I want to create a parallel program, which makes heavy use of SCALAPACK. The basis of SCALAPACK is BLACS, which itself relies on MPI for interprocess communication.
I want to start the program with a defined number of processes (e.g. the number of cores on the machine) and let the algorithm decide, how to use these processes for calculations.
As a testcase I wanted to use 10 processes. 9 of these processes should get arranged in a grid (BLACS_GRIDINIT) and the 10th process should wait till the other processes are finished.
Unfortunately, OpenMPI crashes because the last process doesn't get into a MPI context from BLACS, while the others did.
Question: What is the correct way to use BLACS with more processes than needed?
I did some experiments with additional MPI_INIT and MPI_FINALIZE calls, but none of my tries were successful.
I started with the sample code from Intel MKL (shortened a little bit):
PROGRAM HELLO
* -- BLACS example code --
* Written by Clint Whaley 7/26/94
* Performs a simple check-in type hello world
* ..
* .. External Functions ..
INTEGER BLACS_PNUM
EXTERNAL BLACS_PNUM
* ..
* .. Variable Declaration ..
INTEGER CONTXT, IAM, NPROCS, NPROW, NPCOL, MYPROW, MYPCOL
INTEGER ICALLER, I, J, HISROW, HISCOL
* Determine my process number and the number of processes in
* machine
CALL BLACS_PINFO(IAM, NPROCS)
* Set up process grid that is as close to square as possible
NPROW = INT( SQRT( REAL(NPROCS) ) )
NPCOL = NPROCS / NPROW
* Get default system context, and define grid
CALL BLACS_GET(0, 0, CONTXT)
CALL BLACS_GRIDINIT(CONTXT, 'Row', NPROW, NPCOL)
CALL BLACS_GRIDINFO(CONTXT, NPROW, NPCOL, MYPROW, MYPCOL)
* If I'm not in grid, go to end of program
IF ( (MYPROW.GE.NPROW) .OR. (MYPCOL.GE.NPCOL) ) GOTO 30
* Get my process ID from my grid coordinates
ICALLER = BLACS_PNUM(CONTXT, MYPROW, MYPCOL)
* If I am process {0,0}, receive check-in messages from
* all nodes
IF ( (MYPROW.EQ.0) .AND. (MYPCOL.EQ.0) ) THEN
WRITE(*,*) ' '
DO 20 I = 0, NPROW-1
DO 10 J = 0, NPCOL-1
IF ( (I.NE.0) .OR. (J.NE.0) ) THEN
CALL IGERV2D(CONTXT, 1, 1, ICALLER, 1, I, J)
END IF
* Make sure ICALLER is where we think in process grid
CALL BLACS_PCOORD(CONTXT, ICALLER, HISROW, HISCOL)
IF ( (HISROW.NE.I) .OR. (HISCOL.NE.J) ) THEN
WRITE(*,*) 'Grid error! Halting . . .'
STOP
END IF
WRITE(*, 3000) I, J, ICALLER
10 CONTINUE
20 CONTINUE
WRITE(*,*) ' '
WRITE(*,*) 'All processes checked in. Run finished.'
* All processes but {0,0} send process ID as a check-in
ELSE
CALL IGESD2D(CONTXT, 1, 1, ICALLER, 1, 0, 0)
END IF
30 CONTINUE
CALL BLACS_EXIT(0)
1000 FORMAT('How many processes in machine?')
2000 FORMAT(I)
3000 FORMAT('Process {',i2,',',i2,'} (node number =',I,
$ ') has checked in.')
STOP
END
Update: I investigated the source code of BLACS to see, what happens there.
The call BLACS_PINFO initializes the MPI context with MPI_INIT, if this didn't happen before. This means, that at this point, everything works as expected.
At the end, the call to BLACS_EXIT(0) should free all resources from BLACS and if the argument is 0, it should also call MPI_FINALIZE. Unfortunately, this doesn't work as expected and my last process doesn't call MPI_FINALIZE.
As a workaround, one could ask MPI_FINALIZED and call MPI_FINALIZE if necessary.
Update 2: My previous tries were done with Intel Studio 2013.0.079 and OpenMPI 1.6.2 on SUSE Linux Enterprise Server 11.
After reading ctheo's answer, I tried to compile this example with the tools given by Ubuntu 12.04 (gfortran 4.6.3, OpenMPI 1.4.3, BLACS 1.1) and was successful.
My conclusion is, that Intel's implementation appears to be buggy. I will retry this example in the not so far future with the newest service release of Intel Studio, but don't expect any changes.
However, I would appreciate any other (and maybe better) solution.

I don't know the answer, and I would hazard a guess that the set of people that participate in SO and those who know the answer to your question is < 1. However, I'd suggest that you might have slightly better luck asking on scicomp or by contacting the ScaLAPACK team at the University of Tennessee directly through their support page. Good luck!

I don't think that you need to do much to use less processes in SCALAPACK.
The BLACS_PINFO subroutine returns the total number of processes.
If you want to use one less, just do NPROCS = NPROCS - 1.
I used your sample code (fixes some typos in FORMAT), added the subtraction and got the following output:
$ mpirun -n 4 ./a.out
Process { 0, 0} (node number = 0) has checked in.
Process { 0, 1} (node number = 1) has checked in.
Process { 0, 2} (node number = 2) has checked in.
All processes checked in. Run finished.
The BLACS_GRIDINIT creates a grid with the reduced NPROCS.
By calling BLACS_GRIDINFO one process gets MYPROW=MYPCOL=-1
On the other hand if you want to create multiple grids that use different processes then probably you should use BLACS_GRIDMAP subroutine. The sample code below creates two equal grids with half size of total processes.
PROGRAM HELLO
* ..
INTEGER CONTXT(2), IAM, NPROCS, NPROW, NPCOL, MYPROW, MYPCOL
INTEGER ICALLER, I, J, HISROW, HISCOL
integer UMAP(2,10,10)
*
CALL BLACS_PINFO(IAM, NPROCS)
NPROCS = NPROCS/2
*
NPROW = INT( SQRT( REAL(NPROCS) ) )
NPCOL = NPROCS / NPROW
*
DO IP = 1, 2
DO I = 1, NPROW
DO J = 1, NPCOL
UMAP(IP,I,J) = (IP-1)*NPROCS+(I-1)*NPCOL+(J-1)
ENDDO
ENDDO
CALL BLACS_GET(0, 0, CONTXT(IP))
CALL BLACS_GRIDMAP(CONTXT(IP), UMAP(IP,:,:), 10, NPROW, NPCOL )
ENDDO
*
DO IP = 1, 2
CALL BLACS_GRIDINFO(CONTXT(IP), NPROW, NPCOL, MYPROW, MYPCOL)
IF(MYPROW.GE.0 .AND. MYPCOL.GE.0 ) THEN
WRITE(*,1000) IAM, MYPROW, MYPCOL, IP
END IF
ENDDO
CALL BLACS_EXIT(0)
1000 FORMAT('Process ',i2,' is (',i2,','i2 ') of grid ',i2)
*
STOP
END
I got the following output:
$ mpirun -n 8 ./a.out
Process 0 is ( 0, 0) of grid 1
Process 1 is ( 0, 1) of grid 1
Process 2 is ( 1, 0) of grid 1
Process 3 is ( 1, 1) of grid 1
Process 4 is ( 0, 0) of grid 2
Process 5 is ( 0, 1) of grid 2
Process 6 is ( 1, 0) of grid 2
Process 7 is ( 1, 1) of grid 2
I did not collect the data in process zero. So you can get this output if all processes are local.

Related

Strange occurrence with a send/recv MPI pair

I have an application where the root rank is sending messages to all ranks in the following way:
tag = 22
if( myrankid == 0 )then
do i = 1, nproc
if(I==1)then
do j = 1, nvert
xyz((j-1)*3+1) = data((j-1)*3+1,1)
xyz((j-1)*3+2) = data((j-1)*3+2,1)
xyz((j-1)*3+3) = data((j-1)*3+3,1)
enddo
else
call mpi_send(data, glb_nvert(i)*3, mpi_real, i-1, tag, comm, ierr)
endif
enddo
else
call mpi_recv(data, glb_nvert(i)*3, mpi_real, 0, tag,comm, stat,ierr)
endif
My problem is that at only when running above 3000 ranks this pair hangs at a certain mpi rank (on my specific app it is rank 2009)
Now, I do check that the sizes and arrays are consistent and the only thing I found interesting was the comm. The comm is a communicator which I have duplicated from another MPI communicator.
When I print comm like print*, comm all ranks except the root prints the same integer, except for the root.
E.g.
The root prints:
-1006632941
while rhe remaining 2999 ranks prints:
-1006632951
Is that really what causing the problem?
I have tried using intel mpi and the cray mpi.

slow down of a parallel subroutine when called inside a loop

I have parallelized a subroutine. It have very good benchmark : speedup 4X on a quad core. I have them in two different source: serial.f and paral.f. The comparison is made running them from terminal and printing elapsed wall clock time. Inside each source code there is only call to the associate subroutine. But, when I modify the sources like this :
serial.f :
do i=1,100
call serial
end do
and like this
paral.f :
do i=1,100
call paral
end do
performance goes down to 0.96 X speed: the parallel version is bad than the serial one! The code can be found in why calling many N times a serial subroutine is faster than calling N times the parallel version of the same subroutin
For obtaining the serial.f just comment the block containing the call paral. For obtaining the paral.f just comment the block containing the call serial.
I'm asking : is this a common problem ? How can I solve it to maintain the 4 X speedup maintaning the loop call?
Please note :
(1)I've tried translating to C and timing, benchmarks and problems remains all the same
(2) I've tried translating to modern fortran and timing, benchmarks and problems remains all the same
(3) I've tried all kind of tricks and rewriting of the code. I'm sure the problem is not how the subroutine is parallelized (I achieved 4 X ) but that it is called too many times inside a loop.
Thank you.
EDIT ::
As requested, I'm posting a program written in modern fortran who esibit the same issues :
program main
use omp_lib
implicit none
integer ( kind = 4 ), parameter :: m = 5000
integer ( kind = 4 ), parameter :: n = 5000
integer ( kind = 4 ) i
integer ( kind = 4 ) j
integer ( kind = 4 ) nn
real ( kind = 8 ) u(m,n)
real ( kind = 8 ) w(m,n)
real ( kind = 8 ) wtime,h
call random_seed()
do j=1,n
do i=1,m
call random_number(u(i,j))
end do
end do
wtime = omp_get_wtime ( )
do nn=1,100
!$omp parallel do default(none) shared(u, w) private(i,j)
do j = 2, n - 1
do i = 2, m - 1
w(i,j) = 0.25D+00 * ( u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) )
end do
end do
!$omp end parallel do
end do
wtime = omp_get_wtime ( ) - wtime
h=0.0D+00
do j=1,n
do i=1,m
h=h+w(i,j)
end do
end do
write ( *, '(a,g14.6)' ) ' Wall clock time serial= ', wtime
write ( *, '(a,g14.6)' ) ' h ', h
stop
end
In order to get serial_with_loop.f90 just comment openmp directives and the nn loop. You must obtain also with a similar method parall_with_loop.f90 and serial and parall without loop. You can compile with " gfortran -o name.out -fopenmp -O3 name.f90 " and launch from terminal with output redirection to text file "name.out > time_result.txt"

The problem you have is that you are parallelizing the loop on j that is located inside a loop on nn. Therefore, for each nn value, your machine needs time to create a pool of threads that do the job for different value of j. Therefore, this time (required for creating the pool) is serial and cannot be devided by the number of used threads. As I see your code, there is no reason for not being able to parallelize the nn loop and creating that pool only once, instead of nn times. I think that your code will work better if you write
wtime = omp_get_wtime ( )
!$omp parallel do default(none) shared(u, w) private(nn,i,j)
do nn=1,100
do j = 2, n - 1
do i = 2, m - 1
w(i,j) = 0.25D+00 * ( u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1))
end do
end do
end do
!$omp end parallel do
wtime = omp_get_wtime ( ) - wtime
I hope that this helps you.

Using Persistent Communication in Fortran MPI

I am using persistent communication in my CFD code. I have the communications setup in another subroutine and in the main subroutine, where I have the do loop, I use the MPI_STARTALL(), MPI_WAITALL().
In order to make it shorter, I am showing hte first part of the setup. The rest of the arrays are exactly the same.
My setup subrotuine looks like:
Subroutine MPI_Subroutine
use Variables
use mpi
implicit none
!Starting up MPI
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,npes,ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,MyRank,ierr)
!Compute the size of local block (1D Decomposition)
Jmax = JmaxGlobal
Imax = ImaxGlobal/npes
if (MyRank.lt.(ImaxGlobal - npes*Imax)) then
Imax = Imax + 1
end if
if (MyRank.ne.0.and.MyRank.ne.(npes-1)) then
Imax = Imax + 2
Else
Imax = Imax + 1
endif
! Computing neighboars
if (MyRank.eq.0) then
Left = MPI_PROC_NULL
else
Left = MyRank - 1
end if
if (MyRank.eq.(npes -1)) then
Right = MPI_PROC_NULL
else
Right = MyRank + 1
end if
! Initializing the Arrays in each processor, according to the number of local nodes
Call InitializeArrays
!Creating the channel of communication for this computation,
!Sending and receiving the u_old (Ghost cells)
Call MPI_SEND_INIT(u_old(2,:),Jmax,MPI_DOUBLE_PRECISION,Left,tag,MPI_COMM_WORLD,req(1),ierr)
Call MPI_RECV_INIT(u_old(Imax,:),jmax,MPI_DOUBLE_PRECISION,Right,tag,MPI_COMM_WORLD,req(2),ierr)
Call MPI_SEND_INIT(u_old(Imax-1,:),Jmax,MPI_DOUBLE_PRECISION,Right,tag,MPI_COMM_WORLD,req(3),ierr)
Call MPI_RECV_INIT(u_old(1,:),jmax,MPI_DOUBLE_PRECISION,Left,tag,MPI_COMM_WORLD,req(4),ierr)
Since I am debugging my code I am just checking these arrays. When I check my ghost cells are full of zeroes. Then I guess that I messing with the instruction.
The main code, where I call the MPI_STARTALL, MPI_WAITALL looks like:
Program
use Variables
use mpi
implicit none
open(32, file = 'error.dat')
Call MPI_Subroutine
!kk=kk+1
DO kk=1, 2001
! A lot of calculation
! communicating the maximum error among the processes and delta t
call MPI_REDUCE(eps,epsGlobal,1,MPI_DOUBLE_PRECISION,MPI_MAX,0,MPI_COMM_WORLD,ierr)
call MPI_BCAST(epsGlobal,1,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr)
call MPI_REDUCE(delta_t,delta_tGlobal,1,MPI_DOUBLE_PRECISION,MPI_MIN,0,MPI_COMM_WORLD,ierr)
if(MyRank.eq.0) delta_t = delta_tGlobal
call MPI_BCAST(delta_t,1,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr)
if(MyRank.eq.0) then
write(*,*) kk,epsGlobal,(kk*delta_t)
write(32,*) kk,epsGlobal
endif
Call Swap
Call MPI_STARTALL(4,req,ierr) !
Call MPI_WAITALL(4,req,status,ierr)
enddo
The variables are set in another module. the MPI related variables looks like:
! MPI variables
INTEGER :: npes, MyRank, ierr, Left, Right, tag
INTEGER :: status(MPI_STATUS_SIZE,4)
INTEGER,dimension(4) :: req
I appreciate your time and suggestion in this problem.

MPI in Fortran gives garbage values

PROGRAM ShareNeighbors
IMPLICIT REAL (a-h,o-z)
INCLUDE "mpif.h"
PARAMETER (m = 500, n = 500)
DIMENSION a(m,n), b(m,n)
DIMENSION h(m,n)
INTEGER istatus(MPI_STATUS_SIZE)
INTEGER iprocs, jprocs
PARAMETER (ROOT = 0)
integer dims(2),coords(2)
logical periods(2)
data periods/2*.false./
integer status(MPI_STATUS_SIZE)
integer comm2d,req,source
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
! Get a new communicator for a decomposition of the domain.
! Let MPI find a "good" decomposition
dims(1) = 0
dims(2) = 0
CALL MPI_DIMS_CREATE(nprocs,2,dims,ierr)
if (myrank.EQ.Root) then
print *,nprocs,'processors have been arranged into',dims(1),'X',dims(2),'grid'
endif
CALL MPI_CART_CREATE(MPI_COMM_WORLD,2,dims,periods,.true., &
comm2d,ierr)
! Get my position in this communicator
CALL MPI_COMM_RANK(comm2d,myrank,ierr)
! Get the decomposition
CALL fnd2ddecomp(comm2d,m,n,ista,iend,jsta,jend)
! print *,ista,jsta,iend,jend
ilen = iend - ista + 1
jlen = jend - jsta + 1
CALL MPI_Cart_get(comm2d,2,dims,periods,coords,ierr)
iprocs = dims(1)
jprocs = dims(2)
myranki = coords(1)
myrankj = coords(2)
DO j = jsta, jend
DO i = ista, iend
a(i,j) = myrank+1
ENDDO
ENDDO
! Send data from each processor to Root
call MPI_ISEND(ista,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(iend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jsta,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
Root,1,MPI_COMM_WORLD,req,ierr )
! Recieved the results from othe precessors
if (myrank.EQ.Root) then
do source = 0,nprocs-1
call MPI_RECV(ista,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(iend,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(jsta,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(jend,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
ilen = iend - ista + 1
jlen = jend - jsta + 1
call MPI_RECV(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
source,1,MPI_COMM_WORLD,status,ierr)
! print the results
call ZMINMAX(m,n,ista,iend,jsta,jend,a(:,:),amin,amax)
print *, 'myid=',source,amin,amax
call MPI_Wait(req, status, ierr)
enddo
endif
CALL MPI_FINALIZE(ierr)
END
subroutine fnd2ddecomp(comm2d,m,n,ista,iend,jsta,jend)
integer comm2d
integer m,n,ista,jsta,iend,jend
integer dims(2),coords(2),ierr
logical periods(2)
! Get (i,j) position of a processor from Cartesian topology.
CALL MPI_Cart_get(comm2d,2,dims,periods,coords,ierr)
! Decomposition in first (ie. X) direction
CALL MPE_DECOMP1D(m,dims(1),coords(1),ista,iend)
! Decomposition in second (ie. Y) direction
CALL MPE_DECOMP1D(n,dims(2),coords(2),jsta,jend)
return
end
SUBROUTINE MPE_DECOMP1D(n,numprocs,myid,s,e)
integer n,numprocs,myid,s,e,nlocal,deficit
nlocal = n / numprocs
s = myid * nlocal + 1
deficit = mod(n,numprocs)
s = s + min(myid,deficit)
! Give one more slice to processors
if (myid .lt. deficit) then
nlocal = nlocal + 1
endif
e = s + nlocal - 1
if (e .gt. n .or. myid .eq. numprocs-1) e = n
return
end
SUBROUTINE ZMINMAX(IX,JX,SX,EX,SY,EY,ZX,ZXMIN,ZXMAX)
INTEGER :: IX,JX,SX,EX,SY,EY
REAL :: ZX(IX,JX)
REAL :: ZXMIN,ZXMAX
ZXMIN=1000.
ZXMAX=-1000.
DO II=SX,EX
DO JJ=SY,EY
IF(ZX(II,JJ).LT.ZXMIN)ZXMIN=ZX(II,JJ)
IF(ZX(II,JJ).GT.ZXMAX)ZXMAX=ZX(II,JJ)
ENDDO
ENDDO
RETURN
END
When I am running the above code with 4 processors Root receives garbage values. Where as for 15 processors, the data transfer is proper. How I can tackle this?
I guess it is related buffer, a point which is not clear to me. How I have to tackle the buffer wisely?

1. problem
You are doing multiple sends
call MPI_ISEND(ista,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(iend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jsta,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
Root,1,MPI_COMM_WORLD,req,ierr )
and all of them with the same request variable req. That can't work.
2. problem
You are using a subarray a(ista:iend,jsta:jend) in non-blocking MPI. That is not allowed*. You need to copy the array into some temporary array buffer or use MPI derived subarray datatype (too hard for you at this stage).
The reason for the problem is that the compiler will create a temporary copy just for the call to ISend. The ISend will remember the address, but will not send anything. Then temporary is deleted and the address becomes invalid. And then the MPI_Wait will try to use that address and will fail.
3. problem
Your MPI_Wait is in the wrong place. It must be after the sends out of any if conditions so that they are always executed (provided you are always sending).
You must collect all request separately and than wait for all of them. Best to have them in a an array and wait for all of them at once using MPI_Waitall.
Remeber, the ISend typically does not actually send anything if the buffer is large. The exchange often happens during the Wait operation. At least for larger arrays.
Recommendation:
Take a simple problem example and try to exchange just two small arrays with MPI_IRecv and MPI_ISend between two processes. As simple test problem as you can do. Learn from it, do simple steps. Take no offence, but your current understanding of non-blocking MPI is too weak to write full scale programs. MPI is hard, non-blocking MPI is even harder.
* not allowed when using the interface available in MPI-2. MPI-3 brings a new interface available by using use mpi_f08 where it is possible. But learn the basics first.

MPI Fortran Code Problem

I'm trying to run this MPI Fortran code. There are several problems:
1) when I run this code I expect the program to write 'Enter the number of intervals: (0 quits) ' to screen then ask me n. Instead it asks me n first!!! why?
2) if I don't comment out the line 'goto 10', the program keeps asking me n for ever and does not show me anything else!!!
3) if I comment out 'goto 10' the program ask me n and then writes results. But, problem is every time the program write part of the result not the complete results. It truncate the output!! below are output for three consecutive time I ran the program:
> mpiexec -n 40 ./a.out
10000000
Enter the number of intervals: (0 quits)
pi is 3.14159265358978 Error is 1.287858708565182E-014
time is 1.687502861022949E-002 seconds
> mpiexec -n 40 ./a.out
10000000
Enter the number of intervals: (0 quits)
pi is 3.14159265358978 Error is 1.287858708565182E-014
time is 1.68750286102
> mpiexec -n 40 ./a.out
10000000
Enter the number of intervals: (0 quits)
pi is 3.14159265358978 Error is 1.287858708565182E-014
time is 1.687502861022949E-002 se
Anyone has any idea what's going on? I appreciate your help in advance.
program main
use mpi
double precision starttime, endtime
double precision PI25DT
parameter (PI25DT = 3.141592653589793238462643d0)
double precision mypi, pi, h, sum, x, f, a
double precision starttime, endtime
integer n, myid, numprocs, i, ierr
f(a) = 4.d0 / (1.d0 + a*a) ! function to integrate
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
10 if ( myid .eq. 0 ) then
print *, 'Enter the number of intervals: (0 quits) '
read(*,*) n
endif
starttime = MPI_WTIME()
! broadcast n
call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)
! check for quit signal
if ( n .le. 0 ) goto 30
! calculate the interval size
h = 1.0d0/n
sum = 0.0d0
do 20 i = myid+1, n, numprocs
x = h * (dble(i) - 0.5d0)
sum = sum + f(x)
20 continue
mypi = h * sum
! collect all the partial sums
call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0, &
MPI_COMM_WORLD,ierr)
! node 0 prints the answer.
endtime = MPI_WTIME()
if (myid .eq. 0) then
print *, 'pi is ', pi, 'Error is ', abs(pi - PI25DT)
print *, 'time is ', endtime-starttime, ' seconds'
endif
go to 10
30 call MPI_FINALIZE(ierr)
stop
end

This program is designed to loop via the "goto 10" at the end. The only way to break out of this is for n to have a value <= 0, which will activate the "goto 30" and branch past the "goto 10". Additional clues that this is the intent are the comment "check for quit signal" and that the prompt for the input of n includes "(0 quits)". So try inputting 0!
This is not a good example of modern Fortran. Despite clearly using a Fortran 90 or more recent compiler (the "use" statement shows this), it is written in the style of FORTRAN 77 or earlier. Fixed-source layout, with the source lines apparently starting in column 7. Comment characters in the first column (old Fortran required a C in the first column). "Double Precision". Heavy use of gotos for the program logic. (In my opinion, and others may disagree, there is a place for the goto statement, but not for the basic control flow of a program.) Obsolete Fortran (in my opinion).
The modern Fortran way to express the basic flow:
MainLoop: do
.....
if (n .le. 0) exit MainLoop
....
end do MainLoop

you have to explicitly flush your output. I do not remember if fortran has standard flush function, if flush does not work, try flush_.
Basically what happens, your process zero buffers output, and unless the explicitly tell it to display, you end up a funny stuff

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js