I am trying to reproduce this C example in Fortran. my code so far:
use mpi
implicit none
integer, parameter :: maxn = 8
integer, allocatable :: xlocal(:,:)
integer :: i, j, lsize, errcnt, toterr, buff
integer :: ierror, nproc, pid, root = 0, nreq = 0
integer, allocatable :: request(:), status(:,:)
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, pid, ierror)
if (mod(maxn, nproc) /= 0) then
write(*,*) 'Array size (maxn) should be a multiple of the number of processes'
call MPI_ABORT(MPI_COMM_WORLD, 1, ierror)
end if
lsize = maxn/nproc
allocate(xlocal(0:lsize+1, maxn))
allocate(request(nproc))
allocate(status(MPI_STATUS_SIZE,nproc))
xlocal(0,:) = -1
xlocal(1:lsize,:) = pid
xlocal(lsize+1,:) = -1
! send down unless on bottom
if (pid < nproc-1) then
nreq = nreq + 1
call MPI_ISEND(xlocal(lsize,:), maxn, MPI_INTEGER, &
pid+1, 0, MPI_COMM_WORLD, request(nreq), ierror)
write(*,'(2(A,I1),A)') 'process ', pid, ' sent to process ', pid+1, ':'
write(*,*) xlocal(lsize,:)
end if
if (pid > 0) then
nreq = nreq + 1
call MPI_IRECV(xlocal(0,:), maxn, MPI_INTEGER, &
pid-1, 0, MPI_COMM_WORLD, request(nreq), ierror)
write(*,'(2(A,I1),A)') 'process ', pid, ' received from process ', pid-1, ':'
write(*,*) xlocal(0,:)
end if
! send up unless on top
if (pid > 0) then
nreq = nreq + 1
call MPI_ISEND(xlocal(1,:), maxn, MPI_INTEGER, &
pid-1, 1, MPI_COMM_WORLD, request(nreq), ierror)
write(*,'(2(A,I1),A)') 'process ', pid, ' sent to process ', pid-1, ':'
write(*,*) xlocal(1,:)
end if
if (pid < nproc-1) then
nreq = nreq + 1
call MPI_IRECV(xlocal(lsize+1,:), maxn, MPI_INTEGER, &
pid+1, 1, MPI_COMM_WORLD, request(nreq), ierror)
write(*,'(2(A,I1),A)') 'process ', pid, ' received from process ', pid+1, ':'
write(*,*) xlocal(lsize+1,:)
end if
call MPI_WAITALL(nreq, request, status, ierror)
! check results
errcnt = 0
do i = 1, lsize
do j = 1, maxn
if (xlocal(i,j) /= pid) errcnt = errcnt + 1
end do
end do
do j = 1, maxn
if (xlocal(0,j) /= pid-1) errcnt = errcnt + 1
if ((pid < nproc-1) .and. (xlocal(lsize+1,j) /= pid+1)) errcnt = errcnt + 1
end do
call MPI_REDUCE(errcnt, toterr, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD)
if (pid == root) then
if (toterr == 0) then
write(*,*) "no errors found"
else
write(*,*) "found ", toterr, " errors"
end if
end if
deallocate(xlocal)
deallocate(request)
deallocate(status)
call MPI_FINALIZE(ierror)
but i am running into segmentation faults and can not figure out why. I have a feeling it is due to the request array. can someone explain the correct way of using the request array in Fortran? none of the references I found clarify this.
thx in advance
In case you haven't already done so, consider compiling your program with some flags that will help you in debugging, e.g. with gfortran, you can use -O0 -g -fbounds-check (if that does not help, you might add -fsanitize=address for versions >= 4.8). Other compilers have similar options for debugging.
Doing that, and running with 2 processes, you program crashes at the MPI_Reduce line. If you look up the specifications (e.g. OpenMPI 1.8) you can see that this subroutine requires one more argument, i.e., you forgot to add the ierror argument at the end.
It is a bit tragic that even though the subprograms from the mpi module are accessible through a use association, and thus should be checked for argument consistency to avoid these trivial errors, not all subprograms are necessarily in that module. I don't know which MPI implementation you use, but I checked my local MPICH installation and it does not have most subroutines in the module, so no explicit interfaces exist for them. I guess you are in a similar situation, but I guess other implementations might suffer a similar fate. You could compare it to the C header file missing the function prototype for MPI_Reduce. I guess the reason for this is that originally there was only a Fortran 77 interface for most implementations.
Some final comments: be careful not to just copy-paste the C code. The arrays you pass are not contiguous and will result in a temporary copy to be passed to the MPI routines, which is very inefficient (not that it really matters in this case).
Related
I have an application where the root rank is sending messages to all ranks in the following way:
tag = 22
if( myrankid == 0 )then
do i = 1, nproc
if(I==1)then
do j = 1, nvert
xyz((j-1)*3+1) = data((j-1)*3+1,1)
xyz((j-1)*3+2) = data((j-1)*3+2,1)
xyz((j-1)*3+3) = data((j-1)*3+3,1)
enddo
else
call mpi_send(data, glb_nvert(i)*3, mpi_real, i-1, tag, comm, ierr)
endif
enddo
else
call mpi_recv(data, glb_nvert(i)*3, mpi_real, 0, tag,comm, stat,ierr)
endif
My problem is that at only when running above 3000 ranks this pair hangs at a certain mpi rank (on my specific app it is rank 2009)
Now, I do check that the sizes and arrays are consistent and the only thing I found interesting was the comm. The comm is a communicator which I have duplicated from another MPI communicator.
When I print comm like print*, comm all ranks except the root prints the same integer, except for the root.
E.g.
The root prints:
-1006632941
while rhe remaining 2999 ranks prints:
-1006632951
Is that really what causing the problem?
I have tried using intel mpi and the cray mpi.
I have a fortran array in which each row is calculated by a process. I then want to gather the full array on all processes. I can get it work for two variations of mpi_allgather, but not the more elegant one using mpi_type_vector given here, for example:
program allGather
! test of MPI_allgather with rows
use mpi
implicit none
integer i, mpierr, myRank, noProc
integer, allocatable:: survived(:,:)
integer rowtype
real(4) ran
call MPI_INIT( mpierr )
call MPI_COMM_RANK(MPI_COMM_WORLD, myRank, mpierr) ! get rank of this process in world
call MPI_COMM_SIZE(MPI_COMM_WORLD, noProc, mpiErr)
allocate( survived(noProc,2) ) ! used to find best solution out of all processes
! Each process calculates one row of survived and each row is tagged with it's rank for sorting
! For this test example, just use a random number
call RANDOM_SEED()
call RANDOM_NUMBER(ran)
survived(myRank+1,:) = [int(1.e6*ran), myRank]
write(*,'(3(a,i0),a)') 'Rank: ', myRank, ' survived = [', survived(myRank+1,1), ', ', survived(myRank+1,2),']'
call mpi_barrier(MPI_COMM_WORLD, mpiErr)
! specify type to capture rows of survived
call MPI_TYPE_VECTOR(2, 1, noProc, MPI_INTEGER, rowtype, mpierr)
call MPI_TYPE_COMMIT(rowtype, mpierr)
call MPI_allgather(MPI_IN_PLACE, 1, rowtype, survived , 1, rowtype, MPI_COMM_WORLD, mpierr)
if (myRank == 0) then
write(*,'(/,a)') 'Passed array is:'
write(*,'(i0,'', '',i0)') (survived(i,:),i=1,noProc)
end if
deallocate( survived ) ! used to find best solution out of all processes
call MPI_FINALIZE(mpierr)
end program allGather
This generates an output on 4 processes something like:
Rank: 0 survived = [819935, 0]
Rank: 1 survived = [971609, 1]
Rank: 2 survived = [859684, 2]
Rank: 3 survived = [747759, 3]
Passed array is:
819935, 0
0, 1
0, 0
0, 0
Clearly I'm missing something. Can someone point out my error? Thanks!
PROGRAM ShareNeighbors
IMPLICIT REAL (a-h,o-z)
INCLUDE "mpif.h"
PARAMETER (m = 500, n = 500)
DIMENSION a(m,n), b(m,n)
DIMENSION h(m,n)
INTEGER istatus(MPI_STATUS_SIZE)
INTEGER iprocs, jprocs
PARAMETER (ROOT = 0)
integer dims(2),coords(2)
logical periods(2)
data periods/2*.false./
integer status(MPI_STATUS_SIZE)
integer comm2d,req,source
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
! Get a new communicator for a decomposition of the domain.
! Let MPI find a "good" decomposition
dims(1) = 0
dims(2) = 0
CALL MPI_DIMS_CREATE(nprocs,2,dims,ierr)
if (myrank.EQ.Root) then
print *,nprocs,'processors have been arranged into',dims(1),'X',dims(2),'grid'
endif
CALL MPI_CART_CREATE(MPI_COMM_WORLD,2,dims,periods,.true., &
comm2d,ierr)
! Get my position in this communicator
CALL MPI_COMM_RANK(comm2d,myrank,ierr)
! Get the decomposition
CALL fnd2ddecomp(comm2d,m,n,ista,iend,jsta,jend)
! print *,ista,jsta,iend,jend
ilen = iend - ista + 1
jlen = jend - jsta + 1
CALL MPI_Cart_get(comm2d,2,dims,periods,coords,ierr)
iprocs = dims(1)
jprocs = dims(2)
myranki = coords(1)
myrankj = coords(2)
DO j = jsta, jend
DO i = ista, iend
a(i,j) = myrank+1
ENDDO
ENDDO
! Send data from each processor to Root
call MPI_ISEND(ista,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(iend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jsta,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
Root,1,MPI_COMM_WORLD,req,ierr )
! Recieved the results from othe precessors
if (myrank.EQ.Root) then
do source = 0,nprocs-1
call MPI_RECV(ista,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(iend,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(jsta,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(jend,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
ilen = iend - ista + 1
jlen = jend - jsta + 1
call MPI_RECV(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
source,1,MPI_COMM_WORLD,status,ierr)
! print the results
call ZMINMAX(m,n,ista,iend,jsta,jend,a(:,:),amin,amax)
print *, 'myid=',source,amin,amax
call MPI_Wait(req, status, ierr)
enddo
endif
CALL MPI_FINALIZE(ierr)
END
subroutine fnd2ddecomp(comm2d,m,n,ista,iend,jsta,jend)
integer comm2d
integer m,n,ista,jsta,iend,jend
integer dims(2),coords(2),ierr
logical periods(2)
! Get (i,j) position of a processor from Cartesian topology.
CALL MPI_Cart_get(comm2d,2,dims,periods,coords,ierr)
! Decomposition in first (ie. X) direction
CALL MPE_DECOMP1D(m,dims(1),coords(1),ista,iend)
! Decomposition in second (ie. Y) direction
CALL MPE_DECOMP1D(n,dims(2),coords(2),jsta,jend)
return
end
SUBROUTINE MPE_DECOMP1D(n,numprocs,myid,s,e)
integer n,numprocs,myid,s,e,nlocal,deficit
nlocal = n / numprocs
s = myid * nlocal + 1
deficit = mod(n,numprocs)
s = s + min(myid,deficit)
! Give one more slice to processors
if (myid .lt. deficit) then
nlocal = nlocal + 1
endif
e = s + nlocal - 1
if (e .gt. n .or. myid .eq. numprocs-1) e = n
return
end
SUBROUTINE ZMINMAX(IX,JX,SX,EX,SY,EY,ZX,ZXMIN,ZXMAX)
INTEGER :: IX,JX,SX,EX,SY,EY
REAL :: ZX(IX,JX)
REAL :: ZXMIN,ZXMAX
ZXMIN=1000.
ZXMAX=-1000.
DO II=SX,EX
DO JJ=SY,EY
IF(ZX(II,JJ).LT.ZXMIN)ZXMIN=ZX(II,JJ)
IF(ZX(II,JJ).GT.ZXMAX)ZXMAX=ZX(II,JJ)
ENDDO
ENDDO
RETURN
END
When I am running the above code with 4 processors Root receives garbage values. Where as for 15 processors, the data transfer is proper. How I can tackle this?
I guess it is related buffer, a point which is not clear to me. How I have to tackle the buffer wisely?
1. problem
You are doing multiple sends
call MPI_ISEND(ista,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(iend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jsta,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
Root,1,MPI_COMM_WORLD,req,ierr )
and all of them with the same request variable req. That can't work.
2. problem
You are using a subarray a(ista:iend,jsta:jend) in non-blocking MPI. That is not allowed*. You need to copy the array into some temporary array buffer or use MPI derived subarray datatype (too hard for you at this stage).
The reason for the problem is that the compiler will create a temporary copy just for the call to ISend. The ISend will remember the address, but will not send anything. Then temporary is deleted and the address becomes invalid. And then the MPI_Wait will try to use that address and will fail.
3. problem
Your MPI_Wait is in the wrong place. It must be after the sends out of any if conditions so that they are always executed (provided you are always sending).
You must collect all request separately and than wait for all of them. Best to have them in a an array and wait for all of them at once using MPI_Waitall.
Remeber, the ISend typically does not actually send anything if the buffer is large. The exchange often happens during the Wait operation. At least for larger arrays.
Recommendation:
Take a simple problem example and try to exchange just two small arrays with MPI_IRecv and MPI_ISend between two processes. As simple test problem as you can do. Learn from it, do simple steps. Take no offence, but your current understanding of non-blocking MPI is too weak to write full scale programs. MPI is hard, non-blocking MPI is even harder.
* not allowed when using the interface available in MPI-2. MPI-3 brings a new interface available by using use mpi_f08 where it is possible. But learn the basics first.
I am fairly new to writing/running parallel code. Currently I am experimenting with basic tutorials in writing parallel code to obtain a feel for the process. My computer is using ubuntu with Mpich.
I am attempting to run the code entitled "The complete parallel program to sum a vector" on this page :http://condor.cc.ku.edu/~grobe/docs/intro-MPI.shtml
and am encountering the following error upon execution after being prompted for/entering a number:
Fatal error in MPI_Send: Invalid tag, error stack:
MPI_Send(174): MPI_Send(buf=0x7ffeab0f2d3c, count=1, MPI_INT, dest=1, tag=1157242880, MPI_COMM_WORLD) failed
MPI_Send(101): Invalid tag, value is 1157242880
I also receive the warning while compiling:
sumvecp.f90:41:23:
call mpi_send(vector(start_row),num_rows_to_send, mpi_real, an_id, send_data_tag, mpi_comm_world,ierr)
1
Warning: Legacy Extension: REAL array index at (1)
This is my code
program sumvecp
include '/usr/include/mpi/mpif.h'
parameter (max_rows = 10000000)
parameter (send_data_tag = 2001, return_data_tag = 2002)
integer my_id, root_proces, ierr, status(mpi_status_size)
integer num_procs, an_id, num_rows_to_receive
integer avg_rows_per_process, num_rows,num_rows_to_send
real vector(max_rows), vector2(max_rows), partial_sum, sum
root_process = 0
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,my_id,ierr)
call mpi_comm_size(mpi_comm_world,num_procs,ierr)
if (my_id .eq. root_process) then
print *, "please enter the number of numbers to sum: "
read *, num_rows
if (num_rows .gt. max_rows) stop "Too many numbers."
avg_rows_per_process = num_rows / num_procs
do ii = 1,num_rows
vector(ii) = float(ii)
end do
do an_id = 1, num_procs -1
start_row = (an_id*avg_rows_per_process) +1
end_row = start_row + avg_rows_per_process -1
if (an_id .eq. (num_procs - 1)) end_row = num_rows
num_rows_to_send = end_row - start_row + 1
call mpi_send(num_rows_to_send, 1, mpi_int, an_id, send_data_tag, mpi_comm_world,ierr)
call mpi_send(vector(start_row),num_rows_to_send, mpi_real, an_id, send_data_tag, mpi_comm_world,ierr)
end do
summ = 0.0
do ii = 1, avg_rows_per_process
summ = summ + vector(ii)
end do
print *,"sum", summ, "calculated by the root process."
do an_id =1, num_procs -1
call mpi_recv(partial_sum, 1, mpi_real, mpi_any_source, mpi_any_tag, mpi_comm_world, status, ierr)
sender = status(mpi_source)
print *, "partial sum", partial_sum, "returned from process", sender
summ = summ + partial_sum
end do
print *, "The grand total is: ", sum
else
call mpi_recv(num_rows_to_receive, 1, mpi_int, root_process, mpi_any_tag, mpi_comm_world,status,ierr)
call mpi_recv(vector2,num_rows_to_received, mpi_real,root_process,mpi_any_tag,mpi_comm_world,status,ierr)
num_rows_received = num_rows_to_receive
partial_sum = 0.0
do ii=1,num_rows_received
partial_sum = partial_sum + vector2(ii)
end do
call mpi_send(partial_sum,1,mpi_real,root_process,return_data_tag,mpi_comm_world,ierr)
endif
call mpi_finalize(ierr)
stop
end
You are missing IMPLICIT NONE and you have a large number of undeclared variables.
The reported error is because
send_data_tag = 2001, return_data_tag = 2002
are implicitly real variables and not integers. But you probably have many more problems.
First add IMPLICIT NONE and declare or variables. Also I highly recommend to put use mpi instead of the include '/usr/include/mpi/mpif.h' it is likely to help you to find more problems.
Now I see the code is copied from some website. I wouldn't trust this website, because the codes are clearly wrong.
I have an MPI-parallelized code where it loops through n persons, and for each one it calls some subroutines to do some calculations and after all inside the loop calls a post-processing subroutine.
In the post-processing subroutine, I write the output I want in the following way:
person_number var1 var2
Let's say that every person belongs to a different rank. The problem is that when I write the file for person1, then maybe process of rank3 that includes person3 variables is executing the post-processing subroutine, so it overwrites my data of person1.
What I want is to find a way, to pause other processes before calling the post-processing subroutine, and then once this subroutine is not used by the previous rank, to run it for the next rank and so on.
This is a sketch of the code:
call MPI_Init(ierr)
do i = 1, npersons
call subroutine1(arg1,arg2,arg3)
! call it only if post_process not executed by other process
! otherwise wait until it ends and then call it
call post_process(i, var1, var2)
enddo
call MPI_Finalize(ierr)
subroutine post_process(i, var1, var2)
integer:: i
real*8:: var1, var2
write(111,*) i, var1, var2
end subroutine post_process
reading your comment: " Also, I am wondering if for example process 3 is faster than process 2, if i can use the same way but as soon rank 1 finishes with the routine to notify rank 3 to run the routine and then rank 3 to notify rank 2. Is there any automatic way of this? to know which rank waits before the post-processing step longer?"
This can be addressed exactly by letting all the I/O be performed on process with irank==0 and using buffered sends.
In this case you don't want to let the processes wait, no barriers here, but you want to let them dispatch their result as soon as it's ready, and then continue calculating. When it's time for process 0, it will receive all the buffered data and write them, then it write its own data. You can try to use standard MPI_SEND (it's buffered up to a prefixed size), but the best way is to use MPI_BSEND and attach a correctly sized buffer with MPI_BUFFER_ATTACH(). Something like this:
subroutine post_process(i, var1, var2, irank)
integer:: i, irank
real*8:: var1, var2
integer:: ir
real*8:: var1r, var2r
character buffer(100)
integer ipos
boolean flag
if (irank .gt. 0) then
ipos = 0
call MPI_PACK(i, 1, MPI_INTEGER, buffer, 100, ipos, MPI_COMM_WORLD, ierr)
call MPI_PACK(var1, 1, MPI_REAL8, buffer, 100, ipos, MPI_COMM_WORLD, ierr)
call MPI_PACK(var2, 1, MPI_REAL8, buffer, 100, ipos, MPI_COMM_WORLD, ierr)
call MPI_BSend( buffer, ipos, MPI_PACKED, 0, 0, MPI_COMM_WORLD, ierr)
else
do
call MPI_IPROBE(MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, flag, MPI_STATUS_IGNORE, ierr)
if (flag .eq. false) exit
call MPI_RECV(buffer, 100, MPI_PACKED, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
ipos = 0
call MPI_UNPACK(buffer, 100, ipos, ir, 1, MPI_INTEGER, MPI_COMM_WORLD, ierr)
call MPI_UNPACK(buffer, 100, ipos, var1r, 1, MPI_REAL8, MPI_COMM_WORLD, ierr)
call MPI_UNPACK(buffer, 100, ipos, var2r, 1, MPI_REAL8, MPI_COMM_WORLD, ierr)
write(111,*) ir, var1r, var2r
enddo
write(111,*) i, var1, var2
end if
end subroutine post_process
I'd perform this task serializing with barriers. Assuming you have got irank the result from MPI_COMM_RANK() and nprocs from MPI_COMM_SIZE():
call MPI_Init(ierr)
do i = 1, npersons
call subroutine1(arg1,arg2,arg3)
do ir = 0, nprocs-1
if (ir .eq. irank) then
! call it only if post_process not executed by other process
! otherwise wait until it ends and then call it
call post_process(i, var1, var2)
endif
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
enddo
enddo
All the processes wait at the MPI_BARRIER(), until the irank-th completes, and reach the barrier too.
I have to say that since all the processes write on a shared filesystem in post_process this is not guaranteed to work: the synchronization imposed at MPI level is usually very fast (isn't MPI optimized for this?), and can be faster than the synchronization present in a shared filesystem (being it NFS, GPFS,...), especially on large clusters. Furthermore performing it with a plain fortran write to a shared file... quite sure you can randomly incur in file corruptions, because of caching and timings on the different hosts.
The typical way to approach it is to let only processor with irank==0 write to the file, all the others send data to be written to it. Better, using MPI2 I/O.
The first thing is to initialize the MPI environment properly, by adding the following lines:
! Initialization of MPI
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numproc,ierr)
The function MPI_COMM_RANK will return a variable rank, which is an identifier for each process (i.e. each person of your example). You can use this variable for defining the order in which the processes execute the program. Also, since code in a MPI program is executed by all processes unless you tell them otherwise, you don't need a do loop to call your first subroutine.
You can use a call to MPI_RECV to block the execution of the program for each process until they receive a message. The trick is to work with the variable rank which indicates the number of each process (in your example, it seems to be numbers from 1 to n - be careful, it is likely that the ordering of ranks starts at 0). Tell your processes to pause and wait for a message, except the first process, which is allowed to execute the post-processing subroutine. Once process 1 is done with writing, tell it to send a message to process 2. As soon as process 2 receives the message, it will start executing the subroutine (which is now safe to do, since 1 is done) and send a message to process 3, and so on.
You can try to implement something like this:
integer:: tag
character(1):: mess
call subroutine1(arg1,arg2,arg3)
tag=22 ! or any integer you like
mess='a' ! The content here doesn't matter
if(rank .gt. 1) call MPI_RECV(mess,1,MPI_CHARACTER,rank-1,tag,MPI_COMM_WORLD,stat,ierr)
do k = 1,npersons
if (rank .eq. k) then
call post_process(var1, var2)
if(rank .lt. npersons) then
call MPI_SEND(mess,1,MPI_CHARACTER,rank+1,tag,MPI_COMM_WORLD,ierr)
end if
end if
end do