Wait until a subroutine is not used by other process - fortran

I have an MPI-parallelized code where it loops through n persons, and for each one it calls some subroutines to do some calculations and after all inside the loop calls a post-processing subroutine.
In the post-processing subroutine, I write the output I want in the following way:
person_number var1 var2
Let's say that every person belongs to a different rank. The problem is that when I write the file for person1, then maybe process of rank3 that includes person3 variables is executing the post-processing subroutine, so it overwrites my data of person1.
What I want is to find a way, to pause other processes before calling the post-processing subroutine, and then once this subroutine is not used by the previous rank, to run it for the next rank and so on.
This is a sketch of the code:
call MPI_Init(ierr)
do i = 1, npersons
call subroutine1(arg1,arg2,arg3)
! call it only if post_process not executed by other process
! otherwise wait until it ends and then call it
call post_process(i, var1, var2)
enddo
call MPI_Finalize(ierr)
subroutine post_process(i, var1, var2)
integer:: i
real*8:: var1, var2
write(111,*) i, var1, var2
end subroutine post_process

reading your comment: " Also, I am wondering if for example process 3 is faster than process 2, if i can use the same way but as soon rank 1 finishes with the routine to notify rank 3 to run the routine and then rank 3 to notify rank 2. Is there any automatic way of this? to know which rank waits before the post-processing step longer?"
This can be addressed exactly by letting all the I/O be performed on process with irank==0 and using buffered sends.
In this case you don't want to let the processes wait, no barriers here, but you want to let them dispatch their result as soon as it's ready, and then continue calculating. When it's time for process 0, it will receive all the buffered data and write them, then it write its own data. You can try to use standard MPI_SEND (it's buffered up to a prefixed size), but the best way is to use MPI_BSEND and attach a correctly sized buffer with MPI_BUFFER_ATTACH(). Something like this:
subroutine post_process(i, var1, var2, irank)
integer:: i, irank
real*8:: var1, var2
integer:: ir
real*8:: var1r, var2r
character buffer(100)
integer ipos
boolean flag
if (irank .gt. 0) then
ipos = 0
call MPI_PACK(i, 1, MPI_INTEGER, buffer, 100, ipos, MPI_COMM_WORLD, ierr)
call MPI_PACK(var1, 1, MPI_REAL8, buffer, 100, ipos, MPI_COMM_WORLD, ierr)
call MPI_PACK(var2, 1, MPI_REAL8, buffer, 100, ipos, MPI_COMM_WORLD, ierr)
call MPI_BSend( buffer, ipos, MPI_PACKED, 0, 0, MPI_COMM_WORLD, ierr)
else
do
call MPI_IPROBE(MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, flag, MPI_STATUS_IGNORE, ierr)
if (flag .eq. false) exit
call MPI_RECV(buffer, 100, MPI_PACKED, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
ipos = 0
call MPI_UNPACK(buffer, 100, ipos, ir, 1, MPI_INTEGER, MPI_COMM_WORLD, ierr)
call MPI_UNPACK(buffer, 100, ipos, var1r, 1, MPI_REAL8, MPI_COMM_WORLD, ierr)
call MPI_UNPACK(buffer, 100, ipos, var2r, 1, MPI_REAL8, MPI_COMM_WORLD, ierr)
write(111,*) ir, var1r, var2r
enddo
write(111,*) i, var1, var2
end if
end subroutine post_process

I'd perform this task serializing with barriers. Assuming you have got irank the result from MPI_COMM_RANK() and nprocs from MPI_COMM_SIZE():
call MPI_Init(ierr)
do i = 1, npersons
call subroutine1(arg1,arg2,arg3)
do ir = 0, nprocs-1
if (ir .eq. irank) then
! call it only if post_process not executed by other process
! otherwise wait until it ends and then call it
call post_process(i, var1, var2)
endif
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
enddo
enddo
All the processes wait at the MPI_BARRIER(), until the irank-th completes, and reach the barrier too.
I have to say that since all the processes write on a shared filesystem in post_process this is not guaranteed to work: the synchronization imposed at MPI level is usually very fast (isn't MPI optimized for this?), and can be faster than the synchronization present in a shared filesystem (being it NFS, GPFS,...), especially on large clusters. Furthermore performing it with a plain fortran write to a shared file... quite sure you can randomly incur in file corruptions, because of caching and timings on the different hosts.
The typical way to approach it is to let only processor with irank==0 write to the file, all the others send data to be written to it. Better, using MPI2 I/O.

The first thing is to initialize the MPI environment properly, by adding the following lines:
! Initialization of MPI
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numproc,ierr)
The function MPI_COMM_RANK will return a variable rank, which is an identifier for each process (i.e. each person of your example). You can use this variable for defining the order in which the processes execute the program. Also, since code in a MPI program is executed by all processes unless you tell them otherwise, you don't need a do loop to call your first subroutine.
You can use a call to MPI_RECV to block the execution of the program for each process until they receive a message. The trick is to work with the variable rank which indicates the number of each process (in your example, it seems to be numbers from 1 to n - be careful, it is likely that the ordering of ranks starts at 0). Tell your processes to pause and wait for a message, except the first process, which is allowed to execute the post-processing subroutine. Once process 1 is done with writing, tell it to send a message to process 2. As soon as process 2 receives the message, it will start executing the subroutine (which is now safe to do, since 1 is done) and send a message to process 3, and so on.
You can try to implement something like this:
integer:: tag
character(1):: mess
call subroutine1(arg1,arg2,arg3)
tag=22 ! or any integer you like
mess='a' ! The content here doesn't matter
if(rank .gt. 1) call MPI_RECV(mess,1,MPI_CHARACTER,rank-1,tag,MPI_COMM_WORLD,stat,ierr)
do k = 1,npersons
if (rank .eq. k) then
call post_process(var1, var2)
if(rank .lt. npersons) then
call MPI_SEND(mess,1,MPI_CHARACTER,rank+1,tag,MPI_COMM_WORLD,ierr)
end if
end if
end do

Related

What's wrong with this example using MPI-IO set_view and write_all?

When I try to construct a simple array and write to a output file using MPIIO, the output file is very large and not correct. The simple array 'buff' is 2 * 3 and every element equals to 1. I first create the subarray filetype which is 1 * 1, and then create a new output file 'out.dat'. I use set_view function and write_all to write this buff array. I compile this and run it use 6 cores so that one core write one element. I expect that I can get a 'out.dat' and a 2*3 array with value 1.
I successfully compile and run this code, and the results are all 0 and the size is very large.
program test
use mpi
implicit none
integer::rank,nproc,ierr,buffsize,status(MPI_STATUS_SIZE),intsize,i,j,filetype,cart_comm
integer::fh
integer(kind=mpi_offset_kind):: offset=0
integer::buff(2,3)
character:: filename*50
integer::sizes(2)
integer::gsize(2)
integer::start(2)
integer::subsize(2)
integer::coords(2)
integer:: nprocs_cart(2)=(/2,3/)
logical::periods(2)
intsize=sizeof(rank)
call MPI_init(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
CALL MPI_Dims_create(nproc, 2, nprocs_cart, ierr)
CALL MPI_Cart_create(MPI_COMM_WORLD, 2, nprocs_cart, periods, .TRUE., &
cart_comm, ierr)
CALL MPI_Comm_rank(cart_comm, rank, ierr)
CALL MPI_Cart_coords(cart_comm, rank, 2, coords, ierr)
start=coords
gsize=(/2,3/)
subsize=(/1,1/)
buff=1
call MPI_TYPE_CREATE_SUBARRAY(2,gsize,subsize,start,MPI_ORDER_FORTRAN,&
MPI_int,filetype,ierr)
call MPI_TYPE_COMMIT(filetype,ierr)
call MPI_File_open(MPI_COMM_WORLD,'out.dat',&
MPI_MODE_WRONLY + MPI_MODE_CREATE, MPI_INFO_NULL, fh,ierr)
call MPI_File_set_view(fh,0,MPI_int,filetype,&
"native",MPI_INFO_NULL,ierr)
CALL MPI_FILE_WRITE_all(fh, buff,6, filetype, MPI_STATUS_ignore, ierr)
call MPI_File_close(fh,ierr)
call MPI_FINALIZE(ierr)
end program test

Does MPI_finalize release memory?

In the following code, I have an array b which is being used in MPI. As far as I understand, each processor gets a copy of b even before the call to MPI_INIT. But what happens after we call MPI_FINALIZE? Is that piece of memory still available to each processor?
In a similar manner, what would happen if b is instead declared as a pointer, it is allocated inside MPI_INIT-MPI_FINALIZE but it is not deallocated? Is that memory still available after finalizing MPI?
program main
use mpi
implicit none
integer myr, numpr, ier
integer b(1000)
call MPI_INIT(ier)
call MPI_COMM_RANK(MPI_COMM_WORLD, myr, ier)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numpr, ier)
if (myr .eq. 0) then
!initialize b array
endif
call MPI_BCAST(b, 100, MPI_INTEGER, 0, MPI_COMM_WORLD, ier)
call MPI_FINALIZE(ier)
!do more calculations with b
end
If you imagine the code that you've written without any of the MPI stuff, you'll see that each processor starts with a B array of size 1000, because you declare it as such:
integer b(1000)
Neither MPI_Init nor MPI_Finalise are involved in allocating or deallocating any of this memory.
Likewise, you can allocate an array at run time (C) and it will stick around until you explicitly deallocate it:
PROGRAM main
use mpi
implicit none
integer myr, numpr, ier
integer b(1000)
INTEGER, ALLOCATABLE :: C(:)
call MPI_INIT(ier)
call MPI_COMM_RANK(MPI_COMM_WORLD, myr, ier)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numpr, ier)
ALLOCATE(C(1000))
if (myr .eq. 0) then
b = 100 ! Set all values to 100
c = 99 ! Ditto 99
endif
call MPI_BCAST(b, 1000, MPI_INTEGER, 0, MPI_COMM_WORLD, ier)
call MPI_BCAST(c, 1000, MPI_INTEGER, 0, MPI_COMM_WORLD, ier)
call MPI_FINALIZE(ier)
PRINT *, myr, B(200)
PRINT *, myr, C(200)
DEALLOCATE(C)
END PROGRAM main
produces output:
1 100
1 99
0 100
0 99
Also, note that you had a typo (I think) in your initial code. You only send the first 100 members of B (which has size 1000).

Multiple simuntaniously overlapping IBcast

Since the collective communications of MPI don't have a flag parameter I was wondering if this can cause problems:
I have 2 MPI_communicators, that are not equal, but overlapping and I want to execute MPI_IBcast on both of them simuntaniously:
program main
use mpi
implicit none
integer :: comm1, comm2, ierr, me, req(2)
integer :: color1(4) = [1,1,1,0]
integer :: color2(4) = [0,1,1,1]
integer :: a, b, new_me
call MPI_Init(ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, me, ierr)
a = me - 1
b = me + 1
call MPI_Comm_split(MPI_COMM_WORLD, color1(me+1), me, comm1, ierr)
call MPI_Comm_split(MPI_COMM_WORLD, color2(me+1), me, comm2, ierr)
CALL MPI_IBcast(a,1, MPI_INTEGER, 0, comm1, req(1), ierr)
CALL MPI_IBcast(b,1, MPI_INTEGER, 0, comm2, req(2), ierr)
call MPI_Waitall(2, req, MPI_STATUSES_IGNORE, ierr)
call MPI_Finalize(ierr)
write (*,*) me, a, b
end program main
It worked fine on my local machine, but my question is: Is this guaranteed to work or do I have to use MPI_Bcast, rather than MPI_Ibcast?
The communications use different communicators. As such it is fine. It doesn't matter that they both contain the same set of processes, they are different communicators and that is all that matters.

What is the difference between using MPI_Type_contiguous and MPI_Type_create_struct for sending/receiving declared data structure with MPI

In this thread it has been explained in two ways how to pass messages using MPI with declared data types. I have a data structure with allocatables
type t_example
real, allocatable :: x(:), y(:), z(:)
end type
For maintability of the code would the easiest thing not be to use MPI_TYPE_CONTIGUOUS as follows
! -- declare
type(t_example) :: p1
type(MPI_DATATYPE) :: mpi_dtexample
(...)
call MPI_TYPE_CONTIGUOUS(sizeof(p1), MPI_BYTE, mpi_dtexample, ierr);
call MPI_TYPE_COMMIT(mpi_dtexample, ierr)
Following this I can simply use the send/recv with mpi_dtexample as being the data type.
I cannot come to my mind when it becomes more sensible to use the mpi_type_create_struct, as this would require you to explicitly tell the sequence of the declared type, with the data type and their corresponding sizes.
YES, the MPI_TYPE_CONTIGUOUS approach assumes that the declared type is contiguous and I would not be able to use this approach if I wanted to pass certain strided elements of the declared type.
Is there else anything I should raise my alarm bells on when using MPI_TYPE_CONTIGUOUS
A FULL EXAMPLE
Running with 2 ranks only.
module md_
use mpi_f08
integer numtasks, rank, tag, i, ierr
type(mpi_status) stat
type T_PART
real, allocatable :: x(:), y(:), z(:)
end type
contains
end module
program struct
use md_
implicit none
type(t_part) :: test_
type(mpi_datatype) :: mpidt
integer :: sz, szz
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
tag = 1
szz= 10
allocate(test_% x(szz),test_% y(szz), test_% z(szz) )
sz = sizeof(test_)
call MPI_Type_contiguous(sz, MPI_BYTE ,mpidt, ierr)
call MPI_TYPE_COMMIT(mpidt, ierr)
if (rank .eq. 0) then
do i=1,szz
test_%x(i) = i*1+mod(i,3)
test_%y(i) = i*2+mod(i,3)
test_%z(i) = i*3+mod(i,3)
end do
call MPI_SEND(test_, 1, mpidt, 1, tag, &
MPI_COMM_WORLD, ierr)
else
call MPI_RECV(test_, 1, mpidt, 0, tag, &
MPI_COMM_WORLD, stat, ierr)
endif
print *, 'rank= ',rank,' test_% x= ', test_%z(1) ! seg faults for rank 2
call mpi_barrier(MPI_COMM_WORLD, ierr)
! free datatype when done using it
call MPI_TYPE_FREE(mpidt, ierr)
call MPI_FINALIZE(ierr)
end

request array in MPI non-blocking send/recv

I am trying to reproduce this C example in Fortran. my code so far:
use mpi
implicit none
integer, parameter :: maxn = 8
integer, allocatable :: xlocal(:,:)
integer :: i, j, lsize, errcnt, toterr, buff
integer :: ierror, nproc, pid, root = 0, nreq = 0
integer, allocatable :: request(:), status(:,:)
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, pid, ierror)
if (mod(maxn, nproc) /= 0) then
write(*,*) 'Array size (maxn) should be a multiple of the number of processes'
call MPI_ABORT(MPI_COMM_WORLD, 1, ierror)
end if
lsize = maxn/nproc
allocate(xlocal(0:lsize+1, maxn))
allocate(request(nproc))
allocate(status(MPI_STATUS_SIZE,nproc))
xlocal(0,:) = -1
xlocal(1:lsize,:) = pid
xlocal(lsize+1,:) = -1
! send down unless on bottom
if (pid < nproc-1) then
nreq = nreq + 1
call MPI_ISEND(xlocal(lsize,:), maxn, MPI_INTEGER, &
pid+1, 0, MPI_COMM_WORLD, request(nreq), ierror)
write(*,'(2(A,I1),A)') 'process ', pid, ' sent to process ', pid+1, ':'
write(*,*) xlocal(lsize,:)
end if
if (pid > 0) then
nreq = nreq + 1
call MPI_IRECV(xlocal(0,:), maxn, MPI_INTEGER, &
pid-1, 0, MPI_COMM_WORLD, request(nreq), ierror)
write(*,'(2(A,I1),A)') 'process ', pid, ' received from process ', pid-1, ':'
write(*,*) xlocal(0,:)
end if
! send up unless on top
if (pid > 0) then
nreq = nreq + 1
call MPI_ISEND(xlocal(1,:), maxn, MPI_INTEGER, &
pid-1, 1, MPI_COMM_WORLD, request(nreq), ierror)
write(*,'(2(A,I1),A)') 'process ', pid, ' sent to process ', pid-1, ':'
write(*,*) xlocal(1,:)
end if
if (pid < nproc-1) then
nreq = nreq + 1
call MPI_IRECV(xlocal(lsize+1,:), maxn, MPI_INTEGER, &
pid+1, 1, MPI_COMM_WORLD, request(nreq), ierror)
write(*,'(2(A,I1),A)') 'process ', pid, ' received from process ', pid+1, ':'
write(*,*) xlocal(lsize+1,:)
end if
call MPI_WAITALL(nreq, request, status, ierror)
! check results
errcnt = 0
do i = 1, lsize
do j = 1, maxn
if (xlocal(i,j) /= pid) errcnt = errcnt + 1
end do
end do
do j = 1, maxn
if (xlocal(0,j) /= pid-1) errcnt = errcnt + 1
if ((pid < nproc-1) .and. (xlocal(lsize+1,j) /= pid+1)) errcnt = errcnt + 1
end do
call MPI_REDUCE(errcnt, toterr, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD)
if (pid == root) then
if (toterr == 0) then
write(*,*) "no errors found"
else
write(*,*) "found ", toterr, " errors"
end if
end if
deallocate(xlocal)
deallocate(request)
deallocate(status)
call MPI_FINALIZE(ierror)
but i am running into segmentation faults and can not figure out why. I have a feeling it is due to the request array. can someone explain the correct way of using the request array in Fortran? none of the references I found clarify this.
thx in advance
In case you haven't already done so, consider compiling your program with some flags that will help you in debugging, e.g. with gfortran, you can use -O0 -g -fbounds-check (if that does not help, you might add -fsanitize=address for versions >= 4.8). Other compilers have similar options for debugging.
Doing that, and running with 2 processes, you program crashes at the MPI_Reduce line. If you look up the specifications (e.g. OpenMPI 1.8) you can see that this subroutine requires one more argument, i.e., you forgot to add the ierror argument at the end.
It is a bit tragic that even though the subprograms from the mpi module are accessible through a use association, and thus should be checked for argument consistency to avoid these trivial errors, not all subprograms are necessarily in that module. I don't know which MPI implementation you use, but I checked my local MPICH installation and it does not have most subroutines in the module, so no explicit interfaces exist for them. I guess you are in a similar situation, but I guess other implementations might suffer a similar fate. You could compare it to the C header file missing the function prototype for MPI_Reduce. I guess the reason for this is that originally there was only a Fortran 77 interface for most implementations.
Some final comments: be careful not to just copy-paste the C code. The arrays you pass are not contiguous and will result in a temporary copy to be passed to the MPI routines, which is very inefficient (not that it really matters in this case).