Multiple simuntaniously overlapping IBcast - fortran

Since the collective communications of MPI don't have a flag parameter I was wondering if this can cause problems:
I have 2 MPI_communicators, that are not equal, but overlapping and I want to execute MPI_IBcast on both of them simuntaniously:
program main
use mpi
implicit none
integer :: comm1, comm2, ierr, me, req(2)
integer :: color1(4) = [1,1,1,0]
integer :: color2(4) = [0,1,1,1]
integer :: a, b, new_me
call MPI_Init(ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, me, ierr)
a = me - 1
b = me + 1
call MPI_Comm_split(MPI_COMM_WORLD, color1(me+1), me, comm1, ierr)
call MPI_Comm_split(MPI_COMM_WORLD, color2(me+1), me, comm2, ierr)
CALL MPI_IBcast(a,1, MPI_INTEGER, 0, comm1, req(1), ierr)
CALL MPI_IBcast(b,1, MPI_INTEGER, 0, comm2, req(2), ierr)
call MPI_Waitall(2, req, MPI_STATUSES_IGNORE, ierr)
call MPI_Finalize(ierr)
write (*,*) me, a, b
end program main
It worked fine on my local machine, but my question is: Is this guaranteed to work or do I have to use MPI_Bcast, rather than MPI_Ibcast?

The communications use different communicators. As such it is fine. It doesn't matter that they both contain the same set of processes, they are different communicators and that is all that matters.

Related

What's wrong with this example using MPI-IO set_view and write_all?

When I try to construct a simple array and write to a output file using MPIIO, the output file is very large and not correct. The simple array 'buff' is 2 * 3 and every element equals to 1. I first create the subarray filetype which is 1 * 1, and then create a new output file 'out.dat'. I use set_view function and write_all to write this buff array. I compile this and run it use 6 cores so that one core write one element. I expect that I can get a 'out.dat' and a 2*3 array with value 1.
I successfully compile and run this code, and the results are all 0 and the size is very large.
program test
use mpi
implicit none
integer::rank,nproc,ierr,buffsize,status(MPI_STATUS_SIZE),intsize,i,j,filetype,cart_comm
integer::fh
integer(kind=mpi_offset_kind):: offset=0
integer::buff(2,3)
character:: filename*50
integer::sizes(2)
integer::gsize(2)
integer::start(2)
integer::subsize(2)
integer::coords(2)
integer:: nprocs_cart(2)=(/2,3/)
logical::periods(2)
intsize=sizeof(rank)
call MPI_init(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
CALL MPI_Dims_create(nproc, 2, nprocs_cart, ierr)
CALL MPI_Cart_create(MPI_COMM_WORLD, 2, nprocs_cart, periods, .TRUE., &
cart_comm, ierr)
CALL MPI_Comm_rank(cart_comm, rank, ierr)
CALL MPI_Cart_coords(cart_comm, rank, 2, coords, ierr)
start=coords
gsize=(/2,3/)
subsize=(/1,1/)
buff=1
call MPI_TYPE_CREATE_SUBARRAY(2,gsize,subsize,start,MPI_ORDER_FORTRAN,&
MPI_int,filetype,ierr)
call MPI_TYPE_COMMIT(filetype,ierr)
call MPI_File_open(MPI_COMM_WORLD,'out.dat',&
MPI_MODE_WRONLY + MPI_MODE_CREATE, MPI_INFO_NULL, fh,ierr)
call MPI_File_set_view(fh,0,MPI_int,filetype,&
"native",MPI_INFO_NULL,ierr)
CALL MPI_FILE_WRITE_all(fh, buff,6, filetype, MPI_STATUS_ignore, ierr)
call MPI_File_close(fh,ierr)
call MPI_FINALIZE(ierr)
end program test

Does MPI_finalize release memory?

In the following code, I have an array b which is being used in MPI. As far as I understand, each processor gets a copy of b even before the call to MPI_INIT. But what happens after we call MPI_FINALIZE? Is that piece of memory still available to each processor?
In a similar manner, what would happen if b is instead declared as a pointer, it is allocated inside MPI_INIT-MPI_FINALIZE but it is not deallocated? Is that memory still available after finalizing MPI?
program main
use mpi
implicit none
integer myr, numpr, ier
integer b(1000)
call MPI_INIT(ier)
call MPI_COMM_RANK(MPI_COMM_WORLD, myr, ier)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numpr, ier)
if (myr .eq. 0) then
!initialize b array
endif
call MPI_BCAST(b, 100, MPI_INTEGER, 0, MPI_COMM_WORLD, ier)
call MPI_FINALIZE(ier)
!do more calculations with b
end
If you imagine the code that you've written without any of the MPI stuff, you'll see that each processor starts with a B array of size 1000, because you declare it as such:
integer b(1000)
Neither MPI_Init nor MPI_Finalise are involved in allocating or deallocating any of this memory.
Likewise, you can allocate an array at run time (C) and it will stick around until you explicitly deallocate it:
PROGRAM main
use mpi
implicit none
integer myr, numpr, ier
integer b(1000)
INTEGER, ALLOCATABLE :: C(:)
call MPI_INIT(ier)
call MPI_COMM_RANK(MPI_COMM_WORLD, myr, ier)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numpr, ier)
ALLOCATE(C(1000))
if (myr .eq. 0) then
b = 100 ! Set all values to 100
c = 99 ! Ditto 99
endif
call MPI_BCAST(b, 1000, MPI_INTEGER, 0, MPI_COMM_WORLD, ier)
call MPI_BCAST(c, 1000, MPI_INTEGER, 0, MPI_COMM_WORLD, ier)
call MPI_FINALIZE(ier)
PRINT *, myr, B(200)
PRINT *, myr, C(200)
DEALLOCATE(C)
END PROGRAM main
produces output:
1 100
1 99
0 100
0 99
Also, note that you had a typo (I think) in your initial code. You only send the first 100 members of B (which has size 1000).

What is the difference between using MPI_Type_contiguous and MPI_Type_create_struct for sending/receiving declared data structure with MPI

In this thread it has been explained in two ways how to pass messages using MPI with declared data types. I have a data structure with allocatables
type t_example
real, allocatable :: x(:), y(:), z(:)
end type
For maintability of the code would the easiest thing not be to use MPI_TYPE_CONTIGUOUS as follows
! -- declare
type(t_example) :: p1
type(MPI_DATATYPE) :: mpi_dtexample
(...)
call MPI_TYPE_CONTIGUOUS(sizeof(p1), MPI_BYTE, mpi_dtexample, ierr);
call MPI_TYPE_COMMIT(mpi_dtexample, ierr)
Following this I can simply use the send/recv with mpi_dtexample as being the data type.
I cannot come to my mind when it becomes more sensible to use the mpi_type_create_struct, as this would require you to explicitly tell the sequence of the declared type, with the data type and their corresponding sizes.
YES, the MPI_TYPE_CONTIGUOUS approach assumes that the declared type is contiguous and I would not be able to use this approach if I wanted to pass certain strided elements of the declared type.
Is there else anything I should raise my alarm bells on when using MPI_TYPE_CONTIGUOUS
A FULL EXAMPLE
Running with 2 ranks only.
module md_
use mpi_f08
integer numtasks, rank, tag, i, ierr
type(mpi_status) stat
type T_PART
real, allocatable :: x(:), y(:), z(:)
end type
contains
end module
program struct
use md_
implicit none
type(t_part) :: test_
type(mpi_datatype) :: mpidt
integer :: sz, szz
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
tag = 1
szz= 10
allocate(test_% x(szz),test_% y(szz), test_% z(szz) )
sz = sizeof(test_)
call MPI_Type_contiguous(sz, MPI_BYTE ,mpidt, ierr)
call MPI_TYPE_COMMIT(mpidt, ierr)
if (rank .eq. 0) then
do i=1,szz
test_%x(i) = i*1+mod(i,3)
test_%y(i) = i*2+mod(i,3)
test_%z(i) = i*3+mod(i,3)
end do
call MPI_SEND(test_, 1, mpidt, 1, tag, &
MPI_COMM_WORLD, ierr)
else
call MPI_RECV(test_, 1, mpidt, 0, tag, &
MPI_COMM_WORLD, stat, ierr)
endif
print *, 'rank= ',rank,' test_% x= ', test_%z(1) ! seg faults for rank 2
call mpi_barrier(MPI_COMM_WORLD, ierr)
! free datatype when done using it
call MPI_TYPE_FREE(mpidt, ierr)
call MPI_FINALIZE(ierr)
end

Basic RMA issue with MPI_Win_create, Null pointer in parameter NULL base pointer is invalid when size is nonzero

I'm working in Fortran 90 with the latest stable version of MPICH (3.3)
I want to have an MPI_Window expose an array on the root process, and all other processes in the communicator call MPI_Get to copy the array into their own "local" copy.
Unfortunately, providing MPI_BOTTOM as the "base" argument in the non-root processes for MPI_Win_create(base, ...) results in the error
MPI_Win_create(192): MPI_Win_create(base=(nil), size=0, disp_unit=1275069467, MPI_INFO_NULL, MPI_COMM_WORLD, win=0x7ffcb343d9fc) failed
MPI_Win_create(156): Null pointer in parameter NULL base pointer is invalid when size is nonzero
I've been working off a textbook example, pg. 61 Fig 3.2, Using Advanced MPI, Modern Features of the Message-Passing Interface, Gropp, Hoefler, Thakur, Lusk.
What is the alternative kind(MPI_ADDRESS_KIND) that I'm supposed to use besides MPI_BOTTOM? Is this the correct way to initialize an MPI_Window on a process which isn't actually exposing it's internal memory, just accessing that of another process?
Obviously, changing the argument for base to an already allocated (non-null) array works, but this changes the behavior of the later GET so it doesn't work (creates an invalid memory access).
I don't know why the runtime error specifically says that a null base pointer is invalid with nonzero size, since I'm clearly specifying the size as 0 in the call to mpi_win_create(MPI_BOTTOM, 0, MPI_INTEGER, ...).
Here's all the code I have for this example for myself. It sets up the buffers and attempts to create the windows for each process. There is a commented out section between two calls to MPI_Fence that is the section where all non-root processes attempt the GET.
program main
use mpi
implicit none
integer :: ierr, procno, nprocs, comm
integer, allocatable :: root_data(:), local_data(:)
integer, parameter :: root = 0, NUM_ELEMENTS = 10
integer :: win
integer :: i
!======================================
call mpi_init(ierr)
comm = mpi_comm_world
call mpi_comm_rank(comm, procno, ierr)
call mpi_comm_size(comm, nprocs, ierr)
!======================================
if (procno .eq. root) then
allocate(root_data(1:NUM_ELEMENTS))
do i=1,NUM_ELEMENTS
root_data(i) = i
enddo
call mpi_win_create(root_data, NUM_ELEMENTS, MPI_INTEGER, &
MPI_INFO_NULL, comm, win, ierr)
else
allocate(local_data(1:NUM_ELEMENTS))
local_data = 0
call mpi_win_create(MPI_BOTTOM, 0, MPI_INTEGER, &
MPI_INFO_NULL, comm, win, ierr)
endif
!======================================
call mpi_win_fence(0, win, ierr)
!if (procno .ne. root) then
! call mpi_get(local_data, NUM_ELEMENTS, MPI_INTEGER, &
! root, 0, NUM_ELEMENTS, MPI_INTEGER, &
! win, ierr)
!endif
call mpi_win_fence(0, win, ierr)
!======================================
if (procno .ne. root) then
print *, "proc", procno
print *, local_data
endif
!======================================
call MPI_Win_free(win, ierr)
call mpi_finalize(ierr)
end program main
The expected result is that each process prints its version of local_data., which in this case should be ten 0's since the MPI_Get is commented out.
I hit the runtime error instead.
The size argument of MPI_Win_create() has type INTEGER(KIND=MPI_ADDRESS_KIND).
I was then able to successfully run the modified version with both MPICH 3.3 and the latest Open MPI
program main
use mpi
implicit none
integer :: ierr, procno, nprocs, comm
integer, allocatable :: root_data(:), local_data(:)
integer, parameter :: root = 0
integer (KIND=MPI_ADDRESS_KIND) :: NUM_ELEMENTS = 10, zero = 0
integer :: win
integer :: i
!======================================
call mpi_init(ierr)
comm = mpi_comm_world
call mpi_comm_rank(comm, procno, ierr)
call mpi_comm_size(comm, nprocs, ierr)
!======================================
if (procno .eq. root) then
allocate(root_data(1:NUM_ELEMENTS))
do i=1,NUM_ELEMENTS
root_data(i) = i
enddo
call mpi_win_create(root_data, NUM_ELEMENTS, MPI_INTEGER, MPI_INFO_NULL, comm, win, ierr)
else
allocate(local_data(1:NUM_ELEMENTS))
local_data = 0
call mpi_win_create(MPI_BOTTOM, zero, MPI_INTEGER, MPI_INFO_NULL, comm, win, ierr)
endif
!======================================
call mpi_win_fence(0, win, ierr)
!if (procno .ne. root) then
! call mpi_get(local_data, NUM_ELEMENTS, MPI_INTEGER, &
! root, 0, NUM_ELEMENTS, MPI_INTEGER, &
! win, ierr)
!endif
call mpi_win_fence(0, win, ierr)
!======================================
if (procno .ne. root) then
print *, "proc", procno
print *, local_data
endif
!======================================
call MPI_Win_free(win, ierr)
call mpi_finalize(ierr)
end program main

Wait until a subroutine is not used by other process

I have an MPI-parallelized code where it loops through n persons, and for each one it calls some subroutines to do some calculations and after all inside the loop calls a post-processing subroutine.
In the post-processing subroutine, I write the output I want in the following way:
person_number var1 var2
Let's say that every person belongs to a different rank. The problem is that when I write the file for person1, then maybe process of rank3 that includes person3 variables is executing the post-processing subroutine, so it overwrites my data of person1.
What I want is to find a way, to pause other processes before calling the post-processing subroutine, and then once this subroutine is not used by the previous rank, to run it for the next rank and so on.
This is a sketch of the code:
call MPI_Init(ierr)
do i = 1, npersons
call subroutine1(arg1,arg2,arg3)
! call it only if post_process not executed by other process
! otherwise wait until it ends and then call it
call post_process(i, var1, var2)
enddo
call MPI_Finalize(ierr)
subroutine post_process(i, var1, var2)
integer:: i
real*8:: var1, var2
write(111,*) i, var1, var2
end subroutine post_process
reading your comment: " Also, I am wondering if for example process 3 is faster than process 2, if i can use the same way but as soon rank 1 finishes with the routine to notify rank 3 to run the routine and then rank 3 to notify rank 2. Is there any automatic way of this? to know which rank waits before the post-processing step longer?"
This can be addressed exactly by letting all the I/O be performed on process with irank==0 and using buffered sends.
In this case you don't want to let the processes wait, no barriers here, but you want to let them dispatch their result as soon as it's ready, and then continue calculating. When it's time for process 0, it will receive all the buffered data and write them, then it write its own data. You can try to use standard MPI_SEND (it's buffered up to a prefixed size), but the best way is to use MPI_BSEND and attach a correctly sized buffer with MPI_BUFFER_ATTACH(). Something like this:
subroutine post_process(i, var1, var2, irank)
integer:: i, irank
real*8:: var1, var2
integer:: ir
real*8:: var1r, var2r
character buffer(100)
integer ipos
boolean flag
if (irank .gt. 0) then
ipos = 0
call MPI_PACK(i, 1, MPI_INTEGER, buffer, 100, ipos, MPI_COMM_WORLD, ierr)
call MPI_PACK(var1, 1, MPI_REAL8, buffer, 100, ipos, MPI_COMM_WORLD, ierr)
call MPI_PACK(var2, 1, MPI_REAL8, buffer, 100, ipos, MPI_COMM_WORLD, ierr)
call MPI_BSend( buffer, ipos, MPI_PACKED, 0, 0, MPI_COMM_WORLD, ierr)
else
do
call MPI_IPROBE(MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, flag, MPI_STATUS_IGNORE, ierr)
if (flag .eq. false) exit
call MPI_RECV(buffer, 100, MPI_PACKED, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
ipos = 0
call MPI_UNPACK(buffer, 100, ipos, ir, 1, MPI_INTEGER, MPI_COMM_WORLD, ierr)
call MPI_UNPACK(buffer, 100, ipos, var1r, 1, MPI_REAL8, MPI_COMM_WORLD, ierr)
call MPI_UNPACK(buffer, 100, ipos, var2r, 1, MPI_REAL8, MPI_COMM_WORLD, ierr)
write(111,*) ir, var1r, var2r
enddo
write(111,*) i, var1, var2
end if
end subroutine post_process
I'd perform this task serializing with barriers. Assuming you have got irank the result from MPI_COMM_RANK() and nprocs from MPI_COMM_SIZE():
call MPI_Init(ierr)
do i = 1, npersons
call subroutine1(arg1,arg2,arg3)
do ir = 0, nprocs-1
if (ir .eq. irank) then
! call it only if post_process not executed by other process
! otherwise wait until it ends and then call it
call post_process(i, var1, var2)
endif
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
enddo
enddo
All the processes wait at the MPI_BARRIER(), until the irank-th completes, and reach the barrier too.
I have to say that since all the processes write on a shared filesystem in post_process this is not guaranteed to work: the synchronization imposed at MPI level is usually very fast (isn't MPI optimized for this?), and can be faster than the synchronization present in a shared filesystem (being it NFS, GPFS,...), especially on large clusters. Furthermore performing it with a plain fortran write to a shared file... quite sure you can randomly incur in file corruptions, because of caching and timings on the different hosts.
The typical way to approach it is to let only processor with irank==0 write to the file, all the others send data to be written to it. Better, using MPI2 I/O.
The first thing is to initialize the MPI environment properly, by adding the following lines:
! Initialization of MPI
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numproc,ierr)
The function MPI_COMM_RANK will return a variable rank, which is an identifier for each process (i.e. each person of your example). You can use this variable for defining the order in which the processes execute the program. Also, since code in a MPI program is executed by all processes unless you tell them otherwise, you don't need a do loop to call your first subroutine.
You can use a call to MPI_RECV to block the execution of the program for each process until they receive a message. The trick is to work with the variable rank which indicates the number of each process (in your example, it seems to be numbers from 1 to n - be careful, it is likely that the ordering of ranks starts at 0). Tell your processes to pause and wait for a message, except the first process, which is allowed to execute the post-processing subroutine. Once process 1 is done with writing, tell it to send a message to process 2. As soon as process 2 receives the message, it will start executing the subroutine (which is now safe to do, since 1 is done) and send a message to process 3, and so on.
You can try to implement something like this:
integer:: tag
character(1):: mess
call subroutine1(arg1,arg2,arg3)
tag=22 ! or any integer you like
mess='a' ! The content here doesn't matter
if(rank .gt. 1) call MPI_RECV(mess,1,MPI_CHARACTER,rank-1,tag,MPI_COMM_WORLD,stat,ierr)
do k = 1,npersons
if (rank .eq. k) then
call post_process(var1, var2)
if(rank .lt. npersons) then
call MPI_SEND(mess,1,MPI_CHARACTER,rank+1,tag,MPI_COMM_WORLD,ierr)
end if
end if
end do