MPI_alltoallw working and MPI_Ialltoallw failing - fortran

I am trying to implement non-blocking communications in a large code. However, the code tends to fail for such cases. I have reproduced the error below. When running on one CPU, the code below works when switch is set to false but fails when switch is set to true.
program main
use mpi
implicit none
logical :: switch
integer, parameter :: maxSize=128
integer scounts(maxSize), sdispls(maxSize)
integer rcounts(maxSize), rdispls(maxSize)
integer :: types(maxSize)
double precision sbuf(maxSize), rbuf(maxSize)
integer comm, size, rank, req
integer ierr
integer ii
call MPI_INIT(ierr)
comm = MPI_COMM_WORLD
call MPI_Comm_size(comm, size, ierr)
call MPI_Comm_rank(comm, rank, ierr)
switch = .true.
! Init
sbuf(:) = rank
scounts(:) = 0
rcounts(:) = 0
sdispls(:) = 0
rdispls(:) = 0
types(:) = MPI_INTEGER
if (switch) then
! Send one time N double precision
scounts(1) = 1
rcounts(1) = 1
sdispls(1) = 0
rdispls(1) = 0
call MPI_Type_create_subarray(1, (/maxSize/), &
(/maxSize/), &
(/0/), &
MPI_ORDER_FORTRAN,MPI_DOUBLE_PRECISION, &
types(1),ierr)
call MPI_Type_commit(types(1),ierr)
else
! Send N times one double precision
do ii = 1, maxSize
scounts(ii) = 1
rcounts(ii) = 1
sdispls(ii) = ii-1
rdispls(ii) = ii-1
types(ii) = MPI_DOUBLE_PRECISION
enddo
endif
call MPI_Ibarrier(comm, req, ierr)
call MPI_Wait(req, MPI_STATUS_IGNORE, ierr)
if (switch) then
call MPI_Ialltoallw(sbuf, scounts, sdispls, types, &
rbuf, rcounts, rdispls, types, &
comm, req, ierr)
call MPI_Wait(req, MPI_STATUS_IGNORE, ierr)
call MPI_TYPE_FREE(types(1), ierr)
else
call MPI_alltoallw(sbuf, scounts, sdispls, types, &
rbuf, rcounts, rdispls, types, &
comm, ierr)
endif
call MPI_Finalize( ierr )
end program main
Compiling with the debug flag and running with mpirun -np 1 valgrind --vgdb=yes --vgdb-error=0 ./a.out leads to the following errors in valgrind and gdb :
valgrind :
==249074== Invalid read of size 8
==249074== at 0x4EB0A6D: release_vecs_callback (coll_base_util.c:222)
==249074== by 0x4EB100A: complete_vecs_callback (coll_base_util.c:245)
==249074== by 0x74AD1CC: ompi_request_complete (request.h:441)
==249074== by 0x74AE86D: ompi_coll_libnbc_progress (coll_libnbc_component.c:466)
==249074== by 0x4FC0C39: opal_progress (opal_progress.c:231)
==249074== by 0x4E04795: ompi_request_wait_completion (request.h:415)
==249074== by 0x4E047EB: ompi_request_default_wait (req_wait.c:42)
==249074== by 0x4E80AF7: PMPI_Wait (pwait.c:74)
==249074== by 0x48A30D2: mpi_wait (pwait_f.c:76)
==249074== by 0x10961A: MAIN__ (tmp.f90:61)
==249074== by 0x1096C6: main (tmp.f90:7)
==249074== Address 0x7758830 is 0 bytes inside a block of size 8 free'd
==249074== at 0x483CA3F: free (vg_replace_malloc.c:540)
==249074== by 0x4899CCC: PMPI_IALLTOALLW (pialltoallw_f.c:125)
==249074== by 0x1095FC: MAIN__ (tmp.f90:61)
==249074== by 0x1096C6: main (tmp.f90:7)
==249074== Block was alloc'd at
==249074== at 0x483B7F3: malloc (vg_replace_malloc.c:309)
==249074== by 0x4899B4A: PMPI_IALLTOALLW (pialltoallw_f.c:90)
==249074== by 0x1095FC: MAIN__ (tmp.f90:61)
==249074== by 0x1096C6: main (tmp.f90:7)
gdb :
Thread 1 received signal SIGTRAP, Trace/breakpoint trap.
0x0000000004eb0a6d in release_vecs_callback (request=0x7758af8) at ../../../../openmpi-4.1.0/ompi/mca/coll/base/coll_base_util.c:222
222 if (NULL != request->data.vecs.stypes[i]) {
(gdb) bt
#0 0x0000000004eb0a6d in release_vecs_callback (request=0x7758af8) at ../../../../openmpi-4.1.0/ompi/mca/coll/base/coll_base_util.c:222
#1 0x0000000004eb100b in complete_vecs_callback (req=0x7758af8) at ../../../../openmpi-4.1.0/ompi/mca/coll/base/coll_base_util.c:245
#2 0x00000000074ad1cd in ompi_request_complete (request=0x7758af8, with_signal=true) at ../../../../../openmpi-4.1.0/ompi/request/request.h:441
#3 0x00000000074ae86e in ompi_coll_libnbc_progress () at ../../../../../openmpi-4.1.0/ompi/mca/coll/libnbc/coll_libnbc_component.c:466
#4 0x0000000004fc0c3a in opal_progress () at ../../openmpi-4.1.0/opal/runtime/opal_progress.c:231
#5 0x0000000004e04796 in ompi_request_wait_completion (req=0x7758af8) at ../../openmpi-4.1.0/ompi/request/request.h:415
#6 0x0000000004e047ec in ompi_request_default_wait (req_ptr=0x1ffeffdbb8, status=0x1ffeffdbc0) at ../../openmpi-4.1.0/ompi/request/req_wait.c:42
#7 0x0000000004e80af8 in PMPI_Wait (request=0x1ffeffdbb8, status=0x1ffeffdbc0) at pwait.c:74
#8 0x00000000048a30d3 in ompi_wait_f (request=0x1ffeffe6cc, status=0x10c0a0 <mpi_fortran_status_ignore_>, ierr=0x1ffeffeee0) at pwait_f.c:76
#9 0x000000000010961b in MAIN__ () at tmp.f90:61
Any help would be appreciated. Ubuntu 20.04. gfortran 9.3.0. OpenMP 4.1.0. Thanks.

The proposed program is currently broken when using Open MPI, see issue https://github.com/open-mpi/ompi/issues/8763. The current workaround is to use MPICH.
EDIT : the bug is fixed in the main branch of OpenMPI and should be fixed in versions 5.0 and above.

Related

Does MPI_finalize release memory?

In the following code, I have an array b which is being used in MPI. As far as I understand, each processor gets a copy of b even before the call to MPI_INIT. But what happens after we call MPI_FINALIZE? Is that piece of memory still available to each processor?
In a similar manner, what would happen if b is instead declared as a pointer, it is allocated inside MPI_INIT-MPI_FINALIZE but it is not deallocated? Is that memory still available after finalizing MPI?
program main
use mpi
implicit none
integer myr, numpr, ier
integer b(1000)
call MPI_INIT(ier)
call MPI_COMM_RANK(MPI_COMM_WORLD, myr, ier)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numpr, ier)
if (myr .eq. 0) then
!initialize b array
endif
call MPI_BCAST(b, 100, MPI_INTEGER, 0, MPI_COMM_WORLD, ier)
call MPI_FINALIZE(ier)
!do more calculations with b
end
If you imagine the code that you've written without any of the MPI stuff, you'll see that each processor starts with a B array of size 1000, because you declare it as such:
integer b(1000)
Neither MPI_Init nor MPI_Finalise are involved in allocating or deallocating any of this memory.
Likewise, you can allocate an array at run time (C) and it will stick around until you explicitly deallocate it:
PROGRAM main
use mpi
implicit none
integer myr, numpr, ier
integer b(1000)
INTEGER, ALLOCATABLE :: C(:)
call MPI_INIT(ier)
call MPI_COMM_RANK(MPI_COMM_WORLD, myr, ier)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numpr, ier)
ALLOCATE(C(1000))
if (myr .eq. 0) then
b = 100 ! Set all values to 100
c = 99 ! Ditto 99
endif
call MPI_BCAST(b, 1000, MPI_INTEGER, 0, MPI_COMM_WORLD, ier)
call MPI_BCAST(c, 1000, MPI_INTEGER, 0, MPI_COMM_WORLD, ier)
call MPI_FINALIZE(ier)
PRINT *, myr, B(200)
PRINT *, myr, C(200)
DEALLOCATE(C)
END PROGRAM main
produces output:
1 100
1 99
0 100
0 99
Also, note that you had a typo (I think) in your initial code. You only send the first 100 members of B (which has size 1000).

Is MPI_IBcast guaranteed to send even if some ranks don't participate

I am creating an MPI program, where I am trying to send the same data to all processes as soon as they finish their calculation. The processes can have large differences in their computation time, so I don't want that one processor waits for another.
The root process is guaranteed to always send first.
I know that MPI_Bcast acts as a Barries, so I experimented with MPI_IBcast:
program main
use mpi
implicit none
integer rank, nprcos, ierror, a(10), req
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nprcos, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
a = rank
if(rank /= 2) then
call MPI_IBCAST(a, size(a), MPI_INTEGER, 0, MPI_COMM_WORLD, req, ierror)
call MPI_WAIT(req, MPI_STATUS_IGNORE, IERROR)
endif
write (*,*) 'Hello World from process: ', rank, 'of ', nprcos, "a = ", a(1)
call MPI_FINALIZE(ierror)
end program main
From my experiments it seems, that irregardless of which rank is "boycotting" the MPI_IBcast it always works on all the others:
> $ mpifort test.f90 && mpiexec --mca btl tcp,self -np 4 ./a.out
Hello World from process: 2 of 4 a = 2
Hello World from process: 1 of 4 a = 0
Hello World from process: 0 of 4 a = 0
Hello World from process: 3 of 4 a = 0
Is this a guaranteed behavior or is this just specific to my OpenMPI implementation? How else could I implement this? I can only think of loop over MPI_Isends.
No, this is not guaranteed, all ranks in the communicator should participate. Within MPI this is the definition of a collective communication.

How to send and receive arrays using an MPI struct in Fortran 90

I'm attempting to package 1D and 2D double precision arrays in an MPI struct in Fortran 90. I've successfully done this in C++ in a very similar problem and the procedure seems to be almost exactly the same, but I can't seem to figure out where I'm going wrong here, despite the extremely helpful MPI error codes...
My best guess is that the problem is in the block length or displacement calculations. The code compiles and runs when the call to MPI_RECV() is commented out, but results in an error when it is uncommented.
main.f90
program main
use mymodule
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nPROC, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myID, ierr)
call PASS_MESH_MPI()
call MPI_FINALIZE(ierr)
end program
mymodule.f90
module mymodule
use mpi
double precision, dimension(:,:) :: U(0:10,0:10)
double precision, dimension(:) :: r(0:10), z(0:10)
integer, public :: ierr, nPROC, nWRs, myID, stat(MPI_STATUS_SIZE)
type mytype
double precision, dimension(:,:), allocatable :: U
double precision, dimension(:), allocatable :: r
double precision, dimension(:), allocatable :: z
end type
contains
subroutine PASS_MESH_MPI()
implicit none
type(mytype) :: package
integer :: blocklen(3), types(3), myMPItype
integer(KIND=MPI_ADDRESS_KIND) :: displacement(3), base
allocate( package%U(0:10,0:10) )
allocate( package%r(0:10) )
allocate( package%z(0:10) )
call MPI_GET_ADDRESS(package%U, displacement(1), ierr)
call MPI_GET_ADDRESS(package%r, displacement(2), ierr)
call MPI_GET_ADDRESS(package%z, displacement(3), ierr)
base = displacement(1)
displacement(1) = displacement(1) - base
displacement(2) = displacement(2) - base
displacement(3) = displacement(3) - base
blocklen(1) = (11)*(11)
blocklen(2) = 11
blocklen(3) = 11
types(1) = MPI_DOUBLE_PRECISION
types(2) = MPI_DOUBLE_PRECISION
types(3) = MPI_DOUBLE_PRECISION
call MPI_TYPE_CREATE_STRUCT(3, blocklen, displacement, types, myMPItype, ierr)
call MPI_TYPE_COMMIT(myMPItype, ierr)
if ( myID .eq. 0 ) then
U(:,:) = 5
r(:) = 5
z(:) = 5
package%r(:) = r(:)
package%z(:) = z(:)
package%U(:,:) = U(:,:)
call MPI_SEND(package, 1, myMPItype, 1, 0, MPI_COMM_WORLD, ierr)
end if
if ( myID .ne. 0 ) then
call MPI_RECV(package, 1, myMPItype, 0, 0, MPI_COMM_WORLD, stat, ierr)
end if
call MPI_TYPE_FREE( myMPItype, ierr )
end subroutine
end module
makefile
COMP=mpif90
EXT=f90
CFLAGs=-Wall -Wextra -Wimplicit-interface -fPIC -fmax-errors=1 -g -fcheck=all -fbacktrace
PROG=TESTDAT.x
INPUT=input.dat
OUTPUT=output
TARGS=main.f90 mymodule.f90
OBJS=main.o mymodule.o
$(PROG): $(OBJS)
$(COMP) $(CFLAGS) -o $(PROG) $(OBJS) $(LFLAGS)
mymodule.mod: mymodule.f90 mymodule.o
$(COMP) -c $(CFLAGS) mymodule.f90
mymodule.o: mymodule.f90
$(COMP) -c $(CFLAGS) mymodule.f90
main.o: main.f90 mymodule.mod
$(COMP) -c $(CFLAGS) main.f90
run:
make
mpiexec -np 2 $(PROG)
make clean
clean:
rm -f $(PROG) *.mod *.o DONE watch
Here is the error when I attempt to run this code with 2 processes.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x1020d86fd
#1 0x1020d7a93
#2 0x7fff6f520b5c
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node MacBook-Pro exited on signal 11 (Segmentation fault: 11).
--------------------------------------------------------------------------
And I've seen this somewhat less descriptive one appear as well.
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x1046906fd
#1 0x10468fa93
#2 0x7fff6f520b5c
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node MacBook-Pro exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------
I realize the instructions explicitly say to not paste full files, but these are quite small and I was hoping to make it more convenient for anyone who wanted to run the code themselves. Any help would be greatly appreciated!

MPI_WIN_ALLOCATE_SHARED: memory limited?

It seems like whenever I am trying to allocate a window around 30-32 Mb I get a segmentation fault?
I am using following routine MPI_WIN_ALLOCATE_SHARED
Does anybody know if there is a limit to how big my window can be? If so, is there a way to compile my code relaxing that limit?
I am using INTEL MPI 19.0.3 and ifort 19.0.3 -
Example written in Fortran. By varying the integer size_ you can see when the segmentation fault occurs. I tested it with size_=10e3 and size_=10e4 the latter caused a segmentation fault
C------
program TEST_STACK
use, INTRINSIC ::ISO_C_BINDING
implicit none
include 'mpif.h'
!--- Parameters (They should not be changed ! )
integer, parameter :: whoisroot = 0 ! - Root always 0 here
!--- General parallel
integer :: whoami ! - My rank
integer :: mpi_nproc ! - no. of procs
integer :: mpierr ! - Error status
integer :: status(MPI_STATUS_SIZE)! - For MPI_RECV
!--- Shared memory stuff
integer :: whoami_shm ! - Local rank in shared memory group
integer :: mpi_shm_nproc ! - No. of procs in Shared memory group
integer :: no_partners ! - No. of partners for share memory
integer :: info_alloc
!--- MPI groups
integer :: world_group ! - All procs across all nodes
integer :: shared_group ! - Only procs that share memory
integer :: MPI_COMM_SHM ! - Shared memory communicators (for those in shared_group)
type(C_PTR) :: ptr_buf
integer(kind = MPI_ADDRESS_KIND) :: size_bytes, lb
integer :: win, size_, disp_unit
call MPI_INIT ( mpierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, whoami, mpierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, whoami, mpierr )
call MPI_COMM_SIZE ( MPI_COMM_WORLD, mpi_nproc, mpierr)
call MPI_COMM_SPLIT_TYPE( MPI_COMM_WORLD
& , MPI_COMM_TYPE_SHARED
& , 0
& , MPI_INFO_NULL
& , MPI_COMM_SHM
& , mpierr )
call MPI_COMM_RANK( MPI_COMM_SHM, whoami_shm, mpierr )
call MPI_COMM_SIZE( MPI_COMM_SHM, mpi_shm_nproc, mpierr )
size_ = 10e4! - seg fault
size_bytes = size_ * MPI_REAL
disp_unit = MPI_REAL
size_bytes = size_*disp_unit
call MPI_INFO_CREATE( info_alloc, mpierr )
call MPI_INFO_SET( info_alloc
& , "alloc_shared_noncontig"
& , "true"
& , mpierr )
!
call MPI_WIN_ALLOCATE_SHARED( size_bytes
& , disp_unit
& , info_alloc
& , MPI_COMM_SHM
& , ptr_buf
& , win
& , mpierr )
call MPI_WIN_FREE(win, mpierr)
end program TEST_STACK
I run my code using following command
mpif90 test_stack.f90; mpirun -np 2 ./a.out
This wrapper is linked to my ifort 19.0.3 and Intel MPI library. This has been verified by running
mpif90 -v
and to be very precise my mpif90 is a symbolic link to my mpiifort wrapper. This is made for personal convenience but shouldn't be causing problems I guess?
The manual says that the call to MPI_WIN_ALLOCATE_SHARED looks like this
USE MPI
MPI_WIN_ALLOCATE_SHARED(SIZE, DISP_UNIT, INFO, COMM, BASEPTR, WIN, IERROR)
INTEGER(KIND=MPI_ADDRESS_KIND) SIZE, BASEPTR
INTEGER DISP_UNIT, INFO, COMM, WIN, IERROR
At least the types of disp_unit and baseptr do not match in your program.
I was finally able to diagnose where the error stems from.
In the code I have
disp_unit = MPI_REAL
size_bytes = size_*disp_unit
MPI_REAL is a constant/parameter defined by MPI and is not equal to 4 as I very wrongly expected (4 for 4 bytes for single precision)!. In my version it is set to 1275069468 which most likely refers to an id rather than any sensible number.
Hence, multiplying this number with the size of my array can very quickly exceeds the available memory, but and also the number of digits that can be represented by a byte integer

Invalid pointer and segmentation fault when using MPI_Gather in Fortran

I have a simple program, which is supposed to gather a number of small arrays into one big one using MPI.
PROGRAM main
include 'mpif.h'
integer ierr, i, myrank, thefile, n_procs
integer, parameter :: BUFSIZE = 3
complex*16, allocatable :: loc_arr(:), glob_arr(:)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, n_procs, ierr)
allocate(loc_arr(BUFSIZE))
loc_arr = 0.7 * myrank - cmplx(0.3, 0, kind=8)
allocate(glob_arr(n_procs* BUFSIZE))
write (*,*) myrank, shape(glob_arr)
call MPI_Gather(loc_arr, BUFSIZE, MPI_DOUBLE_COMPLEX,&
glob_arr, n_procs * BUFSIZE, MPI_DOUBLE_COMPLEX,&
0, MPI_COMM_WORLD, ierr)
write (*,*) myrank,"Errorcode:" , ierr
call MPI_FINALIZE(ierr)
END PROGRAM main
I have some experience with MPI in C, but for Fortran 90 nothing seems to work. Here is how I compile(I use ifort) and run it:
mpif90 test.f90 -check all && mpirun -np 4 ./a.out
1 12
3 12
3 Errorcode: 0
1 Errorcode: 0
0 12
2 12
2 Errorcode: 0
0 Errorcode: 0
*** Error in `./a.out': free(): invalid pointer: 0x0000000000a25790 ***
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 10889 RUNNING AT LenovoX1kabel
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 10889 RUNNING AT LenovoX1kabel
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
What do I do wrong? Sometimes I will get this pointer problem, sometimes I will a segmentation fault, but to me it doesn't look like any of the ifort checks complain.
All the Errorcodes are 0, so I'm not sure where I go wrong.
You should never specify the number of processes in MPI collectives. That is a simple rule of thumb.
Therefore the line n_procs * BUFSIZE is clearly wrong.
And indeed the manual states that: recvcount Number of elements for any single receive (integer, significant only at root).
You should just use BUFSIZE. This is the same for C and Fortran.