MPI_WIN_ALLOCATE_SHARED: memory limited?

MPI_WIN_ALLOCATE_SHARED: memory limited? - fortran

It seems like whenever I am trying to allocate a window around 30-32 Mb I get a segmentation fault?
I am using following routine MPI_WIN_ALLOCATE_SHARED
Does anybody know if there is a limit to how big my window can be? If so, is there a way to compile my code relaxing that limit?
I am using INTEL MPI 19.0.3 and ifort 19.0.3 -
Example written in Fortran. By varying the integer size_ you can see when the segmentation fault occurs. I tested it with size_=10e3 and size_=10e4 the latter caused a segmentation fault
C------
program TEST_STACK
use, INTRINSIC ::ISO_C_BINDING
implicit none
include 'mpif.h'
!--- Parameters (They should not be changed ! )
integer, parameter :: whoisroot = 0 ! - Root always 0 here
!--- General parallel
integer :: whoami ! - My rank
integer :: mpi_nproc ! - no. of procs
integer :: mpierr ! - Error status
integer :: status(MPI_STATUS_SIZE)! - For MPI_RECV
!--- Shared memory stuff
integer :: whoami_shm ! - Local rank in shared memory group
integer :: mpi_shm_nproc ! - No. of procs in Shared memory group
integer :: no_partners ! - No. of partners for share memory
integer :: info_alloc
!--- MPI groups
integer :: world_group ! - All procs across all nodes
integer :: shared_group ! - Only procs that share memory
integer :: MPI_COMM_SHM ! - Shared memory communicators (for those in shared_group)
type(C_PTR) :: ptr_buf
integer(kind = MPI_ADDRESS_KIND) :: size_bytes, lb
integer :: win, size_, disp_unit
call MPI_INIT ( mpierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, whoami, mpierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, whoami, mpierr )
call MPI_COMM_SIZE ( MPI_COMM_WORLD, mpi_nproc, mpierr)
call MPI_COMM_SPLIT_TYPE( MPI_COMM_WORLD
& , MPI_COMM_TYPE_SHARED
& , 0
& , MPI_INFO_NULL
& , MPI_COMM_SHM
& , mpierr )
call MPI_COMM_RANK( MPI_COMM_SHM, whoami_shm, mpierr )
call MPI_COMM_SIZE( MPI_COMM_SHM, mpi_shm_nproc, mpierr )
size_ = 10e4! - seg fault
size_bytes = size_ * MPI_REAL
disp_unit = MPI_REAL
size_bytes = size_*disp_unit
call MPI_INFO_CREATE( info_alloc, mpierr )
call MPI_INFO_SET( info_alloc
& , "alloc_shared_noncontig"
& , "true"
& , mpierr )
!
call MPI_WIN_ALLOCATE_SHARED( size_bytes
& , disp_unit
& , info_alloc
& , MPI_COMM_SHM
& , ptr_buf
& , win
& , mpierr )
call MPI_WIN_FREE(win, mpierr)
end program TEST_STACK
I run my code using following command
mpif90 test_stack.f90; mpirun -np 2 ./a.out
This wrapper is linked to my ifort 19.0.3 and Intel MPI library. This has been verified by running
mpif90 -v
and to be very precise my mpif90 is a symbolic link to my mpiifort wrapper. This is made for personal convenience but shouldn't be causing problems I guess?

The manual says that the call to MPI_WIN_ALLOCATE_SHARED looks like this
USE MPI
MPI_WIN_ALLOCATE_SHARED(SIZE, DISP_UNIT, INFO, COMM, BASEPTR, WIN, IERROR)
INTEGER(KIND=MPI_ADDRESS_KIND) SIZE, BASEPTR
INTEGER DISP_UNIT, INFO, COMM, WIN, IERROR
At least the types of disp_unit and baseptr do not match in your program.

I was finally able to diagnose where the error stems from.
In the code I have
disp_unit = MPI_REAL
size_bytes = size_*disp_unit
MPI_REAL is a constant/parameter defined by MPI and is not equal to 4 as I very wrongly expected (4 for 4 bytes for single precision)!. In my version it is set to 1275069468 which most likely refers to an id rather than any sensible number.
Hence, multiplying this number with the size of my array can very quickly exceeds the available memory, but and also the number of digits that can be represented by a byte integer

Related

Using MPI_PUT in fortran and different ranks have different displacements using c_loc

I have MPI ranks split up to calculate different parts an an array, then I want to put/send those slices onto a different rank that doesn't participate in the calculation. That rank is the master of a new communicator set up to do other things with the array (averaging, IO, etc). I got it to work with MPI_isend and MPI_irecv, and now I want to try MPI_Put.
use mpi_f08
use iso_c_binding
implicit none
integer, parameter :: n=10, gps = 18, pes=12, dpes = 6
integer :: main=pes, d=dpes
integer :: diag_master
integer :: global_size, global_rank, diag_size, diag_rank
type(MPI_comm),allocatable :: diag_comm
integer :: pelist_diag
TYPE(MPI_Win) :: win
integer :: ierr, i, j
type(MPI_COMM) :: comm, mycomm
integer :: gsz, grk
integer :: lsz, lrk
integer(KIND=MPI_ADDRESS_KIND) :: local_group
logical :: local_flag
integer :: color,key
!!! THIS IS THE ARRAY
real, dimension(n,pes) :: r
!!!
logical :: on_dpes = .false.
logical,allocatable,dimension(:) :: dpes_list ! true if on dpes list
integer :: comm_manager
integer :: dmg
integer(KIND=MPI_ADDRESS_KIND) :: buff_size !< the size of a variable type
integer(kind=MPI_ADDRESS_KIND) :: displacement
integer :: disp_size
integer :: loc_base
integer, pointer :: fptr
!!!!!!!! THIS ALL WORKS BEGIN !!!!!!!!
comm=MPI_COMM_WORLD
call MPI_INIT(ierr)
call MPI_COMM_SIZE(COMM, gsz, ierr)
call MPI_COMM_RANK(COMM, grk, ierr)
allocate(dpes_list(gsz))
! write (6,*) "I am ",grk," of ",gsz
!> Find the group
call MPI_COMM_GET_ATTR(COMM,MPI_APPNUM,local_group,local_flag,ierr)
!> Split a new communicator as mycom
color = int(local_group)
key = 0
call MPI_COMM_SPLIT(COMM, color, key, mycomm, ierr)
!> Get information about the split communicators
call mpi_comm_size(mycomm,lsz,ierr)
call mpi_comm_rank(mycomm,lrk,ierr)
!> Create data on the main communicator
if (lsz == pes) then
comm_manager = main
on_dpes = .false.
r = 0.0
if (mod(lrk,2) == 0) then
c_loop: do concurrent (i=1:n)
r(i,lrk+1) = sin(real(i))+real(i)
enddo c_loop
else
r(:,lrk+1) = 10.0-dble(lrk)
endif
if (lsz == dpes) then
diag_size = lsz
diag_rank = lrk
comm_manager = d
on_dpes = .true.
diag_comm = mycomm
if (lrk==0) then
dmg = grk
endif
endif
call MPI_ALLGATHER(on_dpes,1,MPI_LOGICAL, &
dpes_list,gsz,MPI_LOGICAL, MPI_COMM_WORLD, ierr)
!> Get the master of dpes
do i=1,gsz
if (dpes_list(i)) then
dmg = i-1
exit
endif
enddo
diag_master = dmg
diag_global_master = dmg
!!!!!!!! THIS ALL WORKS END !!!!!!!!
!! At this point, the ranks that participate in the calculation
!! have values in r(i,lrk+1) where lrk is their rank
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!! THIS IS WHERE THINGS GO WRONG? !!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
disp_size = storage_size(r)
buff_size = disp_size*size(r)
call c_f_pointer(c_loc(r(1,1)),fptr)
loc_base = fptr
nullify(fptr)
write (6,*) loc_base, grk
call MPI_Win_create(loc_base,buff_size,disp_size,MPI_INFO_NULL,&
mpi_comm_world,win,ierr)
call MPI_Win_Fence(0,win,ierr)
displacement = loc_base + disp_size *buff_size
! if (.not.allocated(diag_comm)) then
if (grk == 11) then
call MPI_Put(r(:,global_rank+1),size(r,1),MPI_FLOAT,&
diag_master,displacement,size(r,1), MPI_FLOAT, win ,ierr)
endif
call MPI_Win_Fence(0,win,ierr)
CALL MPI_WIN_FREE(win, ierr)
call MPI_FINALIZE(ierr)
I have ! if (.not.allocated(diag_comm)) then commented out because I tried to do this with all of the ranks that calculate r, but I got the same result.
I am compiling with mpiifort -O0 -fpe0 -init=snan,arrays -no-wrap-margin -traceback -stand f18 and run with mpirun -n 12 ./$#.x : -n 6 ./$#.x in my Makefile. The version of mpiifort I am using is
> mpiifort -v
mpiifort for the Intel(R) MPI Library 2019 Update 2 for Linux*
Copyright 2003-2019, Intel Corporation.
ifort version 19.0.2.187
The output (write (6,*) loc_base, grk)is strange.
1072411986 0
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 9
0 10
0 11
2142952877 12
2142952877 13
2142952877 14
2142952877 15
2142952877 16
2142952877 17
Rank 12-17 are the ranks that don't participate in "calculating r", but I'm not sure why c_loc(r(1,1)) is different for these ranks. Also, it is different for rank 0.
My actual questions are
1) How do I calculate the displacement variable? Am I doing it correctly? Is it supposed to be different between ranks because it will be in this case?
2) Why is c_loc(r(1,1)) different for the ranks 12-17? Does it have anything to do with the fact that this is a SPMD program? Why is it different for rank 0?
3) Can I do the one way communication with all of the ranks instead of just one? I had each rank call mpi_isend, and then i just called mpi_irecv in a loop through all of the ranks sending when I did this the other way. Can I do something similar with MPI_Put? Should I be using MPI_Get? Something else?
4) How do I get this to work? This is just an educational example for myself, and what I actually need to do is much more complicated.

I can answer item 2, at least. You have:
call c_f_pointer(c_loc(r(1,1)),fptr)
loc_base = fptr
where loc_base is declared integer. You seem to be assuming that loc_base is some sort of address, but it is not. In Fortran, intrinsic assignment from a pointer assigns the value of the target, not the location of the target. So you're effectively doing a TRANSFER of the REAL values of r to loc_base - probably not what you want.

Application throws segmentation fault after some time on calling a procedure (containing local coarrays) in a loop

I am trying to call a subroutine in a loop. This subroutine has a local coarray. Following is the code that I am using:
! Test local coarray in procedure called in a loop.
!
program main
use, intrinsic :: iso_fortran_env, only : input_unit, output_unit, error_unit
implicit none
! Variable declaration.
integer :: me, ti
integer :: GHOST_WIDTH, TSTART, TSTEPS
sync all
! Initialize.
GHOST_WIDTH = 1
TSTART = 0
TSTEPS = 100000
me = this_image()
! Iterate.
do ti = TSTART + 1, TSTART + TSTEPS
call Aldeal( GHOST_WIDTH )
if ( me == 1 ) write( output_unit, * ) ti
end do
if ( me == 1 ) write( output_unit, * ) "All done!"
contains
subroutine Aldeal( width )
integer, intent(in) :: width
integer, allocatable, codimension[:] :: shell1_Co, shell2_Co, shell3_Co
allocate( shell1_Co[*], shell2_Co[*], shell3_Co[*] )
deallocate( shell1_Co, shell2_Co, shell3_Co )
return
end subroutine Aldeal
end program main
Right now the subroutine is not doing anything other than allocating the local coarray and deallocating it. But even while doing this, the program is throwing me the following error after some iterations:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
In coarray image 1
Image PC Routine Line Source
coarray_main 0000000000406063 Unknown Unknown Unknown
libpthread-2.17.s 00007F21D8B845F0 Unknown Unknown Unknown
libicaf.so 00007F21D90970D5 for_rtl_ICAF_CO_D Unknown Unknown
coarray_main 0000000000405054 main_IP_aldeal_ 37 coarray_main.f90
coarray_main 0000000000404AEC MAIN__ 23 coarray_main.f90
coarray_main 0000000000404A22 Unknown Unknown Unknown
libc-2.17.so 00007F21D85C5505 __libc_start_main Unknown Unknown
coarray_main 0000000000404929 Unknown Unknown Unknown
Abort(0) on node 0 (rank 0 in comm 496): application called MPI_Abort(comm=0x84000003, 0) - process 0
And the same error is repeated for other images as well.
Line 23 is call Aldeal( GHOST_WIDTH ) inside the do loop of the main program. And line 37 corresponds to deallocate( shell1_Co, shell2_Co, shell3_Co ) statement in the subroutine.
Additionally, if I remove the deallocate statement from the subroutine, it throws the same error but the line number in the error statement this time are 23 and 39. Line 39 corresponds to the end subroutine Aldeal statement.
I am not able to understand what exactly I am doing wrong. Please help.
P.S. I am using Centos 7 with Intel(R) Parallel Studio XE 2019 Update 4 for Linux.

Observations:
If I modify the code to have a derived-type with an allocatable component and use that to create the coarray in the subroutine, the code runs a little longer but eventually aborts with an error. Following is the modification:
module mod_coarray_error
implicit none
type :: int_t
integer, allocatable, dimension(:) :: var
end type int_t
contains
subroutine Aldeal_type( width )
integer, intent(in) :: width
type(int_t), allocatable, codimension[:] :: int_t_Co
allocate( int_t_Co[*] )
allocate( int_t_Co%var(width) )
sync all
! deallocate( int_t_Co%var )
deallocate( int_t_Co )
return
end subroutine Aldeal_type
end module mod_coarray_error
program main
use, intrinsic :: iso_fortran_env, only : input_unit, output_unit, error_unit
use :: mod_coarray_error
implicit none
! Variable declaration.
integer :: me, ti
integer :: GHOST_WIDTH, TSTART, TSTEPS, SAVET
sync all
! Initialize.
GHOST_WIDTH = 3
TSTART = 0
TSTEPS = 100000
SAVET = 1000
me = this_image()
! Iterate.
do ti = TSTART + 1, TSTART + TSTEPS
sync all
call Aldeal_type( GHOST_WIDTH )
if ( mod( ti, SAVET ) == 0 ) then
if ( me == 1 ) write( output_unit, * ) ti
end if
end do
sync all
if ( me == 1 ) write( output_unit, * ) "All done!"
end program main
Additionally, this code runs fine till the end when compiled in Windows.
Now if I add the compiler option heap-arrays 0, the code seems to run till the end even in Linux.
I tried to increase the number of loops, ie, TSTEPS in the code to 1e7. Even then, it runs successfully till the end. But I observe the following effects:
Code gets slower as loop count increases, ie, it takes more time to run from ti = 1e6 to ti = 2e6 than the time it takes to run from ti = 1 to ti = 1e6.
Memory used by the program keeps on increasing, ie, each image which consumes 2GB at start of the program run, consumes 3.5GB at ti = 2e6, 4.7GB at ti = 4e6, and 6GB at ti = 6e6.
Memory used by the program is relatively less when run in Windows, but it still keeps on increasing as the loop count increases. Eg each image which consumes 100MB at start, consumes 1.5GB at ti = 2e6, 2.5GB at ti = 4e6, and 3.5GB at ti = 6e6.
Using the compiler option /heap-arrays0 in Windows has no effect either on the run (as it was already successfully running without it) or on the amount of memory consumed while running.
The original code posted in the question still throws an error even when compiled using the above compiler option. It does not seem to run in Windows too.
Ultimately, I am still confused as to what is happening.
P.S. I posted the question in Intel forum but have not received any response yet.

Incorrect results when reading binary file with MPI I/O

I am new to MPI and am struggling with reading a binary file.
Specifically, I have a $198\times 50 \times 50$ array of integers (16 bit integers, to be specific) stored in a binary file. I want to use 2 compute nodes to process this file. So there are two MPI processes and each process will process half of the input. I am using the function MPI_FILE_READ_AT to read respective regions. I expect the array values to fill in the variable/argument 'bucket' that I pass in to the function call. But a sanity check print out of the 'bucket' entries tells me that the values in bucket are all incorrect. I feel that I am going wrong with the arguments.
program main
use mpi
implicit none
integer :: i, error, num_processes, id, fh
integer(MPI_OFFSET_KIND) :: filesize, offset
integer(MPI_OFFSET_KIND) :: num_bytes_per_process
integer(MPI_OFFSET_KIND) :: num_bytes_this_process
integer :: num_ints_per_process, num_ints_this_process
integer(kind = 2), dimension(:), allocatable :: bucket
character(len=100) :: inputFileName
integer, parameter :: INTKIND=2
! Initialize
inputFileName = 'xyz_50x50'
print *, 'MPI_OFFSET_KIND =', MPI_OFFSET_KIND
! MPI basics
call MPI_Init ( error )
call MPI_Comm_size ( MPI_COMM_WORLD, num_processes, error )
call MPI_Comm_rank ( MPI_COMM_WORLD, id, error )
! Open the file
call MPI_FILE_OPEN(MPI_COMM_WORLD, inputFileName, MPI_MODE_RDONLY, &
MPI_INFO_NULL, fh, error)
! get the size of the file
call MPI_File_get_size(fh, filesize, error)
! Note: filesize is the TOTAL number of bytes in the file
num_bytes_per_process = filesize/num_processes
num_ints_per_process = num_bytes_per_process/INTKIND
offset = id * num_bytes_per_process
num_bytes_this_process = min(num_bytes_per_process, filesize - offset)
num_ints_this_process = num_bytes_this_process/INTKIND
allocate(bucket(num_ints_this_process))
call MPI_FILE_READ_AT(fh, offset, bucket, num_ints_this_process, &
MPI_SHORT, MPI_STATUS_SIZE, error)
do i = 1, num_ints_this_process
if (bucket(i) /= 0) then
print *, "my id is ", id, " and bucket(",i,")=", bucket(i)
endif
enddo
! close the file
call MPI_File_close(fh, error)
! close mpi
call MPI_Finalize(error)
end program main

you have to use MPI_STATUS_IGNORE instead of MPI_STATUS_SIZE
(fwiw, i am unable to compile this program unless i fixe this)
call MPI_FILE_READ_AT(fh, offset, bucket, num_ints_this_process, &
MPI_SHORT, MPI_STATUS_IGNORE, error)
note that since all MPI tasks read the file at the same time, you'd rather use the collective MPI_File_read_at_all() subroutine in order to improve performances.

MPI in Fortran gives garbage values

PROGRAM ShareNeighbors
IMPLICIT REAL (a-h,o-z)
INCLUDE "mpif.h"
PARAMETER (m = 500, n = 500)
DIMENSION a(m,n), b(m,n)
DIMENSION h(m,n)
INTEGER istatus(MPI_STATUS_SIZE)
INTEGER iprocs, jprocs
PARAMETER (ROOT = 0)
integer dims(2),coords(2)
logical periods(2)
data periods/2*.false./
integer status(MPI_STATUS_SIZE)
integer comm2d,req,source
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
! Get a new communicator for a decomposition of the domain.
! Let MPI find a "good" decomposition
dims(1) = 0
dims(2) = 0
CALL MPI_DIMS_CREATE(nprocs,2,dims,ierr)
if (myrank.EQ.Root) then
print *,nprocs,'processors have been arranged into',dims(1),'X',dims(2),'grid'
endif
CALL MPI_CART_CREATE(MPI_COMM_WORLD,2,dims,periods,.true., &
comm2d,ierr)
! Get my position in this communicator
CALL MPI_COMM_RANK(comm2d,myrank,ierr)
! Get the decomposition
CALL fnd2ddecomp(comm2d,m,n,ista,iend,jsta,jend)
! print *,ista,jsta,iend,jend
ilen = iend - ista + 1
jlen = jend - jsta + 1
CALL MPI_Cart_get(comm2d,2,dims,periods,coords,ierr)
iprocs = dims(1)
jprocs = dims(2)
myranki = coords(1)
myrankj = coords(2)
DO j = jsta, jend
DO i = ista, iend
a(i,j) = myrank+1
ENDDO
ENDDO
! Send data from each processor to Root
call MPI_ISEND(ista,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(iend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jsta,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
Root,1,MPI_COMM_WORLD,req,ierr )
! Recieved the results from othe precessors
if (myrank.EQ.Root) then
do source = 0,nprocs-1
call MPI_RECV(ista,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(iend,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(jsta,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(jend,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
ilen = iend - ista + 1
jlen = jend - jsta + 1
call MPI_RECV(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
source,1,MPI_COMM_WORLD,status,ierr)
! print the results
call ZMINMAX(m,n,ista,iend,jsta,jend,a(:,:),amin,amax)
print *, 'myid=',source,amin,amax
call MPI_Wait(req, status, ierr)
enddo
endif
CALL MPI_FINALIZE(ierr)
END
subroutine fnd2ddecomp(comm2d,m,n,ista,iend,jsta,jend)
integer comm2d
integer m,n,ista,jsta,iend,jend
integer dims(2),coords(2),ierr
logical periods(2)
! Get (i,j) position of a processor from Cartesian topology.
CALL MPI_Cart_get(comm2d,2,dims,periods,coords,ierr)
! Decomposition in first (ie. X) direction
CALL MPE_DECOMP1D(m,dims(1),coords(1),ista,iend)
! Decomposition in second (ie. Y) direction
CALL MPE_DECOMP1D(n,dims(2),coords(2),jsta,jend)
return
end
SUBROUTINE MPE_DECOMP1D(n,numprocs,myid,s,e)
integer n,numprocs,myid,s,e,nlocal,deficit
nlocal = n / numprocs
s = myid * nlocal + 1
deficit = mod(n,numprocs)
s = s + min(myid,deficit)
! Give one more slice to processors
if (myid .lt. deficit) then
nlocal = nlocal + 1
endif
e = s + nlocal - 1
if (e .gt. n .or. myid .eq. numprocs-1) e = n
return
end
SUBROUTINE ZMINMAX(IX,JX,SX,EX,SY,EY,ZX,ZXMIN,ZXMAX)
INTEGER :: IX,JX,SX,EX,SY,EY
REAL :: ZX(IX,JX)
REAL :: ZXMIN,ZXMAX
ZXMIN=1000.
ZXMAX=-1000.
DO II=SX,EX
DO JJ=SY,EY
IF(ZX(II,JJ).LT.ZXMIN)ZXMIN=ZX(II,JJ)
IF(ZX(II,JJ).GT.ZXMAX)ZXMAX=ZX(II,JJ)
ENDDO
ENDDO
RETURN
END
When I am running the above code with 4 processors Root receives garbage values. Where as for 15 processors, the data transfer is proper. How I can tackle this?
I guess it is related buffer, a point which is not clear to me. How I have to tackle the buffer wisely?

1. problem
You are doing multiple sends
call MPI_ISEND(ista,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(iend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jsta,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
Root,1,MPI_COMM_WORLD,req,ierr )
and all of them with the same request variable req. That can't work.
2. problem
You are using a subarray a(ista:iend,jsta:jend) in non-blocking MPI. That is not allowed*. You need to copy the array into some temporary array buffer or use MPI derived subarray datatype (too hard for you at this stage).
The reason for the problem is that the compiler will create a temporary copy just for the call to ISend. The ISend will remember the address, but will not send anything. Then temporary is deleted and the address becomes invalid. And then the MPI_Wait will try to use that address and will fail.
3. problem
Your MPI_Wait is in the wrong place. It must be after the sends out of any if conditions so that they are always executed (provided you are always sending).
You must collect all request separately and than wait for all of them. Best to have them in a an array and wait for all of them at once using MPI_Waitall.
Remeber, the ISend typically does not actually send anything if the buffer is large. The exchange often happens during the Wait operation. At least for larger arrays.
Recommendation:
Take a simple problem example and try to exchange just two small arrays with MPI_IRecv and MPI_ISend between two processes. As simple test problem as you can do. Learn from it, do simple steps. Take no offence, but your current understanding of non-blocking MPI is too weak to write full scale programs. MPI is hard, non-blocking MPI is even harder.
* not allowed when using the interface available in MPI-2. MPI-3 brings a new interface available by using use mpi_f08 where it is possible. But learn the basics first.

Incorrect output via fftw_mpi_r2c_2d and fftw_mpi_c2r_2d

I wrote a simple test program in order to implement the FFTW with MPI in a 2d domain (with Fortran). The domain is 'Ny x Nx' wide and partitioned in the second ('x') index.
After proper (I believe?) declaration and allocation of variables and plans, I call the fftw_mpi r2c_2d function and next I transform back its output with the fftw_mpi c2r_2d, in order to check if I get the original input. The r2c_2d part seems to work fine. However, I don't get the original input after transforming back the output (apart normalization) with the c2r_2d function: the resulting vector displays 'zeros' at the indices (:,j) with j corresponding to multiples of 'Ny/2'. What am I doing wrong? Thanks!
Here is the extract from the code:
Program TEST
use, intrinsic :: iso_c_binding
Implicit none
include 'mpif.h'
include 'fftw3-mpi.f03'
Integer*8,parameter :: nx=16, ny=16
!MPI
integer*8 :: ipe,npe
integer*8 ::mpi_realtype,icomm=mpi_comm_world,istat(mpi_status_size),ierr
! FFTW VARIABLES DECLARATION
type(C_PTR) :: p1, p2, cdatar, cdatac
integer(C_INTPTR_T) :: alloc_local, local_L, local_L_offset, local_M, local_M_offset
real(C_DOUBLE), pointer :: faux(:,:) ! real input 2d function
complex(C_DOUBLE), pointer :: gaux(:,:) ! complex output of 2d FFTW (transposed)
! MPI initialization
call mpi_init(ierr)
call mpi_comm_rank(icomm,ipe,ierr)
call mpi_comm_size(icomm,npe,ierr)
! FFTW ALLOCATIONS AND PLANS
call fftw_mpi_init()
alloc_local = fftw_mpi_local_size_2d(ny/2+1,nx &
,MPI_COMM_WORLD, local_L, local_L_offset)
cdatac = fftw_alloc_complex(alloc_local)
call c_f_pointer(cdatac, gaux, [nx,local_L]) !transposed
alloc_local = fftw_mpi_local_size_2d(nx,ny/2+1, MPI_COMM_WORLD, &
local_M, local_M_offset)
cdatar = fftw_alloc_real(2*alloc_local)
call c_f_pointer(cdatar, faux, [ny,local_M])
! Create plans
p1 = fftw_mpi_plan_dft_r2c_2d(nx,ny,faux,gaux, MPI_COMM_WORLD, &
ior(FFTW_MEASURE, FFTW_MPI_TRANSPOSED_OUT))
p2 = fftw_mpi_plan_dft_c2r_2d(nx,ny,gaux,faux, MPI_COMM_WORLD, &
ior(FFTW_MEASURE, FFTW_MPI_TRANSPOSED_IN))
! EXECUTE FFTW
call random_number(faux)
print *, "real input:", real(faux(1,:))
call fftw_mpi_execute_dft_r2c(p1,faux,gaux)
call fftw_mpi_execute_dft_c2r(p2, gaux, faux)
print *, "real output:", real(faux(1,:))/(nx*ny)
call fftw_destroy_plan(p1)
call fftw_destroy_plan(p2)
call mpi_finalize(ierr)
End Program TEST

The issue is due to the padding needed by fftw:
Although the real data is conceptually n0 × n1 × n2 × … × nd-1 , it is physically stored as an n0 × n1 × n2 × … × [2 (nd-1/2 + 1)] array, where the last dimension has been padded to make it the same size as the complex output. This is much like the in-place serial r2c/c2r interface (see Multi-Dimensional DFTs of Real Data), except that in MPI the padding is required even for out-of-place data.
Hence, the input array for a 16x16 transform is therefore a 16x18 array. The value of the extra two numbers at the end of each rows are meaningless in the real space. Yet, these extra numbers must not be forgotten as the c pointer is cast to a fortran 2D array:
call c_f_pointer(cdatar, faux, [2*(ny/2+1),local_M])
The extra numbers are still printed at the end of each row. The array can be sliced to avoid printing these worthless values:
print *, "real input:", real(faux(1:ny,:))
...
print *, "real output:", real(faux(1:ny,:))/(nx*ny)
Here is the complete code, based on yours and the one of How to do a fftw3 MPI "transposed" 2D transform if possible at all? It can be compiled by mpif90 main.f90 -o main -I/usr/include -L/usr/lib -lfftw3_mpi -lfftw3 -lm and ran by mpirun -np 2 main.
Program TEST
use, intrinsic :: iso_c_binding
Implicit none
include 'mpif.h'
include 'fftw3-mpi.f03'
Integer*8,parameter :: nx=4, ny=8
!MPI
integer*8 :: ipe,npe
integer*8 ::mpi_realtype,icomm=mpi_comm_world,istat(mpi_status_size),ierr
! FFTW VARIABLES DECLARATION
type(C_PTR) :: p1, p2, cdatar, cdatac
integer(C_INTPTR_T) :: alloc_local, local_L, local_L_offset, local_M, local_M_offset
real(C_DOUBLE), pointer :: faux(:,:) ! real input 2d function
complex(C_DOUBLE), pointer :: gaux(:,:) ! complex output of 2d FFTW (transposed)
! MPI initialization
call mpi_init(ierr)
call mpi_comm_rank(icomm,ipe,ierr)
call mpi_comm_size(icomm,npe,ierr)
! FFTW ALLOCATIONS AND PLANS
call fftw_mpi_init()
alloc_local = fftw_mpi_local_size_2d(ny/2+1,nx &
,MPI_COMM_WORLD, local_L, local_L_offset)
cdatac = fftw_alloc_complex(alloc_local)
call c_f_pointer(cdatac, gaux, [nx,local_L]) !transposed
alloc_local = fftw_mpi_local_size_2d(nx,ny/2+1, MPI_COMM_WORLD, &
local_M, local_M_offset)
cdatar = fftw_alloc_real(2*alloc_local)
call c_f_pointer(cdatar, faux, [2*(ny/2+1),local_M])
! Create plans
p1 = fftw_mpi_plan_dft_r2c_2d(nx,ny,faux,gaux, MPI_COMM_WORLD, &
ior(FFTW_MEASURE, FFTW_MPI_TRANSPOSED_OUT))
p2 = fftw_mpi_plan_dft_c2r_2d(nx,ny,gaux,faux, MPI_COMM_WORLD, &
ior(FFTW_MEASURE, FFTW_MPI_TRANSPOSED_IN))
! EXECUTE FFTW
call random_number(faux)
print *, "real input:", real(faux(1:ny,:))
call fftw_mpi_execute_dft_r2c(p1,faux,gaux)
call fftw_mpi_execute_dft_c2r(p2, gaux, faux)
print *, "real output:", real(faux(1:ny,:))/(nx*ny)
call fftw_destroy_plan(p1)
call fftw_destroy_plan(p2)
call mpi_finalize(ierr)
End Program TEST

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js