MPI in Fortran gives garbage values

MPI in Fortran gives garbage values - fortran

PROGRAM ShareNeighbors
IMPLICIT REAL (a-h,o-z)
INCLUDE "mpif.h"
PARAMETER (m = 500, n = 500)
DIMENSION a(m,n), b(m,n)
DIMENSION h(m,n)
INTEGER istatus(MPI_STATUS_SIZE)
INTEGER iprocs, jprocs
PARAMETER (ROOT = 0)
integer dims(2),coords(2)
logical periods(2)
data periods/2*.false./
integer status(MPI_STATUS_SIZE)
integer comm2d,req,source
CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr)
! Get a new communicator for a decomposition of the domain.
! Let MPI find a "good" decomposition
dims(1) = 0
dims(2) = 0
CALL MPI_DIMS_CREATE(nprocs,2,dims,ierr)
if (myrank.EQ.Root) then
print *,nprocs,'processors have been arranged into',dims(1),'X',dims(2),'grid'
endif
CALL MPI_CART_CREATE(MPI_COMM_WORLD,2,dims,periods,.true., &
comm2d,ierr)
! Get my position in this communicator
CALL MPI_COMM_RANK(comm2d,myrank,ierr)
! Get the decomposition
CALL fnd2ddecomp(comm2d,m,n,ista,iend,jsta,jend)
! print *,ista,jsta,iend,jend
ilen = iend - ista + 1
jlen = jend - jsta + 1
CALL MPI_Cart_get(comm2d,2,dims,periods,coords,ierr)
iprocs = dims(1)
jprocs = dims(2)
myranki = coords(1)
myrankj = coords(2)
DO j = jsta, jend
DO i = ista, iend
a(i,j) = myrank+1
ENDDO
ENDDO
! Send data from each processor to Root
call MPI_ISEND(ista,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(iend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jsta,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
Root,1,MPI_COMM_WORLD,req,ierr )
! Recieved the results from othe precessors
if (myrank.EQ.Root) then
do source = 0,nprocs-1
call MPI_RECV(ista,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(iend,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(jsta,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
call MPI_RECV(jend,1,MPI_INTEGER,source, &
1,MPI_COMM_WORLD,status,ierr )
ilen = iend - ista + 1
jlen = jend - jsta + 1
call MPI_RECV(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
source,1,MPI_COMM_WORLD,status,ierr)
! print the results
call ZMINMAX(m,n,ista,iend,jsta,jend,a(:,:),amin,amax)
print *, 'myid=',source,amin,amax
call MPI_Wait(req, status, ierr)
enddo
endif
CALL MPI_FINALIZE(ierr)
END
subroutine fnd2ddecomp(comm2d,m,n,ista,iend,jsta,jend)
integer comm2d
integer m,n,ista,jsta,iend,jend
integer dims(2),coords(2),ierr
logical periods(2)
! Get (i,j) position of a processor from Cartesian topology.
CALL MPI_Cart_get(comm2d,2,dims,periods,coords,ierr)
! Decomposition in first (ie. X) direction
CALL MPE_DECOMP1D(m,dims(1),coords(1),ista,iend)
! Decomposition in second (ie. Y) direction
CALL MPE_DECOMP1D(n,dims(2),coords(2),jsta,jend)
return
end
SUBROUTINE MPE_DECOMP1D(n,numprocs,myid,s,e)
integer n,numprocs,myid,s,e,nlocal,deficit
nlocal = n / numprocs
s = myid * nlocal + 1
deficit = mod(n,numprocs)
s = s + min(myid,deficit)
! Give one more slice to processors
if (myid .lt. deficit) then
nlocal = nlocal + 1
endif
e = s + nlocal - 1
if (e .gt. n .or. myid .eq. numprocs-1) e = n
return
end
SUBROUTINE ZMINMAX(IX,JX,SX,EX,SY,EY,ZX,ZXMIN,ZXMAX)
INTEGER :: IX,JX,SX,EX,SY,EY
REAL :: ZX(IX,JX)
REAL :: ZXMIN,ZXMAX
ZXMIN=1000.
ZXMAX=-1000.
DO II=SX,EX
DO JJ=SY,EY
IF(ZX(II,JJ).LT.ZXMIN)ZXMIN=ZX(II,JJ)
IF(ZX(II,JJ).GT.ZXMAX)ZXMAX=ZX(II,JJ)
ENDDO
ENDDO
RETURN
END
When I am running the above code with 4 processors Root receives garbage values. Where as for 15 processors, the data transfer is proper. How I can tackle this?
I guess it is related buffer, a point which is not clear to me. How I have to tackle the buffer wisely?

1. problem
You are doing multiple sends
call MPI_ISEND(ista,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(iend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jsta,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(jend,1,MPI_INTEGER,Root,1, &
MPI_COMM_WORLD,req,ierr)
call MPI_ISEND(a(ista:iend,jsta:jend),(ilen)*(jlen),MPI_REAL, &
Root,1,MPI_COMM_WORLD,req,ierr )
and all of them with the same request variable req. That can't work.
2. problem
You are using a subarray a(ista:iend,jsta:jend) in non-blocking MPI. That is not allowed*. You need to copy the array into some temporary array buffer or use MPI derived subarray datatype (too hard for you at this stage).
The reason for the problem is that the compiler will create a temporary copy just for the call to ISend. The ISend will remember the address, but will not send anything. Then temporary is deleted and the address becomes invalid. And then the MPI_Wait will try to use that address and will fail.
3. problem
Your MPI_Wait is in the wrong place. It must be after the sends out of any if conditions so that they are always executed (provided you are always sending).
You must collect all request separately and than wait for all of them. Best to have them in a an array and wait for all of them at once using MPI_Waitall.
Remeber, the ISend typically does not actually send anything if the buffer is large. The exchange often happens during the Wait operation. At least for larger arrays.
Recommendation:
Take a simple problem example and try to exchange just two small arrays with MPI_IRecv and MPI_ISend between two processes. As simple test problem as you can do. Learn from it, do simple steps. Take no offence, but your current understanding of non-blocking MPI is too weak to write full scale programs. MPI is hard, non-blocking MPI is even harder.
* not allowed when using the interface available in MPI-2. MPI-3 brings a new interface available by using use mpi_f08 where it is possible. But learn the basics first.

Related

Error on file_set_view in a collective mpi_write

I am just trying to write in a collective way in MPI Fortran from a CFD code. In each process, data are divided in blocks, with a general number of cells, and a structure var(b) is created which hosts the two variables r and p of the block b. Then a double MPI structure derived type is created to collect all data in a process, the first type collecting all variables in a block, and the second one all the structures in a block. So, each process has to write one of this double derived datatype, where the offset is evaluated all the data amount in the previous processes (0 for rank 0, all data in the rank 0 for rank 1, and so one). The code is the following
module var_mod
type vt
sequence
double precision,dimension(:,:,:),allocatable :: r,p
end type vt
type(vt),target,dimension(:),allocatable :: var
end module var_mod
PROGRAM main
USE MPI_F08
USE var_mod
IMPLICIT NONE
! FILES
INTEGER,PARAMETER :: NB = 4
!----------------------------------------------------------------
INTEGER :: b,i,j,k,me,np
TYPE(MPI_File) :: mpifh
INTEGER(KIND=MPI_OFFSET_KIND) :: mpidisp,sum_dim
integer,dimension(:),allocatable :: ni,nj,nk,mpiblock,mpistride
integer :: cont,mpierr
INTEGER,dimension(nb) :: Blocks
INTEGER(KIND=MPI_ADDRESS_KIND),dimension(:),allocatable :: Offsets,Pos
INTEGER(KIND=MPI_COUNT_KIND) :: lb, ext8
TYPE(MPI_Datatype),dimension(:),allocatable :: Elem_Type,Types
TYPE(MPI_Datatype) :: All_Type,mpiparflowtype
TYPE(MPI_Status) :: status
CHARACTER(LEN=MPI_MAX_ERROR_STRING) :: string
INTEGER :: resultlen
!----------------------------------------------------------------
call mpi_init
call mpi_comm_size(mpi_comm_world,np)
call mpi_comm_rank(mpi_comm_world,me)
allocate (ni(nb))
allocate (nj(nb))
allocate (nk(nb))
allocate (var(nb))
do b = 1,NB
ni(b) = b/3+1
nj(b) = b
nk(b) = b/5+1
allocate (var(b)%r(ni(b),nj(b),nk(b)))
allocate (var(b)%p(ni(b),nj(b),nk(b)))
END DO
!
! Initialize the data
!
do b = 1,nb
DO k = 1,nk(b)
DO j = 1,nj(b)
DO i = 1,ni(b)
var(b)%r(i,j,k) = 10000*me+1000*b+i*100+j*10+k
var(b)%p(i,j,k) = -var(b)%r(i,j,k)
END DO
END DO
END DO
end do
! (1) Create a separate structure datatype for each record
allocate (Offsets(2),Pos(2),Types(2),Elem_Type(nb))
DO b = 1,nb
CALL MPI_GET_ADDRESS(var(b)%r,POS(1))
CALL MPI_GET_ADDRESS(var(b)%p,POS(2))
Offsets = POS-POS(1)
Types = MPI_REAL8
Blocks = ni(b)*nj(b)*nk(b)
CALL MPI_TYPE_CREATE_STRUCT(2,Blocks,Offsets,Types,Elem_Type(b),mpierr)
END DO
deallocate (Offsets,Pos,Types)
! Create a structure of structures that describes the whole array
allocate (Offsets(nb),Pos(nb))
Blocks = 1
DO b = 1,nb
CALL MPI_GET_ADDRESS(var(b)%r,POS(b))
END DO
Offsets = POS-POS(1)
CALL MPI_TYPE_CREATE_STRUCT(nb,Blocks,Offsets,Elem_Type,All_Type)
CALL MPI_TYPE_COMMIT(All_Type,mpierr)
! Free the intermediate datatypes
DO b = 1,nb
CALL MPI_TYPE_FREE(Elem_Type(b))
END DO
deallocate(Offsets,Pos,Elem_Type)
! Set index
cont = 1
allocate(mpiblock(cont))
allocate(mpistride(cont))
mpiblock=1
mpistride=0
call MPI_TYPE_INDEXED(cont,mpiblock,mpistride,All_Type,mpiparflowtype)
call MPI_TYPE_COMMIT(mpiparflowtype)
deallocate(mpiblock,mpistride)
! Position where to write
CALL MPI_Type_get_extent(MPI_REAL8, lb, ext8)
mpidisp = 0
do b = 1,nb
mpidisp = mpidisp + (ni(b)*nj(b)*nk(b)) ! number of cell in the block b
end do
mpidisp = mpidisp*2*ext8*me !multiply for number of variables and byte of each variable and shif to the process rank
! Open file
call MPI_FILE_OPEN(MPI_COMM_WORLD,'MPIDATA',IOR(MPI_MODE_CREATE,MPI_MODE_WRONLY),MPI_INFO_NULL,mpifh)
! setting file view
call MPI_FILE_SET_VIEW(mpifh,mpidisp,All_Type,mpiparflowtype,'native',MPI_INFO_NULL,mpierr)
write(*,*) me,'error on file set view:',mpierr
call MPI_Error_string(mpierr, string, resultlen)
write(*,*) 'string:',trim(string),resultlen
! MPI Write file
call MPI_FILE_WRITE_ALL(mpifh,var(1)%r,1,All_Type,status)
! Close file
call MPI_FILE_CLOSE(mpifh)
! deallocations and free
CALL MPI_TYPE_FREE(All_Type)
CALL MPI_TYPE_FREE(mpiparflowtype)
do b = 1,nb
deallocate (var(b)%r,var(b)%p)
END DO
deallocate (var)
deallocate (ni,nj,nk)
! end
call mpi_finalize
END PROGRAM main
When the code is launched, for instance, on two processes (both Intel and gnu compilers no problem in compilation phase), the run concludes but an error MPI_TYPE_ERR in MPI_FILE_SET_VIEW is issued and the data file contains only rank 0 data.
I would expect a file with data from all ranks, but I can not understand what the problem is.

MPI_WIN_ALLOCATE_SHARED: memory limited?

It seems like whenever I am trying to allocate a window around 30-32 Mb I get a segmentation fault?
I am using following routine MPI_WIN_ALLOCATE_SHARED
Does anybody know if there is a limit to how big my window can be? If so, is there a way to compile my code relaxing that limit?
I am using INTEL MPI 19.0.3 and ifort 19.0.3 -
Example written in Fortran. By varying the integer size_ you can see when the segmentation fault occurs. I tested it with size_=10e3 and size_=10e4 the latter caused a segmentation fault
C------
program TEST_STACK
use, INTRINSIC ::ISO_C_BINDING
implicit none
include 'mpif.h'
!--- Parameters (They should not be changed ! )
integer, parameter :: whoisroot = 0 ! - Root always 0 here
!--- General parallel
integer :: whoami ! - My rank
integer :: mpi_nproc ! - no. of procs
integer :: mpierr ! - Error status
integer :: status(MPI_STATUS_SIZE)! - For MPI_RECV
!--- Shared memory stuff
integer :: whoami_shm ! - Local rank in shared memory group
integer :: mpi_shm_nproc ! - No. of procs in Shared memory group
integer :: no_partners ! - No. of partners for share memory
integer :: info_alloc
!--- MPI groups
integer :: world_group ! - All procs across all nodes
integer :: shared_group ! - Only procs that share memory
integer :: MPI_COMM_SHM ! - Shared memory communicators (for those in shared_group)
type(C_PTR) :: ptr_buf
integer(kind = MPI_ADDRESS_KIND) :: size_bytes, lb
integer :: win, size_, disp_unit
call MPI_INIT ( mpierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, whoami, mpierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, whoami, mpierr )
call MPI_COMM_SIZE ( MPI_COMM_WORLD, mpi_nproc, mpierr)
call MPI_COMM_SPLIT_TYPE( MPI_COMM_WORLD
& , MPI_COMM_TYPE_SHARED
& , 0
& , MPI_INFO_NULL
& , MPI_COMM_SHM
& , mpierr )
call MPI_COMM_RANK( MPI_COMM_SHM, whoami_shm, mpierr )
call MPI_COMM_SIZE( MPI_COMM_SHM, mpi_shm_nproc, mpierr )
size_ = 10e4! - seg fault
size_bytes = size_ * MPI_REAL
disp_unit = MPI_REAL
size_bytes = size_*disp_unit
call MPI_INFO_CREATE( info_alloc, mpierr )
call MPI_INFO_SET( info_alloc
& , "alloc_shared_noncontig"
& , "true"
& , mpierr )
!
call MPI_WIN_ALLOCATE_SHARED( size_bytes
& , disp_unit
& , info_alloc
& , MPI_COMM_SHM
& , ptr_buf
& , win
& , mpierr )
call MPI_WIN_FREE(win, mpierr)
end program TEST_STACK
I run my code using following command
mpif90 test_stack.f90; mpirun -np 2 ./a.out
This wrapper is linked to my ifort 19.0.3 and Intel MPI library. This has been verified by running
mpif90 -v
and to be very precise my mpif90 is a symbolic link to my mpiifort wrapper. This is made for personal convenience but shouldn't be causing problems I guess?

The manual says that the call to MPI_WIN_ALLOCATE_SHARED looks like this
USE MPI
MPI_WIN_ALLOCATE_SHARED(SIZE, DISP_UNIT, INFO, COMM, BASEPTR, WIN, IERROR)
INTEGER(KIND=MPI_ADDRESS_KIND) SIZE, BASEPTR
INTEGER DISP_UNIT, INFO, COMM, WIN, IERROR
At least the types of disp_unit and baseptr do not match in your program.

I was finally able to diagnose where the error stems from.
In the code I have
disp_unit = MPI_REAL
size_bytes = size_*disp_unit
MPI_REAL is a constant/parameter defined by MPI and is not equal to 4 as I very wrongly expected (4 for 4 bytes for single precision)!. In my version it is set to 1275069468 which most likely refers to an id rather than any sensible number.
Hence, multiplying this number with the size of my array can very quickly exceeds the available memory, but and also the number of digits that can be represented by a byte integer

Incorrect results when reading binary file with MPI I/O

I am new to MPI and am struggling with reading a binary file.
Specifically, I have a $198\times 50 \times 50$ array of integers (16 bit integers, to be specific) stored in a binary file. I want to use 2 compute nodes to process this file. So there are two MPI processes and each process will process half of the input. I am using the function MPI_FILE_READ_AT to read respective regions. I expect the array values to fill in the variable/argument 'bucket' that I pass in to the function call. But a sanity check print out of the 'bucket' entries tells me that the values in bucket are all incorrect. I feel that I am going wrong with the arguments.
program main
use mpi
implicit none
integer :: i, error, num_processes, id, fh
integer(MPI_OFFSET_KIND) :: filesize, offset
integer(MPI_OFFSET_KIND) :: num_bytes_per_process
integer(MPI_OFFSET_KIND) :: num_bytes_this_process
integer :: num_ints_per_process, num_ints_this_process
integer(kind = 2), dimension(:), allocatable :: bucket
character(len=100) :: inputFileName
integer, parameter :: INTKIND=2
! Initialize
inputFileName = 'xyz_50x50'
print *, 'MPI_OFFSET_KIND =', MPI_OFFSET_KIND
! MPI basics
call MPI_Init ( error )
call MPI_Comm_size ( MPI_COMM_WORLD, num_processes, error )
call MPI_Comm_rank ( MPI_COMM_WORLD, id, error )
! Open the file
call MPI_FILE_OPEN(MPI_COMM_WORLD, inputFileName, MPI_MODE_RDONLY, &
MPI_INFO_NULL, fh, error)
! get the size of the file
call MPI_File_get_size(fh, filesize, error)
! Note: filesize is the TOTAL number of bytes in the file
num_bytes_per_process = filesize/num_processes
num_ints_per_process = num_bytes_per_process/INTKIND
offset = id * num_bytes_per_process
num_bytes_this_process = min(num_bytes_per_process, filesize - offset)
num_ints_this_process = num_bytes_this_process/INTKIND
allocate(bucket(num_ints_this_process))
call MPI_FILE_READ_AT(fh, offset, bucket, num_ints_this_process, &
MPI_SHORT, MPI_STATUS_SIZE, error)
do i = 1, num_ints_this_process
if (bucket(i) /= 0) then
print *, "my id is ", id, " and bucket(",i,")=", bucket(i)
endif
enddo
! close the file
call MPI_File_close(fh, error)
! close mpi
call MPI_Finalize(error)
end program main

you have to use MPI_STATUS_IGNORE instead of MPI_STATUS_SIZE
(fwiw, i am unable to compile this program unless i fixe this)
call MPI_FILE_READ_AT(fh, offset, bucket, num_ints_this_process, &
MPI_SHORT, MPI_STATUS_IGNORE, error)
note that since all MPI tasks read the file at the same time, you'd rather use the collective MPI_File_read_at_all() subroutine in order to improve performances.

Using Persistent Communication in Fortran MPI

I am using persistent communication in my CFD code. I have the communications setup in another subroutine and in the main subroutine, where I have the do loop, I use the MPI_STARTALL(), MPI_WAITALL().
In order to make it shorter, I am showing hte first part of the setup. The rest of the arrays are exactly the same.
My setup subrotuine looks like:
Subroutine MPI_Subroutine
use Variables
use mpi
implicit none
!Starting up MPI
call MPI_INIT(ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,npes,ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,MyRank,ierr)
!Compute the size of local block (1D Decomposition)
Jmax = JmaxGlobal
Imax = ImaxGlobal/npes
if (MyRank.lt.(ImaxGlobal - npes*Imax)) then
Imax = Imax + 1
end if
if (MyRank.ne.0.and.MyRank.ne.(npes-1)) then
Imax = Imax + 2
Else
Imax = Imax + 1
endif
! Computing neighboars
if (MyRank.eq.0) then
Left = MPI_PROC_NULL
else
Left = MyRank - 1
end if
if (MyRank.eq.(npes -1)) then
Right = MPI_PROC_NULL
else
Right = MyRank + 1
end if
! Initializing the Arrays in each processor, according to the number of local nodes
Call InitializeArrays
!Creating the channel of communication for this computation,
!Sending and receiving the u_old (Ghost cells)
Call MPI_SEND_INIT(u_old(2,:),Jmax,MPI_DOUBLE_PRECISION,Left,tag,MPI_COMM_WORLD,req(1),ierr)
Call MPI_RECV_INIT(u_old(Imax,:),jmax,MPI_DOUBLE_PRECISION,Right,tag,MPI_COMM_WORLD,req(2),ierr)
Call MPI_SEND_INIT(u_old(Imax-1,:),Jmax,MPI_DOUBLE_PRECISION,Right,tag,MPI_COMM_WORLD,req(3),ierr)
Call MPI_RECV_INIT(u_old(1,:),jmax,MPI_DOUBLE_PRECISION,Left,tag,MPI_COMM_WORLD,req(4),ierr)
Since I am debugging my code I am just checking these arrays. When I check my ghost cells are full of zeroes. Then I guess that I messing with the instruction.
The main code, where I call the MPI_STARTALL, MPI_WAITALL looks like:
Program
use Variables
use mpi
implicit none
open(32, file = 'error.dat')
Call MPI_Subroutine
!kk=kk+1
DO kk=1, 2001
! A lot of calculation
! communicating the maximum error among the processes and delta t
call MPI_REDUCE(eps,epsGlobal,1,MPI_DOUBLE_PRECISION,MPI_MAX,0,MPI_COMM_WORLD,ierr)
call MPI_BCAST(epsGlobal,1,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr)
call MPI_REDUCE(delta_t,delta_tGlobal,1,MPI_DOUBLE_PRECISION,MPI_MIN,0,MPI_COMM_WORLD,ierr)
if(MyRank.eq.0) delta_t = delta_tGlobal
call MPI_BCAST(delta_t,1,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierr)
if(MyRank.eq.0) then
write(*,*) kk,epsGlobal,(kk*delta_t)
write(32,*) kk,epsGlobal
endif
Call Swap
Call MPI_STARTALL(4,req,ierr) !
Call MPI_WAITALL(4,req,status,ierr)
enddo
The variables are set in another module. the MPI related variables looks like:
! MPI variables
INTEGER :: npes, MyRank, ierr, Left, Right, tag
INTEGER :: status(MPI_STATUS_SIZE,4)
INTEGER,dimension(4) :: req
I appreciate your time and suggestion in this problem.

Incorrect output via fftw_mpi_r2c_2d and fftw_mpi_c2r_2d

I wrote a simple test program in order to implement the FFTW with MPI in a 2d domain (with Fortran). The domain is 'Ny x Nx' wide and partitioned in the second ('x') index.
After proper (I believe?) declaration and allocation of variables and plans, I call the fftw_mpi r2c_2d function and next I transform back its output with the fftw_mpi c2r_2d, in order to check if I get the original input. The r2c_2d part seems to work fine. However, I don't get the original input after transforming back the output (apart normalization) with the c2r_2d function: the resulting vector displays 'zeros' at the indices (:,j) with j corresponding to multiples of 'Ny/2'. What am I doing wrong? Thanks!
Here is the extract from the code:
Program TEST
use, intrinsic :: iso_c_binding
Implicit none
include 'mpif.h'
include 'fftw3-mpi.f03'
Integer*8,parameter :: nx=16, ny=16
!MPI
integer*8 :: ipe,npe
integer*8 ::mpi_realtype,icomm=mpi_comm_world,istat(mpi_status_size),ierr
! FFTW VARIABLES DECLARATION
type(C_PTR) :: p1, p2, cdatar, cdatac
integer(C_INTPTR_T) :: alloc_local, local_L, local_L_offset, local_M, local_M_offset
real(C_DOUBLE), pointer :: faux(:,:) ! real input 2d function
complex(C_DOUBLE), pointer :: gaux(:,:) ! complex output of 2d FFTW (transposed)
! MPI initialization
call mpi_init(ierr)
call mpi_comm_rank(icomm,ipe,ierr)
call mpi_comm_size(icomm,npe,ierr)
! FFTW ALLOCATIONS AND PLANS
call fftw_mpi_init()
alloc_local = fftw_mpi_local_size_2d(ny/2+1,nx &
,MPI_COMM_WORLD, local_L, local_L_offset)
cdatac = fftw_alloc_complex(alloc_local)
call c_f_pointer(cdatac, gaux, [nx,local_L]) !transposed
alloc_local = fftw_mpi_local_size_2d(nx,ny/2+1, MPI_COMM_WORLD, &
local_M, local_M_offset)
cdatar = fftw_alloc_real(2*alloc_local)
call c_f_pointer(cdatar, faux, [ny,local_M])
! Create plans
p1 = fftw_mpi_plan_dft_r2c_2d(nx,ny,faux,gaux, MPI_COMM_WORLD, &
ior(FFTW_MEASURE, FFTW_MPI_TRANSPOSED_OUT))
p2 = fftw_mpi_plan_dft_c2r_2d(nx,ny,gaux,faux, MPI_COMM_WORLD, &
ior(FFTW_MEASURE, FFTW_MPI_TRANSPOSED_IN))
! EXECUTE FFTW
call random_number(faux)
print *, "real input:", real(faux(1,:))
call fftw_mpi_execute_dft_r2c(p1,faux,gaux)
call fftw_mpi_execute_dft_c2r(p2, gaux, faux)
print *, "real output:", real(faux(1,:))/(nx*ny)
call fftw_destroy_plan(p1)
call fftw_destroy_plan(p2)
call mpi_finalize(ierr)
End Program TEST

The issue is due to the padding needed by fftw:
Although the real data is conceptually n0 × n1 × n2 × … × nd-1 , it is physically stored as an n0 × n1 × n2 × … × [2 (nd-1/2 + 1)] array, where the last dimension has been padded to make it the same size as the complex output. This is much like the in-place serial r2c/c2r interface (see Multi-Dimensional DFTs of Real Data), except that in MPI the padding is required even for out-of-place data.
Hence, the input array for a 16x16 transform is therefore a 16x18 array. The value of the extra two numbers at the end of each rows are meaningless in the real space. Yet, these extra numbers must not be forgotten as the c pointer is cast to a fortran 2D array:
call c_f_pointer(cdatar, faux, [2*(ny/2+1),local_M])
The extra numbers are still printed at the end of each row. The array can be sliced to avoid printing these worthless values:
print *, "real input:", real(faux(1:ny,:))
...
print *, "real output:", real(faux(1:ny,:))/(nx*ny)
Here is the complete code, based on yours and the one of How to do a fftw3 MPI "transposed" 2D transform if possible at all? It can be compiled by mpif90 main.f90 -o main -I/usr/include -L/usr/lib -lfftw3_mpi -lfftw3 -lm and ran by mpirun -np 2 main.
Program TEST
use, intrinsic :: iso_c_binding
Implicit none
include 'mpif.h'
include 'fftw3-mpi.f03'
Integer*8,parameter :: nx=4, ny=8
!MPI
integer*8 :: ipe,npe
integer*8 ::mpi_realtype,icomm=mpi_comm_world,istat(mpi_status_size),ierr
! FFTW VARIABLES DECLARATION
type(C_PTR) :: p1, p2, cdatar, cdatac
integer(C_INTPTR_T) :: alloc_local, local_L, local_L_offset, local_M, local_M_offset
real(C_DOUBLE), pointer :: faux(:,:) ! real input 2d function
complex(C_DOUBLE), pointer :: gaux(:,:) ! complex output of 2d FFTW (transposed)
! MPI initialization
call mpi_init(ierr)
call mpi_comm_rank(icomm,ipe,ierr)
call mpi_comm_size(icomm,npe,ierr)
! FFTW ALLOCATIONS AND PLANS
call fftw_mpi_init()
alloc_local = fftw_mpi_local_size_2d(ny/2+1,nx &
,MPI_COMM_WORLD, local_L, local_L_offset)
cdatac = fftw_alloc_complex(alloc_local)
call c_f_pointer(cdatac, gaux, [nx,local_L]) !transposed
alloc_local = fftw_mpi_local_size_2d(nx,ny/2+1, MPI_COMM_WORLD, &
local_M, local_M_offset)
cdatar = fftw_alloc_real(2*alloc_local)
call c_f_pointer(cdatar, faux, [2*(ny/2+1),local_M])
! Create plans
p1 = fftw_mpi_plan_dft_r2c_2d(nx,ny,faux,gaux, MPI_COMM_WORLD, &
ior(FFTW_MEASURE, FFTW_MPI_TRANSPOSED_OUT))
p2 = fftw_mpi_plan_dft_c2r_2d(nx,ny,gaux,faux, MPI_COMM_WORLD, &
ior(FFTW_MEASURE, FFTW_MPI_TRANSPOSED_IN))
! EXECUTE FFTW
call random_number(faux)
print *, "real input:", real(faux(1:ny,:))
call fftw_mpi_execute_dft_r2c(p1,faux,gaux)
call fftw_mpi_execute_dft_c2r(p2, gaux, faux)
print *, "real output:", real(faux(1:ny,:))/(nx*ny)
call fftw_destroy_plan(p1)
call fftw_destroy_plan(p2)
call mpi_finalize(ierr)
End Program TEST

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js