The code I am working on uses MPI to split a large 3 dimensional array (a cube) into sub domains along all the three axes to form smaller cubes. I previously had worked on a simpler 2 dimensional equivalent with no issues.
Now since MPI has this annoying habit (or gratifying habit, depending on how you see it) of treating MPI_SEND and MPI_RECV as non-blocking calls for small chunks of data, this migration from 2D to 3D brought a lot struggle. All the calls that worked perfectly for 2D started deadlocking at the slightest provocation in 3D since the data that had to be passed between processes were now 3D arrays and therefore were larger.
After a week of fight and pulling out much hair, a complicated set of MPI_SEND and MPI_RECV calls were constructed that managed to pass data across faces, edges and corners of every cube in the domain (with periodicity and non-periodicity set appropriately at different boundaries) smoothly. The happiness was not to last. After adding a new boundary condition that required an extra path of communication between cells on one side of the domain, the code plunged into another vicious bout of deadlocking.
Having had enough, I decided to resort to non-blocking calls. Now with this much background, I hope my intentions with the below code will be quite plain. I am not including the codes I used to transfer data across edges and corners of the sub-domains. If I can sort out the communication between the faces of the cubes, I would be able to bring everything else to fall neatly into place.
The code uses five arrays to simplify data transfer:
rArr = Array of ranks of neighbouring cells
tsArr = Array of tags for sending data to each neighbour
trArr = Array of tags for receiving data from each neighbour
lsArr = Array of limits (indices) to describe the chunk of data to be sent
lrArr = Array of limits (indices) to describe the chunk of data to be received
Now since each cube has 6 neighbours sharing a face each, rArr, tsArr and trArr each are integer arrays of length 6. The limits array on the other hand is a two dimensional array as described below:
lsArr = [[xStart, xEnd, yStart, yEnd, zStart, zEnd, dSize], !for face 1 sending
[xStart, xEnd, yStart, yEnd, zStart, zEnd, dSize], !for face 2 sending
.
.
[xStart, xEnd, yStart, yEnd, zStart, zEnd, dSize]] !for face 6 sending
So a call to send values of the variable dCube across the ith face of a cell (process) will happen as below:
call MPI_SEND(dCube(lsArr(i, 1):lsArr(i, 2), lsArr(i, 3):lsArr(i, 4), lsArr(i, 5):lsArr(i, 6)), lsArr(i, 7), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
And another process with the matching destination rank and tag will receive the same chunk as below:
call MPI_RECV(dCube(lrArr(i, 1):lrArr(i, 2), lrArr(i, 3):lrArr(i, 4), lrArr(i, 5):lrArr(i, 6)), lrArr(i, 7), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)
The lsArr and lrArr of source and destination process were tested to show matching sizes (but different limits). The tags arrays were also checked to see if they matched.
Now my earlier version of the code with blocking calls worked perfectly and hence I am 99% confident about the correctness of values in the above arrays. If there is reason to doubt their accuracy, I can add those details, but then the post will become extremely long.
Below is the blocking version of my code which worked perfectly. I apologize if it is a bit intractable. If it is necessary to elucidate it further to identify the problem, I shall do so.
subroutine exchangeData(dCube)
use someModule
implicit none
integer :: i, j
double precision, intent(inout), dimension(xS:xE, yS:yE, zS:zE) :: dCube
do j = 1, 3
if (mod(edArr(j), 2) == 0) then !!edArr = [xRank, yRank, zRank]
i = 2*j - 1
call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
i = 2*j
call MPI_RECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)
else
i = 2*j
call MPI_RECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)
i = 2*j - 1
call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
end if
if (mod(edArr(j), 2) == 0) then
i = 2*j
call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
i = 2*j - 1
call MPI_RECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)
else
i = 2*j - 1
call MPI_RECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, stVal, ierr)
i = 2*j
call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
end if
end do
end subroutine exchangeData
Basically it goes along each direction, x, y and z and first sends data from odd numbered faces and then even numbered faces. I don't know if there is an easier way to do this. This was arrived at after innumerable deadlocking codes that nearly drove me mad. The codes to send data across edges and corners are even longer.
Now comes the actual problem I am having now. I replaced the above code with the following (a bit naively, maybe?)
subroutine exchangeData(dCube)
use someModule
implicit none
integer :: i, j
integer, dimension(6) :: fRqLst
integer :: stLst(MPI_STATUS_SIZE, 6)
double precision, intent(inout), dimension(xS:xE, yS:yE, zS:zE) :: dCube
fRqLst = MPI_REQUEST_NULL
do i = 1, 6
call MPI_IRECV(dCube(lrArr(1, i):lrArr(2, i), lrArr(3, i):lrArr(4, i), lrArr(5, i):lrArr(6, i)), &
lrArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), trArr(i), MPI_COMM_WORLD, fRqLst(i), ierr)
end do
do i = 1, 6
call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
end do
call MPI_WAITALL(6, fRqLst, stLst, ierr)
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
end subroutine exchangeData
someModule is a placeholder module that contains all the variables. Actually they are distributed across a series of modules, but I'll gloss over that for now. The main idea was to use non-blocking MPI_IRECV calls to prime every process to receive data and then send data using a series of blocking MPI_SEND calls. However, I doubt that if things were this easy, parallel programming would have been a piece of cake.
This code gives a SIGABRT and exits with a double-free error. Moreover it seems to be a Heisenbug and disappears at times.
Error message:
*** Error in `./a.out': double free or corruption (!prev): 0x00000000010315c0 ***
*** Error in `./a.out': double free or corruption (!prev): 0x00000000023075c0 ***
*** Error in `./a.out': double free or corruption (!prev): 0x0000000001d465c0 ***
Program received signal SIGABRT: Process abort signal.
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
Backtrace for this error:
#0 0x7F5807D3C7D7
#1 0x7F5807D3CDDE
#2 0x7F580768ED3F
#3 0x7F580768ECC9
#4 0x7F58076920D7
#5 0x7F58076CB393
#6 0x7F58076D766D
#0 0x7F4D387D27D7
#1 0x7F4D387D2DDE
#2 0x7F4D38124D3F
#3 0x7F4D38124CC9
#4 0x7F4D381280D7
#5 0x7F4D38161393
#0 #6 0x7F4D3816D66D
0x7F265643B7D7
#1 0x7F265643BDDE
#2 0x7F2655D8DD3F
#3 0x7F2655D8DCC9
#4 0x7F2655D910D7
#5 0x7F2655DCA393
#6 0x7F2655DD666D
#7 0x42F659 in exchangedata_ at solver.f90:1542 (discriminator 1)
#7 0x42F659 in exchangedata_ at solver.f90:1542 (discriminator 1)
#8 0x42EFFB in processgrid_ at solver.f90:431
#9 0x436CF0 in MAIN__ at solver.f90:97
#8 0x42EFFB in processgrid_ at solver.f90:431
#9 0x436CF0 in MAIN__ at solver.f90:97
#0 0x7FC9DA96B7D7
#1 0x7FC9DA96BDDE
#2 0x7FC9DA2BDD3F
#3 0x7FC9DA2BDCC9
#4 0x7FC9DA2C10D7
#5 0x7FC9DA2FA393
#6 0x7FC9DA30666D
#7 0x42F659 in exchangedata_ at solver.f90:1542 (discriminator 1)
#8 0x42EFFB in processgrid_ at solver.f90:431
#9 0x436CF0 in MAIN__ at solver.f90:97
#7 0x42F659 in exchangedata_ at solver.f90:1542 (discriminator 1)
#8 0x42EFFB in processgrid_ at solver.f90:431
#9 0x436CF0 in MAIN__ at solver.f90:97
I tried searching for similar errors on this site with the '(discriminator 1)' part, but couldn't find any. I also searched for cases where MPI produces double-free memory corruption error and again to no avail.
I must also point out that the line 1542 in the error message corresponds to the blocking MPI_SEND call in my code.
The above error popped when I was using gfortran 4.8.2 with ompi 1.6.5. However, I also tried running the above code with the Intel fortran compiler and received a curious error message:
[21] trying to free memory block that is currently involved to uncompleted data transfer operation
I searched the above error on net and got almost nothing. :( So that was a dead end as well. The full error message is a bit too long, but below is a part of it:
*** glibc detected *** ./a.out: munmap_chunk(): invalid pointer: 0x0000000001c400a0 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x0000000001c40410 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x0000000000a67790 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x0000000000a67790 ***
*** glibc detected *** ./a.out: free(): invalid next size (normal): 0x0000000000d28c80 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x00000000015354b0 ***
*** glibc detected *** ./a.out: malloc(): memory corruption: 0x00000000015354b0 ***
*** glibc detected *** ./a.out: free(): invalid next size (normal): 0x0000000000f51520 ***
[20] trying to free memory block that is currently involved to uncompleted data transfer operation
free mem - addr=0x26bd800 len=3966637480
RTC entry - addr=0x26a9e70 len=148800 cnt=1
Assertion failed in file ../../i_rtc_cache.c at line 1397: 0
internal ABORT - process 20
[21] trying to free memory block that is currently involved to uncompleted data transfer operation
free mem - addr=0x951e90 len=2282431520
RTC entry - addr=0x93e160 len=148752 cnt=1
Assertion failed in file ../../i_rtc_cache.c at line 1397: 0
internal ABORT - process 21
If my error is some careless one or borne from insufficient knowledge, the above details might suffice. If it is a deeper issue, then I'll gladly elaborate further.
Thanks in advance!
Though the question attracted useful comments, I believe that posting an answer on how the suggestion helped me solve the problem might turn out to be useful for those who may stumble on this post in the future with the same issue.
As pointed out, non-blocking MPI calls with non-contiguous arrays in Fortran - bad idea
I used the idea of copying the non-contiguous array into a contiguous one and using that instead. However, with blocking calls, non-contiguous arrays seem to be behaving well. Since I was using blocking MPI_SEND and non-blocking MPI_IRECV, the code makes only one copy - for just receiving, and continues to send data non-contiguously as before. This seems to be working for now, but if it can cause any hiccups later on, please warn me in the comments.
It does add a lot of repeating lines of code ( ruining the aesthetics :P ). That is mainly because the limits for sending/receiving are not same for all the 6 faces. So the arrays for temporarily storing the data to be received have to allocated (and copied) individually for each of the six faces.
subroutine exchangeData(dCube)
use someModule
implicit none
integer :: i
integer, dimension(6) :: fRqLst
integer :: stLst(MPI_STATUS_SIZE, 6)
double precision, intent(inout), dimension(xS:xE, yS:yE, zS:zE) :: dCube
double precision, allocatable, dimension(:,:,:) :: fx0, fx1, fy0, fy1, fz0, fz1
allocate(fx0(lrArr(1, 1):lrArr(2, 1), lrArr(3, 1):lrArr(4, 1), lrArr(5, 1):lrArr(6, 1)))
allocate(fx1(lrArr(1, 2):lrArr(2, 2), lrArr(3, 2):lrArr(4, 2), lrArr(5, 2):lrArr(6, 2)))
allocate(fy0(lrArr(1, 3):lrArr(2, 3), lrArr(3, 3):lrArr(4, 3), lrArr(5, 3):lrArr(6, 3)))
allocate(fy1(lrArr(1, 4):lrArr(2, 4), lrArr(3, 4):lrArr(4, 4), lrArr(5, 4):lrArr(6, 4)))
allocate(fz0(lrArr(1, 5):lrArr(2, 5), lrArr(3, 5):lrArr(4, 5), lrArr(5, 5):lrArr(6, 5)))
allocate(fz1(lrArr(1, 6):lrArr(2, 6), lrArr(3, 6):lrArr(4, 6), lrArr(5, 6):lrArr(6, 6)))
fRqLst = MPI_REQUEST_NULL
call MPI_IRECV(fx0, lrArr(7, 1), MPI_DOUBLE_PRECISION, rArr(1), trArr(1), MPI_COMM_WORLD, fRqLst(1), ierr)
call MPI_IRECV(fx1, lrArr(7, 2), MPI_DOUBLE_PRECISION, rArr(2), trArr(2), MPI_COMM_WORLD, fRqLst(2), ierr)
call MPI_IRECV(fy0, lrArr(7, 3), MPI_DOUBLE_PRECISION, rArr(3), trArr(3), MPI_COMM_WORLD, fRqLst(3), ierr)
call MPI_IRECV(fy1, lrArr(7, 4), MPI_DOUBLE_PRECISION, rArr(4), trArr(4), MPI_COMM_WORLD, fRqLst(4), ierr)
call MPI_IRECV(fz0, lrArr(7, 5), MPI_DOUBLE_PRECISION, rArr(5), trArr(5), MPI_COMM_WORLD, fRqLst(5), ierr)
call MPI_IRECV(fz1, lrArr(7, 6), MPI_DOUBLE_PRECISION, rArr(6), trArr(6), MPI_COMM_WORLD, fRqLst(6), ierr)
do i = 1, 6
call MPI_SEND(dCube(lsArr(1, i):lsArr(2, i), lsArr(3, i):lsArr(4, i), lsArr(5, i):lsArr(6, i)), &
lsArr(7, i), MPI_DOUBLE_PRECISION, rArr(i), tsArr(i), MPI_COMM_WORLD, ierr)
end do
call MPI_WAITALL(6, fRqLst, stLst, ierr)
dCube(lrArr(1, 1):lrArr(2, 1), lrArr(3, 1):lrArr(4, 1), lrArr(5, 1):lrArr(6, 1)) = fx0
dCube(lrArr(1, 2):lrArr(2, 2), lrArr(3, 2):lrArr(4, 2), lrArr(5, 2):lrArr(6, 2)) = fx1
dCube(lrArr(1, 3):lrArr(2, 3), lrArr(3, 3):lrArr(4, 3), lrArr(5, 3):lrArr(6, 3)) = fy0
dCube(lrArr(1, 4):lrArr(2, 4), lrArr(3, 4):lrArr(4, 4), lrArr(5, 4):lrArr(6, 4)) = fy1
dCube(lrArr(1, 5):lrArr(2, 5), lrArr(3, 5):lrArr(4, 5), lrArr(5, 5):lrArr(6, 5)) = fz0
dCube(lrArr(1, 6):lrArr(2, 6), lrArr(3, 6):lrArr(4, 6), lrArr(5, 6):lrArr(6, 6)) = fz1
deallocate(fx0, fx1, fy0, fy1, fz0, fz1)
end subroutine exchangeData
This partially nullifies the advantage I sought to gain by storing ranks and tags in arrays. I did that mainly to put the sending and receiving calls in a loop. With this fix, only the send calls can be put in loop.
Since allocating and deallocating in each call of the subroutine can waste time, the arrays can be put in a module and allocated at the beginning of the code. The limits don't change in each call.
When the same method is applied for corners and edges as well, it does bloat the code a bit, but it seems to be working. :)
Thanks for the comments.
Related
I am making a module in Fortran 90 to run PARPACK on a given matrix. I have an existing ARPACK code which functions normally as expected. I tried converting it into PARPACK and it runs into memory clear errors. I am fairly new to coding and fortran, please excuse any blunders I've made.
The code:
!ARPACK module
module parpack
implicit none
contains
subroutine parp
! use mpi
include '/usr/lib/x86_64-linux-gnu/openmpi/include/mpif.h'
integer comm, myid, nprocs, rc, nloc, status(MPI_STATUS_SIZE)
integer, parameter :: pres=8
integer nev, ncv, maxn, maxnev, maxncv
parameter (maxn=10**7, maxnev=maxn-1, maxncv=maxn)
! Arrays for SNAUPD
integer iparam(11), ipntr(14)
logical, allocatable :: select(:)
real(kind=pres), allocatable :: workd(:), workl(:), worktmp1(:), worktmp2(:)
! Scalars for SNAUPD
character bmat*1, which*2
integer ido, n, info, ierr, ldv
integer i, j, ishfts, maxitr, mode1, nconv
integer(kind=pres) lworkl
real(kind=pres) tol
! Arrays for SNEUPD
real(kind=pres), allocatable :: d(:,:), resid(:), v(:,:), workev(:), z(:,:)
! Scalars for SNEUPD
logical rvec, first
real sigmar, sigmai
!==============================================
real(kind=pres), allocatable :: mat(:,:)
open (11, file = 'matrix.dat', status = 'old')
read (11,*) n
!=============================================
! Dimension of the problem
nev = n/10
ncv = nev+2
ldv = n
bmat = 'I'
which = 'LM'
! Additional environment variables
ido = 0
tol = 0.0E+0
info = 0
lworkl = 3*ncv**2+6*ncv
! Algorithm Mode specifications:
ishfts = 1
maxitr = 300
mode1 = 1
iparam(1) = ishfts
iparam(3) = maxitr
iparam(7) = mode1
! Distribution to nodes
!=============================================
! Matrix allocation
allocate (mat(n,n))
! PDNAUPD
allocate (workd(5*n))
allocate (workl(lworkl))
allocate (resid(n))
allocate (worktmp1(n))
allocate (worktmp2(n))
! PDNEUPD
allocate (d(n,3))
allocate (v(ldv,ncv))
allocate (workev(3*n))
allocate (z(ldv,ncv))
allocate (select(ncv))
!===========================================
! Read Matrix from the provided file
mat = 0
read(11,*) mat
mat = transpose(mat)
!===========================================
! MPI Calling
call MPI_INIT(ierr)
comm = MPI_COMM_WORLD
call MPI_COMM_RANK(comm, myid, ierr)
call MPI_COMM_SIZE(comm, nprocs, ierr)
nloc = n/nprocs
! if ( mod(n, nprocs) .gt. myid ) nloc = nloc + n
!===============================================
20 continue
call pdnaupd(comm, ido, bmat, nloc, which, nev, tol, resid, ncv, v, ldv, iparam, ipntr, workd, workl, lworkl, info) !Top level solver
call MPI_BARRIER(comm,ierr)
print *, ido, info, iparam(5) !for testing
!===============================================
if (ido .eq. -1 .or. ido .eq. 1) then
worktmp1 = 0
if (myid .ne. 0) then !It is slave
call MPI_SEND(workd(ipntr(1)), nloc, MPI_DOUBLE_PRECISION, 0, 0, comm, ierr)
else !It is host
worktmp1(1:nloc) = workd(ipntr(1):ipntr(1)+nloc-1)
i = nprocs
if (i .gt. 1) then
do i=1,nprocs-1
call MPI_RECV(worktmp1(i*nloc+1), nloc, MPI_DOUBLE_PRECISION, i, 0, comm, status, ierr)
end do
endif
endif
call MPI_BARRIER(comm,ierr)
if (myid .eq. 0) then !It is host
! Matrix multiplication
worktmp2 = 0
call matmultiply(n, mat, worktmp1, worktmp2)
workd(ipntr(2):ipntr(2)+nloc-1) = worktmp2(1:nloc)
i = nprocs
if (i .gt. 1) then
do i=1,nprocs-1
call MPI_SEND(worktmp2(i*nloc+1), nloc, MPI_DOUBLE_PRECISION, i, 100*i, comm, ierr)
end do
endif
else !It is slave
call MPI_RECV(workd(ipntr(2)), nloc, MPI_DOUBLE_PRECISION, 0, 100*myid, comm, status, ierr)
endif
go to 20
! call matmultiply(n, mat, workd(ipntr(1):ipntr(1)+n-1), workd(ipntr(2):ipntr(2)+n-1))
! go to 20
endif
! print *, info !for testing
!===============================================================
! Post-processing for eigenvalues
rvec = .true.
if (myid .eq. 0) then
call pdneupd ( comm, rvec, 'A', select, d, d(1,2), z, ldv, sigmar, sigmai, &
workev, bmat, n, which, nev, tol, resid, ncv, v, ldv, iparam, ipntr, &
workd, workl, lworkl, info)
endif
! print *, info !for testing
close(11)
call MPI_FINALIZE(ierr)
return
end subroutine
!==============================================================================================
! Additional Function definitions
subroutine matmultiply(n, mat, v, w)
integer n, i, j
integer, parameter :: pres=8
real(kind = pres) mat(n,n), temp(n)
real(kind = pres) v(n), w(n)
temp = 0
do j = 1,n
do i = 1,n
temp(j) = temp(j) + mat(i,j)*v(i)
end do
end do
w = temp
return
end subroutine
end module
I apologize for the ton of redundant lines and comments, I am yet to clean it up for finalization.
When I run the code on a single thread with ./a.out, I get the following output:
Invalid MIT-MAGIC-COOKIE-1 key 1 0 1629760560
1 0 1629760560
1 0 1629760560
1 0 1629760560
.
.
. <A long chain as the code is exhausting all iterations>
.<first of the numbers is ido, which starts with 1 instead of -1 for some reason, second being
.info and third being iparam(5) which is a random number until the final iteration>
.
99 1 1
munmap_chunk(): invalid pointer
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x7f5a863d0d01 in ???
#1 0x7f5a863cfed5 in ???
#2 0x7f5a8620420f in ???
#3 0x7f5a8620418b in ???
#4 0x7f5a861e3858 in ???
#5 0x7f5a8624e3ed in ???
#6 0x7f5a8625647b in ???
#7 0x7f5a862566cb in ???
#8 0x560f05ac1819 in ???
#9 0x560f05abd7bc in checker
at /home/srivatsank/Desktop/fortran/lap_vs_arp/ptest/ptest.f90:45
#10 0x560f05abd8d9 in main
at /home/srivatsank/Desktop/fortran/lap_vs_arp/ptest/ptest.f90:3
Aborted (core dumped)
line 45 in ptest is call parp
line 3 in ptest is use parpack(name of the module)
The main code is as follows:
program checker
use parpack
use arpack
! use lapack
implicit none
!Program to test LAPACK and ARPACK
! 1. Variable definition
integer a,n,i
real, allocatable :: mat(:,:)
real t0, t1
a=2
! Loop
! do 20 a = 1,3
! Open File
open(unit=10, file = 'matrix.dat', status = 'replace')
! 2. Generate Symmetric matrices
n = 10**a
allocate (mat(n,n))
call RANDOM_NUMBER(mat)
! 3. Save symmetric matrices to r.dat
write (10,*) n
do 30 i=1,n
write(10,*) mat(i,:)
30 end do
deallocate(mat)
close(10)
! 4. Test time taken by each of the routines
! call cpu_time(t0)
! call arp
! call cpu_time(t1)
! print *, 'n:', n, 'ARPACK time taken:', t1-t0
call cpu_time(t0)
call parp
call cpu_time(t1)
print *, 'n:', n, 'PARPACK time taken:', t1-t0
!20 end do
end program checker
The memory error occurs at the very end of the subroutine, when the mail program tries to exit from the subroutine. I have verified this by printing statements as the last line in the subroutine.
And on running mpirun -np 4 a.out, the code just enters the pdneupd process and sits there for eternity. Could anyone help?
I have the following code:
real :: s_s, d_s, s_r(size), d_r(size)
integer :: k, k_r(size)
! - size = number of processors
! - Do something to initialise s_s, d_s, k
write(*,*) "SENDING >>>>"
write(*,*) s_s, d_s
call MPI_Allgather( s_s, 1, MPI_REAL,
& s_r, 1, MPI_REAL, MPI_COMM_PGM, mpi_err)
call MPI_Allgather( d_s, 1, MPI_REAL,
& d_r, 1, MPI_REAL, MPI_COMM_PGM, mpi_err)
call MPI_Allgather ( k, 1, MPI_INTEGER,
& k_r, 1, MPI_INTEGER, MPI_COMM_PGM, mpi_err)
write(*,*) "RECEIVED <<<<"
write(*,*) s_r, d_r, kr
This generates the following output:
SENDING >>>>
-1803.80339864908 0.616157856320407
RECEIVED <<<<
6.953077622513053E-310 3.565412685916647E-314 1.221334434576037E-314
1.498827614035474E-314 6.952991536467244E-310 6.953288052096687E-310
6.953108563966064E-310 2.350861403096908E-314 4 1
2 3
kr is being gathered correctly however, s_r and d_r seem to be receiving junk. Could this be because of the MPI datatypes? I tried with MPI_REAL MPI_REAL8 and MPI_DOUBLE but that didn't work. Furthermore, mpi_err = MPI_SUCCESS
What could I do to resolve this?
EDIT 1
I worked on the following prototype program:
program allgather
implicit none
include "mpif.h"
real a(4)
integer rank,size,ierr
real as(4)
real ar(16)
integer i, j, k,z
a=1
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr)
if(size.ne.4)then
write(*,*)'Error!:# of processors must be equal to 4'
write(*,*)'Programm aborting....'
call MPI_ABORT(ierr)
endif
do k=1,4
if ( rank == (mod(k, size))) then
a(k) = k
else
a(k) = 0.0
endif
enddo
write(*,*) "Rank :", rank
write(*,*) a
call MPI_Allgather(a, 4, MPI_REAL, ar,
& 4,
& MPI_REAL, MPI_COMM_WORLD, ierr)
write(*,*) "Recieved array"
write(*,*) ar
do i = 1, 16
if ( ar(i) /= 0.0 ) then
z = mod(i, size)
if ( z == 0 ) then
a( size ) = ar(i)
else
a ( z ) = ar(i)
endif
endif
enddo
write(*,*) "---------"
write(*,*) a
write(*,*) "---------"
call MPI_FINALIZE(ierr)
end
And this generates the expected results i.e. ar doesn't gather junk. I'm unable to however tell the difference between the implementations.
It turns out that for the project, the data type to be used was MPI_FLT. It is strange that MPI_FLT works and not MPI_REALx where x=4,8 also not MPI_FLOAT. I grep-ed MPI_FLT in the project to see what it is defined as but didn't turn up anywhere in the project.
The OpenMPI version I'm using is:
$ mpirun --version
mpirun (Open MPI) 3.0.0
The compiler I use is:
$ mpifort --version
ifort (IFORT) 19.0.1.144 20181018
In a future edit I will elaborate on the cause.
I have problem with combining several 2D arrays into one big 2D array using MPI in Fortran.
I have equal size 2D arrays containing real numbers, every array is contained in different process:
numerical(subdsize,nt)
I want to combine them to one big array
numerical_final(nx,nt)
I am using the following command
CALL MPI_Gather(numerical(1:subdsize,nt),subdsize*nt,MPI_DOUBLE_PRECISION,numerical_final,subdsize*nt,MPI_DOUBLE_PRECISION,0, MPI_COMM_WORLD, mpierror)
Unfortunately the data that numerical_final array contains are a complete mess. I was looking for solutions really everywhere. I read this topic but it did not help me:
sending blocks of 2D array in C using MPI
I am using Intel Fortran 2018 compiler and Ubuntu 16.04.
Full code below.
I will be grateful for the help.
PROGRAM Advection
IMPLICIT NONE
INCLUDE 'mpif.h'
INTEGER :: nt,nx,i,steptime,tag,j
DOUBLE PRECISION :: R_dx, R_dt, R_c, R_cfl, R_t
DOUBLE PRECISION, DIMENSION(3) :: R_input
DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:) :: xcoord
DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:,:) :: numerical, numerical_final
DOUBLE PRECISION :: time_begin,time_end,time_elapsed
INTEGER:: myrank,nproc,mpierror,xdomains,subdsize
INTEGER:: status(MPI_STATUS_SIZE)
CALL MPI_Init(mpierror)
CALL MPI_Comm_size(MPI_COMM_WORLD,nproc,mpierror)
CALL MPI_Comm_rank(MPI_COMM_WORLD,myrank,mpierror)
IF (nproc<2) THEN
PRINT*, "Error, only more than 1"
CALL MPI_ABORT
END IF
IF (myrank .EQ. 0) THEN
OPEN(UNIT = 1, FILE = 'inputdata.dat')
READ(1,*) R_input(1)
READ(1,*) R_input(2)
READ(1,*) R_input(3)
READ(1,*) nx
CLOSE(1)
END IF
CALL MPI_Bcast(R_input, 3, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, mpierror)
CALL MPI_Bcast(nx, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, mpierror)
R_c=R_input(1)
R_cfl=R_input(2)
R_t=R_input(3)
R_dx=80./(nx-1)
nt=15
R_dt=R_t/(nt-1)
IF (myrank .EQ. 0) THEN
PRINT*, R_c*R_dt/R_dx
END IF
xdomains = nproc
IF ((MOD(nx,xdomains))==0) THEN
subdsize =nx/xdomains
ELSE
DO
nx=nx+1
IF ((MOD(nx,xdomains)) .EQ. 0) THEN
subdsize=nx/xdomains
EXIT
END IF
END DO
END IF
RAYS
ALLOCATE(xcoord(0:subdsize+1))
ALLOCATE(numerical(0:subdsize+1,nt))
DO i=0,subdsize+1
xcoord(i) = -40.-R_dx+i*R_dx+myrank*R_dx*subdsize
END DO
DO i = 0,subdsize+1
numerical(i,1)=0.5*(sign(1.,xcoord(i))+1.0)
END DO
IF (myrank .EQ. 0) THEN
DO i=1,nt
numerical(0:1,i)=0.
END DO
END IF
IF (myrank .EQ. nproc-1) THEN
DO i=1,nt
numerical(subdsize:subdsize+1,i)=1.
END DO
END IF
DO steptime=1, nt-1
tag = 1
IF (myrank .LT. nproc-1) THEN
CALL MPI_Send (numerical(subdsize,steptime),1,MPI_DOUBLE_PRECISION,myrank+1,tag,MPI_COMM_WORLD,mpierror)
END IF
IF (myrank .GT. 0) THEN
CALL MPI_Recv (numerical(0,steptime),1,MPI_DOUBLE_PRECISION,myrank-1,tag,MPI_COMM_WORLD,status,mpierror )
END IF
IF (myrank .EQ. 0) THEN
DO i = 2, subdsize+1
numerical(i,steptime+1)=numerical(i,steptime)-R_c*R_dt/R_dx*(numerical(i,steptime)-numerical(i-1,steptime))
END DO
ELSE
DO i = 1, subdsize+1
numerical(i,steptime+1)=numerical(i,steptime)-R_c*R_dt/R_dx*(numerical(i,steptime)-numerical(i-1,steptime))
END DO
END IF
END DO
ALLOCATE(numerical_final(nx,nt))
CALL MPI_Gather(numerical(1:subdsize,nt),subdsize*nt,MPI_DOUBLE_PRECISION,numerical_final,subdsize*nt,MPI_DOUBLE_PRECISION,0, MPI_COMM_WORLD, mpierror)
CALL MPI_Finalize(mpierror)
DEALLOCATE (numerical,numerical_final)
END PROGRAM
And inputfile
1.5 !c
0.5 !Courant
5.0 !time
100 !x points
I am rewriting a numerical simulation code that is parallelized using MPI in one direction.
So far, the arrays containing the data were saved by the master MPI process, which implied transferring the data from all MPI processes to one and allocate huge arrays to store the whole thing. It is not very efficient nor classy, and is a problem for large resolutions.
I am therefore trying to use MPI-IO to write directly the file from the distributed arrays. One of the constraint I have is that the written file needs to respect the fortran "unformatted" format (i.e. 4 bytes integer before and after each field indicating its size).
I wrote a simple test program, that works when I write only one distributed array to the file. However, when I write several arrays, the total size of the file is wrong and when comparing to the equivalent fortran 'unformatted' file, the files are different.
Here is the sample code :
module arrays_dim
implicit none
INTEGER, PARAMETER :: dp = kind(0.d0)
integer, parameter :: imax = 500
integer, parameter :: jmax = 50
integer, parameter :: kmax = 10
end module arrays_dim
module mpi_vars
use mpi
implicit none
integer, save :: ierr, myID, numprocs
integer, save :: i_start, i_end, i_mean, i_loc
integer, save :: subArray, fileH
integer(MPI_OFFSET_KIND), save :: offset, currPos
end module mpi_vars
program test
use mpi
use arrays_dim
use mpi_vars
real(dp), dimension(0:imax,0:jmax+1,0:kmax+1) :: v, w
real(dp), dimension(:,:,:), allocatable :: v_loc, w_loc
integer :: i, j, k
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myID, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
i_mean = (imax+1)/numprocs
i_start = myID*i_mean
i_end = i_start+i_mean-1
if(i_mean*numprocs<imax+1) then
if(myID == numprocs-1) i_end = imax
endif
i_loc = i_end - i_start + 1
allocate(v_loc(i_start:i_end,0:jmax+1,0:kmax+1))
allocate(w_loc(i_start:i_end,0:jmax+1,0:kmax+1))
print*, 'I am:', myID, i_start, i_end, i_loc
do k=0,kmax+1
do j=0,jmax+1
do i=0,imax
v(i,j,k) = i+j+k
w(i,j,k) = i*j*k
enddo
enddo
enddo
if(myID==0) then
open(10,form='unformatted')
write(10) v
!write(10) w
close(10)
endif
do k=0,kmax+1
do j=0,jmax+1
do i=i_start,i_end
v_loc(i,j,k) = i+j+k
w_loc(i,j,k) = i*j*k
enddo
enddo
enddo
call MPI_Type_create_subarray (3, [imax+1, jmax+2, kmax+2], [i_loc, jmax+2, kmax+2], &
[i_start, 0, 0], &
MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, subArray, ierr)
call MPI_Type_commit(subArray, ierr)
call MPI_File_open(MPI_COMM_WORLD, 'mpi.dat', &
MPI_MODE_WRONLY + MPI_MODE_CREATE + MPI_MODE_APPEND, &
MPI_INFO_NULL, fileH, ierr )
call saveMPI(v_loc, (i_loc)*(jmax+2)*(kmax+2))
!call saveMPI(w_loc, (i_loc)*(jmax+2)*(kmax+2))
call MPI_File_close(fileH, ierr)
deallocate(v_loc,w_loc)
call MPI_FINALIZE(ierr)
end program test
!
subroutine saveMPI(array, n)
use mpi
use arrays_dim
use mpi_vars
implicit none
real(dp), dimension(n) :: array
integer :: n
offset = (imax+1)*(jmax+2)*(kmax+2)*8
if(myID==0) then
call MPI_File_seek(fileH, int(0,MPI_OFFSET_KIND), MPI_SEEK_CUR, ierr)
call MPI_File_write(fileH, [(imax+1)*(jmax+2)*(kmax+2)*8], 1, MPI_INTEGER, MPI_STATUS_IGNORE, ierr)
call MPI_File_seek(fileH, offset, MPI_SEEK_CUR, ierr)
call MPI_File_write(fileH, [(imax+1)*(jmax+2)*(kmax+2)*8], 1, MPI_INTEGER, MPI_STATUS_IGNORE, ierr)
endif
call MPI_File_set_view(fileH, int(4,MPI_OFFSET_KIND), MPI_DOUBLE_PRECISION, subArray, 'native', MPI_INFO_NULL, ierr)
call MPI_File_write_all(fileH, array, (i_loc)*(jmax+2)*(kmax+2), MPI_DOUBLE_PRECISION, MPI_STATUS_IGNORE, ierr)
end subroutine saveMPI
when the lines !write(10) w and !call saveMPI(w_loc, (i_loc)*(jmax+2)*(kmax+2)) are commented (i.e. I only write the v array), the code is working fine :
mpif90.openmpi -O3 -o prog main.f90
mpirun.openmpi -np 4 ./prog
cmp mpi.dat fort.10
cmp does not generate an output, so the files are identical.
If however I uncomment these lines, then the resulting files (mpi.dat and fort.10) are different. I am sure that the problem lies in the way I define the offset I use to write the data at the right position on the file, but I do not know how to indicate to the second call of saveMPI that the initial position should be the end of the file. What am I missing ?
Only the first call to saveMPI is working as you expect it to. Everything get messed up from the second call up. Here are few indications of what is happening:
MPI_File_set_view resets the independent file pointers and the shared file pointer to zero. See MPI_File_set_view for more details. So you are actually overwriting v data with w data when you call MPI_File_set_view in saveMPI.
with MPI_File_write, the data is written into those parts of the file specified by the current view. This mean that the way you are adding the size information into the file, is not really compatible with the view previously set for v.
calling MPI_File_seek with MPI_SEEK_CUR set the position relative to the current position of the individual pointer. So, for the second call, it is relative to the individual pointer of process 0
I do not use parallel IO that much, so I can not help more that this unless I step into the docs, which I do not have time to. The hint I can give is to:
add an additional parameter to saveMPI that will contain the absolute displacement of the data to write; this can be an [in out] arg. For the first call, it will be zero and for subsequent calls, it will be the size of all data already written to file, including the size information. It can be updated in saveMPI.
before writing the size information (by process 0) call MPI_File_set_view to reset the view to linear byte stream as originally given by MPI_File_open. This can be done by setting the etype and filetype to both MPI_BYTE in calling MPI_File_set_view. look into the doc of MPI_File_open for more information. You will then have to calls to MPI_File_set_view in saveMPI.
Your saveMPI subroutine could look like
subroutine saveMPI(array, n, disp)
use mpi
use arrays_dim
use mpi_vars
implicit none
real(dp), dimension(n) :: array
integer :: n, disp
offset = (imax+1)*(jmax+2)*(kmax+2)*8
call MPI_File_set_view(fileH, int(disp,MPI_OFFSET_KIND), MPI_BYTE, MPI_BYTE, 'native', MPI_INFO_NULL, ierr)
if(myID==0) then
call MPI_File_seek(fileH, int(0,MPI_OFFSET_KIND), MPI_SEEK_END, ierr)
call MPI_File_write(fileH, [(imax+1)*(jmax+2)*(kmax+2)*8], 1, MPI_INTEGER, MPI_STATUS_IGNORE, ierr)
call MPI_File_seek(fileH, int(offset,MPI_OFFSET_KIND), MPI_SEEK_CUR, ierr)
call MPI_File_write(fileH, [(imax+1)*(jmax+2)*(kmax+2)*8], 1, MPI_INTEGER, MPI_STATUS_IGNORE, ierr)
endif
call MPI_File_set_view(fileH, int(disp+4,MPI_OFFSET_KIND), MPI_DOUBLE_PRECISION, subArray, 'native', MPI_INFO_NULL, ierr)
call MPI_File_write_all(fileH, array, (i_loc)*(jmax+2)*(kmax+2), MPI_DOUBLE_PRECISION, MPI_STATUS_IGNORE, ierr)
disp = disp+offset+8
end subroutine saveMPI
and called like:
disp = 0
call saveMPI(v_loc, (i_loc)*(jmax+2)*(kmax+2), disp)
call saveMPI(w_loc, (i_loc)*(jmax+2)*(kmax+2), disp)
Finally, make sure that you delete the file between two calls because you are using MPI_MODE_APPEND.
During the realization of the course work I have to write MPI program to solve PDE continuum mechanics. (FORTRAN)
In parallel program a big computational 3D domain (grid MxNxL) is shared between processes. Every process has it's own subdomain for computation(grid NXxNYxNZ = (M/P1)x(N/P2)x(L/P3) , P1*P2*P3 = P - number of processes). After every computation step I have to exchange two border layers between neighbor processes or (it is not important) in case if I don't have a neighbor towards I have to process the border conditions. I have wrote the following implementation, but I dont know how to get rid of extra data shuffling. I want to accelerate my code as possible.
if (XB2 /= MPI_PROC_NULL) then ! If process has right-X neighbor process
call CopyToSb(NX-1,NX,1,NY,1,NZ,XPkgSize)
else
call CopyXLevel(NX,NX+1,-1) ! BORDER PROCESSING (NOT IMPORTANT)
end if
call MPI_SENDRECV_REPLACE(BPtmp, 7*XPkgSize, MPI_REAL4, XB2, 1, XB1, 1, COMM_CART, MPI_STATUS_IGNORE, ierr)
if (XB1 /= MPI_PROC_NULL) then ! If process has left-X neighbor process
call CopyFromSb(-1,0,1,NY,1,NZ,XPkg,XPkgSize)
call CopyToSb(1,2,1,NY,1,NZ,XPkgSize)
else
call CopyXLevel(1,0,1) ! BORDER PROCESSING (NOT IMPORTANT)
end if
call MPI_SENDRECV_REPLACE(BPtmp, 7*XPkgSize, MPI_REAL4, XB1, 2, XB2, 2, COMM_CART, MPI_STATUS_IGNORE, ierr)
if (XB2 /= MPI_PROC_NULL) call CopyFromSb(NX+1,NX+2,1,NY,1,NZ,XPkg,XPkgSize) ! If process has right-X neighbor process
if(iam /= YB1) then ! Because we have cyclic grid on Y direction
if (YB2 /= MPI_PROC_NULL) call CopyToSb(1,NX,NY-1,NY,1,NZ,YPkgSize) ! If process has right-Y neighbor process
call MPI_SENDRECV_REPLACE(BPtmp, 7*YPkgSize, MPI_REAL4, YB2, 3, YB1, 3, COMM_CART, MPI_STATUS_IGNORE, ierr)
if (YB1 /= MPI_PROC_NULL) then ! If process has left-Y neighbor process
call CopyFromSb(1,NX,-1,0,1,NZ,YPkg,YPkgSize)
call CopyToSb(1,NX,1,2,1,NZ,YPkgSize)
end if
call MPI_SENDRECV_REPLACE(BPtmp, 7*YPkgSize, MPI_REAL4, YB1, 4, YB2, 4, COMM_CART, MPI_STATUS_IGNORE, ierr)
if (YB2 /= MPI_PROC_NULL) call CopyFromSb(1,NX,NY+1,NY+2,1,NZ,YPkg,YPkgSize) ! If process has right-Y neighbor process
else
call CopyYLevel(1,NY+1,1) ! BORDER PROCESSING (NOT IMPORTANT)
call CopyYLevel(NY,0,1) ! BORDER PROCESSING (NOT IMPORTANT)
call CopyYLevel(2,NY+2,1) ! BORDER PROCESSING (NOT IMPORTANT)
call CopyYLevel(NY-1,-1,1) ! BORDER PROCESSING (NOT IMPORTANT)
end if
if (ZB2 /= MPI_PROC_NULL) then
call CopyToSb(1,NX,1,NY,NZ-1,NZ,ZPkgSize)
else
call CopyZLevel(NZ,NZ+1,-1) ! BORDER PROCESSING (NOT IMPORTANT)
end if
call MPI_SENDRECV_REPLACE(BPtmp, 7*ZPkgSize, MPI_REAL4, ZB2, 5, ZB1, 5, COMM_CART, MPI_STATUS_IGNORE, ierr)
if (ZB1 /= MPI_PROC_NULL) then
call CopyFromSb(1,NX,1,NY,-1,0,ZPkg,ZPkgSize)
call CopyToSb(1,NX,1,NY,1,2,ZPkgSize)
else
call CopyZLevel(1,0,-1) ! BORDER PROCESSING (NOT IMPORTANT)
end if
call MPI_SENDRECV_REPLACE(BPtmp, 7*ZPkgSize, MPI_REAL4, ZB1, 6, ZB2, 6, COMM_CART, MPI_STATUS_IGNORE, ierr)
if (ZB2 /= MPI_PROC_NULL) call CopyFromSb(1,NX,1,NY,NZ+1,NZ+2,ZPkg,ZPkgSize)
contains
! This is copyToSendBuffer and copyFromSendBuffer functions
! I think it is most problem part of my code.
subroutine CopyToSb(x1,x2,y1,y2,z1,z2,PkgSize)
integer :: x1,x2,y1,y2,z1,z2,PkgSize
BPtmp(1:PkgSize) = RESHAPE(R(x1:x2,y1:y2,z1:z2),[PkgSize])
BPtmp(PkgSize+1:2*PkgSize) = RESHAPE(U(x1:x2,y1:y2,z1:z2),[PkgSize])
BPtmp(2*PkgSize+1:3*PkgSize) = RESHAPE(V(x1:x2,y1:y2,z1:z2),[PkgSize])
BPtmp(3*PkgSize+1:4*PkgSize) = RESHAPE(W(x1:x2,y1:y2,z1:z2),[PkgSize])
BPtmp(4*PkgSize+1:5*PkgSize) = RESHAPE(P(x1:x2,y1:y2,z1:z2),[PkgSize])
BPtmp(5*PkgSize+1:6*PkgSize) = RESHAPE(H(x1:x2,y1:y2,z1:z2),[PkgSize])
BPtmp(6*PkgSize+1:7*PkgSize) = RESHAPE(S(x1:x2,y1:y2,z1:z2),[PkgSize])
end subroutine CopyToSb
subroutine CopyFromSb(x1,x2,y1,y2,z1,z2,Pkg,PkgSize)
integer, dimension(3) :: Pkg
integer :: x1,x2,y1,y2,z1,z2,PkgSize
R(x1:x2,y1:y2,z1:z2) = RESHAPE(BPtmp(1:PkgSize),Pkg)
U(x1:x2,y1:y2,z1:z2) = RESHAPE(BPtmp(PkgSize+1:2*PkgSize),Pkg)
V(x1:x2,y1:y2,z1:z2) = RESHAPE(BPtmp(2*PkgSize+1:3*PkgSize),Pkg)
W(x1:x2,y1:y2,z1:z2) = RESHAPE(BPtmp(3*PkgSize+1:4*PkgSize),Pkg)
P(x1:x2,y1:y2,z1:z2) = RESHAPE(BPtmp(4*PkgSize+1:5*PkgSize),Pkg)
H(x1:x2,y1:y2,z1:z2) = RESHAPE(BPtmp(5*PkgSize+1:6*PkgSize),Pkg)
S(x1:x2,y1:y2,z1:z2) = RESHAPE(BPtmp(6*PkgSize+1:7*PkgSize),Pkg)
end subroutine CopyFromSb
end
The problem is that I have to excahnge slices of array like 2*NY*NZ or NX*2*NZ or NX*NY*2 but in fact I have (-1:NX+2, -1:NY+2, -1:NZ+2) arrays. It is noncontiguous problem. I know what MPI derived datatypes is, but I need some advices(examples) on my way.