I'm parallelizing a Fortran 90 program using MPI and I get some truly bizarre behavior. I have an array ia of length nn+1, which I'm sending in chunks from process 0 to processes 1,...,ntasks-1. Each process also has a list proc_start which tells the starting position in ia that all the other processes have, and a list pts_per_proc which tells the number of points that each process has. The following code works:
if (me == 0) then
print *, 'Eat my shorts'
else
allocate( ia(pts_per_proc(me+1)+1) )
endif
! If this is the boss process, send the array ia,
if (me == 0) then
do n=1,ntasks-1
call mpi_send(ia(proc_start(n+1)),pts_per_proc(n+1)+1, &
& mpi_integer,n,n,mpi_comm_world,ierr)
enddo
! but if it's a worker, receive this array.
else
call mpi_recv(ia,pts_per_proc(me+1)+1,mpi_integer, &
& 0,me,mpi_comm_world,stat,ierr)
endif
with no seg faults. When I comment out the line
print *, 'Eat my shorts'
it seg faults, no matter where I include a call to mpi_barrier. For example, replacing the first bit with the code
call mpi_barrier(mpi_comm_world,ierr)
if (me /= 0) then
allocate( ia(pts_per_proc(me+1)+1) )
endif
call mpi_barrier(mpi_comm_world,ierr)
gives me a seg fault. I could use mpi_scatterv instead in order to circumvent this issue but I'd like to know just what's going wrong here -- the barriers should guarantee that nothing runs out of order.
A segmentation fault hidden by a print * statement is not unusual, and is often a symptom of memory corruption somewhere in your program.
In cases like these the memcheck tool of Valgrind may save lot of troubles, though you need to properly configure the tool for its usage with MPI (and possibly expect a few false positives which are easily detectable).
Related
I am currently starting to develop a parallel code for scientific applications. I have to exchange some buffers from p0 to p1 and from p1 to p0 (I am creating ghost point between processors boundaries).
The error can be summarized by this sample code:
program test
use mpi
implicit none
integer id, ids, idr, ierr, tag, istat(MPI_STATUS_SIZE)
real sbuf, rbuf
call mpi_init(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,id,ierr)
if(id.eq.0) then
ids=0
idr=1
sbuf=1.5
tag=id
else
ids=1
idr=0
sbuf=3.5
tag=id
endif
call mpi_send(sbuf,1,MPI_REAL,ids,tag,MPI_COMM_WORLD,ierr)
call mpi_recv(rbuf,1,MPI_REAL,idr,tag,MPI_COMM_WORLD,istat,ierr)
call mpi_finalize(ierr)
return
end
What is wrong with this?
Coding with MPI can be difficult at first, and it's good that you're going through the steps of making a sample code. Your sample code as posted hangs due to deadlock. Both processes are busy MPI_SEND-ing, and the send cannot complete until it has been MPI_RECV-ed. So the code is stuck.
There are two common ways around this problem.
Send and Receive in a Particular Order
This is the simple and easy-to-understand solution. Code your send and receive operations such that nobody ever gets stuck. For your 2-process test case, you could do:
if (id==0) then
call mpi_send(sbuf,1,MPI_REAL,ids,tag,MPI_COMM_WORLD,ierr)
call mpi_recv(rbuf,1,MPI_REAL,idr,tag,MPI_COMM_WORLD,istat,ierr)
else
call mpi_recv(rbuf,1,MPI_REAL,idr,tag,MPI_COMM_WORLD,istat,ierr)
call mpi_send(sbuf,1,MPI_REAL,ids,tag,MPI_COMM_WORLD,ierr)
endif
Now, process 1 receives first, so there is never a deadlock. This particular example is not extensible, but there are various looping structures that can help. You can imagine a routine to send data from every process to every other process as:
do sending_process=1,nproc
if (id == sending_process) then
! -- I am sending
do destination_process = 1,nproc
if (sending_process == destination_process) cycle
call MPI_SEND ! Send to destination_process
enddo
elseif
! -- I am receiving
call MPI_RECV ! Receive from sending_process
endif
enddo
This works reasonably well and is easy to follow. I recommend this structure for beginners.
However, it has several issues for truly large problems. You are sending a number of messages equal to the number of processes squared, which can overload a large network. Also, depending on your operation, you probably do not need to send data from every process to every other process. (I suspect this is true for you given you mentioned ghosts.) You can modify the above loop to only send if data are required, but for those cases there is a better option.
Use Non-Blocking MPI Operations
For many-core problems, this is often the best solution. I recommend sticking to the simple MPI_ISEND and MPI_IRECV. Here, you start all necessary sends and receives, and then wait.
Here, I am using some list structure which has been setup already which defines the complete list of necessary destinations for each process.
! -- Open sends
do d=1,Number_Destinations
idest = Destination_List(d)
call MPI_ISEND ! To destination d
enddo
! -- Open receives
do s=1,Number_Senders
isend = Senders_List(s)
call MPI_IRECV ! From source s
enddo
call MPI_WAITALL
This option may look simpler but it is not. You must set up all necessary lists beforehand, and there are a variety of potential problems with buffer size and data alignment. Even still, it is typically the best answer for big codes.
As pointed by Vladimir, your code is too incomplete to provide a definitive answer.
That being said, that could be a well known error.
MPI_Send() might block. From a pragmatic point of view, MPI_Send() is likely to return immediately when sending a short message, but is likely to block when sending a large message. Note small and large depends on your MPI library, the interconnect you are using plus other runtime parameters. MPI_Send() might block until a MPI_Recv() is posted on the other end.
It seems you MPI_Send() and MPI_Recv() in the same block of code, so you can try using MPI_Sendrecv() to do it in one shot. MPI_Sendrecv() will issue a non blocking send under the hood, so that will help if your issue is really a MPI_Send() deadlock.
So, I want to help my researchers a bit with debugging Fortran programs, and for demonstration purposes I created a program that intentionally causes a segfault.
Here's the source:
program segfault
implicit none
integer :: n(10), i
integer :: ios, u
open(newunit=u, file='data.txt', status='old', action='read', iostat=ios)
if (ios /= 0) STOP "error opening file"
i = 0
do
i = i + 1
read(u, *, iostat=ios) n(i)
if (ios /= 0) exit
end do
close(u)
print*, sum(n)
end program segfault
The data.txt file contains 100 random numbers:
for i in {1..100}; do
echo $RANDOM >> data.txt;
done
When I compile this program with
gfortran -O3 -o segfault.exe segfault.f90
the resulting executable dutifully crashes. But when I compile with debugging enabled:
gfortran -O0 -g -o segfault.exe segfault.f90
Then it reads in only the first 10 values, and prints their sum. For what it's worth, -O2 causes the desired segfault, -O1 does not.
I find this deeply concerning. After all, how can I debug properly if the bug goes away when I compile with debugging symbols enabled?
Can someone explain this behaviour?
I am using GNU Fortran (MacPorts gcc5 5.3.0_1) 5.3.0
A segfault is an undefined behaviour. The program does not conform to the Fortran standard so you cannot expect any particular outcome. It can do anything at all. You cannot count with a segfault to happen, the less be deeply concerned whent it does not happen.
There are compiler checks (fcheck=) and sanitizations (-fsanitize=) available for a reason. Waiting for a segfault is not guaranteed to work. Not in Fortran, not in C, not in any similar language.
The outcome of a non-conforming program may depend on many things like placement of a variable in memory or in a register. Aligning of variables in memory, position of stack frames... You can't count with anything at all. These details obviously depend on the optimization level.
If the program accesses an array out of bounds, but the address in memory happens to be a part of memory which still belongs to the process, a segfault may not happen. It is just some bytes in memory which the process is allowed to read or write to (or both). You may be overwriting some other variable, you may be reading some garbage from some old stack frame, you may be overwriting malloc's internal book-keeping data and currupting the heap. The crash may be waiting to happen somewhere else or maybe just the numeric result of the program will be slightly wrong. Anything can happen.
I'm trying to send a derived type data with allocatable array in mpi ad got a seg fault.
program test_type
use mpi
implicit none
type mytype
real,allocatable::x(:)
integer::a
end type mytype
type(mytype),allocatable::y(:)
type(mytype)::z
integer::n,i,ierr,myid,ntasks,status,request
integer :: datatype, oldtypes(2), blockcounts(2)
integer(KIND=MPI_ADDRESS_KIND) :: offsets(2)
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,myid,ierr)
call mpi_comm_size(mpi_comm_world,ntasks,ierr)
n=2
allocate(z%x(n))
if(myid==0)then
allocate(y(ntasks-1))
do i=1,ntasks-1
allocate(y(i)%x(n))
enddo
else
call random_number(z%x)
z%a=myid
write(0,*) "z in process", myid, z%x, z%a
endif
call mpi_get_address(z%x,offsets(1),ierr)
call mpi_get_address(z%a,offsets(2),ierr)
offsets=offsets-offsets(1)
oldtypes=(/ mpi_real,mpi_integer /)
blockcounts=(/ n,1 /)
write(0,*) "before commit",myid,offsets,blockcounts,oldtypes
call mpi_type_create_struct(2,blockcounts,offsets,oldtypes,datatype,ierr)
call mpi_type_commit(datatype, ierr)
write(0,*) "after commit",myid,datatype, ierr
if(myid==0) then
do i=1,ntasks-1
call mpi_irecv(y(i),1,datatype,1,0,mpi_comm_world,request,ierr)
write(0,*) "received", y(i)%x,y(i)%a
enddo
else
call mpi_isend(z,1,datatype,0,0,mpi_comm_world,request,ierr)
write(0,*) "sent"
write(0,*) myid, z%x, z%a
end if
call mpi_finalize(ierr)
end program
And this is what I got printed out running with 2 processes:
before commit 0 0 -14898056
2 1 13 7
after commit 0 73 0
z in process 1 3.9208680E-07 2.5480442E-02 1
before commit 1 0 -491689432
2 1 13 7
after commit 1 73 0
received 0.0000000E+00 0.0000000E+00 0
forrtl: severe (174): SIGSEGV, segmentation fault occurred
It seems to get negative address offsets. Please help.
Thanks.
There are multiple issues with this code.
Allocatable arrays with most Fortran compilers are like pointers in C/C++: the real object behind the array name is something that holds a pointer to the allocated data. That data is usually allocated on the heap and that could be anywhere in the virtual address space of the process, which explains the negative offset. By the way, negative offsets are perfectly acceptable in MPI datatypes (that's why MPI_ADDRESS_KIND specifies a signed integer kind), so no big problem here.
The bigger problem is that the offsets between dynamically allocated things usually vary with each allocation. You could check that:
ADDR(y(1)%x) - ADDR(y(1)%a)
is completely different than
ADDR(y(i)%x) - ADDR(y(i)%a), for i = 2..ntasks-1
(ADDR here is just a shorhand notation for the object address as returned by MPI_GET_ADDRESS)
Even if it happens the offsets match for some value(s) of i, that is more of a coincidence than a rule.
That leads to the following: the type that you construct using offsets from the z variable cannot be used to send elements of the y array. To solve this, simply remove the allocatable property of mytype%x if that is possible (e.g. if n is known in advance).
Another option that should work well for small values of ntasks is to define as many MPI datatypes as the number of elements of the y array. Then use datatype(i), which is based on the offsets of y(i)%x and y(i)%a, to send y(i).
A more severe issue is the fact that you are using non-blocking MPI operations and never wait for them to complete before accessing the data buffers. This code simply won't work:
do i=1,ntasks-1
call mpi_irecv(y(i),1,datatype,1,0,mpi_comm_world,request,ierr)
write(0,*) "received", y(i)%x,y(i)%a
enddo
Calling MPI_IRECV starts an asynchronous receive operation. The operation is probably still in progress by the time the WRITE operator gets executed, therefore completely random data is being accessed (some memory allocators might actually zero the data in debug mode). Either insert a call to MPI_WAIT inbetween the MPI_ISEND and WRITE calls or use the blocking receive MPI_RECV.
A similar problem exists with the use of the non-blocking send call MPI_ISEND. Since you never wait on the completion of the request or test for it, the MPI library is allowed to postpone indefinitely the actual progression of the operation and the send might never actually occur. Again, since there is absolutely no justification for the use of the non-blocking send in your case, replace MPI_ISEND by MPI_SEND.
And last but not least, rank 0 is receiving messages from rank 1 only:
call mpi_irecv(y(i),1,datatype,1,0,mpi_comm_world,request,ierr)
^^^
At the same time, all other processes are sending to rank 0. Therefore, your program will only work if run with two MPI processes. You might want to replace the underlined 1 in the receive call with i.
When I use Send/Recv my code works but when I replace Send/Recv with Isend/Irecv it yields segmentation fault. But before going anywhere else I wanted to verify whether the following snippet seems alrite or not.
The rest of the code should be fine as Send/Recv works; but I haven`t pasted here as its a long code.
INTEGER :: IERR,TASKID,NUMTASKS,SPANX,SPANY,SPANZ,PROCSX,PROCSY,PROCSZ,STAT,STATUS(MPI_STATUS_SIZE),ISTAT(MPI_STATUS_SIZE,52)
INTEGER,DIMENSION(1:52) :: REQ
ALLOCATE(RCC(IIST:IIEND,JJST:JJEND,KKST:KKEND),STAT=IERR)
IF (IERR /=0) PRINT*,'ERROR IN RCC BY',TASKID
DO I=1,52
REQ(I)=MPI_REQUEST_NULL
ENDDO
IF (TASKID.NE.0) THEN
NT=TASKID
CALL MPI_ISEND(RCC(IIST:IIEND,JJST:JJEND,KKST:KKEND),SIZE(RCC),MPI_DOUBLE_PRECISION,0,8,MPI_COMM_WORLD,REQ(NT),IERR)
ENDIF
IF (TASKID.EQ.0) THEN
DO NT = 1,26
CALL MPI_IRECV(CC(RSPANX(NT):RSPANXE(NT),RSPANY(NT):RSPANYE(NT),RSPANZ(NT):RSPANZE(NT)),SIZECC(NT),MPI_DOUBLE_PRECISION,NT,8,MPI_COMM_WORLD,REQ(NT+26),IERR)
ENDDO
ENDIF
CALL MPI_WAITALL(52,REQ,ISTAT,IERR)
DEALLOCATE(RCC,STAT=IERR)
IF (IERR /=0) PRINT*,'ERROR IN DEALLOCATE RCC BY',TASKID
CALL MPI_FINALIZE(IERR)
RETURN
END
However, when I use Isend/Irecv the following line doesn`t give Segmentation fault.
CALL MPI_IRECV(CC(RSPANX(NT),RSPANY(NT),RSPANZ(NT)),SIZECC(NT),MPI_DOUBLE_PRECISION,NT,8,MPI_COMM_WORLD,REQ(NT+26),IERR)
Calling asynchronous communication routines like MPI_ISEND and MPI_IRECV with array sections, e.g. RCC(IIST:IIEND,JJST:JJEND,KKST:KKEND), is very dangerous. The reason is that due to limitations in the older Fortran standards most MPI implementations do not provide proper interfaces for those routines and the compiler copies the data from the array section into a temporary contiguous storage, which then gets passed to the subroutine. The segmentation fault probably occurs due to this temporary storage being freed on return from MPI_ISEND/MPI_IRECV before the actual data transfer takes place. You can prevent this from happening by manually allocating the contiguous array and copying the data there.
On the other side, CC(RSPANX(NT),RSPANY(NT),RSPANZ(NT)) does not refer to a section of the array but rather to the location of a single element. No temporary copy of the data is created in this case.
MPI-3.0 provides an improved set of Fortran bindings mpi_f08, which uses modern features in Fortran 2008 and TS 29113 to mark such arguments with the ASYNCHRONOUS attribute and to enable safe passing of arrays with different dimensions (TYPE(*), DIMENSION(..))
My 2D hydro code stalls during the following subroutine (which computes the y-direction flux):
ALLOCATE(W1d(1:my,nFields),q1d(nFields),&
Wl(1:my,nFields),Wr(1:my,nFields))
PRINT *,"Main loop"
DO i=1,mx
DO j=1,my
q1d(1) = qVar(i,j,1,iRho)
q1d(2) = qVar(i,j,1, iE)
q1d(3) = qVar(i,j,1, ivy)
q1d(4) = qVar(i,j,1, ivx)
CALL Cons2Prim(q1d(:), W1d(j,:))
ENDDO
CALL lr_states(grid, W1d, dt, dy, Wl, Wr, dir)
DO j=1,my
Flux(i,j,:) = hllc_flux(wl(j,:), wr(j,:))
ENDDO
DO j=1,my
CALL Prim2Cons(Wl(j,:),Ul(i,j,:))
CALL Prim2Cons(Wr(j,:),Ur(i,j,:))
ENDDO
ENDDO
PRINT *,"Deallocating"
DEALLOCATE(W1d,q1d,Wl,Wr)
PRINT *,"Returning"
I separated the DEALLOCATE statement into 4 separate statements and found that whichever 2D array would come first, W1d, wl, or wr, was the cause of the stall. Ignoring the DEALLOCATE statement (which should produce an automatic deallocate when going back to the main) also causes a stall. The subroutine for the x-direction flux has the same arrays, is called before this subroutine, and has no problems deallocating them.
Any suggestions?
EDIT This is run on Fedora 18 and compiled with Intel Fortran 2013.3. It is a parallelized code, but I am running it on a single processor for testing/debugging purposes.
I did three different things and it suddenly started working again. Two of them I do not believe could have done it, while it is possible the third did it. The changes I made:
I did have the bounds of i and j loops defined slightly differently, so I made it uniform between the two directional sweeps
I ran make clean and make
I added -check bounds -check pointers -check uninit flags to the Makefile
I think the first two did not really do anything. The variable grid in the code above is a 2x2 array that contains the bounds of qVar; in the x-sweep I had defined mx = grid(1,2) - grid(1,1) + 1, similarly for my, but grid(1,1) is 1, so it really does not do much different. The second item above I had done at least 3 times.
But the last one I tried once and it started working again. I do not know how that could have fixed it, so if someone does know, please tell me!