This question already has answers here:
MPI_Gather gives seg fault in the most basic code
(2 answers)
Closed 5 years ago.
I have seen people producing segmentation fault with MPI_Barrier using C (Segmentation fault while using MPI_Barrier in `libpmpi.12.dylib`) and C++ (Why does MPI_Barrier cause a segmentation fault in C++). However, I do not reproduce the errors they get.
However, Now I get the same error fortran MPI_Barrier.
My code is simple like:
program main
implicit none
include 'mpif.h'
! local variables
!
character(len=80) :: filename, input
character(len=4) :: command
integer :: ierror, i, l, cmdunit
logical :: terminate
integer :: num_procs, my_id, impi_error
real :: program_start, program_end
call MPI_INIT(impi_error)
call MPI_COMM_RANK(MPI_COMM_WORLD,my_id,impi_error)
call MPI_COMM_SIZE(MPI_COMM_WORLD,num_procs,impi_error)
call MPI_Barrier(MPI_COMM_WORLD)
program_start = MPI_Wtime()
filename='sc.cmd'
cmdunit=8
print *, my_id, cmdunit
call MPI_Barrier(MPI_COMM_WORLD)
call MPI_Barrier(MPI_COMM_WORLD)
call MPI_Barrier(MPI_COMM_WORLD)
call MPI_Barrier(MPI_COMM_WORLD)
call MPI_Barrier(MPI_COMM_WORLD)
program_end = MPI_Wtime()
if (my_id == 0) then
write(*,'(a,F25.16,a)') "MDStressLab runs in ", program_end - program_start, " s."
endif
call MPI_FINALIZE(impi_error)
end program
Nothing special about the code. However, when I use the command mpif90 tmp.f90 to compile the code and then run with command mpirun -n 2 ./a.out. It gives me:
0 8
1 8
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x7FBF2C700E08
#1 0x7FBF2C6FFF90
#0 0x7F2EDF972E08
#2 0x7FBF2C3514AF
#1 0x7F2EDF971F90
#2 0x7F2EDF5C34AF
#3 0x7FBF2CA4F808
#4 0x400EB4 in MAIN__ at tmp.f90:?
#3 0x7F2EDFCC1808
#4 0x400EB4 in MAIN__ at tmp.f90:?
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 35660 on node min-virtual-machine exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
The funny thing is that it only crashes with 2 nodes. It will run fine with 1~10 nodes except 2. Since this also happens randomly in C and C++, I think there might be some hidden bugs somewhere in the MPI library. That is just my guess. Could anybody help?
just replace
call MPI_Barrier(MPI_COMM_WORLD)
with
call MPI_Barrier(MPI_COMM_WORLD, impi_error)
note that if your Fortran compiler and MPI library support Fortran 2008, you also have the option to replace
include mpif.h
with
use mpi_f08
and you will no more need the impi_error parameter since Fortran 2008 bindings make this optional
Related
I'm modifying an existing Fortran code which uses the openmp library. The original version of this code works perfectly in parallel.
I obtain a segmentation fault when a certain variable is accessed during the multi-thread run (I verified by setting flags all over the code). This array is defined allocatable, then as threadprivate and then allocated, while in the original version it's not an allocatable and its size is set immediately. I modified this part due to the workplan I was given.
Here is a basic piece of code which reproduces the error. The guilty variable is an array here named "var".
program testparallel
use omp_lib
implicit none
integer :: thread_id, thread_num
integer :: i,N
integer,dimension(:),allocatable,save :: var
!$omp threadprivate(var)
N = 20
allocate(var(5))
!$omp parallel default(shared) private(thread_id)
thread_id = omp_get_thread_num()
thread_num = omp_get_num_threads()
write(*,*)'Parallel execution on ',thread_num, ' Threads'
!$omp do
do i=1,N
var = 0
write(*,*) thread_id,i
end do
!$omp end do
!$omp end parallel
end program testparallel
This is how more or less the original code is structured, I didn't modify this part directly. var is initialised within the loop and, according to the inputs, its values are used later by other routines.
This is the error traceback I obtained:
Parallel execution on 2 Threads
0 1
0 2
Parallel execution on 2 Threads
0 3
0 4
0 5
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
0 6
Backtrace for this error:
0 7
0 8
0 9
0 10
#0 0x7F0149194697
#1 0x7F0149194CDE
#2 0x7F014824E33F
#3 0x400FB2 in MAIN__._omp_fn.0 at testparallel.F90:?
#4 0x7F0148C693C4
#5 0x7F01485ECDD4
#6 0x7F0148315F6C
#7 0xFFFFFFFFFFFFFFFF
The segfault doesn't occur if I don't define var as allocatable but define its size straighaway (as in the original code). If I allocate it before setting it as threadprivate I get a compilation error.
How can I avoid this error but keep var as allocatable (which is necessary)?
EDIT: I corrected the description of the original code.
Your issue comes from the fact that, although your allocatable array var is declared threadprivate, it is only allocated in the non-parallel part of the code. Therefore, once on a parallel section, only the master thread can safely access to the array.
A very simple fix is to enclose your array allocation (and subsequent de-allocation) within a parallel section like this:
!$omp parallel
allocate(var(5))
!$omp end parallel
I am working on a large fortran code and before to compile with fast options (in order to perform test on large database), I usually compile with "warnings" options in order to detect and backtrace all the problems.
So with the gfortran -fbacktrace -ffpe-trap=invalid,zero,overflow,underflow -Wall -fcheck=all -ftrapv -g2 compilation, I get the following error:
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7fec64cdfef7 in ???
#1 0x7fec64cdf12d in ???
#2 0x7fec6440e4af in ???
#3 0x7fec64a200b4 in ???
#4 0x7fec649dc5ce in ???
#5 0x4cf93a in __f_mod_MOD
at /f_mod.f90:132
#6 0x407d55 in main_loop_
at main.f90:419
#7 0x40cf5c in main_prog
at main.f90:180
#8 0x40d5d3 in main
at main.f90:68
And the portion of the code f_mod.f90:132 is containing a where loop:
! Compute s parameter
do i = 1, Imax
where (dprim .ne. 1.0)
s(:,:,:, :) = s(:,:,:, :) +vprim(:,:,:, i,:)*dprim(:,:,:, :)*dprim(:,:,:, :)/(1.0 -dprim(:,:,:, :))
endwhere
enddo
But I do not see any mistake here. All the other locations are the calls of the subroutine leading to this part. And of course, since it is a SIGFPE error, I have to problem at the execution when I compile gfortran -g1. (I use gfortran 6.4.0 on linux)
Moreover, this error appears and disappears with the modifications of completely different part of the code. Thus, the problem comes from this where loop ? Or from somewhere else and the backtrace is wrong ? If it is the case how can I find this mistake?
EDIT:
Since, I can not reproduce this error in a minimal example (they are working), I think that the problem comes for somewhere else. But how to find the problem in a large code ?
As the code is dying with a SIGFPE, use each of the individual
possible traps to learn if it is a FE_DIVBYZERO, FE_INVALID,
FE_OVERFLOW, or FE_UNDERFLOW. If it is an underflow, change
your mask to '1 - dprim .ne. 0'.
PS: Don't use array section notation when a whole array reference
can be used instead.
PPS: You may want to compute dprim*drpim / (1 - dprim) outside
of the do-loop as it is loop invariant.
I am getting strange reactions to my Fortran95 code from my machine and do not know what is going wrong. Here's the situation:
I am trying to get acquainted with LAPACK and wrote a shamefully simple 1-D "FEM" program just to see how to use LAPACK:
program bla
! Solving the easiest of all FE static cases: one-dimensional, axially loaded elastic rod. composed of 2-noded elements
implicit none
integer :: nelem, nnodes, i,j, info
real, parameter :: E=2.1E9, crossec=19.634375E-6, L=1., F=10E3
real :: initelemL
real, allocatable :: A(:,:)
real, allocatable :: b(:), u(:)
integer, allocatable :: ipiv(:)
print *,'Number of elements?'
read *,nelem
nnodes=nelem+1
allocate(A(nnodes,nnodes),u(nnodes),b(nnodes), ipiv(nnodes))
initelemL=L/nelem
A(1,1)=1
do i=2, nnodes
A(1,i)=0
end do
do i=2,nnodes
do j=1,nnodes
A(i,j)=0
end do
A(i,i)=1
A(i,i-1)=-1
end do
b(1)=0 !That's the BC of zero-displacement of the first node
do i=2,nnodes
b(i)=((F/crossec)/E +1)*initelemL
end do
!calling the LAPACK subroutine:
call SGESV(nnodes,nnodes, A, nnodes, ipiv, b, nnodes, info)
print *,info
print *,b
end program bla
I'm on a mac so in order to include LAPACK i compile with:
gfortran -fbacktrace -g -Wall -Wextra -framework accelerate bla.f95
without a warning.
When I run the code, the strange things happen:
If I put in 2 as number of elements, I get the answer "b" as expected.
If I put in 5, I am getting a segmentation fault:
Number of elements?
5
0
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x10bd3eff6
#1 0x10bd3e593
#2 0x7fff98001f19
#3 0x7fff93087d62
#4 0x7fff93085dd1
#5 0x7fff930847e3
#6 0x7fff93084666
#7 0x7fff93083186
#8 0x7fff9696c63f
#9 0x7fff96969393
#10 0x7fff969693b4
#11 0x7fff9696967a
#12 0x7fff96991bb2
#13 0x7fff969ba80e
#14 0x7fff9699efb4
#15 0x7fff9699f013
#16 0x7fff9698f3b9
#17 0x10bdc7cee
#18 0x10bdc8fd6
#19 0x10bdc9936
#20 0x10bdc0f42
#21 0x10bd36c40
#22 0x10bd36d20
Segmentation fault: 11
If I put in 50, I get the answer but THEN the program fails although there is nothing more to do for it:
Number of elements?
50
0
0.00000000 2.48505771E-02 4.97011542E-02 7.45517313E-02 9.94023085E-02 0.124252886 0.149103463 0.173954040 0.198804617 0.223655194 0.248505771 0.273356348 0.298206925 0.323057503 0.347908080 0.372758657 0.397609234 0.422459811 0.447310388 0.472160965 0.497011542 0.521862149 0.546712756 0.571563363 0.596413970 0.621264577 0.646115184 0.670965791 0.695816398 0.720667005 0.745517612 0.770368218 0.795218825 0.820069432 0.844920039 0.869770646 0.894621253 0.919471860 0.944322467 0.969173074 0.994023681 1.01887429 1.04372489 1.06857550 1.09342611 1.11827672 1.14312732 1.16797793 1.19282854 1.21767914 1.24252975
a.out(1070,0x7fff7aa05300) malloc: *** error for object 0x7fbdd9406028: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x106c61ff6
#1 0x106c61593
#2 0x7fff98001f19
Abort trap: 6
This is reproducible. But: if I put another 'print' statement somewhere in the code, the numbers (here: 2,5,50) change. I'm probably committing a rookie mistake here but I am currently feeling rather helpless as it works sometimes and I am not sure how to interpret the Backtrace.
My ideas are currently:
Some really stupid mistake in using SGESV
LAPACK library somehow broken
Some hardware issue with my memory.
Has anybody experienced something like this before and could offer any advice on what is going?
Thanks in advance, cheers,
N.F.
You are defining b as a 1-dimensional real array and allocating it on
the heap. You are passing it as the 6th parameter to SGESV.
The documentation of SGESV
defines the 6th parameter as a 2-dimensional real array:
\param[in,out] B
\verbatim
B is REAL array, dimension (LDB,NRHS)
On entry, the N-by-NRHS matrix of right hand side matrix B.
On exit, if INFO = 0, the N-by-NRHS solution matrix X.
\endverbatim
SGESV consequently writes into memory locations that it believes to
lie in a 2D array addressed by b when in fact they might, luckily, happen to
lie in the 1D array you have actually passed, or unluckily in who-knows-what
other part of your program's memory layout. So it's vandalizing itself. The
damage done will be unpredictable and will vary according to your input
parameter, which determines the expected and actual size of the mis-allocated
array.
Comparing your code with the documented interface of SGESV, you appear to
be confusing the parameters B and IPIV.
I'm trying to send a derived type data with allocatable array in mpi ad got a seg fault.
program test_type
use mpi
implicit none
type mytype
real,allocatable::x(:)
integer::a
end type mytype
type(mytype),allocatable::y(:)
type(mytype)::z
integer::n,i,ierr,myid,ntasks,status,request
integer :: datatype, oldtypes(2), blockcounts(2)
integer(KIND=MPI_ADDRESS_KIND) :: offsets(2)
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,myid,ierr)
call mpi_comm_size(mpi_comm_world,ntasks,ierr)
n=2
allocate(z%x(n))
if(myid==0)then
allocate(y(ntasks-1))
do i=1,ntasks-1
allocate(y(i)%x(n))
enddo
else
call random_number(z%x)
z%a=myid
write(0,*) "z in process", myid, z%x, z%a
endif
call mpi_get_address(z%x,offsets(1),ierr)
call mpi_get_address(z%a,offsets(2),ierr)
offsets=offsets-offsets(1)
oldtypes=(/ mpi_real,mpi_integer /)
blockcounts=(/ n,1 /)
write(0,*) "before commit",myid,offsets,blockcounts,oldtypes
call mpi_type_create_struct(2,blockcounts,offsets,oldtypes,datatype,ierr)
call mpi_type_commit(datatype, ierr)
write(0,*) "after commit",myid,datatype, ierr
if(myid==0) then
do i=1,ntasks-1
call mpi_irecv(y(i),1,datatype,1,0,mpi_comm_world,request,ierr)
write(0,*) "received", y(i)%x,y(i)%a
enddo
else
call mpi_isend(z,1,datatype,0,0,mpi_comm_world,request,ierr)
write(0,*) "sent"
write(0,*) myid, z%x, z%a
end if
call mpi_finalize(ierr)
end program
And this is what I got printed out running with 2 processes:
before commit 0 0 -14898056
2 1 13 7
after commit 0 73 0
z in process 1 3.9208680E-07 2.5480442E-02 1
before commit 1 0 -491689432
2 1 13 7
after commit 1 73 0
received 0.0000000E+00 0.0000000E+00 0
forrtl: severe (174): SIGSEGV, segmentation fault occurred
It seems to get negative address offsets. Please help.
Thanks.
There are multiple issues with this code.
Allocatable arrays with most Fortran compilers are like pointers in C/C++: the real object behind the array name is something that holds a pointer to the allocated data. That data is usually allocated on the heap and that could be anywhere in the virtual address space of the process, which explains the negative offset. By the way, negative offsets are perfectly acceptable in MPI datatypes (that's why MPI_ADDRESS_KIND specifies a signed integer kind), so no big problem here.
The bigger problem is that the offsets between dynamically allocated things usually vary with each allocation. You could check that:
ADDR(y(1)%x) - ADDR(y(1)%a)
is completely different than
ADDR(y(i)%x) - ADDR(y(i)%a), for i = 2..ntasks-1
(ADDR here is just a shorhand notation for the object address as returned by MPI_GET_ADDRESS)
Even if it happens the offsets match for some value(s) of i, that is more of a coincidence than a rule.
That leads to the following: the type that you construct using offsets from the z variable cannot be used to send elements of the y array. To solve this, simply remove the allocatable property of mytype%x if that is possible (e.g. if n is known in advance).
Another option that should work well for small values of ntasks is to define as many MPI datatypes as the number of elements of the y array. Then use datatype(i), which is based on the offsets of y(i)%x and y(i)%a, to send y(i).
A more severe issue is the fact that you are using non-blocking MPI operations and never wait for them to complete before accessing the data buffers. This code simply won't work:
do i=1,ntasks-1
call mpi_irecv(y(i),1,datatype,1,0,mpi_comm_world,request,ierr)
write(0,*) "received", y(i)%x,y(i)%a
enddo
Calling MPI_IRECV starts an asynchronous receive operation. The operation is probably still in progress by the time the WRITE operator gets executed, therefore completely random data is being accessed (some memory allocators might actually zero the data in debug mode). Either insert a call to MPI_WAIT inbetween the MPI_ISEND and WRITE calls or use the blocking receive MPI_RECV.
A similar problem exists with the use of the non-blocking send call MPI_ISEND. Since you never wait on the completion of the request or test for it, the MPI library is allowed to postpone indefinitely the actual progression of the operation and the send might never actually occur. Again, since there is absolutely no justification for the use of the non-blocking send in your case, replace MPI_ISEND by MPI_SEND.
And last but not least, rank 0 is receiving messages from rank 1 only:
call mpi_irecv(y(i),1,datatype,1,0,mpi_comm_world,request,ierr)
^^^
At the same time, all other processes are sending to rank 0. Therefore, your program will only work if run with two MPI processes. You might want to replace the underlined 1 in the receive call with i.
I'm parallelizing a Fortran 90 program using MPI and I get some truly bizarre behavior. I have an array ia of length nn+1, which I'm sending in chunks from process 0 to processes 1,...,ntasks-1. Each process also has a list proc_start which tells the starting position in ia that all the other processes have, and a list pts_per_proc which tells the number of points that each process has. The following code works:
if (me == 0) then
print *, 'Eat my shorts'
else
allocate( ia(pts_per_proc(me+1)+1) )
endif
! If this is the boss process, send the array ia,
if (me == 0) then
do n=1,ntasks-1
call mpi_send(ia(proc_start(n+1)),pts_per_proc(n+1)+1, &
& mpi_integer,n,n,mpi_comm_world,ierr)
enddo
! but if it's a worker, receive this array.
else
call mpi_recv(ia,pts_per_proc(me+1)+1,mpi_integer, &
& 0,me,mpi_comm_world,stat,ierr)
endif
with no seg faults. When I comment out the line
print *, 'Eat my shorts'
it seg faults, no matter where I include a call to mpi_barrier. For example, replacing the first bit with the code
call mpi_barrier(mpi_comm_world,ierr)
if (me /= 0) then
allocate( ia(pts_per_proc(me+1)+1) )
endif
call mpi_barrier(mpi_comm_world,ierr)
gives me a seg fault. I could use mpi_scatterv instead in order to circumvent this issue but I'd like to know just what's going wrong here -- the barriers should guarantee that nothing runs out of order.
A segmentation fault hidden by a print * statement is not unusual, and is often a symptom of memory corruption somewhere in your program.
In cases like these the memcheck tool of Valgrind may save lot of troubles, though you need to properly configure the tool for its usage with MPI (and possibly expect a few false positives which are easily detectable).