I am working on a large fortran code and before to compile with fast options (in order to perform test on large database), I usually compile with "warnings" options in order to detect and backtrace all the problems.
So with the gfortran -fbacktrace -ffpe-trap=invalid,zero,overflow,underflow -Wall -fcheck=all -ftrapv -g2 compilation, I get the following error:
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7fec64cdfef7 in ???
#1 0x7fec64cdf12d in ???
#2 0x7fec6440e4af in ???
#3 0x7fec64a200b4 in ???
#4 0x7fec649dc5ce in ???
#5 0x4cf93a in __f_mod_MOD
at /f_mod.f90:132
#6 0x407d55 in main_loop_
at main.f90:419
#7 0x40cf5c in main_prog
at main.f90:180
#8 0x40d5d3 in main
at main.f90:68
And the portion of the code f_mod.f90:132 is containing a where loop:
! Compute s parameter
do i = 1, Imax
where (dprim .ne. 1.0)
s(:,:,:, :) = s(:,:,:, :) +vprim(:,:,:, i,:)*dprim(:,:,:, :)*dprim(:,:,:, :)/(1.0 -dprim(:,:,:, :))
endwhere
enddo
But I do not see any mistake here. All the other locations are the calls of the subroutine leading to this part. And of course, since it is a SIGFPE error, I have to problem at the execution when I compile gfortran -g1. (I use gfortran 6.4.0 on linux)
Moreover, this error appears and disappears with the modifications of completely different part of the code. Thus, the problem comes from this where loop ? Or from somewhere else and the backtrace is wrong ? If it is the case how can I find this mistake?
EDIT:
Since, I can not reproduce this error in a minimal example (they are working), I think that the problem comes for somewhere else. But how to find the problem in a large code ?
As the code is dying with a SIGFPE, use each of the individual
possible traps to learn if it is a FE_DIVBYZERO, FE_INVALID,
FE_OVERFLOW, or FE_UNDERFLOW. If it is an underflow, change
your mask to '1 - dprim .ne. 0'.
PS: Don't use array section notation when a whole array reference
can be used instead.
PPS: You may want to compute dprim*drpim / (1 - dprim) outside
of the do-loop as it is loop invariant.
Related
my long running application crashes randomly with segmentation fault. When trying to debug the generated coredump, I get stuck with wierd stacktrace:
(gdb) bt full
#0 __memmove_ssse3 () at ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S:2582
No locals.
#1 0x00000000 in ?? ()
No symbol table info available.
How it can happen, that the backtrace starts at 0x00000000?
What can I do to debug this issue more? I can't run it in gdb as it may take even a week till the crash occures.
Generally this means that the return address on the stack has been overwritten with 0, probably due to overrunning the end of an on-stack array. You can trying building with address sanitizer on gcc or clang (if you are using them). Or you can try running with valgrind to see if it will tell you about invalid memory writes.
Pretty simple setup, using gfortran 4.8.5 on linux (red hat):
I get a segfault if my array of reals (inside a derived type) has size > 2,000,000. This seems to be a standard stack/heap issue as my stack size is 8mb if I check with ulimit.
There is no problem if the array is NOT inside a derived type
Note that as #francescalus guesses, removing the initial value = 0.0 eliminates the problem
Edit to add: Note that I have posted a followup question Segmentation fault related to component of derived type that represents a more realistic use case and further narrows down the conditions under which this seems to occur.
program main
call sub1 ! seg fault if col size > 2,100,000
call sub2 ! works fine at col size = 100,000,000
end program main
subroutine sub1
type table
real :: col(2100000) = 0.0 ! works if "= 0.0" removed
end type table
type(table) :: table1
table1%col = 1.0
end subroutine sub1
subroutine sub2
real :: col(100000000) = 0.0
col = 1.0
end subroutine sub2
Some obvious questions here:
Is this expected behavior, or some bug that was fixed in newer versions of gfortran?
Am I following standard fortran operating procedures here, or doing something wrong?
What is the recommended way to avoid this (please assume that I am unable to update to a newer version of gfortran in the near term)? I will almost certainly solve with an allocatable array component for reasons not specific to this question, but that might not be an ideal general solution and I would like to know of all good options I have here.
In particular, is initializing the components of a derived type bad practice?
This is likely to be a runtime issue due to insufficient stack, rather than a bug with gfortran.
Gfortran uses the stack to store automatic arrays and other initialization data. When code does not create problems when one such array is small, but segfaults when the size of the array increases, a possible reason is running out of stack.
The issue seems to be the same in more recent versions of gfortran. I compiled and ran your program with gfortran 4.8.4, 4.9.3, 5.5.0, 6.4.0, 7.3.0 and 8.2.0. In all cases I obtained a segmentation fault with the default stack size, but no error when the stack size was slightly increased.
$ ./sfa
Segmentation fault
$ ulimit -s
8192
$ ulimit -s 8256
$ ./sfa && echo "DONE"
DONE
Your problem may be solved by running
$ ulimit -s unlimited
before executing your binary. I am not aware of any particular penalty for doing this, but programmers more aware of the fine details of memory management, such as compiler developers, may think otherwise.
Initializing the components of a derived type is not bad practice, but as you can see, it can create problems with the stack if the component is a big array - be it due to the storage of the component itself, or to the storage of memory to work on the RHS of the assignment. If the component is made allocatable and allocated in a subroutine, the array is stored in the heap rather than in the stack, and this issue is usually avoided. In this case, it may be about actually setting the values of the array dynamically in a subroutine rather than at compile time. It may be less elegant, but I think it's worth it, since it's the typical example of code development work that prevents avoidable, environment-related errors when executing the binary.
Your code above is standards compliant. As explained in the comments, lack of explicit interfaces for subroutines is not good practice, but for these simple subroutines it's not against the rules.
Some compilers have flags that allow you to change where some objects are allocated in memory. While it may fix a particular issue, flags are compiler dependent, and usually not equivalent when comparing different compilers. Using dynamic memory via allocatables is a more robust solution, according to my experience.
Finally, note that, if you are using OpenMP, the ulimit command above only affects the master thread - you need to set the stack size of each of the other threads via the environment variable OMP_STACKSIZE, which cannot be unlimited. And bear in mind that non-master threads running out of stack are a problem much more difficult to diagnose, since the binary may stop without a proper Segmentation fault error.
These are not necessarily useful solutions, but below are some conditions under which the seg fault disappears. A couple of people mentioned the lack of an explicit interface (as bad practice though not technically incorrect), and it seems that this might be one key here as either of these two changes to the code gets rid of the seg fault, although it's not quite that simple, as I'll explain:
Put everything in main, with no subroutine calls
Put the type definition table in a module
Let me expand on #2 briefly. Simply taking the example in the OP and then giving it an explicit interface by putting the subroutine in a module does NOT work. However, if I put the type definition in a module and then use it (as shown below) the segfault does not occur:
program main
use table_mod
type(table) :: table1
table1%col = 1.0
end program main
I am getting strange reactions to my Fortran95 code from my machine and do not know what is going wrong. Here's the situation:
I am trying to get acquainted with LAPACK and wrote a shamefully simple 1-D "FEM" program just to see how to use LAPACK:
program bla
! Solving the easiest of all FE static cases: one-dimensional, axially loaded elastic rod. composed of 2-noded elements
implicit none
integer :: nelem, nnodes, i,j, info
real, parameter :: E=2.1E9, crossec=19.634375E-6, L=1., F=10E3
real :: initelemL
real, allocatable :: A(:,:)
real, allocatable :: b(:), u(:)
integer, allocatable :: ipiv(:)
print *,'Number of elements?'
read *,nelem
nnodes=nelem+1
allocate(A(nnodes,nnodes),u(nnodes),b(nnodes), ipiv(nnodes))
initelemL=L/nelem
A(1,1)=1
do i=2, nnodes
A(1,i)=0
end do
do i=2,nnodes
do j=1,nnodes
A(i,j)=0
end do
A(i,i)=1
A(i,i-1)=-1
end do
b(1)=0 !That's the BC of zero-displacement of the first node
do i=2,nnodes
b(i)=((F/crossec)/E +1)*initelemL
end do
!calling the LAPACK subroutine:
call SGESV(nnodes,nnodes, A, nnodes, ipiv, b, nnodes, info)
print *,info
print *,b
end program bla
I'm on a mac so in order to include LAPACK i compile with:
gfortran -fbacktrace -g -Wall -Wextra -framework accelerate bla.f95
without a warning.
When I run the code, the strange things happen:
If I put in 2 as number of elements, I get the answer "b" as expected.
If I put in 5, I am getting a segmentation fault:
Number of elements?
5
0
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x10bd3eff6
#1 0x10bd3e593
#2 0x7fff98001f19
#3 0x7fff93087d62
#4 0x7fff93085dd1
#5 0x7fff930847e3
#6 0x7fff93084666
#7 0x7fff93083186
#8 0x7fff9696c63f
#9 0x7fff96969393
#10 0x7fff969693b4
#11 0x7fff9696967a
#12 0x7fff96991bb2
#13 0x7fff969ba80e
#14 0x7fff9699efb4
#15 0x7fff9699f013
#16 0x7fff9698f3b9
#17 0x10bdc7cee
#18 0x10bdc8fd6
#19 0x10bdc9936
#20 0x10bdc0f42
#21 0x10bd36c40
#22 0x10bd36d20
Segmentation fault: 11
If I put in 50, I get the answer but THEN the program fails although there is nothing more to do for it:
Number of elements?
50
0
0.00000000 2.48505771E-02 4.97011542E-02 7.45517313E-02 9.94023085E-02 0.124252886 0.149103463 0.173954040 0.198804617 0.223655194 0.248505771 0.273356348 0.298206925 0.323057503 0.347908080 0.372758657 0.397609234 0.422459811 0.447310388 0.472160965 0.497011542 0.521862149 0.546712756 0.571563363 0.596413970 0.621264577 0.646115184 0.670965791 0.695816398 0.720667005 0.745517612 0.770368218 0.795218825 0.820069432 0.844920039 0.869770646 0.894621253 0.919471860 0.944322467 0.969173074 0.994023681 1.01887429 1.04372489 1.06857550 1.09342611 1.11827672 1.14312732 1.16797793 1.19282854 1.21767914 1.24252975
a.out(1070,0x7fff7aa05300) malloc: *** error for object 0x7fbdd9406028: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x106c61ff6
#1 0x106c61593
#2 0x7fff98001f19
Abort trap: 6
This is reproducible. But: if I put another 'print' statement somewhere in the code, the numbers (here: 2,5,50) change. I'm probably committing a rookie mistake here but I am currently feeling rather helpless as it works sometimes and I am not sure how to interpret the Backtrace.
My ideas are currently:
Some really stupid mistake in using SGESV
LAPACK library somehow broken
Some hardware issue with my memory.
Has anybody experienced something like this before and could offer any advice on what is going?
Thanks in advance, cheers,
N.F.
You are defining b as a 1-dimensional real array and allocating it on
the heap. You are passing it as the 6th parameter to SGESV.
The documentation of SGESV
defines the 6th parameter as a 2-dimensional real array:
\param[in,out] B
\verbatim
B is REAL array, dimension (LDB,NRHS)
On entry, the N-by-NRHS matrix of right hand side matrix B.
On exit, if INFO = 0, the N-by-NRHS solution matrix X.
\endverbatim
SGESV consequently writes into memory locations that it believes to
lie in a 2D array addressed by b when in fact they might, luckily, happen to
lie in the 1D array you have actually passed, or unluckily in who-knows-what
other part of your program's memory layout. So it's vandalizing itself. The
damage done will be unpredictable and will vary according to your input
parameter, which determines the expected and actual size of the mis-allocated
array.
Comparing your code with the documented interface of SGESV, you appear to
be confusing the parameters B and IPIV.
I'm on an ARM Cortex M0 (Nordic NRF51822) using the Segger JLink. When my code hard faults (say due to a dereferencing an invalid pointer), I see only the following stack trace:
(gdb) bt
#0 HardFault_HandlerC (hardfault_args=<optimized out>) at main_display.cpp:440
#1 0x00011290 in ?? ()
I have a hard fault handler installed and it can give me the lr and pc:
(gdb) p/x stacked_pc
$1 = 0x18ea6
(gdb) p/x stacked_lr
$2 = 0x18b35
And I know I can use addr-to-line to translate these to source code lines:
> arm-none-eabi-addr2line -e main_display.elf 0x18ea6
/Users/cmason/code/nrf/src/../libs/epaper/EPD_Display.cpp:33
> arm-none-eabi-addr2line -e main_display.elf 0x18b35
/Users/cmason/code/nrf/src/../libs/epaper/EPD.cpp:414
Can I get the rest of the backtrace somehow? If I stop at a normal breakpoint I can get a backtrace, so I know GDB can do the (somewhat complex) algorithm to unwind the stack on ARM. I understand that, in the general case, the stack may be screwed up by my code to the point where it's unreadable, but I don't think that's whats happening in this case.
I think this may be complicated by Nordic's memory protection scheme. Their bluetooth stack installs its own interrupt vector and prevents access to certain memory regions. Or maybe this is Segger's fault? On other examples of Cortex M0 do most people see regular back traces from hard faults?
Thanks!
-c
Cortex-M0 and Cortex-M3 is close enough that you can use the answer from this question:
Stack Backtrace for ARM core using GCC compiler (when there is a MSP to PSP switch)
in short: GCC has a function _Unwind_Backtrace to generate a full call stack; this needs to be hacked up a bit to simulate doing a backtrace from before the exception entry happened. Details in the linked question.
I'm trying to port a code from ifort compiler to ibm xlf compiler. It works well under ifort on redhat but give results contain "NaNQ" under xlf on AIX system. It turns out that there is a array bounds reading in the code cost this problem, here is a simplified example:
program main
implicit none
real(8)::a(1,0:10)=0.D0
print *, a(1,-1)
end program main
Using both compiler I can successfully compile it, without any mistake or warning.
On ifort I get result:
0.000000000000000E+000
But on xlf, I get:
0.247032822920623272E-322
However, if I read more beyond the boundary, the xlf won't compile but ifort compile successfully.
program main
implicit none
real(8)::a(1,0:10)=0.D0
print *, a(1,-3:-1)
end program main
On ifort I get:
0.000000000000000E+000 0.000000000000000E+000 0.000000000000000E+000
On xlf it won't compile:
"1.f90", line 5.9: 1516-023 (S) Subscript is out of bounds.
** main === End of Compilation 1 ===
1501-511 Compilation failed for file 1.f90.
Why the ifort and xlf take differently on this cross boundary read? Is there any way to make the compiler to check it strictly and prevent cross boundary read to happen? After all, it took me a long time to catch this bug in our code, since our group have been using this code for more than 15 years without any problems on ifort. Thanks.
Most Fortran compilers have options to check for array bounds errors at runtime. In these examples with constant indices the error can be found at compile time, which some compilers but not others will do without using non-default options. With ifort use -check bounds to request array bounds checking. You can get additional checking with -check all. These options are generally not the default because there is a runtime cost. But the cost of getting a wrong answer can be much higher!! I have found the runtime cost to frequently be surprisingly low and recommend using runtime checks during code development and even in production if the runtime cost is acceptable.