Debugging with gdb and gfortran - FPE's - gdb

I'm debugging a larger numerical program that I have added on to. It is written in fortran90, compiled with gfortran (the latest version available for Mac) and I am debugging it using gdb (again the latest version available for Mac).
My additions have a bug somewhere and I am trying to locate it, which is clear as running the program does not produce the expected result. When I run it in gdb, I get the following output at the end:
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
[Inferior 1 (process 83843) exited normally]
I would like to identify exactly where this FPE occurs, but it seems that a floating point exception does not cause the program to crash. I tested this by explicitly dividing by 0 in my code - it did not cause the program to stop running, but led to unexpected behavior.
What is the proper flag for either gdb or gfortran to ensure that the program stops running (ideally with a backtrace) when it reaches a floating point exception? I tried following the instructions here but it did not appear to change anything.

Probably you need to add these flags when compiling your code:
gfortran -g -fbacktrace -ffpe-trap=zero,overflow,underflow youcode.f90 -o run.exe
Explanation for compiler flags from gfortran manual:
-g
to include debug data
-fbacktrace
Specify that, when a runtime error is encountered or a deadly signal is emitted (segmentation fault, illegal instruction, bus error or floating-point exception), the Fortran runtime library should output a backtrace of the error. This option only has influence for compilation of the Fortran main program.
-ffpe-trap=list
Specify a list of IEEE exceptions when a Floating Point Exception (FPE) should be raised. On most systems, this will result in a SIGFPE signal being sent and the program being interrupted, producing a core file useful for debugging. list is a (possibly empty) comma-separated list of the following IEEE exceptions: invalid (invalid floating point operation, such as SQRT(-1.0)), zero (division by zero), overflow (overflow in a floating point operation), underflow (underflow in a floating point operation), precision (loss of precision during operation) and denormal (operation produced a denormal value).
Some of the routines in the Fortran runtime library, like ‘CPU_TIME’, are likely to to trigger floating point exceptions when ffpe-trap=precision is used. For this reason, the use of ffpe-trap=precision is not recommended.
Take a look at these two places for more info:
https://gcc.gnu.org/onlinedocs/gcc-4.3.2/gfortran.pdf
http://faculty.washington.edu/rjl/uwamath583s11/sphinx/notes/html/gfortran_flags.html

Related

EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0) - underlying causes

What are the underlying causes of EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0) and where is it documented? Does it relate to a specific thing in the Intel CPU documentation somewhere maybe?
(There are lots of questions about this exception happening in specific cases, e.g. when using Swift, but I'm staring the disassembly of a C++ program and have no real understanding of what I'm looking for here.)
The identifier "EXC_BAD_INSTRUCTION" seems to indicate an attempt to execute an invalid opcode. On IA32 hardware, it is a fault of type Invalid Opcode Exception (#UD) and has interrupt vector value of 6. It can be triggered for a number of reasons but it generally means that the byte sequence that encode the incoming instruction is invalid or reserved, or has inconsistent operands relative to the instruction, or is to long (>15 bytes), or uses an IA32 extension not supported by the current hardware, or is the UD2 opcode, or attempting to execute some instruction while the machine is in some state that prevents its execution, or some other corner cases.
Now, one possible explanation for cause for this kind of fault is that the compiler you are using assumes that some hardware features are accessible to the target (executing) machine and compiles code accordingly. The target machine features can generally specified as compile flag options. For example, floating point operations and standard math functions will normally only generate x87 fpu instruction, but using the combination of -mfpmath=sse and -msse instruct compiler to generate SSE scalar instructions for usual floating point calculations. The SEE instruction sets are an extension feature of IA32 architecture and are not available on all machines. Portable code for an architecture should be compiled to machine independent generic code for this architecture.
Another possible but less probable cause for this hardware fault is that the compiler might itself be bugged and generate some invalid byte sequence.
I don't think some undefined behavior would lead to an invalid opcode fault on most architectures. While it is true that UB may lead to any kind of erratic behavior, UB conditions are mostly never detectable at compile time and as such, the usually generate a General Protection Fault (#GP) on IA32 (which is called a segmentation fault in unix nomenclature). On the opposite, attempting to generate an undefined opcode is a condition that is always detectable at compile time (unless the code is self generating at runtime, or if the byte code stream gets misaligned and the instruction pointer points in the middle of some opcode).
The most likely reason is that your code has some undefined behavior. For example writing to an array at a negative index. As a result, on many implementations, you end up writing over the stack structure and the next call, return or jump has your instruction pointer ending over data instead of compiled code.
Check with a debugger and with some luck enough of the stack will be intact to tell you around were the issue might be. If not, run with some type of bounds checking program. Valgrind for example. Address all error and warnings.

What kind of GCC optimizations may change a double based on whether it is printed or not? [duplicate]

This question already has answers here:
Different floating point result with optimization enabled - compiler bug?
(7 answers)
Closed 4 years ago.
I am debugging code that implements an algorithm whose main loop terminates when a statement à la s >= u || s <= l is true, where s, u and l are doubles that are updated in the main loop. In this example, all three variables are always between 0.5 and 1.5. I am not including the code here, as it is not written by me and extracting a MWE is hard. I am puzzled by the code behaving differently on different architectures, and I'm hoping the clues below can help me narrow down the error in the algorithm.
Some floating point rounding seems to be the root cause of the bug. Here is what I have ascertained so far:
The algorithm terminates correctly on all optimization levels on x86-64.
The algorithm terminates correctly with -O3 (other opt levels were not tried) on arm64, mips64 and ppc64.
The algorithm terminates correctly with -O0 on i686.
The algorithm loops indefinitely with -O1, -O2 and -O3 on i686.
Main point of question: In the cases when the algorithm loops indefinitely, it can be made to terminate correctly if s is printed (std::cout << s << std::endl) before it is compared to l and u.
What kind of compiler optimizations could be relevant here?
All behaviors above were observed on a GNU/Linux system and reproduced with GCC 6.4, 7.3 and 8.1.
Since you say your code works as intended on x86-64 and other instruction sets, but breaks on i686, but only with some optimisation levels, the likely culprit is x86 extended precision.
On x86, floating point instructions store results in registers with greater precision than when those values are subsequently stored in memory. Therefore, when the compiler can re-use the same value already loaded in a register, the results may be different compared to when it has to save and re-load the value. Printing a value may require saving and re-loading it.
This is a well-known non-bug in GCC.
GCC provides a -ffloat-store command-line option which may be of help:
-ffloat-store
  Do not store floating-point variables in registers, and inhibit other options that might change whether a floating-point value is taken from a register or memory.
  This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.
As mentioned there though, it doesn't automatically let your code work the same as on other instruction sets. You may need to modify your code to explicitly store results in variables.

Detecting underflow during execution

Is there any way to detect underflow automatically during execution?
Specifically I believe there should be a compiler option to generate code that checks for underflows and similar falgs right after mathematical operations that could cause them.
I'm talking about the G++ compiler.
C99/C++11 have floating point control functions (e.g. fetestexcept) and defined flags (including FE_UNDERFLOW) that should let you detect floating point underflow reasonably portably (i.e., with any compiler/library that supports these).
Though they're not as portable, gcc has an feenableexcept that will let you set floating point exceptions that are trapped. When one of the exceptions you've enabled fires, your program will receive a SIGFPE signal.
At least on most hardware, there's no equivalent for integer operations -- an underflow simply produces a 2's complement (or whatever) result and (for example) sets the the flags (e.g., carry and sign bits) to signal what happened. C99/C++11 do have some flags for things like integer overflow, but I don't believe they're nearly as widely supported.

Stopping the debugger when a NaN floating point number is produced

I have an C++ program. Somewhere in the program (hard to reproduce, but reproduceable) a caclculation results in a float beeing set to a NaN. Since a floating point operation involving a NaN results in a NaN, this spreads fast.
Is there any way I can setup the compiler (gcc 4.4) or the debuger (gdb) to stop when a floating point operation results in a NaN? That would be extremely useful.
Thanks!
Nathan
PS: It might matter: I am working under ubuntu linux 10.10.
You could enable floating point exceptions - see glibc Control Functions - then you'll get a SIGFPE when your NaN value is produced

Break on NaNs or infs

It is often hard to find the origin of a NaN, since it can happen at any step of a computation and propagate itself.
So is it possible to make a C++ program halt when a computation returns NaN or inf? The best in my opinion would be to have a crash with a nice error message:
Foo: NaN encoutered at Foo.c:624
Is something like this possible? Do you have a better solution? How do you debug NaN problems?
EDIT: Precisions: I'm working with GCC under Linux.
You can't do it in a completely portable way, but many platforms provide C APIs that allow you to access the floating point status control register(s).
Specifically, you want to unmask the overflow and invalid floating-point exceptions, which will cause the processor to signal an exception when arithmetic in your program produces a NaN or infinity result.
On your linux system this should do the trick:
#include <fenv.h>
...
feenableexcept(FE_INVALID | FE_OVERFLOW);
You may want to learn to write a trap handler so that you can print a diagnostic message or otherwise continue execution when one of these exceptions is signaled.
Yes! Set (perhaps more or less portably) your IEEE 754-compliant processor to generate an interrupt when a NaN or infinite is encountered.
I googled and found these slides, which are a start. The slide on page 5 summarizes all the information you need.
I'm no C expert, but I expect the answer is no.
This would require every float calculation to have this check. A huge performance impact.
NaN and Inf aren't evil. They may be legitimately used in some library your app uses, and break it.