I have an C++ program. Somewhere in the program (hard to reproduce, but reproduceable) a caclculation results in a float beeing set to a NaN. Since a floating point operation involving a NaN results in a NaN, this spreads fast.
Is there any way I can setup the compiler (gcc 4.4) or the debuger (gdb) to stop when a floating point operation results in a NaN? That would be extremely useful.
Thanks!
Nathan
PS: It might matter: I am working under ubuntu linux 10.10.
You could enable floating point exceptions - see glibc Control Functions - then you'll get a SIGFPE when your NaN value is produced
Related
This question already has answers here:
Different floating point result with optimization enabled - compiler bug?
(7 answers)
Closed 4 years ago.
I am debugging code that implements an algorithm whose main loop terminates when a statement à la s >= u || s <= l is true, where s, u and l are doubles that are updated in the main loop. In this example, all three variables are always between 0.5 and 1.5. I am not including the code here, as it is not written by me and extracting a MWE is hard. I am puzzled by the code behaving differently on different architectures, and I'm hoping the clues below can help me narrow down the error in the algorithm.
Some floating point rounding seems to be the root cause of the bug. Here is what I have ascertained so far:
The algorithm terminates correctly on all optimization levels on x86-64.
The algorithm terminates correctly with -O3 (other opt levels were not tried) on arm64, mips64 and ppc64.
The algorithm terminates correctly with -O0 on i686.
The algorithm loops indefinitely with -O1, -O2 and -O3 on i686.
Main point of question: In the cases when the algorithm loops indefinitely, it can be made to terminate correctly if s is printed (std::cout << s << std::endl) before it is compared to l and u.
What kind of compiler optimizations could be relevant here?
All behaviors above were observed on a GNU/Linux system and reproduced with GCC 6.4, 7.3 and 8.1.
Since you say your code works as intended on x86-64 and other instruction sets, but breaks on i686, but only with some optimisation levels, the likely culprit is x86 extended precision.
On x86, floating point instructions store results in registers with greater precision than when those values are subsequently stored in memory. Therefore, when the compiler can re-use the same value already loaded in a register, the results may be different compared to when it has to save and re-load the value. Printing a value may require saving and re-loading it.
This is a well-known non-bug in GCC.
GCC provides a -ffloat-store command-line option which may be of help:
-ffloat-store
Do not store floating-point variables in registers, and inhibit other options that might change whether a floating-point value is taken from a register or memory.
This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.
As mentioned there though, it doesn't automatically let your code work the same as on other instruction sets. You may need to modify your code to explicitly store results in variables.
I'm debugging a larger numerical program that I have added on to. It is written in fortran90, compiled with gfortran (the latest version available for Mac) and I am debugging it using gdb (again the latest version available for Mac).
My additions have a bug somewhere and I am trying to locate it, which is clear as running the program does not produce the expected result. When I run it in gdb, I get the following output at the end:
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
[Inferior 1 (process 83843) exited normally]
I would like to identify exactly where this FPE occurs, but it seems that a floating point exception does not cause the program to crash. I tested this by explicitly dividing by 0 in my code - it did not cause the program to stop running, but led to unexpected behavior.
What is the proper flag for either gdb or gfortran to ensure that the program stops running (ideally with a backtrace) when it reaches a floating point exception? I tried following the instructions here but it did not appear to change anything.
Probably you need to add these flags when compiling your code:
gfortran -g -fbacktrace -ffpe-trap=zero,overflow,underflow youcode.f90 -o run.exe
Explanation for compiler flags from gfortran manual:
-g
to include debug data
-fbacktrace
Specify that, when a runtime error is encountered or a deadly signal is emitted (segmentation fault, illegal instruction, bus error or floating-point exception), the Fortran runtime library should output a backtrace of the error. This option only has influence for compilation of the Fortran main program.
-ffpe-trap=list
Specify a list of IEEE exceptions when a Floating Point Exception (FPE) should be raised. On most systems, this will result in a SIGFPE signal being sent and the program being interrupted, producing a core file useful for debugging. list is a (possibly empty) comma-separated list of the following IEEE exceptions: invalid (invalid floating point operation, such as SQRT(-1.0)), zero (division by zero), overflow (overflow in a floating point operation), underflow (underflow in a floating point operation), precision (loss of precision during operation) and denormal (operation produced a denormal value).
Some of the routines in the Fortran runtime library, like ‘CPU_TIME’, are likely to to trigger floating point exceptions when ffpe-trap=precision is used. For this reason, the use of ffpe-trap=precision is not recommended.
Take a look at these two places for more info:
https://gcc.gnu.org/onlinedocs/gcc-4.3.2/gfortran.pdf
http://faculty.washington.edu/rjl/uwamath583s11/sphinx/notes/html/gfortran_flags.html
When I run the exact same code performing the exact same floating point calculation (using doubles) compiled on Windows and Solaris, I get slightly different results.
I am aware that the results are not exact due to rounding errors. However I would have expected the rounding errors to be platform-independent, thereby giving be the same (slightly incorrect) result on both platforms, which is not the case.
Is this normal, or do I have another problem in my code?
On x86, usually most calculations happen with 80-bit quantities, unless otherwise forced to be double-precision. Most other architectures I know of do all calculations in double-precision (again, unless otherwise overridden).
I don't know if you're running Solaris on SPARC or x86, but if the former, then I highly suspect that to be the cause of the difference.
The subject of your question suggests that it might depend on the compiler. It might, but the fact that you are running on different hardware (assuming your Solaris is not x86) suggests a much more likely reason for the difference - the difference in hardware itself.
Different hardware platforms might use completely different hardware devices (FPUs, CPUs) to perform floating-point calculations, arriving at different results.
Moreover, often the FPU units are configurable by some persistent settings, like infinity model, rounding mode etc. Different hardware might have different default setup. Compiler will normally generate the code that will initialize the FPU at program startup, by that initial setup can be different as well.
Finally, different implementations of C++ language might implement floating-point semantics differently, so you might even get different results from different C++ compilers of the same hardware.
I believe that under Windows/x86, your code will run with the x87 precision already set to 53 bits (double precision), though I'm not sure exactly when this gets set. On Solaris/x86 the x87 FPU is likely to be using its default precision of 64 bits (extended precision), hence the difference.
There's a simple check you can do to detect which precision (53 bits or 64 bits) is being used: try computing something like 1e16 + 2.9999, while being careful to avoid compiler constant-folding optimizations (e.g., define a separate add function to do the addition, and turn off any optimizations that might inline functions). When using 53-bit precision (SSE2, or x87 in double-precision mode) this gives 1e16 + 2; when using 64-bit precision (x87 in extended precision mode) this gives 1e16 + 4. The latter result comes from an effect called 'double rounding', where the result of the addition is rounded first to 64 bits, and then to 53 bits. (Doing this calculation directly in Python, I get 1e16 + 4 on 32-bit Linux, and 1e16+2 on Windows, for exactly the same reason.)
Here's a very nice article (that goes significantly beyond the oft-quoted Goldberg's "What every computer scientist should know...") that explains some of the problems arising from the use of the x87 FPU:
http://hal.archives-ouvertes.fr/docs/00/28/14/29/PDF/floating-point-article.pdf
It is often hard to find the origin of a NaN, since it can happen at any step of a computation and propagate itself.
So is it possible to make a C++ program halt when a computation returns NaN or inf? The best in my opinion would be to have a crash with a nice error message:
Foo: NaN encoutered at Foo.c:624
Is something like this possible? Do you have a better solution? How do you debug NaN problems?
EDIT: Precisions: I'm working with GCC under Linux.
You can't do it in a completely portable way, but many platforms provide C APIs that allow you to access the floating point status control register(s).
Specifically, you want to unmask the overflow and invalid floating-point exceptions, which will cause the processor to signal an exception when arithmetic in your program produces a NaN or infinity result.
On your linux system this should do the trick:
#include <fenv.h>
...
feenableexcept(FE_INVALID | FE_OVERFLOW);
You may want to learn to write a trap handler so that you can print a diagnostic message or otherwise continue execution when one of these exceptions is signaled.
Yes! Set (perhaps more or less portably) your IEEE 754-compliant processor to generate an interrupt when a NaN or infinite is encountered.
I googled and found these slides, which are a start. The slide on page 5 summarizes all the information you need.
I'm no C expert, but I expect the answer is no.
This would require every float calculation to have this check. A huge performance impact.
NaN and Inf aren't evil. They may be legitimately used in some library your app uses, and break it.
My application is generating different floating point values when I compile it in release mode and in debug mode. The only reason that I found out is I save a binary trace log and the one from the release build is ever so slightly off from the debug build, it looks like the bottom two bits of the 32 bit float values are different about 1/2 of the cases.
Would you consider this "difference" to be a bug or would this type of difference be expected. Would this be a compiler bug or an internal library bug.
For example:
LEFTPOS and SPACING are defined floating point values.
float def_x;
int xpos;
def_x = LEFTPOS + (xpos * (SPACING / 2));
The issue is in regards to the X360 compiler.
Release mode may have a different FP strategy set. There are different floating point arithmetic modes depending on the level of optimization you'd like. MSVC, for example, has strict, fast, and precise modes.
I know that on PC, floating point registers are 80 bits wide. So if a calculation is done entirely within the FPU, you get the benefit of 80 bits of precision. On the other hand, if an intermediate result is moved out into a normal register and back, it gets truncated to 32 bits, which gives different results.
Now consider that a release build will have optimisations which keep intermediate results in FPU registers, whereas a debug build will probably naively copy intermediate results back and forward between memory and registers - and there you have your difference in behaviour.
I don't know whether this happens on X360 too or not.
It's not a bug. Any floating point uperation has a certain imprecision. In Release mode, optimization will change the order of the operations and you'll get a slightly different result. The difference should be small, though. If it's big you might have other problems.
I helped a co-worker find a compiler switch that was different in release vs. debug builds that was causing his differences.
Take a look at /fp (Specify Floating-Point Behavior).
In addition to the different floating-point modes others have pointed out, SSE or similiar vector optimizations may be turned on for release. Converting floating-point arithmetic from standard registers to vector registers can have an effect on the lower bits of your results, as the vector registers will generally be more narrow (fewer bits) than the standard floating-point registers.
Not a bug. This type of difference is to be expected.
For example, some platforms have float registers that use more bits than are stored in memory, so keeping a value in the register can yield a slightly different result compared to storing to memory and re-loading from memory.
This discrepancy may very well be caused by the compiler optimization, which is typically done in the release mode, but not in debug mode. For example, the compiler may reorder some of the operations to speed up execution, which can conceivably cause a slight difference in the floating point result.
So, I would say most likely it is not a bug. If you are really worried about this, try turning on optimization in the Debug mode.
Like others mentioned, floating point registers have higher precision than floats, so the accuracy of the final result depends on the register allocation.
If you need consistent results, you can make the variables volatile, which will result in slower, less precise, but consistent results.
If you set a compiler switch that allowed the compiler to reorder floating-point operations, -- e.g. /fp:fast -- then obviously it's not a bug.
If you didn't set any such switch, then it's a bug -- the C and C++ standards don't allow the compilers to reorder operations without your permission.