I use both gcc10 and clang12 in Ubuntu. I just found that if I enable the -ffast-math flag, in my C++ project, there will be an about 4 times performance improvement.
However, if I only enable -ffast-math at compile time and not at link time, there will be no performance improvement. What does it mean to use -ffast-math when linking, and will it link to any special ffast-math libraries in the system?
P.S: This performance improvement actually makes the performance normal. I once asked a question about the poor performance of AVX instruction on Intel processor. Now I can make the performance normal as long as I use -ffast-math flag to compile and link programs on Linux, but even if I use clang and -ffast-math on windows, the performance is still poor. So I wonder if I have linked to any special system libraries under Linux.
However, if I only enable -ffast-math at compile time and not at link time, there will be no performance improvement. What does it mean to use -ffast-math when linking, and will it link to any special ffast-math libraries in the system?
Turns out gcc does link in crtfastmath.o when -ffast-math is specified for linker (undocumented feature).
For x86 see https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/i386/crtfastmath.c#L83, it sets the following CPU options:
#define MXCSR_DAZ (1 << 6) /* Enable denormals are zero mode */
#define MXCSR_FTZ (1 << 15) /* Enable flush to zero mode */
Denormalized floating point numbers are much slower to handle, so that disabling them in the CPU makes floating point computations faster.
From Intel 64 and IA-32 Architectures Optimization Reference Manual:
6.5.3 Flush-to-Zero and Denormals-are-Zero Modes
The flush-to-zero (FTZ) and denormals-are-zero (DAZ) modes are not compatible with the IEEE Standard 754. They are provided to improve performance for applications where underflow is common and where the generation of a denormalized result is not necessary.
3.8.3.3 Floating-point Exceptions in SSE/SSE2/SSE3 Code
Most special situations that involve masked floating-point exceptions are handled efficiently in hardware. When a masked overflow exception occurs while executing SSE/SSE2/SSE3 code, processor hardware can handles it without performance penalty.
Underflow exceptions and denormalized source operands are usually treated according to the IEEE 754 specification, but this can incur significant performance delay. If a programmer is willing to trade pure IEEE 754 compliance for speed, two non-IEEE 754 compliant modes are provided to speed situations where underflows and input are frequent: FTZ mode and DAZ mode.
When the FTZ mode is enabled, an underflow result is automatically converted to a zero with the correct sign. Although this behavior is not compliant with IEEE 754, it is provided for use in applications where performance is more important than IEEE 754 compliance. Since denormal results are not produced when the FTZ mode is enabled, the only denormal floating-point numbers that can be encountered in FTZ mode are the ones specified as constants (read only).
The DAZ mode is provided to handle denormal source operands efficiently when running a SIMD floating-point application. When the DAZ mode is enabled, input denormals are treated as zeros with the same sign. Enabling the DAZ mode is the way to deal with denormal floating-point constants when perfor mance is the objective.
If departing from the IEEE 754 specification is acceptable and performance is critical, run SSE/SSE2/SSE3 applications with FTZ and DAZ modes enabled.
Related
This question already has answers here:
Different floating point result with optimization enabled - compiler bug?
(7 answers)
Closed 4 years ago.
I am debugging code that implements an algorithm whose main loop terminates when a statement à la s >= u || s <= l is true, where s, u and l are doubles that are updated in the main loop. In this example, all three variables are always between 0.5 and 1.5. I am not including the code here, as it is not written by me and extracting a MWE is hard. I am puzzled by the code behaving differently on different architectures, and I'm hoping the clues below can help me narrow down the error in the algorithm.
Some floating point rounding seems to be the root cause of the bug. Here is what I have ascertained so far:
The algorithm terminates correctly on all optimization levels on x86-64.
The algorithm terminates correctly with -O3 (other opt levels were not tried) on arm64, mips64 and ppc64.
The algorithm terminates correctly with -O0 on i686.
The algorithm loops indefinitely with -O1, -O2 and -O3 on i686.
Main point of question: In the cases when the algorithm loops indefinitely, it can be made to terminate correctly if s is printed (std::cout << s << std::endl) before it is compared to l and u.
What kind of compiler optimizations could be relevant here?
All behaviors above were observed on a GNU/Linux system and reproduced with GCC 6.4, 7.3 and 8.1.
Since you say your code works as intended on x86-64 and other instruction sets, but breaks on i686, but only with some optimisation levels, the likely culprit is x86 extended precision.
On x86, floating point instructions store results in registers with greater precision than when those values are subsequently stored in memory. Therefore, when the compiler can re-use the same value already loaded in a register, the results may be different compared to when it has to save and re-load the value. Printing a value may require saving and re-loading it.
This is a well-known non-bug in GCC.
GCC provides a -ffloat-store command-line option which may be of help:
-ffloat-store
Do not store floating-point variables in registers, and inhibit other options that might change whether a floating-point value is taken from a register or memory.
This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.
As mentioned there though, it doesn't automatically let your code work the same as on other instruction sets. You may need to modify your code to explicitly store results in variables.
Is there any way to detect underflow automatically during execution?
Specifically I believe there should be a compiler option to generate code that checks for underflows and similar falgs right after mathematical operations that could cause them.
I'm talking about the G++ compiler.
C99/C++11 have floating point control functions (e.g. fetestexcept) and defined flags (including FE_UNDERFLOW) that should let you detect floating point underflow reasonably portably (i.e., with any compiler/library that supports these).
Though they're not as portable, gcc has an feenableexcept that will let you set floating point exceptions that are trapped. When one of the exceptions you've enabled fires, your program will receive a SIGFPE signal.
At least on most hardware, there's no equivalent for integer operations -- an underflow simply produces a 2's complement (or whatever) result and (for example) sets the the flags (e.g., carry and sign bits) to signal what happened. C99/C++11 do have some flags for things like integer overflow, but I don't believe they're nearly as widely supported.
I detected some differences in my program results between Release and Debug versions. After some research I realized that some floating point optimizations are causing those differences. I have solved the problem by using the fenv_access pragma for disabling some optimizations for some critical methods.
Thinking about it, I realized that it is probably better to use the fp:strict model instead of fp:precise in my program because of its characteristics, but I am worried about performance. I have tried to find some information about the performance issues of fp:strict or the differences in performance between precise and strict, model but I have found very little information.
Does anyone know anything about this??
Thanks in advance.
This happens because you are compiling in 32-bit mode, it uses the x86 floating point processor. The code optimizer removes redundant moves from the FPU registers to memory and back, leaving intermediary results in the FPU stack. A pretty important optimization.
Problem is, the FPU stores doubles with 80 bits of precision. Instead of the 64 bits of precision of a double. Intel originally assumed this was a feature, producing more accurate intermediary calculations but it is really a bug. They didn't make the same mistake when they designed the SSE2 instruction set, used by 64 bit compilers to do floating point math. The XMM registers are 64 bits.
So in the release mode build you get subtly different results since the calculations are performed with more bits. This should never be a problem in a program that uses floating point values to calculate, a double can only store 15 significant digits. What's different are the noise digits, the ones beyond the first 15 digits. But sometimes less if your calculation loses significant digits badly. Like calculating 1 - 3 * (1/3.0).
But yeah, you can use fp:precise to get consistent noise digits. It forces the intermediate values to be flushed to memory so they cannot remain in the FPU with 80 bits of precision. It makes your code slow of course.
I am not sure if this is a solution but is what I have :)
As I have post previously I have wrote a test program that performs floating point operations that is said to be optimized under fp:precise and not under fp:strict and then measure performance. I run it 10000 times and, in average, fp:strict is 2.85% slower than fp:precise.
Just offering my two cents:
I have an image processing program that autovectorizes, the aim was to compare the performance and accuracy taking matlab as a gold standard.
Using VS2012 and an Intel i950.
Critical region error & runtime
2.3328196e-02 465 ms with strict
7.1277611e-02 182 ms with precise
7.1277611e-02 188 ms with fast
strict did not vecotrization
Using strict slowed the code down by 2x. Which was not acceptable.
It's absolutely normal to see performance difference between a Debug and Release version.
The compiler and run-times will do a lot more additional sanity checks in debug version; don't compare one to the other, especially in regards to performance; compare release vs. release with different compiler switches.
On the other hand, if the results are different between the 2 versions, then you will have to go in and check for programming errors (most probably).
Max.
When I run the exact same code performing the exact same floating point calculation (using doubles) compiled on Windows and Solaris, I get slightly different results.
I am aware that the results are not exact due to rounding errors. However I would have expected the rounding errors to be platform-independent, thereby giving be the same (slightly incorrect) result on both platforms, which is not the case.
Is this normal, or do I have another problem in my code?
On x86, usually most calculations happen with 80-bit quantities, unless otherwise forced to be double-precision. Most other architectures I know of do all calculations in double-precision (again, unless otherwise overridden).
I don't know if you're running Solaris on SPARC or x86, but if the former, then I highly suspect that to be the cause of the difference.
The subject of your question suggests that it might depend on the compiler. It might, but the fact that you are running on different hardware (assuming your Solaris is not x86) suggests a much more likely reason for the difference - the difference in hardware itself.
Different hardware platforms might use completely different hardware devices (FPUs, CPUs) to perform floating-point calculations, arriving at different results.
Moreover, often the FPU units are configurable by some persistent settings, like infinity model, rounding mode etc. Different hardware might have different default setup. Compiler will normally generate the code that will initialize the FPU at program startup, by that initial setup can be different as well.
Finally, different implementations of C++ language might implement floating-point semantics differently, so you might even get different results from different C++ compilers of the same hardware.
I believe that under Windows/x86, your code will run with the x87 precision already set to 53 bits (double precision), though I'm not sure exactly when this gets set. On Solaris/x86 the x87 FPU is likely to be using its default precision of 64 bits (extended precision), hence the difference.
There's a simple check you can do to detect which precision (53 bits or 64 bits) is being used: try computing something like 1e16 + 2.9999, while being careful to avoid compiler constant-folding optimizations (e.g., define a separate add function to do the addition, and turn off any optimizations that might inline functions). When using 53-bit precision (SSE2, or x87 in double-precision mode) this gives 1e16 + 2; when using 64-bit precision (x87 in extended precision mode) this gives 1e16 + 4. The latter result comes from an effect called 'double rounding', where the result of the addition is rounded first to 64 bits, and then to 53 bits. (Doing this calculation directly in Python, I get 1e16 + 4 on 32-bit Linux, and 1e16+2 on Windows, for exactly the same reason.)
Here's a very nice article (that goes significantly beyond the oft-quoted Goldberg's "What every computer scientist should know...") that explains some of the problems arising from the use of the x87 FPU:
http://hal.archives-ouvertes.fr/docs/00/28/14/29/PDF/floating-point-article.pdf
My application is generating different floating point values when I compile it in release mode and in debug mode. The only reason that I found out is I save a binary trace log and the one from the release build is ever so slightly off from the debug build, it looks like the bottom two bits of the 32 bit float values are different about 1/2 of the cases.
Would you consider this "difference" to be a bug or would this type of difference be expected. Would this be a compiler bug or an internal library bug.
For example:
LEFTPOS and SPACING are defined floating point values.
float def_x;
int xpos;
def_x = LEFTPOS + (xpos * (SPACING / 2));
The issue is in regards to the X360 compiler.
Release mode may have a different FP strategy set. There are different floating point arithmetic modes depending on the level of optimization you'd like. MSVC, for example, has strict, fast, and precise modes.
I know that on PC, floating point registers are 80 bits wide. So if a calculation is done entirely within the FPU, you get the benefit of 80 bits of precision. On the other hand, if an intermediate result is moved out into a normal register and back, it gets truncated to 32 bits, which gives different results.
Now consider that a release build will have optimisations which keep intermediate results in FPU registers, whereas a debug build will probably naively copy intermediate results back and forward between memory and registers - and there you have your difference in behaviour.
I don't know whether this happens on X360 too or not.
It's not a bug. Any floating point uperation has a certain imprecision. In Release mode, optimization will change the order of the operations and you'll get a slightly different result. The difference should be small, though. If it's big you might have other problems.
I helped a co-worker find a compiler switch that was different in release vs. debug builds that was causing his differences.
Take a look at /fp (Specify Floating-Point Behavior).
In addition to the different floating-point modes others have pointed out, SSE or similiar vector optimizations may be turned on for release. Converting floating-point arithmetic from standard registers to vector registers can have an effect on the lower bits of your results, as the vector registers will generally be more narrow (fewer bits) than the standard floating-point registers.
Not a bug. This type of difference is to be expected.
For example, some platforms have float registers that use more bits than are stored in memory, so keeping a value in the register can yield a slightly different result compared to storing to memory and re-loading from memory.
This discrepancy may very well be caused by the compiler optimization, which is typically done in the release mode, but not in debug mode. For example, the compiler may reorder some of the operations to speed up execution, which can conceivably cause a slight difference in the floating point result.
So, I would say most likely it is not a bug. If you are really worried about this, try turning on optimization in the Debug mode.
Like others mentioned, floating point registers have higher precision than floats, so the accuracy of the final result depends on the register allocation.
If you need consistent results, you can make the variables volatile, which will result in slower, less precise, but consistent results.
If you set a compiler switch that allowed the compiler to reorder floating-point operations, -- e.g. /fp:fast -- then obviously it's not a bug.
If you didn't set any such switch, then it's a bug -- the C and C++ standards don't allow the compilers to reorder operations without your permission.