My application is generating different floating point values when I compile it in release mode and in debug mode. The only reason that I found out is I save a binary trace log and the one from the release build is ever so slightly off from the debug build, it looks like the bottom two bits of the 32 bit float values are different about 1/2 of the cases.
Would you consider this "difference" to be a bug or would this type of difference be expected. Would this be a compiler bug or an internal library bug.
For example:
LEFTPOS and SPACING are defined floating point values.
float def_x;
int xpos;
def_x = LEFTPOS + (xpos * (SPACING / 2));
The issue is in regards to the X360 compiler.
Release mode may have a different FP strategy set. There are different floating point arithmetic modes depending on the level of optimization you'd like. MSVC, for example, has strict, fast, and precise modes.
I know that on PC, floating point registers are 80 bits wide. So if a calculation is done entirely within the FPU, you get the benefit of 80 bits of precision. On the other hand, if an intermediate result is moved out into a normal register and back, it gets truncated to 32 bits, which gives different results.
Now consider that a release build will have optimisations which keep intermediate results in FPU registers, whereas a debug build will probably naively copy intermediate results back and forward between memory and registers - and there you have your difference in behaviour.
I don't know whether this happens on X360 too or not.
It's not a bug. Any floating point uperation has a certain imprecision. In Release mode, optimization will change the order of the operations and you'll get a slightly different result. The difference should be small, though. If it's big you might have other problems.
I helped a co-worker find a compiler switch that was different in release vs. debug builds that was causing his differences.
Take a look at /fp (Specify Floating-Point Behavior).
In addition to the different floating-point modes others have pointed out, SSE or similiar vector optimizations may be turned on for release. Converting floating-point arithmetic from standard registers to vector registers can have an effect on the lower bits of your results, as the vector registers will generally be more narrow (fewer bits) than the standard floating-point registers.
Not a bug. This type of difference is to be expected.
For example, some platforms have float registers that use more bits than are stored in memory, so keeping a value in the register can yield a slightly different result compared to storing to memory and re-loading from memory.
This discrepancy may very well be caused by the compiler optimization, which is typically done in the release mode, but not in debug mode. For example, the compiler may reorder some of the operations to speed up execution, which can conceivably cause a slight difference in the floating point result.
So, I would say most likely it is not a bug. If you are really worried about this, try turning on optimization in the Debug mode.
Like others mentioned, floating point registers have higher precision than floats, so the accuracy of the final result depends on the register allocation.
If you need consistent results, you can make the variables volatile, which will result in slower, less precise, but consistent results.
If you set a compiler switch that allowed the compiler to reorder floating-point operations, -- e.g. /fp:fast -- then obviously it's not a bug.
If you didn't set any such switch, then it's a bug -- the C and C++ standards don't allow the compilers to reorder operations without your permission.
Related
This question already has answers here:
Different floating point result with optimization enabled - compiler bug?
(7 answers)
Closed 4 years ago.
I am debugging code that implements an algorithm whose main loop terminates when a statement à la s >= u || s <= l is true, where s, u and l are doubles that are updated in the main loop. In this example, all three variables are always between 0.5 and 1.5. I am not including the code here, as it is not written by me and extracting a MWE is hard. I am puzzled by the code behaving differently on different architectures, and I'm hoping the clues below can help me narrow down the error in the algorithm.
Some floating point rounding seems to be the root cause of the bug. Here is what I have ascertained so far:
The algorithm terminates correctly on all optimization levels on x86-64.
The algorithm terminates correctly with -O3 (other opt levels were not tried) on arm64, mips64 and ppc64.
The algorithm terminates correctly with -O0 on i686.
The algorithm loops indefinitely with -O1, -O2 and -O3 on i686.
Main point of question: In the cases when the algorithm loops indefinitely, it can be made to terminate correctly if s is printed (std::cout << s << std::endl) before it is compared to l and u.
What kind of compiler optimizations could be relevant here?
All behaviors above were observed on a GNU/Linux system and reproduced with GCC 6.4, 7.3 and 8.1.
Since you say your code works as intended on x86-64 and other instruction sets, but breaks on i686, but only with some optimisation levels, the likely culprit is x86 extended precision.
On x86, floating point instructions store results in registers with greater precision than when those values are subsequently stored in memory. Therefore, when the compiler can re-use the same value already loaded in a register, the results may be different compared to when it has to save and re-load the value. Printing a value may require saving and re-loading it.
This is a well-known non-bug in GCC.
GCC provides a -ffloat-store command-line option which may be of help:
-ffloat-store
Do not store floating-point variables in registers, and inhibit other options that might change whether a floating-point value is taken from a register or memory.
This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.
As mentioned there though, it doesn't automatically let your code work the same as on other instruction sets. You may need to modify your code to explicitly store results in variables.
So I have been researching how the variable uint8 works and I have realized that it is actually not faster than int! In order to multiply, divide, add, or subtract, the program must turn uint8 into an int which will make it about the same speed or slightly slower.
Why did C++ not implement multiplying, dividing, adding, or subtracting directly to uint8?
Why did C++ not implement multiplying, dividing, adding, or subtracting directly to uint8?
Because the optimal way doing that is platform specific.
Most CPU's provide these operations as assembler instructions based on using integer values of a specific default size (e.g. 32 bits, or 64 bits like shown here for 16 bit instructions), they may or may not have such instructions for uint8 values.
The bit size is usually optimized for the CPU's cache lining mechanisms.
So the optimal implementation is dependend on the available target CPU instructions and cannot be covered by the C++ standard.
I'm not sure wether or not a compiler will produce 8bit arithmetic operations for uint8_t when properate (quite unlikely for it is unlikely to be faster).
#harold mentioned, what I said before is not so morden now... Partial register update problem is no longer so serious now for 8bit operations. So, just that most 8bit operations are not faster. While 8bit division is a little faster and I'm trying to figure out why MS's compiler won't use it. (Not so sure: As the partially updating problem is just mostly reduced not completely removed, and even kept by AMD, that one cycle benefit of 8bit division just not worth to be abused).
Original:
On morden x86 processors, 8bit operations face a problem called partial register update that you only change part of the full register, which results in false dependency that seriously impacts performance.
And FYI, at the language level there is no arithmetic for integral types smaller than int in C++. There is the usual arithmetic promotion to lift the type.
I detected some differences in my program results between Release and Debug versions. After some research I realized that some floating point optimizations are causing those differences. I have solved the problem by using the fenv_access pragma for disabling some optimizations for some critical methods.
Thinking about it, I realized that it is probably better to use the fp:strict model instead of fp:precise in my program because of its characteristics, but I am worried about performance. I have tried to find some information about the performance issues of fp:strict or the differences in performance between precise and strict, model but I have found very little information.
Does anyone know anything about this??
Thanks in advance.
This happens because you are compiling in 32-bit mode, it uses the x86 floating point processor. The code optimizer removes redundant moves from the FPU registers to memory and back, leaving intermediary results in the FPU stack. A pretty important optimization.
Problem is, the FPU stores doubles with 80 bits of precision. Instead of the 64 bits of precision of a double. Intel originally assumed this was a feature, producing more accurate intermediary calculations but it is really a bug. They didn't make the same mistake when they designed the SSE2 instruction set, used by 64 bit compilers to do floating point math. The XMM registers are 64 bits.
So in the release mode build you get subtly different results since the calculations are performed with more bits. This should never be a problem in a program that uses floating point values to calculate, a double can only store 15 significant digits. What's different are the noise digits, the ones beyond the first 15 digits. But sometimes less if your calculation loses significant digits badly. Like calculating 1 - 3 * (1/3.0).
But yeah, you can use fp:precise to get consistent noise digits. It forces the intermediate values to be flushed to memory so they cannot remain in the FPU with 80 bits of precision. It makes your code slow of course.
I am not sure if this is a solution but is what I have :)
As I have post previously I have wrote a test program that performs floating point operations that is said to be optimized under fp:precise and not under fp:strict and then measure performance. I run it 10000 times and, in average, fp:strict is 2.85% slower than fp:precise.
Just offering my two cents:
I have an image processing program that autovectorizes, the aim was to compare the performance and accuracy taking matlab as a gold standard.
Using VS2012 and an Intel i950.
Critical region error & runtime
2.3328196e-02 465 ms with strict
7.1277611e-02 182 ms with precise
7.1277611e-02 188 ms with fast
strict did not vecotrization
Using strict slowed the code down by 2x. Which was not acceptable.
It's absolutely normal to see performance difference between a Debug and Release version.
The compiler and run-times will do a lot more additional sanity checks in debug version; don't compare one to the other, especially in regards to performance; compare release vs. release with different compiler switches.
On the other hand, if the results are different between the 2 versions, then you will have to go in and check for programming errors (most probably).
Max.
When I run the exact same code performing the exact same floating point calculation (using doubles) compiled on Windows and Solaris, I get slightly different results.
I am aware that the results are not exact due to rounding errors. However I would have expected the rounding errors to be platform-independent, thereby giving be the same (slightly incorrect) result on both platforms, which is not the case.
Is this normal, or do I have another problem in my code?
On x86, usually most calculations happen with 80-bit quantities, unless otherwise forced to be double-precision. Most other architectures I know of do all calculations in double-precision (again, unless otherwise overridden).
I don't know if you're running Solaris on SPARC or x86, but if the former, then I highly suspect that to be the cause of the difference.
The subject of your question suggests that it might depend on the compiler. It might, but the fact that you are running on different hardware (assuming your Solaris is not x86) suggests a much more likely reason for the difference - the difference in hardware itself.
Different hardware platforms might use completely different hardware devices (FPUs, CPUs) to perform floating-point calculations, arriving at different results.
Moreover, often the FPU units are configurable by some persistent settings, like infinity model, rounding mode etc. Different hardware might have different default setup. Compiler will normally generate the code that will initialize the FPU at program startup, by that initial setup can be different as well.
Finally, different implementations of C++ language might implement floating-point semantics differently, so you might even get different results from different C++ compilers of the same hardware.
I believe that under Windows/x86, your code will run with the x87 precision already set to 53 bits (double precision), though I'm not sure exactly when this gets set. On Solaris/x86 the x87 FPU is likely to be using its default precision of 64 bits (extended precision), hence the difference.
There's a simple check you can do to detect which precision (53 bits or 64 bits) is being used: try computing something like 1e16 + 2.9999, while being careful to avoid compiler constant-folding optimizations (e.g., define a separate add function to do the addition, and turn off any optimizations that might inline functions). When using 53-bit precision (SSE2, or x87 in double-precision mode) this gives 1e16 + 2; when using 64-bit precision (x87 in extended precision mode) this gives 1e16 + 4. The latter result comes from an effect called 'double rounding', where the result of the addition is rounded first to 64 bits, and then to 53 bits. (Doing this calculation directly in Python, I get 1e16 + 4 on 32-bit Linux, and 1e16+2 on Windows, for exactly the same reason.)
Here's a very nice article (that goes significantly beyond the oft-quoted Goldberg's "What every computer scientist should know...") that explains some of the problems arising from the use of the x87 FPU:
http://hal.archives-ouvertes.fr/docs/00/28/14/29/PDF/floating-point-article.pdf
We can do left shift operators in C/C++ for faster way to multiply integers with powers of 2.
But we cannot use left shift operators for floats or doubles because they are represented in different way, having an exponent component and a mantissa component.
My questions is that,
Is there any way? Like left shift operators for integers to faster multiply float numbers? Even with powers of 2??
No, you can't. But depending on your problem, you might be able to use SIMD instructions to perform one operation on several packed variables.. Read about the SSE2 instruction set.
http://en.wikipedia.org/wiki/SSE2
http://softpixel.com/~cwright/programming/simd/sse2.php
In any event, if you are optimizing floating-point multiplications, you are in 99% of the cases looking in the wrong place. Without going on a major rant regarding premature optimization, at least justify it by performing proper profiling.
You could do this:
float f = 5.0;
int* i = (int*)&f;
*i += 0x00800000;
But then you have the overhead of moving the float out of the register, into memory, then back into a different register, only to be flushed back to memory ... about 15 or so cycles more than if you'd just done fmul. Of course, that's even assuming your system has IEEE floats at all.
Don't try to optimize this. You should look at the rest of your program to find algorithmic optimizations instead of trying to discover ways to microoptimize things like floats. It will only end in blood and tears.
Truly, any decent compiler would recognize static-time power-of-two constants and use the smartest operation.
In Microsoft Visual C++, don't forget the "floating point model" switch. The default is /fp:precise but you can change it to /fp:fast. The fast model trades some floating point accuracy for more speed. In some cases, the speedups can be drastic (the blog post referenced below notes speedups as high as x5 in some cases). Note that Xbox games are compiled with the /fp:fast switch by default.
I just switched from /fp:precise to /fp:fast on a math-heavy application of mine (with many float multiplications) and got an immediate 27% speedup with almost no loss in accuracy across my test suite.
Read the Microsoft blog post regarding the details of this switch here. It seems that the main reasons to not enable this would be if you need all the accuracy available (eg, games with large worlds, long-running simulations where errors may accumulate) or you need robust double or float NaN processing.
Lastly, also consider enabling the SSE2 instruction extensions. This gave an extra 3% boost in my application. The effects of this will vary depending on the number of operands in your arithmetic—for example, these extensions can provide speedup in cases where you are adding or multiplying more than 2 numbers together at a time.