acos(double) gives different result on x64 and x32 Visual Studio - c++

acos(double) gives different result on x64 and x32 Visual Studio.
printf("%.30g\n", double(acosl(0.49990774364240564)));
printf("%.30g\n", acos(0.49990774364240564));
on x64: 1.0473040763868076
on x32: 1.0473040763868078
on linux4.4 x32 and x64 with sse enabled: 1.0473040763868078
is there a way to make VSx64 acos() give me 1.0473040763868078 as result?

TL:DR: this is normal and you can't reasonably change it.
The 32-bit library may be using 80-bit FP values in x87 registers for its temporaries, avoiding rounding off to 64-bit double after every operation. (Unless there's a whole separate library, compiling your own code to use SSE doesn't change what's inside the library, or even the calling convention for passing data to the library. But since 32-bit passes double and float in memory on the stack, a library is free to load it with SSE2 or with x87. Still, you don't get the performance advantage of passing FP values in xmm registers unless it's impossible for non-SSE code to use the library.)
It's also possible that they're different simply because they use a different order of operations, producing different temporaries along the way. That's less plausible, unless they're separately hand-written in asm. If they're built from the same C source (without "unsafe" FP optimizations), then the compiler isn't allowed to reorder things, because of this non-associative behaviour of FP math.
glibc's libm (used on Linux) typically favours precision over speed, so its giving you the correctly-rounded result out to the last bit of the mantissa for both 32 and 64-bit. The IEEE FP standard only requires the basic operations (+ - * / FMA and FP remainder) to be "correctly rounded" out to the last bit of the mantissa. (i.e. rounding error of at most 0.5 ulp). (The exact result, according to calc, is 1.047304076386807714.... Keep in mind that double (on x86 with normal compilers) is IEEE754 binary64, so internally the mantissa and exponent are in base2. If you print enough extra decimal digits, though, you can tell that ...7714 should round up to ...78, although really you should print more digits in case they're not zero beyond that. I'm just assuming it's ...78000.)
So Microsoft's 64-bit library implementation produces 1.0473040763868076 and there's pretty much nothing you can do about it, other than not use it. (e.g. find your own acos() implementation and use it.) But FP determinism is hard, even if you limit yourself to just x86 with SSE. See Does any floating point-intensive code produce bit-exact results in any x86-based architecture?. If you limit yourself to a single compiler, it can be possible if you avoid complicated library functions like acos().
You might be able to get the 32-bit library version to produce the same value as the 64-bit version, if it uses x87 and changing the x87 precision setting affects it. But the other way around is not possible: SSE2 has separate instructions for 64-bit double and 32-bit float, and always rounds after every instruction, so you can't change any setting that will increase precision result. (You could change the SSE rounding mode, and that will change the result, but not in a good way!)
See also:
Intermediate Floating-Point Precision and the rest of Bruce Dawson's excellent series of articles about floating point. (table of contents.
The linked article describes how some versions of VC++'s CRT runtime startup set the x87 FP register precision to 53-bit mantissa instead of 80-bit full precision. Also that D3D9 will set it to 24, so even double only has the precision of float if done with x87.
https://en.wikipedia.org/wiki/Rounding#Table-maker.27s_dilemma
What Every Computer Scientist Should Know About Floating-Point Arithmetic

You may have reached the precision limit. Double precision is approximately 16 digits. After, there is no guarantee the digits are valid. If so, you cannot change this behavior, except changing the type double to something else, supporting higher precision.
E.g. long double if your machine and compiler supports the extended 80 bit double or 128 bit Quadruple (also machine depended see here for example).

I disagree that there isn't much you can do about it.
For example, you could try changing the floating point model compiler options.
Here are my results with different floating point models (note /fp:precise is the default):
/fp:precise 1.04730407638680755866289473488
/fp:strict 1.04730407638680755866289473488
/fp:fast 1.04730407638680778070749965991
So it seems you are looking for /fp:fast. Whether that gives the most accurate result remains to be seen though.

Related

Microsoft Visual Studio: Setting Rounding Modi on Floating Point for x64

I am trying to figure out how to set ROUND_UP, ROUND_DOWN, ROUND_to_NEAREST, and ROUND_to_INFINITY for an MS Visual Studio project.
The representation of natural numbers should follow the IEEE 754 standards, Which means setting /FP: strict is selected. However, the code is running in an x64 environment.
Through careful selection of rounding mode, I want to cause -0.000000 to be equal to -0.000001, for example.
Cheers
Commodore
I am making come computations and saving the results (tuple). After each operation, I query saved data to know if I had already had the value. (-0.000000,-0.202319) would be equal to (-0.000001,-0.202319) rounding with nearest. How can I do this with Visual Studio?
In general, == and != for floating-point are not a 'safe' method for doing floating-point comparison except in the specific case of 'binary representation equality' even using 'IEEE-754 compliant' code generation. This is why clang/LLVM for example has the -Wfloat-equal warning.
Be sure to read What Every Computer Scientist Should Know About Floating-Point Arithmetic. I'd also recommend reading the many great Bruce Dawson blog posts on the topic of floating-point.
Instead, you should explicitly use an 'epsilon' comparison:
constexpr float c_EPSILON = 0.000001f;
if (fabsf(a - b) <= c_EPSILON)
{
// A & B are equal within the epsilon value.
}
In general, you can't assume that SSE/SSE2-based floating-point math (required for x64) will match legacy x87-based floating-point math, and in many cases you can't even assume AMD and Intel will always agree even with /fp:strict.
For example, IEEE-754 doesn't specify what happens with fast reciprocal operations such as RCP or RSQRT. AMD and Intel give different answers in these cases in the lower bits.
With all that said, you are intended to use _controlfp or _controlfp_s rather than _control87 to control rounding mode behavior for all platforms. _control87 really only works for 32-bit (x86) platforms when using /arch:IA32.
Keep in mind that changing the floating-point control word and calling code outside of your control is often problematic. Code assumes the default of "no-exceptions, round-to-nearest" and deviation from that can result in untested/undefined behavior. You can really only change the control word, do your own code, then change it back to the default in any safe way. See this old DirectX article.
That’s not how it works. Choosing the rounding mode affects the last bit of the result, that’s it. Comparisons are not affected. And comparisons between 0.000001 and 0.234567 are most definitely not affected.
What you want cannot be achieved with rounding modes. Feel free to write a function that returns true if two numbers are close together.

Truncate Floats and Doubles after user defined points in X87 and SSE FPUs

I have made a function g that is able to approximate a function to a certain degree, this function gives accurate results up to 5 decimals ( 1,23456xxxxxxxxxxxx where the x positions are just rounding errors / junk ) .
To avoid spreading error to other computations that will use the results of g I would like to just set all the x positions to zero, better yet, just set to 0 everything after the 5th decimal .
I haven't found anything in X87 and SSE literature that let's me play with IEEE 754 bits or their representation the way I would like to .
There is an old reference to the FISTP instruction for X87 that is apparently mirrored in the SSE world with FISTTP, with the benefit that FISTTP doesn't necesserily modify the control word and is therefore faster .
I have noticed that FISTTP was called "chopping mode", but now in more modern literature is just "rounding toward zero" or "truncate" and this confuse me because "chopping" means removing something altogether where "rounding toward zero" doesn't necessarily means the same thing to me .
I don't need to round to zero, I only need to preserve up to 5 decimals as the last step in my function before storing the result in memory; how do I do this in X87 ( scalar FPU ) and SSE ( vector FPU ) ?
As several people commented, more early rounding doesn't help the final result be more accurate. If you want to read more about floating point comparisons and weirdness / gotchas, I highly recommend Bruce Dawson's series of articles on floating point. Here's a quote from the one with the index
We’ve finally reached the point in this series that I’ve been waiting
for. In this post I am going to share the most crucial piece of
floating-point math knowledge that I have. Here it is:
[Floating-point] math is hard.
You just won’t believe how vastly, hugely, mind-bogglingly hard it is.
I mean, you may think it’s difficult to calculate when trains from
Chicago and Los Angeles will collide, but that’s just peanuts to
floating-point math.
(Bonus points if you recognize that last paragraph as a paraphrase of a famous line about space.)
How you could actually implement your bad idea:
There aren't any machine instructions or C standard library functions to truncate or round to anything other than integer.
Note that there are machine instructions (and C functions) that round a double to nearest (representable) integer without converting it to intmax_t or anything, just double->double. So no round-trip through a fixed-width 2's complement integer.
So to use them, you could scale your float up by some factor, round to nearest integer, then scale back down. (like chux's round()-based function, but I'd recommend C99 double rint(double) instead of round(). round has weird rounding semantics that don't match any of the available rounding modes on x86, so it compiles to worse code.
The x86 asm instructions you keep mentioning are nothing special, and don't do anything that you can't ask the compiler to do with pure C.
FISTP (Float Integer STore (and Pop the x87 stack) is one way for a compiler or asm programmer to implement long lrint(double) or (int)nearbyint(double). Some compilers make better code for one or the other. It rounds using the current x87 rounding mode (default: round to nearest), which is the same semantics as those ISO C standard functions.
FISTTP (Float Integer STore with Truncation (and Pop the x87 stack) is part of SSE3, even though it operates on the x87 stack. It lets compilers avoid setting the rounding mode to truncation (round-towards-zero) to implement the C truncation semantics of (long)x, and then restoring the old rounding mode.
This is what the "not modify the control word" stuff is about. Neither instruction does that, but to implement (int)x without FISTTP, the compiler has to use other instructions to modify and restore the rounding mode around a FIST instruction. Or just use SSE2 CVTTSD2SI to convert a double in an xmm register with truncation, instead of an FP value on the legacy x87 stack.
Since FISTTP is only available with SSE3, you'd only use it for long double, or in 32-bit code that had FP values in x87 registers anyway because of the crusty old 32-bit ABI which returns FP values on the x87 stack.
PS. if you didn't recognize Bruce's HHGTG reference, the original is:
Space is big. Really big. You just won’t believe how vastly hugely
mindbogglingly big it is. I mean you may think it’s a long way down
the road to the chemist’s, but that’s just peanuts to space.
how do I do this in X87 ( scalar FPU ) and SSE ( vector FPU ) ?
The following does not use X87 nor SSE. I've included it as a community reference for general purpose code. If anything, it can be used to test a X87 solution.
Any "chopping" of the result of g() will at least marginally increase error, hopefully tolerable as OP said "To avoid spreading error to other computations ..."
It is unclear if OP wants "accurate results up to 5 decimals" to reflect absolute precision (+/- 0.000005) or relative precision (+/- 0.000005 * result). Will assume "absolute precision".
Since float, double are far often a binary floating point, any "chop" will reflect a FP number nearest to a multiple of 0.00001.
Text Method:
// - x xxx...xxx . xxxxx \0
char buf[1+1+ DBL_MAX_10_EXP+1 +5 +1];
sprintf(buf, "%.5f", x);
x = atof(buf);
round() rint() method:
#define SCALE 100000.0
if (fabs(x) < DBL_MAX/SCALE) {
x = x*SCALE;
x = rint(x)/SCALE;
}
Direct bit manipulation of x. Simply zero select bits in the significand.
TBD code.

precision differences in matlab and c++

I am trying to make equivalence tests on an algorithm written in C++ and in Matlab.
The algorithm contains some kind of a loop in time and runs more than 1000 times. It has arithmetic operations and some math functions.
I feed the initial inputs to both platforms by hand (like a=1.767, b=6.65, ...) and when i check the hexadecimal representations of those inputs they are the same. So no problem for inputs. And get the outputs of c++ to matlab by a text file with 16 decimal digits. (i use "setprecision(32)" statement)
But here comes the problem; although after the 614'th step of both code, all the results are exactly the same, on the step of 615 I get a difference about 2.xxx..xxe-19? And after this step the error becomes larger and larger, and at the end of the runs it is about 5.xx..xxe-14.
0x3ff1 3e42 a211 6cca--->[C++ function]--->0x3ff4 7619 7005 5a42
0x3ff1 3e42 a211 6cca--->[MATLAB function]--->ans
ans - 0x3ff4 7619 7005 5a42
= 2.xxx..xxe-19
I searched how matlab behaves the numbers and found really interesting things like "denormalized mantissa". While realmin is about e-308, by denormalizing the mantissa matlab has the smallest real number about e-324. Further matlab holds many more digits for "pi" or "exp(1)" than that of c++.
On the other hand, matlab help says that whatever the format it displays, matlab uses the double precision internally.
So,I'd really appreciate if someone explains what the exact reason is for these differences? How can we make equivalence tests on matlab and c++?
There is one thing in x86 CPU about floating points numbers. Internally, the floating point unit uses registers that are 10 bytes, i.e. 80 bits. Furthermore, the CPU has a setting that tells whether the floating point calculations should be made with 32 bits (float), 64 bits (double) or 80 bits precision. Less precision meaning faster executed floating point operations. (The 32 bits mode used to be popular for video games, where speed takes over precision).
From this I remember I tracked a bug in a calculation library (dll) that given the same input did not gave the same result whether it was started from a test C++ executable, or from MatLab.. Furthermore, this did not happen in Debug mode, only in Release!
The final conclusion was that MatLab did set the CPU floating point precision to 80 bits, whereas our test executable did not (and leave the default 64 bits precision). Furthermore, this calculation mismatch did not happen Debug mode because all the variables were written to memory into 64 bits double variables, and reloaded from there afterward, nullifying the additional 16 bits. In Release mode, some variables were optimized out (not written to memory), and all calculations were done with floating point registers only, on 80 bits, keeping the additional 16 bits non-zero value.
Don't know if this helps, but maybe worth knowing.
A similar discussion occurred before, the conclusion was that IEEE 754 tolerates error in the last bit for transcendental functions (cos, sin, exp, etc..). So you can't expect exactly same results between MATLAB and C (not even same C code compiled in different compilers).
I may be way off track here and you may already have investigated this possibility but it could be possible that there are differences between C++ and Matlab in the way that the mathematical library functions (sin() cos() and exp() that you mention) are implemented internally. Ultimately, some kind of functional approximation must be being used to generate function values and if there is some difference between these methods then presumably it is possible that this manifests itself in the form of numerical rounding error over a large number of iterations.
This question basically covers what I am trying to suggest How does C compute sin() and other math functions?

Avoiding denormal values in C++

After searching a long time for a performance bug, I read about denormal floating point values.
Apparently denormalized floating-point values can be a major performance concern as is illustrated in this question:
Why does changing 0.1f to 0 slow down performance by 10x?
I have an Intel Core 2 Duo and I am compiling with gcc, using -O2.
So what do I do? Can I somehow instruct g++ to avoid denormal values?
If not, can I somehow test if a float is denormal?
Wait. Before you do anything, do you actually know that your code is encountering denormal values, and that they're having a measurable performance impact?
Assuming you know that, do you know if the algorithm(s) that you're using is stable if denormal support is turned off? Getting the wrong answer 10x faster is not usually a good performance optimization.
Those issues aside:
If you want to detect denormal values to confirm that their presence, you have a few options. If you have a C99 standard library or Boost, you can use the fpclassify macro. Alternatively, you can compare the absolute values of your data to the smallest positive normal number.
You can set the hardware to flush denormal values to zero (FTZ), or treat denormal inputs as zero (DAZ). The easiest way, if it is properly supported on your platform, is probably to use the fesetenv( ) function in the C header fenv.h. However, this is one of the least-widely supported features of the C standard, and is inherently platform specific anyway. You may want to just use some inline assembly to directly set the FPU state to (DAZ/FTZ).
You can test whether a float is denormal using
#include <cmath>
if ( std::fpclassify( flt ) == FP_SUBNORMAL )
(Caveat: I'm not sure that this will execute at full speed in practice.)
In C++03, and this code has worked for me in practice,
#include <cmath>
#include <limits>
if ( flt != 0 && std::fabsf( flt ) < std::numeric_limits<float>::min() ) {
// it's denormalized
}
To decide where to apply this, you may use a sample-based analyzer like Shark, VTune, or Zoom, to highlight the instructions slowed by denormal values. Micro-optimization, even more than other optimizations, is totally hopeless without analysis both before and after.
Most math coprocessors have an option to truncate denormal values to zero. On x86 it is the FZ (Flush to Zero) flag in the MXCSR control register. Check your CRT implementation for a support function to set the control register. It ought to be in <float.h>, something resembling _controlfp(). The option bit usually has "FLUSH" in the #defined symbol.
Double-check your math results after you set this. Which is something you ought to do anyway, getting denormals is a sign of health problems.
To have (flush-to-zero) FTZ (assuming underflow is masked by default) in gcc:
#define CSR_FLUSH_TO_ZERO (1 << 15)
unsigned csr = __builtin_ia32_stmxcsr();
csr |= CSR_FLUSH_TO_ZERO;
__builtin_ia32_ldmxcsr(csr);
In case it's not obvious from the names, __builtin_ia32_stmxcsr and __builtin_ia32_ldmxcsr are available only if you're targeting a x86 processor. ARM, Sparc, MIPS, etc. will each need separate platform-specific code with this approach.
You apparently want some CPU instructions called FTZ (Flush To Zero) and DAZ (Denormals Are Zero).
I found the information on an audio web site but their link to the Intel documentation was missing. They are apparently SSE2 instructions so they should work on AMD CPUs that support that.
I don't know what you can do in GCC to force that on in a portable way. You can always write inline assembly code to use them though. You may have to force GCC to use only SSE2 for floating point math.
Just as an addition to the other answers, if you actually have a problem with denormal floating point values you probably have a precision problem in addition to your performance issue.
It may be a good idea to check if you can restructure your computations to keep the numbers larger to avoid losing precision and performance.

Switching from a 53-bit to 64-bit FPU in Fortran

I am porting Fortran code from Fortran PowerStation(version 4.0) to the Fortran 11(2003) compiler. The old compiler (PowerStation) has 53-bit precision. When porting to the new compiler, I am not achieving proper or exact values for my real/float variables. I hope the new compiler is 64-bit precision. So I think I need to change the FPU (floating point unit) from 53-bit to 64-bit precision. Is this correct? If so, how do I go about changing 53-bit to 64-bit precision using the properties of the new compiler? If not, what should I be doing?
Thanks in advance.
The portable way to request floating point precision in Fortran 90/95/2003 is with the selected_real_kind intrinsic function. For example,
integer, parameter :: DoubleReal_K = selected_real_kind (14) will define a integer DoubleReal_K that specifies a floating point variable with at least 14 decimal digits:
real (DoubleReal_K) :: MyFloat.
Requesting 14 decimal digits will typically produce a double-precision float with 53 bits -- but the only guarantee is 14 decimal digits.
If you need more precision, use a larger value than 14 to specify a longer type -- 17 decimal digits might get extended precision (64 bits), or it might get quadrupole precision, or nothing, depending on the compiler. If the compiler has a larger type available, it will provide it... Otherwise, get a better compiler. Why are you using such an old and unsupported compiler? Also, don't expect exact results from floating point calculations -- it is normal for changes to cause small changes in the results.
You hope the new compiler is 64-bit precision ? I sort of expect that you read the manual and figure that out yourself but if you can't do that, tell us which compiler you are using and someone might help.
How different are the results of the old code and the code compiled with the new compiler ? Of course the results won't be exactly the same if the precision has changed -- how could they be unless you take special steps to ensure sameness.;