Change floating point rounding mode

Change floating point rounding mode - c++

What is the most efficient way to change the rounding mode* of IEEE 754 floating point numbers? A portable C function would be nice, but a solution that uses x86 assembly is ok too.
*I am referring to the standard rounding modes of towards nearest, towards zero, and towards positive/negative infinity

This is the standard C solution:
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
// store the original rounding mode
const int originalRounding = fegetround( );
// establish the desired rounding mode
fesetround(FE_TOWARDZERO);
// do whatever you need to do ...
// ... and restore the original mode afterwards
fesetround(originalRounding);
On backwards platforms lacking C99 support, you may need to resort to assembly. In this case, you may want to set the rounding for both the x87 unit (via the fldcw instruction) and SSE (via the ldmxcsr instruction).
Edit
You don't need to resort to assembly for MSVC. You can use the (totally non-standard) _controlfp( ) instead:
unsigned int originalRounding = _controlfp(0, 0);
_controlfp(_RC_CHOP, _MCW_RC);
// do something ...
_controlfp(originalRounding, _MCW_RC);
You can read more about _controlfp( ) on MSDN.
And, just for completeness, a decoder ring for the macro names for rounding modes:
rounding mode C name MSVC name
-----------------------------------------
to nearest FE_TONEAREST _RC_NEAR
toward zero FE_TOWARDZERO _RC_CHOP
to +infinity FE_UPWARD _RC_UP
to -infinity FE_DOWNWARD _RC_DOWN

this might help.
Edit: I would say you would need your own function. You can use assembly inside C.
But if you register size is 64bits, round it to 32bit would make your calculations faster. It will actually make it slower. Remember 64bit calculations is easy for a 64 microprocessor rather than 2-32bit. I don't know what exactly you want to achieve. I know performance is on your criteria.

Related

C++ rounding double with maximum possible precision [duplicate]

What is the most efficient way to change the rounding mode* of IEEE 754 floating point numbers? A portable C function would be nice, but a solution that uses x86 assembly is ok too.
*I am referring to the standard rounding modes of towards nearest, towards zero, and towards positive/negative infinity

This is the standard C solution:
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
// store the original rounding mode
const int originalRounding = fegetround( );
// establish the desired rounding mode
fesetround(FE_TOWARDZERO);
// do whatever you need to do ...
// ... and restore the original mode afterwards
fesetround(originalRounding);
On backwards platforms lacking C99 support, you may need to resort to assembly. In this case, you may want to set the rounding for both the x87 unit (via the fldcw instruction) and SSE (via the ldmxcsr instruction).
Edit
You don't need to resort to assembly for MSVC. You can use the (totally non-standard) _controlfp( ) instead:
unsigned int originalRounding = _controlfp(0, 0);
_controlfp(_RC_CHOP, _MCW_RC);
// do something ...
_controlfp(originalRounding, _MCW_RC);
You can read more about _controlfp( ) on MSDN.
And, just for completeness, a decoder ring for the macro names for rounding modes:
rounding mode C name MSVC name
-----------------------------------------
to nearest FE_TONEAREST _RC_NEAR
toward zero FE_TOWARDZERO _RC_CHOP
to +infinity FE_UPWARD _RC_UP
to -infinity FE_DOWNWARD _RC_DOWN

this might help.
Edit: I would say you would need your own function. You can use assembly inside C.
But if you register size is 64bits, round it to 32bit would make your calculations faster. It will actually make it slower. Remember 64bit calculations is easy for a 64 microprocessor rather than 2-32bit. I don't know what exactly you want to achieve. I know performance is on your criteria.

acos(double) gives different result on x64 and x32 Visual Studio

acos(double) gives different result on x64 and x32 Visual Studio.
printf("%.30g\n", double(acosl(0.49990774364240564)));
printf("%.30g\n", acos(0.49990774364240564));
on x64: 1.0473040763868076
on x32: 1.0473040763868078
on linux4.4 x32 and x64 with sse enabled: 1.0473040763868078
is there a way to make VSx64 acos() give me 1.0473040763868078 as result?

TL:DR: this is normal and you can't reasonably change it.
The 32-bit library may be using 80-bit FP values in x87 registers for its temporaries, avoiding rounding off to 64-bit double after every operation. (Unless there's a whole separate library, compiling your own code to use SSE doesn't change what's inside the library, or even the calling convention for passing data to the library. But since 32-bit passes double and float in memory on the stack, a library is free to load it with SSE2 or with x87. Still, you don't get the performance advantage of passing FP values in xmm registers unless it's impossible for non-SSE code to use the library.)
It's also possible that they're different simply because they use a different order of operations, producing different temporaries along the way. That's less plausible, unless they're separately hand-written in asm. If they're built from the same C source (without "unsafe" FP optimizations), then the compiler isn't allowed to reorder things, because of this non-associative behaviour of FP math.
glibc's libm (used on Linux) typically favours precision over speed, so its giving you the correctly-rounded result out to the last bit of the mantissa for both 32 and 64-bit. The IEEE FP standard only requires the basic operations (+ - * / FMA and FP remainder) to be "correctly rounded" out to the last bit of the mantissa. (i.e. rounding error of at most 0.5 ulp). (The exact result, according to calc, is 1.047304076386807714.... Keep in mind that double (on x86 with normal compilers) is IEEE754 binary64, so internally the mantissa and exponent are in base2. If you print enough extra decimal digits, though, you can tell that ...7714 should round up to ...78, although really you should print more digits in case they're not zero beyond that. I'm just assuming it's ...78000.)
So Microsoft's 64-bit library implementation produces 1.0473040763868076 and there's pretty much nothing you can do about it, other than not use it. (e.g. find your own acos() implementation and use it.) But FP determinism is hard, even if you limit yourself to just x86 with SSE. See Does any floating point-intensive code produce bit-exact results in any x86-based architecture?. If you limit yourself to a single compiler, it can be possible if you avoid complicated library functions like acos().
You might be able to get the 32-bit library version to produce the same value as the 64-bit version, if it uses x87 and changing the x87 precision setting affects it. But the other way around is not possible: SSE2 has separate instructions for 64-bit double and 32-bit float, and always rounds after every instruction, so you can't change any setting that will increase precision result. (You could change the SSE rounding mode, and that will change the result, but not in a good way!)
See also:
Intermediate Floating-Point Precision and the rest of Bruce Dawson's excellent series of articles about floating point. (table of contents.
The linked article describes how some versions of VC++'s CRT runtime startup set the x87 FP register precision to 53-bit mantissa instead of 80-bit full precision. Also that D3D9 will set it to 24, so even double only has the precision of float if done with x87.
https://en.wikipedia.org/wiki/Rounding#Table-maker.27s_dilemma
What Every Computer Scientist Should Know About Floating-Point Arithmetic

You may have reached the precision limit. Double precision is approximately 16 digits. After, there is no guarantee the digits are valid. If so, you cannot change this behavior, except changing the type double to something else, supporting higher precision.
E.g. long double if your machine and compiler supports the extended 80 bit double or 128 bit Quadruple (also machine depended see here for example).

I disagree that there isn't much you can do about it.
For example, you could try changing the floating point model compiler options.
Here are my results with different floating point models (note /fp:precise is the default):
/fp:precise 1.04730407638680755866289473488
/fp:strict 1.04730407638680755866289473488
/fp:fast 1.04730407638680778070749965991
So it seems you are looking for /fp:fast. Whether that gives the most accurate result remains to be seen though.

Truncate Floats and Doubles after user defined points in X87 and SSE FPUs

I have made a function g that is able to approximate a function to a certain degree, this function gives accurate results up to 5 decimals ( 1,23456xxxxxxxxxxxx where the x positions are just rounding errors / junk ) .
To avoid spreading error to other computations that will use the results of g I would like to just set all the x positions to zero, better yet, just set to 0 everything after the 5th decimal .
I haven't found anything in X87 and SSE literature that let's me play with IEEE 754 bits or their representation the way I would like to .
There is an old reference to the FISTP instruction for X87 that is apparently mirrored in the SSE world with FISTTP, with the benefit that FISTTP doesn't necesserily modify the control word and is therefore faster .
I have noticed that FISTTP was called "chopping mode", but now in more modern literature is just "rounding toward zero" or "truncate" and this confuse me because "chopping" means removing something altogether where "rounding toward zero" doesn't necessarily means the same thing to me .
I don't need to round to zero, I only need to preserve up to 5 decimals as the last step in my function before storing the result in memory; how do I do this in X87 ( scalar FPU ) and SSE ( vector FPU ) ?

As several people commented, more early rounding doesn't help the final result be more accurate. If you want to read more about floating point comparisons and weirdness / gotchas, I highly recommend Bruce Dawson's series of articles on floating point. Here's a quote from the one with the index
We’ve finally reached the point in this series that I’ve been waiting
for. In this post I am going to share the most crucial piece of
floating-point math knowledge that I have. Here it is:
[Floating-point] math is hard.
You just won’t believe how vastly, hugely, mind-bogglingly hard it is.
I mean, you may think it’s difficult to calculate when trains from
Chicago and Los Angeles will collide, but that’s just peanuts to
floating-point math.
(Bonus points if you recognize that last paragraph as a paraphrase of a famous line about space.)
How you could actually implement your bad idea:
There aren't any machine instructions or C standard library functions to truncate or round to anything other than integer.
Note that there are machine instructions (and C functions) that round a double to nearest (representable) integer without converting it to intmax_t or anything, just double->double. So no round-trip through a fixed-width 2's complement integer.
So to use them, you could scale your float up by some factor, round to nearest integer, then scale back down. (like chux's round()-based function, but I'd recommend C99 double rint(double) instead of round(). round has weird rounding semantics that don't match any of the available rounding modes on x86, so it compiles to worse code.
The x86 asm instructions you keep mentioning are nothing special, and don't do anything that you can't ask the compiler to do with pure C.
FISTP (Float Integer STore (and Pop the x87 stack) is one way for a compiler or asm programmer to implement long lrint(double) or (int)nearbyint(double). Some compilers make better code for one or the other. It rounds using the current x87 rounding mode (default: round to nearest), which is the same semantics as those ISO C standard functions.
FISTTP (Float Integer STore with Truncation (and Pop the x87 stack) is part of SSE3, even though it operates on the x87 stack. It lets compilers avoid setting the rounding mode to truncation (round-towards-zero) to implement the C truncation semantics of (long)x, and then restoring the old rounding mode.
This is what the "not modify the control word" stuff is about. Neither instruction does that, but to implement (int)x without FISTTP, the compiler has to use other instructions to modify and restore the rounding mode around a FIST instruction. Or just use SSE2 CVTTSD2SI to convert a double in an xmm register with truncation, instead of an FP value on the legacy x87 stack.
Since FISTTP is only available with SSE3, you'd only use it for long double, or in 32-bit code that had FP values in x87 registers anyway because of the crusty old 32-bit ABI which returns FP values on the x87 stack.
PS. if you didn't recognize Bruce's HHGTG reference, the original is:
Space is big. Really big. You just won’t believe how vastly hugely
mindbogglingly big it is. I mean you may think it’s a long way down
the road to the chemist’s, but that’s just peanuts to space.

how do I do this in X87 ( scalar FPU ) and SSE ( vector FPU ) ?
The following does not use X87 nor SSE. I've included it as a community reference for general purpose code. If anything, it can be used to test a X87 solution.
Any "chopping" of the result of g() will at least marginally increase error, hopefully tolerable as OP said "To avoid spreading error to other computations ..."
It is unclear if OP wants "accurate results up to 5 decimals" to reflect absolute precision (+/- 0.000005) or relative precision (+/- 0.000005 * result). Will assume "absolute precision".
Since float, double are far often a binary floating point, any "chop" will reflect a FP number nearest to a multiple of 0.00001.
Text Method:
// - x xxx...xxx . xxxxx \0
char buf[1+1+ DBL_MAX_10_EXP+1 +5 +1];
sprintf(buf, "%.5f", x);
x = atof(buf);
round() rint() method:
#define SCALE 100000.0
if (fabs(x) < DBL_MAX/SCALE) {
x = x*SCALE;
x = rint(x)/SCALE;
}
Direct bit manipulation of x. Simply zero select bits in the significand.
TBD code.

how to get float rounding mode for c/c++ in vs2008/2012/2010 [duplicate]

What is the most efficient way to change the rounding mode* of IEEE 754 floating point numbers? A portable C function would be nice, but a solution that uses x86 assembly is ok too.
*I am referring to the standard rounding modes of towards nearest, towards zero, and towards positive/negative infinity

This is the standard C solution:
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
// store the original rounding mode
const int originalRounding = fegetround( );
// establish the desired rounding mode
fesetround(FE_TOWARDZERO);
// do whatever you need to do ...
// ... and restore the original mode afterwards
fesetround(originalRounding);
On backwards platforms lacking C99 support, you may need to resort to assembly. In this case, you may want to set the rounding for both the x87 unit (via the fldcw instruction) and SSE (via the ldmxcsr instruction).
Edit
You don't need to resort to assembly for MSVC. You can use the (totally non-standard) _controlfp( ) instead:
unsigned int originalRounding = _controlfp(0, 0);
_controlfp(_RC_CHOP, _MCW_RC);
// do something ...
_controlfp(originalRounding, _MCW_RC);
You can read more about _controlfp( ) on MSDN.
And, just for completeness, a decoder ring for the macro names for rounding modes:
rounding mode C name MSVC name
-----------------------------------------
to nearest FE_TONEAREST _RC_NEAR
toward zero FE_TOWARDZERO _RC_CHOP
to +infinity FE_UPWARD _RC_UP
to -infinity FE_DOWNWARD _RC_DOWN

this might help.
Edit: I would say you would need your own function. You can use assembly inside C.
But if you register size is 64bits, round it to 32bit would make your calculations faster. It will actually make it slower. Remember 64bit calculations is easy for a 64 microprocessor rather than 2-32bit. I don't know what exactly you want to achieve. I know performance is on your criteria.

Avoiding denormal values in C++

After searching a long time for a performance bug, I read about denormal floating point values.
Apparently denormalized floating-point values can be a major performance concern as is illustrated in this question:
Why does changing 0.1f to 0 slow down performance by 10x?
I have an Intel Core 2 Duo and I am compiling with gcc, using -O2.
So what do I do? Can I somehow instruct g++ to avoid denormal values?
If not, can I somehow test if a float is denormal?

Wait. Before you do anything, do you actually know that your code is encountering denormal values, and that they're having a measurable performance impact?
Assuming you know that, do you know if the algorithm(s) that you're using is stable if denormal support is turned off? Getting the wrong answer 10x faster is not usually a good performance optimization.
Those issues aside:
If you want to detect denormal values to confirm that their presence, you have a few options. If you have a C99 standard library or Boost, you can use the fpclassify macro. Alternatively, you can compare the absolute values of your data to the smallest positive normal number.
You can set the hardware to flush denormal values to zero (FTZ), or treat denormal inputs as zero (DAZ). The easiest way, if it is properly supported on your platform, is probably to use the fesetenv( ) function in the C header fenv.h. However, this is one of the least-widely supported features of the C standard, and is inherently platform specific anyway. You may want to just use some inline assembly to directly set the FPU state to (DAZ/FTZ).

You can test whether a float is denormal using
#include <cmath>
if ( std::fpclassify( flt ) == FP_SUBNORMAL )
(Caveat: I'm not sure that this will execute at full speed in practice.)
In C++03, and this code has worked for me in practice,
#include <cmath>
#include <limits>
if ( flt != 0 && std::fabsf( flt ) < std::numeric_limits<float>::min() ) {
// it's denormalized
}
To decide where to apply this, you may use a sample-based analyzer like Shark, VTune, or Zoom, to highlight the instructions slowed by denormal values. Micro-optimization, even more than other optimizations, is totally hopeless without analysis both before and after.

Most math coprocessors have an option to truncate denormal values to zero. On x86 it is the FZ (Flush to Zero) flag in the MXCSR control register. Check your CRT implementation for a support function to set the control register. It ought to be in <float.h>, something resembling _controlfp(). The option bit usually has "FLUSH" in the #defined symbol.
Double-check your math results after you set this. Which is something you ought to do anyway, getting denormals is a sign of health problems.

To have (flush-to-zero) FTZ (assuming underflow is masked by default) in gcc:
#define CSR_FLUSH_TO_ZERO (1 << 15)
unsigned csr = __builtin_ia32_stmxcsr();
csr |= CSR_FLUSH_TO_ZERO;
__builtin_ia32_ldmxcsr(csr);
In case it's not obvious from the names, __builtin_ia32_stmxcsr and __builtin_ia32_ldmxcsr are available only if you're targeting a x86 processor. ARM, Sparc, MIPS, etc. will each need separate platform-specific code with this approach.

You apparently want some CPU instructions called FTZ (Flush To Zero) and DAZ (Denormals Are Zero).
I found the information on an audio web site but their link to the Intel documentation was missing. They are apparently SSE2 instructions so they should work on AMD CPUs that support that.
I don't know what you can do in GCC to force that on in a portable way. You can always write inline assembly code to use them though. You may have to force GCC to use only SSE2 for floating point math.

Just as an addition to the other answers, if you actually have a problem with denormal floating point values you probably have a precision problem in addition to your performance issue.
It may be a good idea to check if you can restructure your computations to keep the numbers larger to avoid losing precision and performance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js