Avoiding denormal values in C++ - c++

After searching a long time for a performance bug, I read about denormal floating point values.
Apparently denormalized floating-point values can be a major performance concern as is illustrated in this question:
Why does changing 0.1f to 0 slow down performance by 10x?
I have an Intel Core 2 Duo and I am compiling with gcc, using -O2.
So what do I do? Can I somehow instruct g++ to avoid denormal values?
If not, can I somehow test if a float is denormal?

Wait. Before you do anything, do you actually know that your code is encountering denormal values, and that they're having a measurable performance impact?
Assuming you know that, do you know if the algorithm(s) that you're using is stable if denormal support is turned off? Getting the wrong answer 10x faster is not usually a good performance optimization.
Those issues aside:
If you want to detect denormal values to confirm that their presence, you have a few options. If you have a C99 standard library or Boost, you can use the fpclassify macro. Alternatively, you can compare the absolute values of your data to the smallest positive normal number.
You can set the hardware to flush denormal values to zero (FTZ), or treat denormal inputs as zero (DAZ). The easiest way, if it is properly supported on your platform, is probably to use the fesetenv( ) function in the C header fenv.h. However, this is one of the least-widely supported features of the C standard, and is inherently platform specific anyway. You may want to just use some inline assembly to directly set the FPU state to (DAZ/FTZ).

You can test whether a float is denormal using
#include <cmath>
if ( std::fpclassify( flt ) == FP_SUBNORMAL )
(Caveat: I'm not sure that this will execute at full speed in practice.)
In C++03, and this code has worked for me in practice,
#include <cmath>
#include <limits>
if ( flt != 0 && std::fabsf( flt ) < std::numeric_limits<float>::min() ) {
// it's denormalized
}
To decide where to apply this, you may use a sample-based analyzer like Shark, VTune, or Zoom, to highlight the instructions slowed by denormal values. Micro-optimization, even more than other optimizations, is totally hopeless without analysis both before and after.

Most math coprocessors have an option to truncate denormal values to zero. On x86 it is the FZ (Flush to Zero) flag in the MXCSR control register. Check your CRT implementation for a support function to set the control register. It ought to be in <float.h>, something resembling _controlfp(). The option bit usually has "FLUSH" in the #defined symbol.
Double-check your math results after you set this. Which is something you ought to do anyway, getting denormals is a sign of health problems.

To have (flush-to-zero) FTZ (assuming underflow is masked by default) in gcc:
#define CSR_FLUSH_TO_ZERO (1 << 15)
unsigned csr = __builtin_ia32_stmxcsr();
csr |= CSR_FLUSH_TO_ZERO;
__builtin_ia32_ldmxcsr(csr);
In case it's not obvious from the names, __builtin_ia32_stmxcsr and __builtin_ia32_ldmxcsr are available only if you're targeting a x86 processor. ARM, Sparc, MIPS, etc. will each need separate platform-specific code with this approach.

You apparently want some CPU instructions called FTZ (Flush To Zero) and DAZ (Denormals Are Zero).
I found the information on an audio web site but their link to the Intel documentation was missing. They are apparently SSE2 instructions so they should work on AMD CPUs that support that.
I don't know what you can do in GCC to force that on in a portable way. You can always write inline assembly code to use them though. You may have to force GCC to use only SSE2 for floating point math.

Just as an addition to the other answers, if you actually have a problem with denormal floating point values you probably have a precision problem in addition to your performance issue.
It may be a good idea to check if you can restructure your computations to keep the numbers larger to avoid losing precision and performance.

Related

Microsoft Visual Studio: Setting Rounding Modi on Floating Point for x64

I am trying to figure out how to set ROUND_UP, ROUND_DOWN, ROUND_to_NEAREST, and ROUND_to_INFINITY for an MS Visual Studio project.
The representation of natural numbers should follow the IEEE 754 standards, Which means setting /FP: strict is selected. However, the code is running in an x64 environment.
Through careful selection of rounding mode, I want to cause -0.000000 to be equal to -0.000001, for example.
Cheers
Commodore
I am making come computations and saving the results (tuple). After each operation, I query saved data to know if I had already had the value. (-0.000000,-0.202319) would be equal to (-0.000001,-0.202319) rounding with nearest. How can I do this with Visual Studio?
In general, == and != for floating-point are not a 'safe' method for doing floating-point comparison except in the specific case of 'binary representation equality' even using 'IEEE-754 compliant' code generation. This is why clang/LLVM for example has the -Wfloat-equal warning.
Be sure to read What Every Computer Scientist Should Know About Floating-Point Arithmetic. I'd also recommend reading the many great Bruce Dawson blog posts on the topic of floating-point.
Instead, you should explicitly use an 'epsilon' comparison:
constexpr float c_EPSILON = 0.000001f;
if (fabsf(a - b) <= c_EPSILON)
{
// A & B are equal within the epsilon value.
}
In general, you can't assume that SSE/SSE2-based floating-point math (required for x64) will match legacy x87-based floating-point math, and in many cases you can't even assume AMD and Intel will always agree even with /fp:strict.
For example, IEEE-754 doesn't specify what happens with fast reciprocal operations such as RCP or RSQRT. AMD and Intel give different answers in these cases in the lower bits.
With all that said, you are intended to use _controlfp or _controlfp_s rather than _control87 to control rounding mode behavior for all platforms. _control87 really only works for 32-bit (x86) platforms when using /arch:IA32.
Keep in mind that changing the floating-point control word and calling code outside of your control is often problematic. Code assumes the default of "no-exceptions, round-to-nearest" and deviation from that can result in untested/undefined behavior. You can really only change the control word, do your own code, then change it back to the default in any safe way. See this old DirectX article.
That’s not how it works. Choosing the rounding mode affects the last bit of the result, that’s it. Comparisons are not affected. And comparisons between 0.000001 and 0.234567 are most definitely not affected.
What you want cannot be achieved with rounding modes. Feel free to write a function that returns true if two numbers are close together.

acos(double) gives different result on x64 and x32 Visual Studio

acos(double) gives different result on x64 and x32 Visual Studio.
printf("%.30g\n", double(acosl(0.49990774364240564)));
printf("%.30g\n", acos(0.49990774364240564));
on x64: 1.0473040763868076
on x32: 1.0473040763868078
on linux4.4 x32 and x64 with sse enabled: 1.0473040763868078
is there a way to make VSx64 acos() give me 1.0473040763868078 as result?
TL:DR: this is normal and you can't reasonably change it.
The 32-bit library may be using 80-bit FP values in x87 registers for its temporaries, avoiding rounding off to 64-bit double after every operation. (Unless there's a whole separate library, compiling your own code to use SSE doesn't change what's inside the library, or even the calling convention for passing data to the library. But since 32-bit passes double and float in memory on the stack, a library is free to load it with SSE2 or with x87. Still, you don't get the performance advantage of passing FP values in xmm registers unless it's impossible for non-SSE code to use the library.)
It's also possible that they're different simply because they use a different order of operations, producing different temporaries along the way. That's less plausible, unless they're separately hand-written in asm. If they're built from the same C source (without "unsafe" FP optimizations), then the compiler isn't allowed to reorder things, because of this non-associative behaviour of FP math.
glibc's libm (used on Linux) typically favours precision over speed, so its giving you the correctly-rounded result out to the last bit of the mantissa for both 32 and 64-bit. The IEEE FP standard only requires the basic operations (+ - * / FMA and FP remainder) to be "correctly rounded" out to the last bit of the mantissa. (i.e. rounding error of at most 0.5 ulp). (The exact result, according to calc, is 1.047304076386807714.... Keep in mind that double (on x86 with normal compilers) is IEEE754 binary64, so internally the mantissa and exponent are in base2. If you print enough extra decimal digits, though, you can tell that ...7714 should round up to ...78, although really you should print more digits in case they're not zero beyond that. I'm just assuming it's ...78000.)
So Microsoft's 64-bit library implementation produces 1.0473040763868076 and there's pretty much nothing you can do about it, other than not use it. (e.g. find your own acos() implementation and use it.) But FP determinism is hard, even if you limit yourself to just x86 with SSE. See Does any floating point-intensive code produce bit-exact results in any x86-based architecture?. If you limit yourself to a single compiler, it can be possible if you avoid complicated library functions like acos().
You might be able to get the 32-bit library version to produce the same value as the 64-bit version, if it uses x87 and changing the x87 precision setting affects it. But the other way around is not possible: SSE2 has separate instructions for 64-bit double and 32-bit float, and always rounds after every instruction, so you can't change any setting that will increase precision result. (You could change the SSE rounding mode, and that will change the result, but not in a good way!)
See also:
Intermediate Floating-Point Precision and the rest of Bruce Dawson's excellent series of articles about floating point. (table of contents.
The linked article describes how some versions of VC++'s CRT runtime startup set the x87 FP register precision to 53-bit mantissa instead of 80-bit full precision. Also that D3D9 will set it to 24, so even double only has the precision of float if done with x87.
https://en.wikipedia.org/wiki/Rounding#Table-maker.27s_dilemma
What Every Computer Scientist Should Know About Floating-Point Arithmetic
You may have reached the precision limit. Double precision is approximately 16 digits. After, there is no guarantee the digits are valid. If so, you cannot change this behavior, except changing the type double to something else, supporting higher precision.
E.g. long double if your machine and compiler supports the extended 80 bit double or 128 bit Quadruple (also machine depended see here for example).
I disagree that there isn't much you can do about it.
For example, you could try changing the floating point model compiler options.
Here are my results with different floating point models (note /fp:precise is the default):
/fp:precise 1.04730407638680755866289473488
/fp:strict 1.04730407638680755866289473488
/fp:fast 1.04730407638680778070749965991
So it seems you are looking for /fp:fast. Whether that gives the most accurate result remains to be seen though.

Truncate Floats and Doubles after user defined points in X87 and SSE FPUs

I have made a function g that is able to approximate a function to a certain degree, this function gives accurate results up to 5 decimals ( 1,23456xxxxxxxxxxxx where the x positions are just rounding errors / junk ) .
To avoid spreading error to other computations that will use the results of g I would like to just set all the x positions to zero, better yet, just set to 0 everything after the 5th decimal .
I haven't found anything in X87 and SSE literature that let's me play with IEEE 754 bits or their representation the way I would like to .
There is an old reference to the FISTP instruction for X87 that is apparently mirrored in the SSE world with FISTTP, with the benefit that FISTTP doesn't necesserily modify the control word and is therefore faster .
I have noticed that FISTTP was called "chopping mode", but now in more modern literature is just "rounding toward zero" or "truncate" and this confuse me because "chopping" means removing something altogether where "rounding toward zero" doesn't necessarily means the same thing to me .
I don't need to round to zero, I only need to preserve up to 5 decimals as the last step in my function before storing the result in memory; how do I do this in X87 ( scalar FPU ) and SSE ( vector FPU ) ?
As several people commented, more early rounding doesn't help the final result be more accurate. If you want to read more about floating point comparisons and weirdness / gotchas, I highly recommend Bruce Dawson's series of articles on floating point. Here's a quote from the one with the index
We’ve finally reached the point in this series that I’ve been waiting
for. In this post I am going to share the most crucial piece of
floating-point math knowledge that I have. Here it is:
[Floating-point] math is hard.
You just won’t believe how vastly, hugely, mind-bogglingly hard it is.
I mean, you may think it’s difficult to calculate when trains from
Chicago and Los Angeles will collide, but that’s just peanuts to
floating-point math.
(Bonus points if you recognize that last paragraph as a paraphrase of a famous line about space.)
How you could actually implement your bad idea:
There aren't any machine instructions or C standard library functions to truncate or round to anything other than integer.
Note that there are machine instructions (and C functions) that round a double to nearest (representable) integer without converting it to intmax_t or anything, just double->double. So no round-trip through a fixed-width 2's complement integer.
So to use them, you could scale your float up by some factor, round to nearest integer, then scale back down. (like chux's round()-based function, but I'd recommend C99 double rint(double) instead of round(). round has weird rounding semantics that don't match any of the available rounding modes on x86, so it compiles to worse code.
The x86 asm instructions you keep mentioning are nothing special, and don't do anything that you can't ask the compiler to do with pure C.
FISTP (Float Integer STore (and Pop the x87 stack) is one way for a compiler or asm programmer to implement long lrint(double) or (int)nearbyint(double). Some compilers make better code for one or the other. It rounds using the current x87 rounding mode (default: round to nearest), which is the same semantics as those ISO C standard functions.
FISTTP (Float Integer STore with Truncation (and Pop the x87 stack) is part of SSE3, even though it operates on the x87 stack. It lets compilers avoid setting the rounding mode to truncation (round-towards-zero) to implement the C truncation semantics of (long)x, and then restoring the old rounding mode.
This is what the "not modify the control word" stuff is about. Neither instruction does that, but to implement (int)x without FISTTP, the compiler has to use other instructions to modify and restore the rounding mode around a FIST instruction. Or just use SSE2 CVTTSD2SI to convert a double in an xmm register with truncation, instead of an FP value on the legacy x87 stack.
Since FISTTP is only available with SSE3, you'd only use it for long double, or in 32-bit code that had FP values in x87 registers anyway because of the crusty old 32-bit ABI which returns FP values on the x87 stack.
PS. if you didn't recognize Bruce's HHGTG reference, the original is:
Space is big. Really big. You just won’t believe how vastly hugely
mindbogglingly big it is. I mean you may think it’s a long way down
the road to the chemist’s, but that’s just peanuts to space.
how do I do this in X87 ( scalar FPU ) and SSE ( vector FPU ) ?
The following does not use X87 nor SSE. I've included it as a community reference for general purpose code. If anything, it can be used to test a X87 solution.
Any "chopping" of the result of g() will at least marginally increase error, hopefully tolerable as OP said "To avoid spreading error to other computations ..."
It is unclear if OP wants "accurate results up to 5 decimals" to reflect absolute precision (+/- 0.000005) or relative precision (+/- 0.000005 * result). Will assume "absolute precision".
Since float, double are far often a binary floating point, any "chop" will reflect a FP number nearest to a multiple of 0.00001.
Text Method:
// - x xxx...xxx . xxxxx \0
char buf[1+1+ DBL_MAX_10_EXP+1 +5 +1];
sprintf(buf, "%.5f", x);
x = atof(buf);
round() rint() method:
#define SCALE 100000.0
if (fabs(x) < DBL_MAX/SCALE) {
x = x*SCALE;
x = rint(x)/SCALE;
}
Direct bit manipulation of x. Simply zero select bits in the significand.
TBD code.

Should I use floating point's NaN, or floating point + bool for a data set that contains invalid values?

I have a large amount of data to process with math intensive operations on each data set. Much of it is analogous to image processing. However, since this data is read directly from a physical device, many of the pixel values can be invalid.
This makes NaN's property of representing values that are not a number and spreading on arithmetic operations very compelling. However, it also seems to require turning off some optimizations such as gcc's -ffast-math, plus we need to be cross platform. Our current design uses a simple struct that contains a float value and a bool indicating validity.
While it seems NaN was designed with this use in mind,
others think it is more trouble than it is worth. Does anyone have advice based on their more intimate experience with IEEE754 with performance in mind?
BRIEF: For strictest portability, don't use NaNs. Use a separate valid bit. E.g. a template like Valid. However, if you know that you will only ever run on IEEE 754-2008 machines, and not IEEE 754-1985 (see below), then you may get away with it.
For performance, it is probably faster not to use NaNs on most of the machines that you have access to. However, I have been involved with hardware design of FP on several machines that are improving NaN handling performance, so there is a trend to make NaNs faster, and, in particular, signalling NaNs should soon be faster than Valid.
DETAIL:
Not all floating point formats have NaNs. Not all systems use IEEE floating point. IBM hex floating point can still be found on some machines - actually systems, since IBM now supports IEEE FP on more recent machines.
Furthermore, IEEE Floating Point itself had compatibility issues wrt NaNs, in IEEE 754-1985. E.g, see wikipedia http://en.wikipedia.org/wiki/NaN:
The original IEEE 754 standard from 1985 (IEEE 754-1985) only
described binary floating point formats, and did not specify how the
signaled/quiet state was to be tagged. In practice, the most
significant bit of the significand determined whether a NaN is
signalling or quiet. Two different implementations, with reversed
meanings, resulted.
* most processors (including those of the Intel/AMD x86-32/x86-64 family, the Motorola 68000 family, the AIM PowerPC family, the ARM
family, and the Sun SPARC family) set the signaled/quiet bit to
non-zero if the NaN is quiet, and to zero if the NaN is signaling.
Thus, on these processors, the bit represents an 'is_quiet' flag.
* in NaNs generated by the PA-RISC and MIPS processors, the signaled/quiet bit is zero if the NaN is quiet, and non-zero if the
NaN is signaling. Thus, on these processors, the bit represents an
'is_signaling' flag.
This, if your code may run on older HP machines, or current MIPS machines (which are ubiquitous in embedded systems), you should not depend on a fixed encoding of NaN, but should have a machine dependent #ifdef for your special NaNs.
IEEE 754-2008 standardizes NaN encodings, so this is getting better. It depends on your market.
As for performance: many machines essentially trap, or otherwise take a major hiccup in performance, when performing computations involving both SNaNs (which must trap) and QNaNs (which don't need to trap, i.e. which could be fast - and which are getting faster in some machines as we speak.)
I can say with confidence that on older machines, particularly older Intel machines, you did NOT want to use NaNs if you cared about performance. E.g. http://www.cygnus-software.com/papers/x86andinfinity.html says "The Intel Pentium 4 handles infinities, NANs, and denormals very badly. ... If you write code that adds floating point numbers at the rate of one per clock cycle, and then throw infinities at it as input, the performance drops. A lot. A huge amount. ... NANs are even slower. Addition with NANs takes about 930 cycles. ... Denormals are a bit trickier to measure."
Get the picture? Almost 1000x slower to use a NaN than to do a normal floating point operation? In this case it is almost guaranteed that using a template like Valid will be faster.
However, see the reference to "Pentium 4"? That's a really old web page. For years people like me have been saying "QNaNs should be faster", and it has slowly taken hold.
More recently (2009), Microsoft says http://connect.microsoft.com/VisualStudio/feedback/details/498934/big-performance-penalty-for-checking-for-nans-or-infinity "If you do math on arrays of double that contain large numbers of NaN's or Infinities, there is an order of magnitude performance penalty."
If I feel impelled, I may go and run a microbenchmark on some machines. But you should get the picture.
This should be changing because it is not that hard to make QNaNs fast. But it has always been a chicken and egg problem: hardware guys like those I work with say "Nobody uses NaNs, so we won;t make them fast", while software guys don't use NaNs because they are slow. Still, the tide is slowly changing.
Heck, if you are using gcc and want best performance, you turn on optimizations like "-ffinite-math-only ... Allow optimizations for floating-point arithmetic that assume that arguments and results are not NaNs or +-Infs." Similar is true for most compilers.
By the way, you can google like I did, "NaN performance floating point" and check refs out yourself. And/or run your own microbenchmarks.
Finally, I have been assuming that you are using a template like
template<typename T> class Valid {
...
bool valid;
T value;
...
};
I like templates like this, because they can bring "validity tracking" not just to FP, but also to integer (Valid), etc.
But, they can have a big cost. The operations are probably not much more expensive than NaN handling on old machines, but the data density can be really poor. sizeof(Valid) may sometimes be 2*sizeof(float). This bad density may hurt performance much more than the operations involved.
By the way, you should consider template specialization, so that Valid uses NaNs if they arte available and fast, and a valid bit otherwise.
template <> class Valid<float> {
float value;
bool is_valid() {
return value != my_special_NaN;
}
}
etc.
Anyway, you are better off having as few valid bits as possible, and packing them elsewhere, rather than Valid right close to the value. E.g.
struct Point { float x, y, z; };
Valid<Point> pt;
is better (density wise) than
struct Point_with_Valid_Coords { Valid<float> x, y, z; };
unless you are using NaNs - or some other special encoding.
And
struct Point_with_Valid_Coords { float x, y, z; bool valid_x, valid_y, valid_z };
is in between - but then you have to do all the code yourself.
BTW, I have been assuming you are using C++. If FORTRAN or Java ...
BOTTOM LINE: separate valid bits is probably faster and more portable.
But NaN handling is speeding up, and one day soon will be good enough
By the way, my preference: create a Valid template. Then you can use it for all data types. Specialize it for NaNs if it helps. Although my life is making things faster, IMHO it is usually more important to make the code clean.
If invalid data is very common, you are of course wasting a lot of time on running this data through the processing. If the invalid data is common enough it is probably better to be running some kind of sparse datastructure of only the valid data. If it is not very common, you can of course keep a sparse datastructure of which data is invalid. That way you would not waste a bool for each value. But maybe memory is not a problem for you...
If you are doing operations such as multipling two possibly invalid data entries, I understand it is compelling to use NaNs instead of doing checks on both variables to see if they are valid and setting the same flag in the resultant.
How portable do you need to be? Will you ever need to be able to port it to an architecture with only fixed point support? If that is the case, I think your choice is clear.
Personally I would only use NaNs if it proved to be much faster. Otherwise I'd say the code gets more clear if you have explicit handling of invalid data.
Since the floating-point numbers come from a device, they probably have a limited range. You can use some other special number, rather than NaN, to indicate absense of data, e.g. 1e37. This solution is portable. I do not know whether or not is more convinient for you than using a bool flag.

Signed zero linux vs windows

i am running a program in c++ on windows and on linux.
the output is meant to be identical.
i am trying to make sure that the only differences are real differences oppose to working inviorment differences.
so far i have taken care of all the differences that can be caused by \r\n differences
but there is one thing that i can't seem to figure out.
in the windows out put there is a 0.000 and in linux it is -0.000
does any one know what can it be that is making the difference?
thanx
Probably it comes from differences in how the optimizer optimizes some FP calculations (that can be configurable - see e.g. here); in one case you get a value slightly less than 0, in the other slightly more. Both in output are rounded to a 0.000, but they keep their "real" sign.
Since in the IEEE floating point format the sign bit is separate from the value, you have two different values of 0, a positive and a negative one. In most cases it doesn't make a difference; both zeros will compare equal, and they indeed describe the same mathematical value (mathematically, 0 and -0 are the same). Where the difference can be significant is when you have underflow and need to know whether the underflow occurred from a positive or from a negative value. Also if you divide by 0, the sign of the infinity you get depends on the sign of the 0 (i.e. 1/+0.0 give +Inf, but 1/-0.0 gives -Inf). In other words, most probably it won't make a difference for you.
Note however that the different output does not necessarily mean that the number itself is different. It could well be that the value in Windows is also -0.0, but the output routine on Windows doesn't distinguish between +0.0 and -0.0 (they compare equal, after all).
Unless using (unsafe) flags like -ffast-math, the compiler is limited in the assumptions it can make when 'optimizing' IEEE-754 arithmetic. First check that both platforms are using the same rounding.
Also, if possible, check they are using the same floating-point unit. i.e., SSE vs FPU on x86. The latter might be an issue with math library function implementations - especially trigonometric / transcendental functions.