Microsoft Visual Studio: Setting Rounding Modi on Floating Point for x64

Microsoft Visual Studio: Setting Rounding Modi on Floating Point for x64 - c++

I am trying to figure out how to set ROUND_UP, ROUND_DOWN, ROUND_to_NEAREST, and ROUND_to_INFINITY for an MS Visual Studio project.
The representation of natural numbers should follow the IEEE 754 standards, Which means setting /FP: strict is selected. However, the code is running in an x64 environment.
Through careful selection of rounding mode, I want to cause -0.000000 to be equal to -0.000001, for example.
Cheers
Commodore
I am making come computations and saving the results (tuple). After each operation, I query saved data to know if I had already had the value. (-0.000000,-0.202319) would be equal to (-0.000001,-0.202319) rounding with nearest. How can I do this with Visual Studio?

In general, == and != for floating-point are not a 'safe' method for doing floating-point comparison except in the specific case of 'binary representation equality' even using 'IEEE-754 compliant' code generation. This is why clang/LLVM for example has the -Wfloat-equal warning.
Be sure to read What Every Computer Scientist Should Know About Floating-Point Arithmetic. I'd also recommend reading the many great Bruce Dawson blog posts on the topic of floating-point.
Instead, you should explicitly use an 'epsilon' comparison:
constexpr float c_EPSILON = 0.000001f;
if (fabsf(a - b) <= c_EPSILON)
{
// A & B are equal within the epsilon value.
}
In general, you can't assume that SSE/SSE2-based floating-point math (required for x64) will match legacy x87-based floating-point math, and in many cases you can't even assume AMD and Intel will always agree even with /fp:strict.
For example, IEEE-754 doesn't specify what happens with fast reciprocal operations such as RCP or RSQRT. AMD and Intel give different answers in these cases in the lower bits.
With all that said, you are intended to use _controlfp or _controlfp_s rather than _control87 to control rounding mode behavior for all platforms. _control87 really only works for 32-bit (x86) platforms when using /arch:IA32.
Keep in mind that changing the floating-point control word and calling code outside of your control is often problematic. Code assumes the default of "no-exceptions, round-to-nearest" and deviation from that can result in untested/undefined behavior. You can really only change the control word, do your own code, then change it back to the default in any safe way. See this old DirectX article.

That’s not how it works. Choosing the rounding mode affects the last bit of the result, that’s it. Comparisons are not affected. And comparisons between 0.000001 and 0.234567 are most definitely not affected.
What you want cannot be achieved with rounding modes. Feel free to write a function that returns true if two numbers are close together.

Related

Slight acos precision difference between Clang and Visual C++

I have some cross platform code I'm working with. On the Mac it's compiled with Clang, on Windows it's compiled with Visual C++.
There is a calculation that can be sensitive, and there was a difference between Mac and Windows that was triggering asserts. It ends up there is a difference between acos results, but I'm not clear why.
On both platforms, the input to acos is exactly -1.0f. In Visual C++, acos(-1.0f) is 3.14159274. That's the value of pi as a float, which is what I'd expect.
But on macOS:
float value = acos(-1.0f);
...evaluates to 3.1415925. Thats just enough of an accuracy difference to trigger issues in the code. acos is an operation that can be imprecise with float - I understand that. And different compilers can have different implementations of acos. I'm just unclear why Clang seems to have issues with such a simple acos result while Visual C++ doesn't. A float is capable of representing 3.14159274, but that's not the result I'm getting.
It is possible to get an accurate/Visual C++ aligned value out of Xcode's version of Clang with:
float value = (float)acos((double)-1.0f);
So I can fix the issue by moving to higher accuracy, and then down casting the value back to float to preserve the same rounding as Windows. I'm just looking for a justification as to why the extra precision is necessary when the VC++ compiler doesn't seem to have a precision issue. It could be differences between the Clang/Xcode/VC++ math libraries as well. I just assumed that acos(-1.0) might be more settled across the compilers. I couldn't find any difference in round modes (even though rounding should not be necessary) and fresh projects in Xcode and Visual Studio show the same difference. Both machines are Intel.

If you look at the binary representation of these floating point values you can see that the mac/clang's value A is the next lowest floating-point number than windows/msvc's value B
A 3.14159250 0x40490FDA
B 3.14159274 0x40490FDB
Whilst B is closest to the true value of π, it is actually greater than π as #njuffa points out in their comment.
Reading the specification, it looks like acosf is supposed to return a value in the closed range [0,π]. Technically A meets this criteria whilst B doesn't.
In summary -
A is the closest value to, but less than, π
B is the closest value to π
The difference in these may be as a result of a deliberate decision of the respective standard library implementors.
I'd also observe that both values are true inverses of cosf as both cosf(A) and cosf(B) equal -1.0f.
Generally speaking, though, it is unwise to rely on exact bit-level accuracy with any floating point calculations. If you are not already aware of it, the document What Every Computer Scientist Should
Know About Floating-Point Arithmetic explains why.
Edit: I was curious and found what might be relevant Apple source code here.
Return value:
...
Otherwise:
...
Returns a value in [0, pi] (C 7.12.4.1 3). Note that
this prohibits returning a correctly rounded value for acosf(-1),
since pi rounded to a float lies outside that interval.

acos(double) gives different result on x64 and x32 Visual Studio

acos(double) gives different result on x64 and x32 Visual Studio.
printf("%.30g\n", double(acosl(0.49990774364240564)));
printf("%.30g\n", acos(0.49990774364240564));
on x64: 1.0473040763868076
on x32: 1.0473040763868078
on linux4.4 x32 and x64 with sse enabled: 1.0473040763868078
is there a way to make VSx64 acos() give me 1.0473040763868078 as result?

TL:DR: this is normal and you can't reasonably change it.
The 32-bit library may be using 80-bit FP values in x87 registers for its temporaries, avoiding rounding off to 64-bit double after every operation. (Unless there's a whole separate library, compiling your own code to use SSE doesn't change what's inside the library, or even the calling convention for passing data to the library. But since 32-bit passes double and float in memory on the stack, a library is free to load it with SSE2 or with x87. Still, you don't get the performance advantage of passing FP values in xmm registers unless it's impossible for non-SSE code to use the library.)
It's also possible that they're different simply because they use a different order of operations, producing different temporaries along the way. That's less plausible, unless they're separately hand-written in asm. If they're built from the same C source (without "unsafe" FP optimizations), then the compiler isn't allowed to reorder things, because of this non-associative behaviour of FP math.
glibc's libm (used on Linux) typically favours precision over speed, so its giving you the correctly-rounded result out to the last bit of the mantissa for both 32 and 64-bit. The IEEE FP standard only requires the basic operations (+ - * / FMA and FP remainder) to be "correctly rounded" out to the last bit of the mantissa. (i.e. rounding error of at most 0.5 ulp). (The exact result, according to calc, is 1.047304076386807714.... Keep in mind that double (on x86 with normal compilers) is IEEE754 binary64, so internally the mantissa and exponent are in base2. If you print enough extra decimal digits, though, you can tell that ...7714 should round up to ...78, although really you should print more digits in case they're not zero beyond that. I'm just assuming it's ...78000.)
So Microsoft's 64-bit library implementation produces 1.0473040763868076 and there's pretty much nothing you can do about it, other than not use it. (e.g. find your own acos() implementation and use it.) But FP determinism is hard, even if you limit yourself to just x86 with SSE. See Does any floating point-intensive code produce bit-exact results in any x86-based architecture?. If you limit yourself to a single compiler, it can be possible if you avoid complicated library functions like acos().
You might be able to get the 32-bit library version to produce the same value as the 64-bit version, if it uses x87 and changing the x87 precision setting affects it. But the other way around is not possible: SSE2 has separate instructions for 64-bit double and 32-bit float, and always rounds after every instruction, so you can't change any setting that will increase precision result. (You could change the SSE rounding mode, and that will change the result, but not in a good way!)
See also:
Intermediate Floating-Point Precision and the rest of Bruce Dawson's excellent series of articles about floating point. (table of contents.
The linked article describes how some versions of VC++'s CRT runtime startup set the x87 FP register precision to 53-bit mantissa instead of 80-bit full precision. Also that D3D9 will set it to 24, so even double only has the precision of float if done with x87.
https://en.wikipedia.org/wiki/Rounding#Table-maker.27s_dilemma
What Every Computer Scientist Should Know About Floating-Point Arithmetic

You may have reached the precision limit. Double precision is approximately 16 digits. After, there is no guarantee the digits are valid. If so, you cannot change this behavior, except changing the type double to something else, supporting higher precision.
E.g. long double if your machine and compiler supports the extended 80 bit double or 128 bit Quadruple (also machine depended see here for example).

I disagree that there isn't much you can do about it.
For example, you could try changing the floating point model compiler options.
Here are my results with different floating point models (note /fp:precise is the default):
/fp:precise 1.04730407638680755866289473488
/fp:strict 1.04730407638680755866289473488
/fp:fast 1.04730407638680778070749965991
So it seems you are looking for /fp:fast. Whether that gives the most accurate result remains to be seen though.

If two languages follow IEEE 754, will calculations in both languages result in the same answers?

I'm in the process of converting a program from Scilab code to C++. One loop in particular is producing a slightly different result than the original Scilab code (it's a long piece of code so I'm not going to include it in the question but I'll try my best to summarise the issue below).
The problem is, each step of the loop uses calculations from the previous step. Additionally, the difference between calculations only becomes apparent around the 100,000th iteration (out of approximately 300,000).
Note: I'm comparing the output of my C++ program with the outputs of Scilab 5.5.2 using the "format(25);" command. Meaning I'm comparing 25 significant digits. I'd also like to point out I understand how precision cannot be guaranteed after a certain number of bits but read the sections below before commenting. So far, all calculations have been identical up to 25 digits between the two languages.
In attempts to get to the bottom of this issue, so far I've tried:
Examining the data type being used:
I've managed to confirm that Scilab is using IEEE 754 doubles (according to the language documentation). Also, according to Wikipedia, C++ isn't required to use IEEE 754 for doubles, but from what I can tell, everywhere I use a double in C++ it has perfectly match Scilab's results.
Examining the use of transcendental functions:
I've also read from What Every Computer Scientist Should Know About Floating-Point Arithmetic that IEEE does not require transcendental functions to be exactly rounded. With that in mind, I've compared the results of these functions (sin(), cos(), exp()) in both languages and again, the results appear to be the same (up to 25 digits).
The use of other functions and predefined values:
I repeated the above steps for the use of sqrt() and pow(). As well as the value of Pi (I'm using M_PI in C++ and %pi in Scilab). Again, the results were the same.
Lastly, I've rewritten the loop (very carefully) in order to ensure that the code is identical between the two languages.
Note: Interestingly, I noticed that for all the above calculations the results between the two languages match farther than the actual result of the calculations (outside of floating point arithmetic). For example:
Value of sin(x) using Wolfram Alpha = 0.123456789.....
Value of sin(x) using Scilab & C++ = 0.12345yyyyy.....
Where even once the value computed using Scilab or C++ started to differ from the actual result (from Wolfram). Each language's result still matched each other. This leads me to believe that most of the values are being calculated (between the two languages) in the same way. Even though they're not required to by IEEE 754.
My original thinking was one of the first three points above are implemented differently between the two languages. But from what I can tell everything seems to produce identical results.
Is it possible that even though all the inputs to these loops are identical, the results can be different? Possibly because a very small error (past what I can see with 25 digits) is occurring that accumulates over time? If so, how can I go about fixing this issue?

No, the format of the numbering system does not guarantee equivalent answers from functions in different languages.
Functions, such as sin(x), can be implemented in different ways, using the same language (as well as different languages). The sin(x) function is an excellent example. Many implementations will use a look-up table or look-up table with interpolation. This has speed advantages. However, some implementations may use a Taylor Series to evaluate the function. Some implementations may use polynomials to come up with a close approximation.
Having the same numeric format is one hurdle to solve between languages. Function implementation is another.
Remember, you need to consider the platform as well. A program that uses an 80-bit floating point processor will have different results than a program that uses a 64-bit floating point software implementation.

Some architectures provide the capability of using extended precision floating point registers (e.g. 80 bits internally, versus 64-bit values in RAM). So, it's possible to get slightly different results for the same calculation, depending on how the computations are structured, and the optimization level used to compile the code.

Yes, it's possible to have a different results. It's possible even if you are using exactly the same source code in the same programming language for the same platform. Sometimes it's enough to have a different compiler switch; for example -ffastmath would lead the compiler to optimize your code for speed rather than accuracy, and, if your computational problem is not well-conditioned to begin with, the result may be significantly different.
For example, suppose you have this code:
x_8th = x*x*x*x*x*x*x*x;
One way to compute this is to perform 7 multiplications. This would be the default behavior for most compilers. However, you may want to speed this up by specifying compiler option -ffastmath and the resulting code would have only 3 multiplications:
temp1 = x*x; temp2 = temp1*temp1; x_8th = temp2*temp2;
The result would be slightly different because finite precision arithmetic is not associative, but sufficiently close for most applications and much faster. However, if your computation is not well-conditioned that small error can quickly get amplified into a large one.

Note that it is possible that the Scilab and C++ are not using the exact same instruction sequence, or that one uses FPU and the other uses SSE, so there may not be a way to get them to be exactly the same.
As commented by IInspectable, if your compiler has _control87() or something similar, you can use it to change the precision and/or rounding settings. You could try combinations of this to see if it has any effect, but again, even you manage to get the settings identical for Scilab and C++, differences in the actual instruction sequences may be the issue.
http://msdn.microsoft.com/en-us/library/e9b52ceh.aspx
If SSE is used, I"m not sure what can be adjusted as I don't think SSE has an 80 bit precision mode.
In the case of using FPU in 32 bit mode, and if your compiler doesn't have something like _control87, you could use assembly code. If inline assembly is not allowed, you would need to call an assembly function. This example is from an old test program:
static short fcw; /* 16 bit floating point control word */
/* ... */
/* set precision control to extended precision */
__asm{
fnstcw fcw
or fcw,0300h
fldcw fcw
}

Should I use floating point's NaN, or floating point + bool for a data set that contains invalid values?

I have a large amount of data to process with math intensive operations on each data set. Much of it is analogous to image processing. However, since this data is read directly from a physical device, many of the pixel values can be invalid.
This makes NaN's property of representing values that are not a number and spreading on arithmetic operations very compelling. However, it also seems to require turning off some optimizations such as gcc's -ffast-math, plus we need to be cross platform. Our current design uses a simple struct that contains a float value and a bool indicating validity.
While it seems NaN was designed with this use in mind,
others think it is more trouble than it is worth. Does anyone have advice based on their more intimate experience with IEEE754 with performance in mind?

BRIEF: For strictest portability, don't use NaNs. Use a separate valid bit. E.g. a template like Valid. However, if you know that you will only ever run on IEEE 754-2008 machines, and not IEEE 754-1985 (see below), then you may get away with it.
For performance, it is probably faster not to use NaNs on most of the machines that you have access to. However, I have been involved with hardware design of FP on several machines that are improving NaN handling performance, so there is a trend to make NaNs faster, and, in particular, signalling NaNs should soon be faster than Valid.
DETAIL:
Not all floating point formats have NaNs. Not all systems use IEEE floating point. IBM hex floating point can still be found on some machines - actually systems, since IBM now supports IEEE FP on more recent machines.
Furthermore, IEEE Floating Point itself had compatibility issues wrt NaNs, in IEEE 754-1985. E.g, see wikipedia http://en.wikipedia.org/wiki/NaN:
The original IEEE 754 standard from 1985 (IEEE 754-1985) only
described binary floating point formats, and did not specify how the
signaled/quiet state was to be tagged. In practice, the most
significant bit of the significand determined whether a NaN is
signalling or quiet. Two different implementations, with reversed
meanings, resulted.
* most processors (including those of the Intel/AMD x86-32/x86-64 family, the Motorola 68000 family, the AIM PowerPC family, the ARM
family, and the Sun SPARC family) set the signaled/quiet bit to
non-zero if the NaN is quiet, and to zero if the NaN is signaling.
Thus, on these processors, the bit represents an 'is_quiet' flag.
* in NaNs generated by the PA-RISC and MIPS processors, the signaled/quiet bit is zero if the NaN is quiet, and non-zero if the
NaN is signaling. Thus, on these processors, the bit represents an
'is_signaling' flag.
This, if your code may run on older HP machines, or current MIPS machines (which are ubiquitous in embedded systems), you should not depend on a fixed encoding of NaN, but should have a machine dependent #ifdef for your special NaNs.
IEEE 754-2008 standardizes NaN encodings, so this is getting better. It depends on your market.
As for performance: many machines essentially trap, or otherwise take a major hiccup in performance, when performing computations involving both SNaNs (which must trap) and QNaNs (which don't need to trap, i.e. which could be fast - and which are getting faster in some machines as we speak.)
I can say with confidence that on older machines, particularly older Intel machines, you did NOT want to use NaNs if you cared about performance. E.g. http://www.cygnus-software.com/papers/x86andinfinity.html says "The Intel Pentium 4 handles infinities, NANs, and denormals very badly. ... If you write code that adds floating point numbers at the rate of one per clock cycle, and then throw infinities at it as input, the performance drops. A lot. A huge amount. ... NANs are even slower. Addition with NANs takes about 930 cycles. ... Denormals are a bit trickier to measure."
Get the picture? Almost 1000x slower to use a NaN than to do a normal floating point operation? In this case it is almost guaranteed that using a template like Valid will be faster.
However, see the reference to "Pentium 4"? That's a really old web page. For years people like me have been saying "QNaNs should be faster", and it has slowly taken hold.
More recently (2009), Microsoft says http://connect.microsoft.com/VisualStudio/feedback/details/498934/big-performance-penalty-for-checking-for-nans-or-infinity "If you do math on arrays of double that contain large numbers of NaN's or Infinities, there is an order of magnitude performance penalty."
If I feel impelled, I may go and run a microbenchmark on some machines. But you should get the picture.
This should be changing because it is not that hard to make QNaNs fast. But it has always been a chicken and egg problem: hardware guys like those I work with say "Nobody uses NaNs, so we won;t make them fast", while software guys don't use NaNs because they are slow. Still, the tide is slowly changing.
Heck, if you are using gcc and want best performance, you turn on optimizations like "-ffinite-math-only ... Allow optimizations for floating-point arithmetic that assume that arguments and results are not NaNs or +-Infs." Similar is true for most compilers.
By the way, you can google like I did, "NaN performance floating point" and check refs out yourself. And/or run your own microbenchmarks.
Finally, I have been assuming that you are using a template like
template<typename T> class Valid {
...
bool valid;
T value;
...
};
I like templates like this, because they can bring "validity tracking" not just to FP, but also to integer (Valid), etc.
But, they can have a big cost. The operations are probably not much more expensive than NaN handling on old machines, but the data density can be really poor. sizeof(Valid) may sometimes be 2*sizeof(float). This bad density may hurt performance much more than the operations involved.
By the way, you should consider template specialization, so that Valid uses NaNs if they arte available and fast, and a valid bit otherwise.
template <> class Valid<float> {
float value;
bool is_valid() {
return value != my_special_NaN;
}
}
etc.
Anyway, you are better off having as few valid bits as possible, and packing them elsewhere, rather than Valid right close to the value. E.g.
struct Point { float x, y, z; };
Valid<Point> pt;
is better (density wise) than
struct Point_with_Valid_Coords { Valid<float> x, y, z; };
unless you are using NaNs - or some other special encoding.
And
struct Point_with_Valid_Coords { float x, y, z; bool valid_x, valid_y, valid_z };
is in between - but then you have to do all the code yourself.
BTW, I have been assuming you are using C++. If FORTRAN or Java ...
BOTTOM LINE: separate valid bits is probably faster and more portable.
But NaN handling is speeding up, and one day soon will be good enough
By the way, my preference: create a Valid template. Then you can use it for all data types. Specialize it for NaNs if it helps. Although my life is making things faster, IMHO it is usually more important to make the code clean.

If invalid data is very common, you are of course wasting a lot of time on running this data through the processing. If the invalid data is common enough it is probably better to be running some kind of sparse datastructure of only the valid data. If it is not very common, you can of course keep a sparse datastructure of which data is invalid. That way you would not waste a bool for each value. But maybe memory is not a problem for you...
If you are doing operations such as multipling two possibly invalid data entries, I understand it is compelling to use NaNs instead of doing checks on both variables to see if they are valid and setting the same flag in the resultant.
How portable do you need to be? Will you ever need to be able to port it to an architecture with only fixed point support? If that is the case, I think your choice is clear.
Personally I would only use NaNs if it proved to be much faster. Otherwise I'd say the code gets more clear if you have explicit handling of invalid data.

Since the floating-point numbers come from a device, they probably have a limited range. You can use some other special number, rather than NaN, to indicate absense of data, e.g. 1e37. This solution is portable. I do not know whether or not is more convinient for you than using a bool flag.

Signed zero linux vs windows

i am running a program in c++ on windows and on linux.
the output is meant to be identical.
i am trying to make sure that the only differences are real differences oppose to working inviorment differences.
so far i have taken care of all the differences that can be caused by \r\n differences
but there is one thing that i can't seem to figure out.
in the windows out put there is a 0.000 and in linux it is -0.000
does any one know what can it be that is making the difference?
thanx

Probably it comes from differences in how the optimizer optimizes some FP calculations (that can be configurable - see e.g. here); in one case you get a value slightly less than 0, in the other slightly more. Both in output are rounded to a 0.000, but they keep their "real" sign.

Since in the IEEE floating point format the sign bit is separate from the value, you have two different values of 0, a positive and a negative one. In most cases it doesn't make a difference; both zeros will compare equal, and they indeed describe the same mathematical value (mathematically, 0 and -0 are the same). Where the difference can be significant is when you have underflow and need to know whether the underflow occurred from a positive or from a negative value. Also if you divide by 0, the sign of the infinity you get depends on the sign of the 0 (i.e. 1/+0.0 give +Inf, but 1/-0.0 gives -Inf). In other words, most probably it won't make a difference for you.
Note however that the different output does not necessarily mean that the number itself is different. It could well be that the value in Windows is also -0.0, but the output routine on Windows doesn't distinguish between +0.0 and -0.0 (they compare equal, after all).

Unless using (unsafe) flags like -ffast-math, the compiler is limited in the assumptions it can make when 'optimizing' IEEE-754 arithmetic. First check that both platforms are using the same rounding.
Also, if possible, check they are using the same floating-point unit. i.e., SSE vs FPU on x86. The latter might be an issue with math library function implementations - especially trigonometric / transcendental functions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js