I'm porting a CUDA code to C++ and using Visual Studio 2010. The CUDA code uses the rint function, which does not seem to be present in the Visual Studio 2010 math.h, so it seems that I need to implement it by myself.
According to this link, the CUDA rint function
rounds x to the nearest integer value in floating-point format, with halfway cases rounded towards zero.
I think I could use the casting to int which discards the fractional part, effectively rounding towards zero, so I ended-up with the following function
inline double rint(double x)
{
int temp; temp = (x >= 0. ? (int)(x + 0.5) : (int)(x - 0.5));
return (double)temp;
}
which has two different castings, one to int and one to double.
I have three questions:
Is the above function fully equivalent to CUDA rint for "small" numbers? Will it fail for "large" numbers that cannot be represented as an int?
Is there any more computationlly efficient way (rather than using two castings) of defining rint?
Thank you very much in advance.
The cited description of rint() in the CUDA documentation is incorrect. Roundings to integer with floating-point result map the IEEE-754 (2008) specified rounding modes as follows:
trunc() // round towards zero
floor() // round down (towards negative infinity)
ceil() // round up (towards positive infinity)
rint() // round to nearest or even (i.e. ties are rounded to even)
round() // round to nearest, ties away from zero
Generally, these functions work as described in the C99 standard. For rint(), the standard specifies that the function rounds according to the current rounding mode (which defaults to round to nearest or even). Since CUDA does not support dynamic rounding modes, all functions that are defined to use the current rounding mode use the rounding mode "round to nearest or even". Here are some examples showing the difference between round() and rint():
argument rint() round()
1.5 2.0 2.0
2.5 2.0 3.0
3.5 4.0 4.0
4.5 4.0 5.0
round() can be emulated fairly easily along the lines of the code you posted, I am not aware of a simple emulation for rint(). Please note that you would not want to use an intermediate cast to integer, as 'int' supports a narrower numeric range than integers that are exactly representable by a 'double'. Instead use trunc(), ceil(), floor() as appropriate.
Since rint() is part of both the current C and C++ standards, I am a bit surprised that MSVC does not include this function; I would suggest checking MSDN to see whether a substitute is offered. If your platforms are SSE4 capable, you could use the SSE intrinsics _mm_round_sd(), _mm_round_pd() defined in smmintrin.h, with the rounding mode set to _MM_FROUND_TO_NEAREST_INT, to implement the functionality of CUDA's rint().
While (in my experience), the SSE intrinsics are portable across Windows, Linux, and Mac OS X, you may want to avoid hardware specific code. In this case, you could try the following code (lightly tested):
double my_rint(double a)
{
const double two_to_52 = 4.5035996273704960e+15;
double fa = fabs(a);
double r = two_to_52 + fa;
if (fa >= two_to_52) {
r = a;
} else {
r = r - two_to_52;
r = _copysign(r, a);
}
return r;
}
Note that MSVC 2010 seems to lack the standard copysign() function as well, so I had to substitute _copysign(). The above code assumes that the current rounding mode is round-to-nearest-even (which it is by default). By adding 2**52 it makes sure that rounding occurs at the integer unit bit. Note that this also assumes that pure double-precision computation is performed. On platforms that use some higher precision for intermediate results one might need to declare 'fa' and 'r' as volatile.
Related
I have some cross platform code I'm working with. On the Mac it's compiled with Clang, on Windows it's compiled with Visual C++.
There is a calculation that can be sensitive, and there was a difference between Mac and Windows that was triggering asserts. It ends up there is a difference between acos results, but I'm not clear why.
On both platforms, the input to acos is exactly -1.0f. In Visual C++, acos(-1.0f) is 3.14159274. That's the value of pi as a float, which is what I'd expect.
But on macOS:
float value = acos(-1.0f);
...evaluates to 3.1415925. Thats just enough of an accuracy difference to trigger issues in the code. acos is an operation that can be imprecise with float - I understand that. And different compilers can have different implementations of acos. I'm just unclear why Clang seems to have issues with such a simple acos result while Visual C++ doesn't. A float is capable of representing 3.14159274, but that's not the result I'm getting.
It is possible to get an accurate/Visual C++ aligned value out of Xcode's version of Clang with:
float value = (float)acos((double)-1.0f);
So I can fix the issue by moving to higher accuracy, and then down casting the value back to float to preserve the same rounding as Windows. I'm just looking for a justification as to why the extra precision is necessary when the VC++ compiler doesn't seem to have a precision issue. It could be differences between the Clang/Xcode/VC++ math libraries as well. I just assumed that acos(-1.0) might be more settled across the compilers. I couldn't find any difference in round modes (even though rounding should not be necessary) and fresh projects in Xcode and Visual Studio show the same difference. Both machines are Intel.
If you look at the binary representation of these floating point values you can see that the mac/clang's value A is the next lowest floating-point number than windows/msvc's value B
A 3.14159250 0x40490FDA
B 3.14159274 0x40490FDB
Whilst B is closest to the true value of π, it is actually greater than π as #njuffa points out in their comment.
Reading the specification, it looks like acosf is supposed to return a value in the closed range [0,π]. Technically A meets this criteria whilst B doesn't.
In summary -
A is the closest value to, but less than, π
B is the closest value to π
The difference in these may be as a result of a deliberate decision of the respective standard library implementors.
I'd also observe that both values are true inverses of cosf as both cosf(A) and cosf(B) equal -1.0f.
Generally speaking, though, it is unwise to rely on exact bit-level accuracy with any floating point calculations. If you are not already aware of it, the document What Every Computer Scientist Should
Know About Floating-Point Arithmetic explains why.
Edit: I was curious and found what might be relevant Apple source code here.
Return value:
...
Otherwise:
...
Returns a value in [0, pi] (C 7.12.4.1 3). Note that
this prohibits returning a correctly rounded value for acosf(-1),
since pi rounded to a float lies outside that interval.
In a specific online judge running 32-bit GCC 7.3.0, this:
#include <iostream>
volatile float three = 3.0f, seven = 7.0f;
int main()
{
float x = three / seven;
std::cout << x << '\n';
float y = three / seven;
std::cout << (x == y) << '\n';
}
Outputs
0.428571
0
To me this seems like it violates IEEE 754, since the standard requires basic operations to be correctly rounded. Now I know there are a couple reasons for IEEE 754 floating-point calculations to be non-deterministic as discussed here, but I don't see how any of them applies to this example. Here are some of the things I considered:
Excess precision and contraction: I'm doing a single calculation and assigning the result to a float, which should force both of the values to be rounded to float precision.
Compile-time calculations: three and seven are volatile so both calculations must be done at runtime.
Floating-point flags: The calculations are done in the same thread almost immediately after each other, so the flags should be the same.
Does this necessarily indicate that the online judge system doesn't conform to IEEE 754?
Also, removing the statement printing x, adding a statement to print y, or making y volatile all changes the result. This seems to contradict my understanding of the C++ standard which I think requires the assignments to round off any excess precision.
Thanks to geza for pointing out that this is a known issue. I would still like a definitive answer on whether this conforms to the C++ standard and IEEE 754 though, since the C++ standard appears to require assignments to round off excess precision. Here's the quote from draft N4860 [expr.pre]:
The values of the floating-point operands and the results of floating-point expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.50
50) The cast and assignment operators must still perform their specific conversions as described in 7.6.1.3, 7.6.3, 7.6.1.8 and 7.6.19.
Does this necessarily indicate that the online judge system doesn't conform to IEEE 754?
Yes, with minor caveats.
One, C++ cannot just “conform” to IEEE 754. There has to be some specification of how things in C++ bind (connect) to IEEE 754, such as statements that the float format is IEEE-754 binary32, that x / y uses IEEE-754 division, and so on. C++ 2017 draft N4659 refers to LIA-1, but I do not see that it clearly requires LIA-1 be used even if std::numeric_limits<float>::is_iec559 reports true, and LIA-1 apparently only suggests language bindings.
The C++ standard tells us the fact that std::numeric_limits<float>::is_iec559 reports true means the float type conforms to ISO/IEC/IEEE 60559, which is effectively IEEE 754-2008. But, in addition to the binding problem, I do not see a statement in the C++ standard that nullifies 8 [expr] 13 (“The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.”) when is_iec559 is true. Although it is true that the cast and conversion operators must “perform their specific conversions” (footnote 64), and this forces float y = three / seven; to produce the correct IEEE-754 binary32 results even if binary64 or Intel’s 80-bit floating-point are used for the division, it might not force it to produce the correct result if only a little excess precision is used. (If at least 48 bits of precision are used, no double-rounding errors occur for division when rounded to the 24-bits of the binary32 format. If fewer excess bits are used, there may be some cases that experience double rounding errors.)
I believe the intent of is_iec559 is to indicate a sensible binding, and the behavior shown in the question does violate this. In particular, the defect shown in the question is caused by failing to round the excess precision used in the division to the actual float type; it is not caused by the hypothetical use of less-than-enough excess precision mentioned above.
I have to check an inequality containing square roots. To avoid incorrect results due to floating point inaccuracy and rounding, I use std::nextafter() to get an upper/lower bound:
#include <cfloat> // DBL_MAX
#include <cmath> // std::nextafter, std::sqrt
double x = 42.0; //just an example number
double y = std::nextafter(std::sqrt(x), DBL_MAX);
a) Is y*y >= x guaranteed using GCC compiler?
b) Will this work for other operations like + - * / or even std::cos() and std::acos()?
c) Are there better ways to get upper/lower bounds?
Update:
I read this is not guaranteed by the C++ Standard, but should work according to IEEE-754. Will this work with the GCC compiler?
In general, floating point operations will result in some ULP error. IEEE 754 requires that results for most operations be correct to within 0.5 ULP, but errors can accumulate, which means a result may not be within one ULP of the the exact result. There are limits to precision as well, so depending on the number of digits there are in resulting values, you also may not be working with values of the same magnitudes. Transcendental functions are also somewhat notorious for introducing error into calculations.
However, if you're using GNU glibc, sqrt will be correct to within 0.5 ULP (rounded), so you're specific example would work (neglecting NaN, +/-0, +/-Inf). Although, it's probably better to define some epsilon as your error tolerance and use that as your bound. For exmaple,
bool gt(double a, double b, double eps) {
return (a > b - eps);
}
Depending on the level of precision you need in calculations, you also may want to use long double instead.
So, to answer your questions...
a) Is y*y >= x guaranteed using GCC compiler?
Assuming you use GNU glibc or SSE2 intrinsics, yes.
b) Will this work for other operations like + - * / or even std::cos() and std::acos()?
Assuming you use GNU glibc and one operation, yes. Although some transcendentals are not guaranteed correctly rounded.
c) Are there better ways to get upper/lower bounds?
You need to know what your error tolerance in calculations is, and use that as an epsilon (which may be larger than one ULP).
For GCC this page suggests that it will work if you use the GCC builtin sqrt function __builtin_sqrt.
Additionally this behavior will be dependent on how you compile your code and the machine that it is run on
If the processor supports SSE2 then you should compile your code with the flags -mfpmath=sse -msse2 to ensure that all floating point operations are done using the SSE registers.
If the processor doesn't support SSE2 then you should use the long double type for the floating point values and compile with the flag -ffloat-store to force GCC to not use registers to store floating point values (you'll have a performance penalty for doing this)
Concerning
c) Are there better ways to get upper/lower bounds?
Another way is to use a different rounding mode, i.e. FE_UPWARD or FE_DOWNWARD instead of the default FE_TONEAREST. See https://stackoverflow.com/a/6867722 This may be slower, but is a better upper/lower bound.
I'm a bit surprised with MSVC ldexp behavior (it happens in Visual Studio 2013, but also with all older versions at least down to 2003...).
For example:
#include <math.h>
#include <stdio.h>
int main()
{
double g=ldexp(2.75,-1074);
double e=ldexp(3.0,-1074);
printf("g=%g e=%g \n",g,e);
return 0;
}
prints
g=9.88131e-324 e=1.4822e-323
The first one g is strangely rounded...
It is 2.75 * fmin_denormalized, so i definitely expect the second result e.
If I evaluate 2.75*ldexp(1.0,-1074) I correctly get same value as e.
Are my expectations too high, or does Microsoft fail to comply with some standard?
While the question does not explicitly state this, I assume that the output expected by the asker is:
g=1.4822e-323 e=1.4822e-323
This is what we would expect from a C/C++ compiler that promises strict adherence to IEEE-754. The question is tagged both C and C++, I will address C99 here as that is the standard I have in hand.
In Annex F, which describes IEC 60559 floating-point arithmetic (where IEC 60559 is basically another name for IEEE-754) the C99 standard specifies:
An implementation that defines __STDC_IEC_559__ shall conform to the
specifications in this annex. [...] The scalbn and scalbln
functions in <math.h> provide the scalb function recommended in the
Appendix to IEC 60559.
Further down in that annex, section F.9.3.6 specifies:
On a binary system, ldexp(x, exp) is equivalent to scalbn(x, exp).
The appendix referenced by the C99 standard is the appendix of the 1985 version of IEEE-754, where we find the scalb function defined as follows:
Scalb(y, N) returns y × 2N for integral values N without computing 2N.
scalb is defined as a multiplication with a power of two, and multiplications must be rounded correctly based on the current rounding mode according to the standard. Therefore, with a conforming C99 compiler ldexp() must return a correctly rounded result if the compiler defines __STDC_IEC_559__. In the absence of a library call setting the rounding mode, the default rounding mode "round to nearest or even" is in effect.
I do not have access to MSVC 2013, so I do not know whether it defines that symbol or not. This could even depend on a compiler flag setting, such as /fp:strict.
After tracking down my copy of the C++11 standard, I cannot find any reference to __STDC_IEC_559__ or any language about IEEE-754 bindings. According to the answer to this question this is because that support is included by referring to the C99 standard.
This happens because during the ldexp calculation the 2.75 gets truncated to 2, which happens because at that small of a denormalized number the bits that represent the '.75' part get shifted off the end of the representable number and disappear. Whether this is a bug or designed behavior can be debated.
When calculating 2.75*ldexp(1.0,-1074) normal rounding happens, and the 2.75 becomes 3.
EDIT: ldexp should round correctly, and this is a bug.
OP results do not fail to comply with the C spec as that spec does not define the preciseness of calculations.
OP result may have failed IEEE 754, but it depends on the rounding mode in use at the time, which is not posted. Yet OP's reports 2.75*ldexp(1.0,-1074) worked as expected implying at that time, the expected rounding mode was in effect.
Using printf("%la",x) aids in seeing clearly what is happening near the limits of double.
I would expect g to "round to nearest - ties to even" with the result of 0x1.8p-1073 - which did occur with gcc on windows.
Ideally g would have the value of 0x1.6p-1073
0x0.0p-1073 Zero
0x0.8p-1073 next higher double DBL_TRUE_MIN
0x1.0p-1073 next higher double
0x1.6p-1073 ideal `g` answer, but not available as a double
0x1.8p-1073 next higher double
To be fair, it could be a processor bug - it has happened before.
Ref
double g=ldexp(2.75,-1074);
printf("%la\n%la\n", 2.75,ldexp(2.75,-1074));
printf("%la\n%la\n", 3.0 ,ldexp(3.0 ,-1074));
double e=ldexp(3.0,-1074);
printf("%la\n%la\n", g,e);
printf("%la\n%la\n", 9.88131e-324, DBL_TRUE_MIN);
printf("g=%g e=%g \n",g,e);
0x1.6p+1
0x1.8p-1073
0x1.8p+1
0x1.8p-1073
0x1.8p-1073
0x1.8p-1073
0x1p-1073
0x1p-1074
g=1.4822e-323 e=1.4822e-323
Assume that we are working on a platform where the type long double has a strictly greater precision than 64 bits. What is the fastest way to convert a given long double to an ordinary double-precision number with some prescribed rounding (upward, downward, round-to-nearest, etc.)?
To make the question more precise, let me give a concrete example: Let N be a given long double, and M be the double whose value minimizes the (real) value M - N such that M > N. This M would be the upward rounded conversion I want to find.
Can I get away with setting the rounding mode of the FP environment appropriately and performing a simple cast (e.g. (double) N)?
Clarification: You can assume that the platform supports the IEEE Standard for Floating-Point Arithmetic (IEEE 754).
Can I get away with setting the rounding mode of the FP environment appropriately and performing a simple cast (e.g. (double) N)?
Yes, as long as the compiler implements the IEEE 754 (most of them do at least roughly). Conversion from one floating-point format to the other is one of the operations to which, according to IEEE 754, the rounding mode should apply. In order to convert from long double to double up, set the rounding mode to upwards and do the conversion.
In C99, which should be accepted by C++ compilers (I'm not sure a syntax is specified for C++):
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
…
fesetround(FE_UPWARD);
double d = (double) ld;
PS: you may discover that your compiler does not implement #pragma STDC FENV_ACCESS ON properly. Welcome to the club.