c++ sqrt guaranteed precision, upper/lower bound - c++

I have to check an inequality containing square roots. To avoid incorrect results due to floating point inaccuracy and rounding, I use std::nextafter() to get an upper/lower bound:
#include <cfloat> // DBL_MAX
#include <cmath> // std::nextafter, std::sqrt
double x = 42.0; //just an example number
double y = std::nextafter(std::sqrt(x), DBL_MAX);
a) Is y*y >= x guaranteed using GCC compiler?
b) Will this work for other operations like + - * / or even std::cos() and std::acos()?
c) Are there better ways to get upper/lower bounds?
Update:
I read this is not guaranteed by the C++ Standard, but should work according to IEEE-754. Will this work with the GCC compiler?

In general, floating point operations will result in some ULP error. IEEE 754 requires that results for most operations be correct to within 0.5 ULP, but errors can accumulate, which means a result may not be within one ULP of the the exact result. There are limits to precision as well, so depending on the number of digits there are in resulting values, you also may not be working with values of the same magnitudes. Transcendental functions are also somewhat notorious for introducing error into calculations.
However, if you're using GNU glibc, sqrt will be correct to within 0.5 ULP (rounded), so you're specific example would work (neglecting NaN, +/-0, +/-Inf). Although, it's probably better to define some epsilon as your error tolerance and use that as your bound. For exmaple,
bool gt(double a, double b, double eps) {
return (a > b - eps);
}
Depending on the level of precision you need in calculations, you also may want to use long double instead.
So, to answer your questions...
a) Is y*y >= x guaranteed using GCC compiler?
Assuming you use GNU glibc or SSE2 intrinsics, yes.
b) Will this work for other operations like + - * / or even std::cos() and std::acos()?
Assuming you use GNU glibc and one operation, yes. Although some transcendentals are not guaranteed correctly rounded.
c) Are there better ways to get upper/lower bounds?
You need to know what your error tolerance in calculations is, and use that as an epsilon (which may be larger than one ULP).

For GCC this page suggests that it will work if you use the GCC builtin sqrt function __builtin_sqrt.
Additionally this behavior will be dependent on how you compile your code and the machine that it is run on
If the processor supports SSE2 then you should compile your code with the flags -mfpmath=sse -msse2 to ensure that all floating point operations are done using the SSE registers.
If the processor doesn't support SSE2 then you should use the long double type for the floating point values and compile with the flag -ffloat-store to force GCC to not use registers to store floating point values (you'll have a performance penalty for doing this)

Concerning
c) Are there better ways to get upper/lower bounds?
Another way is to use a different rounding mode, i.e. FE_UPWARD or FE_DOWNWARD instead of the default FE_TONEAREST. See https://stackoverflow.com/a/6867722 This may be slower, but is a better upper/lower bound.

Related

-Ofast produces incorrect code while using long double

#include <cstdio>
int main(void)
{
int val = 500;
printf("%d\n", (int)((long double)val / 500));
printf("%d\n", (int)((long double)500 / 500));
}
Obviously it should output 1 1. But if you compile it with -Ofast, it will output 0 1, why?
And if you change 500 to other values (such as 400) and compile with -Ofast, it will still output 1 1.
Compiler explorer with -Ofast: https://gcc.godbolt.org/z/YkX7fB
It seems this line causes the problem.
-Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math, -fallow-store-data-races and the Fortran-specific [...]
-ffast-math
Sets the options -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans, -fcx-limited-range and -fexcess-precision=fast.
This option causes the preprocessor macro __FAST_MATH__ to be defined.
This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.
Conclusion: Don't use -ffast-math unless you are willing to get surprises like the one you've gotten now.
With -Ofast, -ffast-math is enabled, which can cause some operations to be calculated in a different and faster way. In your case, (long double)val / 500) can be calculated as (long double)val * (1.0L / 500)). This can be seen in the generated assembly when you compare -O2 and -Ofast for the following function:
long double f(long double a)
{
return a / 500.0L;
}
The assembly generated with -O2 involves fdiv instruction, while the assembly generated with -Ofast involves fmul instruction, see https://gcc.godbolt.org/z/58VHxb.
Next, 1/500, that is, 0.002, is not representable by long double exactly. Therefore, some rounding occurs and, seemingly, in your case, this rounding happens to be down. This can be checked by the following expression:
500.0L * (1.0L / 500.0L) < 1.0L
which is evaluated as true: https://gcc.godbolt.org/z/zMcjxJ. So, the exact stored multiplier is 0.002 - some very small delta.
Finally, the result of the multiplication is 500 * (0.002 - delta) = 1 - some small value. And when this value in converted into int, it's truncated, therefore the result in int is 0.
Even if the shown program snippet has a 'problem', it is the wrong way to work with floating point numbers anyway.
You, more or less, ask the program if a floating point number has an 'exact value' - in this case of '1'. Ok, to be more precise - if the value is '< 1' or '>= 1' for a value which is 'around' 1 - so exactly around the dividing limit of the two answers. But as the others have already written (or can easily be found in wikipedia, ...) floating point numbers have just limited precision. So such deviations can and will happen.
So, coming to a conclusion: You should always use rounding when doing floating point to integer conversions, i.e. '(int) round (floating_point_value)'.
PS. Contrary what others maybe say or recommend - I don't see any problem with -ffast-math calculations at all. The only 'problem' would be when (bitwise) comparing the results of some program after letting it run on different computers.
I do all my (scientific) calculations with -ffast-math (actually -Ofast). But that has never been a problem so far - since I expect floating point numbers to have some rounding errors (this is true, regardless if using -ffast-math or not) - but that's all, as far as I know. Since I typically use 64bit floating points (double) this means, the calculations are precise to around 15 to 17 decimal digits - and the last (few) of them are inflicted with these inaccuracies - still giving me lots of 'accurate' digits - say - more than 13, depending on how complicate my calculations are.

Is a*b < a for 0 <= b < 1 in IEEE floating point with positive a?

I am writing in C++, using conformant IEEE arithmetic in round-to-nearest mode. If a is a positive short int (16 bits) and b is a float (32 bits), where 0 <= b < 1, does a*b < a always evaluate to true?
Maybe. It depends on how the compiler decide to evaluate floating-point expressions (read about FLT_EVAL_METHOD, invented by C99 but now part of the C++ standard, if you want the gory details.)
As soon as a can be greater than 4, the product a*b expressed as a float will round up to a when b is "big enough", for example b = 1-ε/2 (where ε is the difference between 1.0 and the next representable number, 2-23.) But if the compiler does not perform the rounding in the intermediate evaluation, before the comparison, then the product may be kept in some (better) internal precision where a*b is still different from a, and the comparison done on that internal precision will be always asserted. And this case is not uncommon: because of the design of the x87 coprocessor, keeping all results as 64-bit long double was typical of 32-bit x86 architecture, for example; 53-bit double would also keep all values separate, since 24+16<53.
Assuming there are no bugs in your compiler, a explicit cast to float should force the rounding, so (float)a*b < a should evaluate sometimes to false. Be specially cautious here, since this area is known to show compiler bugs, particularly since floating-point is declared "reserved to experts" and programmers are generally advised to not rely on details like these. You should particularly take care to not activate the optimization options (like /fp:fast) of your compiler, which are very likely to skip the rounding operation to improve performances.
A safer (but still not completely safe) way to perform the test is to explicitly store the result of the multiplication into a float variable, like in
float c = a * b;
if (c < a) call_Houston();
Here again, the C++ standard require explicit rounding (which is quite logical since the representation of the expression has to be stored into the 32-bit float variable.) But here again, some clever compilers, particularly when in optimization mode, might guess that the expression is reused just after, and could take the short path and reuse the in-register evaluation (which has more precision), and ruin your efforts (and left Houston unaware.) The GCC compiler used to recommend in such case to beg the compiler with code like
volatile float c = a * b;
if (c < a) call_Houston();
and goes to specific options like -ffloat-store. This does not prevent loss of sanity points. BTW, recent versions of GCC are much more sane on this respect (since bug323 is fixed.)

Adding double to double in a more fixed way?

I'm using double instead of float in my code but unfortunately i faced the next problem :
When i try to add :
1.000000000000020206059048177849 + 0.000000000000020206059048177849
i have this result :
1.000000000000040400000000000000
which avoid the last 14 number.. i want the result to be more accurate.
i know this might look silly but really this is so important to me .. anyone can help?
here's a simple code example :
#include <iomanip>
#include <iostream>
using namespace std;
int main()
{
double a=1.000000000000020206059048177849 + 0.000000000000020206059048177849;
cout<<fixed<<setprecision(30)<<a;
system("pause");
return 0;
}
Update: The answer below assumes that the expression is evaluated during run-time, i.e. you are not adding compile-time constants. This is not necessarily true, your compiler may evaluate the expression during compile time. It may use higher precision for this. As suggested in the comments, the way you print out the number might be the root cause for your problem.
If you absolutly need more precision and can't make any other twists, your only option is to increase precision. double values provide a precision of about 16 decimal digits. You have the following options:
Use a library that provides higher precision by implementing floating point operations in software. This is slow, but you can get as precise as you want to, see e.g. GMP, the GNU Multiple Precision Library.
The other option is to use a long double, which is at least as precise as double. On some platforms, long double may even provide more precision than a double, but in general it does not. On your typical desktop PC it may be 80 bits long (compared to 64 bits), but this is not necessarily true and depends on your platform and your compiler. It is not portable.
Maybe, you can avoid the hassle and tune your implementation a bit in order to avoid floating point errors. Can you reorder operations? Your intermediate results are of the form 1+x. Is there a way to compute x instead of 1+x? Subtracting 1 is not an option here, of course, because then precision of x is already lost.

Differences in rounded result when calling pow()

OK, I know that there was many question about pow function and casting it's result to int, but I couldn't find answer to this a bit specific question.
OK, this is the C code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main()
{
int i = 5;
int j = 2;
double d1 = pow(i,j);
double d2 = pow(5,2);
int i1 = (int)d1;
int i2 = (int)d2;
int i3 = (int)pow(i,j);
int i4 = (int)pow(5,2);
printf("%d %d %d %d",i1,i2,i3,i4);
return 0;
}
And this is the output: "25 25 24 25". Notice that only in third case where arguments to pow are not literals we have that wrong result, probably caused by rounding errors. Same thing happends without explicit casting. Could somebody explain what happens in this four cases?
Im using CodeBlocks in Windows 7, and MinGW gcc compiler that came with it.
The result of the pow operation is 25.0000 plus or minus some bit of rounding error. If the rounding error is positive or zero, 25 will result from the conversion to an integer. If the rounding error is negative, 24 will result. Both answers are correct.
What is most likely happening internally is that in one case a higher-precision, 80-bit FPU value is being used directly and in the other case, the result is being written from the FPU to memory (as a 64-bit double) and then read back in (converting it to a slightly different 80-bit value). This can make a microscopic difference in the final result, which is all it takes to change a 25.0000000001 to a 24.999999997
Another possibility is that your compiler recognizes the constants passed to pow and does the calculation itself, substituting the result for the call to pow. Your compiler may use an internal arbitrary-precision math library or it may just use one that's different.
This is caused by a combination of two problems:
The implementation of pow you are using is not high quality. Floating-point arithmetic is necessarily approximate in many cases, but good implementations take care to ensure that simple cases such as pow(5, 2) return exact results. The pow you are using is returning a result that is less than 25 by an amount greater than 0 but less than or equal to 2–49. For example, it might be returning 25–2-50.
The C implementation you are using sometimes uses a 64-bit floating-point format and sometimes uses an 80-bit floating-point format. As long as the number is kept in the 80-bit format, it retains the complete value that pow returned. If you convert this value to an integer, it produces 24, because the value is less than 25 and conversion to integer truncates; it does not round. When the number is converted to the 64-bit format, it is rounded. Converting between floating-point formats rounds, so the result is rounded to the nearest representable value, 25. After that, conversion to integer produces 25.
The compiler may switch formats whenever it is “convenient” in some sense. For example, there are a limited number of registers with the 80-bit format. When they are full, the compiler may convert some values to the 64-bit format and store them in memory. The compiler may also rearrange expressions or perform parts of them at compile-time instead of run-time, and these can affect the arithmetic performed and the format used.
It is troublesome when a C implementation mixes floating-point formats, because users generally cannot predict or control when the conversions between formats occur. This leads to results that are not easily reproducible and interferes with deriving or controlling numerical properties of software. C implementations can be designed to use a single format throughout and avoid some of these problems, but your C implementation is apparently not so designed.
To add to the other answers here: just generally be very careful when working with floating point values.
I highly recommend reading this paper (even though it is a long read):
http://hal.archives-ouvertes.fr/docs/00/28/14/29/PDF/floating-point-article.pdf
Skip to section 3 for practical examples, but don't neglect the previous chapters!
I'm fairly sure this can be explained by "intermediate rounding" and the fact that pow is not simply looping around j times multiplying by i, but calculating using exp(log(i)*j) as a floating point calculation. Intermediate rounding may well convert 24.999999999996 into 25.000000000 - even arbitrary storing and reloading of the value may cause differences in this sort of behaviuor, so depending on how the code is generated, it may make a difference to the exact result.
And of course, in some cases, the compiler may even "know" what pow actually achieves, and replace the calculation with a constant result.

Banker's rounding with Visual C++? [duplicate]

I'm porting a CUDA code to C++ and using Visual Studio 2010. The CUDA code uses the rint function, which does not seem to be present in the Visual Studio 2010 math.h, so it seems that I need to implement it by myself.
According to this link, the CUDA rint function
rounds x to the nearest integer value in floating-point format, with halfway cases rounded towards zero.
I think I could use the casting to int which discards the fractional part, effectively rounding towards zero, so I ended-up with the following function
inline double rint(double x)
{
int temp; temp = (x >= 0. ? (int)(x + 0.5) : (int)(x - 0.5));
return (double)temp;
}
which has two different castings, one to int and one to double.
I have three questions:
Is the above function fully equivalent to CUDA rint for "small" numbers? Will it fail for "large" numbers that cannot be represented as an int?
Is there any more computationlly efficient way (rather than using two castings) of defining rint?
Thank you very much in advance.
The cited description of rint() in the CUDA documentation is incorrect. Roundings to integer with floating-point result map the IEEE-754 (2008) specified rounding modes as follows:
trunc() // round towards zero
floor() // round down (towards negative infinity)
ceil() // round up (towards positive infinity)
rint() // round to nearest or even (i.e. ties are rounded to even)
round() // round to nearest, ties away from zero
Generally, these functions work as described in the C99 standard. For rint(), the standard specifies that the function rounds according to the current rounding mode (which defaults to round to nearest or even). Since CUDA does not support dynamic rounding modes, all functions that are defined to use the current rounding mode use the rounding mode "round to nearest or even". Here are some examples showing the difference between round() and rint():
argument rint() round()
1.5 2.0 2.0
2.5 2.0 3.0
3.5 4.0 4.0
4.5 4.0 5.0
round() can be emulated fairly easily along the lines of the code you posted, I am not aware of a simple emulation for rint(). Please note that you would not want to use an intermediate cast to integer, as 'int' supports a narrower numeric range than integers that are exactly representable by a 'double'. Instead use trunc(), ceil(), floor() as appropriate.
Since rint() is part of both the current C and C++ standards, I am a bit surprised that MSVC does not include this function; I would suggest checking MSDN to see whether a substitute is offered. If your platforms are SSE4 capable, you could use the SSE intrinsics _mm_round_sd(), _mm_round_pd() defined in smmintrin.h, with the rounding mode set to _MM_FROUND_TO_NEAREST_INT, to implement the functionality of CUDA's rint().
While (in my experience), the SSE intrinsics are portable across Windows, Linux, and Mac OS X, you may want to avoid hardware specific code. In this case, you could try the following code (lightly tested):
double my_rint(double a)
{
const double two_to_52 = 4.5035996273704960e+15;
double fa = fabs(a);
double r = two_to_52 + fa;
if (fa >= two_to_52) {
r = a;
} else {
r = r - two_to_52;
r = _copysign(r, a);
}
return r;
}
Note that MSVC 2010 seems to lack the standard copysign() function as well, so I had to substitute _copysign(). The above code assumes that the current rounding mode is round-to-nearest-even (which it is by default). By adding 2**52 it makes sure that rounding occurs at the integer unit bit. Note that this also assumes that pure double-precision computation is performed. On platforms that use some higher precision for intermediate results one might need to declare 'fa' and 'r' as volatile.