On a float rounding error - c++

I do not understand the output of the following program:
int main()
{
float x = 14.567729f;
float sqr = x * x;
float diff1 = sqr - x * x;
double diff2 = double(sqr) - double(x) * double(x);
std::cout << diff1 << std::endl;
std::cout << diff2 << std::endl;
return 0;
}
Output:
6.63225e-006
6.63225e-006
I use VS2010, x86 compiler.
I expect to get a different output
0
6.63225e-006
Why diff1 is not equal to 0?
To calculate sqr - x * x compiler increases float precision to double. Why?

The floating point registers are 80 bits (on most modern CPUs)
During an expression the result is an 80 bit value. It only gets truncated to 32 (float) or 64 (double) when it gets assigned to a location in memory. If you hold everything in registers (try compiling with -O3) you may see a different result.
Compiled with: -03:
> ./a.out
0
6.63225e-06

float diff1 = sqr - x * x;
double diff2 = double(sqr) - double(x) * double(x);
Why diff1 is not equal to 0?
Because you have already cached sqr = x*x and forced its representation to be a float.
To calculate sqr - x * x compiler increases float precision to double. Why?
Because that is how C did things back before there was a C standard. I don't think modern compilers are bound to that convention, but many still do follow it. If this is the case, the right-hand sides of the calculations of diff1 and diff2 will be identical. The only difference is that after calculating the right-hand side of float diff1 = ..., the double result is converted back to a float.

Apparently the standard allows floats to be automatically promoted to double in expressions like that. See here
Do a find on that page for "automatically promoted" and check out the first paragraph with that phrase in it.
If we go by that paragraph, as I understand it, your sqr=x*x is initially being treated as if it were a double as well, but once it is stored it is being rounded to a float. Then, in your diff1=sqr-x*x, x*x is again being treated like a double, and so is sqr although it's already rounded. Therefore, it yields the same result as casting them all to doubles: sqr is a double then but already rounded to float precision, and again x*x is double precision.

On x86/x64 architectures it is common for compilers to promote all 32-bit floats to 64-bit doubles for computations; check the output assembly to see if the two variants produce the same instructions. The only difference between the types is the storage.

Related

Simpson's Composite Rule giving too large values for when n is very large

Using Simpson's Composite Rule to calculate the integral from 2 to 1,000 of 1/ln(x), however when using a large n (usually around 500,000), I start to get results that vary from the value my calculator and other sources give me (176.5644). For example, when n = 10,000,000, it gives me a value of 184.1495. Wondering why this is, since as n gets larger, the accuracy is supposed to increase and not decrease.
#include <iostream>
#include <cmath>
// the function f(x)
float f(float x)
{
return (float) 1 / std::log(x);
}
float my_simpson(float a, float b, long int n)
{
if (n % 2 == 1) n += 1; // since n has to be even
float area, h = (b-a)/n;
float x, y, z;
for (int i = 1; i <= n/2; i++)
{
x = a + (2*i - 2)*h;
y = a + (2*i - 1)*h;
z = a + 2*i*h;
area += f(x) + 4*f(y) + f(z);
}
return area*h/3;
}
int main()
{
std::cout.precision(20);
int upperBound = 1'000;
int subsplits = 1'000'000;
float approx = my_simpson(2, upperBound, subsplits);
std::cout << "Output: " << approx << std::endl;
return 0;
}
Update: Switched from floats to doubles and works much better now! Thank you!
Unlike a real (in mathematical sense) number, a float has a limited precision.
A typical IEEE 754 32-bit (single precision) floating-point number binary representation dedicates only 24 bits (one of which is implicit) to the mantissa and that translates in roughly less than 8 decimal significant digits (please take this as a gross semplification).
A double on the other end, has 53 significand bits, making it more accurate and (usually) the first choice for numerical computations, these days.
since as n gets larger, the accuracy is supposed to increase and not decrease.
Unfortunately, that's not how it works. There's a sweat spot, but after that the accumulation of rounding errors prevales and the results diverge from their expected values.
In OP's case, this calculation
area += f(x) + 4*f(y) + f(z);
introduces (and accumulates) rounding errors, due to the fact that area becomes much greater than f(x) + 4*f(y) + f(z) (e.g 224678.937 vs. 0.3606823). The bigger n is, the sooner this gets relevant, making the result diverging from the real one.
As mentioned in the comments, another issue (undefined behavior) is that area isn't initialized (to zero).

can anyone explain me the maths of this code?

Consider
x_min as -12.5,
x_max as 12.5,
bits as 8,
x is any value between -12.5 to +12.5 ,
Can someone explain me the math's of this snippet??
int float_to_uint(float x, float x_min, float x_max, unsigned int bits)
{
float span = x_max - x_min;
return (int) ((x- x_min)*((float)((1<<bits)/span)));
}
If we ignore rounding, types and other little details, you could rearrange the separate parts a bit:
(x-x_min) / (x_max-x_min) * (1<<bits)
This is basically scaling x to values of 0..2^bits (=256) depending on where x is within x_min..x_max.
x | result
------+----------
-12.5 | 0
... |
0 | 128
... |
12.5 | 256
The goal of the function is to map values in the range x_min to x_max to values 0 to 2^bits.
(int) ((x- x_min) / span * (1<<bits));
But there is some trickery being used here to help the optimizer. The last two values are re-aranged and computed first. Mathematically it's the same but with floats it will round differently. A difference so minor there is actually a compiler flag allowing the compiler to ignore it (fast-math).
(int) ((x- x_min) * ((1<<bits) / span));
The cast to float is pointless as arithmetic promotion already turns 1<<bits into a float and float / float remains float.
Now you might ask: What is the point of this transformation? The result is the (about) the same.
Here is my thought on that: In the source the bits, x_min and x_max will be literals or constants. So span is known at compile time too. The transformation allows the compiler to inline that function and compute (1<<bits) / span) at compile time. That leaves only one float subtraction and multiplication at runtime. It will therefore generate code that runs noticeable faster on something like an Arduino that has no FPU.

OpenCL kernel float division gives different result

I have a OpenCL kernel for some computation. I found only one thread gives different result with CPU codes. I am using vs2010 x64 release mode.
By checking the OpenCL codes by some examples, I found some interesting results. Here are the testing examples in kernel codes.
I tested 3 cases in OpenCl kernel, the precision is checked by printf("%.10f", fval);
case 1:
float fval = (10296184.0) / (float)(x*y*z); // which gives result fval = 3351.6225585938
float fval = (10296184.0f) / (float)(x*y*z); // which gives result fval = 3351.6225585938
Variables are: int x,y, z
these values are computed by some operations. And their values are x=12, y=16, z=16;
case 2:
float fval = (10296184.0) / (float)(12*16*16); // which gives result fval = 3351.6223144531
float fval = (10296184.0f) / (float)(12*16*16); // which gives result fval = 3351.6223144531
case 3:
However, when I compute the difference of fval by using above two expressions, the result is 0 if using 10296184.0.
float fval = (10296184.0) / (float)(x*y*z) - (10296184.0) / (float)(12*16*16); // which gives result fval = 0.0000000000
float fval = (10296184.0f) / (float)(x*y*z) - (10296184.0f) / (float)(12*16*16); // which gives result fval = 0.0001812663
Could anyone explain the reason or give me some hints?
Some observations:
The two float values differ by 1 ULP. So the results differ by a minimum amount.
// Float ULP in the 2's place here
// v
0x1.a2f3ea0000000p+11 3351.622314... // OP's lower float value
0x1.a2f3eaaaaaaabp+11 3351.622395... // higher precision quotient
0x1.a2f3ec0000000p+11 3351.622558... // OP's higher float value
(10296184.0) / (float)(12*16*16) is calculated at compile time as is the closer result to the expected mathematical answer.
float fval = (10296184.0) / (float)(x*y*z) is calculated at run time.
Considering float variables being used, surprising that code is doing this division with double math. This is a double constant divide by a double (which is the promotion of the float product) resulting in a double quotient, converted to a float and then saved. I'd expect 10296184.0f - note the f - to have been used, then the math could have all been done as floats.
C allows different rounding modes denoted by FLT_ROUNDS This may differ at compile time and run time and may explain the difference. Knowing the result of fegetround() (The function gets the current rounding direction.) would help.
OP may have employed various compiler optimizations that sacrifice precision for speed.
C does not specify the precision of math operations, yet good to the last ULP should be expected with * / + - sqrt() modf() on quality platforms. I suspect code suffers from a weak math implementation.

Is there any "standard" way to calculate the numerical gradient?

I am trying to calculate the numerical gradient of a smooth function in c++. And the parameter value could vary from zero to a very large number(maybe 1e10 to 1e20?)
I used the function f(x,y) = 10*x^3 + y^3 as a testbench, but I found that if x or y is too large, I can't get correct gradient.
Here is my code to calculate the graidient:
#include <iostream>
#include <cmath>
#include <cassert>
using namespace std;
double f(double x, double y)
{
// black box expensive function
return 10 * pow(x, 3) + pow(y, 3);
}
int main()
{
// double x = -5897182590.8347721;
// double y = 269857217.0017581;
double x = 1.13041e+19;
double y = -5.49756e+14;
const double epsi = 1e-4;
double f1 = f(x, y);
double f2 = f(x, y+epsi);
double f3 = f(x, y-epsi);
cout << f1 << endl;
cout << f2 << endl;
cout << f3 << endl;
cout << f1 - f2 << endl; // 0
cout << f2 - f3 << endl; // 0
return 0;
}
If I use the above code to calculate the gradient, the gradient would be zero!
The testbench function, 10*x^3 + y^3, is just a demo, the real problem I need to solve is actually a black box function.
So, is there any "standard" way to calculate the numerical gradient?
In the first place, you should use the central difference scheme, which is more accurate (by cancellation of one more term of the Taylor develoment).
(f(x + h) - f(x - h)) / 2h
rather than
(f(x + h) - f(x)) / h
Then the choice of h is critical and using a fixed constant is the worst thing you can do. Because for small x, h will be too large so that the approximation formula no more works, and for large x, h will be too small, resulting in severe truncation error.
A much better choice is to take a relative value, h = x√ε, where ε is the machine epsilon (1 ulp), which gives a good tradeoff.
(f(x(1 + √ε)) - f(x(1 - √ε))) / 2x√ε
Beware that when x = 0, a relative value cannot work and you need to fall back to a constant. But then, nothing tells you which to use !
You need to consider the precision needed.
At first glance, since |y| = 5.49756e14 and epsi = 1e-4, you need at least ⌈log2(5.49756e14)-log2(1e-4)⌉ = 63 bits of significand precision (that is the number of bits used to encode the digits of your number, also known as mantissa) for y and y+epsi to be considered different.
The double-precision floating-point format only has 53 bits of significand precision (assuming it is 8 bytes). So, currently, f1, f2 and f3 are exactly the same because y, y+epsi and y-epsi are equal.
Now, let's consider the limit : y = 1e20, and the result of your function, 10x^3 + y^3. Let's ignore x for now, so let's take f = y^3. Now we can calculate the precision needed for f(y) and f(y+epsi) to be different : f(y) = 1e60 and f(epsi) = 1e-12. This gives a minimum significand precision of ⌈log2(1e60)-log2(1e-12)⌉ = 240 bits.
Even if you were to use the long double type, assuming it is 16 bytes, your results would not differ : f1, f2 and f3 would still be equal, even though y and y+epsi would not.
If we take x into account, the maximum value of f would be 11e60 (with x = y = 1e20). So the upper limit on precision is ⌈log2(11e60)-log2(1e-12)⌉ = 243 bits, or at least 31 bytes.
One way to solve your problem is to use another type, maybe a bignum used as fixed-point.
Another way is to rethink your problem and deal with it differently. Ultimately, what you want is f1 - f2. You can try to decompose f(y+epsi). Again, if you ignore x, f(y+epsi) = (y+epsi)^3 = y^3 + 3*y^2*epsi + 3*y*epsi^2 + epsi^3. So f(y+epsi) - f(y) = 3*y^2*epsi + 3*y*epsi^2 + epsi^3.
The only way to calculate gradient is calculus.
Gradient is a vector:
g(x, y) = Df/Dx i + Df/Dy j
where (i, j) are unit vectors in x and y directions, respectively.
One way to approximate derivatives is first order differences:
Df/Dx ~ (f(x2, y)-f(x1, y))/(x2-x1)
and
Df/Dy ~ (f(x, y2)-f(x, y1))/(y2-y1)
That doesn't look like what you're doing.
You have a closed form expression:
g(x, y) = 30*x^2 i + 3*y^2 j
You can plug in values for (x, y) and calculate the gradient exactly at any point. Compare that to your differences and see how well your approximation is doing.
How you implement it numerically is your responsibility. (10^19)^3 = 10^57, right?
What is the size of double on your machine? Is it a 64 bit IEEE double precision floating point number?
Use
dx = (1+abs(x))*eps, dfdx = (f(x+dx,y) - f(x,y)) / dx
dy = (1+abs(y))*eps, dfdy = (f(x,y+dy) - f(x,y)) / dy
to get meaningful step sizes for large arguments.
Use eps = 1e-8 for one-sided difference formulas, eps = 1e-5 for central difference quotients.
Explore automatic differentiation (see autodiff.org) for derivatives without difference quotients and thus much smaller numerical errors.
We can examine the behaviour of the error in the derivative using the following program - it calculates the 1-sided derivative and the central difference based derivative using a varying step size. Here I'm using x and y ~ 10^10, which is smaller than what you were using, but should illustrate the same point.
#include <iostream>
#include <cmath>
#include <cassert>
using namespace std;
double f(double x, double y) {
return 10 * pow(x, 3) + pow(y, 3);
}
double f_x(double x, double y) {
return 3 * 10 * pow(x,2);
}
double f_y(double x, double y) {
return 3 * pow(y,2);
}
int main()
{
// double x = -5897182590.8347721;
// double y = 269857217.0017581;
double x = 1.13041e+10;
double y = -5.49756e+10;
//double x = 10.1;
//double y = -5.2;
double epsi = 1e8;
for(int i=0; i<60; ++i) {
double dfx_n = (f(x+epsi,y) - f(x,y))/epsi;
double dfx_cd = (f(x+epsi,y) - f(x-epsi,y))/(2*epsi);
double dfx = f_x(x,y);
cout<<epsi<<" "<<fabs(dfx-dfx_n)<<" "<<fabs(dfx - dfx_cd)<<std::endl;
epsi/=1.5;
}
return 0;
}
The output shows that a 1-sided difference gets us an optimal error of about 1.37034e+13 at a step length of about 100.0. Note that while this error looks large, as a relative error it is 3.5746632302764072e-09 (since the exact value is 3.833e+21)
In comparison the 2-sided difference gets an optimal error of about 1.89493e+10 with a step size of about 45109.3. This is three-orders of magnitude better, (with a much larger step-size).
How can we work out the step size? The link in the comments of Yves Daosts answer gives us a ballpark value:
h=x_c sqrt(eps) for 1-Sided, and h=x_c cbrt(eps) for 2-Sided.
But either way, if the required step size for decent accuracy at x ~ 10^10 is 100.0, the required step size with x ~ 10^20 is going to be 10^10 larger too. So the problem is simply that your step size is way too small.
This can be verified by increasing the starting step-size in the above code and resetting the x/y values to the original values.
Then expected derivative is O(1e39), best 1-sided error of about O(1e31) occurs near a step length of 5.9e10, best 2-sided error of about O(1e29) occurs near a step length of 6.1e13.
As numerical differentiation is ill conditioned (which means a small error could alter your result significantly) you should consider to use Cauchy's integral formula. This way you can calculate the n-th derivative with an integral. This will lead to less problems with considering accuracy and stability.

Division by zero prevention: Checking the divisor's expression doesn't result in zero vs. checking the divisor isn't zero?

Is division by zero possible in the following case due to the floating point error in the subtraction?
float x, y, z;
...
if (y != 1.0)
z = x / (y - 1.0);
In other words, is the following any safer?
float divisor = y - 1.0;
if (divisor != 0.0)
z = x / divisor;
Assuming IEEE-754 floating-point, they are equivalent.
It is a basic theorem of FP arithmetic that for finite x and y, x - y == 0 if and only if x == y, assuming gradual underflow.
If subnormal results are flushed to zero (instead of gradual underflow), this theorem holds only if the result x - y is normal. Because 1.0 is well scaled, y - 1.0 is never subnormal, and so y - 1.0 is zero if and only if y is exactly 1.0, regardless of how underflow are handled.
C++ doesn't guarantee IEEE-754, of course, but the theorem is true for most "reasonable" floating-point systems.
This will prevent you from dividing by exactly zero, however that does not mean still won't end up with +/-inf as a result. The denominator could still be small enough so that the answer is not representable with a double and you will end up with an inf. For example:
#include <iostream>
#include <limits>
int main(int argc, char const *argv[])
{
double small = std::numeric_limits<double>::epsilon();
double large = std::numeric_limits<double>::max() / small;
std::cout << "small: " << small << std::endl;
std::cout << "large: " << large << std::endl;
return 0;
}
In this program small is non-zero, but it is so small that large exceeds the range of double and is inf.
There is no difference between the two code snippets () - in fact, the optimizer could even optimize both fragments to the same binary code, assuming that there are no further uses of the divisor variable.
Note, however, that division by a floating point zero 0.0 does not result in a run-time error, but produces an inf or -inf instead.