Returning 'inf' when finite result is expected - c++

The outputs are not what Python/Mathematica previously calculated when calling test_func(3000, 10). Instead, my code returns inf, and I am not sure why.
#include <iostream>
#include <iomanip>
#include <fstream>
#include <string>
#include <chrono>
#include <cmath>
using namespace std;
//I0
double I0 (double y0, double y)
{
return(
log ( (exp(y0+y)-1)/(exp(y0-y)-1) ) - y
);
}
//test_func
double test_func (double k, double M)
{
double k0 = sqrt(pow(k, 2.0)+pow(M, 2.0));
cout << "k0: " << k0 << endl;
double Izero = I0(k0, k);
cout << "I0: " << Izero << endl;
double k3 = pow(k, 3.0);
cout << "k3: " << k3 << endl;
return(
Izero/k3
);
}
int main ()
{
cout << test_func(3000, 10) << endl;
return 0;
}
The output I get is
k0: 3000.02
I0: inf
k3: 2.7e+10
inf
but I0 should be 3004.1026690762033, while the result of the function should be test_func(3000, 10)=1.1126306181763716e-07. I am puzzled. Do you know what is wrong with it? I am a C++ beginner, so any help is very welcome.

double has a limited range. On most modern implementations, doubles are in the standard IEEE floating point binary64 format, with range up to about 1.8e308. Anything bigger (such as your exp(6000)) rounds up to infinity.
Switching to a larger type like long double may not be the best idea. Though the range of floating point types roughly grows exponentially with size, the fact that you use the exp function makes it easy to defeat the extra precision. E.g. on an implementation with 80-bit long double (with range up to about 1.2e4932), modifying test_func and I0 to use long double still fails to evaluate test_func(5700, 10).
It is instead possible to redesign I0 to avoid huge numbers. Let's start by splitting the log.
double I0(double y0, double y) {
return log(exp(y0+y)-1) - log(exp(y0-y)-1) - y;
}
When you're computing log(exp(y0+y) - 1), if exp(y0+y) gives infinity you can recover the computation by using y0+y instead of the log. I.e. we're ignoring the -1, because if our numbers are that large the precision of double isn't enough to actually register the difference in the final result. Also, you may want to replace both exp(x) - 1s with expm1. This is because when x is close to 0, exp(x) - 1 will tend to lose precision. E.g. exp(1e-16) - 1 == 0 but expm1(1e-16) > 0 assuming IEEE. I suspect that's not as important to you.
double I0(double y0, double y) {
double num = expm1(y0 + y), den = expm1(y0 - y);
num = isinf(num) ? y0 + y : log(num);
den = isinf(den) ? y0 - y : log(den); // though I suspect the domain of I0 is such that you don't actually need this
return num - den - y;
}
This is only a very rudimentary correction. Squeezing the most correctness out of floating point is very difficult in general and is its own whole field of programming. However, this is enough to make your case work, and even works on those large inputs where the naive long double also fails. (I'm no expert, but I also notice that y0-y is itself problematic, since y0 is apparently chosen to be close to y. Subtraction between them loses precision that may (or may not) destroy the result.)
If you don't want to deal with rewriting your formulas to fit within the limitations of floating point (completely understandable, given the potential for bugs!), I would suggest following #M.M's advice and using an arbitrary-precision math library like MPFR. That's similar to replacing double with long double, but now you can keep throwing bits at the problems until they go away whereas you will eventually run out of built-in floating point types.

Related

Simpson's Composite Rule giving too large values for when n is very large

Using Simpson's Composite Rule to calculate the integral from 2 to 1,000 of 1/ln(x), however when using a large n (usually around 500,000), I start to get results that vary from the value my calculator and other sources give me (176.5644). For example, when n = 10,000,000, it gives me a value of 184.1495. Wondering why this is, since as n gets larger, the accuracy is supposed to increase and not decrease.
#include <iostream>
#include <cmath>
// the function f(x)
float f(float x)
{
return (float) 1 / std::log(x);
}
float my_simpson(float a, float b, long int n)
{
if (n % 2 == 1) n += 1; // since n has to be even
float area, h = (b-a)/n;
float x, y, z;
for (int i = 1; i <= n/2; i++)
{
x = a + (2*i - 2)*h;
y = a + (2*i - 1)*h;
z = a + 2*i*h;
area += f(x) + 4*f(y) + f(z);
}
return area*h/3;
}
int main()
{
std::cout.precision(20);
int upperBound = 1'000;
int subsplits = 1'000'000;
float approx = my_simpson(2, upperBound, subsplits);
std::cout << "Output: " << approx << std::endl;
return 0;
}
Update: Switched from floats to doubles and works much better now! Thank you!
Unlike a real (in mathematical sense) number, a float has a limited precision.
A typical IEEE 754 32-bit (single precision) floating-point number binary representation dedicates only 24 bits (one of which is implicit) to the mantissa and that translates in roughly less than 8 decimal significant digits (please take this as a gross semplification).
A double on the other end, has 53 significand bits, making it more accurate and (usually) the first choice for numerical computations, these days.
since as n gets larger, the accuracy is supposed to increase and not decrease.
Unfortunately, that's not how it works. There's a sweat spot, but after that the accumulation of rounding errors prevales and the results diverge from their expected values.
In OP's case, this calculation
area += f(x) + 4*f(y) + f(z);
introduces (and accumulates) rounding errors, due to the fact that area becomes much greater than f(x) + 4*f(y) + f(z) (e.g 224678.937 vs. 0.3606823). The bigger n is, the sooner this gets relevant, making the result diverging from the real one.
As mentioned in the comments, another issue (undefined behavior) is that area isn't initialized (to zero).

How does Cpp work with large numbers in calculations?

I have a code that tries to solve an integral of a function in a given interval numerically, using the method of Trapezoidal Rule (see the formula in Trapezoid method ), now, for the function sin(x) in the interval [-pi/2.0,pi/2.0], the integral is waited to be zero.
In this case, I take the number of partitions 'n' equal to 4. The problem is that when I have pi with 20 decimal places it is zero, with 14 decimal places it is 8.72e^(-17), then with 11 decimal places, it is zero, with 8 decimal places it is 8.72e^(-17), with 3 decimal places it is zero. I mean, the integral is zero or a number near zero for different approximations of pi, but it doesn't have a clear trend.
I would appreciate your help in understanding why this happens. (I did run it in Dev-C++).
#include <iostream>
#include <math.h>
using namespace std;
#define pi 3.14159265358979323846
//Pi: 3.14159265358979323846
double func(double x){
return sin(x);
}
int main() {
double x0 = -pi/2.0, xf = pi/2.0;
int n = 4;
double delta_x = (xf-x0)/(n*1.0);
double sum = (func(x0)+func(xf))/2.0;
double integral;
for (int k = 1; k<n; k++){
// cout<<"func: "<<func(x0+(k*delta_x))<<" "<<"last sum: "<<sum<<endl;
sum = sum + func(x0+(k*delta_x));
// cout<<"func + last sum= "<<sum<<endl;
}
integral = delta_x*sum;
cout<<"The value for the integral is: "<<integral<<endl;
return 0;
}
OP is integrating y=sin(x) from -a to +a. The various tests use different values of a, all near pi/2.
The approach uses a linear summation of values near -1.0, down to 0 and then up to near 1.0.
This summation is sensitive to calculation error with the last terms as the final math sum is expected to be 0.0. Since the start/end a varies, the error varies.
A more stable result would be had adding the extreme f = sin(f(k)) values first. e.g. sum += sin(f(k=1)), then sum += sin(f(k=3)), then sum += sin(f(k=2)) rather than k=1,2,3. In particular the formation of term x=f(k=3) is likely a bit off from the negative of its x=f(k=1) earlier term, further compounding the issue.
Welcome to the world or numerical analysis.
Problem exists if code used all float or all long double, just different degrees.
Problem is not due to using an inexact value of pi (Exact value is impossible with FP as pi is irrational and all finite FP are rational).
Much is due to the formation of x. Could try the below to form the x symmetrically about 0.0. Compare exactly x generated this way to x the original way.
x = (x0-x1)/2 + ((k - n/2)*delta_x)
Print out the exact values computed for deeper understanding.
printf("x:%a y:%a\n", x0+(k*delta_x), func(x0+(k*delta_x)));

comparing equal calculation outputs for floating point numbers

The outputs of two calculations are supposed to be same as described below but even after taking machine precision into account, they come out to be unequal. What would be way around it to get them equal?
#include <iostream>
#include <limits>
#include <math.h>
bool definitelyGreaterThan(double a, double b, double epsilon)
{
return (a - b) > ( (std::fabs(a) < std::fabs(b) ? std::fabs(b) : std::fabs(a)) * epsilon);
}
bool definitelyLessThan(double a, double b, double epsilon)
{
return (b - a) > ( (std::fabs(a) < std::fabs(b) ? std::fabs(b) : std::fabs(a)) * epsilon);
}
int main ()
{
double fig1, fig2;
double m1 = 235.60242, m2 = 126.734781;
double n1 = 4.2222, n2 = 2.1111;
double p1 = 1.245, p2 = 2.394;
fig1 = (m1/m2) * (n1/n2) - p1 * 6.0 / p2;
m1 = 1.2*m1, m2 = 1.2*m2; // both scaled equally, numerator and denominator
n1 = 2.0*n1, n2 = 2.0*n2;
p1 = 3.0*p1, p2 = 3.0*p2;
fig2 = (m1/m2) * (n1/n2) - p1 * 6.0 / p2; // same as above
double epsilon = std::numeric_limits<double>::epsilon();
std::cout << "\n fig1 " << fig1 << " fig2 " << fig2 << " difference " << fig1 - fig2 << " epl " << epsilon;
std::cout << "\n if(fig1 < fig2) " << definitelyLessThan(fig1, fig2, epsilon)
<< "\n if(fig1 > fig2) " << definitelyGreaterThan(fig1, fig2, epsilon) << "\n";
}
with output as -
fig1 0.597738 fig2 0.597738 difference 8.88178e-16 epl 2.22045e-16
if(fig1 < fig2) 0
if(fig1 > fig2) 1
The difference between two numbers is greater than machine precision.
The key question is whether there is any universal method to deal with such aspects or solution has to be application dependent?
There are two things to consider:
First, the possible rounding error (introduced by limited machine precision) scales with the number of operations that were used to calculate the value. For example, storing the result of m1/m2 might introduce some rounding error (depending on the actual value), and multiplying this value with something also multiplies that rounding error, and adds another possiblie rounding error on top of that (by storing that result).
Second, floating point values are not stored linearly (instead, they use an exponent-and-mantissa format): The bigger a values actually is, the bigger is the difference between that value and the next representable value (and therefore also the possible rounding error). std::numeric_limits<T>::epsilon() only states the difference between 1.0 and the next representable value, so if your value is not exactly 1.0, then this epsilon does not exactly represent the machine precision (meaning: The difference between this value and the next representable one) for that value.
So, to answer your question: The solution is to select an application-dependend, reasonable maximum rounding error that is allowed for two values to still be considered equal. Since this allowed rounding error depends both on expected values and number of operations (as well as what's acceptable for the application itself, of course), a truly universal solution is not possible.

Is there any "standard" way to calculate the numerical gradient?

I am trying to calculate the numerical gradient of a smooth function in c++. And the parameter value could vary from zero to a very large number(maybe 1e10 to 1e20?)
I used the function f(x,y) = 10*x^3 + y^3 as a testbench, but I found that if x or y is too large, I can't get correct gradient.
Here is my code to calculate the graidient:
#include <iostream>
#include <cmath>
#include <cassert>
using namespace std;
double f(double x, double y)
{
// black box expensive function
return 10 * pow(x, 3) + pow(y, 3);
}
int main()
{
// double x = -5897182590.8347721;
// double y = 269857217.0017581;
double x = 1.13041e+19;
double y = -5.49756e+14;
const double epsi = 1e-4;
double f1 = f(x, y);
double f2 = f(x, y+epsi);
double f3 = f(x, y-epsi);
cout << f1 << endl;
cout << f2 << endl;
cout << f3 << endl;
cout << f1 - f2 << endl; // 0
cout << f2 - f3 << endl; // 0
return 0;
}
If I use the above code to calculate the gradient, the gradient would be zero!
The testbench function, 10*x^3 + y^3, is just a demo, the real problem I need to solve is actually a black box function.
So, is there any "standard" way to calculate the numerical gradient?
In the first place, you should use the central difference scheme, which is more accurate (by cancellation of one more term of the Taylor develoment).
(f(x + h) - f(x - h)) / 2h
rather than
(f(x + h) - f(x)) / h
Then the choice of h is critical and using a fixed constant is the worst thing you can do. Because for small x, h will be too large so that the approximation formula no more works, and for large x, h will be too small, resulting in severe truncation error.
A much better choice is to take a relative value, h = x√ε, where ε is the machine epsilon (1 ulp), which gives a good tradeoff.
(f(x(1 + √ε)) - f(x(1 - √ε))) / 2x√ε
Beware that when x = 0, a relative value cannot work and you need to fall back to a constant. But then, nothing tells you which to use !
You need to consider the precision needed.
At first glance, since |y| = 5.49756e14 and epsi = 1e-4, you need at least ⌈log2(5.49756e14)-log2(1e-4)⌉ = 63 bits of significand precision (that is the number of bits used to encode the digits of your number, also known as mantissa) for y and y+epsi to be considered different.
The double-precision floating-point format only has 53 bits of significand precision (assuming it is 8 bytes). So, currently, f1, f2 and f3 are exactly the same because y, y+epsi and y-epsi are equal.
Now, let's consider the limit : y = 1e20, and the result of your function, 10x^3 + y^3. Let's ignore x for now, so let's take f = y^3. Now we can calculate the precision needed for f(y) and f(y+epsi) to be different : f(y) = 1e60 and f(epsi) = 1e-12. This gives a minimum significand precision of ⌈log2(1e60)-log2(1e-12)⌉ = 240 bits.
Even if you were to use the long double type, assuming it is 16 bytes, your results would not differ : f1, f2 and f3 would still be equal, even though y and y+epsi would not.
If we take x into account, the maximum value of f would be 11e60 (with x = y = 1e20). So the upper limit on precision is ⌈log2(11e60)-log2(1e-12)⌉ = 243 bits, or at least 31 bytes.
One way to solve your problem is to use another type, maybe a bignum used as fixed-point.
Another way is to rethink your problem and deal with it differently. Ultimately, what you want is f1 - f2. You can try to decompose f(y+epsi). Again, if you ignore x, f(y+epsi) = (y+epsi)^3 = y^3 + 3*y^2*epsi + 3*y*epsi^2 + epsi^3. So f(y+epsi) - f(y) = 3*y^2*epsi + 3*y*epsi^2 + epsi^3.
The only way to calculate gradient is calculus.
Gradient is a vector:
g(x, y) = Df/Dx i + Df/Dy j
where (i, j) are unit vectors in x and y directions, respectively.
One way to approximate derivatives is first order differences:
Df/Dx ~ (f(x2, y)-f(x1, y))/(x2-x1)
and
Df/Dy ~ (f(x, y2)-f(x, y1))/(y2-y1)
That doesn't look like what you're doing.
You have a closed form expression:
g(x, y) = 30*x^2 i + 3*y^2 j
You can plug in values for (x, y) and calculate the gradient exactly at any point. Compare that to your differences and see how well your approximation is doing.
How you implement it numerically is your responsibility. (10^19)^3 = 10^57, right?
What is the size of double on your machine? Is it a 64 bit IEEE double precision floating point number?
Use
dx = (1+abs(x))*eps, dfdx = (f(x+dx,y) - f(x,y)) / dx
dy = (1+abs(y))*eps, dfdy = (f(x,y+dy) - f(x,y)) / dy
to get meaningful step sizes for large arguments.
Use eps = 1e-8 for one-sided difference formulas, eps = 1e-5 for central difference quotients.
Explore automatic differentiation (see autodiff.org) for derivatives without difference quotients and thus much smaller numerical errors.
We can examine the behaviour of the error in the derivative using the following program - it calculates the 1-sided derivative and the central difference based derivative using a varying step size. Here I'm using x and y ~ 10^10, which is smaller than what you were using, but should illustrate the same point.
#include <iostream>
#include <cmath>
#include <cassert>
using namespace std;
double f(double x, double y) {
return 10 * pow(x, 3) + pow(y, 3);
}
double f_x(double x, double y) {
return 3 * 10 * pow(x,2);
}
double f_y(double x, double y) {
return 3 * pow(y,2);
}
int main()
{
// double x = -5897182590.8347721;
// double y = 269857217.0017581;
double x = 1.13041e+10;
double y = -5.49756e+10;
//double x = 10.1;
//double y = -5.2;
double epsi = 1e8;
for(int i=0; i<60; ++i) {
double dfx_n = (f(x+epsi,y) - f(x,y))/epsi;
double dfx_cd = (f(x+epsi,y) - f(x-epsi,y))/(2*epsi);
double dfx = f_x(x,y);
cout<<epsi<<" "<<fabs(dfx-dfx_n)<<" "<<fabs(dfx - dfx_cd)<<std::endl;
epsi/=1.5;
}
return 0;
}
The output shows that a 1-sided difference gets us an optimal error of about 1.37034e+13 at a step length of about 100.0. Note that while this error looks large, as a relative error it is 3.5746632302764072e-09 (since the exact value is 3.833e+21)
In comparison the 2-sided difference gets an optimal error of about 1.89493e+10 with a step size of about 45109.3. This is three-orders of magnitude better, (with a much larger step-size).
How can we work out the step size? The link in the comments of Yves Daosts answer gives us a ballpark value:
h=x_c sqrt(eps) for 1-Sided, and h=x_c cbrt(eps) for 2-Sided.
But either way, if the required step size for decent accuracy at x ~ 10^10 is 100.0, the required step size with x ~ 10^20 is going to be 10^10 larger too. So the problem is simply that your step size is way too small.
This can be verified by increasing the starting step-size in the above code and resetting the x/y values to the original values.
Then expected derivative is O(1e39), best 1-sided error of about O(1e31) occurs near a step length of 5.9e10, best 2-sided error of about O(1e29) occurs near a step length of 6.1e13.
As numerical differentiation is ill conditioned (which means a small error could alter your result significantly) you should consider to use Cauchy's integral formula. This way you can calculate the n-th derivative with an integral. This will lead to less problems with considering accuracy and stability.

Division by zero prevention: Checking the divisor's expression doesn't result in zero vs. checking the divisor isn't zero?

Is division by zero possible in the following case due to the floating point error in the subtraction?
float x, y, z;
...
if (y != 1.0)
z = x / (y - 1.0);
In other words, is the following any safer?
float divisor = y - 1.0;
if (divisor != 0.0)
z = x / divisor;
Assuming IEEE-754 floating-point, they are equivalent.
It is a basic theorem of FP arithmetic that for finite x and y, x - y == 0 if and only if x == y, assuming gradual underflow.
If subnormal results are flushed to zero (instead of gradual underflow), this theorem holds only if the result x - y is normal. Because 1.0 is well scaled, y - 1.0 is never subnormal, and so y - 1.0 is zero if and only if y is exactly 1.0, regardless of how underflow are handled.
C++ doesn't guarantee IEEE-754, of course, but the theorem is true for most "reasonable" floating-point systems.
This will prevent you from dividing by exactly zero, however that does not mean still won't end up with +/-inf as a result. The denominator could still be small enough so that the answer is not representable with a double and you will end up with an inf. For example:
#include <iostream>
#include <limits>
int main(int argc, char const *argv[])
{
double small = std::numeric_limits<double>::epsilon();
double large = std::numeric_limits<double>::max() / small;
std::cout << "small: " << small << std::endl;
std::cout << "large: " << large << std::endl;
return 0;
}
In this program small is non-zero, but it is so small that large exceeds the range of double and is inf.
There is no difference between the two code snippets () - in fact, the optimizer could even optimize both fragments to the same binary code, assuming that there are no further uses of the divisor variable.
Note, however, that division by a floating point zero 0.0 does not result in a run-time error, but produces an inf or -inf instead.