How to increase accuracy of floating point second derivative calculation? - c++

I've written a simple program to calculate the first and second derivative of a function, using function pointers. My program computes the correct answers (more or less), but for some functions, the accuracy is less than I would like.
This is the function I am differentiating:
float f1(float x) {
return (x * x);
}
These are the derivative functions, using the central finite difference method:
// Function for calculating the first derivative.
float first_dx(float (*fx)(float), float x) {
float h = 0.001;
float dfdx;
dfdx = (fx(x + h) - fx(x - h)) / (2 * h);
return dfdx;
}
// Function for calculating the second derivative.
float second_dx(float (*fx)(float), float x) {
float h = 0.001;
float d2fdx2;
d2fdx2 = (fx(x - h) - 2 * fx(x) + fx(x + h)) / (h * h);
return d2fdx2;
}
Main function:
int main() {
pc.baud(9600);
float x = 2.0;
pc.printf("**** Function Pointers ****\r\n");
pc.printf("Value of f(%f): %f\r\n", x, f1(x));
pc.printf("First derivative: %f\r\n", first_dx(f1, x));
pc.printf("Second derivative: %f\r\n\r\n", second_dx(f1, x));
}
This is the output from the program:
**** Function Pointers ****
Value of f(2.000000): 4.000000
First derivative: 3.999948
Second derivative: 1.430511
I'm happy with the accuracy of the first derivative, but I believe the second derivative is too far off (it should be equal to ~2.0).
I have a basic understanding of how floating point numbers are represented and why they are sometimes inaccurate, but how can I make this second derivative result more accurate? Could I be using something better than the central finite difference method, or is there a way I can get better results with the current method?

The accuracy can be increased by choosing a type which has more precision. float is currently defined as an IEEE-754 32-bit number, giving you a precision of ~7.225 decimal places.
What you want is the 64-bit counterpart: double with ~15.955 decimal places accuracy.
That should be sufficient for your calculation, however worth mentioning is boosts implementation which offers a quadruple-precision floating point number (128-bit).
Finally The GNU Multiple Precision Arithmetic Library offers types with an arbitrary number of decimal places for precision.

Go analytical. ;-) probably not an option given "with the current
method".
Use double instead of float.
Vary the epsilon (h), and combine the results in some way. For example you could try 0.00001, 0.000001, 0.0000001 and average them. In fact, you'd want the result with the smallest h that doesn't overflow/underflow. But it's not clear how to detect overflow and underflow.

Related

How does Cpp work with large numbers in calculations?

I have a code that tries to solve an integral of a function in a given interval numerically, using the method of Trapezoidal Rule (see the formula in Trapezoid method ), now, for the function sin(x) in the interval [-pi/2.0,pi/2.0], the integral is waited to be zero.
In this case, I take the number of partitions 'n' equal to 4. The problem is that when I have pi with 20 decimal places it is zero, with 14 decimal places it is 8.72e^(-17), then with 11 decimal places, it is zero, with 8 decimal places it is 8.72e^(-17), with 3 decimal places it is zero. I mean, the integral is zero or a number near zero for different approximations of pi, but it doesn't have a clear trend.
I would appreciate your help in understanding why this happens. (I did run it in Dev-C++).
#include <iostream>
#include <math.h>
using namespace std;
#define pi 3.14159265358979323846
//Pi: 3.14159265358979323846
double func(double x){
return sin(x);
}
int main() {
double x0 = -pi/2.0, xf = pi/2.0;
int n = 4;
double delta_x = (xf-x0)/(n*1.0);
double sum = (func(x0)+func(xf))/2.0;
double integral;
for (int k = 1; k<n; k++){
// cout<<"func: "<<func(x0+(k*delta_x))<<" "<<"last sum: "<<sum<<endl;
sum = sum + func(x0+(k*delta_x));
// cout<<"func + last sum= "<<sum<<endl;
}
integral = delta_x*sum;
cout<<"The value for the integral is: "<<integral<<endl;
return 0;
}
OP is integrating y=sin(x) from -a to +a. The various tests use different values of a, all near pi/2.
The approach uses a linear summation of values near -1.0, down to 0 and then up to near 1.0.
This summation is sensitive to calculation error with the last terms as the final math sum is expected to be 0.0. Since the start/end a varies, the error varies.
A more stable result would be had adding the extreme f = sin(f(k)) values first. e.g. sum += sin(f(k=1)), then sum += sin(f(k=3)), then sum += sin(f(k=2)) rather than k=1,2,3. In particular the formation of term x=f(k=3) is likely a bit off from the negative of its x=f(k=1) earlier term, further compounding the issue.
Welcome to the world or numerical analysis.
Problem exists if code used all float or all long double, just different degrees.
Problem is not due to using an inexact value of pi (Exact value is impossible with FP as pi is irrational and all finite FP are rational).
Much is due to the formation of x. Could try the below to form the x symmetrically about 0.0. Compare exactly x generated this way to x the original way.
x = (x0-x1)/2 + ((k - n/2)*delta_x)
Print out the exact values computed for deeper understanding.
printf("x:%a y:%a\n", x0+(k*delta_x), func(x0+(k*delta_x)));

constrain a value -pi to pi for precision buff

What is the best way to constrain any value from -pi to pi ?
I currently have:
if (fAngle > XM_PI) {
fAngle = fAngle - XM_2PI;
}
else if (fAngle < -XM_PI) {
fAngle = fAngle - -XM_2PI;
}
However, I fear those if's should instead be while's
For reference, under the Exploit Symmetrical Functions section:
https://developer.arm.com/solutions/graphics-and-gaming/developer-guides/learn-the-basics/understanding-numerical-precision/mitigating-loss-of-precision
Extra bit of precision!
Adding or subtracting XM_2PI cannot restore any accuracy that has been lost. In fact, it adds noise, generally losing more accuracy, because XM_2PI is necessarily only an approximation of 2π. It has some error itself, so adding or subtracting it adds or subtracts the error in the approximation.
What it can do is keep you from losing more accuracy by ensuring that future results remain low in magnitude, thus remaining in a region where the floating-point format has more precision than if the number grew beyond 4, 8, 16, or other points where the exponent changes and the absolute precision becomes worse.
If you already have some value x outside [−π, π] and want its sine or cosine, you should get the best result by using sin(x) or cos(x) directly. Good implementations of sin and cos will reduce the argument using a high-precision value for 2π, so you will get a better result than using sin(x-XM_PI) or cos(x-XM_PI) (unless, by chance, the various errors in these happen to cancel).
So your task with trigonometric functions is not to reduce values you already have but to design your algorithms to keep values from growing. Adding or subtracting 2π is a reasonable way to do this. However, when you do it, add or subtract an extended-precision version of 2π, not just XM_2PI. You can do this by representing 2π as XM_2PI (which should be the value representable in floating-point that is closest to 2π) plus some residue r. r should be the value representable in floating-point that is closest to 2π−XM_2PI. You can calculate that with extended-precision software such as GMP or Maple and can likely find it online. (I do not have it handy or I would paste it here; anybody else is welcome to edit it in.) Then you would update your angle with fAngle = fAngle - XM_2PI - r; or fAngle = fAngle + XM_2PI + r;.
An exception is if you have the angle measured in some unit that you can represent or reduce exactly, such as in degrees (which you can reduce by 360º with no error as long as the number of degrees itself is represented with no error) or in time (such as number of seconds for some function with a period of a day or other rational number of seconds, so you can again reduce with no error). In that case, you can let the angle grow as long as you can represent it exactly, and you would reduce it modulo the period prior to converting it to radians.
The simplest coding way is to use the math library function remainder, as in
fAngle = remainder( fangle, XM_2PI);
STATIC_INLINE_PURE float const __vectorcall constrain(float const fAngle)
{
static constexpr double const
dPI(std::numbers::pi),
d2PI(2.0 * std::numbers::pi),
dResidue(-1.74845553146951715461909770965576171875e-07); // abs difference between d2PI(double precision) and XM_2PI(float precision)
double dAngle(fAngle);
dAngle = std::remainder(dAngle, d2PI);
if (dAngle > dPI) {
dAngle = dAngle - d2PI - dResidue;
}
else if (dAngle < -dPI) {
dAngle = dAngle + d2PI + dResidue;
}
return((float)dAngle);
}

Is there any "standard" way to calculate the numerical gradient?

I am trying to calculate the numerical gradient of a smooth function in c++. And the parameter value could vary from zero to a very large number(maybe 1e10 to 1e20?)
I used the function f(x,y) = 10*x^3 + y^3 as a testbench, but I found that if x or y is too large, I can't get correct gradient.
Here is my code to calculate the graidient:
#include <iostream>
#include <cmath>
#include <cassert>
using namespace std;
double f(double x, double y)
{
// black box expensive function
return 10 * pow(x, 3) + pow(y, 3);
}
int main()
{
// double x = -5897182590.8347721;
// double y = 269857217.0017581;
double x = 1.13041e+19;
double y = -5.49756e+14;
const double epsi = 1e-4;
double f1 = f(x, y);
double f2 = f(x, y+epsi);
double f3 = f(x, y-epsi);
cout << f1 << endl;
cout << f2 << endl;
cout << f3 << endl;
cout << f1 - f2 << endl; // 0
cout << f2 - f3 << endl; // 0
return 0;
}
If I use the above code to calculate the gradient, the gradient would be zero!
The testbench function, 10*x^3 + y^3, is just a demo, the real problem I need to solve is actually a black box function.
So, is there any "standard" way to calculate the numerical gradient?
In the first place, you should use the central difference scheme, which is more accurate (by cancellation of one more term of the Taylor develoment).
(f(x + h) - f(x - h)) / 2h
rather than
(f(x + h) - f(x)) / h
Then the choice of h is critical and using a fixed constant is the worst thing you can do. Because for small x, h will be too large so that the approximation formula no more works, and for large x, h will be too small, resulting in severe truncation error.
A much better choice is to take a relative value, h = x√ε, where ε is the machine epsilon (1 ulp), which gives a good tradeoff.
(f(x(1 + √ε)) - f(x(1 - √ε))) / 2x√ε
Beware that when x = 0, a relative value cannot work and you need to fall back to a constant. But then, nothing tells you which to use !
You need to consider the precision needed.
At first glance, since |y| = 5.49756e14 and epsi = 1e-4, you need at least ⌈log2(5.49756e14)-log2(1e-4)⌉ = 63 bits of significand precision (that is the number of bits used to encode the digits of your number, also known as mantissa) for y and y+epsi to be considered different.
The double-precision floating-point format only has 53 bits of significand precision (assuming it is 8 bytes). So, currently, f1, f2 and f3 are exactly the same because y, y+epsi and y-epsi are equal.
Now, let's consider the limit : y = 1e20, and the result of your function, 10x^3 + y^3. Let's ignore x for now, so let's take f = y^3. Now we can calculate the precision needed for f(y) and f(y+epsi) to be different : f(y) = 1e60 and f(epsi) = 1e-12. This gives a minimum significand precision of ⌈log2(1e60)-log2(1e-12)⌉ = 240 bits.
Even if you were to use the long double type, assuming it is 16 bytes, your results would not differ : f1, f2 and f3 would still be equal, even though y and y+epsi would not.
If we take x into account, the maximum value of f would be 11e60 (with x = y = 1e20). So the upper limit on precision is ⌈log2(11e60)-log2(1e-12)⌉ = 243 bits, or at least 31 bytes.
One way to solve your problem is to use another type, maybe a bignum used as fixed-point.
Another way is to rethink your problem and deal with it differently. Ultimately, what you want is f1 - f2. You can try to decompose f(y+epsi). Again, if you ignore x, f(y+epsi) = (y+epsi)^3 = y^3 + 3*y^2*epsi + 3*y*epsi^2 + epsi^3. So f(y+epsi) - f(y) = 3*y^2*epsi + 3*y*epsi^2 + epsi^3.
The only way to calculate gradient is calculus.
Gradient is a vector:
g(x, y) = Df/Dx i + Df/Dy j
where (i, j) are unit vectors in x and y directions, respectively.
One way to approximate derivatives is first order differences:
Df/Dx ~ (f(x2, y)-f(x1, y))/(x2-x1)
and
Df/Dy ~ (f(x, y2)-f(x, y1))/(y2-y1)
That doesn't look like what you're doing.
You have a closed form expression:
g(x, y) = 30*x^2 i + 3*y^2 j
You can plug in values for (x, y) and calculate the gradient exactly at any point. Compare that to your differences and see how well your approximation is doing.
How you implement it numerically is your responsibility. (10^19)^3 = 10^57, right?
What is the size of double on your machine? Is it a 64 bit IEEE double precision floating point number?
Use
dx = (1+abs(x))*eps, dfdx = (f(x+dx,y) - f(x,y)) / dx
dy = (1+abs(y))*eps, dfdy = (f(x,y+dy) - f(x,y)) / dy
to get meaningful step sizes for large arguments.
Use eps = 1e-8 for one-sided difference formulas, eps = 1e-5 for central difference quotients.
Explore automatic differentiation (see autodiff.org) for derivatives without difference quotients and thus much smaller numerical errors.
We can examine the behaviour of the error in the derivative using the following program - it calculates the 1-sided derivative and the central difference based derivative using a varying step size. Here I'm using x and y ~ 10^10, which is smaller than what you were using, but should illustrate the same point.
#include <iostream>
#include <cmath>
#include <cassert>
using namespace std;
double f(double x, double y) {
return 10 * pow(x, 3) + pow(y, 3);
}
double f_x(double x, double y) {
return 3 * 10 * pow(x,2);
}
double f_y(double x, double y) {
return 3 * pow(y,2);
}
int main()
{
// double x = -5897182590.8347721;
// double y = 269857217.0017581;
double x = 1.13041e+10;
double y = -5.49756e+10;
//double x = 10.1;
//double y = -5.2;
double epsi = 1e8;
for(int i=0; i<60; ++i) {
double dfx_n = (f(x+epsi,y) - f(x,y))/epsi;
double dfx_cd = (f(x+epsi,y) - f(x-epsi,y))/(2*epsi);
double dfx = f_x(x,y);
cout<<epsi<<" "<<fabs(dfx-dfx_n)<<" "<<fabs(dfx - dfx_cd)<<std::endl;
epsi/=1.5;
}
return 0;
}
The output shows that a 1-sided difference gets us an optimal error of about 1.37034e+13 at a step length of about 100.0. Note that while this error looks large, as a relative error it is 3.5746632302764072e-09 (since the exact value is 3.833e+21)
In comparison the 2-sided difference gets an optimal error of about 1.89493e+10 with a step size of about 45109.3. This is three-orders of magnitude better, (with a much larger step-size).
How can we work out the step size? The link in the comments of Yves Daosts answer gives us a ballpark value:
h=x_c sqrt(eps) for 1-Sided, and h=x_c cbrt(eps) for 2-Sided.
But either way, if the required step size for decent accuracy at x ~ 10^10 is 100.0, the required step size with x ~ 10^20 is going to be 10^10 larger too. So the problem is simply that your step size is way too small.
This can be verified by increasing the starting step-size in the above code and resetting the x/y values to the original values.
Then expected derivative is O(1e39), best 1-sided error of about O(1e31) occurs near a step length of 5.9e10, best 2-sided error of about O(1e29) occurs near a step length of 6.1e13.
As numerical differentiation is ill conditioned (which means a small error could alter your result significantly) you should consider to use Cauchy's integral formula. This way you can calculate the n-th derivative with an integral. This will lead to less problems with considering accuracy and stability.

How can I check whether a double has a fractional part?

Basically I have two variables:
double halfWidth = Width / 2;
double halfHeight = Height / 2;
As they are being divided by 2, they will either be a whole number or a decimal. How can I check whether they are a whole number or a .5?
You can use modf, this should be sufficient:
double intpart;
if( modf( halfWidth, &intpart) == 0 )
{
// your code here
}
First, you need to make sure that you're using double-precision floating-point math:
double halfWidth = Width / 2.0;
double halfHeight = Height / 2.0;
Because one of the operands is a double (namely, 2.0), this will force the compiler to convert Width and Height to doubles before doing the math (assuming they're not already doubles). Once converted, the division will be done in double-precision floating-point. So it will have a decimal, where appropriate.
The next step is to simply check it with modf.
double temp;
if(modf(halfWidth, &temp) != 0)
{
//Has fractional part.
}
else
{
//No fractional part.
}
You may discard a fractional part and compare the result with the original value using floor():
if (floor(halfWidth) == halfWidth) {
// halfWidth is a whole number
} else {
// halfWidth has a non-zero fractional part
}
As rightly pointed out by #Dávid Laczkó, it's a better solution than modf() because there's no need for an additional variable.
And according to my benchmarks (Linux, gcc 8.3.0, optimizations -O0...-O3), the floor() call consumes less CPU time than modf() on the modern notebook and server processors. The difference even growing with compiler optimizations enabled. Probably it's because the modf() has two arguments when the floor() has only one argument.

What is the optimum epsilon/dx value to use within the finite difference method?

double MyClass::dx = ?????;
double MyClass::f(double x)
{
return 3.0*x*x*x - 2.0*x*x + x - 5.0;
}
double MyClass::fp(double x) // derivative of f(x), that is f'(x)
{
return (f(x + dx) - f(x)) / dx;
}
When using finite difference method for derivation, it is critical to choose an optimum dx value. Mathematically, dx must be as small as possible. However, I'm not sure if it is a correct choice to choose it the smallest positive double precision number (i.e.; 2.2250738585072014 x 10−308).
Is there an optimal numeric interval or exact value to choose a dx in to make the calculation error as small as possible?
(I'm using 64-bit compiler. I will run my program on a Intel i5 processor.)
Choosing the smallest possible value is almost certainly wrong: if dx were that smallest number, then f(x + dx) would be exactly equal to f(x) due to rounding.
So you have a tradeoff: Choose dx too small, and you lose precision to rounding errors. Choose it too large, and your result will be imprecise due to changes in the derivative as x changes.
To judge the numeric errors, consider (f(x + dx) - f(x))/f(x)1 mathematically. The numerator denotes the difference you want to compute, but the denominator denotes the magnitude of numbers you're dealing with. If that fraction is about 2‒k, then you can expect approximately k bits of precision in your result.
If you know your function, you can compute what error you'd get from choosing dx too large. You can then balence things, so that the error incurred from this is about the same as the error incurred from rounding. But if you know the function, you might be better off by providing a function that directly computes the derivative, like in your example with the polygonal f.
The Wikipedia section that pogorskiy pointed out suggests a value of sqrt(ε)x, or approximately 1.5e-8 * x. Without any more detailed knowledge about the function, such a rule of thumb will provide a reasonable default. Also note that that same section suggests not dividing by dx, but instead by (x + dx) - x, as this takes rounding errors incurred by computing x + dx into account. But I guess that whole article is full of suggestions you might use.
1 This formula really should divide by f(x), not by dx, even though a past editor thought differently. I'm attempting to compare the amount of significant bits remaining after the division, not the slope of the tangent.
Why not just use the Power Rule to derive the derivative, you'll get an exact answer:
f(x) = 3x^3 - 2x^2 + x - 5
f'(x) = 9x^2 - 4x + 1
Therefore:
f(x) = 3.0 * x * x * x - 2.0 * x * x + x - 5.0
fp(x) = 9.0 * x * x - 4.0 * x + 1.0