When finding the inverse square root of a double, is it better to clamp invalid non-positive inputs at 0.0 or MIN_DBL? (In my example below double b may end up being negative due to floating point rounding errors and because the laws of physics are slightly slightly fudged in the game.)
Both division by 0.0 and MIN_DBL produce the same outcome in the game because 1/0.0 and 1/DBL_MIN are effectively infinity. My intuition says MIN_DBL is the better choice, but would there be any case for using 0.0? Like perhaps sqrt(0.0), 1/0.0 and multiplication by 1.#INF000000000000 execute faster because they are special cases.
double b = 1 - v.length_squared()/(c*c);
#ifdef CLAMP_BY_0
if (b < 0.0) b = 0.0;
#endif
#ifdef CLAMP_BY_DBL_MIN
if (b <= 0.0) b = DBL_MIN;
#endif
double lorentz_factor = 1/sqrt(b);
double division in MSVC:
1/0.0 = 1.#INF000000000000
1/DBL_MIN = 4.4942328371557898e+307
When dealing with floating point math, "infinity" and "effectively infinity" are quite different. Once a number stops being finite, it tends to stay that way. So while the value of lorentz_factor is "effectively" the same for both methods, depending on how you use that value, later computations can be radically different. sqrt(lorentz_factor) for instance remains infinite if you clamp to 0, but will actually be calculated if you clamp to some very very small number.
So the answer will largely depend on what you plan on doing with that value once you've clamped it.
Why not just assign INF to lorentz_factor directly, avoiding both the sqrt call and the division?
double lorentz_factor;
if (b <= 0.0)
lorentz_factor = std::numeric_limits<double>::infinity();
else
lorentz_factor = 1/sqrt(b);
You'll need to #include <limits> for this.
You can also use ::max() instead of ::infinity(), if that's what you need.
Related
I have a code that tries to solve an integral of a function in a given interval numerically, using the method of Trapezoidal Rule (see the formula in Trapezoid method ), now, for the function sin(x) in the interval [-pi/2.0,pi/2.0], the integral is waited to be zero.
In this case, I take the number of partitions 'n' equal to 4. The problem is that when I have pi with 20 decimal places it is zero, with 14 decimal places it is 8.72e^(-17), then with 11 decimal places, it is zero, with 8 decimal places it is 8.72e^(-17), with 3 decimal places it is zero. I mean, the integral is zero or a number near zero for different approximations of pi, but it doesn't have a clear trend.
I would appreciate your help in understanding why this happens. (I did run it in Dev-C++).
#include <iostream>
#include <math.h>
using namespace std;
#define pi 3.14159265358979323846
//Pi: 3.14159265358979323846
double func(double x){
return sin(x);
}
int main() {
double x0 = -pi/2.0, xf = pi/2.0;
int n = 4;
double delta_x = (xf-x0)/(n*1.0);
double sum = (func(x0)+func(xf))/2.0;
double integral;
for (int k = 1; k<n; k++){
// cout<<"func: "<<func(x0+(k*delta_x))<<" "<<"last sum: "<<sum<<endl;
sum = sum + func(x0+(k*delta_x));
// cout<<"func + last sum= "<<sum<<endl;
}
integral = delta_x*sum;
cout<<"The value for the integral is: "<<integral<<endl;
return 0;
}
OP is integrating y=sin(x) from -a to +a. The various tests use different values of a, all near pi/2.
The approach uses a linear summation of values near -1.0, down to 0 and then up to near 1.0.
This summation is sensitive to calculation error with the last terms as the final math sum is expected to be 0.0. Since the start/end a varies, the error varies.
A more stable result would be had adding the extreme f = sin(f(k)) values first. e.g. sum += sin(f(k=1)), then sum += sin(f(k=3)), then sum += sin(f(k=2)) rather than k=1,2,3. In particular the formation of term x=f(k=3) is likely a bit off from the negative of its x=f(k=1) earlier term, further compounding the issue.
Welcome to the world or numerical analysis.
Problem exists if code used all float or all long double, just different degrees.
Problem is not due to using an inexact value of pi (Exact value is impossible with FP as pi is irrational and all finite FP are rational).
Much is due to the formation of x. Could try the below to form the x symmetrically about 0.0. Compare exactly x generated this way to x the original way.
x = (x0-x1)/2 + ((k - n/2)*delta_x)
Print out the exact values computed for deeper understanding.
printf("x:%a y:%a\n", x0+(k*delta_x), func(x0+(k*delta_x)));
I've written a simple program to calculate the first and second derivative of a function, using function pointers. My program computes the correct answers (more or less), but for some functions, the accuracy is less than I would like.
This is the function I am differentiating:
float f1(float x) {
return (x * x);
}
These are the derivative functions, using the central finite difference method:
// Function for calculating the first derivative.
float first_dx(float (*fx)(float), float x) {
float h = 0.001;
float dfdx;
dfdx = (fx(x + h) - fx(x - h)) / (2 * h);
return dfdx;
}
// Function for calculating the second derivative.
float second_dx(float (*fx)(float), float x) {
float h = 0.001;
float d2fdx2;
d2fdx2 = (fx(x - h) - 2 * fx(x) + fx(x + h)) / (h * h);
return d2fdx2;
}
Main function:
int main() {
pc.baud(9600);
float x = 2.0;
pc.printf("**** Function Pointers ****\r\n");
pc.printf("Value of f(%f): %f\r\n", x, f1(x));
pc.printf("First derivative: %f\r\n", first_dx(f1, x));
pc.printf("Second derivative: %f\r\n\r\n", second_dx(f1, x));
}
This is the output from the program:
**** Function Pointers ****
Value of f(2.000000): 4.000000
First derivative: 3.999948
Second derivative: 1.430511
I'm happy with the accuracy of the first derivative, but I believe the second derivative is too far off (it should be equal to ~2.0).
I have a basic understanding of how floating point numbers are represented and why they are sometimes inaccurate, but how can I make this second derivative result more accurate? Could I be using something better than the central finite difference method, or is there a way I can get better results with the current method?
The accuracy can be increased by choosing a type which has more precision. float is currently defined as an IEEE-754 32-bit number, giving you a precision of ~7.225 decimal places.
What you want is the 64-bit counterpart: double with ~15.955 decimal places accuracy.
That should be sufficient for your calculation, however worth mentioning is boosts implementation which offers a quadruple-precision floating point number (128-bit).
Finally The GNU Multiple Precision Arithmetic Library offers types with an arbitrary number of decimal places for precision.
Go analytical. ;-) probably not an option given "with the current
method".
Use double instead of float.
Vary the epsilon (h), and combine the results in some way. For example you could try 0.00001, 0.000001, 0.0000001 and average them. In fact, you'd want the result with the smallest h that doesn't overflow/underflow. But it's not clear how to detect overflow and underflow.
as i said, i want implement my own double precision cos() function in a compute shader with GLSL, because there is just a built-in version for float.
This is my code:
double faculty[41];//values are calculated at the beginning of main()
double myCOS(double x)
{
double sum,tempExp,sign;
sum = 1.0;
tempExp = 1.0;
sign = -1.0;
for(int i = 1; i <= 30; i++)
{
tempExp *= x;
if(i % 2 == 0){
sum = sum + (sign * (tempExp / faculty[i]));
sign *= -1.0;
}
}
return sum;
}
The result of this code is, that the sum turns out to be NaN on the shader, but on the CPU the algorithm is working well.
I tried to debug this code too and I got the following information:
faculty[i] is positive and not zero for all entries
tempExp is positive in each step
none of the other variables are NaN during each step
the first time sum is NaN is at the step with i=4
and now my question: What exactly can go wrong if each variable is a number and nothing is divided by zero especially when the algorithm works on the CPU?
Let me guess:
First you determined the problem is in the loop, and you use only the following operations: +, *, /.
The rules for generating NaN from these operations are:
The divisions 0/0 and ±∞/±∞
The multiplications 0×±∞ and ±∞×0
The additions ∞ + (−∞), (−∞) + ∞ and equivalent subtractions
You ruled out the possibility for 0/0 and ±∞/±∞ by stating that faculty[] is correctly initialized.
The variable sign is always 1.0 or -1.0 so it cannot generate the NaN through the * operation.
What remains is the + operation if tempExp ever become ±∞.
So probably tempExp is too high on entry of your function and becomes ±∞, this will make sum to be ±∞ too. At the next iteration you will trigger the NaN generating operation through: ∞ + (−∞). This is because you multiply one side of the addition by sign and sign switches between positive and negative at each iteration.
You're trying to approximate cos(x) around 0.0. So you should use the properties of the cos() function to reduce your input value to a value near 0.0. Ideally in the range [0, pi/4]. For instance, remove multiples of 2*pi, and get the values of cos() in [pi/4, pi/2] by computing sin(x) around 0.0 and so on.
What can go dramatically wrong is a loss of precision. cos(x) usually is implemented by range reduction followed by a dedicated implementation for the range [0, pi/2]. Range reduction uses cos(x+2*pi) = cos(x). But this range reduction isn't perfect. For starters, pi cannot be exactly represented in finite math.
Now what happens if you try something as absurd as cos(1<<30) ? It's quite possible that the range reduction algorithm introduces an error in x that's larger than 2*pi, in which case the outcome is meaningless. Returning NaN in such cases is reasonable.
I'm using:
D = |ax + by + cz + d| / |n| where n is the normal to plane; a, b, c, d are the coefficients of the equation of the plane; x, y, z are the coordinates of the point from the plane.
To calculate the distance from a 3d point to a 3d plane. The issue that I'm having is that the distances in question are extremely small and this is causing the result( a double ) to be represented in scientific notation, which is not handled correctly in if statements. For example:
if( dist == 0 )
{
//Execute this
}
If dist is any scientific number the code inside the if statement is executed, even though dist is not 0. My question is, is there anyway the scientific number can be converted back into fixed notation to make it usable in if statements similar to these??
Im using VisualStudio 2010, C++.
Normally you would use some tolerance value to compare floating-point numbers:
#define EPSILON (1e-6)
// dist == 0.0?
if (dist < EPSILON) {
// ...
}
Or to compare to any other floating point v:
// dist == v?
if (fabs(dist - v) < EPSILON) {
// ...
}
Sure, you have to choose EPSILON according to your problem.
dist is not represented in scientific notation (unless you are storing it as a string) that's just how it is printed. As another minor point, it's usually a good idea to compare to a value or the same type. 0 is an integer, 0.0 is a double.
From what I can see from some quick tests, in order for you to be seeing dist == 0 as true, it would actually have to be zero. That means you have all the numbers down to DBL_MIN, which is 2.2250738585072014e-308 for a 64 bit IEEE754 fpu. More likely your maths is wrong, and it is actually zero. Check your numerator before you do the division.
What on earth is physically that size? Well if you are specifying the diameter of an electron in units of "the diameter of the universe", then that's only 3.2×10^-42. I'm not sure there is an easy way to visualize just how small doubles can be. I tried 1 / number of atoms in the universe and it still wasn't small enough.
I stumbled across code like
double x,y = ...;
double n = sqrt(x*x+y*y);
if (n > 0)
{
double d1 = (x*x)/n;
double d2 = (x*y)/n;
}
and I am wondering about the numerical stability of such an expression for small values of x and y.
For both expressions, lim (x->0, y->0) (...) = 0, so from a mathematical point of view, it looks safe (the nominator O(x²) whereas the denominator is O(x)).
Nevertheless my question is: Are there any possible numerical problems with this code?
EDIT: If possible I'd like to avoid re-writing the expressions because n is actually used more than twice and to keep readability (it's relatively clear in the context what happens).
If x and y are very close to DBL_MIN, the calculations are
succeptible to underflow or extreme loss of precision: if x is
very close to DBL_MIN, for example x * x may be 0.0, or
(for somewhat larger values) it may result in what is called
gradual underflow, with extreme loss of precision: e.g. with
IEEE double (most, if not all desktop and laptop PCs), 1E-300
* 1E-300 will be 0.0. Obviously, if this happens for both
* x and y, you'll end up with n == 0.0, even if x and
y are both positive.
In C++11, there is a function hypot, which will solve the
problem for n; if x * x is 0.0, however, d1 will still
be 0.0; you'll probably get better results with (x / n) * x
(but I think that there still may be limit cases where you'll
end up with 0.0 or gradual underflow—I've not analyzed it sufficiently to be sure). A better solution
would be to scale the data differently, to avoid such limit
cases.