Numerical stability of division expression - c++

I stumbled across code like
double x,y = ...;
double n = sqrt(x*x+y*y);
if (n > 0)
double d1 = (x*x)/n;
double d2 = (x*y)/n;
and I am wondering about the numerical stability of such an expression for small values of x and y.
For both expressions, lim (x->0, y->0) (...) = 0, so from a mathematical point of view, it looks safe (the nominator O(x²) whereas the denominator is O(x)).
Nevertheless my question is: Are there any possible numerical problems with this code?
EDIT: If possible I'd like to avoid re-writing the expressions because n is actually used more than twice and to keep readability (it's relatively clear in the context what happens).

If x and y are very close to DBL_MIN, the calculations are
succeptible to underflow or extreme loss of precision: if x is
very close to DBL_MIN, for example x * x may be 0.0, or
(for somewhat larger values) it may result in what is called
gradual underflow, with extreme loss of precision: e.g. with
IEEE double (most, if not all desktop and laptop PCs), 1E-300
* 1E-300 will be 0.0. Obviously, if this happens for both
* x and y, you'll end up with n == 0.0, even if x and
y are both positive.
In C++11, there is a function hypot, which will solve the
problem for n; if x * x is 0.0, however, d1 will still
be 0.0; you'll probably get better results with (x / n) * x
(but I think that there still may be limit cases where you'll
end up with 0.0 or gradual underflow—I've not analyzed it sufficiently to be sure). A better solution
would be to scale the data differently, to avoid such limit


The result of own double precision cos() implemention in a shader is NaN, but works well on the CPU. What is going wrong?

as i said, i want implement my own double precision cos() function in a compute shader with GLSL, because there is just a built-in version for float.
This is my code:
double faculty[41];//values are calculated at the beginning of main()
double myCOS(double x)
double sum,tempExp,sign;
sum = 1.0;
tempExp = 1.0;
sign = -1.0;
for(int i = 1; i <= 30; i++)
tempExp *= x;
if(i % 2 == 0){
sum = sum + (sign * (tempExp / faculty[i]));
sign *= -1.0;
return sum;
The result of this code is, that the sum turns out to be NaN on the shader, but on the CPU the algorithm is working well.
I tried to debug this code too and I got the following information:
faculty[i] is positive and not zero for all entries
tempExp is positive in each step
none of the other variables are NaN during each step
the first time sum is NaN is at the step with i=4
and now my question: What exactly can go wrong if each variable is a number and nothing is divided by zero especially when the algorithm works on the CPU?
Let me guess:
First you determined the problem is in the loop, and you use only the following operations: +, *, /.
The rules for generating NaN from these operations are:
The divisions 0/0 and ±∞/±∞
The multiplications 0×±∞ and ±∞×0
The additions ∞ + (−∞), (−∞) + ∞ and equivalent subtractions
You ruled out the possibility for 0/0 and ±∞/±∞ by stating that faculty[] is correctly initialized.
The variable sign is always 1.0 or -1.0 so it cannot generate the NaN through the * operation.
What remains is the + operation if tempExp ever become ±∞.
So probably tempExp is too high on entry of your function and becomes ±∞, this will make sum to be ±∞ too. At the next iteration you will trigger the NaN generating operation through: ∞ + (−∞). This is because you multiply one side of the addition by sign and sign switches between positive and negative at each iteration.
You're trying to approximate cos(x) around 0.0. So you should use the properties of the cos() function to reduce your input value to a value near 0.0. Ideally in the range [0, pi/4]. For instance, remove multiples of 2*pi, and get the values of cos() in [pi/4, pi/2] by computing sin(x) around 0.0 and so on.
What can go dramatically wrong is a loss of precision. cos(x) usually is implemented by range reduction followed by a dedicated implementation for the range [0, pi/2]. Range reduction uses cos(x+2*pi) = cos(x). But this range reduction isn't perfect. For starters, pi cannot be exactly represented in finite math.
Now what happens if you try something as absurd as cos(1<<30) ? It's quite possible that the range reduction algorithm introduces an error in x that's larger than 2*pi, in which case the outcome is meaningless. Returning NaN in such cases is reasonable.

Best method for scaling a vector to desired length

The question is about the most robust and the fastest implementation about this rather basic operation:
Given a vector (X,Y) compute the collinear vector of the given length desiredLength. There are at least two methods for that:
One. Find the length of (X,Y) and rescale accordingly:
double currentLength = sqrt(X*X + Y*Y);
if(currentLength == 0) { /* Aye, Caramba! */ }
double factor = desiredLength / currentLength;
X *= factor;
Y *= factor;
Two. Find the direction of (X,Y) and form a vector of desiredLength in that direction:
if(X == 0 && Y == 0) { /* Aye, Caramba! */ }
double angle = atan2(Y, X);
X = desiredLength * cos(angle);
Y = desiredLength * sin(angle);
Which method would be preferable for developing robust app, better numerical stability, faster execution, etc.?
There's no one right answer, since it will depend on the implementation.
However: on any reasonable modern implementation, the four basic
operations and sqrt will be exact to the last binary digit. From a
quality of implementation point of view, one would hope that the same
thing would be true for all of the functions in math.h, but it's less
certain. On a machine with IEEE arithmetic (Windows and the mainstream
Unix platforms), the four operations and sqrt will be implemented in
hardware, where as the trigonomic operations will generally require a
software implementation, often requiring tens of more basic operations.
Although some floating point processors do support them directly, at
least over limited ranges, even then, they are often a magnitude slower
than the four basic operations.
All of which speaks in favor of your first implementation, at least with
regards to speed (and probably with regards to numeric stability as
I would expect method one to be better At least on the performance front as doing sqrt + 2 multiplications should be cheaper than 3 trig operations.
I would guess it is also better (or not worse) on the other fronts as well since it involves one approximation (sqrt) instead of 2 (per axis). The sqrt approximation is also "shared" by both x and y.
Thu shalt better use hypot(x,y) rather than sqrt(x*x+y*y) because a reasonnable implementation of hypot can save you from underflow/overflow conditions.
Examples: hypot(1.0e300,1.0e300) or hypot(1.0e-300,1.0e-300)
Then evaluating x/hypot(x,y) is safe, even in case of gradual underflow (denormalized numbers) like x=1.0e-320, y=0, while evaluating desiredLength/hypot(x,y) might well overflow.
So I would write
double h = hypot(x,y);
double xd = desiredLength*(x/h);
double yd = desiredLength*(y/h);
You'll get some division by zero exception and nan results if both x,y are zero, so don't bother handling it in a if.

What is the optimum epsilon/dx value to use within the finite difference method?

double MyClass::dx = ?????;
double MyClass::f(double x)
return 3.0*x*x*x - 2.0*x*x + x - 5.0;
double MyClass::fp(double x) // derivative of f(x), that is f'(x)
return (f(x + dx) - f(x)) / dx;
When using finite difference method for derivation, it is critical to choose an optimum dx value. Mathematically, dx must be as small as possible. However, I'm not sure if it is a correct choice to choose it the smallest positive double precision number (i.e.; 2.2250738585072014 x 10−308).
Is there an optimal numeric interval or exact value to choose a dx in to make the calculation error as small as possible?
(I'm using 64-bit compiler. I will run my program on a Intel i5 processor.)
Choosing the smallest possible value is almost certainly wrong: if dx were that smallest number, then f(x + dx) would be exactly equal to f(x) due to rounding.
So you have a tradeoff: Choose dx too small, and you lose precision to rounding errors. Choose it too large, and your result will be imprecise due to changes in the derivative as x changes.
To judge the numeric errors, consider (f(x + dx) - f(x))/f(x)1 mathematically. The numerator denotes the difference you want to compute, but the denominator denotes the magnitude of numbers you're dealing with. If that fraction is about 2‒k, then you can expect approximately k bits of precision in your result.
If you know your function, you can compute what error you'd get from choosing dx too large. You can then balence things, so that the error incurred from this is about the same as the error incurred from rounding. But if you know the function, you might be better off by providing a function that directly computes the derivative, like in your example with the polygonal f.
The Wikipedia section that pogorskiy pointed out suggests a value of sqrt(ε)x, or approximately 1.5e-8 * x. Without any more detailed knowledge about the function, such a rule of thumb will provide a reasonable default. Also note that that same section suggests not dividing by dx, but instead by (x + dx) - x, as this takes rounding errors incurred by computing x + dx into account. But I guess that whole article is full of suggestions you might use.
1 This formula really should divide by f(x), not by dx, even though a past editor thought differently. I'm attempting to compare the amount of significant bits remaining after the division, not the slope of the tangent.
Why not just use the Power Rule to derive the derivative, you'll get an exact answer:
f(x) = 3x^3 - 2x^2 + x - 5
f'(x) = 9x^2 - 4x + 1
f(x) = 3.0 * x * x * x - 2.0 * x * x + x - 5.0
fp(x) = 9.0 * x * x - 4.0 * x + 1.0

Why does division yield a vastly different result than multiplication by a fraction in floating points

I understand why floating point numbers can't be compared, and know about the mantissa and exponent binary representation, but I'm no expert and today I came across something I don't get:
Namely lets say you have something like:
float denominator, numerator, resultone, resulttwo;
resultone = numerator / denominator;
float buff = 1 / denominator;
resulttwo = numerator * buff;
To my knowledge different flops can yield different results and this is not unusual. But in some edge cases these two results seem to be vastly different. To be more specific in my GLSL code calculating the Beckmann facet slope distribution for the Cook-Torrance lighitng model:
float a = 1 / (facetSlopeRMS * facetSlopeRMS * pow(clampedCosHalfNormal, 4));
float b = clampedCosHalfNormal * clampedCosHalfNormal - 1.0;
float c = facetSlopeRMS * facetSlopeRMS * clampedCosHalfNormal * clampedCosHalfNormal;
facetSlopeDistribution = a * exp(b/c);
yields very very different results to
float a = (facetSlopeRMS * facetSlopeRMS * pow(clampedCosHalfNormal, 4));
facetDlopeDistribution = exp(b/c) / a;
Why does it? The second form of the expression is problematic.
If I say try to add the second form of the expression to a color I get blacks, even though the expression should always evaluate to a positive number. Am I getting an infinity? A NaN? if so why?
I didn't go through your mathematics in detail, but you must be aware that small errors get pumped up easily by all these powers and exponentials. You should try and substitute all variables var with var + e(var) (on paper, yes) and derive an expression for the total error - without simplifying in between steps, because that's where the error comes from!
This is also a very common problem in computational fluid dynamics, where you can observe things like 'numeric diffusion' if your grid isn't properly aligned with the simulated flow.
So get a clear grip on where the biggest errors come from, and rewrite equations where possible to minimize the numeric error.
edit: to clarify, an example
Say you have some variable x and an expression y=exp(x). The error in x is denoted e(x) and is small compared to x (say e(x)/x < 0.0001, but note that this depends on the type you are using). Then you could say that
e(y) = y(x+e(x)) - y(x)
e(y) ~ dy/dx * e(x) (for small e(x))
e(y) = exp(x) * e(x)
So there's a magnification of the absolute error of exp(x), meaning that around x=0 there's really no issue (not a surprise, since at that point the slope of exp(x) equals that of x) , but for big x you will notice this.
The relative error would then be
e(y)/y = e(y)/exp(x) = e(x)
whilst the relative error in x was
so you added a factor of x to the relative error.

C++ double division by 0.0 versus DBL_MIN

When finding the inverse square root of a double, is it better to clamp invalid non-positive inputs at 0.0 or MIN_DBL? (In my example below double b may end up being negative due to floating point rounding errors and because the laws of physics are slightly slightly fudged in the game.)
Both division by 0.0 and MIN_DBL produce the same outcome in the game because 1/0.0 and 1/DBL_MIN are effectively infinity. My intuition says MIN_DBL is the better choice, but would there be any case for using 0.0? Like perhaps sqrt(0.0), 1/0.0 and multiplication by 1.#INF000000000000 execute faster because they are special cases.
double b = 1 - v.length_squared()/(c*c);
#ifdef CLAMP_BY_0
if (b < 0.0) b = 0.0;
if (b <= 0.0) b = DBL_MIN;
double lorentz_factor = 1/sqrt(b);
double division in MSVC:
1/0.0 = 1.#INF000000000000
1/DBL_MIN = 4.4942328371557898e+307
When dealing with floating point math, "infinity" and "effectively infinity" are quite different. Once a number stops being finite, it tends to stay that way. So while the value of lorentz_factor is "effectively" the same for both methods, depending on how you use that value, later computations can be radically different. sqrt(lorentz_factor) for instance remains infinite if you clamp to 0, but will actually be calculated if you clamp to some very very small number.
So the answer will largely depend on what you plan on doing with that value once you've clamped it.
Why not just assign INF to lorentz_factor directly, avoiding both the sqrt call and the division?
double lorentz_factor;
if (b <= 0.0)
lorentz_factor = std::numeric_limits<double>::infinity();
lorentz_factor = 1/sqrt(b);
You'll need to #include <limits> for this.
You can also use ::max() instead of ::infinity(), if that's what you need.