In my code I have this multiplications in a C++ code with all variable types as double[]
f1[0] = (f1_rot[0] * xu[0]) + (f1_rot[1] * yu[0]);
f1[1] = (f1_rot[0] * xu[1]) + (f1_rot[1] * yu[1]);
f1[2] = (f1_rot[0] * xu[2]) + (f1_rot[1] * yu[2]);
f2[0] = (f2_rot[0] * xu[0]) + (f2_rot[1] * yu[0]);
f2[1] = (f2_rot[0] * xu[1]) + (f2_rot[1] * yu[1]);
f2[2] = (f2_rot[0] * xu[2]) + (f2_rot[1] * yu[2]);
corresponding to these values
Force Rot1 : -5.39155e-07, -3.66312e-07
Force Rot2 : 4.04383e-07, -1.51852e-08
xu: 0.786857, 0.561981, 0.255018
yu: 0.534605, -0.82715, 0.173264
F1: -6.2007e-07, -4.61782e-16, -2.00963e-07
F2: 3.10073e-07, 2.39816e-07, 1.00494e-07
this multiplication in particular produces a wrong value -4.61782e-16 instead of 1.04745e-13
f1[1] = (f1_rot[0] * xu[1]) + (f1_rot[1] * yu[1]);
I hand verified the other multiplications on a calculator and they all seem to produce the correct values.
this is an open mpi compiled code and the above result is for running a single processor, there are different values when running multiple processors for example 40 processors produces 1.66967e-13 as result of F1[1] multiplication.
Is this some kind of mpi bug ? or a type precision problem ? and why does it work okay for the other multiplications ?
Your problem is an obvious result of what is called catastrophic summations:
As we know, a double precision float can handle numbers of around 16 significant decimals.
f1[1] = (f1_rot[0] * xu[1]) + (f1_rot[1] * yu[1])
= -3.0299486605499998e-07 + 3.0299497080000003e-07
= 1.0474500005332475e-13
This is what we obtain with the numbers you have given in your example.
Notice that (-7) - (-13) = 6, which corresponds to the number of decimals in the float you give in your example: (ex: -5.39155e-07 -3.66312e-07, each mantissa is of a precision of 6 decimals). It means that you used here single precision floats.
I am sure that in your calculations, the precision of your numbers is bigger, that's why you find a more precise result.
Anyway, if you use single precision floats, you can't expect a better precision. With a double precision, you can find a precision up to 16. You shouldn't trust a difference between two numbers, unless it is bigger than the mantissa:
Simple precision floats: (a - b) / b >= ~1e-7
Double precision floats: (a - b) / b >= ~4e-16
For further information, see these examples ... or the table in this article ...
Related
I am trying to implement Fast Inverse Square Root for a fixed point number, but I'm not getting anywhere.
I am trying to follow exactly the same principle as the article, except instead of writing the number in the floating point format x = (-1) ^ s * (1 + M) * 2 ^ (E-127), I am using the format x = M * 2 ^ -16, which is a 32-bit fixed point number with 16 decimal bits and 16 fractional bits.
The problem is that I cannot find the value of the "magic constant". According to my calculations, it doesn’t exist, but I’m not a mathematician and I think I’m doing everything wrong.
To solve Y = 1 / sqrt (x), I used the following reasoning (I don't know if it is correct).
In the original code we have that Y0 for approximation of newton is given by:
i = 0x5f3759df - (i >> 1);
Which means that we will have as a result a floating point number given by:
y0 = (1 + R2 - M / 2) * 2 ^ (R1 - E / 2);
This is because the operation >> divides exponent and mantissa by 2, and then we perform a subtraction of the numbers as integers.
Following the steps shown in the article, I set the format of x to:
x = M * 2 ^ -16
In an attempt to perform the same logic, I try to define Y0 for:
Y0 = (R2 - M / 2) * 2 ^ (R1 - (-16/2));
I'm trying to find a number, which can minimize the error given by:
error = (Y - Y0) / Y
Regardless of the value of R1, I can do shift operations to correct the exponent value of my final result, having the correct result at a fixed point.
Where am I wrong?
It can't be done.
The fast inverse sqrt is due to the floating point representation, that has already split the number into powers of two (exponent) and the significant.
It can be done.
With the same tricks as done for floating points, it's possible to convert your fixed point into 2^exp * x. Given uint32_t a, uint8_t exp = bias- builtin_count_leading_zeros(a); uint32_t b = a << exp, with the constants (and domain of a) so carefully chosen, that there will be no underflows or overflows.
Thus, you will actually have a custom floating point representation, which is tailored for this specific purpose, omitting the sign bit at least and having the best possible number of bits for the exponent, which might as well be 8.
For an application I'm working on, I need to take two integers and add them together using a particular mathematical formula. This ends up looking like this:
int16_t add_special(int16_t a, int16_t b) {
float limit = std::numeric_limits<int16_t>::max();//32767 as a floating point value
float a_fl = a, b_fl = b;
float numerator = a_fl + b_fl;
float denominator = 1 + a_fl * b_fl / std::pow(limit, 2);
float final_value = numerator / denominator;
return static_cast<int16_t>(std::round(final_value));
}
Any readers with a passing familiarity with physics will recognize that this formula is the same as what is used to calculate the sum of near-speed-of-light velocities, and the calculation here intentionally mirrors that computation.
The code as-written gives the results I need: for low numbers, they nearly add together normally, but for high numbers, they converge to the maximum value of 32767, i.e.
add_special(10, 15) == 25
add_special(100, 200) == 300
add_special(1000, 3000) == 3989
add_special(10000, 25000) == 28390
add_special(30000, 30000) == 32640
Which all appears to be correct.
The problem, however, is that the function as-written involves first transforming the numbers into floating point values before transforming them back into integers. This seems like a needless detour for numbers that I know, as a principle of its domain, will never not be integers.
Is there a faster, more optimized way to perform this computation? Or is this the most optimized version of this function I can create?
I'm building for x86-64, using MSVC 14.X, although methods that also work for GCC would be beneficial. Also, I'm not interested in SSE/SIMD optimizations at this stage; I'm mostly just looking at the elementary operations being performed on the data.
You might avoid floating number and does all computation in integral type:
constexpr int16_t add_special(int16_t a, int16_t b) {
std::int64_t limit = std::numeric_limits<int16_t>::max();
std::int64_t a_fl = a;
std::int64_t b_fl = b;
return static_cast<int16_t>(((limit * limit) * (a_fl + b_fl)
+ ((limit * limit + a_fl * b_fl) / 2)) /* Handle round */
/ (limit * limit + a_fl * b_fl));
}
Demo
but according to Benchmark, it is not faster for those values.
As noted by Johannes Overmann, a big performance boost is gained by avoiding std::round, at the cost of some (little) discrepancies in the results, though.
I tried some other little changes HERE, where it seems that the following is a faster approach (at least for that architecture)
constexpr int32_t i_max = std::numeric_limits<int16_t>::max();
constexpr int64_t i_max_2 = static_cast<int64_t>(i_max) * i_max;
int16_t my_add_special(int16_t a, int16_t b)
{
// integer multipication instead of floating point division
double numerator = (a + b) * i_max_2;
double denominator = i_max_2 + a * b;
// Approximated rounding instead of std::round
return 0.5 + numerator / denominator;
}
Suggestions:
Use 32767.0*32767.0 (which is a constant) instead of std::pow(limit, 2).
Use integer values as much as possible, potentially with fixed points. Just the two divisions are a problem. Use floats just form them, if necessary (depends on the input data ranges).
Make it inline if the function is small and if it is appropriate.
Something like:
int16_t add_special(int16_t a, int16_t b) {
float numerator = int32_t(a) + int32_t(b); // Cannot overflow.
float denominator = 1 + (int32_t(a) * int32_t(b)) / (32767.0 * 32767.0); // Cannot overflow either.
return (numerator / denominator) + 0.5; // Relying on implementation defined rounding. Not good but potentially faster than std::round().
}
The only risk with the above is the omission of the explicit rounding, so you will get some implicit rounding.
I'm looking for a robust algorithm (or a paper describing an algorithm) that can find roots of polynomials (ideally up to the 4th debree, but anything will do) using a closed-form solution. I'm only interested in the real roots.
My first take on solving quadratic equations involved this (I also have code in similar style for cubics / quartics, but let's focus on quadratics right now):
/**
* #brief a simple quadratic equation solver
*
* With double-precision floating-point, this reaches 1e-12 worst-case and 1e-15 average
* precision of the roots (the value of the function in the roots). The roots can be however
* quite far from the true roots, up to 1e-10 worst-case and 1e-18 average absolute difference
* for cases when two roots exist. If only a single root exists, the worst-case precision is
* 1e-13 and average-case precision is 1e-18.
*
* With single-precision floating-point, this reaches 1e-3 worst-case and 1e-7 average
* precision of the roots (the value of the function in the roots). The roots can be however
* quite far from the true roots, up to 1e-1 worst-case and 1e-10 average absolute difference
* for cases when two roots exist. If only a single root exists, the worst-case precision is
* 1e+2 (!) and average-case precision is 1e-2. Do not use single-precision floating point,
* except if pressed by time.
*
* All the precision measurements are scaled by the maximum absolute coefficient value.
*
* #tparam T is data type of the arguments (default double)
* #tparam b_sort_roots is root sorting flag (if set, the roots are
* given in ascending (not absolute) value; default true)
* #tparam n_2nd_order_coeff_log10_thresh is base 10 logarithm of threshold
* on the first coefficient (if below threshold, the equation is a linear one; default -6)
* #tparam n_zero_discriminant_log10_thresh is base 10 logarithm of threshold
* on the discriminant (if below negative threshold, the equation does not
* have a real root, if below threshold, the equation has just a single solution; default -6)
*/
template <class T = double, const bool b_sort_roots = true,
const int n_2nd_order_coeff_log10_thresh = -6,
const int n_zero_discriminant_log10_thresh = -6>
class CQuadraticEq {
protected:
T a; /**< #brief the 2nd order coefficient */
T b; /**< #brief the 1st order coefficient */
T c; /**< #brief 0th order coefficient */
T p_real_root[2]; /**< #brief list of the roots (real parts) */
//T p_im_root[2]; // imaginary part of the roots
size_t n_real_root_num; /**< #brief number of real roots */
public:
/**
* #brief default constructor; solves for roots of \f$ax^2 + bx + c = 0\f$
*
* This finds roots of the given equation. It tends to find two identical roots instead of one, rather
* than missing one of two different roots - the number of roots found is therefore orientational,
* as the roots might have the same value.
*
* #param[in] _a is the 2nd order coefficient
* #param[in] _b is the 1st order coefficient
* #param[in] _c is 0th order coefficient
*/
CQuadraticEq(T _a, T _b, T _c) // ax2 + bx + c = 0
:a(_a), b(_b), c(_c)
{
T _aa = fabs(_a);
if(_aa < f_Power_Static(10, n_2nd_order_coeff_log10_thresh)) { // otherwise division by a yields large numbers, this is then more precise
p_real_root[0] = -_c / _b;
//p_im_root[0] = 0;
n_real_root_num = 1;
return;
}
// a simple linear equation
if(_aa < 1) { // do not divide always, that makes it worse
_b /= _a;
_c /= _a;
_a = 1;
// could copy the code here and optimize away division by _a (optimizing compiler might do it for us)
}
// improve numerical stability if the coeffs are very small
const double f_thresh = f_Power_Static(10, n_zero_discriminant_log10_thresh);
double f_disc = _b * _b - 4 * _a * _c;
if(f_disc < -f_thresh) // only really negative
n_real_root_num = 0; // only two complex roots
else if(/*fabs(f_disc) < f_thresh*/f_disc <= f_thresh) { // otherwise gives problems for double root situations
p_real_root[0] = T(-_b / (2 * _a));
n_real_root_num = 1;
} else {
f_disc = sqrt(f_disc);
int i = (b_sort_roots)? ((_a > 0)? 0 : 1) : 0; // produce sorted roots, if required
p_real_root[i] = T((-_b - f_disc) / (2 * _a));
p_real_root[1 - i] = T((-_b + f_disc) / (2 * _a));
//p_im_root[0] = 0;
//p_im_root[1] = 0;
n_real_root_num = 2;
}
}
/**
* #brief gets number of real roots
* #return Returns number of real roots (0 to 2).
*/
size_t n_RealRoot_Num() const
{
_ASSERTE(n_real_root_num >= 0);
return n_real_root_num;
}
/**
* #brief gets value of a real root
* #param[in] n_index is zero-based index of the root
* #return Returns value of the specified root.
*/
T f_RealRoot(size_t n_index) const
{
_ASSERTE(n_index < 2 && n_index < n_real_root_num);
return p_real_root[n_index];
}
/**
* #brief evaluates the equation for a given argument
* #param[in] f_x is value of the argument \f$x\f$
* #return Returns value of \f$ax^2 + bx + c\f$.
*/
T operator ()(T f_x) const
{
T f_x2 = f_x * f_x;
return f_x2 * a + f_x * b + c;
}
};
The code is horrible, and I hate all the thresholds. But for random equations with roots in the [-100, 100] interval, this is not so bad:
root response precision 1e-100: 6315 cases
root response precision 1e-19: 2 cases
root response precision 1e-17: 2 cases
root response precision 1e-16: 6 cases
root response precision 1e-15: 6333 cases
root response precision 1e-14: 3765 cases
root response precision 1e-13: 241 cases
root response precision 1e-12: 3 cases
2-root solution precision 1e-100: 5353 cases
2-root solution precision 1e-19: 656 cases
2-root solution precision 1e-18: 4481 cases
2-root solution precision 1e-17: 2312 cases
2-root solution precision 1e-16: 455 cases
2-root solution precision 1e-15: 68 cases
2-root solution precision 1e-14: 7 cases
2-root solution precision 1e-13: 2 cases
1-root solution precision 1e-100: 3022 cases
1-root solution precision 1e-19: 38 cases
1-root solution precision 1e-18: 197 cases
1-root solution precision 1e-17: 68 cases
1-root solution precision 1e-16: 7 cases
1-root solution precision 1e-15: 1 cases
Note that this precision is relative to the magnitude of the coefficients, which is typically in the 10^6 range (so finally the precision is far from perfect, but probably mostly usable). Without the thresholds, however, it is near to useless.
I have tried using multiple precision arithmetics, which generally works well, but tends to reject many of the roots simply because the coefficients of the polynomial are not multiple precision and some polynomials cannot be exactly represented (if there is a double root in a 2nd degree polynomial, it mostly either splits it to two roots (which I wouldn't mind) or says that there is no root whatsoever). If I want to recover perhaps even slightly imprecise roots, my code gets complicated and full of thresholds.
So far, I've tried using CCmath, but either I can't use it correctly, or the precision is really bad. Also, it uses iterative (not closed-form) solver in plrt().
I have tried using GNU scientific library gsl_poly_solve_quadratic() but that seems to be a naive approach, and not very numerically stable.
Using std::complex numbers naively also turned out to be a really bad idea, as both the precision and speed can be bad (especially with cubic / quartic equations where the code is heavy with transcendental functions).
Is recovering the roots as complex numbers the only way to go? Then no roots are missed and the user can select how precise the roots need to be (and thus ignore small imaginary components in less precise roots).
This isn't really answering your question but I think you can improve on what you've got since you currently have a 'loss of significance' problem when b^2 >> ac. In such cases, you end up with a formula along the lines of (-b + (b + eps))/(2 * a) where the cancellation of the b's can lose many significant figures from eps.
The correct way of handling this is to use the 'normal' equation for roots of a quadratic for one root and the lesser known 'alternative' or 'upside down' equation for the other root. Which way round you take them depends on the sign of _b.
A change to your code along this lines of the following should reduce the errors resulting from this.
if( _b > 0 ) {
p_real_root[i] = T((-_b - f_disc) / (2 * _a));
p_real_root[1 - i] = T((2 * _c) / (-_b - f_disc));
}
else{
p_real_root[i] = T((2 * _c) / (-_b + f_disc));
p_real_root[1 - i] = T((-_b + f_disc) / (2 * _a));
}
in my code I often compute things like the following piece (here C code for simplicity):
float cos_theta = /* some simple operations; no cosf call! */;
float sin_theta = sqrtf(1.0f - cos_theta * cos_theta); // Option 1
For this example ignore that the argument of the square root might be negative due to imprecisions. I fixed that with additional fdimf call. However, I wondered if the following is more precise:
float sin_theta = sqrtf((1.0f + cos_theta) * (1.0f - cos_theta)); // Option 2
cos_theta is between -1 and +1 so for each choice there will be situations where I subtract similar numbers and thus will loose precision, right? What is the most precise and why?
The most precise way with floats is likely to compute both sin and cos using a single x87 instruction, fsincos.
However, if you need to do the computation manually, it's best to group arguments with similar magnitudes. This means the second option is more precise, especially when cos_theta is close to 0, where precision matters the most.
As the article
What Every Computer Scientist Should Know About Floating-Point Arithmetic notes:
The expression x2 - y2 is another formula that exhibits catastrophic
cancellation. It is more accurate to evaluate it as (x - y)(x + y).
Edit: it's more complicated than this. Although the above is generally true, (x - y)(x + y) is slightly less accurate when x and y are of very different magnitudes, as the footnote to the statement explains:
In this case, (x - y)(x + y) has three rounding errors, but x2 - y2 has only two since the rounding error committed when computing the smaller of x2 and y2 does not affect the final subtraction.
In other words, taking x - y, x + y, and the product (x - y)(x + y) each introduce rounding errors (3 steps of rounding error). x2, y2, and the subtraction x2 - y2 also each introduce rounding errors, but the rounding error obtained by squaring a relatively small number (the smaller of x and y) is so negligible that there are effectively only two steps of rounding error, making the difference of squares more precise.
So option 1 is actually going to be more precise. This is confirmed by dev.brutus's Java test.
I wrote small test. It calcutates expected value with double precision. Then it calculates an error with your options. The first option is better:
Algorithm: FloatTest$1
option 1 error = 3.802792362162126
option 2 error = 4.333273185303996
Algorithm: FloatTest$2
option 1 error = 3.802792362167937
option 2 error = 4.333273185305868
The Java code:
import org.junit.Test;
public class FloatTest {
#Test
public void test() {
testImpl(new ExpectedAlgorithm() {
public double te(double cos_theta) {
return Math.sqrt(1.0f - cos_theta * cos_theta);
}
});
testImpl(new ExpectedAlgorithm() {
public double te(double cos_theta) {
return Math.sqrt((1.0f + cos_theta) * (1.0f - cos_theta));
}
});
}
public void testImpl(ExpectedAlgorithm ea) {
double delta1 = 0;
double delta2 = 0;
for (double cos_theta = -1; cos_theta <= 1; cos_theta += 1e-8) {
double[] delta = delta(cos_theta, ea);
delta1 += delta[0];
delta2 += delta[1];
}
System.out.println("Algorithm: " + ea.getClass().getName());
System.out.println("option 1 error = " + delta1);
System.out.println("option 2 error = " + delta2);
}
private double[] delta(double cos_theta, ExpectedAlgorithm ea) {
double expected = ea.te(cos_theta);
double delta1 = Math.abs(expected - t1((float) cos_theta));
double delta2 = Math.abs(expected - t2((float) cos_theta));
return new double[]{delta1, delta2};
}
private double t1(float cos_theta) {
return Math.sqrt(1.0f - cos_theta * cos_theta);
}
private double t2(float cos_theta) {
return Math.sqrt((1.0f + cos_theta) * (1.0f - cos_theta));
}
interface ExpectedAlgorithm {
double te(double cos_theta);
}
}
The correct way to reason about numerical precision of some expression is to:
Measure the result discrepancy relative to the correct value in ULPs (Unit in the last place), introduced in 1960. by W. H. Kahan. You can find C, Python & Mathematica implementations here, and learn more on the topic here.
Discriminate between two or more expressions based on the worst case they produce, not average absolute error as done in other answers or by some other arbitrary metric. This is how numerical approximation polynomials are constructed (Remez algorithm), how standard library methods' implementations are analysed (e.g. Intel atan2), etc...
With that in mind, version_1: sqrt(1 - x * x) and version_2: sqrt((1 - x) * (1 + x)) produce significantly different outcomes. As presented in the plot below, version_1 demonstrates catastrophic performance for x close to 1 with error > 1_000_000 ulps, while on the other hand error of version_2 is well behaved.
That is why I always recommend using version_2, i.e. exploiting the square difference formula.
Python 3.6 code that produces square_diff_error.csv file:
from fractions import Fraction
from math import exp, fabs, sqrt
from random import random
from struct import pack, unpack
def ulp(x):
"""
Computing ULP of input double precision number x exploiting
lexicographic ordering property of positive IEEE-754 numbers.
The implementation correctly handles the special cases:
- ulp(NaN) = NaN
- ulp(-Inf) = Inf
- ulp(Inf) = Inf
Author: Hrvoje Abraham
Date: 11.12.2015
Revisions: 15.08.2017
26.11.2017
MIT License https://opensource.org/licenses/MIT
:param x: (float) float ULP will be calculated for
:returns: (float) the input float number ULP value
"""
# setting sign bit to 0, e.g. -0.0 becomes 0.0
t = abs(x)
# converting IEEE-754 64-bit format bit content to unsigned integer
ll = unpack('Q', pack('d', t))[0]
# computing first smaller integer, bigger in a case of ll=0 (t=0.0)
near_ll = abs(ll - 1)
# converting back to float, its value will be float nearest to t
near_t = unpack('d', pack('Q', near_ll))[0]
# abs takes care of case t=0.0
return abs(t - near_t)
with open('e:/square_diff_error.csv', 'w') as f:
for _ in range(100_000):
# nonlinear distribution of x in [0, 1] to produce more cases close to 1
k = 10
x = (exp(k) - exp(k * random())) / (exp(k) - 1)
fx = Fraction(x)
correct = sqrt(float(Fraction(1) - fx * fx))
version1 = sqrt(1.0 - x * x)
version2 = sqrt((1.0 - x) * (1.0 + x))
err1 = fabs(version1 - correct) / ulp(correct)
err2 = fabs(version2 - correct) / ulp(correct)
f.write(f'{x},{err1},{err2}\n')
Mathematica code that produces the final plot:
data = Import["e:/square_diff_error.csv"];
err1 = {1 - #[[1]], #[[2]]} & /# data;
err2 = {1 - #[[1]], #[[3]]} & /# data;
ListLogLogPlot[{err1, err2}, PlotRange -> All, Axes -> False, Frame -> True,
FrameLabel -> {"1-x", "error [ULPs]"}, LabelStyle -> {FontSize -> 20}]
As an aside, you will always have a problem when theta is small, because the cosine is flat around theta = 0. If theta is between -0.0001 and 0.0001 then cos(theta) in float is exactly one, so your sin_theta will be exactly zero.
To answer your question, when cos_theta is close to one (corresponding to a small theta), your second computation is clearly more accurate. This is shown by the following program, that lists the absolute and relative errors for both computations for various values of cos_theta. The errors are computed by comparing against a value which is computed with 200 bits of precision, using GNU MP library, and then converted to a float.
#include <math.h>
#include <stdio.h>
#include <gmp.h>
int main()
{
int i;
printf("cos_theta abs (1) rel (1) abs (2) rel (2)\n\n");
for (i = -14; i < 0; ++i) {
float x = 1 - pow(10, i/2.0);
float approx1 = sqrt(1 - x * x);
float approx2 = sqrt((1 - x) * (1 + x));
/* Use GNU MultiPrecision Library to get 'exact' answer */
mpf_t tmp1, tmp2;
mpf_init2(tmp1, 200); /* use 200 bits precision */
mpf_init2(tmp2, 200);
mpf_set_d(tmp1, x);
mpf_mul(tmp2, tmp1, tmp1); /* tmp2 = x * x */
mpf_neg(tmp1, tmp2); /* tmp1 = -x * x */
mpf_add_ui(tmp2, tmp1, 1); /* tmp2 = 1 - x * x */
mpf_sqrt(tmp1, tmp2); /* tmp1 = sqrt(1 - x * x) */
float exact = mpf_get_d(tmp1);
printf("%.8f %.3e %.3e %.3e %.3e\n", x,
fabs(approx1 - exact), fabs((approx1 - exact) / exact),
fabs(approx2 - exact), fabs((approx2 - exact) / exact));
/* printf("%.10f %.8f %.8f %.8f\n", x, exact, approx1, approx2); */
}
return 0;
}
Output:
cos_theta abs (1) rel (1) abs (2) rel (2)
0.99999988 2.910e-11 5.960e-08 0.000e+00 0.000e+00
0.99999970 5.821e-11 7.539e-08 0.000e+00 0.000e+00
0.99999899 3.492e-10 2.453e-07 1.164e-10 8.178e-08
0.99999684 2.095e-09 8.337e-07 0.000e+00 0.000e+00
0.99998999 1.118e-08 2.497e-06 0.000e+00 0.000e+00
0.99996835 6.240e-08 7.843e-06 9.313e-10 1.171e-07
0.99989998 3.530e-07 2.496e-05 0.000e+00 0.000e+00
0.99968380 3.818e-07 1.519e-05 0.000e+00 0.000e+00
0.99900001 1.490e-07 3.333e-06 0.000e+00 0.000e+00
0.99683774 8.941e-08 1.125e-06 7.451e-09 9.376e-08
0.99000001 5.960e-08 4.225e-07 0.000e+00 0.000e+00
0.96837723 1.490e-08 5.973e-08 0.000e+00 0.000e+00
0.89999998 2.980e-08 6.837e-08 0.000e+00 0.000e+00
0.68377221 5.960e-08 8.168e-08 5.960e-08 8.168e-08
When cos_theta is not close to one, then the accuracy of both methods is very close to each other and to round-off error.
[Edited for major think-o] It looks to me like option 2 will be better, because for a number like 0.000001 for example option 1 will return the sine as 1 while option will return a number just smaller than 1.
No difference in my option since (1-x) preserves the precision not effecting the carried bit. Then for (1+x) the same is true. Then the only thing effecting the carry bit precision is the multiplication. So in both cases there is one single multiplication, so they are both as likely to give the same carry bit error.
I understand why floating point numbers can't be compared, and know about the mantissa and exponent binary representation, but I'm no expert and today I came across something I don't get:
Namely lets say you have something like:
float denominator, numerator, resultone, resulttwo;
resultone = numerator / denominator;
float buff = 1 / denominator;
resulttwo = numerator * buff;
To my knowledge different flops can yield different results and this is not unusual. But in some edge cases these two results seem to be vastly different. To be more specific in my GLSL code calculating the Beckmann facet slope distribution for the Cook-Torrance lighitng model:
float a = 1 / (facetSlopeRMS * facetSlopeRMS * pow(clampedCosHalfNormal, 4));
float b = clampedCosHalfNormal * clampedCosHalfNormal - 1.0;
float c = facetSlopeRMS * facetSlopeRMS * clampedCosHalfNormal * clampedCosHalfNormal;
facetSlopeDistribution = a * exp(b/c);
yields very very different results to
float a = (facetSlopeRMS * facetSlopeRMS * pow(clampedCosHalfNormal, 4));
facetDlopeDistribution = exp(b/c) / a;
Why does it? The second form of the expression is problematic.
If I say try to add the second form of the expression to a color I get blacks, even though the expression should always evaluate to a positive number. Am I getting an infinity? A NaN? if so why?
I didn't go through your mathematics in detail, but you must be aware that small errors get pumped up easily by all these powers and exponentials. You should try and substitute all variables var with var + e(var) (on paper, yes) and derive an expression for the total error - without simplifying in between steps, because that's where the error comes from!
This is also a very common problem in computational fluid dynamics, where you can observe things like 'numeric diffusion' if your grid isn't properly aligned with the simulated flow.
So get a clear grip on where the biggest errors come from, and rewrite equations where possible to minimize the numeric error.
edit: to clarify, an example
Say you have some variable x and an expression y=exp(x). The error in x is denoted e(x) and is small compared to x (say e(x)/x < 0.0001, but note that this depends on the type you are using). Then you could say that
e(y) = y(x+e(x)) - y(x)
e(y) ~ dy/dx * e(x) (for small e(x))
e(y) = exp(x) * e(x)
So there's a magnification of the absolute error of exp(x), meaning that around x=0 there's really no issue (not a surprise, since at that point the slope of exp(x) equals that of x) , but for big x you will notice this.
The relative error would then be
e(y)/y = e(y)/exp(x) = e(x)
whilst the relative error in x was
e(x)/x
so you added a factor of x to the relative error.