double precision and peridoc decimals - c++

I'm working on a lisp interpreter and implemented rational numbers. I thought they have the advantage over doubles to be able to represent numbers like 1/3. I did some calculations to compare the results. I was surprised by the results
with doubles
(* 3.0 (/ 1.0 3.0)) -> 1
(* 3.0 (/ 4.0 3.0)) -> 4
(* 81.0 (/ 1.0 81.0)) -> 1
with ratios:
(* 3 (/ 1 3)) -> 1
(* 3 (/ 4 3)) -> 4
(* 81 (/ 1 81)) -> 1
Why are the results of the floating point operations exact? There must be a loss of precision. doubles cannot store an infinit number of digits. Or do I miss something?
I did a quick test with a small C-Application. Same result.
#include <stdio.h>
int main()
{
double a1 = 1, b1 = 3;
double a2 = 1, b2 = 81;
printf("result : %f\n", a1 / b1 * b1);
printf("result : %f\n", a2 / b2 * b2);
return 0;
}
Output is:
result : 1.000000
result : 1.000000
MFG
Martin

For the first case, the exact result of the multiply is half way between 1.0 and the largest double that is less than 1.0. Under IEEE 754 round-to-nearest rules, half way numbers are rounded to even, in this case to 1.0. In effect, the rounding of the result of the multiply undid the error introduced by rounding of the division result.
This Java program illustrates what is happening. The conversions to BigDecimal and the BigDecimal arithmetic operations are all exact:
import java.math.BigDecimal;
public class Test {
public static void main(String[] args) {
double a1 = 1, b1 = 3;
System.out.println("Final Result: " + ((a1 / b1) * b1));
BigDecimal divResult = new BigDecimal(a1 / b1);
System.out.println("Division Result: " + divResult);
BigDecimal multiplyResult = divResult.multiply(BigDecimal.valueOf(3));
System.out.println("Multiply Result: " + multiplyResult);
System.out.println("Error rounding up to 1.0: "
+ BigDecimal.valueOf(1).subtract(multiplyResult));
BigDecimal nextDown = new BigDecimal(Math.nextAfter(1.0, 0));
System.out.println("Next double down from 1.0: " + nextDown);
System.out.println("Error rounding down: "
+ multiplyResult.subtract(nextDown));
}
}
The output is:
Final Result: 1.0
Division Result: 0.333333333333333314829616256247390992939472198486328125
Multiply Result: 0.999999999999999944488848768742172978818416595458984375
Error rounding up to 1.0: 5.5511151231257827021181583404541015625E-17
Next double down from 1.0: 0.99999999999999988897769753748434595763683319091796875
Error rounding down: 5.5511151231257827021181583404541015625E-17
The output for the second, similar, case is:
Final Result: 1.0
Division Result: 0.012345679012345678327022824305458925664424896240234375
Multiply Result: 0.9999999999999999444888487687421729788184165954589843750
Error rounding up to 1.0: 5.55111512312578270211815834045410156250E-17
Next double down from 1.0: 0.99999999999999988897769753748434595763683319091796875
Error rounding down: 5.55111512312578270211815834045410156250E-17
This program illustrates a situation in which rounding error can accumulate:
import java.math.BigDecimal;
public class Test {
public static void main(String[] args) {
double tenth = 0.1;
double sum = 0;
for (int i = 0; i < 10; i++) {
sum += tenth;
}
System.out.println("Sum: " + new BigDecimal(sum));
System.out.println("Product: " + new BigDecimal(10.0 * tenth));
}
}
Output:
Sum: 0.99999999999999988897769753748434595763683319091796875
Product: 1
Multiplying by 10 rounds to 1.0. Doing the same multiplication by repeated addition does not get the exact answer.

Related

Exact value of a floating-point number as a rational

I'm looking for a method to convert the exact value of a floating-point number to a rational quotient of two integers, i.e. a / b, where b is not larger than a specified maximum denominator b_max. If satisfying the condition b <= b_max is impossible, then the result falls back to the best approximation which still satisfies the condition.
Hold on. There are a lot of questions/answers here about the best rational approximation of a truncated real number which is represented as a floating-point number. However I'm interested in the exact value of a floating-point number, which is itself a rational number with a different representation. More specifically, the mathematical set of floating-point numbers is a subset of rational numbers. In case of IEEE 754 binary floating-point standard it is a subset of dyadic rationals. Anyway, any floating-point number can be converted to a rational quotient of two finite precision integers as a / b.
So, for example assuming IEEE 754 single-precision binary floating-point format, the rational equivalent of float f = 1.0f / 3.0f is not 1 / 3, but 11184811 / 33554432. This is the exact value of f, which is a number from the mathematical set of IEEE 754 single-precision binary floating-point numbers.
Based on my experience, traversing (by binary search of) the Stern-Brocot tree is not useful here, since that is more suitable for approximating the value of a floating-point number, when it is interpreted as a truncated real instead of an exact rational.
Possibly, continued fractions are the way to go.
The another problem here is integer overflow. Think about that we want to represent the rational as the quotient of two int32_t, where the maximum denominator b_max = INT32_MAX. We cannot rely on a stopping criterion like b > b_max. So the algorithm must never overflow, or it must detect overflow.
What I found so far is an algorithm from Rosetta Code, which is based on continued fractions, but its source mentions it is "still not quite complete". Some basic tests gave good results, but I cannot confirm its overall correctness and I think it can easily overflow.
// https://rosettacode.org/wiki/Convert_decimal_number_to_rational#C
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <stdint.h>
/* f : number to convert.
* num, denom: returned parts of the rational.
* md: max denominator value. Note that machine floating point number
* has a finite resolution (10e-16 ish for 64 bit double), so specifying
* a "best match with minimal error" is often wrong, because one can
* always just retrieve the significand and return that divided by
* 2**52, which is in a sense accurate, but generally not very useful:
* 1.0/7.0 would be "2573485501354569/18014398509481984", for example.
*/
void rat_approx(double f, int64_t md, int64_t *num, int64_t *denom)
{
/* a: continued fraction coefficients. */
int64_t a, h[3] = { 0, 1, 0 }, k[3] = { 1, 0, 0 };
int64_t x, d, n = 1;
int i, neg = 0;
if (md <= 1) { *denom = 1; *num = (int64_t) f; return; }
if (f < 0) { neg = 1; f = -f; }
while (f != floor(f)) { n <<= 1; f *= 2; }
d = f;
/* continued fraction and check denominator each step */
for (i = 0; i < 64; i++) {
a = n ? d / n : 0;
if (i && !a) break;
x = d; d = n; n = x % n;
x = a;
if (k[1] * a + k[0] >= md) {
x = (md - k[0]) / k[1];
if (x * 2 >= a || k[1] >= md)
i = 65;
else
break;
}
h[2] = x * h[1] + h[0]; h[0] = h[1]; h[1] = h[2];
k[2] = x * k[1] + k[0]; k[0] = k[1]; k[1] = k[2];
}
*denom = k[1];
*num = neg ? -h[1] : h[1];
}
All finite double are rational numbers as OP well stated..
Use frexp() to break the number into its fraction and exponent. The end result still needs to use double to represent whole number values due to range requirements. Some numbers are too small, (x smaller than 1.0/(2.0,DBL_MAX_EXP)) and infinity, not-a-number are issues.
The frexp functions break a floating-point number into a normalized fraction and an integral power of 2. ... interval [1/2, 1) or zero ...
C11 §7.12.6.4 2/3
#include <math.h>
#include <float.h>
_Static_assert(FLT_RADIX == 2, "TBD code for non-binary FP");
// Return error flag
int split(double x, double *numerator, double *denominator) {
if (!isfinite(x)) {
*numerator = *denominator = 0.0;
if (x > 0.0) *numerator = 1.0;
if (x < 0.0) *numerator = -1.0;
return 1;
}
int bdigits = DBL_MANT_DIG;
int expo;
*denominator = 1.0;
*numerator = frexp(x, &expo) * pow(2.0, bdigits);
expo -= bdigits;
if (expo > 0) {
*numerator *= pow(2.0, expo);
}
else if (expo < 0) {
expo = -expo;
if (expo >= DBL_MAX_EXP-1) {
*numerator /= pow(2.0, expo - (DBL_MAX_EXP-1));
*denominator *= pow(2.0, DBL_MAX_EXP-1);
return fabs(*numerator) < 1.0;
} else {
*denominator *= pow(2.0, expo);
}
}
while (*numerator && fmod(*numerator,2) == 0 && fmod(*denominator,2) == 0) {
*numerator /= 2.0;
*denominator /= 2.0;
}
return 0;
}
void split_test(double x) {
double numerator, denominator;
int err = split(x, &numerator, &denominator);
printf("e:%d x:%24.17g n:%24.17g d:%24.17g q:%24.17g\n",
err, x, numerator, denominator, numerator/ denominator);
}
int main(void) {
volatile float third = 1.0f/3.0f;
split_test(third);
split_test(0.0);
split_test(0.5);
split_test(1.0);
split_test(2.0);
split_test(1.0/7);
split_test(DBL_TRUE_MIN);
split_test(DBL_MIN);
split_test(DBL_MAX);
return 0;
}
Output
e:0 x: 0.3333333432674408 n: 11184811 d: 33554432 q: 0.3333333432674408
e:0 x: 0 n: 0 d: 9007199254740992 q: 0
e:0 x: 1 n: 1 d: 1 q: 1
e:0 x: 0.5 n: 1 d: 2 q: 0.5
e:0 x: 1 n: 1 d: 1 q: 1
e:0 x: 2 n: 2 d: 1 q: 2
e:0 x: 0.14285714285714285 n: 2573485501354569 d: 18014398509481984 q: 0.14285714285714285
e:1 x: 4.9406564584124654e-324 n: 4.4408920985006262e-16 d: 8.9884656743115795e+307 q: 4.9406564584124654e-324
e:0 x: 2.2250738585072014e-308 n: 2 d: 8.9884656743115795e+307 q: 2.2250738585072014e-308
e:0 x: 1.7976931348623157e+308 n: 1.7976931348623157e+308 d: 1 q: 1.7976931348623157e+308
Leave the b_max consideration for later.
More expedient code is possible with replacing pow(2.0, expo) with ldexp(1, expo) #gammatester or exp2(expo) #Bob__
while (*numerator && fmod(*numerator,2) == 0 && fmod(*denominator,2) == 0) could also use some performance improvements. But first, let us get the functionality as needed.

Sum exceeding permissible value in looping floats

I recently created this simple program to find average velocity.
Average velocity = Δx / Δt
I chose x as a function of t as x = t^2
Therefore v = 2t
also, avg v = (x2 - x1) / (t2 - t1)
I chose the interval to be t = 1s to 4s. Implies x goes from 1 to 16
Therefore avg v = (16 - 1) / (4 - 1) = 5
Now the program :
#include <iostream>
using namespace std;
int main() {
float t = 1, v = 0, sum = 0, n = 0; // t = time, v = velocity, sum = Sigma v, n = Sigma 1
float avgv = 0;
while( t <= 4 ) {
v = 2*t;
sum += v;
t += 0.0001;
n++;
}
avgv = sum/n;
cout << "\n----> " << avgv << " <----\n";
return 0;
}
I used very small increments of time to calculate velocity at many moments. Now, if the increment of t is 0.001, The avg v calculated is 4.99998.
Now if i put increment of t as 0.0001, The avg v becomes 5.00007!
Further decreasing increment to 0.00001 yields avg v = 5.00001
Why is that so?
Thank you.
In base 2 0.0001 and 0.001 are periodic numbers, so they don't have an exact representation. One of them is being rounded up, the other one is rounded down, so when you sum lots of them you get different values.
This is the same thing that happens in decimal representation, if you choose the numbers to sum accordingly (assume each variable can hold 3 decimal digits).
Compare:
a = 1 / 3; // a becomes 0.333
b = a * 6; // b becomes 1.998
with:
a = 2 / 3; // a becomes 0.667
b = a * 3; // b becomes 2.001
both should (theoretically) result into 2 but because of rounding error they give different results
In the decimal system, since 10 is factorised into primes 2 and 5 only fractions whose denominator is divisible only by 2 and 5 can be represented with a finite number of decimal digits (all other fractions are periodic), in base 2 only fractions which have as denominator a power of 2 can be represented exactly. Try using 1.0/512.0 and 1.0/1024.0 as steps in your loop. Also, be careful because if you choose a step that is too small, you may not have enough digits to represent that in the float datatype (i.e., use doubles)

Why am I getting a different result from std::fmod and std::remainder

In the below example app I calculate the floating point remainder from dividing 953 by 0.1, using std::fmod
What I was expecting is that since 953.0 / 0.1 == 9530, that std::fmod(953, 0.1) == 0
I'm getting 0.1 - why is this the case?
Note that with std::remainder I get the correct result.
That is:
std::fmod (953, 0.1) == 0.1 // unexpected
std::remainder(953, 0.1) == 0 // expected
Difference between the two functions:
According to cppreference.com
std::fmod calculates the following:
exactly the value x - n*y, where n is x/y with its fractional part truncated
std::remainder calculates the following:
exactly the value x - n*y, where n is the integral value nearest the exact value x/y
Given my inputs I would expect both functions to have the same output. Why is this not the case?
Exemplar app:
#include <iostream>
#include <cmath>
bool is_zero(double in)
{
return std::fabs(in) < 0.0000001;
}
int main()
{
double numerator = 953;
double denominator = 0.1;
double quotient = numerator / denominator;
double fmod = std::fmod (numerator, denominator);
double rem = std::remainder(numerator, denominator);
if (is_zero(fmod))
fmod = 0;
if (is_zero(rem))
rem = 0;
std::cout << "quotient: " << quotient << ", fmod: " << fmod << ", rem: " << rem << std::endl;
return 0;
}
Output:
quotient: 9530, fmod: 0.1, rem: 0
Because they are different functions.
std::remainder(x, y) calculates IEEE remainder which is x - (round(x/y)*y) where round is rounding half to even (so in particular round(1.0/2.0) == 0)
std::fmod(x, y) calculates x - trunc(x/y)*y. When you divide 953 by 0.1 you may get a number slightly smaller than 9530, so truncation gives 9529. So as the result you get 953.0 - 952.9 = 0.1
Welcome to floating point math. Here's what happens: One tenth cannot be represented exactly in binary, just as one third cannot be represented exactly in decimal. As a result, the division produces a result slightly below 9530. The floor operation produces the integer 9529 instead of 9530. And then this leaves 0.1 left over.

Numerical precision for difference of squares

in my code I often compute things like the following piece (here C code for simplicity):
float cos_theta = /* some simple operations; no cosf call! */;
float sin_theta = sqrtf(1.0f - cos_theta * cos_theta); // Option 1
For this example ignore that the argument of the square root might be negative due to imprecisions. I fixed that with additional fdimf call. However, I wondered if the following is more precise:
float sin_theta = sqrtf((1.0f + cos_theta) * (1.0f - cos_theta)); // Option 2
cos_theta is between -1 and +1 so for each choice there will be situations where I subtract similar numbers and thus will loose precision, right? What is the most precise and why?
The most precise way with floats is likely to compute both sin and cos using a single x87 instruction, fsincos.
However, if you need to do the computation manually, it's best to group arguments with similar magnitudes. This means the second option is more precise, especially when cos_theta is close to 0, where precision matters the most.
As the article
What Every Computer Scientist Should Know About Floating-Point Arithmetic notes:
The expression x2 - y2 is another formula that exhibits catastrophic
cancellation. It is more accurate to evaluate it as (x - y)(x + y).
Edit: it's more complicated than this. Although the above is generally true, (x - y)(x + y) is slightly less accurate when x and y are of very different magnitudes, as the footnote to the statement explains:
In this case, (x - y)(x + y) has three rounding errors, but x2 - y2 has only two since the rounding error committed when computing the smaller of x2 and y2 does not affect the final subtraction.
In other words, taking x - y, x + y, and the product (x - y)(x + y) each introduce rounding errors (3 steps of rounding error). x2, y2, and the subtraction x2 - y2 also each introduce rounding errors, but the rounding error obtained by squaring a relatively small number (the smaller of x and y) is so negligible that there are effectively only two steps of rounding error, making the difference of squares more precise.
So option 1 is actually going to be more precise. This is confirmed by dev.brutus's Java test.
I wrote small test. It calcutates expected value with double precision. Then it calculates an error with your options. The first option is better:
Algorithm: FloatTest$1
option 1 error = 3.802792362162126
option 2 error = 4.333273185303996
Algorithm: FloatTest$2
option 1 error = 3.802792362167937
option 2 error = 4.333273185305868
The Java code:
import org.junit.Test;
public class FloatTest {
#Test
public void test() {
testImpl(new ExpectedAlgorithm() {
public double te(double cos_theta) {
return Math.sqrt(1.0f - cos_theta * cos_theta);
}
});
testImpl(new ExpectedAlgorithm() {
public double te(double cos_theta) {
return Math.sqrt((1.0f + cos_theta) * (1.0f - cos_theta));
}
});
}
public void testImpl(ExpectedAlgorithm ea) {
double delta1 = 0;
double delta2 = 0;
for (double cos_theta = -1; cos_theta <= 1; cos_theta += 1e-8) {
double[] delta = delta(cos_theta, ea);
delta1 += delta[0];
delta2 += delta[1];
}
System.out.println("Algorithm: " + ea.getClass().getName());
System.out.println("option 1 error = " + delta1);
System.out.println("option 2 error = " + delta2);
}
private double[] delta(double cos_theta, ExpectedAlgorithm ea) {
double expected = ea.te(cos_theta);
double delta1 = Math.abs(expected - t1((float) cos_theta));
double delta2 = Math.abs(expected - t2((float) cos_theta));
return new double[]{delta1, delta2};
}
private double t1(float cos_theta) {
return Math.sqrt(1.0f - cos_theta * cos_theta);
}
private double t2(float cos_theta) {
return Math.sqrt((1.0f + cos_theta) * (1.0f - cos_theta));
}
interface ExpectedAlgorithm {
double te(double cos_theta);
}
}
The correct way to reason about numerical precision of some expression is to:
Measure the result discrepancy relative to the correct value in ULPs (Unit in the last place), introduced in 1960. by W. H. Kahan. You can find C, Python & Mathematica implementations here, and learn more on the topic here.
Discriminate between two or more expressions based on the worst case they produce, not average absolute error as done in other answers or by some other arbitrary metric. This is how numerical approximation polynomials are constructed (Remez algorithm), how standard library methods' implementations are analysed (e.g. Intel atan2), etc...
With that in mind, version_1: sqrt(1 - x * x) and version_2: sqrt((1 - x) * (1 + x)) produce significantly different outcomes. As presented in the plot below, version_1 demonstrates catastrophic performance for x close to 1 with error > 1_000_000 ulps, while on the other hand error of version_2 is well behaved.
That is why I always recommend using version_2, i.e. exploiting the square difference formula.
Python 3.6 code that produces square_diff_error.csv file:
from fractions import Fraction
from math import exp, fabs, sqrt
from random import random
from struct import pack, unpack
def ulp(x):
"""
Computing ULP of input double precision number x exploiting
lexicographic ordering property of positive IEEE-754 numbers.
The implementation correctly handles the special cases:
- ulp(NaN) = NaN
- ulp(-Inf) = Inf
- ulp(Inf) = Inf
Author: Hrvoje Abraham
Date: 11.12.2015
Revisions: 15.08.2017
26.11.2017
MIT License https://opensource.org/licenses/MIT
:param x: (float) float ULP will be calculated for
:returns: (float) the input float number ULP value
"""
# setting sign bit to 0, e.g. -0.0 becomes 0.0
t = abs(x)
# converting IEEE-754 64-bit format bit content to unsigned integer
ll = unpack('Q', pack('d', t))[0]
# computing first smaller integer, bigger in a case of ll=0 (t=0.0)
near_ll = abs(ll - 1)
# converting back to float, its value will be float nearest to t
near_t = unpack('d', pack('Q', near_ll))[0]
# abs takes care of case t=0.0
return abs(t - near_t)
with open('e:/square_diff_error.csv', 'w') as f:
for _ in range(100_000):
# nonlinear distribution of x in [0, 1] to produce more cases close to 1
k = 10
x = (exp(k) - exp(k * random())) / (exp(k) - 1)
fx = Fraction(x)
correct = sqrt(float(Fraction(1) - fx * fx))
version1 = sqrt(1.0 - x * x)
version2 = sqrt((1.0 - x) * (1.0 + x))
err1 = fabs(version1 - correct) / ulp(correct)
err2 = fabs(version2 - correct) / ulp(correct)
f.write(f'{x},{err1},{err2}\n')
Mathematica code that produces the final plot:
data = Import["e:/square_diff_error.csv"];
err1 = {1 - #[[1]], #[[2]]} & /# data;
err2 = {1 - #[[1]], #[[3]]} & /# data;
ListLogLogPlot[{err1, err2}, PlotRange -> All, Axes -> False, Frame -> True,
FrameLabel -> {"1-x", "error [ULPs]"}, LabelStyle -> {FontSize -> 20}]
As an aside, you will always have a problem when theta is small, because the cosine is flat around theta = 0. If theta is between -0.0001 and 0.0001 then cos(theta) in float is exactly one, so your sin_theta will be exactly zero.
To answer your question, when cos_theta is close to one (corresponding to a small theta), your second computation is clearly more accurate. This is shown by the following program, that lists the absolute and relative errors for both computations for various values of cos_theta. The errors are computed by comparing against a value which is computed with 200 bits of precision, using GNU MP library, and then converted to a float.
#include <math.h>
#include <stdio.h>
#include <gmp.h>
int main()
{
int i;
printf("cos_theta abs (1) rel (1) abs (2) rel (2)\n\n");
for (i = -14; i < 0; ++i) {
float x = 1 - pow(10, i/2.0);
float approx1 = sqrt(1 - x * x);
float approx2 = sqrt((1 - x) * (1 + x));
/* Use GNU MultiPrecision Library to get 'exact' answer */
mpf_t tmp1, tmp2;
mpf_init2(tmp1, 200); /* use 200 bits precision */
mpf_init2(tmp2, 200);
mpf_set_d(tmp1, x);
mpf_mul(tmp2, tmp1, tmp1); /* tmp2 = x * x */
mpf_neg(tmp1, tmp2); /* tmp1 = -x * x */
mpf_add_ui(tmp2, tmp1, 1); /* tmp2 = 1 - x * x */
mpf_sqrt(tmp1, tmp2); /* tmp1 = sqrt(1 - x * x) */
float exact = mpf_get_d(tmp1);
printf("%.8f %.3e %.3e %.3e %.3e\n", x,
fabs(approx1 - exact), fabs((approx1 - exact) / exact),
fabs(approx2 - exact), fabs((approx2 - exact) / exact));
/* printf("%.10f %.8f %.8f %.8f\n", x, exact, approx1, approx2); */
}
return 0;
}
Output:
cos_theta abs (1) rel (1) abs (2) rel (2)
0.99999988 2.910e-11 5.960e-08 0.000e+00 0.000e+00
0.99999970 5.821e-11 7.539e-08 0.000e+00 0.000e+00
0.99999899 3.492e-10 2.453e-07 1.164e-10 8.178e-08
0.99999684 2.095e-09 8.337e-07 0.000e+00 0.000e+00
0.99998999 1.118e-08 2.497e-06 0.000e+00 0.000e+00
0.99996835 6.240e-08 7.843e-06 9.313e-10 1.171e-07
0.99989998 3.530e-07 2.496e-05 0.000e+00 0.000e+00
0.99968380 3.818e-07 1.519e-05 0.000e+00 0.000e+00
0.99900001 1.490e-07 3.333e-06 0.000e+00 0.000e+00
0.99683774 8.941e-08 1.125e-06 7.451e-09 9.376e-08
0.99000001 5.960e-08 4.225e-07 0.000e+00 0.000e+00
0.96837723 1.490e-08 5.973e-08 0.000e+00 0.000e+00
0.89999998 2.980e-08 6.837e-08 0.000e+00 0.000e+00
0.68377221 5.960e-08 8.168e-08 5.960e-08 8.168e-08
When cos_theta is not close to one, then the accuracy of both methods is very close to each other and to round-off error.
[Edited for major think-o] It looks to me like option 2 will be better, because for a number like 0.000001 for example option 1 will return the sine as 1 while option will return a number just smaller than 1.
No difference in my option since (1-x) preserves the precision not effecting the carried bit. Then for (1+x) the same is true. Then the only thing effecting the carry bit precision is the multiplication. So in both cases there is one single multiplication, so they are both as likely to give the same carry bit error.

sqrt(1.0 - pow(1.0,2)) returns -nan [duplicate]

This question already has answers here:
Why does floating-point arithmetic not give exact results when adding decimal fractions?
(31 answers)
Why pow(10,5) = 9,999 in C++
(8 answers)
Closed 4 years ago.
I've found an interesting floating point problem. I have to calculate several square roots in my code, and the expression is like this:
sqrt(1.0 - pow(pos,2))
where pos goes from -1.0 to 1.0 in a loop. The -1.0 is fine for pow, but when pos=1.0, I get an -nan. Doing some tests, using gcc 4.4.5 and icc 12.0, the output of
1.0 - pow(pos,2) = -1.33226763e-15
and
1.0 - pow(1.0,2) = 0
or
poss = 1.0
1.0 - pow(poss,2) = 0
Where clearly the first one is going to give problems, being negative. Anyone knows why pow is returning a number smaller than 0? The full offending code is below:
int main() {
double n_max = 10;
double a = -1.0;
double b = 1.0;
int divisions = int(5 * n_max);
assert (!(b == a));
double interval = b - a;
double delta_theta = interval / divisions;
double delta_thetaover2 = delta_theta / 2.0;
double pos = a;
//for (int i = 0; i < divisions - 1; i++) {
for (int i = 0; i < divisions+1; i++) {
cout<<sqrt(1.0 - pow(pos, 2)) <<setw(20)<<pos<<endl;
if(isnan(sqrt(1.0 - pow(pos, 2)))){
cout<<"Danger Will Robinson!"<<endl;
cout<< sqrt(1.0 - pow(pos,2))<<endl;
cout<<"pos "<<setprecision(9)<<pos<<endl;
cout<<"pow(pos,2) "<<setprecision(9)<<pow(pos, 2)<<endl;
cout<<"delta_theta "<<delta_theta<<endl;
cout<<"1 - pow "<< 1.0 - pow(pos,2)<<endl;
double poss = 1.0;
cout<<"1- poss "<<1.0 - pow(poss,2)<<endl;
}
pos += delta_theta;
}
return 0;
}
When you keep incrementing pos in a loop, rounding errors accumulate and in your case the final value > 1.0. Instead of that, calculate pos by multiplication on each round to only get minimal amount of rounding error.
The problem is that floating point calculations are not exact, and that 1 - 1^2 may be giving small negative results, yielding an invalid sqrt computation.
Consider capping your result:
double x = 1. - pow(pos, 2.);
result = sqrt(x < 0 ? 0 : x);
or
result = sqrt(abs(x) < 1e-12 ? 0 : x);
setprecision(9) is going to cause rounding. Use a debugger to see what the value really is. Short of that, at least set the precision beyond the possible size of the type you're using.
You will almost always have rounding errors when calculating with doubles, because the double type has only 15 significant decimal digits (52 bits) and a lot of decimal numbers are not convertible to binary floating point numbers without rounding. The IEEE standard contains a lot of effort to keep those errors low, but by principle it cannot always succeed. For a thorough introduction see this document
In your case, you should calculate pos on each loop and round to 14 or less digits. That should give you a clean 0 for the sqrt.
You can calc pos inside the loop as
pos = round(a + interval * i / divisions, 14);
with round defined as
double round(double r, int digits)
{
double multiplier = pow(digits,10);
return floor(r*multiplier + 0.5)/multiplier;
}