comparing equal calculation outputs for floating point numbers - c++

The outputs of two calculations are supposed to be same as described below but even after taking machine precision into account, they come out to be unequal. What would be way around it to get them equal?
#include <iostream>
#include <limits>
#include <math.h>
bool definitelyGreaterThan(double a, double b, double epsilon)
{
return (a - b) > ( (std::fabs(a) < std::fabs(b) ? std::fabs(b) : std::fabs(a)) * epsilon);
}
bool definitelyLessThan(double a, double b, double epsilon)
{
return (b - a) > ( (std::fabs(a) < std::fabs(b) ? std::fabs(b) : std::fabs(a)) * epsilon);
}
int main ()
{
double fig1, fig2;
double m1 = 235.60242, m2 = 126.734781;
double n1 = 4.2222, n2 = 2.1111;
double p1 = 1.245, p2 = 2.394;
fig1 = (m1/m2) * (n1/n2) - p1 * 6.0 / p2;
m1 = 1.2*m1, m2 = 1.2*m2; // both scaled equally, numerator and denominator
n1 = 2.0*n1, n2 = 2.0*n2;
p1 = 3.0*p1, p2 = 3.0*p2;
fig2 = (m1/m2) * (n1/n2) - p1 * 6.0 / p2; // same as above
double epsilon = std::numeric_limits<double>::epsilon();
std::cout << "\n fig1 " << fig1 << " fig2 " << fig2 << " difference " << fig1 - fig2 << " epl " << epsilon;
std::cout << "\n if(fig1 < fig2) " << definitelyLessThan(fig1, fig2, epsilon)
<< "\n if(fig1 > fig2) " << definitelyGreaterThan(fig1, fig2, epsilon) << "\n";
}
with output as -
fig1 0.597738 fig2 0.597738 difference 8.88178e-16 epl 2.22045e-16
if(fig1 < fig2) 0
if(fig1 > fig2) 1
The difference between two numbers is greater than machine precision.
The key question is whether there is any universal method to deal with such aspects or solution has to be application dependent?

There are two things to consider:
First, the possible rounding error (introduced by limited machine precision) scales with the number of operations that were used to calculate the value. For example, storing the result of m1/m2 might introduce some rounding error (depending on the actual value), and multiplying this value with something also multiplies that rounding error, and adds another possiblie rounding error on top of that (by storing that result).
Second, floating point values are not stored linearly (instead, they use an exponent-and-mantissa format): The bigger a values actually is, the bigger is the difference between that value and the next representable value (and therefore also the possible rounding error). std::numeric_limits<T>::epsilon() only states the difference between 1.0 and the next representable value, so if your value is not exactly 1.0, then this epsilon does not exactly represent the machine precision (meaning: The difference between this value and the next representable one) for that value.
So, to answer your question: The solution is to select an application-dependend, reasonable maximum rounding error that is allowed for two values to still be considered equal. Since this allowed rounding error depends both on expected values and number of operations (as well as what's acceptable for the application itself, of course), a truly universal solution is not possible.

Related

Simpson's Composite Rule giving too large values for when n is very large

Using Simpson's Composite Rule to calculate the integral from 2 to 1,000 of 1/ln(x), however when using a large n (usually around 500,000), I start to get results that vary from the value my calculator and other sources give me (176.5644). For example, when n = 10,000,000, it gives me a value of 184.1495. Wondering why this is, since as n gets larger, the accuracy is supposed to increase and not decrease.
#include <iostream>
#include <cmath>
// the function f(x)
float f(float x)
{
return (float) 1 / std::log(x);
}
float my_simpson(float a, float b, long int n)
{
if (n % 2 == 1) n += 1; // since n has to be even
float area, h = (b-a)/n;
float x, y, z;
for (int i = 1; i <= n/2; i++)
{
x = a + (2*i - 2)*h;
y = a + (2*i - 1)*h;
z = a + 2*i*h;
area += f(x) + 4*f(y) + f(z);
}
return area*h/3;
}
int main()
{
std::cout.precision(20);
int upperBound = 1'000;
int subsplits = 1'000'000;
float approx = my_simpson(2, upperBound, subsplits);
std::cout << "Output: " << approx << std::endl;
return 0;
}
Update: Switched from floats to doubles and works much better now! Thank you!
Unlike a real (in mathematical sense) number, a float has a limited precision.
A typical IEEE 754 32-bit (single precision) floating-point number binary representation dedicates only 24 bits (one of which is implicit) to the mantissa and that translates in roughly less than 8 decimal significant digits (please take this as a gross semplification).
A double on the other end, has 53 significand bits, making it more accurate and (usually) the first choice for numerical computations, these days.
since as n gets larger, the accuracy is supposed to increase and not decrease.
Unfortunately, that's not how it works. There's a sweat spot, but after that the accumulation of rounding errors prevales and the results diverge from their expected values.
In OP's case, this calculation
area += f(x) + 4*f(y) + f(z);
introduces (and accumulates) rounding errors, due to the fact that area becomes much greater than f(x) + 4*f(y) + f(z) (e.g 224678.937 vs. 0.3606823). The bigger n is, the sooner this gets relevant, making the result diverging from the real one.
As mentioned in the comments, another issue (undefined behavior) is that area isn't initialized (to zero).

Returning 'inf' when finite result is expected

The outputs are not what Python/Mathematica previously calculated when calling test_func(3000, 10). Instead, my code returns inf, and I am not sure why.
#include <iostream>
#include <iomanip>
#include <fstream>
#include <string>
#include <chrono>
#include <cmath>
using namespace std;
//I0
double I0 (double y0, double y)
{
return(
log ( (exp(y0+y)-1)/(exp(y0-y)-1) ) - y
);
}
//test_func
double test_func (double k, double M)
{
double k0 = sqrt(pow(k, 2.0)+pow(M, 2.0));
cout << "k0: " << k0 << endl;
double Izero = I0(k0, k);
cout << "I0: " << Izero << endl;
double k3 = pow(k, 3.0);
cout << "k3: " << k3 << endl;
return(
Izero/k3
);
}
int main ()
{
cout << test_func(3000, 10) << endl;
return 0;
}
The output I get is
k0: 3000.02
I0: inf
k3: 2.7e+10
inf
but I0 should be 3004.1026690762033, while the result of the function should be test_func(3000, 10)=1.1126306181763716e-07. I am puzzled. Do you know what is wrong with it? I am a C++ beginner, so any help is very welcome.
double has a limited range. On most modern implementations, doubles are in the standard IEEE floating point binary64 format, with range up to about 1.8e308. Anything bigger (such as your exp(6000)) rounds up to infinity.
Switching to a larger type like long double may not be the best idea. Though the range of floating point types roughly grows exponentially with size, the fact that you use the exp function makes it easy to defeat the extra precision. E.g. on an implementation with 80-bit long double (with range up to about 1.2e4932), modifying test_func and I0 to use long double still fails to evaluate test_func(5700, 10).
It is instead possible to redesign I0 to avoid huge numbers. Let's start by splitting the log.
double I0(double y0, double y) {
return log(exp(y0+y)-1) - log(exp(y0-y)-1) - y;
}
When you're computing log(exp(y0+y) - 1), if exp(y0+y) gives infinity you can recover the computation by using y0+y instead of the log. I.e. we're ignoring the -1, because if our numbers are that large the precision of double isn't enough to actually register the difference in the final result. Also, you may want to replace both exp(x) - 1s with expm1. This is because when x is close to 0, exp(x) - 1 will tend to lose precision. E.g. exp(1e-16) - 1 == 0 but expm1(1e-16) > 0 assuming IEEE. I suspect that's not as important to you.
double I0(double y0, double y) {
double num = expm1(y0 + y), den = expm1(y0 - y);
num = isinf(num) ? y0 + y : log(num);
den = isinf(den) ? y0 - y : log(den); // though I suspect the domain of I0 is such that you don't actually need this
return num - den - y;
}
This is only a very rudimentary correction. Squeezing the most correctness out of floating point is very difficult in general and is its own whole field of programming. However, this is enough to make your case work, and even works on those large inputs where the naive long double also fails. (I'm no expert, but I also notice that y0-y is itself problematic, since y0 is apparently chosen to be close to y. Subtraction between them loses precision that may (or may not) destroy the result.)
If you don't want to deal with rewriting your formulas to fit within the limitations of floating point (completely understandable, given the potential for bugs!), I would suggest following #M.M's advice and using an arbitrary-precision math library like MPFR. That's similar to replacing double with long double, but now you can keep throwing bits at the problems until they go away whereas you will eventually run out of built-in floating point types.

C/C++ compare to Nan (different behaviors on different floating point model)

here a little test code:
float zeroF = 0.f;
float naNF = 0.f / zeroF;
float minimumF = std::min(1.0f, naNF);
std::cout << "MinimumF " << minimumF << std::endl;
double zeroD = 0.0;
double naND = 0.0 / zeroD;
double minimumD = std::min(1.0, naND);
std::cout << "MinimumD " << minimumD << std::endl;
I executed the code on VS2013.
On precise model (/fp:precise) the outputs are always "1";
On fast model (/fp:fast) the outputs will be "Nan" (-1.#IND) if optimization is enabled (/O2) and "1" if optimization is disabled (/Od).
First, what should be the right output according to IEEE754 ?
(I read the docs and googled different articles like: What is the rationale for all comparisons returning false for IEEE754 NaN values?, and it seems that the right output should by Nan and not 1 but maybe I am wrong.
Secondly, how the fast model optimization here changes so drastically the output?

Why am I getting a different result from std::fmod and std::remainder

In the below example app I calculate the floating point remainder from dividing 953 by 0.1, using std::fmod
What I was expecting is that since 953.0 / 0.1 == 9530, that std::fmod(953, 0.1) == 0
I'm getting 0.1 - why is this the case?
Note that with std::remainder I get the correct result.
That is:
std::fmod (953, 0.1) == 0.1 // unexpected
std::remainder(953, 0.1) == 0 // expected
Difference between the two functions:
According to cppreference.com
std::fmod calculates the following:
exactly the value x - n*y, where n is x/y with its fractional part truncated
std::remainder calculates the following:
exactly the value x - n*y, where n is the integral value nearest the exact value x/y
Given my inputs I would expect both functions to have the same output. Why is this not the case?
Exemplar app:
#include <iostream>
#include <cmath>
bool is_zero(double in)
{
return std::fabs(in) < 0.0000001;
}
int main()
{
double numerator = 953;
double denominator = 0.1;
double quotient = numerator / denominator;
double fmod = std::fmod (numerator, denominator);
double rem = std::remainder(numerator, denominator);
if (is_zero(fmod))
fmod = 0;
if (is_zero(rem))
rem = 0;
std::cout << "quotient: " << quotient << ", fmod: " << fmod << ", rem: " << rem << std::endl;
return 0;
}
Output:
quotient: 9530, fmod: 0.1, rem: 0
Because they are different functions.
std::remainder(x, y) calculates IEEE remainder which is x - (round(x/y)*y) where round is rounding half to even (so in particular round(1.0/2.0) == 0)
std::fmod(x, y) calculates x - trunc(x/y)*y. When you divide 953 by 0.1 you may get a number slightly smaller than 9530, so truncation gives 9529. So as the result you get 953.0 - 952.9 = 0.1
Welcome to floating point math. Here's what happens: One tenth cannot be represented exactly in binary, just as one third cannot be represented exactly in decimal. As a result, the division produces a result slightly below 9530. The floor operation produces the integer 9529 instead of 9530. And then this leaves 0.1 left over.

Division by zero prevention: Checking the divisor's expression doesn't result in zero vs. checking the divisor isn't zero?

Is division by zero possible in the following case due to the floating point error in the subtraction?
float x, y, z;
...
if (y != 1.0)
z = x / (y - 1.0);
In other words, is the following any safer?
float divisor = y - 1.0;
if (divisor != 0.0)
z = x / divisor;
Assuming IEEE-754 floating-point, they are equivalent.
It is a basic theorem of FP arithmetic that for finite x and y, x - y == 0 if and only if x == y, assuming gradual underflow.
If subnormal results are flushed to zero (instead of gradual underflow), this theorem holds only if the result x - y is normal. Because 1.0 is well scaled, y - 1.0 is never subnormal, and so y - 1.0 is zero if and only if y is exactly 1.0, regardless of how underflow are handled.
C++ doesn't guarantee IEEE-754, of course, but the theorem is true for most "reasonable" floating-point systems.
This will prevent you from dividing by exactly zero, however that does not mean still won't end up with +/-inf as a result. The denominator could still be small enough so that the answer is not representable with a double and you will end up with an inf. For example:
#include <iostream>
#include <limits>
int main(int argc, char const *argv[])
{
double small = std::numeric_limits<double>::epsilon();
double large = std::numeric_limits<double>::max() / small;
std::cout << "small: " << small << std::endl;
std::cout << "large: " << large << std::endl;
return 0;
}
In this program small is non-zero, but it is so small that large exceeds the range of double and is inf.
There is no difference between the two code snippets () - in fact, the optimizer could even optimize both fragments to the same binary code, assuming that there are no further uses of the divisor variable.
Note, however, that division by a floating point zero 0.0 does not result in a run-time error, but produces an inf or -inf instead.