Why am I getting a different result from std::fmod and std::remainder - c++

In the below example app I calculate the floating point remainder from dividing 953 by 0.1, using std::fmod
What I was expecting is that since 953.0 / 0.1 == 9530, that std::fmod(953, 0.1) == 0
I'm getting 0.1 - why is this the case?
Note that with std::remainder I get the correct result.
That is:
std::fmod (953, 0.1) == 0.1 // unexpected
std::remainder(953, 0.1) == 0 // expected
Difference between the two functions:
According to cppreference.com
std::fmod calculates the following:
exactly the value x - n*y, where n is x/y with its fractional part truncated
std::remainder calculates the following:
exactly the value x - n*y, where n is the integral value nearest the exact value x/y
Given my inputs I would expect both functions to have the same output. Why is this not the case?
Exemplar app:
#include <iostream>
#include <cmath>
bool is_zero(double in)
{
return std::fabs(in) < 0.0000001;
}
int main()
{
double numerator = 953;
double denominator = 0.1;
double quotient = numerator / denominator;
double fmod = std::fmod (numerator, denominator);
double rem = std::remainder(numerator, denominator);
if (is_zero(fmod))
fmod = 0;
if (is_zero(rem))
rem = 0;
std::cout << "quotient: " << quotient << ", fmod: " << fmod << ", rem: " << rem << std::endl;
return 0;
}
Output:
quotient: 9530, fmod: 0.1, rem: 0

Because they are different functions.
std::remainder(x, y) calculates IEEE remainder which is x - (round(x/y)*y) where round is rounding half to even (so in particular round(1.0/2.0) == 0)
std::fmod(x, y) calculates x - trunc(x/y)*y. When you divide 953 by 0.1 you may get a number slightly smaller than 9530, so truncation gives 9529. So as the result you get 953.0 - 952.9 = 0.1

Welcome to floating point math. Here's what happens: One tenth cannot be represented exactly in binary, just as one third cannot be represented exactly in decimal. As a result, the division produces a result slightly below 9530. The floor operation produces the integer 9529 instead of 9530. And then this leaves 0.1 left over.

Related

How does Cpp work with large numbers in calculations?

I have a code that tries to solve an integral of a function in a given interval numerically, using the method of Trapezoidal Rule (see the formula in Trapezoid method ), now, for the function sin(x) in the interval [-pi/2.0,pi/2.0], the integral is waited to be zero.
In this case, I take the number of partitions 'n' equal to 4. The problem is that when I have pi with 20 decimal places it is zero, with 14 decimal places it is 8.72e^(-17), then with 11 decimal places, it is zero, with 8 decimal places it is 8.72e^(-17), with 3 decimal places it is zero. I mean, the integral is zero or a number near zero for different approximations of pi, but it doesn't have a clear trend.
I would appreciate your help in understanding why this happens. (I did run it in Dev-C++).
#include <iostream>
#include <math.h>
using namespace std;
#define pi 3.14159265358979323846
//Pi: 3.14159265358979323846
double func(double x){
return sin(x);
}
int main() {
double x0 = -pi/2.0, xf = pi/2.0;
int n = 4;
double delta_x = (xf-x0)/(n*1.0);
double sum = (func(x0)+func(xf))/2.0;
double integral;
for (int k = 1; k<n; k++){
// cout<<"func: "<<func(x0+(k*delta_x))<<" "<<"last sum: "<<sum<<endl;
sum = sum + func(x0+(k*delta_x));
// cout<<"func + last sum= "<<sum<<endl;
}
integral = delta_x*sum;
cout<<"The value for the integral is: "<<integral<<endl;
return 0;
}
OP is integrating y=sin(x) from -a to +a. The various tests use different values of a, all near pi/2.
The approach uses a linear summation of values near -1.0, down to 0 and then up to near 1.0.
This summation is sensitive to calculation error with the last terms as the final math sum is expected to be 0.0. Since the start/end a varies, the error varies.
A more stable result would be had adding the extreme f = sin(f(k)) values first. e.g. sum += sin(f(k=1)), then sum += sin(f(k=3)), then sum += sin(f(k=2)) rather than k=1,2,3. In particular the formation of term x=f(k=3) is likely a bit off from the negative of its x=f(k=1) earlier term, further compounding the issue.
Welcome to the world or numerical analysis.
Problem exists if code used all float or all long double, just different degrees.
Problem is not due to using an inexact value of pi (Exact value is impossible with FP as pi is irrational and all finite FP are rational).
Much is due to the formation of x. Could try the below to form the x symmetrically about 0.0. Compare exactly x generated this way to x the original way.
x = (x0-x1)/2 + ((k - n/2)*delta_x)
Print out the exact values computed for deeper understanding.
printf("x:%a y:%a\n", x0+(k*delta_x), func(x0+(k*delta_x)));

comparing equal calculation outputs for floating point numbers

The outputs of two calculations are supposed to be same as described below but even after taking machine precision into account, they come out to be unequal. What would be way around it to get them equal?
#include <iostream>
#include <limits>
#include <math.h>
bool definitelyGreaterThan(double a, double b, double epsilon)
{
return (a - b) > ( (std::fabs(a) < std::fabs(b) ? std::fabs(b) : std::fabs(a)) * epsilon);
}
bool definitelyLessThan(double a, double b, double epsilon)
{
return (b - a) > ( (std::fabs(a) < std::fabs(b) ? std::fabs(b) : std::fabs(a)) * epsilon);
}
int main ()
{
double fig1, fig2;
double m1 = 235.60242, m2 = 126.734781;
double n1 = 4.2222, n2 = 2.1111;
double p1 = 1.245, p2 = 2.394;
fig1 = (m1/m2) * (n1/n2) - p1 * 6.0 / p2;
m1 = 1.2*m1, m2 = 1.2*m2; // both scaled equally, numerator and denominator
n1 = 2.0*n1, n2 = 2.0*n2;
p1 = 3.0*p1, p2 = 3.0*p2;
fig2 = (m1/m2) * (n1/n2) - p1 * 6.0 / p2; // same as above
double epsilon = std::numeric_limits<double>::epsilon();
std::cout << "\n fig1 " << fig1 << " fig2 " << fig2 << " difference " << fig1 - fig2 << " epl " << epsilon;
std::cout << "\n if(fig1 < fig2) " << definitelyLessThan(fig1, fig2, epsilon)
<< "\n if(fig1 > fig2) " << definitelyGreaterThan(fig1, fig2, epsilon) << "\n";
}
with output as -
fig1 0.597738 fig2 0.597738 difference 8.88178e-16 epl 2.22045e-16
if(fig1 < fig2) 0
if(fig1 > fig2) 1
The difference between two numbers is greater than machine precision.
The key question is whether there is any universal method to deal with such aspects or solution has to be application dependent?
There are two things to consider:
First, the possible rounding error (introduced by limited machine precision) scales with the number of operations that were used to calculate the value. For example, storing the result of m1/m2 might introduce some rounding error (depending on the actual value), and multiplying this value with something also multiplies that rounding error, and adds another possiblie rounding error on top of that (by storing that result).
Second, floating point values are not stored linearly (instead, they use an exponent-and-mantissa format): The bigger a values actually is, the bigger is the difference between that value and the next representable value (and therefore also the possible rounding error). std::numeric_limits<T>::epsilon() only states the difference between 1.0 and the next representable value, so if your value is not exactly 1.0, then this epsilon does not exactly represent the machine precision (meaning: The difference between this value and the next representable one) for that value.
So, to answer your question: The solution is to select an application-dependend, reasonable maximum rounding error that is allowed for two values to still be considered equal. Since this allowed rounding error depends both on expected values and number of operations (as well as what's acceptable for the application itself, of course), a truly universal solution is not possible.

Precision issues when converting a decimal number to its rational equivalent

I have problem of converting a double (say N) to p/q form (rational form), for this I have the following strategy :
Multiply double N by a large number say $k = 10^{10}$
then p = y*k and q = k
Take gcd(p,q) and find p = p/gcd(p,q) and q = p/gcd(p,q)
when N = 8.2 , Answer is correct if we solve using pen and paper, but as 8.2 is represented as 8.19999999 in N (double), it causes problem in its rational form conversion.
I tried it doing other way as : (I used a large no. 10^k instead of 100)
if(abs(y*100 - round(y*100)) < 0.000001) y = round(y*100)/100
But this approach also doesn't give right representation all the time.
Is there any way I could carry out the equivalent conversion from double to p/q ?
Floating point arithmetic is very difficult. As has been mentioned in the comments, part of the difficulty is that you need to represent your numbers in binary.
For example, the number 0.125 can be represented exactly in binary:
0.125 = 2^-3 = 0b0.001
But the number 0.12 cannot.
To 11 significant figures:
0.12 = 0b0.00011110101
If this is converted back to a decimal then the error becomes obvious:
0b0.00011110101 = 0.11962890625
So if you write:
double a = 0.2;
What the machine actually does is find the closest binary representation of 0.2 that it can hold within a double data type. This is an approximation since as we saw above, 0.2 cannot be exactly represented in binary.
One possible approach is to define an 'epsilon' which determines how close your number can be to the nearest representable binary floating point.
Here is a good article on floating points:
https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
have problem of converting a double (say N) to p/q form
... when N = 8.2
A typical double cannot encode 8.2 exactly. Instead the closest representable double is about
8.19999999999999928945726423989981412887573...
8.20000000000000106581410364015027880668640... // next closest
When code does
double N = 8.2;
It will be the 8.19999999999999928945726423989981412887573... that is converted into rational form.
Converting a double to p/q form:
Multiply double N by a large number say $k = 10^{10}$
This may overflow the double. First step should be to determine if the double is large, it which case, it is a whole number.
Do not multiple by some power of 10 as double certainly uses a binary encoding. Multiplication by 10, 100, etc. may introduce round-off error.
C implementations of double overwhelmingly use a binary encoding, so that FLT_RADIX == 2.
Then every finite double x has a significand that is a fraction of some integer over some power of 2: a binary fraction of DBL_MANT_DIG digits #Richard Critten. This is often 53 binary digits.
Determine the exponent of the double. If large enough or x == 0.0, the double is a whole number.
Otherwise, scale a numerator and denominator by DBL_MANT_DIG. While the numerator is even, halve both the numerator and denominator. As the denominator is a power-of-2, no other prime values are needed for simplification consideration.
#include <float.h>
#include <math.h>
#include <stdio.h>
void form_ratio(double x) {
double numerator = x;
double denominator = 1.0;
if (isfinite(numerator) && x != 0.0) {
int expo;
frexp(numerator, &expo);
if (expo < DBL_MANT_DIG) {
expo = DBL_MANT_DIG - expo;
numerator = ldexp(numerator, expo);
denominator = ldexp(1.0, expo);
while (fmod(numerator, 2.0) == 0.0 && denominator > 1.0) {
numerator /= 2.0;
denominator /= 2.0;
}
}
}
int pre = DBL_DECIMAL_DIG;
printf("%.*g --> %.*g/%.*g\n", pre, x, pre, numerator, pre, denominator);
}
int main(void) {
form_ratio(123456789012.0);
form_ratio(42.0);
form_ratio(1.0 / 7);
form_ratio(867.5309);
}
Output
123456789012 --> 123456789012/1
42 --> 42/1
0.14285714285714285 --> 2573485501354569/18014398509481984
867.53089999999997 --> 3815441248019913/4398046511104

Why (1UL <<53) plus 1.0 does not equal to itself?

Say
int64_t x = (1UL << 53);
cout << x << end;
x+= 1.0;
cout << x << end;
The result of x is same, which is '9007199254740992'.
However, x += 1; can make x plus 1 correctly.
Moreover, for 1UL << 52 plus 1.0 can make the result correctly.
I think it could be the float imprecision. Could someone give me more details of that?
The line x+= 1.0 is evaluated as
x = (int64_t)((double)x + (double)1.0);
The number 2^53 + 1 = 9007199254740993 can't be represented exactly as IEEE double, so it's rounded to 2^53 = 9007199254740992 (this depends on the current rounding mode, actually) which is then (losslessly) converted to an int64_t.
x+= 1.0;
The expression x + 1.0 is done with floating point arithmetic.
Assuming IEEE-754 is used, the double precision floating point type can represent integers at most 253 precisely.

Division by zero prevention: Checking the divisor's expression doesn't result in zero vs. checking the divisor isn't zero?

Is division by zero possible in the following case due to the floating point error in the subtraction?
float x, y, z;
...
if (y != 1.0)
z = x / (y - 1.0);
In other words, is the following any safer?
float divisor = y - 1.0;
if (divisor != 0.0)
z = x / divisor;
Assuming IEEE-754 floating-point, they are equivalent.
It is a basic theorem of FP arithmetic that for finite x and y, x - y == 0 if and only if x == y, assuming gradual underflow.
If subnormal results are flushed to zero (instead of gradual underflow), this theorem holds only if the result x - y is normal. Because 1.0 is well scaled, y - 1.0 is never subnormal, and so y - 1.0 is zero if and only if y is exactly 1.0, regardless of how underflow are handled.
C++ doesn't guarantee IEEE-754, of course, but the theorem is true for most "reasonable" floating-point systems.
This will prevent you from dividing by exactly zero, however that does not mean still won't end up with +/-inf as a result. The denominator could still be small enough so that the answer is not representable with a double and you will end up with an inf. For example:
#include <iostream>
#include <limits>
int main(int argc, char const *argv[])
{
double small = std::numeric_limits<double>::epsilon();
double large = std::numeric_limits<double>::max() / small;
std::cout << "small: " << small << std::endl;
std::cout << "large: " << large << std::endl;
return 0;
}
In this program small is non-zero, but it is so small that large exceeds the range of double and is inf.
There is no difference between the two code snippets () - in fact, the optimizer could even optimize both fragments to the same binary code, assuming that there are no further uses of the divisor variable.
Note, however, that division by a floating point zero 0.0 does not result in a run-time error, but produces an inf or -inf instead.