Exact double division - c++

Consider the following function:
auto f(double a, double b) -> int
{
return std::floor(a/b);
}
So I want to compute the largest integer k such that k * b <= a in a mathematical sense.
As there could be rounding errors, I am unsure whether the above function really computes this k. I do not worry about the case that k could be out of range.
What is the proper way to determine this k for sure?

It depends how strict you are. Take a double b and an integer n, and calculate bn. Then a will be rounded. If a is rounded down, then it is less than the mathematical value of nb, and a/b is mathematically less than n. You will get a result if n instead of n-1.
On the other hand, a == b*n will be true. So the “correct” result could be surprising.
Your condition was that “kb <= a”. If we interpret this as “the result of multiplying kb using double precision is <= a”, then you’re fine. If we interpret it as “the mathematically exact product of k and b is <= a”, then you need to calculate k*b - a using the fma function and check the result. This will tell you the truth, but might return a result of 4 if a was calculated as 5.0 * b and was rounded down.

The problem is that float division is not exact.
a/b can give 1.9999 instead of 2, and std::floor can then give 1.
One simple solution is to add a small value prior calling std::floor:
std::floor (a/b + 1.0e-10);
Result:
result = 10 while 11 was expected
With eps added, result = 11
Test code:
#include <iostream>
#include <cmath>
int main () {
double b = atan (1.0);
int x = 11;
double a = x * b;
int y = std::floor (a/b);
std::cout << "result = " << y << " while " << x << " was expected\n";
double eps = 1.0e-10;
int z = std::floor (a/b + eps);
std::cout << "With eps added, result = " << z << "\n";
return 0;
}

Related

Euler's number with stop condition

original outdated code:
Write an algorithm that compute the Euler's number until
My professor from Algorithms course gave me the following homework:
Write a C/C++ program that calculates the value of the Euler's number (e) with a given accuracy of eps > 0.
Hint: The number e = 1 + 1/1! +1/2! + ... + 1 / n! + ... = 2.7172 ... can be calculated as the sum of elements of the sequence x_0, x_1, x_2, ..., where x_0 = 1, x_1 = 1+ 1/1 !, x_2 = 1 + 1/1! +1/2 !, ..., the summation continues as long as the condition |x_(i+1) - x_i| >= eps is valid.
As he further explained, eps is the precision of the algorithm. For example, the precision could be 1/100 |x_(i + 1) - x_i| = absolute value of ( x_(i+1) - x_i )
Currently, my program looks in the following way:
#include<iostream>
#include<cstdlib>
#include<math.h>
// Euler's number
using namespace std;
double factorial(double n)
{
double result = 1;
for(double i = 1; i <= n; i++)
{
result = result*i;
}
return result;
}
int main()
{
long double euler = 2;
long double counter = 2;
long double epsilon = 1.0/1000;
long double moduloDifference;
do
{
euler+= 1 / factorial(counter);
counter++;
moduloDifference = (euler + 1 / factorial(counter+1) - euler);
} while(moduloDifference >= epsilon);
printf("%.35Lf ", euler );
return 0;
}
Issues:
It seems my epsilon value does not work properly. It is supposed to control the precision. For example, when I wish precision of 5 digits, I initialize it to 1.0/10000, and it outputs 3 digits before they get truncated after 8 (.7180).
When I use long double data type, and epsilon = 1/10000, my epsilon gets the value 0, and my program runs infinitely. Yet, if change the data type from long double to double, it works. Why epsilon becomes 0 when using long double data type?
How can I optimize the algorithm of finding Euler's number? I know, I can rid off the function and calculate the Euler's value on the fly, but after each attempt to do that, I receive other errors.
One problem with computing Euler's constant this way is pretty simple: you're starting with some fairly large numbers, but since the denominator in each term is N!, the amount added by each successive term shrinks very quickly. Using naive summation, you quickly reach a point where the value you're adding is small enough that it no longer affects the sum.
In the specific case of Euler's constant, since the numbers constantly decrease, one way we can deal with them quite a bit better is to compute and store all the terms, then add them up in reverse order.
Another possibility that's more general is to use Kahan's summation algorithm instead. This keeps track of a running error while it's doing the summation, and takes the current error into account as it's adding each successive term.
For example, I've rewritten your code to use Kahan summation to compute to (approximately) the limit of precision of a typical (80-bit) long double:
#include<iostream>
#include<cstdlib>
#include<math.h>
#include <vector>
#include <iomanip>
#include <limits>
// Euler's number
using namespace std;
long double factorial(long double n)
{
long double result = 1.0L;
for(int i = 1; i <= n; i++)
{
result = result*i;
}
return result;
}
template <class InIt>
typename std::iterator_traits<InIt>::value_type accumulate(InIt begin, InIt end) {
typedef typename std::iterator_traits<InIt>::value_type real;
real sum = real();
real running_error = real();
for ( ; begin != end; ++begin) {
real difference = *begin - running_error;
real temp = sum + difference;
running_error = (temp - sum) - difference;
sum = temp;
}
return sum;
}
int main()
{
std::vector<long double> terms;
long double epsilon = 1e-19;
long double i = 0;
double term;
for (int i=0; (term=1.0L/factorial(i)) >= epsilon; i++)
terms.push_back(term);
int width = std::numeric_limits<long double>::digits10;
std::cout << std::setw(width) << std::setprecision(width) << accumulate(terms.begin(), terms.end()) << "\n";
}
Result: 2.71828182845904522
In fairness, I should actually add that I haven't checked what happens with your code using naive summation--it's possible the problem you're seeing is from some other source. On the other hand, this does fit fairly well with a type of situation where Kahan summation stands at least a reasonable chance of improving results.
#include<iostream>
#include<cmath>
#include<iomanip>
#define EPSILON 1.0/10000000
#define AMOUNT 6
using namespace std;
int main() {
long double e = 2.0, e0;
long double factorial = 1;
int counter = 2;
long double moduloDifference;
do {
e0 = e;
factorial *= counter++;
e += 1.0 / factorial;
moduloDifference = fabs(e - e0);
} while (moduloDifference >= EPSILON);
cout << "Wynik:" << endl;
cout << setprecision(AMOUNT) << e << endl;
return 0;
}
This an optimized version that does not have a separate function to calculate the factorial.
Issue 1: I am still not sure how EPSILON manages the precision.
Issue 2: I do not understand the real difference between long double and double. Regarding my code, why long double requires a decimal point (1.0/someNumber), and double doesn't (1/someNumber)

Why using double and then cast to float?

I'm trying to improve surf.cpp performances. From line 140, you can find this function:
inline float calcHaarPattern( const int* origin, const SurfHF* f, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
return (float)d;
}
Running an Intel Advisor Vectorization analysis, it shows that "1 Data type conversions present" which could be inefficient (especially in vectorization).
But my question is: looking at this function, why the authors would have created d as double and then cast it to float? If they wanted a decimal number, float would be ok. The only reason that comes to my mind is that since double is more precise than float, then it can represents smaller numbers, but the final value is big enough to be stored in a float, but I didn't run any test on d value.
Any other possible reason?
Because the author want to have higher precision during calculation, then only round the final result. This is the same as preserving more significant digit during calculation.
More precisely, when addition and subtraction, error can be accumulated. This error can be considerable when large number of floating point numbers involved.
You questioned the answer saying it's to use higher precision during the summation, but I don't see why. That answer is correct. Consider this simplified version with completely made-up numbers:
#include <iostream>
#include <iomanip>
float w = 0.012345;
float calcFloat(const int* origin, int n )
{
float d = 0;
for( int k = 0; k < n; k++ )
d += origin[k] * w;
return (float)d;
}
float calcDouble(const int* origin, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += origin[k] * w;
return (float)d;
}
int main()
{
int o[] = { 1111, 22222, 33333, 444444, 5555 };
std::cout << std::setprecision(9) << calcFloat(o, 5) << '\n';
std::cout << std::setprecision(9) << calcDouble(o, 5) << '\n';
}
The results are:
6254.77979
6254.7793
So even though the inputs are the same in both cases, you get a different result using double for the intermediate summation. Changing calcDouble to use (double)w doesn't change the output.
This suggests that the calculation of (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w is high-enough precision, but the accumulation of errors during the summation is what they're trying to avoid.
This is because of how errors are propagated when working with floating point numbers. Quoting The Floating-Point Guide: Error Propagation:
In general:
Multiplication and division are “safe” operations
Addition and subtraction are dangerous, because when numbers of different magnitudes are involved, digits of the smaller-magnitude number are lost.
So you want the higher-precision type for the sum, which involves addition. Multiplying the integer by a double instead of a float doesn't matter nearly as much: you will get something that is approximately as accurate as the float value you start with (as long as the result it isn't very very large or very very small). But summing float values that could have very different orders of magnitude, even when the individual numbers themselves are representable as float, will accumulate errors and deviate further and further from the true answer.
To see that in action:
float f1 = 1e4, f2 = 1e-4;
std::cout << (f1 + f2) << '\n';
std::cout << (double(f1) + f2) << '\n';
Or equivalently, but closer to the original code:
float f1 = 1e4, f2 = 1e-4;
float f = f1;
f += f2;
double d = f1;
d += f2;
std::cout << f << '\n';
std::cout << d << '\n';
The result is:
10000
10000.0001
Adding the two floats loses precision. Adding the float to a double gives the right answer, even though the inputs were identical. You need nine significant digits to represent the correct value, and that's too many for a float.

Taylor Series Resulting in nan after sin(90) and cos(120)

doing a school project. i do not understand why the sin comes out to -NaN when after sin(90) and cos(120).
Can anyone help me understand this?
Also, when I put this in an online C++ editor it totally works, but when compiled in linux it does not.
// Nick Garver
// taylorSeries
// taylorSeries.cpp
#include <iostream>
#include <cmath>
#include <iomanip>
using namespace std;
const double PI = atan(1.0)*4.0;
double angle_in_degrees = 0;
double radians = 0;
double degreesToRadians(double d);
double factorial(double factorial);
double mySine(double x);
double myCosine(double x);
int main()
{
cout << "\033[2J\033[1;1H";
cout.width(4); cout << left << "Deg";
cout.width(9); cout << left << "Radians";
cout.width(11); cout << left << "RealSine";
cout.width(11); cout << left << "MySin";
cout.width(12); cout << left << "RealCos";
cout.width(11); cout << left << "MyCos"<<endl;
while (angle_in_degrees <= 360) //radian equivalent of 45 degrees
{
double sine = sin(degreesToRadians(angle_in_degrees));
double cosine = cos(degreesToRadians(angle_in_degrees));
//output
cout.width(4); cout << left << angle_in_degrees;
cout.width(9); cout << left << degreesToRadians(angle_in_degrees);
cout.width(11); cout << left << sine;
cout.width(11); cout << left << mySine(degreesToRadians(angle_in_degrees));
cout.width(12); cout << left << cosine;
cout.width(11); cout << left << myCosine(degreesToRadians(angle_in_degrees))<<endl;
angle_in_degrees = angle_in_degrees + 15;
}
cout << endl;
return 0;
}
double degreesToRadians(double d)
{
double answer;
answer = (d*PI)/180;
return answer;
}
double mySine(double x)
{
double result = 0;
for(int i = 1; i <= 1000; i++) {
if (i % 2 == 1)
result += pow(x, i * 2 - 1) / factorial(i * 2 - 1);
else
result -= pow(x, i * 2 - 1) / factorial(i * 2 - 1);
}
return result;
}
double myCosine(double x)
{
double positive = 0.0;
double negative= 0.0;
double result=0.0;
for (int i=4; i<=1000; i+=4)
{
positive = positive + (pow(x,i) / factorial (i));
}
for (int i=2; i<=1000; i+=4)
{
negative = negative + (pow(x,i) / factorial (i));
}
result = (1 - (negative) + (positive));
return result;
}
double factorial(double factorial)
{
float x = 1;
for (float counter = 1; counter <= factorial; counter++)
{
x = x * counter;
}
return x;
}
(Marcus has good points; I am going to ramble in other directions...)
Look at the terms in a Taylor series. They become too small to make any difference after fewer than 10 terms. Asking for 1000 is asking for trouble.
Instead of going for 1000, go until the next term does not add anything, something like:
term = pow(x, i * 2 - 1) / factorial(i * 2 - 1);
if (result + term == result) { break; }
result += term;
The series would run much faster if you iteratively calculated the pow and factorial rather than starting over each time. (But, probably speed is not an issue at this point.)
Float has 24 bits of binary precision. Beginning perhaps with 13!, you will get roundoff errors in float. Double, on the other hand, has 53 bits of precision and will last until about 22! without roundoff errors. My point is that you should have done factorial() in double.
Another problem is that the computation of the Taylor series gets somewhat 'unstable' for bigger arguments. Intermediate terms become bigger than the end result, thereby leading to other roundoff errors. To avoid this, a common way to compute sine and cosine is to first fold to between -45 and +45 degrees. No unfolding, except maybe for the sign, is needed later.
As for why you had trouble on one system but not the other -- Different implementations handle NaN differently.
Once you have gotten the NaN out of the way, try computing the series in reverse order. This will lead to a different set of roundoff errors. Will it make your sin() closer to the real sin?
The 'real' sin is probably computed in hardware with 64-bit fixed-point arithmetic, and will be "correctly rounded" to 53 or 24 bits well over 99% of the time. (This, of course, depends on the chip manufacturer, hence my 'hand-waving' statement.)
To judge how 'close' your value is, you need to compute ULPs (units in the last place). This involves looking at the bits in the float/double. (Beyond the scope of this question.)
Sorry about the TMI.
Before I answer this, a few remarks:
It's always helpful for your own debugging to keep your code tidy. Remove unnecessary empty lines, make sure your bracketing style is uniform, and properly indent. I did this for you, but believe me, you'll avoid a lot of bugs if you keep up a consistent style!
you have functions that take double as input and return double, but internally just use float; that should be a red flag!
your whole degreesToRadians would be better to read and only one third as long if you just used return (d*PI)/180;
Answers now:
in your factorial function, you calculate a factorial for values up to 1999. Hint: try to figure out the value of 1999! and look up the maximum number that float on your machine can hold. Then look up double's maximum. How many orders of magnitude is 1999! larger?
1999! is ca. 10^5732. That is a large number, about 150 orders of magnitude larger than what a 32bit float can hold, or still 18 orders of magnitude larger than what a 64bit double can hold. To compare, to store 1999! in a double would be like trying to fit the distance from sun center to earth center in the typical 0.1µm diameter of bacteria.

Ensure float to be smaller than exact value

I want to calculate a sum of the following form in C++
float result = float(x1)/y1+float(x2)/y2+....+float(xn)/yn
xi,yi are all integers. The result will be an approximation of the actual value. It is crucial that this approximation is smaller or equal to the actual value. I can assume that all my values are finite and positive.
I tried using nextf(,0) as in this code snippet.
cout.precision( 15 );
float a = 1.0f / 3.0f * 10; //3 1/3
float b = 2.0f / 3.0f * 10; //6 2/3
float af = nextafterf( a , 0 );
float bf = nextafterf( b , 0 );
cout << a << endl;
cout << b << endl;
cout << af << endl;
cout << bf << endl;
float sumf = 0.0f;
for ( int i = 1; i <= 3; i++ )
{
sumf = sumf + bf;
}
sumf = sumf + af;
cout << sumf << endl;
As one can see the correct solution would be 3*6,666... +3.333.. = 23,3333...
But as output I get:
3.33333349227905
6.66666698455811
3.33333325386047
6.66666650772095
23.3333339691162
Even though my summands are smaller than what they should represent, their sum is not. In this case applying nextafterf to sumf will give me 23.3333320617676 which is smaller. But does this always work? Is it possible that the rounding error gets so big that nextafterf still leaves me above the correct value?
I know that I could avoid this by implementing a class for fractions and calculating everything exactly. But I'm curious whether it is possible to achieve my goal with floats.
Try changing the float rounding mode to FE_TOWARDZERO.
See code example here:
Change floating point rounding mode
My immediate reaction is that the approach you're taking is fundamentally flawed.
The problem is that with floating point numbers, the size of step that nextafter will take will depend on the magnitude of the numbers involved. Let's consider a somewhat extreme example:
#include <iostream>
#include <iomanip>
#include <cmath>
int main() {
float num = 1.0e-10f;
float denom = 1.0e10f;
std::cout << std::setprecision(7) << num - std::nextafterf(num, 0) << "\n";
std::cout << std::setprecision(7) << denom - std::nextafterf(denom, 0) << "\n";
}
Result:
6.938894e-018
1024
So, since the numerator is a lot smaller than the denominator, the increment is also much smaller.
The result seems fairly clear: instead of the result being slightly smaller than the input, the result should be quite a bit larger than the input.
If you want to ensure the result is smaller than the correct number, the obvious choice would be to round the numerator down, but the denominator up (i.e. nextafterf(denom, positive_infinity). This way, you get a smaller numerator and a larger denominator, so the result is always smaller than the un-modified version would have been.
float result = float(x1)/y1+float(x2)/y2+....+float(xn)/yn has 3 places where rounding may occur.
Conversion of int to float - it is not always exact.
Division floating point x/floating point y
Addition: floating point quotient + floating point quotient.
By using the next, (either up or down per the equation needs), the results will certainly be less than the exact mathematical value. This approach may not generate the float closest to the exact answer, yet will be close and certainly smaller.
float foo(const int *x, const int *y, size_t n) {
float sum = 0.0;
for (size_t i=0; i<n; i++) { // assume x[0] is x1, x[1] is x2 ...
float fx = nextafterf(x[i], 0.0);
float fy = nextafterf(y[i], FLT_MAX);
// divide by slightly smaller over slightly larger
float q = nextafterf(fx / fy, 0.0);
sum = nextafterf(sum + q, 0.0);
}
return sum;
}

Rounding double values in C++ like MS Excel does it

I've searched all over the net, but I could not find a solution to my problem. I simply want a function that rounds double values like MS Excel does. Here is my code:
#include <iostream>
#include "math.h"
using namespace std;
double Round(double value, int precision) {
return floor(((value * pow(10.0, precision)) + 0.5)) / pow(10.0, precision);
}
int main(int argc, char *argv[]) {
/* The way MS Excel does it:
1.27815 1.27840 -> 1.27828
1.27813 1.27840 -> 1.27827
1.27819 1.27843 -> 1.27831
1.27999 1.28024 -> 1.28012
1.27839 1.27866 -> 1.27853
*/
cout << Round((1.27815 + 1.27840)/2, 5) << "\n"; // *
cout << Round((1.27813 + 1.27840)/2, 5) << "\n";
cout << Round((1.27819 + 1.27843)/2, 5) << "\n";
cout << Round((1.27999 + 1.28024)/2, 5) << "\n"; // *
cout << Round((1.27839 + 1.27866)/2, 5) << "\n"; // *
if(Round((1.27815 + 1.27840)/2, 5) == 1.27828) {
cout << "Hurray...\n";
}
system("PAUSE");
return EXIT_SUCCESS;
}
I have found the function here at stackoverflow, the answer states that it works like the built-in excel rounding routine, but it does not. Could you tell me what I'm missing?
In a sense what you are asking for is not possible:
Floating point values on most common platforms do not have a notion of a "number of decimal places". Numbers like 2.3 or 8.71 simply cannot be represented precisely. Therefore, it makes no sense to ask for any function that will return a floating point value with a given number of non-zero decimal places -- such numbers simply do not exist.
The only thing you can do with floating point types is to compute the nearest representable approximation, and then print the result with the desired precision, which will give you the textual form of the number that you desire. To compute the representation, you can do this:
double round(double x, int n)
{
int e;
double d;
std::frexp(x, &e);
if (e >= 0) return x; // number is an integer, nothing to do
double const f = std::pow(10.0, n);
std::modf(x * f, &d); // d == integral part of 10^n * x
return d / f;
}
(You can also use modf instead of frexp to determine whether x is already an integer. You should also check that n is non-negative, or otherwise define semantics for negative "precision".)
Alternatively to using floating point types, you could perform fixed point arithmetic. That is, you store everything as integers, but you treat them as units of, say, 1/1000. Then you could print such a number as follows:
std::cout << n / 1000 << "." << n % 1000;
Addition works as expected, though you have to write your own multiplication function.
To compare double values, you must specify a range of comparison, where the result could be considered "safe". You could use a macro for that.
Here is one example of what you could use:
#define COMPARE( A, B, PRECISION ) ( ( A >= B - PRECISION ) && ( A <= B + PRECISION ) )
int main()
{
double a = 12.34567;
bool equal = COMPARE( a, 12.34567F, 0.0002 );
equal = COMPARE( a, 15.34567F, 0.0002 );
return 0;
}
Thank you all for your answers! After considering the possible solutions I changed the original Round() function in my code to adding 0.6 instead of 0.5 to the value.
The value "127827.5" (I do understand that this is not an exact representation!) becomes "127828.1" and finally through floor() and dividing it becomes "1.27828" (or something more like 1.2782800..001). Using COMPARE suggested by Renan Greinert with a correctly chosen precision I can safely compare the values now.
Here is the final version:
#include <iostream>
#include "math.h"
#define COMPARE(A, B, PRECISION) ((A >= B-PRECISION) && (A <= B+PRECISION))
using namespace std;
double Round(double value, int precision) {
return floor(value * pow(10.0, precision) + 0.6) / pow(10.0, precision);
}
int main(int argc, char *argv[]) {
/* The way MS Excel does it:
1.27815 1.27840 // 1.27828
1.27813 1.27840 -> 1.27827
1.27819 1.27843 -> 1.27831
1.27999 1.28024 -> 1.28012
1.27839 1.27866 -> 1.27853
*/
cout << Round((1.27815 + 1.27840)/2, 5) << "\n";
cout << Round((1.27813 + 1.27840)/2, 5) << "\n";
cout << Round((1.27819 + 1.27843)/2, 5) << "\n";
cout << Round((1.27999 + 1.28024)/2, 5) << "\n";
cout << Round((1.27839 + 1.27866)/2, 5) << "\n";
//Comparing the rounded value against a fixed one
if(COMPARE(Round((1.27815 + 1.27840)/2, 5), 1.27828, 0.000001)) {
cout << "Hurray!\n";
}
//Comparing two rounded values
if(COMPARE(Round((1.27815 + 1.27840)/2, 5), Round((1.27814 + 1.27841)/2, 5), 0.000001)) {
cout << "Hurray!\n";
}
system("PAUSE");
return EXIT_SUCCESS;
}
I've tested it by rounding a hundred double values and than comparing the results to what Excel gives. They were all the same.
I'm afraid the answer is that Round cannot perform magic.
Since 1.27828 is not exactly representable as a double, you cannot compare some double with 1.27828 and hope it will match.
You need to do the maths without the decimal part, to get that numbers... so something like this.
double dPow = pow(10.0, 5.0);
double a = 1.27815;
double b = 1.27840;
double a2 = 1.27815 * dPow;
double b2 = 1.27840 * dPow;
double c = (a2 + b2) / 2 + 0.5;
Using your function...
double c = (Round(a) + Round(b)) / 2 + 0.5;