Taylor Series Resulting in nan after sin(90) and cos(120) - c++

doing a school project. i do not understand why the sin comes out to -NaN when after sin(90) and cos(120).
Can anyone help me understand this?
Also, when I put this in an online C++ editor it totally works, but when compiled in linux it does not.
// Nick Garver
// taylorSeries
// taylorSeries.cpp
#include <iostream>
#include <cmath>
#include <iomanip>
using namespace std;
const double PI = atan(1.0)*4.0;
double angle_in_degrees = 0;
double radians = 0;
double degreesToRadians(double d);
double factorial(double factorial);
double mySine(double x);
double myCosine(double x);
int main()
{
cout << "\033[2J\033[1;1H";
cout.width(4); cout << left << "Deg";
cout.width(9); cout << left << "Radians";
cout.width(11); cout << left << "RealSine";
cout.width(11); cout << left << "MySin";
cout.width(12); cout << left << "RealCos";
cout.width(11); cout << left << "MyCos"<<endl;
while (angle_in_degrees <= 360) //radian equivalent of 45 degrees
{
double sine = sin(degreesToRadians(angle_in_degrees));
double cosine = cos(degreesToRadians(angle_in_degrees));
//output
cout.width(4); cout << left << angle_in_degrees;
cout.width(9); cout << left << degreesToRadians(angle_in_degrees);
cout.width(11); cout << left << sine;
cout.width(11); cout << left << mySine(degreesToRadians(angle_in_degrees));
cout.width(12); cout << left << cosine;
cout.width(11); cout << left << myCosine(degreesToRadians(angle_in_degrees))<<endl;
angle_in_degrees = angle_in_degrees + 15;
}
cout << endl;
return 0;
}
double degreesToRadians(double d)
{
double answer;
answer = (d*PI)/180;
return answer;
}
double mySine(double x)
{
double result = 0;
for(int i = 1; i <= 1000; i++) {
if (i % 2 == 1)
result += pow(x, i * 2 - 1) / factorial(i * 2 - 1);
else
result -= pow(x, i * 2 - 1) / factorial(i * 2 - 1);
}
return result;
}
double myCosine(double x)
{
double positive = 0.0;
double negative= 0.0;
double result=0.0;
for (int i=4; i<=1000; i+=4)
{
positive = positive + (pow(x,i) / factorial (i));
}
for (int i=2; i<=1000; i+=4)
{
negative = negative + (pow(x,i) / factorial (i));
}
result = (1 - (negative) + (positive));
return result;
}
double factorial(double factorial)
{
float x = 1;
for (float counter = 1; counter <= factorial; counter++)
{
x = x * counter;
}
return x;
}

(Marcus has good points; I am going to ramble in other directions...)
Look at the terms in a Taylor series. They become too small to make any difference after fewer than 10 terms. Asking for 1000 is asking for trouble.
Instead of going for 1000, go until the next term does not add anything, something like:
term = pow(x, i * 2 - 1) / factorial(i * 2 - 1);
if (result + term == result) { break; }
result += term;
The series would run much faster if you iteratively calculated the pow and factorial rather than starting over each time. (But, probably speed is not an issue at this point.)
Float has 24 bits of binary precision. Beginning perhaps with 13!, you will get roundoff errors in float. Double, on the other hand, has 53 bits of precision and will last until about 22! without roundoff errors. My point is that you should have done factorial() in double.
Another problem is that the computation of the Taylor series gets somewhat 'unstable' for bigger arguments. Intermediate terms become bigger than the end result, thereby leading to other roundoff errors. To avoid this, a common way to compute sine and cosine is to first fold to between -45 and +45 degrees. No unfolding, except maybe for the sign, is needed later.
As for why you had trouble on one system but not the other -- Different implementations handle NaN differently.
Once you have gotten the NaN out of the way, try computing the series in reverse order. This will lead to a different set of roundoff errors. Will it make your sin() closer to the real sin?
The 'real' sin is probably computed in hardware with 64-bit fixed-point arithmetic, and will be "correctly rounded" to 53 or 24 bits well over 99% of the time. (This, of course, depends on the chip manufacturer, hence my 'hand-waving' statement.)
To judge how 'close' your value is, you need to compute ULPs (units in the last place). This involves looking at the bits in the float/double. (Beyond the scope of this question.)
Sorry about the TMI.

Before I answer this, a few remarks:
It's always helpful for your own debugging to keep your code tidy. Remove unnecessary empty lines, make sure your bracketing style is uniform, and properly indent. I did this for you, but believe me, you'll avoid a lot of bugs if you keep up a consistent style!
you have functions that take double as input and return double, but internally just use float; that should be a red flag!
your whole degreesToRadians would be better to read and only one third as long if you just used return (d*PI)/180;
Answers now:
in your factorial function, you calculate a factorial for values up to 1999. Hint: try to figure out the value of 1999! and look up the maximum number that float on your machine can hold. Then look up double's maximum. How many orders of magnitude is 1999! larger?
1999! is ca. 10^5732. That is a large number, about 150 orders of magnitude larger than what a 32bit float can hold, or still 18 orders of magnitude larger than what a 64bit double can hold. To compare, to store 1999! in a double would be like trying to fit the distance from sun center to earth center in the typical 0.1µm diameter of bacteria.

Related

My program for calculating pi using Chudnovsky in C++ precision problem

My code:
#include <iostream>
#include <iomanip>
#include <cmath>
long double fac(long double num) {
long double result = 1.0;
for (long double i=2.0; i<num; i++)
result *= i;
return result;
}
int main() {
using namespace std;
long double pi=0.0;
for (long double k = 0.0; k < 10.0; k++) {
pi += (pow(-1.0,k) * fac(6.0 * k) * (13591409.0 + (545140134.0 * k)))
/ (fac(3.0 * k) * pow(fac(k), 3.0) * pow(640320.0, 3.0 * k + 3.0/2.0));
}
pi *= 12.0;
cout << setprecision(100) << 1.0 / pi << endl;
return 0;
}
My output:
3.1415926535897637228433865175247774459421634674072265625
The problem with this output is that it outputed 56 digits instead of 100; How do I fix that?
First of all your factorial is wrong the loop should be for (long double i=2.0; i<=num; i++) instead of i<num !!!
As mentioned in the comments double can hold only up to ~16 digits so your 100 digits is not doable by this method. To remedy this there are 2 ways:
use high precision datatype
there are libs for this, or you can implement it on your own you need just few basic operations. Note that to represent 100 digits you need at least
ceil(100 digits/log10(2)) = 333 bits
of mantisa or fixed point integer while double has only 53
53*log10(2) = 15.954589770191003346328161420398 digits
use different method of computation of PI
For arbitrary precision I recommend to use BPP However if you want just 100 digits you can use simple taylor seriesbased like this on strings (no need for any high precision datatype nor FPU):
//The following 160 character C program, written by Dik T. Winter at CWI, computes pi to 800 decimal digits.
int a=10000,b=0,c=2800,d=0,e=0,f[2801],g=0;main(){for(;b-c;)f[b++]=a/5;
for(;d=0,g=c*2;c-=14,printf("%.4d",e+d/a),e=d%a)for(b=c;d+=f[b]*a,f[b]=d%--g,d/=g--,--b;d*=b);}
Aside the obvious precision limits Your implementation is really bad from both performance and precision aspects that is why you lost precision way sooner as you hitting double precision limits in very low iterations of k. If you rewrite the iterations so the subresults are as small as can be (in terms of bits of mantisa) and not use too much unnecessary computations here few hints:
why are you computing the same factorials again and again
You have k! in loop where k is incrementing why not just multiply the k to some variable holding actual factorial instead? for example:
//for ( k=0;k<10;k++){ ... fac(k) ... }
for (f=1,k=0;k<10;k++){ if (k) f*=k; ... f ... }
why are you divide by factorials again and again
if you think a bit about it then if (a>b) you can compute this instead:
a! / b! = (1*2*3*4*...*b*...*a) / (1*2*3*4*...*b)
a! / b! = (b+1)*(b+2)*...*(a)
I would not use pow at all for this
pow is "very complex" function causing further precision and performance losses for example pow(-1.0,k) can be done like this:
//for ( k=0;k<10;k++){ ... pow(-1.0,k) ... }
for (s=+1,k=0;k<10;k++){ s=-s; ... s ... }
Also pow(640320.0, 3.0 * k + 3.0/2.0)) can be computed in the same way as factorial, pow(fac(k), 3.0) you can 3 times multipply the variable holding fac(k) instead ...
the therm pow(640320.0, 3.0 * k + 3.0/2.0) outgrows even (6k)!
so you can divide it by it to keep subresults smaller...
These few simple tweaks will enhance the precision a lot as you will overflow the double precision much much latter as the subresults will be much smaller then the naive ones as factorials tend to grow really fast
Putting all together leads to this:
double pi_Chudnovsky() // no pow,fac lower subresult
{ // https://en.wikipedia.org/wiki/Chudnovsky_algorithm
double pi,s,f,f3,k,k3,k6,p,dp,q,r;
for (pi=0.0,s=1.0,f=f3=1,k=k3=k6=0.0,p=640320.0,dp=p*p*p,p*=sqrt(p),r=13591409.0;k<27.0;k++,s=-s)
{
if (k) // f=k!, f3=(3k)!, p=pow(640320.0,3k+1.5)*(3k)!/(6k)!, r=13591409.0+(545140134.0*k)
{
p*=dp; r+=545140134.0;
f*=k; k3++; f3*=k3; k6++; p/=k6; p*=k3;
k3++; f3*=k3; k6++; p/=k6; p*=k3;
k3++; f3*=k3; k6++; p/=k6; p*=k3;
k6++; p/=k6;
k6++; p/=k6;
k6++; p/=k6;
}
q=s*r; q/=f; q/=f; q/=f; q/=p; pi+=q;
}
return 1.0/(pi*12.0);
}
as you can see k goes up to 27, while your naive method can go only up to 18 on 64 bit doubles before overflow. However the result is the same as the double mantissa is saturated after 2 iterations ...
I am feeling happy due to following code :)
/*
I have compiled using cygwin
change "iostream...using namespace std" OR iostream.h based on your compiler at related OS.
*/
#include <iostream>
#include <iomanip>
#include <cmath>
using namespace std;
long double fac(long double num)
{
long double result = 1.0;
for (long double i=2.0; num > i; ++i)
{
result *= i;
}
return result;
}
int main()
{
long double pi=0.0;
for (long double k = 0.0; 10.0 > k; ++k)
{
pi += (pow(-1.0,k) * fac(6.0 * k) * (13591409.0 + (545140134.0 * k)))
/ (fac(3.0 * k) * pow(fac(k), 3.0) * pow(640320.0, 3.0 * k + 3.0/2.0));
}
pi *= 12.0;
cout << "BEFORE USING setprecision VALUE OF DEFAULT PRECISION " << cout.precision() << "\n";
cout << setprecision(100) << 1.0 / pi << endl;
cout << "AFTER USING setprecision VALUE OF CURRENT PRECISION WITHOUT USING fixed " << cout.precision() << "\n";
cout << fixed;
cout << "AFTER USING setprecision VALUE OF CURRENT PRECISION USING fixed " << cout.precision() << "\n";
cout << "USING fixed PREVENT THE EARTH'S ROUNDING OFF INSIDE OUR UNIVERSE :)\n";
cout << setprecision(100) << 1.0 / pi << endl;
return 0;
}
/*
$ # Sample output:
$ g++ 73256565.cpp -o ./a.out;./a.out
$ ./a.out
BEFORE USING setprecision VALUE OF DEFAULT PRECISION 6
3.14159265358976372457810999350158454035408794879913330078125
AFTER USING setprecision VALUE OF CURRENT PRECISION WITHOUT USING fixed 100
AFTER USING setprecision VALUE OF CURRENT PRECISION USING fixed 100
USING fixed PREVENT THE EARTH'S ROUNDING OFF INSIDE OUR UNIVERSE :)
3.1415926535897637245781099935015845403540879487991333007812500000000000000000000000000000000000000000
*/

How can I get a more accurate result when dividing numbers in C++

I am trying to estimate PI using C++ as a fun math project. I've run into an issues where I can only get it as precise as 6 decimal places.
I have tried using a float instead of a double but found the same result.
My code works by summing all the results of 1/n^2 where n=1 through to a defined limit. It then multiplies this result by 6 and takes the square root.
Here is a link to an image written out in mathematical notation
Here is my main function. PREC is the predefined limit. It will populate the array with the results of these fractions and get the sum. My guess is that the sqrt function is causing the issue where I cannot get more precise than 6 digits.
int main(int argc, char *argv[]) {
nthsums = new float[PREC];
for (int i = 1; i < PREC + 1; i += 1) {
nthsums[i] = nth_fraction(i);
}
float array_sum = sum_array(nthsums);
array_sum *= 6.000000D;
float result = sqrt(array_sum);
std::string resultString = std::to_string(result);
cout << resultString << "\n";
}
Just for the sake of it, I'll also include my sum function as I suspect that there could be something wrong with that, too.
float sum_array(float *array) {
float returnSum = 0;
for (int itter = 0; itter < PREC + 1; itter += 1) {
if (array[itter] >= 0) {
returnSum += array[itter];
}
}
return returnSum;
}
I would like to get at least as precise as 10 digits. Is there any way to do this in C++?
So even with long double as the floating point type used for this, there's some subtlety required because adding two long doubles of substantially different order of magnitudes can cause precision loss. See here for a discussion in Java but I believe it to be basically the same behavior in C++.
Code I used:
#include <iostream>
#include <cmath>
#include <numbers>
long double pSeriesApprox(unsigned long long t_terms)
{
long double pi_squared = 0.L;
for (unsigned long long i = t_terms; i >= 1; --i)
{
pi_squared += 6.L * (1.L / i) * (1.L / i);
}
return std::sqrtl(pi_squared);
}
int main(int, char[]) {
const long double pi = std::numbers::pi_v<long double>;
const unsigned long long num_terms = 10'000'000'000;
std::cout.precision(30);
std::cout << "Pi == " << pi << "\n\n";
std::cout << "Pi ~= " << pSeriesApprox(num_terms) << " after " << num_terms << " terms\n";
return 0;
}
Output:
Pi == 3.14159265358979311599796346854
Pi ~= 3.14159265349430016911469465413 after 10000000000 terms
9 decimal digits of accuracy, which is about what we'd expect from a series converging at this rate.
But if all I do is reverse the order the loop in pSeriesApprox goes, adding the exact same terms but from largest to smallest instead of smallest to largest:
long double pSeriesApprox(unsigned long long t_terms)
{
long double pi_squared = 0.L;
for (unsigned long long i = 1; i <= t_terms; ++i)
{
pi_squared += 6.L * (1.L / i) * (1.L / i);
}
return std::sqrtl(pi_squared);
}
Output:
Pi == 3.14159265358979311599796346854
Pi ~= 3.14159264365071688729358356795 after 10000000000 terms
Suddenly we're down to 7 digits of accuracy, even though we used 10 billion terms. In fact, after 100 million terms or so, the approximation to pi stabilizes at this specific value. So while using sufficiently large data types to store these computations is important, some additional care is still needed when trying to perform this kind of sum.

Why using double and then cast to float?

I'm trying to improve surf.cpp performances. From line 140, you can find this function:
inline float calcHaarPattern( const int* origin, const SurfHF* f, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
return (float)d;
}
Running an Intel Advisor Vectorization analysis, it shows that "1 Data type conversions present" which could be inefficient (especially in vectorization).
But my question is: looking at this function, why the authors would have created d as double and then cast it to float? If they wanted a decimal number, float would be ok. The only reason that comes to my mind is that since double is more precise than float, then it can represents smaller numbers, but the final value is big enough to be stored in a float, but I didn't run any test on d value.
Any other possible reason?
Because the author want to have higher precision during calculation, then only round the final result. This is the same as preserving more significant digit during calculation.
More precisely, when addition and subtraction, error can be accumulated. This error can be considerable when large number of floating point numbers involved.
You questioned the answer saying it's to use higher precision during the summation, but I don't see why. That answer is correct. Consider this simplified version with completely made-up numbers:
#include <iostream>
#include <iomanip>
float w = 0.012345;
float calcFloat(const int* origin, int n )
{
float d = 0;
for( int k = 0; k < n; k++ )
d += origin[k] * w;
return (float)d;
}
float calcDouble(const int* origin, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += origin[k] * w;
return (float)d;
}
int main()
{
int o[] = { 1111, 22222, 33333, 444444, 5555 };
std::cout << std::setprecision(9) << calcFloat(o, 5) << '\n';
std::cout << std::setprecision(9) << calcDouble(o, 5) << '\n';
}
The results are:
6254.77979
6254.7793
So even though the inputs are the same in both cases, you get a different result using double for the intermediate summation. Changing calcDouble to use (double)w doesn't change the output.
This suggests that the calculation of (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w is high-enough precision, but the accumulation of errors during the summation is what they're trying to avoid.
This is because of how errors are propagated when working with floating point numbers. Quoting The Floating-Point Guide: Error Propagation:
In general:
Multiplication and division are “safe” operations
Addition and subtraction are dangerous, because when numbers of different magnitudes are involved, digits of the smaller-magnitude number are lost.
So you want the higher-precision type for the sum, which involves addition. Multiplying the integer by a double instead of a float doesn't matter nearly as much: you will get something that is approximately as accurate as the float value you start with (as long as the result it isn't very very large or very very small). But summing float values that could have very different orders of magnitude, even when the individual numbers themselves are representable as float, will accumulate errors and deviate further and further from the true answer.
To see that in action:
float f1 = 1e4, f2 = 1e-4;
std::cout << (f1 + f2) << '\n';
std::cout << (double(f1) + f2) << '\n';
Or equivalently, but closer to the original code:
float f1 = 1e4, f2 = 1e-4;
float f = f1;
f += f2;
double d = f1;
d += f2;
std::cout << f << '\n';
std::cout << d << '\n';
The result is:
10000
10000.0001
Adding the two floats loses precision. Adding the float to a double gives the right answer, even though the inputs were identical. You need nine significant digits to represent the correct value, and that's too many for a float.

Counting iterations of the Leibniz summation for π in C++

My task is to ask the user to how many decimal places of accuracy they want the summation to iterate compared to the actual value of pi. So 2 decimal places would stop when the loop reaches 3.14. I have a complete program, but I am unsure if it actually works as intended. I have checked for 0 and 1 decimal places with a calculator and they seem to work, but I don't want to assume it works for all of them. Also my code may be a little clumsy since were are still learning the basics. We only just learned loops and nested loops. If there are any obvious mistakes or parts that could be cleaned up, I would appreciate any input.
Edit: I only needed to have this work for up to five decimal places. That is why my value of pi was not precise. Sorry for the misunderstanding.
#include <iostream>
#include <cmath>
using namespace std;
int main() {
const double PI = 3.141592;
int n, sign = 1;
double sum = 0,test,m;
cout << "This program determines how many iterations of the infinite series for\n"
"pi is needed to get with 'n' decimal places of the true value of pi.\n"
"How many decimal places of accuracy should there be?" << endl;
cin >> n;
double p = PI * pow(10.0, n);
p = static_cast<double>(static_cast<int>(p) / pow(10, n));
int counter = 0;
bool stop = false;
for (double i = 1;!stop;i = i+2) {
sum = sum + (1.0/ i) * sign;
sign = -sign;
counter++;
test = (4 * sum) * pow(10.0,n);
test = static_cast<double>(static_cast<int>(test) / pow(10, n));
if (test == p)
stop = true;
}
cout << "The series was iterated " << counter<< " times and reached the value of pi\nwithin "<< n << " decimal places." << endl;
return 0;
}
One of the problems of the Leibniz summation is that it has an extremely low convergence rate, as it exhibits sublinear convergence. In your program you also compare a calculated extimation of π with a given value (a 6 digits approximation), while the point of the summation should be to find out the right figures.
You can slightly modify your code to make it terminate the calculation if the wanted digit doesn't change between iterations (I also added a max number of iterations check). Remember that you are using doubles not unlimited precision numbers and sooner or later rounding errors will affect the calculation. As a matter of fact, the real limitation of this code is the number of iterations it takes (2,428,700,925 to obtain 3.141592653).
#include <iostream>
#include <cmath>
#include <iomanip>
using std::cout;
// this will take a long long time...
const unsigned long long int MAX_ITER = 100000000000;
int main() {
int n;
cout << "This program determines how many iterations of the infinite series for\n"
"pi is needed to get with 'n' decimal places of the true value of pi.\n"
"How many decimal places of accuracy should there be?\n";
std::cin >> n;
// precalculate some values
double factor = pow(10.0,n);
double inv_factor = 1.0 / factor;
double quad_factor = 4.0 * factor;
long long int test = 0, old_test = 0, sign = 1;
unsigned long long int count = 0;
double sum = 0;
for ( long long int i = 1; count < MAX_ITER; i += 2 ) {
sum += 1.0 / (i * sign);
sign = -sign;
old_test = test;
test = static_cast<long long int>(sum * quad_factor);
++count;
// perform the test on integer values
if ( test == old_test ) {
cout << "Reached the value of Pi within "<< n << " decimal places.\n";
break;
}
}
double pi_leibniz = static_cast<double>(inv_factor * test);
cout << "Pi = " << std::setprecision(n+1) << pi_leibniz << '\n';
cout << "The series was iterated " << count << " times\n";
return 0;
}
I have summarized the results of several runs in this table:
digits Pi iterations
---------------------------------------
0 3 8
1 3.1 26
2 3.14 628
3 3.141 2,455
4 3.1415 136,121
5 3.14159 376,848
6 3.141592 2,886,751
7 3.1415926 21,547,007
8 3.14159265 278,609,764
9 3.141592653 2,428,700,925
10 3.1415926535 87,312,058,383
Your program will never terminate, because test==p will never be true. This is a comparison between two double-precision numbers that are calculated differently. Due to round-off errors, they will not be identical, even if you run an infinite number of iterations, and your math is correct (and right now it isn't, because the value of PI in your program is not accurate).
To help you figure out what's going on, print the value of test in each iteration, as well as the distance between test and pi, as follows:
#include<iostream>
using namespace std;
void main() {
double pi = atan(1.0) * 4; // Make sure you have a precise value of PI
double sign = 1.0, sum = 0.0;
for (int i = 1; i < 1000; i += 2) {
sum = sum + (1.0 / i) * sign;
sign = -sign;
double test = 4 * sum;
cout << test << " " << fabs(test - pi) << "\n";
}
}
After you make sure the program works well, change the stopping condition eventually to be based on the distance between test and pi.
for (int i=1; fabs(test-pi)>epsilon; i+=2)

Ensure float to be smaller than exact value

I want to calculate a sum of the following form in C++
float result = float(x1)/y1+float(x2)/y2+....+float(xn)/yn
xi,yi are all integers. The result will be an approximation of the actual value. It is crucial that this approximation is smaller or equal to the actual value. I can assume that all my values are finite and positive.
I tried using nextf(,0) as in this code snippet.
cout.precision( 15 );
float a = 1.0f / 3.0f * 10; //3 1/3
float b = 2.0f / 3.0f * 10; //6 2/3
float af = nextafterf( a , 0 );
float bf = nextafterf( b , 0 );
cout << a << endl;
cout << b << endl;
cout << af << endl;
cout << bf << endl;
float sumf = 0.0f;
for ( int i = 1; i <= 3; i++ )
{
sumf = sumf + bf;
}
sumf = sumf + af;
cout << sumf << endl;
As one can see the correct solution would be 3*6,666... +3.333.. = 23,3333...
But as output I get:
3.33333349227905
6.66666698455811
3.33333325386047
6.66666650772095
23.3333339691162
Even though my summands are smaller than what they should represent, their sum is not. In this case applying nextafterf to sumf will give me 23.3333320617676 which is smaller. But does this always work? Is it possible that the rounding error gets so big that nextafterf still leaves me above the correct value?
I know that I could avoid this by implementing a class for fractions and calculating everything exactly. But I'm curious whether it is possible to achieve my goal with floats.
Try changing the float rounding mode to FE_TOWARDZERO.
See code example here:
Change floating point rounding mode
My immediate reaction is that the approach you're taking is fundamentally flawed.
The problem is that with floating point numbers, the size of step that nextafter will take will depend on the magnitude of the numbers involved. Let's consider a somewhat extreme example:
#include <iostream>
#include <iomanip>
#include <cmath>
int main() {
float num = 1.0e-10f;
float denom = 1.0e10f;
std::cout << std::setprecision(7) << num - std::nextafterf(num, 0) << "\n";
std::cout << std::setprecision(7) << denom - std::nextafterf(denom, 0) << "\n";
}
Result:
6.938894e-018
1024
So, since the numerator is a lot smaller than the denominator, the increment is also much smaller.
The result seems fairly clear: instead of the result being slightly smaller than the input, the result should be quite a bit larger than the input.
If you want to ensure the result is smaller than the correct number, the obvious choice would be to round the numerator down, but the denominator up (i.e. nextafterf(denom, positive_infinity). This way, you get a smaller numerator and a larger denominator, so the result is always smaller than the un-modified version would have been.
float result = float(x1)/y1+float(x2)/y2+....+float(xn)/yn has 3 places where rounding may occur.
Conversion of int to float - it is not always exact.
Division floating point x/floating point y
Addition: floating point quotient + floating point quotient.
By using the next, (either up or down per the equation needs), the results will certainly be less than the exact mathematical value. This approach may not generate the float closest to the exact answer, yet will be close and certainly smaller.
float foo(const int *x, const int *y, size_t n) {
float sum = 0.0;
for (size_t i=0; i<n; i++) { // assume x[0] is x1, x[1] is x2 ...
float fx = nextafterf(x[i], 0.0);
float fy = nextafterf(y[i], FLT_MAX);
// divide by slightly smaller over slightly larger
float q = nextafterf(fx / fy, 0.0);
sum = nextafterf(sum + q, 0.0);
}
return sum;
}