How does Cpp work with large numbers in calculations? - c++

I have a code that tries to solve an integral of a function in a given interval numerically, using the method of Trapezoidal Rule (see the formula in Trapezoid method ), now, for the function sin(x) in the interval [-pi/2.0,pi/2.0], the integral is waited to be zero.
In this case, I take the number of partitions 'n' equal to 4. The problem is that when I have pi with 20 decimal places it is zero, with 14 decimal places it is 8.72e^(-17), then with 11 decimal places, it is zero, with 8 decimal places it is 8.72e^(-17), with 3 decimal places it is zero. I mean, the integral is zero or a number near zero for different approximations of pi, but it doesn't have a clear trend.
I would appreciate your help in understanding why this happens. (I did run it in Dev-C++).
#include <iostream>
#include <math.h>
using namespace std;
#define pi 3.14159265358979323846
//Pi: 3.14159265358979323846
double func(double x){
return sin(x);
int main() {
double x0 = -pi/2.0, xf = pi/2.0;
int n = 4;
double delta_x = (xf-x0)/(n*1.0);
double sum = (func(x0)+func(xf))/2.0;
double integral;
for (int k = 1; k<n; k++){
// cout<<"func: "<<func(x0+(k*delta_x))<<" "<<"last sum: "<<sum<<endl;
sum = sum + func(x0+(k*delta_x));
// cout<<"func + last sum= "<<sum<<endl;
integral = delta_x*sum;
cout<<"The value for the integral is: "<<integral<<endl;
return 0;

OP is integrating y=sin(x) from -a to +a. The various tests use different values of a, all near pi/2.
The approach uses a linear summation of values near -1.0, down to 0 and then up to near 1.0.
This summation is sensitive to calculation error with the last terms as the final math sum is expected to be 0.0. Since the start/end a varies, the error varies.
A more stable result would be had adding the extreme f = sin(f(k)) values first. e.g. sum += sin(f(k=1)), then sum += sin(f(k=3)), then sum += sin(f(k=2)) rather than k=1,2,3. In particular the formation of term x=f(k=3) is likely a bit off from the negative of its x=f(k=1) earlier term, further compounding the issue.
Welcome to the world or numerical analysis.
Problem exists if code used all float or all long double, just different degrees.
Problem is not due to using an inexact value of pi (Exact value is impossible with FP as pi is irrational and all finite FP are rational).
Much is due to the formation of x. Could try the below to form the x symmetrically about 0.0. Compare exactly x generated this way to x the original way.
x = (x0-x1)/2 + ((k - n/2)*delta_x)
Print out the exact values computed for deeper understanding.
printf("x:%a y:%a\n", x0+(k*delta_x), func(x0+(k*delta_x)));


Simpson's Composite Rule giving too large values for when n is very large

Using Simpson's Composite Rule to calculate the integral from 2 to 1,000 of 1/ln(x), however when using a large n (usually around 500,000), I start to get results that vary from the value my calculator and other sources give me (176.5644). For example, when n = 10,000,000, it gives me a value of 184.1495. Wondering why this is, since as n gets larger, the accuracy is supposed to increase and not decrease.
#include <iostream>
#include <cmath>
// the function f(x)
float f(float x)
return (float) 1 / std::log(x);
float my_simpson(float a, float b, long int n)
if (n % 2 == 1) n += 1; // since n has to be even
float area, h = (b-a)/n;
float x, y, z;
for (int i = 1; i <= n/2; i++)
x = a + (2*i - 2)*h;
y = a + (2*i - 1)*h;
z = a + 2*i*h;
area += f(x) + 4*f(y) + f(z);
return area*h/3;
int main()
int upperBound = 1'000;
int subsplits = 1'000'000;
float approx = my_simpson(2, upperBound, subsplits);
std::cout << "Output: " << approx << std::endl;
return 0;
Update: Switched from floats to doubles and works much better now! Thank you!
Unlike a real (in mathematical sense) number, a float has a limited precision.
A typical IEEE 754 32-bit (single precision) floating-point number binary representation dedicates only 24 bits (one of which is implicit) to the mantissa and that translates in roughly less than 8 decimal significant digits (please take this as a gross semplification).
A double on the other end, has 53 significand bits, making it more accurate and (usually) the first choice for numerical computations, these days.
since as n gets larger, the accuracy is supposed to increase and not decrease.
Unfortunately, that's not how it works. There's a sweat spot, but after that the accumulation of rounding errors prevales and the results diverge from their expected values.
In OP's case, this calculation
area += f(x) + 4*f(y) + f(z);
introduces (and accumulates) rounding errors, due to the fact that area becomes much greater than f(x) + 4*f(y) + f(z) (e.g 224678.937 vs. 0.3606823). The bigger n is, the sooner this gets relevant, making the result diverging from the real one.
As mentioned in the comments, another issue (undefined behavior) is that area isn't initialized (to zero).

Precision issues when converting a decimal number to its rational equivalent

I have problem of converting a double (say N) to p/q form (rational form), for this I have the following strategy :
Multiply double N by a large number say $k = 10^{10}$
then p = y*k and q = k
Take gcd(p,q) and find p = p/gcd(p,q) and q = p/gcd(p,q)
when N = 8.2 , Answer is correct if we solve using pen and paper, but as 8.2 is represented as 8.19999999 in N (double), it causes problem in its rational form conversion.
I tried it doing other way as : (I used a large no. 10^k instead of 100)
if(abs(y*100 - round(y*100)) < 0.000001) y = round(y*100)/100
But this approach also doesn't give right representation all the time.
Is there any way I could carry out the equivalent conversion from double to p/q ?
Floating point arithmetic is very difficult. As has been mentioned in the comments, part of the difficulty is that you need to represent your numbers in binary.
For example, the number 0.125 can be represented exactly in binary:
0.125 = 2^-3 = 0b0.001
But the number 0.12 cannot.
To 11 significant figures:
0.12 = 0b0.00011110101
If this is converted back to a decimal then the error becomes obvious:
0b0.00011110101 = 0.11962890625
So if you write:
double a = 0.2;
What the machine actually does is find the closest binary representation of 0.2 that it can hold within a double data type. This is an approximation since as we saw above, 0.2 cannot be exactly represented in binary.
One possible approach is to define an 'epsilon' which determines how close your number can be to the nearest representable binary floating point.
Here is a good article on floating points:
have problem of converting a double (say N) to p/q form
... when N = 8.2
A typical double cannot encode 8.2 exactly. Instead the closest representable double is about
8.20000000000000106581410364015027880668640... // next closest
When code does
double N = 8.2;
It will be the 8.19999999999999928945726423989981412887573... that is converted into rational form.
Converting a double to p/q form:
Multiply double N by a large number say $k = 10^{10}$
This may overflow the double. First step should be to determine if the double is large, it which case, it is a whole number.
Do not multiple by some power of 10 as double certainly uses a binary encoding. Multiplication by 10, 100, etc. may introduce round-off error.
C implementations of double overwhelmingly use a binary encoding, so that FLT_RADIX == 2.
Then every finite double x has a significand that is a fraction of some integer over some power of 2: a binary fraction of DBL_MANT_DIG digits #Richard Critten. This is often 53 binary digits.
Determine the exponent of the double. If large enough or x == 0.0, the double is a whole number.
Otherwise, scale a numerator and denominator by DBL_MANT_DIG. While the numerator is even, halve both the numerator and denominator. As the denominator is a power-of-2, no other prime values are needed for simplification consideration.
#include <float.h>
#include <math.h>
#include <stdio.h>
void form_ratio(double x) {
double numerator = x;
double denominator = 1.0;
if (isfinite(numerator) && x != 0.0) {
int expo;
frexp(numerator, &expo);
if (expo < DBL_MANT_DIG) {
expo = DBL_MANT_DIG - expo;
numerator = ldexp(numerator, expo);
denominator = ldexp(1.0, expo);
while (fmod(numerator, 2.0) == 0.0 && denominator > 1.0) {
numerator /= 2.0;
denominator /= 2.0;
int pre = DBL_DECIMAL_DIG;
printf("%.*g --> %.*g/%.*g\n", pre, x, pre, numerator, pre, denominator);
int main(void) {
form_ratio(1.0 / 7);
123456789012 --> 123456789012/1
42 --> 42/1
0.14285714285714285 --> 2573485501354569/18014398509481984
867.53089999999997 --> 3815441248019913/4398046511104

Calculations failed because of - nan

My exercise is to write code which will print the value of this phrase
I have written a code which should work, but when I try to print a value I receive "the value is -nan".
//My Code
#include <iostream>
#include <stdio.h>
#include <cmath>
using namespace std;
int main()
double y;
double x = 21;
y = 30 * sqrt(x * (1/(tan(sqrt(3*x) - 2.1))));
printf ("The value is: \n=> %f", y );
My question is how can I print the proper value?
try this
printf( "sqrt(3*x) = %lf\n", sqrt(3*x));
printf( "sqrt(3*x) - 2.1 = %lf\n", sqrt(3*x) - 2.1);
printf( "tan(sqrt(3*x) - 2.1) = %lf\n", tan(sqrt(3*x) - 2.1));
then you will notice that the last one is negative which will result in a sqrt of a negative number, thus the NaN
The problem is that, depending on the unit (radians or degrees), you get different results with trigonometric functions. Keep in mind that the tan function expects its argument in radians.
sqrt(3*21)-2.1 = 5.837, and you have to calculate its tangent. It is indeed negative if we work with radians (it is around -0.478), leading to the square root of a negative number which is NaN (Not a Number), but if you use degrees then it is +0.102 and you can complete the calculation. If you want to have the result you would have with degrees, considering the function accepts radians, you must convert the number. The conversion is simple: multiply by Pi and divide by 180. Like this:
y = 30 * sqrt(x * (1/(tan((sqrt(3*x) - 2.1)*M_PI/180))));
In this case the result is 429.967.
If the problem is not related with conversion to radians, i.e. multiplication by M_PI / 180.
In general, operations that produce NaN (Not a Number)1 are:
In your case the result of tan() is negative which leads to negative input value for the outer sqrt(), which is the last example from the above table.
To resolve the problematic situation you could either use some mathematical trick2 and try to rewrite the expression such that it doesn't produce a NaN, or if the problem is in the negative square root, you can use the #include <complex> and:
std::complex<double> two_i = std::sqrt(std::complex<double>(-4));
The rest of the answers provide you with a strategy of how to identify the NaN source, by checking each computation involved
1. Bit patterns reserved for special quantities to handle exceptional situations like taking the square root of a negative number, other than aborting computation are called NaNs.
2. Use trigonometric relations.
where #define M_PI = 3.14159265358979323846;

The result of own double precision cos() implemention in a shader is NaN, but works well on the CPU. What is going wrong?

as i said, i want implement my own double precision cos() function in a compute shader with GLSL, because there is just a built-in version for float.
This is my code:
double faculty[41];//values are calculated at the beginning of main()
double myCOS(double x)
double sum,tempExp,sign;
sum = 1.0;
tempExp = 1.0;
sign = -1.0;
for(int i = 1; i <= 30; i++)
tempExp *= x;
if(i % 2 == 0){
sum = sum + (sign * (tempExp / faculty[i]));
sign *= -1.0;
return sum;
The result of this code is, that the sum turns out to be NaN on the shader, but on the CPU the algorithm is working well.
I tried to debug this code too and I got the following information:
faculty[i] is positive and not zero for all entries
tempExp is positive in each step
none of the other variables are NaN during each step
the first time sum is NaN is at the step with i=4
and now my question: What exactly can go wrong if each variable is a number and nothing is divided by zero especially when the algorithm works on the CPU?
Let me guess:
First you determined the problem is in the loop, and you use only the following operations: +, *, /.
The rules for generating NaN from these operations are:
The divisions 0/0 and ±∞/±∞
The multiplications 0×±∞ and ±∞×0
The additions ∞ + (−∞), (−∞) + ∞ and equivalent subtractions
You ruled out the possibility for 0/0 and ±∞/±∞ by stating that faculty[] is correctly initialized.
The variable sign is always 1.0 or -1.0 so it cannot generate the NaN through the * operation.
What remains is the + operation if tempExp ever become ±∞.
So probably tempExp is too high on entry of your function and becomes ±∞, this will make sum to be ±∞ too. At the next iteration you will trigger the NaN generating operation through: ∞ + (−∞). This is because you multiply one side of the addition by sign and sign switches between positive and negative at each iteration.
You're trying to approximate cos(x) around 0.0. So you should use the properties of the cos() function to reduce your input value to a value near 0.0. Ideally in the range [0, pi/4]. For instance, remove multiples of 2*pi, and get the values of cos() in [pi/4, pi/2] by computing sin(x) around 0.0 and so on.
What can go dramatically wrong is a loss of precision. cos(x) usually is implemented by range reduction followed by a dedicated implementation for the range [0, pi/2]. Range reduction uses cos(x+2*pi) = cos(x). But this range reduction isn't perfect. For starters, pi cannot be exactly represented in finite math.
Now what happens if you try something as absurd as cos(1<<30) ? It's quite possible that the range reduction algorithm introduces an error in x that's larger than 2*pi, in which case the outcome is meaningless. Returning NaN in such cases is reasonable.

C++ double division by 0.0 versus DBL_MIN

When finding the inverse square root of a double, is it better to clamp invalid non-positive inputs at 0.0 or MIN_DBL? (In my example below double b may end up being negative due to floating point rounding errors and because the laws of physics are slightly slightly fudged in the game.)
Both division by 0.0 and MIN_DBL produce the same outcome in the game because 1/0.0 and 1/DBL_MIN are effectively infinity. My intuition says MIN_DBL is the better choice, but would there be any case for using 0.0? Like perhaps sqrt(0.0), 1/0.0 and multiplication by 1.#INF000000000000 execute faster because they are special cases.
double b = 1 - v.length_squared()/(c*c);
#ifdef CLAMP_BY_0
if (b < 0.0) b = 0.0;
if (b <= 0.0) b = DBL_MIN;
double lorentz_factor = 1/sqrt(b);
double division in MSVC:
1/0.0 = 1.#INF000000000000
1/DBL_MIN = 4.4942328371557898e+307
When dealing with floating point math, "infinity" and "effectively infinity" are quite different. Once a number stops being finite, it tends to stay that way. So while the value of lorentz_factor is "effectively" the same for both methods, depending on how you use that value, later computations can be radically different. sqrt(lorentz_factor) for instance remains infinite if you clamp to 0, but will actually be calculated if you clamp to some very very small number.
So the answer will largely depend on what you plan on doing with that value once you've clamped it.
Why not just assign INF to lorentz_factor directly, avoiding both the sqrt call and the division?
double lorentz_factor;
if (b <= 0.0)
lorentz_factor = std::numeric_limits<double>::infinity();
lorentz_factor = 1/sqrt(b);
You'll need to #include <limits> for this.
You can also use ::max() instead of ::infinity(), if that's what you need.