How to add two large double precision numbers in c++

How to add two large double precision numbers in c++ - c++

I have the following piece of code
#include <iostream>
#include <iomanip>
int main()
{
double x = 7033753.49999141693115234375;
double y = 7033753.499991415999829769134521484375;
double z = (x+ y)/2.0;
std::cout << "y is " << std::setprecision(40) << y << "\n";
std::cout << "x is " << std::setprecision(40) << x << "\n";
std::cout << "z is " << std::setprecision(40) << z << "\n";
return 0;
}
When the above code is run I get,
y is 7033753.499991415999829769134521484375
x is 7033753.49999141693115234375
z is 7033753.49999141693115234375
When I do the same in Wolfram Alpha the value of z is completely different
z = 7033753.4999914164654910564422607421875 #Wolfram answer
I am familiar with floating point precision and that large numbers away from zero can not be exactly represented. Is that what is happening here? Is there anyway in c++ where I can get the same answer as Wolfram without any performance penalty?

large numbers away from zero can not be exactly represented. Is that what is happening here?
Yes.
Note that there are also infinitely many rational numbers that cannot be represented near zero as well. But the distance between representable values does grow exponentially in larger value ranges.
Is there anyway in c++ where I can get the same answer as Wolfram ...
You can potentially get the same answer by using long double. My system produces exactly the same result as Wolfram. Note that precision of long double varies between systems even among systems that conform to IEEE 754 standard.
More generally though, if you need results that are accurate to many significant digits, then don't use finite precision math.
... without any performance penalty?
No. Precision comes with a cost.

Just telling IOStreams to print to 40 significant decimal figures of precision, doesn't mean that the value you're outputting actually has that much precision.
A typical double takes you up to 17 significant decimal figures (ish); beyond that, what you see is completely arbitrary.
Per eerorika's answer, it looks like the Wolfram Alpha answer is also falling foul of this, albeit possibly with some different precision limit than yours.
You can try a different approach like a "bignum" library, or limit yourself to the precision afforded by the types that you've chosen.

Related

Numerical accuracy of pow(a/b,x) vs pow(b/a,-x)

Is there a difference in accuracy between pow(a/b,x) and pow(b/a,-x)?
If there is, does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Edit: Let's assume x86_64 processor and gcc compiler.
Edit: I tried comparing using some random numbers. For example:
printf("%.20f",pow(8.72138221/1.761329479,-1.51231)) // 0.08898783049228660424
printf("%.20f",pow(1.761329479/8.72138221, 1.51231)) // 0.08898783049228659037
So, it looks like there is a difference (albeit minuscule in this case), but maybe someone who knows about the algorithm implementation could comment on what the maximum difference is, and under what conditions.

Here's one way to answer such questions, to see how floating-point behaves. This is not a 100% correct way to analyze such question, but it gives a general idea.
Let's generate random numbers. Calculate v0=pow(a/b, n) and v1=pow(b/a, -n) in float precision. And calculate ref=pow(a/b, n) in double precision, and round it to float. We use ref as a reference value (we suppose that double has much more precision than float, so we can trust that ref can be considered the best possible value. This is true for IEEE-754 for most of the time). Then sum the difference between v0-ref and v1-ref. The difference should calculated with "the number of floating point numbers between v and ref".
Note, that the results may be depend on the range of a, b and n (and on the random generator quality. If it's really bad, it may give a biased result). Here, I've used a=[0..1], b=[0..1] and n=[-2..2]. Furthermore, this answer supposes that the algorithm of float/double division/pow is the same kind, have the same characteristics.
For my computer, the summed differences are: 2604828 2603684, it means that there is no significant precision difference between the two.
Here's the code (note, this code supposes IEEE-754 arithmetic):
#include <cmath>
#include <stdio.h>
#include <string.h>
long long int diff(float a, float b) {
unsigned int ai, bi;
memcpy(&ai, &a, 4);
memcpy(&bi, &b, 4);
long long int diff = (long long int)ai - bi;
if (diff<0) diff = -diff;
return diff;
}
int main() {
long long int e0 = 0;
long long int e1 = 0;
for (int i=0; i<10000000; i++) {
float a = 1.0f*rand()/RAND_MAX;
float b = 1.0f*rand()/RAND_MAX;
float n = 4.0f*rand()/RAND_MAX - 2.0f;
if (a==0||b==0) continue;
float v0 = std::pow(a/b, n);
float v1 = std::pow(b/a, -n);
float ref = std::pow((double)a/b, n);
e0 += diff(ref, v0);
e1 += diff(ref, v1);
}
printf("%lld %lld\n", e0, e1);
}

... between pow(a/b,x) and pow(b/a,-x) ... does raising a number less than 1 to a positive power or a number greater than 1 to a negative power produce more accurate result?
Whichever division is more arcuate.
Consider z = xy = 2y * log2(x).
Roughly: The error in y * log2(x) is magnified by the value of z to form the error in z. xy is very sensitive to the error in x. The larger the |log2(x)|, the greater concern.
In OP's case, both pow(a/b,p) and pow(b/a,-p), in general, have the same y * log2(x) and same z and similar errors in z. It is a question of how x, y are formed:
a/b and b/a, in general, both have the same error of +/- 0.5*unit in the last place and so both approaches are of similar error.
Yet with select values of a/b vs. b/a, one quotient will be more exact and it is that approach with the lower pow() error.
pow(7777777/4,-p) can be expected to be more accurate than pow(4/7777777,p).
Lacking assurance about the error in the division, the general rule applies: no major difference.

In general, the form with the positive power is slightly better, although by so little it will likely have no practical effect. Specific cases could be distinguished. For example, if either a or b is a power of two, it ought to be used as the denominator, as the division then has no rounding error.
In this answer, I assume IEEE-754 binary floating-point with round-to-nearest-ties-to-even and that the values involved are in the normal range of the floating-point format.
Given a, b, and x with values a, b, and x, and an implementation of pow that computes the representable value nearest the ideal mathematical value (actual implementations are generally not this good), pow(a/b, x) computes (a/b•(1+e0))x•(1+e1), where e0 is the rounding error that occurs in the division and e1 is the rounding error that occurs in the pow, and pow(b/a, -x) computes (b/a•(1+e2))−x•(1+e3), where e2 and e3 are the rounding errors in this division and this pow, respectively.
Each of the errors, e0…e3 lies in the interval [−u/2, u/2], where u is the unit of least precision (ULP) of 1 in the floating-point format. (The notation [p, q] is the interval containing all values from p to q, including p and q.) In case a result is near the edge of a binade (where the floating-point exponent changes and the significand is near 1), the lower bound may be −u/4. At this time, I will not analyze this case.
Rewriting, these are (a/b)x•(1+e0)x•(1+e1) and (a/b)x•(1+e2)−x•(1+e3). This reveals the primary difference is in (1+e0)x versus (1+e2)−x. The 1+e1 versus 1+e3 is also a difference, but this is just the final rounding. [I may consider further analysis of this later but omit it for now.]
Consider (1+e0)x and (1+e2)−x.The potential values of the first expression span [(1−u/2)x, (1+u/2)x], while the second spans [(1+u/2)−x, (1−u/2)−x]. When x > 0, the second interval is longer than the first:
The length of the first is (1+u/2)x−(1+u/2)x.
The length of the second is (1/(1−u/2))x−(1/(1+u/2))x.
Multiplying the latter by (1−u2/22)x produces ((1−u2/22)/(1−u/2))x−( (1−u2/22)/(1+u/2))x = (1+u/2)x−(1+u/2)x, which is the length of the first interval.
1−u2/22 < 1, so (1−u2/22)x < 1 for positive x.
Since the first length equals the second length times a number less than one, the first interval is shorter.
Thus, the form in which the exponent is positive is better in the sense that it has a shorter interval of potential results.
Nonetheless, this difference is very slight. I would not be surprised if it were unobservable in practice. Also, one might be concerned with the probability distribution of errors rather than the range of potential errors. I suspect this would also favor positive exponents.

For evaluation of rounding errors like in your case, it might be useful to use some multi-precision library, such as Boost.Multiprecision. Then, you can compare results for various precisions, e.g, such as with the following program:
#include <iomanip>
#include <iostream>
#include <boost/multiprecision/cpp_bin_float.hpp>
#include <boost/multiprecision/cpp_dec_float.hpp>
namespace mp = boost::multiprecision;
template <typename FLOAT>
void comp() {
FLOAT a = 8.72138221;
FLOAT b = 1.761329479;
FLOAT c = 1.51231;
FLOAT e = mp::pow(a / b, -c);
FLOAT f = mp::pow(b / a, c);
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << e << std::endl;
std::cout << std::fixed << std::setw(40) << std::setprecision(40) << f << std::endl;
}
int main() {
std::cout << "Double: " << std::endl;
comp<mp::cpp_bin_float_double>();
td::cout << std::endl;
std::cout << "Double extended: " << std::endl;
comp<mp::cpp_bin_float_double_extended>();
std::cout << std::endl;
std::cout << "Quad: " << std::endl;
comp<mp::cpp_bin_float_quad>();
std::cout << std::endl;
std::cout << "Dec-100: " << std::endl;
comp<mp::cpp_dec_float_100>();
std::cout << std::endl;
}
Its output reads, on my platform:
Double:
0.0889878304922865903670015086390776559711
0.0889878304922866181225771242679911665618
Double extended:
0.0889878304922865999079806265115166752366
0.0889878304922865999012043629334822725241
Quad:
0.0889878304922865999004910375213273866639
0.0889878304922865999004910375213273505527
Dec-100:
0.0889878304922865999004910375213273881004
0.0889878304922865999004910375213273881004
Live demo: https://wandbox.org/permlink/tAm4sBIoIuUy2lO6
For double, the first calculation was more accurate, however, I guess one cannot make any generic conclusions here.
Also, note that your input numbers are not accurately representable with the IEEE 754 double precision floating-point type (none of them). The question is whether you care about the accuracy of calculations with either those exact numbers of their closest representations.

C++: Cosine is wrong, should be zero. 3Pi/2

I have a program and I'm trying to calculatecos(M_PI*3/2) and instead of getting 0, as I should, I get -1.83691e-016
What am I missing here? I am in radians as I need to be.

First, M_PI is not a very portable macro and is usually good to about 15 decimal places, depending on the compiler you use - my guess is you're using Microsoft's C++ compiler.
Second, if you want a more accurate (and portable) version, use the Boost Math library:
http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/math_toolkit/tutorial/non_templ.html
Third, as Kay has pointed out, pi in itself is an irrational number and therefore no amount of bits (or digits in base 10) would be enough to accurately represent it. Therefore, What you're actually calculating is not cos(3*pi/2) exactly, but "the cosine of 3/2 times the closest approximation of pi given the bits required", which will NOT be 3 *pi/2 and therefore won't be zero.
Finally, if you want custom precision for your mathematical constants, read this: http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/math_toolkit/tutorial/user_def.html

The number M_PI is only an approximation of π. The cosine that you get back is also an approximation, and it's a pretty good one - it has fifteen correct digits after the decimal point.

Given the discrete nature of double values, the standard margin of error against which to test for numerical equality is numeric_limits<double>::epsilon():
#include <iostream>
#include <limits>
#include <cmath>
using namespace std;
int main()
{
double x = cos(M_PI*3/2);
cout << "x = << " << x << endl;
cout << "numeric_limits<double>::epsilon() = "
<< numeric_limits<double>::epsilon() << endl;
cout << "Is x sufficiently close to 0? "
<< (abs(x) < numeric_limits<double>::epsilon() ? "yes" : "no") << endl;
return 0;
}
Output:
x = << -1.83697e-16
numeric_limits<double>::epsilon() = 2.22045e-16
Is x sufficiently close to 0? yes
As you can see, the absolute value of -1.83697e-16 is within the margin of error given by epsilon 2.22045e-16.

Pi is irrational, the computer cannot represent the number perfectly. The small error to the "correct" value of pi causes the error in the output. Being 1.83691 × 10-16 off is still pretty good.
If you want to learn more about the restrictions of actual system and the impact of little errors in the input, then refer to http://en.wikipedia.org/wiki/Numerical_stability.

Doubles and Arithmetic

I compile and run this code with MSVC2008
long double x = 111111111;
long double y = 222222222;
long double Z = x * y;
cout << z << endl;
When I debug, z equals
24691357975308640
Mathematically z should be
24691357975308642
What's going on ?

Doubles are only precise to around 16 digits. If I counted right, then you have 17 digits, and are correct up to 16. If you want to do this kind of math, and will only have integers, then use ints. For a number that large, you will need to use uint64_t.

Nothing is going on. Doubles have a finite amount of precision, and for that precision the value that you obtain is correct. It is an unfortunate shortcoming of the way you chose to print the value that information about the precision (i.e. the significant digits) was lost.
For example, for a 1+11+(1)+52 float (see here), we have 53 bits of precision, giving us 53 × log102 decimal digits of precision, i.e. 15. So we only print 15 digits:
#include <iomanip>
#include <iostream>
std::cout << std::setfill('0') << std::setprecision(15) << std::scientific
<< Z << std::endl;
The result is:
2.469135797530864e+16
Now we made the precision manifest, and the result is indeed correct at that precision.
If you don't like the magic 15 in the code, you should #include <limits> and use:
std::numeric_limits<decltype(Z)>::digits10

Floating point arithmetic is going on. This is a good read. Basically, computers can problems storing and dealing with floating point numbers, so you get these sorts of arithmetic errors.

Generally, one can write a book answering your question. Long story short - floating point arithmetic is going on. See Floating Point. Also, converting double values to ASCII (for displaying) is also hard and not precise. You may also want to look at arbitrary precision arithmetics.

gnu c++ floating number precision

I have simple question about floating number,
double temp;
std::cout.precision(std::numeric_limits<double>::digits10);
temp = 12345678901234567890.1234567890;
std::cout << (temp < std::numeric_limits<double>::max()) << std::endl;
std::cout << std::fixed << std::endl;
std::cout << temp << std::endl;
However, the output I get is this,
1
12345678901234567168.000000000000000
The value of temp is still within the range of double, however, the value is completely different. I am wondering what have I done wrong here?
Thanks.

A double has only 15.95 decimal digits of precision. You've already exceeded this number of digits in the integer part of the value, hence the loss of precision in the last few digits, and the lack of any useful digits after the decimal point.
You should probably take a look at this: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html before doing any more work with floating point values.

It's not completely different. It's correct to 16 digits or so. That's about what you can expect from a double.

A double can only store a limited amount of precision. It works out to about 15 decimal digits.
Here's a helpful article on how floating point numbers are represented, and the implications of that representation: Float

IEEE 754 is not precise for any given value - for example http://www.cprogramming.com/tutorial/floating_point/understanding_floating_point.html and http://support.microsoft.com/kb/42980
-358974.27 can't be represented on float according to http://ridiculousfish.com/blog/posts/float.html and I remember (though I'm too lazy to test it) that even something "simple" like 2.2 or 2.3 can't be accurately represented even as a double.

C++ internal representation of double/float

I am unable to understand why C++ division behaves the way it does. I have a simple program which divides 1 by 10 (using VS 2003)
double dResult = 0.0;
dResult = 1.0/10.0;
I expect dResult to be 0.1, However i get 0.10000000000000001
Why do i get this value, whats the problem with internal representation of double/float
How can i get the correct value?
Thanks.

Because all most modern processors use binary floating-point, which cannot exactly represent 0.1 (there is no way to represent 0.1 as m * 2^e with integer m and e).
If you want to see the "correct value", you can print it out with e.g.:
printf("%.1f\n", dResult);

Double and float are not identical to real numbers, it is because there are infinite values for real numbers, but only finite number of bits to represent them in double/float.
You can further read: what every computer scientist should know about floating point arithmetics

The ubiquitous IEEE754 floating point format expresses floating point numbers in scientific notation base 2, with a finite mantissa. Since a fraction like 1/5 (and hence 1/10) does not have a presentation with finitely many digits in binary scientific notation, you cannot represent the value 0.1 exactly. More generally, the only values that can be represented exactly are those that fit precisely into binary scientific notation with a mantissa of a few (e.g. 24 or 53 or 64) binary digits, and a suitably small exponent.

Working with integers, floats, and doubles could be tricky. Depends on what is your purpose. If you only want to display in nice format, then you can play with the C++ iomanipulator, precision, showpint, noshowpint. If you are trying to do precise computing with numeric methods, you may have to use some library for accurate representation. If you are multiplying a lots of small and large number, you may have to resole to use log transformations. Here is a small test:
float x=1.0000001;
cout << x << endl;
float y=9.9999999999999;
cout << "using default io format " << y/x << endl;
cout << showpoint << "using showpoint " << y/x << endl;
y=9.9999;
cout << "fewer 9 default C++ " << y/x << endl;
cout << showpoint << "fewer 9 showpoint" << y/x << endl;
1
using default io format 10
using showpoint 10.0000
fewer 9 default C++ 9.99990
fewer 9 showpoint9.99990
In special cases you want to use double (which may be the result of some complicated algorithm) to represent integer numbers, you have to figure out the proper conversion method. Once I had a situation where I want to use a single double value to store two type of values: -1, +1, or (0-1) to make my code more memory efficient (and speed, large memory tends to reduce performance). It is a little tricky to distinguish between +1 and val < 1. In this case I know that the values < 1 has a resolution say only 1/500, Then I can safely use floor(val+0.000001) to get back the 1 value that I initially stored.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to add two large double precision numbers in c++ - c++

Related

Numerical accuracy of pow(a/b,x) vs pow(b/a,-x)

C++: Cosine is wrong, should be zero. 3Pi/2

Doubles and Arithmetic

gnu c++ floating number precision

C++ internal representation of double/float

Categories

Resources