Doubles and Arithmetic

Doubles and Arithmetic - c++

I compile and run this code with MSVC2008
long double x = 111111111;
long double y = 222222222;
long double Z = x * y;
cout << z << endl;
When I debug, z equals
24691357975308640
Mathematically z should be
24691357975308642
What's going on ?

Doubles are only precise to around 16 digits. If I counted right, then you have 17 digits, and are correct up to 16. If you want to do this kind of math, and will only have integers, then use ints. For a number that large, you will need to use uint64_t.

Nothing is going on. Doubles have a finite amount of precision, and for that precision the value that you obtain is correct. It is an unfortunate shortcoming of the way you chose to print the value that information about the precision (i.e. the significant digits) was lost.
For example, for a 1+11+(1)+52 float (see here), we have 53 bits of precision, giving us 53 × log102 decimal digits of precision, i.e. 15. So we only print 15 digits:
#include <iomanip>
#include <iostream>
std::cout << std::setfill('0') << std::setprecision(15) << std::scientific
<< Z << std::endl;
The result is:
2.469135797530864e+16
Now we made the precision manifest, and the result is indeed correct at that precision.
If you don't like the magic 15 in the code, you should #include <limits> and use:
std::numeric_limits<decltype(Z)>::digits10

Floating point arithmetic is going on. This is a good read. Basically, computers can problems storing and dealing with floating point numbers, so you get these sorts of arithmetic errors.

Generally, one can write a book answering your question. Long story short - floating point arithmetic is going on. See Floating Point. Also, converting double values to ASCII (for displaying) is also hard and not precise. You may also want to look at arbitrary precision arithmetics.

Related

Is there a simple explanation to understand floating points representation and std::numeric_limits maxdigits() and digits10()? [duplicate]

I am confused about what max_digits10 represents. According to its documentation, it is 0 for all integral types. The formula for floating-point types for max_digits10 looks similar to int's digits10's.

To put it simple,
digits10 is the number of decimal digits guaranteed to survive text → float → text round-trip.
max_digits10 is the number of decimal digits needed to guarantee correct float → text → float round-trip.
There will be exceptions to both but these values give the minimum guarantee. Read the original proposal on max_digits10 for a clear example, Prof. W. Kahan's words and further details. Most C++ implementations follow IEEE 754 for their floating-point data types. For an IEEE 754 float, digits10 is 6 and max_digits10 is 9; for a double it is 15 and 17. Note that both these numbers should not be confused with the actual decimal precision of floating-point numbers.
Example digits10
char const *s1 = "8.589973e9";
char const *s2 = "0.100000001490116119384765625";
float const f1 = strtof(s1, nullptr);
float const f2 = strtof(s2, nullptr);
std::cout << "'" << s1 << "'" << '\t' << std::scientific << f1 << '\n';
std::cout << "'" << s2 << "'" << '\t' << std::fixed << std::setprecision(27) << f2 << '\n';
Prints
'8.589973e9' 8.589974e+009
'0.100000001490116119384765625' 0.100000001490116119384765625
All digits up to the 6th significant digit were preserved, while the 7th digit didn't survive for the first number. However, all 27 digits of the second survived; this is an exception. However, most numbers become different beyond 7 digits and all numbers would be the same within 6 digits.
In summary, digits10 gives the number of significant digits you can count on in a given float as being the same as the original real number in its decimal form from which it was created i.e. the digits that survived after the conversion into a float.
Example max_digits10
void f_s_f(float &f, int p) {
std::ostringstream oss;
oss << std::fixed << std::setprecision(p) << f;
f = strtof(oss.str().c_str(), nullptr);
}
float f3 = 3.145900f;
float f4 = std::nextafter(f3, 3.2f);
std::cout << std::hexfloat << std::showbase << f3 << '\t' << f4 << '\n';
f_s_f(f3, std::numeric_limits<float>::max_digits10);
f_s_f(f4, std::numeric_limits<float>::max_digits10);
std::cout << f3 << '\t' << f4 << '\n';
f_s_f(f3, 6);
f_s_f(f4, 6);
std::cout << f3 << '\t' << f4 << '\n';
Prints
0x1.92acdap+1 0x1.92acdcp+1
0x1.92acdap+1 0x1.92acdcp+1
0x1.92acdap+1 0x1.92acdap+1
Here two different floats, when printed with max_digits10 digits of precision, they give different strings and these strings when read back would give back the original floats they are from. When printed with lesser precision they give the same output due to rounding and hence when read back lead to the same float, when in reality they are from different values.
In summary, max_digits10 are at least required to disambiguate two floats in their decimal form, so that when converted back to a binary float, we get the original bits again and not of the one slightly before or after it due to rounding errors.

In my opinion, it is explained sufficiently at the linked site (and the site for digits10):
digits10 is the (max.) amount of "decimal" digits where numbers
can be represented by a type in any case, independent of their actual value.
A usual 4-byte unsigned integer as example: As everybody should know, it has exactly 32bit,
that is 32 digits of a binary number.
But in terms of decimal numbers?
Probably 9.
Because, it can store 100000000 as well as 999999999.
But if take numbers with 10 digits: 4000000000 can be stored, but 5000000000 not.
So, if we need a guarantee for minimum decimal digit capacity, it is 9.
And that is the result of digits10.
max_digits10 is only interesting for float/double... and gives the decimal digit count
which we need to output/save/process... to take the whole precision
the floating point type can offer.
Theoretical example: A variable with content 123.112233445566
If you show 123.11223344 to the user, it is not as precise as it can be.
If you show 123.1122334455660000000 to the user, it makes no sense because
you could omit the trailing zeros (because your variable can´t hold that much anyways)
Therefore, max_digits10 says how many digits precision you have available in a type.

Lets build some context
After going through lots of answers and reading stuff following is the simplest and layman answer i could reach upto for this.
Floating point numbers in computers (Single precision i.e float type in C/C++ etc. OR double precision i.e double in C/C++ etc.) have to be represented using fixed number of bits.
float is a 32-bit IEEE 754 single precision Floating Point Number – 1
bit for the sign, 8 bits for the exponent, and 23* for the value.
float has 7 decimal digits of precision.
And for double type
The C++ double should have a floating-point precision of up to 15
digits as it contains a precision that is twice the precision of the
float data type. When you declare a variable as double, you should
initialize it with a decimal value
What the heck above means to me?
Its possible that sometimes the floating point number which you have cannot fit into the number of bits available for that type. for eg. float value of 0.1 cannot FIT into available number of BITS in a computer. You may ask why. Try converting this value to binary and you will see that the binary representation is never ending and we have only finite number of bits so we need to stop at one point even though the binary conversion logic says keep going on.
If the given floating point number can be represented by the number of bits available, then we are good. If its not possible to represent the given floating point number in the available number of bits, then the bits are stored a value which is as close as possible to the actual value. This is also known as "Rounding the float value" OR "Rounding error". Now how this value is calculated depends of specific implementation but its safe to assume that given a specific implementation, the most closest value is chosen.
Now lets come to std::numeric_limits<T>::digits10
The value of std::numeric_limits::digits10 is the number of
base-10 digits that are necessary to uniquely represent all distinct
values of the type T, such as necessary for
serialization/deserialization to text. This constant is meaningful for
all floating-point types.
What this std::numeric_limits<T>::digits10 is saying is that whenever you fall into a scenario where rounding MUST happen then you can be assured that after given floating point value is rounded to its closest representable value by the computer, then its guarantied that the closest representable value's std::numeric_limits<T>::digits10 number of Decimal digits will be exactly same as your input floating point. For single precision floating point value this number is usually 6 and for double precision float value this number is usually 15.
Now you may ask why i used the word "guarantied". Well i used this because its possible that more number of digits may survive while conversion to float BUT if you ask me give me a guarantee that how many will survive in all the cases, then that number is std::numeric_limits<T>::digits10. Not convinced yet?
OK, consider example of unsigned char which has 8 bits of storage. When you convert a decimal value to unsigned char, then what's the guarantee that how many decimal digits will survive? I will say "2". Then you will say that even 145 will survive, so it should be 3. BUT i will say NO. Because if you take 256, then it won't survive. Of course 255 will survive, but since you are asking for guarantee so i can only guarantee that 2 digits will survive because answer 3 is not true if i am trying to use values higher than 255.
Now use the same analogy for floating number types when someone asks for a guarantee. That guarantee is given by std::numeric_limits<T>::digits10
Now what the heck is std::numeric_limits<T>::max_digits10
Here comes a bit of another level of complexity. BUT I will try to explain as simple as I can
As i mentioned previously that due to limited number of bits available to represent a floating type on a computer, its not possible to represent every float value exactly. Few can be represented exactly BUT not all values. Now lets consider a hypothetical situation. Someone asks you to write down all the possible float values which the computer can represent (ooohhh...i know what you are thinking). Luckily you don't have write all those :)
Just imagine that you started and reached the last float value which a computer can represent. The max float value which the computer can represent will have certain number of decimal digits. These are the number of decimal digits which std::numeric_limits<T>::max_digits10 tells us. BUT an actual explanation for std::numeric_limits<T>::max_digits10 is the maximum number of decimal digits you need to represent all possible representable values. Thats why i asked you to write all the value initially and you will see that you need maximum std::numeric_limits<T>::max_digits10 of decimal digits to write all representable values of type T.
Please note that this max float value is also the float value which can survive the text to float to text conversion but its number of decimal digits are NOT the guaranteed number of digits (remember the unsigned char example i gave where 3 digits of 255 doesn't mean all 3 digits values can be stored in unsigned char?)
Hope this attempt of mine gives people some understanding. I know i may have over simplified things BUT I have spent sleepless night thinking and reading stuff and this is the explanation which was able to give me some peace of mind.
Cheers !!!

How to add two large double precision numbers in c++

I have the following piece of code
#include <iostream>
#include <iomanip>
int main()
{
double x = 7033753.49999141693115234375;
double y = 7033753.499991415999829769134521484375;
double z = (x+ y)/2.0;
std::cout << "y is " << std::setprecision(40) << y << "\n";
std::cout << "x is " << std::setprecision(40) << x << "\n";
std::cout << "z is " << std::setprecision(40) << z << "\n";
return 0;
}
When the above code is run I get,
y is 7033753.499991415999829769134521484375
x is 7033753.49999141693115234375
z is 7033753.49999141693115234375
When I do the same in Wolfram Alpha the value of z is completely different
z = 7033753.4999914164654910564422607421875 #Wolfram answer
I am familiar with floating point precision and that large numbers away from zero can not be exactly represented. Is that what is happening here? Is there anyway in c++ where I can get the same answer as Wolfram without any performance penalty?

large numbers away from zero can not be exactly represented. Is that what is happening here?
Yes.
Note that there are also infinitely many rational numbers that cannot be represented near zero as well. But the distance between representable values does grow exponentially in larger value ranges.
Is there anyway in c++ where I can get the same answer as Wolfram ...
You can potentially get the same answer by using long double. My system produces exactly the same result as Wolfram. Note that precision of long double varies between systems even among systems that conform to IEEE 754 standard.
More generally though, if you need results that are accurate to many significant digits, then don't use finite precision math.
... without any performance penalty?
No. Precision comes with a cost.

Just telling IOStreams to print to 40 significant decimal figures of precision, doesn't mean that the value you're outputting actually has that much precision.
A typical double takes you up to 17 significant decimal figures (ish); beyond that, what you see is completely arbitrary.
Per eerorika's answer, it looks like the Wolfram Alpha answer is also falling foul of this, albeit possibly with some different precision limit than yours.
You can try a different approach like a "bignum" library, or limit yourself to the precision afforded by the types that you've chosen.

printing the largest single-precision floating point number

I have come seeking knowledge.
I am trying to understand floating point numbers.
I am trying to figure out why, when I print the largest floating point number, it does not print correctly.
2-(2^-23) Exponent Bits
1.99999988079071044921875 * (1.7014118346046923173168730371588e+38) =
3.4028234663852885981170418348451e+38
This should be the largest single-precision floating point number:
340282346638528859811704183484510000000.0
So,
float i = 340282346638528859811704183484510000000.0;
printf(TEXT, "Float %.38f", i);
Output: 340282346638528860000000000000000000000.0
Obviously the number is being rounded up, so I am trying to figure out just exactly what is going on.
My questions are:
The Wikipedia documentation states that 3.4028234663852885981170418348451e+38 is the largest number that can be represented in IEEE-754 fixed point.
Is the number stored in the floating point register = 0 11111111 11111111111111111111111 and it is just not being displayed incorrectly?
If I write printf(TEXT, "Float %.38f", FLT_MAX);, I get the same answer.
Perhaps the computer I am using does not use IEEE-754?
I understand errors with calculations, but I don't understand why the number
340282346638528860000000000000000000000.0 is the largest floating point number that can be accurately represented.
Maybe the Mantissa * Exponent is causing calculation errors? If that is true, then 340282346638528860000000000000000000000.0 would be the largest number that can be faithfully represented without calculation errors. I guess that would make sense. Just need a blessing.
Thanks,

Looks like culprit is printf() (I guess because float is implicitly converted to double when passed to it):
#include <iostream>
#include <limits>
int main()
{
std::cout.precision( 38 );
std::cout << std::numeric_limits<float>::max() << std::endl;
}
Output is:
3.4028234663852885981170418348451692544e+38

With float as binary32, the largest finite float is
340282346638528859811704183484516925440.0
printf("%.1f", FLT_MAX) is not obliged to print exactly to 38+ significant digits, so seeing output like the below is not unexpected.
340282346638528860000000000000000000000.0
printf() will print floating point accurately to DECIMAL_DIG significant digits. DECIMAL_DIG is at least 10. If more than DECIMAL_DIG significance is directed, a compliant printf() may round the result at some point. C11dr §7.21.6.1 6 goes into detail.

What is the purpose of max_digits10 and how is it different from digits10?

I am confused about what max_digits10 represents. According to its documentation, it is 0 for all integral types. The formula for floating-point types for max_digits10 looks similar to int's digits10's.

To put it simple,
digits10 is the number of decimal digits guaranteed to survive text → float → text round-trip.
max_digits10 is the number of decimal digits needed to guarantee correct float → text → float round-trip.
There will be exceptions to both but these values give the minimum guarantee. Read the original proposal on max_digits10 for a clear example, Prof. W. Kahan's words and further details. Most C++ implementations follow IEEE 754 for their floating-point data types. For an IEEE 754 float, digits10 is 6 and max_digits10 is 9; for a double it is 15 and 17. Note that both these numbers should not be confused with the actual decimal precision of floating-point numbers.
Example digits10
char const *s1 = "8.589973e9";
char const *s2 = "0.100000001490116119384765625";
float const f1 = strtof(s1, nullptr);
float const f2 = strtof(s2, nullptr);
std::cout << "'" << s1 << "'" << '\t' << std::scientific << f1 << '\n';
std::cout << "'" << s2 << "'" << '\t' << std::fixed << std::setprecision(27) << f2 << '\n';
Prints
'8.589973e9' 8.589974e+009
'0.100000001490116119384765625' 0.100000001490116119384765625
All digits up to the 6th significant digit were preserved, while the 7th digit didn't survive for the first number. However, all 27 digits of the second survived; this is an exception. However, most numbers become different beyond 7 digits and all numbers would be the same within 6 digits.
In summary, digits10 gives the number of significant digits you can count on in a given float as being the same as the original real number in its decimal form from which it was created i.e. the digits that survived after the conversion into a float.
Example max_digits10
void f_s_f(float &f, int p) {
std::ostringstream oss;
oss << std::fixed << std::setprecision(p) << f;
f = strtof(oss.str().c_str(), nullptr);
}
float f3 = 3.145900f;
float f4 = std::nextafter(f3, 3.2f);
std::cout << std::hexfloat << std::showbase << f3 << '\t' << f4 << '\n';
f_s_f(f3, std::numeric_limits<float>::max_digits10);
f_s_f(f4, std::numeric_limits<float>::max_digits10);
std::cout << f3 << '\t' << f4 << '\n';
f_s_f(f3, 6);
f_s_f(f4, 6);
std::cout << f3 << '\t' << f4 << '\n';
Prints
0x1.92acdap+1 0x1.92acdcp+1
0x1.92acdap+1 0x1.92acdcp+1
0x1.92acdap+1 0x1.92acdap+1
Here two different floats, when printed with max_digits10 digits of precision, they give different strings and these strings when read back would give back the original floats they are from. When printed with lesser precision they give the same output due to rounding and hence when read back lead to the same float, when in reality they are from different values.
In summary, max_digits10 are at least required to disambiguate two floats in their decimal form, so that when converted back to a binary float, we get the original bits again and not of the one slightly before or after it due to rounding errors.

In my opinion, it is explained sufficiently at the linked site (and the site for digits10):
digits10 is the (max.) amount of "decimal" digits where numbers
can be represented by a type in any case, independent of their actual value.
A usual 4-byte unsigned integer as example: As everybody should know, it has exactly 32bit,
that is 32 digits of a binary number.
But in terms of decimal numbers?
Probably 9.
Because, it can store 100000000 as well as 999999999.
But if take numbers with 10 digits: 4000000000 can be stored, but 5000000000 not.
So, if we need a guarantee for minimum decimal digit capacity, it is 9.
And that is the result of digits10.
max_digits10 is only interesting for float/double... and gives the decimal digit count
which we need to output/save/process... to take the whole precision
the floating point type can offer.
Theoretical example: A variable with content 123.112233445566
If you show 123.11223344 to the user, it is not as precise as it can be.
If you show 123.1122334455660000000 to the user, it makes no sense because
you could omit the trailing zeros (because your variable can´t hold that much anyways)
Therefore, max_digits10 says how many digits precision you have available in a type.

gnu c++ floating number precision

I have simple question about floating number,
double temp;
std::cout.precision(std::numeric_limits<double>::digits10);
temp = 12345678901234567890.1234567890;
std::cout << (temp < std::numeric_limits<double>::max()) << std::endl;
std::cout << std::fixed << std::endl;
std::cout << temp << std::endl;
However, the output I get is this,
1
12345678901234567168.000000000000000
The value of temp is still within the range of double, however, the value is completely different. I am wondering what have I done wrong here?
Thanks.

A double has only 15.95 decimal digits of precision. You've already exceeded this number of digits in the integer part of the value, hence the loss of precision in the last few digits, and the lack of any useful digits after the decimal point.
You should probably take a look at this: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html before doing any more work with floating point values.

It's not completely different. It's correct to 16 digits or so. That's about what you can expect from a double.

A double can only store a limited amount of precision. It works out to about 15 decimal digits.
Here's a helpful article on how floating point numbers are represented, and the implications of that representation: Float

IEEE 754 is not precise for any given value - for example http://www.cprogramming.com/tutorial/floating_point/understanding_floating_point.html and http://support.microsoft.com/kb/42980
-358974.27 can't be represented on float according to http://ridiculousfish.com/blog/posts/float.html and I remember (though I'm too lazy to test it) that even something "simple" like 2.2 or 2.3 can't be accurately represented even as a double.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js