Printing double without losing precision - c++

How do you print a double to a stream so that when it is read in you don't lose precision?
I tried:
std::stringstream ss;
double v = 0.1 * 0.1;
ss << std::setprecision(std::numeric_limits<T>::digits10) << v << " ";
double u;
ss >> u;
std::cout << "precision " << ((u == v) ? "retained" : "lost") << std::endl;
This did not work as I expected.
But I can increase precision (which surprised me as I thought that digits10 was the maximum required).
ss << std::setprecision(std::numeric_limits<T>::digits10 + 2) << v << " ";
// ^^^^^^ +2
It has to do with the number of significant digits and the first two don't count in (0.01).
So has anybody looked at representing floating point numbers exactly?
What is the exact magical incantation on the stream I need to do?
After some experimentation:
The trouble was with my original version. There were non-significant digits in the string after the decimal point that affected the accuracy.
So to compensate for this we can use scientific notation to compensate:
ss << std::scientific
<< std::setprecision(std::numeric_limits<double>::digits10 + 1)
<< v;
This still does not explain the need for the +1 though.
Also if I print out the number with more precision I get more precision printed out!
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits10) << v << "\n";
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits10 + 1) << v << "\n";
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits) << v << "\n";
It results in:
1.000000000000000e-02
1.0000000000000002e-02
1.00000000000000019428902930940239457413554200000000000e-02
Based on #Stephen Canon answer below:
We can print out exactly by using the printf() formatter, "%a" or "%A". To achieve this in C++ we need to use the fixed and scientific manipulators (see n3225: 22.4.2.2.2p5 Table 88)
std::cout.flags(std::ios_base::fixed | std::ios_base::scientific);
std::cout << v;
For now I have defined:
template<typename T>
std::ostream& precise(std::ostream& stream)
{
std::cout.flags(std::ios_base::fixed | std::ios_base::scientific);
return stream;
}
std::ostream& preciselngd(std::ostream& stream){ return precise<long double>(stream);}
std::ostream& precisedbl(std::ostream& stream) { return precise<double>(stream);}
std::ostream& preciseflt(std::ostream& stream) { return precise<float>(stream);}
Next: How do we handle NaN/Inf?

It's not correct to say "floating point is inaccurate", although I admit that's a useful simplification. If we used base 8 or 16 in real life then people around here would be saying "base 10 decimal fraction packages are inaccurate, why did anyone ever cook those up?".
The problem is that integral values translate exactly from one base into another, but fractional values do not, because they represent fractions of the integral step and only a few of them are used.
Floating point arithmetic is technically perfectly accurate. Every calculation has one and only one possible result. There is a problem, and it is that most decimal fractions have base-2 representations that repeat. In fact, in the sequence 0.01, 0.02, ... 0.99, only a mere 3 values have exact binary representations. (0.25, 0.50, and 0.75.) There are 96 values that repeat and therefore are obviously not represented exactly.
Now, there are a number of ways to write and read back floating point numbers without losing a single bit. The idea is to avoid trying to express the binary number with a base 10 fraction.
Write them as binary. These days, everyone implements the IEEE-754 format so as long as you choose a byte order and write or read only that byte order, then the numbers will be portable.
Write them as 64-bit integer values. Here you can use the usual base 10. (Because you are representing the 64-bit aliased integer, not the 52-bit fraction.)
You can also just write more decimal fraction digits. Whether this is bit-for-bit accurate will depend on the quality of the conversion libraries and I'm not sure I would count on perfect accuracy (from the software) here. But any errors will be exceedingly small and your original data certainly has no information in the low bits. (None of the constants of physics and chemistry are known to 52 bits, nor has any distance on earth ever been measured to 52 bits of precision.) But for a backup or restore where bit-for-bit accuracy might be compared automatically, this obviously isn't ideal.

Don't print floating-point values in decimal if you don't want to lose precision. Even if you print enough digits to represent the number exactly, not all implementations have correctly-rounded conversions to/from decimal strings over the entire floating-point range, so you may still lose precision.
Use hexadecimal floating point instead. In C:
printf("%a\n", yourNumber);
C++0x provides the hexfloat manipulator for iostreams that does the same thing (on some platforms, using the std::hex modifier has the same result, but this is not a portable assumption).
Using hex floating point is preferred for several reasons.
First, the printed value is always exact. No rounding occurs in writing or reading a value formatted in this way. Beyond the accuracy benefits, this means that reading and writing such values can be faster with a well tuned I/O library. They also require fewer digits to represent values exactly.

I got interested in this question because I'm trying to (de)serialize my data to & from JSON.
I think I have a clearer explanation (with less hand waiving) for why 17 decimal digits are sufficient to reconstruct the original number losslessly:
Imagine 3 number lines:
1. for the original base 2 number
2. for the rounded base 10 representation
3. for the reconstructed number (same as #1 because both in base 2)
When you convert to base 10, graphically, you choose the tic on the 2nd number line closest to the tic on the 1st. Likewise when you reconstruct the original from the rounded base 10 value.
The critical observation I had was that in order to allow exact reconstruction, the base 10 step size (quantum) has to be < the base 2 quantum. Otherwise, you inevitably get the bad reconstruction shown in red.
Take the specific case of when the exponent is 0 for the base2 representation. Then the base2 quantum will be 2^-52 ~= 2.22 * 10^-16. The closest base 10 quantum that's less than this is 10^-16. Now that we know the required base 10 quantum, how many digits will be needed to encode all possible values? Given that we're only considering the case of exponent = 0, the dynamic range of values we need to represent is [1.0, 2.0). Therefore, 17 digits would be required (16 digits for fraction and 1 digit for integer part).
For exponents other than 0, we can use the same logic:
exponent base2 quant. base10 quant. dynamic range digits needed
---------------------------------------------------------------------
1 2^-51 10^-16 [2, 4) 17
2 2^-50 10^-16 [4, 8) 17
3 2^-49 10^-15 [8, 16) 17
...
32 2^-20 10^-7 [2^32, 2^33) 17
1022 9.98e291 1.0e291 [4.49e307,8.99e307) 17
While not exhaustive, the table shows the trend that 17 digits are sufficient.
Hope you like my explanation.

In C++20 you'll be able to use std::format to do this:
std::stringstream ss;
double v = 0.1 * 0.1;
ss << std::format("{}", v);
double u;
ss >> u;
assert(v == u);
The default floating-point format is the shortest decimal representation with a round-trip guarantee. The advantage of this method compared to using the precision of max_digits10 (not digits10 which is not suitable for round trip through decimal) from std::numeric_limits is that it doesn't print unnecessary digits.
In the meantime you can use the {fmt} library, std::format is based on. For example (godbolt):
fmt::print("{}", 0.1 * 0.1);
Output (assuming IEEE754 double):
0.010000000000000002
{fmt} uses the Dragonbox algorithm for fast binary floating point to decimal conversion. In addition to giving the shortest representation it is 20-30x faster than common standard library implementations of printf and iostreams.
Disclaimer: I'm the author of {fmt} and C++20 std::format.

A double has the precision of 52 binary digits or 15.95 decimal digits. See http://en.wikipedia.org/wiki/IEEE_754-2008. You need at least 16 decimal digits to record the full precision of a double in all cases. [But see fourth edit, below].
By the way, this means significant digits.
Answer to OP edits:
Your floating point to decimal string runtime is outputing way more digits than are significant. A double can only hold 52 bits of significand (actually, 53, if you count a "hidden" 1 that is not stored). That means the the resolution is not more than 2 ^ -53 = 1.11e-16.
For example: 1 + 2 ^ -52 = 1.0000000000000002220446049250313 . . . .
Those decimal digits, .0000000000000002220446049250313 . . . . are the smallest binary "step" in a double when converted to decimal.
The "step" inside the double is:
.0000000000000000000000000000000000000000000000000001 in binary.
Note that the binary step is exact, while the decimal step is inexact.
Hence the decimal representation above,
1.0000000000000002220446049250313 . . .
is an inexact representation of the exact binary number:
1.0000000000000000000000000000000000000000000000000001.
Third Edit:
The next possible value for a double, which in exact binary is:
1.0000000000000000000000000000000000000000000000000010
converts inexactly in decimal to
1.0000000000000004440892098500626 . . . .
So all of those extra digits in the decimal are not really significant, they are just base conversion artifacts.
Fourth Edit:
Though a double stores at most 16 significant decimal digits, sometimes 17 decimal digits are necessary to represent the number. The reason has to do with digit slicing.
As I mentioned above, there are 52 + 1 binary digits in the double. The "+ 1" is an assumed leading 1, and is neither stored nor significant. In the case of an integer, those 52 binary digits form a number between 0 and 2^53 - 1. How many decimal digits are necessary to store such a number? Well, log_10 (2^53 - 1) is about 15.95. So at most 16 decimal digits are necessary. Let's label these d_0 to d_15.
Now consider that IEEE floating point numbers also have an binary exponent. What happens when we increment the exponet by, say, 2? We have multiplied our 52-bit number, whatever it was, by 4. Now, instead of our 52 binary digits aligning perfectly with our decimal digits d_0 to d_15, we have some significant binary digits represented in d_16. However, since we multiplied by something less than 10, we still have significant binary digits represented in d_0. So our 15.95 decimal digits now occuply d_1 to d_15, plus some upper bits of d_0 and some lower bits of d_16. This is why 17 decimal digits are sometimes needed to represent a IEEE double.
Fifth Edit
Fixed numerical errors

The easiest way (for IEEE 754 double) to guarantee a round-trip conversion is to always use 17 significant digits. But that has the disadvantage of sometimes including unnecessary noise digits (0.1 → "0.10000000000000001").
An approach that's worked for me is to sprintf the number with 15 digits of precision, then check if atof gives you back the original value. If it doesn't, try 16 digits. If that doesn't work, use 17.
You might want to try David Gay's algorithm (used in Python 3.1 to implement float.__repr__).

Thanks to ThomasMcLeod for pointing out the error in my table computation
To guarantee round-trip conversion using 15 or 16 or 17 digits is only possible for a comparatively few cases. The number 15.95 comes from taking 2^53 (1 implicit bit + 52 bits in the significand/"mantissa") which comes out to an integer in the range 10^15 to 10^16 (closer to 10^16).
Consider a double precision value x with an exponent of 0, i.e. it falls into the floating point range range 1.0 <= x < 2.0. The implicit bit will mark the 2^0 component (part) of x. The highest explicit bit of the significand will denote the next lower exponent (from 0) <=> -1 => 2^-1 or the 0.5 component.
The next bit 0.25, the ones after 0.125, 0.0625, 0.03125, 0.015625 and so on (see table below). The value 1.5 will thus be represented by two components added together: the implicit bit denoting 1.0 and the highest explicit significand bit denoting 0.5.
This illustrates that from the implicit bit downward you have 52 additional, explicit bits to represent possible components where the smallest is 0 (exponent) - 52 (explicit bits in significand) = -52 => 2^-52 which according to the table below is ... well you can see for yourselves that it comes out to quite a bit more than 15.95 significant digits (37 to be exact). To put it another way the smallest number in the 2^0 range that is != 1.0 itself is 2^0+2^-52 which is 1.0 + the number next to 2^-52 (below) = (exactly) 1.0000000000000002220446049250313080847263336181640625, a value which I count as being 53 significant digits long. With 17 digit formatting "precision" the number will display as 1.0000000000000002 and this would depend on the library converting correctly.
So maybe "round-trip conversion in 17 digits" is not really a concept that is valid (enough).
2^ -1 = 0.5000000000000000000000000000000000000000000000000000
2^ -2 = 0.2500000000000000000000000000000000000000000000000000
2^ -3 = 0.1250000000000000000000000000000000000000000000000000
2^ -4 = 0.0625000000000000000000000000000000000000000000000000
2^ -5 = 0.0312500000000000000000000000000000000000000000000000
2^ -6 = 0.0156250000000000000000000000000000000000000000000000
2^ -7 = 0.0078125000000000000000000000000000000000000000000000
2^ -8 = 0.0039062500000000000000000000000000000000000000000000
2^ -9 = 0.0019531250000000000000000000000000000000000000000000
2^-10 = 0.0009765625000000000000000000000000000000000000000000
2^-11 = 0.0004882812500000000000000000000000000000000000000000
2^-12 = 0.0002441406250000000000000000000000000000000000000000
2^-13 = 0.0001220703125000000000000000000000000000000000000000
2^-14 = 0.0000610351562500000000000000000000000000000000000000
2^-15 = 0.0000305175781250000000000000000000000000000000000000
2^-16 = 0.0000152587890625000000000000000000000000000000000000
2^-17 = 0.0000076293945312500000000000000000000000000000000000
2^-18 = 0.0000038146972656250000000000000000000000000000000000
2^-19 = 0.0000019073486328125000000000000000000000000000000000
2^-20 = 0.0000009536743164062500000000000000000000000000000000
2^-21 = 0.0000004768371582031250000000000000000000000000000000
2^-22 = 0.0000002384185791015625000000000000000000000000000000
2^-23 = 0.0000001192092895507812500000000000000000000000000000
2^-24 = 0.0000000596046447753906250000000000000000000000000000
2^-25 = 0.0000000298023223876953125000000000000000000000000000
2^-26 = 0.0000000149011611938476562500000000000000000000000000
2^-27 = 0.0000000074505805969238281250000000000000000000000000
2^-28 = 0.0000000037252902984619140625000000000000000000000000
2^-29 = 0.0000000018626451492309570312500000000000000000000000
2^-30 = 0.0000000009313225746154785156250000000000000000000000
2^-31 = 0.0000000004656612873077392578125000000000000000000000
2^-32 = 0.0000000002328306436538696289062500000000000000000000
2^-33 = 0.0000000001164153218269348144531250000000000000000000
2^-34 = 0.0000000000582076609134674072265625000000000000000000
2^-35 = 0.0000000000291038304567337036132812500000000000000000
2^-36 = 0.0000000000145519152283668518066406250000000000000000
2^-37 = 0.0000000000072759576141834259033203125000000000000000
2^-38 = 0.0000000000036379788070917129516601562500000000000000
2^-39 = 0.0000000000018189894035458564758300781250000000000000
2^-40 = 0.0000000000009094947017729282379150390625000000000000
2^-41 = 0.0000000000004547473508864641189575195312500000000000
2^-42 = 0.0000000000002273736754432320594787597656250000000000
2^-43 = 0.0000000000001136868377216160297393798828125000000000
2^-44 = 0.0000000000000568434188608080148696899414062500000000
2^-45 = 0.0000000000000284217094304040074348449707031250000000
2^-46 = 0.0000000000000142108547152020037174224853515625000000
2^-47 = 0.0000000000000071054273576010018587112426757812500000
2^-48 = 0.0000000000000035527136788005009293556213378906250000
2^-49 = 0.0000000000000017763568394002504646778106689453125000
2^-50 = 0.0000000000000008881784197001252323389053344726562500
2^-51 = 0.0000000000000004440892098500626161694526672363281250
2^-52 = 0.0000000000000002220446049250313080847263336181640625

#ThomasMcLeod: I think the significant digit rule comes from my field, physics, and means something more subtle:
If you have a measurement that gets you the value 1.52 and you cannot read any more detail off the scale, and say you are supposed to add another number (for example of another measurement because this one's scale was too small) to it, say 2, then the result (obviously) has only 2 decimal places, i.e. 3.52.
But likewise, if you add 1.1111111111 to the value 1.52, you get the value 2.63 (and nothing more!).
The reason for the rule is to prevent you from kidding yourself into thinking you got more information out of a calculation than you put in by the measurement (which is impossible, but would seem that way by filling it with garbage, see above).
That said, this specific rule is for addition only (for addition: the error of the result is the sum of the two errors - so if you measure just one badly, though luck, there goes your precision...).
How to get the other rules:
Let's say a is the measured number and δa the error. Let's say your original formula was:
f:=m a
Let's say you also measure m with error δm (let that be the positive side).
Then the actual limit is:
f_up=(m+δm) (a+δa)
and
f_down=(m-δm) (a-δa)
So,
f_up =m a+δm δa+(δm a+m δa)
f_down=m a+δm δa-(δm a+m δa)
Hence, now the significant digits are even less:
f_up ~m a+(δm a+m δa)
f_down~m a-(δm a+m δa)
and so
δf=δm a+m δa
If you look at the relative error, you get:
δf/f=δm/m+δa/a
For division it is
δf/f=δm/m-δa/a
Hope that gets the gist across and hope I didn't make too many mistakes, it's late here :-)
tl,dr: Significant digits mean how many of the digits in the output actually come from the digits in your input (in the real world, not the distorted picture that floating point numbers have).
If your measurements were 1 with "no" error and 3 with "no" error and the function is supposed to be 1/3, then yes, all infinite digits are actual significant digits. Otherwise, the inverse operation would not work, so obviously they have to be.
If significant digit rule means something completely different in another field, carry on :-)

Related

Is there a simple explanation to understand floating points representation and std::numeric_limits maxdigits() and digits10()? [duplicate]

I am confused about what max_digits10 represents. According to its documentation, it is 0 for all integral types. The formula for floating-point types for max_digits10 looks similar to int's digits10's.
To put it simple,
digits10 is the number of decimal digits guaranteed to survive text → float → text round-trip.
max_digits10 is the number of decimal digits needed to guarantee correct float → text → float round-trip.
There will be exceptions to both but these values give the minimum guarantee. Read the original proposal on max_digits10 for a clear example, Prof. W. Kahan's words and further details. Most C++ implementations follow IEEE 754 for their floating-point data types. For an IEEE 754 float, digits10 is 6 and max_digits10 is 9; for a double it is 15 and 17. Note that both these numbers should not be confused with the actual decimal precision of floating-point numbers.
Example digits10
char const *s1 = "8.589973e9";
char const *s2 = "0.100000001490116119384765625";
float const f1 = strtof(s1, nullptr);
float const f2 = strtof(s2, nullptr);
std::cout << "'" << s1 << "'" << '\t' << std::scientific << f1 << '\n';
std::cout << "'" << s2 << "'" << '\t' << std::fixed << std::setprecision(27) << f2 << '\n';
Prints
'8.589973e9' 8.589974e+009
'0.100000001490116119384765625' 0.100000001490116119384765625
All digits up to the 6th significant digit were preserved, while the 7th digit didn't survive for the first number. However, all 27 digits of the second survived; this is an exception. However, most numbers become different beyond 7 digits and all numbers would be the same within 6 digits.
In summary, digits10 gives the number of significant digits you can count on in a given float as being the same as the original real number in its decimal form from which it was created i.e. the digits that survived after the conversion into a float.
Example max_digits10
void f_s_f(float &f, int p) {
std::ostringstream oss;
oss << std::fixed << std::setprecision(p) << f;
f = strtof(oss.str().c_str(), nullptr);
}
float f3 = 3.145900f;
float f4 = std::nextafter(f3, 3.2f);
std::cout << std::hexfloat << std::showbase << f3 << '\t' << f4 << '\n';
f_s_f(f3, std::numeric_limits<float>::max_digits10);
f_s_f(f4, std::numeric_limits<float>::max_digits10);
std::cout << f3 << '\t' << f4 << '\n';
f_s_f(f3, 6);
f_s_f(f4, 6);
std::cout << f3 << '\t' << f4 << '\n';
Prints
0x1.92acdap+1 0x1.92acdcp+1
0x1.92acdap+1 0x1.92acdcp+1
0x1.92acdap+1 0x1.92acdap+1
Here two different floats, when printed with max_digits10 digits of precision, they give different strings and these strings when read back would give back the original floats they are from. When printed with lesser precision they give the same output due to rounding and hence when read back lead to the same float, when in reality they are from different values.
In summary, max_digits10 are at least required to disambiguate two floats in their decimal form, so that when converted back to a binary float, we get the original bits again and not of the one slightly before or after it due to rounding errors.
In my opinion, it is explained sufficiently at the linked site (and the site for digits10):
digits10 is the (max.) amount of "decimal" digits where numbers
can be represented by a type in any case, independent of their actual value.
A usual 4-byte unsigned integer as example: As everybody should know, it has exactly 32bit,
that is 32 digits of a binary number.
But in terms of decimal numbers?
Probably 9.
Because, it can store 100000000 as well as 999999999.
But if take numbers with 10 digits: 4000000000 can be stored, but 5000000000 not.
So, if we need a guarantee for minimum decimal digit capacity, it is 9.
And that is the result of digits10.
max_digits10 is only interesting for float/double... and gives the decimal digit count
which we need to output/save/process... to take the whole precision
the floating point type can offer.
Theoretical example: A variable with content 123.112233445566
If you show 123.11223344 to the user, it is not as precise as it can be.
If you show 123.1122334455660000000 to the user, it makes no sense because
you could omit the trailing zeros (because your variable can´t hold that much anyways)
Therefore, max_digits10 says how many digits precision you have available in a type.
Lets build some context
After going through lots of answers and reading stuff following is the simplest and layman answer i could reach upto for this.
Floating point numbers in computers (Single precision i.e float type in C/C++ etc. OR double precision i.e double in C/C++ etc.) have to be represented using fixed number of bits.
float is a 32-bit IEEE 754 single precision Floating Point Number – 1
bit for the sign, 8 bits for the exponent, and 23* for the value.
float has 7 decimal digits of precision.
And for double type
The C++ double should have a floating-point precision of up to 15
digits as it contains a precision that is twice the precision of the
float data type. When you declare a variable as double, you should
initialize it with a decimal value
What the heck above means to me?
Its possible that sometimes the floating point number which you have cannot fit into the number of bits available for that type. for eg. float value of 0.1 cannot FIT into available number of BITS in a computer. You may ask why. Try converting this value to binary and you will see that the binary representation is never ending and we have only finite number of bits so we need to stop at one point even though the binary conversion logic says keep going on.
If the given floating point number can be represented by the number of bits available, then we are good. If its not possible to represent the given floating point number in the available number of bits, then the bits are stored a value which is as close as possible to the actual value. This is also known as "Rounding the float value" OR "Rounding error". Now how this value is calculated depends of specific implementation but its safe to assume that given a specific implementation, the most closest value is chosen.
Now lets come to std::numeric_limits<T>::digits10
The value of std::numeric_limits::digits10 is the number of
base-10 digits that are necessary to uniquely represent all distinct
values of the type T, such as necessary for
serialization/deserialization to text. This constant is meaningful for
all floating-point types.
What this std::numeric_limits<T>::digits10 is saying is that whenever you fall into a scenario where rounding MUST happen then you can be assured that after given floating point value is rounded to its closest representable value by the computer, then its guarantied that the closest representable value's std::numeric_limits<T>::digits10 number of Decimal digits will be exactly same as your input floating point. For single precision floating point value this number is usually 6 and for double precision float value this number is usually 15.
Now you may ask why i used the word "guarantied". Well i used this because its possible that more number of digits may survive while conversion to float BUT if you ask me give me a guarantee that how many will survive in all the cases, then that number is std::numeric_limits<T>::digits10. Not convinced yet?
OK, consider example of unsigned char which has 8 bits of storage. When you convert a decimal value to unsigned char, then what's the guarantee that how many decimal digits will survive? I will say "2". Then you will say that even 145 will survive, so it should be 3. BUT i will say NO. Because if you take 256, then it won't survive. Of course 255 will survive, but since you are asking for guarantee so i can only guarantee that 2 digits will survive because answer 3 is not true if i am trying to use values higher than 255.
Now use the same analogy for floating number types when someone asks for a guarantee. That guarantee is given by std::numeric_limits<T>::digits10
Now what the heck is std::numeric_limits<T>::max_digits10
Here comes a bit of another level of complexity. BUT I will try to explain as simple as I can
As i mentioned previously that due to limited number of bits available to represent a floating type on a computer, its not possible to represent every float value exactly. Few can be represented exactly BUT not all values. Now lets consider a hypothetical situation. Someone asks you to write down all the possible float values which the computer can represent (ooohhh...i know what you are thinking). Luckily you don't have write all those :)
Just imagine that you started and reached the last float value which a computer can represent. The max float value which the computer can represent will have certain number of decimal digits. These are the number of decimal digits which std::numeric_limits<T>::max_digits10 tells us. BUT an actual explanation for std::numeric_limits<T>::max_digits10 is the maximum number of decimal digits you need to represent all possible representable values. Thats why i asked you to write all the value initially and you will see that you need maximum std::numeric_limits<T>::max_digits10 of decimal digits to write all representable values of type T.
Please note that this max float value is also the float value which can survive the text to float to text conversion but its number of decimal digits are NOT the guaranteed number of digits (remember the unsigned char example i gave where 3 digits of 255 doesn't mean all 3 digits values can be stored in unsigned char?)
Hope this attempt of mine gives people some understanding. I know i may have over simplified things BUT I have spent sleepless night thinking and reading stuff and this is the explanation which was able to give me some peace of mind.
Cheers !!!

Can you help me to understand what "significant digits" means in floating point math?

For what I'm learning, once I convert a floating point value to a decimal one, the "significant digits" I need are a fixed number (17 for double, for example). 17 totals: before and after decimal separator.
So for example this code:
typedef std::numeric_limits<double> dbl;
int main()
{
std::cout.precision(dbl::max_digits10);
//std::cout << std::fixed;
double value1 = 1.2345678912345678912345;
double value2 = 123.45678912345678912345;
double value3 = 123456789123.45678912345;
std::cout << value1 << std::endl;
std::cout << value2 << std::endl;
std::cout << value3 << std::endl;
}
will correctly "show me" 17 values:
1.2345678912345679
123.45678912345679
123456789123.45679
But if I increase precision for the cout (i.e. std::cout.precision(100)), I can see there are other numbers after the 17 range:
1.2345678912345678934769921397673897445201873779296875
123.456789123456786683163954876363277435302734375
123456789123.456787109375
Why should ignore them? They are stored within the variables/double as well, so they will affect the whole "math" later (division, multiplication, sum, and so on).
What does it means "significant digits"? There is other...
Can you help me to understand what “significant digits” means in floating point math?
With FP numbers, like mathematical real numbers, significant digits is the leading digits of a value that do not begin with 0 and then, depending on context, to 1) the decimal point, 2) the last non-zero digit, or 3) the last printed digit.
123. // 3 significant decimal digits
123.125 // 6 significant decimal digits
0.0078125 // 5 significant decimal digits
0x0.00123p45 // 3 significant hexadecimal digits
123000.0 // 3, 6, or 7 significant decimal digits depending on context
When concerned about decimal significant digits and FP types like double. the issue is often "How many decimal significant digits are needed or of concern?"
Nearly all C FP implementations use a binary encoding such that all finite FP are exact sums of power of 2. Each finite FP is exact. Common encoding affords most double to have 53 binary digits is it significand - so 53 significant binary digits. How this appears as a decimal is often the source of confusion.
// Example 0.1 is not an exact sum of powers of 2 so a nearby value is used.
double x = 0.1;
// x takes on the exact value of
// 0.1000000000000000055511151231257827021181583404541015625
// aka 0x1.999999999999ap-4
// aka base2: 0.000110011001100110011001100110011001100110011001100110011010
// The preceding and subsequent doubles
// 0.09999999999999999167332731531132594682276248931884765625
// 0.10000000000000001942890293094023945741355419158935546875
// 123456789012345678901234567890123456789012345678901234567890
Looking at above, one could say x has over 50 decimal significant digits. Yet the value matches the intended 0.1 to 16 decimal significant digits. Or yet since the preceding and subsequent possible double values differ in the 17 place, one could say x has 17 decimal significant digits.
What does it means "significant digits"?
Various meanings of significant digits exist, but for C, 2 common ones are:
The number of decimal significant digits that a textual value to double converts as expected for all double. This is typically 15. C specifies this as DBL_DIG and must be at least 10.
The number of decimal significant digits that a textual value of double needs to be printed to distinguish from another double. This is typically 17. C specifies this as DBL_DECIMAL_DIG and must be at least 10.
Why should ignore them?
It depends of coding goals. Rarely are all digits of the exact value needed. (DBL_TRUE_MIN might have 752 od them.) For most applications, DBL_DECIMAL_DIG is enough. In select apps, DBL_DIG will do. So usually, ignoring digits past 17 does not cause problems.
Keep in mind that floating-point values are not real numbers. There are gaps between the values, and all those extra digits, while meaningful for real numbers, don’t reflect any difference in the floating-point value. When you convert a floating-point value to text, having std::numeric_limits<...>::max_digits10 digits ensures that you can convert the text string back to floating-point and get the original value. The extra digits don’t affect the result.
The extra digits that you see when you ask for more digits are the result of the conversion algorithm trying to do what you asked. The algorithm typically just keeps extracting digits until it reaches the desired precision; it could be written to start outputting zeros after it’s written max_digits10 digits, but that’s an additional complication that nobody bothers with. It wouldn’t really be helpful.
just to add to Pete Becker's answer, I think you're confusing the problem of finding the exact decimal representation of a binary mantissa, with the problem of finding some decimal representation uniquely representing that binary mantissa ( given some fixed rounding scheme ).
Now, regarding the first problem, you always need a finite number of decimal digits to exactly represent a binary mantissa ( because 2 divides 10 ).
For example, you need 18 decimal digits to exactly represent the binary 1.0000000000000001, being 1.00000762939453125 in decimal.
but you need just 17 digits to represent it uniquely as 1.0000076293945312 because no other number having exact value 1.0000076293945312xyz... where 0<=x<5 can exist as a double ( more precisely, the next and prior exactly representable values being 1.0000076293945314720446049250313080847263336181640625 and 1.0000076293945310279553950749686919152736663818359375 ).
Of course, this does not mean that given some decimal number you can ignore all digits past the 17th; it just means that if you apply the same rounding scheme used to produce the decimal at the 17th position and assign it back to a double you'll get the same original double.

How can I avoid this float number rounding issue in C++?

With below code, I get result "4.31 43099".
double f = atof("4.31");
long ff = f * 10000L;
std::cout << f << ' ' << ff << '\n';
If I change "double f" to "float f". I get expected result "4.31 43100". I am not sure if changing "double" to "float" is a good solution. Is there any good solution to assure I get "43100"?
You're not going to be able to eliminate the errors in floating point arithmatic (though with proper analysis you can calculate the error). For casual usage one thing you can do to get more intuitive results is to replace the built-in float to integral conversion (which does truncation), with normal rounding:
double f = atof("4.31");
long ff = std::round(f * 10000L);
std::cout << f << ' ' << ff << '\n';
This should output what you expect: 4.31 43100
Also there's no point in using 10000L, because no matter what kind of integral type you use it still gets converted to f's floating point type for the multiplication. just use std::round(f * 10000.0);
The problem is that floating point is inexact by nature when talking about decimal numbers. A decimal number can be rounded either up or down when converted to binary, depending on which value is closest.
In this case you just want to make sure that if the number was rounded down, it's rounded up instead. You do this by adding the smallest amount possible to the value, which is done with the nextafter function if you have C++11:
long ff = std::nextafter(f, 1.1*f) * 10000L;
If you don't have nextafter you can approximate it with numeric_limits.
long ff = (f * (1.0 + std::numeric_limits<double>::epsilon())) * 10000L;
I just saw your comment that you only use 4 decimal places, so this would be simpler but less robust:
long ff = (f * 1.0000001) * 10000L;
With standard C types - i doubt.
There are many values that cannot be represented in those bits - they actually demand more space to be stored. So floating-point processor just uses the closest possible.
Floating pointing numbers cannot store all the values you think it could - there is only limited amount of bits - you can't put more than 4 billion different values in 32 bits. And that's just the first restriction.
Floating point values(in C) are represented as: sign - one sign bit, power - bits which defines the power of two for the number, significand - the bits that actually make the number.
Your actual number is sign * significand * 2 inpowerof(power - normalization).
Double is 1bit of sign, 15 bits of power(normalized to be positive but that is not the point) and 48 bits to represent the value;
It is a lot but not enough to represent all the values, especially when they cannot be easily represented as finite sum of powers of two: like binary 1010.101101(101). For example it cannot represent precisely such values like 1/3 = 0.333333(3). That's the second restriction.
Try to read - decent understanding of advantages and disadvantages of floating point arithmetic may be very handy:
http://en.wikipedia.org/wiki/Floating_point and http://homepage.cs.uiowa.edu/~atkinson/m170.dir/overton.pdf
There have been some confused answers here! What is happening is this: 4.31 can't be exactly represented as either a single- or double-precision number. It turns out that the nearest representable single-precision number is a little more than 4.31, while the nearest representable double-precision number is a little less than 4.31. When a floating-point value is assigned to an integer variable, it is rounded towards zero (not towards the nearest integer!).
So if f is single-precision, f * 10000L is greater than 43100, so it is rounded down to 43100. And if f is double-precision, f * 10000L is less than 43100, so it is rounded down to 43099.
The comment by n.m. suggests f * 10000L + 0.5, which is I think the best solution.

What is the purpose of max_digits10 and how is it different from digits10?

I am confused about what max_digits10 represents. According to its documentation, it is 0 for all integral types. The formula for floating-point types for max_digits10 looks similar to int's digits10's.
To put it simple,
digits10 is the number of decimal digits guaranteed to survive text → float → text round-trip.
max_digits10 is the number of decimal digits needed to guarantee correct float → text → float round-trip.
There will be exceptions to both but these values give the minimum guarantee. Read the original proposal on max_digits10 for a clear example, Prof. W. Kahan's words and further details. Most C++ implementations follow IEEE 754 for their floating-point data types. For an IEEE 754 float, digits10 is 6 and max_digits10 is 9; for a double it is 15 and 17. Note that both these numbers should not be confused with the actual decimal precision of floating-point numbers.
Example digits10
char const *s1 = "8.589973e9";
char const *s2 = "0.100000001490116119384765625";
float const f1 = strtof(s1, nullptr);
float const f2 = strtof(s2, nullptr);
std::cout << "'" << s1 << "'" << '\t' << std::scientific << f1 << '\n';
std::cout << "'" << s2 << "'" << '\t' << std::fixed << std::setprecision(27) << f2 << '\n';
Prints
'8.589973e9' 8.589974e+009
'0.100000001490116119384765625' 0.100000001490116119384765625
All digits up to the 6th significant digit were preserved, while the 7th digit didn't survive for the first number. However, all 27 digits of the second survived; this is an exception. However, most numbers become different beyond 7 digits and all numbers would be the same within 6 digits.
In summary, digits10 gives the number of significant digits you can count on in a given float as being the same as the original real number in its decimal form from which it was created i.e. the digits that survived after the conversion into a float.
Example max_digits10
void f_s_f(float &f, int p) {
std::ostringstream oss;
oss << std::fixed << std::setprecision(p) << f;
f = strtof(oss.str().c_str(), nullptr);
}
float f3 = 3.145900f;
float f4 = std::nextafter(f3, 3.2f);
std::cout << std::hexfloat << std::showbase << f3 << '\t' << f4 << '\n';
f_s_f(f3, std::numeric_limits<float>::max_digits10);
f_s_f(f4, std::numeric_limits<float>::max_digits10);
std::cout << f3 << '\t' << f4 << '\n';
f_s_f(f3, 6);
f_s_f(f4, 6);
std::cout << f3 << '\t' << f4 << '\n';
Prints
0x1.92acdap+1 0x1.92acdcp+1
0x1.92acdap+1 0x1.92acdcp+1
0x1.92acdap+1 0x1.92acdap+1
Here two different floats, when printed with max_digits10 digits of precision, they give different strings and these strings when read back would give back the original floats they are from. When printed with lesser precision they give the same output due to rounding and hence when read back lead to the same float, when in reality they are from different values.
In summary, max_digits10 are at least required to disambiguate two floats in their decimal form, so that when converted back to a binary float, we get the original bits again and not of the one slightly before or after it due to rounding errors.
In my opinion, it is explained sufficiently at the linked site (and the site for digits10):
digits10 is the (max.) amount of "decimal" digits where numbers
can be represented by a type in any case, independent of their actual value.
A usual 4-byte unsigned integer as example: As everybody should know, it has exactly 32bit,
that is 32 digits of a binary number.
But in terms of decimal numbers?
Probably 9.
Because, it can store 100000000 as well as 999999999.
But if take numbers with 10 digits: 4000000000 can be stored, but 5000000000 not.
So, if we need a guarantee for minimum decimal digit capacity, it is 9.
And that is the result of digits10.
max_digits10 is only interesting for float/double... and gives the decimal digit count
which we need to output/save/process... to take the whole precision
the floating point type can offer.
Theoretical example: A variable with content 123.112233445566
If you show 123.11223344 to the user, it is not as precise as it can be.
If you show 123.1122334455660000000 to the user, it makes no sense because
you could omit the trailing zeros (because your variable can´t hold that much anyways)
Therefore, max_digits10 says how many digits precision you have available in a type.
Lets build some context
After going through lots of answers and reading stuff following is the simplest and layman answer i could reach upto for this.
Floating point numbers in computers (Single precision i.e float type in C/C++ etc. OR double precision i.e double in C/C++ etc.) have to be represented using fixed number of bits.
float is a 32-bit IEEE 754 single precision Floating Point Number – 1
bit for the sign, 8 bits for the exponent, and 23* for the value.
float has 7 decimal digits of precision.
And for double type
The C++ double should have a floating-point precision of up to 15
digits as it contains a precision that is twice the precision of the
float data type. When you declare a variable as double, you should
initialize it with a decimal value
What the heck above means to me?
Its possible that sometimes the floating point number which you have cannot fit into the number of bits available for that type. for eg. float value of 0.1 cannot FIT into available number of BITS in a computer. You may ask why. Try converting this value to binary and you will see that the binary representation is never ending and we have only finite number of bits so we need to stop at one point even though the binary conversion logic says keep going on.
If the given floating point number can be represented by the number of bits available, then we are good. If its not possible to represent the given floating point number in the available number of bits, then the bits are stored a value which is as close as possible to the actual value. This is also known as "Rounding the float value" OR "Rounding error". Now how this value is calculated depends of specific implementation but its safe to assume that given a specific implementation, the most closest value is chosen.
Now lets come to std::numeric_limits<T>::digits10
The value of std::numeric_limits::digits10 is the number of
base-10 digits that are necessary to uniquely represent all distinct
values of the type T, such as necessary for
serialization/deserialization to text. This constant is meaningful for
all floating-point types.
What this std::numeric_limits<T>::digits10 is saying is that whenever you fall into a scenario where rounding MUST happen then you can be assured that after given floating point value is rounded to its closest representable value by the computer, then its guarantied that the closest representable value's std::numeric_limits<T>::digits10 number of Decimal digits will be exactly same as your input floating point. For single precision floating point value this number is usually 6 and for double precision float value this number is usually 15.
Now you may ask why i used the word "guarantied". Well i used this because its possible that more number of digits may survive while conversion to float BUT if you ask me give me a guarantee that how many will survive in all the cases, then that number is std::numeric_limits<T>::digits10. Not convinced yet?
OK, consider example of unsigned char which has 8 bits of storage. When you convert a decimal value to unsigned char, then what's the guarantee that how many decimal digits will survive? I will say "2". Then you will say that even 145 will survive, so it should be 3. BUT i will say NO. Because if you take 256, then it won't survive. Of course 255 will survive, but since you are asking for guarantee so i can only guarantee that 2 digits will survive because answer 3 is not true if i am trying to use values higher than 255.
Now use the same analogy for floating number types when someone asks for a guarantee. That guarantee is given by std::numeric_limits<T>::digits10
Now what the heck is std::numeric_limits<T>::max_digits10
Here comes a bit of another level of complexity. BUT I will try to explain as simple as I can
As i mentioned previously that due to limited number of bits available to represent a floating type on a computer, its not possible to represent every float value exactly. Few can be represented exactly BUT not all values. Now lets consider a hypothetical situation. Someone asks you to write down all the possible float values which the computer can represent (ooohhh...i know what you are thinking). Luckily you don't have write all those :)
Just imagine that you started and reached the last float value which a computer can represent. The max float value which the computer can represent will have certain number of decimal digits. These are the number of decimal digits which std::numeric_limits<T>::max_digits10 tells us. BUT an actual explanation for std::numeric_limits<T>::max_digits10 is the maximum number of decimal digits you need to represent all possible representable values. Thats why i asked you to write all the value initially and you will see that you need maximum std::numeric_limits<T>::max_digits10 of decimal digits to write all representable values of type T.
Please note that this max float value is also the float value which can survive the text to float to text conversion but its number of decimal digits are NOT the guaranteed number of digits (remember the unsigned char example i gave where 3 digits of 255 doesn't mean all 3 digits values can be stored in unsigned char?)
Hope this attempt of mine gives people some understanding. I know i may have over simplified things BUT I have spent sleepless night thinking and reading stuff and this is the explanation which was able to give me some peace of mind.
Cheers !!!

C++ floating-point console output issue

float x = 384.951257;
std::cout << std::fixed << std::setprecision(6) << x << std::endl;
The output is 384.951263. Why? I'm using gcc.
float is usually only 32-bit. With about 3 bits per decimal digit (210 roughly equals 103) that means it can't possibly represent more than about 11 decimal digits, and accounting for other information it also needs to represent, such as magnitude, let's say 6-7 decimal digits. Hey, that's what you got!
Check e.g. Wikipedia for details.
Use double or long double for better precision. double is the default in C++. E.g., the literal 3.14 is of type double.
Floats have a limited resolution. So it gets rounded when you assing the value to x.
All answers here talk as though the issue is due to floating-point numbers and their capacity, but those are just implementation details; the issue is deeper than that. This issue occurs when representing decimal numbers using binary number system. Even something as simple as 0.1)10 is not precisely representable in binary, since it can only represent those numbers as a finite fraction where the denominator is a power of 2. Unfortunately, this does not include most of the numbers that can be represented as finite fraction in base 10, like 0.1.
The single-precision float datatype usually gets mapped to binary32 as called by the IEEE 754 standard, has 32-bits which is partitioned into 1 sign bit, 8 exponent bits and 23 significand bits (excluding the hidden/implicit bit). Thus we've to calculate upto 24 bits when converting to binary32.
Other answers here evade the actual calculations involved, I'll try to do it. This method is explained in greater detail here. So lets convert the real number into a binary number:
Integer part 384)10 = 110000000)2 (using the usual method of successive division by 2)
Fractional part 0.951257)10 can be converted by successive multiplication by 2 and taking the integer part
0.951257 * 2 = 1.902514
0.902514 * 2 = 1.805028
0.805028 * 2 = 1.610056
0.610056 * 2 = 1.220112
0.220112 * 2 = 0.440224
0.440224 * 2 = 0.880448
0.880448 * 2 = 1.760896
0.760896 * 2 = 1.521792
0.521792 * 2 = 1.043584
0.043584 * 2 = 0.087168
0.087168 * 2 = 0.174336
0.174336 * 2 = 0.348672
0.348672 * 2 = 0.697344
0.697344 * 2 = 1.394688
0.394688 * 2 = 0.789376
Gathering the obtined fractional part in binary we've 0.111100111000010)2. The overall number in binary would be 110000000.111100111000010)2; this has 24 bits as required.
Converting this back to decimal would give you 384 + (15585 / 16384) = 384.951232)10. With the rounding mode (round to nearest) enabled this comes to, what you see, 384.951263)10.
This can be verified here.