Let's,
float dt;
I read dt from a text file as
inputFile >> dt;
Then I have a for loop as,
for (float time=dt; time<=maxTime; time+=dt)
{
// some stuff
}
When dt=0.05 and I output std::cout << time << std::endl; I got,
0.05
0.10
...
7.00001
7.05001
...
So, why number of digits is increasing after a while?
Because not every number can be represented by IEEE754 floating point values. At some point, you'll get a number that isn't quite representable and the computer will have to choose the nearest one.
If you enter 0.05 into Harald Schmidt's excellent online converter and reference the Wikipedia entry on IEEE754-1985, you'll end up with the following bits (my explanation of that follows):
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm
0 01111010 10011001100110011001101
|||||||| |||||||||||||||||||||||
128 -+||||||| ||||||||||||||||||||||+- 1 / 8388608
64 --+|||||| |||||||||||||||||||||+-- 1 / 4194304
32 ---+||||| ||||||||||||||||||||+--- 1 / 2097152
16 ----+|||| |||||||||||||||||||+---- 1 / 1048576
8 -----+||| ||||||||||||||||||+----- 1 / 524288
4 ------+|| |||||||||||||||||+------ 1 / 262144
2 -------+| ||||||||||||||||+------- 1 / 131072
1 --------+ |||||||||||||||+-------- 1 / 65536
||||||||||||||+--------- 1 / 32768
|||||||||||||+---------- 1 / 16384
||||||||||||+----------- 1 / 8192
|||||||||||+------------ 1 / 4096
||||||||||+------------- 1 / 2048
|||||||||+-------------- 1 / 1024
||||||||+--------------- 1 / 512
|||||||+---------------- 1 / 256
||||||+----------------- 1 / 128
|||||+------------------ 1 / 64
||||+------------------- 1 / 32
|||+-------------------- 1 / 16
||+--------------------- 1 / 8
|+---------------------- 1 / 4
+----------------------- 1 / 2
The sign, being 0, is positive. The exponent is indicated by the one-bits mapping to the numbers on the left: 64+32+16+8+2 = 122 - 127 bias = -5, so the multiplier is 2-5 or 1/32. The 127 bias is to allow representation of very small numbers (as in close to zero rather that negative numbers with a large magnitude).
The mantissa is a little more complicated. For each one-bit, you accumulate the numbers down the right hand side (after adding an implicit 1). Hence you can calculate the number as the sum of {1, 1/2, 1/16, 1/32, 1/256, 1/512, 1/4096, 1/8192, 1/65536, 1/131072, 1/1048576, 1/2097152, 1/8388608}.
When you add all these up, you get 1.60000002384185791015625.
When you multiply that by the multiplier 1/32 (calculated previously from the exponent bits), you get 0.0500000001, so you can see that 0.05 is already not represented exactly. This bit pattern for the mantissa is actually the same as 0.1 but, with that, the exponent is -4 rather than -5, and it's why 0.1 + 0.1 + 0.1 is rarely equal to 0.3 (this appears to be a favourite interview question).
When you start adding them up, that small error will accumulate since, not only will you see an error in the 0.05 itself, errors may also be introduced at each stage of the accumulation - not all the the numbers 0.1, 0.15, 0.2 and so on can be represented exactly either.
Eventually, the errors will get large enough that they'll start showing up in the number if you use the default precision. You can put this off for a bit by choosing your own precision with something like:
#include <iostream>
#include <iomanip>
:
std::cout << std::setprecison (2) << time << '\n';
It won't fix the variable value, but it will give you some more breathing space before the errors become visible.
As an aside, some people recommend avoiding std::endl since it forces a flush of the buffers. If your implementation is behaving itself, this will happen for terminal devices when you send a newline anyway. And if you've redirected standard output to a non-terminal, you probably don't want flushing on every line. Not really relevant to your question and it probably won't make a real difference in the vast majority of cases, just a point I thought I'd bring up.
IEEE floats use the binary number system and therefore can't store decimal numbers exactly. When you add several of them together (sometimes just two is enough), the representational errors can accumulate and become visible.
Some numbers can't be precisely represented using floating points OR base 2 numbers. If I remember correcly, one of such numbers is decimal 0.05 (in base 2 results in infinitely repeating fractional number). Another issue is that if you print floating point to file (as base 10 number) then read it back you might as well get different number - because base differs and that might cause problems when converting fractional base2 to fractional base10 number.
If you want better precision you could try searching for a bignum library. This will work much slower than floating points, though. Another way to deal with precision problems would be to try storing numbers as "common fraction" with numberator/denominator(i.e. 1/10 instead of 0.1, 1/3 instead of 0.333.., etc - there's probably library even for that, but I haven't heard about it), but that won't work with irrational numbers like pi or e.
Related
I am trying to generate a random number between -10 and 10 with step 0.3 (though I want to have these be arbitrary values) and am having issues with double precision floating point accuracy. Float.h's DBL_DIG is meant to be the minimum accuracy at which no rounding error occurs [EDIT: This is false, see Eric Postpischil's comment for a true definition of DBL_DIG], yet when printing to this many digits, I still see rounding error.
#include <stdio.h>
#include <float.h>
#include <stdlib.h>
int main()
{
for (;;)
{
printf("%.*g\n", DBL_DIG, -10 + (rand() % (unsigned long)(20 / 0.3)) * 0.3);
}
}
When I run this, I get this output:
8.3
-7
1.7
-6.1
-3.1
1.1
-3.4
-8.2
-9.1
-9.7
-7.6
-7.9
1.4
-2.5
-1.3
-8.8
2.6
6.2
3.8
-3.4
9.5
-7.6
-1.9
-0.0999999999999996
-2.2
5
3.2
2.9
-2.5
2.9
9.5
-4.6
6.2
0.799999999999999
-1.3
-7.3
-7.9
Of course, a simple solution would be to just #define DBL_DIG 14 but I feel that is wasting accuracy. Why is this happening and how do I prevent this happening? This is not a duplicate of Is floating point math broken? since I am asking about DBL_DIG, and how to find the minimum accuracy at which no error occurs.
For the specific code in the question, we can avoid excess rounding errors by using integer values until the last moment:
printf("%.*g\n", DBL_DIG,
(-100 + rand() % (unsigned long)(20 / 0.3) * 3.) / 10.);
This was obtained by multiplying each term in the original expression by 10 (−10 because −100 and .3 becomes 3) and then dividing the whole expression by 10. So all values we care about in the numerator1 are integers, which floating-point represents exactly (within range of its precision).
Since the integer values will be computed exactly, there will be just a single rounding error, in the final division by 10, and the result will be the double closest to the desired value.
How many digits should I print to in order to avoid rounding error in most circumstances? (not just in my example above)
Just using more digits is not a solution for general cases. One approach for avoiding error in most cases is to learn about floating-point formats and arithmetic in considerable detail and then write code thoughtfully and meticulously. This approach is generally good but not always successful as it is usually implemented by humans, who continue to make mistakes in spite of all efforts to the contrary.
Footnote
1 Considering (unsigned long)(20 / 0.3) is a longer discussion involving intent and generalization to other values and cases.
generate a random number between -10 and 10 with step 0.3
I would like the program to work with arbitrary values for the bounds and step size.
Why is this happening ....
The source of trouble is assuming that typcial real numbers (such as string "0.3") can encode exactly as a double.
A double can encode about 264 different values exactly. 0.3 is not one of them.
Instead the nearest double is used.
The exact value and 2 nearest are listed below:
0.29999999999999993338661852249060757458209991455078125
0.299999999999999988897769753748434595763683319091796875 (best 0.3)
0.3000000000000000444089209850062616169452667236328125
So OP's code is attempting "-10 and 10 with step 0.2999..." and printing out "-0.0999999999999996" and "0.799999999999999" is more correct than "-0.1" and "0.8".
.... how do I prevent this happening?
Print with a more limited precision.
// reduce the _bit_ output precision by about the root of steps
#define LOG10_2 0.30102999566398119521373889472449
int digits_less = lround(sqrt(20 / 0.3) * LOG10_2);
for (int i = 0; i < 100; i++) {
printf("%.*e\n", DBL_DIG - digits_less,
-10 + (rand() % (unsigned long) (20 / 0.3)) * 0.3);
}
9.5000000000000e+00
-3.7000000000000e+00
8.6000000000000e+00
5.9000000000000e+00
...
-1.0000000000000e-01
8.0000000000000e-01
OP's code really is not doings "steps" as that hints toward a loop with a step of 0.3. The above digits_less is based on repetitive "steps", otherwise OP's equation warrants about 1 decimal digit reduction. The best reduction in precisions depends on estimating the potential cumulative error of all calculations from "0.3" conversion --> double 0.3 (1/2 bit), division (1/2 bit), multiplication (1/2 bit) and addition (more complicated bit).
Wait for the next version of C which may support decimal floating point.
I have a program that is finding paths in a graph and outputting the cumulative weight. All of the edges in the graph have an individual weight of 0 to 100 in the form of a float with at most 2 decimal places.
On Windows/Visual Studio 2010, for a particular path consisting of edges with 0 weight, it outputs the correct total weight of 0. However on Linux/GCC the program is saying the path has a weight of 2.35503e-38. I have had plenty of experiences with crazy bugs caused by floats, but when would 0 + 0 ever equal anything other than 0?
The only thing I can think of that is causing this is the program does treat some of the weights as integers and uses implicit coercion to add them to the total. But 0 + 0.0f still equals 0.0f!
As a quick fix I reduce the total to 0 when less then 0.00001 and that is sufficient for my needs, for now. But what vodoo causes this?
NOTE: I am 100% confident that none of the weights in the graph exceed the range I mentioned and that all of the weights in this particular path are all 0.
EDIT: To elaborate, I have tried both reading the weights from a file and setting them in the code manually as equal to 0.0f No other operation is being performed on them other than adding them to the total.
Because it's an IEEE floating point number, and it's not exactly equal to zero.
http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm
[...] in the form of a float with at most 2 decimal places.
There is no such thing as a float with at most 2 decimal places. Floats are almost always represented as a binary floating point number (fractional binary mantissa and integer exponent). So many (most) numbers with 2 decimal places cannot be represented exactly.
For example, 0.20f may look as an innocent and round fraction, but
printf("%.40f\n", 0.20f);
will print: 0.2000000029802322387695312500000000000000.
See, it does not have 2 decimal places, it has 26!!!
Naturally, for most practical uses the difference in negligible. But if you do some calculations you may end up increasing the rounding error and making it visible, particularly around 0.
It may be that your floats containing values of "0.0f" aren't actually 0.0f (bit representation 0x00000000), but a very, very small number that evaluates to about 0.0. Because of the way IEEE754 spec defines float representations, if you have, for example, a very small mantissa and a 0 exponent, while it's not equal to absolute 0, it will round to 0. However, if you add these numbers together a sufficiently number of times, the very small amount will accumulate into a value that eventually will become non-zero.
Here is an example case which gives the illusion of 0 being non-zero:
float f = 0.1f / 1000000000;
printf("%f, %08x\n", f, *(unsigned int *)&f);
float f2 = f * 10000;
printf("%f, %08x\n", f2, *(unsigned int *)&f2);
If you are assigning literals to your variables and adding them, though, it is possible that either the compiler is not translating 0 into 0x0 in memory. If it is, and this still is happening, then it's also possible that your CPU hardware has a bug relating to turning 0s into non-zero when doing ALU operations that may have squeaked by their validation efforts.
However, it is good to remember that IEEE floating point is only an approximation, and not an exact representation of any particular float value. So any floating-point operations are bound to have some amount of error.
I have seen long articles explaining how floating point numbers can be stored and how the arithmetic of those numbers is being done, but please briefly explain why when I write
cout << 1.0 / 3.0 <<endl;
I see 0.333333, but when I write
cout << 1.0 / 3.0 + 1.0 / 3.0 + 1.0 / 3.0 << endl;
I see 1.
How does the computer do this? Please explain just this simple example. It is enough for me.
Check out the article on "What every computer scientist should know about floating point arithmetic"
The problem is that the floating point format represents fractions in base 2.
The first fraction bit is ½, the second ¼, and it goes on as 1 / 2n.
And the problem with that is that not every rational number (a number that can be expressed as the ratio of two integers) actually has a finite representation in this base 2 format.
(This makes the floating point format difficult to use for monetary values. Although these values are always rational numbers (n/100) only .00, .25, .50, and .75 actually have exact representations in any number of digits of a base two fraction.
)
Anyway, when you add them back, the system eventually gets a chance to round the result to a number that it can represent exactly.
At some point, it finds itself adding the .666... number to the .333... one, like so:
00111110 1 .o10101010 10101010 10101011
+ 00111111 0 .10101010 10101010 10101011o
------------------------------------------
00111111 1 (1).0000000 00000000 0000000x # the x isn't in the final result
The leftmost bit is the sign, the next eight are the exponent, and the remaining bits are the fraction. In between the exponent and the fraction is an assummed "1" that is always present, and therefore not actually stored, as the normalized leftmost fraction bit. I've written zeroes that aren't actually present as individual bits as o.
A lot has happened here, at each step, the FPU has taken rather heroic measures to round the result. Two extra digits of precision (beyond what will fit in the result) have been kept, and the FPU knows in many cases if any, or at least 1, of the remaining rightmost bits were one. If so, then that part of the fraction is more than 0.5 (scaled) and so it rounds up. The intermediate rounded values allow the FPU to carry the rightmost bit all the way over to the integer part and finally round to the correct answer.
This didn't happen because anyone added 0.5; the FPU just did the best it could within the limitations of the format. Floating point is not, actually, inaccurate. It's perfectly accurate, but most of the numbers we expect to see in our base-10, rational-number world-view are not representable by the base-2 fraction of the format. In fact, very few are.
Let's do the math. For brevity, we assume that you only have four significant (base-2) digits.
Of course, since gcd(2,3)=1, 1/3 is periodic when represented in base-2. In particular, it cannot be represented exactly, so we need to content ourselves with the approximation
A := 1×1/4 + 0×1/8 + 1×1/16 + 1*1/32
which is closer to the real value of 1/3 than
A' := 1×1/4 + 0×1/8 + 1×1/16 + 0×1/32
So, printing A in decimal gives 0.34375 (the fact that you see 0.33333 in your example is just testament to the larger number of significant digits in a double).
When adding these up three times, we get
A + A + A
= ( A + A ) + A
= ( (1/4 + 1/16 + 1/32) + (1/4 + 1/16 + 1/32) ) + (1/4 + 1/16 + 1/32)
= ( 1/4 + 1/4 + 1/16 + 1/16 + 1/32 + 1/32 ) + (1/4 + 1/16 + 1/32)
= ( 1/2 + 1/8 + 1/16 ) + (1/4 + 1/16 + 1/32)
= 1/2 + 1/4 + 1/8 + 1/16 + 1/16 + O(1/32)
The O(1/32) term cannot be represented in the result, so it's discarded and we get
A + A + A = 1/2 + 1/4 + 1/8 + 1/16 + 1/16 = 1
QED :)
As for this specific example: I think the compilers are too clever nowadays, and automatically make sure a const result of primitive types will be exact if possible. I haven't managed to fool g++ into doing an easy calculation like this wrong.
However, it's easy to bypass such things by using non-const variables. Still,
int d = 3;
float a = 1./d;
std::cout << d*a;
will exactly yield 1, although this shouldn't really be expected. The reason, as was already said, is that the operator<< rounds the error away.
As to why it can do this: when you add numbers of similar size or multiply a float by an int, you get pretty much all the precision the float type can maximally offer you - that means, the ratio error/result is very small (in other words, the errors occur in a late decimal place, assuming you have a positive error).
So 3*(1./3), even though, as a float, not exactly ==1, has a big correct bias which prevents operator<< from taking care for the small errors. However, if you then remove this bias by just substracting 1, the floating point will slip down right to the error, and suddenly it's not neglectable at all any more. As I said, this doesn't happen if you just type 3*(1./3)-1 because the compiler is too clever, but try
int d = 3;
float a = 1./d;
std::cout << d*a << " - 1 = " << d*a - 1 << " ???\n";
What I get (g++, 32 bit Linux) is
1 - 1 = 2.98023e-08 ???
This works because the default precision is 6 digits, and rounded to 6 digits the result is 1. See 27.5.4.1 basic_ios constructors in the C++ draft standard (n3092).
How do you print a double to a stream so that when it is read in you don't lose precision?
I tried:
std::stringstream ss;
double v = 0.1 * 0.1;
ss << std::setprecision(std::numeric_limits<T>::digits10) << v << " ";
double u;
ss >> u;
std::cout << "precision " << ((u == v) ? "retained" : "lost") << std::endl;
This did not work as I expected.
But I can increase precision (which surprised me as I thought that digits10 was the maximum required).
ss << std::setprecision(std::numeric_limits<T>::digits10 + 2) << v << " ";
// ^^^^^^ +2
It has to do with the number of significant digits and the first two don't count in (0.01).
So has anybody looked at representing floating point numbers exactly?
What is the exact magical incantation on the stream I need to do?
After some experimentation:
The trouble was with my original version. There were non-significant digits in the string after the decimal point that affected the accuracy.
So to compensate for this we can use scientific notation to compensate:
ss << std::scientific
<< std::setprecision(std::numeric_limits<double>::digits10 + 1)
<< v;
This still does not explain the need for the +1 though.
Also if I print out the number with more precision I get more precision printed out!
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits10) << v << "\n";
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits10 + 1) << v << "\n";
std::cout << std::scientific << std::setprecision(std::numeric_limits<double>::digits) << v << "\n";
It results in:
1.000000000000000e-02
1.0000000000000002e-02
1.00000000000000019428902930940239457413554200000000000e-02
Based on #Stephen Canon answer below:
We can print out exactly by using the printf() formatter, "%a" or "%A". To achieve this in C++ we need to use the fixed and scientific manipulators (see n3225: 22.4.2.2.2p5 Table 88)
std::cout.flags(std::ios_base::fixed | std::ios_base::scientific);
std::cout << v;
For now I have defined:
template<typename T>
std::ostream& precise(std::ostream& stream)
{
std::cout.flags(std::ios_base::fixed | std::ios_base::scientific);
return stream;
}
std::ostream& preciselngd(std::ostream& stream){ return precise<long double>(stream);}
std::ostream& precisedbl(std::ostream& stream) { return precise<double>(stream);}
std::ostream& preciseflt(std::ostream& stream) { return precise<float>(stream);}
Next: How do we handle NaN/Inf?
It's not correct to say "floating point is inaccurate", although I admit that's a useful simplification. If we used base 8 or 16 in real life then people around here would be saying "base 10 decimal fraction packages are inaccurate, why did anyone ever cook those up?".
The problem is that integral values translate exactly from one base into another, but fractional values do not, because they represent fractions of the integral step and only a few of them are used.
Floating point arithmetic is technically perfectly accurate. Every calculation has one and only one possible result. There is a problem, and it is that most decimal fractions have base-2 representations that repeat. In fact, in the sequence 0.01, 0.02, ... 0.99, only a mere 3 values have exact binary representations. (0.25, 0.50, and 0.75.) There are 96 values that repeat and therefore are obviously not represented exactly.
Now, there are a number of ways to write and read back floating point numbers without losing a single bit. The idea is to avoid trying to express the binary number with a base 10 fraction.
Write them as binary. These days, everyone implements the IEEE-754 format so as long as you choose a byte order and write or read only that byte order, then the numbers will be portable.
Write them as 64-bit integer values. Here you can use the usual base 10. (Because you are representing the 64-bit aliased integer, not the 52-bit fraction.)
You can also just write more decimal fraction digits. Whether this is bit-for-bit accurate will depend on the quality of the conversion libraries and I'm not sure I would count on perfect accuracy (from the software) here. But any errors will be exceedingly small and your original data certainly has no information in the low bits. (None of the constants of physics and chemistry are known to 52 bits, nor has any distance on earth ever been measured to 52 bits of precision.) But for a backup or restore where bit-for-bit accuracy might be compared automatically, this obviously isn't ideal.
Don't print floating-point values in decimal if you don't want to lose precision. Even if you print enough digits to represent the number exactly, not all implementations have correctly-rounded conversions to/from decimal strings over the entire floating-point range, so you may still lose precision.
Use hexadecimal floating point instead. In C:
printf("%a\n", yourNumber);
C++0x provides the hexfloat manipulator for iostreams that does the same thing (on some platforms, using the std::hex modifier has the same result, but this is not a portable assumption).
Using hex floating point is preferred for several reasons.
First, the printed value is always exact. No rounding occurs in writing or reading a value formatted in this way. Beyond the accuracy benefits, this means that reading and writing such values can be faster with a well tuned I/O library. They also require fewer digits to represent values exactly.
I got interested in this question because I'm trying to (de)serialize my data to & from JSON.
I think I have a clearer explanation (with less hand waiving) for why 17 decimal digits are sufficient to reconstruct the original number losslessly:
Imagine 3 number lines:
1. for the original base 2 number
2. for the rounded base 10 representation
3. for the reconstructed number (same as #1 because both in base 2)
When you convert to base 10, graphically, you choose the tic on the 2nd number line closest to the tic on the 1st. Likewise when you reconstruct the original from the rounded base 10 value.
The critical observation I had was that in order to allow exact reconstruction, the base 10 step size (quantum) has to be < the base 2 quantum. Otherwise, you inevitably get the bad reconstruction shown in red.
Take the specific case of when the exponent is 0 for the base2 representation. Then the base2 quantum will be 2^-52 ~= 2.22 * 10^-16. The closest base 10 quantum that's less than this is 10^-16. Now that we know the required base 10 quantum, how many digits will be needed to encode all possible values? Given that we're only considering the case of exponent = 0, the dynamic range of values we need to represent is [1.0, 2.0). Therefore, 17 digits would be required (16 digits for fraction and 1 digit for integer part).
For exponents other than 0, we can use the same logic:
exponent base2 quant. base10 quant. dynamic range digits needed
---------------------------------------------------------------------
1 2^-51 10^-16 [2, 4) 17
2 2^-50 10^-16 [4, 8) 17
3 2^-49 10^-15 [8, 16) 17
...
32 2^-20 10^-7 [2^32, 2^33) 17
1022 9.98e291 1.0e291 [4.49e307,8.99e307) 17
While not exhaustive, the table shows the trend that 17 digits are sufficient.
Hope you like my explanation.
In C++20 you'll be able to use std::format to do this:
std::stringstream ss;
double v = 0.1 * 0.1;
ss << std::format("{}", v);
double u;
ss >> u;
assert(v == u);
The default floating-point format is the shortest decimal representation with a round-trip guarantee. The advantage of this method compared to using the precision of max_digits10 (not digits10 which is not suitable for round trip through decimal) from std::numeric_limits is that it doesn't print unnecessary digits.
In the meantime you can use the {fmt} library, std::format is based on. For example (godbolt):
fmt::print("{}", 0.1 * 0.1);
Output (assuming IEEE754 double):
0.010000000000000002
{fmt} uses the Dragonbox algorithm for fast binary floating point to decimal conversion. In addition to giving the shortest representation it is 20-30x faster than common standard library implementations of printf and iostreams.
Disclaimer: I'm the author of {fmt} and C++20 std::format.
A double has the precision of 52 binary digits or 15.95 decimal digits. See http://en.wikipedia.org/wiki/IEEE_754-2008. You need at least 16 decimal digits to record the full precision of a double in all cases. [But see fourth edit, below].
By the way, this means significant digits.
Answer to OP edits:
Your floating point to decimal string runtime is outputing way more digits than are significant. A double can only hold 52 bits of significand (actually, 53, if you count a "hidden" 1 that is not stored). That means the the resolution is not more than 2 ^ -53 = 1.11e-16.
For example: 1 + 2 ^ -52 = 1.0000000000000002220446049250313 . . . .
Those decimal digits, .0000000000000002220446049250313 . . . . are the smallest binary "step" in a double when converted to decimal.
The "step" inside the double is:
.0000000000000000000000000000000000000000000000000001 in binary.
Note that the binary step is exact, while the decimal step is inexact.
Hence the decimal representation above,
1.0000000000000002220446049250313 . . .
is an inexact representation of the exact binary number:
1.0000000000000000000000000000000000000000000000000001.
Third Edit:
The next possible value for a double, which in exact binary is:
1.0000000000000000000000000000000000000000000000000010
converts inexactly in decimal to
1.0000000000000004440892098500626 . . . .
So all of those extra digits in the decimal are not really significant, they are just base conversion artifacts.
Fourth Edit:
Though a double stores at most 16 significant decimal digits, sometimes 17 decimal digits are necessary to represent the number. The reason has to do with digit slicing.
As I mentioned above, there are 52 + 1 binary digits in the double. The "+ 1" is an assumed leading 1, and is neither stored nor significant. In the case of an integer, those 52 binary digits form a number between 0 and 2^53 - 1. How many decimal digits are necessary to store such a number? Well, log_10 (2^53 - 1) is about 15.95. So at most 16 decimal digits are necessary. Let's label these d_0 to d_15.
Now consider that IEEE floating point numbers also have an binary exponent. What happens when we increment the exponet by, say, 2? We have multiplied our 52-bit number, whatever it was, by 4. Now, instead of our 52 binary digits aligning perfectly with our decimal digits d_0 to d_15, we have some significant binary digits represented in d_16. However, since we multiplied by something less than 10, we still have significant binary digits represented in d_0. So our 15.95 decimal digits now occuply d_1 to d_15, plus some upper bits of d_0 and some lower bits of d_16. This is why 17 decimal digits are sometimes needed to represent a IEEE double.
Fifth Edit
Fixed numerical errors
The easiest way (for IEEE 754 double) to guarantee a round-trip conversion is to always use 17 significant digits. But that has the disadvantage of sometimes including unnecessary noise digits (0.1 → "0.10000000000000001").
An approach that's worked for me is to sprintf the number with 15 digits of precision, then check if atof gives you back the original value. If it doesn't, try 16 digits. If that doesn't work, use 17.
You might want to try David Gay's algorithm (used in Python 3.1 to implement float.__repr__).
Thanks to ThomasMcLeod for pointing out the error in my table computation
To guarantee round-trip conversion using 15 or 16 or 17 digits is only possible for a comparatively few cases. The number 15.95 comes from taking 2^53 (1 implicit bit + 52 bits in the significand/"mantissa") which comes out to an integer in the range 10^15 to 10^16 (closer to 10^16).
Consider a double precision value x with an exponent of 0, i.e. it falls into the floating point range range 1.0 <= x < 2.0. The implicit bit will mark the 2^0 component (part) of x. The highest explicit bit of the significand will denote the next lower exponent (from 0) <=> -1 => 2^-1 or the 0.5 component.
The next bit 0.25, the ones after 0.125, 0.0625, 0.03125, 0.015625 and so on (see table below). The value 1.5 will thus be represented by two components added together: the implicit bit denoting 1.0 and the highest explicit significand bit denoting 0.5.
This illustrates that from the implicit bit downward you have 52 additional, explicit bits to represent possible components where the smallest is 0 (exponent) - 52 (explicit bits in significand) = -52 => 2^-52 which according to the table below is ... well you can see for yourselves that it comes out to quite a bit more than 15.95 significant digits (37 to be exact). To put it another way the smallest number in the 2^0 range that is != 1.0 itself is 2^0+2^-52 which is 1.0 + the number next to 2^-52 (below) = (exactly) 1.0000000000000002220446049250313080847263336181640625, a value which I count as being 53 significant digits long. With 17 digit formatting "precision" the number will display as 1.0000000000000002 and this would depend on the library converting correctly.
So maybe "round-trip conversion in 17 digits" is not really a concept that is valid (enough).
2^ -1 = 0.5000000000000000000000000000000000000000000000000000
2^ -2 = 0.2500000000000000000000000000000000000000000000000000
2^ -3 = 0.1250000000000000000000000000000000000000000000000000
2^ -4 = 0.0625000000000000000000000000000000000000000000000000
2^ -5 = 0.0312500000000000000000000000000000000000000000000000
2^ -6 = 0.0156250000000000000000000000000000000000000000000000
2^ -7 = 0.0078125000000000000000000000000000000000000000000000
2^ -8 = 0.0039062500000000000000000000000000000000000000000000
2^ -9 = 0.0019531250000000000000000000000000000000000000000000
2^-10 = 0.0009765625000000000000000000000000000000000000000000
2^-11 = 0.0004882812500000000000000000000000000000000000000000
2^-12 = 0.0002441406250000000000000000000000000000000000000000
2^-13 = 0.0001220703125000000000000000000000000000000000000000
2^-14 = 0.0000610351562500000000000000000000000000000000000000
2^-15 = 0.0000305175781250000000000000000000000000000000000000
2^-16 = 0.0000152587890625000000000000000000000000000000000000
2^-17 = 0.0000076293945312500000000000000000000000000000000000
2^-18 = 0.0000038146972656250000000000000000000000000000000000
2^-19 = 0.0000019073486328125000000000000000000000000000000000
2^-20 = 0.0000009536743164062500000000000000000000000000000000
2^-21 = 0.0000004768371582031250000000000000000000000000000000
2^-22 = 0.0000002384185791015625000000000000000000000000000000
2^-23 = 0.0000001192092895507812500000000000000000000000000000
2^-24 = 0.0000000596046447753906250000000000000000000000000000
2^-25 = 0.0000000298023223876953125000000000000000000000000000
2^-26 = 0.0000000149011611938476562500000000000000000000000000
2^-27 = 0.0000000074505805969238281250000000000000000000000000
2^-28 = 0.0000000037252902984619140625000000000000000000000000
2^-29 = 0.0000000018626451492309570312500000000000000000000000
2^-30 = 0.0000000009313225746154785156250000000000000000000000
2^-31 = 0.0000000004656612873077392578125000000000000000000000
2^-32 = 0.0000000002328306436538696289062500000000000000000000
2^-33 = 0.0000000001164153218269348144531250000000000000000000
2^-34 = 0.0000000000582076609134674072265625000000000000000000
2^-35 = 0.0000000000291038304567337036132812500000000000000000
2^-36 = 0.0000000000145519152283668518066406250000000000000000
2^-37 = 0.0000000000072759576141834259033203125000000000000000
2^-38 = 0.0000000000036379788070917129516601562500000000000000
2^-39 = 0.0000000000018189894035458564758300781250000000000000
2^-40 = 0.0000000000009094947017729282379150390625000000000000
2^-41 = 0.0000000000004547473508864641189575195312500000000000
2^-42 = 0.0000000000002273736754432320594787597656250000000000
2^-43 = 0.0000000000001136868377216160297393798828125000000000
2^-44 = 0.0000000000000568434188608080148696899414062500000000
2^-45 = 0.0000000000000284217094304040074348449707031250000000
2^-46 = 0.0000000000000142108547152020037174224853515625000000
2^-47 = 0.0000000000000071054273576010018587112426757812500000
2^-48 = 0.0000000000000035527136788005009293556213378906250000
2^-49 = 0.0000000000000017763568394002504646778106689453125000
2^-50 = 0.0000000000000008881784197001252323389053344726562500
2^-51 = 0.0000000000000004440892098500626161694526672363281250
2^-52 = 0.0000000000000002220446049250313080847263336181640625
#ThomasMcLeod: I think the significant digit rule comes from my field, physics, and means something more subtle:
If you have a measurement that gets you the value 1.52 and you cannot read any more detail off the scale, and say you are supposed to add another number (for example of another measurement because this one's scale was too small) to it, say 2, then the result (obviously) has only 2 decimal places, i.e. 3.52.
But likewise, if you add 1.1111111111 to the value 1.52, you get the value 2.63 (and nothing more!).
The reason for the rule is to prevent you from kidding yourself into thinking you got more information out of a calculation than you put in by the measurement (which is impossible, but would seem that way by filling it with garbage, see above).
That said, this specific rule is for addition only (for addition: the error of the result is the sum of the two errors - so if you measure just one badly, though luck, there goes your precision...).
How to get the other rules:
Let's say a is the measured number and δa the error. Let's say your original formula was:
f:=m a
Let's say you also measure m with error δm (let that be the positive side).
Then the actual limit is:
f_up=(m+δm) (a+δa)
and
f_down=(m-δm) (a-δa)
So,
f_up =m a+δm δa+(δm a+m δa)
f_down=m a+δm δa-(δm a+m δa)
Hence, now the significant digits are even less:
f_up ~m a+(δm a+m δa)
f_down~m a-(δm a+m δa)
and so
δf=δm a+m δa
If you look at the relative error, you get:
δf/f=δm/m+δa/a
For division it is
δf/f=δm/m-δa/a
Hope that gets the gist across and hope I didn't make too many mistakes, it's late here :-)
tl,dr: Significant digits mean how many of the digits in the output actually come from the digits in your input (in the real world, not the distorted picture that floating point numbers have).
If your measurements were 1 with "no" error and 3 with "no" error and the function is supposed to be 1/3, then yes, all infinite digits are actual significant digits. Otherwise, the inverse operation would not work, so obviously they have to be.
If significant digit rule means something completely different in another field, carry on :-)
I am writing a piece of code in which i have to convert from double to float values. I am using boost::numeric_cast to do this conversion which will alert me of any overflow/underflow. However i am also interested in knowing if that conversion resulted in some precision loss or not.
For example
double source = 1988.1012;
float dest = numeric_cast<float>(source);
Produces dest which has value 1988.1
Is there any way available in which i can detect this kind of precision loss/rounding
You could cast the float back to a double and compare this double to the original - that should give you a fair indication as to whether there was a loss of precision.
float dest = numeric_cast<float>(source);
double residual = source - numeric_cast<double>(dest);
Hence, residual contains the "loss" you're looking for.
Look at these articles for single precision and double precision floats. First of all, floats have 8 bits for the exponent vs. 11 for a double. So anything bigger than 10^127 or smaller than 10^-126 in magnitude is going to be the overflow as you mentioned. For the float, you have 23 bits for the actual digits of the number, vs 52 bits for the double. So obviously, you have a lot more digits of precision for the double than float.
Say you have a number like: 1.1123. This number may not actually be encoded as 1.1123 because the digits in a floating point number are used to actually add up as fractions. For example, if your bits in the mantissa were 11001, then the value would be formed by 1 (implicit) + 1 * 1/2 + 1 * 1/4 + 0 * 1/8 + 0 * 1/16 + 1 * 1/32 + 0 * (64 + 128 + ...). So the exact value cannot be encoded unless you can add up these fractions in such a way that it's the exact number. This is rare. Therefore, there will almost always be a precision loss.
You're going to have a certain level of precision loss, as per Dave's answer. If, however, you want to focus on quantifying it and raising an exception when it exceeds a certain number, you will have to open up the floating point number itself and parse out the mantissa & exponent, then do some analysis to determine if you've exceeded your tolerance.
But, the good news, its usually the standard IEEE floating-point float. :-)