From what I have read, a value of data type double has an approximate precision of 15 decimal places. However, when I use a number whose decimal representation repeats, such as 1.0/7.0, I find that the variable holds the value of 0.14285714285714285 - which is 17 places (via the debugger).
I would like to know why it is represented as 17 places internally, and why a precision of 15 is always written at ~15?
An IEEE double has 53 significant bits (that's the value of DBL_MANT_DIG in <cfloat>). That's approximately 15.95 decimal digits (log10(253)); the implementation sets DBL_DIG to 15, not 16, because it has to round down. So you have nearly an extra decimal digit of precision (beyond what's implied by DBL_DIG==15) because of that.
The nextafter() function computes the nearest representable number to a given number; it can be used to show just how precise a given number is.
This program:
#include <cstdio>
#include <cfloat>
#include <cmath>
int main() {
double x = 1.0/7.0;
printf("FLT_RADIX = %d\n", FLT_RADIX);
printf("DBL_DIG = %d\n", DBL_DIG);
printf("DBL_MANT_DIG = %d\n", DBL_MANT_DIG);
printf("%.17g\n%.17g\n%.17g\n", nextafter(x, 0.0), x, nextafter(x, 1.0));
}
gives me this output on my system:
FLT_RADIX = 2
DBL_DIG = 15
DBL_MANT_DIG = 53
0.14285714285714282
0.14285714285714285
0.14285714285714288
(You can replace %.17g by, say, %.64g to see more digits, none of which are significant.)
As you can see, the last displayed decimal digit changes by 3 with each consecutive value. The fact that the last displayed digit of 1.0/7.0 (5) happens to match the mathematical value is largely coincidental; it was a lucky guess. And the correct rounded digit is 6, not 5. Replacing 1.0/7.0 by 1.0/3.0 gives this output:
FLT_RADIX = 2
DBL_DIG = 15
DBL_MANT_DIG = 53
0.33333333333333326
0.33333333333333331
0.33333333333333337
which shows about 16 decimal digits of precision, as you'd expect.
It is actually 53 binary places, which translates to 15 stable decimal places, meaning that if you round a start out with a number with 15 decimal places, convert it to a double, and then round the double back to 15 decimal places you'll get the same number. To uniquely represent a double you need 17 decimal places (meaning that for every number with 17 decimal places, there's a unique closest double) which is why 17 places are showing up, but not all 17-decimal numbers map to different double values (like in the examples in the other answers).
Decimal representation of floating point numbers is kind of strange. If you have a number with 15 decimal places and convert that to a double, then print it out with exactly 15 decimal places, you should get the same number. On the other hand, if you print out an arbitrary double with 15 decimal places and the convert it back to a double, you won't necessarily get the same value back—you need 17 decimal places for that. And neither 15 nor 17 decimal places are enough to accurately display the exact decimal equivalent of an arbitrary double. In general, you need over 100 decimal places to do that precisely.
See the Wikipedia page for double-precision and this article on floating-point precision.
A double holds 53 binary digits accurately, which is ~15.9545898 decimal digits. The debugger can show as many digits as it pleases to be more accurate to the binary value. Or it might take fewer digits and binary, such as 0.1 takes 1 digit in base 10, but infinite in base 2.
This is odd, so I'll show an extreme example. If we make a super simple floating point value that holds only 3 binary digits of accuracy, and no mantissa or sign (so range is 0-0.875), our options are:
binary - decimal
000 - 0.000
001 - 0.125
010 - 0.250
011 - 0.375
100 - 0.500
101 - 0.625
110 - 0.750
111 - 0.875
But if you do the numbers, this format is only accurate to 0.903089987 decimal digits. Not even 1 digit is accurate. As is easy to see, since there's no value that begins with 0.4?? nor 0.9??, and yet to display the full accuracy, we require 3 decimal digits.
tl;dr: The debugger shows you the value of the floating point variable to some arbitrary precision (19 digits in your case), which doesn't necessarily correlate with the accuracy of the floating point format (17 digits in your case).
IEEE 754 floating point is done in binary. There's no exact conversion from a given number of bits to a given number of decimal digits. 3 bits can hold values from 0 to 7, and 4 bits can hold values from 0 to 15. A value from 0 to 9 takes roughly 3.5 bits, but that's not exact either.
An IEEE 754 double precision number occupies 64 bits. Of this, 52 bits are dedicated to the significand (the rest is a sign bit and exponent). Since the significand is (usually) normalized, there's an implied 53rd bit.
Now, given 53 bits and roughly 3.5 bits per digit, simple division gives us 15.1429 digits of precision. But remember, that 3.5 bits per decimal digit is only an approximation, not a perfectly accurate answer.
Many (most?) debuggers actually look at the contents of the entire register. On an x86, that's actually an 80-bit number. The x86 floating point unit will normally be adjusted to carry out calculations to 64-bit precision -- but internally, it actually uses a couple of "guard bits", which basically means internally it does the calculation with a few extra bits of precision so it can round the last one correctly. When the debugger looks at the whole register, it'll usually find at least one extra digit that's reasonably accurate -- though since that digit won't have any guard bits, it may not be rounded correctly.
It is because it's being converted from a binary representation. Just because it has printed all those decimal digits doesn't mean it can represent all decimal values to that precision. Take, for example, this in Python:
>>> 0.14285714285714285
0.14285714285714285
>>> 0.14285714285714286
0.14285714285714285
Notice how I changed the last digit, but it printed out the same number anyway.
In most contexts where double values are used, calculations will have a certain amount of uncertainty. The difference between 1.33333333333333300 and 1.33333333333333399 may be less than the amount of uncertainty that exists in the calculations. Displaying the value of "2/3 + 2/3" as "1.33333333333333" is apt to be more meaningful than displaying it as "1.33333333333333319", since the latter display implies a level of precision that doesn't really exist.
In the debugger, however, it is important to uniquely indicate the value held by a variable, including essentially-meaningless bits of precision. It would be very confusing if a debugger displayed two variables as holding the value "1.333333333333333" when one of them actually held 1.33333333333333319 and the other held 1.33333333333333294 (meaning that, while they looked the same, they weren't equal). The extra precision shown by the debugger isn't apt to represent a numerically-correct calculation result, but indicates how the code will interpret the values held by the variables.
Related
From what I have read, a value of data type double has an approximate precision of 15 decimal places. However, when I use a number whose decimal representation repeats, such as 1.0/7.0, I find that the variable holds the value of 0.14285714285714285 - which is 17 places (via the debugger).
I would like to know why it is represented as 17 places internally, and why a precision of 15 is always written at ~15?
An IEEE double has 53 significant bits (that's the value of DBL_MANT_DIG in <cfloat>). That's approximately 15.95 decimal digits (log10(253)); the implementation sets DBL_DIG to 15, not 16, because it has to round down. So you have nearly an extra decimal digit of precision (beyond what's implied by DBL_DIG==15) because of that.
The nextafter() function computes the nearest representable number to a given number; it can be used to show just how precise a given number is.
This program:
#include <cstdio>
#include <cfloat>
#include <cmath>
int main() {
double x = 1.0/7.0;
printf("FLT_RADIX = %d\n", FLT_RADIX);
printf("DBL_DIG = %d\n", DBL_DIG);
printf("DBL_MANT_DIG = %d\n", DBL_MANT_DIG);
printf("%.17g\n%.17g\n%.17g\n", nextafter(x, 0.0), x, nextafter(x, 1.0));
}
gives me this output on my system:
FLT_RADIX = 2
DBL_DIG = 15
DBL_MANT_DIG = 53
0.14285714285714282
0.14285714285714285
0.14285714285714288
(You can replace %.17g by, say, %.64g to see more digits, none of which are significant.)
As you can see, the last displayed decimal digit changes by 3 with each consecutive value. The fact that the last displayed digit of 1.0/7.0 (5) happens to match the mathematical value is largely coincidental; it was a lucky guess. And the correct rounded digit is 6, not 5. Replacing 1.0/7.0 by 1.0/3.0 gives this output:
FLT_RADIX = 2
DBL_DIG = 15
DBL_MANT_DIG = 53
0.33333333333333326
0.33333333333333331
0.33333333333333337
which shows about 16 decimal digits of precision, as you'd expect.
It is actually 53 binary places, which translates to 15 stable decimal places, meaning that if you round a start out with a number with 15 decimal places, convert it to a double, and then round the double back to 15 decimal places you'll get the same number. To uniquely represent a double you need 17 decimal places (meaning that for every number with 17 decimal places, there's a unique closest double) which is why 17 places are showing up, but not all 17-decimal numbers map to different double values (like in the examples in the other answers).
Decimal representation of floating point numbers is kind of strange. If you have a number with 15 decimal places and convert that to a double, then print it out with exactly 15 decimal places, you should get the same number. On the other hand, if you print out an arbitrary double with 15 decimal places and the convert it back to a double, you won't necessarily get the same value back—you need 17 decimal places for that. And neither 15 nor 17 decimal places are enough to accurately display the exact decimal equivalent of an arbitrary double. In general, you need over 100 decimal places to do that precisely.
See the Wikipedia page for double-precision and this article on floating-point precision.
A double holds 53 binary digits accurately, which is ~15.9545898 decimal digits. The debugger can show as many digits as it pleases to be more accurate to the binary value. Or it might take fewer digits and binary, such as 0.1 takes 1 digit in base 10, but infinite in base 2.
This is odd, so I'll show an extreme example. If we make a super simple floating point value that holds only 3 binary digits of accuracy, and no mantissa or sign (so range is 0-0.875), our options are:
binary - decimal
000 - 0.000
001 - 0.125
010 - 0.250
011 - 0.375
100 - 0.500
101 - 0.625
110 - 0.750
111 - 0.875
But if you do the numbers, this format is only accurate to 0.903089987 decimal digits. Not even 1 digit is accurate. As is easy to see, since there's no value that begins with 0.4?? nor 0.9??, and yet to display the full accuracy, we require 3 decimal digits.
tl;dr: The debugger shows you the value of the floating point variable to some arbitrary precision (19 digits in your case), which doesn't necessarily correlate with the accuracy of the floating point format (17 digits in your case).
IEEE 754 floating point is done in binary. There's no exact conversion from a given number of bits to a given number of decimal digits. 3 bits can hold values from 0 to 7, and 4 bits can hold values from 0 to 15. A value from 0 to 9 takes roughly 3.5 bits, but that's not exact either.
An IEEE 754 double precision number occupies 64 bits. Of this, 52 bits are dedicated to the significand (the rest is a sign bit and exponent). Since the significand is (usually) normalized, there's an implied 53rd bit.
Now, given 53 bits and roughly 3.5 bits per digit, simple division gives us 15.1429 digits of precision. But remember, that 3.5 bits per decimal digit is only an approximation, not a perfectly accurate answer.
Many (most?) debuggers actually look at the contents of the entire register. On an x86, that's actually an 80-bit number. The x86 floating point unit will normally be adjusted to carry out calculations to 64-bit precision -- but internally, it actually uses a couple of "guard bits", which basically means internally it does the calculation with a few extra bits of precision so it can round the last one correctly. When the debugger looks at the whole register, it'll usually find at least one extra digit that's reasonably accurate -- though since that digit won't have any guard bits, it may not be rounded correctly.
It is because it's being converted from a binary representation. Just because it has printed all those decimal digits doesn't mean it can represent all decimal values to that precision. Take, for example, this in Python:
>>> 0.14285714285714285
0.14285714285714285
>>> 0.14285714285714286
0.14285714285714285
Notice how I changed the last digit, but it printed out the same number anyway.
In most contexts where double values are used, calculations will have a certain amount of uncertainty. The difference between 1.33333333333333300 and 1.33333333333333399 may be less than the amount of uncertainty that exists in the calculations. Displaying the value of "2/3 + 2/3" as "1.33333333333333" is apt to be more meaningful than displaying it as "1.33333333333333319", since the latter display implies a level of precision that doesn't really exist.
In the debugger, however, it is important to uniquely indicate the value held by a variable, including essentially-meaningless bits of precision. It would be very confusing if a debugger displayed two variables as holding the value "1.333333333333333" when one of them actually held 1.33333333333333319 and the other held 1.33333333333333294 (meaning that, while they looked the same, they weren't equal). The extra precision shown by the debugger isn't apt to represent a numerically-correct calculation result, but indicates how the code will interpret the values held by the variables.
Why do some numbers lose accuracy when stored as floating point numbers?
For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers (92/10), both of which can be expressed exactly in binary (0b1011100/0b1010). However, the same ratio stored as a floating point number is never exactly equal to 9.2:
32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875
How can such an apparently simple number be "too big" to express in 64 bits of memory?
In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2, is actually this fraction:
5179139571476070 * 2 -49
Where the exponent is -49 and the mantissa is 5179139571476070. The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.
9.2 may be simply 92/10, but 10 cannot be expressed as 2n if n is limited to integer values.
Seeing the Data
First, a few functions to see the components that make a 32- and 64-bit float. Gloss over these if you only care about the output (example in Python):
def float_to_bin_parts(number, bits=64):
if bits == 32: # single precision
int_pack = 'I'
float_pack = 'f'
exponent_bits = 8
mantissa_bits = 23
exponent_bias = 127
elif bits == 64: # double precision. all python floats are this
int_pack = 'Q'
float_pack = 'd'
exponent_bits = 11
mantissa_bits = 52
exponent_bias = 1023
else:
raise ValueError, 'bits argument must be 32 or 64'
bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]
There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.
Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double, which is often implemented as 64 bits.
When we call that function with our example, 9.2, here's what we get:
>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']
Interpreting the Data
You'll see I've split the return value into three components. These components are:
Sign
Exponent
Mantissa (also called Significand, or Fraction)
Sign
The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0.
Exponent
The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010. In decimal, that represents the value 1026. A quirk of this component is that you must subtract a number equal to 2(# of bits) - 1 - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023) to get the true exponent, 0b00000000011 (decimal number 3).
Mantissa
The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:
6.0221413x1023
The mantissa would be the 6.0221413. Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1. So the binary mantissa always starts with 1! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:
1.0010011001100110011001100110011001100110011001100110
This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.
When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 252 to move it 52 places to the right:
0.0010011001100110011001100110011001100110011001100110
In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999. (This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal; for more detail, see: 675539944105574 / 4503599627370496.)
Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.
Recapping the Components
Sign (first component): 0 for positive, 1 for negative
Exponent (middle component): Subtract 2(# of bits) - 1 - 1 to get the true exponent
Mantissa (last component): Divide by 2(# of bits) and add 1 to get the true mantissa
Calculating the Number
Putting all three parts together, we're given this binary number:
1.0010011001100110011001100110011001100110011001100110 x 1011
Which we can then convert from binary to decimal:
1.1499999999999999 x 23 (inexact!)
And multiply to reveal the final representation of the number we started with (9.2) after being stored as a floating point value:
9.1999999999999993
Representing as a Fraction
9.2
Now that we've built the number, it's possible to reconstruct it into a simple fraction:
1.0010011001100110011001100110011001100110011001100110 x 1011
Shift mantissa to a whole number:
10010011001100110011001100110011001100110011001100110 x 1011-110100
Convert to decimal:
5179139571476070 x 23-52
Subtract the exponent:
5179139571476070 x 2-49
Turn negative exponent into division:
5179139571476070 / 249
Multiply exponent:
5179139571476070 / 562949953421312
Which equals:
9.1999999999999993
9.5
>>> float_to_bin_parts(9.5)
['0', '10000000010', '0011000000000000000000000000000000000000000000000000']
Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.
Assemble the binary scientific notation:
1.0011 x 1011
Shift the decimal point:
10011 x 1011-100
Subtract the exponent:
10011 x 10-1
Binary to decimal:
19 x 2-1
Negative exponent to division:
19 / 21
Multiply exponent:
19 / 2
Equals:
9.5
Further reading
The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don’t my numbers add up? (floating-point-gui.de)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
IEEE Double-precision floating-point format (Wikipedia)
Floating Point Arithmetic: Issues and Limitations (docs.python.org)
Floating Point Binary
This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.
Consider the fraction 2/3
In good-ol' base 10, we typically write it out as something like
0.666...
0.666
0.667
When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs. (Yes, in some programs it matters.)
Number bases
So here's where number bases are crucial. If we were trying to represent 2/3 in base 3, then
(2/3)10 = 0.23
In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.
To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.
(1/2)10 = 0.510 = 0.12 = 0.1111...3
Why are floating point numbers inaccurate?
Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.
While all of the other answers are good there is still one thing missing:
It is impossible to represent irrational numbers (e.g. π, sqrt(2), log(3), etc.) precisely!
And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.
Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b. All your arithmetic would have to be done on fractions just like in highschool math (e.g. a/b * c/d = ac/bd).
But of course you would still run into the same kind of trouble when pi, sqrt, log, sin, etc. are involved.
TL;DR
For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (i.e. irrational) can never be represented no matter the system.
There are infinitely many real numbers (so many that you can't enumerate them), and there are infinitely many rational numbers (it is possible to enumerate them).
The floating-point representation is a finite one (like anything in a computer) so unavoidably many many many numbers are impossible to represent. In particular, 64 bits only allow you to distinguish among only 18,446,744,073,709,551,616 different values (which is nothing compared to infinity). With the standard convention, 9.2 is not one of them. Those that can are of the form m.2^e for some integers m and e.
You might come up with a different numeration system, 10 based for instance, where 9.2 would have an exact representation. But other numbers, say 1/3, would still be impossible to represent.
Also note that double-precision floating-points numbers are extremely accurate. They can represent any number in a very wide range with as much as 15 exact digits. For daily life computations, 4 or 5 digits are more than enough. You will never really need those 15, unless you want to count every millisecond of your lifetime.
Why can we not represent 9.2 in binary floating point?
Floating point numbers are (simplifying slightly) a positional numbering system with a restricted number of digits and a movable radix point.
A fraction can only be expressed exactly using a finite number of digits in a positional numbering system if the prime factors of the denominator (when the fraction is expressed in it's lowest terms) are factors of the base.
The prime factors of 10 are 5 and 2, so in base 10 we can represent any fraction of the form a/(2b5c).
On the other hand the only prime factor of 2 is 2, so in base 2 we can only represent fractions of the form a/(2b)
Why do computers use this representation?
Because it's a simple format to work with and it is sufficiently accurate for most purposes. Basically the same reason scientists use "scientific notation" and round their results to a reasonable number of digits at each step.
It would certainly be possible to define a fraction format, with (for example) a 32-bit numerator and a 32-bit denominator. It would be able to represent numbers that IEEE double precision floating point could not, but equally there would be many numbers that can be represented in double precision floating point that could not be represented in such a fixed-size fraction format.
However the big problem is that such a format is a pain to do calculations on. For two reasons.
If you want to have exactly one representation of each number then after each calculation you need to reduce the fraction to it's lowest terms. That means that for every operation you basically need to do a greatest common divisor calculation.
If after your calculation you end up with an unrepresentable result because the numerator or denominator you need to find the closest representable result. This is non-trivil.
Some Languages do offer fraction types, but usually they do it in combination with arbitary precision, this avoids needing to worry about approximating fractions but it creates it's own problem, when a number passes through a large number of calculation steps the size of the denominator and hence the storage needed for the fraction can explode.
Some languages also offer decimal floating point types, these are mainly used in scenarios where it is imporant that the results the computer gets match pre-existing rounding rules that were written with humans in mind (chiefly financial calculations). These are slightly more difficult to work with than binary floating point, but the biggest problem is that most computers don't offer hardware support for them.
I've made a BOMDAS calculator in C++ that uses doubles. Whenever I input an expression like
1000000000000000000000*1000000000000000000000
I get a result like 1000000000000000000004341624882808674582528.000000. I suspect it has something to do with floating-point numbers.
Floating point number represent values with a fixed size representation. A double can represent 16 decimal digits in form where the decimal digits can be restored (internally, it normally stores the value using base 2 which means that it can accurately represent most fractional decimal values). If the number of digits is exceeded, the value will be rounded appropriately. Of course, the upshot is that you won't necessarily get back the digits you're hoping for: if you ask for more then 16 decimal digits either explicitly or implicitly (e.g. by setting the format to std::ios_base::fixed with numbers which are bigger than 1e16) the formatting will conjure up more digits: it will accurately represent the internally held binary values which may produce up to, I think, 54 non-zero digits.
If you want to compute with large values accurately, you'll need some variable sized representation. Since your values are integers a big integer representation might work. These will typically be a lot slower to compute with than double.
A double stores 53 bits of precision. This is about 15 decimal digits. Your problem is that a double cannot store the number of digits you are trying to store. Digits after the 15th decimal digit will not be accurate.
That's not an error. It's exactly because of how floating-point types are represented, as the result is precise to double precision.
Floating-point types in computers are written in the form (-1)sign * mantissa * 2exp so they only have broader ranges, not infinite precision. They're only accurate to the mantissa precision, and the result after every operation will be rounded as such. The double type is most commonly implemented as IEEE-754 64-bit double precision with 53 bits of mantissa so it can be correct to log(253) ≈ 15.955 decimal digits. Doing 1e21*1e21 produces 1e42 which when rounding to the closest value in double precision gives the value that you saw. If you round that to 16 digits it's exactly the same as 1e42.
If you need more range, use double or long double. If you only works with integer then int64_t (or __int128 with gcc and many other compilers on 64-bit platforms) has a much larger precision (64/128 bits compared to 53 bits). If you need even more precision, use an arbitrary-precision arithmetic library instead such as GMP
I've written a program to estimate pi using the Gregory Leibniz formula, however, it will not calculate to 18 decimal points. It will only calculate up to 5 decimal points. Any suggestions?
Use
cout.precision(50);
To increase the precision of the printed output. Here 50 is the number of decimal digits in your output.
The default printing precision for printf is 6
Precision specifies the exact number of digits to appear after the decimal point character. The default precision is 6
Similarly when std::cout was introduced into C++ the same default value was used
Manages the precision (i.e. how many digits are generated) of floating point output performed by std::num_put::do_put.
Returns the current precision.
Sets the precision to the given one. Returns the previous precision.
The default precision, as established by std::basic_ios::init, is 6.
https://en.cppreference.com/w/cpp/io/ios_base/precision
Therefore regardless of how precise the type is, only 6 fractional digits will be printed out. To get more digits you'll need to use std::setprecision or std::cout.precision
However calling std::cout.precision only affects the number of decimal digits in the output, not the number's real precision. Any digits over that type's precision would be just garbage
Most modern systems use IEEE-754 where float is single-precision with 23 bits of mantissa and double maps to double-precision with 52 bits of mantissa. As a result they're accurate to ~6-7 digits and ~15-16 decimal digits respectively. That means they can't represent numbers to 18 decimal points as you expected
On some platforms there may be some extended precision types so you'll be able to store numbers more precisely. For example long double on most compilers on x86 has 64 bits of precision and can represent ~18 significant digits, but it's not 18 digits after the decimal point. Higher precision can be obtained with quadruple-precision on some compilers. To achieve even more precision, the only way is to use a big number library or write one for your own.
equivalent to log10(2^24) ≈ 7.225 decimal digits
Wikipedia
Precision: 7 digits
MSDN
6
std::numeric_limits<float>::digits10
Why numeric_limits return 6 here? Both Wikipedia and MSDN report that floats have 7 decimal digits of precision.
If in doubt, read the spec. The C++ standard says that digits10 is:
Number of base 10 digits that can be represented without change.
That's a little vague; fortunately, there's a footnote:
Equivalent to FLT_DIG, DBL_DIG, LDBL_DIG
Those are defined in the C standard; let's look it up there:
number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits.
So std::numeric_limits<float>::digits10 is the number of decimal digits such that any floating-point number with that many digits is unchanged if you convert it to a float and back to decimal.
As you say, floats have about 7 digits of decimal precision, but the error in representation of both fixed-width decimals and floats is not uniformly logarithmic. The relative error in rounding a number of the form 1.xxx.. to a fixed number of decimal places is nearly ten times larger than the relative error of rounding 9.xxx.. to the same number of decimal places. Similarly, depending on where a value falls in a binade, the relative error in rounding it to 24 binary digits can vary by a factor of nearly two.
The upshot of this is that not all seven-digit decimals survive the round trip to float and back, but all six digit decimals do. Hence, std::numeric_limits<float>::digits10 is 6.
There are not that many six and seven digit decimals with exponents in a valid range for the float type; you can pretty easily write a program to exhaustively test all of them if you're still not convinced.
It's really only 23 bits in the mantissa (there's an implied 1, so it's effectively 24 bits, but the 1 obviously does not vary). This gives 6.923689900271567 decimal digits of precision, which is not quite 7.