The IEEE 754 double precision floating point format has a binary precision of 53 bits, which translates into log10(2^53) ~ 16 significant decimal digits.
If the double precision format is used to store a floating point number in a 64 bit-long word in the memory, with 52 bits for the significand and 1 hidden bit, but a larger precision is used to output the number to the screen, what data is actually read from the memory and written to the output?
How can it even be read, when the total length of the word is 64 bit, does the read-from-memory operation on the machine just simply read more bits and interprets them as an addition to the significand of the number?
For example, take the number 0.1. It does not have an exact binary floating point representation regardless of the precision used, because it has an indefinitely repeating binary floating point pattern in the significand.
If 0.1 is stored with the double precision, and printed to the screen with the precision >16 like this in the C++ language:
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
double x = 0.1;
cout << setprecision(50) << "x= " << x << endl;
};
The output (on my machine at the point of execution), is:
x = 0.1000000000000000055511151231257827021181583404541
If the correct rounding is used with 2 guard bits and 1 sticky bits, can I trust the decimal values given by the first three non-zero binary floating point digits in the error 5.551115123125783e-17?
Every binary fraction is exactly equal to some decimal fraction. If, as is usually the case, double is a binary floating point type, each double number has an exactly equal decimal representation.
For what follows, I am assuming your system uses IEEE 754 64-bit binary floating point to represent double. That is not required by the standard, but is very common. The closest number to 0.1 in that format has exact value 0.1000000000000000055511151231257827021181583404541015625
Although this number has a lot of digits, it is exactly equal to 3602879701896397/255. Multiplying both numerator and denominator by 555 converts it to a decimal fraction, while increasing the number of digits in the numerator.
One common approach, consistent with the result in the question, is to use round-to-nearest to the number of digits required by the format. That will indeed give useful information about the rounding error on conversion of a string to double.
static_casting from a floating point to an integer simply strips the fractional point of the number. For example static_cast<int>(13.9999999) yields 13.
Not all integers are representable as floating point numbers. For example internally the closest float to 13,000,000 may be: 12999999.999999.
In this hypothetical case, I'd expect to get an unexpected result from:
const auto foo = 12'999'999.5F;
const auto bar = static_cast<long long>(ceil(foo));
My assumption is that such a breakdown does occur at some point, if not necessarily at 13,000,000. I'd just like to know the range over which I can trust static_cast<long long>(ceif(foo))?
For example internally the closest float to 13,000,000 may be: 12999999.999999.
That is not possible in any normal floating-point format. The floating-point representation of numbers is equivalent to M•be, where b is a fixed base (e.g., 2 for binary floating-point) and M and e are integers with some restrictions on their values. In order for a value like 13,000,000-x to be represented, where x is some positive value less than 1, e must be negative (because M•be for a non-negative e is an integer). If so, then M•b0 is an integer larger than M•be, so it is larger than 13,000,000, and so 13,000,000 can be represented as M'•b0, where M' is a positive integer less than M and hence fits in the range of allowed values for M (in any normal floating-point format). (Perhaps some bizarre floating-point format might impose a strange range on M or e that prevents this, but no normal format does.)
Regarding your code:
auto test = 0LL;
const auto floater = 0.5F;
for(auto i = 0LL; i == test; i = std::ceil(i + floater)) ++test;
cout << test << endl;
When i was 8,388,608, the mathematical result of 8,388,608 + .5 is 8,388,608.5. This is not representable in the float format on your system, so it was rounded to 8,388,608. The ceil of this is 8,388,608. At this point, test was 8,388,609, so the loop stopped. So this code does not demonstrate that 8,388,608.5 is representable and 8,388,609 is not.
Behavior seems to return to normal if I do: ceil(8'388'609.5F) which will correctly return 8,388,610.
8,388,609.5 is not representable in the float format on your system, so it was rounded by the rule “round to nearest, ties to even.” The two nearest representable values are 8,388,609, and 8,388,610. Since they are equally far apart, the result was 8,388,610. That value was passed to ceil, which of course returned 8,388,610.
On Visual Studio 2015 I got 8,388,609 which is a horrifying small safe range.
In the IEEE-754 basic 32-bit binary format, all integers from -16,777,216 to +16,777,216 are representable, because the format has a 24-bit significand.
Floating point numbers are represented by 3 integers, cbq where:
c is the mantissa (so for the number: 12,999,999.999999 c would be 12,999,999,999,999)
q is the exponent (so for the number: 12,999,999.999999 q would be -6)
b is the base (IEEE-754 requires b to be either 10 or 2; in the representation above b is 10)
From this it's easy to see that a floating point with the capability of representing 12,999,999.999999 also has the capability of representing 13,000,000.000000 using a c of 1,300,000,000,000 and a q of -5.
This example is a bit contrived in that the chosen b is 10, where in almost all implementations the chosen base is 2. But it's worth pointing out that even with a b of 2 the q functions as a shift left or right of the mantissa.
Next let's talk about a range here. Obviously a 32-bit floating point cannot represent all the integers represented by a 32-bit integer, as the floating point must also represent so many much larger or smaller numbers. Since the exponent is simply shifting the mantissa, a floating point number can always exactly represent every integer that can be represented by it's mantissa. Given the traditional IEEE-754 binary base floating point numbers:
A 32-bit (float) has a 24-bit mantissa so it can represent all integers in the range [-16,777,215, 16,777,215]
A 64-bit (double) has a 53-bit mantissa so it can represent all integers in the range [-9,007,199,254,740,991, 9,007,199,254,740,991]
A 128-bit (long double depending upon implementation) has a 113-bit mantissa so it can represent all integers in the range [-103,845,937,170,696,552,570,609,926,584,40,191, 103,845,937,170,696,552,570,609,926,584,40,191]
[source]
c++ provides digits as a method of finding this number for a given floating point type. (Though admittedly even a long long is too small to represent a 113-bit mantissa.) For example a float's maximum mantissa could be found by:
(1LL << numeric_limits<float>::digits) - 1LL
Having thoroughly explained the mantissa, let's revisit the exponent section to talk about how a floating point is actually stored. Take 13,000,000.0 that could be represented as:
c = 13, q = 6, b = 10
c = 130, q = 5, b = 10
c = 1,300, q = 4, b = 10
And so on. For the traditional binary format IEEE-754 requires:
The representation is made unique by choosing the smallest representable exponent that retains the most significant bit (MSB) within the selected word size and format. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers
To explain this in the more familiar base-10 if our mantissa has 14 decimal places, the implementation would look like this:
c = 13,000,000,000,000 so the MSB will be used in the represented number
q = 6 This is a little confusing, it's cause of the bias introduced here; logically q = -6 but the bias is set so that when q = 0 only the MSB of c is immediately to the left of the decimal point, meaning that c = 13,000,000,000,000, q = 0, b = 10 will represent 1.3
b = 10 again the above rules are really only required for base-2 but I've shown them as they would apply to base-10 for the purpose of explaination
Translated back to base-2 this means that a q of numeric_limits<T>::digits - 1 has only zeros after the decimal place. ceil only has an effect if there is a fractional part of the number.
A final point of explanation here, is the range over which ceil will have an effect. After the exponent of a floating point is larger than numeric_limits<T>::digits continuing to increase it only introduces trailing zeros to the resulting number, thus calling ceil when q is greater than or equal to numeric_limits<T>::digits - 2LL. And since we know the MSB of c will be used in the number this means that c must be smaller than (1LL << numeric_limits<T>::digits - 1LL) - 1LL Thus for ceil to have an effect on the traditional binary IEEE-754 floating point:
A 32-bit (float) must be smaller than 8,388,607
A 64-bit (double) must be smaller than 4,503,599,627,370,495
A 128-bit (long double depending upon implementation) must be smaller than 5,192,296,858,534,827,628,530,496,329,220,095
For what I'm learning, once I convert a floating point value to a decimal one, the "significant digits" I need are a fixed number (17 for double, for example). 17 totals: before and after decimal separator.
So for example this code:
typedef std::numeric_limits<double> dbl;
int main()
{
std::cout.precision(dbl::max_digits10);
//std::cout << std::fixed;
double value1 = 1.2345678912345678912345;
double value2 = 123.45678912345678912345;
double value3 = 123456789123.45678912345;
std::cout << value1 << std::endl;
std::cout << value2 << std::endl;
std::cout << value3 << std::endl;
}
will correctly "show me" 17 values:
1.2345678912345679
123.45678912345679
123456789123.45679
But if I increase precision for the cout (i.e. std::cout.precision(100)), I can see there are other numbers after the 17 range:
1.2345678912345678934769921397673897445201873779296875
123.456789123456786683163954876363277435302734375
123456789123.456787109375
Why should ignore them? They are stored within the variables/double as well, so they will affect the whole "math" later (division, multiplication, sum, and so on).
What does it means "significant digits"? There is other...
Can you help me to understand what “significant digits” means in floating point math?
With FP numbers, like mathematical real numbers, significant digits is the leading digits of a value that do not begin with 0 and then, depending on context, to 1) the decimal point, 2) the last non-zero digit, or 3) the last printed digit.
123. // 3 significant decimal digits
123.125 // 6 significant decimal digits
0.0078125 // 5 significant decimal digits
0x0.00123p45 // 3 significant hexadecimal digits
123000.0 // 3, 6, or 7 significant decimal digits depending on context
When concerned about decimal significant digits and FP types like double. the issue is often "How many decimal significant digits are needed or of concern?"
Nearly all C FP implementations use a binary encoding such that all finite FP are exact sums of power of 2. Each finite FP is exact. Common encoding affords most double to have 53 binary digits is it significand - so 53 significant binary digits. How this appears as a decimal is often the source of confusion.
// Example 0.1 is not an exact sum of powers of 2 so a nearby value is used.
double x = 0.1;
// x takes on the exact value of
// 0.1000000000000000055511151231257827021181583404541015625
// aka 0x1.999999999999ap-4
// aka base2: 0.000110011001100110011001100110011001100110011001100110011010
// The preceding and subsequent doubles
// 0.09999999999999999167332731531132594682276248931884765625
// 0.10000000000000001942890293094023945741355419158935546875
// 123456789012345678901234567890123456789012345678901234567890
Looking at above, one could say x has over 50 decimal significant digits. Yet the value matches the intended 0.1 to 16 decimal significant digits. Or yet since the preceding and subsequent possible double values differ in the 17 place, one could say x has 17 decimal significant digits.
What does it means "significant digits"?
Various meanings of significant digits exist, but for C, 2 common ones are:
The number of decimal significant digits that a textual value to double converts as expected for all double. This is typically 15. C specifies this as DBL_DIG and must be at least 10.
The number of decimal significant digits that a textual value of double needs to be printed to distinguish from another double. This is typically 17. C specifies this as DBL_DECIMAL_DIG and must be at least 10.
Why should ignore them?
It depends of coding goals. Rarely are all digits of the exact value needed. (DBL_TRUE_MIN might have 752 od them.) For most applications, DBL_DECIMAL_DIG is enough. In select apps, DBL_DIG will do. So usually, ignoring digits past 17 does not cause problems.
Keep in mind that floating-point values are not real numbers. There are gaps between the values, and all those extra digits, while meaningful for real numbers, don’t reflect any difference in the floating-point value. When you convert a floating-point value to text, having std::numeric_limits<...>::max_digits10 digits ensures that you can convert the text string back to floating-point and get the original value. The extra digits don’t affect the result.
The extra digits that you see when you ask for more digits are the result of the conversion algorithm trying to do what you asked. The algorithm typically just keeps extracting digits until it reaches the desired precision; it could be written to start outputting zeros after it’s written max_digits10 digits, but that’s an additional complication that nobody bothers with. It wouldn’t really be helpful.
just to add to Pete Becker's answer, I think you're confusing the problem of finding the exact decimal representation of a binary mantissa, with the problem of finding some decimal representation uniquely representing that binary mantissa ( given some fixed rounding scheme ).
Now, regarding the first problem, you always need a finite number of decimal digits to exactly represent a binary mantissa ( because 2 divides 10 ).
For example, you need 18 decimal digits to exactly represent the binary 1.0000000000000001, being 1.00000762939453125 in decimal.
but you need just 17 digits to represent it uniquely as 1.0000076293945312 because no other number having exact value 1.0000076293945312xyz... where 0<=x<5 can exist as a double ( more precisely, the next and prior exactly representable values being 1.0000076293945314720446049250313080847263336181640625 and 1.0000076293945310279553950749686919152736663818359375 ).
Of course, this does not mean that given some decimal number you can ignore all digits past the 17th; it just means that if you apply the same rounding scheme used to produce the decimal at the 17th position and assign it back to a double you'll get the same original double.
I used to think that float can use max 6 digit and double 15, after comma. But if I print limits here:
typedef std::numeric_limits<float> fl;
typedef std::numeric_limits<double> dbl;
int main()
{
std::cout << fl::max_digits10 << std::endl;
std::cout << dbl::max_digits10 << std::endl;
}
It prints float 9 and double 17?
You're confusing digits10 and max_digits10.
If digits10 is 6, then any number with six decimal digits can be converted to the floating point type, and back, and when rounded back to six decimal digits, produces the original value.
If max_digits10 is 9, then there exist at least two floating point numbers that when converted to decimal produce the same initial 8 decimal digits.
digits10 is the number you're looking for, based on your description. It's about converting from decimal to binary floating point back to decimal.
max_digits10 is a number about converting from binary floating point to decimal back to binary floating point.
From cppreference:
Unlike most mathematical operations, the conversion of a floating-point value to text and back is exact as long as at least max_digits10 were used (9 for float, 17 for double): it is guaranteed to produce the same floating-point value, even though the intermediate text representation is not exact. It may take over a hundred decimal digits to represent the precise value of a float in decimal notation.
For example (I am using http://www.exploringbinary.com/floating-point-converter/ to facilitate the conversion) and double as the precision format:
1.1e308 => 109999999999999997216016380169010472601796114571365898835589230322558260940308155816455878138416026219051443651421887588487855623732463609216261733330773329156055234383563489264255892767376061912596780024055526930962873899746391708729279405123637426157351830292874541601579169431016577315555383826285225574400
Using 16 significant digits:
1.099999999999999e308 => 109999999999999897424000903433019889783160462729437595463026208549681185812946033955861284690212736971153169019636833121365513414107701410594362313651090292197465320141992473263972245213092236035710707805906167798295036672550192042188756649080117981714588407890666666245533825643214495197630622309084729180160
Using 17 significant digits:
1.0999999999999999e308 => 109999999999999997216016380169010472601796114571365898835589230322558260940308155816455878138416026219051443651421887588487855623732463609216261733330773329156055234383563489264255892767376061912596780024055526930962873899746391708729279405123637426157351830292874541601579169431016577315555383826285225574400
which is the same as the original
More than 17 significant digits:
1.09999999999999995555e308 => 109999999999999997216016380169010472601796114571365898835589230322558260940308155816455878138416026219051443651421887588487855623732463609216261733330773329156055234383563489264255892767376061912596780024055526930962873899746391708729279405123637426157351830292874541601579169431016577315555383826285225574400
Continue to be the same as the original.
There isn't an exact correspondence between decimal digits and binary digits.
IEEE 754 single precision uses 23 bits plus 1 for the implicit leading 1. Double precision uses 52+1 bits.
To get the equivalent decimal precision, use
log10(2^binary_digits) = binary_digits*log10(2)
For single precision this is
24*log10(2) = 7.22
and for double precision
53*log10(2) = 15.95
See here and also the Wikipedia page which I don't find to be particularly concise.
I'm seeing some error when simply assigning a floating point value which contains only 4 significant figures. I wrote a short program to debug and I don't understand what the problem is. After verifying the limits of a float on my platform is seems like there shouldn't be any error. What's causing this?
#include <stdlib.h>
#include <stdio.h>
#include <limits>
#include <iostream>
int main(){
printf("float size: %lu\n", sizeof(float));
printf("float max: %e\n", std::numeric_limits<float>::max());
printf("float significant figures: %i\n", std::numeric_limits<float>::digits10);
float a = 760.5e6;
printf("%.9f\n", a);
std::cout.precision(9);
std::cout << a << std::endl;
double b = 760.5e6;
printf("%.9f\n", b);
std::cout << b << std::endl;
return 0;
}
The output:
float size: 4
float max: 3.402823e+38
float significant figures: 6
760499968.000000000
760499968
760500000.000000000
760500000
A float has 24 bits of precision, which is roughly equivalent to 7 decimal digits. A double has 53 bits of precision, which is roughly equivalent to 16 decimal digits.
As mentioned in the comments, 760.5e6 is not exactly representable by float; however, it is exactly representable by double. This is why the printed results for double are exact, and those from float are not.
It is legal to request printing of more decimal digits than are representable by your floating point number, as you did. The results you report are not an error -- they are simply the result of the decimal printing algorithm doing the best it can.
The stored number in your float is 760499968. This is expected behavior for an IEEE 754 binary32 floating point numbers, as floats usually are.
IEEE 754 floating point numbers are stored in three parts: a sign bit, an exponent, and a mantissa. Since all these values are stored as bits the resulting number is sort of the binary equivalent of scientific notation. The mantissa bits are one less than the number of binary digits allowed as significant figures in the binary scientific notation.
Just like with decimal scientific numbers, if the exponent exceeds the significant figures, you're going to lose integer precision.
The analogy only extends so far: the mantissa is a modification of the coefficient found in the decimal scientific notation you might be familiar with, and there are certain bit patterns that have special meaning in the standard.
The ultimate result of this storage mechanism is that the integer 760500000 cannot be exactly represented by IEEE 754 binary32 with its 23-bit mantissa: it loses integer-level precision after the integer at 2^(mantissa_bits + 1), which is 16777217 for 23-bit mantissa floats. The closest integers to 76050000 that can be represented by a float are 760499968 and 76050032, the former of which is chosen for representation due to the round-ties-to-even rule, and printing the integer at a greater precision than the floating point number can represent will naturally result in apparent inaccuracies.
A double, which has 64 bit size in your case, naturally has more precision than a float, which is 32 bit in your case. Therefore, this is an expected result
Specifications do not enforce that any type should correctly represent all numbers less than std::numeric_limits::max() with all their precision.
The number you display is off only in the 8th digit and after. That is well within the 6 digits of accuracy you are guaranteed for a float. If you only printed 6 digits, the output would get rounded and you'd see the value you expect.
printf("%0.6g\n", a);
See http://ideone.com/ZiHYuT