Floating point resolution seems more limited than it ought to be - c++

I'm seeing some error when simply assigning a floating point value which contains only 4 significant figures. I wrote a short program to debug and I don't understand what the problem is. After verifying the limits of a float on my platform is seems like there shouldn't be any error. What's causing this?
#include <stdlib.h>
#include <stdio.h>
#include <limits>
#include <iostream>
int main(){
printf("float size: %lu\n", sizeof(float));
printf("float max: %e\n", std::numeric_limits<float>::max());
printf("float significant figures: %i\n", std::numeric_limits<float>::digits10);
float a = 760.5e6;
printf("%.9f\n", a);
std::cout.precision(9);
std::cout << a << std::endl;
double b = 760.5e6;
printf("%.9f\n", b);
std::cout << b << std::endl;
return 0;
}
The output:
float size: 4
float max: 3.402823e+38
float significant figures: 6
760499968.000000000
760499968
760500000.000000000
760500000

A float has 24 bits of precision, which is roughly equivalent to 7 decimal digits. A double has 53 bits of precision, which is roughly equivalent to 16 decimal digits.
As mentioned in the comments, 760.5e6 is not exactly representable by float; however, it is exactly representable by double. This is why the printed results for double are exact, and those from float are not.
It is legal to request printing of more decimal digits than are representable by your floating point number, as you did. The results you report are not an error -- they are simply the result of the decimal printing algorithm doing the best it can.

The stored number in your float is 760499968. This is expected behavior for an IEEE 754 binary32 floating point numbers, as floats usually are.
IEEE 754 floating point numbers are stored in three parts: a sign bit, an exponent, and a mantissa. Since all these values are stored as bits the resulting number is sort of the binary equivalent of scientific notation. The mantissa bits are one less than the number of binary digits allowed as significant figures in the binary scientific notation.
Just like with decimal scientific numbers, if the exponent exceeds the significant figures, you're going to lose integer precision.
The analogy only extends so far: the mantissa is a modification of the coefficient found in the decimal scientific notation you might be familiar with, and there are certain bit patterns that have special meaning in the standard.
The ultimate result of this storage mechanism is that the integer 760500000 cannot be exactly represented by IEEE 754 binary32 with its 23-bit mantissa: it loses integer-level precision after the integer at 2^(mantissa_bits + 1), which is 16777217 for 23-bit mantissa floats. The closest integers to 76050000 that can be represented by a float are 760499968 and 76050032, the former of which is chosen for representation due to the round-ties-to-even rule, and printing the integer at a greater precision than the floating point number can represent will naturally result in apparent inaccuracies.

A double, which has 64 bit size in your case, naturally has more precision than a float, which is 32 bit in your case. Therefore, this is an expected result
Specifications do not enforce that any type should correctly represent all numbers less than std::numeric_limits::max() with all their precision.

The number you display is off only in the 8th digit and after. That is well within the 6 digits of accuracy you are guaranteed for a float. If you only printed 6 digits, the output would get rounded and you'd see the value you expect.
printf("%0.6g\n", a);
See http://ideone.com/ZiHYuT

Related

What data is written out when a higher precision is used to display a number than the one supported by the format?

The IEEE 754 double precision floating point format has a binary precision of 53 bits, which translates into log10(2^53) ~ 16 significant decimal digits.
If the double precision format is used to store a floating point number in a 64 bit-long word in the memory, with 52 bits for the significand and 1 hidden bit, but a larger precision is used to output the number to the screen, what data is actually read from the memory and written to the output?
How can it even be read, when the total length of the word is 64 bit, does the read-from-memory operation on the machine just simply read more bits and interprets them as an addition to the significand of the number?
For example, take the number 0.1. It does not have an exact binary floating point representation regardless of the precision used, because it has an indefinitely repeating binary floating point pattern in the significand.
If 0.1 is stored with the double precision, and printed to the screen with the precision >16 like this in the C++ language:
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
double x = 0.1;
cout << setprecision(50) << "x= " << x << endl;
};
The output (on my machine at the point of execution), is:
x = 0.1000000000000000055511151231257827021181583404541
If the correct rounding is used with 2 guard bits and 1 sticky bits, can I trust the decimal values given by the first three non-zero binary floating point digits in the error 5.551115123125783e-17?
Every binary fraction is exactly equal to some decimal fraction. If, as is usually the case, double is a binary floating point type, each double number has an exactly equal decimal representation.
For what follows, I am assuming your system uses IEEE 754 64-bit binary floating point to represent double. That is not required by the standard, but is very common. The closest number to 0.1 in that format has exact value 0.1000000000000000055511151231257827021181583404541015625
Although this number has a lot of digits, it is exactly equal to 3602879701896397/255. Multiplying both numerator and denominator by 555 converts it to a decimal fraction, while increasing the number of digits in the numerator.
One common approach, consistent with the result in the question, is to use round-to-nearest to the number of digits required by the format. That will indeed give useful information about the rounding error on conversion of a string to double.

C++ long double bigger "safe" whole number

IEEE 754 numbers are not uniformly spaced, the larger the numbers, the bigger the difference between two consecutive representable numbers.
I have that the size of my long double with C++ is 16 bytes. So what is the bigger "safe" whole number "n" that can be represented with this type?.
I call it "safe" if n - 1 is possible to represent but n + 1 not.
The IEEE 754 standard defines the parameters of various numerical types:
https://en.wikipedia.org/wiki/IEEE_754
for long doubles which are 128 bits long, the mantissa (the part of the floating point number that contains the significant digits) is 113 bits so it can represent integers with full precision up to 2^113 - 1. It can represent floating point numbers which are larger, but you start losing precision beyond that because the lower order numbers just get rounded.
As I understand, you're asking for which is the largest contiguous precisely representable integer. It is exactly:
std::pow(std::numeric_limits<long double>::radix, std::numeric_limits<long double>::digits)
or expressed in math: radixdigits where (quoted from cppreference)
The value of std::numeric_limits<T>::radix is the base of the number system used in the representation of the type. It is 2 for all binary numeric types, but it may be, for example, 10 for IEEE 754 decimal floating-point types ...
The value of std::numeric_limits<T>::digits is the number of digits in base-radix that can be represented by the type T without change. ... For floating-point types, this is the number of digits in the mantissa
In C++ you can get a lot of information using <limits>. For example:
#include <limits>
#include <iostream>
#include <iomanip>
int main() {
auto p = std::numeric_limits<long double>::max_digits10;
std::cout << "Max Long Double: "
<< std::setprecision(p)
<< std::setw(p + 7)
<< std::numeric_limits<long double>::max()
<< std::endl;
}
Prints:
Max Long Double: 1.18973149535723176502e+4932

Shouldn't Float 6 and Double 15 available digits?

I used to think that float can use max 6 digit and double 15, after comma. But if I print limits here:
typedef std::numeric_limits<float> fl;
typedef std::numeric_limits<double> dbl;
int main()
{
std::cout << fl::max_digits10 << std::endl;
std::cout << dbl::max_digits10 << std::endl;
}
It prints float 9 and double 17?
You're confusing digits10 and max_digits10.
If digits10 is 6, then any number with six decimal digits can be converted to the floating point type, and back, and when rounded back to six decimal digits, produces the original value.
If max_digits10 is 9, then there exist at least two floating point numbers that when converted to decimal produce the same initial 8 decimal digits.
digits10 is the number you're looking for, based on your description. It's about converting from decimal to binary floating point back to decimal.
max_digits10 is a number about converting from binary floating point to decimal back to binary floating point.
From cppreference:
Unlike most mathematical operations, the conversion of a floating-point value to text and back is exact as long as at least max_digits10 were used (9 for float, 17 for double): it is guaranteed to produce the same floating-point value, even though the intermediate text representation is not exact. It may take over a hundred decimal digits to represent the precise value of a float in decimal notation.
For example (I am using http://www.exploringbinary.com/floating-point-converter/ to facilitate the conversion) and double as the precision format:
1.1e308 => 109999999999999997216016380169010472601796114571365898835589230322558260940308155816455878138416026219051443651421887588487855623732463609216261733330773329156055234383563489264255892767376061912596780024055526930962873899746391708729279405123637426157351830292874541601579169431016577315555383826285225574400
Using 16 significant digits:
1.099999999999999e308 => 109999999999999897424000903433019889783160462729437595463026208549681185812946033955861284690212736971153169019636833121365513414107701410594362313651090292197465320141992473263972245213092236035710707805906167798295036672550192042188756649080117981714588407890666666245533825643214495197630622309084729180160
Using 17 significant digits:
1.0999999999999999e308 => 109999999999999997216016380169010472601796114571365898835589230322558260940308155816455878138416026219051443651421887588487855623732463609216261733330773329156055234383563489264255892767376061912596780024055526930962873899746391708729279405123637426157351830292874541601579169431016577315555383826285225574400
which is the same as the original
More than 17 significant digits:
1.09999999999999995555e308 => 109999999999999997216016380169010472601796114571365898835589230322558260940308155816455878138416026219051443651421887588487855623732463609216261733330773329156055234383563489264255892767376061912596780024055526930962873899746391708729279405123637426157351830292874541601579169431016577315555383826285225574400
Continue to be the same as the original.
There isn't an exact correspondence between decimal digits and binary digits.
IEEE 754 single precision uses 23 bits plus 1 for the implicit leading 1. Double precision uses 52+1 bits.
To get the equivalent decimal precision, use
log10(2^binary_digits) = binary_digits*log10(2)
For single precision this is
24*log10(2) = 7.22
and for double precision
53*log10(2) = 15.95
See here and also the Wikipedia page which I don't find to be particularly concise.

Why do compilers fix the digits of floating point number to 6?

According to The C++ Programming Language - 4th, section 6.2.5:
There are three floating-points types: float (single-precision), double (double-precision), and long double (extended-precision)
Refer to: http://en.wikipedia.org/wiki/Single-precision_floating-point_format
The true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit (to the left of the binary point) with value 1 unless the exponent is stored with all zeros. Thus only 23 fraction bits of the significand appear in the memory format but the total precision is 24 bits (equivalent to log10(224) ≈ 7.225 decimal digits).
→ The maximum digits of floating point number is 7 digits on binary32 interchange format. (a computer number format that occupies 4 bytes (32 bits) in computer memory)
When I test on different compilers (like GCC, VC compiler)
→ It always outputs 6 as the value.
Take a look into float.h of each compiler
→ I found that 6 is fixed.
Question:
Do you know why there is a different here (between actual value theoretical value - 7 - and actual value - 6)?
It sounds like "7" is more reasonable because when I test using below code, the value is still valid, while "8" is invalid
Why don't the compilers check the interchange format for giving decision about the numbers of digits represented in floating-point (instead of using a fixed value)?
Code:
#include <iostream>
#include <limits>
using namespace std;
int main( )
{
cout << numeric_limits<float> :: digits10 << endl;
float f = -9999999;
cout.precision ( 10 );
cout << f << endl;
}
You're not reading the documentation.
std::numeric_limits<float>::digits10 is 6:
The value of std::numeric_limits<T>::digits10 is the number of base-10 digits that can be represented by the type T without change, that is, any number with this many decimal digits can be converted to a value of type T and back to decimal form, without change due to rounding or overflow. For base-radix types, it is the value of digits (digits-1 for floating-point types) multiplied by log10(radix) and rounded down.
The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.
std::numeric_limits<float>::max_digits10 is 9:
The value of std::numeric_limits<T>::max_digits10 is the number of base-10 digits that are necessary to uniquely represent all distinct values of the type T, such as necessary for serialization/deserialization to text. This constant is meaningful for all floating-point types.
Unlike most mathematical operations, the conversion of a floating-point value to text and back is exact as long as at least max_digits10 were used (9 for float, 17 for double): it is guaranteed to produce the same floating-point value, even though the intermediate text representation is not exact. It may take over a hundred decimal digits to represent the precise value of a float in decimal notation.
std::numeric_limits<float>::digits10 equates to FLT_DIG, which is defined by the C standard :
number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits,
⎧ p log10 b if b is a power of 10
⎨
⎩ ⎣( p − 1) log10 b⎦ otherwise
FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10
The reason for the value 6 (and not 7), is due to rounding errors - not all floating point values with 7 decimal digits can be losslessly represented by a 32-bit float. Rounding errors are limited to 1 bit though, so the FLT_DIG value was calculated based on 23 bits (instead of the full 24) :
23 * log10(2) = 6.92
which is rounded down to 6.

printing the largest single-precision floating point number

I have come seeking knowledge.
I am trying to understand floating point numbers.
I am trying to figure out why, when I print the largest floating point number, it does not print correctly.
2-(2^-23) Exponent Bits
1.99999988079071044921875 * (1.7014118346046923173168730371588e+38) =
3.4028234663852885981170418348451e+38
This should be the largest single-precision floating point number:
340282346638528859811704183484510000000.0
So,
float i = 340282346638528859811704183484510000000.0;
printf(TEXT, "Float %.38f", i);
Output: 340282346638528860000000000000000000000.0
Obviously the number is being rounded up, so I am trying to figure out just exactly what is going on.
My questions are:
The Wikipedia documentation states that 3.4028234663852885981170418348451e+38 is the largest number that can be represented in IEEE-754 fixed point.
Is the number stored in the floating point register = 0 11111111 11111111111111111111111 and it is just not being displayed incorrectly?
If I write printf(TEXT, "Float %.38f", FLT_MAX);, I get the same answer.
Perhaps the computer I am using does not use IEEE-754?
I understand errors with calculations, but I don't understand why the number
340282346638528860000000000000000000000.0 is the largest floating point number that can be accurately represented.
Maybe the Mantissa * Exponent is causing calculation errors? If that is true, then 340282346638528860000000000000000000000.0 would be the largest number that can be faithfully represented without calculation errors. I guess that would make sense. Just need a blessing.
Thanks,
Looks like culprit is printf() (I guess because float is implicitly converted to double when passed to it):
#include <iostream>
#include <limits>
int main()
{
std::cout.precision( 38 );
std::cout << std::numeric_limits<float>::max() << std::endl;
}
Output is:
3.4028234663852885981170418348451692544e+38
With float as binary32, the largest finite float is
340282346638528859811704183484516925440.0
printf("%.1f", FLT_MAX) is not obliged to print exactly to 38+ significant digits, so seeing output like the below is not unexpected.
340282346638528860000000000000000000000.0
printf() will print floating point accurately to DECIMAL_DIG significant digits. DECIMAL_DIG is at least 10. If more than DECIMAL_DIG significance is directed, a compliant printf() may round the result at some point. C11dr §7.21.6.1 6 goes into detail.