base10 and floating point representation [duplicate] - c++

This question already has answers here:
Is `double` guaranteed by C++03 to represent small integers exactly?
(2 answers)
Closed 9 years ago.
0.1 in base10 has no correspondance in a double representation.
Is there any guarantees in C++ that a base10 number with no fractional part and with a number of digits minus or equal to std::numeric_limits::digit10 has mandatory a correct/accurate double representation?

Per Dietmar Kühl’s answer, statements in the C++ standard imply that integers in [0, rd) can be represented exactly, where r is std::numeric_limits<type>::radix and d is std::numeric_limits<type>::digits.
This in turn seems to imply that an integer with no more than std::numeric_limits<type>::digits10 base-10 digits can be represented exactly.
Aside: There are some problems with the C++ standard’s definition of std::numeric_limits<type>::digits10.
The standard says this is the “Number of base 10 digits that can be represented without change.” Is that supposed to be just simple base 10 digits, i.e., integers, or is it a statement about precision throughout the range of the format? A footnote, which is not normative, says this is equivalent to FLT_DIG, DBL_DIG, and LDBL_DIG, which are defined by way of the C standard. The C standard gives two definitions in one statement:
number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits,
and:
p log10 b if b is a power of 10
floor((p-1) log10 b) otherwise
I do not believe the former is a good definition. The latter gives us 7 for IEEE-754 32-bit binary floating-point, but 1.5e-45 is a floating-point number with 2 decimal digits, and rounding it to IEEE-754 32-bit binary floating-point and back gives 1.401…e-45 (because it is in the subnormal interval). So it is not true that any floating-point number with 7 decimal digits can be rounded to floating-point and back again without change to the 7 decimal digits.

I believe there is such guarantee, by definition of std::numeric_limits::digit10 (as long as it's implemented correctly, of course). A related discussion (see top answer): What is the meaning of numeric_limits<double>::digits10

Related

std::numeric_limits::digits10<float> and precision after the dot

When std::numeric_limits::digits10<float> return 7 does it mean that I have 7 significatif figures after the dot or that 7 with the left part?
For instance is it like:
1.123456
12.12345
or is it like
12.1234657
From cppreference
The value of std::numeric_limits::digits10 is the number of base-10 digits that can be represented by the type T without change, that is, any number with this many decimal digits can be converted to a value of type T and back to decimal form, without change due to rounding or overflow. For base-radix types, it is the value of digits (digits-1 for floating-point types) multiplied by log
10(radix) and rounded down.
And later
The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.
That means, e.g.
cout << std::numeric_limits<float>::digits10; // 6
cout << std::numeric_limits<float>::digits; // 24
the second one is the number of digits in the mantissa while the first one the number of decimal digits that can safely be represented across aforementioned conversions.
TL;DR: it's your first case.

Do floats, doubles, and long doubles have a guaranteed minimum precision?

From my previous question "Is floating point precision mutable or invariant?" I received a response which said,
C provides DBL_DIG, DBL_DECIMAL_DIG, and their float and long double
counterparts. DBL_DIG indicates the minimum relative decimal
precision. DBL_DECIMAL_DIG can be thought of as the maximum relative
decimal precision.
I looked these macros up. They are found in the header <cfloat>. From the cplusplus reference page they list macros for float, double, and long double.
Here are the macros for minimum precision values.
FLT_DIG 6 or greater
DBL_DIG 10 or greater
LDBL_DIG 10 or greater
If I took these macros at face value, I would assume that a float has a minimum decimal precision of 6, while a double and long double have a minimum decimal precision of 10. However, being a big boy, I know that some things may be too good to be true.
Therefore, I would like to know. Do floats, doubles, and long doubles have guaranteed minimum decimal precision, and is this minimum decimal precision the values of the macros given above?
If not, why?
Note: Assume we are using programming language C++.
If std::numeric_limits<F>::is_iec559 is true, then the guarantees of the IEEE 754 standard apply to floating point type F.
Otherwise (and anyway), minimum permitted values of symbols such as DBL_DIG are specified by the C standard, which, undisputably for the library, “is incorporated into [the C++] International Standard by reference”, as quoted from C++11 §17.5.1.5/1.
Edit:
As noted by TC in a comment here,
” <climits> and <cfloat> are normatively incorporated by §18.3.3 [c.limits]; the minimum values are specified in turn in §5.2.4.2.2 of the C standard
Unfortunately for the formal view, first of all that quote from C++11 is from section 17.5 which is only informative, not normative. And secondly, the wording in the C standard that the values specified there are minimums, is also in a section (the C99 standard's Annex E) that's informative, not normative. So while it can be regarded as an in-practice guarantee, it's not a formal guarantee.
One strong indication that the in-practice minimum precision for float is 6 decimal digits, that no implementation will give less:
output operations default to precision 6, and this is normative text.
Disclaimer: It may be that there is additional wording that provides guarantees that I didn't notice. Not very likely, but possible.
Do floats, doubles, and long doubles have guaranteed minimum decimal precision, and is this minimum decimal precision the values of the macros given above?
I can't find any place in the standard that guarantees any minimal values for decimal precision.
The following quote from http://en.cppreference.com/w/cpp/types/numeric_limits/digits10 might be useful:
Example
An 8-bit binary type can represent any two-digit decimal number exactly, but 3-digit decimal numbers 256..999 cannot be represented. The value of digits10 for an 8-bit type is 2 (8 * std::log10(2) is 2.41)
The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.
However, the C standard specifies the minimum values that need to be supported.
From the C Standard:
5.2.4.2.2 Characteristics of floating types
...
9 The values given in the following list shall be replaced by constant expressions with implementation-defined values that are greater or equal in magnitude (absolute value) to those shown, with the same sign
...
-- number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits,
...
FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10
To be more specific. Since my compiler uses the IEEE 754 Standard, then the precision of my decimal digits are guaranteed to be 6 to 9 significant decimal digits for float and 15 to 17 significant decimal digits for double. Also, since a long double on my compiler is the same size as a double, it too has 15 to 17 significant decimal digits.
These ranges can be verified from IEEE 754 single-precision binary floating-point format: binary32 and IEEE 754 double-precision binary floating-point format: binary64 respectively.
The C++ Standard says nothing specific about limits on floating point types. You may interpret the incorporation of the C Standard "by reference" as you wish, but if you take the limits as specified there (N1570), section 5.2.4.2.2 subpoint 15:
EXAMPLE 1
The following describes an artificial floating-point representation that meets the minimum requirements of this International Standard, and the appropriate values in a header for type
float:
FLT_RADIX 16
FLT_MANT_DIG 6
FLT_EPSILON 9.53674316E-07F
FLT_DECIMAL_DIG 9
FLT_DIG 6
FLT_MIN_EXP -31
FLT_MIN 2.93873588E-39F
FLT_MIN_10_EXP -38
FLT_MAX_EXP +32
FLT_MAX 3.40282347E+38F
FLT_MAX_10_EXP +38
By this section, float, double and long double have these properties at the least*.

How many decimal places does the primitive float and double support? [duplicate]

This question already has answers here:
'float' vs. 'double' precision
(6 answers)
Closed 8 years ago.
I have read that double stores 15 digits and float stores 7 digits.
My question is, are these numbers the number of decimal places supported or total number of digits in a number?
If you are on an architecture using IEEE-754 floating point arithmetic (as in most architectures), then the type float corresponds to single precision, and the type double corresponds to double precision, as described in the standard.
Let's make some numbers:
Single precision:
32 bits to represent the number, out of which 24 bits are for mantissa. This means that the least significant bit (LSB) has a relative value of 2^(-24) respect to the MSB, which is the "hidden 1", and it is not represented. Therefore, for a fixed exponent, the minimum representable value is 10^(-7.22) times the exponent. What this means is that for a representation in base exponent notation (3.141592653589 E 25), only "7.22" decimal numbers are significant, which in practice means that at least 7 decimals will be always correct.
Double precision:
64 bits to represent the number, out of which 53 bits are for mantissa. Following the same reasoning, expressing 2^(-53) as a power of 10 results in 10^(-15.95), which in term means that at least 15 decimals will be always correct.
Those are the total number of "significant figures" if you will, counting from left to right, regardless of where the decimal point is. Beyond those numbers of digits, accuracy is not preserved.
The counts you listed are for the base 10 representation.
There are macros for the number of decimal places each type supports. The gcc docs explain what they are and also what they mean:
FLT_DIG
This is the number of decimal digits of precision for the float data type. Technically, if p and b are the precision and base (respectively) for the representation, then the decimal precision q is the maximum number of decimal digits such that any floating point number with q base 10 digits can be rounded to a floating point number with p base b digits and back again, without change to the q decimal digits.
The value of this macro is supposed to be at least 6, to satisfy ISO C.
DBL_DIG
LDBL_DIG
These are similar to FLT_DIG, but for the data types double and long double, respectively. The values of these macros are supposed to be at least 10.
On both gcc 4.9.2 and clang 3.5.0, these macros yield 6 and 15, respectively.
are these numbers the number of decimal places supported or total number of digits in a number?
They are the significant digits contained in every number (although you may not need all of them, but they're still there). The mantissa of the same type always contains the same number of bits, so every number consequentially contains the same number of valid "digits" if you think in terms of decimal digits. You cannot store more digits than will fit into the mantissa.
The number of "supported" digits is, however, much larger, for example float will usually support up to 38 decimal digits and double will support up to 308 decimal digits, but most of these digits are not significant (that is, "unknown").
Although technically, this is wrong, since float and double do not have universally well-defined sizes like I presumed above (they're implementation-defined). Also, storage sizes are not necessarily the same as the sizes of intermediate results.
The C++ standard is very reluctant at precisely defining any fundamental type, leaving almost everything to the implementation. Floating point types are no exception:
3.9.1 / 8
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.
Now of course all of this is not particularly helpful in practice.
In practice, floating point is (usually) IEEE 754 compliant, with float having a width of 32 bits and double having a width of 64 bits (as stored in memory, registers have higher precision on some notable mainstream architectures).
This is equivalent to 24 bits and 53 bits of matissa, respectively, or 7 and 15 full decimals.

Is `double` guaranteed by C++03 to represent small integers exactly?

Does the C++03 standard guarantee that sufficiently small non-zero integers are represented exactly in double? If not, what about C++11? Note, I am not assuming IEEE compliance here.
I suspect that the answer is no, but I would love to be proved wrong.
When I say sufficiently small, I mean, bounded by some value that can be derived from the guarantees of C++03, and maybe even be calculated from values made available via std::numeric_limits<double>.
EDIT:
It is clear (now that I have checked) that std::numeric_limits<double>::digits is the same thing as DBL_MANT_DIG, and std::numeric_limits<double>::digits10 is the same thing as DBL_DIG, and this is true for both C++03 and C++11.
Further more, C++03 defers to C90, and C++11 defers to C99 with respect to the meaning of DBL_MANT_DIG and DBL_DIG.
Both C90 and C99 states that the minimum allowable value for DBL_DIG is 10, i.e., 10 decimal digits.
The question then is, what does that mean? Does it mean that integers of up to 10 decimal digits are guaranteed to be represented exactly in double?
In that case, what is then the purpose of DECIMAL_DIG in C99, and the following remark in C99 §5.2.4.2.2 / 12?
Conversion from (at least) double to decimal with DECIMAL_DIG digits and back
should be the identity function.
Here is what C99 §5.2.4.2.2 / 9 has to say about DBL_DIG:
Number of decimal digits, 'q', such that any floating-point
number with 'q' decimal digits can be rounded into a
floating-point number with 'p' radix 'b' digits and back again
without change to the q decimal digits,
{ p * log10(b) if 'b' is a power of 10
{
{ floor((p-1) * log10(b)) otherwise
FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10
I'll be happy if someone can help me unpack this.
Well, 3.9.1 [basic.fundamental] paragraph 8 states
... The value representation of floating-point types is implementation-defined. ...
At least, the implementation has to define what representation it uses.
On the other hand, std::numeric_limits<F> defines a couple of members which seem to imply that the representation is some in the form of significand radix exponent:
std::numeric_limits<F>::radix: the radix of the exponent
std::numeric_limtis<F>::digits: the number of radix digits
I think these statements imply that you can represent integers in the range of 0 ... radix digits - 1 exactly.
From the C standard, "Characteristics of floating types <float.h>", which is normative for C++, I would assume that you can combine FLT_RADIX and FLT_MANT_DIG into useful information: The number of digits in the mantissa and the base in which they are expressed.
For example, for a single-precision IEEE754 float, this would be respectively 2 and 24, so you should be able to store integers of absolute value up to 224.

Decimal precision of floats

equivalent to log10(2^24) ≈ 7.225 decimal digits
Wikipedia
Precision: 7 digits
MSDN
6
std::numeric_limits<float>::digits10
Why numeric_limits return 6 here? Both Wikipedia and MSDN report that floats have 7 decimal digits of precision.
If in doubt, read the spec. The C++ standard says that digits10 is:
Number of base 10 digits that can be represented without change.
That's a little vague; fortunately, there's a footnote:
Equivalent to FLT_DIG, DBL_DIG, LDBL_DIG
Those are defined in the C standard; let's look it up there:
number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits.
So std::numeric_limits<float>::digits10 is the number of decimal digits such that any floating-point number with that many digits is unchanged if you convert it to a float and back to decimal.
As you say, floats have about 7 digits of decimal precision, but the error in representation of both fixed-width decimals and floats is not uniformly logarithmic. The relative error in rounding a number of the form 1.xxx.. to a fixed number of decimal places is nearly ten times larger than the relative error of rounding 9.xxx.. to the same number of decimal places. Similarly, depending on where a value falls in a binade, the relative error in rounding it to 24 binary digits can vary by a factor of nearly two.
The upshot of this is that not all seven-digit decimals survive the round trip to float and back, but all six digit decimals do. Hence, std::numeric_limits<float>::digits10 is 6.
There are not that many six and seven digit decimals with exponents in a valid range for the float type; you can pretty easily write a program to exhaustively test all of them if you're still not convinced.
It's really only 23 bits in the mantissa (there's an implied 1, so it's effectively 24 bits, but the 1 obviously does not vary). This gives 6.923689900271567 decimal digits of precision, which is not quite 7.