Do floats, doubles, and long doubles have a guaranteed minimum precision? - c++

From my previous question "Is floating point precision mutable or invariant?" I received a response which said,
C provides DBL_DIG, DBL_DECIMAL_DIG, and their float and long double
counterparts. DBL_DIG indicates the minimum relative decimal
precision. DBL_DECIMAL_DIG can be thought of as the maximum relative
decimal precision.
I looked these macros up. They are found in the header <cfloat>. From the cplusplus reference page they list macros for float, double, and long double.
Here are the macros for minimum precision values.
FLT_DIG 6 or greater
DBL_DIG 10 or greater
LDBL_DIG 10 or greater
If I took these macros at face value, I would assume that a float has a minimum decimal precision of 6, while a double and long double have a minimum decimal precision of 10. However, being a big boy, I know that some things may be too good to be true.
Therefore, I would like to know. Do floats, doubles, and long doubles have guaranteed minimum decimal precision, and is this minimum decimal precision the values of the macros given above?
If not, why?
Note: Assume we are using programming language C++.

If std::numeric_limits<F>::is_iec559 is true, then the guarantees of the IEEE 754 standard apply to floating point type F.
Otherwise (and anyway), minimum permitted values of symbols such as DBL_DIG are specified by the C standard, which, undisputably for the library, “is incorporated into [the C++] International Standard by reference”, as quoted from C++11 §17.5.1.5/1.
Edit:
As noted by TC in a comment here,
” <climits> and <cfloat> are normatively incorporated by §18.3.3 [c.limits]; the minimum values are specified in turn in §5.2.4.2.2 of the C standard
Unfortunately for the formal view, first of all that quote from C++11 is from section 17.5 which is only informative, not normative. And secondly, the wording in the C standard that the values specified there are minimums, is also in a section (the C99 standard's Annex E) that's informative, not normative. So while it can be regarded as an in-practice guarantee, it's not a formal guarantee.
One strong indication that the in-practice minimum precision for float is 6 decimal digits, that no implementation will give less:
output operations default to precision 6, and this is normative text.
Disclaimer: It may be that there is additional wording that provides guarantees that I didn't notice. Not very likely, but possible.

Do floats, doubles, and long doubles have guaranteed minimum decimal precision, and is this minimum decimal precision the values of the macros given above?
I can't find any place in the standard that guarantees any minimal values for decimal precision.
The following quote from http://en.cppreference.com/w/cpp/types/numeric_limits/digits10 might be useful:
Example
An 8-bit binary type can represent any two-digit decimal number exactly, but 3-digit decimal numbers 256..999 cannot be represented. The value of digits10 for an 8-bit type is 2 (8 * std::log10(2) is 2.41)
The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.
However, the C standard specifies the minimum values that need to be supported.
From the C Standard:
5.2.4.2.2 Characteristics of floating types
...
9 The values given in the following list shall be replaced by constant expressions with implementation-defined values that are greater or equal in magnitude (absolute value) to those shown, with the same sign
...
-- number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits,
...
FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10

To be more specific. Since my compiler uses the IEEE 754 Standard, then the precision of my decimal digits are guaranteed to be 6 to 9 significant decimal digits for float and 15 to 17 significant decimal digits for double. Also, since a long double on my compiler is the same size as a double, it too has 15 to 17 significant decimal digits.
These ranges can be verified from IEEE 754 single-precision binary floating-point format: binary32 and IEEE 754 double-precision binary floating-point format: binary64 respectively.

The C++ Standard says nothing specific about limits on floating point types. You may interpret the incorporation of the C Standard "by reference" as you wish, but if you take the limits as specified there (N1570), section 5.2.4.2.2 subpoint 15:
EXAMPLE 1
The following describes an artificial floating-point representation that meets the minimum requirements of this International Standard, and the appropriate values in a header for type
float:
FLT_RADIX 16
FLT_MANT_DIG 6
FLT_EPSILON 9.53674316E-07F
FLT_DECIMAL_DIG 9
FLT_DIG 6
FLT_MIN_EXP -31
FLT_MIN 2.93873588E-39F
FLT_MIN_10_EXP -38
FLT_MAX_EXP +32
FLT_MAX 3.40282347E+38F
FLT_MAX_10_EXP +38
By this section, float, double and long double have these properties at the least*.

Related

How to know how my compiler encodes floating-point data?

How to know how floating-point data are stored in a C++ program?
If I assign the number 1.23 to a double object for example, how can I know how this number is encoded?
The official way to know how floating-point data are encoded is to read the documentation of the C++ implementation, because the 2017 C++ standard says, in 6.9.1 “Fundamental types” [basic.fundamental], paragraph 8, draft N4659:
… The value representation of floating-point types is implementation-defined…
“Implementation-defined” means the implementation must document it (3.12 “implementation-defined behavior” [defns.impl.defined]).
The C++ standard appears to be incomplete in this regard, as it says “… the value representation is a set of bits in the object representation that determines a value…” (6.9 “Types” [basic.types] 4) and “The object representation of an object of type T is the sequence of N unsigned char objects taken up by the object of type T,…” (ibid), but I do not see that it says the implementation must define which of the bits in the object representation are the value representation, or in which order/mapping. Nonetheless, the responsibility of informing you about the characteristics of the C++ implementation lies with the implementation and the implementors, because no other party can do it. (That is, the implementors create the implementation, and they can do so in arbitrary ways, so they are the ones who determine what the characteristics are, so they are the source of that information.)
The C standard defines some mathematical characteristics of floating-point types and requires implementations to describe them in <float.h>. C++ inherits these in <cfloat> (C++ 20.5.1.2 “Header” [headers] 3-4). C 2011 5.2.4.2.2 “Characteristics of floating types <float.h>” defines a model in which a floating-point number x equals sbe sum(fkb−k for k=1 to p), where s is a sign (±1), b is the base or radix, e is an exponent between emin and emax, inclusive, p is the precision (number of base-b digits in the significand), and fk are base-b digits of the significand (nonnegative integers less than b). The floating-point type may also contain infinities and Not-a-Number (NaN) “values”, and some values are distinguished as normal or subnormal. Then <float.h> relates the parameters of this model:
FLT_RADIX provides the base, b.
FLT_MANT_DIG, DBL_MANT_DIG, and LDBL_MANT_DIG provide the number of significand digits, also known as the precision, p, for the float, double, and long double types, respectively.
FLT_MIN_EXP, DBL_MIN_EXP, LDBL_MIN_EXP, FLT_MAX_EXP, DBL_MAX_EXP, and LDBL_MAX_EXP provide the minimum and maximum exponents, emin and emax.
In addition to providing these in <cfloat>, C++ provides them in the numeric_limits template defined in the <numeric> header (21.3.4.1 “numeric_limits members” [numeric.limits.members]) in radix (b), digits (p), min_exponent (emin) and max_exponent (emax). For example, std::numeric_limits<double>::digits gives the number of digits in the significand of the double type. That template includes other members that describe the floating-point type, such as whether it supports infinities, NaNs, and subnormal values.
These provide a complete description of the mathematical properties of the floating-point format. However, as stated above, C++ appears to fail to specify that the implementation should document how the value bits that represent a type appear in the object bits.
Many C++ implementations use the IEEE-754 basic 32-bit binary format for float and the 64-bit format for double, and the value bits are mapped to the object bits in the same way as for integers of the corresponding width. If so, for normal numbers, the sign s is encoded in the most significant bit (0 or 1 for +1 or −1, respectively), the exponent e is encoded using the biased value e+126 (float) or e+1022 (double) in the next 8 (float) or 11 (double) bits, and the remaining bits contain the digits fk for k from 2 to p. The first digit, f1, is 1 for normal numbers. For subnormal numbers, the exponent field is zero, and f1 is 0. (Note the biases here are 126 and 1022 instead of the 127 and 1023 used in IEEE-754 because the C model expresses the significand using b−k instead of b1−k as is used in IEEE-754.) Infinities are encoded with all ones in the exponent field and all zeros in the significand field. NaNs are encoded with all ones in the exponent field and not all zeros in the significand field.
The compiler will use the encoding that is used by the CPU architecture that you are compiling for. (Unless that architecture doesn't support floating point, in which case the compiler probably would choose the encoding that they'll use the emulate).
The vendor that designed the CPU architecture should document the encoding that the CPU it uses. You can know what the documentation says by reading it.
The IEEE 754 standard is fairly ubiquitous.

How many decimal places does the primitive float and double support? [duplicate]

This question already has answers here:
'float' vs. 'double' precision
(6 answers)
Closed 8 years ago.
I have read that double stores 15 digits and float stores 7 digits.
My question is, are these numbers the number of decimal places supported or total number of digits in a number?
If you are on an architecture using IEEE-754 floating point arithmetic (as in most architectures), then the type float corresponds to single precision, and the type double corresponds to double precision, as described in the standard.
Let's make some numbers:
Single precision:
32 bits to represent the number, out of which 24 bits are for mantissa. This means that the least significant bit (LSB) has a relative value of 2^(-24) respect to the MSB, which is the "hidden 1", and it is not represented. Therefore, for a fixed exponent, the minimum representable value is 10^(-7.22) times the exponent. What this means is that for a representation in base exponent notation (3.141592653589 E 25), only "7.22" decimal numbers are significant, which in practice means that at least 7 decimals will be always correct.
Double precision:
64 bits to represent the number, out of which 53 bits are for mantissa. Following the same reasoning, expressing 2^(-53) as a power of 10 results in 10^(-15.95), which in term means that at least 15 decimals will be always correct.
Those are the total number of "significant figures" if you will, counting from left to right, regardless of where the decimal point is. Beyond those numbers of digits, accuracy is not preserved.
The counts you listed are for the base 10 representation.
There are macros for the number of decimal places each type supports. The gcc docs explain what they are and also what they mean:
FLT_DIG
This is the number of decimal digits of precision for the float data type. Technically, if p and b are the precision and base (respectively) for the representation, then the decimal precision q is the maximum number of decimal digits such that any floating point number with q base 10 digits can be rounded to a floating point number with p base b digits and back again, without change to the q decimal digits.
The value of this macro is supposed to be at least 6, to satisfy ISO C.
DBL_DIG
LDBL_DIG
These are similar to FLT_DIG, but for the data types double and long double, respectively. The values of these macros are supposed to be at least 10.
On both gcc 4.9.2 and clang 3.5.0, these macros yield 6 and 15, respectively.
are these numbers the number of decimal places supported or total number of digits in a number?
They are the significant digits contained in every number (although you may not need all of them, but they're still there). The mantissa of the same type always contains the same number of bits, so every number consequentially contains the same number of valid "digits" if you think in terms of decimal digits. You cannot store more digits than will fit into the mantissa.
The number of "supported" digits is, however, much larger, for example float will usually support up to 38 decimal digits and double will support up to 308 decimal digits, but most of these digits are not significant (that is, "unknown").
Although technically, this is wrong, since float and double do not have universally well-defined sizes like I presumed above (they're implementation-defined). Also, storage sizes are not necessarily the same as the sizes of intermediate results.
The C++ standard is very reluctant at precisely defining any fundamental type, leaving almost everything to the implementation. Floating point types are no exception:
3.9.1 / 8
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.
Now of course all of this is not particularly helpful in practice.
In practice, floating point is (usually) IEEE 754 compliant, with float having a width of 32 bits and double having a width of 64 bits (as stored in memory, registers have higher precision on some notable mainstream architectures).
This is equivalent to 24 bits and 53 bits of matissa, respectively, or 7 and 15 full decimals.

Is `double` guaranteed by C++03 to represent small integers exactly?

Does the C++03 standard guarantee that sufficiently small non-zero integers are represented exactly in double? If not, what about C++11? Note, I am not assuming IEEE compliance here.
I suspect that the answer is no, but I would love to be proved wrong.
When I say sufficiently small, I mean, bounded by some value that can be derived from the guarantees of C++03, and maybe even be calculated from values made available via std::numeric_limits<double>.
EDIT:
It is clear (now that I have checked) that std::numeric_limits<double>::digits is the same thing as DBL_MANT_DIG, and std::numeric_limits<double>::digits10 is the same thing as DBL_DIG, and this is true for both C++03 and C++11.
Further more, C++03 defers to C90, and C++11 defers to C99 with respect to the meaning of DBL_MANT_DIG and DBL_DIG.
Both C90 and C99 states that the minimum allowable value for DBL_DIG is 10, i.e., 10 decimal digits.
The question then is, what does that mean? Does it mean that integers of up to 10 decimal digits are guaranteed to be represented exactly in double?
In that case, what is then the purpose of DECIMAL_DIG in C99, and the following remark in C99 §5.2.4.2.2 / 12?
Conversion from (at least) double to decimal with DECIMAL_DIG digits and back
should be the identity function.
Here is what C99 §5.2.4.2.2 / 9 has to say about DBL_DIG:
Number of decimal digits, 'q', such that any floating-point
number with 'q' decimal digits can be rounded into a
floating-point number with 'p' radix 'b' digits and back again
without change to the q decimal digits,
{ p * log10(b) if 'b' is a power of 10
{
{ floor((p-1) * log10(b)) otherwise
FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10
I'll be happy if someone can help me unpack this.
Well, 3.9.1 [basic.fundamental] paragraph 8 states
... The value representation of floating-point types is implementation-defined. ...
At least, the implementation has to define what representation it uses.
On the other hand, std::numeric_limits<F> defines a couple of members which seem to imply that the representation is some in the form of significand radix exponent:
std::numeric_limits<F>::radix: the radix of the exponent
std::numeric_limtis<F>::digits: the number of radix digits
I think these statements imply that you can represent integers in the range of 0 ... radix digits - 1 exactly.
From the C standard, "Characteristics of floating types <float.h>", which is normative for C++, I would assume that you can combine FLT_RADIX and FLT_MANT_DIG into useful information: The number of digits in the mantissa and the base in which they are expressed.
For example, for a single-precision IEEE754 float, this would be respectively 2 and 24, so you should be able to store integers of absolute value up to 224.

base10 and floating point representation [duplicate]

This question already has answers here:
Is `double` guaranteed by C++03 to represent small integers exactly?
(2 answers)
Closed 9 years ago.
0.1 in base10 has no correspondance in a double representation.
Is there any guarantees in C++ that a base10 number with no fractional part and with a number of digits minus or equal to std::numeric_limits::digit10 has mandatory a correct/accurate double representation?
Per Dietmar Kühl’s answer, statements in the C++ standard imply that integers in [0, rd) can be represented exactly, where r is std::numeric_limits<type>::radix and d is std::numeric_limits<type>::digits.
This in turn seems to imply that an integer with no more than std::numeric_limits<type>::digits10 base-10 digits can be represented exactly.
Aside: There are some problems with the C++ standard’s definition of std::numeric_limits<type>::digits10.
The standard says this is the “Number of base 10 digits that can be represented without change.” Is that supposed to be just simple base 10 digits, i.e., integers, or is it a statement about precision throughout the range of the format? A footnote, which is not normative, says this is equivalent to FLT_DIG, DBL_DIG, and LDBL_DIG, which are defined by way of the C standard. The C standard gives two definitions in one statement:
number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits,
and:
p log10 b if b is a power of 10
floor((p-1) log10 b) otherwise
I do not believe the former is a good definition. The latter gives us 7 for IEEE-754 32-bit binary floating-point, but 1.5e-45 is a floating-point number with 2 decimal digits, and rounding it to IEEE-754 32-bit binary floating-point and back gives 1.401…e-45 (because it is in the subnormal interval). So it is not true that any floating-point number with 7 decimal digits can be rounded to floating-point and back again without change to the 7 decimal digits.
I believe there is such guarantee, by definition of std::numeric_limits::digit10 (as long as it's implemented correctly, of course). A related discussion (see top answer): What is the meaning of numeric_limits<double>::digits10

Range of representable values of 32-bit, 64-bit and 80-bit float IEEE-754?

In the C++ standard it says of floating literals:
If the scaled value is not in the range of representable values for its type, the program is ill-formed.
The scaled value is the significant part multiplied by 10 ^ exponent part.
Under x86-64:
float is a single-precision IEEE-754
double is a double-precision IEEE-754
long double is an 80-bit extended precision IEEE-754
In this context, what is the range of repsentable values for each of these three types? Where is this documented? or how is it calculated?
If you know the number of exponent bits and mantissa bits, then based on the IEEE-754 format, one can establish that the maximum absolute representable value is:
2^(2^(E-1)-1)) * (1 + (2^M-1)/2^M)
The minimum absolute value (not including zero or denormals) is:
2^(2-2^(E-1))
For single-precision, E is 8, M is 23.
For double-precision, E is 11, M is 52.
For extended-precision, I'm not sure. If you're referring to the 80-bit precision of the x87 FPU, then so far as I can tell, it's not's IEEE-754 compliant...
The answer (if you're on a machine with IEEE floating point) is
in float.h. FLT_MAX, DBL_MAX and LDBL_MAX. On a system
with full IEEE support, something around 3.4e+38, 1.8E+308 and
1.2E4932. (The exact values may vary, and may be expressed
differently, depending on how the compiler does its input and
rounding. g++, for example, defines them to be compiler
built-ins.)
EDIT:
WRT your question (since neither I nor the other responders
actually answered it): the range of representable values is
[-type_MAX...type], where
type is one of FLT, DBL, or LDBL.
I was looking for largest representable number by 64 bits and ending up making my own 500 digit floating point calculator. This is what I come up with if all 64 bits are turned on
18,446,744,073,709,551,615
18 quintillion 446 quadrillion 744 trillion 73 billion 709 million 551 thousand 615