Range of representable values of 32-bit, 64-bit and 80-bit float IEEE-754? - c++

In the C++ standard it says of floating literals:
If the scaled value is not in the range of representable values for its type, the program is ill-formed.
The scaled value is the significant part multiplied by 10 ^ exponent part.
Under x86-64:
float is a single-precision IEEE-754
double is a double-precision IEEE-754
long double is an 80-bit extended precision IEEE-754
In this context, what is the range of repsentable values for each of these three types? Where is this documented? or how is it calculated?

If you know the number of exponent bits and mantissa bits, then based on the IEEE-754 format, one can establish that the maximum absolute representable value is:
2^(2^(E-1)-1)) * (1 + (2^M-1)/2^M)
The minimum absolute value (not including zero or denormals) is:
2^(2-2^(E-1))
For single-precision, E is 8, M is 23.
For double-precision, E is 11, M is 52.
For extended-precision, I'm not sure. If you're referring to the 80-bit precision of the x87 FPU, then so far as I can tell, it's not's IEEE-754 compliant...

The answer (if you're on a machine with IEEE floating point) is
in float.h. FLT_MAX, DBL_MAX and LDBL_MAX. On a system
with full IEEE support, something around 3.4e+38, 1.8E+308 and
1.2E4932. (The exact values may vary, and may be expressed
differently, depending on how the compiler does its input and
rounding. g++, for example, defines them to be compiler
built-ins.)
EDIT:
WRT your question (since neither I nor the other responders
actually answered it): the range of representable values is
[-type_MAX...type], where
type is one of FLT, DBL, or LDBL.

I was looking for largest representable number by 64 bits and ending up making my own 500 digit floating point calculator. This is what I come up with if all 64 bits are turned on
18,446,744,073,709,551,615
18 quintillion 446 quadrillion 744 trillion 73 billion 709 million 551 thousand 615

Related

Double value change during assignment

I know double has some precision issues and it can truncate values during conversion to integer.
In my case I am assigning a double 690000000000123455 and it gets changed to 690000000000123392 during assignment.
Why is the number being changed so much drastically? After all there's no fractional part assigned with it. It doesn't seems like a precision issues as value doesn't change by 1 but 63.
Presumably you store 690000000000123455 as a 64 bit integer and assign this to a double.
double d = 690000000000123455;
The closest representable double precision value to 690000000000123455 can be checked here: http://pages.cs.wisc.edu/~rkennedy/exact-float?number=690000000000123455 and is seen to be 690000000000123392.
In other words, everything is as to be expected. Your number cannot be represented exactly as a double precision value and so the closest representable value is chosen.
For more discussion of floating point data types see: Is floating point math broken?
IEEE-754 double precision floats have about 53 bits of precision which equates to about 16 decimal digits (give or take). You'll notice that's about where your two numbers start to diverge.
double storage size is 8 byte. It's value ranges from 2.3E-308 to 1.7E+308. It's precision is upto 15 decimal places. But your number contains 18 digits. That's the reason.
You could use long double as it has precision upto 19 decimal places.
The other answers are already pretty complete, but I want to suggest a website I find very helpful in understanding how float point number works: IEEE 754 Converter (only 32-bit float here, but the interaction is still very good).
As we can see, 690000000000123455 is between 2^59 and 2^60, and the highest precision of the Mantissa is 2^-52 for double precision, which means that the precision step for the given number is 2^7=128. The error 63 you provided, is actually within range.
As a side suggestion, it is better to use long for storing big integers, as it will hold the precision and does not overflow (in your case).

Do floats, doubles, and long doubles have a guaranteed minimum precision?

From my previous question "Is floating point precision mutable or invariant?" I received a response which said,
C provides DBL_DIG, DBL_DECIMAL_DIG, and their float and long double
counterparts. DBL_DIG indicates the minimum relative decimal
precision. DBL_DECIMAL_DIG can be thought of as the maximum relative
decimal precision.
I looked these macros up. They are found in the header <cfloat>. From the cplusplus reference page they list macros for float, double, and long double.
Here are the macros for minimum precision values.
FLT_DIG 6 or greater
DBL_DIG 10 or greater
LDBL_DIG 10 or greater
If I took these macros at face value, I would assume that a float has a minimum decimal precision of 6, while a double and long double have a minimum decimal precision of 10. However, being a big boy, I know that some things may be too good to be true.
Therefore, I would like to know. Do floats, doubles, and long doubles have guaranteed minimum decimal precision, and is this minimum decimal precision the values of the macros given above?
If not, why?
Note: Assume we are using programming language C++.
If std::numeric_limits<F>::is_iec559 is true, then the guarantees of the IEEE 754 standard apply to floating point type F.
Otherwise (and anyway), minimum permitted values of symbols such as DBL_DIG are specified by the C standard, which, undisputably for the library, “is incorporated into [the C++] International Standard by reference”, as quoted from C++11 §17.5.1.5/1.
Edit:
As noted by TC in a comment here,
” <climits> and <cfloat> are normatively incorporated by §18.3.3 [c.limits]; the minimum values are specified in turn in §5.2.4.2.2 of the C standard
Unfortunately for the formal view, first of all that quote from C++11 is from section 17.5 which is only informative, not normative. And secondly, the wording in the C standard that the values specified there are minimums, is also in a section (the C99 standard's Annex E) that's informative, not normative. So while it can be regarded as an in-practice guarantee, it's not a formal guarantee.
One strong indication that the in-practice minimum precision for float is 6 decimal digits, that no implementation will give less:
output operations default to precision 6, and this is normative text.
Disclaimer: It may be that there is additional wording that provides guarantees that I didn't notice. Not very likely, but possible.
Do floats, doubles, and long doubles have guaranteed minimum decimal precision, and is this minimum decimal precision the values of the macros given above?
I can't find any place in the standard that guarantees any minimal values for decimal precision.
The following quote from http://en.cppreference.com/w/cpp/types/numeric_limits/digits10 might be useful:
Example
An 8-bit binary type can represent any two-digit decimal number exactly, but 3-digit decimal numbers 256..999 cannot be represented. The value of digits10 for an 8-bit type is 2 (8 * std::log10(2) is 2.41)
The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.
However, the C standard specifies the minimum values that need to be supported.
From the C Standard:
5.2.4.2.2 Characteristics of floating types
...
9 The values given in the following list shall be replaced by constant expressions with implementation-defined values that are greater or equal in magnitude (absolute value) to those shown, with the same sign
...
-- number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits,
...
FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10
To be more specific. Since my compiler uses the IEEE 754 Standard, then the precision of my decimal digits are guaranteed to be 6 to 9 significant decimal digits for float and 15 to 17 significant decimal digits for double. Also, since a long double on my compiler is the same size as a double, it too has 15 to 17 significant decimal digits.
These ranges can be verified from IEEE 754 single-precision binary floating-point format: binary32 and IEEE 754 double-precision binary floating-point format: binary64 respectively.
The C++ Standard says nothing specific about limits on floating point types. You may interpret the incorporation of the C Standard "by reference" as you wish, but if you take the limits as specified there (N1570), section 5.2.4.2.2 subpoint 15:
EXAMPLE 1
The following describes an artificial floating-point representation that meets the minimum requirements of this International Standard, and the appropriate values in a header for type
float:
FLT_RADIX 16
FLT_MANT_DIG 6
FLT_EPSILON 9.53674316E-07F
FLT_DECIMAL_DIG 9
FLT_DIG 6
FLT_MIN_EXP -31
FLT_MIN 2.93873588E-39F
FLT_MIN_10_EXP -38
FLT_MAX_EXP +32
FLT_MAX 3.40282347E+38F
FLT_MAX_10_EXP +38
By this section, float, double and long double have these properties at the least*.

How many decimal places does the primitive float and double support? [duplicate]

This question already has answers here:
'float' vs. 'double' precision
(6 answers)
Closed 8 years ago.
I have read that double stores 15 digits and float stores 7 digits.
My question is, are these numbers the number of decimal places supported or total number of digits in a number?
If you are on an architecture using IEEE-754 floating point arithmetic (as in most architectures), then the type float corresponds to single precision, and the type double corresponds to double precision, as described in the standard.
Let's make some numbers:
Single precision:
32 bits to represent the number, out of which 24 bits are for mantissa. This means that the least significant bit (LSB) has a relative value of 2^(-24) respect to the MSB, which is the "hidden 1", and it is not represented. Therefore, for a fixed exponent, the minimum representable value is 10^(-7.22) times the exponent. What this means is that for a representation in base exponent notation (3.141592653589 E 25), only "7.22" decimal numbers are significant, which in practice means that at least 7 decimals will be always correct.
Double precision:
64 bits to represent the number, out of which 53 bits are for mantissa. Following the same reasoning, expressing 2^(-53) as a power of 10 results in 10^(-15.95), which in term means that at least 15 decimals will be always correct.
Those are the total number of "significant figures" if you will, counting from left to right, regardless of where the decimal point is. Beyond those numbers of digits, accuracy is not preserved.
The counts you listed are for the base 10 representation.
There are macros for the number of decimal places each type supports. The gcc docs explain what they are and also what they mean:
FLT_DIG
This is the number of decimal digits of precision for the float data type. Technically, if p and b are the precision and base (respectively) for the representation, then the decimal precision q is the maximum number of decimal digits such that any floating point number with q base 10 digits can be rounded to a floating point number with p base b digits and back again, without change to the q decimal digits.
The value of this macro is supposed to be at least 6, to satisfy ISO C.
DBL_DIG
LDBL_DIG
These are similar to FLT_DIG, but for the data types double and long double, respectively. The values of these macros are supposed to be at least 10.
On both gcc 4.9.2 and clang 3.5.0, these macros yield 6 and 15, respectively.
are these numbers the number of decimal places supported or total number of digits in a number?
They are the significant digits contained in every number (although you may not need all of them, but they're still there). The mantissa of the same type always contains the same number of bits, so every number consequentially contains the same number of valid "digits" if you think in terms of decimal digits. You cannot store more digits than will fit into the mantissa.
The number of "supported" digits is, however, much larger, for example float will usually support up to 38 decimal digits and double will support up to 308 decimal digits, but most of these digits are not significant (that is, "unknown").
Although technically, this is wrong, since float and double do not have universally well-defined sizes like I presumed above (they're implementation-defined). Also, storage sizes are not necessarily the same as the sizes of intermediate results.
The C++ standard is very reluctant at precisely defining any fundamental type, leaving almost everything to the implementation. Floating point types are no exception:
3.9.1 / 8
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.
Now of course all of this is not particularly helpful in practice.
In practice, floating point is (usually) IEEE 754 compliant, with float having a width of 32 bits and double having a width of 64 bits (as stored in memory, registers have higher precision on some notable mainstream architectures).
This is equivalent to 24 bits and 53 bits of matissa, respectively, or 7 and 15 full decimals.

Some floating point precision and numeric limits question

I know that there are tons of questions like this one, but I couldn't find my answers. Please read before voting to close (:
According to PC ASM:
The numeric coprocessor has eight floating point registers.
Each register holds 80 bits of data.
Floating point numbers are always stored as 80-bit
extended precision numbers in these registers.
How is that possible, when sizeof shows different things. For example, on x64 architecture, the sizeof double is 8 and this is far away from 80bits.
why does std::numeric_limits< long double >::max() gives me 1.18973e+4932 ?! This is huuuuuuuuuuge number. If this is not the way to get max of floating point numbers, then why this compiles at all, and even more - why does this returns a value.
what does this mean:
Double precision magnitudes can range from approximately 10^−308 to 10^308
These are huge numbers, you cannot store them into 8B or even 16B (which is extended precision and it is only 128bits)?
Obviously, I'm missing something. Actually, obviously, a lot of things.
1) sizeof is the size in memory, not in a register. sizeof is in bytes, so 8 bytes = 64 bits. When doubles are calculated in memory (on this architecture), they get an extra 16 bits for more precise intermediate calculations. When the value is copied back to memory, the extra 16 bits are lost.
2) Why do you think long double doesn't go up to 1.18973e+4932?
3) Why can't you store 10^308 in 8 bytes? I only need 13 bits: 4 to store the 10, and 9 to store the 308.
A double is not an intel coprocessor 80 bit floating point, it is a IEEE 754 64 bit floating point. With sizeof(double) you will get the size of the latter.
This is the correct way to get the maximum value for long double, so your question is pointless.
You are probably missing that floating point numbers are not exact numbers. 10^308 doesn't store 308 digits, only about 19 digits.
The size of space that the FPU uses and the amount of space used in memory to represent double are two different things. IEEE 754 (which probably most architectures use) specifies 32-bit single precision and 64-bit double precision numbers, which is why sizeof(double) gives you 8 bytes. Intel x86 does floating point math internally using 80 bits.
std::numeric_limits< long double >::max() is giving you the correct size for long double which is typically 80 bits. If you want the max size for 64 bit double you should use that as the template parameter.
For the question about ranges, why do you think you can't store them in 8 bytes? They do in fact fit, and what you're missing is that at the extremes of the range there are number that can't be represented (for example exponent nearing 308, there are many many integers that cant' be represented at all).
See also http://floating-point-gui.de/ for info about floating point.
Floating point number on computer are represented according to the IEEE 754-2008.
It defines several formats, amongst which
binary32 = Single precision,
binary64 = Double precision and
binary128 = Quadruple precision are the most common.
http://en.wikipedia.org/wiki/IEEE_754-2008#Basic_formats
Double precision number have 52 bits for the digit, which gives the precision, and 10 bits for the exponent, which gives the size of the number.
So doubles are 1.xxx(52 binary digits) * 2 ^ exponent(10 binary digits, so up to 2^10=1024)
And 2^1024 = 1,79 * 10^308
Which is why this is the largest value you can store in a double.
When using a quadruple precision number, they are 112 bits of precision and 14 digits for the exponent, so the largest exponent is 16384.
As 2^16384 gives 1,18 * 10^4932 you see that your C++ test was perfectly correct and that on x64 your double is actually stored in a quadruple precision number.

Long Integer and Float

If a Long Integer and a float both take 4 bytes to store in memory then why are their ranges different?
Integers are stored like this:
1 bit for the sign (+/-)
31 bits for the value.
Floats are stored differently, giving greater range at the expense of accuracy:
1 bit for the sign (+/-)
N bits for the mantissa S
M bits for the exponent E
Float is represented in the exponential form: (+/-)S*(base)^E
BTW, "long" isn't always 32 bits. See this article.
Different way to encode your numbers.
Long counts up from 1 to 2^(4*8).
Float uses only 23 of the 32 bits for the "counting". But it adds "range" with an exponent in the other bits. So you have bigger numbers, but they are less accurate (in the lower based parts):
1.2424 * 10^54 (mantisse * exponent) is certainly bigger than 2^32. But you can discern a long 2^31 from a long 2^31-1 whereas you can't discern a float 1.24 * 10^54 and a float 1.24 * 10^54 - 1: the 1 just is lost in this representation as float.
They are not always the same size. But even when they are, their ranges are different because they serve different purposes. One is for integers with no decimal places, and one is for decimals.
This can be explained in terms of why a floating point representation can represent a larger range of numbers than a fixed point representation. This text from the Wikipedia entry:
The advantage of floating-point
representation over fixed-point (and
integer) representation is that it can
support a much wider range of values.
For example, a fixed-point
representation that has seven decimal
digits, with the decimal point assumed
to be positioned after the fifth
digit, can represent the numbers
12345.67, 8765.43, 123.00, and so on, whereas a floating-point
representation (such as the IEEE 754
decimal32 format) with seven decimal
digits could in addition represent
1.234567, 123456.7, 0.00001234567, 1234567000000000, and so on. The
floating-point format needs slightly
more storage (to encode the position
of the radix point), so when stored in
the same space, floating-point numbers
achieve their greater range at the
expense of precision.
Indeed a float takes 4 bytes (32bits), but since it's a float you have to store different things in these 32 bits:
1 bit is used for the sign (+/-)
8 bits are used for the exponent
23 bits are used for the significand (the significant digits)
You can see that the range of a float directly depends on the number of bits allocated to the significand, and the min/max values depend on the numbre of bits allocated for the exponent.
With the upper example:
8 bits for the exponent (= size of a char) gives an exponent range [-128,127]
--> max is about 127*log10(2) = 10^38
Regarding a long integer, you've got 1 bit used for the sign and then 31 bits to represent the integer value leading to a max of 2 147 483 647.
You can have a look at Wikipedia for more precise info:
Wikipedia - Floating point
Their ranges are different because they use different ways of representing numbers.
long (in c) is equivalent to long int. The size of this type varies between processors and compilers, but is often, as you say, 32 bits. At 32 bits, it can represent 232 different values. Since we often want to use negative numbers, computers normally represent integers using a format called "two's complement". This way, we can represent numbers from (-231) and up to (231-1). Counting the number 0, this adds up to 232 numbers.
float (in c) is usually a single presicion IEEE 754 formatted number. At 32 bits, this data type can also take 232 different bit patterns, but they are not used to directly represent whole numbers, like in the long. Instead, they represent a sign, and the mantisse and exponent of a normalized decimal number.
In general: when you have more range of values (float has up to 10^many), you have less precision.
This is what happens here. If you need integers, 32-bit long will give you more.
In a handwavey high level, floating point sacrefices integer precision to extend its range. This is done by combining a base value with a scaling factor. For large values, a float will not be able to precisely represent all integers but for small values it will represent better than integer precision.
No, the size of primitive data types in C is Implementation Defined.
This wiki entry clearly states: The floating-point format needs slightly more storage (to encode the position of the radix point), so when stored in the same space, floating-point numbers achieve their greater range at the expense of precision.