I'm wonder if in the IEEE 754 norm, their is anything to avoid cohort.
Does the norm implies to have the lesser exponent as possible or not? Or cohort are present?
Related
Can all IEEE 754 32 bit floating point numbers be represented exactly by a 64 bit floating point number? Stated another way, is a cast from f32 to f64 ever rounded?
Can all IEEE 754 32 bit floating point numbers be represented exactly by a 64 bit floating point number?
Yes. All numeric values of binary32 are in binary64.
Stated another way, is a cast from f32 to f64 ever rounded?
Not usually. Various language like C allow intermediate 32-bit FP calculations to employ wider math and so a cast may narrow (round) results. Yet if the value was truly f32, no rounding error would occur going to f64.
Aside:
The Not-a-number payload of a binary32 is 23 bits and that is fully encodable as a binary64, yet the detailed meaning of those is implementation dependent.
The Wikipedia page on the IEEE 754 standard contains a table that summarizes different floating point representation formats. Here is an excerpt.
The meaning of the Decimal digits column is the number of digits the represented number its mantissa has if you convert it to decimal. The page states that it is computed by (Significand bits)*log_10(2). I can see how that makes sense.
However, I don't see what the meaning is of the Decimal E max column. It is computed by (E max)*log_10(2) and is supposed to be "the maximum exponent in decimal". But isn't E max the maximum exponent in decimal?
I'm asking because these 'decimal' values are the values (I think) that can be passed to selected_real_kind in Fortran. If you define a real with kind selected_real_kind(6, 37) it will be single precision. There will be (at least) 6 significand digits in your decimal number. So a similar question is, what is the meaning of the 37? This is also the value returned by Fortran's range. The GNU Fortran docs state that "RANGE(X) returns the decimal exponent range in the model of the type of X", but it doesn't help me understand what it means.
I always come up with an answer myself minutes after I've posted it on StackExchange even though I've been thinking about it all day...
The number in binary is represented by m*2^(e) with m the mantissa and e the exponent in binary. The maximum value of e for single precision is 127.
The number converted to decimal can be represented by m*10^(e) with m the mantissa and e the exponent in decimal. To have the same (single) precision here, e has a maximum value of 127*log_10(2) = 38.23. You can also see this by noticing m*10^(127*log_10(2)) = m*2^(127).
I've been studying IEEE 754 for a while,and there's a thing that I do not manage to understand. According to my notes, in IEEE simple precision, you have 1 bit for the sign, 8 for exponent and 23 for mantissa, making a total of 32 bits. The exponent could be described as following: the first bit gives the sign, and then the remaining 7 bits describe some number, which means that the biggest possible value for exponent is 2^+127, and the lowest 2^-127. But according to Wikipedia (and other websites), the lowest possible value is -126 which you get if you consider the exponent as a number determined by: e-127 and e is an integer between 1 and 254. Why can't e take the value 0 which will enable the exponent -127?
Look up 'subnormal' or denormalized numbers; they have a biassed exponent value of 0.
A denormal number is represented with a biased exponent of all 0 bits, which represents an exponent of −126 in single precision (not −127).
Also, there are 24 logical bits in the mantissa, but the first is always 1 so it isn't actually stored.
Signed zeros are represented by exponent and mantissa with all bits zero, and the sign bit may be 0 (positive) or 1 (negative).
Could someone explain to me why these do not equal in IEEE 754 floating point:
(1 + 1e300) - 1e100 and 1 + (1e300 - 1e100)
Many thanks!
It depends on precisely which IEEE floating point format you're working in, and the rounding mode that you are using.
F64
The comments above that say that they are equal are most likely checking this in at least double (F64 binary) precision with e.g. RNE rounding. This is what usually happens when using double in C-like languages.
In that case, all of your numbers get converted and rounded to an F64 value. The other numbers are so far apart from 1e300 that any addition will round back to roughly 1e300. To display this number in decimals, it gets rounded again and shown as 1e300.
If your rounding mode is not RNE, your final answers may be slightly different from 1e300, although still probably equal.
F32
However, if you are working in single (F32) precision - most of your numbers are much too large to represent, and will likely be converted to an Inf.
Following the IEEE 745 rules: you end up calculating Inf-Inf in both cases, which should result in NaN. Then finally, if you end up comparing the values as floats: NaN==NaN, the answer must be false, which appears to be consistent with what you are seeing.
For Python,
>>> (1 + 1e300) - 1e100 == 1 + (1e300 - 1e100)
True
From what I understand, there are no FLT_MAX type constants in GLSL.
Is there any way to ensure that a float represents the largest possible value without overflow?
EDIT:
Since it was asked what Im using this for:
I'm basically scaling a point out into "infinity". Its for 2D shadow casting, where I completely reshape the triangle strip shadows on the GPU. As I can only control deal with a single vertex at a time the w component stores whether it stays on the hull or is projected to infinity.
In the case that both 'shadow boundary' points on are the same edge, and the light is almost colinear with that edge, I need to ensure that the triangle still covers the entire screen. Its hard to describe.
In GLSL, IEEE 754 infinity can conveniently be achieved by dividing by zero:
float infinity = 1.0 / 0.0;
GLSL uses the IEEE 754 floating-point definition for float:
As an input value to one of the processing units, a single-precision or double-precision floating-point variable is expected to match the corresponding IEEE 754 floating-point definition for precision and dynamic range. Floating-point variables within a shader are also encoded according to the IEEE 754 specification for single-precision floating-point values (logically, not necessarily physically). While encodings are logically IEEE 754, operations (addition, multiplication, etc.) are not necessarily performed as required by IEEE 754.
The maximum representable float value is (1 − 2^-24) × 2^128
Typical maximum floating-point values
You can therefore use this (taken from Microsoft's float.h)
#define FLT_MAX 3.402823466e+38
#define FLT_MIN 1.175494351e-38
#define DBL_MAX 1.7976931348623158e+308
#define DBL_MIN 2.2250738585072014e-308
Exact maximum floating-point value
Also, since the maximum floating-point value is
7f7f ffff = 0 11111110 11111111111111111111111 = 2139095039
here's another interesting way to get an exact maximum value:
float fMaxFloat = intBitsToFloat(2139095039);
Yes, according to the GLSL language specification in 4.1.4 they are standard IEEE 754 datatypes. I suppose the lack of the FLT_MAX is simply because there are different lengths available: single precision (float), double precision (double) and sometimes even half precision.
You can use positive and negative infinity, or if you need a finite max/min-value, the highest numbers available in floating point. The exact pattern is easy to find out if you search stackoverflow.