A floating-point conversion, as the standard defines it, is a conversion between two floating-point types that isn't a promotion.
The simplest example is double to float:
double d = 0.1;
float f = d;
The standard says [conv.double]:
A prvalue of floating-point type can be converted to a prvalue of another floating-point type.
If the source value can be exactly represented in the destination type, the result of the conversion is that exact representation.
If the source value is between two adjacent destination values, the result of the conversion is an implementation-defined choice of either of those values.
Otherwise, the behavior is undefined.
The conversions allowed as floating-point promotions are excluded from the set of floating-point conversions.
In my example above, the source value cannot be exactly represented in the destination type. The value of d is 0.10000000000000001, whereas the value of f is (likely) 0.10000000149011612, and indeed if you cast f back to a double, is doesn't equal d. However, this source value is between two adjacent destination values: f and the previous representable float value, 0.099999994039535522. So the value of f can be either of these values, but because 0.10000000149011612 is closer to 0.10000000000000001 than 0.099999994039535522 is, that's likely the value chosen by the implementation.
My question is about the last case:
Otherwise, the behavior is undefined.
Are there any values for which a conversion is undefined behavior? Since floating-point types have representations for +infinity and -infinity, I would assume there cannot be any source value that isn't exactly represented or between two adjacent destination values: any double value is either an exact float value (including NaN) or between -infinity and +infinity, in which case it is between two adjacent float values.
So what is the point of this "otherwise" case? Is it here to cover exotic types that are considered floating-point but aren't float, double, or long double? Can a conversion between float, double, and long double cause undefined behavior?
It turns out some floating-point implementations cannot represent infinities. MBF, as Eljay pointed out, is one of them. It's also implied by the existence of HUGE_VAL, which is the same as INFINITY if possible.
However, this is extremely unlikely, and can be tested with std::numeric_limits<T>::has_infinity. Presumably, if this value is true, then there cannot be any undefined behavior with floating-point conversions.
Can a floating-point conversion cause undefined behavior?
Yes.
Consider 2 float values: FLT_MAX and nextafterf(FLT_MAX, 0). Their difference is delta. All saved in a double.
random(a, b) forms a double random value (a, b].
double max = FLT_MAX;
double max_before = nextafterf(FLT_MAX, 0);
double delta = max - max_before;
// Conversion to `float` is well defined
double in_between = max_before + random(0.0, delta);
float in_betweenf = in_between; // in_between is inclusively between 2 float.
// Conversion to `float` can fail as the `double` value can
// exceed FLT_MAX, even is the sum is the smallest `double` more than `FLT_MAX`.
double in_between = max + random(0.0, delta);
float in_betweenf = in_between;
This is primarily the case when float does not support infinity.
If the value being converted is outside the range of values that can be represented, the behavior is undefined. C17dr § 6.3.1.5 1.
Ideally it would be nice if in_between = max_before + delta + random(0.0, 0.5*delta); was well defined, but it is not when float lacks an infinity.
Related
float fv = orginal_value; // original_value may be any float value
...
double dv = (double)fv;
...
fv = (float)dv;
SHOULD fv be equal to original_value exactly? Any precision may be lost?
SHOULD fv be equal to original_value exactly? Any precision may be
lost?
Yes, if the value of dv did not change in between.
From section Conversion 6.3.1.5 Real Floating types in C99 specs:
When a float is promoted to double or long double, or a double is
promoted to long double, its value is unchanged.
When a double is
demoted to float, a long double is demoted to double or float, or a
value being represented in greater precision and range than required
by its semantic type (see 6.3.1.8) is explicitly converted to its
semantic type, if the value being converted can be represented exactly
in the new type, it is unchanged. If the value being converted is in
the range of values that can be represented but cannot be represented
exactly, the result is either the nearest higher or nearest lower
representable value, chosen in an implementation-defined manner. If
the value being converted is outside the range of values that can be
represented, the behavior is undefined
For C++, from section 4.6 aka conv.fpprom (draft used: n337 and I believe similar lines are available in final specs)
A prvalue of type float can be converted to a prvalue of type double.
The value is unchanged. This conversion is called floating point
promotion.
And section 4.8 aka conv.double
A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined. The conversions allowed as floating point
promotions are excluded from the set of floating point conversions
So the values should be equal exactly.
Let's say I have two arithmetic types, an integer one, I, and a floating point one, F. I also assume that std::numeric_limits<I>::max() is smaller than std::numeric_limits<F>::max().
Now, let's say I have a positive integer value i. Because the representable range of F is larger than I, F(i) should always be defined behavior.
However, if I have a floating point value f such that f == F(i), is I(f) well defined? In other words, is I(F(i)) always defined behavior?
Relevant section from the C++14 standard:
4.9 Floating-integral conversions [conv.fpint]
A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates;
that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be
represented in the destination type. [ Note: If the destination type is bool, see 4.12. — end note ]
A prvalue of an integer type or of an unscoped enumeration type can be converted to a prvalue of a floating
point type. The result is exact if possible. If the value being converted is in the range of values that can
be represented but the value cannot be represented exactly, it is an implementation-defined choice of either
the next lower or higher representable value. [ Note: Loss of precision occurs if the integral value cannot
be represented exactly as a value of the floating type. — end note ] If the value being converted is outside
the range of values that can be represented, the behavior is undefined. If the source type is bool, the value
false is converted to zero and the value true is converted to one.
However, if I have a floating point value f such that f == F(i), is I(f) well defined? In other words, is I(F(i)) always defined behavior?
No.
Suppose that I is a signed two's complement 32 bit integer type, F is a 32 bit single precision floating point type, and i is the maximum positive integer. This is within the range of the floating point type, but it cannot be represented exactly as a floating point number. Some of those 32 bits are used for the exponent.
Instead, the conversion from integer to floating point is implementation dependent, but typically is done by rounding to the closest representable value. That rounded value is one beyond the range of the integer type. The conversion back to integer fails (better said, it's undefined behavior).
No.
It's possible that i == std::numeric_limits<I>::max(), but that i is not exactly representable in F.
If the value being converted is in the range of values that can be represented but the value cannot be represented exactly, it is an implementation-defined choice of either the next lower or higher representable value.
Since the next higher representable value may be chosen, it's possible that the result F(i) no longer fits into I, so conversion back would be undefined behavior.
No. Regardless of the standard, you cannot expect that in general this conversion will return your original integer. It doesn't make sense mathematically. But if you read into what you quoted, the standard clearly indicates the possibility of a loss of precision upon converting from int to float.
Suppose your types I and F use the same number of bits. All of the bits of I (save possibly one that stores the sign) are used to specify the absolute value of the number. On the other hand, in F, some bits are used to specify the exponent and some are used for the significand. The range will be greater because of the possible exponent. But the significand will have less precision because there are fewer bits devoted to its specification.
Just as a test, I printed
std::numeric_limits<int>::max();
std::numeric_limits<float>::max();
I then converted the first number to float and back again. The max float had an exponent of 38, and the max int had 10 digits, so clearly float has a larger range. But upon converting the max int to float and back, I went from 2147473647 to -2147473648. So it seems the number was incremented by one unit and went around to the negative side.
I didn't check how many bits are actually used for float on my system, but it at least demonstrates the loss of precision, and it shows that gcc "rounded up".
If I try this
float f = (float)numeric_limits<double>::infinity();
Or indeed, try to cast anything bigger than float max down to a float, am I guaranteed to end up with infinity?
It works on GCC, but is it a standard though?
float f = (float)numeric_limits<double>::infinity();
This is guaranteed to set f to infinity if your compilation platform offers IEEE 754 arithmetic for floating-point computations (it usually does).
Or indeed, try to cast anything bigger than float max down to a float, am I guaranteed to end up with infinity?
No. In the default IEEE 754 round-to-nearest mode, a few double values above the maximum finite float (that is, FLT_MAX) convert to FLT_MAX. The exact limit is the number midway between FLT_MAX (0x1.fffffep127 in C99 hexadecimal representation) and the next float number that could be represented if the exponent in the single-precision format had a larger range, 0x2.0p127. The limit is thus 0x1.ffffffp127 or approximately 3.4028235677973366e+38 in decimal.
From the C++11 standard, §4.8.1:
A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined.
This implies that
If you cast double infinity to float, you get float infinity.
If you cast a double value, that lies between float max and infinity, to float, then you get float max or float infinity.
Consider two way of computing something:
data in double precision
-> apply a function with double precision temporaries
-> return result
data in double precision
-> cast to long double
-> apply a function with long double precision temporaries
-> cast to double
-> return result
Can the second solution give a less accurate result compared to the first one and if yes in what case?
Yes. Proof: Let c = 0x1p-53 + 0x1p-64. Evaluate 1+c-c-1 in double and in long double (of the common Intel format, with a 64-bit significand). In double, the result is 0, which is the mathematically exact answer. In long double, the result is -0x1p-64, which is wrong (and remains wrong when cast to double).
In double, 1+c adds slightly more than half the ULP (unit of least precision) of 1 to 1, so it produces 1 plus an ULP. Subtracting c subtracts slightly more than half an ULP, so the closest representable number (in double) to the result is 1, so 1 is produced. Then subtracting 1 yields 0.
In long double, 1+c adds 0x1p-53 plus half an ULP of 1. (In long double, the ULP of 1 is 0x1p-63.) Since the result is exactly the same distance from the two nearest representable numbers (in long double), the one with the low bit zero is returned, 1+0x1p-53. Then the exact result of subtracting c is 1 - 0x1p-64. This is exactly representable, so it is returned. Finally, subtracting 1 yields -0x1p-64.
About long double the draft says:
3.9.1 Fundamental Types
8 There are three floating point types: float, double, and long double. The type double provides at least
as much precision as float, and the type long double provides at least as much precision as double.
The set of values of the type float is a subset of the set of values of the type double; the set of values
of the type double is a subset of the set of values of the type long double. The value representation of
floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic
types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum
and minimum values of each arithmetic type for an implementation.
As for promotions which is the next most interesting bit:
4.6 Floating point promotion
1 A prvalue of type float can be converted to a prvalue of type
double. The value is unchanged.
2 This conversion is called floating
point promotion.
Note there is nothing being said about double to long double. I'd hazard this as a slip though.
Next about conversions which is what we are interested when you go from long double to double:
4.8 Floating point conversions
1 A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined.
2 The conversions allowed as floating point
promotions are excluded from the set of floating point conversions.
Now, let's see the effects of narrowing:
6. A narrowing conversion is an implicit conversion
[...]
from long double to double or float, or from double to float, except where the source is a constant expression and the actual value after
conversion is within the range of values that can be represented (even
if it cannot be represented exactly)
There are two takeaways from all this standardese:
Combining the bit about narrowing with the bit about implementation defined conversions there may be changes in your results across platforms.
If your intermediate results (considering multiple such results) in long double are in a range that cannot be represented accurately by a double (high or low), these can accumulate to return a different final result which you will want to return back as a double.
As for which is more accurate, I think that depends entirely on your application.
Suppose float a = (1.5 * b) where b is float then how is this expression evaluated?
Is 1.5 treated as double or float?
1.5 is double, use 1.5f for float, what it actually does:
float a = (float)(1.5 * (double)b)
1.5 is a floating point literal, a double value. C++03 2.13.3 Floating literals has this to say:
A floating literal consists of an integer part, a decimal point, a fraction part, an e or E, an optionally signed integer exponent, and an optional type suffix. ... The type of a floating literal is double unless explicitly specified by a suffix.
Section 13.3.3.1 Standard conversion sequences defines the way in which conversions are handled but it's a little dry to repeat here. Suffice to say that floating point promotion is done and section 4.6 Floating point promotion states that:
An rvalue of type float can be converted to an rvalue of type double. The value is unchanged.
Hence the float b is promoted to a double to perform the multiplication.
Then the calculation is performed using (effectively) a temporary double and the result is shoe-horned back into the float a.
So, effectively:
float b = something;
double xyzzy0 = 1.5;
double xyzzy1 = (double)b;
double xyzzy2 = xyzzy0 * xyzzy1;
float a = xyzzy2;
That last step may be problematic. Section 4.8 Floating point conversions (which doesn't include the safer promotions like float to double) states:
An rvalue of floating point type can be converted to an rvalue of another floating point type. If the source value can be exactly represented in the destination type, the result of the conversion is that exact representation. If the source value is between two adjacent destination values, the result of the conversion is an implementation-defined choice of either of those values. Otherwise, the behavior is undefined.
The conversions allowed as floating point promotions are excluded from the set of floating point conversions.
In other words, if the multiplication results in a value outside the range of a float, all bets are off. This is likely to happen if b is about at 67% of the maximum absolute value of a float (positive or negative, doesn't matter).