float fv = orginal_value; // original_value may be any float value
...
double dv = (double)fv;
...
fv = (float)dv;
SHOULD fv be equal to original_value exactly? Any precision may be lost?
SHOULD fv be equal to original_value exactly? Any precision may be
lost?
Yes, if the value of dv did not change in between.
From section Conversion 6.3.1.5 Real Floating types in C99 specs:
When a float is promoted to double or long double, or a double is
promoted to long double, its value is unchanged.
When a double is
demoted to float, a long double is demoted to double or float, or a
value being represented in greater precision and range than required
by its semantic type (see 6.3.1.8) is explicitly converted to its
semantic type, if the value being converted can be represented exactly
in the new type, it is unchanged. If the value being converted is in
the range of values that can be represented but cannot be represented
exactly, the result is either the nearest higher or nearest lower
representable value, chosen in an implementation-defined manner. If
the value being converted is outside the range of values that can be
represented, the behavior is undefined
For C++, from section 4.6 aka conv.fpprom (draft used: n337 and I believe similar lines are available in final specs)
A prvalue of type float can be converted to a prvalue of type double.
The value is unchanged. This conversion is called floating point
promotion.
And section 4.8 aka conv.double
A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined. The conversions allowed as floating point
promotions are excluded from the set of floating point conversions
So the values should be equal exactly.
Related
A floating-point conversion, as the standard defines it, is a conversion between two floating-point types that isn't a promotion.
The simplest example is double to float:
double d = 0.1;
float f = d;
The standard says [conv.double]:
A prvalue of floating-point type can be converted to a prvalue of another floating-point type.
If the source value can be exactly represented in the destination type, the result of the conversion is that exact representation.
If the source value is between two adjacent destination values, the result of the conversion is an implementation-defined choice of either of those values.
Otherwise, the behavior is undefined.
The conversions allowed as floating-point promotions are excluded from the set of floating-point conversions.
In my example above, the source value cannot be exactly represented in the destination type. The value of d is 0.10000000000000001, whereas the value of f is (likely) 0.10000000149011612, and indeed if you cast f back to a double, is doesn't equal d. However, this source value is between two adjacent destination values: f and the previous representable float value, 0.099999994039535522. So the value of f can be either of these values, but because 0.10000000149011612 is closer to 0.10000000000000001 than 0.099999994039535522 is, that's likely the value chosen by the implementation.
My question is about the last case:
Otherwise, the behavior is undefined.
Are there any values for which a conversion is undefined behavior? Since floating-point types have representations for +infinity and -infinity, I would assume there cannot be any source value that isn't exactly represented or between two adjacent destination values: any double value is either an exact float value (including NaN) or between -infinity and +infinity, in which case it is between two adjacent float values.
So what is the point of this "otherwise" case? Is it here to cover exotic types that are considered floating-point but aren't float, double, or long double? Can a conversion between float, double, and long double cause undefined behavior?
It turns out some floating-point implementations cannot represent infinities. MBF, as Eljay pointed out, is one of them. It's also implied by the existence of HUGE_VAL, which is the same as INFINITY if possible.
However, this is extremely unlikely, and can be tested with std::numeric_limits<T>::has_infinity. Presumably, if this value is true, then there cannot be any undefined behavior with floating-point conversions.
Can a floating-point conversion cause undefined behavior?
Yes.
Consider 2 float values: FLT_MAX and nextafterf(FLT_MAX, 0). Their difference is delta. All saved in a double.
random(a, b) forms a double random value (a, b].
double max = FLT_MAX;
double max_before = nextafterf(FLT_MAX, 0);
double delta = max - max_before;
// Conversion to `float` is well defined
double in_between = max_before + random(0.0, delta);
float in_betweenf = in_between; // in_between is inclusively between 2 float.
// Conversion to `float` can fail as the `double` value can
// exceed FLT_MAX, even is the sum is the smallest `double` more than `FLT_MAX`.
double in_between = max + random(0.0, delta);
float in_betweenf = in_between;
This is primarily the case when float does not support infinity.
If the value being converted is outside the range of values that can be represented, the behavior is undefined. C17dr § 6.3.1.5 1.
Ideally it would be nice if in_between = max_before + delta + random(0.0, 0.5*delta); was well defined, but it is not when float lacks an infinity.
If I have
(float)value = 10.50
and do
int new_value = (int)value
what rules will round number?
When a finite value of floating type is converted to an integer type, the fractional part is discarded (i.e., the value is truncated toward zero).
So in the case of -10.5, it's converted to -10.
C++11 4.9 Floating-integral conversions [conv.fpint]
An rvalue of a floating point type can be converted to an rvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type. [ Note: If the destination type is bool, see 4.12. —end note ]
The rule is quite simple: the number simply gets truncated to its integral part, in this case, to 10. The fractional part gets dropped entirely. The same applies to negative numbers: -10.5 would be converted to -10.
When converted to integers, the fractional part of the float is dropped, meaning the float 10.5 will be converted to the integer 10, and the float -10.5 will be converted to the integer -10.
If I try this
float f = (float)numeric_limits<double>::infinity();
Or indeed, try to cast anything bigger than float max down to a float, am I guaranteed to end up with infinity?
It works on GCC, but is it a standard though?
float f = (float)numeric_limits<double>::infinity();
This is guaranteed to set f to infinity if your compilation platform offers IEEE 754 arithmetic for floating-point computations (it usually does).
Or indeed, try to cast anything bigger than float max down to a float, am I guaranteed to end up with infinity?
No. In the default IEEE 754 round-to-nearest mode, a few double values above the maximum finite float (that is, FLT_MAX) convert to FLT_MAX. The exact limit is the number midway between FLT_MAX (0x1.fffffep127 in C99 hexadecimal representation) and the next float number that could be represented if the exponent in the single-precision format had a larger range, 0x2.0p127. The limit is thus 0x1.ffffffp127 or approximately 3.4028235677973366e+38 in decimal.
From the C++11 standard, §4.8.1:
A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined.
This implies that
If you cast double infinity to float, you get float infinity.
If you cast a double value, that lies between float max and infinity, to float, then you get float max or float infinity.
Consider two way of computing something:
data in double precision
-> apply a function with double precision temporaries
-> return result
data in double precision
-> cast to long double
-> apply a function with long double precision temporaries
-> cast to double
-> return result
Can the second solution give a less accurate result compared to the first one and if yes in what case?
Yes. Proof: Let c = 0x1p-53 + 0x1p-64. Evaluate 1+c-c-1 in double and in long double (of the common Intel format, with a 64-bit significand). In double, the result is 0, which is the mathematically exact answer. In long double, the result is -0x1p-64, which is wrong (and remains wrong when cast to double).
In double, 1+c adds slightly more than half the ULP (unit of least precision) of 1 to 1, so it produces 1 plus an ULP. Subtracting c subtracts slightly more than half an ULP, so the closest representable number (in double) to the result is 1, so 1 is produced. Then subtracting 1 yields 0.
In long double, 1+c adds 0x1p-53 plus half an ULP of 1. (In long double, the ULP of 1 is 0x1p-63.) Since the result is exactly the same distance from the two nearest representable numbers (in long double), the one with the low bit zero is returned, 1+0x1p-53. Then the exact result of subtracting c is 1 - 0x1p-64. This is exactly representable, so it is returned. Finally, subtracting 1 yields -0x1p-64.
About long double the draft says:
3.9.1 Fundamental Types
8 There are three floating point types: float, double, and long double. The type double provides at least
as much precision as float, and the type long double provides at least as much precision as double.
The set of values of the type float is a subset of the set of values of the type double; the set of values
of the type double is a subset of the set of values of the type long double. The value representation of
floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic
types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum
and minimum values of each arithmetic type for an implementation.
As for promotions which is the next most interesting bit:
4.6 Floating point promotion
1 A prvalue of type float can be converted to a prvalue of type
double. The value is unchanged.
2 This conversion is called floating
point promotion.
Note there is nothing being said about double to long double. I'd hazard this as a slip though.
Next about conversions which is what we are interested when you go from long double to double:
4.8 Floating point conversions
1 A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined.
2 The conversions allowed as floating point
promotions are excluded from the set of floating point conversions.
Now, let's see the effects of narrowing:
6. A narrowing conversion is an implicit conversion
[...]
from long double to double or float, or from double to float, except where the source is a constant expression and the actual value after
conversion is within the range of values that can be represented (even
if it cannot be represented exactly)
There are two takeaways from all this standardese:
Combining the bit about narrowing with the bit about implementation defined conversions there may be changes in your results across platforms.
If your intermediate results (considering multiple such results) in long double are in a range that cannot be represented accurately by a double (high or low), these can accumulate to return a different final result which you will want to return back as a double.
As for which is more accurate, I think that depends entirely on your application.
Suppose float a = (1.5 * b) where b is float then how is this expression evaluated?
Is 1.5 treated as double or float?
1.5 is double, use 1.5f for float, what it actually does:
float a = (float)(1.5 * (double)b)
1.5 is a floating point literal, a double value. C++03 2.13.3 Floating literals has this to say:
A floating literal consists of an integer part, a decimal point, a fraction part, an e or E, an optionally signed integer exponent, and an optional type suffix. ... The type of a floating literal is double unless explicitly specified by a suffix.
Section 13.3.3.1 Standard conversion sequences defines the way in which conversions are handled but it's a little dry to repeat here. Suffice to say that floating point promotion is done and section 4.6 Floating point promotion states that:
An rvalue of type float can be converted to an rvalue of type double. The value is unchanged.
Hence the float b is promoted to a double to perform the multiplication.
Then the calculation is performed using (effectively) a temporary double and the result is shoe-horned back into the float a.
So, effectively:
float b = something;
double xyzzy0 = 1.5;
double xyzzy1 = (double)b;
double xyzzy2 = xyzzy0 * xyzzy1;
float a = xyzzy2;
That last step may be problematic. Section 4.8 Floating point conversions (which doesn't include the safer promotions like float to double) states:
An rvalue of floating point type can be converted to an rvalue of another floating point type. If the source value can be exactly represented in the destination type, the result of the conversion is that exact representation. If the source value is between two adjacent destination values, the result of the conversion is an implementation-defined choice of either of those values. Otherwise, the behavior is undefined.
The conversions allowed as floating point promotions are excluded from the set of floating point conversions.
In other words, if the multiplication results in a value outside the range of a float, all bets are off. This is likely to happen if b is about at 67% of the maximum absolute value of a float (positive or negative, doesn't matter).