Difference between narrowing and truncation in C++? - c++

I've been going through a book (C++ Programming Language Stroustrup 4th ed).
An example given in a section related to Initialization as below:
void f(double d, int i)
{
int a{ d }; // error : possible truncation
char b{ i }; // error : possible narrowing
}
What exactly is the difference between truncation and narrowing?

A narrowing conversion is basically any conversion that may cause a loss of information. Strictly speaking, a narrowing conversion is :
an implicit conversion
from a floating-point type to an integer type, or
from long double to double or float, or from double to float, except where the source is a constant expression and the actual value after conversion is within the range of values that can be represented (even if it cannot be represented exactly), or
from an integer type or unscoped enumeration type to a floating-point type, except where the source is a constant expression and the actual value after conversion will fit into the target type and will produce the original value when converted back to the original type, or
from an integer type or unscoped enumeration type to an integer type that cannot represent all the values of the original type, except where the source is a constant expression whose value after integral promotions will fit into the target type, or
from a pointer type or a pointer-to-member type to bool.
Notice that this means both of the conversions you posted are narrowing conversion. int a{ d }; is the first case and char b{ i }; is the fourth.
Truncation only occurs when converting between a floating point type and an integer type. It generally refers to the decimal portion of the floating point number being lost (source). This implies that truncation is a subset of narrowing conversions.

What exactly is the difference between truncation and narrowing?
Truncation shortens a decimal-based value such as a float or a double to its integral form int with the extra precision bits after the decimal (from 2-1 in corresponding decimal form) removed.
A double can be truncated to a float as well, with the possibility of an overflow (depending on size of the value) and removal of half of the precision bits in its binary form (since double has twice the precision of float, with them ordinally being 64 and 32 bit floating points respectively).
For an example of double being truncated into a float, consider something which goes atleast above 23 precision bits (considering the mantissa of float) such as the value of PI, regarding which BenVoigt gave an example in the comments.
The value of PI as given by a double is:
11.001001000011111101101010100010001000010110100011000
// 3.141592653589793116
Note that there are 52 precision bits (according to IEEE 754 standard, from 0 to 51), or the bits forming the value after the decimal.
Corresponding truncated float value:
11.0010010000111111011011
// 3.1415927410125732422
Note the inaccuracy for the value of PI in relative terms of the number considered above. This is caused by the removal of the trailing bits of precision when truncating the value from double to float (which has only 23 precision bits, from 0 to 22), ordinally decreasing the precision bits in this case.
Following the conversion of floating-point values to integral form, you can say it acts similar to a floor function call.
Narrowing is shortening of the value as the name implies as well, but unlike truncation it is not restricted to shortening of a floating-point value to an integer value. It applies to other conversions as well, such as a long to an int, a pointer type to a boolean and a character to an integer (as in your example).

Maybe its best understood with an example...
Lets say d == 3.1415926 then in your code a will end up as 3. That is truncation.
On the other hand, if i == 1000 then that is outside the range of char. If char is unsigned the value will wrap around and you will get 1000%256 as the value of b. This happens because int has a wider range than char, hence this conversion is called narrowing.

double d=2.345;
int a = d; // a is 2 now, so the number 2.345 is truncated
As for int to char, char has size of 1 byte, while int has 4 bytes (assuming 32 bit), so you would be "narrowing" variable i.
It could be just about english :) You can look up the words in a dictionary so it may be clearer.

Related

Precision loss from float to double, and from double to float?

float fv = orginal_value; // original_value may be any float value
...
double dv = (double)fv;
...
fv = (float)dv;
SHOULD fv be equal to original_value exactly? Any precision may be lost?
SHOULD fv be equal to original_value exactly? Any precision may be
lost?
Yes, if the value of dv did not change in between.
From section Conversion 6.3.1.5 Real Floating types in C99 specs:
When a float is promoted to double or long double, or a double is
promoted to long double, its value is unchanged.
When a double is
demoted to float, a long double is demoted to double or float, or a
value being represented in greater precision and range than required
by its semantic type (see 6.3.1.8) is explicitly converted to its
semantic type, if the value being converted can be represented exactly
in the new type, it is unchanged. If the value being converted is in
the range of values that can be represented but cannot be represented
exactly, the result is either the nearest higher or nearest lower
representable value, chosen in an implementation-defined manner. If
the value being converted is outside the range of values that can be
represented, the behavior is undefined
For C++, from section 4.6 aka conv.fpprom (draft used: n337 and I believe similar lines are available in final specs)
A prvalue of type float can be converted to a prvalue of type double.
The value is unchanged. This conversion is called floating point
promotion.
And section 4.8 aka conv.double
A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined. The conversions allowed as floating point
promotions are excluded from the set of floating point conversions
So the values should be equal exactly.

Is round-trip through floating point always defined behavior if floating point range is bigger?

Let's say I have two arithmetic types, an integer one, I, and a floating point one, F. I also assume that std::numeric_limits<I>::max() is smaller than std::numeric_limits<F>::max().
Now, let's say I have a positive integer value i. Because the representable range of F is larger than I, F(i) should always be defined behavior.
However, if I have a floating point value f such that f == F(i), is I(f) well defined? In other words, is I(F(i)) always defined behavior?
Relevant section from the C++14 standard:
4.9 Floating-integral conversions [conv.fpint]
A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates;
that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be
represented in the destination type. [ Note: If the destination type is bool, see 4.12. — end note ]
A prvalue of an integer type or of an unscoped enumeration type can be converted to a prvalue of a floating
point type. The result is exact if possible. If the value being converted is in the range of values that can
be represented but the value cannot be represented exactly, it is an implementation-defined choice of either
the next lower or higher representable value. [ Note: Loss of precision occurs if the integral value cannot
be represented exactly as a value of the floating type. — end note ] If the value being converted is outside
the range of values that can be represented, the behavior is undefined. If the source type is bool, the value
false is converted to zero and the value true is converted to one.
However, if I have a floating point value f such that f == F(i), is I(f) well defined? In other words, is I(F(i)) always defined behavior?
No.
Suppose that I is a signed two's complement 32 bit integer type, F is a 32 bit single precision floating point type, and i is the maximum positive integer. This is within the range of the floating point type, but it cannot be represented exactly as a floating point number. Some of those 32 bits are used for the exponent.
Instead, the conversion from integer to floating point is implementation dependent, but typically is done by rounding to the closest representable value. That rounded value is one beyond the range of the integer type. The conversion back to integer fails (better said, it's undefined behavior).
No.
It's possible that i == std::numeric_limits<I>::max(), but that i is not exactly representable in F.
If the value being converted is in the range of values that can be represented but the value cannot be represented exactly, it is an implementation-defined choice of either the next lower or higher representable value.
Since the next higher representable value may be chosen, it's possible that the result F(i) no longer fits into I, so conversion back would be undefined behavior.
No. Regardless of the standard, you cannot expect that in general this conversion will return your original integer. It doesn't make sense mathematically. But if you read into what you quoted, the standard clearly indicates the possibility of a loss of precision upon converting from int to float.
Suppose your types I and F use the same number of bits. All of the bits of I (save possibly one that stores the sign) are used to specify the absolute value of the number. On the other hand, in F, some bits are used to specify the exponent and some are used for the significand. The range will be greater because of the possible exponent. But the significand will have less precision because there are fewer bits devoted to its specification.
Just as a test, I printed
std::numeric_limits<int>::max();
std::numeric_limits<float>::max();
I then converted the first number to float and back again. The max float had an exponent of 38, and the max int had 10 digits, so clearly float has a larger range. But upon converting the max int to float and back, I went from 2147473647 to -2147473648. So it seems the number was incremented by one unit and went around to the negative side.
I didn't check how many bits are actually used for float on my system, but it at least demonstrates the loss of precision, and it shows that gcc "rounded up".

Why Implicit Cast from double to float available?

In C++ we can write something like
float f = 3.55;
and it is a legal statement, whereas the type of real number numerals is double and we are storing that double into floating point number. It essentially means storing 8 bytes into 4 bytes (a possible data loss)? My question is that when I write
long l = 333;
int y = l;
I get an error because long value is converted into int value (possible data loss). but why don't I encounter a problem when storing 8 byte double real numeral in floating point (4 byte)?
From §4 Standard conversions [conv] C++11:
Standard conversions are implicit conversions with built-in meaning.
Clause 4 enumerates the full set of such conversions. A standard
conversion sequence is a sequence of standard conversions in the
following order:
...
Zero or one conversion from the following set: integral promotions,
floating point promotion, integral conversions, floating point
conversions, floating-integral conversions, pointer conversions,
pointer to member conversions, and boolean conversions.
So conversion between two numeric types is allowed implicitly as it makes sense also if used carefully. For example When you calculate Amount(int) from P(int), R(float) and T(int);
And from §4.8 Floating point conversions [conv.double],
A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined.
The conversions allowed as floating point
promotions are excluded from the set of floating point conversions.
It appears double to float conversion is implicitly performed by the compliant C++ compiler. (At the cost of potentially loosing the precision)
Your example is not an error and should compile.
When you assign a larger integer type to a smaller integer type (or perform any conversion that doesn't quality as a promotion), an integral conversion occurs and precision may be lost.
Similarly, floating point conversion occurs when you assign one floating point type to another floating point type; the result is either the same value, or a value close to it, unless the source value exceeds the range of the destination type.

C++ cast to a more precise type and lose accuracy?

Consider two way of computing something:
data in double precision
-> apply a function with double precision temporaries
-> return result
data in double precision
-> cast to long double
-> apply a function with long double precision temporaries
-> cast to double
-> return result
Can the second solution give a less accurate result compared to the first one and if yes in what case?
Yes. Proof: Let c = 0x1p-53 + 0x1p-64. Evaluate 1+c-c-1 in double and in long double (of the common Intel format, with a 64-bit significand). In double, the result is 0, which is the mathematically exact answer. In long double, the result is -0x1p-64, which is wrong (and remains wrong when cast to double).
In double, 1+c adds slightly more than half the ULP (unit of least precision) of 1 to 1, so it produces 1 plus an ULP. Subtracting c subtracts slightly more than half an ULP, so the closest representable number (in double) to the result is 1, so 1 is produced. Then subtracting 1 yields 0.
In long double, 1+c adds 0x1p-53 plus half an ULP of 1. (In long double, the ULP of 1 is 0x1p-63.) Since the result is exactly the same distance from the two nearest representable numbers (in long double), the one with the low bit zero is returned, 1+0x1p-53. Then the exact result of subtracting c is 1 - 0x1p-64. This is exactly representable, so it is returned. Finally, subtracting 1 yields -0x1p-64.
About long double the draft says:
3.9.1 Fundamental Types
8 There are three floating point types: float, double, and long double. The type double provides at least
as much precision as float, and the type long double provides at least as much precision as double.
The set of values of the type float is a subset of the set of values of the type double; the set of values
of the type double is a subset of the set of values of the type long double. The value representation of
floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic
types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum
and minimum values of each arithmetic type for an implementation.
As for promotions which is the next most interesting bit:
4.6 Floating point promotion
1 A prvalue of type float can be converted to a prvalue of type
double. The value is unchanged.
2 This conversion is called floating
point promotion.
Note there is nothing being said about double to long double. I'd hazard this as a slip though.
Next about conversions which is what we are interested when you go from long double to double:
4.8 Floating point conversions
1 A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined.
2 The conversions allowed as floating point
promotions are excluded from the set of floating point conversions.
Now, let's see the effects of narrowing:
6. A narrowing conversion is an implicit conversion
[...]
from long double to double or float, or from double to float, except where the source is a constant expression and the actual value after
conversion is within the range of values that can be represented (even
if it cannot be represented exactly)
There are two takeaways from all this standardese:
Combining the bit about narrowing with the bit about implementation defined conversions there may be changes in your results across platforms.
If your intermediate results (considering multiple such results) in long double are in a range that cannot be represented accurately by a double (high or low), these can accumulate to return a different final result which you will want to return back as a double.
As for which is more accurate, I think that depends entirely on your application.

What happens if I assign a number with a decimal point to an integer rather than to a float?

A friend of mine asked me this question earlier, but I found myself clutching at straws trying to give him an adequate explanation.
A float or a double will be truncated. So 2.99 will become 2 and -2.99 will become -2.
The pertinent section from the standard (section 4.9)
1 A prvalue of a floating point type can be converted to a prvalue of
an integer type. The conversion truncates; that is, the fractional
part is discarded. The behavior is undefined if the truncated value
cannot be represented in the destination type.
If you assign a value of any numeric type (integer, floating-point) to an object of another numeric type, the value is implicitly converted to the target type. The same thing happens in an initialization or when passing an argument to a function.
The rules for how the conversion is done vary with what kind of types you're using.
If the target type can represent the value exactly:
short s = 42;
int i = s;
double x = 42.0;
int j = x;
there may be a change in representation, but the mathematical value is unchanged.
If a floating-point type is converted to an integer type, and the value can't be represented, it's truncated, as #sashang's answer says -- but if the truncated value can't be represented, the behavior is undefined.
Conversion of an integer (either signed or unsigned) to an unsigned type causes the value to be reduced modulo MAX+1, where MAX is the maximum value of the unsigned type. For example:
unsigned short s = 70000; // sets s to 4464 (70000 - 65536)
// if unsigned short is 16 bits
Conversion of an integer to a signed type, if the value won't fit, is implementation-defined. It typically wraps around in a manner similar to what happens with unsigned types, but the language doesn't guarantee that.