In C++ we can write something like
float f = 3.55;
and it is a legal statement, whereas the type of real number numerals is double and we are storing that double into floating point number. It essentially means storing 8 bytes into 4 bytes (a possible data loss)? My question is that when I write
long l = 333;
int y = l;
I get an error because long value is converted into int value (possible data loss). but why don't I encounter a problem when storing 8 byte double real numeral in floating point (4 byte)?
From §4 Standard conversions [conv] C++11:
Standard conversions are implicit conversions with built-in meaning.
Clause 4 enumerates the full set of such conversions. A standard
conversion sequence is a sequence of standard conversions in the
following order:
...
Zero or one conversion from the following set: integral promotions,
floating point promotion, integral conversions, floating point
conversions, floating-integral conversions, pointer conversions,
pointer to member conversions, and boolean conversions.
So conversion between two numeric types is allowed implicitly as it makes sense also if used carefully. For example When you calculate Amount(int) from P(int), R(float) and T(int);
And from §4.8 Floating point conversions [conv.double],
A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined.
The conversions allowed as floating point
promotions are excluded from the set of floating point conversions.
It appears double to float conversion is implicitly performed by the compliant C++ compiler. (At the cost of potentially loosing the precision)
Your example is not an error and should compile.
When you assign a larger integer type to a smaller integer type (or perform any conversion that doesn't quality as a promotion), an integral conversion occurs and precision may be lost.
Similarly, floating point conversion occurs when you assign one floating point type to another floating point type; the result is either the same value, or a value close to it, unless the source value exceeds the range of the destination type.
Related
I've been going through a book (C++ Programming Language Stroustrup 4th ed).
An example given in a section related to Initialization as below:
void f(double d, int i)
{
int a{ d }; // error : possible truncation
char b{ i }; // error : possible narrowing
}
What exactly is the difference between truncation and narrowing?
A narrowing conversion is basically any conversion that may cause a loss of information. Strictly speaking, a narrowing conversion is :
an implicit conversion
from a floating-point type to an integer type, or
from long double to double or float, or from double to float, except where the source is a constant expression and the actual value after conversion is within the range of values that can be represented (even if it cannot be represented exactly), or
from an integer type or unscoped enumeration type to a floating-point type, except where the source is a constant expression and the actual value after conversion will fit into the target type and will produce the original value when converted back to the original type, or
from an integer type or unscoped enumeration type to an integer type that cannot represent all the values of the original type, except where the source is a constant expression whose value after integral promotions will fit into the target type, or
from a pointer type or a pointer-to-member type to bool.
Notice that this means both of the conversions you posted are narrowing conversion. int a{ d }; is the first case and char b{ i }; is the fourth.
Truncation only occurs when converting between a floating point type and an integer type. It generally refers to the decimal portion of the floating point number being lost (source). This implies that truncation is a subset of narrowing conversions.
What exactly is the difference between truncation and narrowing?
Truncation shortens a decimal-based value such as a float or a double to its integral form int with the extra precision bits after the decimal (from 2-1 in corresponding decimal form) removed.
A double can be truncated to a float as well, with the possibility of an overflow (depending on size of the value) and removal of half of the precision bits in its binary form (since double has twice the precision of float, with them ordinally being 64 and 32 bit floating points respectively).
For an example of double being truncated into a float, consider something which goes atleast above 23 precision bits (considering the mantissa of float) such as the value of PI, regarding which BenVoigt gave an example in the comments.
The value of PI as given by a double is:
11.001001000011111101101010100010001000010110100011000
// 3.141592653589793116
Note that there are 52 precision bits (according to IEEE 754 standard, from 0 to 51), or the bits forming the value after the decimal.
Corresponding truncated float value:
11.0010010000111111011011
// 3.1415927410125732422
Note the inaccuracy for the value of PI in relative terms of the number considered above. This is caused by the removal of the trailing bits of precision when truncating the value from double to float (which has only 23 precision bits, from 0 to 22), ordinally decreasing the precision bits in this case.
Following the conversion of floating-point values to integral form, you can say it acts similar to a floor function call.
Narrowing is shortening of the value as the name implies as well, but unlike truncation it is not restricted to shortening of a floating-point value to an integer value. It applies to other conversions as well, such as a long to an int, a pointer type to a boolean and a character to an integer (as in your example).
Maybe its best understood with an example...
Lets say d == 3.1415926 then in your code a will end up as 3. That is truncation.
On the other hand, if i == 1000 then that is outside the range of char. If char is unsigned the value will wrap around and you will get 1000%256 as the value of b. This happens because int has a wider range than char, hence this conversion is called narrowing.
double d=2.345;
int a = d; // a is 2 now, so the number 2.345 is truncated
As for int to char, char has size of 1 byte, while int has 4 bytes (assuming 32 bit), so you would be "narrowing" variable i.
It could be just about english :) You can look up the words in a dictionary so it may be clearer.
What guarantees does the C++ standard give for narrowing conversion from double to int types?
Is it the same as Java as explained at Q31328190:
No, it's not the same as in Java. If the mathematical result of "truncate the fractional part" cannot be represented by the target type, the behaviour is undefined.
From 4.9 [conv.fpint]/1 ("Floating-integral conversions"):
A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type.
From this other QUESTION they talk about how Bjarne Stroustrup said that just as integral data-types narrower than an int(e.g. short) are promoted to an int, floats are promoted to a double. However, unlike widening of integrals narrower than an int, floating point promotion does not happen in the same way, but instead, occurs elsewhere.
I know that if you were to compute float + double the float would be converted to a double before the binary operator(+) is applied. However, this is not floating point promotion according to Learncpp.com. This is usual arithmetic conversion.
When does floating point promotion actually happen?
There is such a thing as "floating point promotion" of float to double per [conv.fpprom].
A prvalue of type float can be converted to a prvalue of type double. The value is unchanged.
This conversion is called floating point promotion.
The answers to the linked question are correct. This promotion should not occur automatically when adding two floats since the usual arithmetic conversions do not promote floating-point operands.
Floating point promotion does occur when passing a float as an operand to an ellipsis, like in printf. That's why the %f format specifier prints either a float or a double: if you pass a float, the function actually receives a double, the result of promotion.
The existence of the floating point promotion is also important in overload resolution, because integral promotions and floating point promotions have better implicit conversion rank than integral conversions, floating point conversions, and floating-integral conversions.
Example 1:
void f(double);
void f(long double);
f(0.0f);
This calls void f(double) since the promotion to double is better than the conversion to long double. In contrast, consider this perhaps surprising example 2:
void f(long double);
void f(int);
f(0.0f);
This is ambiguous. The conversion from float to long double is no better than the conversion from float to int since they are both not promotions.
Example 3:
struct S {
operator float();
operator int();
};
double d = S();
This calls operator float and then promotes the resulting float value to double to initialize d.
The primary (perhaps sole) time that floating point promotions are applied is when passing an argument to a variadic function (e.g., printf).
In this case, the usual arithmetic conversions don't apply (they're for finding a common type between two operands in an expression).
The relevant part of the standard is [expr.call]/7 (at least as of N4296):
When there is no parameter for a given argument, the argument is passed in such a way that the receiving function can obtain the value of the argument by invoking va_arg (18.10).
[...]
If the argument has integral or enumeration type that is subject to the integral promotions (4.5), or a floating point type that is subject to the floating point promotion (4.6), the value of the argument is converted to the promoted type before the call.
I am not sure if promotion just means converting a data type to a larger data type (for example short to int).
Or does promotion means converting a data type to another "compatible" data type, for example converting a short to an int, which will keep the same bit pattern (the extra space will be filled with zeros). And is conversion means converting something like an int to a float, which will create a completely different bit pattern?
There are two things that are called promotions: integral promotions and floating point promotions. Integral promotion refers to integral types (including bitfields and enums) being converted to "larger" integral types and floating point promotion is specifically just float to double.
Both types of promotions are subsets of a wider range of conversions.
char -> int: integral promotion
float -> double: floating point promotion
int -> char: [narrowing] conversion (not a promotion)
int -> float: conversion
const char* -> std::string: conversion
Foo -> Bar: possibly undefined conversion?
etc.
A promotion is a specific kind of conversion for built-in types that is guaranteed not to change the value.
The type you are promoting to must be able to accurately represent any possible value of the type you are promoting from.
Here is a list of the applicable conversions.
Promotion
char or short values (signed or unsigned) are promoted to int (or unsigned) before anything else happens
this is done because int is assumed to be the most efficient integral datatype, and it is guaranteed that no information will be lost by going from a smaller datatype to a larger one
Conversion
after integral promotion, the arguments to an operator are checked
if both are the same datatype, evaluation proceeds
if the arguments are of different datatypes, conversion will occur
Casts
the type of an expression can be forced using casts.
a cast is simply any valid datatype enclosed in parentheses and placed next to a constant, variable or expression
Please Refer to this : website
Consider two way of computing something:
data in double precision
-> apply a function with double precision temporaries
-> return result
data in double precision
-> cast to long double
-> apply a function with long double precision temporaries
-> cast to double
-> return result
Can the second solution give a less accurate result compared to the first one and if yes in what case?
Yes. Proof: Let c = 0x1p-53 + 0x1p-64. Evaluate 1+c-c-1 in double and in long double (of the common Intel format, with a 64-bit significand). In double, the result is 0, which is the mathematically exact answer. In long double, the result is -0x1p-64, which is wrong (and remains wrong when cast to double).
In double, 1+c adds slightly more than half the ULP (unit of least precision) of 1 to 1, so it produces 1 plus an ULP. Subtracting c subtracts slightly more than half an ULP, so the closest representable number (in double) to the result is 1, so 1 is produced. Then subtracting 1 yields 0.
In long double, 1+c adds 0x1p-53 plus half an ULP of 1. (In long double, the ULP of 1 is 0x1p-63.) Since the result is exactly the same distance from the two nearest representable numbers (in long double), the one with the low bit zero is returned, 1+0x1p-53. Then the exact result of subtracting c is 1 - 0x1p-64. This is exactly representable, so it is returned. Finally, subtracting 1 yields -0x1p-64.
About long double the draft says:
3.9.1 Fundamental Types
8 There are three floating point types: float, double, and long double. The type double provides at least
as much precision as float, and the type long double provides at least as much precision as double.
The set of values of the type float is a subset of the set of values of the type double; the set of values
of the type double is a subset of the set of values of the type long double. The value representation of
floating-point types is implementation-defined. Integral and floating types are collectively called arithmetic
types. Specializations of the standard template std::numeric_limits (18.3) shall specify the maximum
and minimum values of each arithmetic type for an implementation.
As for promotions which is the next most interesting bit:
4.6 Floating point promotion
1 A prvalue of type float can be converted to a prvalue of type
double. The value is unchanged.
2 This conversion is called floating
point promotion.
Note there is nothing being said about double to long double. I'd hazard this as a slip though.
Next about conversions which is what we are interested when you go from long double to double:
4.8 Floating point conversions
1 A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined.
2 The conversions allowed as floating point
promotions are excluded from the set of floating point conversions.
Now, let's see the effects of narrowing:
6. A narrowing conversion is an implicit conversion
[...]
from long double to double or float, or from double to float, except where the source is a constant expression and the actual value after
conversion is within the range of values that can be represented (even
if it cannot be represented exactly)
There are two takeaways from all this standardese:
Combining the bit about narrowing with the bit about implementation defined conversions there may be changes in your results across platforms.
If your intermediate results (considering multiple such results) in long double are in a range that cannot be represented accurately by a double (high or low), these can accumulate to return a different final result which you will want to return back as a double.
As for which is more accurate, I think that depends entirely on your application.