How does Float round when converting it into integer

How does Float round when converting it into integer - c++

If I have
(float)value = 10.50
and do
int new_value = (int)value
what rules will round number?

When a finite value of floating type is converted to an integer type, the fractional part is discarded (i.e., the value is truncated toward zero).
So in the case of -10.5, it's converted to -10.
C++11 4.9 Floating-integral conversions [conv.fpint]
An rvalue of a floating point type can be converted to an rvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type. [ Note: If the destination type is bool, see 4.12. —end note ]

The rule is quite simple: the number simply gets truncated to its integral part, in this case, to 10. The fractional part gets dropped entirely. The same applies to negative numbers: -10.5 would be converted to -10.

When converted to integers, the fractional part of the float is dropped, meaning the float 10.5 will be converted to the integer 10, and the float -10.5 will be converted to the integer -10.

Related

Difference between narrowing and truncation in C++?

I've been going through a book (C++ Programming Language Stroustrup 4th ed).
An example given in a section related to Initialization as below:
void f(double d, int i)
{
int a{ d }; // error : possible truncation
char b{ i }; // error : possible narrowing
}
What exactly is the difference between truncation and narrowing?

A narrowing conversion is basically any conversion that may cause a loss of information. Strictly speaking, a narrowing conversion is :
an implicit conversion
from a floating-point type to an integer type, or
from long double to double or float, or from double to float, except where the source is a constant expression and the actual value after conversion is within the range of values that can be represented (even if it cannot be represented exactly), or
from an integer type or unscoped enumeration type to a floating-point type, except where the source is a constant expression and the actual value after conversion will fit into the target type and will produce the original value when converted back to the original type, or
from an integer type or unscoped enumeration type to an integer type that cannot represent all the values of the original type, except where the source is a constant expression whose value after integral promotions will fit into the target type, or
from a pointer type or a pointer-to-member type to bool.
Notice that this means both of the conversions you posted are narrowing conversion. int a{ d }; is the first case and char b{ i }; is the fourth.
Truncation only occurs when converting between a floating point type and an integer type. It generally refers to the decimal portion of the floating point number being lost (source). This implies that truncation is a subset of narrowing conversions.

What exactly is the difference between truncation and narrowing?
Truncation shortens a decimal-based value such as a float or a double to its integral form int with the extra precision bits after the decimal (from 2-1 in corresponding decimal form) removed.
A double can be truncated to a float as well, with the possibility of an overflow (depending on size of the value) and removal of half of the precision bits in its binary form (since double has twice the precision of float, with them ordinally being 64 and 32 bit floating points respectively).
For an example of double being truncated into a float, consider something which goes atleast above 23 precision bits (considering the mantissa of float) such as the value of PI, regarding which BenVoigt gave an example in the comments.
The value of PI as given by a double is:
11.001001000011111101101010100010001000010110100011000
// 3.141592653589793116
Note that there are 52 precision bits (according to IEEE 754 standard, from 0 to 51), or the bits forming the value after the decimal.
Corresponding truncated float value:
11.0010010000111111011011
// 3.1415927410125732422
Note the inaccuracy for the value of PI in relative terms of the number considered above. This is caused by the removal of the trailing bits of precision when truncating the value from double to float (which has only 23 precision bits, from 0 to 22), ordinally decreasing the precision bits in this case.
Following the conversion of floating-point values to integral form, you can say it acts similar to a floor function call.
Narrowing is shortening of the value as the name implies as well, but unlike truncation it is not restricted to shortening of a floating-point value to an integer value. It applies to other conversions as well, such as a long to an int, a pointer type to a boolean and a character to an integer (as in your example).

Maybe its best understood with an example...
Lets say d == 3.1415926 then in your code a will end up as 3. That is truncation.
On the other hand, if i == 1000 then that is outside the range of char. If char is unsigned the value will wrap around and you will get 1000%256 as the value of b. This happens because int has a wider range than char, hence this conversion is called narrowing.

double d=2.345;
int a = d; // a is 2 now, so the number 2.345 is truncated
As for int to char, char has size of 1 byte, while int has 4 bytes (assuming 32 bit), so you would be "narrowing" variable i.
It could be just about english :) You can look up the words in a dictionary so it may be clearer.

What happens to decimal part of floating point data if we assign it to an int?

I know that int can only store non decimal numbers, but in this case:
int n=3.14
It will only store 3, but my question is what will happen to .14? Will it be lost in memory, or discarded, or temporarily stored?

The fractional part of a floating point number is truncated when it is assigned to an integral type.
From the C++ Standard:
A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type. [ Note: If the destination type is bool, see conv.bool. — end note ]

When you assign a floating point value to an integral type, the decimals are truncated.
So in your example
int n = 3.14; // will result in n == 3

The fractional part will disappear entirely. You can see this by assigning the integer variable to a float variable then printing the float.Subject to rounding (which is not an issue for small integers) it will print out as 3.00....

if you already stored a float as int then the part after the point gets "sliced" it also happens in objects the data that isn't part of the stored type just isn't recorded in that object/variable.

Precision loss from float to double, and from double to float?

float fv = orginal_value; // original_value may be any float value
...
double dv = (double)fv;
...
fv = (float)dv;
SHOULD fv be equal to original_value exactly? Any precision may be lost?

SHOULD fv be equal to original_value exactly? Any precision may be
lost?
Yes, if the value of dv did not change in between.
From section Conversion 6.3.1.5 Real Floating types in C99 specs:
When a float is promoted to double or long double, or a double is
promoted to long double, its value is unchanged.
When a double is
demoted to float, a long double is demoted to double or float, or a
value being represented in greater precision and range than required
by its semantic type (see 6.3.1.8) is explicitly converted to its
semantic type, if the value being converted can be represented exactly
in the new type, it is unchanged. If the value being converted is in
the range of values that can be represented but cannot be represented
exactly, the result is either the nearest higher or nearest lower
representable value, chosen in an implementation-defined manner. If
the value being converted is outside the range of values that can be
represented, the behavior is undefined
For C++, from section 4.6 aka conv.fpprom (draft used: n337 and I believe similar lines are available in final specs)
A prvalue of type float can be converted to a prvalue of type double.
The value is unchanged. This conversion is called floating point
promotion.
And section 4.8 aka conv.double
A prvalue of floating point type can be converted to a prvalue of
another floating point type. If the source value can be exactly
represented in the destination type, the result of the conversion is
that exact representation. If the source value is between two adjacent
destination values, the result of the conversion is an
implementation-defined choice of either of those values. Otherwise,
the behavior is undefined. The conversions allowed as floating point
promotions are excluded from the set of floating point conversions
So the values should be equal exactly.

Is round-trip through floating point always defined behavior if floating point range is bigger?

Let's say I have two arithmetic types, an integer one, I, and a floating point one, F. I also assume that std::numeric_limits<I>::max() is smaller than std::numeric_limits<F>::max().
Now, let's say I have a positive integer value i. Because the representable range of F is larger than I, F(i) should always be defined behavior.
However, if I have a floating point value f such that f == F(i), is I(f) well defined? In other words, is I(F(i)) always defined behavior?
Relevant section from the C++14 standard:
4.9 Floating-integral conversions [conv.fpint]
A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates;
that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be
represented in the destination type. [ Note: If the destination type is bool, see 4.12. — end note ]
A prvalue of an integer type or of an unscoped enumeration type can be converted to a prvalue of a floating
point type. The result is exact if possible. If the value being converted is in the range of values that can
be represented but the value cannot be represented exactly, it is an implementation-defined choice of either
the next lower or higher representable value. [ Note: Loss of precision occurs if the integral value cannot
be represented exactly as a value of the floating type. — end note ] If the value being converted is outside
the range of values that can be represented, the behavior is undefined. If the source type is bool, the value
false is converted to zero and the value true is converted to one.

However, if I have a floating point value f such that f == F(i), is I(f) well defined? In other words, is I(F(i)) always defined behavior?
No.
Suppose that I is a signed two's complement 32 bit integer type, F is a 32 bit single precision floating point type, and i is the maximum positive integer. This is within the range of the floating point type, but it cannot be represented exactly as a floating point number. Some of those 32 bits are used for the exponent.
Instead, the conversion from integer to floating point is implementation dependent, but typically is done by rounding to the closest representable value. That rounded value is one beyond the range of the integer type. The conversion back to integer fails (better said, it's undefined behavior).

No.
It's possible that i == std::numeric_limits<I>::max(), but that i is not exactly representable in F.
If the value being converted is in the range of values that can be represented but the value cannot be represented exactly, it is an implementation-defined choice of either the next lower or higher representable value.
Since the next higher representable value may be chosen, it's possible that the result F(i) no longer fits into I, so conversion back would be undefined behavior.

No. Regardless of the standard, you cannot expect that in general this conversion will return your original integer. It doesn't make sense mathematically. But if you read into what you quoted, the standard clearly indicates the possibility of a loss of precision upon converting from int to float.
Suppose your types I and F use the same number of bits. All of the bits of I (save possibly one that stores the sign) are used to specify the absolute value of the number. On the other hand, in F, some bits are used to specify the exponent and some are used for the significand. The range will be greater because of the possible exponent. But the significand will have less precision because there are fewer bits devoted to its specification.
Just as a test, I printed
std::numeric_limits<int>::max();
std::numeric_limits<float>::max();
I then converted the first number to float and back again. The max float had an exponent of 38, and the max int had 10 digits, so clearly float has a larger range. But upon converting the max int to float and back, I went from 2147473647 to -2147473648. So it seems the number was incremented by one unit and went around to the negative side.
I didn't check how many bits are actually used for float on my system, but it at least demonstrates the loss of precision, and it shows that gcc "rounded up".

What happens if I assign a number with a decimal point to an integer rather than to a float?

A friend of mine asked me this question earlier, but I found myself clutching at straws trying to give him an adequate explanation.

A float or a double will be truncated. So 2.99 will become 2 and -2.99 will become -2.
The pertinent section from the standard (section 4.9)
1 A prvalue of a floating point type can be converted to a prvalue of
an integer type. The conversion truncates; that is, the fractional
part is discarded. The behavior is undefined if the truncated value
cannot be represented in the destination type.

If you assign a value of any numeric type (integer, floating-point) to an object of another numeric type, the value is implicitly converted to the target type. The same thing happens in an initialization or when passing an argument to a function.
The rules for how the conversion is done vary with what kind of types you're using.
If the target type can represent the value exactly:
short s = 42;
int i = s;
double x = 42.0;
int j = x;
there may be a change in representation, but the mathematical value is unchanged.
If a floating-point type is converted to an integer type, and the value can't be represented, it's truncated, as #sashang's answer says -- but if the truncated value can't be represented, the behavior is undefined.
Conversion of an integer (either signed or unsigned) to an unsigned type causes the value to be reduced modulo MAX+1, where MAX is the maximum value of the unsigned type. For example:
unsigned short s = 70000; // sets s to 4464 (70000 - 65536)
// if unsigned short is 16 bits
Conversion of an integer to a signed type, if the value won't fit, is implementation-defined. It typically wraps around in a manner similar to what happens with unsigned types, but the language doesn't guarantee that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js