How does C++ round, if signed/unsigned integers are implicitly converted to floats/doubles?
Like:
int myInt = SomeNumberWeCantExpressWithAFloat;
float myFloat = myInt;
My university script says the following: The resulting value is the representable value nearest to the original value, where ties are broken in an implementation-defined fashion.
Please explain how the "nearest representable value" is calculated and what "where ties are broken in an implementation-defined fashion" is supposed to mean.
Edit:
Since I work most of my time with the GCC, please give additional information about what floating point representation the GCC uses by default, if there is one.
Single-precision floating point numbers have 24-bit mantissa. On systems with 32-bit int representation values above 224 and below -(224) require rounding.
Value 224+1 = 16777217 is the first int that cannot be represented exactly in IEEE binary32 format. Two float representations are available - 16777216, which is below the exact value by 1, and 16777218, which is above the exact value, also by 1. Hence, we have a tie, meaning that C++ is allowed to choose either one of these two representations.
IEEE 754 specifies 5 different rounding modes about how to round integers:
A very common mode is called: Round to nearest, ties to even.
From GCC Wiki:
Without any explicit options, GCC assumes round to nearest or even and
does not care about signalling NaNs. Compare with C99's #pragma STDC
FENV ACCESS OFF. Also, see note on x86 and m68080.
Round to nearest, ties to even
From Wikipedia:
Rounding a number y to the nearest integer requires some tie-breaking
rule for those cases when y is exactly half-way between two integers —
that is, when the fraction part of y is exactly 0.5.
In such a situation the even one would be chosen. This applies for positive as for negative numbers.
Sources:
https://en.wikipedia.org/wiki/Rounding#Tie-breaking
https://gcc.gnu.org/wiki/FloatingPointMath
https://en.wikipedia.org/wiki/IEEE_floating_point
Feel free to edit. Additional information about conversion rules for rational/irrational numbers is appreciated.
Related
I'm having issues with the following line of code:
float temp = std::stof("-8.34416e-46");
My program aborts as soon as it reaches this line. It's clear that this float is less than FLT_MIN, but such floats are allowed to exist. (For example, float temp = -8.34416e-46; works fine. Is the stof method supposed to only work for values between FLT_MIN and FLT_MAX?
If so, what would be a good alternative to get a string ("-8.34416e-46") into a float?
An Alternative
Convert to double with std::stod and then assign to float. Add bounds checking if desired.
Converting “-8.34416e-46”
The C++ standard allows conversions of strings to float to report underflow if the result is in the subnormal range even though it is representable.
When rounding to the nearest representable value is used, −8.34416•10−46 is within the range of float (in C++ implementations that use IEEE-754 binary32 for float, which is common), but it is in the subnormal range. The C++ standard says stof calls strtof and then defers to the C standard to define strtof. The C standard indicates that strtof may underflow, about which it says “The result underflows if the magnitude of the mathematical result is so small that the mathematical result cannot be represented, without extraordinary roundoff error, in an object of the specified type.” That is awkward phrasing, but it refers to the rounding errors that occur when subnormal values are encountered. (Subnormal values are subject to larger relative errors than normal values, so their rounding errors might be said to be extraordinary.)
Thus, a C++ implementation is allowed by the C++ standard to underflow for subnormal values even though they are representable.
The smallest positive magnitude in the binary32 format is 2−149, about 1.4•10−45. 8.34416•10−46 is smaller than this, but it is greater than half of 2−149. That means, between 0 and 2−149, it is closer to the latter, so conversion with rounding to the nearest representable value will produce 2−149 rather than zero. Unfortunately, your strtof implementation chooses to report underflow rather than completing a conversion to the nearest representable value.
Normal and subnormal values
For IEEE-754 32-bit binary floating-point, the normal range is from 2-126 to 2128-2104. Within this range, every number is represented with a signficand (the fraction portion of the floating-point representation) that has a leading 1 bit followed by 23 additional bits, and so the error that occurs when rounding any real number in this range to the nearest representable value is at most 2-24 times the position value of the leading bit.
In additional to this normal range, there is a subnormal range from 2−149 to 2−126−2−149. In this interval, the exponent part of the floating-point format has reached its smallest value and cannot be decreased any more. To represent smaller and smaller numbers in this interval, the significand is reduced below the normal minimum of 1. It starts with a 0 and is followed by 23 additional bits. In this interval, the error that occurs when rounding a real number to the nearest representable value may be larger than 2-24 times the position value of the leading bit. Since the exponent cannot be decreased any further, numbers in this interval have increasing numbers of leading 0 bits as they get smaller and smaller. Thus the relative errors involved with using these numbers grows.
For whatever reasons, the C++ has said that implementations may report underflow in this interval. (The IEEE-754 standard defines underflow in complicated ways and also allows implementations some choices.)
In a specific online judge running 32-bit GCC 7.3.0, this:
#include <iostream>
volatile float three = 3.0f, seven = 7.0f;
int main()
{
float x = three / seven;
std::cout << x << '\n';
float y = three / seven;
std::cout << (x == y) << '\n';
}
Outputs
0.428571
0
To me this seems like it violates IEEE 754, since the standard requires basic operations to be correctly rounded. Now I know there are a couple reasons for IEEE 754 floating-point calculations to be non-deterministic as discussed here, but I don't see how any of them applies to this example. Here are some of the things I considered:
Excess precision and contraction: I'm doing a single calculation and assigning the result to a float, which should force both of the values to be rounded to float precision.
Compile-time calculations: three and seven are volatile so both calculations must be done at runtime.
Floating-point flags: The calculations are done in the same thread almost immediately after each other, so the flags should be the same.
Does this necessarily indicate that the online judge system doesn't conform to IEEE 754?
Also, removing the statement printing x, adding a statement to print y, or making y volatile all changes the result. This seems to contradict my understanding of the C++ standard which I think requires the assignments to round off any excess precision.
Thanks to geza for pointing out that this is a known issue. I would still like a definitive answer on whether this conforms to the C++ standard and IEEE 754 though, since the C++ standard appears to require assignments to round off excess precision. Here's the quote from draft N4860 [expr.pre]:
The values of the floating-point operands and the results of floating-point expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.50
50) The cast and assignment operators must still perform their specific conversions as described in 7.6.1.3, 7.6.3, 7.6.1.8 and 7.6.19.
Does this necessarily indicate that the online judge system doesn't conform to IEEE 754?
Yes, with minor caveats.
One, C++ cannot just “conform” to IEEE 754. There has to be some specification of how things in C++ bind (connect) to IEEE 754, such as statements that the float format is IEEE-754 binary32, that x / y uses IEEE-754 division, and so on. C++ 2017 draft N4659 refers to LIA-1, but I do not see that it clearly requires LIA-1 be used even if std::numeric_limits<float>::is_iec559 reports true, and LIA-1 apparently only suggests language bindings.
The C++ standard tells us the fact that std::numeric_limits<float>::is_iec559 reports true means the float type conforms to ISO/IEC/IEEE 60559, which is effectively IEEE 754-2008. But, in addition to the binding problem, I do not see a statement in the C++ standard that nullifies 8 [expr] 13 (“The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.”) when is_iec559 is true. Although it is true that the cast and conversion operators must “perform their specific conversions” (footnote 64), and this forces float y = three / seven; to produce the correct IEEE-754 binary32 results even if binary64 or Intel’s 80-bit floating-point are used for the division, it might not force it to produce the correct result if only a little excess precision is used. (If at least 48 bits of precision are used, no double-rounding errors occur for division when rounded to the 24-bits of the binary32 format. If fewer excess bits are used, there may be some cases that experience double rounding errors.)
I believe the intent of is_iec559 is to indicate a sensible binding, and the behavior shown in the question does violate this. In particular, the defect shown in the question is caused by failing to round the excess precision used in the division to the actual float type; it is not caused by the hypothetical use of less-than-enough excess precision mentioned above.
For example, The code below will give undesirable result due to precision of floating point numbers.
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
I wonder whether similar problems will show up if I use mathematical functions. For example
int a = sqrt(4); // Do I have guarantee that I will always get 2 here?
int b = log2(8); // Do I have guarantee that I will always get 3 here?
If not, how to solve this problem?
Edit:
Actually, I came across this problem when I was programming for an algorithm task. There I want to get
the largest integer which is power of 2 and is less than or equal to integer N
So round function can not solve my problem. I know I can solve this problem through a loop, but it seems not very elegant.
I want to know if
int a = pow(2, static_cast<int>(log2(N)));
can always give correct result. For example if N==8, is it possible that log2(N) gives me something like 2.9999999999999 and the final result become 4 instead of 8?
Inaccurate operands vs inaccurate results
I wonder whether similar problems will show up if I use mathematical functions.
Actually, the problem that could prevent log2(8) to be 3 does not exist for basic operations (including *). But it exists for the log2 function.
You are confusing two different issues:
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
In the example above, a is not exactly 1/3, so it is possible that a*3 does not produce 1.0. The product could have happened to round to 1.0, it just doesn't. However, if a somehow had been exactly 1/3, the product of a by 3 would have been exactly 1.0, because this is how IEEE 754 floating-point works: the result of basic operations is the nearest representable value to the mathematical result of the same operation on the same operands. When the exact result is representable as a floating-point number, then that representation is what you get.
Accuracy of sqrt and log2
sqrt is part of the “basic operations”, so sqrt(4) is guaranteed always, with no exception, in an IEEE 754 system, to be 2.0.
log2 is not part of the basic operations. The result of an implementation of this function is not guaranteed by the IEEE 754 standard to be the closest to the mathematical result. It can be another representable number further away. So without more hypotheses on the log2 function that you use, it is impossible to tell what log2(8.0) can be.
However, most implementations of reasonable quality for elementary functions such as log2 guarantee that the result of the implementation is within 1 ULP of the mathematical result. When the mathematical result is not representable, this means either the representable value above or the one below (but not necessarily the closest one of the two). When the mathematical result is exactly representable (such as 3.0), then this representation is still the only one guaranteed to be returned.
So about log2(8), the answer is “if you have a reasonable quality implementation of log2, you can expect the result to be 3.0`”.
Unfortunately, not every implementation of every elementary function is a quality implementation. See this blog post, caused by a widely used implementation of pow being inaccurate by more than 1 ULP when computing pow(10.0, 2.0), and thus returning 99.0 instead of 100.0.
Rounding to the nearest integer
Next, in each case, you assign the floating-point to an int with an implicit conversion. This conversion is defined in the C++ standard as truncating the floating-point values (that is, rounding towards zero). If you expect the result of the floating-point computation to be an integer, you can round the floating-point value to the nearest integer before assigning it. It will help obtain the desired answer in all cases where the error does not accumulate to a value larger than 1/2:
int b = std::nearbyint(log2(8.0));
To conclude with a straightforward answer to the question the the title: yes, you should worry about accuracy when using floating-point functions for the purpose of producing an integral end-result. These functions do not come even with the guarantees that basic operations come with.
Unfortunately the default conversion from a floating point number to integer in C++ is really crazy as it works by dropping the decimal part.
This is bad for two reasons:
a floating point number really really close to a positive integer, but below it will be converted to the previous integer instead (e.g. 3-1×10-10 = 2.9999999999 will be converted to 2)
a floating point number really really close to a negative integer, but above it will be converted to the next integer instead (e.g. -3+1×10-10 = -2.9999999999 will be converted to -2)
The combination of (1) and (2) means also that using int(x + 0.5) will not work reasonably as it will round negative numbers up.
There is a reasonable round function, but unfortunately returns another floating point number, thus you need to write int(round(x)).
When working with C99 or C++11 you can use lround(x).
Note that the only numbers that can be represented correctly in floating point are quotients where the denominator is an integral power of 2.
For example 1/65536 = 0.0000152587890625 can be represented correctly, but even just 0.1 is impossible to represent correctly and thus any computation involving that quantity will be approximated.
Of course when using 0.1 approximations can cancel out leaving a correct result occasionally, but even just adding ten times 0.1 will not give 1.0 as result when doing the computation using IEEE754 double-precision floating point numbers.
Even worse the compilers are allowed to use higher precision for intermediate results. This means that adding 10 times 0.1 may give back 1 when converted to an integer if the compiler decides to use higher accuracy and round to closest double at the end.
This is "worse" because despite being the precision higher the results are compiler and compiler options dependent, making reasoning about the computations harder and making the exact result non portable among different systems (even if they use the same precision and format).
Most compilers have special options to avoid this specific problem.
I'm writing an algorithm, to round a floating number. The input will be a 64bit IEEE754 double type number, very close to X.5, where X is a integer less than 32. The first solution came into my mind is to use a bit mask, to mask off those least significant bits as they represent very small fractions of 2^-n.(Given the exponent is not large).
But the problem is should I do that? Is there any other ways to accomplish the same thing? I feel using bit operation on float point is very controversy. Thanks!
The langugage I'm using is C++ by the way.
Edit:
Thanks guys, for your comments. I appreciate! Let's say I have a float number, can be 1.4999999... or 21.50000012.... I want to round it to 1.5 or 21.5. My goal is to round any number to its nearest to X.5 form, since it can be stored in a IEEE754 float point number.
If your compiler guarantees that you are using IEEE 754 floating-point, I would recommend that you round according to the method delineated in this blog post: add, and then immediately subtract a large constant so as to send the value in the binade of floating-point numbers where the ULP is 0.5. You won't find any faster method, and it does not involve any bit manipulation.
The appropriate constant to round a number between 0 and 32 to the nearest halt-unit for IEEE 754 double-precision is 2251799813685248.0.
Summary: use x = x + 2251799813685248.0 - 2251799813685248.0;.
You can use any of the functions round(), floor(), ceil(), rint(), nearbyint(), and trunc(). All do rounding in different modes, and all are standard C99. The only thing you need to do is to link against the standard math library by specifying -lm as a compiler flag.
As to trying to achieve rounding by bit manipulations, I would stay away from that: a) it will be much slower than using the functions above (they generally use hardware facilities where possible), b) it is reinventing the wheel with a lot of potential for bugs, and c) the newer C standards don't like you doing bit manipulations on floating point types: they use the so called strict aliasing rules that disallow you to just cast a double* to an uint64_t*. You would either need to do your bit manipulation by casting to a unsigned char* and manipulating the IEEE number byte by byte, or you would have to use memcpy() to copy the bit representation from a double variable into an uint64_t and back again. A lot of hassle for something already available in the form of standardized functions and hardware support.
You want to round x to the nearest value of the form d.5. For a generan number you write:
round(x+0.5)-0.5
For a number close to d.5, less than 0.25 away, you can use Pascal's offering:
round(2*x)*0.5
If you're looking for a bit trick and are guaranteed to have doubles in the ranges you describe, then you could do something like this (inline as you see fit):
void RoundNearestHalf(double &d) {
unsigned const maskshift = ((*(unsigned __int64*)&d >> 52) - 1023);
unsigned __int64 const setmask = 0x0008000000000000 >> maskshift;
unsigned __int64 const clearmask = ~0x0007FFFFFFFFFFFF >> maskshift;
*(unsigned __int64*)&d |= setmask;
*(unsigned __int64*)&d &= clearmask;
}
maskshift is the unbiased exponent. For the input range, we know this will be non-negative and no more than 4 (the trick will work for higher values too, but no more than 51). We use this value to make a setmask which sets the 2^-1 (one-half) place in the mantissa, and clearmask which clears all bits in the mantissa of lower value than 2^-1. The result is d rounded to the nearest half.
Note that it would be worth profiling this against other implementations, perhaps using the standard library to determine whether or not its actually faster.
I can't speak about C++ for sure, but in C99 the use of IEEE 754 standard for floating point will be purely normative (not required). In C99 if the __STDC_IEC_559__ macro is set then it declares that IEC 559 (which is more or less IEEE 754) is used for floating point.
I think it should be pointed out that there are functions to handle many types of rounding for you.
When comparing doubles for equality, we need to give a tolerance level, because floating-point computation might introduce errors. For example:
double x;
double y;
x = f();
y = g();
if (fabs(x-y)<epsilon) {
// they are equal!
} else {
// they are not!
}
However, if I simply assign a constant value, without any computation, do I still need to check the epsilon?
double x = 1;
double y = 1;
if (x==y) {
// they are equal!
} else {
// no they are not!
}
Is == comparison good enough? Or I need to do fabs(x-y)<epsilon again? Is it possible to introduce error in assigning? Am I too paranoid?
How about casting (double x = static_cast<double>(100))? Is that gonna introduce floating-point error as well?
I am using C++ on Linux, but if it differs by language, I would like to understand that as well.
Actually, it depends on the value and the implementation. The C++ standard (draft n3126) has this to say in 2.14.4 Floating literals:
If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner.
In other words, if the value is exactly representable (and 1 is, in IEEE754, as is 100 in your static cast), you get the value. Otherwise (such as with 0.1) you get an implementation-defined close match (a). Now I'd be very worried about an implementation that chose a different close match based on the same input token but it is possible.
(a) Actually, that paragraph can be read in two ways, either the implementation is free to choose either the closest higher or closest lower value regardless of which is actually the closest, or it must choose the closest to the desired value.
If the latter, it doesn't change this answer however since all you have to do is hardcode a floating point value exactly at the midpoint of two representable types and the implementation is once again free to choose either.
For example, it might alternate between the next higher and next lower for the same reason banker's rounding is applied - to reduce the cumulative errors.
No if you assign literals they should be the same :)
Also if you start with the same value and do the same operations, they should be the same.
Floating point values are non-exact, but the operations should produce consistent results :)
Both cases are ultimately subject to implementation defined representations.
Storage of floating point values and their representations take on may forms - load by address or constant? optimized out by fast math? what is the register width? is it stored in an SSE register? Many variations exist.
If you need precise behavior and portability, do not rely on this implementation defined behavior.
IEEE-754, which is a standard common implementations of floating point numbers abide to, requires floating-point operations to produce a result that is the nearest representable value to an infinitely-precise result. Thus the only imprecision that you will face is rounding after each operation you perform, as well as propagation of rounding errors from the operations performed earlier in the chain. Floats are not per se inexact. And by the way, epsilon can and should be computed, you can consult any numerics book on that.
Floating point numbers can represent integers precisely up to the length of their mantissa. So for example if you cast from an int to a double, it will always be exact, but for casting into into a float, it will no longer be exact for very large integers.
There is one major example of extensive usage of floating point numbers as a substitute for integers, it's the LUA scripting language, which has no integer built-in type, and floating-point numbers are used extensively for logic and flow control etc. The performance and storage penalty from using floating-point numbers turns out to be smaller than the penalty of resolving multiple types at run time and makes the implementation lighter. LUA has been extensively used not only on PC, but also on game consoles.
Now, many compilers have an optional switch that disables IEEE-754 compatibility. Then compromises are made. Denormalized numbers (very very small numbers where the exponent has reached smallest possible value) are often treated as zero, and approximations in implementation of power, logarithm, sqrt, and 1/(x^2) can be made, but addition/subtraction, comparison and multiplication should retain their properties for numbers which can be exactly represented.
The easy answer: For constants == is ok.
There are two exceptions which you should be aware of:
First exception:
0.0 == -0.0
There is a negative zero which compares equal for the IEEE 754 standard. This means
1/INFINITY == 1/-INFINITY which breaks f(x) == f(y) => x == y
Second exception:
NaN != NaN
This is a special caveat of NotaNumber which allows to find out if a number is a NaN
on systems which do not have a test function available (Yes, that happens).