Truncate and casting to int of ceiled double - c++

When performing a std::ceil on a double value the value will round up to a whole number. So 3.3 will become 4.0. Which can be casted or truncated to an int. Which will 'chop off' the part after the comma. So:
int foo = (int)std::ceil(3.3);
So at first glance this will store 4 in foo. However, the double is a floating point value. So it might either be 4.000000001 or 3.999999999. The latter would be truncated to 3.
But in practice I've never seen this behaviour occurring. Can I safely assume that any implementation will return 4? Or is it only the IEEE-754 that does this. Or have I just been lucky?

Rounding (or ceil-ing) a double will always, always, always be exact.
For floating point numbers below 2^(m+1), where m is the number of mantissal bits, all integers have exact representations, so the result can be exactly represented.
For floating point numbers above 2^(m+1)... they're already integers. Makes sense, if you think about it: there aren't enough mantissal bits to stretch down to the right of the decimal point. So again rounding/ceil-ing is exact.

ceil in C++ behaves like it does in C, where the standard says
The ceil functions compute the smallest integer value not less than x.
The result of ceil is always the floating-point representation of an integer; however, the result may overflow an integral type when truncated.
In your particular case, std::ceil(3.3) must be exactly 4.0, since that's the "smallest integer value not less than" 3.3.

Related

Is it safe to use Int(round(x))?

Say you have a double value, and want to round it to an integer...
Many round() functions return a double instead of an integer:
C# - round(double) -> double
C++ - round(double) -> double
Darwin - round(double) -> double
Swift - double.rounded() -> double
Java - round(double) -> int
Ruby: float.round() -> int
(This is most likely because doubles have a much wider range of possible values.)
Given this "default" behavior, it probably explains why you'll commonly see the following recommended:
Int(round(myDouble))
(Here we assume that Int() removes everything after the decimal: 4.9 -> 4.)
So far so good, until you realize how complex floating points really are. E.g. 55 might actually be stored as 54.9999999999999999, for example.
Because of this, it sounds like the following might happen:
Int(round(55.4)) // we ask it to round 55.4, expecting 55
Int(54.9999999999999) // it rounded it to "55.0"
54 // the Int() function removed all remaining digits
We were expecting 55.4 rounded to be 55, but it ended up evaluating to 54.
Can something like the above really happen if we use Int(round(x))?
If so, what should we use instead of Int(round())?
Related: Many languages define floor(double) -> double. Is Int(floor(double)) safe?
Floating point models are constructed on these foundations:
a base b
a significand with a limited number of digits in that base (p the precision)
an exponent e for shifting the floating point, also limited in a certain range
a sign
So the floating point values are made like this: (-1)^signBit * significand * b^e
The significand can be represented in a normalized form x.xxxxxxxx with 1 non null digit left of floating point (except for zero, or eventually values near zero that lose precision and gradually underflow), and p-1 digit after floating point.
But by shifting appropriately the exponent (e+1-p), it can as well be considered as an integer with p digits, xxxxxxxxx.0.
With a reasonnable range for exponent, we see that every integer up to b^p can be represented exactly by such floating point model. With the limited precision, only the last digits in base b are lost, so if we have an integer too large to fit in significand, it will necessarily have a null fraction part. Thus, there is no reason for round to answer anything else but an integral value (with null fraction part).
The only unsafe part as you noted is that Int range might be much smaller than range of floating point values. Thus converting large floating point to Int could result in overflow exception, or worse, silent overflow with undefined behavior...
The conversion to Int is thus not necessary for the sake of eliminating the fraction part. It must be for other purposes (like feeding another part of the program that would only accept an Int).

How does C++ round int to float/double?

How does C++ round, if signed/unsigned integers are implicitly converted to floats/doubles?
Like:
int myInt = SomeNumberWeCantExpressWithAFloat;
float myFloat = myInt;
My university script says the following: The resulting value is the representable value nearest to the original value, where ties are broken in an implementation-defined fashion.
Please explain how the "nearest representable value" is calculated and what "where ties are broken in an implementation-defined fashion" is supposed to mean.
Edit:
Since I work most of my time with the GCC, please give additional information about what floating point representation the GCC uses by default, if there is one.
Single-precision floating point numbers have 24-bit mantissa. On systems with 32-bit int representation values above 224 and below -(224) require rounding.
Value 224+1 = 16777217 is the first int that cannot be represented exactly in IEEE binary32 format. Two float representations are available - 16777216, which is below the exact value by 1, and 16777218, which is above the exact value, also by 1. Hence, we have a tie, meaning that C++ is allowed to choose either one of these two representations.
IEEE 754 specifies 5 different rounding modes about how to round integers:
A very common mode is called: Round to nearest, ties to even.
From GCC Wiki:
Without any explicit options, GCC assumes round to nearest or even and
does not care about signalling NaNs. Compare with C99's #pragma STDC
FENV ACCESS OFF. Also, see note on x86 and m68080.
Round to nearest, ties to even
From Wikipedia:
Rounding a number y to the nearest integer requires some tie-breaking
rule for those cases when y is exactly half-way between two integers —
that is, when the fraction part of y is exactly 0.5.
In such a situation the even one would be chosen. This applies for positive as for negative numbers.
Sources:
https://en.wikipedia.org/wiki/Rounding#Tie-breaking
https://gcc.gnu.org/wiki/FloatingPointMath
https://en.wikipedia.org/wiki/IEEE_floating_point
Feel free to edit. Additional information about conversion rules for rational/irrational numbers is appreciated.

Should I worry about precision when I use C++ mathematical functions with integers?

For example, The code below will give undesirable result due to precision of floating point numbers.
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
I wonder whether similar problems will show up if I use mathematical functions. For example
int a = sqrt(4); // Do I have guarantee that I will always get 2 here?
int b = log2(8); // Do I have guarantee that I will always get 3 here?
If not, how to solve this problem?
Edit:
Actually, I came across this problem when I was programming for an algorithm task. There I want to get
the largest integer which is power of 2 and is less than or equal to integer N
So round function can not solve my problem. I know I can solve this problem through a loop, but it seems not very elegant.
I want to know if
int a = pow(2, static_cast<int>(log2(N)));
can always give correct result. For example if N==8, is it possible that log2(N) gives me something like 2.9999999999999 and the final result become 4 instead of 8?
Inaccurate operands vs inaccurate results
I wonder whether similar problems will show up if I use mathematical functions.
Actually, the problem that could prevent log2(8) to be 3 does not exist for basic operations (including *). But it exists for the log2 function.
You are confusing two different issues:
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
In the example above, a is not exactly 1/3, so it is possible that a*3 does not produce 1.0. The product could have happened to round to 1.0, it just doesn't. However, if a somehow had been exactly 1/3, the product of a by 3 would have been exactly 1.0, because this is how IEEE 754 floating-point works: the result of basic operations is the nearest representable value to the mathematical result of the same operation on the same operands. When the exact result is representable as a floating-point number, then that representation is what you get.
Accuracy of sqrt and log2
sqrt is part of the “basic operations”, so sqrt(4) is guaranteed always, with no exception, in an IEEE 754 system, to be 2.0.
log2 is not part of the basic operations. The result of an implementation of this function is not guaranteed by the IEEE 754 standard to be the closest to the mathematical result. It can be another representable number further away. So without more hypotheses on the log2 function that you use, it is impossible to tell what log2(8.0) can be.
However, most implementations of reasonable quality for elementary functions such as log2 guarantee that the result of the implementation is within 1 ULP of the mathematical result. When the mathematical result is not representable, this means either the representable value above or the one below (but not necessarily the closest one of the two). When the mathematical result is exactly representable (such as 3.0), then this representation is still the only one guaranteed to be returned.
So about log2(8), the answer is “if you have a reasonable quality implementation of log2, you can expect the result to be 3.0`”.
Unfortunately, not every implementation of every elementary function is a quality implementation. See this blog post, caused by a widely used implementation of pow being inaccurate by more than 1 ULP when computing pow(10.0, 2.0), and thus returning 99.0 instead of 100.0.
Rounding to the nearest integer
Next, in each case, you assign the floating-point to an int with an implicit conversion. This conversion is defined in the C++ standard as truncating the floating-point values (that is, rounding towards zero). If you expect the result of the floating-point computation to be an integer, you can round the floating-point value to the nearest integer before assigning it. It will help obtain the desired answer in all cases where the error does not accumulate to a value larger than 1/2:
int b = std::nearbyint(log2(8.0));
To conclude with a straightforward answer to the question the the title: yes, you should worry about accuracy when using floating-point functions for the purpose of producing an integral end-result. These functions do not come even with the guarantees that basic operations come with.
Unfortunately the default conversion from a floating point number to integer in C++ is really crazy as it works by dropping the decimal part.
This is bad for two reasons:
a floating point number really really close to a positive integer, but below it will be converted to the previous integer instead (e.g. 3-1×10-10 = 2.9999999999 will be converted to 2)
a floating point number really really close to a negative integer, but above it will be converted to the next integer instead (e.g. -3+1×10-10 = -2.9999999999 will be converted to -2)
The combination of (1) and (2) means also that using int(x + 0.5) will not work reasonably as it will round negative numbers up.
There is a reasonable round function, but unfortunately returns another floating point number, thus you need to write int(round(x)).
When working with C99 or C++11 you can use lround(x).
Note that the only numbers that can be represented correctly in floating point are quotients where the denominator is an integral power of 2.
For example 1/65536 = 0.0000152587890625 can be represented correctly, but even just 0.1 is impossible to represent correctly and thus any computation involving that quantity will be approximated.
Of course when using 0.1 approximations can cancel out leaving a correct result occasionally, but even just adding ten times 0.1 will not give 1.0 as result when doing the computation using IEEE754 double-precision floating point numbers.
Even worse the compilers are allowed to use higher precision for intermediate results. This means that adding 10 times 0.1 may give back 1 when converted to an integer if the compiler decides to use higher accuracy and round to closest double at the end.
This is "worse" because despite being the precision higher the results are compiler and compiler options dependent, making reasoning about the computations harder and making the exact result non portable among different systems (even if they use the same precision and format).
Most compilers have special options to avoid this specific problem.

At what point do doubles begin to lose precision?

My application needs to perform some operations: >, <, ==, !=, +, -, ++, etc. (but without division) on some numbers. Those numbers are sometimes integer, and more rarely floats.
If I use internally the "double" type (as defined by IEEE 754) even for integers, up until what point can I be safe to use them as if they were ints, without running in strange rounding errors (for example, n == 5 && n == 6 are both true because they round to the same number)?
Obviously the second input of the various operations (+, -, etc.) is always an integer and I know that with 0.000[..]01 I'll have troubles since the start.
As a bonus answer, the same question but for float.
The number of bits in a IEEE-754 double mantissa is 52, and there's an extra implied bit that is always 1. This means the maximum value that can be contained exactly is 2^53, or 9007199254740992.
A float mantissa is 23 bits, again with an implied bit. The maximum integer that can be exactly represented is 2^24, or 16777216.
If your intent is to hold integer values only, there's usually a 64-bit integer type that would be more appropriate than a double.
Edit: originally I had 2^53-1 and 2^24-1, but I realized there's no need to subtract 1 - an even number can take advantage of an implied 0 bit to the right of the mantissa.
C# Refer to:
However, do be aware that the range of the decimal type is smaller than a double. That is double can hold a larger value, but it does so by losing precision. Or, as stated on MSDN:
The decimal keyword denotes a 128-bit
data type. Compared to floating-point
types, the decimal type has a greater
precision and a smaller range, which
makes it suitable for financial and
monetary calculations. The approximate
range and precision for the decimal
type are shown in the following table.
The primary difference between decimal and double is that decimal is fixed-point and double is floating point. That means that decimal stores an exact value, while double represents a value represented by a fraction, and is less precise. A decimalis 128 bits, so it takes the double space to store. Calculations on decimal is also slower (measure !).
If you need even larger precision, then BigInteger can be used from .NET 4. (You will need to handle decimal points yourself). Here you should be aware, that BigInteger is immutable, so any arithmetic operation on it will create a new instance - if numbers are large, this might be cribbling for performance.
I suggest you look into exactly how much precision you need. Perhaps your algorithm can work with normalized values, that can be smaller ? If performance is an issue, one of the built in floating point types are likely to be faster.

Some questions about floating points

I'm wondering if a number is represented one way in a floating point representation, is it going to be represented in the same way in a representation that has a larger size.
That is, if a number has a particular representation as a float, will it have the same representation if that float is cast to a double and then still the same when cast to a long double.
I'm wondering because I'm writing a BigInteger implementation and any floating point number that is passed in I am sending to a function that accepts a long double to convert it. Which leads me to my next question. Obviously floating points do not always have exact representations, so in my BigInteger class what should I be attempting to represent when given a float. Is it reasonable to try and represent the same number as given by std::cout << std::fixed << someFloat; even if that is not the same as the number passed in. Is that the most accurate representation I will be able to get? If so, ...
What's the best way to extract that value (in base some power of 10), at the moment I'm just grabbing it as a string and passing it to my string constructor. This will work, but I can't help but feel theres a better way, but certainly taking the remainder when dividing by my base is not accurate with floats.
Finally, I wonder if there is a floating point equivalent of uintmax_t, that is a typename that will always be the largest floating point type on a system, or is there no point because long double will always be the largest (even if it 's the same as a double).
Thanks, T.
If by "same representation" you mean "exactly the same binary representation in memory except for padding", then no. Double-precision has more bits of both exponent and mantissa, and also has a different exponent bias. But I believe that any single-precision value is exactly representable in double-precision (except possibly denormalised values).
I'm not sure what you mean when you say "floating points do not always have exact representations". Certainly, not all decimal floating-point values have exact binary floating-point values (and vice versa), but I'm not sure that's a problem here. So long as your floating-point input has no fractional part, then a suitably large "BigInteger" format should be able to represent it exactly.
Conversion via a base-10 representation is not the way to go. In theory, all you need is a bit-array of length ~1024, initialise it all to zero, and then shift the mantissa bits in by the exponent value. But without knowing more about your implementation, there's not a lot more I can suggest!
double includes all values of float; long double includes all values of double. So you're not losing any value information by conversion to long double. However, you're losing information about the original type, which is relevant (see below).
In order to follow common C++ semantics, conversion of a floating point value to integer should truncate the value, not round.
The main problem is with large values that are not exact. You can use the frexp function to find the base 2 exponent of the floating point value. You can use std::numeric_limits<T>::digits to check if that's within the integer range that can be exactly represented.
My personal design choice would be an assert that the fp value is within the range that can be exactly represented, i.e. a restriction on the range of any actual argument.
To do that properly you need overloads taking float and double arguments, since the range that can be represented exactly depends on the actual argument's type.
When you have an fp value that is within the allowed range, you can use floor and fmod to extract digits in any numeral system you want.
yes, going from IEEE float to double to extended you will see bits from the smaller format to the larger format, for example
single
S EEEEEEEE MMMMMMM.....
double
S EEEEEEEEEEEE MMMMM....
6.5 single
0 10000001 101000...
6.5 double
0 10000000001 101000...
13 single
0 10000010 101000...
13 double
0 10000000010 101000...
The mantissa you will left justify and then add zeros.
The exponent is right justified, sign extend the next to msbit then copy the msbit.
An exponent of -2 for example. take -2 subtract 1 which is -3. -3 in twos complement is 0xFD or 0b11111101 but the exponent bits in the format are 0b01111101, the msbit inverted. And for double a -2 exponent -2-1 = -3. or 0b1111...1101 and that becomes 0b0111...1101, the msbit inverted. (exponent bits = twos_complement(exponent-1) with the msbit inverted).
As we see above an exponent of 3 3-1 = 2 0b000...010 invert the upper bit 0b100...010
So yes you can take the bits from single precision and copy them to the proper locations in the double precision number. I dont have an extended float reference handy but pretty sure it works the same way.