If i write code in c++:
long long d = 999999998.9999999994;
cout<<d;
I get output: 999999999 (rounding up)
But output of this code:
long long d = 999999998.9999994994;
cout<<d;
is 999999998 (rounding down)
Is it something to do with precision. Is there any way i can change precision. floor() function also gives the same output.
I also noticed that if i assign value 8.9999994994 or 8.9999999994 to d(above variable). Output is 8.
999999998.9999999994 is not exactly representable in double, so the actual value is one of the two representable numbers on either side of 999999998.9999999994 - either 999999998.99999988079071044921875 or 999999999 (assuming IEEE-754 binary64 format), selected in an implementation-defined manner. Most systems will by default round to nearest, producing 999999999.
The net result is that on those systems when you write 999999998.9999999994 it ends up having the exact same effect as writing 999999999.0. Hence the subsequent conversion yields 999999999 - the conversion from a floating point number to an integer always truncates, but here there is nothing to truncate.
With 999999998.9999994994, the closest representable numbers are 999999998.999999523162841796875 and 999999998.99999940395355224609375. Either one produces 999999998 after truncation. Similarly, with 8.9999999994, the closest representable numbers are 8.999999999399999950355777400545775890350341796875 and 8.9999999994000017267126168007962405681610107421875, and either one will produce 8 after truncation.
long long d = 999999998.9999999994;
The closest value to 999999998.9999999994 that double can represent is 999999999.0 - remember that floating points have finite precision ;).
Therefore, truncating the decimal places yields 999999999, and thats what is saved in d.
Using a literal with L-suffix does indeed lead to 999999998 being saved in d - long double has a higher precision.
long long d = 999999998.9999994994;
The closest value to 999999998.9999994994 that double can represent is actually below 999999999 - approximately 999999998.999999523 on my machine. Truncating the decimal places subsequently yields 999999998, and that is stored in d.
Related
I need a strong guarantee that int x = (int) std::round(y) will always give the correct results (y is finite and "humanly", e.g. -50000 to 50000).
std::round(4.1) can give 4.000000000001 or 3.99999999999. In the latter case, casting to int gives 3, right?
To manage this, I reinvented the wheel with this ugly function:
template<std::integral S = int, std::floating_point T>
S roundi(T x)
{
S r = (S) x;
T r2 = std::fmod(x, 1);
if (r2 >= 0.5) return r + 1;
if (r2 <= -0.5) return r - 1;
return r;
}
But is this necessary? Or does casting from double to int use the last mantissa bit for rounding?
Assuming int is 32 bits wide and double is 64 bits wide (and assuming IEEE 754), all values of int are exactly representable in a double.
That means std::round(4.1) returns exactly 4. Nothing more nothing less. And casting that number to int is always 4 exactly.
std::round(4.1) can give 4.000000000001 or 3.99999999999. In later case, casting to int gives 3 right?
No, it cannot. The result of std::round is always an integer, exactly, with no rounding error.
I need strong guarantee that int x = (int) std::round(y) will give always the correct results (y is finite and "humanly" e.g. -50000 to
50000).
C++ inherits its floating-point model from C, and, per C 2018 5.2.4.2.2 12, double is capable of representing at least ten-digit integers, so [−50,000, +50,000] is well within its range. It is even within the range of float, which is capable of representing six-digit integers. This requirement extends back to C 1990.
Given an int A Is there a strong guarantee that A == (int) (double) A?
No, the C++ standard does not impose an upper limit on the width of int nor a relationship between with precision of int (number of bits it uses for the value, excluding the sign bit) and the precision of double (number of bits or other digits in its significand), so a C++ implementation may have an int with more precision than double.
std::round(4.1) can give 4.000000000001 or 3.99999999999. In later case, casting to int gives 3 right?
That's true. 4.1 can be seen as 4.0 (which has exact representation in floating point as an integer it is) plus 0.1, which can be seen as 1/10 (it's exactly 1/10, indeed) And the problem you will have is if you try to round a number close to that to one decimal point after the decimal mark (rounding to an integer multiple of 0.1 or 0.01 or 0.001, etc.)
If you are using decimal floating point (which normally C compilers don't) then you are lucky, as 0.1 is 10&^(-1) which again has an exact representation in the machine. But as a binary floating point number, it has an infinite representation in binary as 0.000110011001100110011001100...b and it depends where you cut the number you will get some value or another, but you will never get the exact value as a decimal number (with a finite number of digits)
But the way round() works is not that... if first adds 0.5 (which is exactly representable as a binary floating point number) to the number (this results in an exact operation, no rounding error emerges from it), and then cuts the integer part (which is also an exact operation), meaning that you are getting always an exact integer result (which is perfectly representable as an exact floating point, if the original number was). The rounding is equivalent to this set of operations:
(int)(4.1 + 0.5);
so you will get the integer part of 4.6 after addding the 0.5 part (or something like 4.60000000000000003, 4.59999999999999998, anyway both will be truncated to 4.0, which is also exactly representable in binary floating point format) so you will never get a wrong answer for the rounding to integer case... you can get a wrong response in case you get something close to 4.5 (which can round to 4.0 instead of the correct rounding to 5.0, but .5 happens to be exactly 0.1b in binary... and so it's not affected --
Beware although that rounding to multiples of a negative power of ten (0.1, 0.01, ...) is not warranted, as none of those numbers is representable exactly in binary floating point. All of them have an infinite representation as binary numbers, and due to the cutting at some point, they can be represented as a tiny number above or below (depending on which is close) and the rounding will not work.
When int64_t is cast to double and doesn't have an exact match, to my knowledge I get a sort of best-effort-nearest-value equivalent in double. For example, 9223372036854775000 in int64_t appears to end up as 9223372036854774784.0 in double:
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
return 0;
}
It appears to me as if an int64_t cast to a double always ends up on as a clean non-fractional number, even in this higher number range where double has really low precision. However, I just observed this from random attempts. Is this guaranteed to happen for any value of int64_t cast to a double?
And if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off? (Assuming it doesn't overflow during the conversion back.) Like here:
#include <inttypes.h>
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
printf("Corresponding int to corresponding double: %" PRId64 "\n",
(int64_t)((double)9223372036854775000LL));
// Outputs: 9223372036854774784
return 0;
}
Or can it be imprecise and get me the "wrong" int in some corner cases?
Intuitively and from my tests the answer to both points appears to be "yes", but if somebody with a good formal understanding of the floating point standards and the maths behind it could confirm this that would be really helpful to me. I would also be curious if any known more aggressive optimizations like gcc's -Ofast are known to break any of this.
In general case yes, both should be true. The floating point base needs to be - if not 2, then at least integer and given that, an integer converted to nearest floating point value can never produce non-zero fractions - either the precision suffices or the lowest-order integer digits in the base of the floating type would be zeroed. For example in your case your system uses ISO/IEC/IEEE 60559 binary floating point numbers. When inspected in base 2, it can be seen that the trailing digits of the value are indeed zeroed:
>>> bin(9223372036854775000)
'0b111111111111111111111111111111111111111111111111111110011011000'
>>> bin(9223372036854774784)
'0b111111111111111111111111111111111111111111111111111110000000000'
The conversion of a double without fractions to an integer type, given that the value of the double falls within the range of the integer type should be exact...
Though you still might encounter a quality-of-implementation issue, or an outright bug - for example MSVC currently has a compiler bug where a round-trip conversion of unsigned 32-bit value with MSB set (or just double value between 2³¹ and 2³²-1 converted to unsigned int) would "overflow" in the conversion and always result in exactly 2³¹.
The following assumes the value being converted is positive. The behavior of negative numbers is analogous.
C 2018 6.3.1.4 2 specifies conversions from integer to real and says:
… If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.
This tells us that some integer value x being converted to floating-point can produce a non-integer only if one of the two representable values bounding x is not an integer and x is not representable.
5.2.4.2.2 specifies the model used for floating-point numbers. Each finite floating-point number is represented by a sequence of digits in a certain base b scaled by be for some exponent e. (b is an integer greater than 1.) Then, if one of the two values bounding x, say p is not an integer, the scaling must be such that the lowest digit in that floating-point number represents a fraction. But if this is the case, then setting all of the digits in p that represent fractions to 0 must produce a new floating-point number that is an integer. If x < p, this integer must be x, and therefore x is representable in the floating-point format. On the other hand, if p < x, we can add enough to each digit that represents a fraction to make it 0 (and produce a carry to the next higher digit). This will also produce an integer representable in the floating-point type1, and it must be x.
Therefore, if conversion of an integer x to the floating-point type would produce a non-integer, x must be representable in the type. But then conversion to the floating-point type must produce x. So it is never possible to produce a non-integer.
Footnote
1 It is possible this will carry out of all the digits, as when applying it to a three-digit decimal number 9.99, which produces 10.00. In this case, the value produced is the next power of b, if it is in range of the floating-point format. If it is not, the C standard does not define the behavior. Also note the C standard sets minimum requirements on the range that floating-point formats must support which preclude any format from not being able to represent 1, which avoids a degenerate case in which a conversion could produce a number like .999 because it was the largest representable finite value.
When a 64bit int is cast to 64bit float ... and doesn't have an exact match, will it always land on a non-fractional number?
Is this guaranteed to happen for any value of int64_t cast to a double?
For common double: Yes, it always land on a non-fractional number
When there is no match, the result is the closest floating point representable value above or below, depending on rounding mode. Given the characteristics of common double, these 2 bounding values are also whole numbers. When the value is not representable, there is first a nearby whole number one.
... if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off?
No. Edge cases near INT64_MAX fail as the converted value could become a FP value above INT64_MAX. Then conversion back to the integer type incurs: "the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised." C17dr § 6.3.1.3 3
#include <limits.h>
#include <string.h>
int main() {
long long imaxm1 = LLONG_MAX - 1;
double max = (double) imaxm1;
printf("%lld\n%f\n", imaxm1, max);
long long imax = (long long) max;
printf("%lld\n", imax);
}
9223372036854775806
9223372036854775808.000000
9223372036854775807 // Value here is implementation defined.
Deeper exceptions
(Question variation) When an N bit integer type is cast to a floating point and doesn't have an exact match, will it always land on a non-fractional number?
Integer type range exceeds finite float point
Conversion to infinity: With common float, and uint128_t, UINT128_MAX converts to infinity. This is readily possible with extra wide integer types.
int main() {
unsigned __int128 imaxm1 = 0xFFFFFFFFFFFFFFFF;
imaxm1 <<= 64;
imaxm1 |= 0xFFFFFFFFFFFFFFFF;
double fmax = (float) imaxm1;
double max = (double) imaxm1;
printf("%llde27\n%f\n%f\n", (long long) (imaxm1/1000000000/1000000000/1000000000),
fmax, max);
}
340282366920e27
inf
340282366920938463463374607431768211456.000000
Floating point precession deep more than range
On some unicorn implementation, with very wide FP precision and small range, the largest finite could, in theory, not practice, be a non-whole number. Then with an even wider integer type, the conversion could result in this non-whole number value. I do not see this as a legit concern of OP's.
How can I round __float128 in C++ to get __int128? I found some rounding functions in quadmath.h but their result is long long or something even shorter or integer stored in __float128. This question isn't duplicate of Why do round() and ceil() not return an integer? because I use 128-bit numbers and casting doesn't work for them.
__int128 can only represent an integer which is in range -2128 (or -2127 - 1 in some system) to 2127 + 1.
__float128 can represent a float up to 216384 - 216271 ≈ 1.1897 × 104932 which isn much bigger than __int128.
You need to:
use roundq to get the rounded __float128 than.
check if that value stays in range [-2128, 2128], these numbers are 1 outside the limit of __int128 and both of them can be represented correctly by a float because they're a power of 2.
if it is in that range, make a cast to __int128
Alternately, from gcc documentation you can use llroundq: round to nearest integer value away from zero. But in this case, quote from libquadmath source code:
else
{
/* The number is too large. It is left implementation defined
what happens. */
return (long long int) x;
}
Is there an algorithm in C++ that will allow me to, given a floating-point value V of type T (e.g. double or float), returns the closest value to V in a given direction (up or down) that can be represented exactly in less than or equal to a specified number of decimal places D ?
For example, given
T = double
V = 670000.08267799998
D = 6
For direction = towards +inf I would like the result to be 670000.082678, and for direction = towards -inf I would like the result to be 670000.082677
This is somewhat similar to std::nexttoward(), but with the restriction that the 'next' value needs to be exactly representable using at most D decimal places.
I've considered a naive solution involving separating out the fractional portion and scaling it by 10^D, truncating it, and scaling it again by 10^-D and tacking it back onto the whole number portion, but I don't believe that guarantees that the resulting value will be exactly representable in the underlying type.
I'm hopeful that there's a way to do this properly, but so far I've been unable to find one.
Edit: I think my original explanation didn't properly convey my requirements. At the suggestion of #patricia-shanahan I'll try to describing my higher-level goal and then reformulate the problem a little differently in that context.
At the highest level, the reason I need this routine is due to some business logic wherein I must take in a double value K and a percentage P, split it into two double components V1 and V2 where V1 ~= P percent of K and V1 + V2 ~= K. The catch is that V1 is used in further calculations before being sent to a 3rd party over a wire protocol that accepts floating-point values in string format with a max of D decimal places. Because the value sent to the 3rd party (in string format) needs to be reconcilable with the results of the calculations made using V1 (in double format) , I need to "adjust" V1 using some function F() so that it is as close as possible to being P percent of K while still being exactly representable in string format using at most D decimal places. V2 has none of the restrictions of V1, and can be calculated as V2 = K - F(V1) (it is understood and acceptable that this may result in V2 such that V1 + V2 is very close to but not exactly equal to K).
At the lower level, I'm looking to write that routine to 'adjust' V1 as something with the following signature:
double F(double V, unsigned int D, bool roundUpIfTrueElseDown);
where the output is computed by taking V and (if necessary, and in the direction specified by the bool param) rounding it to the Dth decimal place.
My expectation would be that when V is serialized out as follows
const auto maxD = std::numeric_limits<double>::digits10;
assert(D <= maxD); // D will be less than maxD... e.g. typically 1-6, definitely <= 13
std::cout << std::fixed
<< std::setprecision(maxD)
<< F(V, D, true);
then the output contains only zeros beyond the Dth decimal place.
It's important to note that, for performance reasons, I am looking for an implementation of F() that does not involve conversion back and forth between double and string format. Though the output may eventually be converted to a string format, in many cases the logic will early-out before this is necessary and I would like to avoid the overhead in that case.
This is a sketch of a program that does what is requested. It is presented mainly to find out whether that is really what is wanted. I wrote it in Java, because that language has some guarantees about floating point arithmetic on which I wanted to depend. I only use BigDecimal to get exact display of doubles, to show that the answers are exactly representable with no more than D digits after the decimal point.
Specifically, I depended on double behaving according to IEEE 754 64-bit binary arithmetic. That is likely, but not guaranteed by the standard, for C++. I also depended on Math.pow being exact for simple exact cases, on exactness of division by a power of two, and on being able to get exact output using BigDecimal.
I have not handled edge cases. The big missing piece is dealing with large magnitude numbers with large D. I am assuming that the bracketing binary fractions are exactly representable as doubles. If they have more than 53 significant bits that will not be the case. It also needs code to deal with infinities and NaNs. The assumption of exactness of division by a power of two is incorrect for subnormal numbers. If you need your code to handle them, you will have to put in corrections.
It is based on the concept that a number that is both exactly representable as a decimal with no more than D digits after the decimal point and is exactly representable as a binary fraction must be representable as a fraction with denominator 2 raised to the D power. If it needs a higher power of 2 in the denominator, it will need more than D digits after the decimal point in its decimal form. If it cannot be represented at all as a fraction with a power-of-two denominator, it cannot be represented exactly as a double.
Although I ran some other cases for illustration, the key output is:
670000.082678 to 6 digits Up: 670000.09375 Down: 670000.078125
Here is the program:
import java.math.BigDecimal;
public class Test {
public static void main(String args[]) {
testIt(2, 0.000001);
testIt(10, 0.000001);
testIt(6, 670000.08267799998);
}
private static void testIt(int d, double in) {
System.out.print(in + " to " + d + " digits");
System.out.print(" Up: " + new BigDecimal(roundUpExact(d, in)).toString());
System.out.println(" Down: "
+ new BigDecimal(roundDownExact(d, in)).toString());
}
public static double roundUpExact(int d, double in) {
double factor = Math.pow(2, d);
double roundee = factor * in;
roundee = Math.ceil(roundee);
return roundee / factor;
}
public static double roundDownExact(int d, double in) {
double factor = Math.pow(2, d);
double roundee = factor * in;
roundee = Math.floor(roundee);
return roundee / factor;
}
}
In general, decimal fractions are not precisely representable as binary fractions. There are some exceptions, like 0.5 (½) and 16.375 (16⅜), because all binary fractions are precisely representable as decimal fractions. (That's because 2 is a factor of 10, but 10 is not a factor of 2, or any power of two.) But if a number is not a multiple of some power of 2, its binary representation will be an infinitely-long cyclic sequence, like the representation of ⅓ in decimal (.333....).
The standard C library provides the macro DBL_DIG (normally 15); any decimal number with that many decimal digits of precision can be converted to a double (for example, with scanf) and then converted back to a decimal representation (for example, with printf). To go in the opposite direction without losing information -- start with a double, convert it to decimal and then convert it back -- you need 17 decimal digits (DBL_DECIMAL_DIG). (The values I quote are based on IEEE-754 64-bit doubles).
One way to provide something close to the question would be to consider a decimal number with no more than DBL_DIG digits of precision to be an "exact-but-not-really-exact" representation of a floating point number if that floating point number is the floating point number which comes closest to the value of the decimal number. One way to find that floating point number would be to use scanf or strtod to convert the decimal number to a floating point number, and then try the floating point numbers in the vicinity (using nextafter to explore) to find which ones convert to the same representation with DBL_DIG digits of precision.
If you trust the standard library implementation to not be too far off, you could convert your double to a decimal number using sprintf, increment the decimal string at the desired digit position (which is just a string operation), and then convert it back to a double with strtod.
Total re-write.
Based on OP's new requirement and using power-of-2 as suggested by #Patricia Shanahan, simple C solution:
double roundedV = ldexp(round(ldexp(V, D)),-D); // for nearest
double roundedV = ldexp(ceil (ldexp(V, D)),-D); // at or just greater
double roundedV = ldexp(floor(ldexp(V, D)),-D); // at or just less
The only thing added here beyond #Patricia Shanahan fine solution is C code to match OP's tag.
In C++ integers must be represented in binary, but floating point types can have a decimal representation.
If FLT_RADIX from <limits.h> is 10, or some multiple of 10, then your goal of exact representation of a decimal values is attainable.
Otherwise, in general, it's not attainable.
So, as a first step, try to find a C++ implementation where FLT_RADIX is 10.
I wouldn't worry about algorithm or efficiency thereof until the C++ implementation is installed and proved to be working on your system. But as a hint, your goal seems to be suspiciously similar to the operation known as “rounding”. I think, after obtaining my decimal floating point C++ implementation, I’d start by investigating techniques for rounding, e.g., googling that, maybe Wikipedia, …
In Visual C++ 2010, I tried this
double d= DBL_MAX;
double dblmaxintpart;
modf(DBL_MAX, &dblmaxintpart);
In the debugger window I put
d == dblmaxintpart
which gave true as result.
Can I assume that DBL_MAX is equal to its integer part as an always valid assertion?
Yes, the integer part of a double which represents an integer will always be the double itself, even at DBL_MAX. In fact, any double greater than 2^52 will have itself as an integer part, because doubles of that size don't have enough mantissal bits to represent a fraction.
For similar reasons, not all integers above 2^53 are representable as doubles (though when converted to doubles, they will still be integers).
Finally, the fractional part of any double less than 1 will be exactly itself, and the fractional and integer parts of any double, when added, will produce exactly the original double.