In Visual C++ 2010, I tried this
double d= DBL_MAX;
double dblmaxintpart;
modf(DBL_MAX, &dblmaxintpart);
In the debugger window I put
d == dblmaxintpart
which gave true as result.
Can I assume that DBL_MAX is equal to its integer part as an always valid assertion?
Yes, the integer part of a double which represents an integer will always be the double itself, even at DBL_MAX. In fact, any double greater than 2^52 will have itself as an integer part, because doubles of that size don't have enough mantissal bits to represent a fraction.
For similar reasons, not all integers above 2^53 are representable as doubles (though when converted to doubles, they will still be integers).
Finally, the fractional part of any double less than 1 will be exactly itself, and the fractional and integer parts of any double, when added, will produce exactly the original double.
Related
When int64_t is cast to double and doesn't have an exact match, to my knowledge I get a sort of best-effort-nearest-value equivalent in double. For example, 9223372036854775000 in int64_t appears to end up as 9223372036854774784.0 in double:
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
return 0;
}
It appears to me as if an int64_t cast to a double always ends up on as a clean non-fractional number, even in this higher number range where double has really low precision. However, I just observed this from random attempts. Is this guaranteed to happen for any value of int64_t cast to a double?
And if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off? (Assuming it doesn't overflow during the conversion back.) Like here:
#include <inttypes.h>
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
printf("Corresponding int to corresponding double: %" PRId64 "\n",
(int64_t)((double)9223372036854775000LL));
// Outputs: 9223372036854774784
return 0;
}
Or can it be imprecise and get me the "wrong" int in some corner cases?
Intuitively and from my tests the answer to both points appears to be "yes", but if somebody with a good formal understanding of the floating point standards and the maths behind it could confirm this that would be really helpful to me. I would also be curious if any known more aggressive optimizations like gcc's -Ofast are known to break any of this.
In general case yes, both should be true. The floating point base needs to be - if not 2, then at least integer and given that, an integer converted to nearest floating point value can never produce non-zero fractions - either the precision suffices or the lowest-order integer digits in the base of the floating type would be zeroed. For example in your case your system uses ISO/IEC/IEEE 60559 binary floating point numbers. When inspected in base 2, it can be seen that the trailing digits of the value are indeed zeroed:
>>> bin(9223372036854775000)
'0b111111111111111111111111111111111111111111111111111110011011000'
>>> bin(9223372036854774784)
'0b111111111111111111111111111111111111111111111111111110000000000'
The conversion of a double without fractions to an integer type, given that the value of the double falls within the range of the integer type should be exact...
Though you still might encounter a quality-of-implementation issue, or an outright bug - for example MSVC currently has a compiler bug where a round-trip conversion of unsigned 32-bit value with MSB set (or just double value between 2³¹ and 2³²-1 converted to unsigned int) would "overflow" in the conversion and always result in exactly 2³¹.
The following assumes the value being converted is positive. The behavior of negative numbers is analogous.
C 2018 6.3.1.4 2 specifies conversions from integer to real and says:
… If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.
This tells us that some integer value x being converted to floating-point can produce a non-integer only if one of the two representable values bounding x is not an integer and x is not representable.
5.2.4.2.2 specifies the model used for floating-point numbers. Each finite floating-point number is represented by a sequence of digits in a certain base b scaled by be for some exponent e. (b is an integer greater than 1.) Then, if one of the two values bounding x, say p is not an integer, the scaling must be such that the lowest digit in that floating-point number represents a fraction. But if this is the case, then setting all of the digits in p that represent fractions to 0 must produce a new floating-point number that is an integer. If x < p, this integer must be x, and therefore x is representable in the floating-point format. On the other hand, if p < x, we can add enough to each digit that represents a fraction to make it 0 (and produce a carry to the next higher digit). This will also produce an integer representable in the floating-point type1, and it must be x.
Therefore, if conversion of an integer x to the floating-point type would produce a non-integer, x must be representable in the type. But then conversion to the floating-point type must produce x. So it is never possible to produce a non-integer.
Footnote
1 It is possible this will carry out of all the digits, as when applying it to a three-digit decimal number 9.99, which produces 10.00. In this case, the value produced is the next power of b, if it is in range of the floating-point format. If it is not, the C standard does not define the behavior. Also note the C standard sets minimum requirements on the range that floating-point formats must support which preclude any format from not being able to represent 1, which avoids a degenerate case in which a conversion could produce a number like .999 because it was the largest representable finite value.
When a 64bit int is cast to 64bit float ... and doesn't have an exact match, will it always land on a non-fractional number?
Is this guaranteed to happen for any value of int64_t cast to a double?
For common double: Yes, it always land on a non-fractional number
When there is no match, the result is the closest floating point representable value above or below, depending on rounding mode. Given the characteristics of common double, these 2 bounding values are also whole numbers. When the value is not representable, there is first a nearby whole number one.
... if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off?
No. Edge cases near INT64_MAX fail as the converted value could become a FP value above INT64_MAX. Then conversion back to the integer type incurs: "the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised." C17dr § 6.3.1.3 3
#include <limits.h>
#include <string.h>
int main() {
long long imaxm1 = LLONG_MAX - 1;
double max = (double) imaxm1;
printf("%lld\n%f\n", imaxm1, max);
long long imax = (long long) max;
printf("%lld\n", imax);
}
9223372036854775806
9223372036854775808.000000
9223372036854775807 // Value here is implementation defined.
Deeper exceptions
(Question variation) When an N bit integer type is cast to a floating point and doesn't have an exact match, will it always land on a non-fractional number?
Integer type range exceeds finite float point
Conversion to infinity: With common float, and uint128_t, UINT128_MAX converts to infinity. This is readily possible with extra wide integer types.
int main() {
unsigned __int128 imaxm1 = 0xFFFFFFFFFFFFFFFF;
imaxm1 <<= 64;
imaxm1 |= 0xFFFFFFFFFFFFFFFF;
double fmax = (float) imaxm1;
double max = (double) imaxm1;
printf("%llde27\n%f\n%f\n", (long long) (imaxm1/1000000000/1000000000/1000000000),
fmax, max);
}
340282366920e27
inf
340282366920938463463374607431768211456.000000
Floating point precession deep more than range
On some unicorn implementation, with very wide FP precision and small range, the largest finite could, in theory, not practice, be a non-whole number. Then with an even wider integer type, the conversion could result in this non-whole number value. I do not see this as a legit concern of OP's.
Particularly I'm interested if int32_t is always losslessly converted to double.
Does the following code always return true?
int is_lossless(int32_t i)
{
double d = i;
int32_t i2 = d;
return (i2 == i);
}
What is for int64_t?
When is integer to floating point conversion lossless?
When the floating point type has enough precision and range to encode all possible values of the integer type.
Does the following int32_t code always return true? --> Yes.
Does the following int64_t code always return true? --> No.
As DBL_MAX is at least 1E+37, the range is sufficient for at least int122_t, let us look to precision.
With common double, with its base 2, sign bit, 53 bit significand, and exponent, all values of int54_t with its 53 value bits can be represented exactly. INT54_MIN is also representable. With this double, it has DBL_MANT_DIG == 53 and in this case that is the number of base-2 digits in the floating-point significand.
The smallest magnitude non-representable value would be INT54_MAX + 2. Type int55_t and wider have values not exactly representable as a double.
With uintN_t types, there is 1 more value bit. The typical double can then encode all uint53_t and narrower.
With other possible double encodings, as C specifies DBL_DIG >= 10, all values of int34_t can round trip.
Code is always true with int32_t, regardless of double encoding.
What is for int64_t?
UB potential with int64_t.
The conversion in int64_t i ... double d = i;, when inexact, makes for a implementation defined result of the 2 nearest candidates. This is often a round to nearest. Then i values near INT64_MAX can convert to a double one more than INT64_MAX.
With int64_t i2 = d;, the conversion of the double value one more than INT64_MAX to int64_t is undefined behavior (UB).
A simple prior test to detect this:
#define INT64_MAX_P1 ((INT64_MAX/2 + 1) * 2.0)
if (d == INT64_MAX_P1) return false; // not lossless
Question: Does the following code always return true?
Always is a big statement and therefore the answer is no.
The C++ Standard makes no mention whether or not the floating-point types which are known to C++ (float, double and long double) are of the IEEE-754 type. The standard explicitly states:
There are three floating-point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined. [Note: This document imposes no requirements on the accuracy of floating-point operations; see also [support.limits]. — end note] Integral and floating-point types are collectively called arithmetic types. Specialisations of the standard library template std::numeric_limits shall specify the maximum and minimum values of each arithmetic type for an implementation.
source: C++ standard: basic fundamentals
Most commonly, the type double represents the IEEE 754 double-precision binary floating-point format binary64, and can be depicted as:
and decoded as:
However, there is a plethora of other floating-point formats out there that are decoded differently and not necessarly have the same properties as the well known IEEE-754. Nonetheless, they are all-by-all similar:
They are n bits long
One bit represents the sign
m bits represent the significant with or without a hidden first bit
e bits represent some form of an exponent of a given base (2 or 10)
To know Whether or not a double can represent all 32-bit signed integer or not, you must answer the following question (assuming our floating-point number is in base 2):
Does my floating-point representation have a hidden first bit in the significant? If so, assume m=m+1
A 32bit signed integer is represented by 1 sign bit and 31 bits representing the number. Is the significant large enough to hold those 31 bits?
Is the exponent large enough that it can represent a number of the form 1.xxxxx 2^31?
If you can answer yes to the last two questions, then yes a int32 can always be represented by the double that is implemented on this particular system.
Note: I ignored decimal32 and decimal64 numbers, as I have no direct knowledge about them.
Note : my answer supposes the double follow IEEE 754, and both int32_t and int64_tare 2's complement.
Does the following code always return true?
the mantissa/significand of a double is longer than 32b so int32_t => double is always done without error because there is no possible precision error (and there is no possible overflow/underflow, the exponent cover more than the needed range of values)
What is for int64_t?
but 53 bits of mantissa/significand (including 1 implicit) of a double is not enough to save 64b of a int64_t => int64_t having upper and lower bits enough distant cannot be store in a double without precision error (there is still no possible overflow/underflow, the exponent still cover more than the needed range of values)
If your platform uses IEEE754 for the double, then yes, any int32_t can be represented perfectly in a double. This is not the case for all possible values that an int64_t can have.
(It is possible on some platforms to tweak the mantissa / exponent sizes of floating point types to make the transformation lossy, but such a type would not be an IEEE754 double.)
To test for IEEE754, use
static_assert(std::numeric_limits<double>::is_iec559, "IEEE 754 floating point");
How can I round __float128 in C++ to get __int128? I found some rounding functions in quadmath.h but their result is long long or something even shorter or integer stored in __float128. This question isn't duplicate of Why do round() and ceil() not return an integer? because I use 128-bit numbers and casting doesn't work for them.
__int128 can only represent an integer which is in range -2128 (or -2127 - 1 in some system) to 2127 + 1.
__float128 can represent a float up to 216384 - 216271 ≈ 1.1897 × 104932 which isn much bigger than __int128.
You need to:
use roundq to get the rounded __float128 than.
check if that value stays in range [-2128, 2128], these numbers are 1 outside the limit of __int128 and both of them can be represented correctly by a float because they're a power of 2.
if it is in that range, make a cast to __int128
Alternately, from gcc documentation you can use llroundq: round to nearest integer value away from zero. But in this case, quote from libquadmath source code:
else
{
/* The number is too large. It is left implementation defined
what happens. */
return (long long int) x;
}
If i write code in c++:
long long d = 999999998.9999999994;
cout<<d;
I get output: 999999999 (rounding up)
But output of this code:
long long d = 999999998.9999994994;
cout<<d;
is 999999998 (rounding down)
Is it something to do with precision. Is there any way i can change precision. floor() function also gives the same output.
I also noticed that if i assign value 8.9999994994 or 8.9999999994 to d(above variable). Output is 8.
999999998.9999999994 is not exactly representable in double, so the actual value is one of the two representable numbers on either side of 999999998.9999999994 - either 999999998.99999988079071044921875 or 999999999 (assuming IEEE-754 binary64 format), selected in an implementation-defined manner. Most systems will by default round to nearest, producing 999999999.
The net result is that on those systems when you write 999999998.9999999994 it ends up having the exact same effect as writing 999999999.0. Hence the subsequent conversion yields 999999999 - the conversion from a floating point number to an integer always truncates, but here there is nothing to truncate.
With 999999998.9999994994, the closest representable numbers are 999999998.999999523162841796875 and 999999998.99999940395355224609375. Either one produces 999999998 after truncation. Similarly, with 8.9999999994, the closest representable numbers are 8.999999999399999950355777400545775890350341796875 and 8.9999999994000017267126168007962405681610107421875, and either one will produce 8 after truncation.
long long d = 999999998.9999999994;
The closest value to 999999998.9999999994 that double can represent is 999999999.0 - remember that floating points have finite precision ;).
Therefore, truncating the decimal places yields 999999999, and thats what is saved in d.
Using a literal with L-suffix does indeed lead to 999999998 being saved in d - long double has a higher precision.
long long d = 999999998.9999994994;
The closest value to 999999998.9999994994 that double can represent is actually below 999999999 - approximately 999999998.999999523 on my machine. Truncating the decimal places subsequently yields 999999998, and that is stored in d.
I have a program in C++ where I divide two numbers, and I need to know if the answer is an integer or not. What I am using is:
if(fmod(answer,1) == 0)
I also tried this:
if(floor(answer)==answer)
The problem is that answer usually is a 5 digit number, but with many decimals. For example, answer can be: 58696.000000000000000025658 and the program considers that an integer.
Is there any way I can make this work?
I am dividing double a/double b= double answer
(sometimes there are more than 30 decimals)
Thanks!
EDIT:
a and b are numbers in the thousands (about 100,000) which are then raised to powers of 2 and 3, added together and divided (according to a complicated formula). So I am plugging in various a and b values and looking at the answer. I will only keep the a and b values that make the answer an integer. An example of what I got for one of the answers was: 218624 which my program above considered to be an integer, but it really was: 218624.00000000000000000056982 So I need a code that can distinguish integers with more than 20-30 decimals.
You can use std::modf in cmath.h:
double integral;
if(std::modf(answer, &integral) == 0.0)
The integral part of answer is stored in fraction and the return value of std::modf is the fractional part of answer with the same sign as answer.
The usual solution is to check if the number is within a very short distance of an integer, like this:
bool isInteger(double a){
double b=round(a),epsilon=1e-9; //some small range of error
return (a<=b+epsilon && a>=b-epsilon);
}
This is needed because floating point numbers have limited precision, and numbers that indeed are integers may not be represented perfectly. For example, the following would fail if we do a direct comparison:
double d=sqrt(2); //square root of 2
double answer=2.0/(d*d); //2 divided by 2
Here, answer actually holds the value 0.99999..., so we cannot compare that to an integer, and we cannot check if the fractional part is close to 0.
In general, since the floating point representation of a number can be either a bit smaller or a bit bigger than the actual number, it is not good to check if the fractional part is close to 0. It may be a number like 0.99999999 or 0.000001 (or even their negatives), these are all possible results of a precision loss. That's also why I'm checking both sides (+epsilon and -epsilon). You should adjust that epsilon variable to fit your needs.
Also, keep in mind that the precision of a double is close to 15 digits. You may also use a long double, which may give you some extra digits of precision (or not, it is up to the compiler), but even that only gets you around 18 digits. If you need more precision than that, you will need to use an external library, like GMP.
Floating point numbers are stored in memory using a very different bit format than integers. Because of this, comparing them for equality is not likely to work effectively. Instead, you need to test if the difference is smaller than some epsilon:
const double EPSILON = 0.00000000000000000001; // adjust for whatever precision is useful for you
double remainder = std::fmod(numer, denom);
if(std::fabs(0.0 - remainder) < EPSILON)
{
//...
}
Alternatively, if you want to include values that are close to integers (based on your desired precision), you can modify the if condition slightly (since the remainder returned by std::fmod will be in the range [0, 1)):
if (std::fabs(std::round(d) - d) < EPSILON)
{
// ...
}
You can see the test for this here.
Floating point numbers are generally somewhat precise to about 12-15 digits (as a double), but as they are stored as a mantissa (fraction) and a exponent, rational numbers (integers or common fractions) are not likely to be stored as such. For example,
double d = 2.0; // d might actually be 1.99999999999999995
Because of this, you need to compare the difference of what you expect to some very small number that encompasses the precision you desire (we will call this value, epsilon):
double d = 2.0;
bool test = std::fabs(2 - d) < epsilon; // will return true
So when you are trying to compare the remainder from std::fmod, you need to check it against the difference from 0.0 (not for actual equality to 0.0), which is what is done above.
Also, the std::fabs call prevents you from having to do 2 checks by asserting that the value will always be positive.
If you desire a precision that is greater than 15-18 decimal places, you cannot use double or long double; you will need to use a high precision floating point library.