Comparing uint64_t and float for numeric equivalence - c++

I am writing a protocol, that uses RFC 7049 as its binary representation. The standard states, that the protocol may use 32-bit floating point representation of numbers, if their numeric value is equivalent to respective 64-bit numbers. The conversion must not lead to lose of precision.
What 32-bit float numbers can be bigger than 64-bit integer and numerically equivalent with them?
Is comparing float x; uint64_t y; (float)x == (float)y enough for ensuring, that the values are equivalent? Will this comparison ever be true?
RFC 7049 §3.6. Numbers
For the purposes of this specification, all number representations
for the same numeric value are equivalent. This means that an
encoder can encode a floating-point value of 0.0 as the integer 0.
It, however, also means that an application that expects to find
integer values only might find floating-point values if the encoder
decides these are desirable, such as when the floating-point value is
more compact than a 64-bit integer.

There certainly are numbers for which this is true:
2^33 can be perfectly represented as a floating point number, but clearly cannot be represented as a 32-bit integer. The following code should work as expected:
bool representable_as_float(int64_t value) {
float repr = value;
return repr >= -0x1.0p63 && repr < 0x1.0p63 && (int64_t)repr == value;
}
It is important to notice though that we are basically doing (int64_t)(float)value and not the other way around - we are interested if the cast to float loses any precision.
The check to see whether repr is smaller than the maximum value of int64_t is important since we could invoke undefined behavior otherwise, since the cast to float may round up to the next higher number (which could then be larger than the maximum value possible in int64_t). (Thanks to #tmyklebu for pointing this out).
Two samples:
// powers of 2 can easily be represented
assert(representable_as_float(((int64_t)1) << 33));
// Other numbers not so much:
assert(!representable_as_float(std::numeric_limits<int64_t>::max()));

The following is based on Julia's method for comparing floats and integers. This does not require access to 80-bit long doubles or floating point exceptions, and should work under any rounding mode. I believe this should work for any C float type (IEEE754 or not), and not cause any undefined behaviour.
UPDATE: technically this assumes a binary float format, and that the float exponent size is large enough to represent 264: this is certainly true for the standard IEEE754 binary32 (which you refer to in your question), but not, say, binary16.
#include <stdio.h>
#include <stdint.h>
int cmp_flt_uint64(float x,uint64_t y) {
return (x == (float)y) && (x != 0x1p64f) && ((uint64_t)x == y);
}
int main() {
float x = 0x1p64f;
uint64_t y = 0xffffffffffffffff;
if (cmp_flt_uint64(x,y))
printf("true\n");
else
printf("false\n");
;
}
The logic here is as follows:
The first equality can be true only if x is a non-negative integer in the interval [0,264].
The second checks that x (and hence (float)y) is not 264: if this is the case, then y cannot be represented exactly by a float, and so the comparison is false.
Any remaining values of x can be exactly converted to a uint64_t, and so we cast and compare.

No, you need to compare (long double)x == (long double)y on an architecture where the mantissa of a long double can hold 63 bits. This is because some big long long ints will lose precision when you convert them to float, and compare as equal to a non-equivalent float, but if you convert to long double, it will not lose precision on that architecture.
The following program demonstrates this behavior when compiled with gcc -std=c99 -mssse3 -mfpmath=sse on x86, because these settings use wide-enough long doubles but prevent the implicit use of higher-precision types in calculations:
#include <assert.h>
#include <stdint.h>
const int64_t x = (1ULL<<62) - 1ULL;
const float y = (float)(1ULL<<62);
// The mantissa is not wide enough to store
// 63 bits of precision.
int main(void)
{
assert ((float)x == (float)y);
assert ((long double)x != (long double)y);
return 0;
}
Edit: If you don’t have wide enough long doubles, the following might work:
feclearexcept(FE_ALL_EXCEPT);
x == y;
ftestexcept(FE_INEXACT);
I think, although I could be mistaken, that an implementation could round off x during the conversion in a way that loses precision.
Another strategy that could work is to compare
extern uint64_t x;
extern float y;
const float z = (float)x;
y == z && (uint64_t)z == x;
This should catch losses of precision due to round-off error, but it could conceivably cause undefined behavior if the conversion to z rounds up. It will work if the conversion is set to round toward zero when converting x to z.

Related

When a 64bit int is cast to 64bit float in C/C++ and doesn't have an exact match, will it always land on a non-fractional number?

When int64_t is cast to double and doesn't have an exact match, to my knowledge I get a sort of best-effort-nearest-value equivalent in double. For example, 9223372036854775000 in int64_t appears to end up as 9223372036854774784.0 in double:
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
return 0;
}
It appears to me as if an int64_t cast to a double always ends up on as a clean non-fractional number, even in this higher number range where double has really low precision. However, I just observed this from random attempts. Is this guaranteed to happen for any value of int64_t cast to a double?
And if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off? (Assuming it doesn't overflow during the conversion back.) Like here:
#include <inttypes.h>
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
printf("Corresponding int to corresponding double: %" PRId64 "\n",
(int64_t)((double)9223372036854775000LL));
// Outputs: 9223372036854774784
return 0;
}
Or can it be imprecise and get me the "wrong" int in some corner cases?
Intuitively and from my tests the answer to both points appears to be "yes", but if somebody with a good formal understanding of the floating point standards and the maths behind it could confirm this that would be really helpful to me. I would also be curious if any known more aggressive optimizations like gcc's -Ofast are known to break any of this.
In general case yes, both should be true. The floating point base needs to be - if not 2, then at least integer and given that, an integer converted to nearest floating point value can never produce non-zero fractions - either the precision suffices or the lowest-order integer digits in the base of the floating type would be zeroed. For example in your case your system uses ISO/IEC/IEEE 60559 binary floating point numbers. When inspected in base 2, it can be seen that the trailing digits of the value are indeed zeroed:
>>> bin(9223372036854775000)
'0b111111111111111111111111111111111111111111111111111110011011000'
>>> bin(9223372036854774784)
'0b111111111111111111111111111111111111111111111111111110000000000'
The conversion of a double without fractions to an integer type, given that the value of the double falls within the range of the integer type should be exact...
Though you still might encounter a quality-of-implementation issue, or an outright bug - for example MSVC currently has a compiler bug where a round-trip conversion of unsigned 32-bit value with MSB set (or just double value between 2³¹ and 2³²-1 converted to unsigned int) would "overflow" in the conversion and always result in exactly 2³¹.
The following assumes the value being converted is positive. The behavior of negative numbers is analogous.
C 2018 6.3.1.4 2 specifies conversions from integer to real and says:
… If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.
This tells us that some integer value x being converted to floating-point can produce a non-integer only if one of the two representable values bounding x is not an integer and x is not representable.
5.2.4.2.2 specifies the model used for floating-point numbers. Each finite floating-point number is represented by a sequence of digits in a certain base b scaled by be for some exponent e. (b is an integer greater than 1.) Then, if one of the two values bounding x, say p is not an integer, the scaling must be such that the lowest digit in that floating-point number represents a fraction. But if this is the case, then setting all of the digits in p that represent fractions to 0 must produce a new floating-point number that is an integer. If x < p, this integer must be x, and therefore x is representable in the floating-point format. On the other hand, if p < x, we can add enough to each digit that represents a fraction to make it 0 (and produce a carry to the next higher digit). This will also produce an integer representable in the floating-point type1, and it must be x.
Therefore, if conversion of an integer x to the floating-point type would produce a non-integer, x must be representable in the type. But then conversion to the floating-point type must produce x. So it is never possible to produce a non-integer.
Footnote
1 It is possible this will carry out of all the digits, as when applying it to a three-digit decimal number 9.99, which produces 10.00. In this case, the value produced is the next power of b, if it is in range of the floating-point format. If it is not, the C standard does not define the behavior. Also note the C standard sets minimum requirements on the range that floating-point formats must support which preclude any format from not being able to represent 1, which avoids a degenerate case in which a conversion could produce a number like .999 because it was the largest representable finite value.
When a 64bit int is cast to 64bit float ... and doesn't have an exact match, will it always land on a non-fractional number?
Is this guaranteed to happen for any value of int64_t cast to a double?
For common double: Yes, it always land on a non-fractional number
When there is no match, the result is the closest floating point representable value above or below, depending on rounding mode. Given the characteristics of common double, these 2 bounding values are also whole numbers. When the value is not representable, there is first a nearby whole number one.
... if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off?
No. Edge cases near INT64_MAX fail as the converted value could become a FP value above INT64_MAX. Then conversion back to the integer type incurs: "the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised." C17dr § 6.3.1.3 3
#include <limits.h>
#include <string.h>
int main() {
long long imaxm1 = LLONG_MAX - 1;
double max = (double) imaxm1;
printf("%lld\n%f\n", imaxm1, max);
long long imax = (long long) max;
printf("%lld\n", imax);
}
9223372036854775806
9223372036854775808.000000
9223372036854775807 // Value here is implementation defined.
Deeper exceptions
(Question variation) When an N bit integer type is cast to a floating point and doesn't have an exact match, will it always land on a non-fractional number?
Integer type range exceeds finite float point
Conversion to infinity: With common float, and uint128_t, UINT128_MAX converts to infinity. This is readily possible with extra wide integer types.
int main() {
unsigned __int128 imaxm1 = 0xFFFFFFFFFFFFFFFF;
imaxm1 <<= 64;
imaxm1 |= 0xFFFFFFFFFFFFFFFF;
double fmax = (float) imaxm1;
double max = (double) imaxm1;
printf("%llde27\n%f\n%f\n", (long long) (imaxm1/1000000000/1000000000/1000000000),
fmax, max);
}
340282366920e27
inf
340282366920938463463374607431768211456.000000
Floating point precession deep more than range
On some unicorn implementation, with very wide FP precision and small range, the largest finite could, in theory, not practice, be a non-whole number. Then with an even wider integer type, the conversion could result in this non-whole number value. I do not see this as a legit concern of OP's.

Is there any difference between using floating point casts vs floating point suffixes in C and C++?

Is there a difference between this (using floating point literal suffixes):
float MY_FLOAT = 3.14159265358979323846264338328f; // f suffix
double MY_DOUBLE = 3.14159265358979323846264338328; // no suffix
long double MY_LONG_DOUBLE = 3.14159265358979323846264338328L; // L suffix
vs this (using floating point casts):
float MY_FLOAT = (float)3.14159265358979323846264338328;
double MY_DOUBLE = (double)3.14159265358979323846264338328;
long double MY_LONG_DOUBLE = (long double)3.14159265358979323846264338328;
in C and C++?
Note: the same would go for function calls:
void my_func(long double value);
my_func(3.14159265358979323846264338328L);
// vs
my_func((long double)3.14159265358979323846264338328);
// etc.
Related:
What's the C++ suffix for long double literals?
https://en.cppreference.com/w/cpp/language/floating_literal
The default is double. Assuming IEEE754 floating point, double is a strict superset of float, and thus you will never lose precision by not specifying f. EDIT: this is only true when specifying values that can be represented by float. If rounding occurs this might not be strictly true due to having rounding twice, see Eric Postpischil's answer. So you should also use the f suffix for floats.
This example is also problematic:
long double MY_LONG_DOUBLE = (long double)3.14159265358979323846264338328;
This first gives a double constant which is then converted to long double. But because you started with a double you have already lost precision that will never come back. Therefore, if you want to use full precision in long double constants you must use the L suffix:
long double MY_LONG_DOUBLE = 3.14159265358979323846264338328L; // L suffix
There is a difference between using a suffix and a cast; 8388608.5000000009f and (float) 8388608.5000000009 have different values in common C implementations. This code:
#include <stdio.h>
int main(void)
{
float x = 8388608.5000000009f;
float y = (float) 8388608.5000000009;
printf("%.9g - %.9g = %.9g.\n", x, y, x-y);
}
prints “8388609 - 8388608 = 1.” in Apple Clang 11.0 and other implementations that use correct rounding with IEEE-754 binary32 for float and binary64 for double. (The C standard permits implementations to use methods other than IEEE-754 correct rounding, so other C implementations may have different results.)
The reason is that (float) 8388608.5000000009 contains two rounding operations. With the suffix, 8388608.5000000009f is converted directly to float, so the portion that must be discarded in order to fit in a float, .5000000009, is directly examined in order to see whether it is greater than .5 or not. It is, so the result is rounded up to the next representable value, 8388609.
Without the suffix, 8388608.5000000009 is first converted to double. When the portion that must be discarded, .0000000009, is considered, it is found to be less than ½ the low bit at the point of truncation. (The value of the low bit there is .00000000186264514923095703125, and half of it is .000000000931322574615478515625.) So the result is rounded down, and we have 8388608.5 as a double. When the cast rounds this to float, the portion that must be discarded is .5, which is exactly halfway between the representable numbers 8388608 and 8388609. The rule for breaking ties rounds it to the value with the even low bit, 8388608.
(Another example is “7.038531e-26”; (float) 7.038531e-26 is not equal to 7.038531e-26f. This is the only such numeral with fewer than eight significant digits when float is binary32 and double is binary64, except of course “-7.038531e-26”.)
While you do not lose precision if you omit the f in a float constant, there can be surprises in so doing.
Consider this:
#include <stdio.h>
#define DCN 0.1
#define FCN 0.1f
int main( void)
{
float f = DCN;
printf( "DCN\t%s\n", f > DCN ? "more" : "not-more");
float g = FCN;
printf( "FCN\t%s\n", g > FCN ? "more" : "not-more");
return 0;
}
This (compiled with gcc 9.1.1) produces the output
DCN more
FCN not-more
The explanation is that in f > DCN the compiler takes DCN to have type double and so promotes f to a double, and
(double)(float)0.1 > 0.1
Personally on the (rare) occasions when I need float constants, I always use a 'f' suffix.

Are doubles able to represent every int64_t value? [duplicate]

This question already has answers here:
Representing integers in doubles
(4 answers)
Closed 5 years ago.
My question is whether all integer values are guaranteed to have a perfect double representation.
Consider the following code sample that prints "Same":
// Example program
#include <iostream>
#include <string>
int main()
{
int a = 3;
int b = 4;
double d_a(a);
double d_b(b);
double int_sum = a + b;
double d_sum = d_a + d_b;
if (double(int_sum) == d_sum)
{
std::cout << "Same" << std::endl;
}
}
Is this guaranteed to be true for any architecture, any compiler, any values of a and b? Will any integer i converted to double, always be represented as i.0000000000000 and not, for example, as i.000000000001?
I tried it for some other numbers and it always was true, but was unable to find anything about whether this is coincidence or by design.
Note: This is different from this question (aside from the language) since I am adding the two integers.
Disclaimer (as suggested by Toby Speight): Although IEEE 754 representations are quite common, an implementation is permitted to use any other representation that satisfies the requirements of the language.
The doubles are represented in the form mantissa * 2^exponent, i.e. some of the bits are used for the non-integer part of the double number.
bits range precision
float 32 1.5E-45 .. 3.4E38 7- 8 digits
double 64 5.0E-324 .. 1.7E308 15-16 digits
long double 80 1.9E-4951 .. 1.1E4932 19-20 digits
The part in the fraction can also used to represent an integer by using an exponent which removes all the digits after the dot.
E.g. 2,9979 · 10^4 = 29979.
Since a common int is usually 32 bit you can represent all ints as double, but for 64 bit integers of course this is no longer true. To be more precise (as LThode noted in a comment): IEEE 754 double-precision can guarantee this for up to 53 bits (52 bits of significand + the implicit leading 1 bit).
Answer: yes for 32 bit ints, no for 64 bit ints.
(This is correct for server/desktop general-purpose CPU environments, but other architectures may behave differently.)
Practical Answer as Malcom McLean puts it: 64 bit doubles are an adequate integer type for almost all integers that are likely to count things in real life.
For the empirically inclined, try this:
#include <iostream>
#include <limits>
using namespace std;
int main() {
double test;
volatile int test_int;
for(int i=0; i< std::numeric_limits<int>::max(); i++) {
test = i;
test_int = test;
// compare int with int:
if (test_int != i)
std::cout<<"found integer i="<<i<<", test="<<test<<std::endl;
}
return 0;
}
Success time: 0.85 memory: 15240 signal:0
Subquestion:
Regarding the question for fractional differences. Is it possible to have a integer which converts to a double which is just off the correct value by a fraction, but which converts back to the same integer due to rounding?
The answer is no, because any integer which converts back and forth to the same value, actually represents the same integer value in double. For me the simplemost explanation (suggested by ilkkachu) for this is that using the exponent 2^exponent the step width must always be a power of two. Therefore, beyond the largest 52(+1 sign) bit integer, there are never two double values with a distance smaller than 2, which solves the rounding issue.
No. Suppose you have a 64-bit integer type and a 64-bit floating-point type (which is typical for a double). There are 2^64 possible values for that integer type and there are 2^64 possible values for that floating-point type. But some of those floating-point values (in fact, most of them) do not represent integer values, so the floating-point type can represent fewer integer values than the integer type can.
The answer is no. This only works if ints are 32 bit, which, while true on most platforms, isn't guaranteed by the standard.
The two integers can share the same double representation.
For example, this
#include <iostream>
int main() {
int64_t n = 2397083434877565865;
if (static_cast<double>(n) == static_cast<double>(n - 1)) {
std::cout << "n and (n-1) share the same double representation\n";
}
}
will print
n and (n-1) share the same double representation
I.e. both 2397083434877565865 and 2397083434877565864 will convert to the same double.
Note that I used int64_t here to guarantee 64-bit integers, which - depending on your platform - might also be what int is.
You have 2 different questions:
Are all integer values perfectly represented as doubles?
That was already answered by other people (TL;DR: it depends on the precision of int and double).
Consider the following code sample that prints "Same": [...] Is this guaranteed to be true for any architecture, any compiler, any values of a and b?
Your code adds two ints and then converts the result to double. The sum of ints will overflow for certain values, but the sum of the two separately-converted doubles will not (typically). For those values the results will differ.
The short answer is "possibly". The portable answer is "not everywhere".
It really depends on your platform, and in particular, on
the size and representation of double
the range of int
For platforms using IEEE-754 doubles, it may be true if int is 53-bit or smaller. For platforms where int is larger than double, it's obviously false.
You may want be able to investigate the properties on your runtime host, using std::numeric_limits and std::nextafter.

Comparing floating point values converted from strings with literals

This is not a duplicate of the famous Is floating point math broken, even if it looks like one at first sight.
I'm reading a double from a text file using fscanf(file, "%lf", &value); and comparing it with the == operator against a double literal. If the string is the same as the literal, will the comparision using == be true in all cases?
Example
Text file content:
7.7
Code snippet:
double value;
fscanf(file, "%lf", &value); // reading "7.7" from file into value
if (value == 7.7)
printf("strictly equal\n");
The expected and actual output is
strictly equal
But this supposes that the compiler converts the double literal 7.7 into a double exactly the same way as does the fscanf function, but the compiler may or may not use the same library for converting strings to double.
Or asked otherwise: does the conversion from string to double result in a unique binary representation or may there be slight implementation dependent differences?
Live demonstration
From the c++ standard:
[lex.fcon]
... If the scaled value is in the range
of representable values for its type, the result is the scaled value if representable, else the larger or smaller
representable value nearest the scaled value, chosen in an implementation-defined manner...
emphasis mine.
So you can only rely on equality if the value is strictly representable by a double.
About C++, from cppreference one can read:
[lex.fcon] (§6.4.4.2)
The result of evaluating a floating constant is either the nearest representable value or the larger or smaller representable value immediately adjacent to the nearest representable value, chosen in an implementation-defined manner (in other words, default rounding direction during translation is implementation-defined).
Since the representation of a floating literal is unspecified, I guess you cannot conclude about its comparison with a scanf result.
About C11 (standard ISO/IEC 9899:2011):
[lex.fcon] (§6.4.4.2)
Recommended practice
7 The translation-time conversion of floating constants should match the execution-time conversion of character strings by library functions, such as strtod, given matching inputs suitable for both conversions, the same result format, and default execution-time
rounding.
So clearly for C11, this is not guaranteed to match.
If the string is the same as the literal, will the comparison using == be true in all cases?
A common consideration not yet explored: FLT_EVAL_METHOD
#include <float.h>
...
printf("%d\n", FLT_EVAL_METHOD);
2 evaluate all operations and constants to the range and precision of the
long double type.
If this returns 2, then the math used in value == 7.7 is long double and 7.7 treated as 7.7L. In OP's case, this may evaluate to false.
To account for this wider precision, assign values which will removes all extra range and precision.
scanf(file, "%lf", &value);
double seven_seven = 7.7;
if (value == seven_seven)
printf("strictly equal\n");
IMO, this is a more likely occurring problem than variant rounding modes or variations in library/compiler conversions.
Note that this case is akin to the below, a well known issue.
float value;
fscanf(file, "%f", &value);
if (value == 7.7)
printf("strictly equal\n");
Demonstration
#include <stdio.h>
#include <float.h>
int main() {
printf("%d\n", FLT_EVAL_METHOD);
double value;
sscanf("7.7", "%lf", &value);
double seven_seven = 7.7;
if (value == seven_seven) {
printf("value == seven_seven\n");
} else {
printf("value != seven_seven\n");
}
if (value == 7.7) {
printf("value == 7.7\n");
} else {
printf("value != 7.7\n");
}
return 0;
}
Output
2
value == seven_seven
value != 7.7
Alternative Compare
To compare 2 double that are "near" each other, we need a definition of "near". A useful approach is to consider all the finite double values sorted into a ascending sequence and then compare their sequence numbers from each other. double_distance(x, nextafter(x, 2*x) --> 1
Following code makes various assumptions about double layout and size.
#include <assert.h>
unsigned long long double_order(double x) {
union {
double d;
unsigned long long ull;
} u;
assert(sizeof(double) == sizeof(unsigned long long));
u.d = x;
if (u.ull & 0x8000000000000000) {
u.ull ^= 0x8000000000000000;
return 0x8000000000000000 - u.ull;
}
return u.ull + 0x8000000000000000;
}
unsigned long long double_distance(double x, double y) {
unsigned long long ullx = double_order(x);
unsigned long long ully = double_order(y);
if (x > y) return ullx - ully;
return ully - ullx;
}
....
printf("%llu\n", double_distance(value, 7.7)); // 0
printf("%llu\n", double_distance(value, nextafter(value,value*2))); // 1
printf("%llu\n", double_distance(value, nextafter(value,value/2))); // 1
Or just use
if (nextafter(7.7, -INF) <= value && value <= nextafter(7.7, +INF)) {
puts("Close enough");
}
There's no guarantee.
You can hope that the compiler uses a high quality algorithm for the conversion of literals, and that the standard library implementation uses a high quality conversion as well, and two high quality algorithms should agree quite often.
It's also possible that both use the exact same algorithm (for example, the compiler converts the literal by putting the characters into a char array and calling sscanf.
BTW. I had one bug caused by the fact that a compiler didn't convert the literal 999999999.5 exactly. Replaced it with 9999999995 / 10.0 and everything was fine.

Are all integer values perfectly represented as doubles? [duplicate]

This question already has answers here:
Representing integers in doubles
(4 answers)
Closed 5 years ago.
My question is whether all integer values are guaranteed to have a perfect double representation.
Consider the following code sample that prints "Same":
// Example program
#include <iostream>
#include <string>
int main()
{
int a = 3;
int b = 4;
double d_a(a);
double d_b(b);
double int_sum = a + b;
double d_sum = d_a + d_b;
if (double(int_sum) == d_sum)
{
std::cout << "Same" << std::endl;
}
}
Is this guaranteed to be true for any architecture, any compiler, any values of a and b? Will any integer i converted to double, always be represented as i.0000000000000 and not, for example, as i.000000000001?
I tried it for some other numbers and it always was true, but was unable to find anything about whether this is coincidence or by design.
Note: This is different from this question (aside from the language) since I am adding the two integers.
Disclaimer (as suggested by Toby Speight): Although IEEE 754 representations are quite common, an implementation is permitted to use any other representation that satisfies the requirements of the language.
The doubles are represented in the form mantissa * 2^exponent, i.e. some of the bits are used for the non-integer part of the double number.
bits range precision
float 32 1.5E-45 .. 3.4E38 7- 8 digits
double 64 5.0E-324 .. 1.7E308 15-16 digits
long double 80 1.9E-4951 .. 1.1E4932 19-20 digits
The part in the fraction can also used to represent an integer by using an exponent which removes all the digits after the dot.
E.g. 2,9979 · 10^4 = 29979.
Since a common int is usually 32 bit you can represent all ints as double, but for 64 bit integers of course this is no longer true. To be more precise (as LThode noted in a comment): IEEE 754 double-precision can guarantee this for up to 53 bits (52 bits of significand + the implicit leading 1 bit).
Answer: yes for 32 bit ints, no for 64 bit ints.
(This is correct for server/desktop general-purpose CPU environments, but other architectures may behave differently.)
Practical Answer as Malcom McLean puts it: 64 bit doubles are an adequate integer type for almost all integers that are likely to count things in real life.
For the empirically inclined, try this:
#include <iostream>
#include <limits>
using namespace std;
int main() {
double test;
volatile int test_int;
for(int i=0; i< std::numeric_limits<int>::max(); i++) {
test = i;
test_int = test;
// compare int with int:
if (test_int != i)
std::cout<<"found integer i="<<i<<", test="<<test<<std::endl;
}
return 0;
}
Success time: 0.85 memory: 15240 signal:0
Subquestion:
Regarding the question for fractional differences. Is it possible to have a integer which converts to a double which is just off the correct value by a fraction, but which converts back to the same integer due to rounding?
The answer is no, because any integer which converts back and forth to the same value, actually represents the same integer value in double. For me the simplemost explanation (suggested by ilkkachu) for this is that using the exponent 2^exponent the step width must always be a power of two. Therefore, beyond the largest 52(+1 sign) bit integer, there are never two double values with a distance smaller than 2, which solves the rounding issue.
No. Suppose you have a 64-bit integer type and a 64-bit floating-point type (which is typical for a double). There are 2^64 possible values for that integer type and there are 2^64 possible values for that floating-point type. But some of those floating-point values (in fact, most of them) do not represent integer values, so the floating-point type can represent fewer integer values than the integer type can.
The answer is no. This only works if ints are 32 bit, which, while true on most platforms, isn't guaranteed by the standard.
The two integers can share the same double representation.
For example, this
#include <iostream>
int main() {
int64_t n = 2397083434877565865;
if (static_cast<double>(n) == static_cast<double>(n - 1)) {
std::cout << "n and (n-1) share the same double representation\n";
}
}
will print
n and (n-1) share the same double representation
I.e. both 2397083434877565865 and 2397083434877565864 will convert to the same double.
Note that I used int64_t here to guarantee 64-bit integers, which - depending on your platform - might also be what int is.
You have 2 different questions:
Are all integer values perfectly represented as doubles?
That was already answered by other people (TL;DR: it depends on the precision of int and double).
Consider the following code sample that prints "Same": [...] Is this guaranteed to be true for any architecture, any compiler, any values of a and b?
Your code adds two ints and then converts the result to double. The sum of ints will overflow for certain values, but the sum of the two separately-converted doubles will not (typically). For those values the results will differ.
The short answer is "possibly". The portable answer is "not everywhere".
It really depends on your platform, and in particular, on
the size and representation of double
the range of int
For platforms using IEEE-754 doubles, it may be true if int is 53-bit or smaller. For platforms where int is larger than double, it's obviously false.
You may want be able to investigate the properties on your runtime host, using std::numeric_limits and std::nextafter.