This is not a duplicate of the famous Is floating point math broken, even if it looks like one at first sight.
I'm reading a double from a text file using fscanf(file, "%lf", &value); and comparing it with the == operator against a double literal. If the string is the same as the literal, will the comparision using == be true in all cases?
Example
Text file content:
7.7
Code snippet:
double value;
fscanf(file, "%lf", &value); // reading "7.7" from file into value
if (value == 7.7)
printf("strictly equal\n");
The expected and actual output is
strictly equal
But this supposes that the compiler converts the double literal 7.7 into a double exactly the same way as does the fscanf function, but the compiler may or may not use the same library for converting strings to double.
Or asked otherwise: does the conversion from string to double result in a unique binary representation or may there be slight implementation dependent differences?
Live demonstration
From the c++ standard:
[lex.fcon]
... If the scaled value is in the range
of representable values for its type, the result is the scaled value if representable, else the larger or smaller
representable value nearest the scaled value, chosen in an implementation-defined manner...
emphasis mine.
So you can only rely on equality if the value is strictly representable by a double.
About C++, from cppreference one can read:
[lex.fcon] (§6.4.4.2)
The result of evaluating a floating constant is either the nearest representable value or the larger or smaller representable value immediately adjacent to the nearest representable value, chosen in an implementation-defined manner (in other words, default rounding direction during translation is implementation-defined).
Since the representation of a floating literal is unspecified, I guess you cannot conclude about its comparison with a scanf result.
About C11 (standard ISO/IEC 9899:2011):
[lex.fcon] (§6.4.4.2)
Recommended practice
7 The translation-time conversion of floating constants should match the execution-time conversion of character strings by library functions, such as strtod, given matching inputs suitable for both conversions, the same result format, and default execution-time
rounding.
So clearly for C11, this is not guaranteed to match.
If the string is the same as the literal, will the comparison using == be true in all cases?
A common consideration not yet explored: FLT_EVAL_METHOD
#include <float.h>
...
printf("%d\n", FLT_EVAL_METHOD);
2 evaluate all operations and constants to the range and precision of the
long double type.
If this returns 2, then the math used in value == 7.7 is long double and 7.7 treated as 7.7L. In OP's case, this may evaluate to false.
To account for this wider precision, assign values which will removes all extra range and precision.
scanf(file, "%lf", &value);
double seven_seven = 7.7;
if (value == seven_seven)
printf("strictly equal\n");
IMO, this is a more likely occurring problem than variant rounding modes or variations in library/compiler conversions.
Note that this case is akin to the below, a well known issue.
float value;
fscanf(file, "%f", &value);
if (value == 7.7)
printf("strictly equal\n");
Demonstration
#include <stdio.h>
#include <float.h>
int main() {
printf("%d\n", FLT_EVAL_METHOD);
double value;
sscanf("7.7", "%lf", &value);
double seven_seven = 7.7;
if (value == seven_seven) {
printf("value == seven_seven\n");
} else {
printf("value != seven_seven\n");
}
if (value == 7.7) {
printf("value == 7.7\n");
} else {
printf("value != 7.7\n");
}
return 0;
}
Output
2
value == seven_seven
value != 7.7
Alternative Compare
To compare 2 double that are "near" each other, we need a definition of "near". A useful approach is to consider all the finite double values sorted into a ascending sequence and then compare their sequence numbers from each other. double_distance(x, nextafter(x, 2*x) --> 1
Following code makes various assumptions about double layout and size.
#include <assert.h>
unsigned long long double_order(double x) {
union {
double d;
unsigned long long ull;
} u;
assert(sizeof(double) == sizeof(unsigned long long));
u.d = x;
if (u.ull & 0x8000000000000000) {
u.ull ^= 0x8000000000000000;
return 0x8000000000000000 - u.ull;
}
return u.ull + 0x8000000000000000;
}
unsigned long long double_distance(double x, double y) {
unsigned long long ullx = double_order(x);
unsigned long long ully = double_order(y);
if (x > y) return ullx - ully;
return ully - ullx;
}
....
printf("%llu\n", double_distance(value, 7.7)); // 0
printf("%llu\n", double_distance(value, nextafter(value,value*2))); // 1
printf("%llu\n", double_distance(value, nextafter(value,value/2))); // 1
Or just use
if (nextafter(7.7, -INF) <= value && value <= nextafter(7.7, +INF)) {
puts("Close enough");
}
There's no guarantee.
You can hope that the compiler uses a high quality algorithm for the conversion of literals, and that the standard library implementation uses a high quality conversion as well, and two high quality algorithms should agree quite often.
It's also possible that both use the exact same algorithm (for example, the compiler converts the literal by putting the characters into a char array and calling sscanf.
BTW. I had one bug caused by the fact that a compiler didn't convert the literal 999999999.5 exactly. Replaced it with 9999999995 / 10.0 and everything was fine.
Related
When int64_t is cast to double and doesn't have an exact match, to my knowledge I get a sort of best-effort-nearest-value equivalent in double. For example, 9223372036854775000 in int64_t appears to end up as 9223372036854774784.0 in double:
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
return 0;
}
It appears to me as if an int64_t cast to a double always ends up on as a clean non-fractional number, even in this higher number range where double has really low precision. However, I just observed this from random attempts. Is this guaranteed to happen for any value of int64_t cast to a double?
And if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off? (Assuming it doesn't overflow during the conversion back.) Like here:
#include <inttypes.h>
#include <stdio.h>
int main(int argc, const char **argv) {
printf("Corresponding double: %f\n", (double)9223372036854775000LL);
// Outputs: 9223372036854774784.000000
printf("Corresponding int to corresponding double: %" PRId64 "\n",
(int64_t)((double)9223372036854775000LL));
// Outputs: 9223372036854774784
return 0;
}
Or can it be imprecise and get me the "wrong" int in some corner cases?
Intuitively and from my tests the answer to both points appears to be "yes", but if somebody with a good formal understanding of the floating point standards and the maths behind it could confirm this that would be really helpful to me. I would also be curious if any known more aggressive optimizations like gcc's -Ofast are known to break any of this.
In general case yes, both should be true. The floating point base needs to be - if not 2, then at least integer and given that, an integer converted to nearest floating point value can never produce non-zero fractions - either the precision suffices or the lowest-order integer digits in the base of the floating type would be zeroed. For example in your case your system uses ISO/IEC/IEEE 60559 binary floating point numbers. When inspected in base 2, it can be seen that the trailing digits of the value are indeed zeroed:
>>> bin(9223372036854775000)
'0b111111111111111111111111111111111111111111111111111110011011000'
>>> bin(9223372036854774784)
'0b111111111111111111111111111111111111111111111111111110000000000'
The conversion of a double without fractions to an integer type, given that the value of the double falls within the range of the integer type should be exact...
Though you still might encounter a quality-of-implementation issue, or an outright bug - for example MSVC currently has a compiler bug where a round-trip conversion of unsigned 32-bit value with MSB set (or just double value between 2³¹ and 2³²-1 converted to unsigned int) would "overflow" in the conversion and always result in exactly 2³¹.
The following assumes the value being converted is positive. The behavior of negative numbers is analogous.
C 2018 6.3.1.4 2 specifies conversions from integer to real and says:
… If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.
This tells us that some integer value x being converted to floating-point can produce a non-integer only if one of the two representable values bounding x is not an integer and x is not representable.
5.2.4.2.2 specifies the model used for floating-point numbers. Each finite floating-point number is represented by a sequence of digits in a certain base b scaled by be for some exponent e. (b is an integer greater than 1.) Then, if one of the two values bounding x, say p is not an integer, the scaling must be such that the lowest digit in that floating-point number represents a fraction. But if this is the case, then setting all of the digits in p that represent fractions to 0 must produce a new floating-point number that is an integer. If x < p, this integer must be x, and therefore x is representable in the floating-point format. On the other hand, if p < x, we can add enough to each digit that represents a fraction to make it 0 (and produce a carry to the next higher digit). This will also produce an integer representable in the floating-point type1, and it must be x.
Therefore, if conversion of an integer x to the floating-point type would produce a non-integer, x must be representable in the type. But then conversion to the floating-point type must produce x. So it is never possible to produce a non-integer.
Footnote
1 It is possible this will carry out of all the digits, as when applying it to a three-digit decimal number 9.99, which produces 10.00. In this case, the value produced is the next power of b, if it is in range of the floating-point format. If it is not, the C standard does not define the behavior. Also note the C standard sets minimum requirements on the range that floating-point formats must support which preclude any format from not being able to represent 1, which avoids a degenerate case in which a conversion could produce a number like .999 because it was the largest representable finite value.
When a 64bit int is cast to 64bit float ... and doesn't have an exact match, will it always land on a non-fractional number?
Is this guaranteed to happen for any value of int64_t cast to a double?
For common double: Yes, it always land on a non-fractional number
When there is no match, the result is the closest floating point representable value above or below, depending on rounding mode. Given the characteristics of common double, these 2 bounding values are also whole numbers. When the value is not representable, there is first a nearby whole number one.
... if I cast this non-fractional double back to int64_t, will I always get the exact corresponding 64bit int with the .0 chopped off?
No. Edge cases near INT64_MAX fail as the converted value could become a FP value above INT64_MAX. Then conversion back to the integer type incurs: "the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised." C17dr § 6.3.1.3 3
#include <limits.h>
#include <string.h>
int main() {
long long imaxm1 = LLONG_MAX - 1;
double max = (double) imaxm1;
printf("%lld\n%f\n", imaxm1, max);
long long imax = (long long) max;
printf("%lld\n", imax);
}
9223372036854775806
9223372036854775808.000000
9223372036854775807 // Value here is implementation defined.
Deeper exceptions
(Question variation) When an N bit integer type is cast to a floating point and doesn't have an exact match, will it always land on a non-fractional number?
Integer type range exceeds finite float point
Conversion to infinity: With common float, and uint128_t, UINT128_MAX converts to infinity. This is readily possible with extra wide integer types.
int main() {
unsigned __int128 imaxm1 = 0xFFFFFFFFFFFFFFFF;
imaxm1 <<= 64;
imaxm1 |= 0xFFFFFFFFFFFFFFFF;
double fmax = (float) imaxm1;
double max = (double) imaxm1;
printf("%llde27\n%f\n%f\n", (long long) (imaxm1/1000000000/1000000000/1000000000),
fmax, max);
}
340282366920e27
inf
340282366920938463463374607431768211456.000000
Floating point precession deep more than range
On some unicorn implementation, with very wide FP precision and small range, the largest finite could, in theory, not practice, be a non-whole number. Then with an even wider integer type, the conversion could result in this non-whole number value. I do not see this as a legit concern of OP's.
Suppose I is some integer type and F some (real) floating point type.
I want to write two functions. The first function shall take a value i of type I and return a boolean indicating whether i converted to F falls into the representable range, i.e. whether (F)i will have defined behavior.
The second function shall take a value f of type F and return a boolean indicating whether f converted to I falls into the representable range, i.e. whether (I)f will have defined behavior.
Is it possible to write such a function that will be, on every implementation conforming to the standard, correct and not exhibit undefined behavior for any input? In particular I do not want to assume that the floating point types are IEEE 754 types.
I am asking about both C and C++ and their respective standard versions separately, in case that changes the answer.
Basically the intention of this question is to figure out whether (sensible) floating-point / integral conversions are possible without relying on IEEE 754 or other standards or hardware details at all. I ask out of curiosity.
Comparing against e.g. INT_MAX or FLT_MAX does not seem to be possible, because it is not clear which type to do the comparison in without already knowing which of the types has wider range.
some float to some int is fairly easy is we can assume FLT_RADIX != 10 (2N floating point) and the range of FP exceeds the integer range.
Form exact FP limits
Test if FP has a fraction part that is 0**. (also handles NaN, inf)
Test if too positive.
Test if too negative.
Test if converted to integer value rounds.
Pseudo code
// For now, assume 2's complement.
// With some extra macro magic, could handle all integer encodings.
// Use integer limits whose magnitudes are at or 1 away from a power-of-2
// and form FP power-of-2 limits
// The following will certainly not incur any rounding
#define FLT_INT_MAXP1 ((INT_MAX/2 + 1)*2.0f)
#define FLT_INT_MIN (INT_MIN*1.0f)
status float_to_int_test(float f) {
float ipart;
if (modff(f, &ipart) != 0.0) {
return not_a_whole_number;
}
if (f >= FLT_INT_MAXP1) return too_big;
if (f < FLT_INT_MIN) return too_negative;
if (f != (volatile float) f)) return rounding_occurred;
return success;
}
Armed with the above float_to_int test....
status int_to_float_test(int i) {
volatile float f = (float) i;
if (float_to_int_test(f) != success) return fail
volatile int j = (int) f;
if (i != j) return fail;
return success;
}
Simplifications possible, but something to get OP started.
Extreme cases which need additional code include int128_t or wider having more range than float and FLT_RADIX == 10.
** Hmmm - appears OP does not cares about fractional part. In that case conversion from double to int appears as a good duplicate for half the problem.
There are several posts here about floating point numbers and their nature. It is clear that comparing floats and doubles must always be done cautiously. Asking for equality has also been discussed and the recommendation is clearly to stay away from it.
But what if there is a direct assignement:
double a = 5.4;
double b = a;
assumg a is any non-NaN value - can a == b ever be false?
It seems that the answer is obviously no, yet I can't find any standard defining this behaviour in a C++ environment. IEEE-754 states that two floating point numbers with equal (non-NaN) bitset patterns are equal. Does it now mean that I can continue comparing my doubles this way without having to worry about maintainability? Do I have to worried about other compilers / operating systems and their implementation regarding these lines? Or maybe a compiler that optimizes some bits away and ruins their equality?
I wrote a little program that generates and compares non-NaN random doubles forever - until it finds a case where a == b yields false. Can I compile/run this code anywhere and anytime in the future without having to expect a halt? (ignoring endianness and assuming sign, exponent and mantissa bit sizes / positions stay the same).
#include <iostream>
#include <random>
struct double_content {
std::uint64_t mantissa : 52;
std::uint64_t exponent : 11;
std::uint64_t sign : 1;
};
static_assert(sizeof(double) == sizeof(double_content), "must be equal");
void set_double(double& n, std::uint64_t sign, std::uint64_t exponent, std::uint64_t mantissa) {
double_content convert;
memcpy(&convert, &n, sizeof(double));
convert.sign = sign;
convert.exponent = exponent;
convert.mantissa = mantissa;
memcpy(&n, &convert, sizeof(double_content));
}
void print_double(double& n) {
double_content convert;
memcpy(&convert, &n, sizeof(double));
std::cout << "sign: " << convert.sign << ", exponent: " << convert.exponent << ", mantissa: " << convert.mantissa << " --- " << n << '\n';
}
int main() {
std::random_device rd;
std::mt19937_64 engine(rd());
std::uniform_int_distribution<std::uint64_t> mantissa_distribution(0ull, (1ull << 52) - 1);
std::uniform_int_distribution<std::uint64_t> exponent_distribution(0ull, (1ull << 11) - 1);
std::uniform_int_distribution<std::uint64_t> sign_distribution(0ull, 1ull);
double a = 0.0;
double b = 0.0;
bool found = false;
while (!found){
auto sign = sign_distribution(engine);
auto exponent = exponent_distribution(engine);
auto mantissa = mantissa_distribution(engine);
//re-assign exponent for NaN cases
if (mantissa) {
while (exponent == (1ull << 11) - 1) {
exponent = exponent_distribution(engine);
}
}
//force -0.0 to be 0.0
if (mantissa == 0u && exponent == 0u) {
sign = 0u;
}
set_double(a, sign, exponent, mantissa);
b = a;
//here could be more (unmodifying) code to delay the next comparison
if (b != a) { //not equal!
print_double(a);
print_double(b);
found = true;
}
}
}
using Visual Studio Community 2017 Version 15.9.5
The C++ standard clearly specifies in [basic.types]#3:
For any trivially copyable type T, if two pointers to T point to distinct T objects obj1 and obj2, where neither obj1 nor obj2 is a potentially-overlapping subobject, if the underlying bytes ([intro.memory]) making up obj1 are copied into obj2, obj2 shall subsequently hold the same value as obj1.
It gives this example:
T* t1p;
T* t2p;
// provided that t2p points to an initialized object ...
std::memcpy(t1p, t2p, sizeof(T));
// at this point, every subobject of trivially copyable type in *t1p contains
// the same value as the corresponding subobject in *t2p
The remaining question is what a value is. We find in [basic.fundamental]#12 (emphasis mine):
There are three floating-point types: float, double, and long double.
The type double provides at least as much precision as float, and the type long double provides at least as much precision as double.
The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The value representation of floating-point types is implementation-defined.
Since the C++ standard has no further requirements on how floating point values are represented, this is all you will find as guarantee from the standard, as assignment is only required to preserve values ([expr.ass]#2):
In simple assignment (=), the object referred to by the left operand is modified by replacing its value with the result of the right operand.
As you correctly observed, IEEE-754 requires that non-NaN, non-zero floats compare equal if and only if they have the same bit pattern. So if your compiler uses IEEE-754-compliant floats, you should find that assignment of non-NaN, non-zero floating point numbers preserves bit patterns.
And indeed, your code
double a = 5.4;
double b = a;
should never allow (a == b) to return false. But as soon as you replace 5.4 with a more complicated expression, most of this nicety vanishes. It's not the exact subject of the article, but https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/ mentions several possible ways in which innocent looking code can yield different results (which breaks "identical to the bit pattern" assertions). In particular, you might be comparing an 80 bit intermediate result with a 64 bit rounded result, possibly yielding inequality.
There are some complications here. First, note that the title asks a different question than the question. The title asks:
is assigning two doubles guaranteed to yield the same bitset patterns?
while the question asks:
can a == b ever be false?
The first of these asks whether different bits might occur from an assignment (which could be due to either the assignment not recording the same value as its right operand or due to the assignment using a different bit pattern that represents the same value), while the second asks whether, whatever bits are written by an assignment, the stored value must compare equal to the operand.
In full generality, the answer to the first question is no. Using IEEE-754 binary floating-point formats, there is a one-to-one map between non-zero numeric values and their encodings in bit patterns. However, this admits several cases where an assignment could produce a different bit pattern:
The right operand is the IEEE-754 −0 entity, but +0 is stored. This is not a proper IEEE-754 operation, but C++ is not required to conform to IEEE 754. Both −0 and +0 represent mathematical zero and would satisfy C++ requirements for assignment, so a C++ implementation could do this.
IEEE-754 decimal formats have one-to-many maps between numeric values and their encodings. By way of illustration, three hundred could be represented with bits whose direct meaning is 3•102 or bits whose direct meaning is 300•100. Again, since these represent the same mathematical value, it would be permissible under the C++ standard to store one in the left operand of an assignment when the right operand is the other.
IEEE-754 includes many non-numeric entities called NaNs (for Not a Number), and a C++ implementation might store a NaN different from the right operand. This could include either replacing any NaN with a “canonical” NaN for the implementation or, upon assignment of a signaling Nan, indicating the signal in some way and then converting the signaling NaN to a quiet NaN and storing that.
Non-IEEE-754 formats may have similar issues.
Regarding the latter question, can a == b be false after a = b, where both a and b have type double, the answer is no. The C++ standard does require that an assignment replace the value of the left operand with the value of the right operand. So, after a = b, a must have the value of b, and therefore they are equal.
Note that the C++ standard does not impose any restrictions on the accuracy of floating-point operations (although I see this only stated in non-normative notes). So, theoretically, one might interpret assignment or comparison of floating-point values to be floating-point operations and say that they do not need to be accuracy, so the assignment could change the value or the comparison could return an inaccurate result. I do not believe this is a reasonable interpretation of the standard; the lack of restrictions on floating-point accuracy is intended to allow latitude in expression evaluation and library routines, not simple assignment or comparison.
One should note the above applies specifically to a double object that is assigned from a simple double operand. This should not lull readers into complacency. Several similar but different situations can result in failure of what might seem intuitive mathematically, such as:
After float x = 3.4;, the expression x == 3.4 will generally evaluate as false, since 3.4 is a double and has to be converted to a float for the assignment. That conversion reduces precision and alters the value.
After double x = 3.4 + 1.2;, the expression x == 3.4 + 1.2 is permitted by the C++ standard to evaluate to false. This is because the standard permits floating-point expressions to be evaluated with more precision than the nominal type requires. Thus, 3.4 + 1.2 might be evaluated with the precision of long double. When the result is assigned to x, the standard requires that the excess precision be “discarded,” so the value is converted to a double. As with the float example above, this conversion may change the value. Then the comparison x == 3.4 + 1.2 may compare a double value in x to what is essentially a long double value produced by 3.4 + 1.2.
I need to convert normalized integer values to and from real floating-point values. For instance, for int16_t, a value of 1.0 is represented by 32767 and -1.0 is represented by -32768. Although it's a bit tedious to do this for each integer type, both signed and unsigned, it's still easy enough to write by hand.
However, I want to use standard methods whenever possible rather than going off and reinventing the wheel, so what I'm looking for is something like a standard C or C++ header, a Boost library, or some other small, portable, easily-incorporated source that already performs these conversions.
Here's a templated solution using std::numeric_limits:
#include <cstdint>
#include <limits>
template <typename T>
constexpr double normalize (T value) {
return value < 0
? -static_cast<double>(value) / std::numeric_limits<T>::min()
: static_cast<double>(value) / std::numeric_limits<T>::max()
;
}
int main () {
// Test cases evaluated at compile time.
static_assert(normalize(int16_t(32767)) == 1, "");
static_assert(normalize(int16_t(0)) == 0, "");
static_assert(normalize(int16_t(-32768)) == -1, "");
static_assert(normalize(int16_t(-16384)) == -0.5, "");
static_assert(normalize(uint16_t(65535)) == 1, "");
static_assert(normalize(uint16_t(0)) == 0, "");
}
This handles both signed and unsigned integers, and 0 does normalize to 0.
View Successful Compilation Result
I'd question whether your intent is correct here (or indeed that of most of the answers).
Since you're likely just dealing with something like an integer representation of a "real" value such as that produced by an ADC - I'd argue that in fact a floating point fraction of +32767/32768 (not +1.0) is represented by the integer +32767, as a value of +1.0 can't actually be expressed in this form due to the 2's complement arithmetic used.
Although it's a bit tedious to do this for each integer type, both
signed and unsigned, it's still easy enough to write by hand.
You certainly don't need to do this for each integer type! Use <limits> instead.
template<class T> double AsDouble(const T x) {
const double valMin = std::numeric_limits<T>::min();
const double valMax = std::numeric_limits<T>::max();
return 2 * (x - valMin) / (valMax - valMin) - 1; // note: 0 does not become 0.
}
I am writing a protocol, that uses RFC 7049 as its binary representation. The standard states, that the protocol may use 32-bit floating point representation of numbers, if their numeric value is equivalent to respective 64-bit numbers. The conversion must not lead to lose of precision.
What 32-bit float numbers can be bigger than 64-bit integer and numerically equivalent with them?
Is comparing float x; uint64_t y; (float)x == (float)y enough for ensuring, that the values are equivalent? Will this comparison ever be true?
RFC 7049 §3.6. Numbers
For the purposes of this specification, all number representations
for the same numeric value are equivalent. This means that an
encoder can encode a floating-point value of 0.0 as the integer 0.
It, however, also means that an application that expects to find
integer values only might find floating-point values if the encoder
decides these are desirable, such as when the floating-point value is
more compact than a 64-bit integer.
There certainly are numbers for which this is true:
2^33 can be perfectly represented as a floating point number, but clearly cannot be represented as a 32-bit integer. The following code should work as expected:
bool representable_as_float(int64_t value) {
float repr = value;
return repr >= -0x1.0p63 && repr < 0x1.0p63 && (int64_t)repr == value;
}
It is important to notice though that we are basically doing (int64_t)(float)value and not the other way around - we are interested if the cast to float loses any precision.
The check to see whether repr is smaller than the maximum value of int64_t is important since we could invoke undefined behavior otherwise, since the cast to float may round up to the next higher number (which could then be larger than the maximum value possible in int64_t). (Thanks to #tmyklebu for pointing this out).
Two samples:
// powers of 2 can easily be represented
assert(representable_as_float(((int64_t)1) << 33));
// Other numbers not so much:
assert(!representable_as_float(std::numeric_limits<int64_t>::max()));
The following is based on Julia's method for comparing floats and integers. This does not require access to 80-bit long doubles or floating point exceptions, and should work under any rounding mode. I believe this should work for any C float type (IEEE754 or not), and not cause any undefined behaviour.
UPDATE: technically this assumes a binary float format, and that the float exponent size is large enough to represent 264: this is certainly true for the standard IEEE754 binary32 (which you refer to in your question), but not, say, binary16.
#include <stdio.h>
#include <stdint.h>
int cmp_flt_uint64(float x,uint64_t y) {
return (x == (float)y) && (x != 0x1p64f) && ((uint64_t)x == y);
}
int main() {
float x = 0x1p64f;
uint64_t y = 0xffffffffffffffff;
if (cmp_flt_uint64(x,y))
printf("true\n");
else
printf("false\n");
;
}
The logic here is as follows:
The first equality can be true only if x is a non-negative integer in the interval [0,264].
The second checks that x (and hence (float)y) is not 264: if this is the case, then y cannot be represented exactly by a float, and so the comparison is false.
Any remaining values of x can be exactly converted to a uint64_t, and so we cast and compare.
No, you need to compare (long double)x == (long double)y on an architecture where the mantissa of a long double can hold 63 bits. This is because some big long long ints will lose precision when you convert them to float, and compare as equal to a non-equivalent float, but if you convert to long double, it will not lose precision on that architecture.
The following program demonstrates this behavior when compiled with gcc -std=c99 -mssse3 -mfpmath=sse on x86, because these settings use wide-enough long doubles but prevent the implicit use of higher-precision types in calculations:
#include <assert.h>
#include <stdint.h>
const int64_t x = (1ULL<<62) - 1ULL;
const float y = (float)(1ULL<<62);
// The mantissa is not wide enough to store
// 63 bits of precision.
int main(void)
{
assert ((float)x == (float)y);
assert ((long double)x != (long double)y);
return 0;
}
Edit: If you don’t have wide enough long doubles, the following might work:
feclearexcept(FE_ALL_EXCEPT);
x == y;
ftestexcept(FE_INEXACT);
I think, although I could be mistaken, that an implementation could round off x during the conversion in a way that loses precision.
Another strategy that could work is to compare
extern uint64_t x;
extern float y;
const float z = (float)x;
y == z && (uint64_t)z == x;
This should catch losses of precision due to round-off error, but it could conceivably cause undefined behavior if the conversion to z rounds up. It will work if the conversion is set to round toward zero when converting x to z.