Suppose I is some integer type and F some (real) floating point type.
I want to write two functions. The first function shall take a value i of type I and return a boolean indicating whether i converted to F falls into the representable range, i.e. whether (F)i will have defined behavior.
The second function shall take a value f of type F and return a boolean indicating whether f converted to I falls into the representable range, i.e. whether (I)f will have defined behavior.
Is it possible to write such a function that will be, on every implementation conforming to the standard, correct and not exhibit undefined behavior for any input? In particular I do not want to assume that the floating point types are IEEE 754 types.
I am asking about both C and C++ and their respective standard versions separately, in case that changes the answer.
Basically the intention of this question is to figure out whether (sensible) floating-point / integral conversions are possible without relying on IEEE 754 or other standards or hardware details at all. I ask out of curiosity.
Comparing against e.g. INT_MAX or FLT_MAX does not seem to be possible, because it is not clear which type to do the comparison in without already knowing which of the types has wider range.
some float to some int is fairly easy is we can assume FLT_RADIX != 10 (2N floating point) and the range of FP exceeds the integer range.
Form exact FP limits
Test if FP has a fraction part that is 0**. (also handles NaN, inf)
Test if too positive.
Test if too negative.
Test if converted to integer value rounds.
Pseudo code
// For now, assume 2's complement.
// With some extra macro magic, could handle all integer encodings.
// Use integer limits whose magnitudes are at or 1 away from a power-of-2
// and form FP power-of-2 limits
// The following will certainly not incur any rounding
#define FLT_INT_MAXP1 ((INT_MAX/2 + 1)*2.0f)
#define FLT_INT_MIN (INT_MIN*1.0f)
status float_to_int_test(float f) {
float ipart;
if (modff(f, &ipart) != 0.0) {
return not_a_whole_number;
}
if (f >= FLT_INT_MAXP1) return too_big;
if (f < FLT_INT_MIN) return too_negative;
if (f != (volatile float) f)) return rounding_occurred;
return success;
}
Armed with the above float_to_int test....
status int_to_float_test(int i) {
volatile float f = (float) i;
if (float_to_int_test(f) != success) return fail
volatile int j = (int) f;
if (i != j) return fail;
return success;
}
Simplifications possible, but something to get OP started.
Extreme cases which need additional code include int128_t or wider having more range than float and FLT_RADIX == 10.
** Hmmm - appears OP does not cares about fractional part. In that case conversion from double to int appears as a good duplicate for half the problem.
Related
There are several posts here about floating point numbers and their nature. It is clear that comparing floats and doubles must always be done cautiously. Asking for equality has also been discussed and the recommendation is clearly to stay away from it.
But what if there is a direct assignement:
double a = 5.4;
double b = a;
assumg a is any non-NaN value - can a == b ever be false?
It seems that the answer is obviously no, yet I can't find any standard defining this behaviour in a C++ environment. IEEE-754 states that two floating point numbers with equal (non-NaN) bitset patterns are equal. Does it now mean that I can continue comparing my doubles this way without having to worry about maintainability? Do I have to worried about other compilers / operating systems and their implementation regarding these lines? Or maybe a compiler that optimizes some bits away and ruins their equality?
I wrote a little program that generates and compares non-NaN random doubles forever - until it finds a case where a == b yields false. Can I compile/run this code anywhere and anytime in the future without having to expect a halt? (ignoring endianness and assuming sign, exponent and mantissa bit sizes / positions stay the same).
#include <iostream>
#include <random>
struct double_content {
std::uint64_t mantissa : 52;
std::uint64_t exponent : 11;
std::uint64_t sign : 1;
};
static_assert(sizeof(double) == sizeof(double_content), "must be equal");
void set_double(double& n, std::uint64_t sign, std::uint64_t exponent, std::uint64_t mantissa) {
double_content convert;
memcpy(&convert, &n, sizeof(double));
convert.sign = sign;
convert.exponent = exponent;
convert.mantissa = mantissa;
memcpy(&n, &convert, sizeof(double_content));
}
void print_double(double& n) {
double_content convert;
memcpy(&convert, &n, sizeof(double));
std::cout << "sign: " << convert.sign << ", exponent: " << convert.exponent << ", mantissa: " << convert.mantissa << " --- " << n << '\n';
}
int main() {
std::random_device rd;
std::mt19937_64 engine(rd());
std::uniform_int_distribution<std::uint64_t> mantissa_distribution(0ull, (1ull << 52) - 1);
std::uniform_int_distribution<std::uint64_t> exponent_distribution(0ull, (1ull << 11) - 1);
std::uniform_int_distribution<std::uint64_t> sign_distribution(0ull, 1ull);
double a = 0.0;
double b = 0.0;
bool found = false;
while (!found){
auto sign = sign_distribution(engine);
auto exponent = exponent_distribution(engine);
auto mantissa = mantissa_distribution(engine);
//re-assign exponent for NaN cases
if (mantissa) {
while (exponent == (1ull << 11) - 1) {
exponent = exponent_distribution(engine);
}
}
//force -0.0 to be 0.0
if (mantissa == 0u && exponent == 0u) {
sign = 0u;
}
set_double(a, sign, exponent, mantissa);
b = a;
//here could be more (unmodifying) code to delay the next comparison
if (b != a) { //not equal!
print_double(a);
print_double(b);
found = true;
}
}
}
using Visual Studio Community 2017 Version 15.9.5
The C++ standard clearly specifies in [basic.types]#3:
For any trivially copyable type T, if two pointers to T point to distinct T objects obj1 and obj2, where neither obj1 nor obj2 is a potentially-overlapping subobject, if the underlying bytes ([intro.memory]) making up obj1 are copied into obj2, obj2 shall subsequently hold the same value as obj1.
It gives this example:
T* t1p;
T* t2p;
// provided that t2p points to an initialized object ...
std::memcpy(t1p, t2p, sizeof(T));
// at this point, every subobject of trivially copyable type in *t1p contains
// the same value as the corresponding subobject in *t2p
The remaining question is what a value is. We find in [basic.fundamental]#12 (emphasis mine):
There are three floating-point types: float, double, and long double.
The type double provides at least as much precision as float, and the type long double provides at least as much precision as double.
The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The value representation of floating-point types is implementation-defined.
Since the C++ standard has no further requirements on how floating point values are represented, this is all you will find as guarantee from the standard, as assignment is only required to preserve values ([expr.ass]#2):
In simple assignment (=), the object referred to by the left operand is modified by replacing its value with the result of the right operand.
As you correctly observed, IEEE-754 requires that non-NaN, non-zero floats compare equal if and only if they have the same bit pattern. So if your compiler uses IEEE-754-compliant floats, you should find that assignment of non-NaN, non-zero floating point numbers preserves bit patterns.
And indeed, your code
double a = 5.4;
double b = a;
should never allow (a == b) to return false. But as soon as you replace 5.4 with a more complicated expression, most of this nicety vanishes. It's not the exact subject of the article, but https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/ mentions several possible ways in which innocent looking code can yield different results (which breaks "identical to the bit pattern" assertions). In particular, you might be comparing an 80 bit intermediate result with a 64 bit rounded result, possibly yielding inequality.
There are some complications here. First, note that the title asks a different question than the question. The title asks:
is assigning two doubles guaranteed to yield the same bitset patterns?
while the question asks:
can a == b ever be false?
The first of these asks whether different bits might occur from an assignment (which could be due to either the assignment not recording the same value as its right operand or due to the assignment using a different bit pattern that represents the same value), while the second asks whether, whatever bits are written by an assignment, the stored value must compare equal to the operand.
In full generality, the answer to the first question is no. Using IEEE-754 binary floating-point formats, there is a one-to-one map between non-zero numeric values and their encodings in bit patterns. However, this admits several cases where an assignment could produce a different bit pattern:
The right operand is the IEEE-754 −0 entity, but +0 is stored. This is not a proper IEEE-754 operation, but C++ is not required to conform to IEEE 754. Both −0 and +0 represent mathematical zero and would satisfy C++ requirements for assignment, so a C++ implementation could do this.
IEEE-754 decimal formats have one-to-many maps between numeric values and their encodings. By way of illustration, three hundred could be represented with bits whose direct meaning is 3•102 or bits whose direct meaning is 300•100. Again, since these represent the same mathematical value, it would be permissible under the C++ standard to store one in the left operand of an assignment when the right operand is the other.
IEEE-754 includes many non-numeric entities called NaNs (for Not a Number), and a C++ implementation might store a NaN different from the right operand. This could include either replacing any NaN with a “canonical” NaN for the implementation or, upon assignment of a signaling Nan, indicating the signal in some way and then converting the signaling NaN to a quiet NaN and storing that.
Non-IEEE-754 formats may have similar issues.
Regarding the latter question, can a == b be false after a = b, where both a and b have type double, the answer is no. The C++ standard does require that an assignment replace the value of the left operand with the value of the right operand. So, after a = b, a must have the value of b, and therefore they are equal.
Note that the C++ standard does not impose any restrictions on the accuracy of floating-point operations (although I see this only stated in non-normative notes). So, theoretically, one might interpret assignment or comparison of floating-point values to be floating-point operations and say that they do not need to be accuracy, so the assignment could change the value or the comparison could return an inaccurate result. I do not believe this is a reasonable interpretation of the standard; the lack of restrictions on floating-point accuracy is intended to allow latitude in expression evaluation and library routines, not simple assignment or comparison.
One should note the above applies specifically to a double object that is assigned from a simple double operand. This should not lull readers into complacency. Several similar but different situations can result in failure of what might seem intuitive mathematically, such as:
After float x = 3.4;, the expression x == 3.4 will generally evaluate as false, since 3.4 is a double and has to be converted to a float for the assignment. That conversion reduces precision and alters the value.
After double x = 3.4 + 1.2;, the expression x == 3.4 + 1.2 is permitted by the C++ standard to evaluate to false. This is because the standard permits floating-point expressions to be evaluated with more precision than the nominal type requires. Thus, 3.4 + 1.2 might be evaluated with the precision of long double. When the result is assigned to x, the standard requires that the excess precision be “discarded,” so the value is converted to a double. As with the float example above, this conversion may change the value. Then the comparison x == 3.4 + 1.2 may compare a double value in x to what is essentially a long double value produced by 3.4 + 1.2.
In C++, the conversion of an integer value of type I to a floating point type F will be exact — as static_cast<I>(static_cast<F>(i)) == i — if the range of I is a part of the range of integral values of F.
Is it possible, and if yes how, to calculate the loss of precision of static_cast<F>(i) (without using another floating point type with a wider range)?
As a start, I tried to code a function that would return if a conversion is safe or not (safe, meaning no loss of precision), but I must admit I am not so sure about its correctness.
template <class F, class I>
bool is_cast_safe(I value)
{
return std::abs(alue) < std::numeric_limits<F>::digits;
}
std::cout << is_cast_safe<float>(4) << std::endl; // true
std::cout << is_cast_safe<float>(0x1000001) << std::endl; // false
Thanks in advance.
is_cast_safe can be implemented with:
static const F One = 1;
F ULP = std::scalbn(One, std::ilogb(value) - std::numeric_limits<F>::digits + 1);
I U = std::max(ULP, One);
return value % U;
This sets ULP to the value of the least digit position in the result of converting value to F. ilogb returns the position (as an exponent of the floating-point radix) for the highest digit position, and subtracting one less than the number of digits adjusts to the lowest digit position. Then scalbn gives us the value of that position, which is the ULP.
Then value can be represented exactly in F if and only if it is a multiple of the ULP. To test that, we convert the ULP to I (but substitute 1 if it is less than 1), and then take the remainder of value divided by the ULP (or 1).
Also, if one is concerned the conversion to F might overflow, code can be inserted to handle this as well.
Calculating the actual amount of the change is trickier. The conversion to floating-point could round up or down, and the rule for choosing is implementation-defined, although round-to-nearest-ties-to-even is common. So the actual change cannot be calculated from the floating-point properties we are given in numeric_limits. It must involve performing the conversion and doing some work in floating-point. This definitely can be done, but it is a nuisance. I think an approach that should work is:
Assume value is non-negative. (Negative values can be handled similarly but are omitted for now for simplicity.)
First, test for overflow in conversion to F. This in itself is tricky, as the behavior is undefined if the value is too large. Some similar considerations were addressed in this answer to a question about safely converting from floating-point to integer (in C).
If the value does not overflow, then convert it. Let the result be x. Divide x by the floating-point radix r, producing y. If y is not an integer (which can be tested using fmod or trunc) the conversion was exact.
Otherwise, convert y to I, producing z. This is safe because y is less than the original value, so it must fit in I.
Then the error due to conversion is (z-value/r)*r + value%r.
I loss = abs(static_cast<I>(static_cast<F>(i))-i) should do the job. The only exception if i's magnitude is large, so static_cast<F>(i) would generate an out-of-I-range F.
(I supposed here that I abs(I) is available)
The C standard, which C++ relies on for these matters as well, as far as I know, has the following section:
When a value of integer type is converted to a real floating type, if the value being converted can be represented exactly in the new type, it is unchanged. If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner. If the value being converted is outside the range of values that can be represented, the behavior is undefined.
Is there any way I can check for the last case? It seems to me that this last undefined behaviour is unavoidable. If I have an integral value i and naively check something like
i <= FLT_MAX
I will (apart from other problems related to precision) already trigger it because the comparison first converts i to a float (in this case or to any other floating type in general), so if it is out of range, we get undefined behaviour.
Or is there some guarantee about the relative sizes of integral and floating types that would imply something like "float can always represent all values of int (not necessarily exactly of course)" or at least "long double can always hold everything" so that we could do comparisons in that type? I couldn't find anything like that, though.
This is mainly a theoretical exercise, so I'm not interested in answers along the lines of "on most architectures these conversions always work". Let's try to find a way to detect this kind of overflow without assuming anything beyond the C(++) standard! :)
Detect overflow when converting integral to floating types
FLT_MAX, DBL_MAX are at least 1E+37 per the C spec, so all integers with |values| of 122 bits or less will convert to a float without overflow on all compliant platforms. Same with double
To solve this in the general case for integers of 128/256/etc. bits, both FLT_MAX and some_big_integer_MAX need to be reduced.
Perhaps by taking the log of both. (bit_count() is a TBD user code)
if(bit_count(unsigned_big_integer_MAX) > logbf(FLT_MAX)) problem();
Or if the integer lacks padding
if(sizeof(unsigned_big_integer_MAX)*CHAR_BIT > logbf(FLT_MAX)) problem();
Note: working with a FP function like logbf() may produce an edge condition with the exact integer math with an incorrect compare.
Macro magic can use obtuse tests like the following that takes advantage the BIGINT_MAX is certainly a power-of-2 minus 1 and FLT_MAX division by a power of 2 is certainly exact (unless FLT_RADIX == 10).
This pre-processor code will complain if conversion from a big integer type to float will be inexact for some big integer.
#define POW2_61 0x2000000000000000u
#if BIGINT_MAX/POW2_61 > POW2_61
// BIGINT is at least a 122 bit integer
#define BIGINT_MAX_PLUS1_div_POW2_61 ((BIGINT_MAX/2 + 1)/(POW2_61/2))
#if BIGINT_MAX_PLUS1_div_POW2_61 > POW2_61
#warning TBD code for an integer wider than 183 bits
#else
_Static_assert(BIGINT_MAX_PLUS1_div_POW2_61 <= FLT_MAX/POW2_61,
"bigint too big for float");
#endif
#endif
[Edit 2]
Is there any way I can check for the last case?
This code will complain if conversion from a big integer type to float will be inexact for a select big integer.
Of course the test needs to occur before the conversion is attempted.
Given various rounding modes or a rare FLT_RADIX == 10, the best that can readily be had is a test that aims a bit low. When it is true, the conversion will work. Yet a vary small range of of big integers that report false on the below test do convert OK.
Below is a more refined idea that I need to mull over for a bit, yet I hope it provides some coding idea for the test OP is looking for.
#define POW2_60 0x1000000000000000u
#define POW2_62 0x4000000000000000u
#define MAX_FLT_MIN 1e37
#define MAX_FLT_MIN_LOG2 (122 /* 122.911.. */)
bool intmax_to_float_OK(intmax_t x) {
#if INTMAX_MAX/POW2_60 < POW2_62
(void) x;
return true; // All big integer values work
#elif INTMAX_MAX/POW2_60/POW2_60 < POW2_62
return x/POW2_60 < (FLT_MAX/POW2_60)
#elif INTMAX_MAX/POW2_60/POW2_60/POW2_60 < POW2_62
return x/POW2_60/POW2_60 < (FLT_MAX/POW2_60/POW2_60)
#else
#error TBD code
#endif
}
Here's a C++ template function that returns the largest positive integer that fits into both of the given types.
template<typename float_type, typename int_type>
int_type max_convertible()
{
static const int int_bits = sizeof(int_type) * CHAR_BIT - std::is_signed<int_type>() ? 1 : 0;
if ((int)ceil(std::log2(std::numeric_limits<float_type>::max())) > int_bits)
return std::numeric_limits<int_type>::max();
return (int_type) std::numeric_limits<float_type>::max();
}
If the number you're converting is larger than the return from this function, it can't be converted. Unfortunately I'm having trouble finding a combination of types to test it with, it's very hard to find an integer type that won't fit into the smallest floating point type.
Code
#include<stdio.h>
#include<limits.h>
#include<float.h>
int f( double x, double y, double z){
return (x+y)+z == x+(y+z);
}
int ff( long long x, long long y, long long z){
return (x+y)+z == x+(y+z);
}
int main()
{
printf("%d\n",f(DBL_MAX,DBL_MAX,-DBL_MAX));
printf("%d\n",ff(LLONG_MAX,LLONG_MAX,-LLONG_MAX));
return 0;
}
Output
0
1
I am unable to understand why both functions work differently. What is happening here?
In the eyes of the C++ and the C standard, the integer version definitely and the floating point version potentially invoke Undefined Behavior because the results of the computation x + y is not representable in the type the arithmetic is performed on.† So both functions may yield or even do anything.
However, many real world platforms offer additional guarantees for floating point operations and implement integers in a certain way that lets us explain the results you get.
Considering f, we note that many popular platforms implement floating point math as described in IEEE 754. Following the rules of that standard, we get for the LHS:
DBL_MAX + DBL_MAX = INF
and
INF - DBL_MAX = INF.
The RHS yields
DBL_MAX - DBL_MAX = 0
and
DBL_MAX + 0 = DBL_MAX
and thus LHS != RHS.
Moving on to ff: Many platforms perform signed integer computation in twos complement. Twos complement's addition is associative, so the comparison will yield true as long as optimizer does not change it to something that contradicts twos complement rules.
The latter is entirely possible (for example see this discussion), so you cannot rely on signed integer overflow doing what I explained above. However, it seems that it "was nice" in this case.
†Note that this never applies to unsigned integer arithmetic. In C++, unsigned integers implement arithmetic modulo 2^NumBits where NumBits is the number of bits of the type. In this arithmetic, every integer can be represented by picking a representative of its equivalence class in [0, 2^NumBits - 1]. So this arithmetic can never overflow.
For those doubting that the floating point case is potential UB: N4140 5/4 [expr] says
If during the evaluation of an expression, the result is not mathematically defined or not in the range of
representable values for its type, the behavior is undefined.
which is the case. The inf and NaN stuff is allowed, but not required in C++ and C floating point math. It is only required if std::numeric_limits::is_iec559<T> is true for floating point type in question. (Or in C, if it defines __STDC_IEC_559__ . Otherwise, the Annex F stuff need not apply.) If either of the iec indicators guarantees us IEEE semantics, the behavior is well defined to do what I described above.
Assuming IEEE-754 conformance, is a float guaranteed to be preserved when transported through a double?
In other words, will the following assert always be satisfied?
int main()
{
float f = some_random_float();
assert(f == (float)(double)f);
}
Assume that f could acquire any of the special values defined by IEEE, such as NaN and Infinity.
According to IEEE, is there a case where the assert will be satisfied, but the exact bit-level representation is not preserved after the transportation through double?
The code snippet is valid in both C and C++.
You don't even need to assume IEEE. C89 says in 3.1.2.5:
The set of values of the type float is a subset of the set of values
of the type double
And every other C and C++ standard says equivalent things. As far as I know, NaNs and infinities are "values of the type float", albeit values with some special-case rules when used as operands.
The fact that the float -> double -> float conversion restores the original value of the float follows (in general) from the fact that numeric conversions all preserve the value if it's representable in the destination type.
Bit-level representations are a slightly different matter. Imagine that there's a value of float that has two distinct bitwise representations. Then nothing in the C standard prevents the float -> double -> float conversion from switching one to the other. In IEEE that won't happen for "actual values" unless there are padding bits, but I don't know whether IEEE rules out a single NaN having distinct bitwise representations. NaNs don't compare equal to themselves anyway, so there's also no standard way to tell whether two NaNs are "the same NaN" or "different NaNs" other than maybe converting them to strings. The issue may be moot.
One thing to watch out for is non-conforming modes of compilers, in which they keep super-precise values "under the covers", for example intermediate results left in floating-point registers and reused without rounding. I don't think that would cause your example code to fail, but as soon as you're doing floating-point == it's the kind of thing you start worrying about.
From C99:
6.3.1.5 Real floating types
1 When a float is promoted to double or long double, or a double is promoted to long double, its value is unchanged.
2 When a double is demoted to float, a long double is demoted to double or float, or a value being represented in greater precision and range than required by its semantic type (see 6.3.1.8) is explicitly converted to its semantic type, if the value being converted can be represented exactly in the new type, it is unchanged...
I think, this guarantees you that a float->double->float conversion is going to preserve the original float value.
The standard also defines the macros INFINITY and NAN in 7.12 Mathematics <math.h>:
4 The macro INFINITY expands to a constant expression of type float representing positive or unsigned infinity, if available; else to a positive constant of type float that overflows at translation time.
5 The macro NAN is defined if and only if the implementation supports quiet NaNs for the float type. It expands to a constant expression of type float representing a quiet NaN.
So, there's provision for such special values and conversions may just work for them as well (including for the minus infinity and negative zero).
The assertion will fail in flush-to-zero and/or denormalized-is-zero mode (e.g. code compiled with -mfpmath=sse, -fast-math, etc, but also on heaps of compilers and architectures as default, such as Intel's C++ compiler) if f is denormalized.
You cannot produce a denormalized float in that mode though, but the scenario is still possible:
a) Denormalized float comes from external source.
b) Some libraries tamper with FPU modes but forget (or intentionally avoid) setting them back after each function call to it, making it possible for caller to mismatch normalization.
Practical example which prints following:
f = 5.87747e-39
f2 = 5.87747e-39
f = 5.87747e-39
f2 = 0
error, f != f2!
The example works both for VC2010 and GCC 4.3 but assumes that VC uses SSE for math as default and GCC uses FPU for math as default. The example may fail to illustrate the problem otherwise.
#include <limits>
#include <iostream>
#include <cmath>
#ifdef _MSC_VER
#include <xmmintrin.h>
#endif
template <class T>bool normal(T t)
{
return (t != 0 || fabsf( t ) >= std::numeric_limits<T>::min());
}
void csr_flush_to_zero()
{
#ifdef _MSC_VER
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
#else
unsigned csr = __builtin_ia32_stmxcsr();
csr |= (1 << 15);
__builtin_ia32_ldmxcsr(csr);
#endif
}
void test_cast(float f)
{
std::cout << "f = " << f << "\n";
double d = double(f);
float f2 = float(d);
std::cout << "f2 = " << f2 << "\n";
if(f != f2)
std::cout << "error, f != f2!\n";
std::cout << "\n";
}
int main()
{
float f = std::numeric_limits<float>::min() / 2.0;
test_cast(f);
csr_flush_to_zero();
test_cast(f);
}