I am working on a platform that has terrible stalls when comparing floats to zero. As an optimization I have seen the following code used:
inline bool GreaterThanZero( float value )
{
const int value_as_int = *(int*)&value;
return ( value_as_int > 0 );
}
Looking at the generated assembly the stalls are gone and the function is more performant.
Does this work? I'm confused because all of the optimizations for IEEE tricks use SIGNMASKS and lots of AND/OR operations (https://www.lomont.org/papers/2005/CompareFloat.pdf for example). Does the cast to a signed int help? Testing in a simple harness detects no problems.
Any insight would be good.
The expression *(int*)&value > 0 tests if value is any positive float, from the smallest positive denormal (which has the same representation as 0x00000001) to the largest finite float (with representation 0x7f7fffff) and +inf (which has the same representation as 0x7f800000). The trick detects as positive a number of, but not all, NaN representations (the NaN representations above 0x7f800001). It is fine if you don't care about some values of NaN making the test true.
This all works because of the representation of IEEE 754 formats.
The bit manipulation functions that you saw in the literature for the purpose of emulating IEEE 754 operations were probably aiming for perfect emulation, taking into account the particular behaviors of NaN and signed zeroes. For instance, the variation *(int*)&value >= 0 would not be equivalent to value >= 0.0f, because -0.0f, represented as 0x80000000 as an unsigned int and thus as -0x80000000 as a signed one, makes the latter condition true and the former one false. This can make such functions quite complicated.
Does the cast to a signed int help?
Well, yes, because the sign bits of float and int are in the same place and both indicate a positive number when unset. But the condition value > 0.0f could be implemented by re-interpreting value as an unsigned integer too.
Note: the conversion to int* of the address of value breaks strict aliasing rules, but this may be acceptable if your compiler guarantees that it gives meaning to these programs (perhaps with a command-line option).
Related
I have two floating point values, a and b. I can guarantee they are values in the domain (0, 1). Is there any circumstance where a * b could equal one? I intend to calculate 1/(1 - a * b), and wish to avoid a divide by zero.
My instinct is that it cannot, because the result should be equal or smaller to a or b. But instincts are a poor replacement for understanding the correct behavior.
I do not get to specify the rounding mode, so if there's a rounding mode where I could get into trouble, I want to know about it.
Edit: I did not specify whether the compiler was IEEE compliant or not because I cannot guarantee that the compiler/CPU running my software will indeed by IEEE compliant.
I have two floating point values, a and b…
Since this says we have “values,” not “variables,” it admits a possibility that 1 - a*b may evaluate to 1. When writing about software, people sometimes use names as placeholders for more complicated expressions. For example, one might have an expression a that is sin(x)/x and an expression b that is 1-y*y and then ask about computing 1 - a*b when the code is actually 1 - (sin(x)/x)*(1-y*y). This would be a problem because C++ allows extra precision to be used when evaluating floating-point expressions.
The most common instances of this is that the compiler uses long double arithmetic while computing expressions containing double operands or it uses a fused multiply-add instructions while computing an expression of the format x + y*z.
Suppose expressions a and b have been computed with excess precision and are positive values less than 1 in that excess precision. E.g., for illustration, suppose double were implemented with four decimal digits but a and b were computed with long double with six decimal digits. a and b could both be .999999. Then a*b is .999998000001 before rounding, .999998 after rounding to six digits. Now suppose that at this point in the computation, the compiler converts from long double to double, perhaps because it decides to store this intermediate value on the stack temporarily while it computes some other things from nearby expressions. Converting it to four-digit double produces 1.000, because that is the four-decimal-digit number nearest .999998. When the compiler later loads this from the stack and continues evaluation, we have 1 - 1.000, and the result is zero.
On the other hand, if a and b are variables, I expect your expression is safe. When a value is assigned to a variable or is converted with a cast operation, the C++ standard requires it to be converted to the nominal type; the result must be a value in the nominal type, without any “extra precision.” Then, given 0 < a < 1 and 0 < b < 1, the mathematical value (that, without floating-point rounding) a•b is less than a and is less than b. Then rounding of a•b to the nominal type cannot produce a value greater than a or b with any IEEE-754 rounding method, so it cannot produce 1. (The only requirement here is that the rounding method never skip over values—it might be constrained to round in a particular direction, upward or downward or toward zero or whatever, but it never goes past a representable value in that direction to get to a value farther away from the unrounded result. Since we know a•b is bounded above by both a and b, rounding cannot produce any result greater than the lesser of a and b.)
Formally, the C++ standard does not impose any requirements on the accuracy of floating-point results. So a C++ implementation could use a bonkers rounding mode that produced 3.14 for .9*.9. Aside from implementations flushing subnormals to zero, I am not aware of any C++ implementations that do not obey the requirement above. Flushing subnormals to zero will not affect calculations in 1 - a*b when a and b are near 1. (In a perverse floating-point format, with an exponent range narrower than the significand and no subnormal values, .9999 could be representable while .0001 is not because the exponent required for it is out of range. Then 1-.9999*.9999, which would produce .0002 in normal four-digit arithmetic, would produce 0 due to underflow. No such formats are in normal hardware.)
So, if a and b are variables, 0 < a < 1 and 0 < b < 1, and your C++ implementation is reasonable (may use extra precision, may flush subnormals, does not use perverse floating-point formats or rounding), then 1 - a*b does not evaluate to zero.
There is a mathematical proof that it will never be >= 1. I don't have it handy.... you may want to ask on the math stack overflow site if you are interested in studying the proof. But your instincts are correct. It will never be >= 1.
Now, we must be careful because floating point arithmetic is only an approximation of math and has limitations. I'm not an expert on these limitations, but the floating-point standard is very carefully designed and provides certain guarantees. I'm pretty sure one of them includes (or implies) that x * y where x < 1 and y < 1 is guaranteed to be < 1.
You can check that even if using the highest float or double that is lower than 1, and multiplying by itself, the result will be lower than 1. Any multiplication of numbers lower than that must give a smaller result.
Here is the code I ran, with the results in comments:
float a = nextafterf(1, 0); // 0.999999940
double b = nextafter(1, 0); // 0.99999999999999989
float c = a * a; // 0.999999881
double d = b * b; // 0.99999999999999978
I've built a custom version of frexp:
auto frexp(float f) noexcept
{
static_assert(std::numeric_limits<float>::is_iec559);
constexpr uint32_t ExpMask = 0xff;
constexpr int32_t ExpOffset = 126;
constexpr int MantBits = 23;
uint32_t u;
std::memcpy(&u, &f, sizeof(float)); // well defined bit transformation from float to int
int exp = ((u >> MantBits) & ExpMask) - ExpOffset; // extract the 8 bits of the exponent (it has an offset of 126)
// divide by 2^exp (leaving mantissa intact while placing "0" into the exponent)
u &= ~(ExpMask << MantBits); // zero out the exponent bits
u |= ExpOffset << MantBits; // place 126 into exponent bits (representing 0)
std::memcpy(&f, &u, sizeof(float)); // copy back to f
return std::make_pair(exp, f);
}
By checking is_iec559 I'm making sure that float fulfills
the requirements of IEC 559 (IEEE 754) standard.
My question is: Does this mean that the bit operations I'm doing are well defined and do what I want? If not, is there a way to fix it?
I tested it for some random values and it seems to be correct, at least on Windows 10 compiled with msvc and on wandbox. Note however, that (on purpose) I'm not handling the edge cases of subnormals, NaN, and inf.
If anyone wonders why I'm doing this: In benchmarks I found that this version of frexp is up to 15 times faster than std::frexp on Windows 10. I haven't tested other platforms yet. But I want to make sure that this not just works by coincident and may brake in future.
Edit:
As mentioned in the comments, endianess could be an issue. Does anybody know?
"Does this mean that the bit operations I'm doing are well defined..."
The TL;DR;, by the strict definition of "well defined": no.
Your assumptions are likely correct but not well defined, because there are no guarantees about the bit width, or the implementation of float. From § 3.9.1:
there are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.
The is_iec559 clause only qualifies with:
True if and only if the type adheres to IEC 559 standard
If a literal genie wrote you a terrible compiler, and made float = binary16, double = binary32, and long double = binary64, and made is_iec559 true for all the types, it would still adhere to the standard.
does that mean that I can extract exponent and mantissa in a well defined way?
The TL;DR;, by the limited guarantees of the C++ standard: no.
Assume you use float32_t and is_iec559 is true, and logically deduced from all the rules that it could only be binary32 with no trap representations, and you correctly argued that memcpy is a well defined for conversion between arithmetic types of the same width, and won't break strict aliasing. Even with all those assumptions, the behavior might be well defined but it's only likely and not guaranteed that you can extract the mantissa this way.
The IEEE 754 standard and 2's complement regard bit string encodings, and the behavior of memcpy is described using bytes. While it's plausible to assume the bit string of uint32_t and float32_t would be encoded the same way (e.g. same endianness), there's no guarantee in the standard for that. If the bit strings are stored differently and you shift and mask the copied integer representation to get the mantissa, the answer will be incorrect, despite the memcpy behavior being well defined.
As mentioned in the comments, endianess could be an issue. Does anybody know?
At least a few architectures have used different endianness for floating point registers and integer registers. The same link says that except for small embedded processors, this isn't a concern. I trust Wikipedia entirely for all topics and refuse to do any further research.
I've recently come across some code which has a loop of the form
for (int i = 0; i < 1e7; i++){
}
I question the wisdom of doing this since 1e7 is a floating point type, and will cause i to be promoted when evaluating the stopping condition. Should this be of cause for concern?
The elephant in the room here is that the range of an int could be as small as -32767 to +32767, and the behaviour on assigning a larger value than this to such an int is undefined.
But, as for your main point, indeed it should concern you as it is a very bad habit. Things could go wrong as yes, 1e7 is a floating point double type.
The fact that i will be converted to a floating point due to type promotion rules is somewhat moot: the real damage is done if there is unexpected truncation of the apparent integral literal. By the way of a "proof by example", consider first the loop
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 18446744073709551615ULL; ){
std::cout << i << "\n";
}
This outputs every consecutive value of i in the range, as you'd expect. Note that std::numeric_limits<std::uint64_t>::max() is 18446744073709551615ULL, which is 1 less than the 64th power of 2. (Here I'm using a slide-like "operator" ++< which is useful when working with unsigned types. Many folk consider --> and ++< as obfuscating but in scientific programming they are common, particularly -->.)
Now on my machine, a double is an IEEE754 64 bit floating point. (Such as scheme is particularly good at representing powers of 2 exactly - IEEE754 can represent powers of 2 up to 1022 exactly.) So 18,446,744,073,709,551,616 (the 64th power of 2) can be represented exactly as a double. The nearest representable number before that is 18,446,744,073,709,550,592 (which is 1024 less).
So now let's write the loop as
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 1.8446744073709551615e19; ){
std::cout << i << "\n";
}
On my machine that will only output one value of i: 18,446,744,073,709,550,592 (the number that we've already seen). This proves that 1.8446744073709551615e19 is a floating point type. If the compiler was allowed to treat the literal as an integral type then the output of the two loops would be equivalent.
It will work, assuming that your int is at least 32 bits.
However, if you really want to use exponential notation, you should better define an integer constant outside the loop and use proper casting, like this:
const int MAX_INDEX = static_cast<int>(1.0e7);
...
for (int i = 0; i < MAX_INDEX; i++) {
...
}
Considering this, I'd say it is much better to write
const int MAX_INDEX = 10000000;
or if you can use C++14
const int MAX_INDEX = 10'000'000;
1e7 is a literal of type double, and usually double is 64-bit IEEE 754 format with a 52-bit mantissa. Roughly every tenth power of 2 corresponds to a third power of 10, so double should be able to represent integers up to at least 105*3 = 1015, exactly. And if int is 32-bit then int has roughly 103*3 = 109 as max value (asking Google search it says that "2**31 - 1" = 2 147 483 647, i.e. twice the rough estimate).
So, in practice it's safe on current desktop systems and larger.
But C++ allows int to be just 16 bits, and on e.g. an embedded system with that small int, one would have Undefined Behavior.
If the intention to loop for a exact integer number of iterations, for example if iterating over exactly all the elements in an array then comparing against a floating point value is maybe not such a good idea, solely for accuracy reasons; since the implicit cast of an integer to float will truncate integers toward zero there's no real danger of out-of-bounds access, it will just abort the loop short.
Now the question is: When do these effects actually kick in? Will your program experience them? The floating point representation usually used these days is IEEE 754. As long as the exponent is 0 a floating point value is essentially an integer. C double precision floats 52 bits for the mantissa, which gives you integer precision to a value of up to 2^52, which is in the order of about 1e15. Without specifying with a suffix f that you want a floating point literal to be interpreted single precision the literal will be double precision and the implicit conversion will target that as well. So as long as your loop end condition is less 2^52 it will work reliably!
Now one question you have to think about on the x86 architecture is efficiency. The very first 80x87 FPUs came in a different package, and later a different chip and as aresult getting values into the FPU registers is a bit awkward on the x86 assembly level. Depending on what your intentions are it might make the difference in runtime for a realtime application; but that's premature optimization.
TL;DR: Is it safe to to? Most certainly yes. Will it cause trouble? It could cause numerical problems. Could it invoke undefined behavior? Depends on how you use the loop end condition, but if i is used to index an array and for some reason the array length ended up in a floating point variable always truncating toward zero it's not going to cause a logical problem. Is it a smart thing to do? Depends on the application.
I have run into a situation in my code where a function returns a double, and it is possible for this double to be a zero, a negative zero, or another value entirely. I need to distinguish between zero and negative zero, but the default double comparison does not. Due to the format of doubles, C++ does not allow for comparison of doubles using bitwise operators, so I am unsure how to procede. How can I distinguish between the two?
Call std::signbit() to determine the state of the sign bit.
Due to the format of doubles, C++ does not allow for comparison of doubles using bitwise operators, so I am unsure how to procede.
First off, C++ doesn't mandate any standard for float/double types.
Assuming you're talking about IEEE 754's binary32 and binary64 formats, they're specifically designed to maintain their order when their bit patterns are interpreted as integers so that a non-FPU can sort them; this is the reason they have a biased exponent.
There're many SO posts discussing such comparisons; here's the most relevant one. A simple check would be
bool is_negative_zero(float val)
{
return ((val == 0.0f) && std::signbit(val));
}
This works since 0.0f == -0.0f, although there're places where the sign makes a difference like atan2 or when dividing by -0 as opposed to +0 leads to the respective infinities.
To test explicitly for a == -0 in C do the following:
if (*((long *)&a) == 0x8000000000000000) {
// a is -0
}
Say I want a function that takes two floats (x and y), and I want to compare them using not their float representation but rather their bitwise representation as a 32-bit unsigned int. That is, a number like -495.5 has bit representation 0b11000011111001011100000000000000 or 0xC3E5C000 as a float, and I have an unsigned int with the same bit representation (corresponding to a decimal value 3286614016, which I don't care about). Is there any easy way for me to perform an operation like <= on these floats using only the information contained in their respective unsigned int counterparts?
You must do a signed compare unless you ensure that all the original values were positive. You must use an integer type that is the same size as the original floating point type. Each chip may have a different internal format, so comparing values from different chips as integers is most likely to give misleading results.
Most float formats look something like this: sxxxmmmm
s is a sign bit
xxx is an exponent
mmmm is the mantissa
The value represented will then be something like: 1mmm << (xxx-k)
1mmm because there is an implied leading 1 bit unless the value is zero.
If xxx < k then it will be a right shift. k is near but not equal to half the largest value that could be expressed by xxx. It is adjusted for the size of the mantissa.
All to say that, disregarding NaN, comparing floating point values as signed integers of the same size will yield meaningful results. They are designed that way so that floating point comparisons are no more costly than integer comparisons. There are compiler optimizations to turn off NaN checks so that the comparisons are straight integer comparisons if the floating point format of the chip supports it.
As an integer, NaN is greater than infinity is greater than finite values. If you try an unsigned compare, all the negative values will be larger than the positive values, just like signed integers cast to unsigned.
If you truly truly don't care about what the conversion yields, it isn't too hard. But the results are extremely non-portable, and you almost certainly won't get an ordering that at all resembles what you'd get by comparing the floats directly.
typedef unsigned int TypeWithSameSizeAsFloat; //Fix this for your platform
bool compare1(float one, float two)
union Convert {
float f;
TypeWithSameSizeAsFloat i;
}
Convert lhs, rhs;
lhs.f = one;
rhs.f = two;
return lhs.i < rhs.i;
}
bool compare2(float one, float two) {
return reinterpret_cast<TypeWithSameSizeAsFloat&>(one)
< reinterpret_cast<TypeWithSameSizeAsFloat&>(two);
}
Just understand the caveats, and chose your second type carefully. Its a near worthless excersize at any rate.
In a word, no. IEEE 754 might allow some kinds of hacks like this, but they do not work all the time and handle all cases, and some platforms do not use that floating point standard (such as doubles on x87 having 80 bit precision internally).
If you're doing this for performance reasons I suggest you strongly reconsider -- if it's faster to use the integer comparison the compiler will probably do it for you, and if it is not, you pay for a float to int conversion multiple times, when a simple comparison may be possible without moving the floats out of registers.
Maybe I'm misreading the question, but I suppose you could do this:
bool compare(float a, float b)
{
return *((unsigned int*)&a) < *((unsigned int*)&b);
}
But this assumes all kinds of things and also warrants the question of why you'd want to compare the bitwise representations of two floats.