assertions on negations of relational operators for floating point number - c++

Suppose I have two variables a and b, either both in type float, or both in type double, which hold some values. Do the following assertions always hold? I mean, do the existence of numerical errors alter the conclusions?
a > b is true if and only if a <= b is false
a < b is true if and only if a >= b is false
a >= b is necessarily true if a == b is true
a <= b is necessarily true if a == b is true
For the third and fourth, I mean for example, does "a == b is true" always give you "a >= b is true"?
EDIT:
Assume neither a or b is NaN or Inf.
EDIT 2:
After reading the IEEE 754 standard in the year 1985, I found the following.
First of all, it said the following
Comparisons are exact and never overflow nor underflow.
I understand it as when doing comparison, there is no consideration of numerical error, numbers are compared as-is. Since addition and subtraction such as a - b requires additional effort to define what the numerical error is, I assume the quote above is saying that comparisons like a > b is not done by judging whether a - b > 0 is true or not. Please let me know if I'm wrong.
Second, it listed the four canonical relations which are mutually exclusive.
Four mutually exclusive relations are possible: less than, equal, greater than, and unordered. The last case arises when at least one operand is NaN. Every NaN shall compare unordered with everything, including itself.
Then in Table 4, it defined the various kinds of operators such as ">" or ">=" in terms of the truth values under these four canonical relations. From the table we immediately have the following:
a >= b is true if and only if a < b is false
a <= b is true if and only if a > b is false
Both a >= b and a <= b are necessarily true if a == b is true
So the assertions in my questions can be concluded as true. However, I wasn't able to find anything in the standard defining whether symmetricity is true or not. In another way, from a > b I don't know if b < a is true or not. Thus I also have no way to derive a <= b is true from b < a is false. So I would be interested to know, in addition to the assertions in the OP, whether the following is always true or not
a < b is true if and only if b > a is true
a <= b is true if and only if b >= a is true
etc.
EDIT 3:
Regarding denormal numbers as mentioned by #Mark Ransom, I read the wikipedia page about it and my current understanding is that the existence of denormal numbers does not alter the conclusions above. In another way, if some hardware claims that it fully support denormal numbers, it also needs to ensure that the definitions of the comparison operators satisfy the above standards.
EDIT 4:
I just read the 2008 revision of IEEE 754, it doesn't say anything about symmetricity either. So I guess this is undefined.
(All the above discussion assumes no NaN or Inf in any of the operands).

If neither number is a NaN or infinity then your assertions hold, if you have an IEEE compatible system. If the standards don't mention it then I'm sure that is just because it was sufficiently obvious to not be worth mentioning. In particular, if "a < b" and "b > a" don't have the same value (even for NaNs and infinities) then we are in crazy town.
Denormals shouldn't affect the conclusions because if you are assuming IEEE compatibility then denormal support is a given.
The only risks I can think of are in cases involving the x87 FPU and it's odd 80-bit long double format. Rounding from 80-bit long double to double or float is expensive, so it is sometimes omitted, and that can lead to an assert like this firing:
assert(sqrt(val) == sqrt(val));
It could fire because the result of the first sqrt() call may be written to memory, and therefore rounded to float or double. It would then be compared against the result of the second sqrt() call, which might not be rounded. However, failing to round the result of the second sqrt() call before doing the comparison is, strictly speaking, not IEEE compliant if sqrt() returns float or double.
Infinity should only be a problem if both numbers are infinity.

Related

IEEE-754 floating point comparisons with special cases

I'm wondering where I can find the standard output (if there is one) when comparing (LT & GT) 2 special single precision IEEE-754 floating point values, being combinations of -inf/+inf/NaN/-0/+0.
I wrote a test program and it gave me following output, but how to check if it is compliant?:
The relevant portion of the IEEE 754-2008 standard, and a late 754-2019 draft, is:
Four mutually exclusive relations are possible: less than, equal, greater than, and unordered. The last case arises when at least one operand is NaN. Every NaN shall compare unordered with everything, including itself. Comparisons shall ignore the sign of zero (so +0 = −0). Infinite operands of the same sign shall compare equal.
From this, we see that none of the comparisons with a NaN would yield less than or greater than, and therefore any test of less than or greater than should yield false, so the results in the table conform in this regard.
Similarly, the comparisons of an infinity to the same infinity and of either zero to either zero should yield false, so the table conforms in this regard too.
The standard does not explicitly detail comparisons of other values; they are inherited from ordinary arithmetic, and we can see the table conforms in this regard too.
See the links in the Standards section of the Wikipedia IEEE 754 page for sources for official versions of the standard.
The standard says:
nan comparisons are always evaluated as false (even nan == nan)
+inf > x is true if x is neither +inf nor nan, and false otherwise
-inf < x is true if x is neither -inf nor nan, and false otherwise

Floating points : Is it true if a > b then a - b > 0?

in c++ for two doubles a and b, is it true if a > b, then a - b > 0? Assume a, b are not Nan, just normal numbers
No, not by the requirements of the C++ standard alone. C++ allows implementations to use floating-point types without support for subnormal numbers. In such a type, you could have a equal to 1.00012•2m, where m is the minimum normal exponent, and b equal to 1.00002•2m. Then a > b but the exact value of a-b is not representable (it would be subnormal), so the computed result is zero.
This does not happen with IEEE-754 arithmetic, which supports subnormals. However, some C++ implementations, possibly not conforming to the C++ standard, use IEEE-754 formats but flush subnormal results to zero. So this will also produce a > b but a-b == 0.
Supposing an implementation is supporting subnormals and is conforming to the C standard, another issue is the standard allows implementations to use extra precision within floating-point expressions. If your a and b are actually expressions rather than single objects, it is possible that the extra precision could cause a > b to be true while a-b == 0 is false. Implementations are permitted to use or not use such extra precision at will, so a > b could be true in some evaluations and not others, and similarly for a-b == 0. If a and b are single objects, this will not happen, because the standard requires implementations to “discard” the excess precision in assignments and casts.
Yes, this is true in IEEE-754 arithmetic, in all rounding modes. So long as your C++ compiler uses IEEE-754 semantics (either via software or hardware). But see Eric Postpischill's answer for when particular implementations might diverge from IEEE-754. Best to check with your compiler documentation.

Comparing double in C++, peer review

I have always had the problem of comparing double values for equality. There are functions around like some fuzzy_compare(double a, double b), but I often enough did not manage to find them in time. So I thought on building a wrapper class for double just for the comparison operator:
typedef union {
uint64_t i;
double d;
} number64;
bool Double::operator==(const double value) const {
number64 a, b;
a.d = this->value;
b.d = value;
if ((a.i & 0x8000000000000000) != (b.i & 0x8000000000000000)) {
if ((a.i & 0x7FFFFFFFFFFFFFFF) == 0 && (b.i & 0x7FFFFFFFFFFFFFFF) == 0)
return true;
return false;
}
if ((a.i & 0x7FF0000000000000) != (b.i & 0x7FF0000000000000))
return false;
uint64_t diff = (a.i & 0x000FFFFFFFFFFFF) - (b.i & 0x000FFFFFFFFFFFF) & 0x000FFFFFFFFFFFF;
return diff < 2; // 2 here is kind of some epsilon, but integer and independent of value range
}
The idea behind it is:
First, compare the sign. If it's different, the numbers are different. Except if all other bits are zero. That is comparing +0.0 with -0.0, which should be equal. Next, compare the exponent. If these are different, the numbers are different. Last, compare the mantissa. If the difference is low enough, the values are equal.
It seems to work, but just to be sure, I'd like a peer review. It could well be that I overlooked something.
And yes, this wrapper class needs all the operator overloading stuff. I skipped that because they're all trivial. The equality operator is the main purpose of this wrapper class.
This code has several problems:
Small values on different sides of zero always compare unequal, no matter how (not) far apart.
More importantly, -0.0 compares unequal with +epsilon but +0.0 compares equal with +epsilon (for some epsilon). That's really bad.
What about NaNs?
Values with different exponents compare unequal, even if one floating point "step" apart (e.g. the double before 1 compares unequal to 1, but the one after 1 compares equal...).
The last point could ironically be fixed by not distinguishing between exponent and mantissa: The binary representations of all positive floats are exactly in the order of their magnitude!
It appears that you want to just check whether two floats are a certain number of "steps" apart. If so, maybe this boost function might help. But I would also question whether that's actually reasonable:
Should the smallest positive non-denormal compare equal to zero? There are still many (denormal) floats between them. I doubt this is what you want.
If you operate on values that are expected to be of magnitude 1e16, then 1 should compare equal to 0, even though half of all positive doubles are between 0 and 1.
It is usually most practical to use a relative + absolute epsilon. But I think it will be most worthwhile to check out this article, which discusses the topic of comparing floats more extensively than I could fit into this answer:
https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
To cite its conclusion:
Know what you’re doing
There is no silver bullet. You have to choose wisely.
If you are comparing against zero, then relative epsilons and ULPs based comparisons are usually meaningless. You’ll need to use an absolute epsilon, whose value might be some small multiple of FLT_EPSILON and the inputs to your calculation. Maybe.
If you are comparing against a non-zero number then relative epsilons or ULPs based comparisons are probably what you want. You’ll probably want some small multiple of FLT_EPSILON for your relative epsilon, or some small number of ULPs. An absolute epsilon could be used if you knew exactly what number you were comparing against.
If you are comparing two arbitrary numbers that could be zero or non-zero then you need the kitchen sink. Good luck and God speed.
Above all you need to understand what you are calculating, how stable the algorithms are, and what you should do if the error is larger than expected. Floating-point math can be stunningly accurate but you also need to understand what it is that you are actually calculating.
You store into one union member and then read from another. That causes aliasing problem (undefined behaviour) because the C++ language requires that objects of different types do not alias.
There are a few ways to remove the undefined behaviour:
Get rid of the union and just memcpy the double into uint64_t. The portable way.
Mark union member i type with [[gnu::may_alias]].
Insert a compiler memory barrier between storing into union member d and reading from member i.
Frame the question this way:
We have two numbers, a and b, that have been computed with floating-point arithmetic.
If they had been computed exactly with real-number mathematics, we would have a and b.
We want to compare a and b and get an answer that tells us whether a equals b.
In other words, you are trying to correct for errors that occurred while computing a and b. In general, that is impossible, of course, because we do not know what a and b are. We only have the approximations a and b.
The code you propose falls back to another strategy:
If a and b are close to each other, we will accept that a equals b. (In other words: If a is close to b, it is possible that a equals b, and the differences we have are only because of calculation errors, so we will accept that a equals b without further evidence.)
There are two problems with this strategy:
This strategy will incorrectly accept that a equals b even when it is not true, just because a and b are close.
We need to decide how close to require a and b to be.
Your code attempts to address the latter: It is establishing some tests about whether a and b are close enough. As others have pointed out, it is severely flawed:
It treats numbers as different if they have different signs, but floating-point arithmetic can cause a to be negative even if a is positive, and vice versa.
It treats numbers as different if they have different exponents, but floating-point arithmetic can cause a to have a different exponent from a.
It treats numbers as different if they differ by more than a fixed number of ULP (units of least precision), but floating-point arithmetic can, in general, cause a to differ from a by any amount.
It assumes an IEEE-754 format and needlessly uses aliasing with behavior not defined by the C++ standard.
The approach is fundamentally flawed because it needlessly fiddles with the floating-point representation. The actual way to determine from a and b whether a and b might be equal is to figure out, given a and b, what sets of values a and b have and whether there is any value in common in those sets.
In other words, given a, the value of a might be in some interval, (a−eal, a+ear) (that is, all the numbers from a minus some error on the left to a plus some error on the right), and, given b, the value of b might be in some interval, (b−ebl, b+ebr). If so, what you want to test is not some floating-point representation properties but whether the two intervals (a−eal, a+ear) and (b−ebl, b+ebr) overlap.
To do that, you need to know, or at least have bounds on, the errors eal, ear, ebl, and ebr. But those errors are not fixed by the floating-point format. They are not 2 ULP or 1 ULP or any number of ULP scaled by the exponent. They depend on how a and b were computed. In general, the errors can range from 0 to infinity, and they can also be NaN.
So, to test whether a and b might be equal, you need to analyze the floating-point arithmetic errors that could have occurred. In general, this is difficult. There is an entire field of mathematics for it, numerical analysis.
If you have computed bounds on the errors, then you can just compare the intervals using ordinary arithmetic. There is no need to take apart the floating-point representation and work with the bits. Just use the normal add, subtract, and comparison operations.
(The problem is actually more complicated than I allowed above. Given a computed value a, the potential values of a do not always lie in a single interval. They could be an arbitrary set of points.)
As I have written previously, there is no general solution for comparing numbers containing arithmetic errors: 0 1 2 3.
Once you figure out error bounds and write a test that returns true if a and b might be equal, you still have the problem that the test also accepts false negatives: It will return true even in cases where a and b are not equal. In other words, you have just replaced a program that is wrong because it rejects equality even though a and b would be equal with a program that is wrong in other cases because it accepts equality in cases where a and b are not equal. This is another reason there is no general solution: In some applications, accepting as equal numbers that are not equal is okay, at least for some situations. In other applications, that is not okay, and using a test like this will break the program.

Is it ok to compare floating points to 0.0 without epsilon?

I am aware, that to compare two floating point values one needs to use some epsilon precision, as they are not exact. However, I wonder if there are edge cases, where I don't need that epsilon.
In particular, I would like to know if it is always safe to do something like this:
double foo(double x){
if (x < 0.0) return 0.0;
else return somethingelse(x); // somethingelse(x) != 0.0
}
int main(){
int x = -3.0;
if (foo(x) == 0.0) {
std::cout << "^- is this comparison ok?" << std::endl;
}
}
I know that there are better ways to write foo (e.g. returning a flag in addition), but I wonder if in general is it ok to assign 0.0 to a floating point variable and later compare it to 0.0.
Or more general, does the following comparison yield true always?
double x = 3.3;
double y = 3.3;
if (x == y) { std::cout << "is an epsilon required here?" << std::endl; }
When I tried it, it seems to work, but it might be that one should not rely on that.
Yes, in this example it is perfectly fine to check for == 0.0. This is not because 0.0 is special in any way, but because you only assign a value and compare it afterwards. You could also set it to 3.3 and compare for == 3.3, this would be fine too. You're storing a bit pattern, and comparing for that exact same bit pattern, as long as the values are not promoted to another type for doing the comparison.
However, calculation results that would mathematically equal zero would not always equal 0.0.
This Q/A has evolved to also include cases where different parts of the program are compiled by different compilers. The question does not mention this, my answer applies only when the same compiler is used for all relevant parts.
C++ 11 Standard,
§5.10 Equality operators
6 If both operands are of arithmetic or enumeration type, the usual
arithmetic conversions are performed on both operands; each of the
operators shall yield true if the specified relationship is true and
false if it is false.
The relationship is not defined further, so we have to use the common meaning of "equal".
§2.13.4 Floating literals
1 [...] If the scaled value is in the range of representable values
for its type, the result is the scaled value if representable, else
the larger or smaller representable value nearest the scaled value,
chosen in an implementation-defined manner. [...]
The compiler has to choose between exactly two values when converting a literal, when the value is not representable. If the same value is chosen for the same literal consistently, you are safe to compare values such as 3.3, because == means "equal".
Yes, if you return 0.0 you can compare it to 0.0; 0 is representable exactly as a floating-point value. If you return 3.3 you have to be a much more careful, since 3.3 is not exactly representable, so a conversion from double to float, for example, will produce a different value.
correction: 0 as a floating point value is not unique, but IEEE 754 defines the comparison 0.0==-0.0 to be true (any zero for that matter).
So with 0.0 this works - for every other number it does not. The literal 3.3 in one compilation unit (e.g. a library) and another (e.g. your application) might differ. The standard only requires the compiler to use the same rounding it would use at runtime - but different compilers / compiler settings might use different rounding.
It will work most of the time (for 0), but is very bad practice.
As long as you are using the same compiler with the same settings (e.g. one compilation unit) it will work because the literal 0.0 or 0.0f will translate to the same bit pattern every time. The representation of zero is not unique though. So if foo is declared in a library and your call to it in some application the same function might fail.
You can rescue this very case by using std::fpclassify to check whether the returned value represents a zero. For every finite (non-zero) value you will have to use an epsilon-comparison though unless you stay within one compilation unit and perform no operations on the values.
As written in both cases you are using identical constants in the same file fed to the same compiler. The string to float conversion the compiler uses should return the same bit pattern so these should not only be equal as in a plus or minus cases for zero thing but equal bit by bit.
Were you to have a constant which uses the operating systems C library to generate the bit pattern then have a string to f or something that can possibly use a different C library if the binary is transported to another computer than the one compiled on. You might have a problem.
Certainly if you compute 3.3 for one of the terms, runtime, and have the other 3.3 computed compile time again you can and will get failures on the equal comparisons. Some constants obviously are more likely to work than others.
Of course as written your 3.3 comparison is dead code and the compiler just removes it if optimizations are enabled.
You didnt specify the floating point format nor standard if any for that format you were interested in. Some formats have the +/- zero problem, some dont for example.
It is a common misconception that floating point values are "not exact". In fact each of them is perfectly exact (except, may be, some special cases as -0.0 or Inf) and equal to s·2e – (p – 1), where s, e, and p are significand, exponent, and precision correspondingly, each of them integer. E.g. in IEEE 754-2008 binary32 format (aka float32) p = 24 and 1 is represented as ‭0x‭800000‬‬·20 – 23. There are two things that are really not exact when you deal with floating point values:
Representation of a real value using a FP one. Obviously, not all real numbers can be represented using a given FP format, so they have to be somehow rounded. There are several rounding modes, but the most commonly used is the "Round to nearest, ties to even". If you always use the same rounding mode, which is almost certainly the case, the same real value is always represented with the same FP one. So you can be sure that if two real values are equal, their FP counterparts are exactly equal too (but not the reverse, obviously).
Operations with FP numbers are (mostly) inexact. So if you have some real-value function φ(ξ) implemented in the computer as a function of a FP argument f(x), and you want to compare its result with some "true" value y, you need to use some ε in comparison, because it is very hard (sometimes even impossible) to white a function giving exactly y. And the value of ε strongly depends on the nature of the FP operations involved, so in each particular case there may be different optimal value.
For more details see D. Goldberg. What Every Computer Scientist Should Know About Floating-Point Arithmetic, and J.-M. Muller et al. Handbook of Floating-Point Arithmetic. Both texts you can find in the Internet.

For a floating point value a: Does a*0.0 == 0.0 always evaluate true for finite values of a?

I was always assuming that the following test will always succeed for finite values (no INF, no NAN) of somefloat:
assert(somefloat*0.0==0.0);
In Multiply by 0 optimization it was stated that double a=0.0 and double a=-0.0 are not strictly speaking the same thing.
So I was wondering whether this can lead to problems on some platforms e.g. can the result of the above test depend on a beeing positive or negative.
If your implementation uses IEEE 754 arithmetic (which most do), then positive and negative zero will compare equal. Since the left-hand side of your expression can only be either positive or negative zero for finite a, the assertion will always be true.
If it uses some other kind of arithmetic, then only the implementor, and hopefully the implementation-specific documentation, can tell you. Arguably (see comments) the wording of the standard can be taken to imply that they must compare equal in any case, and certainly no sane implementation would do otherwise.
-0.0 == 0.0 according to double comparison rules.
For non-finite values (+-Inf, Nan) somefloat*0.0 != 0.0.
Your assert can never fail, as long as somefloat is not
infinity or NaN. On systems which don't support infinity or
NaN, the compiler can simply optimize it out.