Is it ok to compare floating points to 0.0 without epsilon? - c++

I am aware, that to compare two floating point values one needs to use some epsilon precision, as they are not exact. However, I wonder if there are edge cases, where I don't need that epsilon.
In particular, I would like to know if it is always safe to do something like this:
double foo(double x){
if (x < 0.0) return 0.0;
else return somethingelse(x); // somethingelse(x) != 0.0
}
int main(){
int x = -3.0;
if (foo(x) == 0.0) {
std::cout << "^- is this comparison ok?" << std::endl;
}
}
I know that there are better ways to write foo (e.g. returning a flag in addition), but I wonder if in general is it ok to assign 0.0 to a floating point variable and later compare it to 0.0.
Or more general, does the following comparison yield true always?
double x = 3.3;
double y = 3.3;
if (x == y) { std::cout << "is an epsilon required here?" << std::endl; }
When I tried it, it seems to work, but it might be that one should not rely on that.

Yes, in this example it is perfectly fine to check for == 0.0. This is not because 0.0 is special in any way, but because you only assign a value and compare it afterwards. You could also set it to 3.3 and compare for == 3.3, this would be fine too. You're storing a bit pattern, and comparing for that exact same bit pattern, as long as the values are not promoted to another type for doing the comparison.
However, calculation results that would mathematically equal zero would not always equal 0.0.
This Q/A has evolved to also include cases where different parts of the program are compiled by different compilers. The question does not mention this, my answer applies only when the same compiler is used for all relevant parts.
C++ 11 Standard,
§5.10 Equality operators
6 If both operands are of arithmetic or enumeration type, the usual
arithmetic conversions are performed on both operands; each of the
operators shall yield true if the specified relationship is true and
false if it is false.
The relationship is not defined further, so we have to use the common meaning of "equal".
§2.13.4 Floating literals
1 [...] If the scaled value is in the range of representable values
for its type, the result is the scaled value if representable, else
the larger or smaller representable value nearest the scaled value,
chosen in an implementation-defined manner. [...]
The compiler has to choose between exactly two values when converting a literal, when the value is not representable. If the same value is chosen for the same literal consistently, you are safe to compare values such as 3.3, because == means "equal".

Yes, if you return 0.0 you can compare it to 0.0; 0 is representable exactly as a floating-point value. If you return 3.3 you have to be a much more careful, since 3.3 is not exactly representable, so a conversion from double to float, for example, will produce a different value.

correction: 0 as a floating point value is not unique, but IEEE 754 defines the comparison 0.0==-0.0 to be true (any zero for that matter).
So with 0.0 this works - for every other number it does not. The literal 3.3 in one compilation unit (e.g. a library) and another (e.g. your application) might differ. The standard only requires the compiler to use the same rounding it would use at runtime - but different compilers / compiler settings might use different rounding.
It will work most of the time (for 0), but is very bad practice.
As long as you are using the same compiler with the same settings (e.g. one compilation unit) it will work because the literal 0.0 or 0.0f will translate to the same bit pattern every time. The representation of zero is not unique though. So if foo is declared in a library and your call to it in some application the same function might fail.
You can rescue this very case by using std::fpclassify to check whether the returned value represents a zero. For every finite (non-zero) value you will have to use an epsilon-comparison though unless you stay within one compilation unit and perform no operations on the values.

As written in both cases you are using identical constants in the same file fed to the same compiler. The string to float conversion the compiler uses should return the same bit pattern so these should not only be equal as in a plus or minus cases for zero thing but equal bit by bit.
Were you to have a constant which uses the operating systems C library to generate the bit pattern then have a string to f or something that can possibly use a different C library if the binary is transported to another computer than the one compiled on. You might have a problem.
Certainly if you compute 3.3 for one of the terms, runtime, and have the other 3.3 computed compile time again you can and will get failures on the equal comparisons. Some constants obviously are more likely to work than others.
Of course as written your 3.3 comparison is dead code and the compiler just removes it if optimizations are enabled.
You didnt specify the floating point format nor standard if any for that format you were interested in. Some formats have the +/- zero problem, some dont for example.

It is a common misconception that floating point values are "not exact". In fact each of them is perfectly exact (except, may be, some special cases as -0.0 or Inf) and equal to s·2e – (p – 1), where s, e, and p are significand, exponent, and precision correspondingly, each of them integer. E.g. in IEEE 754-2008 binary32 format (aka float32) p = 24 and 1 is represented as ‭0x‭800000‬‬·20 – 23. There are two things that are really not exact when you deal with floating point values:
Representation of a real value using a FP one. Obviously, not all real numbers can be represented using a given FP format, so they have to be somehow rounded. There are several rounding modes, but the most commonly used is the "Round to nearest, ties to even". If you always use the same rounding mode, which is almost certainly the case, the same real value is always represented with the same FP one. So you can be sure that if two real values are equal, their FP counterparts are exactly equal too (but not the reverse, obviously).
Operations with FP numbers are (mostly) inexact. So if you have some real-value function φ(ξ) implemented in the computer as a function of a FP argument f(x), and you want to compare its result with some "true" value y, you need to use some ε in comparison, because it is very hard (sometimes even impossible) to white a function giving exactly y. And the value of ε strongly depends on the nature of the FP operations involved, so in each particular case there may be different optimal value.
For more details see D. Goldberg. What Every Computer Scientist Should Know About Floating-Point Arithmetic, and J.-M. Muller et al. Handbook of Floating-Point Arithmetic. Both texts you can find in the Internet.

Related

floating point number comparison why no equal functions

In the c++ standard for floating pointer number, there is std::isgreater for greater comparison, and std::isless for less comparison, so why is there not a std::isequal for equality comparison? Is there a safe and accurate way to check if a double variable is equal to the DBL_MAX constants defined by the standard? The reason we try to do this is we are accessing data through service protocol, and it defines a double field when no data is available it will send DBL_MAX, so in our client code when it's DBL_MAX we need to skip it, and anything else we need to process it.
The interest of isgreater, isless, isgreaterequal, islessequal compared to >, <, >= and <= is that they do not raise FE_INVALID (a floating point exception, these are different beasts than C++ exceptions and are not mapped to C++ exceptions) when comparing with a NaN while the operators do.
As == do not raise a FP exception, there is no need of an additional functionality which does.
Note that there is also islessgreater and isunordered.
If you are not considering NaN or not testing the floating point exception there is no need to worry about these functions.
Considering equality comparison == is what to use if you want to check that the values are the same (ignoring the issues related to signed 0 and NaN). Depending on how you are reaching these values, it is sometimes useful to consider an approximate equality comparison -- but using one systematically is not recommended, for instance such approximate equality is probably not transitive.
In your context of a network protocol, you have to consider how the data is serialized. If the serialization is defined as binary, you can probably reconstruct the exact value and thus == is what you want so compare against DBL_MAX (for other values, check what is specified for signed 0 and NaN an know that there are signalling and quiet NaN are represented by different bit patterns although IEEE 754-2008 recommend now one of them). If the representation is decimal, you'll have to check if the representation is precise enough for the DBL_MAX value be reconstructable (and pay attention to rounding modes).
Note that I'd have considered a NaN for representing the no data available case instead of using a potentially valid value.
Is there a safe and accurate way to check if a double variable is equal to the DBL_MAX constants defined by the standard?
When you obtain a floating point number as a result of evaluating some expression, then == comparison doesn't make sense in most cases due to finite precision and rounding errors. However, if you first set a floating point variable to some value, then you can compare it to that value using == (with some exceptions like positive and negative zeros).
For example:
double v = std::numeric_limits<double>::max();
{ conditional assignment to v of a non-double-max value }
if (v != std::numeric_limits<double>::max())
process(v);
std::numeric_limits<double>::max() is exactly representable as double (it is double), so this comparison should be safe and should never yield true unless v was reassigned.
Equal function between to floating point have no sense, because the floating point numbers have rounding.
So if you want compare 2 floating point numbers for equality the best way is to declare a significant epsilon(error) for your comparison and then verify that the first number is in around the second.
To do this you can verify if first number is greater than second number minus epsilon and the first number is lower than the second number plus epsilon.
Ex.
if ( first > second - epsilon && first < second + epsilon ){
//The numbers are equal
}else{
//The numbers are not equal
}

Floating point equality

It is common knowledge that one has to be careful when comparing floating point values. Usually, instead of using ==, we use some epsilon or ULP based equality testing.
However, I wonder, are there any cases, when using == is perfectly fine?
Look at this simple snippet, which cases are guaranteed to succeed?
void fn(float a, float b) {
float l1 = a/b;
float l2 = a/b;
if (l1==l1) { } // case a)
if (l1==l2) { } // case b)
if (l1==a/b) { } // case c)
if (l1==5.0f/3.0f) { } // case d)
}
int main() {
fn(5.0f, 3.0f);
}
Note: I've checked this and this, but they don't cover (all of) my cases.
Note2: It seems that I have to add some plus information, so answers can be useful in practice: I'd like to know:
what the C++ standard says
what happens, if a C++ implementation follows IEEE-754
This is the only relevant statement I found in the current draft standard:
The value representation of floating-point types is implementation-defined. [ Note: This document imposes no requirements on the accuracy of floating-point operations; see also [support.limits]. — end note ]
So, does this mean, that even "case a)" is implementation defined? I mean, l1==l1 is definitely a floating-point operation. So, if an implementation is "inaccurate", then could l1==l1 be false?
I think this question is not a duplicate of Is floating-point == ever OK?. That question doesn't address any of the cases I'm asking. Same subject, different question. I'd like to have answers specifically to case a)-d), for which I cannot find answers in the duplicated question.
However, I wonder, are there any cases, when using == is perfectly fine?
Sure there are. One category of examples are usages that involve no computation, e.g. setters that should only execute on changes:
void setRange(float min, float max)
{
if(min == m_fMin && max == m_fMax)
return;
m_fMin = min;
m_fMax = max;
// Do something with min and/or max
emit rangeChanged(min, max);
}
See also Is floating-point == ever OK? and Is floating-point == ever OK?.
Contrived cases may "work". Practical cases may still fail. One additional issue is that often optimisation will cause small variations in the way the calculation is done so that symbolically the results should be equal but numerically they are different. The example above could, theoretically, fail in such a case. Some compilers offer an option to produce more consistent results at a cost to performance. I would advise "always" avoiding the equality of floating point numbers.
Equality of physical measurements, as well as digitally stored floats, is often meaningless. So if your comparing if floats are equal in your code you are probably doing something wrong. You usually want greater than or less that or within a tolerance. Often code can be rewritten so these types of issues are avoided.
Only a) and b) are guaranteed to succeed in any sane implementation (see the legalese below for details), as they compare two values that have been derived in the same way and rounded to float precision. Consequently, both compared values are guaranteed to be identical to the last bit.
Case c) and d) may fail because the computation and subsequent comparison may be carried out with higher precision than float. The different rounding of double should be enough to fail the test.
Note that the cases a) and b) may still fail if infinities or NANs are involved, though.
Legalese
Using the N3242 C++11 working draft of the standard, I find the following:
In the text describing the assignment expression, it is explicitly stated that type conversion takes place, [expr.ass] 3:
If the left operand is not of class type, the expression is implicitly converted (Clause 4) to the cv-unqualified type of the left operand.
Clause 4 refers to the standard conversions [conv], which contain the following on floating point conversions, [conv.double] 1:
A prvalue of floating point type can be converted to a prvalue of another floating point type. If the
source value can be exactly represented in the destination type, the result of the conversion is that exact
representation. If the source value is between two adjacent destination values, the result of the conversion
is an implementation-defined choice of either of those values. Otherwise, the behavior is undefined.
(Emphasis mine.)
So we have the guarantee that the result of the conversion is actually defined, unless we are dealing with values outside the representable range (like float a = 1e300, which is UB).
When people think about "internal floating point representation may be more precise than visible in code", they think about the following sentence in the standard, [expr] 11:
The values of the floating operands and the results of floating expressions may be represented in greater
precision and range than that required by the type; the types are not changed thereby.
Note that this applies to operands and results, not to variables. This is emphasized by the attached footnote 60:
The cast and assignment operators must still perform their specific conversions as described in 5.4, 5.2.9 and 5.17.
(I guess, this is the footnote that Maciej Piechotka meant in the comments - the numbering seems to have changed in the version of the standard he's been using.)
So, when I say float a = some_double_expression;, I have the guarantee that the result of the expression is actually rounded to be representable by a float (invoking UB only if the value is out-of-bounds), and a will refer to that rounded value afterwards.
An implementation could indeed specify that the result of the rounding is random, and thus break the cases a) and b). Sane implementations won't do that, though.
Assuming IEEE 754 semantics, there are definitely some cases where you can do this. Conventional floating point number computations are exact whenever they can be, which for example includes (but is not limited to) all basic operations where the operands and the results are integers.
So if you know for a fact that you don't do anything that would result in something unrepresentable, you are fine. For example
float a = 1.0f;
float b = 1.0f;
float c = 2.0f;
assert(a + b == c); // you can safely expect this to succeed
The situation only really gets bad if you have computations with results that aren't exactly representable (or that involve operations which aren't exact) and you change the order of operations.
Note that the C++ standard itself doesn't guarantee IEEE 754 semantics, but that's what you can expect to be dealing with most of the time.
Case (a) fails if a == b == 0.0. In this case, the operation yields NaN, and by definition (IEEE, not C) NaN ≠ NaN.
Cases (b) and (c) can fail in parallel computation when floating-point round modes (or other computation modes) are changed in the middle of this thread's execution. Seen this one in practice, unfortunately.
Case (d) can be different because the compiler (on some machine) may choose to constant-fold the computation of 5.0f/3.0f and replace it with the constant result (of unspecified precision), whereas a/b must be computed at runtime on the target machine (which might be radically different). In fact, intermediate calculations may be performed in arbitrary precision. I've seen differences on old Intel architectures when intermediate computation was performed in 80-bit floating-point, a format that the language didn't even directly support.
In my humble opinion, you should not rely on the == operator because it has many corner cases. The biggest problem is rounding and extended precision. In case of x86, floating point operations can be done with bigger precision than you can store in variables (if you use coprocessors, IIRC SSE operations use same precision as storage).
This is usually good thing, but this causes problems like:
1./2 != 1./2 because one value is form variable and second is from floating point register. In the simplest cases, it will work, but if you add other floating point operations the compiler could decide to split some variables to the stack, changing their values, thus changing the result of the comparison.
To have 100% certainty you need look at assembly and see what operations was done before on both values. Even the order can change the result in non-trivial cases.
Overall what is point of using ==? You should use algorithms that are stable. This means they work even if values are not equal, but they still give the same results. The only place I know where == could be useful is serializing/deserializing where you know what result you want exactly and you can alter serialization to archive your goal.

Floating Point comparison specific

I have a specific question about floating point comparisons. I know that it's not recommended to use the == comparison due to precision issues, but in this specific case, I am wondering if, in all cases / compilers, this statement will hold true?
float a = 1.02f;
float b = 1.02f;
if(a == b)
{
print(true);
}
else
{
print(false);
}
In other words, if I assign floating point numbers exactly, with no addition, subtraction, demotion, or promotion, will this always hold true?
Yes, the compiler should consistently translate floating point constants that are identical to the same value - if not, it's a bug in the compiler [which of course is not impossible]. However, as soon as you do ANYTHING ELSE to the value (such as reading it from a file including cin) or do simple math (add 1.0, then subtract 1.0), the value is not guaranteed to be the same any longer.
As pointed out, the value "Not A Number" or "NaN" is guaranteed to NEVER be equal to anything, and all comparisons except != will be false, no matter what it is compared to - this is part of the specification for floating point. But all other constants, as long as you are JUST using a constant to assign to a variable, it should remain the same throughout the code.
Of course, you can't rely on TWO DIFFERENT compilers coming up with the same exact value from the same source-code [many times they will, but sometimes they simply will do some different rounding of the last bit, or something like that which causes differences]

Should I worry about precision when I use C++ mathematical functions with integers?

For example, The code below will give undesirable result due to precision of floating point numbers.
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
I wonder whether similar problems will show up if I use mathematical functions. For example
int a = sqrt(4); // Do I have guarantee that I will always get 2 here?
int b = log2(8); // Do I have guarantee that I will always get 3 here?
If not, how to solve this problem?
Edit:
Actually, I came across this problem when I was programming for an algorithm task. There I want to get
the largest integer which is power of 2 and is less than or equal to integer N
So round function can not solve my problem. I know I can solve this problem through a loop, but it seems not very elegant.
I want to know if
int a = pow(2, static_cast<int>(log2(N)));
can always give correct result. For example if N==8, is it possible that log2(N) gives me something like 2.9999999999999 and the final result become 4 instead of 8?
Inaccurate operands vs inaccurate results
I wonder whether similar problems will show up if I use mathematical functions.
Actually, the problem that could prevent log2(8) to be 3 does not exist for basic operations (including *). But it exists for the log2 function.
You are confusing two different issues:
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
In the example above, a is not exactly 1/3, so it is possible that a*3 does not produce 1.0. The product could have happened to round to 1.0, it just doesn't. However, if a somehow had been exactly 1/3, the product of a by 3 would have been exactly 1.0, because this is how IEEE 754 floating-point works: the result of basic operations is the nearest representable value to the mathematical result of the same operation on the same operands. When the exact result is representable as a floating-point number, then that representation is what you get.
Accuracy of sqrt and log2
sqrt is part of the “basic operations”, so sqrt(4) is guaranteed always, with no exception, in an IEEE 754 system, to be 2.0.
log2 is not part of the basic operations. The result of an implementation of this function is not guaranteed by the IEEE 754 standard to be the closest to the mathematical result. It can be another representable number further away. So without more hypotheses on the log2 function that you use, it is impossible to tell what log2(8.0) can be.
However, most implementations of reasonable quality for elementary functions such as log2 guarantee that the result of the implementation is within 1 ULP of the mathematical result. When the mathematical result is not representable, this means either the representable value above or the one below (but not necessarily the closest one of the two). When the mathematical result is exactly representable (such as 3.0), then this representation is still the only one guaranteed to be returned.
So about log2(8), the answer is “if you have a reasonable quality implementation of log2, you can expect the result to be 3.0`”.
Unfortunately, not every implementation of every elementary function is a quality implementation. See this blog post, caused by a widely used implementation of pow being inaccurate by more than 1 ULP when computing pow(10.0, 2.0), and thus returning 99.0 instead of 100.0.
Rounding to the nearest integer
Next, in each case, you assign the floating-point to an int with an implicit conversion. This conversion is defined in the C++ standard as truncating the floating-point values (that is, rounding towards zero). If you expect the result of the floating-point computation to be an integer, you can round the floating-point value to the nearest integer before assigning it. It will help obtain the desired answer in all cases where the error does not accumulate to a value larger than 1/2:
int b = std::nearbyint(log2(8.0));
To conclude with a straightforward answer to the question the the title: yes, you should worry about accuracy when using floating-point functions for the purpose of producing an integral end-result. These functions do not come even with the guarantees that basic operations come with.
Unfortunately the default conversion from a floating point number to integer in C++ is really crazy as it works by dropping the decimal part.
This is bad for two reasons:
a floating point number really really close to a positive integer, but below it will be converted to the previous integer instead (e.g. 3-1×10-10 = 2.9999999999 will be converted to 2)
a floating point number really really close to a negative integer, but above it will be converted to the next integer instead (e.g. -3+1×10-10 = -2.9999999999 will be converted to -2)
The combination of (1) and (2) means also that using int(x + 0.5) will not work reasonably as it will round negative numbers up.
There is a reasonable round function, but unfortunately returns another floating point number, thus you need to write int(round(x)).
When working with C99 or C++11 you can use lround(x).
Note that the only numbers that can be represented correctly in floating point are quotients where the denominator is an integral power of 2.
For example 1/65536 = 0.0000152587890625 can be represented correctly, but even just 0.1 is impossible to represent correctly and thus any computation involving that quantity will be approximated.
Of course when using 0.1 approximations can cancel out leaving a correct result occasionally, but even just adding ten times 0.1 will not give 1.0 as result when doing the computation using IEEE754 double-precision floating point numbers.
Even worse the compilers are allowed to use higher precision for intermediate results. This means that adding 10 times 0.1 may give back 1 when converted to an integer if the compiler decides to use higher accuracy and round to closest double at the end.
This is "worse" because despite being the precision higher the results are compiler and compiler options dependent, making reasoning about the computations harder and making the exact result non portable among different systems (even if they use the same precision and format).
Most compilers have special options to avoid this specific problem.

Floating-point comparison of constant assignment

When comparing doubles for equality, we need to give a tolerance level, because floating-point computation might introduce errors. For example:
double x;
double y;
x = f();
y = g();
if (fabs(x-y)<epsilon) {
// they are equal!
} else {
// they are not!
}
However, if I simply assign a constant value, without any computation, do I still need to check the epsilon?
double x = 1;
double y = 1;
if (x==y) {
// they are equal!
} else {
// no they are not!
}
Is == comparison good enough? Or I need to do fabs(x-y)<epsilon again? Is it possible to introduce error in assigning? Am I too paranoid?
How about casting (double x = static_cast<double>(100))? Is that gonna introduce floating-point error as well?
I am using C++ on Linux, but if it differs by language, I would like to understand that as well.
Actually, it depends on the value and the implementation. The C++ standard (draft n3126) has this to say in 2.14.4 Floating literals:
If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner.
In other words, if the value is exactly representable (and 1 is, in IEEE754, as is 100 in your static cast), you get the value. Otherwise (such as with 0.1) you get an implementation-defined close match (a). Now I'd be very worried about an implementation that chose a different close match based on the same input token but it is possible.
(a) Actually, that paragraph can be read in two ways, either the implementation is free to choose either the closest higher or closest lower value regardless of which is actually the closest, or it must choose the closest to the desired value.
If the latter, it doesn't change this answer however since all you have to do is hardcode a floating point value exactly at the midpoint of two representable types and the implementation is once again free to choose either.
For example, it might alternate between the next higher and next lower for the same reason banker's rounding is applied - to reduce the cumulative errors.
No if you assign literals they should be the same :)
Also if you start with the same value and do the same operations, they should be the same.
Floating point values are non-exact, but the operations should produce consistent results :)
Both cases are ultimately subject to implementation defined representations.
Storage of floating point values and their representations take on may forms - load by address or constant? optimized out by fast math? what is the register width? is it stored in an SSE register? Many variations exist.
If you need precise behavior and portability, do not rely on this implementation defined behavior.
IEEE-754, which is a standard common implementations of floating point numbers abide to, requires floating-point operations to produce a result that is the nearest representable value to an infinitely-precise result. Thus the only imprecision that you will face is rounding after each operation you perform, as well as propagation of rounding errors from the operations performed earlier in the chain. Floats are not per se inexact. And by the way, epsilon can and should be computed, you can consult any numerics book on that.
Floating point numbers can represent integers precisely up to the length of their mantissa. So for example if you cast from an int to a double, it will always be exact, but for casting into into a float, it will no longer be exact for very large integers.
There is one major example of extensive usage of floating point numbers as a substitute for integers, it's the LUA scripting language, which has no integer built-in type, and floating-point numbers are used extensively for logic and flow control etc. The performance and storage penalty from using floating-point numbers turns out to be smaller than the penalty of resolving multiple types at run time and makes the implementation lighter. LUA has been extensively used not only on PC, but also on game consoles.
Now, many compilers have an optional switch that disables IEEE-754 compatibility. Then compromises are made. Denormalized numbers (very very small numbers where the exponent has reached smallest possible value) are often treated as zero, and approximations in implementation of power, logarithm, sqrt, and 1/(x^2) can be made, but addition/subtraction, comparison and multiplication should retain their properties for numbers which can be exactly represented.
The easy answer: For constants == is ok.
There are two exceptions which you should be aware of:
First exception:
0.0 == -0.0
There is a negative zero which compares equal for the IEEE 754 standard. This means
1/INFINITY == 1/-INFINITY which breaks f(x) == f(y) => x == y
Second exception:
NaN != NaN
This is a special caveat of NotaNumber which allows to find out if a number is a NaN
on systems which do not have a test function available (Yes, that happens).