How to properly avoid SIGFPE and overflow on arithmetic operations - c++

I've been trying to create a Fraction class as complete as possible, to learn C++, classes and related stuff on my own. Among other things, I wanted to ensure some level of "protection" against floating point exceptions and overflows.
Objective:
Avoid overflow and floating point exceptions in arithmetic operations found in common operations, expending the least time/memory. If avoiding is not possible, then at least detect it.
Also, the idea is to not cast to some bigger type. That creates a handful of problems (like there might be no bigger type)
Cases I've found:
Overflow on +, -, *, /, pow, root
Operations are mostly straightforward (a and b are Long):
a+b: if LONG_MAX - b > a then there's overflow. (not enough. a or b might be negatives)
a-b: if LONG_MAX - a > -b then there's overflow. (Idem)
a*b: if LONG_MAX / b > a then there's overflow. (if b != 0)
a/b: might thrown SIGFPE if a << b or overflow if b << 0
pow(a,b): if (pow(LONG_MAX, 1.0/b) > a then there's overflow.
pow(a,1.0/b): Similar to a/b
Overflow on abs(x) when x = LONG_MIN (or equivalent)
This is funny. Every signed type has a range [-x-1,x] of possible values. abs(-x-1) = x+1 = -x-1 because overflow. This means there is a case where abs(x) < 0
SIGFPE with big numbers divided by -1
Found when applying numerator/gcd(numerator,denominator). Sometimes gcd returned -1 and I got a floating point exception.
Easy fixes:
On some operations is easy to check for overflow. If that's the case, I can always cast to double (with the risk of loosing precision over big integers). The idea is to find a better solution, without casting.
In Fraction arithmetics, sometimes I can do extra checking for simplifications: to solve a/b * c/d (co-primes), I can reduce to co-primes a/d and c/b first.
I can always do cascade if's asking if a or b are <0 or > 0. Not the prettiest. Besides that awful choice, I can create a function neg() that will avoid that overflow
T neg(T x){if (x > 0) return -x; else return x;},
I can take abs(x) of gcd and any similar situation (anywhere x > LONG_MIN)
I'm not sure if 2. and 3. are the best solutions, but seems good enough. I'm posting those here so maybe anyone has a better answer.
Ugliest fixes
In most operations I need to do a lot of extra operations to check and avoid overflow. Here is were I'm pretty sure I can learn a thing or two.
Example:
Fraction Fraction::operator+(Fraction f){
double lcm = max(den,f.den);
lcm /= gcd(den, f.den);
lcm *= min(den,f.den);
// a/c + b/d = [a*(lcm/d) + b*(lcm/c)] / lcm //use to create normal fractions
// a/c + b/d = [a/lcm * (lcm/c)] + [b/lcm * (lcm/d)] //use to create fractions through double
double p = (double)num;
p *= lcm / (double)den;
double q = (double)f.num;
q *= lcm / (double)f.den;
if(lcm >= LONG_MAX || (p + q) >= LONG_MAX || (p + q) <= LONG_MIN){
//cerr << "Aproximating " << num << "/" << den << " + " << f.num << "/" << f.den << endl;
p = (double)num / lcm;
p *= lcm / (double)den;
q = (double)f.num / lcm;
q *= lcm / (double)f.den;
return Fraction(p + q);
}
else
return normal(p + q, (long)lcm);
}
Which is the best way to avoid overflow on these arithmetic operations?
Edit: There are a handfull of questions in this site quite similar, but those are not the same (detect instead of avoid, unsigned instead of signed, SIGFPE in specific no-related situations).
Checking all of them I found some answers that upon modification might be usefull to give a propper answer, like:
Detect overflow in unsigned addition (not my case, I'm working with signed):
uint32_t x, y;
uint32_t value = x + y;
bool overflow = value < x; // Alternatively "value < y" should also work
Detect overflow in signed operations. This might be a bit too general, with a lot of branches, and doesn't discuss how to avoid overflow.
The CERT rules mentioned in an answer, are a good starting point, but again only discuss how to detect.
Other answers are too general and I wonder if there are any answer more specific for the cases I'm looking at.

You need to differentiate between floating point operations and integral operations.
Concerning the latter, operations on unsigned types do not normally overflow, except for division by zero which is undefined behaviour by definition IIRC. This is closely related to the fact that C(++) standard mandates a binary representation for unsigned numbers, which virtually makes them a ring.
In contrast, the C(++) standard allows for multiple implementations of signed numbers (sign+magnitude, 1's complement or, most widely used, 2's complement). So signed overflow is defined to be undefined behaviour, possibly to give compiler implementers more freedom to generate efficient code for their target machines. Also this is the reason for your worries with abs(): At least in 2's complement representation, there is no positive number that is equal in magnitude to the largest negative number in magnitude. Refer to CERT rules for elaboration.
On the floating point side SIGFPE has historically been coined for signalling floating point exceptions. However, given the variety of implementations of the arithmetic units in processors nowadays, SIGFPE should be considered a generic signal that reports arithmetic errors. For instance, the glibc reference manual gives a list of possible reasons, explicitely including integral division by zero.
It is worth noting that floating point operations as per ANSI/IEEE Std 754, which is most commonly used today I suppose, are specifically designed to be a kind of error-proof. This means that for example, when an addition overflows it gives a result of infinity and typically sets a flag that you can check later. It is perfectly legal to use this infinite value in further calculations as the floating point operations have been defined for affine arithmetic. This once was meant to allow long running computations (on slow machines) to continue even with intermediate overflows etc. Note that certain operations are forbidden even in affine arithmetic, for example dividing infinity by infinity or subtracting infinity by infinity.
So the bottom line is that floating point computations should not normally cause floating point exceptions. Yet you can have so-called traps which cause SIGFPE (or a similar mechanism) to be triggered whenever the above mentioned flags become raised.

Related

Can multiplying a pair of almost-one values ever yield a result of 1.0?

I have two floating point values, a and b. I can guarantee they are values in the domain (0, 1). Is there any circumstance where a * b could equal one? I intend to calculate 1/(1 - a * b), and wish to avoid a divide by zero.
My instinct is that it cannot, because the result should be equal or smaller to a or b. But instincts are a poor replacement for understanding the correct behavior.
I do not get to specify the rounding mode, so if there's a rounding mode where I could get into trouble, I want to know about it.
Edit: I did not specify whether the compiler was IEEE compliant or not because I cannot guarantee that the compiler/CPU running my software will indeed by IEEE compliant.
I have two floating point values, a and b…
Since this says we have “values,” not “variables,” it admits a possibility that 1 - a*b may evaluate to 1. When writing about software, people sometimes use names as placeholders for more complicated expressions. For example, one might have an expression a that is sin(x)/x and an expression b that is 1-y*y and then ask about computing 1 - a*b when the code is actually 1 - (sin(x)/x)*(1-y*y). This would be a problem because C++ allows extra precision to be used when evaluating floating-point expressions.
The most common instances of this is that the compiler uses long double arithmetic while computing expressions containing double operands or it uses a fused multiply-add instructions while computing an expression of the format x + y*z.
Suppose expressions a and b have been computed with excess precision and are positive values less than 1 in that excess precision. E.g., for illustration, suppose double were implemented with four decimal digits but a and b were computed with long double with six decimal digits. a and b could both be .999999. Then a*b is .999998000001 before rounding, .999998 after rounding to six digits. Now suppose that at this point in the computation, the compiler converts from long double to double, perhaps because it decides to store this intermediate value on the stack temporarily while it computes some other things from nearby expressions. Converting it to four-digit double produces 1.000, because that is the four-decimal-digit number nearest .999998. When the compiler later loads this from the stack and continues evaluation, we have 1 - 1.000, and the result is zero.
On the other hand, if a and b are variables, I expect your expression is safe. When a value is assigned to a variable or is converted with a cast operation, the C++ standard requires it to be converted to the nominal type; the result must be a value in the nominal type, without any “extra precision.” Then, given 0 < a < 1 and 0 < b < 1, the mathematical value (that, without floating-point rounding) a•b is less than a and is less than b. Then rounding of a•b to the nominal type cannot produce a value greater than a or b with any IEEE-754 rounding method, so it cannot produce 1. (The only requirement here is that the rounding method never skip over values—it might be constrained to round in a particular direction, upward or downward or toward zero or whatever, but it never goes past a representable value in that direction to get to a value farther away from the unrounded result. Since we know a•b is bounded above by both a and b, rounding cannot produce any result greater than the lesser of a and b.)
Formally, the C++ standard does not impose any requirements on the accuracy of floating-point results. So a C++ implementation could use a bonkers rounding mode that produced 3.14 for .9*.9. Aside from implementations flushing subnormals to zero, I am not aware of any C++ implementations that do not obey the requirement above. Flushing subnormals to zero will not affect calculations in 1 - a*b when a and b are near 1. (In a perverse floating-point format, with an exponent range narrower than the significand and no subnormal values, .9999 could be representable while .0001 is not because the exponent required for it is out of range. Then 1-.9999*.9999, which would produce .0002 in normal four-digit arithmetic, would produce 0 due to underflow. No such formats are in normal hardware.)
So, if a and b are variables, 0 < a < 1 and 0 < b < 1, and your C++ implementation is reasonable (may use extra precision, may flush subnormals, does not use perverse floating-point formats or rounding), then 1 - a*b does not evaluate to zero.
There is a mathematical proof that it will never be >= 1. I don't have it handy.... you may want to ask on the math stack overflow site if you are interested in studying the proof. But your instincts are correct. It will never be >= 1.
Now, we must be careful because floating point arithmetic is only an approximation of math and has limitations. I'm not an expert on these limitations, but the floating-point standard is very carefully designed and provides certain guarantees. I'm pretty sure one of them includes (or implies) that x * y where x < 1 and y < 1 is guaranteed to be < 1.
You can check that even if using the highest float or double that is lower than 1, and multiplying by itself, the result will be lower than 1. Any multiplication of numbers lower than that must give a smaller result.
Here is the code I ran, with the results in comments:
float a = nextafterf(1, 0); // 0.999999940
double b = nextafter(1, 0); // 0.99999999999999989
float c = a * a; // 0.999999881
double d = b * b; // 0.99999999999999978

Prevent underflow when adding two logarithms

I am using the addition in log space equation described on the Wikipedia log probability article, but I am getting underflow when computing the exp of very large, negative, logarithms. As a result, my program crashes.
Example inputs are a = -2 and b = -1033.4391885529124.
My code, implemented straight from the Wikipedia article, looks like this:
double log_sum(double a, double b)
{
double min_ab = std::min(a, b);
a = std::max(a, b);
b = min_ab;
if (isinf(a) && isinf(b)) {
return -std::numeric_limits<double>::infinity();
} else if (isinf(a)) {
return b;
} else if (isinf(b)) {
return a;
} else {
return a + log2(1 + exp2(b - a));
}
}
I've come up with the following ideas, but can't decide which is best:
Check for out-of-range inputs before evaluation.
Disable (somehow) the exception, and flush or clamp the output after evaluation
Implement custom log and exp functions that do not throw exceptions and automatically flush or clamp the results.
Some other ways?
Additionally, I'd be interested to know what effect the choice of the logarithm base has on the computation. I chose base two because I believed that other log bases would be calculated from log_n(x) = log_2(x) / log_2(n), and would suffer from precision loss due to the division. Is that correct?
According to http://en.cppreference.com/w/cpp/numeric/math/exp:
For IEEE-compatible type double, overflow is guaranteed if 709.8 < arg, and underflow is guaranteed if arg < -708.4
So you can't prevent an underflow. However:
If a range error occurs due to underflow, the correct result (after rounding) is returned.
So there shouldn't be any program crash - "just" a loss of precision.
However, notice that
1 + exp(n)
will loose precision much sooner, i.e. already at n = -53. This is because the next representable number after 1.0 is 1.0 + 2^-52.
So loss of precision due to exp is far less than the precision lost when adding 1.0 + exp(...)
The problem here is accurately computing the expression log(1+exp(x)) without intermediate under/overflow. Fortunately, Martin Maechler (one of the R core developers) details how to do it in section 3 of this vignette.
He uses natural base functions: it should be possible to translate it to base-2 by appropriately scaling the functions, but it uses the log1p function in one part, and I'm not aware of any math library which supplies a base-2 variant.
The choice of base is unlikely to have any effect on accuracy (or performance), and most reasonable math libraries are able to give sub 1-ulp guarantees for both functions (i.e. you will have one of the two floating point values closest to the exact answer). A pretty common approach is to break up the floating point number into its base-2 exponent k and significand 1+f, such that 1/sqrt(2) < 1+f < sqrt(2), and then use a polynomial approximation to compute log(1+f): due to some mathematical quirks (basically, the fact that the 2nd term of the Taylor series can be represented exactly) it turns out to be more accurate to do this in the natural base rather than base-2, so a typical implementation will look like:
log(x) = k*log2 + p(f)
log2(x) = k + p(f)*invlog2
(e.g. see log and log2 in openlibm), so there is no real benefit to using one over the other.

How to catch undefined behaviour without executing it?

In my software I am using the input values from the user at run time and performing some mathematical operations. Consider for simplicity below example:
int multiply(const int a, const int b)
{
if(a >= INT_MAX || B >= INT_MAX)
return 0;
else
return a*b;
}
I can check if the input values are greater than the limits, but how do I check if the result will be out of limits? It is quite possible that a = INT_MAX - 1 and b = 2. Since the inputs are perfectly valid, it will execute the undefined code which makes my program meaningless. This means any code executed after this will be random and eventually may result in crash. So how do I protect my program in such cases?
This really comes down to what you actually want to do in this case.
For a machine where long or long long (or int64_t) is a 64-bit value, and int is a 32-bit value, you could do (I'm assuming long is 64 bit here):
long x = static_cast<long>(a) * b;
if (x > MAX_INT || x < MIN_INT)
return 0;
else
return static_cast<int>(x);
By casting one value to long, the other will have to be converted as well. You can cast both if that makes you happier. The overhead here, above a normal 32-bit multiply is a couple of clock-cycles on modern CPU's, and it's unlikely that you can find a safer solution, that is also faster. [You can, in some compilers, add attributes to the if saying that it's unlikely to encourage branch prediction "to get it right" for the common case of returning x]
Obviously, this won't work for values where the type is as big as the biggest integer you can deal with (although you could possibly use floating point, but it may still be a bit dodgy, since the precision of float is not sufficient - could be done using some "safety margin" tho' [e.g. compare to less than LONG_INT_MAX / 2], if you don't need the entire range of integers.). Penalty here is a bit worse tho', especially transitions between float and integer isn't "pleasant".
Another alternative is to actually test the relevant code, with "known invalid values", and as long as the rest of the code is "ok" with it. Make sure you test this with the relevant compiler settings, as changing the compiler options will change the behaviour. Note that your code then has to deal with "what do we do when 65536 * 100000 is a negative number", and your code didn't expect so. Perhaps add something like:
int x = a * b;
if (x < 0) return 0;
[But this only works if you don't expect negative results, of course]
You could also inspect the assembly code generated and understand the architecture of the actual processor [the key here is to understand if "overflow will trap" - which it won't by default in x86, ARM, 68K, 29K. I think MIPS has an option of "trap on overflow"], and determine whether it's likely to cause a problem [1], and add something like
#if (defined(__X86__) || defined(__ARM__))
#error This code needs inspecting for correct behaviour
#endif
return a * b;
One problem with this approach, however, is that even the slightest changes in code, or compiler version may alter the outcome, so it's important to couple this with the testing approach above (and make sure you test the ACTUAL production code, not some hacked up mini-example).
[1] The "undefined behaviour" is undefined to allow C to "work" on processors that have trapping overflows of integer math, as well as the fact that that a * b when it overflows in a signed value is of course hard to determine unless you have a defined math system (two's complement, one's complement, distinct sign bit) - so to avoid "defining" the exact behaviour in these cases, the C standard says "It's undefined". It doesn't mean that it will definitely go bad.
Specifically for the multiplication of a by b the mathematically correct way to detect if it will overflow is to calculate log₂ of both values. If their sum is higher than the log₂ of the highest representable value of the result, then there is overflow.
log₂(a) + log₂(b) < log₂(UINT_MAX)
The difficulty is to calculate quickly the log₂ of an integer. For that, there are several bit twiddling hacks that can be used, like counting bit, counting leading zeros (some processors even have instructions for that). This site has several implementations
https://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
The simplest implementation could be:
unsigned int log2(unsigned int v)
{
unsigned int r = 0;
while (v >>= 1)
r++;
return r;
}
In your program you only need to check then
if(log2(a) + log2(b) < MYLOG2UINTMAX)
return a*b;
else
printf("Overflow");
The signed case is similar but has to take care of the negative case specifically.
EDIT: My solution is not complete and has an error which makes the test more severe than necessary. The equation works in reality if the log₂ function returns a floating point value. In the implementation I limited thevalue to unsigned integers. This means that completely valid multiplication get refused. Why? Because log2(UINT_MAX) is truncated
log₂(UINT_MAX)=log₂(4294967295)≈31.9999999997 truncated to 31.
We have there for to change the implementation to replace the constant to compare to
#define MYLOG2UINTMAX (CHAR_BIT*sizeof (unsigned int))
You may try this:
if ( b > ULONG_MAX / a ) // Need to check a != 0 before this division
return 0; //a*b invoke UB
else
return a*b;

how might someone divide a double or float by multiples of two?

I know with normal integers you can divide by bitshifting to the right. I'm wondering if theres an easy way to do the same with numbers that aren't perfect integers.
ldexp, ldexpf, and ldexpl do this for doubles, floats, and long doubles respectively. Alternatively, if you have a specific power of two in mind (say, 4), it's probably best to just divide the usual way:
whatever / 4
There is a function ldexp (and siblings) that allows you to "multiply by powers of two" (including negative ones), this is not the same optimisation as using shifts for integers. For powers of two, all double values are "perfect" for both X and 1/X (because if X is 2n, then 1/X = 2-n, both of which are fine to store as a floating point number in IEEE-754 or any other binary floating point format), so there won't be any odd rounding, which means the compiler should be able to replace the divide by a multiply operation - in my experiments, it indeed does.
To manipulate the exponent of floating point values is generally detrimental to performance compared to the "apply multiply by 1/X".
The function ldexp is a few dozen instructions long in glibc, with several branches and a call in the code. It is highly unlikely you'll find any benefit from calling ldexp, as well as confusing people who don't know that x = ldexp(x, -1); is the same as x /= 2.0;.
Just use
double x = 7;
x *= 2;
x /= 2;
assert(x == 7.0);
or
double y = 5;
y = 2*x + 0.5*y;
assert(y == 16.5);
or
double z = 2.5*x;
assert(z == 17.5);
Why? Because your computer can represent all powers of two as floating point values (as long as that power does not exceed the limit of the exponent, that is), and it will do so. Consequently, all of the calculations above are precise, there is no rounding error in the constants. All the assert()s are guaranteed to succeed.
Of course, you can achieve the same effect by doing bit manipulations, but current floating point hardware can do a multiplication within a nanosecond, and it handles all special cases correctly. If you do the bit fiddling, you will either waste time, or handle the special cases incorrectly. So don't even try.

Is floating-point == ever OK?

Just today I came across third-party software we're using and in their sample code there was something along these lines:
// Defined in somewhere.h
static const double BAR = 3.14;
// Code elsewhere.cpp
void foo(double d)
{
if (d == BAR)
...
}
I'm aware of the problem with floating-points and their representation, but it made me wonder if there are cases where float == float would be fine? I'm not asking for when it could work, but when it makes sense and works.
Also, what about a call like foo(BAR)? Will this always compare equal as they both use the same static const BAR?
Yes, you are guaranteed that whole numbers, including 0.0, compare with ==
Of course you have to be a little careful with how you got the whole number in the first place, assignment is safe but the result of any calculation is suspect
ps there are a set of real numbers that do have a perfect reproduction as a float (think of 1/2, 1/4 1/8 etc) but you probably don't know in advance that you have one of these.
Just to clarify. It is guaranteed by IEEE 754 that float representions of integers (whole numbers) within range, are exact.
float a=1.0;
float b=1.0;
a==b // true
But you have to be careful how you get the whole numbers
float a=1.0/3.0;
a*3.0 == 1.0 // not true !!
There are two ways to answer this question:
Are there cases where float == float gives the correct result?
Are there cases where float == float is acceptable coding?
The answer to (1) is: Yes, sometimes. But it's going to be fragile, which leads to the answer to (2): No. Don't do that. You're begging for bizarre bugs in the future.
As for a call of the form foo(BAR): In that particular case the comparison will return true, but when you are writing foo you don't know (and shouldn't depend on) how it is called. For example, calling foo(BAR) will be fine but foo(BAR * 2.0 / 2.0) (or even maybe foo(BAR * 1.0) depending on how much the compiler optimises things away) will break. You shouldn't be relying on the caller not performing any arithmetic!
Long story short, even though a == b will work in some cases you really shouldn't rely on it. Even if you can guarantee the calling semantics today maybe you won't be able to guarantee them next week so save yourself some pain and don't use ==.
To my mind, float == float is never* OK because it's pretty much unmaintainable.
*For small values of never.
The other answers explain quite well why using == for floating point numbers is dangerous. I just found one example that illustrates these dangers quite well, I believe.
On the x86 platform, you can get weird floating point results for some calculations, which are not due to rounding problems inherent to the calculations you perform. This simple C program will sometimes print "error":
#include <stdio.h>
void test(double x, double y)
{
const double y2 = x + 1.0;
if (y != y2)
printf("error\n");
}
void main()
{
const double x = .012;
const double y = x + 1.0;
test(x, y);
}
The program essentially just calculates
x = 0.012 + 1.0;
y = 0.012 + 1.0;
(only spread across two functions and with intermediate variables), but the comparison can still yield false!
The reason is that on the x86 platform, programs usually use the x87 FPU for floating point calculations. The x87 internally calculates with a higher precision than regular double, so double values need to be rounded when they are stored in memory. That means that a roundtrip x87 -> RAM -> x87 loses precision, and thus calculation results differ depending on whether intermediate results passed via RAM or whether they all stayed in FPU registers. This is of course a compiler decision, so the bug only manifests for certain compilers and optimization settings :-(.
For details see the GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323
Rather scary...
Additional note:
Bugs of this kind will generally be quite tricky to debug, because the different values become the same once they hit RAM.
So if for example you extend the above program to actually print out the bit patterns of y and y2 right after comparing them, you will get the exact same value. To print the value, it has to be loaded into RAM to be passed to some print function like printf, and that will make the difference disappear...
I'll provide more-or-less real example of legitimate, meaningful and useful testing for float equality.
#include <stdio.h>
#include <math.h>
/* let's try to numerically solve a simple equation F(x)=0 */
double F(double x) {
return 2 * cos(x) - pow(1.2, x);
}
/* a well-known, simple & slow but extremely smart method to do this */
double bisection(double range_start, double range_end) {
double a = range_start;
double d = range_end - range_start;
int counter = 0;
while (a != a + d) // <-- WHOA!!
{
d /= 2.0;
if (F(a) * F(a + d) > 0) /* test for same sign */
a = a + d;
++counter;
}
printf("%d iterations done\n", counter);
return a;
}
int main() {
/* we must be sure that the root can be found in [0.0, 2.0] */
printf("F(0.0)=%.17f, F(2.0)=%.17f\n", F(0.0), F(2.0));
double x = bisection(0.0, 2.0);
printf("the root is near %.17f, F(%.17f)=%.17f\n", x, x, F(x));
}
I'd rather not explain the bisection method used itself, but emphasize on the stopping condition. It has exactly the discussed form: (a == a+d) where both sides are floats: a is our current approximation of the equation's root, and d is our current precision. Given the precondition of the algorithm — that there must be a root between range_start and range_end — we guarantee on every iteration that the root stays between a and a+d while d is halved every step, shrinking the bounds.
And then, after a number of iterations, d becomes so small that during addition with a it gets rounded to zero! That is, a+d turns out to be closer to a then to any other float; and so the FPU rounds it to the closest representable value: to a itself. Calculation on a hypothetical machine can illustrate; let it have 4-digit decimal mantissa and some large exponent range. Then what result should the machine give to 2.131e+02 + 7.000e-3? The exact answer is 213.107, but our machine can't represent such number; it has to round it. And 213.107 is much closer to 213.1 than to 213.2 — so the rounded result becomes 2.131e+02 — the little summand vanished, rounded up to zero. Exactly the same is guaranteed to happen at some iteration of our algorithm — and at that point we can't continue anymore. We have found the root to maximum possible precision.
Addendum
No you can't just use "some small number" in the stopping condition. For any choice of the number, some inputs will deem your choice too large, causing loss of precision, and there will be inputs which will deem your choiсe too small, causing excess iterations or even entering infinite loop. Imagine that our F can change — and suddenly the solutions can be both huge 1.0042e+50 and tiny 1.0098e-70. Detailed discussion follows.
Calculus has no notion of a "small number": for any real number, you can find infinitely many even smaller ones. The problem is, among those "even smaller" ones might be a root of our equation. Even worse, some equations will have distinct roots (e.g. 2.51e-8 and 1.38e-8) — both of which will get approximated by the same answer if our stopping condition looks like d < 1e-6. Whichever "small number" you choose, many roots which would've been found correctly to the maximum precision with a == a+d — will get spoiled by the "epsilon" being too large.
It's true however that floats' exponent has finite limited range, so one actually can find the smallest nonzero positive FP number; in IEEE 754 single precision, it's the 1e-45 denorm. But it's useless! while (d >= 1e-45) {…} will loop forever with single-precision (positive nonzero) d.
At the same time, any choice of the "small number" in d < eps stopping condition will be too small for many equations. Where the root has high enough exponent, the result of subtraction of two neighboring mantissas will easily exceed our "epsilon". For example, 7.00023e+8 - 7.00022e+8 = 0.00001e+8 = 1.00000e+3 = 1000 — meaning that the smallest possible difference between numbers with exponent +8 and 6-digit mantissa is... 1000! It will never fit into, say, 1e-4. For numbers with relatively high exponent we simply have not enough precision to ever see a difference of 1e-4. This means eps = 1e-4 will be too small!
My implementation above took this last problem into account; you can see that d is halved each step — instead of getting recalculated as difference of (possibly huge in exponent) a and b. For reals, it doesn't matter; for floats it does! The algorithm will get into infinite loops with (b-a) < eps on equations with huge enough roots. The previous paragraph shows why. d < eps won't get stuck, but even then — needless iterations will be performed during shrinking d way down below the precision of a — still showing the choice of eps as too small. But a == a+d will stop exactly at precision.
Thus as shown: any choice of eps in while (d < eps) {…} will be both too large and too small, if we allow F to vary.
... This kind of reasoning may seem overly theoretical and needlessly deep, but it's to illustrate again the trickiness of floats. One should be aware of their finite precision when writing arithmetic operators around.
Perfect for integral values even in floating point formats
But the short answer is: "No, don't use ==."
Ironically, the floating point format works "perfectly", i.e., with exact precision, when operating on integral values within the range of the format. This means that you if you stick with double values, you get perfectly good integers with a little more than 50 bits, giving you about +- 4,500,000,000,000,000, or 4.5 quadrillion.
In fact, this is how JavaScript works internally, and it's why JavaScript can do things like + and - on really big numbers, but can only << and >> on 32-bit ones.
Strictly speaking, you can exactly compare sums and products of numbers with precise representations. Those would be all the integers, plus fractions composed of 1 / 2n terms. So, a loop incrementing by n + 0.25, n + 0.50, or n + 0.75 would be fine, but not any of the other 96 decimal fractions with 2 digits.
So the answer is: while exact equality can in theory make sense in narrow cases, it is best avoided.
The only case where I ever use == (or !=) for floats is in the following:
if (x != x)
{
// Here x is guaranteed to be Not a Number
}
and I must admit I am guilty of using Not A Number as a magic floating point constant (using numeric_limits<double>::quiet_NaN() in C++).
There is no point in comparing floating point numbers for strict equality. Floating point numbers have been designed with predictable relative accuracy limits. You are responsible for knowing what precision to expect from them and your algorithms.
It's probably ok if you're never going to calculate the value before you compare it. If you are testing if a floating point number is exactly pi, or -1, or 1 and you know that's the limited values being passed in...
I also used it a few times when rewriting few algorithms to multithreaded versions. I used a test that compared results for single- and multithreaded version to be sure, that both of them give exactly the same result.
Let's say you have a function that scales an array of floats by a constant factor:
void scale(float factor, float *vector, int extent) {
int i;
for (i = 0; i < extent; ++i) {
vector[i] *= factor;
}
}
I'll assume that your floating point implementation can represent 1.0 and 0.0 exactly, and that 0.0 is represented by all 0 bits.
If factor is exactly 1.0 then this function is a no-op, and you can return without doing any work. If factor is exactly 0.0 then this can be implemented with a call to memset, which will likely be faster than performing the floating point multiplications individually.
The reference implementation of BLAS functions at netlib uses such techniques extensively.
In my opinion, comparing for equality (or some equivalence) is a requirement in most situations: standard C++ containers or algorithms with an implied equality comparison functor, like std::unordered_set for example, requires that this comparator be an equivalence relation (see C++ named requirements: UnorderedAssociativeContainer).
Unfortunately, comparing with an epsilon as in abs(a - b) < epsilon does not yield an equivalence relation since it loses transitivity. This is most probably undefined behavior, specifically two 'almost equal' floating point numbers could yield different hashes; this can put the unordered_set in an invalid state.
Personally, I would use == for floating points most of the time, unless any kind of FPU computation would be involved on any operands. With containers and container algorithms, where only read/writes are involved, == (or any equivalence relation) is the safest.
abs(a - b) < epsilon is more or less a convergence criteria similar to a limit. I find this relation useful if I need to verify that a mathematical identity holds between two computations (for example PV = nRT, or distance = time * speed).
In short, use == if and only if no floating point computation occur;
never use abs(a-b) < e as an equality predicate;
Yes. 1/x will be valid unless x==0. You don't need an imprecise test here. 1/0.00000001 is perfectly fine. I can't think of any other case - you can't even check tan(x) for x==PI/2
The other posts show where it is appropriate. I think using bit-exact compares to avoid needless calculation is also okay..
Example:
float someFunction (float argument)
{
// I really want bit-exact comparison here!
if (argument != lastargument)
{
lastargument = argument;
cachedValue = very_expensive_calculation (argument);
}
return cachedValue;
}
I would say that comparing floats for equality would be OK if a false-negative answer is acceptable.
Assume for example, that you have a program that prints out floating points values to the screen and that if the floating point value happens to be exactly equal to M_PI, then you would like it to print out "pi" instead. If the value happens to deviate a tiny bit from the exact double representation of M_PI, it will print out a double value instead, which is equally valid, but a little less readable to the user.
I have a drawing program that fundamentally uses a floating point for its coordinate system since the user is allowed to work at any granularity/zoom. The thing they are drawing contains lines that can be bent at points created by them. When they drag one point on top of another they're merged.
In order to do "proper" floating point comparison I'd have to come up with some range within which to consider the points the same. Since the user can zoom in to infinity and work within that range and since I couldn't get anyone to commit to some sort of range, we just use '==' to see if the points are the same. Occasionally there'll be an issue where points that are supposed to be exactly the same are off by .000000000001 or something (especially around 0,0) but usually it works just fine. It's supposed to be hard to merge points without the snap turned on anyway...or at least that's how the original version worked.
It throws of the testing group occasionally but that's their problem :p
So anyway, there's an example of a possibly reasonable time to use '=='. The thing to note is that the decision is less about technical accuracy than about client wishes (or lack thereof) and convenience. It's not something that needs to be all that accurate anyway. So what if two points won't merge when you expect them to? It's not the end of the world and won't effect 'calculations'.