Floating point equality and tolerances

Floating point equality and tolerances - c++

Comparing two floating point number by something like a_float == b_float is looking for trouble since a_float / 3.0 * 3.0 might not be equal to a_float due to round off error.
What one normally does is something like fabs(a_float - b_float) < tol.
How does one calculate tol?
Ideally tolerance should be just larger than the value of one or two of the least significant figures. So if the single precision floating point number is use tol = 10E-6 should be about right. However this does not work well for the general case where a_float might be very small or might be very large.
How does one calculate tol correctly for all general cases? I am interested in C or C++ cases specifically.

This blogpost contains an example, fairly foolproof implementation, and detailed theory behind it
http://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
it is also one of a series, so you can always read more.
In short: use ULP for most numbers, use epsilon for numbers near zero, but there are still caveats. If you want to be sure about your floating point math i recommend reading whole series.

As far as I know, one doesn't.
There is no general "right answer", since it can depend on the application's requirement for precision.
For instance, a 2D physics simulation working in screen-pixels might decide that 1/4 of a pixel is good enough, while a 3D CAD system used to design nuclear plant internals might not.
I can't see a way to programmatically decide this from the outside.

The C header file <float.h> gives you the constants FLT_EPSILON and DBL_EPSILON, which is the difference between 1.0 and the smallest number larger than 1.0 that a float/double can represent. You can scale that by the size of your numbers and the rounding error you wish to tolerate:
#include <float.h>
#ifndef DBL_TRUE_MIN
/* DBL_TRUE_MIN is a common non-standard extension for the minimum denorm value
* DBL_MIN is the minimum non-denorm value -- use that if TRUE_MIN is not defined */
#define DBL_TRUE_MIN DBL_MIN
#endif
/* return the difference between |x| and the next larger representable double */
double dbl_epsilon(double x) {
int exp;
if (frexp(x, &exp) == 0.0)
return DBL_TRUE_MIN;
return ldexp(DBL_EPSILON, exp-1);
}

Welcome to the world of traps, snares and loopholes. As mentioned elsewhere, a general purpose solution for floating point equality and tolerances does not exist. Given that, there are tools and axioms that a programmer may use in select cases.
fabs(a_float - b_float) < tol has the shortcoming OP mentioned: "does not work well for the general case where a_float might be very small or might be very large." fabs(a_float - ref_float) <= fabs(ref_float * tol) copes with the variant ranges much better.
OP's "single precision floating point number is use tol = 10E-6" is a bit worrisome for C and C++ so easily promote float arithmetic to double and then it's the "tolerance" of double, not float, that comes into play. Consider float f = 1.0; printf("%.20f\n", f/7.0); So many new programmers do not realize that the 7.0 caused a double precision calculation. Recommend using double though out your code except where large amounts of data need the float smaller size.
C99 provides nextafter() which can be useful in helping to gauge "tolerance". Using it, one can determine the next representable number. This will help with the OP "... the full number of significant digits for the storage type minus one ... to allow for roundoff error." if ((nextafter(x, -INF) <= y && (y <= nextafter(x, +INF))) ...
The kind of tol or "tolerance" used is often the crux of the matter. Most often (IMHO) a relative tolerance is important. e. g. "Are x and y within 0.0001%"? Sometimes an absolute tolerance is needed. e.g. "Are x and y within 0.0001"?
The value of the tolerance is often debatable for the best value is often situation dependent. Comparing within 0.01 may work for a financial application for Dollars but not Yen. (Hint: be sure to use a coding style that allows easy updates.)

Rounding error varies according to values used for operations.
Instead of a fixed tolerance, you can probably use a factor of epsilon like:
bool nearly_equal(double a, double b, int factor /* a factor of epsilon */)
{
double min_a = a - (a - std::nextafter(a, std::numeric_limits<double>::lowest())) * factor;
double max_a = a + (std::nextafter(a, std::numeric_limits<double>::max()) - a) * factor;
return min_a <= b && max_a >= b;
}

Although the value of the tolerance depends on the situation, if you are looking for precision comparasion you could used as tolerance the machine epsilon value, numeric_limits::epsilon() (Library limits). The function returns the difference between 1 and the smallest value greater than 1 that is representable for the data type.
http://msdn.microsoft.com/en-us/library/6x7575x3.aspx
The value of epsilon differs if you are comparing floats or doubles. For instance, in my computer, if comparing floats the value of epsilon is 1.1920929e-007 and if comparing doubles the value of epsilon is 2.2204460492503131e-016.
For a relative comparison between x and y, multiply the epsilon by the maximum absolute value of x and y.
The result above could be multiplied by the ulps (units in the last place) which allows you to play with the precision.
#include <iostream>
#include <cmath>
#include <limits>
template<class T> bool are_almost_equal(T x, T y, int ulp)
{
return std::abs(x-y) <= std::numeric_limits<T>::epsilon() * std::max(std::abs(x), std::abs(y)) * ulp
}

When I need to compare floats, I use code like this
bool same( double a, double b, double error ) {
double x;
if( a == 0 ) {
x = b;
} else if( b == 0 ) {
x = a;
} else {
x = (a-b) / a;
}
return fabs(x) < error;
}

Related

What can std::numeric_limits<double>::epsilon() be used for?

unsigned int updateStandardStopping(unsigned int numInliers, unsigned int totPoints, unsigned int sampleSize)
{
double max_hypotheses_=85000;
double n_inliers = 1.0;
double n_pts = 1.0;
double conf_threshold_=0.95
for (unsigned int i = 0; i < sampleSize; ++i)
{
n_inliers *= numInliers - i;//n_linliers=I(I-1)...(I-m+1)
n_pts *= totPoints - i;//totPoints=N(N-1)(N-2)...(N-m+1)
}
double prob_good_model = n_inliers/n_pts;
if ( prob_good_model < std::numeric_limits<double>::epsilon() )
{
return max_hypotheses_;
}
else if ( 1 - prob_good_model < std::numeric_limits<double>::epsilon() )
{
return 1;
}
else
{
double nusample_s = log(1-conf_threshold_)/log(1-prob_good_model);
return (unsigned int) ceil(nusample_s);
}
}
Here is a selection statement:
if ( prob_good_model < std::numeric_limits<double>::epsilon() )
{...}
To my understanding, the judgement statement is the same as(or an approximation to)
prob_good_model < 0
So whether or not I am right and where std::numeric_limits<double>::epsilon() can be used besides that?

The point of epsilon is to make it (fairly) easy for you to figure out the smallest difference you could see between two numbers.
You don't normally use it exactly as-is though. You need to scale it based on the magnitudes of the numbers you're comparing. If you have two numbers around 1e-100, then you'd use something on the order of: std::numeric_limits<double>::epsilon() * 1.0e-100 as your standard of comparison. Likewise, if your numbers are around 1e+100, your standard would be std::numeric_limits<double>::epsilon() * 1e+100.
If you try to use it without scaling, you can get drastically incorrect (utterly meaningless) results. For example:
if (std::abs(1e-100 - 1e-200) < std::numeric_limits<double>::epsilon())
Yup, that's going to show up as "true" (i.e., saying the two are equal) even though they differ by 100 orders of magnitude. In the other direction, if the numbers are much larger than 1, comparing to an (unscaled) epsilon is equivalent to saying if (x != y)--it leaves no room for rounding errors at all.
At least in my experience, the epsilon specified for the floating point type isn't often of a great deal of use though. With proper scaling, it tells you the smallest difference there could possibly be between two numbers of a given magnitude (for a particular floating point implementation).
In real use, however, that's of relatively little real use. A more realistic number will typically be based on the precision of the inputs, and an estimate of the amount of precision you're likely to have lost due to rounding (and such).
For example, let's assume you started with values measured to a precision of 1 part per million, and you did only a few calculations, so you believe you may have lost as much as 2 digits of precision due to rounding errors. In this case, the "epsilon" you care about is roughly 1e-4, scaled to the magnitude of the numbers you're dealing with. That is to say, under those circumstances, you can expect on the order of 4 digits of precision to be meaningful, so if you see a difference in the first four digits, it probably means the values aren't equal, but if they differ only in the fifth (or later) digits, you should probably treat them as equal.
The fact that the floating point type you're using can represent (for example) 16 digits of precision doesn't mean that every measurement you use is going to be nearly the precise--in fact, it's relatively rare the anything based on physical measurements has any hope of being even close to that precise. It does, however, give you a limit on what you can hope for from a calculation--even if you start with a value that's precise to, say, 30 digits, the most you can hope for after calculation is going to be defined by std::numeric_limits<T>::epsilon.

It can be used on situation where a function is undefined, but you still need a value at that point. You lose a bit of accuracy, especially in extreme cases, but sometimes it's alright.
Like let's say you're using 1/x somewhere but your range of x is [0, n[. you can use 1/(x + std::numeric_limits<double>::epsilon()) instead so that 0 is still defined. That being said, you have to be careful with how the value is used, it might not work for every case.

Does casting `std::floor()` and `std::ceil()` to integer type always give the correct result?

I am being paranoid that one of these functions may give an incorrect result like this:
std::floor(2000.0 / 1000.0) --> std::floor(1.999999999999) --> 1
or
std::ceil(18 / 3) --> std::ceil(6.000000000001) --> 7
Can something like this happen? If there is indeed a risk like this, I'm planning to use the functions below in order to work safely. But, is this really necessary?
constexpr long double EPSILON = 1e-10;
intmax_t GuaranteedFloor(const long double & Number)
{
if (Number > 0)
{
return static_cast<intmax_t>(std::floor(Number) + EPSILON);
}
else
{
return static_cast<intmax_t>(std::floor(Number) - EPSILON);
}
}
intmax_t GuaranteedCeil(const long double & Number)
{
if (Number > 0)
{
return static_cast<intmax_t>(std::ceil(Number) + EPSILON);
}
else
{
return static_cast<intmax_t>(std::ceil(Number) - EPSILON);
}
}
(Note: I'm assuming that the the given 'long double' argument will fit in the 'intmax_t' return type.)

People often get the impression that floating point operations produce results with small, unpredictable, quasi-random errors. This impression is incorrect.
Floating point arithmetic computations are as exact as possible. 18/3 will always produce exactly 6. The result of 1/3 won't be exactly one third, but it will be the closest number to one third that is representable as a floating point number.
So the examples you showed are guaranteed to always work. As for your suggested "guaranteed floor/ceil", it's not a good idea. Certain sequences of operations can easily blow the error far above 1e-10, and certain other use cases will require 1e-10 to be correctly recognized (and ceil'ed) as nonzero.
As a rule of thumb, hardcoded epsilon values are bugs in your code.

In the specific examples you're listing, I don't think those errors would ever occur.
std::floor(2000.0 /*Exactly Representable in 32-bit or 64-bit Floating Point Numbers*/ / 1000.0 /*Also exactly representable*/) --> std::floor(2.0 /*Exactly Representable*/) --> 2
std::ceil(18 / 3 /*both treated as ints, might not even compile if ceil isn't properly overloaded....?*/) --> 6
std::ceil(18.0 /*Exactly Representable*/ / 3.0 /*Exactly Representable*/) --> 6
Having said that, if you have math that depends on these functions behaving exactly correctly for floating point numbers, that may illuminate a design flaw you need to reconsider/reexamine.

As long as the floating-point values x and y exactly represent integers within the limits of the type you're using, there's no problem--x / y will always yield a floating-point value that exactly represents the integer result. Casting to int as you're doing will always work.
However, once the floating-point values go outside the integer-representable range for the type (Representing integers in doubles), epsilons don't help.
Consider this example. 16777217 is the smallest integer not exactly representable as a 32-bit float:
int ix=16777217, iy=97;
printf("%d / %d = %d", ix, iy, ix/iy);
// yields "16777217 / 97 = 172961" which is accurate
float x=ix, y=iy;
printf("%f / %f = %f", x, y, x/y);
// yields "16777216.000000 / 97.000000 = 172960.989691"
In this case, the error is negative; in other cases (try 16777219 / 1549), the error is positive.
While it's tempting to add an epsilon to make floor work, it won't extend the accuracy much. When the values differ by more orders of magnitude, the error becomes greater than 1 and integer-accuracy can't be guaranteed. Specifically, when x/y exceeds the max. representable, the error can exceed 1.0, so the epsilon is no help.
If this is coming into play, you will have to consider changing your mathematical approach--order of operations, work with logarithms, etc.

Such results are likely to appear when working with doubles. You can use round or you can subtract 0.5 then use std::ceil function.

Can I trust a real-to-int conversion of the result of ceil()?

Suppose I have some code such as:
float a, b = ...; // both positive
int s1 = ceil(sqrt(a/b));
int s2 = ceil(sqrt(a/b)) + 0.1;
Is it ever possible that s1 != s2? My concern is when a/b is a perfect square. For example, perhaps a=100.0 and b=4.0, then the output of ceil should be 5.00000 but what if instead it is 4.99999?
Similar question: is there a chance that 100.0/4.0 evaluates to say 5.00001 and then ceil will round it up to 6.00000?
I'd prefer to do this in integer math but the sqrt kinda screws that plan.
EDIT: suggestions on how to better implement this would be appreciated too! The a and b values are integer values, so actual code is more like: ceil(sqrt(float(a)/b))
EDIT: Based on levis501's answer, I think I will do this:
float a, b = ...; // both positive
int s = sqrt(a/b);
while (s*s*b < a) ++s;
Thank you all!

I don't think it's possible. Regardless of the value of sqrt(a/b), what it produces is some value N that we use as:
int s1 = ceil(N);
int s2 = ceil(N) + 0.1;
Since ceil always produces an integer value (albeit represented as a double), we will always have some value X, for which the first produces X.0 and the second X.1. Conversion to int will always truncate that .1, so both will result in X.
It might seem like there would be an exception if X was so large that X.1 overflowed the range of double. I don't see where this could be possible though. Except close to 0 (where overflow isn't a concern) the square root of a number will always be smaller than the input number. Therefore, before ceil(N)+0.1 could overflow, the a/b being used as an input in sqrt(a/b) would have to have overflowed already.

You may want to write an explicit function for your case. e.g.:
/* return the smallest positive integer whose square is at least x */
int isqrt(double x) {
int y1 = ceil(sqrt(x));
int y2 = y1 - 1;
if ((y2 * y2) >= x) return y2;
return y1;
}
This will handle the odd case where the square root of your ratio a/b is within the precision of double.

Equality of floating point numbers is indeed an issue, but IMHO not if we deal with integer numbers.
If you have the case of 100.0/4.0, it should perfectly evaluate to 25.0, as 25.0 is exactly representable as a float, as opposite to e.g. 25.1.

Yes, it's entirely possible that s1 != s2. Why is that a problem, though?
It seems natural enough that s1 != (s1 + 0.1).
BTW, if you would prefer to have 5.00001 rounded to 5.00000 instead of 6.00000, use rint instead of ceil.
And to answer the actual question (in your comment) - you can use sqrt to get a starting point and then just find the correct square using integer arithmetic.
int min_dimension_greater_than(int items, int buckets)
{
double target = double(items) / buckets;
int min_square = ceil(target);
int dim = floor(sqrt(target));
int square = dim * dim;
while (square < min_square) {
seed += 1;
square = dim * dim;
}
return dim;
}
And yes, this can be improved a lot, it's just a quick sketch.

s1 will always equal s2.
The C and C++ standards do not say much about the accuracy of math routines. Taken literally, it is impossible for the standard to be implemented, since the C standard says sqrt(x) returns the square root of x, but the square root of two cannot be exactly represented in floating point.
Implementing routines with good performance that always return a correctly rounded result (in round-to-nearest mode, this means the result is the representable floating-point number that is nearest to the exact result, with ties resolved in favor of a low zero bit) is a difficult research problem. Good math libraries target accuracy less than 1 ULP (so one of the two nearest representable numbers is returned), perhaps something slightly more than .5 ULP. (An ULP is the Unit of Least Precision, the value of the low bit given a particular value in the exponent field.) Some math libraries may be significantly worse than this. You would have to ask your vendor or check the documentation for more information.
So sqrt may be slightly off. If the exact square root is an integer (within the range in which integers are exactly representable in floating-point) and the library guarantees errors are less than 1 ULP, then the result of sqrt must be exactly correct, because any result other than the exact result is at least 1 ULP away.
Similarly, if the library guarantees errors are less than 1 ULP, then ceil must return the exact result, again because the exact result is representable and any other result would be at least 1 ULP away. Additionally, the nature of ceil is such that I would expect any reasonable math library to always return an integer, even if the rest of the library were not high quality.
As for overflow cases, if ceil(x) were beyond the range where all integers are exactly representable, then ceil(x)+.1 is closer to ceil(x) than it is to any other representable number, so the rounded result of adding .1 to ceil(x) should be ceil(x) in any system implementing the floating-point standard (IEEE 754). That is provided you are in the default rounding mode, which is round-to-nearest. It is possible to change the rounding mode to something like round-toward-infinity, which could cause ceil(x)+.1 to be an integer higher than ceil(x).

C++ Should this be easier?

long-time listener, first-time caller. I am relatively new to programming and was looking back at some of the code I wrote for an old lab. Is there an easier way to tell if a double is evenly divisible by an integer?
double num (//whatever);
int divisor (//an integer);
bool bananas;
if(floor(num)!= num || static_cast<int>(num)%divisor != 0) {
bananas=false;
}
if(bananas==true)
//do stuff;
}

The question is strange, and the checks are as well. The problem is that it makes little sense to speak about divisibility of a floating point number because floating point number are represented imprecisely in binary, and divisibility is about exactitude.
I encourage you to read this article, by David Goldberg: What Every Computer Scientist Should Know About Floating Point Arithmetic. It is a bit long-winded, so you may appreciate this website, instead: The Floating-Point Guide.
The truth is that floor(num) == num is a strange piece of code.
num is a double
floor(num) returns an double, close to an int
The trouble is that this does not check what you really wanted. For example, suppose (for the sake of example) that 5 cannot be represented exactly as a double, therefore, instead of storing 5, the computer will store 4.999999999999.
double num = 5; // 4.999999999999999
double floored = floor(num); // 4.0
assert(num != floored);
In general exact comparisons are meaningless for floating point numbers, because of rounding errors.
If you insist on using floor, I suggest to use floor(num + 0.5) which is better, though slightly biased. A better rounding method is the Banker's rounding because it is unbiased, and the article references others if you wish. Note that the Banker's rounding is the baked in in round...
As for your question, first you need a double aware modulo: fmod, then you need to remember the avoid exact comparisons bit.
A first (naive) attempt:
// divisor is deemed non-zero
// epsilon is a constant
double mod = fmod(num, divisor); // divisor will be converted to a double
if (mod <= epsilon) { }
Unfortunately it fails one important test: the magnitude of mod depends on the magnitude of divisor, thus if divisor is smaller than epsilon to begin with, it will always be true.
A second attempt:
// divisor is deemed non-zero
double const epsilon = divisor / 1000.0;
double mod = fmod(num, divisor);
if (mod <= epsilon) { }
Better, but not quite there: mod and epsilon are signed! Yes, it's a bizarre modulo, th sign of mod is the sign of num
A third attempt:
// divisor is deemed non-zero
double const eps = fabs(divisor / 1000.0);
double mod = fabs(fmod(num, divisor));
if (mod <= eps) { }
Much better.
Should work fairly well too if divisor comes from an integer, as there won't be precision issues... or at least not too much.
EDIT: fourth attempt, by #ybungalobill
The previous attempt does not deal well with situations where num/divisor errors on the wrong side. Like 1.999/1.000 --> 0.999, it's nearly divisor so we should indicate equality, yet it failed.
// divisor is deemed non-zero
mod = fabs(fmod(num/divisor, 1));
if (mod <= 0.001 || fabs(1 - mod) <= 0.001) { }
Looks like a never ending task eh ?
There is still cause for troubles though.
double has a limited precision, that is a limited number of digits that is representable (16 I think ?). This precision might be insufficient to represent an integer:
Integer n = 12345678901234567890;
double d = n; // 1.234567890123457 * 10^20
This truncation means it is impossible to map it back to its original value. This should not cause any issue with double and int, for example on my platform double is 8 bytes and int is 4 bytes, so it would work, but changing double to float or int to long could violate this assumption, oh hell!
Are you sure you really need floating point, by the way ?

Based on the above comments, I believe you can do this...
double num (//whatever);
int divisor (//an integer);
if(fmod(num, divisor) == 0) {
//do stuff;
}

I haven't checked it but why not do this?
if (floor(num) == num && !(static_cast<int>(num) % divisor)) {
// do stuff...
}

Dealing with accuracy problems in floating-point numbers

I was wondering if there is a way of overcoming an accuracy problem that seems to be the result of my machine's internal representation of floating-point numbers:
For the sake of clarity the problem is summarized as:
// str is "4.600"; atof( str ) is 4.5999999999999996
double mw = atof( str )
// The variables used in the columns calculation below are:
//
// mw = 4.5999999999999996
// p = 0.2
// g = 0.2
// h = 1 (integer)
int columns = (int) ( ( mw - ( h * 11 * p ) ) / ( ( h * 11 * p ) + g ) ) + 1;
Prior to casting to an integer type the result of the columns calculation is 1.9999999999999996; so near yet so far from the desired result of 2.0.
Any suggestions most welcome.

When you use floating point arithmetic strict equality is almost meaningless. You usually want to compare with a range of acceptable values.
Note that some values can not be represented exactly as floating point vlues.
See What Every Computer Scientist Should Know About Floating-Point Arithmetic and Comparing floating point numbers.

There's no accurracy problem.
The result you got (1.9999999999999996) differed from the mathematical result (2) by a margin of 1E-16. That's quite accurate, considering your input "4.600".
You do have a rounding problem, of course. The default rounding in C++ is truncation; you want something similar to Kip's solution. Details depend on your exact domain, do you expect round(-x)== - round(x) ?

If you haven't read it, the title of this paper is really correct. Please consider reading it, to learn more about the fundamentals of floating-point arithmetic on modern computers, some pitfalls, and explanations as to why they behave the way they do.

A very simple and effective way to round a floating point number to an integer:
int rounded = (int)(f + 0.5);
Note: this only works if f is always positive. (thanks j random hacker)

If accuracy is really important then you should consider using double precision floating point numbers rather than just floating point. Though from your question it does appear that you already are. However, you still have a problem with checking for specific values. You need code along the lines of (assuming you're checking your value against zero):
if (abs(value) < epsilon)
{
// Do Stuff
}
where "epsilon" is some small, but non zero value.

On computers, floating point numbers are never exact. They are always just a close approximation. (1e-16 is close.)
Sometimes there are hidden bits you don't see. Sometimes basic rules of algebra no longer apply: a*b != b*a. Sometimes comparing a register to memory shows up these subtle differences. Or using a math coprocessor vs a runtime floating point library. (I've been doing this waayyy tooo long.)
C99 defines: (Look in math.h)
double round(double x);
float roundf(float x);
long double roundl(long double x);
.
Or you can roll your own:
template<class TYPE> inline int ROUND(const TYPE & x)
{ return int( (x > 0) ? (x + 0.5) : (x - 0.5) ); }
For floating point equivalence, try:
template<class TYPE> inline TYPE ABS(const TYPE & t)
{ return t>=0 ? t : - t; }
template<class TYPE> inline bool FLOAT_EQUIVALENT(
const TYPE & x, const TYPE & y, const TYPE & epsilon )
{ return ABS(x-y) < epsilon; }

Use decimals: decNumber++

You can read this paper to find what you are looking for.
You can get the absolute value of the result as seen here:
x = 0.2;
y = 0.3;
equal = (Math.abs(x - y) < 0.000001)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js