int division rounding library? - c++

Does anyone know of an open-source C or C++ library with functions implementing every integer division mode one might want? Possible behaviors (for positive result):
round_down, round_up,
round_to_nearest_with_ties_rounding_up,
round_to_nearest_with_ties_rounding_down,
round_to_nearest_with_ties_rounding_to_even,
round_to_nearest_with_ties_rounding_to_odd
with each (aside from round-to-even and round-to-odd) having two variants
// (round relative to 0; -divide(-x, y) == divide(x, y))
negative_mirrors_positive,
// (round relative to -Infinity; divide(x + C*y, y) == divide(x, y) + C)
negative_continuous_with_positive
.
I know how to write it, but surely someone has done so already?
As an example, if we assume (as is common and is mandated in C++11) that built-in signed integral division rounds towards zero, and that built-in modulus is consistent with this, then
int divide_rounding_up_with_negative_mirroring_positive(int dividend, int divisor) {
// div+mod is often a single machine instruction.
const int quotient = dividend / divisor;
const int remainder = dividend % divisor;
// this ?:'s condition equals whether quotient is positive,
// but we compute it without depending on quotient for speed
// (instruction-level parallelism with the divide).
const int adjustment = (((dividend < 0) == (divisor < 0)) ? 1 : -1);
if(remainder != 0) {
return quotient + adjustment;
}
else {
return quotient;
}
}
Bonus points: work for multiple argument types; fast; optionally return modulus as well; do not overflow for any argument values (except division by zero and MIN_INT/-1, of course).
If I don't find such a library, I'll write one in C++11, release it, and link to it in an answer here.

So, I wrote something. The implementation is typically ugly template and bitwise code, but it works well. Usage:
divide(dividend, divisor, rounding_strategy<...>())
where rounding_strategy<round_up, negative_mirrors_positive> is an example strategy; see list of variants in my question or in the source code. https://github.com/idupree/Lasercake/blob/ee2ce96d33cad10d376c6c5feb34805ab44862ac/data_structures/numbers.hpp#L80
depending only on C++11 [*], with unit tests (using Boost Test framework) starting at https://github.com/idupree/Lasercake/blob/ee2ce96d33cad10d376c6c5feb34805ab44862ac/tests/misc_utils_tests.cpp#L38
It is polymorphic, decent speed, and does not overflow, but doesn't currently return modulus.
[*] (and on boost::make_signed and boost::enable_if_c, which are trivial to replace with std::make_signed and std::enable_if, and on our caller_error_if() which can be replaced with assert() or if(..){throw ..} or deleted. You can ignore and delete the rest of the file assuming you're not interested in the other things there.)
Each divide_impl's code can be adapted to C by replacing each T with e.g. int and T(CONSTANT) with CONSTANT. In the case of the round_to_nearest_* variant, you'd either want to make the rounding kind be a runtime argument or create six copies of the code (one for each distinct rounding variation it handles). The code relies on '/' rounding towards zero, which is common and also specified by C11 (std draft N1570 6.5.5.6) as well as C++11. For C89/C++98 compatibility, it could use stdlib.h div()/ldiv() which are guaranteed to round towards zero (see http://www.linuxmanpages.com/man3/div.3.php , http://en.cppreference.com/w/cpp/numeric/math/div )

Related

Is it possible to test whether a type supports negative zero in C++ at compile time?

Is there a way to write a type trait to determine whether a type supports negative zero in C++ (including integer representations such as sign-and-magnitude)? I don't see anything that directly does that, and std::signbit doesn't appear to be constexpr.
To clarify: I'm asking because I want to know whether this is possible, regardless of what the use case might be, if any.
Unfortunately, I cannot imagine a way for that. The fact is that C standard thinks that type representations should not be a programmer's concern (*), but is only there to tell implementors what they should do.
As a programmer all you have to know is that:
2-complement is not the only possible representation for negative integer
negative 0 could exist
an arithmetic operation on integers cannot return a negative 0, only bitwise operation can
(*) Opinion here: Knowing the internal representation could lead programmers to use the old good optimizations that blindly ignored the strict aliasing rule. If you see a type as an opaque object that can only be used in standard operations, you will have less portability questions...
The best one can do is to rule out the possibility of signed zero at compile time, but never be completely positive about its existence at compile time. The C++ standard goes a long way to prevent checking binary representation at compile time:
reinterpret_cast<char*>(&value) is forbidden in constexpr.
using union types to circumvent the above rule in constexpr is also forbidden.
Operations on zero and negative zero of integer types behave exactly the same, per-c++ standard, with no way to differentiate.
For floating-point operations, division by zero is forbidden in a constant expression, so testing 1/0.0 != 1/-0.0 is out of the question.
The only thing one can test is if the domain of an integer type is dense enough to rule-out signed zero:
template<typename T>
constexpr bool test_possible_signed_zero()
{
using limits = std::numeric_limits<T>;
if constexpr (std::is_fundamental_v<T> &&
limits::is_exact &&
limits::is_integer) {
auto low = limits::min();
auto high = limits::max();
T carry = 1;
// This is one of the simplest ways to check that
// the max() - min() + 1 == 2 ** bits
// without stepping out into undefined behavior.
for (auto bits = limits::digits ; bits > 0 ; --bits) {
auto adder = low % 2 + high %2 + carry;
if (adder % 2 != 0) return true;
carry = adder / 2;
low /= 2;
high /= 2;
}
return false;
} else {
return true;
}
}
template <typename T>
class is_possible_signed_zero:
public std::integral_constant<bool, test_possible_signed_zero<T>()>
{};
template <typename T>
constexpr bool is_possible_signed_zero_v = is_possible_signed_zero<T>::value;
It is only guaranteed that if this trait returns false then no signed zero is possible. This assurance is very weak, but I can't see any stronger assurance. Also, it says nothing constructive about floating point types. I could not find any reasonable way to test floating point types.
Somebody's going to come by and point out this is all-wrong standards-wise.
Anyway, decimal machines aren't allowed anymore and through the ages there's been only one negative zero. As a practical matter, these tests suffice:
INT_MIN == -INT_MAX && ~0 == 0
but your code doesn't work for two reasons. Despite what the standard says, constexprs are evaluated on the host using host rules, and there exists an architecture where this crashes at compile time.
Trying to massage out the trap is not possible. ~(unsigned)0 == (unsigned)-1 reliably tests for 2s compliment, so it's inverse does indeed check for one's compliment*; however, ~0 is the only way to generate negative zero on ones compliment, and any use of that value as a signed number can trap so we can't test for its behavior. Even using platform specific code, we can't catch traps in constexpr, so forgetaboutit.
*barring truly exotic arithmetic but hey
Everybody uses #defines for architecture selection. If you need to know, use it.
If you handed me an actually standards complaint compiler that yielded a compile error on trap in a constexpr and evaluated with target platform rules rather than host platform rules with converted results, we could do this:
target.o: target.c++
$(CXX) -c target.c++ || $(CC) -DTRAP_ZERO -c target.c++
bool has_negativezero() {
#ifndef -DTRAP_ZERO
return INT_MIN == -INT_MAX && ~0 == 0;
#else
return 0;
#endif
}
The standard std::signbit function in C++ has a constructor that receives an integral value
bool signbit( IntegralType arg ); (4) (since C++11)
So you can check with static_assert(signbit(-0)). However there's a footnote on that (emphasis mine)
A set of overloads or a function template accepting the arg argument of any integral type. Equivalent to (2) (the argument is cast to double).
which unfortunately means you still have to rely on a floating-point type with negative zero. You can force the use of IEEE-754 with signed zero with std::numeric_limits<double>::is_iec559
Similarly std::copysign has the overload Promoted copysign ( Arithmetic1 x, Arithmetic2 y ); that can be used for this purpose. Unluckily both signbit and copysign are not constexpr according to the current standards although there are some proposals to do that
constexpr for cmath and cstdlib
More constexpr for cmath and complex
Constexpr Math Functions
Yet Clang and GCC can already consider those constexpr if you don't want to wait for the standard to update. Here's their results
Systems with a negative zero also have a balanced range, so can just check if the positive and negative ranges have the same magnitude
if constexpr(-std::numeric_limits<int>::max() != std::numeric_limits<int>::min() + 1) // or
if constexpr(-std::numeric_limits<int>::max() == std::numeric_limits<int>::min())
// has negative zero
In fact -INT_MAX - 1 is also how libraries defined INT_MIN in two's complement
But the simplest solution would be eliminating non-two's complement cases, which are pretty much non-existent nowadays
static_assert(-1 == ~0, "This requires the use of 2's complement");
Related:
How to check a double's bit pattern is 0x0 in a C++11 constexpr?

Rounding integer to nearest multiple of another integer

I need to round integers to be the nearest multiple of another integer. Examples for results in the case of multiples of 100:
36->0
99->100
123->100
164->200
and so on.
I came up with the following code, that works, but feels "dirty":
int RoundToMultiple(int toRound, int multiple)
{
return (toRound + (multiple / 2)) / multiple * multiple;
}
This counts on the truncating properties of integer division to make it work.
Can I count on this code to be portable? Are there any compiler setups where this will fail to give me the desired result? If there are, how can I achieve the same results in a portable way?
If needed for a better answer, it can be assumed that multiples will be powers of 10 (including multiples of 1). Numbers can also be assumed to all be positive.
Yes, you can count on this code to be portable. N4296 (which is the latest open draft of C++14) says in section 5.6 [expr.mul]:
For
integral operands the / operator yields the algebraic quotient with any fractional part discarded. [Footnote: This is often called truncation towards zero]
This is not a new feature of the latest C++, it could be relied on in C89 too.
The only caveat, is that if toRound is negative, you need to subtract the offset.
An alternative approach is:
int RoundToMultiple(int toRound, int multiple)
{
const auto ratio = static_cast<double>(toRound) / multiple;
const auto iratio = std::lround(ratio);
return iratio * multiple;
}
This avoid messy +/- offsets, but performance will be worse, and there are problems if toRound is so large that it can't be held precisely in a double. (OTOH, if this is for output, then I suspect multiple will be similarly large in this case, so you will be alright.)
The C++ standard explicitly specifies the behavior of integer division thusly:
[expr.mul]
For integral operands the / operator yields the algebraic
quotient with any fractional part discarded.
A.k.a. truncation towards zero. This is as portable as it gets.
Though - as mentioned by others - the integral division behaves as you expect, may be the following solution looks "less wired" (still opinion based).
Concerning a solution that converts an int to a double: I personally feel that this is to expensive just for the sake of rounding, but maybe someone can convince me that my feeling is wrong;
Anyway, by using just integral operators, the following solution makes the discussion on whether a double's mantissa can always hold every int superfluous:
int RoundToMultiple(int toRound, int multiple) {
toRound += multiple / 2;
return toRound - (toRound%multiple);
}
If you also wanted to include negative values, the code could be slightly adapted as follows (including tests):
#include <stdio.h>
int RoundToMultiple(int toRound, int multiple) {
toRound += toRound < 0 ? -multiple / 2 : multiple / 2;
return toRound - (toRound%multiple);
}
int main(int argc, char const *argv[])
{
int tests[] = { 36,99,123,164,-36,-99,-123,-164,0 };
int expectedResults[] = { 0,100,100,200,0,-100,-100,-200,0 };
int i=0;
int test=0, result=0, expectedResult=0;
do {
test = tests[i];
result = RoundToMultiple(test, 100);
expectedResult = expectedResults[i];
printf("test %d: %d==%d ? %s\n", test, result, expectedResult, (expectedResult==result ? "OK" : "NOK!"));
i++;
}
while(test != 0);
}

Prevent underflow when adding two logarithms

I am using the addition in log space equation described on the Wikipedia log probability article, but I am getting underflow when computing the exp of very large, negative, logarithms. As a result, my program crashes.
Example inputs are a = -2 and b = -1033.4391885529124.
My code, implemented straight from the Wikipedia article, looks like this:
double log_sum(double a, double b)
{
double min_ab = std::min(a, b);
a = std::max(a, b);
b = min_ab;
if (isinf(a) && isinf(b)) {
return -std::numeric_limits<double>::infinity();
} else if (isinf(a)) {
return b;
} else if (isinf(b)) {
return a;
} else {
return a + log2(1 + exp2(b - a));
}
}
I've come up with the following ideas, but can't decide which is best:
Check for out-of-range inputs before evaluation.
Disable (somehow) the exception, and flush or clamp the output after evaluation
Implement custom log and exp functions that do not throw exceptions and automatically flush or clamp the results.
Some other ways?
Additionally, I'd be interested to know what effect the choice of the logarithm base has on the computation. I chose base two because I believed that other log bases would be calculated from log_n(x) = log_2(x) / log_2(n), and would suffer from precision loss due to the division. Is that correct?
According to http://en.cppreference.com/w/cpp/numeric/math/exp:
For IEEE-compatible type double, overflow is guaranteed if 709.8 < arg, and underflow is guaranteed if arg < -708.4
So you can't prevent an underflow. However:
If a range error occurs due to underflow, the correct result (after rounding) is returned.
So there shouldn't be any program crash - "just" a loss of precision.
However, notice that
1 + exp(n)
will loose precision much sooner, i.e. already at n = -53. This is because the next representable number after 1.0 is 1.0 + 2^-52.
So loss of precision due to exp is far less than the precision lost when adding 1.0 + exp(...)
The problem here is accurately computing the expression log(1+exp(x)) without intermediate under/overflow. Fortunately, Martin Maechler (one of the R core developers) details how to do it in section 3 of this vignette.
He uses natural base functions: it should be possible to translate it to base-2 by appropriately scaling the functions, but it uses the log1p function in one part, and I'm not aware of any math library which supplies a base-2 variant.
The choice of base is unlikely to have any effect on accuracy (or performance), and most reasonable math libraries are able to give sub 1-ulp guarantees for both functions (i.e. you will have one of the two floating point values closest to the exact answer). A pretty common approach is to break up the floating point number into its base-2 exponent k and significand 1+f, such that 1/sqrt(2) < 1+f < sqrt(2), and then use a polynomial approximation to compute log(1+f): due to some mathematical quirks (basically, the fact that the 2nd term of the Taylor series can be represented exactly) it turns out to be more accurate to do this in the natural base rather than base-2, so a typical implementation will look like:
log(x) = k*log2 + p(f)
log2(x) = k + p(f)*invlog2
(e.g. see log and log2 in openlibm), so there is no real benefit to using one over the other.

Is it safe to use == on FP in this example

I stumbled onto this code here.
Generators doubleSquares(int value)
{
Generators result;
for (int i = 0; i <= std::sqrt(value / 2); i++) // 1
{
double j = std::sqrt(value - std::pow(i, 2)); // 2
if (std::floor(j) == j) // 3
result.push_back( { i, static_cast<int>(j) } ); // 4
}
return result;
}
Am I wrong to think that //3 is dangerous ?
This code is not guaranteed by the C++ standard to work as desired.
Some low-quality math libraries do not return correctly rounded values for pow, even when the inputs have integer values and the mathematical result can be exactly represented. sqrt may also return an inaccurate value, although this function is easier to implement and so less commonly suffers from defects.
Thus, it is not guaranteed that j is exactly an integer when you might expect it to be.
In a good-quality math library, pow and sqrt will always return correct results (zero error) when the mathematical result is exactly representable. If you have a good-quality C++ implementation, this code should work as desired, up to the limits of the integer and floating-point types used.
Improving the Code
This code has no reason to use pow; std::pow(i, 2) should be i*i. This results in exact arithmetic (up to the point of integer overflow) and completely avoids the question of whether pow is correct.
Eliminating pow leaves just sqrt. If we know the implementation returns correct values, we can accept the use of sqrt. If not, we can use this instead:
for (int i = 0; i*i <= value/2; ++i)
{
int j = std::round(std::sqrt(value - i*i));
if (j*j + i*i == value)
result.push_back( { i, j } );
}
This code only relies on sqrt to return a result accurate within .5, which even a low-quality sqrt implementation should provide for reasonable input values.
There are two different, but related, questions:
Is j an integer?
Is j likely to be the result of a double calculation whose exact result would be an integer?
The quoted code asks the first question. It is not correct for asking the second question. More context would be needed to be certain which question should be being asked.
If the second question should be being asked, you cannot depend only on floor. Consider a double that is greater than 2.99999999999 but less than 3. It could be the result of a calculation whose exact value would be 3. Its floor is 2, and it is greater than its floor by almost 1. You would need to compare for being close to the result of std:round instead.
I would say it is dangerous. One should always test for "equality" of floating point numbers by comparing the difference between the two numbers with an acceptably small number, e.g.:
#include <math.h>
...
if (fabs(std::floor(j) - j) < eps) {
...
... where eps is a number that is acceptably small for one's purpose. This approach is essential unless one can guarantee that the operations return exact results, which may be true for some cases (e.g. IEEE-754-compliant systems) but the C++ standard does not require that this be true. See, for instance Cross-Platform Issues With Floating-Point Arithmetics in C++.

Is floating-point == ever OK?

Just today I came across third-party software we're using and in their sample code there was something along these lines:
// Defined in somewhere.h
static const double BAR = 3.14;
// Code elsewhere.cpp
void foo(double d)
{
if (d == BAR)
...
}
I'm aware of the problem with floating-points and their representation, but it made me wonder if there are cases where float == float would be fine? I'm not asking for when it could work, but when it makes sense and works.
Also, what about a call like foo(BAR)? Will this always compare equal as they both use the same static const BAR?
Yes, you are guaranteed that whole numbers, including 0.0, compare with ==
Of course you have to be a little careful with how you got the whole number in the first place, assignment is safe but the result of any calculation is suspect
ps there are a set of real numbers that do have a perfect reproduction as a float (think of 1/2, 1/4 1/8 etc) but you probably don't know in advance that you have one of these.
Just to clarify. It is guaranteed by IEEE 754 that float representions of integers (whole numbers) within range, are exact.
float a=1.0;
float b=1.0;
a==b // true
But you have to be careful how you get the whole numbers
float a=1.0/3.0;
a*3.0 == 1.0 // not true !!
There are two ways to answer this question:
Are there cases where float == float gives the correct result?
Are there cases where float == float is acceptable coding?
The answer to (1) is: Yes, sometimes. But it's going to be fragile, which leads to the answer to (2): No. Don't do that. You're begging for bizarre bugs in the future.
As for a call of the form foo(BAR): In that particular case the comparison will return true, but when you are writing foo you don't know (and shouldn't depend on) how it is called. For example, calling foo(BAR) will be fine but foo(BAR * 2.0 / 2.0) (or even maybe foo(BAR * 1.0) depending on how much the compiler optimises things away) will break. You shouldn't be relying on the caller not performing any arithmetic!
Long story short, even though a == b will work in some cases you really shouldn't rely on it. Even if you can guarantee the calling semantics today maybe you won't be able to guarantee them next week so save yourself some pain and don't use ==.
To my mind, float == float is never* OK because it's pretty much unmaintainable.
*For small values of never.
The other answers explain quite well why using == for floating point numbers is dangerous. I just found one example that illustrates these dangers quite well, I believe.
On the x86 platform, you can get weird floating point results for some calculations, which are not due to rounding problems inherent to the calculations you perform. This simple C program will sometimes print "error":
#include <stdio.h>
void test(double x, double y)
{
const double y2 = x + 1.0;
if (y != y2)
printf("error\n");
}
void main()
{
const double x = .012;
const double y = x + 1.0;
test(x, y);
}
The program essentially just calculates
x = 0.012 + 1.0;
y = 0.012 + 1.0;
(only spread across two functions and with intermediate variables), but the comparison can still yield false!
The reason is that on the x86 platform, programs usually use the x87 FPU for floating point calculations. The x87 internally calculates with a higher precision than regular double, so double values need to be rounded when they are stored in memory. That means that a roundtrip x87 -> RAM -> x87 loses precision, and thus calculation results differ depending on whether intermediate results passed via RAM or whether they all stayed in FPU registers. This is of course a compiler decision, so the bug only manifests for certain compilers and optimization settings :-(.
For details see the GCC bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=323
Rather scary...
Additional note:
Bugs of this kind will generally be quite tricky to debug, because the different values become the same once they hit RAM.
So if for example you extend the above program to actually print out the bit patterns of y and y2 right after comparing them, you will get the exact same value. To print the value, it has to be loaded into RAM to be passed to some print function like printf, and that will make the difference disappear...
I'll provide more-or-less real example of legitimate, meaningful and useful testing for float equality.
#include <stdio.h>
#include <math.h>
/* let's try to numerically solve a simple equation F(x)=0 */
double F(double x) {
return 2 * cos(x) - pow(1.2, x);
}
/* a well-known, simple & slow but extremely smart method to do this */
double bisection(double range_start, double range_end) {
double a = range_start;
double d = range_end - range_start;
int counter = 0;
while (a != a + d) // <-- WHOA!!
{
d /= 2.0;
if (F(a) * F(a + d) > 0) /* test for same sign */
a = a + d;
++counter;
}
printf("%d iterations done\n", counter);
return a;
}
int main() {
/* we must be sure that the root can be found in [0.0, 2.0] */
printf("F(0.0)=%.17f, F(2.0)=%.17f\n", F(0.0), F(2.0));
double x = bisection(0.0, 2.0);
printf("the root is near %.17f, F(%.17f)=%.17f\n", x, x, F(x));
}
I'd rather not explain the bisection method used itself, but emphasize on the stopping condition. It has exactly the discussed form: (a == a+d) where both sides are floats: a is our current approximation of the equation's root, and d is our current precision. Given the precondition of the algorithm — that there must be a root between range_start and range_end — we guarantee on every iteration that the root stays between a and a+d while d is halved every step, shrinking the bounds.
And then, after a number of iterations, d becomes so small that during addition with a it gets rounded to zero! That is, a+d turns out to be closer to a then to any other float; and so the FPU rounds it to the closest representable value: to a itself. Calculation on a hypothetical machine can illustrate; let it have 4-digit decimal mantissa and some large exponent range. Then what result should the machine give to 2.131e+02 + 7.000e-3? The exact answer is 213.107, but our machine can't represent such number; it has to round it. And 213.107 is much closer to 213.1 than to 213.2 — so the rounded result becomes 2.131e+02 — the little summand vanished, rounded up to zero. Exactly the same is guaranteed to happen at some iteration of our algorithm — and at that point we can't continue anymore. We have found the root to maximum possible precision.
Addendum
No you can't just use "some small number" in the stopping condition. For any choice of the number, some inputs will deem your choice too large, causing loss of precision, and there will be inputs which will deem your choiсe too small, causing excess iterations or even entering infinite loop. Imagine that our F can change — and suddenly the solutions can be both huge 1.0042e+50 and tiny 1.0098e-70. Detailed discussion follows.
Calculus has no notion of a "small number": for any real number, you can find infinitely many even smaller ones. The problem is, among those "even smaller" ones might be a root of our equation. Even worse, some equations will have distinct roots (e.g. 2.51e-8 and 1.38e-8) — both of which will get approximated by the same answer if our stopping condition looks like d < 1e-6. Whichever "small number" you choose, many roots which would've been found correctly to the maximum precision with a == a+d — will get spoiled by the "epsilon" being too large.
It's true however that floats' exponent has finite limited range, so one actually can find the smallest nonzero positive FP number; in IEEE 754 single precision, it's the 1e-45 denorm. But it's useless! while (d >= 1e-45) {…} will loop forever with single-precision (positive nonzero) d.
At the same time, any choice of the "small number" in d < eps stopping condition will be too small for many equations. Where the root has high enough exponent, the result of subtraction of two neighboring mantissas will easily exceed our "epsilon". For example, 7.00023e+8 - 7.00022e+8 = 0.00001e+8 = 1.00000e+3 = 1000 — meaning that the smallest possible difference between numbers with exponent +8 and 6-digit mantissa is... 1000! It will never fit into, say, 1e-4. For numbers with relatively high exponent we simply have not enough precision to ever see a difference of 1e-4. This means eps = 1e-4 will be too small!
My implementation above took this last problem into account; you can see that d is halved each step — instead of getting recalculated as difference of (possibly huge in exponent) a and b. For reals, it doesn't matter; for floats it does! The algorithm will get into infinite loops with (b-a) < eps on equations with huge enough roots. The previous paragraph shows why. d < eps won't get stuck, but even then — needless iterations will be performed during shrinking d way down below the precision of a — still showing the choice of eps as too small. But a == a+d will stop exactly at precision.
Thus as shown: any choice of eps in while (d < eps) {…} will be both too large and too small, if we allow F to vary.
... This kind of reasoning may seem overly theoretical and needlessly deep, but it's to illustrate again the trickiness of floats. One should be aware of their finite precision when writing arithmetic operators around.
Perfect for integral values even in floating point formats
But the short answer is: "No, don't use ==."
Ironically, the floating point format works "perfectly", i.e., with exact precision, when operating on integral values within the range of the format. This means that you if you stick with double values, you get perfectly good integers with a little more than 50 bits, giving you about +- 4,500,000,000,000,000, or 4.5 quadrillion.
In fact, this is how JavaScript works internally, and it's why JavaScript can do things like + and - on really big numbers, but can only << and >> on 32-bit ones.
Strictly speaking, you can exactly compare sums and products of numbers with precise representations. Those would be all the integers, plus fractions composed of 1 / 2n terms. So, a loop incrementing by n + 0.25, n + 0.50, or n + 0.75 would be fine, but not any of the other 96 decimal fractions with 2 digits.
So the answer is: while exact equality can in theory make sense in narrow cases, it is best avoided.
The only case where I ever use == (or !=) for floats is in the following:
if (x != x)
{
// Here x is guaranteed to be Not a Number
}
and I must admit I am guilty of using Not A Number as a magic floating point constant (using numeric_limits<double>::quiet_NaN() in C++).
There is no point in comparing floating point numbers for strict equality. Floating point numbers have been designed with predictable relative accuracy limits. You are responsible for knowing what precision to expect from them and your algorithms.
It's probably ok if you're never going to calculate the value before you compare it. If you are testing if a floating point number is exactly pi, or -1, or 1 and you know that's the limited values being passed in...
I also used it a few times when rewriting few algorithms to multithreaded versions. I used a test that compared results for single- and multithreaded version to be sure, that both of them give exactly the same result.
Let's say you have a function that scales an array of floats by a constant factor:
void scale(float factor, float *vector, int extent) {
int i;
for (i = 0; i < extent; ++i) {
vector[i] *= factor;
}
}
I'll assume that your floating point implementation can represent 1.0 and 0.0 exactly, and that 0.0 is represented by all 0 bits.
If factor is exactly 1.0 then this function is a no-op, and you can return without doing any work. If factor is exactly 0.0 then this can be implemented with a call to memset, which will likely be faster than performing the floating point multiplications individually.
The reference implementation of BLAS functions at netlib uses such techniques extensively.
In my opinion, comparing for equality (or some equivalence) is a requirement in most situations: standard C++ containers or algorithms with an implied equality comparison functor, like std::unordered_set for example, requires that this comparator be an equivalence relation (see C++ named requirements: UnorderedAssociativeContainer).
Unfortunately, comparing with an epsilon as in abs(a - b) < epsilon does not yield an equivalence relation since it loses transitivity. This is most probably undefined behavior, specifically two 'almost equal' floating point numbers could yield different hashes; this can put the unordered_set in an invalid state.
Personally, I would use == for floating points most of the time, unless any kind of FPU computation would be involved on any operands. With containers and container algorithms, where only read/writes are involved, == (or any equivalence relation) is the safest.
abs(a - b) < epsilon is more or less a convergence criteria similar to a limit. I find this relation useful if I need to verify that a mathematical identity holds between two computations (for example PV = nRT, or distance = time * speed).
In short, use == if and only if no floating point computation occur;
never use abs(a-b) < e as an equality predicate;
Yes. 1/x will be valid unless x==0. You don't need an imprecise test here. 1/0.00000001 is perfectly fine. I can't think of any other case - you can't even check tan(x) for x==PI/2
The other posts show where it is appropriate. I think using bit-exact compares to avoid needless calculation is also okay..
Example:
float someFunction (float argument)
{
// I really want bit-exact comparison here!
if (argument != lastargument)
{
lastargument = argument;
cachedValue = very_expensive_calculation (argument);
}
return cachedValue;
}
I would say that comparing floats for equality would be OK if a false-negative answer is acceptable.
Assume for example, that you have a program that prints out floating points values to the screen and that if the floating point value happens to be exactly equal to M_PI, then you would like it to print out "pi" instead. If the value happens to deviate a tiny bit from the exact double representation of M_PI, it will print out a double value instead, which is equally valid, but a little less readable to the user.
I have a drawing program that fundamentally uses a floating point for its coordinate system since the user is allowed to work at any granularity/zoom. The thing they are drawing contains lines that can be bent at points created by them. When they drag one point on top of another they're merged.
In order to do "proper" floating point comparison I'd have to come up with some range within which to consider the points the same. Since the user can zoom in to infinity and work within that range and since I couldn't get anyone to commit to some sort of range, we just use '==' to see if the points are the same. Occasionally there'll be an issue where points that are supposed to be exactly the same are off by .000000000001 or something (especially around 0,0) but usually it works just fine. It's supposed to be hard to merge points without the snap turned on anyway...or at least that's how the original version worked.
It throws of the testing group occasionally but that's their problem :p
So anyway, there's an example of a possibly reasonable time to use '=='. The thing to note is that the decision is less about technical accuracy than about client wishes (or lack thereof) and convenience. It's not something that needs to be all that accurate anyway. So what if two points won't merge when you expect them to? It's not the end of the world and won't effect 'calculations'.