Simple way for diapason (A, B)
if (A < X && X < B) ...
But it seems that +INF, .NAN also lay in the diapason
Your condition is not an interval (diapason). It is functionally equivalent to
X < std::min(A, B)
You only have an upper bound, no lower bound at all.
Exactly how NaN and +Inf behave, depends on the floating point representation, which is not specified by the C++ standard, but is cpu specific.
If we assume the commonly used IEEE-754, then neither X=+Inf, nor X=NaN can satisfy the condition for any values of A and B.
This is how you check that a floating point number is between a lower and an upper bound (but equal to neither):
X > low && X < high
or
low < X && X < high
Again, if we assume IEEE-754, then neither X=+Inf, nor X=NaN can satisfy this condition for any values of low and high. But, since IEEE-754 might not be guaranteed, the behaviour of such numbers is not specified. You might need to be explicit to support exotic hardware:
low < X && X < high && std::isfinite(X)
Related
This code sample from https://en.cppreference.com/w/cpp/types/numeric_limits/epsilon
typename std::enable_if<!std::numeric_limits<T>::is_integer, bool>::type
almost_equal(T x, T y, int ulp)
{
// the machine epsilon has to be scaled to the magnitude of the values used
// and multiplied by the desired precision in ULPs (units in the last place)
return std::fabs(x-y) <= std::numeric_limits<T>::epsilon() * std::fabs(x+y) * ulp
// unless the result is subnormal
|| std::fabs(x-y) < std::numeric_limits<T>::min();
}
Can someone explain why epsilon is scaled to fabs(x+y) instead of std::fmax(std::fabs(x), std::fabs(y))?
We can distinguish two main cases: either x and y are almost_equal, or they aren't.
When x and y are almost equal (down to a few bits), x+y is almost equal to 2*x. It's likely that x and y have the same exponent, but it's also possible that their exponent differs by exactly one (e.g. 0.49999 and 0.500001).
When x and y are not even close, e.g. when they differ by more than a factor of 2, then almost_equal(x,y) will return true anyway - the difference between the current and proposed implementations doesn't really matter.
So, what we're really worried about is the edge case - where x and y differ in about ulp bits. So getting back to the definition of epsilon: it differs from 1.0 in exactly 1 Unit in the Last Place (1 ulp). Differing in N ULP's means differing by ldexp(epsilon, N-1).
But here we need to look carefully at the direction of epsilon. It's defined by the next bigger number: nextafter(1.0, 2.0)-1.0. But that's twice the value of 1.0-nextafter(1.0, 0.0) - the previous smaller number.
This is because the difference between adjacent floating-point numbers around x scales with approximately log(x), but it's a step function. At every power of 2, the step size changes also by a step of 2.
So, your proposed lower bound of approximately x indeed works for the best case (near powers of 2) but can be off by a factor of 2. And x+y, which is almost equal to 2*x when it matters, also works for the worst-case inputs.
So you could just multiply your bound by 2, but it's already a bit slower due to the hidden conditional.
Comparing two floating point number by something like a_float == b_float is looking for trouble since a_float / 3.0 * 3.0 might not be equal to a_float due to round off error.
What one normally does is something like fabs(a_float - b_float) < tol.
How does one calculate tol?
Ideally tolerance should be just larger than the value of one or two of the least significant figures. So if the single precision floating point number is use tol = 10E-6 should be about right. However this does not work well for the general case where a_float might be very small or might be very large.
How does one calculate tol correctly for all general cases? I am interested in C or C++ cases specifically.
This blogpost contains an example, fairly foolproof implementation, and detailed theory behind it
http://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
it is also one of a series, so you can always read more.
In short: use ULP for most numbers, use epsilon for numbers near zero, but there are still caveats. If you want to be sure about your floating point math i recommend reading whole series.
As far as I know, one doesn't.
There is no general "right answer", since it can depend on the application's requirement for precision.
For instance, a 2D physics simulation working in screen-pixels might decide that 1/4 of a pixel is good enough, while a 3D CAD system used to design nuclear plant internals might not.
I can't see a way to programmatically decide this from the outside.
The C header file <float.h> gives you the constants FLT_EPSILON and DBL_EPSILON, which is the difference between 1.0 and the smallest number larger than 1.0 that a float/double can represent. You can scale that by the size of your numbers and the rounding error you wish to tolerate:
#include <float.h>
#ifndef DBL_TRUE_MIN
/* DBL_TRUE_MIN is a common non-standard extension for the minimum denorm value
* DBL_MIN is the minimum non-denorm value -- use that if TRUE_MIN is not defined */
#define DBL_TRUE_MIN DBL_MIN
#endif
/* return the difference between |x| and the next larger representable double */
double dbl_epsilon(double x) {
int exp;
if (frexp(x, &exp) == 0.0)
return DBL_TRUE_MIN;
return ldexp(DBL_EPSILON, exp-1);
}
Welcome to the world of traps, snares and loopholes. As mentioned elsewhere, a general purpose solution for floating point equality and tolerances does not exist. Given that, there are tools and axioms that a programmer may use in select cases.
fabs(a_float - b_float) < tol has the shortcoming OP mentioned: "does not work well for the general case where a_float might be very small or might be very large." fabs(a_float - ref_float) <= fabs(ref_float * tol) copes with the variant ranges much better.
OP's "single precision floating point number is use tol = 10E-6" is a bit worrisome for C and C++ so easily promote float arithmetic to double and then it's the "tolerance" of double, not float, that comes into play. Consider float f = 1.0; printf("%.20f\n", f/7.0); So many new programmers do not realize that the 7.0 caused a double precision calculation. Recommend using double though out your code except where large amounts of data need the float smaller size.
C99 provides nextafter() which can be useful in helping to gauge "tolerance". Using it, one can determine the next representable number. This will help with the OP "... the full number of significant digits for the storage type minus one ... to allow for roundoff error." if ((nextafter(x, -INF) <= y && (y <= nextafter(x, +INF))) ...
The kind of tol or "tolerance" used is often the crux of the matter. Most often (IMHO) a relative tolerance is important. e. g. "Are x and y within 0.0001%"? Sometimes an absolute tolerance is needed. e.g. "Are x and y within 0.0001"?
The value of the tolerance is often debatable for the best value is often situation dependent. Comparing within 0.01 may work for a financial application for Dollars but not Yen. (Hint: be sure to use a coding style that allows easy updates.)
Rounding error varies according to values used for operations.
Instead of a fixed tolerance, you can probably use a factor of epsilon like:
bool nearly_equal(double a, double b, int factor /* a factor of epsilon */)
{
double min_a = a - (a - std::nextafter(a, std::numeric_limits<double>::lowest())) * factor;
double max_a = a + (std::nextafter(a, std::numeric_limits<double>::max()) - a) * factor;
return min_a <= b && max_a >= b;
}
Although the value of the tolerance depends on the situation, if you are looking for precision comparasion you could used as tolerance the machine epsilon value, numeric_limits::epsilon() (Library limits). The function returns the difference between 1 and the smallest value greater than 1 that is representable for the data type.
http://msdn.microsoft.com/en-us/library/6x7575x3.aspx
The value of epsilon differs if you are comparing floats or doubles. For instance, in my computer, if comparing floats the value of epsilon is 1.1920929e-007 and if comparing doubles the value of epsilon is 2.2204460492503131e-016.
For a relative comparison between x and y, multiply the epsilon by the maximum absolute value of x and y.
The result above could be multiplied by the ulps (units in the last place) which allows you to play with the precision.
#include <iostream>
#include <cmath>
#include <limits>
template<class T> bool are_almost_equal(T x, T y, int ulp)
{
return std::abs(x-y) <= std::numeric_limits<T>::epsilon() * std::max(std::abs(x), std::abs(y)) * ulp
}
When I need to compare floats, I use code like this
bool same( double a, double b, double error ) {
double x;
if( a == 0 ) {
x = b;
} else if( b == 0 ) {
x = a;
} else {
x = (a-b) / a;
}
return fabs(x) < error;
}
For given numbers x,y and n, I would like to calculate x-y mod n in C. Look at this example:
int substract_modulu(int x, int y, int n)
{
return (x-y) % n;
}
As long as x>y, we are fine. In the other case, however, the modulu operation is undefined.
You can think of x,y,n>0. I would like the result to be positive, so if (x-y)<0, then ((x-y)-substract_modulu(x,y,n))/ n shall be an integer.
What is the fastest algorithm you know for that? Is there one which avoids any calls of if and operator??
As many have pointed out, in current C and C++ standards, x % n is no longer implementation-defined for any values of x and n. It is undefined behaviour in the cases where x / n is undefined [1]. Also, x - y is undefined behaviour in the case of integer overflow, which is possible if the signs of x and y might differ.
So the main problem for a general solution is avoiding integer overflow, either in the division or the subtraction. If we know that x and y are non-negative and n is positive, then overflow and division by zero are not possible, and we can confidently say that (x - y) % n is defined. Unfortunately, x - y might be negative, in which case so will be the result of the % operator.
It's easy to correct for the result being negative if we know that n is positive; all we have to do is unconditionally add n and do another modulo operation. That's unlikely to be the best solution, unless you have a computer where division is faster than branching.
If a conditional load instruction is available (pretty common these days), then the compiler will probably do well with the following code, which is portable and well-defined, subject to the constraints that x,y ≥ 0 ∧ n > 0:
((x - y) % n) + ((x >= y) ? 0 : n)
For example, gcc produces this code for my core I5 (although it's generic enough to work on any non-Paleozoic intel chip):
idivq %rcx
cmpq %rsi, %rdi
movl $0, %eax
cmovge %rax, %rcx
leaq (%rdx,%rcx), %rax
which is cheerfully branch-free. (Conditional move is usually a lot faster than branching.)
Another way of doing this would be (except that the function sign needs to be written):
((x - y) % n) + (sign(x - y) & (unsigned long)n)
where sign is all 1s if its argument is negative, and otherwise 0. One possible implementation of sign (adapted from bithacks) is
unsigned long sign(unsigned long x) {
return x >> (sizeof(long) * CHAR_BIT - 1);
}
This is portable (casting negative integer values to unsigned is defined), but it may be slow on architectures which lack high-speed shift. It's unlikely to be faster than the previous solution, but YMMV. TIAS.
Neither of these produce correct results for the general case where integer overflow is possible. It's very difficult to deal with integer overflow. (One particularly annoying case is n == -1, although you can test for that and return 0 without any use of %.) Also, you need to decide your preference for the result of modulo of negative n. I personally prefer the definition where x%n is either 0 or has the same sign as n -- otherwise why would you bother with a negative divisor -- but applications differ.
The three-modulo solution proposed by Tom Tanner will work if n is not -1 and n + n does not overflow. n == -1 will fail if either x or y is INT_MIN, and the simple fix of using abs(n) instead of n will fail if n is INT_MIN. The cases where n has a large absolute value could be replaced with comparisons, but there are a lot of corner cases, and made more complicated by the fact that the standard does not require 2's complement arithmetic, so it's not easily predictable what the corner cases are [2].
As a final note, some tempting solutions do not work. You cannot just take the absolute value of (x - y):
(-z) % n == -(z % n) == n - (z % n) ≠ z % n (unless z % n happens to be n / 2)
And, for the same reason, you cannot just take the absolute value of the result of modulo.
Also, you cannot just cast (x - y) to unsigned:
(unsigned)z == z + 2k (for some k) if z < 0
(z + 2k) % n == (z % n) + (2k % n) ≠ z % n unless (2k % n) == 0
[1] x/n and x%n are both undefined if n==0. But x%n is also undefined if x/n is "not representable" (i.e. there was integer overflow), which will happen on twos-complement
machines (that is, all the ones you care about) if x is most negative representable number and n == -1. It's clear why x/n should be undefined in this case, but slightly less so in the case of x%n, since that value is (mathematically) 0.
[2] Most people who complain about the difficulty of predicting the results of floating-point arithmetic haven't spent much time trying to write truly portable integer arithmetic code :)
If you want to avoid undefined behaviour, without an if, the following would work
return (x % n - y % n + n) % n;
The efficiency depends on the implementation of the modulo operation, but I'd suspect algorithms involving if would be rather faster.
Alternatively you could treat x and y as unsigned. In which case there are no negative numbers involved and no undefined behaviour.
With C++11 the undefined behavior was removed. Depending on the the exact behavior you want you can there just stick with
return (x-y) % n;
For a full explanation read this answer:
https://stackoverflow.com/a/13100805/1149664
You still get undefined behavior for n==0 or if x-y can not be stored in the type you are using.
Whether branching is going to matter will depend on the CPU to some degree. According to the documentation abs (on MSDN) has intrinsic behavior and it might not be a bottleneck at all. This you'll have to test.
If you wan't unconditionally compute things there are several nice methods that can be adapted from the Bit Twiddling Hacks site.
int v; // we want to find the absolute value of v
unsigned int r; // the result goes here
int const mask = v >> sizeof(int) * CHAR_BIT - 1;
r = (v + mask) ^ mask;
However, I don't know if this will be helpful to your situation without more information about hardware targets and testing.
Just out of curiosity I had to test this myself and when you look at the assembly generated by the compiler we can see there's no real overhead in the use of abs.
unsigned r = abs(i);
====
00381006 cdq
00381007 xor eax,edx
00381009 sub eax,edx
The following is just an alternate form of the above example which according to the Bit Twiddling Site is not patented (while the version used by the Visual C++ 2008 compiler is).
Throughout my answer I have been using MSDN and Visual C++ but I would assume that any sane compiler has similar behavior.
Assuming 0 <= x < n and 0 <= y < n, how about (x + n - y) % n? Then x + n will certainly be larger than y, subtracting y will always result in a positive integer, and the final mod n reduces the result if necessary.
I'm going to guess that it's not really the case here, but I'd like to mention that if the value you are taking modulo with is a power of two, then using the "AND" method is a lot quicker (I'm going to ignore the x-y, and just show how it works for a single x, as x-y is not part of the equation here):
int modpow2(int x, int n)
{
return x & (n-1);
}
If you want to ensure that your code doesn't do anything daft, you could add ASSERT(!(n & n-1)); - this checks that there is only a single bit set in n (so, n is a power of two).
Here is the CPP Code I use in competitive programming:
#include <iostream>
#include<bits/stdc++.h>
using namespace std;
#define ll long long
#define mod 1000000007
ll subtraction_modulo(ll x, ll y ){
return ( ( (x - y) % mod ) + mod ) % mod;
}
Here,
ll -> long long int
mod -> globally defined mod value to be used.
I'm writing a full double to float function for the Arduino (irrelevant, but I couldn't find any "proper" ones) and I do this check:
if (d < 0) {
d *= -1;
bin += "-";
}
I know because of floating point imprecisions double equality is finicky. So is it safe to do that? Or should I stick to this (which I use in later parts of my code anyways)
int compareNums(double x, double y) {
if (abs(x - y) <= EPSILON) {
return 0;
} else if (x > y) {
return 1;
} else {
return -1;
}
}
And a couple quick questions: does it matter if I do d < 0 or d < 0.0?
I'm multiplying a double d by 10 until it has no fractional part, so I do a check similar to d == (int) d. I'm wondering what's a good epsilon to use (I used this here http://msdn.microsoft.com/en-us/library/6x7575x3(v=vs.80).aspx), since I don't want to end up with an infinite loop. According to the article 0.000000119209 is the smallest distinguishable difference for floats or something like that.
Thanks
d < 0 is valid (though I'd prefer to write d < 0.0. In the first case the zero will be "promoted" to double before the comparison.
And comparing double to zero with < or > is perfectly valid, and does not require an "epsilon".
bin += "-"; is nonsensical.
In general comparing floats/doubles with "==" is invalid and should never be done (except for some special cases such as checking for zero or infinity). Some languages do not even allow "==" (or equivalent) between floats.
d == (int) d is more nonsense.
See my answer to this question:
How dangerous is it to compare floating point values?
Specifically, the recommendations that you should not be using absolute epsilons and should not be using floating point whatsoever until you've thoroughly read and understood What Every Computer Scientist Should Know About Floating-Point Arithmetic.
As for this specific piece of code in your question, where it seems your goal is to print a textual representation of the number, simply testing < 0 is correct. And it does not matter whether you write 0 or 0.0.
Today, I noticed that when I cast a double that is greater than the maximum possible integer to an integer, I get -2147483648. Similarly, when I cast a double that is less than the minimum possible integer, I also get -2147483648.
Is this behavior defined for all platforms?
What is the best way to detect this under/overflow? Is putting if statements for min and max int before the cast the best solution?
When casting floats to integers, overflow causes undefined behavior. From the C99 spec, section 6.3.1.4 Real floating and integer:
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
You have to check the range manually, but don't use code like:
// DON'T use code like this!
if (my_double > INT_MAX || my_double < INT_MIN)
printf("Overflow!");
INT_MAX is an integer constant that may not have an exact floating-point representation. When comparing to a float, it may be rounded to the nearest higher or nearest lower representable floating point value (this is implementation-defined). With 64-bit integers, for example, INT_MAX is 2^63 - 1 which will typically be rounded to 2^63, so the check essentially becomes my_double > INT_MAX + 1. This won't detect an overflow if my_double equals 2^63.
For example with gcc 4.9.1 on Linux, the following program
#include <math.h>
#include <stdint.h>
#include <stdio.h>
int main() {
double d = pow(2, 63);
int64_t i = INT64_MAX;
printf("%f > %lld is %s\n", d, i, d > i ? "true" : "false");
return 0;
}
prints
9223372036854775808.000000 > 9223372036854775807 is false
It's hard to get this right if you don't know the limits and internal representation of the integer and double types beforehand. But if you convert from double to int64_t, for example, you can use floating point constants that are exact doubles (assuming two's complement and IEEE doubles):
if (!(my_double >= -9223372036854775808.0 // -2^63
&& my_double < 9223372036854775808.0) // 2^63
) {
// Handle overflow.
}
The construct !(A && B)also handles NaNs correctly. A portable, safe, but slighty inaccurate version for ints is:
if (!(my_double > INT_MIN && my_double < INT_MAX)) {
// Handle overflow.
}
This errs on the side of caution and will falsely reject values that equal INT_MIN or INT_MAX. But for most applications, this should be fine.
limits.h has constants for max and min possible values for integer data types, you can check your double variable before casting, like
if (my_double > nextafter(INT_MAX, 0) || my_double < nextafter(INT_MIN, 0))
printf("Overflow!");
else
my_int = (int)my_double;
EDIT: nextafter() will solve the problem mentioned by nwellnhof
To answer your question: The behaviour when you cast out of range floats is undefined or implementation specific.
Speaking from experience: I've worked on a MIPS64 system that didn't implemented these kind of casts at all. Instead of doing something deterministic the CPU threw a CPU exception. The exception handler that ought to emulate the cast returned without doing anything to the result.
I've ended up with random integers. Guess how long it took to trace back a bug to this cause. :-)
You'll better do the range check yourself if you aren't sure that the number can't get out of the valid range.
A portable way for C++ is to use the SafeInt class:
http://www.codeplex.com/SafeInt
The implementation will allow for normal addition/subtract/etc on a C++ number type including casts. It will throw an exception whenever and overflow scenario is detected.
SafeInt<int> s1 = INT_MAX;
SafeInt<int> s2 = 42;
SafeInt<int> s3 = s1 + s2; // throws
I highly advise using this class in any place where overflow is an important scenario. It makes it very difficult to avoid silently overflowing. In cases where there is a recovery scenario for an overflow, simply catch the SafeIntException and recover as appropriate.
SafeInt now works on GCC as well as Visual Studio
What is the best way to detect this under/overflow?
Compare the truncated double to exact limits near INT_MIN,INT_MAX.
The trick is to exactly convert limits based on INT_MIN,INT_MAX into double values. A double may not exactly represent INT_MAX as the number of bits in an int may exceed that floating point's precision.*1 In that case, the conversion of INT_MAX to double suffers from rounding. The number after INT_MAX is a power-of-2 and is certainly representable as a double. 2.0*(INT_MAX/2 + 1) generates the whole number one greater than INT_MAX.
The same applies to INT_MIN on non-2s-complement machines.
INT_MAX is always a power-of-2 - 1.
INT_MIN is always:
-INT_MAX (not 2's complement) or
-INT_MAX-1 (2's complement)
int double_to_int(double x) {
x = trunc(x);
if (x >= 2.0*(INT_MAX/2 + 1)) Handle_Overflow();
#if -INT_MAX == INT_MIN
if (x <= 2.0*(INT_MIN/2 - 1)) Handle_Underflow();
#else
// Fixed 2022
// if (x < INT_MIN) Handle_Underflow();
if (x - INT_MIN < -1.0) Handle_Underflow();
#endif
return (int) x;
}
To detect NaN and not use trunc()
#define DBL_INT_MAXP1 (2.0*(INT_MAX/2+1))
#define DBL_INT_MINM1 (2.0*(INT_MIN/2-1))
int double_to_int(double x) {
if (x < DBL_INT_MAXP1) {
#if -INT_MAX == INT_MIN
if (x > DBL_INT_MINM1) {
return (int) x;
}
#else
if (ceil(x) >= INT_MIN) {
return (int) x;
}
#endif
Handle_Underflow();
} else if (x > 0) {
Handle_Overflow();
} else {
Handle_NaN();
}
}
[Edit 2022] Corner error corrected after 6 years.
double values in the range (INT_MIN - 1.0 ... INT_MIN) (non-inclusive end-points) convert well to int. Prior code failed those.
*1 This applies too to INT_MIN - 1 when int precision is more than double. Although this is rare, the issues readily applies to long long. Consider the difference between:
if (x < LLONG_MIN - 1.0) Handle_Underflow(); // Bad
if (x - LLONG_MIN < -1.0) Handle_Underflow();// Good
With 2's complement, some_int_type_MIN is a (negative) power-of-2 and exactly converts to a double. Thus x - LLONG_MIN is exact in the range of concern while LLONG_MIN - 1.0 may suffer precision loss in the subtraction.
We meet the same question. such as:
double d = 9223372036854775807L;
int i = (int)d;
in Linux/window, i = -2147483648. but In AIX 5.3 i = 2147483647.
If the double is outside the range of interger.
Linux/window always return INT_MIN.
AIX will return INT_MAX if double is postive, will return INT_MIN of
double is negetive.
Another option is to use boost::numeric_cast which allows for arbitrary conversion between numerical types. It detects loss of range when a numeric type is converted, and throws an exception if the range cannot be preserved.
The website referenced above also provides a small example which should give a quick overview on how this template can be used.
Of course, this isn't plain C anymore ;-)
I am not sure about this but I think it may be possible to "turn on" floating point exceptions for under/overflow...take a look at this Dealing with Floating-point Exceptions in MSVC7\8 so you might have an alternative to if/else checks.
I can't tell you for certain whether it is defined for all platforms, but that is pretty much what's happened on every platform I've used. Except, in my experience, it rolls. That is, if the value of the double is INT_MAX + 2, then when the result of the cast ends up being INT_MIN + 2.
As for the best way to handle it, I'm really not sure. I've run up against the issue myself, and have yet to find an elegant way to deal with it. I'm sure someone will respond that can help us both there.