Issue with std::numeric_limits<T>::min() with float/double values - c++

While writing some code to ensure user input is valid I came across an issue with std::numeric_limits<T>::min(); where T is a floating point type i.e double/float etc.
Using -std::numeric_limits<double>::max(); allows me to get the real minimum value but you would expect std::numeric_limits<double>::min(); to return that.
Why does std::numeric_limits<double>::min(); not return the smallest possible value of these types and instead force us to use std::numeric_limits::lowest or -std::numeric_limits<double>::max();?

Yes, using -max() will give lowest value.
In C++11 there is also std::numeric_limits::lowest that is consistent for reals and ints.
http://en.cppreference.com/w/cpp/types/numeric_limits/lowest
See also How to workaround the inconsistent definition of numeric_limits<T>::min()?

Why is this?
This is because std::numeric_limits<>::min() returns implementation defined FLT_MIN, DBL_MIN or INT_MIN. In this regard behavior of this method is consistent. But the return value ofstd::numeric_limits<double>::min(): DBL_MIN has slightly different meaning than INT_MIN. This is the smallest value that double can represent. There might be value greater than 0 but double can't represent it.
why does using min() not do the same thing in this instance?
The rationale behind this is that you can query <limits> for this value to check for this possible underflow specific to floating point arithmetic.
You can use
std::numeric_limits<double>::lowest()
to query for a lowest negative value.

Related

floating point number comparison why no equal functions

In the c++ standard for floating pointer number, there is std::isgreater for greater comparison, and std::isless for less comparison, so why is there not a std::isequal for equality comparison? Is there a safe and accurate way to check if a double variable is equal to the DBL_MAX constants defined by the standard? The reason we try to do this is we are accessing data through service protocol, and it defines a double field when no data is available it will send DBL_MAX, so in our client code when it's DBL_MAX we need to skip it, and anything else we need to process it.
The interest of isgreater, isless, isgreaterequal, islessequal compared to >, <, >= and <= is that they do not raise FE_INVALID (a floating point exception, these are different beasts than C++ exceptions and are not mapped to C++ exceptions) when comparing with a NaN while the operators do.
As == do not raise a FP exception, there is no need of an additional functionality which does.
Note that there is also islessgreater and isunordered.
If you are not considering NaN or not testing the floating point exception there is no need to worry about these functions.
Considering equality comparison == is what to use if you want to check that the values are the same (ignoring the issues related to signed 0 and NaN). Depending on how you are reaching these values, it is sometimes useful to consider an approximate equality comparison -- but using one systematically is not recommended, for instance such approximate equality is probably not transitive.
In your context of a network protocol, you have to consider how the data is serialized. If the serialization is defined as binary, you can probably reconstruct the exact value and thus == is what you want so compare against DBL_MAX (for other values, check what is specified for signed 0 and NaN an know that there are signalling and quiet NaN are represented by different bit patterns although IEEE 754-2008 recommend now one of them). If the representation is decimal, you'll have to check if the representation is precise enough for the DBL_MAX value be reconstructable (and pay attention to rounding modes).
Note that I'd have considered a NaN for representing the no data available case instead of using a potentially valid value.
Is there a safe and accurate way to check if a double variable is equal to the DBL_MAX constants defined by the standard?
When you obtain a floating point number as a result of evaluating some expression, then == comparison doesn't make sense in most cases due to finite precision and rounding errors. However, if you first set a floating point variable to some value, then you can compare it to that value using == (with some exceptions like positive and negative zeros).
For example:
double v = std::numeric_limits<double>::max();
{ conditional assignment to v of a non-double-max value }
if (v != std::numeric_limits<double>::max())
process(v);
std::numeric_limits<double>::max() is exactly representable as double (it is double), so this comparison should be safe and should never yield true unless v was reassigned.
Equal function between to floating point have no sense, because the floating point numbers have rounding.
So if you want compare 2 floating point numbers for equality the best way is to declare a significant epsilon(error) for your comparison and then verify that the first number is in around the second.
To do this you can verify if first number is greater than second number minus epsilon and the first number is lower than the second number plus epsilon.
Ex.
if ( first > second - epsilon && first < second + epsilon ){
//The numbers are equal
}else{
//The numbers are not equal
}

What's the least possible denominator/divisor value?

I'm writing a code to prevent the zero denominator/divisor to avoid NaN value as a result of the division.
I wonder what could be the least possible denominator value in double in C++ and how to find or what's the reason for that?
Well, the smallest positive(a) normalised double in C++ can be obtained with std::numeric_limits<double>::min() (from the <limits> header).
However, while you may be able to use that to prevent NaN values(b), it probably won't help with overflows. For example:
std::numeric_limits<double>::max() / 1.0 => ok
std::numeric_limits<double>::max() / 0.5 => overflow
Preventing that will depend on both the denominator and the numerator.
As for why that is the case, it's because C++ uses IEEE-754 double-precision format for its double types - it's a limitation of that format.
(a) I've chosen positive values here, the smallest value could be interpreted as the most negative, in which case it would be read as -std::numeric_limits<double>::max(). But, given your intent is to avoid NaN, I suspect my assumption is correct.
(b) I'm not entirely sure how you intend to do this, which is why I also discuss overflows - you may want to make it clearer in your question.
The smallest possible float is 1.17549e-38. To expand upon It's coming home's comment, see the answer from here:
#include <limits>
//...
std::numeric_limits<float>::max(); // 3.40282e+38
std::numeric_limits<float>::min(); // 1.17549e-38
std::numeric_limits<float>::infinity();
The float in the above code can be replaced by any data type you want, ie:
std::numeric_limits<int>::min();

Floating-point comparison of constant assignment

When comparing doubles for equality, we need to give a tolerance level, because floating-point computation might introduce errors. For example:
double x;
double y;
x = f();
y = g();
if (fabs(x-y)<epsilon) {
// they are equal!
} else {
// they are not!
}
However, if I simply assign a constant value, without any computation, do I still need to check the epsilon?
double x = 1;
double y = 1;
if (x==y) {
// they are equal!
} else {
// no they are not!
}
Is == comparison good enough? Or I need to do fabs(x-y)<epsilon again? Is it possible to introduce error in assigning? Am I too paranoid?
How about casting (double x = static_cast<double>(100))? Is that gonna introduce floating-point error as well?
I am using C++ on Linux, but if it differs by language, I would like to understand that as well.
Actually, it depends on the value and the implementation. The C++ standard (draft n3126) has this to say in 2.14.4 Floating literals:
If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner.
In other words, if the value is exactly representable (and 1 is, in IEEE754, as is 100 in your static cast), you get the value. Otherwise (such as with 0.1) you get an implementation-defined close match (a). Now I'd be very worried about an implementation that chose a different close match based on the same input token but it is possible.
(a) Actually, that paragraph can be read in two ways, either the implementation is free to choose either the closest higher or closest lower value regardless of which is actually the closest, or it must choose the closest to the desired value.
If the latter, it doesn't change this answer however since all you have to do is hardcode a floating point value exactly at the midpoint of two representable types and the implementation is once again free to choose either.
For example, it might alternate between the next higher and next lower for the same reason banker's rounding is applied - to reduce the cumulative errors.
No if you assign literals they should be the same :)
Also if you start with the same value and do the same operations, they should be the same.
Floating point values are non-exact, but the operations should produce consistent results :)
Both cases are ultimately subject to implementation defined representations.
Storage of floating point values and their representations take on may forms - load by address or constant? optimized out by fast math? what is the register width? is it stored in an SSE register? Many variations exist.
If you need precise behavior and portability, do not rely on this implementation defined behavior.
IEEE-754, which is a standard common implementations of floating point numbers abide to, requires floating-point operations to produce a result that is the nearest representable value to an infinitely-precise result. Thus the only imprecision that you will face is rounding after each operation you perform, as well as propagation of rounding errors from the operations performed earlier in the chain. Floats are not per se inexact. And by the way, epsilon can and should be computed, you can consult any numerics book on that.
Floating point numbers can represent integers precisely up to the length of their mantissa. So for example if you cast from an int to a double, it will always be exact, but for casting into into a float, it will no longer be exact for very large integers.
There is one major example of extensive usage of floating point numbers as a substitute for integers, it's the LUA scripting language, which has no integer built-in type, and floating-point numbers are used extensively for logic and flow control etc. The performance and storage penalty from using floating-point numbers turns out to be smaller than the penalty of resolving multiple types at run time and makes the implementation lighter. LUA has been extensively used not only on PC, but also on game consoles.
Now, many compilers have an optional switch that disables IEEE-754 compatibility. Then compromises are made. Denormalized numbers (very very small numbers where the exponent has reached smallest possible value) are often treated as zero, and approximations in implementation of power, logarithm, sqrt, and 1/(x^2) can be made, but addition/subtraction, comparison and multiplication should retain their properties for numbers which can be exactly represented.
The easy answer: For constants == is ok.
There are two exceptions which you should be aware of:
First exception:
0.0 == -0.0
There is a negative zero which compares equal for the IEEE 754 standard. This means
1/INFINITY == 1/-INFINITY which breaks f(x) == f(y) => x == y
Second exception:
NaN != NaN
This is a special caveat of NotaNumber which allows to find out if a number is a NaN
on systems which do not have a test function available (Yes, that happens).

Why are FLT_MAX and FLT_MIN not positive and negative infinity, and what is their use?

Logically speaking, given the nature of floating point values, the maximum and minimum representable values of a float are positive and negative infinity, respectively.
Why, then, are FLT_MAX and FLT_MIN not set to them? I understand that this is "just how the standard called for". But then, what use could FLT_MAX or FLT_MIN have as they currently lie in the middle of the representable numeric range of float? Other numeric limits have some utility because they make guarantees about comparisons (e.g. "No INT can test greater than INT_MAX"). Without that kind of guarantee, what use are these float limits at all?
A motivating example for C++:
#include <vector>
#include <limits>
template<typename T>
T find_min(const std::vector<T> &vec)
{
T result = std::numeric_limits<T>::max();
for (std::vector<T>::const_iterator p = vec.start() ; p != vec.end() ; ++p)
if (*p < result) result = *p;
return result;
}
This code works fine if T is an integral type, but not if it is a floating point type. This is annoying. (Yes yes, the standard library provides min_element, but that is not the point. The point is the pattern.)
The purpose of FLT_MIN/MAX is to tell you what the smallest and largest representable floating-point numbers are. Infinity isn't a number; it's a limit.
what use could FLT_MAX or FLT_MIN have as they currently lie in the middle of the representable numeric range of float?
They do not lie in the middle of the representable range. There is no positive float value x which you can add to FLT_MAX and get a representable number. You will get +INF. Which, as previously stated, is not a number.
This code works fine if T is an integral type, but not if it is a floating point type. This is annoying. (Yes yes, the standard library provides min_element, but that is not the point. The point is the pattern.)
And how doesn't it "work fine?" It gives you the smallest value. The only situation where it doesn't "work fine" is if the table contains only +INF. And even in that case, it returns an actual number, not an error-code. Which is probably the better option anyway.
FLT_MAX is defined in section 5.2.4.2.2(9) as
maximum representable finite floating-point number
Positive infinity is not finite.
FLT_MIN is defined in section 5.2.4.2.2(10) as
minimum normalized positive floating-point number
Negative infinity is neither normalized nor positive.
Unlike integer types, floating-point types are (almost?) universally symmetric about zero, and I think the C floating-point model requires this.
On two's-complement systems (i.e., almost all modern systems), INT_MIN is -INT_MAX-1; on other systems, it may be -INT_MAX. (Quibble: a two's-complement system can have INT_MIN equal to -INT_MAX if the lowest representable value is treated as a trap representation.) So INT_MIN conveys information that INT_MAX by itself doesn't.
And a macro for the smallest positive value would not be particularly useful; that's just 1.
In floating-point, on the other hand, the negative value with the greatest magnitude is just -FLT_MAX (or -DBL_MAX, or -LDBL_MAX).
As for why they're not Infinity, there's already a way to represent infinite values (at least in C99): the macro INFINITY. That might cause problems for some C++ applications, but these were defined for C, which doesn't have things like std::numeric_limits<T>::max().
Furthermore, not all floating-point systems have representations for infinity (or NaN).
If FLT_MAX were INFINITY (on systems that support it), then there would probably need to be another macro for the largest representable real value.
I would say the broken pattern you're seeing is only an artifact of poor naming in C, whereas in C++ with numeric_limits and templates, it's an actual semantic flaw that breaks template code that wants to handle both integer and floating point values. Of course you can write a little bit of extra code to test if you have an integer or floating point type (e.g. if ((T)1/2) /* floating point */ else /* integer */) and the problem goes away.
As for why somebody would care about the values FLT_MIN and FLT_MAX give you, they're useful for avoiding underflow and overflow. For example, suppose I need to compute sqrt(x²-1). This is well-defined for any floating point x greater than or equal to 1, but performing the squaring, subtraction, and square root could easily overflow and render the result meaningless when x is large. One might want to test whether x > FLT_MAX/x and handle this case some other way (such as simply returning x :-).

How do I check and handle numbers very close to zero

I have some math (in C++) which seems to be generating some very small, near zero, numbers (I suspect the trig function calls are my real problem), but I'd like to detect these cases so that I can study them in more detail.
I'm currently trying out the following, is it correct?
if ( std::abs(x) < DBL_MIN ) {
log_debug("detected small num, %Le, %Le", x, y);
}
Second, the nature of the mathematics is trigonometric in nature (aka using a lot of radian/degree conversions and sin/cos/tan calls, etc), what sort of transformations can I do to avoid mathematical errors?
Obviously for multiplications I can use a log transform - what else?
Contrary to widespread belief, DBL_MIN is not the smallest positive double value but the smallest positive normalized double value. Typically - for 64-bit ieee754 doubles - it's 2-1022, while the smallest positive double value is 2-1074. Therefore
I'm currently trying out the following, is it correct?
if ( std::abs(x) < DBL_MIN ) {
log_debug("detected small num, %Le, %Le", x, y);
}
may have an affirmative answer. The condition checks whether x is a denormalized (also called subnormal) number or ±0.0. Without knowing more about your specific situation, I cannot tell if that test is appropriate. Denormalized numbers can be legitimate results of calculations or the consequence of rounding where the correct result would be 0. It is also possible that rounding produces numbers of far greater magnitude than DBL_MIN when the mathematically correct result would be 0, so a much larger threshold could be sensible.
If x is a double, then one problem with this approach is that you can't distinguish between x being legitimately zero, and x being a positive value smaller than DBL_MIN. So this will work if you know x can never be legitimately zero, and you want to see when underflow occurs.
You could also try catching the SIGFPE signal, which will fire on a POSIX-compliant system any time there's a math error including floating-point underflow. See: http://en.wikipedia.org/wiki/SIGFPE
EDIT: To be clear, DBL_MIN is NOT the largest negative value that a double can hold, it is the smallest positive normalized value that a double can hold. So your approach is fine as long as the value can't be zero.
Another useful constant is DBL_EPSILON which is the smallest double value that can be added to 1.0 without getting 1.0 back. Note that this is a much larger value than DBL_MIN. But it may be useful to you since you're doing trigonometric functions that may tend toward 1 instead of tending toward 0.
Since you are using C++, the most idiomatic is to use std::numeric_limits from header <limits>.
For instance:
template <typename T>
bool is_close_to_zero(T x)
{
return std::abs(x) < std::numeric_limits<T>::epsilon();
}
The actual tolerance to be used heavily depends on your problem. Please complete your question with a concrete use case so that I can enhance my answer.
There is also std::numeric_limits<T>::min() and std::numeric_limits<T>::denorm_min() that may be useful. The first one is the smallest positive non-denormalized value of type T (equal to FLT/DBL/LDBL_MIN from <cfloat>), the second one is the smallest positive value of type T (no <cfloat> equivalent).
[You may find this document useful to read if you aren't at ease with floating point numbers representation.]
The first if check will actually only be true when your value is zero.
For your second question, you imply lots of conversions. Instead, pick one unit (deg or rad) and do all your computational operations in that unit. Then at the very end do a single conversion to the other value if you need to.