how could minimal value of float number smaller than precision? [duplicate] - c++

I was solving an equation using double precision and I got -7.07649e-17 as a solution instead of 0.
I agree it's close enough that I can say it's equal but I've read that the machine epsilon for the C++ double type is 2^-52 which is larger than the value I get.
So why do I have an inferior value than the machine epsilon?
Why isn't the value rounded to zero?
It's not a big deal but when I do a logical test it appears that my value is not zero...

There are two different constants in this story. One is epsilon, which is a minimal value that when added to 1.0 produces a value different from 1.0. If you add a smaller value to 1.0 you will again get a 1.0, because there are physical limits to the representation of a number in a computer. But there are values that are less than epsilon and greater than zero. Smallest such number for a double you get with std::numeric_limits<double>::min.
For reference, you get epsilon with std::numeric_limits<double>::epsilon.

You are not guaranteed that rounding will take place at any particular time. The C++ standard permits the implementation to use additional precision pretty much anywhere it wants to and many real-world implementations do exactly that.

A common solution for the floating point precision problem is to define an epsilon value yourself and compare to that instead of zero.
e.g.
double epsilon = 0.00001;
if (abs(value) < epsilon) // treat value as 0 in your code

Related

Why my double can contain a value below the machine epsilon?

I was solving an equation using double precision and I got -7.07649e-17 as a solution instead of 0.
I agree it's close enough that I can say it's equal but I've read that the machine epsilon for the C++ double type is 2^-52 which is larger than the value I get.
So why do I have an inferior value than the machine epsilon?
Why isn't the value rounded to zero?
It's not a big deal but when I do a logical test it appears that my value is not zero...
There are two different constants in this story. One is epsilon, which is a minimal value that when added to 1.0 produces a value different from 1.0. If you add a smaller value to 1.0 you will again get a 1.0, because there are physical limits to the representation of a number in a computer. But there are values that are less than epsilon and greater than zero. Smallest such number for a double you get with std::numeric_limits<double>::min.
For reference, you get epsilon with std::numeric_limits<double>::epsilon.
You are not guaranteed that rounding will take place at any particular time. The C++ standard permits the implementation to use additional precision pretty much anywhere it wants to and many real-world implementations do exactly that.
A common solution for the floating point precision problem is to define an epsilon value yourself and compare to that instead of zero.
e.g.
double epsilon = 0.00001;
if (abs(value) < epsilon) // treat value as 0 in your code

Floating-point comparison of constant assignment

When comparing doubles for equality, we need to give a tolerance level, because floating-point computation might introduce errors. For example:
double x;
double y;
x = f();
y = g();
if (fabs(x-y)<epsilon) {
// they are equal!
} else {
// they are not!
}
However, if I simply assign a constant value, without any computation, do I still need to check the epsilon?
double x = 1;
double y = 1;
if (x==y) {
// they are equal!
} else {
// no they are not!
}
Is == comparison good enough? Or I need to do fabs(x-y)<epsilon again? Is it possible to introduce error in assigning? Am I too paranoid?
How about casting (double x = static_cast<double>(100))? Is that gonna introduce floating-point error as well?
I am using C++ on Linux, but if it differs by language, I would like to understand that as well.
Actually, it depends on the value and the implementation. The C++ standard (draft n3126) has this to say in 2.14.4 Floating literals:
If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner.
In other words, if the value is exactly representable (and 1 is, in IEEE754, as is 100 in your static cast), you get the value. Otherwise (such as with 0.1) you get an implementation-defined close match (a). Now I'd be very worried about an implementation that chose a different close match based on the same input token but it is possible.
(a) Actually, that paragraph can be read in two ways, either the implementation is free to choose either the closest higher or closest lower value regardless of which is actually the closest, or it must choose the closest to the desired value.
If the latter, it doesn't change this answer however since all you have to do is hardcode a floating point value exactly at the midpoint of two representable types and the implementation is once again free to choose either.
For example, it might alternate between the next higher and next lower for the same reason banker's rounding is applied - to reduce the cumulative errors.
No if you assign literals they should be the same :)
Also if you start with the same value and do the same operations, they should be the same.
Floating point values are non-exact, but the operations should produce consistent results :)
Both cases are ultimately subject to implementation defined representations.
Storage of floating point values and their representations take on may forms - load by address or constant? optimized out by fast math? what is the register width? is it stored in an SSE register? Many variations exist.
If you need precise behavior and portability, do not rely on this implementation defined behavior.
IEEE-754, which is a standard common implementations of floating point numbers abide to, requires floating-point operations to produce a result that is the nearest representable value to an infinitely-precise result. Thus the only imprecision that you will face is rounding after each operation you perform, as well as propagation of rounding errors from the operations performed earlier in the chain. Floats are not per se inexact. And by the way, epsilon can and should be computed, you can consult any numerics book on that.
Floating point numbers can represent integers precisely up to the length of their mantissa. So for example if you cast from an int to a double, it will always be exact, but for casting into into a float, it will no longer be exact for very large integers.
There is one major example of extensive usage of floating point numbers as a substitute for integers, it's the LUA scripting language, which has no integer built-in type, and floating-point numbers are used extensively for logic and flow control etc. The performance and storage penalty from using floating-point numbers turns out to be smaller than the penalty of resolving multiple types at run time and makes the implementation lighter. LUA has been extensively used not only on PC, but also on game consoles.
Now, many compilers have an optional switch that disables IEEE-754 compatibility. Then compromises are made. Denormalized numbers (very very small numbers where the exponent has reached smallest possible value) are often treated as zero, and approximations in implementation of power, logarithm, sqrt, and 1/(x^2) can be made, but addition/subtraction, comparison and multiplication should retain their properties for numbers which can be exactly represented.
The easy answer: For constants == is ok.
There are two exceptions which you should be aware of:
First exception:
0.0 == -0.0
There is a negative zero which compares equal for the IEEE 754 standard. This means
1/INFINITY == 1/-INFINITY which breaks f(x) == f(y) => x == y
Second exception:
NaN != NaN
This is a special caveat of NotaNumber which allows to find out if a number is a NaN
on systems which do not have a test function available (Yes, that happens).

How do I check and handle numbers very close to zero

I have some math (in C++) which seems to be generating some very small, near zero, numbers (I suspect the trig function calls are my real problem), but I'd like to detect these cases so that I can study them in more detail.
I'm currently trying out the following, is it correct?
if ( std::abs(x) < DBL_MIN ) {
log_debug("detected small num, %Le, %Le", x, y);
}
Second, the nature of the mathematics is trigonometric in nature (aka using a lot of radian/degree conversions and sin/cos/tan calls, etc), what sort of transformations can I do to avoid mathematical errors?
Obviously for multiplications I can use a log transform - what else?
Contrary to widespread belief, DBL_MIN is not the smallest positive double value but the smallest positive normalized double value. Typically - for 64-bit ieee754 doubles - it's 2-1022, while the smallest positive double value is 2-1074. Therefore
I'm currently trying out the following, is it correct?
if ( std::abs(x) < DBL_MIN ) {
log_debug("detected small num, %Le, %Le", x, y);
}
may have an affirmative answer. The condition checks whether x is a denormalized (also called subnormal) number or ±0.0. Without knowing more about your specific situation, I cannot tell if that test is appropriate. Denormalized numbers can be legitimate results of calculations or the consequence of rounding where the correct result would be 0. It is also possible that rounding produces numbers of far greater magnitude than DBL_MIN when the mathematically correct result would be 0, so a much larger threshold could be sensible.
If x is a double, then one problem with this approach is that you can't distinguish between x being legitimately zero, and x being a positive value smaller than DBL_MIN. So this will work if you know x can never be legitimately zero, and you want to see when underflow occurs.
You could also try catching the SIGFPE signal, which will fire on a POSIX-compliant system any time there's a math error including floating-point underflow. See: http://en.wikipedia.org/wiki/SIGFPE
EDIT: To be clear, DBL_MIN is NOT the largest negative value that a double can hold, it is the smallest positive normalized value that a double can hold. So your approach is fine as long as the value can't be zero.
Another useful constant is DBL_EPSILON which is the smallest double value that can be added to 1.0 without getting 1.0 back. Note that this is a much larger value than DBL_MIN. But it may be useful to you since you're doing trigonometric functions that may tend toward 1 instead of tending toward 0.
Since you are using C++, the most idiomatic is to use std::numeric_limits from header <limits>.
For instance:
template <typename T>
bool is_close_to_zero(T x)
{
return std::abs(x) < std::numeric_limits<T>::epsilon();
}
The actual tolerance to be used heavily depends on your problem. Please complete your question with a concrete use case so that I can enhance my answer.
There is also std::numeric_limits<T>::min() and std::numeric_limits<T>::denorm_min() that may be useful. The first one is the smallest positive non-denormalized value of type T (equal to FLT/DBL/LDBL_MIN from <cfloat>), the second one is the smallest positive value of type T (no <cfloat> equivalent).
[You may find this document useful to read if you aren't at ease with floating point numbers representation.]
The first if check will actually only be true when your value is zero.
For your second question, you imply lots of conversions. Instead, pick one unit (deg or rad) and do all your computational operations in that unit. Then at the very end do a single conversion to the other value if you need to.

Comparing doubles to double literals? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How should I do floating point comparison?
Is it not recommended to compare for equality a double and a double literal in C++, because I guess it is compiler dependent?
To be more precise it is not OK to compare a double which is hard-coded (a literal in the source code) and a double which should be computed, as the last number of the resultant of the calculation can vary from one compiler to another. Is this not standardized?
I heard this is mentioned in Knuth's TeXbook, is that right?
If this is all true, what is the solution?
You've misunderstood the advice a bit. The point is that floating-point computations aren't exact. Rounding errors occur, and precision is gradually lost. Take something as simple as 1.0/10.0. The result should be 0.1, but it isn't, because 0.1 cannot be represented exactly in floating-point format. So the actual result will be slightly different. The same is true for countless other operations, so the point has nothing to do with const doubles. It has to do with not expecting the result to be exact. If you perform some computation where the result should be 1.0, then you should not test it for equality against 1.0, because rounding errors might mean that it actually came out 0.9999999997 instead.
So the usual solution is to test if the result is sufficiently close to 1.0. If it is close, then we assume "it's good enough", and act as if the result had been 1.0.
The bottom line is that strict equality is rarely used for floating-point values. Instead, you should test if the difference between the two values is less than some small value (typically called the epsilon)
The problem you are talking about is due to rounding errors and will happen for every floating point number. What you can do is define an epsilon and see if the difference between the two floating point numbers is smaller than this. E.g.:
double A = somethingA();
double B = somethingB();
double epsilon = 0.00001;
if (abs(A - B) < epsilon)
doublesAreEqual();
[Edit] Also see this question: What is the most effective way for float and double comparison?.
The key problem is how floating point arithmetic works - it includes rounding that can lead to comparison for equality evaluated wrong. This applies to all floating point numbers regardless of whether variable is declared const or not.
if you do floating point calculations and you need to do comparisons with certain fixed values it is always safer to use an epsilon value to take into account precision errors .
Example:
double calcSomeStuf();
if ( calcSomeStuf() == 0.1 ) { ...}
is a bad idea
however:
const double epsilon = 0.005
double calcSomeStuf();
if ( abs(calcSomeStuf() - 0.1) < epsilon ) { ...}
is a lot safer (especially considering the fact that 0.1 cannot be represented exactly as a double)
This is necessary because when accumulating floating point operations rounding errors occur, and due to the nature of floating point not all numbers can be represented exactly

Double versus float

I have a constant for pi in my code:
const float PI = acos(-1);
Would it be better to declare it as a double? An answer to another question on this site said floating point operations aren't exactly precise, and I'd like the constant to be accurate.
"precise" is not a boolean concept. float provides a certain amount of precision. Whether or not that amount is sufficient for your application depends on, well, your application.
most applications don't need more precision than float provides, though many prefer to use double to (try and) gloss over problems with unstable algorithms or "just because" due to misconceptions like "floating point operations aren't exactly precise".
In most cases when a float is "not precise enough", the problem is not float, it's the code that uses it.
Edit: That being said, most modern CPUs only do calculations in double precision or greater anyway, so you might as well use double unless you're working with large arrays and memory usage is an issue.
From standard:
There are three floating point types:
float, double, and long double. The
type double provides at least as much
precision as float, and the type long
double provides at least as much
precision as double.
Of the three (notice that this goes hand in hand with the 3 versions of acos) you should choose long double if what you are aiming for is precision (but you should also know that after some degree, further precision may be redundant in some cases).
So you should use this to get the most precise result from acos
long double result = acos(-1L);
(Note: There might be some platform specific types or some user defined types which provide more precision)
I'd like the constant to be accurate.
There is nothing like accurate floating point values. They cannot be stored with perfect precision, because of their representation in memory. This is only possible with integers. double give you double the precision a float offers (who would have guessed). double should fit your needs in almost every case.
I would recommend using M_PI from <cmath>, which should be available in all POSIX compliant implementations of the standard.
It depends exactly how precise you need to be. I've never had to you doubles because floats are not precise enough.
The most accurate representation of pi is M_PI from math.h
The question boils down to: how much accuracy do you need?
Let's quote wikipedia:
For example, the decimal
representation of π truncated to 11
decimal places is good enough to
estimate the circumference of any
circle that fits inside the Earth with
an error of less than one millimetre,
and the decimal representation of π
truncated to 39 decimal places is
sufficient to estimate the
circumference of any circle that fits
in the observable universe with
precision comparable to the radius of
a hydrogen atom.
I've written a small java program, here's its output:
As string: 3.14159265358979323846264338327950288419716939937510
As double: 3.141592653589793
As float: 3.1415927
Remember, that if you want to have the double precision of a double, all your numbers you're calculating with need also to be doubles. (That is not entierly true, but is close enough.)
For most applications, float would do just fine for PI. Double is definitely has more precision, but it doesn't guarantee precision anymore than floats can. By that I mean that the number 1.0 represented in binary is not a rational number. Therefore, if you try to represent it, you'll only succeed to an nth digit where n is determined by how many bytes you use to represent that number.
Unfortunately to contain many digits of PI, you'd probably need to hold it in a string. Though now we're talking about some impressive number crunching here that you might see in molecule simulations. You're probably not going to need that level of precision.
As this site says, there are three overloaded versions of acos function.
Therefore the call acos(-1) is ambiguous.
Having said that, you should declare PI as long double to avoid any loss of precision, by using
const long double PI = acos(-1.0L);