Precision of floating point operations

Precision of floating point operations - c++

Floating point operations are typically approximations to the corresponding arithmetic operations, because in many cases the precise arithmetic result cannot be represented by the internal number format. But what happens if I have a program where all numbers can actually be represented exactly by IEEE 754 single precision? Consider for example this:
float a = 7;
float b = a / 32.0;
float c = a + b;
float d = c - a;
assert( d == b );
Is it safe to assume that within this and similar cases, the result of the numerical calculation is identical to the exact arithmetic result? I think this sort of code would work on every actual machine. Is this correct?
Edit This question is not about what the standard says, but rather about the real world. I'm well aware that in theory one could create a conforming engine where this would fail. I don't care, but rather I wonder if this works on existing machines.

No as the c++ standard does not require floating point values to be stored in the IEEE 754 format.

"Real" machines are quite careful in implementing the standard exactly (just remember the Pentium division bug). But be careful to check, the i386 line of machines did use some extra bits in its registers, which were cut off when asigning to/from memory, so computations done only in registers gave different results than if some intermediate results where spilled to memory.
Also check out David Goldberg's What every computer scientist should know about floating point arithmetic.
In any case, it doesn't make any sense to "round" (or otherwise mangle) a number that can be represented exactly.

Related

Is imprecision of a double value guaranteed to be consistent on the same machine?

We know that doubles loose precision of their value as you increase the number of decimal points. But if I use the same double value twice on the same machine will I be guaranteed to have the same imprecision? For example:
double d1 = 123.456;//actually becomes 123.45600001
double d2 = 123.456;//is guaranteed to become 123.45600001?
For simplicity sake lets stick with just C++.

No, you do not have this guarantee. C++ implementations are not bound by IEEE standard, and can choose any binary represenation they want.
While they are unlikely to just invent their own, they usually have fluctuate (even within the same vendor) in how thew represent the 'nonrepresentable' numbers - they can be represented with a bigger number or a smaller number - and this representation changes (I believe, even different floating math options for same gcc can affect this).

In your case, d1 == d2 will return true, and it will almost always return true on any properly operating compiler/architecture. Even if the compiler/architecture aren't IEEE conforming, it's extremely unlikely that providing an identical symbol for 123.456 would return different values on multiple invocations.
But, if you had the following code:
double d1 = 123.456;
double d2 = get_123_456_from_network_service(service);
That guarantee would cease to be. 123.456 will be exactly the same on your computer all the time, but if a different computer attempts to use the same symbol, and isn't IEEE conforming (or if yours isn't), then there's a good chance the values would be different.

If your compiler conforms to the IEEE 754 (IEC 559) standard, they should be the same. You can check if it is IEEE compliant with std::numeric_limits<T>::is_iec559. Most implementations are IEEE compliant, but this is not a guarantee.

Are floating point errors deterministic?

One of the big got'chas of floating point numbers is that some of them cannot be exactly represented in binary. This can make them difficult to work with. However what I'm curious about is whether or not subtle or not-so-subtle errors in floating point are deterministic. Can somebody predict them for example? Here's one example of a random number generator that could take advantage of floating point errors:
#include <cmath>
float constant = M_PI;
float generate()
{
static float state = 1;
state = state * constant;
return state;
}
One would have to know the implementation, the hardware, the compiler settings and so on, which makes it quite difficult to predict what the results would be. Or is my thinking flawed?

Floating point "errors" are deterministic. There is a 1:1 mapping between input and output values for a given operation. Your example will produce the same output sequence every time.
That said, there could be a floating-point implementation or ten out there that will produce different sequences, but this is not something you can consider "random" (i.e. a source of entropy).

Every floating point representation defines the composition of a floating point variable (which part is the mantissa, which part is the exponent, which part is the sign, etc) and the behaviour of every operation.
In any implementation you might choose, it is therefore possible to predict the result of every floating point operation, if you know its operand (or operands) That characteristic is the definition of determinism.
So, yes, floating point operations are deterministic.
Different implementations (compilers, host systems, etc) do support different floating point representations. So there is some variation of results between implementations. However, it is still possible to predict the result of any floating point operation, if you know how floating point variables are represented, and how operations work.
The fact not everyone knows enough about floating point types and operations on them does not make them non-deterministic. Nor does the fact that not everyone can describe the complete set of operations in a complex algorithm. The knowledge is readily available and, with enough effort, understandable well enough so effects of all operations on all possible operands can be reliably predicted before doing the operation.
There are buggy implementations of floating point out there, which do not comply with their own documentation. For example, look up the pentium FDIV bug - where some early pentium CPUs implemented floating point division incorrectly. Even those turned out to be deterministic, once it was understood what the operations actually do.

Should I worry about precision when I use C++ mathematical functions with integers?

For example, The code below will give undesirable result due to precision of floating point numbers.
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
I wonder whether similar problems will show up if I use mathematical functions. For example
int a = sqrt(4); // Do I have guarantee that I will always get 2 here?
int b = log2(8); // Do I have guarantee that I will always get 3 here?
If not, how to solve this problem?
Edit:
Actually, I came across this problem when I was programming for an algorithm task. There I want to get
the largest integer which is power of 2 and is less than or equal to integer N
So round function can not solve my problem. I know I can solve this problem through a loop, but it seems not very elegant.
I want to know if
int a = pow(2, static_cast<int>(log2(N)));
can always give correct result. For example if N==8, is it possible that log2(N) gives me something like 2.9999999999999 and the final result become 4 instead of 8?

Inaccurate operands vs inaccurate results
I wonder whether similar problems will show up if I use mathematical functions.
Actually, the problem that could prevent log2(8) to be 3 does not exist for basic operations (including *). But it exists for the log2 function.
You are confusing two different issues:
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
In the example above, a is not exactly 1/3, so it is possible that a*3 does not produce 1.0. The product could have happened to round to 1.0, it just doesn't. However, if a somehow had been exactly 1/3, the product of a by 3 would have been exactly 1.0, because this is how IEEE 754 floating-point works: the result of basic operations is the nearest representable value to the mathematical result of the same operation on the same operands. When the exact result is representable as a floating-point number, then that representation is what you get.
Accuracy of sqrt and log2
sqrt is part of the “basic operations”, so sqrt(4) is guaranteed always, with no exception, in an IEEE 754 system, to be 2.0.
log2 is not part of the basic operations. The result of an implementation of this function is not guaranteed by the IEEE 754 standard to be the closest to the mathematical result. It can be another representable number further away. So without more hypotheses on the log2 function that you use, it is impossible to tell what log2(8.0) can be.
However, most implementations of reasonable quality for elementary functions such as log2 guarantee that the result of the implementation is within 1 ULP of the mathematical result. When the mathematical result is not representable, this means either the representable value above or the one below (but not necessarily the closest one of the two). When the mathematical result is exactly representable (such as 3.0), then this representation is still the only one guaranteed to be returned.
So about log2(8), the answer is “if you have a reasonable quality implementation of log2, you can expect the result to be 3.0`”.
Unfortunately, not every implementation of every elementary function is a quality implementation. See this blog post, caused by a widely used implementation of pow being inaccurate by more than 1 ULP when computing pow(10.0, 2.0), and thus returning 99.0 instead of 100.0.
Rounding to the nearest integer
Next, in each case, you assign the floating-point to an int with an implicit conversion. This conversion is defined in the C++ standard as truncating the floating-point values (that is, rounding towards zero). If you expect the result of the floating-point computation to be an integer, you can round the floating-point value to the nearest integer before assigning it. It will help obtain the desired answer in all cases where the error does not accumulate to a value larger than 1/2:
int b = std::nearbyint(log2(8.0));
To conclude with a straightforward answer to the question the the title: yes, you should worry about accuracy when using floating-point functions for the purpose of producing an integral end-result. These functions do not come even with the guarantees that basic operations come with.

Unfortunately the default conversion from a floating point number to integer in C++ is really crazy as it works by dropping the decimal part.
This is bad for two reasons:
a floating point number really really close to a positive integer, but below it will be converted to the previous integer instead (e.g. 3-1×10-10 = 2.9999999999 will be converted to 2)
a floating point number really really close to a negative integer, but above it will be converted to the next integer instead (e.g. -3+1×10-10 = -2.9999999999 will be converted to -2)
The combination of (1) and (2) means also that using int(x + 0.5) will not work reasonably as it will round negative numbers up.
There is a reasonable round function, but unfortunately returns another floating point number, thus you need to write int(round(x)).
When working with C99 or C++11 you can use lround(x).
Note that the only numbers that can be represented correctly in floating point are quotients where the denominator is an integral power of 2.
For example 1/65536 = 0.0000152587890625 can be represented correctly, but even just 0.1 is impossible to represent correctly and thus any computation involving that quantity will be approximated.
Of course when using 0.1 approximations can cancel out leaving a correct result occasionally, but even just adding ten times 0.1 will not give 1.0 as result when doing the computation using IEEE754 double-precision floating point numbers.
Even worse the compilers are allowed to use higher precision for intermediate results. This means that adding 10 times 0.1 may give back 1 when converted to an integer if the compiler decides to use higher accuracy and round to closest double at the end.
This is "worse" because despite being the precision higher the results are compiler and compiler options dependent, making reasoning about the computations harder and making the exact result non portable among different systems (even if they use the same precision and format).
Most compilers have special options to avoid this specific problem.

c++ and matlab floating point computation

Could someone tell me if c++ and matlab use the same floating point computation implementations? Will I get the same values in C++ as I would in Matlab?
Currently I have these discrepancies from translating my Matlab code into C++:
Matlab: R = 1.0000000001623, I = -3.07178893432791e-010, C = -3.79693498864242e-011
C++: R = 1.00000000340128 I = -3.96890964537988e-009 Z = 2.66864907949582e-009
If not what is the difference and where can I find more about floating point computation implementations?
Thanks!

Although it's not clear what your numbers actually are, the relative difference of the first (and largest) numbers is about 1e-8, which is the relative tolerance of many double precision algorithms.
Floating point numbers are only an approximation of the real number system, and their finite size (64 bits for double precision) limits their precision. Because of this finite precision, operations that involve floating point numbers can incur round-off error, and are thus not strictly associative. What this means is that A+(B+C) != (A+B)+C. The difference between the two is usually small, depending on their relative sizes, but it's not always zero.
What this means is that you should expect small differences in the relative and absolute values when you compare an algorithm coded in Matlab to one in C++. The difference may be in the libraries (i.e., there's no guarantee that Matlab uses the system math library for routines like sqrt), or it may just be that your C++ and Matlab implementations order their operations differently.
The section on floating point comparison tests in Boost::Test discusses this a bit, and has some good references. In particular, you should probably read What Every Computer Scientist Should Know About Floating-Point Arithmetic and consider picking up a copy of Knuth's TAOCP Vol. II.

Matlab by default uses a double floating point precision, a C float uses a single floating point precision.
The representation for floating points is the same between the two, and is either processor or I believe a standard. But as mentioned, floating points are extremely fickle, you always have to allow for some tolerance. If you do a complex operation, such as the one below, you will frequently get a non-zero number, despite algebra telling you otherwise. The stuff under the hood between how operations are done with matlab and c will allow for some differences. Just make sure they are close.
((3*pi+2)*5-9)/2-7.5*pi-3

Floating-point comparison of constant assignment

When comparing doubles for equality, we need to give a tolerance level, because floating-point computation might introduce errors. For example:
double x;
double y;
x = f();
y = g();
if (fabs(x-y)<epsilon) {
// they are equal!
} else {
// they are not!
}
However, if I simply assign a constant value, without any computation, do I still need to check the epsilon?
double x = 1;
double y = 1;
if (x==y) {
// they are equal!
} else {
// no they are not!
}
Is == comparison good enough? Or I need to do fabs(x-y)<epsilon again? Is it possible to introduce error in assigning? Am I too paranoid?
How about casting (double x = static_cast<double>(100))? Is that gonna introduce floating-point error as well?
I am using C++ on Linux, but if it differs by language, I would like to understand that as well.

Actually, it depends on the value and the implementation. The C++ standard (draft n3126) has this to say in 2.14.4 Floating literals:
If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner.
In other words, if the value is exactly representable (and 1 is, in IEEE754, as is 100 in your static cast), you get the value. Otherwise (such as with 0.1) you get an implementation-defined close match (a). Now I'd be very worried about an implementation that chose a different close match based on the same input token but it is possible.
(a) Actually, that paragraph can be read in two ways, either the implementation is free to choose either the closest higher or closest lower value regardless of which is actually the closest, or it must choose the closest to the desired value.
If the latter, it doesn't change this answer however since all you have to do is hardcode a floating point value exactly at the midpoint of two representable types and the implementation is once again free to choose either.
For example, it might alternate between the next higher and next lower for the same reason banker's rounding is applied - to reduce the cumulative errors.

No if you assign literals they should be the same :)
Also if you start with the same value and do the same operations, they should be the same.
Floating point values are non-exact, but the operations should produce consistent results :)

Both cases are ultimately subject to implementation defined representations.
Storage of floating point values and their representations take on may forms - load by address or constant? optimized out by fast math? what is the register width? is it stored in an SSE register? Many variations exist.
If you need precise behavior and portability, do not rely on this implementation defined behavior.

IEEE-754, which is a standard common implementations of floating point numbers abide to, requires floating-point operations to produce a result that is the nearest representable value to an infinitely-precise result. Thus the only imprecision that you will face is rounding after each operation you perform, as well as propagation of rounding errors from the operations performed earlier in the chain. Floats are not per se inexact. And by the way, epsilon can and should be computed, you can consult any numerics book on that.
Floating point numbers can represent integers precisely up to the length of their mantissa. So for example if you cast from an int to a double, it will always be exact, but for casting into into a float, it will no longer be exact for very large integers.
There is one major example of extensive usage of floating point numbers as a substitute for integers, it's the LUA scripting language, which has no integer built-in type, and floating-point numbers are used extensively for logic and flow control etc. The performance and storage penalty from using floating-point numbers turns out to be smaller than the penalty of resolving multiple types at run time and makes the implementation lighter. LUA has been extensively used not only on PC, but also on game consoles.
Now, many compilers have an optional switch that disables IEEE-754 compatibility. Then compromises are made. Denormalized numbers (very very small numbers where the exponent has reached smallest possible value) are often treated as zero, and approximations in implementation of power, logarithm, sqrt, and 1/(x^2) can be made, but addition/subtraction, comparison and multiplication should retain their properties for numbers which can be exactly represented.

The easy answer: For constants == is ok.
There are two exceptions which you should be aware of:
First exception:
0.0 == -0.0
There is a negative zero which compares equal for the IEEE 754 standard. This means
1/INFINITY == 1/-INFINITY which breaks f(x) == f(y) => x == y
Second exception:
NaN != NaN
This is a special caveat of NotaNumber which allows to find out if a number is a NaN
on systems which do not have a test function available (Yes, that happens).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js