Computer precision: when should I have to worry about it? - c++

In C++ programming, when do I need to worry about the precision issue? To take a small example (it might not be a perfect one though),
std::vector<double> first (50000, 0.0);
std::vector<double> second (first);
Could it be possible that second[619] = 0.00000000000000000000000000001234 (I mean a very small value). Or SUM = second[0]+second[1]+...+second[49999] => 1e-31? Or SUM = second[0]-second[1]-...-second[49999] => -7.987654321e-12?
My questions:
Could it be some small disturbances in working with the double type numbers?
What may cause these kind of small disturbances? i.e. rounding errors become large? Could you please list them? How to take precautions?
If there could be small disturbance in certain operations, does it then mean after these operations, using if (SUM == 0) is dangerous? One should then always use if (SUM < SMALL) instead, where SMALL is defined as a very small value, such as 1E-30?
Lastly, could the small disturbances result into a negative value? Because if it is possible, then I should be better use if (abs(SUM) < SMALL) instead.
Any experiences?

This is a good reference document for floating point precision: What Every Computer Scientist Should Know About Floating-Point Arithmetic
One of the more important parts is catastrophic cancellation
Catastrophic cancellation occurs when the operands are subject to
rounding errors. For example in the quadratic formula, the expression
b2 - 4ac occurs. The quantities b2 and 4ac are subject to rounding
errors since they are the results of floating-point multiplications.
Suppose that they are rounded to the nearest floating-point number,
and so are accurate to within .5 ulp. When they are subtracted,
cancellation can cause many of the accurate digits to disappear,
leaving behind mainly digits contaminated by rounding error. Hence the
difference might have an error of many ulps. For example, consider b =
3.34, a = 1.22, and c = 2.28. The exact value of b2 - 4ac is .0292. But b2 rounds to 11.2 and 4ac rounds to 11.1, hence the final answer
is .1 which is an error by 70 ulps, even though 11.2 - 11.1 is exactly
equal to .16. The subtraction did not introduce any error, but rather
exposed the error introduced in the earlier multiplications.
Benign cancellation occurs when subtracting exactly known quantities.
If x and y have no rounding error, then by Theorem 2 if the
subtraction is done with a guard digit, the difference x-y has a very
small relative error (less than 2).
A formula that exhibits catastrophic cancellation can sometimes be
rearranged to eliminate the problem. Again consider the quadratic
formula

For your specific example, 0 has an exact representation as a double, and adding exactly 0 to a double does not change its value.
Also, like any other values you put in variables, numbers that you initialize in the array are not going to mysteriously change. You only get rounding when the result of a calculation cannot be exactly represented as a floating point number.
To give a better opinion about "disturbances" I would need to know the kinds of calculations that your code performs.

Related

Should I worry about precision when I use C++ mathematical functions with integers?

For example, The code below will give undesirable result due to precision of floating point numbers.
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
I wonder whether similar problems will show up if I use mathematical functions. For example
int a = sqrt(4); // Do I have guarantee that I will always get 2 here?
int b = log2(8); // Do I have guarantee that I will always get 3 here?
If not, how to solve this problem?
Edit:
Actually, I came across this problem when I was programming for an algorithm task. There I want to get
the largest integer which is power of 2 and is less than or equal to integer N
So round function can not solve my problem. I know I can solve this problem through a loop, but it seems not very elegant.
I want to know if
int a = pow(2, static_cast<int>(log2(N)));
can always give correct result. For example if N==8, is it possible that log2(N) gives me something like 2.9999999999999 and the final result become 4 instead of 8?
Inaccurate operands vs inaccurate results
I wonder whether similar problems will show up if I use mathematical functions.
Actually, the problem that could prevent log2(8) to be 3 does not exist for basic operations (including *). But it exists for the log2 function.
You are confusing two different issues:
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
In the example above, a is not exactly 1/3, so it is possible that a*3 does not produce 1.0. The product could have happened to round to 1.0, it just doesn't. However, if a somehow had been exactly 1/3, the product of a by 3 would have been exactly 1.0, because this is how IEEE 754 floating-point works: the result of basic operations is the nearest representable value to the mathematical result of the same operation on the same operands. When the exact result is representable as a floating-point number, then that representation is what you get.
Accuracy of sqrt and log2
sqrt is part of the “basic operations”, so sqrt(4) is guaranteed always, with no exception, in an IEEE 754 system, to be 2.0.
log2 is not part of the basic operations. The result of an implementation of this function is not guaranteed by the IEEE 754 standard to be the closest to the mathematical result. It can be another representable number further away. So without more hypotheses on the log2 function that you use, it is impossible to tell what log2(8.0) can be.
However, most implementations of reasonable quality for elementary functions such as log2 guarantee that the result of the implementation is within 1 ULP of the mathematical result. When the mathematical result is not representable, this means either the representable value above or the one below (but not necessarily the closest one of the two). When the mathematical result is exactly representable (such as 3.0), then this representation is still the only one guaranteed to be returned.
So about log2(8), the answer is “if you have a reasonable quality implementation of log2, you can expect the result to be 3.0`”.
Unfortunately, not every implementation of every elementary function is a quality implementation. See this blog post, caused by a widely used implementation of pow being inaccurate by more than 1 ULP when computing pow(10.0, 2.0), and thus returning 99.0 instead of 100.0.
Rounding to the nearest integer
Next, in each case, you assign the floating-point to an int with an implicit conversion. This conversion is defined in the C++ standard as truncating the floating-point values (that is, rounding towards zero). If you expect the result of the floating-point computation to be an integer, you can round the floating-point value to the nearest integer before assigning it. It will help obtain the desired answer in all cases where the error does not accumulate to a value larger than 1/2:
int b = std::nearbyint(log2(8.0));
To conclude with a straightforward answer to the question the the title: yes, you should worry about accuracy when using floating-point functions for the purpose of producing an integral end-result. These functions do not come even with the guarantees that basic operations come with.
Unfortunately the default conversion from a floating point number to integer in C++ is really crazy as it works by dropping the decimal part.
This is bad for two reasons:
a floating point number really really close to a positive integer, but below it will be converted to the previous integer instead (e.g. 3-1×10-10 = 2.9999999999 will be converted to 2)
a floating point number really really close to a negative integer, but above it will be converted to the next integer instead (e.g. -3+1×10-10 = -2.9999999999 will be converted to -2)
The combination of (1) and (2) means also that using int(x + 0.5) will not work reasonably as it will round negative numbers up.
There is a reasonable round function, but unfortunately returns another floating point number, thus you need to write int(round(x)).
When working with C99 or C++11 you can use lround(x).
Note that the only numbers that can be represented correctly in floating point are quotients where the denominator is an integral power of 2.
For example 1/65536 = 0.0000152587890625 can be represented correctly, but even just 0.1 is impossible to represent correctly and thus any computation involving that quantity will be approximated.
Of course when using 0.1 approximations can cancel out leaving a correct result occasionally, but even just adding ten times 0.1 will not give 1.0 as result when doing the computation using IEEE754 double-precision floating point numbers.
Even worse the compilers are allowed to use higher precision for intermediate results. This means that adding 10 times 0.1 may give back 1 when converted to an integer if the compiler decides to use higher accuracy and round to closest double at the end.
This is "worse" because despite being the precision higher the results are compiler and compiler options dependent, making reasoning about the computations harder and making the exact result non portable among different systems (even if they use the same precision and format).
Most compilers have special options to avoid this specific problem.

IEEE-754 floating point: Divide first or multiply first for best precision?

What's better if I want to preserve as much precision as possible in a calculation with IEEE-754 floating point values:
a = b * c / d
or
a = b / d * c
Is there a difference? If there is, does it depend on the magnitudes of the input values? And, if magnitude matters, how is the best ordering determined when general magnitudes of the values are known?
It depends on the magnitude of the values. Obviously if one divides by zero, all bets are off, but if a multiplication or division results in a denormal subsequent operations can lose precision.
You may find it useful to study Goldberg's seminal paper What Every Computer Scientist Should Know About Floating-Point Arithmetic which will explain things far better than any answer you're likely to receive here. (Goldberg was one of the original authors of IEEE-754.)
Assuming that none of the operations would yield an overflow or an underflow, and your input values have uniformly distributed significands, then this is equivalent. Well, I suppose that to have a rigorous proof, one should do an exhaustive test (probably not possible in practice for double precision since there are 2^156 inputs), but if there is a difference in the average error, then it is tiny. I could try in low precisions with Sipe.
In any case, in the absence of overflow/underflow, only the exact values of the significands matter, not the exponents.
However if the result a is added to (or subtracted from) another expression and not reused, then starting with the division may be more interesting since you can group the multiplication with the following addition by using a FMA (thus with a single rounding).

d0 when taking roots of numbers

So in general, I understand the difference between specifying 3. and 3.0d0 with the difference being the number of digits stored by the computer. When doing arithmetic operations, I generally make sure everything is in double precision. However, I am confused about the following operations:
64^(1./3.) vs. 64^(1.0d0/3.0d0)
It took me a couple of weeks to find an error where I was assigning the output of 64^(1.0d0/3.0d0) to an integer. Because 64^(1.0d0/3.0d0) returns 3.999999, the integer got the value 3 and not 4. However, 64^(1./3.) = 4.00000. Can someone explain to me why it is wise to use 1./3. vs. 1.0d0/3.0d0 here?
The issue isn't so much single versus double precision. All floating point calculations are subject to imprecision compared to true real numbers. In assigning a real to an integer, Fortran truncates. You probably want to use the Fortran intrinsic nint.
this is a peculiar fortuitous case where the lower precision calculation gives the exact result. You can see this without the integer conversion issue:
write(*,*)4.d0-64**(1./3.),4.d0-64**(1.d0/3.d0)
0.000000000 4.440892E-016
In general this does not happen, here the double precision value is "better"
write(*,*)13.d0-2197**(1./3.),13.d0-2197**(1.d0/3.d0)
-9.5367E-7 1.77E-015
Here, since the s.p. calc comes out slightly high it gives you the correct value on integer conversion, while the d.p. result will get rounded down, hence be wrong, even though the floating point error was smaller.
So in general, no you should not consider use of single precision to be preferred.
in fact 64 and 125 seem to be the only special cases where the s.p. calc gives a perfect cube root while the d.p. calc does not.

c++ Floating point subtraction error and absolute values

The way I understand it is: when subtracting two double numbers with double precision in c++ they are first transformed to a significand starting with one times 2 to the power of the exponent. Then one can get an error if the subtracted numbers have the same exponent and many of the same digits in the significand, leading to loss of precision. To test this for my code I wrote the following safe addition function:
double Sadd(double d1, double d2, int& report, double prec) {
int exp1, exp2;
double man1=frexp(d1, &exp1), man2=frexp(d2, &exp2);
if(d1*d2<0) {
if(exp1==exp2) {
if(abs(man1+man2)<prec) {
cout << "Floating point error" << endl;
report=0;
}
}
}
return d1+d2;
}
However, testing this I notice something strange: it seems that the actual error (not whether the function reports error but the actual one resulting from the computation) seems to depend on the absolute values of the subtracted numbers and not just the number of equal digits in the significand...
For examples, using 1e-11 as the precision prec and subtracting the following numbers:
1) 9.8989898989898-9.8989898989897: The function reports error and I get the highly incorrect value 9.9475983006414e-14
2) 98989898989898-98989898989897: The function reports error but I get the correct value 1
Obviously I have misunderstood something. Any ideas?
If you subtract two floating-point values that are nearly equal, the result will mostly reflect noise in the low bits. Nearly equal here is more than just same exponent and almost the same digits. For example, 1.0001 and 1.0000 are nearly equal, and subtracting them could be caught by a test like this. But 1.0000 and 0.9999 differ by exactly the same amount, and would not be caught by a test like this.
Further, this is not a safe addition function. Rather, it's a post-hoc check for a design/coding error. If you're subtracting two values that are so close together that noise matters you've made a mistake. Fix the mistake. I'm not objecting to using something like this as a debugging aid, but please call it something that implies that that's what it is, rather than suggesting that there's something inherently dangerous about floating-point addition. Further, putting the check inside the addition function seems excessive: an assert that the two values won't cause problems, followed by a plain old floating-point addition, would probably be better. After all, most of the additions in your code won't lead to problems, and you'd better know where the problem spots are; put asserts in the problems spots.
+1 to Pete Becker's answer.
Note that the problem of degenerated result might also occur with exp1!=exp2
For example, if you subtract
1.0-0.99999999999999
So,
bool degenerated =
(epx1==exp2 && abs(d1+d2)<prec)
|| (epx1==exp2-1 && abs(d1+2*d2)<prec)
|| (epx1==exp2+1 && abs(2*d1+d2)<prec);
You can omit the check for d1*d2<0, or keep it to avoid the whole test otherwise...
If you want to also handle loss of precision with degenerated denormalized floats, that'll be a bit more involved (it's as if the significand had less bits).
It's quite easy to prove that for IEEE 754 floating-point arithmetic, if x/2 <= y <= 2x then calculating x - y is an exact operation and will give the exact result correctly without any rounding error.
And if the result of an addition or subtraction is a denormalised number, then the result is always exact.

How can I get consistent program behavior when using floats?

I am writing a simulation program that proceeds in discrete steps. The simulation consists of many nodes, each of which has a floating-point value associated with it that is re-calculated on every step. The result can be positive, negative or zero.
In the case where the result is zero or less something happens. So far this seems straightforward - I can just do something like this for each node:
if (value <= 0.0f) something_happens();
A problem has arisen, however, after some recent changes I made to the program in which I re-arranged the order in which certain calculations are done. In a perfect world the values would still come out the same after this re-arrangement, but because of the imprecision of floating point representation they come out very slightly different. Since the calculations for each step depend on the results of the previous step, these slight variations in the results can accumulate into larger variations as the simulation proceeds.
Here's a simple example program that demonstrates the phenomena I'm describing:
float f1 = 0.000001f, f2 = 0.000002f;
f1 += 0.000004f; // This part happens first here
f1 += (f2 * 0.000003f);
printf("%.16f\n", f1);
f1 = 0.000001f, f2 = 0.000002f;
f1 += (f2 * 0.000003f);
f1 += 0.000004f; // This time this happens second
printf("%.16f\n", f1);
The output of this program is
0.0000050000057854
0.0000050000062402
even though addition is commutative so both results should be the same. Note: I understand perfectly well why this is happening - that's not the issue. The problem is that these variations can mean that sometimes a value that used to come out negative on step N, triggering something_happens(), now may come out negative a step or two earlier or later, which can lead to very different overall simulation results because something_happens() has a large effect.
What I want to know is whether there is a good way to decide when something_happens() should be triggered that is not going to be affected by the tiny variations in calculation results that result from re-ordering operations so that the behavior of newer versions of my program will be consistent with the older versions.
The only solution I've so far been able to think of is to use some value epsilon like this:
if (value < epsilon) something_happens();
but because the tiny variations in the results accumulate over time I need to make epsilon quite large (relatively speaking) to ensure that the variations don't result in something_happens() being triggered on a different step. Is there a better way?
I've read this excellent article on floating point comparison, but I don't see how any of the comparison methods described could help me in this situation.
Note: Using integer values instead is not an option.
Edit the possibility of using doubles instead of floats has been raised. This wouldn't solve my problem since the variations would still be there, they'd just be of a smaller magnitude.
I've worked with simulation models for 2 years and the epsilon approach is the sanest way to compare your floats.
Generally, using suitable epsilon values is the way to go if you need to use floating point numbers. Here are a few things which may help:
If your values are in a known range you and you don't need divisions you may be able to scale the problem and use exact operations on integers. In general, the conditions don't apply.
A variation is to use rational numbers to do exact computations. This still has restrictions on the operations available and it typically has severe performance implications: you trade performance for accuracy.
The rounding mode can be changed. This can be use to compute an interval rather than an individual value (possibly with 3 values resulting from round up, round down, and round closest). Again, it won't work for everything but you may get an error estimate out of this.
Keeping track of the value and a number of operations (possible multiple counters) may also be used to estimate the current size of the error.
To possibly experiment with different numeric representations (float, double, interval, etc.) you might want to implement your simulation as templates parameterized for the numeric type.
There are many books written on estimating and minimizing errors when using floating point arithmetic. This is the topic of numerical mathematics.
Most cases I'm aware of experiment briefly with some of the methods mentioned above and conclude that the model is imprecise anyway and don't bother with the effort. Also, doing something else than using float may yield better result but is just too slow, even using double due to the doubled memory footprint and the smaller opportunity of using SIMD operations.
I recommend that you single step - preferably in assembly mode - through the calculations while doing the same arithmetic on a calculator. You should be able to determine which calculation orderings yield results of lesser quality than you expect and which that work. You will learn from this and probably write better-ordered calculations in the future.
In the end - given the examples of numbers you use - you will probably need to accept the fact that you won't be able to do equality comparisons.
As to the epsilon approach you usually need one epsilon for every possible exponent. For the single-precision floating point format you would need 256 single precision floating point values as the exponent is 8 bits wide. Some exponents will be the result of exceptions but for simplicity it is better to have a 256 member vector than to do a lot of testing as well.
One way to do this could be to determine your base epsilon in the case where the exponent is 0 i e the value to be compared against is in the range 1.0 <= x < 2.0. Preferably the epsilon should be chosen to be base 2 adapted i e a value that can be exactly represented in a single precision floating point format - that way you know exactly what you are testing against and won't have to think about rounding problems in the epsilon as well. For exponent -1 you would use your base epsilon divided by two, for -2 divided by 4 and so on. As you approach the lowest and the highest parts of the exponent range you gradually run out of precision - bit by bit - so you need to be aware that extreme values can cause the epsilon method to fail.
If it absolutely has to be floats then using an epsilon value may help but may not eliminate all problems. I would recommend using doubles for the spots in the code you know for sure will have variation.
Another way is to use floats to emulate doubles, there are many techniques out there and the most basic one is to use 2 floats and do a little bit of math to save most of the number in one float and the remainder in the other (saw a great guide on this, if I find it I'll link it).
Certainly you should be using doubles instead of floats. This will probably reduce the number of flipped nodes significantly.
Generally, using an epsilon threshold is only useful when you are comparing two floating-point number for equality, not when you are comparing them to see which is bigger. So (for most models, at least) using epsilon won't gain you anything at all -- it will just change the set of flipped nodes, it wont make that set smaller. If your model itself is chaotic, then it's chaotic.