c++ Floating point subtraction error and absolute values - c++

The way I understand it is: when subtracting two double numbers with double precision in c++ they are first transformed to a significand starting with one times 2 to the power of the exponent. Then one can get an error if the subtracted numbers have the same exponent and many of the same digits in the significand, leading to loss of precision. To test this for my code I wrote the following safe addition function:
double Sadd(double d1, double d2, int& report, double prec) {
int exp1, exp2;
double man1=frexp(d1, &exp1), man2=frexp(d2, &exp2);
if(d1*d2<0) {
if(exp1==exp2) {
if(abs(man1+man2)<prec) {
cout << "Floating point error" << endl;
report=0;
}
}
}
return d1+d2;
}
However, testing this I notice something strange: it seems that the actual error (not whether the function reports error but the actual one resulting from the computation) seems to depend on the absolute values of the subtracted numbers and not just the number of equal digits in the significand...
For examples, using 1e-11 as the precision prec and subtracting the following numbers:
1) 9.8989898989898-9.8989898989897: The function reports error and I get the highly incorrect value 9.9475983006414e-14
2) 98989898989898-98989898989897: The function reports error but I get the correct value 1
Obviously I have misunderstood something. Any ideas?

If you subtract two floating-point values that are nearly equal, the result will mostly reflect noise in the low bits. Nearly equal here is more than just same exponent and almost the same digits. For example, 1.0001 and 1.0000 are nearly equal, and subtracting them could be caught by a test like this. But 1.0000 and 0.9999 differ by exactly the same amount, and would not be caught by a test like this.
Further, this is not a safe addition function. Rather, it's a post-hoc check for a design/coding error. If you're subtracting two values that are so close together that noise matters you've made a mistake. Fix the mistake. I'm not objecting to using something like this as a debugging aid, but please call it something that implies that that's what it is, rather than suggesting that there's something inherently dangerous about floating-point addition. Further, putting the check inside the addition function seems excessive: an assert that the two values won't cause problems, followed by a plain old floating-point addition, would probably be better. After all, most of the additions in your code won't lead to problems, and you'd better know where the problem spots are; put asserts in the problems spots.

+1 to Pete Becker's answer.
Note that the problem of degenerated result might also occur with exp1!=exp2
For example, if you subtract
1.0-0.99999999999999
So,
bool degenerated =
(epx1==exp2 && abs(d1+d2)<prec)
|| (epx1==exp2-1 && abs(d1+2*d2)<prec)
|| (epx1==exp2+1 && abs(2*d1+d2)<prec);
You can omit the check for d1*d2<0, or keep it to avoid the whole test otherwise...
If you want to also handle loss of precision with degenerated denormalized floats, that'll be a bit more involved (it's as if the significand had less bits).

It's quite easy to prove that for IEEE 754 floating-point arithmetic, if x/2 <= y <= 2x then calculating x - y is an exact operation and will give the exact result correctly without any rounding error.
And if the result of an addition or subtraction is a denormalised number, then the result is always exact.

Related

What does "double + 1e-6" mean?

The result of this cpp is 72.740, but the answer should be like 72.741
mx = 72.74050000;
printf("%.3lf \n", mx);
So I found the solution on website, and it told me to add "+1e-7" and it works
mx = 72.74050000;
printf("%.3lf \n", mx + 1e-7);
but I dont know the reason in this method, can anyone explain how it works?
And I also try to print it but nothing special happens..., and it turn out to be 72.7405
mx = 72.74050003;
cout << mx + 1e-10;
To start, your question contains an incorrect assumption. You put 72.7405 (let's assume it's precise) on input and expect 72.741 on output. So, you assume that rounding in printf will select higher candidate of possible twos. Why?
Well, one could consider this is your task, according to some rules (e.g. fiscal norms for rounding in bills, in taxation, etc.) - this is usual. But, when you use standard de facto floating of C/C++ on x86, ARM, etc., you should take the following specifics into account:
It is binary, not decimal. As result, all values you showed in your example are kept with some error.
Standard library tends to use standard rounding, unless forced to use another method.
The second point means that default rounding in C floating is round-to-nearest-ties-to-even (or, shortly, half-to-even). With this rounding, 72.7405 will be rounded to 72.740, not 72.741 (but, 72.7415 will be rounded to 72.742). To ask for rounding 72.7405 -> 72.741, you should have installed another rounding mode: round-to-nearest-ties-away-from-zero (shortly: round-half-away). This mode is request, to refer to, in IEEE754 for decimal arithmetic. So, if you used true decimal arithmetic, it would suffice.
(If we don't allow negative numbers, the same mode might be treated as half-up. But I assume negative numbers are not permitted in financial accounting and similar contexts.)
But, the first point here is more important: inexactness of representation of such values can be multiplied by operations. I repeat your situation and a proposed solution with more cases:
Code:
#include <stdio.h>
int main()
{
float mx;
mx = 72.74050000;
printf("%.6lf\n", mx);
printf("%.3lf\n", mx + 1e-7);
mx *= 3;
printf("%.6lf\n", mx);
printf("%.3lf\n", mx + 1e-7);
}
Result (Ubuntu 20.04/x86-64):
72.740501
72.741
218.221497
218.221
So you see that just multiplying of your example number by 3 resulted in situation that the compensation summand 1e-7 gets not enough to force rounding half-up, and 218.2215 (the "exact" 72.7405*3) is rounded to 218.221 instead of desired 218.222. Oops, "Directed by Robert B. Weide"...
How the situation could be fixed? Well, you could start with a stronger rough approach. If you need rounding to 3 decimal digits, but inputs look like having 4 digits, add 0.00005 (half of least significant digit in your results) instead of this powerless and sluggish 1e-7. This will definitely move half-voting values up.
But, all this will work only if result before rounding have error strictly less than 0.00005. If you have cumbersome calculations (e.g. summing hundreds of values), it's easy to get resulting error more than this threshold. To avoid such an error, you would round intermediate results often (ideally, each value).
And, the last conclusion leads us to the final question: if we need to round each intermediate result, why not just migrate to calculations in integers? You have to keep intermediate results up to 4 decimal digits? Scale by 10000 and do all calculations in integers. This will also aid in avoiding silent(*) accuracy loss with higher exponents.
(*) Well, IEEE754 requires raising "inexact" flag, but, with binary floating, nearly any operation with decimal fractions will raise it, so, useful signal will drown in sea of noise.
The final conclusion is the proper answer not to your question but to upper task: use fixed-point approaches. The approach with this +1e-7, as I showed above, is too easy to fail. No, don't use it, no, never. There are lots of proper libraries for fixed-point arithmetic, just pick one and use.
(It's also interesting why %.6f resulted in printing 72.740501 but 218.221497/3 == 72.740499. It suggests "single" floating (float in C) gets too inaccurate here. Even without this wrong approach, using double will postpone the issue, masking it and disguising as a correct way.)
If you will output the value like
printf( "mx = %.16f\n", mx );
you will see
mx = 72.7404999999999973
So to make the result like 72.741 due to rounding in outputting with a call of printf you need to make the next digit equal to 5 instead of 4. It is enough to add 0.00001.
Here is a demonstration program.
#include <iostream>
#include <iomanip>
#include <cstdio>
int main( void )
{
double mx = 72.74050000;
printf( "mx = %.3f\n", mx + 0.00001);
std::cout << "mx = " << std::setprecision( 5 ) << mx + 0.00001 << '\n';
}
The program output is
mx = 72.741
mx = 72.741
0.00001 is the same as 1e-5.

Should I add a tiny amount when trying to use std::round()?

Here is a situation where a round_to_2_digits() function is rounding down when we expected it to round up. This turned out to be the case where a number cannot be represented exactly in a double. I don't remember the exact value, but say this:
double value = 1.155;
double value_rounded = round_to_2_digits( value );
The value was the output of a function, and instead of being exactly 1.155 like the code above, it actually was returning something like 1.15499999999999999. So calling std::round() on it would result in 1.15 instead of 1.16 like we thought.
Questions:
I'm thinking it may be wise to add a tiny value in round_to_2_digits() prior to calling std::round().
Is this standard practice, or acceptable? For example, adding the 0.0005 to the value being rounded.
Is there a mathematical term for this kind of "fudge factor"?
EDIT: "epsilon" was the term I was looking for.
And since the function rounds to only 2 digits after the decimal point, should I be adding 0.001? 0.005? 0.0005?
The rounding function is quite simple:
double round_to_2_decimals( double value )
{
value *= 100.0;
value = std::round(value);
value /= 100.0;
return value;
}
Step one is admitting that double may not be the right data type for your application. :-) Consider using a fixed-point type, or multiplying all of your values by 100 (or 1000, or whatever), and working with integer values.
You can't actually guarantee that adding any particular small epsilon won't give you the wrong results in some other situation. What if your value actually was 1.54999999..., and you rounded it up? Then you'd have the wrong value the other way.
The problem with your suggested solution is that at the end of it, you're still going to have a binary fraction that's not necessarily equal to the decimal value you're trying to represent. The only way to fix this is to use a representation that can exactly represent the values you want to use.
This question doesn't make a lot of sense. POSIX mandates std::round rounds half away from zero. So the result should in fact be 116 not 115. In order to actually replicate your behavior, I had to use a function that pays attention to rounding mode:
std::fesetround(FE_DOWNWARD);
std::cout << std::setprecision(20) << std::rint(1.155 * 100.0);
This was tested on GCC 5.2.0 and Clang 3.7.0.

Should I worry about precision when I use C++ mathematical functions with integers?

For example, The code below will give undesirable result due to precision of floating point numbers.
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
I wonder whether similar problems will show up if I use mathematical functions. For example
int a = sqrt(4); // Do I have guarantee that I will always get 2 here?
int b = log2(8); // Do I have guarantee that I will always get 3 here?
If not, how to solve this problem?
Edit:
Actually, I came across this problem when I was programming for an algorithm task. There I want to get
the largest integer which is power of 2 and is less than or equal to integer N
So round function can not solve my problem. I know I can solve this problem through a loop, but it seems not very elegant.
I want to know if
int a = pow(2, static_cast<int>(log2(N)));
can always give correct result. For example if N==8, is it possible that log2(N) gives me something like 2.9999999999999 and the final result become 4 instead of 8?
Inaccurate operands vs inaccurate results
I wonder whether similar problems will show up if I use mathematical functions.
Actually, the problem that could prevent log2(8) to be 3 does not exist for basic operations (including *). But it exists for the log2 function.
You are confusing two different issues:
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
In the example above, a is not exactly 1/3, so it is possible that a*3 does not produce 1.0. The product could have happened to round to 1.0, it just doesn't. However, if a somehow had been exactly 1/3, the product of a by 3 would have been exactly 1.0, because this is how IEEE 754 floating-point works: the result of basic operations is the nearest representable value to the mathematical result of the same operation on the same operands. When the exact result is representable as a floating-point number, then that representation is what you get.
Accuracy of sqrt and log2
sqrt is part of the “basic operations”, so sqrt(4) is guaranteed always, with no exception, in an IEEE 754 system, to be 2.0.
log2 is not part of the basic operations. The result of an implementation of this function is not guaranteed by the IEEE 754 standard to be the closest to the mathematical result. It can be another representable number further away. So without more hypotheses on the log2 function that you use, it is impossible to tell what log2(8.0) can be.
However, most implementations of reasonable quality for elementary functions such as log2 guarantee that the result of the implementation is within 1 ULP of the mathematical result. When the mathematical result is not representable, this means either the representable value above or the one below (but not necessarily the closest one of the two). When the mathematical result is exactly representable (such as 3.0), then this representation is still the only one guaranteed to be returned.
So about log2(8), the answer is “if you have a reasonable quality implementation of log2, you can expect the result to be 3.0`”.
Unfortunately, not every implementation of every elementary function is a quality implementation. See this blog post, caused by a widely used implementation of pow being inaccurate by more than 1 ULP when computing pow(10.0, 2.0), and thus returning 99.0 instead of 100.0.
Rounding to the nearest integer
Next, in each case, you assign the floating-point to an int with an implicit conversion. This conversion is defined in the C++ standard as truncating the floating-point values (that is, rounding towards zero). If you expect the result of the floating-point computation to be an integer, you can round the floating-point value to the nearest integer before assigning it. It will help obtain the desired answer in all cases where the error does not accumulate to a value larger than 1/2:
int b = std::nearbyint(log2(8.0));
To conclude with a straightforward answer to the question the the title: yes, you should worry about accuracy when using floating-point functions for the purpose of producing an integral end-result. These functions do not come even with the guarantees that basic operations come with.
Unfortunately the default conversion from a floating point number to integer in C++ is really crazy as it works by dropping the decimal part.
This is bad for two reasons:
a floating point number really really close to a positive integer, but below it will be converted to the previous integer instead (e.g. 3-1×10-10 = 2.9999999999 will be converted to 2)
a floating point number really really close to a negative integer, but above it will be converted to the next integer instead (e.g. -3+1×10-10 = -2.9999999999 will be converted to -2)
The combination of (1) and (2) means also that using int(x + 0.5) will not work reasonably as it will round negative numbers up.
There is a reasonable round function, but unfortunately returns another floating point number, thus you need to write int(round(x)).
When working with C99 or C++11 you can use lround(x).
Note that the only numbers that can be represented correctly in floating point are quotients where the denominator is an integral power of 2.
For example 1/65536 = 0.0000152587890625 can be represented correctly, but even just 0.1 is impossible to represent correctly and thus any computation involving that quantity will be approximated.
Of course when using 0.1 approximations can cancel out leaving a correct result occasionally, but even just adding ten times 0.1 will not give 1.0 as result when doing the computation using IEEE754 double-precision floating point numbers.
Even worse the compilers are allowed to use higher precision for intermediate results. This means that adding 10 times 0.1 may give back 1 when converted to an integer if the compiler decides to use higher accuracy and round to closest double at the end.
This is "worse" because despite being the precision higher the results are compiler and compiler options dependent, making reasoning about the computations harder and making the exact result non portable among different systems (even if they use the same precision and format).
Most compilers have special options to avoid this specific problem.

d0 when taking roots of numbers

So in general, I understand the difference between specifying 3. and 3.0d0 with the difference being the number of digits stored by the computer. When doing arithmetic operations, I generally make sure everything is in double precision. However, I am confused about the following operations:
64^(1./3.) vs. 64^(1.0d0/3.0d0)
It took me a couple of weeks to find an error where I was assigning the output of 64^(1.0d0/3.0d0) to an integer. Because 64^(1.0d0/3.0d0) returns 3.999999, the integer got the value 3 and not 4. However, 64^(1./3.) = 4.00000. Can someone explain to me why it is wise to use 1./3. vs. 1.0d0/3.0d0 here?
The issue isn't so much single versus double precision. All floating point calculations are subject to imprecision compared to true real numbers. In assigning a real to an integer, Fortran truncates. You probably want to use the Fortran intrinsic nint.
this is a peculiar fortuitous case where the lower precision calculation gives the exact result. You can see this without the integer conversion issue:
write(*,*)4.d0-64**(1./3.),4.d0-64**(1.d0/3.d0)
0.000000000 4.440892E-016
In general this does not happen, here the double precision value is "better"
write(*,*)13.d0-2197**(1./3.),13.d0-2197**(1.d0/3.d0)
-9.5367E-7 1.77E-015
Here, since the s.p. calc comes out slightly high it gives you the correct value on integer conversion, while the d.p. result will get rounded down, hence be wrong, even though the floating point error was smaller.
So in general, no you should not consider use of single precision to be preferred.
in fact 64 and 125 seem to be the only special cases where the s.p. calc gives a perfect cube root while the d.p. calc does not.

Computer precision: when should I have to worry about it?

In C++ programming, when do I need to worry about the precision issue? To take a small example (it might not be a perfect one though),
std::vector<double> first (50000, 0.0);
std::vector<double> second (first);
Could it be possible that second[619] = 0.00000000000000000000000000001234 (I mean a very small value). Or SUM = second[0]+second[1]+...+second[49999] => 1e-31? Or SUM = second[0]-second[1]-...-second[49999] => -7.987654321e-12?
My questions:
Could it be some small disturbances in working with the double type numbers?
What may cause these kind of small disturbances? i.e. rounding errors become large? Could you please list them? How to take precautions?
If there could be small disturbance in certain operations, does it then mean after these operations, using if (SUM == 0) is dangerous? One should then always use if (SUM < SMALL) instead, where SMALL is defined as a very small value, such as 1E-30?
Lastly, could the small disturbances result into a negative value? Because if it is possible, then I should be better use if (abs(SUM) < SMALL) instead.
Any experiences?
This is a good reference document for floating point precision: What Every Computer Scientist Should Know About Floating-Point Arithmetic
One of the more important parts is catastrophic cancellation
Catastrophic cancellation occurs when the operands are subject to
rounding errors. For example in the quadratic formula, the expression
b2 - 4ac occurs. The quantities b2 and 4ac are subject to rounding
errors since they are the results of floating-point multiplications.
Suppose that they are rounded to the nearest floating-point number,
and so are accurate to within .5 ulp. When they are subtracted,
cancellation can cause many of the accurate digits to disappear,
leaving behind mainly digits contaminated by rounding error. Hence the
difference might have an error of many ulps. For example, consider b =
3.34, a = 1.22, and c = 2.28. The exact value of b2 - 4ac is .0292. But b2 rounds to 11.2 and 4ac rounds to 11.1, hence the final answer
is .1 which is an error by 70 ulps, even though 11.2 - 11.1 is exactly
equal to .16. The subtraction did not introduce any error, but rather
exposed the error introduced in the earlier multiplications.
Benign cancellation occurs when subtracting exactly known quantities.
If x and y have no rounding error, then by Theorem 2 if the
subtraction is done with a guard digit, the difference x-y has a very
small relative error (less than 2).
A formula that exhibits catastrophic cancellation can sometimes be
rearranged to eliminate the problem. Again consider the quadratic
formula
For your specific example, 0 has an exact representation as a double, and adding exactly 0 to a double does not change its value.
Also, like any other values you put in variables, numbers that you initialize in the array are not going to mysteriously change. You only get rounding when the result of a calculation cannot be exactly represented as a floating point number.
To give a better opinion about "disturbances" I would need to know the kinds of calculations that your code performs.