Double rounding error, even when using DBL_DIG - c++

I am trying to generate a random number between -10 and 10 with step 0.3 (though I want to have these be arbitrary values) and am having issues with double precision floating point accuracy. Float.h's DBL_DIG is meant to be the minimum accuracy at which no rounding error occurs [EDIT: This is false, see Eric Postpischil's comment for a true definition of DBL_DIG], yet when printing to this many digits, I still see rounding error.
#include <stdio.h>
#include <float.h>
#include <stdlib.h>
int main()
{
for (;;)
{
printf("%.*g\n", DBL_DIG, -10 + (rand() % (unsigned long)(20 / 0.3)) * 0.3);
}
}
When I run this, I get this output:
8.3
-7
1.7
-6.1
-3.1
1.1
-3.4
-8.2
-9.1
-9.7
-7.6
-7.9
1.4
-2.5
-1.3
-8.8
2.6
6.2
3.8
-3.4
9.5
-7.6
-1.9
-0.0999999999999996
-2.2
5
3.2
2.9
-2.5
2.9
9.5
-4.6
6.2
0.799999999999999
-1.3
-7.3
-7.9
Of course, a simple solution would be to just #define DBL_DIG 14 but I feel that is wasting accuracy. Why is this happening and how do I prevent this happening? This is not a duplicate of Is floating point math broken? since I am asking about DBL_DIG, and how to find the minimum accuracy at which no error occurs.

For the specific code in the question, we can avoid excess rounding errors by using integer values until the last moment:
printf("%.*g\n", DBL_DIG,
(-100 + rand() % (unsigned long)(20 / 0.3) * 3.) / 10.);
This was obtained by multiplying each term in the original expression by 10 (−10 because −100 and .3 becomes 3) and then dividing the whole expression by 10. So all values we care about in the numerator1 are integers, which floating-point represents exactly (within range of its precision).
Since the integer values will be computed exactly, there will be just a single rounding error, in the final division by 10, and the result will be the double closest to the desired value.
How many digits should I print to in order to avoid rounding error in most circumstances? (not just in my example above)
Just using more digits is not a solution for general cases. One approach for avoiding error in most cases is to learn about floating-point formats and arithmetic in considerable detail and then write code thoughtfully and meticulously. This approach is generally good but not always successful as it is usually implemented by humans, who continue to make mistakes in spite of all efforts to the contrary.
Footnote
1 Considering (unsigned long)(20 / 0.3) is a longer discussion involving intent and generalization to other values and cases.

generate a random number between -10 and 10 with step 0.3
I would like the program to work with arbitrary values for the bounds and step size.
Why is this happening ....
The source of trouble is assuming that typcial real numbers (such as string "0.3") can encode exactly as a double.
A double can encode about 264 different values exactly. 0.3 is not one of them.
Instead the nearest double is used.
The exact value and 2 nearest are listed below:
0.29999999999999993338661852249060757458209991455078125
0.299999999999999988897769753748434595763683319091796875 (best 0.3)
0.3000000000000000444089209850062616169452667236328125
So OP's code is attempting "-10 and 10 with step 0.2999..." and printing out "-0.0999999999999996" and "0.799999999999999" is more correct than "-0.1" and "0.8".
.... how do I prevent this happening?
Print with a more limited precision.
// reduce the _bit_ output precision by about the root of steps
#define LOG10_2 0.30102999566398119521373889472449
int digits_less = lround(sqrt(20 / 0.3) * LOG10_2);
for (int i = 0; i < 100; i++) {
printf("%.*e\n", DBL_DIG - digits_less,
-10 + (rand() % (unsigned long) (20 / 0.3)) * 0.3);
}
9.5000000000000e+00
-3.7000000000000e+00
8.6000000000000e+00
5.9000000000000e+00
...
-1.0000000000000e-01
8.0000000000000e-01
OP's code really is not doings "steps" as that hints toward a loop with a step of 0.3. The above digits_less is based on repetitive "steps", otherwise OP's equation warrants about 1 decimal digit reduction. The best reduction in precisions depends on estimating the potential cumulative error of all calculations from "0.3" conversion --> double 0.3 (1/2 bit), division (1/2 bit), multiplication (1/2 bit) and addition (more complicated bit).
Wait for the next version of C which may support decimal floating point.

Related

Accurate percentage in C++

Given 2 numbers, where A <= B say for example A = 9 and B = 10, I am trying to get the percentage of how smaller A is compared to B. I need to have the percentage as an int e.g. if the result is 10.00% The int should be 1000.
Here is my code:
int A = 9;
int B = 10;
int percentage = (((1 - (double)A/B) / 0.01)) * 100;
My code returns 999 instead of 1000. Some precision related to the usage of double is lost.
Is there a way to avoid losing precision in my case?
Seems the formula you're looking for is
int result = 10000 - (A*10000+B/2)/B;
The idea is to do all computations in integers and delaying division.
To do the rounding half of the denominator is added before performing the division (otherwise you get truncation in the division and thus upper rounding because of 100%-x)
For example with A=9 and B=11 the percentage is 18.18181818... and rounding 18.18, the computation without the rounding would give 1819 instead of the expected result 1818.
Note that the computation is done all in integers so there is a risk of overflow for large values of A and B. For example if int is 32 bit then A can be up to around 200000 before risking an overflow when computing A*10000.
Using A*10000LL instead of A*10000 in the formula will trade in some speed to raise the limit to a much bigger value.
Offcourse there may be precision loss in floating point number. Either you should use fixed point number as #6502 answered or add a bias to the result to get the intended answer.
You should better do
assert(B != 0);
int percentage = ((A<0) == (B<0) ? 0.5 : -0.5) + (((1 - (double)A/B) / 0.01)) * 100;
Because of precision loss, result of (((1 - (double)A/B) / 0.01)) * 100 may be slightly less or more than intended. If you add extra 0.5, it is guaranteed to be sligthly more than intended. Now when you cast this value to an integer, you get intended answer. (floor or ceil value depending whether the fractional part of the result of equation was above or below 0.5)
I tried
float floatpercent = (((1 - (double)A/B) / 0.01)) * 100;
int percentage = (int) floatpercent;
cout<< percentage;
displays 1000
I suspect a precision loss on automatic casting to int as the root problem to your code.
[I alluded to this in a comment to the original question, but I though I'd post it as an answer.]
The core problem is that the form of expression you're using amplifies the unavoidable floating point loss of precision when representing simple fractions of 10.
Your expression (with casts stripped out for now, using standard precedence to also avoid some parens)
((1 - A/B) / 0.01) * 100
is quite a complicated way of representing what you want, although it's algebraically correct. Unfortunately, floating point numbers can only precisely represent numbers like 1/2, 1/4, 1/8, etc, their multiples, and sums of those. In particular, neither 9/10 or 1/10 or 1/100 have precise representations.
The above expression introduces these errors twice: first in the calculation of A/B, and then in the division by 0.01. These two imprecise values are then divided, which further amplifies the inherent error.
The most direct way to write what you meant (again without needed casts) is
((B-A) / B) * 10000
This produces the correct answer and considerably easier to read, I would suggest, than the original. The fully correct C form is
((B - A) / (double)B) * 10000
I've tested this and it works reliably. As others have noted, it's generally good better to work with doubles instead of floats, as their extra precision makes them less prone (but not immune) to this sort of difficulty.

Why float taking 0.699999 instead of 0.7 [duplicate]

This question already has answers here:
Floating point comparison [duplicate]
(5 answers)
Closed 9 years ago.
Here x is taking 0.699999 instead of 0.7 but y is taking 0.5 as assigned. Can you tell me what is the exact reason for this behavior.
#include<iostream>
using namespace std;
int main()
{
float x = 0.7;
float y = 0.5;
if (x < 0.7)
{
if (y < 0.5)
cout<<"2 is right"<<endl;
else
cout<<"1 is right"<<endl;
}
else
cout<<"0 is right"<<endl;
cin.get();
return 0;
}
There are lots of things on the internet about IEEE floating point.
0.5 = 1/2
so can be written exactly as a sum of powers of two
0.7 = 7/10 = 1/2 + 1/5 = 1/2 + 1/8 + a bit more... etc
The bit more can never be exactly a power of two, so you get the closest it can manage.
It is to do with how floating points are represented in memory. They have a limited number of bits (usually 32 for a float). This means there are a limited number of values that can be represented which means that many numbers from the infinite set of real numbers cannot be represented.
This website explains further
If you want to understand exactly why, then have a look at floating point representation of your machine (most probably it's IEEE 754, https://en.wikipedia.org/wiki/IEEE_floating_point ).
If you want to write robust and portable code, never compare floating-point values for equality. You should always compare them with some precision (e.g. instead of x==y you should write fabs(x-y) < eps where eps is say 1e-6).
floating point representation is approximate only as you cannot have precise representation of real, non-rational numbers on a computer.
`
when operating on floats, errros will in general accumulate.
however, there are some reals which can be represented exactly on a digital computer using it's native datatype for this purpose (*), 0.5 being one of them.
(*) meaning the format the floating point processing unit of the cpu operates on (standardized in ieee754). specialized libraries can represent integer and rational numbers exactly beyond the limits of the processor's internal formats. rounding errors may still occur when converting into a human-readable decimal expansion and the alternative also does not extend to irrational numbers (e.g. sqrt(3)). and, of course, these libraries comes at the cost of less speed.

C++ How to avoid floating-point arithmetic error [duplicate]

This question already has answers here:
Why does floating-point arithmetic not give exact results when adding decimal fractions?
(31 answers)
Closed 3 years ago.
I am writing a loop that increments with a float, but I have come across a floating-point arithmetic issue illustrated in the following example:
for(float value = -2.0; value <= 2.0; value += 0.2)
std::cout << value << std::endl;
Here is the output:
-2
-1.8
-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
1.46031e-07
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Why exactly am I getting 1.46031e-07 instead of 0? I know this has something to do with floating-point errors, but I can't grasp why it is happening and what I should do to prevent this from happening (if there is a way). Can someone explain (or point me to a link) that will help me understand? Any input is appreciated. Thanks!
As everybody else has said, this is do to the fact that the real numbers are an infinite and uncountable set, while floating point representations use a finite number of bits. Floating point numbers can only approximate real numbers and even in many simple cases are not precise, due to their definition. As you have now seen, 0.2 is not actually 0.2 but is instead a number very close to it. As you add these to value, you accumulate the error at each step.
As an alternative, try using ints for your iteration and dividing the result to get it back in the domain you require:
for (int value = -20; value <= 20; value += 2) {
std::cout << (value / 10.f) << std::endl;
}
For me this gives:
-2
-1.8
-1.6
-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
There's no clear-cut solution for avoid floating point precision loss. I would suggest having a look through the following paper: What every computer scientist should know about floating point arithmetic.
This is because floating point numbers have only a certain discrete precision.
The 0.2 is not really a 0.2, but is internally represented as a slightly different number.
That is why you are seeing a difference.
This is common in all floating point calculations, and you really can't avoid it.
Let's do your loop, but with increased output precision.
code:
for(float value = -2.0; value <= 2.0; value += 0.2)
std::cout << std::setprecision(100) << value << std::endl;
output:
-2
-1.7999999523162841796875
-1.599999904632568359375
-1.3999998569488525390625
-1.19999980926513671875
-0.999999821186065673828125
-0.79999983310699462890625
-0.599999845027923583984375
-0.3999998569488525390625
-0.19999985396862030029296875
1.460313825418779742904007434844970703125e-07
0.20000015199184417724609375
0.400000154972076416015625
0.6000001430511474609375
0.800000131130218505859375
1.00000011920928955078125
1.20000016689300537109375
1.40000021457672119140625
1.60000026226043701171875
1.80000030994415283203125
Use integers and divide down:
for(int value = -20; value <= 20; value += 2)
std::cout << (value/10.0) << std::endl;
Learn about floating point representation with some Algorithms book or using internet. There are lots of resources out there.
For the time, what you want seems to be some way to get zero when its something very very close to zero. and we all know that we call this process "rounding". :) so why don't you use it while printing those numbers. printf function provides good formatting power for these kinds of things. check the tables in the following link if you dont know how to format with printf. ( you can use the formating for rounding and displaying the numbers correctly )
printf ref : http://www.cplusplus.com/reference/cstdio/printf/?kw=printf
-- edit --
maybe some of you know know that according to mathematics 1.99999999.... is the same as 2.0 . Only difference is the representation. But the number is the same.
your floating point problem is a little bit similar to this. ( this is just for your clarification only. your problem is not the same as the 1.9999.... thing. )

Significant digits increasing

Let's,
float dt;
I read dt from a text file as
inputFile >> dt;
Then I have a for loop as,
for (float time=dt; time<=maxTime; time+=dt)
{
// some stuff
}
When dt=0.05 and I output std::cout << time << std::endl; I got,
0.05
0.10
...
7.00001
7.05001
...
So, why number of digits is increasing after a while?
Because not every number can be represented by IEEE754 floating point values. At some point, you'll get a number that isn't quite representable and the computer will have to choose the nearest one.
If you enter 0.05 into Harald Schmidt's excellent online converter and reference the Wikipedia entry on IEEE754-1985, you'll end up with the following bits (my explanation of that follows):
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm
0 01111010 10011001100110011001101
|||||||| |||||||||||||||||||||||
128 -+||||||| ||||||||||||||||||||||+- 1 / 8388608
64 --+|||||| |||||||||||||||||||||+-- 1 / 4194304
32 ---+||||| ||||||||||||||||||||+--- 1 / 2097152
16 ----+|||| |||||||||||||||||||+---- 1 / 1048576
8 -----+||| ||||||||||||||||||+----- 1 / 524288
4 ------+|| |||||||||||||||||+------ 1 / 262144
2 -------+| ||||||||||||||||+------- 1 / 131072
1 --------+ |||||||||||||||+-------- 1 / 65536
||||||||||||||+--------- 1 / 32768
|||||||||||||+---------- 1 / 16384
||||||||||||+----------- 1 / 8192
|||||||||||+------------ 1 / 4096
||||||||||+------------- 1 / 2048
|||||||||+-------------- 1 / 1024
||||||||+--------------- 1 / 512
|||||||+---------------- 1 / 256
||||||+----------------- 1 / 128
|||||+------------------ 1 / 64
||||+------------------- 1 / 32
|||+-------------------- 1 / 16
||+--------------------- 1 / 8
|+---------------------- 1 / 4
+----------------------- 1 / 2
The sign, being 0, is positive. The exponent is indicated by the one-bits mapping to the numbers on the left: 64+32+16+8+2 = 122 - 127 bias = -5, so the multiplier is 2-5 or 1/32. The 127 bias is to allow representation of very small numbers (as in close to zero rather that negative numbers with a large magnitude).
The mantissa is a little more complicated. For each one-bit, you accumulate the numbers down the right hand side (after adding an implicit 1). Hence you can calculate the number as the sum of {1, 1/2, 1/16, 1/32, 1/256, 1/512, 1/4096, 1/8192, 1/65536, 1/131072, 1/1048576, 1/2097152, 1/8388608}.
When you add all these up, you get 1.60000002384185791015625.
When you multiply that by the multiplier 1/32 (calculated previously from the exponent bits), you get 0.0500000001, so you can see that 0.05 is already not represented exactly. This bit pattern for the mantissa is actually the same as 0.1 but, with that, the exponent is -4 rather than -5, and it's why 0.1 + 0.1 + 0.1 is rarely equal to 0.3 (this appears to be a favourite interview question).
When you start adding them up, that small error will accumulate since, not only will you see an error in the 0.05 itself, errors may also be introduced at each stage of the accumulation - not all the the numbers 0.1, 0.15, 0.2 and so on can be represented exactly either.
Eventually, the errors will get large enough that they'll start showing up in the number if you use the default precision. You can put this off for a bit by choosing your own precision with something like:
#include <iostream>
#include <iomanip>
:
std::cout << std::setprecison (2) << time << '\n';
It won't fix the variable value, but it will give you some more breathing space before the errors become visible.
As an aside, some people recommend avoiding std::endl since it forces a flush of the buffers. If your implementation is behaving itself, this will happen for terminal devices when you send a newline anyway. And if you've redirected standard output to a non-terminal, you probably don't want flushing on every line. Not really relevant to your question and it probably won't make a real difference in the vast majority of cases, just a point I thought I'd bring up.
IEEE floats use the binary number system and therefore can't store decimal numbers exactly. When you add several of them together (sometimes just two is enough), the representational errors can accumulate and become visible.
Some numbers can't be precisely represented using floating points OR base 2 numbers. If I remember correcly, one of such numbers is decimal 0.05 (in base 2 results in infinitely repeating fractional number). Another issue is that if you print floating point to file (as base 10 number) then read it back you might as well get different number - because base differs and that might cause problems when converting fractional base2 to fractional base10 number.
If you want better precision you could try searching for a bignum library. This will work much slower than floating points, though. Another way to deal with precision problems would be to try storing numbers as "common fraction" with numberator/denominator(i.e. 1/10 instead of 0.1, 1/3 instead of 0.333.., etc - there's probably library even for that, but I haven't heard about it), but that won't work with irrational numbers like pi or e.

Error subtracting floating point numbers when passing through 0.0

The following program:
#include <stdio.h>
int main()
{
double val = 1.0;
int i;
for (i = 0; i < 10; i++)
{
val -= 0.2;
printf("%g %s\n", val, (val == 0.0 ? "zero" : "non-zero"));
}
return 0;
}
Produces this output:
0.8 non-zero
0.6 non-zero
0.4 non-zero
0.2 non-zero
5.55112e-17 non-zero
-0.2 non-zero
-0.4 non-zero
-0.6 non-zero
-0.8 non-zero
-1 non-zero
Can anyone tell me what is causing the error when subtracting 0.2 from 0.2? Is this a rounding error or something else? Most importantly, how do I avoid this error?
EDIT: It looks like the conclusion is to not worry about it, given 5.55112e-17 is extremely close to zero (thanks to #therefromhere for that information).
Its because floating points numbers can not be stored in memory in exact value. So it is never safe to use == in floating point values. Using double will increase the precision, but again that will not be exact. The correct way to compare a floating point value is to do something like this:
val == target; // not safe
// instead do this
// where EPS is some suitable low value like 1e-7
fabs(val - target) &lt EPS;
EDIT: As pointed in the comments, the main reason of the problem is that 0.2 can't be stored exactly. So when you are subtracting it from some value, every time causing some error. If you do this kind of floating point calculation repeatedly then at certain point the error will be noticeable. What I am trying to say is that all floating points values can't be stored, as there are infinites of them. A slight wrong value is not generally noticeable but using that is successive computation will lead to higher cumulative error.
0.2 is not a double precision floating-point number, so it is rounded to the nearest double precision number, which is:
0.200000000000000011102230246251565404236316680908203125
That's rather unwieldy, so let's look at it in hex instead:
0x0.33333333333334
Now, let's follow what happens when this value is repeatedly subtracted from 1.0:
0x1.00000000000000
- 0x0.33333333333334
--------------------
0x0.cccccccccccccc
The exact result is not representable in double precision, so it is rounded, which gives:
0x0.ccccccccccccd
In decimal, this is exactly:
0.8000000000000000444089209850062616169452667236328125
Now we repeat the process:
0x0.ccccccccccccd
- 0x0.33333333333334
--------------------
0x0.9999999999999c
rounds to 0x0.999999999999a
(0.600000000000000088817841970012523233890533447265625 in decimal)
0x0.999999999999a
- 0x0.33333333333334
--------------------
0x0.6666666666666c
rounds to 0x0.6666666666666c
(0.400000000000000077715611723760957829654216766357421875 in decimal)
0x0.6666666666666c
- 0x0.33333333333334
--------------------
0x0.33333333333338
rounds to 0x0.33333333333338
(0.20000000000000006661338147750939242541790008544921875 in decimal)
0x0.33333333333338
- 0x0.33333333333334
--------------------
0x0.00000000000004
rounds to 0x0.00000000000004
(0.000000000000000055511151231257827021181583404541015625 in decimal)
Thus, we see that the accumulated rounding that is required by floating-point arithmetic produces the very small non-zero result that you are observing. Rounding is subtle, but it is deterministic, not magic, and not a bug. It's worth taking the time to learn about.
Floating point arithmetic cannot represent all numbers exactly. Thus rounding errors like you observe are inevitable.
One possible strategy is to use a fixed point format, e.g. A decimal or currency data type. Such types still can't represent all numbers but would behave as you expect for this example.
To elaborate a bit: if the mantissa of the floating point number is encoded in binary (as is the case in most contemporary FPUs), then only sums of (multiples) of the numbers 1/2, 1/4, 1/8, 1/16, ... can be represented exactly in the mantissa. The value 0.2 is approximated with 1/8 + 1/16 + .... some even smaller numbers, yet the exact value of 0.2 can not be reached with a finite mantissa.
You can try the following:
printf("%.20f", 0.2);
and you'll (probably) see that what you think is 0.2 is not 0.2 but a number that is a tiny amount different (actually, on my computer it prints 0.20000000000000001110). Now you understand why you can never reach 0.
But if you let val = 12.5 and subtract 0.125 in your loop, you could reach zero.