Precision of Floats to Ints vs Doubles to Ints, unexpected results

Precision of Floats to Ints vs Doubles to Ints, unexpected results - c++

I am computer engineering student and do tutoring for the introductory C++ classes at BYU-Idaho, and a student successfully stumped me.
If write the code for this:
#include <iostream>
using namespace std;
int main()
{
float y = .59;
int x = (int)(y * 100.0);
cout << x << endl;
return 0;
}
Result = 58
#include <iostream>
using namespace std;
int main()
{
double y = .59;
int x = (int)(y * 100.0);
cout << x << endl;
return 0;
}
Result = 59
I told him it was a precision issue and that because the int is more precise than a float it loses information. A double is more precise than a float so it works.
However I am not sure if what I said is correct. I think it has something to do with the int getting padded with zeros and as a result it gets "truncated" while it get's casted, but I am not sure.
If any of you guys want to explain what is going on "underneath" all of this I would find it interesting!

The problem is that float isn't accurate enough to hold the exact value 0.59. If you store such a value, it will be rounded in binary representation (already during compile time) to something different, in your case this was a value slightly less than 0.59 (it might also be slightly greater than the value you wanted it to be). When multiplying this with 100, you get a value slightly less than 59. Converting such a value to an integer will round it towards 0, so this leads to 58.
0.59 as a float will be stored as (now being represented as a human-readable decimal number):
0.589999973773956298828125
Now to the double type. While this type has essentially the same problem, it might be of two reasons why you get the expected result: Either double can hold the exact value you want (this is not the case with 0.59 but for other values it might be the case), or the compiler decides to round it up. Thus, multiplying this with 100 leads to a value which is not less than 59 and will be rounded towards 0 to 59, as expected.
Now note that it might be the case that 0.59 as a double is still being rounded down by the compiler. Indeed, I just checked and it is. 0.59 as a double will be stored as:
0.58999999999999996891375531049561686813831329345703
However, you are multiplying this value with 100 before converting it to an integer. Now there comes an interesting point: When multiplied with 100, the difference of y to 0.59 put by the compiler is eliminated since 0.59 * 100 can again not be stored exactly. In fact, the processor calculates 0.58999999999999996891375531049561686813831329345703 * 100.0, which will be rounded up to 59, a number which can be represented in double!
See this code for details: http://ideone.com/V0essb
Now you might wonder why the same doesn't count for float, which should behave exactly the same but with different accuracy. The problem is that 0.589999973773956298828125 * 100.0 is not rounded up to 59 (which can also be represented in a float). The rounding behavior after calculations isn't really defined.
Indeed, operations on floating point numbers aren't exactly specified, meaning that you can encounter different results on different machines. This makes it possible to implement performance tweaks which lead to slightly incorrect results, even if rounding isn't involved! It might be the case that on another machine you end up with the expected results while on others you are not.

0.59 is not exactly representable in binary floating-point. So x will actually be a value very slightly above or below 0.59. Which it is may be affected by whether you use float or double. This in turn will determine whether the result of your program is 58 or 59.

It is a precision issue, and has to do with the fact that .59
is not exactly representable in either a double or a float.
So y is not .59; it is something very close to .59
(slightly more or slightly less). The multiplication by 100
is exact, but since the original value wasn't, you get something
slightly more or slightly less than 59. Conversion to int
truncates to zero, so you get either 59 or 58.

This is the same problem as when you use an old calculator to do 1 / 3 * 3, and it comes up with 2.9999999 or something similar. Combine this with the fact that the (int)(float_value) will simply chop the decimals of, so if floatvalue is 58.999996185 like my machine gets, then 58 will be the result, because although 58.999996185 is NEARLY 59, if you cut out only the first two digits, it is indeed 58.
Floating point numbers are great for calculating a lot of things, but you have to be VERY careful when it comes to "what is the result". It is an approximation, the precision is not infinite, and rounding of intermediate results will happen.
With double, there are more digits, and it may well be that when the calculation of 0.58999999999999 times 100, the last bit is a one instead of a zero, so the result is a 59.00000000001 or something like that, whcih then becomes 59 as an integer.

Related

Very large differences using float and double

#include <iostream>
using namespace std;
int main() {
int steps=1000000000;
float s = 0;
for (int i=1;i<(steps+1);i++){
s += (i/2.0) ;
}
cout << s << endl;
}
Declaring s as float: 9.0072e+15
Declaring s as double: 2.5e+17 (same result as implementing it in Julia)
I understand double has double precision than float, but float should still handle numbers up to 10^38.
I did read similar topics where results where not the same, but in that cases the differences were very small, here the difference is 25x.
I also add that using long double instead gives me the same result as double. If the matter is the precision, I would have expected to have something a bit different.

The problem is the lack of precision: https://en.wikipedia.org/wiki/Floating_point
After 100 million numbers you are adding 1e8 to 1e16 (or at least numbers of that magnitude), but single precision numbers are only accurate to 7 digits - so it is the same as adding 0 to 1e16; that's why your result is considerably lower for float.
Prefer double over float in most cases.

Problem with floating point precision! Infinite real numbers cannot possibly be represented by the finite memory of a computer. Float, in general, are just approximations of the number they are meant to represent.
For more details, please check the following documentation:
https://softwareengineering.stackexchange.com/questions/101163/what-causes-floating-point-rounding-errors

You didn't mention what type of floating point numbers you are using, but I'm going to assume that you use IEEE 754, or similar.
I understand double has double precision
To be more precise with the terminology, double uses twice as many bits. That's not double the number of reprensentable values, it's 4294967296 times as many representable values, despite being named "double precision".
but float should still handle numbers up to 10^38.
Float can handle a few numbers up to that magnitude. But that does't mean that float values in that range are precise. For example, 3,4028235E+38 can be represented as a single precision float. How much would you imagine is the difference between the previous value representable by float? Is it the machine epsilon? Perhaps 0.1? Maybe 1? No. The difference is about 2E+31.
Now, your numbers aren't quite in that range. But, they're outside the continuous range of whole integers that can be precisely represented by float. The highest value in that range happens to be 16777217, or about 1.7E+7, which is way less than 2.5E+17. So, every addition beyond that range adds some error to the result. You perform a billion calculations so those errors add up.
Conclusions:
Understand that single precision is way less precise than double precision.
Avoid long sequences of calculations where precision errors can accumulate.

Float point numbers and incorrect result due to rounding behavior

I need to output float point numbers with two digits after the decimal point. In addition, I also need to round off the numbers. However, sometimes I don't get the results I need. Below is an example.
#include <iomanip>
#include <iostream>
using namespace std;
int main(){
cout << setprecision(2);
cout << fixed;
cout<<(1.7/20)<<endl;
cout<<(1.1/20)<<endl;
}
The results are:
0.08
0.06
Since 1.7/20=0.085 and 1.1/20=0.055. In theory I should get 0.09 and 0.06. I know it has something to do with the binary expression of floating point numbers. My questions is how can I get the right results when fixing the number of digits after the decimal point with rounding off?
Edit: This is not a duplicate of another question. Using fesetround(FE_UPWARD) will not solve the problem. fesetround(FE_UPWARD) will round (1.0/30) to 0.04 while the correct results should be 0.03. In addition, fesetround(FE_TONEAREST) doesn't help either. (1.7/20) still round to 0.08.
Edit: Now I understand that this behavior might be due to the half-to-even rounding. But how can I avoid this? Namely, if the result is exact half, it should round up.

Yes, you're right - it has to do with the representation in base 2, and the fact that sometimes the base 2 value will be higher than the base 10 number and sometimes it will be lower. But never by much!
If you want something that matches expectations more often, you can do two stage rounding. A double is generally accurate to at least 15 digits (total, including those to the left of the decimal point). Your first rounding will leave you with a number that has more stability for the second phase of rounding. No rounding is going to match the results you would get in decimal 100%, but it's possible to get very close.
double round_2digits(double d)
{
double intermediate = floor(d * 100000000000000.0 + 0.5); // round to 14 digits
return floor(intermediate / 1000000000000.0 + 0.5) / 100.0;
}
See it in action.
For a totally different approach, you can simply ensure that the base 2 number that you start with is always larger than the desired decimal, instead of being larger half the time and smaller half the time. Simply increment the least significant bit of the number with nextafter before rounding.
double round_2digits(double d)
{
return floor(100.0 * std::nextafter(d, std::numeric_limits<double>::max())) / 100.0;
}

You can define round_with_precision() method of your own, which would invoke tgmath.h provided round() method passing modified value, and then returning the value after dividing with same factor.
#include <tgmath.h>
double round_with_precision(double d, const size_t &prec)
{
d *= pow(10, prec);
return (std::round(d) / pow(10, prec));
}
int main(){
const size_t prec = 2;
cout << round_with_precision(1.7/20, prec) << endl; //prints 0.09
cout << round_with_precision(1.1/20, prec) << endl; //prints 0.06
}

The issue is due to binary floating-point representation and floating-point constants in C. The fact is that 1.7 and 1.1 are not exactly representable in binary. The ISO C standard says (I suppose that this is similar in C++): "Floating constants are converted to internal format as if at translation-time." This means that the active rounding mode (set by fesetround) will not have any influence at all for the constant (it may have an influence for the roundings that occur at run time).
The division by 20 will introduce another rounding error. Depending on the full code and compiler options, it may or may not be done at compile time, so that the active rounding mode may be ignored. In any case, if you expect 0.085 and 0.055 exactly, this is not possible because these values are not representable exactly in binary.
So, even if you have perfect code that rounds double values on 2 decimal digits, this may not work as you want, because of the rounding errors that occurred before, and it is too late to recover the information in a way that works in all cases.
If you want to be able to handle "midpoint" values such as 0.085 exactly, you need to use a number system that can represent them exactly, such as decimal arithmetic (but you may still get rounding errors in other kinds of operations). You may also want to use integers scaled by a power of 10. There is no general answer because this really depends on the application, as any workaround will have drawbacks.
For more information, see all the general articles on floating point and Goldberg's article (PDF version).

c++ incorrect floating point arithmetic

For the following program:
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
for (float a = 1.0; a < 10; a++)
cout << std::setprecision(30) << 1.0/a << endl;
return 0;
}
I recieve the following output:
1
0.5
0.333333333333333314829616256247
0.25
0.200000000000000011102230246252
0.166666666666666657414808128124
0.142857142857142849212692681249
0.125
0.111111111111111104943205418749
Which is definitely not right right for the lower place digits, particularly with respect to 1/3,1/5,1/7, and 1/9. things just start going wrong around 10^-16 I would expect to see out put more resembling:
1
0.5
0.333333333333333333333333333333
0.25
0.2
0.166666666666666666666666666666
0.142857142857142857142857142857
0.125
0.111111111111111111111111111111
Is this an inherit flaw in the float class? Is there a way to overcome this and have proper division? Is there a special datatype for doing precise decimal operations? Am I just doing something stupid or wrong in my example?

There are a lot of numbers that computers cannot represent, even if you use float or double-precision float. 1/3, or .3 repeating, is one of those numbers. So it just does the best it can, which is the result you get.
See http://floating-point-gui.de/, or google float precision, there's a ton of info out there (including many SO questions) on this subject.
To answer your questions -- yes, this is an inherent limitation in both the float class and the double class. Some mathematical programs (MathCAD, probably Mathematica) can do "symbolic" math, which allows calculation of the "correct" answers. In many cases, the round-off error can be managed, even over really complex computations, such that the top 6-8 decimal places are correct. However, the opposite is true as well -- naive computations can be constructed that return wildly incorrect answers.
For small problems like division of whole numbers, you'll get a decent number of decimal place accuracy (maybe 4-6 places). If you use double precision floats, that will go up to maybe 8. If you need more... well, I'd start questioning why you want that many decimal places.

First of all, since your code does 1.0/a, it gives you double (1.0 is a double value, 1.0f is float) as the rules of C++ (and C) always extends a smaller type to the larger one if the operands of an operation is different size (so, int + char makes the char into an int before adding the values, long + int will make the int long, etc, etc).
Second floating point values have a set number of bits for the "number". In float, that is 23 bits (+ 1 'hidden' bit), and in double it's 52 bits (+1). Yet get approximately 3 digits per bit (exactly: log2(10), if we use decimal number representation), so a 23 bit number gives approximately 7-8 digits, a 53 bit number approximately 16-17 digits. The remainder is just "noise" caused by the last few bits of the number not evening out when converting to a decimal number.
To have infinite precision, we would have to either store the value as a fraction, or have an infinite number of bits. And of course, we could have some other finite precision, such as 100 bits, but I'm sure you'd complain about that too, because it would just have another 15 or so digits before it "goes wrong".

Floats only have so much precision (23 bits worth to be precise). If you REALLY want to see "0.333333333333333333333333333333" output, you could create a custom "Fraction" class which stores the numerator and denominator separately. Then you could calculate the digit at any given point with complete accuracy.

multiplication of double with integer precision

I have a double of 3.4. However, when I multiply it with 100, it gives 339 instead of 340. It seems to be caused by the precision of double. How could I get around this?
Thanks

First what is going on:
3.4 can't be represented exactly as binary fraction. So the implementation chooses closest binary fraction that is representable. I am not sure whether it always rounds towards zero or not, but in your case the represented number is indeed smaller.
The conversion to integer truncates, that is uses the closest integer with smaller absolute value.
Since both conversions are biased in the same direction, you can always get a rounding error.
Now you need to know what you want, but probably you want to use symmetrical rounding, i.e. find the closest integer be it smaller or larger. This can be implemented as
#include <cmath>
int round(double x) { std::floor(x + 0.5); } // floor is provided, round not
or
int round(double x) { return x < 0 ? x - 0.5 : x + 0.5; }
I am not completely sure it's indeed rounding towards zero, so please verify the later if you use it.

If you need full precision, you might want to use something like Boost.Rational.

You could use two integers and multiply the fractional part by multiplier / 10.
E.g
int d[2] = {3,4};
int n = (d[0] * 100) + (d[1] * 10);
If you really want all that precision either side of the decimal point. Really does depend on the application.

Floating-point values are seldom exact. Unfortunately, when casting a floating-point value to an integer in C, the value is rounded towards zero. This mean that if you have 339.999999, the result of the cast will be 339.
To overcome this, you could add (or subtract) "0.5" from the value. In this case 339.99999 + 0.5 => 340.499999 => 340 (when converted to an int).
Alternatively, you could use one of the many conversion functions provided by the standard library.

You don't have a double with the value of 3.4, since 3.4 isn't
representable as a double (at least on the common machines, and
most of the exotics as well). What you have is some value very
close to 3.4. After multiplication, you have some value very
close to 340. But certainly not 399.
Where are you seeing the 399? I'm guessing that you're simply
casting to int, using static_cast, because this operation
truncates toward zero. Other operations would likely do what
you want: outputting in fixed format with 0 positions after the
decimal, for example, rounds (in an implementation defined
manner, but all of the implementations I know use round to even
by default); the function round rounds to nearest, rounding
away from zero in halfway cases (but your results will not be
anywhere near a halfway case). This is the rounding used in
commercial applications.
The real question is what are you doing that requires an exact
integral value. Depending on the application, it may be more
appropriate to use int or long, scaling the actual values as
necessary (i.e. storing 100 times the actual value, or
whatever), or some sort of decimal arithmetic package, rather
than to use double.

Unexpected loss of precision when dividing doubles

I have a function getSlope which takes as parameters 4 doubles and returns another double calculated using this given parameters in the following way:
double QSweep::getSlope(double a, double b, double c, double d){
double slope;
slope=(d-b)/(c-a);
return slope;
}
The problem is that when calling this function with arguments for example:
getSlope(2.71156, -1.64161, 2.70413, -1.72219);
the returned result is:
10.8557
and this is not a good result for my computations.
I have calculated the slope using Mathematica and the result for the slope for the same parameters is:
10.8452
or with more digits for precision:
10.845222072678331.
The result returned by my program is not good in my further computations.
Moreover, I do not understant how does the program returns 10.8557 starting from 10.845222072678331 (supposing that this is the approximate result for the division)?
How can I get the good result for my division?
thank you in advance,
madalina
I print the result using the command line:
std::cout<<slope<<endl;
It may be that my parameters are maybe not good, as I read them from another program (which computes a graph; after I read this parameters fromt his graph I have just displayed them to see their value but maybe the displayed vectors have not the same internal precision for the calculated value..I do not know it is really strange. Some numerical errors appears..)
When the graph from which I am reading my parameters is computed, some numerical libraries written in C++ (with templates) are used. No OpenGL is used for this computation.
thank you,
madalina

I've tried with float instead of double and I get 10.845110 as a result. It still looks better than madalina result.
EDIT:
I think I know why you get this results. If you get a, b, c and d parameters from somewhere else and you print it, it gives you rounded values. Then if you put it to Mathemtacia (or calc ;) ) it will give you different result.
I tried changing a little bit one of your parameters. When I did:
double c = 2.7041304;
I get 10.845806. I only add 0.0000004 to c!
So I think your "errors" aren't errors. Print a, b, c and d with better precision and then put them to Mathematica.

The following code:
#include <iostream>
using namespace std;
double getSlope(double a, double b, double c, double d){
double slope;
slope=(d-b)/(c-a);
return slope;
}
int main( ) {
double s = getSlope(2.71156, -1.64161, 2.70413, -1.72219);
cout << s << endl;
}
gives a result of 10.8452 with g++. How are you printing out the result in your code?

Could it be that you use DirectX or OpenGL in your project? If so they can turn off double precision and you will get strange results.
You can check your precision settings with
std::sqrt(x) * std::sqrt(x)
The result has to be pretty close to x.
I met this problem long time ago and spend a month checking all the formulas. But then I've found
D3DCREATE_FPU_PRESERVE

The problem here is that (c-a) is small, so the rounding errors inherent in floating point operations is magnified in this example. A general solution is to rework your equation so that you're not dividing by a small number, I'm not sure how you would do it here though.
EDIT:
Neil is right in his comment to this question, I computed the answer in VB using Doubles and got the same answer as mathematica.

The results you are getting are consistent with 32bit arithmetic. Without knowing more about your environment, it's not possible to advise what to do.
Assuming the code shown is what's running, ie you're not converting anything to strings or floats, then there isn't a fix within C++. It's outside of the code you've shown, and depends on the environment.
As Patrick McDonald and Treb brought both up the accuracy of your inputs and the error on a-c, I thought I'd take a look at that. One technique to look at rounding errors is interval arithmetic, which makes the upper and lower bounds which value represents explicit (they are implicit in floating point numbers, and are fixed to the precision of the representation). By treating each value as an upper and lower bound, and by extending the bounds by the error in the representation ( approx x * 2 ^ -53 for a double value x ), you get a result which gives the lower and upper bounds on the accuracy of a value, taking into account worst case precision errors.
For example, if you have a value in the range [1.0, 2.0] and subtract from it a value in the range [0.0, 1.0], then the result must lie in the range [below(0.0),above(2.0)] as the minimum result is 1.0-1.0 and the maximum is 2.0-0.0. below and above are equivalent to floor and ceiling, but for the next representable value rather than for integers.
Using intervals which represent worst-case double rounding:
getSlope(
a = [2.7115599999999995262:2.7115600000000004144],
b = [-1.6416099999999997916:-1.6416100000000002357],
c = [2.7041299999999997006:2.7041300000000005888],
d = [-1.7221899999999998876:-1.7221900000000003317])
(d-b) = [-0.080580000000000526206:-0.080579999999999665783]
(c-a) = [-0.0074300000000007129439:-0.0074299999999989383218]
to double precision [10.845222072677243474:10.845222072679954195]
So although c-a is small compared to c or a, it is still large compared to double rounding, so if you were using the worst imaginable double precision rounding, then you could trust that value's to be precise to 12 figures - 10.8452220727. You've lost a few figures off double precision, but you're still working to more than your input's significance.
But if the inputs were only accurate to the number significant figures, then rather than being the double value 2.71156 +/- eps, then the input range would be [2.711555,2.711565], so you get the result:
getSlope(
a = [2.711555:2.711565],
b = [-1.641615:-1.641605],
c = [2.704125:2.704135],
d = [-1.722195:-1.722185])
(d-b) = [-0.08059:-0.08057]
(c-a) = [-0.00744:-0.00742]
to specified accuracy [10.82930108:10.86118598]
which is a much wider range.
But you would have to go out of your way to track the accuracy in the calculations, and the rounding errors inherent in floating point are not significant in this example - it's precise to 12 figures with the worst case double precision rounding.
On the other hand, if your inputs are only known to 6 figures, it doesn't actually matter whether you get 10.8557 or 10.8452. Both are within [10.82930108:10.86118598].

Better Print out the arguments, too. When you are, as I guess, transferring parameters in decimal notation, you will lose precision for each and every one of them. The problem being that 1/5 is an infinite series in binary, so e.g. 0.2 becomes .001001001.... Also, decimals are chopped when converting an binary float to a textual representation in decimal.
Next to that, sometimes the compiler chooses speed over precision. This should be a documented compiler switch.

Patrick seems to be right about (c-a) being the main cause:
d-b = -1,72219 - (-1,64161) = -0,08058
c-a = 2,70413 - 2,71156 = -0,00743
S = (d-b)/(c-a)= -0,08058 / -0,00743 = 10,845222
You start out with six digits precision, through the subtraction you get a reduction to 3 and four digits. My best guess is that you loose additonal precision because the number -0,00743 can not be represented exaclty in a double. Try using intermediate variables with a bigger precision, like this:
double QSweep::getSlope(double a, double b, double c, double d)
{
double slope;
long double temp1, temp2;
temp1 = (d-b);
temp2 = (c-a);
slope = temp1/temp2;
return slope;
}

While the academic discussion going on is great for learning about the limitations of programming languages, you may find the simplest solution to the problem is an data structure for arbitrary precision arithmetic.
This will have some overhead, but you should be able to find something with fairly guaranteeable accuracy.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js