Incorrect floating point math? - c++

Here is a problem that has had me completely baffled for the past few hours...
I have an equation hard coded in my program:
double s2;
s2 = -(0*13)/84+6/42-0/84+24/12+(6*13)/42;
Every time i run the program, the computer spits out 3 as the answer, however doing the math by hand, i get 4. Even further, after inputting the equation into Matlab, I also get the answer 4. Whats going on here?
The only thing i can think of that is going wrong here would be round off error. However with a maximum of 5 rounding errors, coupled with using double precision math, my maximum error would be very very small so i doubt that is the problem.
Anyone able to offer any solutions?
Thanks in advance,
-Faken

You're not actually doing floating point math there, you're doing integer math, which will floor the results of divisions.
In C++, 5/4 = 1, not 1.25 - because 5 and 4 are both integers, so the result will be an integer, and thus the fractional part of the result is thrown away.
On the other hand, 5.0/4.0 will equal approx. 1.25 because at least one of 5.0 and 4.0 is a floating-point number so the result will also be floating point.

You're confusing integer division with floating point division. 3 is the correct answer with integer division. You'll get 4 if you convert those values to floating point numbers.

Some of this is being evaluated using integer arithmetic. Try adding a decimal place to your numbers, e.g. 6.0 instead 6 to tell the compiler that you don't want integer arithmetic.

s2 = -(0*13)/84+6/42-0/84+24/12+(6*13)/42;
yields 3
s2 = -(0.*13.)/84.+6./42.-0./84.+24./12.+(6.*13.)/42.;
does what you are expecting.

Related

C++ Rounding float-pointing value to number of fractional digits

I have tried std::round but it doesn't give me result that I want exactly. So my question is I have program in C# and I am converting to C++ and I faced with this problem. C# Math.round and C++ round are different. So this causes wrong calculations.
C# code:
Console.WriteLine(Math.Round(0.850, 1));
Output:
0,8
C++ code:
std::cout << roundf(0.850f) << std::endl;
Output:
1
So like you see they are different. How can I solve this ?
The C# version is rounding a double to one decimal place, the C++ version is rounding a float to the nearest integer.
Rounding a binary floating point to a fixed number of decimal places doesn't really make much sense, as the rounded number will still most likely be an approximation. For example 0.8 cannot be represented exactly in binary floating point.
The C++ round function only rounds to the nearest integral value, which, given the above, is a sensible choice.
You can recover the C# behaviour (rounding to 1 decimal place) with std::round(0.850 * 10) / 10.
Note that I've dropped the f suffix to match the C# double type.

I'm trying to round a float to two decimal points but it's incorrect. How to fix this rounding error in C++?

I'm having trouble with rounding floats. I'm solving a task where you need to round your result to two decimal points. But I can't do it when the third decimal point is 5 because it's stored incorrectly.
For example: My result is equal to 1.005 and that should be rounded to 1.01. But C++ rounds it to 1.00 because the original float is stored as 1.0049999... and not 1.005.
I've already tried always adding a very small float to the result but there are some other test cases which are then rounded up but should be rounded down.
I know how floating-point works and that it is often not completely accurate. I'm just wondering whether anyone has found a way around this specific problem.
When you say "my result is equal to 1.005", you are assuming some count of true decimal digits. This can be 1.005 (three digits of fractional part), 1.0050 (four digits), 1.005000, and so on.
So, you should first round, using some usual rounding, to that count of digits. It is simpler to do this in integers: for example, with 6 fractional digits, it means some usual round(), rint(), etc. after multiplication by 1,000,000. With this step, you are getting exact decimal number. After this, you are able to make the required final rounding to what you need.
In your example, this will round 1,004,999.99... to 1,005,000. Then, divide by 10000 and round again.
(Notice that there are suggestions to make this rounding in yet specific way. The General Decimal Arithmetic specification and IBM arithmetic manuals suggest this rounding is done in the way that exact fractional part 0.5 shall be rounded away from zero unless least significant result bit becomes 0 or 5, in that case it is rounded toward zero. But, if you have no such rounding available, a general away-from-zero is also suitable.)
If you are implementing arithmetic for money accounting, it is reasonable to avoid floating point at all and use fixed-point arithmetic (emulated with integers, if needed). This is better because you the methods I've described for rounding are inevitably containing conversion to integers (and back), so, it's cheaper to use such integers directly. You will get inexact operation checking as well (by cost of explicit integer overflow).
If you can use a library like boost with its Multiprecision support.
Another option would be to use a long double, maybe that's precise enough for you.

Improper Double Calculations [duplicate]

Teaching myself C and finding that when I do an equation for a temp conversion it won't work unless I change the fraction to a decimal. ie,
tempC=(.555*(tempF-32)) will work but tempC=((5/9)*(tempF-32)) won't work.
Why?
According to the book "C Primer Plus" it should work as I'm using floats for both tempC and tempF.
It looks like you have integer division in the second case:
tempC=((5/9)*(tempF-32))
The 5 / 9 will get truncated to zero.
To fix that, you need to make one of them a floating-point type:
tempC=((5./9.)*(tempF-32))
When you do 5/9, 5 and 9 are both integers and integer division happens. The result of integer division is an integer and it is the quotient of the two operands. So, the quotient in case of 5/9 is 0 and since you multiply by 0, tempC comes out to be 0. In order to not have integer division, atleast one of the two operands must be float.
E.g. if you use 5.0/9 or 5/9.0 or 5.0/9.0, it will work as expected.
5/9 is an integer division not a floating point division. That's why you are getting wrong result.
Make 5 or 9 floating point variable and you will get correct answer.
Like 5.0/9 OR 5/9.0
5/9 is an integer expression, as such it gets truncated to 0. your compiler should warn you about this, else you should look into enabling warnings.
If you put 5/9 in parenthesis, this will be calculated first, and since those are two integers, it will be done by integer division and the result will be 0, before the rest of the expression is evaluated.
You can rearrange your expression so that the conversion to float occurs first:
tempC=((5/9)*(tempF-32)); → tempC=(5*(tempF-32))/9;
or of course, as the others say, use floating point constants.

d0 when taking roots of numbers

So in general, I understand the difference between specifying 3. and 3.0d0 with the difference being the number of digits stored by the computer. When doing arithmetic operations, I generally make sure everything is in double precision. However, I am confused about the following operations:
64^(1./3.) vs. 64^(1.0d0/3.0d0)
It took me a couple of weeks to find an error where I was assigning the output of 64^(1.0d0/3.0d0) to an integer. Because 64^(1.0d0/3.0d0) returns 3.999999, the integer got the value 3 and not 4. However, 64^(1./3.) = 4.00000. Can someone explain to me why it is wise to use 1./3. vs. 1.0d0/3.0d0 here?
The issue isn't so much single versus double precision. All floating point calculations are subject to imprecision compared to true real numbers. In assigning a real to an integer, Fortran truncates. You probably want to use the Fortran intrinsic nint.
this is a peculiar fortuitous case where the lower precision calculation gives the exact result. You can see this without the integer conversion issue:
write(*,*)4.d0-64**(1./3.),4.d0-64**(1.d0/3.d0)
0.000000000 4.440892E-016
In general this does not happen, here the double precision value is "better"
write(*,*)13.d0-2197**(1./3.),13.d0-2197**(1.d0/3.d0)
-9.5367E-7 1.77E-015
Here, since the s.p. calc comes out slightly high it gives you the correct value on integer conversion, while the d.p. result will get rounded down, hence be wrong, even though the floating point error was smaller.
So in general, no you should not consider use of single precision to be preferred.
in fact 64 and 125 seem to be the only special cases where the s.p. calc gives a perfect cube root while the d.p. calc does not.

What does the compiler do when it converts a float variable to an integer variable?

What does the compiler do? The aim is to get the number after the point as an integer. I did it like this:
float a = 0;
cin >> a;
int b = (a - (int)a)*10;
Now my problem is this: when I enter for example 3.2, I get 2, which is what I want. It also works with .4, .5 and .7. but when I enter for example 2.3, I get 2. For 2.7 I get 6 and so on. But when I do it without variables, for example:
(2.3 - (int)2.3)*10;
I get the correct result.
I couldn't figure out what the compiler does. I alway thought when I cast a float to an integer, then it simply cuts at the point. This is what the compiler actually does when I use constant numbers. However, when I use variables, the compiler reduces some of them, but not all.
You are most likely not having problems with the compiler, but with the fact that floating point numbers cannot be represented exactly on a binary computer.
So, when you do:
float f = 2.7f;
..what might actually be stored in the computer is:
2.6999999999999999
This is a very well-known characteristic of floating points on binary computers. There are many posts on SO that discuss this.
Basically, the problem comes from the fact that binary has different "infinitely repeating" values than base 10 does. For instance. 1/10 in decimal is 0.1, in binary, it's 0.000110011001100110011001100... The problem is caused because floating point cannot hold 2.3 correctly because it's an infinite number of binary digits, but it approximates closely, probably as 2.2999999. For most math, it's the close enough. But be wary of truncation.
One solution is to round before you truncate.
int b = (a - (int)(a+.05))*10;
Also note that floating point values have different sizes in memory than in the registers, which means you have to round when comparing if two floating point values are equal as well.
The reason for the discrepancy is that by default, floating point literals are doubles, which have higher accuracy, and are more closely able to represent the value you're looking for.
Why don't you do it like this?
b = (a*10)%10;
I find it a lot easier.