Here is a situation where a round_to_2_digits() function is rounding down when we expected it to round up. This turned out to be the case where a number cannot be represented exactly in a double. I don't remember the exact value, but say this:
double value = 1.155;
double value_rounded = round_to_2_digits( value );
The value was the output of a function, and instead of being exactly 1.155 like the code above, it actually was returning something like 1.15499999999999999. So calling std::round() on it would result in 1.15 instead of 1.16 like we thought.
Questions:
I'm thinking it may be wise to add a tiny value in round_to_2_digits() prior to calling std::round().
Is this standard practice, or acceptable? For example, adding the 0.0005 to the value being rounded.
Is there a mathematical term for this kind of "fudge factor"?
EDIT: "epsilon" was the term I was looking for.
And since the function rounds to only 2 digits after the decimal point, should I be adding 0.001? 0.005? 0.0005?
The rounding function is quite simple:
double round_to_2_decimals( double value )
{
value *= 100.0;
value = std::round(value);
value /= 100.0;
return value;
}
Step one is admitting that double may not be the right data type for your application. :-) Consider using a fixed-point type, or multiplying all of your values by 100 (or 1000, or whatever), and working with integer values.
You can't actually guarantee that adding any particular small epsilon won't give you the wrong results in some other situation. What if your value actually was 1.54999999..., and you rounded it up? Then you'd have the wrong value the other way.
The problem with your suggested solution is that at the end of it, you're still going to have a binary fraction that's not necessarily equal to the decimal value you're trying to represent. The only way to fix this is to use a representation that can exactly represent the values you want to use.
This question doesn't make a lot of sense. POSIX mandates std::round rounds half away from zero. So the result should in fact be 116 not 115. In order to actually replicate your behavior, I had to use a function that pays attention to rounding mode:
std::fesetround(FE_DOWNWARD);
std::cout << std::setprecision(20) << std::rint(1.155 * 100.0);
This was tested on GCC 5.2.0 and Clang 3.7.0.
Related
I have a naive question about the high-precision number conversion (in C++ here).
Suppose the user assigns 0.1 to the double variable x_d with this statement,
x_d = 0.1
It is known that x_d thus obtained is no more exactly 0.1, due to the inevitable machine rounding.
I wonder whether we still have a way to get back the original highly precise string “0.1”, from the double variable x_d? Clearly, it is useless to use std::to_string (x_d) here. Even a high precision library like boost::multiprecision or MPFR seem to be helpless. For example, std::to_string(boost:cpp_dec_float_10000(x_d) ) cannot recover back the lost precision.
So my question is, can we retrieve back the string “0.1” from a double x_d that is assigned using the statement x_d = 0.1?
Let's assume that during the assignment, the decimal number 0.1 is rounded to a double value X that does not equal the decimal number 0.1. Now, assume some other computation results in the valueX, but without it being rounded. In order to distinguish those two, you would have to store the origin somewhere. For that, there is simply no place in a double (assuming common implementations), so the answer to your question is "no".
So in general, I understand the difference between specifying 3. and 3.0d0 with the difference being the number of digits stored by the computer. When doing arithmetic operations, I generally make sure everything is in double precision. However, I am confused about the following operations:
64^(1./3.) vs. 64^(1.0d0/3.0d0)
It took me a couple of weeks to find an error where I was assigning the output of 64^(1.0d0/3.0d0) to an integer. Because 64^(1.0d0/3.0d0) returns 3.999999, the integer got the value 3 and not 4. However, 64^(1./3.) = 4.00000. Can someone explain to me why it is wise to use 1./3. vs. 1.0d0/3.0d0 here?
The issue isn't so much single versus double precision. All floating point calculations are subject to imprecision compared to true real numbers. In assigning a real to an integer, Fortran truncates. You probably want to use the Fortran intrinsic nint.
this is a peculiar fortuitous case where the lower precision calculation gives the exact result. You can see this without the integer conversion issue:
write(*,*)4.d0-64**(1./3.),4.d0-64**(1.d0/3.d0)
0.000000000 4.440892E-016
In general this does not happen, here the double precision value is "better"
write(*,*)13.d0-2197**(1./3.),13.d0-2197**(1.d0/3.d0)
-9.5367E-7 1.77E-015
Here, since the s.p. calc comes out slightly high it gives you the correct value on integer conversion, while the d.p. result will get rounded down, hence be wrong, even though the floating point error was smaller.
So in general, no you should not consider use of single precision to be preferred.
in fact 64 and 125 seem to be the only special cases where the s.p. calc gives a perfect cube root while the d.p. calc does not.
The way I understand it is: when subtracting two double numbers with double precision in c++ they are first transformed to a significand starting with one times 2 to the power of the exponent. Then one can get an error if the subtracted numbers have the same exponent and many of the same digits in the significand, leading to loss of precision. To test this for my code I wrote the following safe addition function:
double Sadd(double d1, double d2, int& report, double prec) {
int exp1, exp2;
double man1=frexp(d1, &exp1), man2=frexp(d2, &exp2);
if(d1*d2<0) {
if(exp1==exp2) {
if(abs(man1+man2)<prec) {
cout << "Floating point error" << endl;
report=0;
}
}
}
return d1+d2;
}
However, testing this I notice something strange: it seems that the actual error (not whether the function reports error but the actual one resulting from the computation) seems to depend on the absolute values of the subtracted numbers and not just the number of equal digits in the significand...
For examples, using 1e-11 as the precision prec and subtracting the following numbers:
1) 9.8989898989898-9.8989898989897: The function reports error and I get the highly incorrect value 9.9475983006414e-14
2) 98989898989898-98989898989897: The function reports error but I get the correct value 1
Obviously I have misunderstood something. Any ideas?
If you subtract two floating-point values that are nearly equal, the result will mostly reflect noise in the low bits. Nearly equal here is more than just same exponent and almost the same digits. For example, 1.0001 and 1.0000 are nearly equal, and subtracting them could be caught by a test like this. But 1.0000 and 0.9999 differ by exactly the same amount, and would not be caught by a test like this.
Further, this is not a safe addition function. Rather, it's a post-hoc check for a design/coding error. If you're subtracting two values that are so close together that noise matters you've made a mistake. Fix the mistake. I'm not objecting to using something like this as a debugging aid, but please call it something that implies that that's what it is, rather than suggesting that there's something inherently dangerous about floating-point addition. Further, putting the check inside the addition function seems excessive: an assert that the two values won't cause problems, followed by a plain old floating-point addition, would probably be better. After all, most of the additions in your code won't lead to problems, and you'd better know where the problem spots are; put asserts in the problems spots.
+1 to Pete Becker's answer.
Note that the problem of degenerated result might also occur with exp1!=exp2
For example, if you subtract
1.0-0.99999999999999
So,
bool degenerated =
(epx1==exp2 && abs(d1+d2)<prec)
|| (epx1==exp2-1 && abs(d1+2*d2)<prec)
|| (epx1==exp2+1 && abs(2*d1+d2)<prec);
You can omit the check for d1*d2<0, or keep it to avoid the whole test otherwise...
If you want to also handle loss of precision with degenerated denormalized floats, that'll be a bit more involved (it's as if the significand had less bits).
It's quite easy to prove that for IEEE 754 floating-point arithmetic, if x/2 <= y <= 2x then calculating x - y is an exact operation and will give the exact result correctly without any rounding error.
And if the result of an addition or subtraction is a denormalised number, then the result is always exact.
What does the compiler do? The aim is to get the number after the point as an integer. I did it like this:
float a = 0;
cin >> a;
int b = (a - (int)a)*10;
Now my problem is this: when I enter for example 3.2, I get 2, which is what I want. It also works with .4, .5 and .7. but when I enter for example 2.3, I get 2. For 2.7 I get 6 and so on. But when I do it without variables, for example:
(2.3 - (int)2.3)*10;
I get the correct result.
I couldn't figure out what the compiler does. I alway thought when I cast a float to an integer, then it simply cuts at the point. This is what the compiler actually does when I use constant numbers. However, when I use variables, the compiler reduces some of them, but not all.
You are most likely not having problems with the compiler, but with the fact that floating point numbers cannot be represented exactly on a binary computer.
So, when you do:
float f = 2.7f;
..what might actually be stored in the computer is:
2.6999999999999999
This is a very well-known characteristic of floating points on binary computers. There are many posts on SO that discuss this.
Basically, the problem comes from the fact that binary has different "infinitely repeating" values than base 10 does. For instance. 1/10 in decimal is 0.1, in binary, it's 0.000110011001100110011001100... The problem is caused because floating point cannot hold 2.3 correctly because it's an infinite number of binary digits, but it approximates closely, probably as 2.2999999. For most math, it's the close enough. But be wary of truncation.
One solution is to round before you truncate.
int b = (a - (int)(a+.05))*10;
Also note that floating point values have different sizes in memory than in the registers, which means you have to round when comparing if two floating point values are equal as well.
The reason for the discrepancy is that by default, floating point literals are doubles, which have higher accuracy, and are more closely able to represent the value you're looking for.
Why don't you do it like this?
b = (a*10)%10;
I find it a lot easier.
In C++,
const double Pi = 3.14159265;
cout << sin(Pi); // displays: 3.58979e-009
it SHOULD display the number zero
I understand this is because Pi is being approximated, but is there any way I can have a value of Pi hardcoded into my program that will return 0 for sin(Pi)? (a different constant maybe?)
In case you're wondering what I'm trying to do: I'm converting polar to rectangular, and while there are some printf() tricks I can do to print it as "0.00", it still doesn't consistently return decent values (in some cases I get "-0.00")
The lines that require sin and cosine are:
x = r*sin(theta);
y = r*cos(theta);
BTW: My Rectangular -> Polar is working fine... it's just the Polar -> Rectangular
Thanks!
edit: I'm looking for a workaround so that I can print sin(some multiple of Pi) as a nice round number to the console (ideally without a thousand if-statements)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (edit: also got linked in a comment) is pretty hardcore reading (I can't claim to have read all of it), but the crux of it is this: you'll never get perfectly accurate floating point calculations. From the article:
Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation.
Don't let your program depend on exact results from floating point calculations - always allow a tolerance range. FYI 3.58979e-009 is about 0.0000000036. That's well within any reasonable tolerance range you choose!
Let's put it this way, 3.58979e-009 is as close to 0 as your 3.14159265 value is to the real Pi. What you got is, technically, what you asked for. :)
Now, if you only put 9 significant figures (8 decimal places) in, then instruct the output to also display no more, i.e. use:
cout.precision(8);
cout << sin(Pi);
it's equal to zero if your equality operator has enough tolerance
Did you try M_PI, available in most <cmath> or <math.h> implementations?
Even so, using floating point in this way will always introduce a certain amount of error.
This should display zero:
cout << fixed << sin(Pi);
(I don't think you should be trying to round anything. If you are worried about display, deal with the display functions, not with the value itself.)
3.58979e-009 this is 0,0000000358979
Is a ~~0 like yours ~~PI.
You could throw in some more digits to get a better result (try for example 3.1415926535897932384626433832795029L), but you'll still get rounding errors.
Still, you can create your own sin and cos versions that check against your known Pi value and return exactly zero in those cases.
namespace TrigExt
{
const double PI = 3.14159265358979323846;
inline double sin(double theta)
{
return theta==PI?(0.0):(std::sin(theta));
}
}
You may also expand this thing for the other trigonometric functions and to handle Pi multiples.
You could write a little wrapper function:
double mysin(const double d) {
double ret = sin(d);
if(fabs(ret) < 0.0000001) {
return 0.0;
} else {
return ret;
}
}
As others have noted, floating-point maths is notoriously inexact. You need some kind of tolerance if you want something to appear as exactly zero.
why not force to however many digits you need
int isin = (int)(sin(val) * 1000);
cout << (isin/1000.0)
sin(PI) should equal 0, for an exact value of PI. You are not entering the exact value of PI. As other people are pointing out, the result you are getting rounded to 7 decimal places is 0, which is pretty good for your approximation.
If you need different behavior you should write your own sine function.
If you use float or double in math operations you will never have exact results.
The reason is that in a computer everything is stored as a power of 2. This does not translate exactly to our decimal number system. (An example is that there is n o representation in base 2 of 0.1)
In addition float and double are 64 bits at least on some compilers and platforms. (I think - somebody correct me on that if needed). This will cause some rounding errors for either very large values or for very small values (0.0000000000xxx)
In order to get exact results you are going to need some big integer library.
As written in the comments to the question above see the site ...
http://docs.sun.com/source/806-3568/ncg_goldberg.html
double cut(double value, double cutoff=1e-7) {
return (abs(value) > cutoff)*value;
}
this will zero values below threshold, use it like this cut(sin(Pi))
More significant figures might help. My C compiler (gcc) uses the constant 3.14159265358979323846 for M_PI in "math.h". Other than that, there aren't many options. Creating your own function to validate the answer (as described in another answer to your question) is probably the best idea.
You know, just for the mathematical correctness out there: sin(3.14159265) ins't zero. It's approximately zero, which is exactly what the program is telling you. For calculations, this number ought to give you a good result. For displaying, it sucks, so whenever you print a float, make sure to format the number.
I don't really think that there are any float mechanics in the work here... it's just simple math.
About the code though, be careful... doesn't make your code give the wrong result by making the approximations before the display, just display the information the right way.