Why is the following not yielding the desired result?
data data1;
do a = 0.0 to 1.0 by 0.1;
do b = 0.0 to 1.0 by 0.1;
do c = 0.0 to 1.0 by 0.1;
do d = 0.0 to 1.0 by 0.1;
if (a+b+c+d)=1 then output;
end;
end;
end;
end;
format a b c d 4.1;
run;
I'm not really familiar with SAS, but in general when you have numbers like 0.1, it's represented in binary. Since .1 can't be represented exactly in binary then math equations don't always add up exactly. For instance 0.1 times 10 does not equal exactly 1.0 in floating point arithmetic. In general, don't use equality when working with floating point.
See http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
Numbers in SAS are stored in binary (as they are in most computing applications), and are often not precisely representing the decimal equivalent. Just like 0.33333333 != (1/3) exactly, neither do many 1/10ths represent their precise decimal value either.
data data1;
do a = 0.0 to 1.0 by 0.1;
do b = 0.0 to 1.0 by 0.1;
do c = 0.0 to 1.0 by 0.1;
do d = 0.0 to 1.0 by 0.1;
if round(a+b+c+d,.1)=1 then output;
end;
end;
end;
end;
format a b c d 4.1;
run;
Rounding fixes the issue.
You can read this SAS technical paper for more information.
Related
I have problem of converting a double (say N) to p/q form (rational form), for this I have the following strategy :
Multiply double N by a large number say $k = 10^{10}$
then p = y*k and q = k
Take gcd(p,q) and find p = p/gcd(p,q) and q = p/gcd(p,q)
when N = 8.2 , Answer is correct if we solve using pen and paper, but as 8.2 is represented as 8.19999999 in N (double), it causes problem in its rational form conversion.
I tried it doing other way as : (I used a large no. 10^k instead of 100)
if(abs(y*100 - round(y*100)) < 0.000001) y = round(y*100)/100
But this approach also doesn't give right representation all the time.
Is there any way I could carry out the equivalent conversion from double to p/q ?
Floating point arithmetic is very difficult. As has been mentioned in the comments, part of the difficulty is that you need to represent your numbers in binary.
For example, the number 0.125 can be represented exactly in binary:
0.125 = 2^-3 = 0b0.001
But the number 0.12 cannot.
To 11 significant figures:
0.12 = 0b0.00011110101
If this is converted back to a decimal then the error becomes obvious:
0b0.00011110101 = 0.11962890625
So if you write:
double a = 0.2;
What the machine actually does is find the closest binary representation of 0.2 that it can hold within a double data type. This is an approximation since as we saw above, 0.2 cannot be exactly represented in binary.
One possible approach is to define an 'epsilon' which determines how close your number can be to the nearest representable binary floating point.
Here is a good article on floating points:
https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
have problem of converting a double (say N) to p/q form
... when N = 8.2
A typical double cannot encode 8.2 exactly. Instead the closest representable double is about
8.19999999999999928945726423989981412887573...
8.20000000000000106581410364015027880668640... // next closest
When code does
double N = 8.2;
It will be the 8.19999999999999928945726423989981412887573... that is converted into rational form.
Converting a double to p/q form:
Multiply double N by a large number say $k = 10^{10}$
This may overflow the double. First step should be to determine if the double is large, it which case, it is a whole number.
Do not multiple by some power of 10 as double certainly uses a binary encoding. Multiplication by 10, 100, etc. may introduce round-off error.
C implementations of double overwhelmingly use a binary encoding, so that FLT_RADIX == 2.
Then every finite double x has a significand that is a fraction of some integer over some power of 2: a binary fraction of DBL_MANT_DIG digits #Richard Critten. This is often 53 binary digits.
Determine the exponent of the double. If large enough or x == 0.0, the double is a whole number.
Otherwise, scale a numerator and denominator by DBL_MANT_DIG. While the numerator is even, halve both the numerator and denominator. As the denominator is a power-of-2, no other prime values are needed for simplification consideration.
#include <float.h>
#include <math.h>
#include <stdio.h>
void form_ratio(double x) {
double numerator = x;
double denominator = 1.0;
if (isfinite(numerator) && x != 0.0) {
int expo;
frexp(numerator, &expo);
if (expo < DBL_MANT_DIG) {
expo = DBL_MANT_DIG - expo;
numerator = ldexp(numerator, expo);
denominator = ldexp(1.0, expo);
while (fmod(numerator, 2.0) == 0.0 && denominator > 1.0) {
numerator /= 2.0;
denominator /= 2.0;
}
}
}
int pre = DBL_DECIMAL_DIG;
printf("%.*g --> %.*g/%.*g\n", pre, x, pre, numerator, pre, denominator);
}
int main(void) {
form_ratio(123456789012.0);
form_ratio(42.0);
form_ratio(1.0 / 7);
form_ratio(867.5309);
}
Output
123456789012 --> 123456789012/1
42 --> 42/1
0.14285714285714285 --> 2573485501354569/18014398509481984
867.53089999999997 --> 3815441248019913/4398046511104
I have such a function that calculates weights according to Gaussian distribution:
const float dx = 1.0f / static_cast<float>(points - 1);
const float sigma = 1.0f / 3.0f;
const float norm = 1.0f / (sqrtf(2.0f * static_cast<float>(M_PI)) * sigma);
const float divsigma2 = 0.5f / (sigma * sigma);
m_weights[0] = 1.0f;
for (int i = 1; i < points; i++)
{
float x = static_cast<float>(i)* dx;
m_weights[i] = norm * expf(-x * x * divsigma2) * dx;
m_weights[0] -= 2.0f * m_weights[i];
}
In all the calc above the number does not matter. The only thing matters is that m_weights[0] = 1.0f; and each time I calculate m_weights[i] I subtract it twice from m_weights[0] like this:
m_weights[0] -= 2.0f * m_weights[i];
to ensure that w[0] + 2 * w[i] (1..N) will sum to exactly 1.0f. But it does not. This assert fails:
float wSum = 0.0f;
for (size_t i = 0; i < m_weights.size(); ++i)
{
float w = m_weights[i];
if (i == 0) {
wSum += w;
} else {
wSum += (w + w);
}
}
assert(wSum == 1.0 && "Weights sum is not 1.");
How can I ensure the sum to be 1.0f on all platforms?
You can't. Floating point isn't like that. Even adding the same values can produce different results according to the cpu used.
All you can do is define some accuracy value and ensure that you end up with 1.0 +/- that value.
See: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
Because the precision of float is only 23 bits (see e.g. https://en.wikipedia.org/wiki/Single-precision_floating-point_format ), rounding error quickly accumulates therefore even if the rest of code is correct, your sum becomes something like 1.0000001 or 0.9999999 (have you watched it in the debugger or tried to print it to console, by the way?). To improve precision you can replace float with double, but still the sum will not be exactly 1.0: the error will just be smaller, something like 1e-16 instead of 1e-7.
The second thing to do is to replace strict comparison to 1.0 with a range comparison, like:
assert(fabs(wSum - 1.0) <= 1e-13 && "Weights sum is not 1.");
Here 1e-13 is the epsilon within which you consider two floating-point numbers equal. If you choose to go with float (not double), you may need epsilon like 1e-6 .
Depending on how large your weights are and how many points there are, accumulated error can become larger than that epsilon. In that case you would need special algorithms for keeping the precision higher, such as sorting the numbers by their absolute values prior to summing them up starting with the smallest numbers.
How can I ensure the sum to be 1.0f on all platforms?
As the other answers (and comments) have stated, you can't achieve this, due to the inexactness of floating point calculations.
One solution is that, instead of using double, use a fixed point or multi-precision library such as GMP, Boost Multiprecision Library, or one of the many others out there.
My code
double to_radians(double theta)
{
return (M_PI * theta) / 180.0;
}
int main()
{
std::vector<std::pair<double, double>> points;
for (double theta = 0.0; theta <= 360.0; theta += 30.0)
{
points.push_back(std::make_pair(std::cos(to_radians(theta)), std::sin(to_radians(theta))));
}
for (auto point : points)
std::cout << point.first << " " << point.second << "\n";
}
Output I expect
1 0
0.866025 0.5
0.5 0.866025
0 1
-0.5 0.866025
-0.866025 0.5
-1 0
-0.866025 -0.5
-0.5 -0.866025
0 -1
0.5 -0.866025
0.866025 -0.5
1 0
Output I get:
1 0
0.866025 0.5
0.5 0.866025
6.12303e-17 1
-0.5 0.866025
-0.866025 0.5
-1 1.22461e-16
-0.866025 -0.5
-0.5 -0.866025
-1.83691e-16 -1
0.5 -0.866025
0.866025 -0.5
1 -2.44921e-16
As you can see I am getting these strange values instead of zero. Can somebody explain why this is happening?
6.12303e-17, to take an example, represents the value 6.12303*10-17, or 0.00000000000000000612303.
The reason you obtain this value as result is that you did not apply cos to π/2, which is not representable as a double anyway (it's irrational). The cos function was applied to a double close to π/2, obtained by multiplying 90 by M_PI and dividing by 180. Since the argument is not π/2, the result does not have to be 0. In fact, since floating-point numbers are more dense near zero, it is extremely unlikely for any floating-point format that applying a correctly rounded cos to any floating-point number produces exactly zero as result.
In fact, since the derivative of cos in π/2 is -1, the value obtained for the expression cos(M_PI/2.0) is a close approximation of the difference between M_PI/2 and π/2. That difference is indeed of the order of d*10-17, since the double-precision IEEE 754 format can only represent the first 16 or so first decimal digits of an arbitrary number.
Note that the same argument applies to obtaining 0.5 as the result of cos(M_PI/3.0), or even -1.0 as the result of cos(M_PI). The difference is that there are many floating-point numbers, some very small, around 0, and these can represent very precisely the intended non-zero result. In comparison, 0.5 and -1.0 have only a few neighbors, and for inputs close enough to π/3 and π, the numbers 0.5 and -1.0 end up being returned as the nearest representable double-precision value to the respective mathematical result (which isn't 1/2 or -1, since the input is not π/3 or π).
The simplest solution to your problem would be to use hypothetical functions cosdeg and sindeg that would compute directly the cosine and sine of angles in degrees. Since 60 and 90 are representable exactly as double-precision floating-point numbers, these functions would have no excuse not to return 0.5 or 0.0 (also exactly representable as double-precision floating-point numbers). I asked a question in relation to these functions earlier but no-one pointed to any already available implementation.
The functions sinpi and cospi pointed out by njuffa are often available, and they allow to compute the sine and cosine or π/2, π/4 or even 7.5*π, but not of π/3, since the number 1/3 they would have to be applied to is not representable exactly in binary floating-point.
It's a floating point rounding error. Trig functions are implemented as mathematical series that are approximated on the computational level which causes for numbers very close to zero for example 6.12303e-17 rather than the expected 0.
This question already has answers here:
Why does floating-point arithmetic not give exact results when adding decimal fractions?
(31 answers)
Why pow(10,5) = 9,999 in C++
(8 answers)
Closed 4 years ago.
I've found an interesting floating point problem. I have to calculate several square roots in my code, and the expression is like this:
sqrt(1.0 - pow(pos,2))
where pos goes from -1.0 to 1.0 in a loop. The -1.0 is fine for pow, but when pos=1.0, I get an -nan. Doing some tests, using gcc 4.4.5 and icc 12.0, the output of
1.0 - pow(pos,2) = -1.33226763e-15
and
1.0 - pow(1.0,2) = 0
or
poss = 1.0
1.0 - pow(poss,2) = 0
Where clearly the first one is going to give problems, being negative. Anyone knows why pow is returning a number smaller than 0? The full offending code is below:
int main() {
double n_max = 10;
double a = -1.0;
double b = 1.0;
int divisions = int(5 * n_max);
assert (!(b == a));
double interval = b - a;
double delta_theta = interval / divisions;
double delta_thetaover2 = delta_theta / 2.0;
double pos = a;
//for (int i = 0; i < divisions - 1; i++) {
for (int i = 0; i < divisions+1; i++) {
cout<<sqrt(1.0 - pow(pos, 2)) <<setw(20)<<pos<<endl;
if(isnan(sqrt(1.0 - pow(pos, 2)))){
cout<<"Danger Will Robinson!"<<endl;
cout<< sqrt(1.0 - pow(pos,2))<<endl;
cout<<"pos "<<setprecision(9)<<pos<<endl;
cout<<"pow(pos,2) "<<setprecision(9)<<pow(pos, 2)<<endl;
cout<<"delta_theta "<<delta_theta<<endl;
cout<<"1 - pow "<< 1.0 - pow(pos,2)<<endl;
double poss = 1.0;
cout<<"1- poss "<<1.0 - pow(poss,2)<<endl;
}
pos += delta_theta;
}
return 0;
}
When you keep incrementing pos in a loop, rounding errors accumulate and in your case the final value > 1.0. Instead of that, calculate pos by multiplication on each round to only get minimal amount of rounding error.
The problem is that floating point calculations are not exact, and that 1 - 1^2 may be giving small negative results, yielding an invalid sqrt computation.
Consider capping your result:
double x = 1. - pow(pos, 2.);
result = sqrt(x < 0 ? 0 : x);
or
result = sqrt(abs(x) < 1e-12 ? 0 : x);
setprecision(9) is going to cause rounding. Use a debugger to see what the value really is. Short of that, at least set the precision beyond the possible size of the type you're using.
You will almost always have rounding errors when calculating with doubles, because the double type has only 15 significant decimal digits (52 bits) and a lot of decimal numbers are not convertible to binary floating point numbers without rounding. The IEEE standard contains a lot of effort to keep those errors low, but by principle it cannot always succeed. For a thorough introduction see this document
In your case, you should calculate pos on each loop and round to 14 or less digits. That should give you a clean 0 for the sqrt.
You can calc pos inside the loop as
pos = round(a + interval * i / divisions, 14);
with round defined as
double round(double r, int digits)
{
double multiplier = pow(digits,10);
return floor(r*multiplier + 0.5)/multiplier;
}
Here is my problem, I have several parameters that I need to increment by 0.1.
But my UI only renders x.x , x.xx, x.xxx for floats so since 0.1f is not really 0.1 but something like 0.10000000149011612 on the long run my ui will render -0.00 and that doesn't make much sense. How to prevent that for all the possible cases of UI.
Thank you.
Use integers and divide by 10 (or 1000 etc...) just before displaying. Your parameters will store an integer number of tenths, and you'll increment them by 1 tenth.
If you know that your floating point value will always be a multiple of 0.1, you can round it after every increment to make sure it maintains a sensible value. It still won't be exact (because it physically can't be), but at least the errors won't accumulate and it will display properly.
Instead of:
x += delta;
Do:
x = floor((x + delta) / precision + 0.5) * precision;
Edit: It's useful to turn the rounding into a stand-alone function and decouple it from the increment:
inline double round(double value, double precision = 1.0)
{
return floor(value / precision + 0.5) * precision;
}
x = round(x + 0.1, 0.1);