The asin() function from cmath library returns inaccurate values - c++

I'm using the asin and acos functions from cmath library to find the angles from their corresponding sine and cosine values but it sometimes returns inaccurate values. For example the result of the code below is not zero:
cout << acos(-1) / 6 - asin(0.5) << endl;
So what can I do? I use acos(-1) as the value of Pi, then somewhere in my code I want to see for example how many asin(0.5) are there in Pi but current method doesn't work.

Inexactness lurks in various places.
double is very commonly encoded using base 2, such as with binary64. As a floating point number, it can be expected to be a Dyadic rational. The key is that a finite floating-point number is a rational number.
Mathematically, arccos(-1) is π. π is an irrational number. Even with many digits of precision, a double cannot represent exactly π. Instead a nearby value for acos(-1) is used called machine pi or perhaps M_PI.
Mathematically, arcsin(0.5) is π/6. As above, asin(0.5) is a nearby value to π/6.
Division by 3 introduces potential inexactness. About 2 of every double/3 results in an inexact, yet rounded, quotient.
Of course it is not zero.
What is curious about this is that acos(-1) / 3 - asin(0.5) is about 1.57..., not 0.0.
I suspect OP was researching acos(-1) / 6 - asin(0.5).
Here subtraction of nearby values incurs catastrophic cancellation and a small, but non-zero difference, is a likely outcome.
So what can I do?
I want to see for example how many asin(0.5) are there in Pi
Tolerate marginal error in the calculation and employ the desired rounding.
double my_pi = acos(-1);
double target = asin(0.5);
long close_integer_quotient = lround(my_pi/target);

Related

Python trig functions not returning matched results

I came across something rather interesting while I am playing around with the math module for trigonometric calculations using tan, sin, and cos.
As stated is all math textbooks, online source, and courses, the following is true:
tan(x) = sin(x) / cos(x)
Although I came across some precision errors while using the three trig functions with the following:
from math import tan, sin, cos
theta = -30
alpha = tan(theta)
omega = sin(theta) / cos(theta)
print(alpha, omega)
print(alpha == omega)
>>> (6.405331196646276, 6.4053311966462765)
>>> (False)
I have tried a couple of different values for theta and the last digit of the results has been off by a tiny bit.
Is there something that I am missing?
This issue is because of the finite floating point precision (not all real numbers can be represented exactly and not all calculations with them are precise). An accessible guide is in the Python docs.
Using the default, "double precision" floating point representation, you can never hope for better than about 15 decimal place precision and calculations involving such numbers will tend to degrade this precision (the rounding error refered to in the above comment). In the same way, you get False from the following:
In [1]: 0.01 == (0.1)**2
Out[1]: False
because the Python isn't squaring 0.1 but the "nearest representable number" to 0.1, which is neither 0.01 nor the nearest representable number to 0.01.
D Stanley has given the correct way to test for "equality" within some absolute tolerance: (abs(a-b) < tol) where tol is some small number you choose to fit your expected precision.
As you have discovered, there is a level of imprecision when comparing floating point numbers. A common way to test for "equality" is to determine a reasonable amount of difference you want to accept (commonly called "epsilon") an compare the difference between the two numbers against that maximum error:
epsilon = 1E-14
print(alpha, omega)
print(alpha == omega)
print(abs(alpha - omega) < epsilon)
First you should notice that the arguments of trigonometric functions are given in arc length, not in degree. Thus theta=-30 refers to an angle of -30*180/pi in degrees.
Second, the processor, and thus the calling math library, has separate internal procedures for the computation of tan and (sin, cos). The extra division operation loses 1/2 to 1 bit of precision, which explains the difference in results.

Should I worry about precision when I use C++ mathematical functions with integers?

For example, The code below will give undesirable result due to precision of floating point numbers.
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
I wonder whether similar problems will show up if I use mathematical functions. For example
int a = sqrt(4); // Do I have guarantee that I will always get 2 here?
int b = log2(8); // Do I have guarantee that I will always get 3 here?
If not, how to solve this problem?
Edit:
Actually, I came across this problem when I was programming for an algorithm task. There I want to get
the largest integer which is power of 2 and is less than or equal to integer N
So round function can not solve my problem. I know I can solve this problem through a loop, but it seems not very elegant.
I want to know if
int a = pow(2, static_cast<int>(log2(N)));
can always give correct result. For example if N==8, is it possible that log2(N) gives me something like 2.9999999999999 and the final result become 4 instead of 8?
Inaccurate operands vs inaccurate results
I wonder whether similar problems will show up if I use mathematical functions.
Actually, the problem that could prevent log2(8) to be 3 does not exist for basic operations (including *). But it exists for the log2 function.
You are confusing two different issues:
double a = 1 / 3.0;
int b = a * 3; // b will be 0 here
In the example above, a is not exactly 1/3, so it is possible that a*3 does not produce 1.0. The product could have happened to round to 1.0, it just doesn't. However, if a somehow had been exactly 1/3, the product of a by 3 would have been exactly 1.0, because this is how IEEE 754 floating-point works: the result of basic operations is the nearest representable value to the mathematical result of the same operation on the same operands. When the exact result is representable as a floating-point number, then that representation is what you get.
Accuracy of sqrt and log2
sqrt is part of the “basic operations”, so sqrt(4) is guaranteed always, with no exception, in an IEEE 754 system, to be 2.0.
log2 is not part of the basic operations. The result of an implementation of this function is not guaranteed by the IEEE 754 standard to be the closest to the mathematical result. It can be another representable number further away. So without more hypotheses on the log2 function that you use, it is impossible to tell what log2(8.0) can be.
However, most implementations of reasonable quality for elementary functions such as log2 guarantee that the result of the implementation is within 1 ULP of the mathematical result. When the mathematical result is not representable, this means either the representable value above or the one below (but not necessarily the closest one of the two). When the mathematical result is exactly representable (such as 3.0), then this representation is still the only one guaranteed to be returned.
So about log2(8), the answer is “if you have a reasonable quality implementation of log2, you can expect the result to be 3.0`”.
Unfortunately, not every implementation of every elementary function is a quality implementation. See this blog post, caused by a widely used implementation of pow being inaccurate by more than 1 ULP when computing pow(10.0, 2.0), and thus returning 99.0 instead of 100.0.
Rounding to the nearest integer
Next, in each case, you assign the floating-point to an int with an implicit conversion. This conversion is defined in the C++ standard as truncating the floating-point values (that is, rounding towards zero). If you expect the result of the floating-point computation to be an integer, you can round the floating-point value to the nearest integer before assigning it. It will help obtain the desired answer in all cases where the error does not accumulate to a value larger than 1/2:
int b = std::nearbyint(log2(8.0));
To conclude with a straightforward answer to the question the the title: yes, you should worry about accuracy when using floating-point functions for the purpose of producing an integral end-result. These functions do not come even with the guarantees that basic operations come with.
Unfortunately the default conversion from a floating point number to integer in C++ is really crazy as it works by dropping the decimal part.
This is bad for two reasons:
a floating point number really really close to a positive integer, but below it will be converted to the previous integer instead (e.g. 3-1×10-10 = 2.9999999999 will be converted to 2)
a floating point number really really close to a negative integer, but above it will be converted to the next integer instead (e.g. -3+1×10-10 = -2.9999999999 will be converted to -2)
The combination of (1) and (2) means also that using int(x + 0.5) will not work reasonably as it will round negative numbers up.
There is a reasonable round function, but unfortunately returns another floating point number, thus you need to write int(round(x)).
When working with C99 or C++11 you can use lround(x).
Note that the only numbers that can be represented correctly in floating point are quotients where the denominator is an integral power of 2.
For example 1/65536 = 0.0000152587890625 can be represented correctly, but even just 0.1 is impossible to represent correctly and thus any computation involving that quantity will be approximated.
Of course when using 0.1 approximations can cancel out leaving a correct result occasionally, but even just adding ten times 0.1 will not give 1.0 as result when doing the computation using IEEE754 double-precision floating point numbers.
Even worse the compilers are allowed to use higher precision for intermediate results. This means that adding 10 times 0.1 may give back 1 when converted to an integer if the compiler decides to use higher accuracy and round to closest double at the end.
This is "worse" because despite being the precision higher the results are compiler and compiler options dependent, making reasoning about the computations harder and making the exact result non portable among different systems (even if they use the same precision and format).
Most compilers have special options to avoid this specific problem.

Should I combine multiplication and division steps when working with floating point values?

I am aware of the precision problems in floats and doubles, which why I am asking this:
If I have a formula such as: (a/PI)*180.0 (where PI is a constant)
Should I combine the division and multiplication, so I can use only one division: a/0.017453292519943295769236, in order to avoid loss of precision ?
Does this make it more precise when it has less steps to calculate the result?
Short answer
Yes, you should in general combine as many multiplications and divisions by constants as possible into one operation. It is (in general(*)) faster and more accurate at the same time.
Neither π nor π/180 nor their inverses are representable exactly as floating-point. For this reason, the computation will involve at least one approximate constant (in addition to the approximation of each of the operations involved).
Because two operations introduce one approximation each, it can be expected to be more accurate to do the whole computation in one operation.
In the case at hand, is division or multiplication better?
Apart from that, it is a question of “luck” whether the relative accuracy to which π/180 can be represented in the floating-point format is better or worse than that of 180/π.
My compiler provides addition precision with the long double type, so I am able to use it as reference for answering this question for double:
~ $ cat t.c
#define PIL 3.141592653589793238462643383279502884197L
#include <stdio.h>
int main() {
long double heop = 180.L / PIL;
long double pohe = PIL / 180.L;
printf("relative acc. of π/180: %Le\n", (pohe - (double) pohe) / pohe);
printf("relative acc. of 180/π: %Le\n", (heop - (double) heop) / heop);
}
~ $ gcc t.c && ./a.out
relative acc. of π/180: 1.688893e-17
relative acc. of 180/π: -3.469703e-17
In usual programming practice, one wouldn't bother and simply multiply by (the floating-point representation of) 180/π, because multiplication is so much faster than division.
As it turns out, in the case of the binary64 floating-point type double almost always maps to, π/180 can be represented with better relative accuracy than 180/π, so π/180 is the constant one should use to optimize accuracy: a / ((double) (π / 180)). With this formula, the total relative error would be approximately the sum of the relative error of the constant (1.688893e-17) and of the relative error of the division (which will depend on the value of a but never be more than 2-53).
Alternative methods for faster and more accurate results
Note that division is so expensive that you could get an even more accurate result faster by using one multiplication and one fma: let heop1 be the best double approximation of 180/π, and heop2 the best double approximation of 180/π - heop1. Then the best value for the result can be computed as:
double r = fma(a, heop1, a * heop2);
The fact that the above is the absolute best possible double approximation to the real computation is a theorem (in fact, it is a theorem with exceptions. The details can be found in the “Handbook of Floating-Point Arithmetic”). But even when the real constant you want to multiply a double by in order to get a double result is one of the exceptions to the theorem, the above computation is still clearly very accurate and only differs from the best double approximation for a few exceptional values of a.
If, like mine, your compiler provides more precision for long double than for double, you can also use one long double multiplication:
// this is more accurate than double division:
double r = (double)((long double) a * 57.295779513082320876798L)
This is not as good as the solution based on fma, but it is good enough that for most values of a, it produces the optimal double approximation to the real computation.
A counter-example to the general claim that operations should be grouped as one
(*) The claim that it is better to group constant is only statistically true for most constants.
If you happened to wish to multiply a by, say, the real constant 0.0000001 * DBL_MIN, you would be better off multiplying first by 0.0000001, then by DBL_MIN, and the end result (which can be a normalized number if a is larger than 1000000 or so) would be more precise than if you had multiplied by the best double representation of 0.0000001 * DBL_MIN. This is because the relative accuracy when representing 0.0000001 * DBL_MIN as a single double value is much worse than the accuracy for representing 0.0000001.

How to correctly normalize a floating point value in C++?

Maybe I don't understand the IEEE754 standard that much, but given a set of floating point values that are float or double, for example :
56.543f 3238.124124f 121.3f ...
you are able to convert them in values ranging from 0 to 1, so you normalize them, by taking an appropriate common factor while considering what is the maximum value and the minimum value in the set.
Now my point is that in this transformation I need a much higher precision for the set of destination that ranges from 0 to 1 if compared to the level of precision that I need in the first one, especially if the values in the first set are covering a wide range of numerical values ( really big and really small values ).
How the float or the double ( or the IEEE 754 standard if you want ) type can handle this situation while providing more precision for the second set of values knowing that I will basically not need an integer part ?
Or it doesn't handle this at all and I need fixed point math with a totally different type ?
Floating point numbers are stored in a format similar to scientific notation. Internally, they align the leading 1 of the binary representation to the top of the significand. Each value is carried with the same number of binary digits of precision relative to its own magnitude.
When you compress your set of floating point values to the range 0..1, the only precision loss you will get will be due to the rounding that occurs in the various steps of the process.
If you're merely compressing by scaling, you will lose only a small amount of precision near the LSBs of the mantissa (around 1 or 2 ulp, where ulp means "units of the last place).
If you also need to shift your data, then things get trickier. If your data is all positive, then subtracting off the smallest number will not damage anything. But, if your data is a mixture of positive and negative data, then some of your values near zero may suffer a loss in precision.
If you do all the arithmetic at double precision, you'll carry 53 bits of precision through the calculation. If your precision needs fit within that (which likely they do), then you'll be fine. Otherwise, the exact numerical performance will depend on the distribution of your data.
Single and double IEEE floats have a format where the exponent and fraction parts have fixed bit-width. So this is not possible (i.e. you will always have unused bits if you only store values between 0 and 1). (See: http://en.wikipedia.org/wiki/Single-precision_floating-point_format)
Are you sure the 52-bit wide fraction part of a double is not precise enough?
Edit: If you use the whole range of the floating format, you will lose precision when normalizing the values. The roundings can be off and enough small values will become 0. Unless you know that this is a problem, don't worry. Otherwise you have to look up some other solution as mentioned in other answers.
Having binary floating point values (with an implicit leading one) expressed as
(1+fraction) * 2^exponent where fraction < 1
A division a/b is:
a/b = (1+fraction(a)) / (1+fraction(b)) * 2^(exponent(a) - exponent(b))
Hence division/multiplication has essentially no loss of precision.
A subtraction a-b is:
a-b = (1+fraction(a)) * 2^(exponent(a) - (1+fraction(b)) * exponent(b))
Hence a subtraction/addition might have a loss of precision (big - tiny == big) !
Clamping a value x in a range [min, max] to [0, 1]
(x - min) / (max - min)
will have precision issues if any subtraction has a loss of precision.
Answering your question:
Nothing is, choose a suitable representation (floating point, fraction, multi precision ...) for your algorithms and expected data.
If you have a selection of doubles and you normalize them to between 0.0 and 1.0, there are a number of sources of precision loss. They are all, however, much smaller than you suspect.
First, you will lose some precision in the arithmetic operations required to normalize them as rounding occurs. This is relatively small -- a bit or so per operation -- and usually relatively random.
Second, the exponent component will no longer be using the positive exponent possibility.
Third, as all the values are positive, the sign bit will also be wasted.
Forth, if the input space does not include +inf or -inf or +NaN or -NaN or the like, those code points will also be wasted.
But, for the most part, you'll waste about 3 bits of information in a 64 bit double in your normalization, one of which being the kind of thing that is nearly unavoidable when you deal with finite-bit-width values.
Any 64 bit fixed point representation of the values from 0 to 1 will have far less "range" than doubles. A double can represent something on the order of 10^-300, while a 64 bit fixed point representation that includes 1.0 can only go as low as 10^-19 or so. (The 64 bit fixed point representation can represent 1 - 10^-19 as being distinct from 1, while the double cannot, but the 64 bit fixed point value can not represent anything smaller than 2^-64, while doubles can).
Some of the numbers above are approximate, and may depend on rounding/exact format.
For higher precision you can try http://www.boost.org/doc/libs/1_55_0/libs/multiprecision/doc/html/boost_multiprecision/tut/floats.html.
Note also, that for the numerical critical operations +,- there are special algorithms that minimize the numerical error introduced by the algorithm:
http://en.wikipedia.org/wiki/Kahan_summation_algorithm

C++ Cosine Problem

I have the following code using C++:
double value = .3;
double result = cos(value);
When I look at the values in the locals window for "value" it shows 0.2999999999
Then, when I get the value of "result" I get: 0.95533648912560598
However, when I run cos(.3) on the computers calculator I get: .9999862922474
So clearly there is something that I am doing wrong.
Any thoughts on what might be causing the difference in results?
I am running Win XP on an Intel processor.
Thanks
The difference in results is because:
Your computer's calculator is returning the cosine of an angle specified in degrees.
The C++ cos() function is returning cosine of an angle specified in radians.
The .2999999999 is due to the way floating point numbers are handled in computers. .3 cannot be represented exactly in a double. For details, I recommend reading What Every Computer Scientist Should Know about Floating Point Arithmetic.
cos(.3 radians) = 0.95533...
cos(.3 degrees) = 0.99998...
cos(0.3) = 0.99998629224742679269138848004408 using degrees
cos(0.3) = 0,95533648912560601964231022756805 using radians
When I look at the values in the locals window for "value" it shows 0.2999999999
Long story short, your calculator uses decimal arithmetic, while your C++ code uses binary arithmetic (double is a binary floating-point number). Decimal number 0.3 cannot be represented exactly as a binary floating-point number. Read What Every Computer Scientist Should Know About Floating-Point Arithmetic, that will explain all implications in more detail.
Your calculator is using degrees. For example:
>>> import math
>>> math.cos (.3)
0.95533648912560598
>>> math.cos (.3 * math.pi / 180) # convert to degrees
0.99998629224742674
C++ does not exactly represent floating point numbers due to the insane amount of storage that would be required to get the infinite precision necessary. For a demonstration of this, try the following:
double ninth = 1.0/9.0;
double result = 9.0 * ninth;
This should yield a value in result of .99999999999
So, in essence, you need to compare floating point values within a small epsilon (I tend to use 1e-7). You can do a strict bit-by-bit comparison, but this consists of converting the memory used by the floating point to an array of characters of length sizeof(float), then comparing the characters.
Another thing to check would be whether or not you are using degrees. The computer's calculator uses degrees for its cosine calculation (notice how the result from the calculator is .99999..., which is very close to 1. The cosine of zero is 1 exactly), whereas the cosine function offered in <math> is in radians. Try multiplying your value by PI/180.0 and seeing if the result is more inline with your expectations.