Fast approximate float division - c++

On modern processors, float division is a good order of magnitude slower than float multiplication (when measured by reciprocal throughput).
I'm wondering if there are any algorithms out there for computating a fast approximation to x/y, given certain assumptions and tolerance levels. For example, if you assume that 0<x<y, and are willing to accept any output that is within 10% of the true value, are there algorithms faster than the built-in FDIV operation?

I hope that this helps because this is probably as close as your going to get to what you are looking for.
__inline__ double __attribute__((const)) divide( double y, double x ) {
// calculates y/x
union {
double dbl;
unsigned long long ull;
} u;
u.dbl = x; // x = x
u.ull = ( 0xbfcdd6a18f6a6f52ULL - u.ull ) >> (unsigned char)1;
// pow( x, -0.5 )
u.dbl *= u.dbl; // pow( pow(x,-0.5), 2 ) = pow( x, -1 ) = 1.0/x
return u.dbl * y; // (1.0/x) * y = y/x
}
See also:
Another post about reciprocal approximation.
The Wikipedia page.

FDIV is usually exceptionally slower than FMUL just b/c it can't be piped like multiplication and requires multiple clk cycles for iterative convergence HW seeking process.
Easiest way is to simply recognize that division is nothing more than the multiplication of the dividend y and the inverse of the divisor x. The not so straight forward part is remembering a float value x = m * 2 ^ e & its inverse x^-1 = (1/m)*2^(-e) = (2/m)*2^(-e-1) = p * 2^q approximating this new mantissa p = 2/m = 3-x, for 1<=m<2. This gives a rough piece-wise linear approximation of the inverse function, however we can do a lot better by using an iterative Newton Root Finding Method to improve that approximation.
let w = f(x) = 1/x, the inverse of this function f(x) is found by solving for x in terms of w or x = f^(-1)(w) = 1/w. To improve the output with the root finding method we must first create a function whose zero reflects the desired output, i.e. g(w) = 1/w - x, d/dw(g(w)) = -1/w^2.
w[n+1]= w[n] - g(w[n])/g'(w[n]) = w[n] + w[n]^2 * (1/w[n] - x) = w[n] * (2 - x*w[n])
w[n+1] = w[n] * (2 - x*w[n]), when w[n]=1/x, w[n+1]=1/x*(2-x*1/x)=1/x
These components then add to get the final piece of code:
float inv_fast(float x) {
union { float f; int i; } v;
float w, sx;
int m;
sx = (x < 0) ? -1:1;
x = sx * x;
v.i = (int)(0x7EF127EA - *(uint32_t *)&x);
w = x * v.f;
// Efficient Iterative Approximation Improvement in horner polynomial form.
v.f = v.f * (2 - w); // Single iteration, Err = -3.36e-3 * 2^(-flr(log2(x)))
// v.f = v.f * ( 4 + w * (-6 + w * (4 - w))); // Second iteration, Err = -1.13e-5 * 2^(-flr(log2(x)))
// v.f = v.f * (8 + w * (-28 + w * (56 + w * (-70 + w *(56 + w * (-28 + w * (8 - w))))))); // Third Iteration, Err = +-6.8e-8 * 2^(-flr(log2(x)))
return v.f * sx;
}

Related

Efficient floating point scaling in C++

I'm working on my fast (and accurate) sin implementation in C++, and I have a problem regarding the efficient angle scaling into the +- pi/2 range.
My sin function for +-pi/2 using Taylor series is the following
(Note: FLOAT is a macro expanded to float or double just for the benchmark)
/**
* Sin for 'small' angles, accurate on [-pi/2, pi/2], fairly accurate on [-pi, pi]
*/
// To switch between float and double
#define FLOAT float
FLOAT
my_sin_small(FLOAT x)
{
constexpr FLOAT C1 = 1. / (7. * 6. * 5. * 4. * 3. * 2.);
constexpr FLOAT C2 = -1. / (5. * 4. * 3. * 2.);
constexpr FLOAT C3 = 1. / (3. * 2.);
constexpr FLOAT C4 = -1.;
// Correction for sin(pi/2) = 1, due to the ignored taylor terms
constexpr FLOAT corr = -1. / 0.9998431013994987;
const FLOAT x2 = x * x;
return corr * x * (x2 * (x2 * (x2 * C1 + C2) + C3) + C4);
}
So far so good... The problem comes when I try to scale an arbitrary angle into the +-pi/2 range. My current solution is:
FLOAT
my_sin(FLOAT x)
{
constexpr FLOAT pi = 3.141592653589793238462;
constexpr FLOAT rpi = 1 / pi;
// convert to +-pi/2 range
int n = std::nearbyint(x * rpi);
FLOAT xbar = (n * pi - x) * (2 * (n & 1) - 1);
// (2 * (n % 2) - 1) is a sign correction (see below)
return my_sin_small(xbar);
};
I made a benchmark, and I'm losing a lot for the +-pi/2 scaling.
Tricking with int(angle/pi + 0.5) is a nope since it is limited to the int precision, also requires +- branching, and i try to avoid branches...
What should I try to improve the performance for this scaling? I'm out of ideas.
Benchmark results for float. (In the benchmark the angle could be out of the validity range for my_sin_small, but for the bench I don't care about that...):
Benchmark results for double.
Sign correction for xbar in my_sin():
Algo accuracy compared to python sin() function:
Candidate improvements
Convert the radians x to rotations by dividing by 2*pi.
Retain only the fraction so we have an angle (-1.0 ... 1.0). This simplifies the OP's modulo step to a simple "drop the whole number" step instead. Going forward with different angle units simply involves a co-efficient set change. No need to scale back to radians.
For positive values, subtract 0.5 so we have (-0.5 ... 0.5) and then flip the sign. This centers the possible values about 0.0 and makes for better convergence of the approximating polynomial as compared to the math sine function. For negative values - see below.
Call my_sin_small1() that uses this (-0.5 ... 0.5) rotations range rather than [-pi ... +pi] radians.
In my_sin_small1(), fold constants together to drop the corr * step.
Rather than use the truncated Taylor's series, use a more optimal set. IMO, this will provide better answers, especially near +/-pi.
Notes: No int to/from float code. With more analysis, possible to get a better set of coefficients that fix my_sin(+/-pi) closer to 0.0. This is just a quick set of code to demo less FP steps and good potential results.
C like code for OP to port to C++
FLOAT my_sin_small1(FLOAT x) {
static const FLOAT A1 = -5.64744881E+01;
static const FLOAT A2 = +7.81017968E+01;
static const FLOAT A3 = -4.11145353E+01;
static const FLOAT A4 = +6.27923581E+00;
const FLOAT x2 = x * x;
return x * (x2 * (x2 * (x2 * A1 + A2) + A3) + A4);
}
FLOAT my_sin1(FLOAT x) {
static const FLOAT pi = 3.141592653589793238462;
static const FLOAT pi2i = 1/(pi * 2);
x *= pi2i;
FLOAT xfraction = 0.5f - (x - truncf(x));
return my_sin_small1(xfraction);
}
For negative values, use -my_sin1(-x) or like code to flip the sign - or add 0.5 in the above minus 0.5 step.
Test
#include <math.h>
#include <stdio.h>
int main(void) {
for (int d = 0; d <= 360; d += 20) {
FLOAT x = d / 180.0 * M_PI;
FLOAT y = my_sin1(x);
printf("%12.6f %11.8f %11.8f\n", x, sin(x), y);
}
}
Output
0.000000 0.00000000 -0.00022483
0.349066 0.34202013 0.34221691
0.698132 0.64278759 0.64255589
1.047198 0.86602542 0.86590189
1.396263 0.98480775 0.98496443
1.745329 0.98480775 0.98501128
2.094395 0.86602537 0.86603642
2.443461 0.64278762 0.64260530
2.792527 0.34202022 0.34183803
3.141593 -0.00000009 0.00000000
3.490659 -0.34202016 -0.34183764
3.839724 -0.64278757 -0.64260519
4.188790 -0.86602546 -0.86603653
4.537856 -0.98480776 -0.98501128
4.886922 -0.98480776 -0.98496443
5.235988 -0.86602545 -0.86590189
5.585053 -0.64278773 -0.64255613
5.934119 -0.34202036 -0.34221727
6.283185 0.00000017 -0.00022483
Alternate code below makes for better results near 0.0, yet might cost a tad more time. OP seems more inclined to speed.
FLOAT xfraction = 0.5f - (x - truncf(x));
// vs.
FLOAT xfraction = x - truncf(x);
if (x >= 0.5f) x -= 1.0f;
[Edit]
Below is a better set with about 10% reduced error.
-56.0833765f
77.92947047f
-41.0936875f
6.278635918f
Yet another approach:
Spend more time (code) to reduce the range to ±pi/4 (±45 degrees), then possible to use only 3 or 2 terms of a polynomial that is like the usually Taylors series.
float sin_quick_small(float x) {
const float x2 = x * x;
#if 0
// max error about 7e-7
static const FLOAT A2 = +0.00811656036940792f;
static const FLOAT A3 = -0.166597759850666f;
static const FLOAT A4 = +0.999994132743861f;
return x * (x2 * (x2 * A2 + A3) + A4);
#else
// max error about 0.00016
static const FLOAT A3 = -0.160343346851626f;
static const FLOAT A4 = +0.999031566686144f;
return x * (x2 * A3 + A4);
#endif
}
float cos_quick_small(float x) {
return cosf(x); // TBD code.
}
float sin_quick(float x) {
if (x < 0.0) {
return -sin_quick(-x);
}
int quo;
float x90 = remquof(fabsf(x), 3.141592653589793238462f / 2, &quo);
switch (quo % 4) {
case 0:
return sin_quick_small(x90);
case 1:
return cos_quick_small(x90);
case 2:
return sin_quick_small(-x90);
case 3:
return -cos_quick_small(x90);
}
return 0.0;
}
int main() {
float max_x = 0.0;
float max_error = 0.0;
for (int d = -45; d <= 45; d += 1) {
FLOAT x = d / 180.0 * M_PI;
FLOAT y = sin_quick(x);
double err = fabs(y - sin(x));
if (err > max_error) {
max_x = x;
max_error = err;
}
printf("%12.6f %11.8f %11.8f err:%11.8f\n", x, sin(x), y, err);
}
printf("x:%.6f err:%.6f\n", max_x, max_error);
return 0;
}

Representation of Fourier series depends on tabulation points

Well, I had task to create function that does Fourier series with some mathematical function, so I found all the formulas, but the main problem is when I change count of point on some interval to draw those series I have very strange artifact:
This is Fourier series of sin(x) on interavl (-3.14; 314) with 100 point for tabulation
And this is same function with same interval but with 100000 points for tabulation
Code for Fourier series coeficients:
void fourieSeriesDecompose(std::function<double(double)> func, double period, long int iterations, double *&aParams, double *&bParams){
aParams = new double[iterations];
aParams[0] = integrateRiemans(func, 0, period, 1000);
for(int i = 1; i < iterations; i++){
auto sineFunc = [&](double x) -> double { return 2 * (func(x) * cos((2 * x * i * M_PI) / period)); };
aParams[i] = integrateRiemans(sineFunc, -period / 2, period / 2, 1000) / period;
}
bParams = new double[iterations];
for(int i = 1; i < iterations; i++){
auto sineFunc = [&](double x) -> double { return 2 * (func(x) * sin(2 * (x * (i + 1) * M_PI) / period)); };
bParams[i] = integrateRiemans(sineFunc, -period / 2, period / 2, 1000) / period;
}
}
This code I use to reproduce function using found coeficients:
double fourieSeriesCompose(double x, double period, long iterations, double *aParams, double *bParams){
double y = aParams[0];
for(int i = 1; i < iterations; i++){
y += sqrt(aParams[i] * aParams[i] + bParams[i] * bParams[i]) * cos((2 * i * x * M_PI) / period - atan(bParams[i] / aParams[i]));
}
return y;
}
And the runner code
double period = M_PI * 2;
auto startFunc = [](double x) -> double{ return sin(x); };
fourieSeriesDecompose(*startFunc, period, 1000, aCoeficients, bCoeficients);
auto readyFunc = [&](double x) -> double{ return fourieSeriesCompose(x, period, 1000, aCoeficients, bCoeficients); };
tabulateFunc(readyFunc);
scaleFunc();
//Draw methods after this
see:
How to compute Discrete Fourier Transform?
So if I deciphered it correctly the aParams,bParams represent the real and imaginary part of the result then the angles in sin and cos must be the same but you have different! You got this:
auto sineFunc = [&](double x) -> double { return 2*(func(x)*cos((2* x* i *M_PI)/period));
auto sineFunc = [&](double x) -> double { return 2*(func(x)*sin( 2*(x*(i+1)*M_PI)/period));
as you can see its not the same angle. Also what is period? You got iterations! if it is period of the function you want to transform then it should be applied to it and not to the kernel ... Also integrateRiemans does what? its the nested for loop to integrate the furrier transform? Btw. hope that func is real domain otherwise the integration/sumation needs both real and imaginary part not just one ...
So what you should do is:
create (cplx) table of the func(x) data on the interval you want with iterations samples
so for loop where x = x0+i*(x1-x0)/(iterations-1) and x0,x1 is the range you want the func to sample. Lets call it f[i]
for (i=0;i<iteration;i++) f[i]=func(x0+i*(x1-x0)/(iterations-1));
furrier transform it
something like this:
for (i=0;i<iteration;i++) a[i]=b[i]=0;
for (j=0;j<iteration;j++)
for (i=0;i<iteration;i++)
{
a[j]+=f[i]*cos(-2.0*M_PI*i*j/iterations);
b[j]+=f[i]*sin(-2.0*M_PI*i*j/iterations);
}
now a[],b[] should hold your slow DFT result ... beware integer rounding ... depending on compiler you might need to cast some stuff to double to avoid integer rounding.

Coeficients in numerical calculations of exp() function

I am trying to understand the implementation of exp_ps() from http://gruntthepeon.free.fr/ssemath/sse_mathfun.h or exp256_ps() from http://software-lisc.fbk.eu/avx_mathfun/avx_mathfun.h.
I understand almost everything in the calculation, except for how constant cephes_exp_C2 is determined. It seems that it increases the accuracy of the calculation. If it is removed from the calculation then resulting function is significantly faster and slightly less precise (relative error is still under 1% for values around +/- 10). I found such coefficients in other numerical libraries, but without closer explanation.
After a bit of searching through the Cephes source, I think it's an error in Pommier's translation. This is not the first time I have seen errors in Pommier's code. I recommend using math library in Gromacs.
From exp.c in Cephe's,
static double C1 = 6.93145751953125E-1;
static double C2 = 1.42860682030941723212E-6;
....
px = floor( LOG2E * x + 0.5 );
n = px;
x -= px * C1;
x -= px * C2;
From Pommier,
_PS_CONST(cephes_exp_C1, 0.693359375);
_PS_CONST(cephes_exp_C2, -2.12194440e-4); <-- Wrong value
....
//
// fx = LOG2E * x + 0.5
//
fx = _mm_mul_ps(x, *(v4sf*)_ps_cephes_LOG2EF);
fx = _mm_add_ps(fx, *(v4sf*)_ps_0p5);
//
// fx = floor(fx)
//
emm0 = _mm_cvttps_epi32(fx);
tmp = _mm_cvtepi32_ps(emm0);
v4sf mask = _mm_cmpgt_ps(tmp, fx);
mask = _mm_and_ps(mask, one);
fx = _mm_sub_ps(tmp, mask);
//
// x -= fx * C1;
// x -= fx * C2; (Using z allows for better ILP in this step)
//
tmp = _mm_mul_ps(fx, *(v4sf*)_ps_cephes_exp_C1);
v4sf z = _mm_mul_ps(fx, *(v4sf*)_ps_cephes_exp_C2);
x = _mm_sub_ps(x, tmp);
x = _mm_sub_ps(x, z);

Fast Inverse Square Root on x64

I found on net Fast Inverse Square Root on http://en.wikipedia.org/wiki/Fast_inverse_square_root . Does it work properly on x64 ?
Did anyone use and serious test ?
Originally Fast Inverse Square Root was written for a 32-bit float, so as long as you operate on IEEE-754 floating point representation, there is no way x64 architecture will affect the result.
Note that for "double" precision floating point (64-bit) you should use another constant:
...the "magic number" for 64 bit IEEE754 size type double ... was shown to be exactly 0x5fe6eb50c7b537a9
Here is an implementation for double precision floats:
#include <cstdint>
double invsqrtQuake( double number )
{
double y = number;
double x2 = y * 0.5;
std::int64_t i = *(std::int64_t *) &y;
// The magic number is for doubles is from https://cs.uwaterloo.ca/~m32rober/rsqrt.pdf
i = 0x5fe6eb50c7b537a9 - (i >> 1);
y = *(double *) &i;
y = y * (1.5 - (x2 * y * y)); // 1st iteration
// y = y * ( 1.5 - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
I did a few tests and it seems to work fine
Yes, it works if using the correct magic number and corresponding integer type. In addition to the answers above, here's a C++11 implementation that works for both double and float. Conditionals should optimise out at compile time.
template <typename T, char iterations = 2> inline T inv_sqrt(T x) {
static_assert(std::is_floating_point<T>::value, "T must be floating point");
static_assert(iterations == 1 or iterations == 2, "itarations must equal 1 or 2");
typedef typename std::conditional<sizeof(T) == 8, std::int64_t, std::int32_t>::type Tint;
T y = x;
T x2 = y * 0.5;
Tint i = *(Tint *)&y;
i = (sizeof(T) == 8 ? 0x5fe6eb50c7b537a9 : 0x5f3759df) - (i >> 1);
y = *(T *)&i;
y = y * (1.5 - (x2 * y * y));
if (iterations == 2)
y = y * (1.5 - (x2 * y * y));
return y;
}
As for testing, I use the following doctest in my project:
#ifdef DOCTEST_LIBRARY_INCLUDED
TEST_CASE_TEMPLATE("inv_sqrt", T, double, float) {
std::vector<T> vals = {0.23, 3.3, 10.2, 100.45, 512.06};
for (auto x : vals)
CHECK(inv_sqrt<T>(x) == doctest::Approx(1.0 / std::sqrt(x)));
}
#endif

Fast Arc Cos algorithm?

I have my own, very fast cos function:
float sine(float x)
{
const float B = 4/pi;
const float C = -4/(pi*pi);
float y = B * x + C * x * abs(x);
// const float Q = 0.775;
const float P = 0.225;
y = P * (y * abs(y) - y) + y; // Q * y + P * y * abs(y)
return y;
}
float cosine(float x)
{
return sine(x + (pi / 2));
}
But now when I profile, I see that acos() is killing the processor. I don't need intense precision. What is a fast way to calculate acos(x)
Thanks.
A simple cubic approximation, the Lagrange polynomial for x ∈ {-1, -½, 0, ½, 1}, is:
double acos(x) {
return (-0.69813170079773212 * x * x - 0.87266462599716477) * x + 1.5707963267948966;
}
It has a maximum error of about 0.18 rad.
Got spare memory? A lookup table (with interpolation, if required) is gonna be fastest.
nVidia has some great resources that show how to approximate otherwise very expensive math functions, such as: acos
asin
atan2
etc etc...
These algorithms produce good results when speed of execution is more important (within reason) than precision. Here's their acos function:
// Absolute error <= 6.7e-5
float acos(float x) {
float negate = float(x < 0);
x = abs(x);
float ret = -0.0187293;
ret = ret * x;
ret = ret + 0.0742610;
ret = ret * x;
ret = ret - 0.2121144;
ret = ret * x;
ret = ret + 1.5707288;
ret = ret * sqrt(1.0-x);
ret = ret - 2 * negate * ret;
return negate * 3.14159265358979 + ret;
}
And here are the results for when calculating acos(0.5):
nVidia: result: 1.0471513828611643
math.h: result: 1.0471975511965976
That's pretty close! Depending on your required degree of precision, this might be a good option for you.
I have my own. It's pretty accurate and sort of fast. It works off of a theorem I built around quartic convergence. It's really interesting, and you can see the equation and how fast it can make my natural log approximation converge here: https://www.desmos.com/calculator/yb04qt8jx4
Here's my arccos code:
function acos(x)
local a=1.43+0.59*x a=(a+(2+2*x)/a)/2
local b=1.65-1.41*x b=(b+(2-2*x)/b)/2
local c=0.88-0.77*x c=(c+(2-a)/c)/2
return (8*(c+(2-a)/c)-(b+(2-2*x)/b))/6
end
A lot of that is just square root approximation. It works really well, too, unless you get too close to taking a square root of 0. It has an average error (excluding x=0.99 to 1) of 0.0003. The problem, though, is that at 0.99 it starts going to shit, and at x=1, the difference in accuracy becomes 0.05. Of course, this could be solved by doing more iterations on the square roots (lol nope) or, just a little thing like, if x>0.99 then use a different set of square root linearizations, but that makes the code all long and ugly.
If you don't care about accuracy so much, you could just do one iteration per square root, which should still keep you somewhere in the range of 0.0162 or something as far as accuracy goes:
function acos(x)
local a=1.43+0.59*x a=(a+(2+2*x)/a)/2
local b=1.65-1.41*x b=(b+(2-2*x)/b)/2
local c=0.88-0.77*x c=(c+(2-a)/c)/2
return 8/3*c-b/3
end
If you're okay with it, you can use pre-existing square root code. It will get rid of the the equation going a bit crazy at x=1:
function acos(x)
local a = math.sqrt(2+2*x)
local b = math.sqrt(2-2*x)
local c = math.sqrt(2-a)
return 8/3*d-b/3
end
Frankly, though, if you're really pressed for time, remember that you could linearize arccos into 3.14159-1.57079x and just do:
function acos(x)
return 1.57079-1.57079*x
end
Anyway, if you want to see a list of my arccos approximation equations, you can go to https://www.desmos.com/calculator/tcaty2sv8l I know that my approximations aren't the best for certain things, but if you're doing something where my approximations would be useful, please use them, but try to give me credit.
You can approximate the inverse cosine with a polynomial as suggested by dan04, but a polynomial is a pretty bad approximation near -1 and 1 where the derivative of the inverse cosine goes to infinity. When you increase the degree of the polynomial you hit diminishing returns quickly, and it is still hard to get a good approximation around the endpoints. A rational function (the quotient of two polynomials) can give a much better approximation in this case.
acos(x) ≈ π/2 + (ax + bx³) / (1 + cx² + dx⁴)
where
a = -0.939115566365855
b = 0.9217841528914573
c = -1.2845906244690837
d = 0.295624144969963174
has a maximum absolute error of 0.017 radians (0.96 degrees) on the interval (-1, 1). Here is a plot (the inverse cosine in black, cubic polynomial approximation in red, the above function in blue) for comparison:
The coefficients above have been chosen to minimise the maximum absolute error over the entire domain. If you are willing to allow a larger error at the endpoints, the error on the interval (-0.98, 0.98) can be made much smaller. A numerator of degree 5 and a denominator of degree 2 is about as fast as the above function, but slightly less accurate. At the expense of performance you can increase accuracy by using higher degree polynomials.
A note about performance: computing the two polynomials is still very cheap, and you can use fused multiply-add instructions. The division is not so bad, because you can use the hardware reciprocal approximation and a multiply. The error in the reciprocal approximation is negligible in comparison with the error in the acos approximation. On a 2.6 GHz Skylake i7, this approximation can do about 8 inverse cosines every 6 cycles using AVX. (That is throughput, the latency is longer than 6 cycles.)
Another approach you could take is to use complex numbers. From de Moivre's formula,
ⅈx = cos(π/2*x) + ⅈ*sin(π/2*x)
Let θ = π/2*x. Then x = 2θ/π, so
sin(θ) = ℑ(ⅈ^2θ/π)
cos(θ) = ℜ(ⅈ^2θ/π)
How can you calculate powers of ⅈ without sin and cos? Start with a precomputed table for powers of 2:
ⅈ4 = 1
ⅈ2 = -1
ⅈ1 = ⅈ
ⅈ1/2 = 0.7071067811865476 + 0.7071067811865475*ⅈ
ⅈ1/4 = 0.9238795325112867 + 0.3826834323650898*ⅈ
ⅈ1/8 = 0.9807852804032304 + 0.19509032201612825*ⅈ
ⅈ1/16 = 0.9951847266721969 + 0.0980171403295606*ⅈ
ⅈ1/32 = 0.9987954562051724 + 0.049067674327418015*ⅈ
ⅈ1/64 = 0.9996988186962042 + 0.024541228522912288*ⅈ
ⅈ1/128 = 0.9999247018391445 + 0.012271538285719925*ⅈ
ⅈ1/256 = 0.9999811752826011 + 0.006135884649154475*ⅈ
To calculate arbitrary values of ⅈx, approximate the exponent as a binary fraction, and then multiply together the corresponding values from the table.
For example, to find sin and cos of 72° = 0.8π/2:
ⅈ0.8
&approx; ⅈ205/256
= ⅈ0b11001101
= ⅈ1/2 * ⅈ1/4 * ⅈ1/32 * ⅈ1/64 * ⅈ1/256
= 0.3078496400415349 + 0.9514350209690084*ⅈ
sin(72°) &approx; 0.9514350209690084 ("exact" value is 0.9510565162951535)
cos(72°) &approx; 0.3078496400415349 ("exact" value is 0.30901699437494745).
To find asin and acos, you can use this table with the Bisection Method:
For example, to find asin(0.6) (the smallest angle in a 3-4-5 triangle):
ⅈ0 = 1 + 0*ⅈ. The sin is too small, so increase x by 1/2.
ⅈ1/2 = 0.7071067811865476 + 0.7071067811865475*ⅈ . The sin is too big, so decrease x by 1/4.
ⅈ1/4 = 0.9238795325112867 + 0.3826834323650898*ⅈ. The sin is too small, so increase x by 1/8.
ⅈ3/8 = 0.8314696123025452 + 0.5555702330196022*ⅈ. The sin is still too small, so increase x by 1/16.
ⅈ7/16 = 0.773010453362737 + 0.6343932841636455*ⅈ. The sin is too big, so decrease x by 1/32.
ⅈ13/32 = 0.8032075314806449 + 0.5956993044924334*ⅈ.
Each time you increase x, multiply by the corresponding power of ⅈ. Each time you decrease x, divide by the corresponding power of ⅈ.
If we stop here, we obtain acos(0.6) &approx; 13/32*π/2 = 0.6381360077604268 (The "exact" value is 0.6435011087932844.)
The accuracy, of course, depends on the number of iterations. For a quick-and-dirty approximation, use 10 iterations. For "intense precision", use 50-60 iterations.
A fast arccosine implementation, accurate to about 0.5 degrees, can be based on the observation that for x in [0,1], acos(x) ≈ √(2*(1-x)). An additional scale factor improves accuracy near zero. The optimal factor can be found by a simple binary search. Negative arguments are handled according to acos (-x) = π - acos (x).
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
// Approximate acos(a) with relative error < 5.15e-3
// This uses an idea from Robert Harley's posting in comp.arch.arithmetic on 1996/07/12
// https://groups.google.com/forum/#!original/comp.arch.arithmetic/wqCPkCCXqWs/T9qCkHtGE2YJ
float fast_acos (float a)
{
const float PI = 3.14159265f;
const float C = 0.10501094f;
float r, s, t, u;
t = (a < 0) ? (-a) : a; // handle negative arguments
u = 1.0f - t;
s = sqrtf (u + u);
r = C * u * s + s; // or fmaf (C * u, s, s) if FMA support in hardware
if (a < 0) r = PI - r; // handle negative arguments
return r;
}
float uint_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof(r));
return r;
}
int main (void)
{
double maxrelerr = 0.0;
uint32_t a = 0;
do {
float x = uint_as_float (a);
float r = fast_acos (x);
double xx = (double)x;
double res = (double)r;
double ref = acos (xx);
double relerr = (res - ref) / ref;
if (fabs (relerr) > maxrelerr) {
maxrelerr = fabs (relerr);
printf ("xx=% 15.8e res=% 15.8e ref=% 15.8e rel.err=% 15.8e\n",
xx, res, ref, relerr);
}
a++;
} while (a);
printf ("maximum relative error = %15.8e\n", maxrelerr);
return EXIT_SUCCESS;
}
The output of the above test scaffold should look similar to this:
xx= 0.00000000e+000 res= 1.56272149e+000 ref= 1.57079633e+000 rel.err=-5.14060021e-003
xx= 2.98023259e-008 res= 1.56272137e+000 ref= 1.57079630e+000 rel.err=-5.14065723e-003
xx= 8.94069672e-008 res= 1.56272125e+000 ref= 1.57079624e+000 rel.err=-5.14069537e-003
xx=-2.98023259e-008 res= 1.57887137e+000 ref= 1.57079636e+000 rel.err= 5.14071269e-003
xx=-8.94069672e-008 res= 1.57887149e+000 ref= 1.57079642e+000 rel.err= 5.14075044e-003
maximum relative error = 5.14075044e-003
Here is a great website with many options:
https://www.ecse.rpi.edu/Homepages/wrf/Research/Short_Notes/arcsin/onlyelem.html
Personally I went the Chebyshev-Pade quotient approximation with with the following code:
double arccos(double x) {
const double pi = 3.141592653;
return pi / 2 - (.5689111419 - .2644381021*x - .4212611542*(2*x - 1)*(2*x - 1)
+ .1475622352*(2*x - 1)*(2*x - 1)*(2*x - 1))
/ (2.006022274 - 2.343685222*x + .3316406750*(2*x - 1)*(2*x - 1) +
.02607135626*(2*x - 1)*(2*x - 1)*(2*x - 1));
}
If you're using Microsoft VC++, here's an inline __asm x87 FPU code version without all the CRT filler, error checks, etc. and unlike the earliest classic ASM code you can find, it uses a FMUL instead of the slower FDIV. It compiles/works with Microsoft VC++ 2005 Express/Pro what I always stick with for various reasons.
It's a little tricky to setup a function with "__declspec(naked)/__fastcall", pull parameters correctly, handle stack, so not for the faint of heart. If it fails to compile with errors on your version, don't bother unless you're experienced. Or ask me, I can rewrite it in a slightly friendlier __asm{} block. I would manually inline this if it's a critical part of a function in a loop for further performance gains if need be.
extern float __fastcall fs_acos(float x);
extern double __fastcall fs_Acos(double x);
// ACOS(x)- Computes the arccosine of ST(0)
// Allowable range: -1<=x<=+1
// Derivative Formulas: acos(x) = atan(sqrt((1 - x * x)/(x * x))) OR
// acos(x) = atan2(sqrt(1 - x * x), x)
// e.g. acos(-1.0) = 3.1415927
__declspec(naked) float __fastcall fs_acos(float x) { __asm {
FLD DWORD PTR [ESP+4] ;// Load/Push parameter 'x' to FPU stack
FLD1 ;// Load 1.0
FADD ST, ST(1) ;// Compute 1.0 + 'x'
FLD1 ;// Load 1.0
FSUB ST, ST(2) ;// Compute 1.0 - 'x'
FMULP ST(1), ST ;// Compute (1-x) * (1+x)
FSQRT ;// Compute sqrt(result)
FXCH ST(1)
FPATAN ;// Compute arctangent of result / 'x' (ST1/ST0)
RET 4
}}
__declspec(naked) double __fastcall fs_Acos(double x) { __asm { //
FLD QWORD PTR [ESP+4] ;// Load/Push parameter 'x' to FPU stack
FLD1 ;// Load 1.0
FADD ST, ST(1) ;// Compute (1.0 + 'x')
FLD1 ;// Load 1.0
FSUB ST, ST(2) ;// Compute (1.0 - 'x')
FMULP ST(1), ST ;// Compute (1-x) * (1+x)
FSQRT ;// Compute sqrt((1-x) * (1+x))
FXCH ST(1)
FPATAN ;// Compute arctangent of result / 'x' (ST1/ST0)
RET 8
}}
Unfortunately I do not have enough reputation to comment.
Here is a small modification of Nvidia's function, that deals with the fact that numbers that should be <= 1 while preserving performance as much as possible.
It may be important since rounding errors can lead number that should be 1.0 to be (oh so slightly) larger than 1.0.
double safer_acos(double x) {
double negate = double(x < 0);
x = abs(x);
x -= double(x>1.0)*(x-1.0); // <- equivalent to min(1.0,x), but faster
double ret = -0.0187293;
ret = ret * x;
ret = ret + 0.0742610;
ret = ret * x;
ret = ret - 0.2121144;
ret = ret * x;
ret = ret + 1.5707288;
ret = ret * sqrt(1.0-x);
ret = ret - 2 * negate * ret;
return negate * 3.14159265358979 + ret;
// In a single line (no gain using gcc)
//return negate * 3.14159265358979 + (((((-0.0187293*x)+ 0.0742610)*x - 0.2121144)*x + 1.5707288)* sqrt(1.0-x))*(1.0-2.0*negate);
}