Optimizations for pow() with const non-integer exponent?

Optimizations for pow() with const non-integer exponent? - c++

I have hot spots in my code where I'm doing pow() taking up around 10-20% of my execution time.
My input to pow(x,y) is very specific, so I'm wondering if there's a way to roll two pow() approximations (one for each exponent) with higher performance:
I have two constant exponents: 2.4 and 1/2.4.
When the exponent is 2.4, x will be in the range (0.090473935, 1.0].
When the exponent is 1/2.4, x will be in the range (0.0031308, 1.0].
I'm using SSE/AVX float vectors. If platform specifics can be taken advantage of, right on!
A maximum error rate around 0.01% is ideal, though I'm interested in full precision (for float) algorithms as well.
I'm already using a fast pow() approximation, but it doesn't take these constraints into account. Is it possible to do better?

Another answer because this is very different from my previous answer, and this is blazing fast. Relative error is 3e-8. Want more accuracy? Add a couple more Chebychev terms. It's best to keep the order odd as this makes for a small discontinuity between 2^n-epsilon and 2^n+epsilon.
#include <stdlib.h>
#include <math.h>
// Returns x^(5/12) for x in [1,2), to within 3e-8 (relative error).
// Want more precision? Add more Chebychev polynomial coefs.
double pow512norm (
double x)
{
static const int N = 8;
// Chebychev polynomial terms.
// Non-zero terms calculated via
// integrate (2/pi)*ChebyshevT[n,u]/sqrt(1-u^2)*((u+3)/2)^(5/12)
// from -1 to 1
// Zeroth term is similar except it uses 1/pi rather than 2/pi.
static const double Cn[N] = {
1.1758200232996901923,
0.16665763094889061230,
-0.0083154894939042125035,
0.00075187976780420279038,
// Wolfram alpha doesn't want to compute the remaining terms
// to more precision (it times out).
-0.0000832402,
0.0000102292,
-1.3401e-6,
1.83334e-7};
double Tn[N];
double u = 2.0*x - 3.0;
Tn[0] = 1.0;
Tn[1] = u;
for (int ii = 2; ii < N; ++ii) {
Tn[ii] = 2*u*Tn[ii-1] - Tn[ii-2];
}
double y = 0.0;
for (int ii = N-1; ii >= 0; --ii) {
y += Cn[ii]*Tn[ii];
}
return y;
}
// Returns x^(5/12) to within 3e-8 (relative error).
double pow512 (
double x)
{
static const double pow2_512[12] = {
1.0,
pow(2.0, 5.0/12.0),
pow(4.0, 5.0/12.0),
pow(8.0, 5.0/12.0),
pow(16.0, 5.0/12.0),
pow(32.0, 5.0/12.0),
pow(64.0, 5.0/12.0),
pow(128.0, 5.0/12.0),
pow(256.0, 5.0/12.0),
pow(512.0, 5.0/12.0),
pow(1024.0, 5.0/12.0),
pow(2048.0, 5.0/12.0)
};
double s;
int iexp;
s = frexp (x, &iexp);
s *= 2.0;
iexp -= 1;
div_t qr = div (iexp, 12);
if (qr.rem < 0) {
qr.quot -= 1;
qr.rem += 12;
}
return ldexp (pow512norm(s)*pow2_512[qr.rem], 5*qr.quot);
}
Addendum: What's going on here?
Per request, the following explains how the above code works.
Overview
The above code defines two functions, double pow512norm (double x) and double pow512 (double x). The latter is the entry point to the suite; this is the function that user code should call to calculate x^(5/12). The function pow512norm(x) uses Chebyshev polynomials to approximate x^(5/12), but only for x in the range [1,2]. (Use pow512norm(x) for values of x outside that range and the result will be garbage.)
The function pow512(x) splits the incoming x into a pair (double s, int n) such that x = s * 2^n and such that 1≤s<2. A further partitioning of n into (int q, unsigned int r) such that n = 12*q + r and r is less than 12 lets me split the problem of finding x^(5/12) into parts:
x^(5/12)=(s^(5/12))*((2^n)^(5/12)) via (uv)^a=(u^a)(v^a) for positive u,v and real a.
s^(5/12) is calculated via pow512norm(s).
(2^n)^(5/12)=(2^(12*q+r))^(5/12) via substitution.
2^(12*q+r)=(2^(12*q))*(2^r) via u^(a+b)=(u^a)*(u^b) for positive u, real a,b.
(2^(12*q+r))^(5/12)=(2^(5*q))*((2^r)^(5/12)) via some more manipulations.
(2^r)^(5/12) is calculated by the lookup table pow2_512.
Calculate pow512norm(s)*pow2_512[qr.rem] and we're almost there. Here qr.rem is the r value calculated in step 3 above. All that is needed is to multiply this by 2^(5*q) to yield the desired result.
That is exactly what the math library function ldexp does.
Function Approximation
The goal here is to come up with an easily computable approximation of f(x)=x^(5/12) that is 'good enough' for the problem at hand. Our approximation should be close to f(x) in some sense. Rhetorical question: What does 'close to' mean? Two competing interpretations are minimizing the mean square error versus minimizing the maximum absolute error.
I'll use a stock market analogy to describe the difference between these. Suppose you want to save for your eventual retirement. If you are in your twenties, the best thing to do is to invest in stocks or stock market funds. This is because over a long enough span of time, the stock market on average beats any other investment scheme. However, we've all seen times when putting money into stocks is a very bad thing to do. If you are in your fifties or sixties (or forties if you want to retire young) you need to invest a bit more conservatively. Those downswings can wreak have on your retirement portfolio.
Back to function approximation: As the consumer of some approximation, you are typically worried about the worst-case error rather than the performance "on average". Use some approximation constructed to give the best performance "on average" (e.g. least squares) and Murphy's law dictates that your program will spend a whole lot of time using the approximation exactly where the performance is far worse than average. What you want is a minimax approximation, something that minimizes the maximum absolute error over some domain. A good math library will take a minimax approach rather than a least squares approach because this lets the authors of the math library give some guaranteed performance of their library.
Math libraries typically use a polynomial or a rational polynomial to approximate some function f(x) over some domain a≤x≤b. Suppose the function f(x) is analytic over this domain and you want to approximate the function by some polynomial p(x) of degree N. For a given degree N there exists some magical, unique polynomial p(x) such that p(x)-f(x) has N+2 extrema over [a,b] and such that the absolute values of these N+2 extrema are all equal to one another. Finding this magical polynomial p(x) is the holy grail of function approximators.
I did not find that holy grail for you. I instead used a Chebyshev approximation. The Chebyshev polynomials of the first kind are an orthogonal (but not orthonormal) set of polynomials with some very nice features when it comes to function approximation. The Chebyshev approximation oftentimes is very close to that magical polynomial p(x). (In fact, the Remez exchange algorithm that does find that holy grail polynomial typically starts with a Chebyshev approximation.)
pow512norm(x)
This function uses Chebyshev approximation to find some polynomial p*(x) that approximates x^(5/12). Here I'm using p*(x) to distinguish this Chebyshev approximation from the magical polynomial p(x) described above. The Chebyshev approximation p*(x) is easy to find; finding p(x) is a bear. The Chebyshev approximation p*(x) is sum_i Cn[i]*Tn(i,x), where the Cn[i] are the Chebyshev coefficients and Tn(i,x) are the Chebyshev polynomials evaluated at x.
I used Wolfram alpha to find the Chebyshev coefficients Cn for me. For example, this calculates Cn[1]. The first box after the input box has the desired answer, 0.166658 in this case. That's not as many digits as I would like. Click on 'more digits' and voila, you get a whole lot more digits. Wolfram alpha is free; there is a limit on how much computation it will do. It hits that limit on higher order terms. (If you buy or have access to mathematica you will be able to calculate those high-order coefficients to a high degree of precision.)
The Chebyshev polynomials Tn(x) are calculated in the array Tn. Beyond giving something very close to magical polynomial p(x), another reason for using Chebyshev approximation is that the values of those Chebyshev polynomials are easily calculated: Start with Tn[0]=1 and Tn[1]=x, and then iteratively calculate Tn[i]=2*x*Tn[i-1] - Tn[i-2]. (I used 'ii' as the index variable rather than 'i' in my code. I never use 'i' as a variable name. How many words in the English language have an 'i' in the word? How many have two consecutive 'i's?)
pow512(x)
pow512 is the function that user code should be calling. I already described the basics of this function above. A few more details: The math library function frexp(x) returns the significand s and exponent iexp for the input x. (Minor issue: I want s between 1 and 2 for use with pow512norm but frexp returns a value between 0.5 and 1.) The math library function div returns the quotient and remainder for integer division in one swell foop. Finally, I use the math library function ldexp to put the three parts together to form the final answer.

In the IEEE 754 hacking vein, here is another solution which is faster and less "magical." It achieves an error margin of .08% in about a dozen clock cycles (for the case of p=2.4, on an Intel Merom CPU).
Floating point numbers were originally invented as an approximation to logarithms, so you can use the integer value as an approximation of log2. This is somewhat-portably achievable by applying the convert-from-integer instruction to a floating-point value, to obtain another floating-point value.
To complete the pow computation, you can multiply by a constant factor and convert the logarithm back with the convert-to-integer instruction. On SSE, the relevant instructions are cvtdq2ps and cvtps2dq.
It's not quite so simple, though. The exponent field in IEEE 754 is signed, with a bias value of 127 representing an exponent of zero. This bias must be removed before you multiply the logarithm, and re-added before you exponentiate. Furthermore, bias adjustment by subtraction won't work on zero. Fortunately, both adjustments can be achieved by multiplying by a constant factor beforehand.
x^p
= exp2( p * log2( x ) )
= exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 )
= cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) )
= cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) )
= cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) )
= cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )
exp2( 127 / p - 127 ) is the constant factor. This function is rather specialized: it won't work with small fractional exponents, because the constant factor grows exponentially with the inverse of the exponent and will overflow. It won't work with negative exponents. Large exponents lead to high error, because the mantissa bits are mingled with the exponent bits by the multiplication.
But, it's just 4 fast instructions long. Pre-multiply, convert from "integer" (to logarithm), power-multiply, convert to "integer" (from logarithm). Conversions are very fast on this implementation of SSE. We can also squeeze an extra constant coefficient into the first multiplication.
template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
__m128 ret = arg;
// std::printf( "arg = %,vg\n", ret );
// Apply a constant pre-correction factor.
ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
* pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
// std::printf( "scaled = %,vg\n", ret );
// Reinterpret arg as integer to obtain logarithm.
asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "log = %,vg\n", ret );
// Multiply logarithm by power.
ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
// std::printf( "powered = %,vg\n", ret );
// Convert back to "integer" to exponentiate.
asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "result = %,vg\n", ret );
return ret;
}
A few trials with exponent = 2.4 show this consistently overestimates by about 5%. (The routine is always guaranteed to overestimate.) You could simply multiply by 0.95, but a few more instructions will get us about 4 decimal digits of accuracy, which should be enough for graphics.
The key is to match the overestimate with an underestimate, and take the average.
Compute x^0.8: four instructions, error ~ +3%.
Compute x^-0.4: one rsqrtps. (This is quite accurate enough, but does sacrifice the ability to work with zero.)
Compute x^0.4: one mulps.
Compute x^-0.2: one rsqrtps.
Compute x^2: one mulps.
Compute x^3: one mulps.
x^2.4 = x^2 * x^0.4: one mulps. This is the overestimate.
x^2.4 = x^3 * x^-0.4 * x^-0.2: two mulps. This is the underestimate.
Average the above: one addps, one mulps.
Instruction tally: fourteen, including two conversions with latency = 5 and two reciprocal square root estimates with throughput = 4.
To properly take the average, we want to weight the estimates by their expected errors. The underestimate raises the error to a power of 0.6 vs 0.4, so we expect it to be 1.5x as erroneous. Weighting doesn't add any instructions; it can be done in the pre-factor. Calling the coefficient a: a^0.5 = 1.5 a^-0.75, and a = 1.38316186.
The final error is about .015%, or 2 orders of magnitude better than the initial fastpow result. The runtime is about a dozen cycles for a busy loop with volatile source and destination variables… although it's overlapping the iterations, real-world usage will also see instruction-level parallelism. Considering SIMD, that's a throughput of one scalar result per 3 cycles!
int main() {
__m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 );
std::printf( "Input: %,vg\n", x0 );
// Approx 5% accuracy from one call. Always an overestimate.
__m128 x1 = fastpow< 24, 10, 1, 1 >( x0 );
std::printf( "Direct x^2.4: %,vg\n", x1 );
// Lower exponents provide lower initial error, but too low causes overflow.
__m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 );
std::printf( "1.38 x^0.8: %,vg\n", xf );
// Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
__m128 xfm4 = _mm_rsqrt_ps( xf );
__m128 xf4 = _mm_mul_ps( xf, xfm4 );
// Precisely calculate x^2 and x^3
__m128 x2 = _mm_mul_ps( x0, x0 );
__m128 x3 = _mm_mul_ps( x2, x0 );
// Overestimate of x^2 * x^0.4
x2 = _mm_mul_ps( x2, xf4 );
// Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4.
__m128 xfm2 = _mm_rsqrt_ps( xf4 );
x3 = _mm_mul_ps( x3, xfm4 );
x3 = _mm_mul_ps( x3, xfm2 );
std::printf( "x^2 * x^0.4: %,vg\n", x2 );
std::printf( "x^3 / x^0.6: %,vg\n", x3 );
x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) );
// Final accuracy about 0.015%, 200x better than x^0.8 calculation.
std::printf( "average = %,vg\n", x2 );
}
Well… sorry I wasn't able to post this sooner. And extending it to x^1/2.4 is left as an exercise ;v) .
Update with stats
I implemented a little test harness and two x(5⁄12) cases corresponding to the above.
#include <cstdio>
#include <xmmintrin.h>
#include <cmath>
#include <cfloat>
#include <algorithm>
using namespace std;
template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
__m128 ret = arg;
// std::printf( "arg = %,vg\n", ret );
// Apply a constant pre-correction factor.
ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
* pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
// std::printf( "scaled = %,vg\n", ret );
// Reinterpret arg as integer to obtain logarithm.
asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "log = %,vg\n", ret );
// Multiply logarithm by power.
ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
// std::printf( "powered = %,vg\n", ret );
// Convert back to "integer" to exponentiate.
asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "result = %,vg\n", ret );
return ret;
}
__m128 pow125_4( __m128 arg ) {
// Lower exponents provide lower initial error, but too low causes overflow.
__m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg );
// Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
__m128 xfm4 = _mm_rsqrt_ps( xf );
__m128 xf4 = _mm_mul_ps( xf, xfm4 );
// Precisely calculate x^2 and x^3
__m128 x2 = _mm_mul_ps( arg, arg );
__m128 x3 = _mm_mul_ps( x2, arg );
// Overestimate of x^2 * x^0.4
x2 = _mm_mul_ps( x2, xf4 );
// Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6.
__m128 xfm2 = _mm_rsqrt_ps( xf4 );
x3 = _mm_mul_ps( x3, xfm4 );
x3 = _mm_mul_ps( x3, xfm2 );
return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) );
}
__m128 pow512_2( __m128 arg ) {
// 5/12 is too small, so compute the sqrt of 10/12 instead.
__m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg );
return _mm_mul_ps( _mm_rsqrt_ps( x ), x );
}
__m128 pow512_4( __m128 arg ) {
// 5/12 is too small, so compute the 4th root of 20/12 instead.
// 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow.
// weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3
__m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg );
__m128 xover = _mm_mul_ps( arg, xf );
__m128 xfm1 = _mm_rsqrt_ps( xf );
__m128 x2 = _mm_mul_ps( arg, arg );
__m128 xunder = _mm_mul_ps( x2, xfm1 );
// sqrt2 * over + 2 * sqrt2 * under
__m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ),
_mm_add_ps( xover, xunder ) );
xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
return xavg;
}
__m128 mm_succ_ps( __m128 arg ) {
return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) );
}
void test_pow( double p, __m128 (*f)( __m128 ) ) {
__m128 arg;
for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON );
! isfinite( _mm_cvtss_f32( f( arg ) ) );
arg = mm_succ_ps( arg ) ) ;
for ( ; _mm_cvtss_f32( f( arg ) ) == 0;
arg = mm_succ_ps( arg ) ) ;
std::printf( "Domain from %g\n", _mm_cvtss_f32( arg ) );
int n;
int const bucket_size = 1 << 25;
do {
float max_error = 0;
double total_error = 0, cum_error = 0;
for ( n = 0; n != bucket_size; ++ n ) {
float result = _mm_cvtss_f32( f( arg ) );
if ( ! isfinite( result ) ) break;
float actual = ::powf( _mm_cvtss_f32( arg ), p );
float error = ( result - actual ) / actual;
cum_error += error;
error = std::abs( error );
max_error = std::max( max_error, error );
total_error += error;
arg = mm_succ_ps( arg );
}
std::printf( "error max = %8g\t" "avg = %8g\t" "|avg| = %8g\t" "to %8g\n",
max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) );
} while ( n == bucket_size );
}
int main() {
std::printf( "4 insn x^12/5:\n" );
test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > );
std::printf( "14 insn x^12/5:\n" );
test_pow( 12./5, & pow125_4 );
std::printf( "6 insn x^5/12:\n" );
test_pow( 5./12, & pow512_2 );
std::printf( "14 insn x^5/12:\n" );
test_pow( 5./12, & pow512_4 );
}
Output:
4 insn x^12/5:
Domain from 1.36909e-23
error max = inf avg = inf |avg| = inf to 8.97249e-19
error max = 2267.14 avg = 139.175 |avg| = 139.193 to 5.88021e-14
error max = 0.123606 avg = -0.000102963 |avg| = 0.0371122 to 3.85365e-09
error max = 0.123607 avg = -0.000108978 |avg| = 0.0368548 to 0.000252553
error max = 0.12361 avg = 7.28909e-05 |avg| = 0.037507 to 16.5513
error max = 0.123612 avg = -0.000258619 |avg| = 0.0365618 to 1.08471e+06
error max = 0.123611 avg = 8.70966e-05 |avg| = 0.0374369 to 7.10874e+10
error max = 0.12361 avg = -0.000103047 |avg| = 0.0371122 to 4.65878e+15
error max = 0.123609 avg = nan |avg| = nan to 1.16469e+16
14 insn x^12/5:
Domain from 1.42795e-19
error max = inf avg = nan |avg| = nan to 9.35823e-15
error max = 0.000936462 avg = 2.0202e-05 |avg| = 0.000133764 to 6.13301e-10
error max = 0.000792752 avg = 1.45717e-05 |avg| = 0.000129936 to 4.01933e-05
error max = 0.000791785 avg = 7.0132e-06 |avg| = 0.000129923 to 2.63411
error max = 0.000787589 avg = 1.20745e-05 |avg| = 0.000129347 to 172629
error max = 0.000786553 avg = 1.62351e-05 |avg| = 0.000132397 to 1.13134e+10
error max = 0.000785586 avg = 8.25205e-06 |avg| = 0.00013037 to 6.98147e+12
6 insn x^5/12:
Domain from 9.86076e-32
error max = 0.0284339 avg = 0.000441158 |avg| = 0.00967327 to 6.46235e-27
error max = 0.0284342 avg = -5.79938e-06 |avg| = 0.00897913 to 4.23516e-22
error max = 0.0284341 avg = -0.000140706 |avg| = 0.00897084 to 2.77556e-17
error max = 0.028434 avg = 0.000440504 |avg| = 0.00967325 to 1.81899e-12
error max = 0.0284339 avg = -6.11153e-06 |avg| = 0.00897915 to 1.19209e-07
error max = 0.0284298 avg = -0.000140597 |avg| = 0.00897084 to 0.0078125
error max = 0.0284371 avg = 0.000439748 |avg| = 0.00967319 to 512
error max = 0.028437 avg = -7.74294e-06 |avg| = 0.00897924 to 3.35544e+07
error max = 0.0284369 avg = -0.000142036 |avg| = 0.00897089 to 2.19902e+12
error max = 0.0284368 avg = 0.000439183 |avg| = 0.0096732 to 1.44115e+17
error max = 0.0284367 avg = -7.41244e-06 |avg| = 0.00897923 to 9.44473e+21
error max = 0.0284366 avg = -0.000141706 |avg| = 0.00897088 to 6.1897e+26
error max = 0.485129 avg = -0.0401671 |avg| = 0.048422 to 4.05648e+31
error max = 0.994932 avg = -0.891494 |avg| = 0.891494 to 2.65846e+36
error max = 0.999329 avg = nan |avg| = nan to -0
14 insn x^5/12:
Domain from 2.64698e-23
error max = 0.13556 avg = 0.00125936 |avg| = 0.00354677 to 1.73472e-18
error max = 0.000564988 avg = 2.51458e-06 |avg| = 0.000113709 to 1.13687e-13
error max = 0.000565065 avg = -1.49258e-06 |avg| = 0.000112553 to 7.45058e-09
error max = 0.000565143 avg = 1.5293e-06 |avg| = 0.000112864 to 0.000488281
error max = 0.000565298 avg = 2.76457e-06 |avg| = 0.000113713 to 32
error max = 0.000565453 avg = -1.61276e-06 |avg| = 0.000112561 to 2.09715e+06
error max = 0.000565531 avg = 1.42628e-06 |avg| = 0.000112866 to 1.37439e+11
error max = 0.000565686 avg = 2.71505e-06 |avg| = 0.000113715 to 9.0072e+15
error max = 0.000565763 avg = -1.56586e-06 |avg| = 0.000112415 to 1.84467e+19
I suspect accuracy of the more accurate 5/12 is being limited by the rsqrt operation.

Ian Stephenson wrote this code which he claims outperforms pow(). He describes the idea as follows:
Pow is basically implemented using
log's: pow(a,b)=x(logx(a)*b). so we
need a fast log and fast exponent - it
doesn't matter what x is so we use 2.
The trick is that a floating point
number is already in a log style
format:
a=M*2E
Taking the log of both sides gives:
log2(a)=log2(M)+E
or more simply:
log2(a)~=E
In other words if we take the floating
point representation of a number, and
extract the Exponent we've got
something that's a good starting point
as its log. It turns out that when we
do this by massaging the bit patterns,
the Mantissa ends up giving a good
approximation to the error, and it
works pretty well.
This should be good enough for simple
lighting calculations, but if you need
something better, you can then extract
the Mantissa, and use that to
calculate a quadratic correction factor
which is pretty accurate.

First off, using floats isn't going to buy much on most machines nowadays. In fact, doubles can be faster. Your power, 1.0/2.4, is 5/12 or 1/3*(1+1/4). Even though this is calling cbrt (once) and sqrt (twice!) it is still twice as fast as using pow(). (Optimization: -O3, compiler: i686-apple-darwin10-g++-4.2.1).
#include <math.h> // cmath does not provide cbrt; C99 does.
double xpow512 (double x) {
double cbrtx = cbrt(x);
return cbrtx*sqrt(sqrt(cbrtx));
}

This might not answer your question.
The 2.4f and 1/2.4f make me very suspicious, because those are exactly the powers used to convert between sRGB and a linear RGB color space. So you might actually be trying to optimize that, specifically. I don't know, which is why this might not answer your question.
If this is the case, try using a lookup table. Something like:
__attribute__((aligned(64))
static const unsigned short SRGB_TO_LINEAR[256] = { ... };
__attribute__((aligned(64))
static const unsigned short LINEAR_TO_SRGB[256] = { ... };
void apply_lut(const unsigned short lut[256], unsigned char *src, ...
If you are using 16-bit data, change as appropriate. I would make the table 16 bits anyway so you can dither the result if necessary when working with 8-bit data. This obviously won't work very well if your data is floating point to begin with -- but it doesn't really make sense to store sRGB data in floating point, so you might as well convert to 16-bit / 8-bit first and then do the conversion from linear to sRGB.
(The reason sRGB doesn't make sense as floating point is that HDR should be linear, and sRGB is only convenient for storing on disk or displaying on screen, but not convenient for manipulation.)

I shall answer the question you really wanted to ask, which is how to do fast sRGB <-> linear RGB conversion. To do this precisely and efficiently we can use polynomial approximations. The following polynomial approximations have been generated with sollya, and have a worst case relative error of 0.0144%.
inline double poly7(double x, double a, double b, double c, double d,
double e, double f, double g, double h) {
double ab, cd, ef, gh, abcd, efgh, x2, x4;
x2 = x*x; x4 = x2*x2;
ab = a*x + b; cd = c*x + d;
ef = e*x + f; gh = g*x + h;
abcd = ab*x2 + cd; efgh = ef*x2 + gh;
return abcd*x4 + efgh;
}
inline double srgb_to_linear(double x) {
if (x <= 0.04045) return x / 12.92;
// Polynomial approximation of ((x+0.055)/1.055)^2.4.
return poly7(x, 0.15237971711927983387,
-0.57235993072870072762,
0.92097986411523535821,
-0.90208229831912012386,
0.88348956209696805075,
0.48110797889132134175,
0.03563925285274562038,
0.00084585397227064120);
}
inline double linear_to_srgb(double x) {
if (x <= 0.0031308) return x * 12.92;
// Piecewise polynomial approximation (divided by x^3)
// of 1.055 * x^(1/2.4) - 0.055.
if (x <= 0.0523) return poly7(x, -6681.49576364495442248881,
1224.97114922729451791383,
-100.23413743425112443219,
6.60361150127077944916,
0.06114808961060447245,
-0.00022244138470139442,
0.00000041231840827815,
-0.00000000035133685895) / (x*x*x);
return poly7(x, -0.18730034115395793881,
0.64677431008037400417,
-0.99032868647877825286,
1.20939072663263713636,
0.33433459165487383613,
-0.01345095746411287783,
0.00044351684288719036,
-0.00000664263587520855) / (x*x*x);
}
And the sollya input used to generate the polynomials:
suppressmessage(174);
f = ((x+0.055)/1.055)^2.4;
p0 = fpminimax(f, 7, [|D...|], [0.04045;1], relative);
p = fpminimax(f/(p0(1)+1e-18), 7, [|D...|], [0.04045;1], relative);
print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
print("absolute:", dirtyinfnorm((f-p), [s;1]));
print(canonical(p));
s = 0.0523;
z = 3;
f = 1.055 * x^(1/2.4) - 0.055;
p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [0.0031308;s], relative)/x^z;
print("relative:", dirtyinfnorm((f-p)/f, [0.0031308;s]));
print("absolute:", dirtyinfnorm((f-p), [0.0031308;s]));
print(canonical(p));
p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [s;1], relative)/x^z;
print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
print("absolute:", dirtyinfnorm((f-p), [s;1]));
print(canonical(p));

Binomial series does account for a constant exponent, but you will be able to use it only if you can normalize all your input to the range [1,2). (Note that it computes (1+x)^a). You'll have to do some analysis to decide how many terms you need for your desired accuracy.

For exponents of 2.4, you could either make a lookup table for all your 2.4 values and lirp or perhaps higher-order function to fill in the in-betweem values if the table wasn't accurate enough (basically a huge log table.)
Or, value squared * value to the 2/5s which could take the initial square value from the first half of the function and then 5th root it. For the 5th root, you could Newton it or do some other fast approximator, though honestly once you get to this point, your probably better off just doing the exp and log functions with the appropriate abbreviated series functions yourself.

The following is an idea you can use with any of the fast calculation methods. Whether it helps things go faster depends on how your data arrives. You can use the fact that if you know x and pow(x, n), you can use the rate of change of the power to compute a reasonable approximation of pow(x + delta, n) for small delta, with a single multiply and add (more or less). If successive values you feed your power functions are close enough together, this would amortize the full cost of the accurate calculation over multiple function calls. Note that you don't need an extra pow calculation to get the derivative. You could extend this to use the second derivative so you can use a quadratic, which would increase the delta you could use and still get the same accuracy.

So traditionally the powf(x, p) = x^p is solved by rewriting x as x=2^(log2(x)) making powf(x,p) = 2^(p*log2(x)), which transforms the problem into two approximations exp2() & log2(). This has the advantage of working with larger powers p, however the downside is that this is not the optimal solution for a constant power p and over a specified input bound 0 ≤ x ≤ 1.
When the power p > 1, the answer is a trivial minimax polynomial over the bound 0 ≤ x ≤ 1, which is the case for p = 12/5 = 2.4 as can be seen below:
float pow12_5(float x){
float mp;
// Minimax horner polynomials for x^(5/12), Note: choose the accurarcy required then implement with fma() [Fused Multiply Accumulates]
// mp = 0x4.a84a38p-12 + x * (-0xd.e5648p-8 + x * (0xa.d82fep-4 + x * 0x6.062668p-4)); // 1.13705697e-3
mp = 0x1.117542p-12 + x * (-0x5.91e6ap-8 + x * (0x8.0f50ep-4 + x * (0xa.aa231p-4 + x * (-0x2.62787p-4)))); // 2.6079002e-4
// mp = 0x5.a522ap-16 + x * (-0x2.d997fcp-8 + x * (0x6.8f6d1p-4 + x * (0xf.21285p-4 + x * (-0x7.b5b248p-4 + x * 0x2.32b668p-4)))); // 8.61377e-5
// mp = 0x2.4f5538p-16 + x * (-0x1.abcdecp-8 + x * (0x5.97464p-4 + x * (0x1.399edap0 + x * (-0x1.0d363ap0 + x * (0xa.a54a3p-4 + x * (-0x2.e8a77cp-4)))))); // 3.524655e-5
return(mp);
}
However when p < 1 the minimax approximation over the bound 0 ≤ x ≤ 1 does not appropriately converge to the desired accuracy. One option [not really] is to rewrite the problem y=x^p=x^(p+m)/x^m where m=1,2,3 is a positive integer, making the new power approximation p > 1 but this introduces division which is inherently slower.
There's however another option which is to decompose the input x as its floating point exponent and mantissa form:
x = mx* 2^(ex) where 1 ≤ mx < 2
y = x^(5/12) = mx^(5/12) * 2^((5/12)*ex), let ey = floor(5*ex/12), k = (5*ex) % 12
= mx^(5/12) * 2^(k/12) * 2^(ey)
The minimax approximation of mx^(5/12) over 1 ≤ mx < 2 now converges much faster than before, without division, but requires 12 point LUT for the 2^(k/12). The code is below:
float powk_12LUT[] = {0x1.0p0, 0x1.0f38fap0, 0x1.1f59acp0, 0x1.306fep0, 0x1.428a3p0, 0x1.55b81p0, 0x1.6a09e6p0, 0x1.7f910ep0, 0x1.965feap0, 0x1.ae89fap0, 0x1.c823ep0, 0x1.e3437ep0};
float pow5_12(float x){
union{float f; uint32_t u;} v, e2;
float poff, m, e, ei;
int xe;
v.f = x;
xe = ((v.u >> 23) - 127);
if(xe < -127) return(0.0f);
// Calculate remainder k in 2^(k/12) to find LUT
e = xe * (5.0f/12.0f);
ei = floorf(e);
poff = powk_12LUT[(int)(12.0f * (e - ei))];
e2.u = ((int)ei + 127) << 23; // Calculate the exponent
v.u = (v.u & ~(0xFFuL << 23)) | (0x7FuL << 23); // Normalize exponent to zero
// Approximate mx^(5/12) on [1,2), with appropriate degree minimax
// m = 0x8.87592p-4 + v.f * (0x8.8f056p-4 + v.f * (-0x1.134044p-4)); // 7.6125e-4
// m = 0x7.582138p-4 + v.f * (0xb.1666bp-4 + v.f * (-0x2.d21954p-4 + v.f * 0x6.3ea0cp-8)); // 8.4522726e-5
m = 0x6.9465cp-4 + v.f * (0xd.43015p-4 + v.f * (-0x5.17b2a8p-4 + v.f * (0x1.6cb1f8p-4 + v.f * (-0x2.c5b76p-8)))); // 1.04091259e-5
// m = 0x6.08242p-4 + v.f * (0xf.352bdp-4 + v.f * (-0x7.d0c1bp-4 + v.f * (0x3.4d153p-4 + v.f * (-0xc.f7a42p-8 + v.f * 0x1.5d840cp-8)))); // 1.367401e-6
return(m * poff * e2.f);
}

Related

Make very small, (or large), exponential calculations

Exponential limit of most 32 bit machines is
exp( +/-700 )
But I would like to do an exponential calculation
res = exp( x ) / exp( d )
when x or d are bigger than 700 I use the fact that
exp( x + y ) = exp( x ) . exp( y )
So my calculation would be something along the line of
res = (exp( x - z ).exp(z)) / (exp( d - z ).exp(z))
or
res = exp( x - z ) / exp( d - z )
where (x-z) < 700
But this approach is flawed in some cases, for example where x = 6000 and d = 10000
If we use z=5300 then
res = exp( 6000 - 5300 ) / exp( 10000 - 5300 )
res = exp( 700 ) / exp( 47000 )
But exp( 47000 ) = 0 on a 32 bit machine.
If I replace z = 9300 then I get the opposite effect.
res = exp( -3300 ) / exp( 700 )
So how could I solve the above equations, (that should return a 32bit valid number I think), given the limitations of the computer?
Edit
The reason for doing this is I am using the formula
P( a ) = P(y1) * P(y2) * P(y3) ... P(yN)
In order to prevent overflow I then do
a = log( P(y1) ) + log( P(y2) ) + log (P(y3)) ... log( P(yN) )
b = log( P(z1) ) + log( P(z2) ) + log (P(z3)) ... log( P(zN) )
...
z = log( P(zz1) ) + log( P(zz2) ) + log (P(zz3)) ... log( P(zzN) )
to get a total I do
total = a + b ... z
and to calculate the percentage I do
(exp(a) / exp( total ) ) * 100
but it is possible that "a" and/or "total" are greater than 700
I guess the question could be how could I calculate the percentage without using the exponential

It doesn't matter that the answer should be a 32 bit number if some of the intermediate steps in the calculations aren't.
For math that goes outside the bounds of an int or long type, you probably need to start using something like GMP.
https://gmplib.org/

I assume that you want to compute this:
p = exp(a) / exp(b)
And since a^b/a^c == a^(b-c) this reduces to
p = exp(a - b)
which can be easily computed if that difference is below that critical exponent.
If it isn't, then your result cannot be represented by primitive datatypes like double (because it's either extremely large or extremely small), you then need some kind of arbitrary precision numbers, probably provided by some library.
But if you only need to print the result, or store it somehow, then you can easily compute even extremely large numbers:
For that, you change to base 10 (for displaying), compute the equivalent exponent therefore (tExponent = log10(eExponent)), and get that value into the allowed range between std::numeric_limits::max_exponent10 and std::numeric_limits::min_exponent10, saving the difference as scaling factor.
For now, I just have a quick and dirty live example, showing
exp(90000) / exp(100) = 1.18556 scaled by 10^39043
(Check at wolfram alpha)
Note: When I wrote this, it was pretty late in the evening. I'm leaving this here for an "alternative" approach.
Now, generally, there's
a^b = [a^(b/c)]^c
And since
(a/b)^c = (a^c)/(b^c)
holds, too, I guess the easiest approach here is to just divide both exponents as long as one of them is above your critical value, then do the exponentiation, divide the results, and finally use the divisor of the former exponents as exponent for the quotient:
double large_exp_quot(
double eNum,
double eDenom,
unsigned int const critical = 200) {
if (abs(eNum - eDenom) > critical) {
throw out_of_range{"That won't work, resulting exponent is too large"};
}
unsigned int eDivisor = 1;
while (abs(eNum) > critical or abs(eDenom) > critical) {
eNum /= 2;
eDenom /= 2;
eDivisor *= 2;
}
return pow(exp(eNum) / exp(eDenom), eDivisor);
}
But this will only work, if the result of your computation can actually be stored using the C++ primitive datatypes, in this case double. The example you gave ... with exponents 6000 and 10000 ... is obviously not representable with a double (it's e^(-4000) and thus incredibly small)

Numerically unstable computation: exp(a) / exp(b)
Equivalent stable computation: exp(a - b)
Numerically unstable computation: Πi=1..n pi
Equivalent stable computation: exp(Σi=1..n log(pi))

Kernel Density Estimator ( with Gauss Kernel ) Sum f(x) = 1?

I want to use KDE with the Gaussian Kernel. If I'm correct, the sum of all f(x) must be 1 ( ~ rounding ) ?
My Implementation looks like this:
float K( float const& val)
{
const float p=1.0 / std::sqrt( 2.0 * M_PI);
float result = 0.5 * (val*val);
result = p * std::exp(- result);
return result;
};
std::vector< std::pair<float, float> kde( float *val, int len float h)
{
std::vector< std::pair<float, float>> density( len );
const float p = 1.0 / (h * len );
for(int r=0;r<len;r++)
{
float sum = 0;
for(int i=0;i<len;i++)
sum += k( (val[r] - val[i]) / h );
density[r] = std::make_pair( val[r], p*sum );
}
return density;
}
And I choosed h > 0. Am i right that p*sum is the probability for the value val[r] ? The sum over all probability is > 1 ( but looks ok for me ).

You misinterpreted the assumptions on the probability density here. The density integrates to one, whereas its values at certain points are definitely not 1.
Let's discuss it using the following formula from the linked Wikipedia article which you seem to use:
This formula provides the density f_h(x) evaluated at point x.
From my review, your code correctly evaluates this quantity. Yet, you misinterpreted the quantity which should be one. As a density, the integral over the complete space should yield one, i.e.
This property is called normalization of the density.
Moreover, being a density itself, each summand of f_h(x) should yield 1/n when integrated over the whole space, when one also includes the normalization constant. Again, there's no guarantee on the values of the summands.
In one dimension, you can easily confirm the normalization by using the trapezoidal rule or another quadrature scheme (--if you provide a working example, I can try to do that.)

Fast approximate float division

On modern processors, float division is a good order of magnitude slower than float multiplication (when measured by reciprocal throughput).
I'm wondering if there are any algorithms out there for computating a fast approximation to x/y, given certain assumptions and tolerance levels. For example, if you assume that 0<x<y, and are willing to accept any output that is within 10% of the true value, are there algorithms faster than the built-in FDIV operation?

I hope that this helps because this is probably as close as your going to get to what you are looking for.
__inline__ double __attribute__((const)) divide( double y, double x ) {
// calculates y/x
union {
double dbl;
unsigned long long ull;
} u;
u.dbl = x; // x = x
u.ull = ( 0xbfcdd6a18f6a6f52ULL - u.ull ) >> (unsigned char)1;
// pow( x, -0.5 )
u.dbl *= u.dbl; // pow( pow(x,-0.5), 2 ) = pow( x, -1 ) = 1.0/x
return u.dbl * y; // (1.0/x) * y = y/x
}
See also:
Another post about reciprocal approximation.
The Wikipedia page.

FDIV is usually exceptionally slower than FMUL just b/c it can't be piped like multiplication and requires multiple clk cycles for iterative convergence HW seeking process.
Easiest way is to simply recognize that division is nothing more than the multiplication of the dividend y and the inverse of the divisor x. The not so straight forward part is remembering a float value x = m * 2 ^ e & its inverse x^-1 = (1/m)*2^(-e) = (2/m)*2^(-e-1) = p * 2^q approximating this new mantissa p = 2/m = 3-x, for 1<=m<2. This gives a rough piece-wise linear approximation of the inverse function, however we can do a lot better by using an iterative Newton Root Finding Method to improve that approximation.
let w = f(x) = 1/x, the inverse of this function f(x) is found by solving for x in terms of w or x = f^(-1)(w) = 1/w. To improve the output with the root finding method we must first create a function whose zero reflects the desired output, i.e. g(w) = 1/w - x, d/dw(g(w)) = -1/w^2.
w[n+1]= w[n] - g(w[n])/g'(w[n]) = w[n] + w[n]^2 * (1/w[n] - x) = w[n] * (2 - x*w[n])
w[n+1] = w[n] * (2 - x*w[n]), when w[n]=1/x, w[n+1]=1/x*(2-x*1/x)=1/x
These components then add to get the final piece of code:
float inv_fast(float x) {
union { float f; int i; } v;
float w, sx;
int m;
sx = (x < 0) ? -1:1;
x = sx * x;
v.i = (int)(0x7EF127EA - *(uint32_t *)&x);
w = x * v.f;
// Efficient Iterative Approximation Improvement in horner polynomial form.
v.f = v.f * (2 - w); // Single iteration, Err = -3.36e-3 * 2^(-flr(log2(x)))
// v.f = v.f * ( 4 + w * (-6 + w * (4 - w))); // Second iteration, Err = -1.13e-5 * 2^(-flr(log2(x)))
// v.f = v.f * (8 + w * (-28 + w * (56 + w * (-70 + w *(56 + w * (-28 + w * (8 - w))))))); // Third Iteration, Err = +-6.8e-8 * 2^(-flr(log2(x)))
return v.f * sx;
}

Newton Raphson with SSE2 - can someone explain me these 3 lines

I'm reading this document: http://software.intel.com/en-us/articles/interactive-ray-tracing
and I stumbled upon these three lines of code:
The SIMD version is already quite a bit faster, but we can do better.
Intel has added a fast 1/sqrt(x) function to the SSE2 instruction set.
The only drawback is that its precision is limited. We need the
precision, so we refine it using Newton-Rhapson:
__m128 nr = _mm_rsqrt_ps( x );
__m128 muls = _mm_mul_ps( _mm_mul_ps( x, nr ), nr );
result = _mm_mul_ps( _mm_mul_ps( half, nr ), _mm_sub_ps( three, muls ) );
This code assumes the existence of a __m128 variable named 'half'
(four times 0.5f) and a variable 'three' (four times 3.0f).
I know how to use Newton Raphson to calculate a function's zero and I know how to use it to calculate the square root of a number but I just can't see how this code performs it.
Can someone explain it to me please?

Given the Newton iteration , it should be quite straight forward to see this in the source code.
__m128 nr = _mm_rsqrt_ps( x ); // The initial approximation y_0
__m128 muls = _mm_mul_ps( _mm_mul_ps( x, nr ), nr ); // muls = x*nr*nr == x(y_n)^2
result = _mm_mul_ps(
_mm_sub_ps( three, muls ) // this is 3.0 - mul;
/*multiplied by */ __mm_mul_ps(half,nr) // y_0 / 2 or y_0 * 0.5
);
And to be precise, this algorithm is for the inverse square root.
Note that this still doesn't give fully a fully accurate result. rsqrtps with a NR iteration gives almost 23 bits of accuracy, vs. sqrtps's 24 bits with correct rounding for the last bit.
The limited accuracy is an issue if you want to truncate the result to integer. (int)4.99999 is 4. Also, watch out for the x == 0.0 case if using sqrt(x) ~= x * sqrt(x), because 0 * +Inf = NaN.

To compute the inverse square root of a, Newton's method is applied to the equation 0=f(x)=a-x^(-2) with derivative f'(x)=2*x^(-3) and thus the iteration step
N(x) = x - f(x)/f'(x) = x - (a*x^3-x)/2
= x/2 * (3 - a*x^2)
This division-free method has -- in contrast to the globally converging Heron's method -- a limited region of convergence, so you need an already good approximation of the inverse square root to get a better approximation.

Fast Arc Cos algorithm?

I have my own, very fast cos function:
float sine(float x)
{
const float B = 4/pi;
const float C = -4/(pi*pi);
float y = B * x + C * x * abs(x);
// const float Q = 0.775;
const float P = 0.225;
y = P * (y * abs(y) - y) + y; // Q * y + P * y * abs(y)
return y;
}
float cosine(float x)
{
return sine(x + (pi / 2));
}
But now when I profile, I see that acos() is killing the processor. I don't need intense precision. What is a fast way to calculate acos(x)
Thanks.

A simple cubic approximation, the Lagrange polynomial for x ∈ {-1, -½, 0, ½, 1}, is:
double acos(x) {
return (-0.69813170079773212 * x * x - 0.87266462599716477) * x + 1.5707963267948966;
}
It has a maximum error of about 0.18 rad.

Got spare memory? A lookup table (with interpolation, if required) is gonna be fastest.

nVidia has some great resources that show how to approximate otherwise very expensive math functions, such as: acos
asin
atan2
etc etc...
These algorithms produce good results when speed of execution is more important (within reason) than precision. Here's their acos function:
// Absolute error <= 6.7e-5
float acos(float x) {
float negate = float(x < 0);
x = abs(x);
float ret = -0.0187293;
ret = ret * x;
ret = ret + 0.0742610;
ret = ret * x;
ret = ret - 0.2121144;
ret = ret * x;
ret = ret + 1.5707288;
ret = ret * sqrt(1.0-x);
ret = ret - 2 * negate * ret;
return negate * 3.14159265358979 + ret;
}
And here are the results for when calculating acos(0.5):
nVidia: result: 1.0471513828611643
math.h: result: 1.0471975511965976
That's pretty close! Depending on your required degree of precision, this might be a good option for you.

I have my own. It's pretty accurate and sort of fast. It works off of a theorem I built around quartic convergence. It's really interesting, and you can see the equation and how fast it can make my natural log approximation converge here: https://www.desmos.com/calculator/yb04qt8jx4
Here's my arccos code:
function acos(x)
local a=1.43+0.59*x a=(a+(2+2*x)/a)/2
local b=1.65-1.41*x b=(b+(2-2*x)/b)/2
local c=0.88-0.77*x c=(c+(2-a)/c)/2
return (8*(c+(2-a)/c)-(b+(2-2*x)/b))/6
end
A lot of that is just square root approximation. It works really well, too, unless you get too close to taking a square root of 0. It has an average error (excluding x=0.99 to 1) of 0.0003. The problem, though, is that at 0.99 it starts going to shit, and at x=1, the difference in accuracy becomes 0.05. Of course, this could be solved by doing more iterations on the square roots (lol nope) or, just a little thing like, if x>0.99 then use a different set of square root linearizations, but that makes the code all long and ugly.
If you don't care about accuracy so much, you could just do one iteration per square root, which should still keep you somewhere in the range of 0.0162 or something as far as accuracy goes:
function acos(x)
local a=1.43+0.59*x a=(a+(2+2*x)/a)/2
local b=1.65-1.41*x b=(b+(2-2*x)/b)/2
local c=0.88-0.77*x c=(c+(2-a)/c)/2
return 8/3*c-b/3
end
If you're okay with it, you can use pre-existing square root code. It will get rid of the the equation going a bit crazy at x=1:
function acos(x)
local a = math.sqrt(2+2*x)
local b = math.sqrt(2-2*x)
local c = math.sqrt(2-a)
return 8/3*d-b/3
end
Frankly, though, if you're really pressed for time, remember that you could linearize arccos into 3.14159-1.57079x and just do:
function acos(x)
return 1.57079-1.57079*x
end
Anyway, if you want to see a list of my arccos approximation equations, you can go to https://www.desmos.com/calculator/tcaty2sv8l I know that my approximations aren't the best for certain things, but if you're doing something where my approximations would be useful, please use them, but try to give me credit.

You can approximate the inverse cosine with a polynomial as suggested by dan04, but a polynomial is a pretty bad approximation near -1 and 1 where the derivative of the inverse cosine goes to infinity. When you increase the degree of the polynomial you hit diminishing returns quickly, and it is still hard to get a good approximation around the endpoints. A rational function (the quotient of two polynomials) can give a much better approximation in this case.
acos(x) ≈ π/2 + (ax + bx³) / (1 + cx² + dx⁴)
where
a = -0.939115566365855
b = 0.9217841528914573
c = -1.2845906244690837
d = 0.295624144969963174
has a maximum absolute error of 0.017 radians (0.96 degrees) on the interval (-1, 1). Here is a plot (the inverse cosine in black, cubic polynomial approximation in red, the above function in blue) for comparison:
The coefficients above have been chosen to minimise the maximum absolute error over the entire domain. If you are willing to allow a larger error at the endpoints, the error on the interval (-0.98, 0.98) can be made much smaller. A numerator of degree 5 and a denominator of degree 2 is about as fast as the above function, but slightly less accurate. At the expense of performance you can increase accuracy by using higher degree polynomials.
A note about performance: computing the two polynomials is still very cheap, and you can use fused multiply-add instructions. The division is not so bad, because you can use the hardware reciprocal approximation and a multiply. The error in the reciprocal approximation is negligible in comparison with the error in the acos approximation. On a 2.6 GHz Skylake i7, this approximation can do about 8 inverse cosines every 6 cycles using AVX. (That is throughput, the latency is longer than 6 cycles.)

Another approach you could take is to use complex numbers. From de Moivre's formula,
ⅈx = cos(π/2*x) + ⅈ*sin(π/2*x)
Let θ = π/2*x. Then x = 2θ/π, so
sin(θ) = ℑ(ⅈ^2θ/π)
cos(θ) = ℜ(ⅈ^2θ/π)
How can you calculate powers of ⅈ without sin and cos? Start with a precomputed table for powers of 2:
ⅈ4 = 1
ⅈ2 = -1
ⅈ1 = ⅈ
ⅈ1/2 = 0.7071067811865476 + 0.7071067811865475*ⅈ
ⅈ1/4 = 0.9238795325112867 + 0.3826834323650898*ⅈ
ⅈ1/8 = 0.9807852804032304 + 0.19509032201612825*ⅈ
ⅈ1/16 = 0.9951847266721969 + 0.0980171403295606*ⅈ
ⅈ1/32 = 0.9987954562051724 + 0.049067674327418015*ⅈ
ⅈ1/64 = 0.9996988186962042 + 0.024541228522912288*ⅈ
ⅈ1/128 = 0.9999247018391445 + 0.012271538285719925*ⅈ
ⅈ1/256 = 0.9999811752826011 + 0.006135884649154475*ⅈ
To calculate arbitrary values of ⅈx, approximate the exponent as a binary fraction, and then multiply together the corresponding values from the table.
For example, to find sin and cos of 72° = 0.8π/2:
ⅈ0.8
&approx; ⅈ205/256
= ⅈ0b11001101
= ⅈ1/2 * ⅈ1/4 * ⅈ1/32 * ⅈ1/64 * ⅈ1/256
= 0.3078496400415349 + 0.9514350209690084*ⅈ
sin(72°) &approx; 0.9514350209690084 ("exact" value is 0.9510565162951535)
cos(72°) &approx; 0.3078496400415349 ("exact" value is 0.30901699437494745).
To find asin and acos, you can use this table with the Bisection Method:
For example, to find asin(0.6) (the smallest angle in a 3-4-5 triangle):
ⅈ0 = 1 + 0*ⅈ. The sin is too small, so increase x by 1/2.
ⅈ1/2 = 0.7071067811865476 + 0.7071067811865475*ⅈ . The sin is too big, so decrease x by 1/4.
ⅈ1/4 = 0.9238795325112867 + 0.3826834323650898*ⅈ. The sin is too small, so increase x by 1/8.
ⅈ3/8 = 0.8314696123025452 + 0.5555702330196022*ⅈ. The sin is still too small, so increase x by 1/16.
ⅈ7/16 = 0.773010453362737 + 0.6343932841636455*ⅈ. The sin is too big, so decrease x by 1/32.
ⅈ13/32 = 0.8032075314806449 + 0.5956993044924334*ⅈ.
Each time you increase x, multiply by the corresponding power of ⅈ. Each time you decrease x, divide by the corresponding power of ⅈ.
If we stop here, we obtain acos(0.6) &approx; 13/32*π/2 = 0.6381360077604268 (The "exact" value is 0.6435011087932844.)
The accuracy, of course, depends on the number of iterations. For a quick-and-dirty approximation, use 10 iterations. For "intense precision", use 50-60 iterations.

A fast arccosine implementation, accurate to about 0.5 degrees, can be based on the observation that for x in [0,1], acos(x) ≈ √(2*(1-x)). An additional scale factor improves accuracy near zero. The optimal factor can be found by a simple binary search. Negative arguments are handled according to acos (-x) = π - acos (x).
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
// Approximate acos(a) with relative error < 5.15e-3
// This uses an idea from Robert Harley's posting in comp.arch.arithmetic on 1996/07/12
// https://groups.google.com/forum/#!original/comp.arch.arithmetic/wqCPkCCXqWs/T9qCkHtGE2YJ
float fast_acos (float a)
{
const float PI = 3.14159265f;
const float C = 0.10501094f;
float r, s, t, u;
t = (a < 0) ? (-a) : a; // handle negative arguments
u = 1.0f - t;
s = sqrtf (u + u);
r = C * u * s + s; // or fmaf (C * u, s, s) if FMA support in hardware
if (a < 0) r = PI - r; // handle negative arguments
return r;
}
float uint_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof(r));
return r;
}
int main (void)
{
double maxrelerr = 0.0;
uint32_t a = 0;
do {
float x = uint_as_float (a);
float r = fast_acos (x);
double xx = (double)x;
double res = (double)r;
double ref = acos (xx);
double relerr = (res - ref) / ref;
if (fabs (relerr) > maxrelerr) {
maxrelerr = fabs (relerr);
printf ("xx=% 15.8e res=% 15.8e ref=% 15.8e rel.err=% 15.8e\n",
xx, res, ref, relerr);
}
a++;
} while (a);
printf ("maximum relative error = %15.8e\n", maxrelerr);
return EXIT_SUCCESS;
}
The output of the above test scaffold should look similar to this:
xx= 0.00000000e+000 res= 1.56272149e+000 ref= 1.57079633e+000 rel.err=-5.14060021e-003
xx= 2.98023259e-008 res= 1.56272137e+000 ref= 1.57079630e+000 rel.err=-5.14065723e-003
xx= 8.94069672e-008 res= 1.56272125e+000 ref= 1.57079624e+000 rel.err=-5.14069537e-003
xx=-2.98023259e-008 res= 1.57887137e+000 ref= 1.57079636e+000 rel.err= 5.14071269e-003
xx=-8.94069672e-008 res= 1.57887149e+000 ref= 1.57079642e+000 rel.err= 5.14075044e-003
maximum relative error = 5.14075044e-003

Here is a great website with many options:
https://www.ecse.rpi.edu/Homepages/wrf/Research/Short_Notes/arcsin/onlyelem.html
Personally I went the Chebyshev-Pade quotient approximation with with the following code:
double arccos(double x) {
const double pi = 3.141592653;
return pi / 2 - (.5689111419 - .2644381021*x - .4212611542*(2*x - 1)*(2*x - 1)
+ .1475622352*(2*x - 1)*(2*x - 1)*(2*x - 1))
/ (2.006022274 - 2.343685222*x + .3316406750*(2*x - 1)*(2*x - 1) +
.02607135626*(2*x - 1)*(2*x - 1)*(2*x - 1));
}

If you're using Microsoft VC++, here's an inline __asm x87 FPU code version without all the CRT filler, error checks, etc. and unlike the earliest classic ASM code you can find, it uses a FMUL instead of the slower FDIV. It compiles/works with Microsoft VC++ 2005 Express/Pro what I always stick with for various reasons.
It's a little tricky to setup a function with "__declspec(naked)/__fastcall", pull parameters correctly, handle stack, so not for the faint of heart. If it fails to compile with errors on your version, don't bother unless you're experienced. Or ask me, I can rewrite it in a slightly friendlier __asm{} block. I would manually inline this if it's a critical part of a function in a loop for further performance gains if need be.
extern float __fastcall fs_acos(float x);
extern double __fastcall fs_Acos(double x);
// ACOS(x)- Computes the arccosine of ST(0)
// Allowable range: -1<=x<=+1
// Derivative Formulas: acos(x) = atan(sqrt((1 - x * x)/(x * x))) OR
// acos(x) = atan2(sqrt(1 - x * x), x)
// e.g. acos(-1.0) = 3.1415927
__declspec(naked) float __fastcall fs_acos(float x) { __asm {
FLD DWORD PTR [ESP+4] ;// Load/Push parameter 'x' to FPU stack
FLD1 ;// Load 1.0
FADD ST, ST(1) ;// Compute 1.0 + 'x'
FLD1 ;// Load 1.0
FSUB ST, ST(2) ;// Compute 1.0 - 'x'
FMULP ST(1), ST ;// Compute (1-x) * (1+x)
FSQRT ;// Compute sqrt(result)
FXCH ST(1)
FPATAN ;// Compute arctangent of result / 'x' (ST1/ST0)
RET 4
}}
__declspec(naked) double __fastcall fs_Acos(double x) { __asm { //
FLD QWORD PTR [ESP+4] ;// Load/Push parameter 'x' to FPU stack
FLD1 ;// Load 1.0
FADD ST, ST(1) ;// Compute (1.0 + 'x')
FLD1 ;// Load 1.0
FSUB ST, ST(2) ;// Compute (1.0 - 'x')
FMULP ST(1), ST ;// Compute (1-x) * (1+x)
FSQRT ;// Compute sqrt((1-x) * (1+x))
FXCH ST(1)
FPATAN ;// Compute arctangent of result / 'x' (ST1/ST0)
RET 8
}}

Unfortunately I do not have enough reputation to comment.
Here is a small modification of Nvidia's function, that deals with the fact that numbers that should be <= 1 while preserving performance as much as possible.
It may be important since rounding errors can lead number that should be 1.0 to be (oh so slightly) larger than 1.0.
double safer_acos(double x) {
double negate = double(x < 0);
x = abs(x);
x -= double(x>1.0)*(x-1.0); // <- equivalent to min(1.0,x), but faster
double ret = -0.0187293;
ret = ret * x;
ret = ret + 0.0742610;
ret = ret * x;
ret = ret - 0.2121144;
ret = ret * x;
ret = ret + 1.5707288;
ret = ret * sqrt(1.0-x);
ret = ret - 2 * negate * ret;
return negate * 3.14159265358979 + ret;
// In a single line (no gain using gcc)
//return negate * 3.14159265358979 + (((((-0.0187293*x)+ 0.0742610)*x - 0.2121144)*x + 1.5707288)* sqrt(1.0-x))*(1.0-2.0*negate);
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Optimizations for pow() with const non-integer exponent? - c++

Binomial series does account for a constant exponent, but you will be able to use it only if you can normalize all your input to the range [1,2). (Note that it computes (1+x)^a). You'll have to do some analysis to decide how many terms you need for your desired accuracy.

Related

Make very small, (or large), exponential calculations

Kernel Density Estimator ( with Gauss Kernel ) Sum f(x) = 1?

Fast approximate float division

Newton Raphson with SSE2 - can someone explain me these 3 lines

Fast Arc Cos algorithm?

Categories

Resources