How do you process exp() with SSE2?

How do you process exp() with SSE2? - c++

I'm making a code that essentially takes advantage of SSE2 on optimizing this code:
double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
pC[sampleIndex] = exp((mMin + std::clamp(pA[sampleIndex] + pB[sampleIndex], 0.0, 1.0) * mRange) * ln2per12);
}
in this:
double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
// SSE2
__m128d bound_lower = _mm_set1_pd(0.0);
__m128d bound_upper = _mm_set1_pd(1.0);
__m128d rangeLn2per12 = _mm_set1_pd(mRange * ln2per12);
__m128d minLn2per12 = _mm_set1_pd(mMin * ln2per12);
__m128d loaded_a = _mm_load_pd(pA);
__m128d loaded_b = _mm_load_pd(pB);
__m128d result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
double *pCEnd = pC + roundintup8(blockSize);
for (; pC < pCEnd; pA += 8, pB += 8, pC += 8) {
_mm_store_pd(pC, result);
loaded_a = _mm_load_pd(pA + 2);
loaded_b = _mm_load_pd(pB + 2);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
_mm_store_pd(pC + 2, result);
loaded_a = _mm_load_pd(pA + 4);
loaded_b = _mm_load_pd(pB + 4);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
_mm_store_pd(pC + 4, result);
loaded_a = _mm_load_pd(pA + 6);
loaded_b = _mm_load_pd(pB + 6);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
_mm_store_pd(pC + 6, result);
loaded_a = _mm_load_pd(pA + 8);
loaded_b = _mm_load_pd(pB + 8);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
}
And I would say it works pretty well. BUT, can't find any exp function for SSE2, to complete the chain of operations.
Reading this, it seems I need to call standard exp() from library?
Really? Isn't this penalizing? Any other ways? Different builtin function?
I'm on MSVC, /arch:SSE2, /O2, producing 32-bit code.

The simplest way is to use exponent approximation. One possible case based on this limit
For n = 256 = 2^8:
__m128d fastExp1(__m128d x)
{
__m128d ret = _mm_mul_pd(_mm_set1_pd(1.0 / 256), x);
ret = _mm_add_pd(_mm_set1_pd(1.0), ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
return ret;
}
The other idea is the polynomial expansion. In particular, taylor series expansion:
__m128d fastExp2(__m128d x)
{
const __m128d a0 = _mm_set1_pd(1.0);
const __m128d a1 = _mm_set1_pd(1.0);
const __m128d a2 = _mm_set1_pd(1.0 / 2);
const __m128d a3 = _mm_set1_pd(1.0 / 2 / 3);
const __m128d a4 = _mm_set1_pd(1.0 / 2 / 3 / 4);
const __m128d a5 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5);
const __m128d a6 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5 / 6);
const __m128d a7 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5 / 6 / 7);
__m128d ret = _mm_fmadd_pd(a7, x, a6);
ret = _mm_fmadd_pd(ret, x, a5);
// If fma extention is not present use
// ret = _mm_add_pd(_mm_mul_pd(ret, x), a5);
ret = _mm_fmadd_pd(ret, x, a4);
ret = _mm_fmadd_pd(ret, x, a3);
ret = _mm_fmadd_pd(ret, x, a2);
ret = _mm_fmadd_pd(ret, x, a1);
ret = _mm_fmadd_pd(ret, x, a0);
return ret;
}
Note that with the same number of expansion terms, you can get a better approximation if you approximate the function for the specific x range, using for example the least squares method.
All of these methods works in a very limited x range but with continuous derivatives which may be important in some cases.
There is a trick to approximate an exponent in a very wide range but with a noticeable piecewise linear regions. It is based on integers reinterpretation as floating-point numbers. For a more accurate description, I recommend this refs:
Piecewise linear approximation to exponential and logarithm
A Fast, Compact Approximation of the Exponential Function
The possible implementation of this approach:
__m128d fastExp3(__m128d x)
{
const __m128d a = _mm_set1_pd(1.0 / M_LN2);
const __m128d b = _mm_set1_pd(3 * 1024.0 - 1.05);
__m128d t = _mm_fmadd_pd(x, a, b);
return _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(t), 11));
}
Despite the simplicity and wide x range for this method, be careful when used in math. In small areas, it gives a piecewise approximation, which can disrupt sensitive algorithms, especially those using differentiation.
To compare the accuracy of different methods, look at the graphics. The first graph is made for the x = [0..1) range. As you can see, the best approximation in this case is given by the method fastExp2(x), slightly worse but acceptable is fastExp1(x). The worst approximation provides by fastExp3(x) - the piecewise stucrure is noticeable, discontinuities of the first derivative is presence.
In the range x = [0..10) fastExp3(x) method provides the best approximation, a bit worse is approximation given by fastExp1(x) - with the same number of calculations, it provides more order than fastExp2(x).
The next step is to improve the accuracy of the fastExp3(x) algorithm. The easiest way to significantly increase accuracy is to use equality exp(x) = exp(x/2)/exp(-x/2) Although it increases the amount of computation, it greatly reduces the error due to mutual error compensation when dividing.
__m128d fastExp5(__m128d x)
{
const __m128d ap = _mm_set1_pd(0.5 / M_LN2);
const __m128d an = _mm_set1_pd(-0.5 / M_LN2);
const __m128d b = _mm_set1_pd(3 * 1024.0 - 1.05);
__m128d tp = _mm_fmadd_pd(x, ap, b);
__m128d tn = _mm_fmadd_pd(x, an, b);
tp = _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(tp), 11));
tn = _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(tn), 11));
return _mm_div_pd(tp, tn);
}
Even greater accuracy can be achieved by combining methods from fastExp1(x) or fastExp2(x) and fastExp3(x) algorithms using equality exp(x+dx) = exp(x) *exp(dx). As shown above, the first multiplier can be computed similar to fastExp3(x) approach, for second multiplier fastExp1(x) or fastExp2(x) method can be used. Finding of the optimal solution in this case is quite a difficult task and I would recommend to look at the implementation in the libraries proposed in answers.

There are several libraries that provide vectorized exponential, with more or less accuracy.
SVML, provided with the Intel compiler (it provides intrinsics as well, so if you have a licence, you can use them), has different level of precision (and speed)
you mentioned IPP, also from Intel, that also provide some functionality
MKL also provides some interface for this computation (for this one, fixing the ISA can be done through macros, for instance if you need reproducibility or precision)
fmath is another option, you can tear the code from the vectorized exp to integrate it inside your loop.
From experience, all these are faster and more precise than a custom padde approximation (not even talking about the unstable Taylor expansion that would give you negative number VERY quickly).
For SVML, IPP and MKL, I would check which one is better: calling from inside your loop or calling exp with one call for your full array (as the libraries could use AVX512 instead of just SSE2).

There is no SSE2 implementation of exp so if you don't want to roll your own as suggested above, one option is to use AVX512 instructions on some hardware that supports ERI (Exponential and Reciprocal Instructions). See https://en.wikipedia.org/wiki/AVX-512#New_instructions_in_AVX-512_exponential_and_reciprocal
I think that currently limits you to the Xeon phi (as pointed out by Peter Cordes - I did find one claim about it being on Skylake and Cannonlake but can't corroborate it), and bear in mind as well that the code won't work at all (i.e. will crash) on other architectures.

Related

SSE to C++ code

I am trying to rewrite a code from c++ source code including SSE instructions, to only c++ code. I know i will lose performance, but its an experiment, i am trying to perform.
I was wondering if there is a C++ equivalent for doing the same as , __mm_unpackhi_pd and __mm_unpacklo_pd. I have zero knowledge about SSE.
A snippet of the code for reference which i am trying to convert. Any knowledge or tips would be helpful. Thank you.
for (unsigned chunk = 0; chunk < chunks; chunk++)
{
unsigned start = chunk * chunksize;
unsigned end =
std::min((chunk + 1) * chunksize, (unsigned)2 * w);
__m128d a2b2 =
_mm_load_pd(d_origx +
((2 * init_G_offset + start) & n2_m_1));
unsigned i2_mod_B = 0;
for (unsigned i = start; i < end; i += 2)
{
__m128d ab = a2b2;
a2b2 =
_mm_load_pd(d_origx +
((origx_offset + i) & n2_m_1));
__m128d cd = _mm_load_pd(d_filter + i);
__m128d cc = _mm_unpacklo_pd(cd, cd);
__m128d dd = _mm_unpackhi_pd(cd, cd);
__m128d a0a1 = _mm_unpacklo_pd(ab, a2b2);
__m128d b0b1 = _mm_unpackhi_pd(ab, a2b2);
__m128d ac = _mm_mul_pd(cc, a0a1);
__m128d ad = _mm_mul_pd(dd, a0a1);
__m128d bc = _mm_mul_pd(cc, b0b1);
__m128d bd = _mm_mul_pd(dd, b0b1);
__m128d ac_m_bd = _mm_sub_pd(ac, bd);
__m128d ad_p_bc = _mm_add_pd(ad, bc);
__m128d ab_times_cd = _mm_unpacklo_pd(ac_m_bd, ad_p_bc);
__m128d a2b2_times_cd =
_mm_unpackhi_pd(ac_m_bd, ad_p_bc);
__m128d xy = _mm_load_pd(d_x_sampt + i2_mod_B);
__m128d x2y2 = _mm_load_pd(d_x_sampt + i2_mod_B + 2);
__m128d st = _mm_add_pd(xy, ab_times_cd);
__m128d s2t2 = _mm_add_pd(x2y2, a2b2_times_cd);
_mm_store_pd(d_x_sampt + i2_mod_B, st);
_mm_store_pd(d_x_sampt + i2_mod_B + 2, s2t2);
i2_mod_B += 4;
}
}

Below you find the description of the two functions, I've also linked each function to its reference page. The whole reference is available here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
_mm_unpackhi_p
__m128d _mm_unpackhi_pd (__m128d a, __m128d b)
Unpack and interleave double-precision (64-bit) floating-point
elements from the high half of a and b, and store the results in dst.
_mm_unpacklo_pd
_m128d _mm_unpacklo_pd (__m128d a, __m128d b)
Unpack and interleave double-precision (64-bit) floating-point
elements from the low half of a and b, and store the results in dst.

Exactly how to implement it depends on your representation, but basically you return a new value composed of the high (or low) half of a concatenated with the high (or low) half of b. For example:
typedef double[2] __m128d;
__m128d _mm_unpackhi_pd(__m128d a, __m128d b) {
__m128d res;
res[0] = a[1];
res[1] = b[1];
return res;
}
__m128d _mm_unpacklo_pd(__m128d a, __m128d b) {
__m128d res;
res[0] = a[0];
res[1] = b[0];
return res;
}
Wierd timing on this question… I found this issue while implementing this function for SIMDe, and it's only 17 days old. If you want to use SIMDe as a reference, these functions are in sse2.h along with a lot of others. The code in SIMDe is a bit more complex than what's above, but that's mostly just to match the implementations of the other _mm_unpack* functions.

Difference between ldexp(1, x) and exp2(x)

It seems if the floating-point representation has radix 2 (i.e. FLT_RADIX == 2) both std::ldexp(1, x) and std::exp2(x) raise 2 to the given power x.
Does the standard define or mention any expected behavioral and/or performance difference between them? What is the practical experience over different compilers?

exp2(x) and ldexp(x,i) perform two different operations. The former computes 2x, where x is a floating-point number, while the latter computes x*2i, where i is an integer. For integer values of x, exp2(x) and ldexp(1,int(x)) would be equivalent, provided the conversion of x to integer doesn't overflow.
The question about the relative efficiency of these two functions doesn't have a clear-cut answer. It will depend on the capabilities of the hardware platform and the details of the library implementation. While conceptually, ldexpf() looks like simple manipulation of the exponent part of a floating-point operand, it is actually a bit more complicated than that, once one considers overflow and gradual underflow via denormals. The latter case involves the rounding of the significand (mantissa) part of the floating-point number.
As ldexp() is generally an infrequently used function, it is in my experience fairly common that less of an optimization effort is applied to it by math library writers than to other math functions.
On some platforms, ldexp(), or a faster (custom) version of it, will be used as a building block in the software implementation of exp2(). The following code provides an exemplary implementation of this approach for float arguments:
#include <cmath>
/* Compute exponential base 2. Maximum ulp error = 0.86770 */
float my_exp2f (float a)
{
const float cvt = 12582912.0f; // 0x1.8p23
const float large = 1.70141184e38f; // 0x1.0p127
float f, r;
int i;
// exp2(a) = exp2(i + f); i = rint (a)
r = (a + cvt) - cvt;
f = a - r;
i = (int)r;
// approximate exp2(f) on interval [-0.5,+0.5]
r = 1.53720379e-4f; // 0x1.426000p-13f
r = fmaf (r, f, 1.33903872e-3f); // 0x1.5f055ep-10f
r = fmaf (r, f, 9.61817801e-3f); // 0x1.3b2b20p-07f
r = fmaf (r, f, 5.55036031e-2f); // 0x1.c6af7ep-05f
r = fmaf (r, f, 2.40226522e-1f); // 0x1.ebfbe2p-03f
r = fmaf (r, f, 6.93147182e-1f); // 0x1.62e430p-01f
r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+00f
// exp2(a) = 2**i * exp2(f);
r = ldexpf (r, i);
if (!(fabsf (a) < 150.0f)) {
r = a + a; // handle NaNs
if (a < 0.0f) r = 0.0f;
if (a > 0.0f) r = large * large; // + INF
}
return r;
}
Most real-life implementations of exp2() do not invoke ldexp(), but a custom version, for example when fast bit-wise transfer between integer and floating-point data is supported, here represented by internal functions __float_as_int() and __int_as_float() that re-interpret an IEEE-754 binary32 as an int32 and vice versa:
/* For a in [0.5, 4), compute a * 2**i, -250 < i < 250 */
float fast_ldexpf (float a, int i)
{
int ia = (i << 23) + __float_as_int (a); // scale by 2**i
a = __int_as_float (ia);
if ((unsigned int)(i + 125) > 250) { // |i| > 125
i = (i ^ (125 << 23)) - i; // ((i < 0) ? -125 : 125) << 23
a = __int_as_float (ia - i); // scale by 2**(+/-125)
a = a * __int_as_float ((127 << 23) + i); // scale by 2**(+/-(i%125))
}
return a;
}
On other platforms, the hardware provides a single-precision version of exp2() as a fast hardware instruction. Internal to the processor these are typically implemented by a table lookup with linear or quadratic interpolation. On such hardware platforms, ldexp(float) may be implemented in terms of exp2(float), for example:
float my_ldexpf (float x, int i)
{
float r, fi, fh, fq, t;
fi = (float)i;
/* NaN, Inf, zero require argument pass-through per ISO standard */
if (!(fabsf (x) <= 3.40282347e+38f) || (x == 0.0f) || (i == 0)) {
r = x;
} else if (abs (i) <= 126) {
r = x * exp2f (fi);
} else if (abs (i) <= 252) {
fh = (float)(i / 2);
r = x * exp2f (fh) * exp2f (fi - fh);
} else {
fq = (float)(i / 4);
t = exp2f (fq);
r = x * t * t * t * exp2f (fi - 3.0f * fq);
}
return r;
}
Lastly, there are platforms that basically provide both exp2() and ldexp() functionality in hardware, such as the x87 instructions F2XM1 and FSCALE on x86 processors.

How can I add together two SSE registers

I have two SSE registers (128 bits is one register) and I want to add them up. I know how I can add corresponding words in them, for example I can do it with _mm_add_epi16 if I use 16bit words in registers, but what I want is something like _mm_add_epi128 (which does not exist), which would use register as one big word.
Is there any way to perform this operation, even if multiple instructions are needed?
I was thinking about using _mm_add_epi64, detecting overflow in the right word and then adding 1 to the left word in register if needed, but I would also like this approach to work for 256bit registers (AVX2), and this approach seems too complicated for that.

To add two 128-bit numbers x and y to give z with SSE you can do it like this
z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);
This is based on this link how-can-i-add-and-subtract-128-bit-integers-in-c-or-c.
The function unsigned_lessthan is defined below. It's complicated without AMD XOP (actually a found a simpler version for SSE4.2 if XOP is not available - see the end of my answer). Probably some of the other people here can suggest a better method. Here is some code showing this works.
#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
inline __m128i unsigned_lessthan(__m128i a, __m128i b) {
#ifdef __XOP__ // AMD XOP instruction set
return _mm_comgt_epu64(b,a));
#else // SSE2 instruction set
__m128i sign32 = _mm_set1_epi32(0x80000000); // sign bit of each dword
__m128i aflip = _mm_xor_si128(b,sign32); // a with sign bits flipped
__m128i bflip = _mm_xor_si128(a,sign32); // b with sign bits flipped
__m128i equal = _mm_cmpeq_epi32(b,a); // a == b, dwords
__m128i bigger = _mm_cmpgt_epi32(aflip,bflip); // a > b, dwords
__m128i biggerl = _mm_shuffle_epi32(bigger,0xA0); // a > b, low dwords copied to high dwords
__m128i eqbig = _mm_and_si128(equal,biggerl); // high part equal and low part bigger
__m128i hibig = _mm_or_si128(bigger,eqbig); // high part bigger or high part equal and low part
__m128i big = _mm_shuffle_epi32(hibig,0xF5); // result copied to low part
return big;
#endif
}
int main() {
__m128i x,y,z,c;
x = _mm_set_epi64x(3,0xffffffffffffffffll);
y = _mm_set_epi64x(1,0x2ll);
z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);
int out[4];
//int64_t out[2];
_mm_storeu_si128((__m128i*)out, z);
printf("%d %d\n", out[2], out[0]);
}
Edit:
The only potentially efficient way to add 128-bit or 256-bit numbers with SSE is with XOP. The only option with AVX would be XOP2 which does not exist yet. And even if you have XOP it may only be efficient to add two 128-bit or 256-numbers in parallel (you could do four with AVX if XOP2 existed) to avoid the horizontal instructions such as mm_unpacklo_epi64.
The best solution in general is to push the registers onto the stack and use scalar arithmetic. Assuming you have two 256-bit registers x4 and y4 you can add them like this:
__m256i x4, y4, z4;
uint64_t x[4], uint64_t y[4], uint64_t z[4]
_mm256_storeu_si256((__m256i*)x, x4);
_mm256_storeu_si256((__m256i*)y, y4);
add_u256(x,y,z);
z4 = _mm256_loadu_si256((__m256i*)z);
void add_u256(uint64_t x[4], uint64_t y[4], uint64_t z[4]) {
uint64_t c1 = 0, c2 = 0, tmp;
//add low 128-bits
z[0] = x[0] + y[0];
z[1] = x[1] + y[1];
c1 += z[1]<x[1];
tmp = z[1];
z[1] += z[0]<x[0];
c1 += z[1]<tmp;
//add high 128-bits + carry from low 128-bits
z[2] = x[2] + y[2];
c2 += z[2]<x[2];
tmp = z[2];
z[2] += c1;
c2 += z[2]<tmp;
z[3] = x[3] + y[3] + c2;
}
int main() {
uint64_t x[4], y[4], z[4];
x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
y[0] = 1; y[1] = 1; y[2] = 1; y[3] = 1;
//z = x + y (x3,x2,x1,x0) = (2,3,1,0)
//x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
//y[0] = 1; y[1] = 0; y[2] = 1; y[3] = 1;
//z = x + y (x3,x2,x1,x0) = (2,3,0,0)
add_u256(x,y,z);
for(int i=3; i>=0; i--) printf("%u ", z[i]); printf("\n");
}
Edit: based on a comment by Stephen Canon at saturated-substraction-avx-or-sse4-2 I discovered there is a more efficient way to compare unsigned 64-bit numbers with SSE4.2 if XOP is not available.
__m128i a,b;
__m128i sign64 = _mm_set1_epi64x(0x8000000000000000L);
__m128i aflip = _mm_xor_si128(a, sign64);
__m128i bflip = _mm_xor_si128(b, sign64);
__m128i cmp = _mm_cmpgt_epi64(aflip,bflip);

Optimizations for pow() with const non-integer exponent?

I have hot spots in my code where I'm doing pow() taking up around 10-20% of my execution time.
My input to pow(x,y) is very specific, so I'm wondering if there's a way to roll two pow() approximations (one for each exponent) with higher performance:
I have two constant exponents: 2.4 and 1/2.4.
When the exponent is 2.4, x will be in the range (0.090473935, 1.0].
When the exponent is 1/2.4, x will be in the range (0.0031308, 1.0].
I'm using SSE/AVX float vectors. If platform specifics can be taken advantage of, right on!
A maximum error rate around 0.01% is ideal, though I'm interested in full precision (for float) algorithms as well.
I'm already using a fast pow() approximation, but it doesn't take these constraints into account. Is it possible to do better?

Another answer because this is very different from my previous answer, and this is blazing fast. Relative error is 3e-8. Want more accuracy? Add a couple more Chebychev terms. It's best to keep the order odd as this makes for a small discontinuity between 2^n-epsilon and 2^n+epsilon.
#include <stdlib.h>
#include <math.h>
// Returns x^(5/12) for x in [1,2), to within 3e-8 (relative error).
// Want more precision? Add more Chebychev polynomial coefs.
double pow512norm (
double x)
{
static const int N = 8;
// Chebychev polynomial terms.
// Non-zero terms calculated via
// integrate (2/pi)*ChebyshevT[n,u]/sqrt(1-u^2)*((u+3)/2)^(5/12)
// from -1 to 1
// Zeroth term is similar except it uses 1/pi rather than 2/pi.
static const double Cn[N] = {
1.1758200232996901923,
0.16665763094889061230,
-0.0083154894939042125035,
0.00075187976780420279038,
// Wolfram alpha doesn't want to compute the remaining terms
// to more precision (it times out).
-0.0000832402,
0.0000102292,
-1.3401e-6,
1.83334e-7};
double Tn[N];
double u = 2.0*x - 3.0;
Tn[0] = 1.0;
Tn[1] = u;
for (int ii = 2; ii < N; ++ii) {
Tn[ii] = 2*u*Tn[ii-1] - Tn[ii-2];
}
double y = 0.0;
for (int ii = N-1; ii >= 0; --ii) {
y += Cn[ii]*Tn[ii];
}
return y;
}
// Returns x^(5/12) to within 3e-8 (relative error).
double pow512 (
double x)
{
static const double pow2_512[12] = {
1.0,
pow(2.0, 5.0/12.0),
pow(4.0, 5.0/12.0),
pow(8.0, 5.0/12.0),
pow(16.0, 5.0/12.0),
pow(32.0, 5.0/12.0),
pow(64.0, 5.0/12.0),
pow(128.0, 5.0/12.0),
pow(256.0, 5.0/12.0),
pow(512.0, 5.0/12.0),
pow(1024.0, 5.0/12.0),
pow(2048.0, 5.0/12.0)
};
double s;
int iexp;
s = frexp (x, &iexp);
s *= 2.0;
iexp -= 1;
div_t qr = div (iexp, 12);
if (qr.rem < 0) {
qr.quot -= 1;
qr.rem += 12;
}
return ldexp (pow512norm(s)*pow2_512[qr.rem], 5*qr.quot);
}
Addendum: What's going on here?
Per request, the following explains how the above code works.
Overview
The above code defines two functions, double pow512norm (double x) and double pow512 (double x). The latter is the entry point to the suite; this is the function that user code should call to calculate x^(5/12). The function pow512norm(x) uses Chebyshev polynomials to approximate x^(5/12), but only for x in the range [1,2]. (Use pow512norm(x) for values of x outside that range and the result will be garbage.)
The function pow512(x) splits the incoming x into a pair (double s, int n) such that x = s * 2^n and such that 1≤s<2. A further partitioning of n into (int q, unsigned int r) such that n = 12*q + r and r is less than 12 lets me split the problem of finding x^(5/12) into parts:
x^(5/12)=(s^(5/12))*((2^n)^(5/12)) via (uv)^a=(u^a)(v^a) for positive u,v and real a.
s^(5/12) is calculated via pow512norm(s).
(2^n)^(5/12)=(2^(12*q+r))^(5/12) via substitution.
2^(12*q+r)=(2^(12*q))*(2^r) via u^(a+b)=(u^a)*(u^b) for positive u, real a,b.
(2^(12*q+r))^(5/12)=(2^(5*q))*((2^r)^(5/12)) via some more manipulations.
(2^r)^(5/12) is calculated by the lookup table pow2_512.
Calculate pow512norm(s)*pow2_512[qr.rem] and we're almost there. Here qr.rem is the r value calculated in step 3 above. All that is needed is to multiply this by 2^(5*q) to yield the desired result.
That is exactly what the math library function ldexp does.
Function Approximation
The goal here is to come up with an easily computable approximation of f(x)=x^(5/12) that is 'good enough' for the problem at hand. Our approximation should be close to f(x) in some sense. Rhetorical question: What does 'close to' mean? Two competing interpretations are minimizing the mean square error versus minimizing the maximum absolute error.
I'll use a stock market analogy to describe the difference between these. Suppose you want to save for your eventual retirement. If you are in your twenties, the best thing to do is to invest in stocks or stock market funds. This is because over a long enough span of time, the stock market on average beats any other investment scheme. However, we've all seen times when putting money into stocks is a very bad thing to do. If you are in your fifties or sixties (or forties if you want to retire young) you need to invest a bit more conservatively. Those downswings can wreak have on your retirement portfolio.
Back to function approximation: As the consumer of some approximation, you are typically worried about the worst-case error rather than the performance "on average". Use some approximation constructed to give the best performance "on average" (e.g. least squares) and Murphy's law dictates that your program will spend a whole lot of time using the approximation exactly where the performance is far worse than average. What you want is a minimax approximation, something that minimizes the maximum absolute error over some domain. A good math library will take a minimax approach rather than a least squares approach because this lets the authors of the math library give some guaranteed performance of their library.
Math libraries typically use a polynomial or a rational polynomial to approximate some function f(x) over some domain a≤x≤b. Suppose the function f(x) is analytic over this domain and you want to approximate the function by some polynomial p(x) of degree N. For a given degree N there exists some magical, unique polynomial p(x) such that p(x)-f(x) has N+2 extrema over [a,b] and such that the absolute values of these N+2 extrema are all equal to one another. Finding this magical polynomial p(x) is the holy grail of function approximators.
I did not find that holy grail for you. I instead used a Chebyshev approximation. The Chebyshev polynomials of the first kind are an orthogonal (but not orthonormal) set of polynomials with some very nice features when it comes to function approximation. The Chebyshev approximation oftentimes is very close to that magical polynomial p(x). (In fact, the Remez exchange algorithm that does find that holy grail polynomial typically starts with a Chebyshev approximation.)
pow512norm(x)
This function uses Chebyshev approximation to find some polynomial p*(x) that approximates x^(5/12). Here I'm using p*(x) to distinguish this Chebyshev approximation from the magical polynomial p(x) described above. The Chebyshev approximation p*(x) is easy to find; finding p(x) is a bear. The Chebyshev approximation p*(x) is sum_i Cn[i]*Tn(i,x), where the Cn[i] are the Chebyshev coefficients and Tn(i,x) are the Chebyshev polynomials evaluated at x.
I used Wolfram alpha to find the Chebyshev coefficients Cn for me. For example, this calculates Cn[1]. The first box after the input box has the desired answer, 0.166658 in this case. That's not as many digits as I would like. Click on 'more digits' and voila, you get a whole lot more digits. Wolfram alpha is free; there is a limit on how much computation it will do. It hits that limit on higher order terms. (If you buy or have access to mathematica you will be able to calculate those high-order coefficients to a high degree of precision.)
The Chebyshev polynomials Tn(x) are calculated in the array Tn. Beyond giving something very close to magical polynomial p(x), another reason for using Chebyshev approximation is that the values of those Chebyshev polynomials are easily calculated: Start with Tn[0]=1 and Tn[1]=x, and then iteratively calculate Tn[i]=2*x*Tn[i-1] - Tn[i-2]. (I used 'ii' as the index variable rather than 'i' in my code. I never use 'i' as a variable name. How many words in the English language have an 'i' in the word? How many have two consecutive 'i's?)
pow512(x)
pow512 is the function that user code should be calling. I already described the basics of this function above. A few more details: The math library function frexp(x) returns the significand s and exponent iexp for the input x. (Minor issue: I want s between 1 and 2 for use with pow512norm but frexp returns a value between 0.5 and 1.) The math library function div returns the quotient and remainder for integer division in one swell foop. Finally, I use the math library function ldexp to put the three parts together to form the final answer.

In the IEEE 754 hacking vein, here is another solution which is faster and less "magical." It achieves an error margin of .08% in about a dozen clock cycles (for the case of p=2.4, on an Intel Merom CPU).
Floating point numbers were originally invented as an approximation to logarithms, so you can use the integer value as an approximation of log2. This is somewhat-portably achievable by applying the convert-from-integer instruction to a floating-point value, to obtain another floating-point value.
To complete the pow computation, you can multiply by a constant factor and convert the logarithm back with the convert-to-integer instruction. On SSE, the relevant instructions are cvtdq2ps and cvtps2dq.
It's not quite so simple, though. The exponent field in IEEE 754 is signed, with a bias value of 127 representing an exponent of zero. This bias must be removed before you multiply the logarithm, and re-added before you exponentiate. Furthermore, bias adjustment by subtraction won't work on zero. Fortunately, both adjustments can be achieved by multiplying by a constant factor beforehand.
x^p
= exp2( p * log2( x ) )
= exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 )
= cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) )
= cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) )
= cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) )
= cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )
exp2( 127 / p - 127 ) is the constant factor. This function is rather specialized: it won't work with small fractional exponents, because the constant factor grows exponentially with the inverse of the exponent and will overflow. It won't work with negative exponents. Large exponents lead to high error, because the mantissa bits are mingled with the exponent bits by the multiplication.
But, it's just 4 fast instructions long. Pre-multiply, convert from "integer" (to logarithm), power-multiply, convert to "integer" (from logarithm). Conversions are very fast on this implementation of SSE. We can also squeeze an extra constant coefficient into the first multiplication.
template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
__m128 ret = arg;
// std::printf( "arg = %,vg\n", ret );
// Apply a constant pre-correction factor.
ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
* pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
// std::printf( "scaled = %,vg\n", ret );
// Reinterpret arg as integer to obtain logarithm.
asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "log = %,vg\n", ret );
// Multiply logarithm by power.
ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
// std::printf( "powered = %,vg\n", ret );
// Convert back to "integer" to exponentiate.
asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "result = %,vg\n", ret );
return ret;
}
A few trials with exponent = 2.4 show this consistently overestimates by about 5%. (The routine is always guaranteed to overestimate.) You could simply multiply by 0.95, but a few more instructions will get us about 4 decimal digits of accuracy, which should be enough for graphics.
The key is to match the overestimate with an underestimate, and take the average.
Compute x^0.8: four instructions, error ~ +3%.
Compute x^-0.4: one rsqrtps. (This is quite accurate enough, but does sacrifice the ability to work with zero.)
Compute x^0.4: one mulps.
Compute x^-0.2: one rsqrtps.
Compute x^2: one mulps.
Compute x^3: one mulps.
x^2.4 = x^2 * x^0.4: one mulps. This is the overestimate.
x^2.4 = x^3 * x^-0.4 * x^-0.2: two mulps. This is the underestimate.
Average the above: one addps, one mulps.
Instruction tally: fourteen, including two conversions with latency = 5 and two reciprocal square root estimates with throughput = 4.
To properly take the average, we want to weight the estimates by their expected errors. The underestimate raises the error to a power of 0.6 vs 0.4, so we expect it to be 1.5x as erroneous. Weighting doesn't add any instructions; it can be done in the pre-factor. Calling the coefficient a: a^0.5 = 1.5 a^-0.75, and a = 1.38316186.
The final error is about .015%, or 2 orders of magnitude better than the initial fastpow result. The runtime is about a dozen cycles for a busy loop with volatile source and destination variables… although it's overlapping the iterations, real-world usage will also see instruction-level parallelism. Considering SIMD, that's a throughput of one scalar result per 3 cycles!
int main() {
__m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 );
std::printf( "Input: %,vg\n", x0 );
// Approx 5% accuracy from one call. Always an overestimate.
__m128 x1 = fastpow< 24, 10, 1, 1 >( x0 );
std::printf( "Direct x^2.4: %,vg\n", x1 );
// Lower exponents provide lower initial error, but too low causes overflow.
__m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 );
std::printf( "1.38 x^0.8: %,vg\n", xf );
// Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
__m128 xfm4 = _mm_rsqrt_ps( xf );
__m128 xf4 = _mm_mul_ps( xf, xfm4 );
// Precisely calculate x^2 and x^3
__m128 x2 = _mm_mul_ps( x0, x0 );
__m128 x3 = _mm_mul_ps( x2, x0 );
// Overestimate of x^2 * x^0.4
x2 = _mm_mul_ps( x2, xf4 );
// Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4.
__m128 xfm2 = _mm_rsqrt_ps( xf4 );
x3 = _mm_mul_ps( x3, xfm4 );
x3 = _mm_mul_ps( x3, xfm2 );
std::printf( "x^2 * x^0.4: %,vg\n", x2 );
std::printf( "x^3 / x^0.6: %,vg\n", x3 );
x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) );
// Final accuracy about 0.015%, 200x better than x^0.8 calculation.
std::printf( "average = %,vg\n", x2 );
}
Well… sorry I wasn't able to post this sooner. And extending it to x^1/2.4 is left as an exercise ;v) .
Update with stats
I implemented a little test harness and two x(5⁄12) cases corresponding to the above.
#include <cstdio>
#include <xmmintrin.h>
#include <cmath>
#include <cfloat>
#include <algorithm>
using namespace std;
template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
__m128 ret = arg;
// std::printf( "arg = %,vg\n", ret );
// Apply a constant pre-correction factor.
ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
* pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
// std::printf( "scaled = %,vg\n", ret );
// Reinterpret arg as integer to obtain logarithm.
asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "log = %,vg\n", ret );
// Multiply logarithm by power.
ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
// std::printf( "powered = %,vg\n", ret );
// Convert back to "integer" to exponentiate.
asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "result = %,vg\n", ret );
return ret;
}
__m128 pow125_4( __m128 arg ) {
// Lower exponents provide lower initial error, but too low causes overflow.
__m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg );
// Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
__m128 xfm4 = _mm_rsqrt_ps( xf );
__m128 xf4 = _mm_mul_ps( xf, xfm4 );
// Precisely calculate x^2 and x^3
__m128 x2 = _mm_mul_ps( arg, arg );
__m128 x3 = _mm_mul_ps( x2, arg );
// Overestimate of x^2 * x^0.4
x2 = _mm_mul_ps( x2, xf4 );
// Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6.
__m128 xfm2 = _mm_rsqrt_ps( xf4 );
x3 = _mm_mul_ps( x3, xfm4 );
x3 = _mm_mul_ps( x3, xfm2 );
return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) );
}
__m128 pow512_2( __m128 arg ) {
// 5/12 is too small, so compute the sqrt of 10/12 instead.
__m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg );
return _mm_mul_ps( _mm_rsqrt_ps( x ), x );
}
__m128 pow512_4( __m128 arg ) {
// 5/12 is too small, so compute the 4th root of 20/12 instead.
// 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow.
// weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3
__m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg );
__m128 xover = _mm_mul_ps( arg, xf );
__m128 xfm1 = _mm_rsqrt_ps( xf );
__m128 x2 = _mm_mul_ps( arg, arg );
__m128 xunder = _mm_mul_ps( x2, xfm1 );
// sqrt2 * over + 2 * sqrt2 * under
__m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ),
_mm_add_ps( xover, xunder ) );
xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
return xavg;
}
__m128 mm_succ_ps( __m128 arg ) {
return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) );
}
void test_pow( double p, __m128 (*f)( __m128 ) ) {
__m128 arg;
for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON );
! isfinite( _mm_cvtss_f32( f( arg ) ) );
arg = mm_succ_ps( arg ) ) ;
for ( ; _mm_cvtss_f32( f( arg ) ) == 0;
arg = mm_succ_ps( arg ) ) ;
std::printf( "Domain from %g\n", _mm_cvtss_f32( arg ) );
int n;
int const bucket_size = 1 << 25;
do {
float max_error = 0;
double total_error = 0, cum_error = 0;
for ( n = 0; n != bucket_size; ++ n ) {
float result = _mm_cvtss_f32( f( arg ) );
if ( ! isfinite( result ) ) break;
float actual = ::powf( _mm_cvtss_f32( arg ), p );
float error = ( result - actual ) / actual;
cum_error += error;
error = std::abs( error );
max_error = std::max( max_error, error );
total_error += error;
arg = mm_succ_ps( arg );
}
std::printf( "error max = %8g\t" "avg = %8g\t" "|avg| = %8g\t" "to %8g\n",
max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) );
} while ( n == bucket_size );
}
int main() {
std::printf( "4 insn x^12/5:\n" );
test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > );
std::printf( "14 insn x^12/5:\n" );
test_pow( 12./5, & pow125_4 );
std::printf( "6 insn x^5/12:\n" );
test_pow( 5./12, & pow512_2 );
std::printf( "14 insn x^5/12:\n" );
test_pow( 5./12, & pow512_4 );
}
Output:
4 insn x^12/5:
Domain from 1.36909e-23
error max = inf avg = inf |avg| = inf to 8.97249e-19
error max = 2267.14 avg = 139.175 |avg| = 139.193 to 5.88021e-14
error max = 0.123606 avg = -0.000102963 |avg| = 0.0371122 to 3.85365e-09
error max = 0.123607 avg = -0.000108978 |avg| = 0.0368548 to 0.000252553
error max = 0.12361 avg = 7.28909e-05 |avg| = 0.037507 to 16.5513
error max = 0.123612 avg = -0.000258619 |avg| = 0.0365618 to 1.08471e+06
error max = 0.123611 avg = 8.70966e-05 |avg| = 0.0374369 to 7.10874e+10
error max = 0.12361 avg = -0.000103047 |avg| = 0.0371122 to 4.65878e+15
error max = 0.123609 avg = nan |avg| = nan to 1.16469e+16
14 insn x^12/5:
Domain from 1.42795e-19
error max = inf avg = nan |avg| = nan to 9.35823e-15
error max = 0.000936462 avg = 2.0202e-05 |avg| = 0.000133764 to 6.13301e-10
error max = 0.000792752 avg = 1.45717e-05 |avg| = 0.000129936 to 4.01933e-05
error max = 0.000791785 avg = 7.0132e-06 |avg| = 0.000129923 to 2.63411
error max = 0.000787589 avg = 1.20745e-05 |avg| = 0.000129347 to 172629
error max = 0.000786553 avg = 1.62351e-05 |avg| = 0.000132397 to 1.13134e+10
error max = 0.000785586 avg = 8.25205e-06 |avg| = 0.00013037 to 6.98147e+12
6 insn x^5/12:
Domain from 9.86076e-32
error max = 0.0284339 avg = 0.000441158 |avg| = 0.00967327 to 6.46235e-27
error max = 0.0284342 avg = -5.79938e-06 |avg| = 0.00897913 to 4.23516e-22
error max = 0.0284341 avg = -0.000140706 |avg| = 0.00897084 to 2.77556e-17
error max = 0.028434 avg = 0.000440504 |avg| = 0.00967325 to 1.81899e-12
error max = 0.0284339 avg = -6.11153e-06 |avg| = 0.00897915 to 1.19209e-07
error max = 0.0284298 avg = -0.000140597 |avg| = 0.00897084 to 0.0078125
error max = 0.0284371 avg = 0.000439748 |avg| = 0.00967319 to 512
error max = 0.028437 avg = -7.74294e-06 |avg| = 0.00897924 to 3.35544e+07
error max = 0.0284369 avg = -0.000142036 |avg| = 0.00897089 to 2.19902e+12
error max = 0.0284368 avg = 0.000439183 |avg| = 0.0096732 to 1.44115e+17
error max = 0.0284367 avg = -7.41244e-06 |avg| = 0.00897923 to 9.44473e+21
error max = 0.0284366 avg = -0.000141706 |avg| = 0.00897088 to 6.1897e+26
error max = 0.485129 avg = -0.0401671 |avg| = 0.048422 to 4.05648e+31
error max = 0.994932 avg = -0.891494 |avg| = 0.891494 to 2.65846e+36
error max = 0.999329 avg = nan |avg| = nan to -0
14 insn x^5/12:
Domain from 2.64698e-23
error max = 0.13556 avg = 0.00125936 |avg| = 0.00354677 to 1.73472e-18
error max = 0.000564988 avg = 2.51458e-06 |avg| = 0.000113709 to 1.13687e-13
error max = 0.000565065 avg = -1.49258e-06 |avg| = 0.000112553 to 7.45058e-09
error max = 0.000565143 avg = 1.5293e-06 |avg| = 0.000112864 to 0.000488281
error max = 0.000565298 avg = 2.76457e-06 |avg| = 0.000113713 to 32
error max = 0.000565453 avg = -1.61276e-06 |avg| = 0.000112561 to 2.09715e+06
error max = 0.000565531 avg = 1.42628e-06 |avg| = 0.000112866 to 1.37439e+11
error max = 0.000565686 avg = 2.71505e-06 |avg| = 0.000113715 to 9.0072e+15
error max = 0.000565763 avg = -1.56586e-06 |avg| = 0.000112415 to 1.84467e+19
I suspect accuracy of the more accurate 5/12 is being limited by the rsqrt operation.

Ian Stephenson wrote this code which he claims outperforms pow(). He describes the idea as follows:
Pow is basically implemented using
log's: pow(a,b)=x(logx(a)*b). so we
need a fast log and fast exponent - it
doesn't matter what x is so we use 2.
The trick is that a floating point
number is already in a log style
format:
a=M*2E
Taking the log of both sides gives:
log2(a)=log2(M)+E
or more simply:
log2(a)~=E
In other words if we take the floating
point representation of a number, and
extract the Exponent we've got
something that's a good starting point
as its log. It turns out that when we
do this by massaging the bit patterns,
the Mantissa ends up giving a good
approximation to the error, and it
works pretty well.
This should be good enough for simple
lighting calculations, but if you need
something better, you can then extract
the Mantissa, and use that to
calculate a quadratic correction factor
which is pretty accurate.

First off, using floats isn't going to buy much on most machines nowadays. In fact, doubles can be faster. Your power, 1.0/2.4, is 5/12 or 1/3*(1+1/4). Even though this is calling cbrt (once) and sqrt (twice!) it is still twice as fast as using pow(). (Optimization: -O3, compiler: i686-apple-darwin10-g++-4.2.1).
#include <math.h> // cmath does not provide cbrt; C99 does.
double xpow512 (double x) {
double cbrtx = cbrt(x);
return cbrtx*sqrt(sqrt(cbrtx));
}

This might not answer your question.
The 2.4f and 1/2.4f make me very suspicious, because those are exactly the powers used to convert between sRGB and a linear RGB color space. So you might actually be trying to optimize that, specifically. I don't know, which is why this might not answer your question.
If this is the case, try using a lookup table. Something like:
__attribute__((aligned(64))
static const unsigned short SRGB_TO_LINEAR[256] = { ... };
__attribute__((aligned(64))
static const unsigned short LINEAR_TO_SRGB[256] = { ... };
void apply_lut(const unsigned short lut[256], unsigned char *src, ...
If you are using 16-bit data, change as appropriate. I would make the table 16 bits anyway so you can dither the result if necessary when working with 8-bit data. This obviously won't work very well if your data is floating point to begin with -- but it doesn't really make sense to store sRGB data in floating point, so you might as well convert to 16-bit / 8-bit first and then do the conversion from linear to sRGB.
(The reason sRGB doesn't make sense as floating point is that HDR should be linear, and sRGB is only convenient for storing on disk or displaying on screen, but not convenient for manipulation.)

I shall answer the question you really wanted to ask, which is how to do fast sRGB <-> linear RGB conversion. To do this precisely and efficiently we can use polynomial approximations. The following polynomial approximations have been generated with sollya, and have a worst case relative error of 0.0144%.
inline double poly7(double x, double a, double b, double c, double d,
double e, double f, double g, double h) {
double ab, cd, ef, gh, abcd, efgh, x2, x4;
x2 = x*x; x4 = x2*x2;
ab = a*x + b; cd = c*x + d;
ef = e*x + f; gh = g*x + h;
abcd = ab*x2 + cd; efgh = ef*x2 + gh;
return abcd*x4 + efgh;
}
inline double srgb_to_linear(double x) {
if (x <= 0.04045) return x / 12.92;
// Polynomial approximation of ((x+0.055)/1.055)^2.4.
return poly7(x, 0.15237971711927983387,
-0.57235993072870072762,
0.92097986411523535821,
-0.90208229831912012386,
0.88348956209696805075,
0.48110797889132134175,
0.03563925285274562038,
0.00084585397227064120);
}
inline double linear_to_srgb(double x) {
if (x <= 0.0031308) return x * 12.92;
// Piecewise polynomial approximation (divided by x^3)
// of 1.055 * x^(1/2.4) - 0.055.
if (x <= 0.0523) return poly7(x, -6681.49576364495442248881,
1224.97114922729451791383,
-100.23413743425112443219,
6.60361150127077944916,
0.06114808961060447245,
-0.00022244138470139442,
0.00000041231840827815,
-0.00000000035133685895) / (x*x*x);
return poly7(x, -0.18730034115395793881,
0.64677431008037400417,
-0.99032868647877825286,
1.20939072663263713636,
0.33433459165487383613,
-0.01345095746411287783,
0.00044351684288719036,
-0.00000664263587520855) / (x*x*x);
}
And the sollya input used to generate the polynomials:
suppressmessage(174);
f = ((x+0.055)/1.055)^2.4;
p0 = fpminimax(f, 7, [|D...|], [0.04045;1], relative);
p = fpminimax(f/(p0(1)+1e-18), 7, [|D...|], [0.04045;1], relative);
print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
print("absolute:", dirtyinfnorm((f-p), [s;1]));
print(canonical(p));
s = 0.0523;
z = 3;
f = 1.055 * x^(1/2.4) - 0.055;
p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [0.0031308;s], relative)/x^z;
print("relative:", dirtyinfnorm((f-p)/f, [0.0031308;s]));
print("absolute:", dirtyinfnorm((f-p), [0.0031308;s]));
print(canonical(p));
p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [s;1], relative)/x^z;
print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
print("absolute:", dirtyinfnorm((f-p), [s;1]));
print(canonical(p));

Binomial series does account for a constant exponent, but you will be able to use it only if you can normalize all your input to the range [1,2). (Note that it computes (1+x)^a). You'll have to do some analysis to decide how many terms you need for your desired accuracy.

For exponents of 2.4, you could either make a lookup table for all your 2.4 values and lirp or perhaps higher-order function to fill in the in-betweem values if the table wasn't accurate enough (basically a huge log table.)
Or, value squared * value to the 2/5s which could take the initial square value from the first half of the function and then 5th root it. For the 5th root, you could Newton it or do some other fast approximator, though honestly once you get to this point, your probably better off just doing the exp and log functions with the appropriate abbreviated series functions yourself.

The following is an idea you can use with any of the fast calculation methods. Whether it helps things go faster depends on how your data arrives. You can use the fact that if you know x and pow(x, n), you can use the rate of change of the power to compute a reasonable approximation of pow(x + delta, n) for small delta, with a single multiply and add (more or less). If successive values you feed your power functions are close enough together, this would amortize the full cost of the accurate calculation over multiple function calls. Note that you don't need an extra pow calculation to get the derivative. You could extend this to use the second derivative so you can use a quadratic, which would increase the delta you could use and still get the same accuracy.

So traditionally the powf(x, p) = x^p is solved by rewriting x as x=2^(log2(x)) making powf(x,p) = 2^(p*log2(x)), which transforms the problem into two approximations exp2() & log2(). This has the advantage of working with larger powers p, however the downside is that this is not the optimal solution for a constant power p and over a specified input bound 0 ≤ x ≤ 1.
When the power p > 1, the answer is a trivial minimax polynomial over the bound 0 ≤ x ≤ 1, which is the case for p = 12/5 = 2.4 as can be seen below:
float pow12_5(float x){
float mp;
// Minimax horner polynomials for x^(5/12), Note: choose the accurarcy required then implement with fma() [Fused Multiply Accumulates]
// mp = 0x4.a84a38p-12 + x * (-0xd.e5648p-8 + x * (0xa.d82fep-4 + x * 0x6.062668p-4)); // 1.13705697e-3
mp = 0x1.117542p-12 + x * (-0x5.91e6ap-8 + x * (0x8.0f50ep-4 + x * (0xa.aa231p-4 + x * (-0x2.62787p-4)))); // 2.6079002e-4
// mp = 0x5.a522ap-16 + x * (-0x2.d997fcp-8 + x * (0x6.8f6d1p-4 + x * (0xf.21285p-4 + x * (-0x7.b5b248p-4 + x * 0x2.32b668p-4)))); // 8.61377e-5
// mp = 0x2.4f5538p-16 + x * (-0x1.abcdecp-8 + x * (0x5.97464p-4 + x * (0x1.399edap0 + x * (-0x1.0d363ap0 + x * (0xa.a54a3p-4 + x * (-0x2.e8a77cp-4)))))); // 3.524655e-5
return(mp);
}
However when p < 1 the minimax approximation over the bound 0 ≤ x ≤ 1 does not appropriately converge to the desired accuracy. One option [not really] is to rewrite the problem y=x^p=x^(p+m)/x^m where m=1,2,3 is a positive integer, making the new power approximation p > 1 but this introduces division which is inherently slower.
There's however another option which is to decompose the input x as its floating point exponent and mantissa form:
x = mx* 2^(ex) where 1 ≤ mx < 2
y = x^(5/12) = mx^(5/12) * 2^((5/12)*ex), let ey = floor(5*ex/12), k = (5*ex) % 12
= mx^(5/12) * 2^(k/12) * 2^(ey)
The minimax approximation of mx^(5/12) over 1 ≤ mx < 2 now converges much faster than before, without division, but requires 12 point LUT for the 2^(k/12). The code is below:
float powk_12LUT[] = {0x1.0p0, 0x1.0f38fap0, 0x1.1f59acp0, 0x1.306fep0, 0x1.428a3p0, 0x1.55b81p0, 0x1.6a09e6p0, 0x1.7f910ep0, 0x1.965feap0, 0x1.ae89fap0, 0x1.c823ep0, 0x1.e3437ep0};
float pow5_12(float x){
union{float f; uint32_t u;} v, e2;
float poff, m, e, ei;
int xe;
v.f = x;
xe = ((v.u >> 23) - 127);
if(xe < -127) return(0.0f);
// Calculate remainder k in 2^(k/12) to find LUT
e = xe * (5.0f/12.0f);
ei = floorf(e);
poff = powk_12LUT[(int)(12.0f * (e - ei))];
e2.u = ((int)ei + 127) << 23; // Calculate the exponent
v.u = (v.u & ~(0xFFuL << 23)) | (0x7FuL << 23); // Normalize exponent to zero
// Approximate mx^(5/12) on [1,2), with appropriate degree minimax
// m = 0x8.87592p-4 + v.f * (0x8.8f056p-4 + v.f * (-0x1.134044p-4)); // 7.6125e-4
// m = 0x7.582138p-4 + v.f * (0xb.1666bp-4 + v.f * (-0x2.d21954p-4 + v.f * 0x6.3ea0cp-8)); // 8.4522726e-5
m = 0x6.9465cp-4 + v.f * (0xd.43015p-4 + v.f * (-0x5.17b2a8p-4 + v.f * (0x1.6cb1f8p-4 + v.f * (-0x2.c5b76p-8)))); // 1.04091259e-5
// m = 0x6.08242p-4 + v.f * (0xf.352bdp-4 + v.f * (-0x7.d0c1bp-4 + v.f * (0x3.4d153p-4 + v.f * (-0xc.f7a42p-8 + v.f * 0x1.5d840cp-8)))); // 1.367401e-6
return(m * poff * e2.f);
}

Fast Arc Cos algorithm?

I have my own, very fast cos function:
float sine(float x)
{
const float B = 4/pi;
const float C = -4/(pi*pi);
float y = B * x + C * x * abs(x);
// const float Q = 0.775;
const float P = 0.225;
y = P * (y * abs(y) - y) + y; // Q * y + P * y * abs(y)
return y;
}
float cosine(float x)
{
return sine(x + (pi / 2));
}
But now when I profile, I see that acos() is killing the processor. I don't need intense precision. What is a fast way to calculate acos(x)
Thanks.

A simple cubic approximation, the Lagrange polynomial for x ∈ {-1, -½, 0, ½, 1}, is:
double acos(x) {
return (-0.69813170079773212 * x * x - 0.87266462599716477) * x + 1.5707963267948966;
}
It has a maximum error of about 0.18 rad.

Got spare memory? A lookup table (with interpolation, if required) is gonna be fastest.

nVidia has some great resources that show how to approximate otherwise very expensive math functions, such as: acos
asin
atan2
etc etc...
These algorithms produce good results when speed of execution is more important (within reason) than precision. Here's their acos function:
// Absolute error <= 6.7e-5
float acos(float x) {
float negate = float(x < 0);
x = abs(x);
float ret = -0.0187293;
ret = ret * x;
ret = ret + 0.0742610;
ret = ret * x;
ret = ret - 0.2121144;
ret = ret * x;
ret = ret + 1.5707288;
ret = ret * sqrt(1.0-x);
ret = ret - 2 * negate * ret;
return negate * 3.14159265358979 + ret;
}
And here are the results for when calculating acos(0.5):
nVidia: result: 1.0471513828611643
math.h: result: 1.0471975511965976
That's pretty close! Depending on your required degree of precision, this might be a good option for you.

I have my own. It's pretty accurate and sort of fast. It works off of a theorem I built around quartic convergence. It's really interesting, and you can see the equation and how fast it can make my natural log approximation converge here: https://www.desmos.com/calculator/yb04qt8jx4
Here's my arccos code:
function acos(x)
local a=1.43+0.59*x a=(a+(2+2*x)/a)/2
local b=1.65-1.41*x b=(b+(2-2*x)/b)/2
local c=0.88-0.77*x c=(c+(2-a)/c)/2
return (8*(c+(2-a)/c)-(b+(2-2*x)/b))/6
end
A lot of that is just square root approximation. It works really well, too, unless you get too close to taking a square root of 0. It has an average error (excluding x=0.99 to 1) of 0.0003. The problem, though, is that at 0.99 it starts going to shit, and at x=1, the difference in accuracy becomes 0.05. Of course, this could be solved by doing more iterations on the square roots (lol nope) or, just a little thing like, if x>0.99 then use a different set of square root linearizations, but that makes the code all long and ugly.
If you don't care about accuracy so much, you could just do one iteration per square root, which should still keep you somewhere in the range of 0.0162 or something as far as accuracy goes:
function acos(x)
local a=1.43+0.59*x a=(a+(2+2*x)/a)/2
local b=1.65-1.41*x b=(b+(2-2*x)/b)/2
local c=0.88-0.77*x c=(c+(2-a)/c)/2
return 8/3*c-b/3
end
If you're okay with it, you can use pre-existing square root code. It will get rid of the the equation going a bit crazy at x=1:
function acos(x)
local a = math.sqrt(2+2*x)
local b = math.sqrt(2-2*x)
local c = math.sqrt(2-a)
return 8/3*d-b/3
end
Frankly, though, if you're really pressed for time, remember that you could linearize arccos into 3.14159-1.57079x and just do:
function acos(x)
return 1.57079-1.57079*x
end
Anyway, if you want to see a list of my arccos approximation equations, you can go to https://www.desmos.com/calculator/tcaty2sv8l I know that my approximations aren't the best for certain things, but if you're doing something where my approximations would be useful, please use them, but try to give me credit.

You can approximate the inverse cosine with a polynomial as suggested by dan04, but a polynomial is a pretty bad approximation near -1 and 1 where the derivative of the inverse cosine goes to infinity. When you increase the degree of the polynomial you hit diminishing returns quickly, and it is still hard to get a good approximation around the endpoints. A rational function (the quotient of two polynomials) can give a much better approximation in this case.
acos(x) ≈ π/2 + (ax + bx³) / (1 + cx² + dx⁴)
where
a = -0.939115566365855
b = 0.9217841528914573
c = -1.2845906244690837
d = 0.295624144969963174
has a maximum absolute error of 0.017 radians (0.96 degrees) on the interval (-1, 1). Here is a plot (the inverse cosine in black, cubic polynomial approximation in red, the above function in blue) for comparison:
The coefficients above have been chosen to minimise the maximum absolute error over the entire domain. If you are willing to allow a larger error at the endpoints, the error on the interval (-0.98, 0.98) can be made much smaller. A numerator of degree 5 and a denominator of degree 2 is about as fast as the above function, but slightly less accurate. At the expense of performance you can increase accuracy by using higher degree polynomials.
A note about performance: computing the two polynomials is still very cheap, and you can use fused multiply-add instructions. The division is not so bad, because you can use the hardware reciprocal approximation and a multiply. The error in the reciprocal approximation is negligible in comparison with the error in the acos approximation. On a 2.6 GHz Skylake i7, this approximation can do about 8 inverse cosines every 6 cycles using AVX. (That is throughput, the latency is longer than 6 cycles.)

Another approach you could take is to use complex numbers. From de Moivre's formula,
ⅈx = cos(π/2*x) + ⅈ*sin(π/2*x)
Let θ = π/2*x. Then x = 2θ/π, so
sin(θ) = ℑ(ⅈ^2θ/π)
cos(θ) = ℜ(ⅈ^2θ/π)
How can you calculate powers of ⅈ without sin and cos? Start with a precomputed table for powers of 2:
ⅈ4 = 1
ⅈ2 = -1
ⅈ1 = ⅈ
ⅈ1/2 = 0.7071067811865476 + 0.7071067811865475*ⅈ
ⅈ1/4 = 0.9238795325112867 + 0.3826834323650898*ⅈ
ⅈ1/8 = 0.9807852804032304 + 0.19509032201612825*ⅈ
ⅈ1/16 = 0.9951847266721969 + 0.0980171403295606*ⅈ
ⅈ1/32 = 0.9987954562051724 + 0.049067674327418015*ⅈ
ⅈ1/64 = 0.9996988186962042 + 0.024541228522912288*ⅈ
ⅈ1/128 = 0.9999247018391445 + 0.012271538285719925*ⅈ
ⅈ1/256 = 0.9999811752826011 + 0.006135884649154475*ⅈ
To calculate arbitrary values of ⅈx, approximate the exponent as a binary fraction, and then multiply together the corresponding values from the table.
For example, to find sin and cos of 72° = 0.8π/2:
ⅈ0.8
&approx; ⅈ205/256
= ⅈ0b11001101
= ⅈ1/2 * ⅈ1/4 * ⅈ1/32 * ⅈ1/64 * ⅈ1/256
= 0.3078496400415349 + 0.9514350209690084*ⅈ
sin(72°) &approx; 0.9514350209690084 ("exact" value is 0.9510565162951535)
cos(72°) &approx; 0.3078496400415349 ("exact" value is 0.30901699437494745).
To find asin and acos, you can use this table with the Bisection Method:
For example, to find asin(0.6) (the smallest angle in a 3-4-5 triangle):
ⅈ0 = 1 + 0*ⅈ. The sin is too small, so increase x by 1/2.
ⅈ1/2 = 0.7071067811865476 + 0.7071067811865475*ⅈ . The sin is too big, so decrease x by 1/4.
ⅈ1/4 = 0.9238795325112867 + 0.3826834323650898*ⅈ. The sin is too small, so increase x by 1/8.
ⅈ3/8 = 0.8314696123025452 + 0.5555702330196022*ⅈ. The sin is still too small, so increase x by 1/16.
ⅈ7/16 = 0.773010453362737 + 0.6343932841636455*ⅈ. The sin is too big, so decrease x by 1/32.
ⅈ13/32 = 0.8032075314806449 + 0.5956993044924334*ⅈ.
Each time you increase x, multiply by the corresponding power of ⅈ. Each time you decrease x, divide by the corresponding power of ⅈ.
If we stop here, we obtain acos(0.6) &approx; 13/32*π/2 = 0.6381360077604268 (The "exact" value is 0.6435011087932844.)
The accuracy, of course, depends on the number of iterations. For a quick-and-dirty approximation, use 10 iterations. For "intense precision", use 50-60 iterations.

A fast arccosine implementation, accurate to about 0.5 degrees, can be based on the observation that for x in [0,1], acos(x) ≈ √(2*(1-x)). An additional scale factor improves accuracy near zero. The optimal factor can be found by a simple binary search. Negative arguments are handled according to acos (-x) = π - acos (x).
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
// Approximate acos(a) with relative error < 5.15e-3
// This uses an idea from Robert Harley's posting in comp.arch.arithmetic on 1996/07/12
// https://groups.google.com/forum/#!original/comp.arch.arithmetic/wqCPkCCXqWs/T9qCkHtGE2YJ
float fast_acos (float a)
{
const float PI = 3.14159265f;
const float C = 0.10501094f;
float r, s, t, u;
t = (a < 0) ? (-a) : a; // handle negative arguments
u = 1.0f - t;
s = sqrtf (u + u);
r = C * u * s + s; // or fmaf (C * u, s, s) if FMA support in hardware
if (a < 0) r = PI - r; // handle negative arguments
return r;
}
float uint_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof(r));
return r;
}
int main (void)
{
double maxrelerr = 0.0;
uint32_t a = 0;
do {
float x = uint_as_float (a);
float r = fast_acos (x);
double xx = (double)x;
double res = (double)r;
double ref = acos (xx);
double relerr = (res - ref) / ref;
if (fabs (relerr) > maxrelerr) {
maxrelerr = fabs (relerr);
printf ("xx=% 15.8e res=% 15.8e ref=% 15.8e rel.err=% 15.8e\n",
xx, res, ref, relerr);
}
a++;
} while (a);
printf ("maximum relative error = %15.8e\n", maxrelerr);
return EXIT_SUCCESS;
}
The output of the above test scaffold should look similar to this:
xx= 0.00000000e+000 res= 1.56272149e+000 ref= 1.57079633e+000 rel.err=-5.14060021e-003
xx= 2.98023259e-008 res= 1.56272137e+000 ref= 1.57079630e+000 rel.err=-5.14065723e-003
xx= 8.94069672e-008 res= 1.56272125e+000 ref= 1.57079624e+000 rel.err=-5.14069537e-003
xx=-2.98023259e-008 res= 1.57887137e+000 ref= 1.57079636e+000 rel.err= 5.14071269e-003
xx=-8.94069672e-008 res= 1.57887149e+000 ref= 1.57079642e+000 rel.err= 5.14075044e-003
maximum relative error = 5.14075044e-003

Here is a great website with many options:
https://www.ecse.rpi.edu/Homepages/wrf/Research/Short_Notes/arcsin/onlyelem.html
Personally I went the Chebyshev-Pade quotient approximation with with the following code:
double arccos(double x) {
const double pi = 3.141592653;
return pi / 2 - (.5689111419 - .2644381021*x - .4212611542*(2*x - 1)*(2*x - 1)
+ .1475622352*(2*x - 1)*(2*x - 1)*(2*x - 1))
/ (2.006022274 - 2.343685222*x + .3316406750*(2*x - 1)*(2*x - 1) +
.02607135626*(2*x - 1)*(2*x - 1)*(2*x - 1));
}

If you're using Microsoft VC++, here's an inline __asm x87 FPU code version without all the CRT filler, error checks, etc. and unlike the earliest classic ASM code you can find, it uses a FMUL instead of the slower FDIV. It compiles/works with Microsoft VC++ 2005 Express/Pro what I always stick with for various reasons.
It's a little tricky to setup a function with "__declspec(naked)/__fastcall", pull parameters correctly, handle stack, so not for the faint of heart. If it fails to compile with errors on your version, don't bother unless you're experienced. Or ask me, I can rewrite it in a slightly friendlier __asm{} block. I would manually inline this if it's a critical part of a function in a loop for further performance gains if need be.
extern float __fastcall fs_acos(float x);
extern double __fastcall fs_Acos(double x);
// ACOS(x)- Computes the arccosine of ST(0)
// Allowable range: -1<=x<=+1
// Derivative Formulas: acos(x) = atan(sqrt((1 - x * x)/(x * x))) OR
// acos(x) = atan2(sqrt(1 - x * x), x)
// e.g. acos(-1.0) = 3.1415927
__declspec(naked) float __fastcall fs_acos(float x) { __asm {
FLD DWORD PTR [ESP+4] ;// Load/Push parameter 'x' to FPU stack
FLD1 ;// Load 1.0
FADD ST, ST(1) ;// Compute 1.0 + 'x'
FLD1 ;// Load 1.0
FSUB ST, ST(2) ;// Compute 1.0 - 'x'
FMULP ST(1), ST ;// Compute (1-x) * (1+x)
FSQRT ;// Compute sqrt(result)
FXCH ST(1)
FPATAN ;// Compute arctangent of result / 'x' (ST1/ST0)
RET 4
}}
__declspec(naked) double __fastcall fs_Acos(double x) { __asm { //
FLD QWORD PTR [ESP+4] ;// Load/Push parameter 'x' to FPU stack
FLD1 ;// Load 1.0
FADD ST, ST(1) ;// Compute (1.0 + 'x')
FLD1 ;// Load 1.0
FSUB ST, ST(2) ;// Compute (1.0 - 'x')
FMULP ST(1), ST ;// Compute (1-x) * (1+x)
FSQRT ;// Compute sqrt((1-x) * (1+x))
FXCH ST(1)
FPATAN ;// Compute arctangent of result / 'x' (ST1/ST0)
RET 8
}}

Unfortunately I do not have enough reputation to comment.
Here is a small modification of Nvidia's function, that deals with the fact that numbers that should be <= 1 while preserving performance as much as possible.
It may be important since rounding errors can lead number that should be 1.0 to be (oh so slightly) larger than 1.0.
double safer_acos(double x) {
double negate = double(x < 0);
x = abs(x);
x -= double(x>1.0)*(x-1.0); // <- equivalent to min(1.0,x), but faster
double ret = -0.0187293;
ret = ret * x;
ret = ret + 0.0742610;
ret = ret * x;
ret = ret - 0.2121144;
ret = ret * x;
ret = ret + 1.5707288;
ret = ret * sqrt(1.0-x);
ret = ret - 2 * negate * ret;
return negate * 3.14159265358979 + ret;
// In a single line (no gain using gcc)
//return negate * 3.14159265358979 + (((((-0.0187293*x)+ 0.0742610)*x - 0.2121144)*x + 1.5707288)* sqrt(1.0-x))*(1.0-2.0*negate);
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How do you process exp() with SSE2? - c++

Related

SSE to C++ code

Difference between ldexp(1, x) and exp2(x)

How can I add together two SSE registers

Optimizations for pow() with const non-integer exponent?

Fast Arc Cos algorithm?

Categories

Resources