C fmod function: Floating point error and optimization

C fmod function: Floating point error and optimization - c++

I'm trying to calculate the true course from one point to anoter on the surface of the earth in as few CPU cycles as possible. The result should be a double 0 <= tc < 360, however in a few special cases i get the result 360 (should be reported as 0). I realize that this is due to machine precision when working with fmod and floating point number, but what will be the most efficient workaround of the problem?
#include <stdio.h>
#include <math.h>
#define EPS 1e-15 // EPS a small number ~ machine precision
#define R2D 57.295779513082320876798154814105 //multiply radian with R2D to get degrees
#define D2R 0.01745329251994329576923690768489 //multiply degrees with D2R to get radians
#define TWO_PI 6.283185307179586476925286766559 //2*Pi
/*----------------------------------------------------------------------------
* Course between points
* We obtain the initial course, tc1, (at point 1) from point 1 to point 2
* by the following. The formula fails if the initial point is a pole. We can
* special case this with as IF statement
----------------------------------------------------------------------------
Implementation
Argument 1: INPUT - Pointer to double containing Latitude of point 1 in degrees
Argument 2: INPUT - Pointer to double containing Longitude of point 1 in degrees
Argument 3: INPUT - Pointer to double containing Latitude of point 2 in degrees
Argument 4: INPUT - Pointer to double containing Longitude of point 2 in degrees
RETURN: Double containing initial course in degrees from point1 to point 2
--------------------------------------------------------------------------*/
double _stdcall CourseInitial (double *lat1, double *lon1, double *lat2, double *lon2)
{
double radLat1 = D2R * *lat1;
double radLon1 = D2R * *lon1;
double radLat2 = D2R * *lat2;
double radLon2 = D2R * *lon2;
double tc = 0;
if (cos(radLat1) < EPS) { // EPS a small number ~ machine precision
if (radLat1 > 0) {
tc = 180; // starting from N pole
} else {
tc = 0; // starting from S pole
}
} else {
// Calculate true course [180, 540]
tc = R2D * atan2(sin(radLon2-radLon1),
cos(radLat1) * tan(radLat2) - sin(radLat1) * cos(radLon2-radLon1)
) + 360;
}
//Return 0 <= true course < 360
return fmod(tc, 360);
}
int main(void)
{
double lat1 = 89;
double lon1 = 17;
double lat2 = 68;
double lon2 = -163;
double tc = 0;
tc = CourseInitial(&lat1, &lon1, &lat2, &lon2);
printf("The course from point 1 to 2 is: %.5f", tc);
return 0;
}
Output:
The course from point 1 to 2 is: 360.00000

A comparison of the given value of the constant D2R and the closest 64 bit and 80 bit floating point number.
80 bit 0x3FF98EFA351294E9C8AF = 1.745329251994329577083321213687439055206596094649285078048706054e-2
D2R = 1.745329251994329576923690768489e-2
80 bit 0x3FF98EFA351294E9C8AE = 1.74532925199432957691391462423657898739293159451335668563842773e-2
64 bit 0x3F91DF46A2529D3A = 1.745329251994329894381863255148346070200204849243164e-2
D2R = 1.745329251994329576923690768489e-2
64 bit 0x3F91DF46A2529D39 = 1.7453292519943295474371680597869271878153085708618164e-2
These are the values chosen by my compiler:
0x3FF98EFA351294E9C8AE = 0 011111111111001 1000111011111010001101010001001010010100111010011100100010101110
0x3F91DF46A2529D39 = 0 01111111001 0001110111110100011010100010010100101001110100111001
The conversion was performed with tools and information on the web pages listed below:
http://www.exploringbinary.com/binary-converter
http://apfloat.appspot.com
http://en.wikipedia.org/wiki/Extended_precision
http://en.wikipedia.org/wiki/Double-precision_floating-point_format

Location of problem
The difference in the result between the two different levels of optimization occurs
when calculating radLon2-radLon1. The result of this calculation is shown here.
-O0 radLon2-radLon1 = 0xC00921FB54442D19
-O1 radLon2-radLon1 = 0xC00921FB54442D18
The difference is in the least significant bit bringing the -O0 result past the real value of pi.
-O0 radLon2-radLon1 = -3.14159265358979356008717331861
Pi with 50 decimals = 3.14159265358979323846264338327950288419716939937510
-O1 radLon2-radLon1 = -3,14159265358979311599796346854
-O0 calculation loads one value on the Floating Point Stack and subtracts the other with
the FSUB command (line 004014a8:). The -O1 calculation loads both values onto the Floating Point Stack
and subtracts them with the FSUBP command (line 00401480:)
Disassembly after compilation with no optimization -O0
(...)
004014a5: fld QWORD PTR [ebp-0x30] // ST(0) = radLon2 (from [ebp-0x30])
004014a8: fsub QWORD PTR [ebp-0x20] // ST(0) = ST(0) - radLon1 (from [ebp-0x20])
004014ab: fstp QWORD PTR [esp] // [esp] = (radLon2-radLon1) = 0xC00921FB54442D19
004014ae: call 0x4080d0 <sin> // ST(0) = sin(radLon2-radLon1)
(...)
-------------------------------------------------------------------------------------
Significant values:
radLon2 = 0xC006C253F2B53437 (-2.84488668075075734620327239099 )
radLon1 = 0x3FD2FD3B0C77C70D ( 0.296705972839036047350447233839 )
radLon2-radLon1 = 0xC00921FB54442D19 (-3.14159265358979356008717331861 )
sin(radLon2-radLon1) = 0x3CB72D0000000000 ( 3.21628574467824890348310873378e-16 )
Later atan2(y, x) is calculated with these values
x = 0x3FF0B04ED1755590 ( 1.04304391688978981278523860965 )
y = 0x3CB72D0000000000 ( 3.21628574467824890348310873378e-16 )
atan2(y, x) = 0x3CB63828CAA39659 ( 3.08355735803412799607014393888e-16 )
Disassembly after compilation with optimization -O1
(...)
29 double radLon2 = D2R * *lon2;
0040146c: fld QWORD PTR ds:0x40a0c0 // ST(0) = D2R (from [ds:0x40a0c0])
00401472: fmul QWORD PTR [esp+0x30] // ST(0) = ST(0) * lon2 (from [ESP+0x30])
// ST(0) = -2.8448866807507573
27 double radLon1 = D2R * *lon1;
00401476: fld QWORD PTR ds:0x40a0c0 // ST(0) = D2R (from [ds:0x40a0c0])
0040147c: fmul QWORD PTR [esp+0x20] // ST(0) = ST(0) * lon1 (from [ESP+0x20])
// ST(0) = 0.29670597283903605 (radLon1)
// ST(1) = -2.8448866807507573 (radLon2)
00401480: fsubp st(1),st // ST(1) = ST(1)-ST(0) then POP stack
00401482: fst QWORD PTR [esp+0x20] // [esp+0x20] = (radLon2-radLon1) = 0xC00921FB54442D18
(...)
40 tc = R2D * atan2(sin(radLon2-radLon1),
00401492: fld QWORD PTR [esp+0x20] // ST(0)=(radLon2-radLon1)
00401496: fstp QWORD PTR [esp] // [esp]=(radLon2-radLon1)
00401499: call 0x4080e0 <sin> // ST(0)=sin(radLon2-radLon1)
(...)
-------------------------------------------------------------------------------------
Significant values
radLon2 = 0xC006C253F2B53437 (-2.84488668075075734620327239099 )
radLon1 = 0x3FD2FD3B0C77C70D ( 0.296705972839036047350447233839 )
radLon2-radLon1 = 0xC00921FB54442D18 (-3,14159265358979311599796346854 )
sin(radLon2-radLon1) = 0xBCA1A60000000000 (-1.22460635382237725821141793858e-16 )
Later atan2(y, x) is calculated with these values
x = 0x3FF0B04ED1755590 ( 1.04304391688978981278523860965 )
y = 0xBCA1A60000000000 (-1.22460635382237725821141793858e-16 )
atan2(y, x) = 0xBCA0EB8D90F27437 (-1.1740697913027295863036855646E-16 )
=====================================================================================
Attempted solution
In CourseInitial() radLon1 and radLon2 is not used independently. So I tried the following.
double radDeltaLon = D2R * (*lon2-*lon1);
(...)
tc = R2D * atan2(sin(radDeltaLon),
cos(radLat1) * tan(radLat2) - sin(radLat1) * cos(radDeltaLon)
) + 360;
This did not work. Debugging showed that the problematic value close to Pi showed up
another place in the code and the end result was the same.
One solution
At the end of each of the defined constants I added an L and with this converting them to Long Doubles (80-bit floating point numbers). This is the same precision that the CPU has in it's Floating Point Registers and solved the problem in some cases.
#define R2D 57.295779513082320876798154814105L //multiply radian with R2D to get degrees
#define D2R 0.01745329251994329576923690768489L //multiply degrees with D2R to get radians
#define TWO_PI 6.283185307179586476925286766559L //2*Pi
Final solution
// Calculate true course [-180, 180)
tc = atan2(sin(radDeltaLon),
cos(radLat1) * tan(radLat2) - sin(radLat1) * cos(radDeltaLon)
);
if (fabs(tc) < EPS) {
tc = 0; //Prevents fmod(tc, 360) from returning 360 due to rounding error
} else {
tc *= R2D; //Convert to degrees after tc has been checked for machine precision
tc += 360; //tc [180, 540)
}
return fmod(tc, 360); // returns tc [0, 360)

Related

Fast weighted mean & variance of 10 bins

I would like to speed up a part of my code but I don't think there is a possible better way to do the following calculation:
float invSum = 1.0f / float(sum);
for (int i = 0; i < numBins; ++i)
{
histVec[i] *= invSum;
}
for (int i = 0; i < numBins; ++i)
{
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
fmean += f * midPoint;
}
for (int i = 0; i < numBins; ++i)
{
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
float diff = midPoint - fmean;
var += f * hwk::sqr(diff);
}
numBins in the for-loops is typically 10 but this bit of code is called very often (frequency of 80 frames per seconds, called at least 8 times per frame)
I tried to use some SSE methods but it is only slightly speeding up this code. I think I could avoid calculating twice midPoint but I am not sure how. Is there a better way to compute fmean and var?
Here is the SSE code:
// make hist contain a multiple of 4 valid values
for (int i = numBins; i < ((numBins + 3) & ~3); i++)
hist[i] = 0;
// find sum of bins in inHist
__m128i iSum4 = _mm_set1_epi32(0);
for (int i = 0; i < numBins; i += 4)
{
__m128i a = *((__m128i *) &inHist[i]);
iSum4 = _mm_add_epi32(iSum4, a);
}
int iSum = iSum4.m128i_i32[0] + iSum4.m128i_i32[1] + iSum4.m128i_i32[2] + iSum4.m128i_i32[3];
//float stdevB, meanB;
if (iSum == 0.0f)
{
stdev = 0.0;
mean = 0.0;
}
else
{
// Set histVec to normalised values in inHist
__m128 invSum = _mm_set1_ps(1.0f / float(iSum));
for (int i = 0; i < numBins; i += 4)
{
__m128i a = *((__m128i *) &inHist[i]);
__m128 b = _mm_cvtepi32_ps(a);
__m128 c = _mm_mul_ps(b, invSum);
_mm_store_ps(&histVec[i], c);
}
float binSize = 256.0f / (float)numBins;
float halfBinSize = binSize * 0.5f;
float binOffset = halfBinSize;
__m128 binSizeMask = _mm_set1_ps(binSize);
__m128 binOffsetMask = _mm_set1_ps(binOffset);
__m128 fmean4 = _mm_set1_ps(0.0f);
for (int i = 0; i < numBins; i += 4)
{
__m128i idx4 = _mm_set_epi32(i + 3, i + 2, i + 1, i);
__m128 idx_m128 = _mm_cvtepi32_ps(idx4);
__m128 histVec4 = _mm_load_ps(&histVec[i]);
__m128 midPoint4 = _mm_add_ps(_mm_mul_ps(idx_m128, binSizeMask), binOffsetMask);
fmean4 = _mm_add_ps(fmean4, _mm_mul_ps(histVec4, midPoint4));
}
fmean4 = _mm_hadd_ps(fmean4, fmean4); // 01 23 01 23
fmean4 = _mm_hadd_ps(fmean4, fmean4); // 0123 0123 0123 0123
float fmean = fmean4.m128_f32[0];
//fmean4 = _mm_set1_ps(fmean);
__m128 var4 = _mm_set1_ps(0.0f);
for (int i = 0; i < numBins; i+=4)
{
__m128i idx4 = _mm_set_epi32(i + 3, i + 2, i + 1, i);
__m128 idx_m128 = _mm_cvtepi32_ps(idx4);
__m128 histVec4 = _mm_load_ps(&histVec[i]);
__m128 midPoint4 = _mm_add_ps(_mm_mul_ps(idx_m128, binSizeMask), binOffsetMask);
__m128 diff4 = _mm_sub_ps(midPoint4, fmean4);
var4 = _mm_add_ps(var4, _mm_mul_ps(histVec4, _mm_mul_ps(diff4, diff4)));
}
var4 = _mm_hadd_ps(var4, var4); // 01 23 01 23
var4 = _mm_hadd_ps(var4, var4); // 0123 0123 0123 0123
float var = var4.m128_f32[0];
stdev = sqrt(var);
mean = fmean;
}
I might be doing something wrong since I dont have a lot of improvement as I was expecting.
Is there something in the SSE code that might possibly slow down the process?
(editor's note: the SSE part of this question was originally asked as https://stackoverflow.com/questions/31837817/foor-loop-optimisation-sse-comparison, which was closed as a duplicate.)

I only just realized that your data array starts out as an array of int, since you didn't have declarations in your code. I can see in the SSE version that you start with integers, and only store a float version of it later.
Keeping everything integer will let us do the loop-counter-vector with a simple ivec = _mm_add_epi32(ivec, _mm_set1_epi32(4)); Aki Suihkonen's answer has some transformations that should let it optimize a lot better. Especially, the auto-vectorizer should be able to do more even without -ffast-math. In fact, it does quite well. You could do better with intrinsics, esp. saving some vector 32bit multiplies and shortening the dependency chain.
My old answer, based on just trying to optimize your code as written, assuming FP input:
You may be able to combine all 3 loops into one, using the algorithm #Jason linked to. It might not be profitable, though, since it involves a division. For small numbers of bins, probably just loop multiple times.
Start by reading the guides at http://agner.org/optimize/. A couple of the techniques in his Optimising Assembly guide will speed up your SSE attempt (which I edited into this question for you).
combine your loops where possible, so you do more with the data for each time it's loaded / stored.
multiple accumulators to hide the latency of loop-carried dependency chains. (Even FP add takes 3 cycles on recent Intel CPUs.) This won't apply for really short arrays like your case.
instead of int->float conversion on every iteration, use a float loop counter as well as the int loop counter. (add a vector of _mm_set1_ps(4.0f) every iteration.) _mm_set... with variable args is something to avoid in loops, when possible. It takes several instructions (esp. when each arg to setr has to be calculated separately.)
gcc -O3 manages to auto-vectorize the first loop, but not the others. With -O3 -ffast-math, it auto-vectorizes more. -ffast-math allows it to do FP operations in a different order than the code specifies. e.g. adding up the array in 4 elements of a vector, and only combining the 4 accumulators at the end.
Telling gcc that the input pointer is aligned by 16 lets gcc auto-vectorize with a lot less overhead (no scalar loops for unaligned portions).
// return mean
float fpstats(float histVec[], float sum, float binSize, float binOffset, long numBins, float *variance_p)
{
numBins += 3;
numBins &= ~3; // round up to multiple of 4. This is just a quick hack to make the code fast and simple.
histVec = (float*)__builtin_assume_aligned(histVec, 16);
float invSum = 1.0f / float(sum);
float var = 0, fmean = 0;
for (int i = 0; i < numBins; ++i)
{
histVec[i] *= invSum;
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
fmean += f * midPoint;
}
for (int i = 0; i < numBins; ++i)
{
float midPoint = (float)i*binSize + binOffset;
float f = histVec[i];
float diff = midPoint - fmean;
// var += f * hwk::sqr(diff);
var += f * (diff * diff);
}
*variance_p = var;
return fmean;
}
gcc generates some weird code for the 2nd loop.
# broadcasting fmean after the 1st loop
subss %xmm0, %xmm2 # fmean, D.2466
shufps $0, %xmm2, %xmm2 # vect_cst_.16
.L5: ## top of 2nd loop
movdqa %xmm3, %xmm5 # vect_vec_iv_.8, vect_vec_iv_.8
cvtdq2ps %xmm3, %xmm3 # vect_vec_iv_.8, vect__32.9
movq %rcx, %rsi # D.2465, D.2467
addq $1, %rcx #, D.2465
mulps %xmm1, %xmm3 # vect_cst_.11, vect__33.10
salq $4, %rsi #, D.2467
paddd %xmm7, %xmm5 # vect_cst_.7, vect_vec_iv_.8
addps %xmm2, %xmm3 # vect_cst_.16, vect_diff_39.15
mulps %xmm3, %xmm3 # vect_diff_39.15, vect_powmult_53.17
mulps (%rdi,%rsi), %xmm3 # MEM[base: histVec_10, index: _107, offset: 0B], vect__41.18
addps %xmm3, %xmm4 # vect__41.18, vect_var_42.19
cmpq %rcx, %rax # D.2465, bnd.26
ja .L8 #, ### <--- This is insane.
haddps %xmm4, %xmm4 # vect_var_42.19, tmp160
haddps %xmm4, %xmm4 # tmp160, vect_var_42.21
.L2:
movss %xmm4, (%rdx) # var, *variance_p_44(D)
ret
.p2align 4,,10
.p2align 3
.L8:
movdqa %xmm5, %xmm3 # vect_vec_iv_.8, vect_vec_iv_.8
jmp .L5 #
So instead of just jumping back to the top every iteration, gcc decides to jump ahead to copy a register, and then unconditionally jmp back to the top of the loop. The uop loop buffer may remove the front-end overhead of this sillyness, but gcc should have structured the loop so it didn't copy xmm5->xmm3 and then xmm3->xmm5 every iteration, because that's silly. It should have the conditional jump just go to the top of the loop.
Also note the technique gcc used to get a float version of the loop counter: start with an integer vector of 1 2 3 4, and add set1_epi32(4). Use that as an input for packed int->float cvtdq2ps. On Intel HW, that instruction runs on the FP-add port, and has 3 cycle latency, same as packed FP add. gcc prob. would have done better to just add a vector of set1_ps(4.0), even though this creates a 3-cycle loop-carried dependency chain, instead of 1 cycle vector int add, with a 3 cycle convert forking off on every iteration.
small iteration count
You say this will often be used on exactly 10 bins? A specialized version for just 10 bins could give a big speedup, by avoiding all the loop overhead and keeping everything in registers.
With that small a problem size, you can have the FP weights just sitting there in memory, instead of re-computing them with integer->float conversion every time.
Also, 10 bins is going to mean a lot of horizontal operations relative to the amount of vertical operations, since you only have 2 and a half vectors worth of data.
If exactly 10 is really common, specialize a version for that. If under-16 is common, specialize a version for that. (They can and should share the const float weights[] = { 0.0f, 1.0f, 2.0f, ...}; array.)
You probably will want to use intrinsics for the specialized small-problem versions, rather than auto-vectorization.
Having zero-padding after the end of the useful data in your array might still be a good idea in your specialized version(s). However, you can load the last 2 floats and clear the upper 64b of a vector register with a movq instruction. (__m128i _mm_cvtsi64_si128 (__int64 a)). Cast this to __m128 and you're good to go.

As peterchen mentioned, these operations are very trivial for current desktop processors. The function is linear, i.e. O(n). What's the typical size of numBins? If it's rather large (say, over 1000000), parallelization will help. This could be simple using a library like OpenMP. If numBins starts approaching MAXINT, you may consider GPGPU as an option (CUDA/OpenCL).
All that considered, you should try profiling your application. Chances are good that, if there is a performance constraint, it's not in this method. Michael Abrash's definition of "high-performance code" has helped me greatly in determining if/when to optimize:
Before we can create high-performance code, we must understand what high performance is. The objective (not always attained) in creating high-performance software is to make the software able to carry out its appointed tasks so rapidly that it responds instantaneously, as far as the user is concerned. In other words, high-performance code should ideally run so fast that any further improvement in the code would be pointless. Notice that the above definition most emphatically does not say anything about making the software as fast as possible.
Reference:
The Graphics Programming Black Book

The overall function to be calculated is
std = sqrt(SUM_i { hist[i]/sum * (midpoint_i - mean_midpoint)^2 })
Using the identity
Var (aX + b) = Var (X) * a^2
one can reduce the complexity of the overall operation considerably
1) midpoint of a bin doesn't need offset b
2) no need to prescale by bin array elements with bin width
and
3) no need to normalize histogram entries with reciprocal of sum
The optimized calculation goes as follows
float calcVariance(int histBin[], float binWidth)
{
int i;
int sum = 0;
int mid = 0;
int var = 0;
for (i = 0; i < 10; i++)
{
sum += histBin[i];
mid += i*histBin[i];
}
float inv_sum = 1.0f / (float)sum;
float mid_sum = mid * inv_sum;
for (i = 0; i < 10; i++)
{
int diff = i * sum - mid; // because mid is prescaled by sum
var += histBin[i] * diff * diff;
}
return sqrt(float(var) / (float)(sum * sum * sum)) * binWidth;
}
Minor changes are required if it's float histBin[];
Also I second padding histBin size to a multiple of 4 for better vectorization.
EDIT
Another way to calculate this with floats in the inner loop:
float inv_sum = 1.0f / (float)sum;
float mid_sum = mid * inv_sum;
float var = 0.0f;
for (i = 0; i < 10; i++)
{
float diff = (float)i - mid_sum;
var += (float)histBin[i] * diff * diff;
}
return sqrt(var * inv_sum) * binWidth;

Perform the scaling on the global results only and keep integers as long as possible.
Group all computation in a single loop, using Σ(X-m)²/N = ΣX²/N - m².
// Accumulate the histogram
int mean= 0, var= 0;
for (int i = 0; i < numBins; ++i)
{
mean+= i * histVec[i];
var+= i * i * histVec[i];
}
// Compute the reduced mean and variance
float fmean= (float(mean) / sum);
float fvar= float(var) / sum - fmean * fmean;
// Rescale
fmean= fmean * binSize + binOffset;
fvar= fvar * binSize * binSize;
The required integer type will depend on the maximum value in the bins. The SSE optimization of the loop can exploit the _mm_madd_epi16 instruction.
If the number of bins is a small as 10, consider fully unrolling the loop. Precompute the i and i² vectors in a table.
In the lucky case that the data fits in 16 bits and the sums in 32 bits, the accumulation is done with something like
static short I[16]= { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0 };
static short I2[16]= { 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 0, 0, 0, 0, 0, 0 };
// First group
__m128i i= _mm_load_si128((__m128i*)&I[0]);
__m128i i2= _mm_load_si128((__m128i*)&I2[0]);
__m128i h= _mm_load_si128((__m128i*)&inHist[0]);
__m128i mean= _mm_madd_epi16(i, h);
__m128i var= _mm_madd_epi16(i2, h);
// Second group
i= _mm_load_si128((__m128i*)&I[8]);
i2= _mm_load_si128((__m128i*)&I2[8]);
h= _mm_load_si128((__m128i*)&inHist[8]);
mean= _mm_add_epi32(mean, _mm_madd_epi16(i, h));
var= _mm_add_epi32(var, _mm_madd_epi16(i2, h));
CAUTION: unchecked

Floating point maxing out loop doesn't terminate in D, works in C++

I have two similar programs one in C++ and another in D.
The compilation is on on Windows7 64bit, to 64bit binaries.
C++ version, VS 2013:
#include <iostream>
#include <string>
int main(int argc, char* argv[])
{
float eps = 1.0f;
float f = 0.0f;
while (f + eps != f)
f += 1.0f;
std::cout << "eps = " + std::to_string(eps) + ", max_f = " + std::to_string(f) << std::endl;
return 0;
}
D version, DMD v2.066.1:
import std.stdio;
import std.conv;
int main(string[] argv)
{
float eps = 1.0f;
float f = 0.0f;
while (f + eps != f)
f += 1.0f;
writeln("eps = " ~ to!string(eps) ~ ", max_f = " ~ to!string(f));
return 0;
}
C++ version works as expected and finds that f + e == f when f = 16777216.
But D version hungs forever. When I put breakpoint I see that in D version f also 16777216 (after running for some time) and Watch window (I use VisualD) shows that (f + e != f) is 'false' so the loop should be terminate but it's not the case during runtime.
I think assembly could give the answer but I'm not very good with it.
I'm new to D, so it should be the case that I misused the language/compiler (compiled with DMD just as 'dmd test.d' without additional options and also from VS with VisualD with default options). Any ideas what could be wrong with D version of the program? Thanks!
Disassembly:
C++:
000000013F7D1410 mov rax,rsp
000000013F7D1413 push rbp
000000013F7D1414 lea rbp,[rax-5Fh]
000000013F7D1418 sub rsp,0E0h
000000013F7D141F mov qword ptr [rbp+17h],0FFFFFFFFFFFFFFFEh
000000013F7D1427 mov qword ptr [rax+8],rbx
000000013F7D142B movaps xmmword ptr [rax-18h],xmm6
000000013F7D142F xorps xmm1,xmm1
float eps = 1.0f;
float f = 0.0f;
000000013F7D1432 movss xmm6,dword ptr [__real#3f800000 (013F7D67E8h)]
000000013F7D143A nop word ptr [rax+rax]
f += 1.0f;
000000013F7D1440 addss xmm1,xmm6
while (f + eps != f)
000000013F7D1444 movaps xmm0,xmm1
000000013F7D1447 addss xmm0,xmm6
000000013F7D144B ucomiss xmm0,xmm1
000000013F7D144E jp main+30h (013F7D1440h)
000000013F7D1450 jne main+30h (013F7D1440h)
D:
000000013F761002 mov ebp,esp
000000013F761004 sub rsp,50h
{
float eps = 1.0f;
000000013F761008 xor eax,eax
000000013F76100A mov dword ptr [rbp-50h],eax
000000013F76100D movss xmm0,dword ptr [rbp-50h]
000000013F761012 movss dword ptr [f],xmm0
float f = 0.0f;
while (f + eps != f)
f += 1.0f;
000000013F761017 movss xmm1,dword ptr [__NULL_IMPORT_DESCRIPTOR+1138h (013F7C3040h)]
000000013F76101F movss xmm2,dword ptr [f]
000000013F761024 addss xmm2,xmm1
000000013F761028 movss dword ptr [f],xmm2
000000013F76102D fld dword ptr [f]
000000013F761030 fadd dword ptr [__NULL_IMPORT_DESCRIPTOR+1138h (013F7C3040h)]
000000013F761036 fld dword ptr [f]
000000013F761039 fucomip st,st(1)
000000013F76103B fstp st(0)
000000013F76103D jne D main+17h (013F761017h)
000000013F76103F jp D main+17h (013F761017h)
Summary
Accept harold's answer that program behavior is due to the mixed FPU and SSE usage.
Here's a summary what happens in D assembly snippet. In fact the loop will run forever.
SSE behaves strictly according to IEEE-754 when f reaches 16777216.0 and we add 1.0 to this value (f += 1.0f) we still obtain 16777216.0 in xmm2 register, then we store it to memory.
(f + eps != f) expression is computed on the FPU. Since FPU registers have enough precision (f+eps) results in 16777217.0. If we stored this result back to memory into float variable then we would get expected value 16777216.0 (since 16777217.0 is not represented as float). And (f + eps != f) would be 'false' and loop would terminate. But we do not store any numbers back to memory and perform comparison on the FPU (since we have both operands). It means that we compare one number that is computed strictly according to IEEE-754 (f) and another that is computed with 80bit accuracy (f+eps). 16777216.0 != 16777217.0 and the loop runs forever.
I'm not an expert in this area but for me it looks like that doing floating point with SSE instructions is more robust as was demonstrated in C++ version of the program.
Update
I had a discussion on the D forum http://forum.dlang.org/thread/ucnayusylmpvkpcnbhgh#forum.dlang.org
It turned out that program behaves correctly - it's according to the language specification that intermediate calculations can be performed with higher accuracy.
The robust implementation for any D compiler is:
import std.stdio;
int main()
{
const float eps = 1.0f;
const float step = 1.0;
float f = 0.0f;
float fPlusEps = f + eps;
while (f != fPlusEps)
{
f += step;
fPlusEps = f + eps;
}
writeln("eps = ", eps, ", max_f = ", f);
return 0;
}

Mixed FPU and SSE code, that's .. really strange. I see absolutely no reason to implement it this way.
But they have, and the result is that f + eps != f is evaluated with 80bit extended precision, while
f += 1.0f is evaluated using 32bit floats.
That means the loop can never end, since f will stop going up before the value that makes
f + eps != f false (which, in 80bit precision, is huge) is reached.

Trying to break a loop with != or == with floating point values is looking for troubles.
The different behavior is mot likely due to the float to double to 80-bits internal floating point conversion compiler may adopt when passing values to the FPU.
When extending the mantissa, in particular- some compilers or optimizer can decide to let the less significant bit "random" instead of zeroed. So 1.0f, when given to the FPU may become 1.000000000000000000000012134432 that -according to a float- precision, is still 1.0, but wen 1.000000000000000000000012134432 and 1.000000000000000000000089544455 (the two tail are random) are compared by the FPU, look different.
You should verify how C++ and D compiler treat the floating point extension/reduction and eventually configure the appropriate switches: if the two compilers are not from the same manufacturer, thay had probably made different choices for their respective defaults.

Optimizations for pow() with const non-integer exponent?

I have hot spots in my code where I'm doing pow() taking up around 10-20% of my execution time.
My input to pow(x,y) is very specific, so I'm wondering if there's a way to roll two pow() approximations (one for each exponent) with higher performance:
I have two constant exponents: 2.4 and 1/2.4.
When the exponent is 2.4, x will be in the range (0.090473935, 1.0].
When the exponent is 1/2.4, x will be in the range (0.0031308, 1.0].
I'm using SSE/AVX float vectors. If platform specifics can be taken advantage of, right on!
A maximum error rate around 0.01% is ideal, though I'm interested in full precision (for float) algorithms as well.
I'm already using a fast pow() approximation, but it doesn't take these constraints into account. Is it possible to do better?

Another answer because this is very different from my previous answer, and this is blazing fast. Relative error is 3e-8. Want more accuracy? Add a couple more Chebychev terms. It's best to keep the order odd as this makes for a small discontinuity between 2^n-epsilon and 2^n+epsilon.
#include <stdlib.h>
#include <math.h>
// Returns x^(5/12) for x in [1,2), to within 3e-8 (relative error).
// Want more precision? Add more Chebychev polynomial coefs.
double pow512norm (
double x)
{
static const int N = 8;
// Chebychev polynomial terms.
// Non-zero terms calculated via
// integrate (2/pi)*ChebyshevT[n,u]/sqrt(1-u^2)*((u+3)/2)^(5/12)
// from -1 to 1
// Zeroth term is similar except it uses 1/pi rather than 2/pi.
static const double Cn[N] = {
1.1758200232996901923,
0.16665763094889061230,
-0.0083154894939042125035,
0.00075187976780420279038,
// Wolfram alpha doesn't want to compute the remaining terms
// to more precision (it times out).
-0.0000832402,
0.0000102292,
-1.3401e-6,
1.83334e-7};
double Tn[N];
double u = 2.0*x - 3.0;
Tn[0] = 1.0;
Tn[1] = u;
for (int ii = 2; ii < N; ++ii) {
Tn[ii] = 2*u*Tn[ii-1] - Tn[ii-2];
}
double y = 0.0;
for (int ii = N-1; ii >= 0; --ii) {
y += Cn[ii]*Tn[ii];
}
return y;
}
// Returns x^(5/12) to within 3e-8 (relative error).
double pow512 (
double x)
{
static const double pow2_512[12] = {
1.0,
pow(2.0, 5.0/12.0),
pow(4.0, 5.0/12.0),
pow(8.0, 5.0/12.0),
pow(16.0, 5.0/12.0),
pow(32.0, 5.0/12.0),
pow(64.0, 5.0/12.0),
pow(128.0, 5.0/12.0),
pow(256.0, 5.0/12.0),
pow(512.0, 5.0/12.0),
pow(1024.0, 5.0/12.0),
pow(2048.0, 5.0/12.0)
};
double s;
int iexp;
s = frexp (x, &iexp);
s *= 2.0;
iexp -= 1;
div_t qr = div (iexp, 12);
if (qr.rem < 0) {
qr.quot -= 1;
qr.rem += 12;
}
return ldexp (pow512norm(s)*pow2_512[qr.rem], 5*qr.quot);
}
Addendum: What's going on here?
Per request, the following explains how the above code works.
Overview
The above code defines two functions, double pow512norm (double x) and double pow512 (double x). The latter is the entry point to the suite; this is the function that user code should call to calculate x^(5/12). The function pow512norm(x) uses Chebyshev polynomials to approximate x^(5/12), but only for x in the range [1,2]. (Use pow512norm(x) for values of x outside that range and the result will be garbage.)
The function pow512(x) splits the incoming x into a pair (double s, int n) such that x = s * 2^n and such that 1≤s<2. A further partitioning of n into (int q, unsigned int r) such that n = 12*q + r and r is less than 12 lets me split the problem of finding x^(5/12) into parts:
x^(5/12)=(s^(5/12))*((2^n)^(5/12)) via (uv)^a=(u^a)(v^a) for positive u,v and real a.
s^(5/12) is calculated via pow512norm(s).
(2^n)^(5/12)=(2^(12*q+r))^(5/12) via substitution.
2^(12*q+r)=(2^(12*q))*(2^r) via u^(a+b)=(u^a)*(u^b) for positive u, real a,b.
(2^(12*q+r))^(5/12)=(2^(5*q))*((2^r)^(5/12)) via some more manipulations.
(2^r)^(5/12) is calculated by the lookup table pow2_512.
Calculate pow512norm(s)*pow2_512[qr.rem] and we're almost there. Here qr.rem is the r value calculated in step 3 above. All that is needed is to multiply this by 2^(5*q) to yield the desired result.
That is exactly what the math library function ldexp does.
Function Approximation
The goal here is to come up with an easily computable approximation of f(x)=x^(5/12) that is 'good enough' for the problem at hand. Our approximation should be close to f(x) in some sense. Rhetorical question: What does 'close to' mean? Two competing interpretations are minimizing the mean square error versus minimizing the maximum absolute error.
I'll use a stock market analogy to describe the difference between these. Suppose you want to save for your eventual retirement. If you are in your twenties, the best thing to do is to invest in stocks or stock market funds. This is because over a long enough span of time, the stock market on average beats any other investment scheme. However, we've all seen times when putting money into stocks is a very bad thing to do. If you are in your fifties or sixties (or forties if you want to retire young) you need to invest a bit more conservatively. Those downswings can wreak have on your retirement portfolio.
Back to function approximation: As the consumer of some approximation, you are typically worried about the worst-case error rather than the performance "on average". Use some approximation constructed to give the best performance "on average" (e.g. least squares) and Murphy's law dictates that your program will spend a whole lot of time using the approximation exactly where the performance is far worse than average. What you want is a minimax approximation, something that minimizes the maximum absolute error over some domain. A good math library will take a minimax approach rather than a least squares approach because this lets the authors of the math library give some guaranteed performance of their library.
Math libraries typically use a polynomial or a rational polynomial to approximate some function f(x) over some domain a≤x≤b. Suppose the function f(x) is analytic over this domain and you want to approximate the function by some polynomial p(x) of degree N. For a given degree N there exists some magical, unique polynomial p(x) such that p(x)-f(x) has N+2 extrema over [a,b] and such that the absolute values of these N+2 extrema are all equal to one another. Finding this magical polynomial p(x) is the holy grail of function approximators.
I did not find that holy grail for you. I instead used a Chebyshev approximation. The Chebyshev polynomials of the first kind are an orthogonal (but not orthonormal) set of polynomials with some very nice features when it comes to function approximation. The Chebyshev approximation oftentimes is very close to that magical polynomial p(x). (In fact, the Remez exchange algorithm that does find that holy grail polynomial typically starts with a Chebyshev approximation.)
pow512norm(x)
This function uses Chebyshev approximation to find some polynomial p*(x) that approximates x^(5/12). Here I'm using p*(x) to distinguish this Chebyshev approximation from the magical polynomial p(x) described above. The Chebyshev approximation p*(x) is easy to find; finding p(x) is a bear. The Chebyshev approximation p*(x) is sum_i Cn[i]*Tn(i,x), where the Cn[i] are the Chebyshev coefficients and Tn(i,x) are the Chebyshev polynomials evaluated at x.
I used Wolfram alpha to find the Chebyshev coefficients Cn for me. For example, this calculates Cn[1]. The first box after the input box has the desired answer, 0.166658 in this case. That's not as many digits as I would like. Click on 'more digits' and voila, you get a whole lot more digits. Wolfram alpha is free; there is a limit on how much computation it will do. It hits that limit on higher order terms. (If you buy or have access to mathematica you will be able to calculate those high-order coefficients to a high degree of precision.)
The Chebyshev polynomials Tn(x) are calculated in the array Tn. Beyond giving something very close to magical polynomial p(x), another reason for using Chebyshev approximation is that the values of those Chebyshev polynomials are easily calculated: Start with Tn[0]=1 and Tn[1]=x, and then iteratively calculate Tn[i]=2*x*Tn[i-1] - Tn[i-2]. (I used 'ii' as the index variable rather than 'i' in my code. I never use 'i' as a variable name. How many words in the English language have an 'i' in the word? How many have two consecutive 'i's?)
pow512(x)
pow512 is the function that user code should be calling. I already described the basics of this function above. A few more details: The math library function frexp(x) returns the significand s and exponent iexp for the input x. (Minor issue: I want s between 1 and 2 for use with pow512norm but frexp returns a value between 0.5 and 1.) The math library function div returns the quotient and remainder for integer division in one swell foop. Finally, I use the math library function ldexp to put the three parts together to form the final answer.

In the IEEE 754 hacking vein, here is another solution which is faster and less "magical." It achieves an error margin of .08% in about a dozen clock cycles (for the case of p=2.4, on an Intel Merom CPU).
Floating point numbers were originally invented as an approximation to logarithms, so you can use the integer value as an approximation of log2. This is somewhat-portably achievable by applying the convert-from-integer instruction to a floating-point value, to obtain another floating-point value.
To complete the pow computation, you can multiply by a constant factor and convert the logarithm back with the convert-to-integer instruction. On SSE, the relevant instructions are cvtdq2ps and cvtps2dq.
It's not quite so simple, though. The exponent field in IEEE 754 is signed, with a bias value of 127 representing an exponent of zero. This bias must be removed before you multiply the logarithm, and re-added before you exponentiate. Furthermore, bias adjustment by subtraction won't work on zero. Fortunately, both adjustments can be achieved by multiplying by a constant factor beforehand.
x^p
= exp2( p * log2( x ) )
= exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 )
= cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) )
= cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) )
= cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) )
= cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )
exp2( 127 / p - 127 ) is the constant factor. This function is rather specialized: it won't work with small fractional exponents, because the constant factor grows exponentially with the inverse of the exponent and will overflow. It won't work with negative exponents. Large exponents lead to high error, because the mantissa bits are mingled with the exponent bits by the multiplication.
But, it's just 4 fast instructions long. Pre-multiply, convert from "integer" (to logarithm), power-multiply, convert to "integer" (from logarithm). Conversions are very fast on this implementation of SSE. We can also squeeze an extra constant coefficient into the first multiplication.
template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
__m128 ret = arg;
// std::printf( "arg = %,vg\n", ret );
// Apply a constant pre-correction factor.
ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
* pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
// std::printf( "scaled = %,vg\n", ret );
// Reinterpret arg as integer to obtain logarithm.
asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "log = %,vg\n", ret );
// Multiply logarithm by power.
ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
// std::printf( "powered = %,vg\n", ret );
// Convert back to "integer" to exponentiate.
asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "result = %,vg\n", ret );
return ret;
}
A few trials with exponent = 2.4 show this consistently overestimates by about 5%. (The routine is always guaranteed to overestimate.) You could simply multiply by 0.95, but a few more instructions will get us about 4 decimal digits of accuracy, which should be enough for graphics.
The key is to match the overestimate with an underestimate, and take the average.
Compute x^0.8: four instructions, error ~ +3%.
Compute x^-0.4: one rsqrtps. (This is quite accurate enough, but does sacrifice the ability to work with zero.)
Compute x^0.4: one mulps.
Compute x^-0.2: one rsqrtps.
Compute x^2: one mulps.
Compute x^3: one mulps.
x^2.4 = x^2 * x^0.4: one mulps. This is the overestimate.
x^2.4 = x^3 * x^-0.4 * x^-0.2: two mulps. This is the underestimate.
Average the above: one addps, one mulps.
Instruction tally: fourteen, including two conversions with latency = 5 and two reciprocal square root estimates with throughput = 4.
To properly take the average, we want to weight the estimates by their expected errors. The underestimate raises the error to a power of 0.6 vs 0.4, so we expect it to be 1.5x as erroneous. Weighting doesn't add any instructions; it can be done in the pre-factor. Calling the coefficient a: a^0.5 = 1.5 a^-0.75, and a = 1.38316186.
The final error is about .015%, or 2 orders of magnitude better than the initial fastpow result. The runtime is about a dozen cycles for a busy loop with volatile source and destination variables… although it's overlapping the iterations, real-world usage will also see instruction-level parallelism. Considering SIMD, that's a throughput of one scalar result per 3 cycles!
int main() {
__m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 );
std::printf( "Input: %,vg\n", x0 );
// Approx 5% accuracy from one call. Always an overestimate.
__m128 x1 = fastpow< 24, 10, 1, 1 >( x0 );
std::printf( "Direct x^2.4: %,vg\n", x1 );
// Lower exponents provide lower initial error, but too low causes overflow.
__m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 );
std::printf( "1.38 x^0.8: %,vg\n", xf );
// Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
__m128 xfm4 = _mm_rsqrt_ps( xf );
__m128 xf4 = _mm_mul_ps( xf, xfm4 );
// Precisely calculate x^2 and x^3
__m128 x2 = _mm_mul_ps( x0, x0 );
__m128 x3 = _mm_mul_ps( x2, x0 );
// Overestimate of x^2 * x^0.4
x2 = _mm_mul_ps( x2, xf4 );
// Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4.
__m128 xfm2 = _mm_rsqrt_ps( xf4 );
x3 = _mm_mul_ps( x3, xfm4 );
x3 = _mm_mul_ps( x3, xfm2 );
std::printf( "x^2 * x^0.4: %,vg\n", x2 );
std::printf( "x^3 / x^0.6: %,vg\n", x3 );
x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) );
// Final accuracy about 0.015%, 200x better than x^0.8 calculation.
std::printf( "average = %,vg\n", x2 );
}
Well… sorry I wasn't able to post this sooner. And extending it to x^1/2.4 is left as an exercise ;v) .
Update with stats
I implemented a little test harness and two x(5⁄12) cases corresponding to the above.
#include <cstdio>
#include <xmmintrin.h>
#include <cmath>
#include <cfloat>
#include <algorithm>
using namespace std;
template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
__m128 ret = arg;
// std::printf( "arg = %,vg\n", ret );
// Apply a constant pre-correction factor.
ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
* pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
// std::printf( "scaled = %,vg\n", ret );
// Reinterpret arg as integer to obtain logarithm.
asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "log = %,vg\n", ret );
// Multiply logarithm by power.
ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
// std::printf( "powered = %,vg\n", ret );
// Convert back to "integer" to exponentiate.
asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "result = %,vg\n", ret );
return ret;
}
__m128 pow125_4( __m128 arg ) {
// Lower exponents provide lower initial error, but too low causes overflow.
__m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg );
// Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
__m128 xfm4 = _mm_rsqrt_ps( xf );
__m128 xf4 = _mm_mul_ps( xf, xfm4 );
// Precisely calculate x^2 and x^3
__m128 x2 = _mm_mul_ps( arg, arg );
__m128 x3 = _mm_mul_ps( x2, arg );
// Overestimate of x^2 * x^0.4
x2 = _mm_mul_ps( x2, xf4 );
// Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6.
__m128 xfm2 = _mm_rsqrt_ps( xf4 );
x3 = _mm_mul_ps( x3, xfm4 );
x3 = _mm_mul_ps( x3, xfm2 );
return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) );
}
__m128 pow512_2( __m128 arg ) {
// 5/12 is too small, so compute the sqrt of 10/12 instead.
__m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg );
return _mm_mul_ps( _mm_rsqrt_ps( x ), x );
}
__m128 pow512_4( __m128 arg ) {
// 5/12 is too small, so compute the 4th root of 20/12 instead.
// 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow.
// weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3
__m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg );
__m128 xover = _mm_mul_ps( arg, xf );
__m128 xfm1 = _mm_rsqrt_ps( xf );
__m128 x2 = _mm_mul_ps( arg, arg );
__m128 xunder = _mm_mul_ps( x2, xfm1 );
// sqrt2 * over + 2 * sqrt2 * under
__m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ),
_mm_add_ps( xover, xunder ) );
xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
return xavg;
}
__m128 mm_succ_ps( __m128 arg ) {
return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) );
}
void test_pow( double p, __m128 (*f)( __m128 ) ) {
__m128 arg;
for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON );
! isfinite( _mm_cvtss_f32( f( arg ) ) );
arg = mm_succ_ps( arg ) ) ;
for ( ; _mm_cvtss_f32( f( arg ) ) == 0;
arg = mm_succ_ps( arg ) ) ;
std::printf( "Domain from %g\n", _mm_cvtss_f32( arg ) );
int n;
int const bucket_size = 1 << 25;
do {
float max_error = 0;
double total_error = 0, cum_error = 0;
for ( n = 0; n != bucket_size; ++ n ) {
float result = _mm_cvtss_f32( f( arg ) );
if ( ! isfinite( result ) ) break;
float actual = ::powf( _mm_cvtss_f32( arg ), p );
float error = ( result - actual ) / actual;
cum_error += error;
error = std::abs( error );
max_error = std::max( max_error, error );
total_error += error;
arg = mm_succ_ps( arg );
}
std::printf( "error max = %8g\t" "avg = %8g\t" "|avg| = %8g\t" "to %8g\n",
max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) );
} while ( n == bucket_size );
}
int main() {
std::printf( "4 insn x^12/5:\n" );
test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > );
std::printf( "14 insn x^12/5:\n" );
test_pow( 12./5, & pow125_4 );
std::printf( "6 insn x^5/12:\n" );
test_pow( 5./12, & pow512_2 );
std::printf( "14 insn x^5/12:\n" );
test_pow( 5./12, & pow512_4 );
}
Output:
4 insn x^12/5:
Domain from 1.36909e-23
error max = inf avg = inf |avg| = inf to 8.97249e-19
error max = 2267.14 avg = 139.175 |avg| = 139.193 to 5.88021e-14
error max = 0.123606 avg = -0.000102963 |avg| = 0.0371122 to 3.85365e-09
error max = 0.123607 avg = -0.000108978 |avg| = 0.0368548 to 0.000252553
error max = 0.12361 avg = 7.28909e-05 |avg| = 0.037507 to 16.5513
error max = 0.123612 avg = -0.000258619 |avg| = 0.0365618 to 1.08471e+06
error max = 0.123611 avg = 8.70966e-05 |avg| = 0.0374369 to 7.10874e+10
error max = 0.12361 avg = -0.000103047 |avg| = 0.0371122 to 4.65878e+15
error max = 0.123609 avg = nan |avg| = nan to 1.16469e+16
14 insn x^12/5:
Domain from 1.42795e-19
error max = inf avg = nan |avg| = nan to 9.35823e-15
error max = 0.000936462 avg = 2.0202e-05 |avg| = 0.000133764 to 6.13301e-10
error max = 0.000792752 avg = 1.45717e-05 |avg| = 0.000129936 to 4.01933e-05
error max = 0.000791785 avg = 7.0132e-06 |avg| = 0.000129923 to 2.63411
error max = 0.000787589 avg = 1.20745e-05 |avg| = 0.000129347 to 172629
error max = 0.000786553 avg = 1.62351e-05 |avg| = 0.000132397 to 1.13134e+10
error max = 0.000785586 avg = 8.25205e-06 |avg| = 0.00013037 to 6.98147e+12
6 insn x^5/12:
Domain from 9.86076e-32
error max = 0.0284339 avg = 0.000441158 |avg| = 0.00967327 to 6.46235e-27
error max = 0.0284342 avg = -5.79938e-06 |avg| = 0.00897913 to 4.23516e-22
error max = 0.0284341 avg = -0.000140706 |avg| = 0.00897084 to 2.77556e-17
error max = 0.028434 avg = 0.000440504 |avg| = 0.00967325 to 1.81899e-12
error max = 0.0284339 avg = -6.11153e-06 |avg| = 0.00897915 to 1.19209e-07
error max = 0.0284298 avg = -0.000140597 |avg| = 0.00897084 to 0.0078125
error max = 0.0284371 avg = 0.000439748 |avg| = 0.00967319 to 512
error max = 0.028437 avg = -7.74294e-06 |avg| = 0.00897924 to 3.35544e+07
error max = 0.0284369 avg = -0.000142036 |avg| = 0.00897089 to 2.19902e+12
error max = 0.0284368 avg = 0.000439183 |avg| = 0.0096732 to 1.44115e+17
error max = 0.0284367 avg = -7.41244e-06 |avg| = 0.00897923 to 9.44473e+21
error max = 0.0284366 avg = -0.000141706 |avg| = 0.00897088 to 6.1897e+26
error max = 0.485129 avg = -0.0401671 |avg| = 0.048422 to 4.05648e+31
error max = 0.994932 avg = -0.891494 |avg| = 0.891494 to 2.65846e+36
error max = 0.999329 avg = nan |avg| = nan to -0
14 insn x^5/12:
Domain from 2.64698e-23
error max = 0.13556 avg = 0.00125936 |avg| = 0.00354677 to 1.73472e-18
error max = 0.000564988 avg = 2.51458e-06 |avg| = 0.000113709 to 1.13687e-13
error max = 0.000565065 avg = -1.49258e-06 |avg| = 0.000112553 to 7.45058e-09
error max = 0.000565143 avg = 1.5293e-06 |avg| = 0.000112864 to 0.000488281
error max = 0.000565298 avg = 2.76457e-06 |avg| = 0.000113713 to 32
error max = 0.000565453 avg = -1.61276e-06 |avg| = 0.000112561 to 2.09715e+06
error max = 0.000565531 avg = 1.42628e-06 |avg| = 0.000112866 to 1.37439e+11
error max = 0.000565686 avg = 2.71505e-06 |avg| = 0.000113715 to 9.0072e+15
error max = 0.000565763 avg = -1.56586e-06 |avg| = 0.000112415 to 1.84467e+19
I suspect accuracy of the more accurate 5/12 is being limited by the rsqrt operation.

Ian Stephenson wrote this code which he claims outperforms pow(). He describes the idea as follows:
Pow is basically implemented using
log's: pow(a,b)=x(logx(a)*b). so we
need a fast log and fast exponent - it
doesn't matter what x is so we use 2.
The trick is that a floating point
number is already in a log style
format:
a=M*2E
Taking the log of both sides gives:
log2(a)=log2(M)+E
or more simply:
log2(a)~=E
In other words if we take the floating
point representation of a number, and
extract the Exponent we've got
something that's a good starting point
as its log. It turns out that when we
do this by massaging the bit patterns,
the Mantissa ends up giving a good
approximation to the error, and it
works pretty well.
This should be good enough for simple
lighting calculations, but if you need
something better, you can then extract
the Mantissa, and use that to
calculate a quadratic correction factor
which is pretty accurate.

First off, using floats isn't going to buy much on most machines nowadays. In fact, doubles can be faster. Your power, 1.0/2.4, is 5/12 or 1/3*(1+1/4). Even though this is calling cbrt (once) and sqrt (twice!) it is still twice as fast as using pow(). (Optimization: -O3, compiler: i686-apple-darwin10-g++-4.2.1).
#include <math.h> // cmath does not provide cbrt; C99 does.
double xpow512 (double x) {
double cbrtx = cbrt(x);
return cbrtx*sqrt(sqrt(cbrtx));
}

This might not answer your question.
The 2.4f and 1/2.4f make me very suspicious, because those are exactly the powers used to convert between sRGB and a linear RGB color space. So you might actually be trying to optimize that, specifically. I don't know, which is why this might not answer your question.
If this is the case, try using a lookup table. Something like:
__attribute__((aligned(64))
static const unsigned short SRGB_TO_LINEAR[256] = { ... };
__attribute__((aligned(64))
static const unsigned short LINEAR_TO_SRGB[256] = { ... };
void apply_lut(const unsigned short lut[256], unsigned char *src, ...
If you are using 16-bit data, change as appropriate. I would make the table 16 bits anyway so you can dither the result if necessary when working with 8-bit data. This obviously won't work very well if your data is floating point to begin with -- but it doesn't really make sense to store sRGB data in floating point, so you might as well convert to 16-bit / 8-bit first and then do the conversion from linear to sRGB.
(The reason sRGB doesn't make sense as floating point is that HDR should be linear, and sRGB is only convenient for storing on disk or displaying on screen, but not convenient for manipulation.)

I shall answer the question you really wanted to ask, which is how to do fast sRGB <-> linear RGB conversion. To do this precisely and efficiently we can use polynomial approximations. The following polynomial approximations have been generated with sollya, and have a worst case relative error of 0.0144%.
inline double poly7(double x, double a, double b, double c, double d,
double e, double f, double g, double h) {
double ab, cd, ef, gh, abcd, efgh, x2, x4;
x2 = x*x; x4 = x2*x2;
ab = a*x + b; cd = c*x + d;
ef = e*x + f; gh = g*x + h;
abcd = ab*x2 + cd; efgh = ef*x2 + gh;
return abcd*x4 + efgh;
}
inline double srgb_to_linear(double x) {
if (x <= 0.04045) return x / 12.92;
// Polynomial approximation of ((x+0.055)/1.055)^2.4.
return poly7(x, 0.15237971711927983387,
-0.57235993072870072762,
0.92097986411523535821,
-0.90208229831912012386,
0.88348956209696805075,
0.48110797889132134175,
0.03563925285274562038,
0.00084585397227064120);
}
inline double linear_to_srgb(double x) {
if (x <= 0.0031308) return x * 12.92;
// Piecewise polynomial approximation (divided by x^3)
// of 1.055 * x^(1/2.4) - 0.055.
if (x <= 0.0523) return poly7(x, -6681.49576364495442248881,
1224.97114922729451791383,
-100.23413743425112443219,
6.60361150127077944916,
0.06114808961060447245,
-0.00022244138470139442,
0.00000041231840827815,
-0.00000000035133685895) / (x*x*x);
return poly7(x, -0.18730034115395793881,
0.64677431008037400417,
-0.99032868647877825286,
1.20939072663263713636,
0.33433459165487383613,
-0.01345095746411287783,
0.00044351684288719036,
-0.00000664263587520855) / (x*x*x);
}
And the sollya input used to generate the polynomials:
suppressmessage(174);
f = ((x+0.055)/1.055)^2.4;
p0 = fpminimax(f, 7, [|D...|], [0.04045;1], relative);
p = fpminimax(f/(p0(1)+1e-18), 7, [|D...|], [0.04045;1], relative);
print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
print("absolute:", dirtyinfnorm((f-p), [s;1]));
print(canonical(p));
s = 0.0523;
z = 3;
f = 1.055 * x^(1/2.4) - 0.055;
p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [0.0031308;s], relative)/x^z;
print("relative:", dirtyinfnorm((f-p)/f, [0.0031308;s]));
print("absolute:", dirtyinfnorm((f-p), [0.0031308;s]));
print(canonical(p));
p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [s;1], relative)/x^z;
print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
print("absolute:", dirtyinfnorm((f-p), [s;1]));
print(canonical(p));

Binomial series does account for a constant exponent, but you will be able to use it only if you can normalize all your input to the range [1,2). (Note that it computes (1+x)^a). You'll have to do some analysis to decide how many terms you need for your desired accuracy.

For exponents of 2.4, you could either make a lookup table for all your 2.4 values and lirp or perhaps higher-order function to fill in the in-betweem values if the table wasn't accurate enough (basically a huge log table.)
Or, value squared * value to the 2/5s which could take the initial square value from the first half of the function and then 5th root it. For the 5th root, you could Newton it or do some other fast approximator, though honestly once you get to this point, your probably better off just doing the exp and log functions with the appropriate abbreviated series functions yourself.

The following is an idea you can use with any of the fast calculation methods. Whether it helps things go faster depends on how your data arrives. You can use the fact that if you know x and pow(x, n), you can use the rate of change of the power to compute a reasonable approximation of pow(x + delta, n) for small delta, with a single multiply and add (more or less). If successive values you feed your power functions are close enough together, this would amortize the full cost of the accurate calculation over multiple function calls. Note that you don't need an extra pow calculation to get the derivative. You could extend this to use the second derivative so you can use a quadratic, which would increase the delta you could use and still get the same accuracy.

So traditionally the powf(x, p) = x^p is solved by rewriting x as x=2^(log2(x)) making powf(x,p) = 2^(p*log2(x)), which transforms the problem into two approximations exp2() & log2(). This has the advantage of working with larger powers p, however the downside is that this is not the optimal solution for a constant power p and over a specified input bound 0 ≤ x ≤ 1.
When the power p > 1, the answer is a trivial minimax polynomial over the bound 0 ≤ x ≤ 1, which is the case for p = 12/5 = 2.4 as can be seen below:
float pow12_5(float x){
float mp;
// Minimax horner polynomials for x^(5/12), Note: choose the accurarcy required then implement with fma() [Fused Multiply Accumulates]
// mp = 0x4.a84a38p-12 + x * (-0xd.e5648p-8 + x * (0xa.d82fep-4 + x * 0x6.062668p-4)); // 1.13705697e-3
mp = 0x1.117542p-12 + x * (-0x5.91e6ap-8 + x * (0x8.0f50ep-4 + x * (0xa.aa231p-4 + x * (-0x2.62787p-4)))); // 2.6079002e-4
// mp = 0x5.a522ap-16 + x * (-0x2.d997fcp-8 + x * (0x6.8f6d1p-4 + x * (0xf.21285p-4 + x * (-0x7.b5b248p-4 + x * 0x2.32b668p-4)))); // 8.61377e-5
// mp = 0x2.4f5538p-16 + x * (-0x1.abcdecp-8 + x * (0x5.97464p-4 + x * (0x1.399edap0 + x * (-0x1.0d363ap0 + x * (0xa.a54a3p-4 + x * (-0x2.e8a77cp-4)))))); // 3.524655e-5
return(mp);
}
However when p < 1 the minimax approximation over the bound 0 ≤ x ≤ 1 does not appropriately converge to the desired accuracy. One option [not really] is to rewrite the problem y=x^p=x^(p+m)/x^m where m=1,2,3 is a positive integer, making the new power approximation p > 1 but this introduces division which is inherently slower.
There's however another option which is to decompose the input x as its floating point exponent and mantissa form:
x = mx* 2^(ex) where 1 ≤ mx < 2
y = x^(5/12) = mx^(5/12) * 2^((5/12)*ex), let ey = floor(5*ex/12), k = (5*ex) % 12
= mx^(5/12) * 2^(k/12) * 2^(ey)
The minimax approximation of mx^(5/12) over 1 ≤ mx < 2 now converges much faster than before, without division, but requires 12 point LUT for the 2^(k/12). The code is below:
float powk_12LUT[] = {0x1.0p0, 0x1.0f38fap0, 0x1.1f59acp0, 0x1.306fep0, 0x1.428a3p0, 0x1.55b81p0, 0x1.6a09e6p0, 0x1.7f910ep0, 0x1.965feap0, 0x1.ae89fap0, 0x1.c823ep0, 0x1.e3437ep0};
float pow5_12(float x){
union{float f; uint32_t u;} v, e2;
float poff, m, e, ei;
int xe;
v.f = x;
xe = ((v.u >> 23) - 127);
if(xe < -127) return(0.0f);
// Calculate remainder k in 2^(k/12) to find LUT
e = xe * (5.0f/12.0f);
ei = floorf(e);
poff = powk_12LUT[(int)(12.0f * (e - ei))];
e2.u = ((int)ei + 127) << 23; // Calculate the exponent
v.u = (v.u & ~(0xFFuL << 23)) | (0x7FuL << 23); // Normalize exponent to zero
// Approximate mx^(5/12) on [1,2), with appropriate degree minimax
// m = 0x8.87592p-4 + v.f * (0x8.8f056p-4 + v.f * (-0x1.134044p-4)); // 7.6125e-4
// m = 0x7.582138p-4 + v.f * (0xb.1666bp-4 + v.f * (-0x2.d21954p-4 + v.f * 0x6.3ea0cp-8)); // 8.4522726e-5
m = 0x6.9465cp-4 + v.f * (0xd.43015p-4 + v.f * (-0x5.17b2a8p-4 + v.f * (0x1.6cb1f8p-4 + v.f * (-0x2.c5b76p-8)))); // 1.04091259e-5
// m = 0x6.08242p-4 + v.f * (0xf.352bdp-4 + v.f * (-0x7.d0c1bp-4 + v.f * (0x3.4d153p-4 + v.f * (-0xc.f7a42p-8 + v.f * 0x1.5d840cp-8)))); // 1.367401e-6
return(m * poff * e2.f);
}

Recursively create a sine wave given a single sine wave value and the period

I am trying to write a .oct function for Octave that, given a single sine wave value, between -1 and 1, and sine wave period, returns a sine wave vector of period length with the last value in the vector being the given sine wave value. My code so far is:
#include <octave/oct.h>
#include <octave/dColVector.h>
#include <math.h>
#define PI 3.14159265
DEFUN_DLD (sinewave_recreate, args, , "args(0) sinewave value, args(1) is period")
{
octave_value_list retval;
double sinewave_value = args(0).double_value ();
double period = args(1).double_value ();
ColumnVector output_sinewave(period);
double degrees_inc = 360 / period;
double output_sinewave_degrees;
output_sinewave_degrees = asin( sinewave_value ) * 180 / PI;
output_sinewave(period-1) = sin( output_sinewave_degrees * PI / 180 );
for (octave_idx_type ii (1); ii < period; ii++) // Start the loop
{
output_sinewave_degrees = output_sinewave_degrees - degrees_inc;
if ( output_sinewave_degrees < 0 )
{
output_sinewave_degrees += 360 ;
}
output_sinewave( period-1-ii ) = sin( output_sinewave_degrees * PI / 180 );
}
retval(0) = output_sinewave;
return retval;
}
but is giving patchy results. By this I mean that it sometimes recreates the sine wave quite accurately and other times it is way off. I have determined this simply by creating a given sine wave, taking the last value in time and plugging this into the function to recreate the sine wave backwards through time and then comparing plots of the two. Obviously I am doing something wrong, but I can't seem to identify what.

Lets start with some trigonometric identities:
sin(x)^2 + cos(x)^2 == 1
sin(x+y) == sin(x)*cos(y) + sin(y)*cos(x)
cos(x+y) == cos(x)*cos(y) - sin(x)*sin(y)
Given the sine and cosine at a point x, we can exactly calculate the values after a step of size d, after precalculating sd = sin(d) and cd = cos(d):
sin(x+d) = sin(x)*cd + cos(x)*sd
cos(x+d) = cos(x)*cd - sin(x)*sd
Given the initial sine value, you can calculate the initial cosine value:
cos(x) = sqrt(1 - sin(x)^2)
Note that there are two possible solutions, corresponding to the two possible square-root values. Also note that all the angles in these identities are in radians, and d needs to be negative if you're going back through the wave.

Mike's note that there are two possible solutions for cos(x) made me realise that I would need to resolve the phase ambiguity of the sine wave. My second, successful attempt at this function is:
#include <octave/oct.h>
#include <octave/dColVector.h>
#include <math.h>
#define PI 3.14159265
DEFUN_DLD (sinewave_recreate_3, args, , "args(0) sinewave value, args(1) is period, args(2) is the phase")
{
octave_value_list retval;
double sinewave_value = args(0).double_value ();
double period = args(1).double_value ();
double phase = args(2).double_value ();
ColumnVector output_sinewave(period);
double X0 = asin(sinewave_value);
if (sinewave_value < 0 & phase > 180 & phase < 270)
{
X0 = PI + (0 - X0);
}
if (sinewave_value < 0 & phase >= 270)
{
X0 = X0 + 2 * PI;
}
if (sinewave_value > 0 & phase > 90)
{
X0 = PI - X0;
}
if (sinewave_value > 0 & phase < 0)
{
X0 = X0 + PI / 2;
}
double dx = PI / 180 * (360/period);
for (octave_idx_type ii (0); ii < period; ii++) // Start the loop
{
output_sinewave(period-1-ii) = sin(X0 - dx * ii);
}
retval(0) = output_sinewave;
return retval;
}
Thanks are also due to Keynslug.

There is simple formula. Here is the example in Python:
import math
import numpy as np
# We are supposing step is equal to 1degree
T = math.radians(1.0/360.0)
PrevBeforePrevValue = np.sin(math.radians(49.0)) # y(t-2)
PrevValue = np.sin(math.radians(50.0)) # y(t-1)
ValueNowRecursiveFormula = ((2.0*(4.0-T*T))/(4.0+T*T))*PrevValue - PrevBeforePrevValue
print("From RECURSIVE formula - " + str(ValueNowRecursiveFormula))
The details can be found here:
http://howtodoit.com.ua/en/on-the-way-of-developing-recursive-sinewave-generator/

You might try an easier way to go through.
Just recall that if
y = sin(x)
then first derivative of y will be equal to
dy/dx = cos(x)
So at every step of computation you add to the current value of y some delta equal to
dy = cos(x) * dx
But that might cut your accuracy down as a side-effect. You could probe it whatever. HTH.
It seems that slightly improved equation tend to be more accurate:
dy = cos(x + dx/2) * dx
Take a look at this.

Fast Arc Cos algorithm?

I have my own, very fast cos function:
float sine(float x)
{
const float B = 4/pi;
const float C = -4/(pi*pi);
float y = B * x + C * x * abs(x);
// const float Q = 0.775;
const float P = 0.225;
y = P * (y * abs(y) - y) + y; // Q * y + P * y * abs(y)
return y;
}
float cosine(float x)
{
return sine(x + (pi / 2));
}
But now when I profile, I see that acos() is killing the processor. I don't need intense precision. What is a fast way to calculate acos(x)
Thanks.

A simple cubic approximation, the Lagrange polynomial for x ∈ {-1, -½, 0, ½, 1}, is:
double acos(x) {
return (-0.69813170079773212 * x * x - 0.87266462599716477) * x + 1.5707963267948966;
}
It has a maximum error of about 0.18 rad.

Got spare memory? A lookup table (with interpolation, if required) is gonna be fastest.

nVidia has some great resources that show how to approximate otherwise very expensive math functions, such as: acos
asin
atan2
etc etc...
These algorithms produce good results when speed of execution is more important (within reason) than precision. Here's their acos function:
// Absolute error <= 6.7e-5
float acos(float x) {
float negate = float(x < 0);
x = abs(x);
float ret = -0.0187293;
ret = ret * x;
ret = ret + 0.0742610;
ret = ret * x;
ret = ret - 0.2121144;
ret = ret * x;
ret = ret + 1.5707288;
ret = ret * sqrt(1.0-x);
ret = ret - 2 * negate * ret;
return negate * 3.14159265358979 + ret;
}
And here are the results for when calculating acos(0.5):
nVidia: result: 1.0471513828611643
math.h: result: 1.0471975511965976
That's pretty close! Depending on your required degree of precision, this might be a good option for you.

I have my own. It's pretty accurate and sort of fast. It works off of a theorem I built around quartic convergence. It's really interesting, and you can see the equation and how fast it can make my natural log approximation converge here: https://www.desmos.com/calculator/yb04qt8jx4
Here's my arccos code:
function acos(x)
local a=1.43+0.59*x a=(a+(2+2*x)/a)/2
local b=1.65-1.41*x b=(b+(2-2*x)/b)/2
local c=0.88-0.77*x c=(c+(2-a)/c)/2
return (8*(c+(2-a)/c)-(b+(2-2*x)/b))/6
end
A lot of that is just square root approximation. It works really well, too, unless you get too close to taking a square root of 0. It has an average error (excluding x=0.99 to 1) of 0.0003. The problem, though, is that at 0.99 it starts going to shit, and at x=1, the difference in accuracy becomes 0.05. Of course, this could be solved by doing more iterations on the square roots (lol nope) or, just a little thing like, if x>0.99 then use a different set of square root linearizations, but that makes the code all long and ugly.
If you don't care about accuracy so much, you could just do one iteration per square root, which should still keep you somewhere in the range of 0.0162 or something as far as accuracy goes:
function acos(x)
local a=1.43+0.59*x a=(a+(2+2*x)/a)/2
local b=1.65-1.41*x b=(b+(2-2*x)/b)/2
local c=0.88-0.77*x c=(c+(2-a)/c)/2
return 8/3*c-b/3
end
If you're okay with it, you can use pre-existing square root code. It will get rid of the the equation going a bit crazy at x=1:
function acos(x)
local a = math.sqrt(2+2*x)
local b = math.sqrt(2-2*x)
local c = math.sqrt(2-a)
return 8/3*d-b/3
end
Frankly, though, if you're really pressed for time, remember that you could linearize arccos into 3.14159-1.57079x and just do:
function acos(x)
return 1.57079-1.57079*x
end
Anyway, if you want to see a list of my arccos approximation equations, you can go to https://www.desmos.com/calculator/tcaty2sv8l I know that my approximations aren't the best for certain things, but if you're doing something where my approximations would be useful, please use them, but try to give me credit.

You can approximate the inverse cosine with a polynomial as suggested by dan04, but a polynomial is a pretty bad approximation near -1 and 1 where the derivative of the inverse cosine goes to infinity. When you increase the degree of the polynomial you hit diminishing returns quickly, and it is still hard to get a good approximation around the endpoints. A rational function (the quotient of two polynomials) can give a much better approximation in this case.
acos(x) ≈ π/2 + (ax + bx³) / (1 + cx² + dx⁴)
where
a = -0.939115566365855
b = 0.9217841528914573
c = -1.2845906244690837
d = 0.295624144969963174
has a maximum absolute error of 0.017 radians (0.96 degrees) on the interval (-1, 1). Here is a plot (the inverse cosine in black, cubic polynomial approximation in red, the above function in blue) for comparison:
The coefficients above have been chosen to minimise the maximum absolute error over the entire domain. If you are willing to allow a larger error at the endpoints, the error on the interval (-0.98, 0.98) can be made much smaller. A numerator of degree 5 and a denominator of degree 2 is about as fast as the above function, but slightly less accurate. At the expense of performance you can increase accuracy by using higher degree polynomials.
A note about performance: computing the two polynomials is still very cheap, and you can use fused multiply-add instructions. The division is not so bad, because you can use the hardware reciprocal approximation and a multiply. The error in the reciprocal approximation is negligible in comparison with the error in the acos approximation. On a 2.6 GHz Skylake i7, this approximation can do about 8 inverse cosines every 6 cycles using AVX. (That is throughput, the latency is longer than 6 cycles.)

Another approach you could take is to use complex numbers. From de Moivre's formula,
ⅈx = cos(π/2*x) + ⅈ*sin(π/2*x)
Let θ = π/2*x. Then x = 2θ/π, so
sin(θ) = ℑ(ⅈ^2θ/π)
cos(θ) = ℜ(ⅈ^2θ/π)
How can you calculate powers of ⅈ without sin and cos? Start with a precomputed table for powers of 2:
ⅈ4 = 1
ⅈ2 = -1
ⅈ1 = ⅈ
ⅈ1/2 = 0.7071067811865476 + 0.7071067811865475*ⅈ
ⅈ1/4 = 0.9238795325112867 + 0.3826834323650898*ⅈ
ⅈ1/8 = 0.9807852804032304 + 0.19509032201612825*ⅈ
ⅈ1/16 = 0.9951847266721969 + 0.0980171403295606*ⅈ
ⅈ1/32 = 0.9987954562051724 + 0.049067674327418015*ⅈ
ⅈ1/64 = 0.9996988186962042 + 0.024541228522912288*ⅈ
ⅈ1/128 = 0.9999247018391445 + 0.012271538285719925*ⅈ
ⅈ1/256 = 0.9999811752826011 + 0.006135884649154475*ⅈ
To calculate arbitrary values of ⅈx, approximate the exponent as a binary fraction, and then multiply together the corresponding values from the table.
For example, to find sin and cos of 72° = 0.8π/2:
ⅈ0.8
&approx; ⅈ205/256
= ⅈ0b11001101
= ⅈ1/2 * ⅈ1/4 * ⅈ1/32 * ⅈ1/64 * ⅈ1/256
= 0.3078496400415349 + 0.9514350209690084*ⅈ
sin(72°) &approx; 0.9514350209690084 ("exact" value is 0.9510565162951535)
cos(72°) &approx; 0.3078496400415349 ("exact" value is 0.30901699437494745).
To find asin and acos, you can use this table with the Bisection Method:
For example, to find asin(0.6) (the smallest angle in a 3-4-5 triangle):
ⅈ0 = 1 + 0*ⅈ. The sin is too small, so increase x by 1/2.
ⅈ1/2 = 0.7071067811865476 + 0.7071067811865475*ⅈ . The sin is too big, so decrease x by 1/4.
ⅈ1/4 = 0.9238795325112867 + 0.3826834323650898*ⅈ. The sin is too small, so increase x by 1/8.
ⅈ3/8 = 0.8314696123025452 + 0.5555702330196022*ⅈ. The sin is still too small, so increase x by 1/16.
ⅈ7/16 = 0.773010453362737 + 0.6343932841636455*ⅈ. The sin is too big, so decrease x by 1/32.
ⅈ13/32 = 0.8032075314806449 + 0.5956993044924334*ⅈ.
Each time you increase x, multiply by the corresponding power of ⅈ. Each time you decrease x, divide by the corresponding power of ⅈ.
If we stop here, we obtain acos(0.6) &approx; 13/32*π/2 = 0.6381360077604268 (The "exact" value is 0.6435011087932844.)
The accuracy, of course, depends on the number of iterations. For a quick-and-dirty approximation, use 10 iterations. For "intense precision", use 50-60 iterations.

A fast arccosine implementation, accurate to about 0.5 degrees, can be based on the observation that for x in [0,1], acos(x) ≈ √(2*(1-x)). An additional scale factor improves accuracy near zero. The optimal factor can be found by a simple binary search. Negative arguments are handled according to acos (-x) = π - acos (x).
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
// Approximate acos(a) with relative error < 5.15e-3
// This uses an idea from Robert Harley's posting in comp.arch.arithmetic on 1996/07/12
// https://groups.google.com/forum/#!original/comp.arch.arithmetic/wqCPkCCXqWs/T9qCkHtGE2YJ
float fast_acos (float a)
{
const float PI = 3.14159265f;
const float C = 0.10501094f;
float r, s, t, u;
t = (a < 0) ? (-a) : a; // handle negative arguments
u = 1.0f - t;
s = sqrtf (u + u);
r = C * u * s + s; // or fmaf (C * u, s, s) if FMA support in hardware
if (a < 0) r = PI - r; // handle negative arguments
return r;
}
float uint_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof(r));
return r;
}
int main (void)
{
double maxrelerr = 0.0;
uint32_t a = 0;
do {
float x = uint_as_float (a);
float r = fast_acos (x);
double xx = (double)x;
double res = (double)r;
double ref = acos (xx);
double relerr = (res - ref) / ref;
if (fabs (relerr) > maxrelerr) {
maxrelerr = fabs (relerr);
printf ("xx=% 15.8e res=% 15.8e ref=% 15.8e rel.err=% 15.8e\n",
xx, res, ref, relerr);
}
a++;
} while (a);
printf ("maximum relative error = %15.8e\n", maxrelerr);
return EXIT_SUCCESS;
}
The output of the above test scaffold should look similar to this:
xx= 0.00000000e+000 res= 1.56272149e+000 ref= 1.57079633e+000 rel.err=-5.14060021e-003
xx= 2.98023259e-008 res= 1.56272137e+000 ref= 1.57079630e+000 rel.err=-5.14065723e-003
xx= 8.94069672e-008 res= 1.56272125e+000 ref= 1.57079624e+000 rel.err=-5.14069537e-003
xx=-2.98023259e-008 res= 1.57887137e+000 ref= 1.57079636e+000 rel.err= 5.14071269e-003
xx=-8.94069672e-008 res= 1.57887149e+000 ref= 1.57079642e+000 rel.err= 5.14075044e-003
maximum relative error = 5.14075044e-003

Here is a great website with many options:
https://www.ecse.rpi.edu/Homepages/wrf/Research/Short_Notes/arcsin/onlyelem.html
Personally I went the Chebyshev-Pade quotient approximation with with the following code:
double arccos(double x) {
const double pi = 3.141592653;
return pi / 2 - (.5689111419 - .2644381021*x - .4212611542*(2*x - 1)*(2*x - 1)
+ .1475622352*(2*x - 1)*(2*x - 1)*(2*x - 1))
/ (2.006022274 - 2.343685222*x + .3316406750*(2*x - 1)*(2*x - 1) +
.02607135626*(2*x - 1)*(2*x - 1)*(2*x - 1));
}

If you're using Microsoft VC++, here's an inline __asm x87 FPU code version without all the CRT filler, error checks, etc. and unlike the earliest classic ASM code you can find, it uses a FMUL instead of the slower FDIV. It compiles/works with Microsoft VC++ 2005 Express/Pro what I always stick with for various reasons.
It's a little tricky to setup a function with "__declspec(naked)/__fastcall", pull parameters correctly, handle stack, so not for the faint of heart. If it fails to compile with errors on your version, don't bother unless you're experienced. Or ask me, I can rewrite it in a slightly friendlier __asm{} block. I would manually inline this if it's a critical part of a function in a loop for further performance gains if need be.
extern float __fastcall fs_acos(float x);
extern double __fastcall fs_Acos(double x);
// ACOS(x)- Computes the arccosine of ST(0)
// Allowable range: -1<=x<=+1
// Derivative Formulas: acos(x) = atan(sqrt((1 - x * x)/(x * x))) OR
// acos(x) = atan2(sqrt(1 - x * x), x)
// e.g. acos(-1.0) = 3.1415927
__declspec(naked) float __fastcall fs_acos(float x) { __asm {
FLD DWORD PTR [ESP+4] ;// Load/Push parameter 'x' to FPU stack
FLD1 ;// Load 1.0
FADD ST, ST(1) ;// Compute 1.0 + 'x'
FLD1 ;// Load 1.0
FSUB ST, ST(2) ;// Compute 1.0 - 'x'
FMULP ST(1), ST ;// Compute (1-x) * (1+x)
FSQRT ;// Compute sqrt(result)
FXCH ST(1)
FPATAN ;// Compute arctangent of result / 'x' (ST1/ST0)
RET 4
}}
__declspec(naked) double __fastcall fs_Acos(double x) { __asm { //
FLD QWORD PTR [ESP+4] ;// Load/Push parameter 'x' to FPU stack
FLD1 ;// Load 1.0
FADD ST, ST(1) ;// Compute (1.0 + 'x')
FLD1 ;// Load 1.0
FSUB ST, ST(2) ;// Compute (1.0 - 'x')
FMULP ST(1), ST ;// Compute (1-x) * (1+x)
FSQRT ;// Compute sqrt((1-x) * (1+x))
FXCH ST(1)
FPATAN ;// Compute arctangent of result / 'x' (ST1/ST0)
RET 8
}}

Unfortunately I do not have enough reputation to comment.
Here is a small modification of Nvidia's function, that deals with the fact that numbers that should be <= 1 while preserving performance as much as possible.
It may be important since rounding errors can lead number that should be 1.0 to be (oh so slightly) larger than 1.0.
double safer_acos(double x) {
double negate = double(x < 0);
x = abs(x);
x -= double(x>1.0)*(x-1.0); // <- equivalent to min(1.0,x), but faster
double ret = -0.0187293;
ret = ret * x;
ret = ret + 0.0742610;
ret = ret * x;
ret = ret - 0.2121144;
ret = ret * x;
ret = ret + 1.5707288;
ret = ret * sqrt(1.0-x);
ret = ret - 2 * negate * ret;
return negate * 3.14159265358979 + ret;
// In a single line (no gain using gcc)
//return negate * 3.14159265358979 + (((((-0.0187293*x)+ 0.0742610)*x - 0.2121144)*x + 1.5707288)* sqrt(1.0-x))*(1.0-2.0*negate);
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C fmod function: Floating point error and optimization - c++

Related

Fast weighted mean & variance of 10 bins

Floating point maxing out loop doesn't terminate in D, works in C++

Optimizations for pow() with const non-integer exponent?

Recursively create a sine wave given a single sine wave value and the period

Fast Arc Cos algorithm?

Categories

Resources