Make very small, (or large), exponential calculations - c++

Exponential limit of most 32 bit machines is
exp( +/-700 )
But I would like to do an exponential calculation
res = exp( x ) / exp( d )
when x or d are bigger than 700 I use the fact that
exp( x + y ) = exp( x ) . exp( y )
So my calculation would be something along the line of
res = (exp( x - z ).exp(z)) / (exp( d - z ).exp(z))
or
res = exp( x - z ) / exp( d - z )
where (x-z) < 700
But this approach is flawed in some cases, for example where x = 6000 and d = 10000
If we use z=5300 then
res = exp( 6000 - 5300 ) / exp( 10000 - 5300 )
res = exp( 700 ) / exp( 47000 )
But exp( 47000 ) = 0 on a 32 bit machine.
If I replace z = 9300 then I get the opposite effect.
res = exp( -3300 ) / exp( 700 )
So how could I solve the above equations, (that should return a 32bit valid number I think), given the limitations of the computer?
Edit
The reason for doing this is I am using the formula
P( a ) = P(y1) * P(y2) * P(y3) ... P(yN)
In order to prevent overflow I then do
a = log( P(y1) ) + log( P(y2) ) + log (P(y3)) ... log( P(yN) )
b = log( P(z1) ) + log( P(z2) ) + log (P(z3)) ... log( P(zN) )
...
z = log( P(zz1) ) + log( P(zz2) ) + log (P(zz3)) ... log( P(zzN) )
to get a total I do
total = a + b ... z
and to calculate the percentage I do
(exp(a) / exp( total ) ) * 100
but it is possible that "a" and/or "total" are greater than 700
I guess the question could be how could I calculate the percentage without using the exponential

It doesn't matter that the answer should be a 32 bit number if some of the intermediate steps in the calculations aren't.
For math that goes outside the bounds of an int or long type, you probably need to start using something like GMP.
https://gmplib.org/

I assume that you want to compute this:
p = exp(a) / exp(b)
And since a^b/a^c == a^(b-c) this reduces to
p = exp(a - b)
which can be easily computed if that difference is below that critical exponent.
If it isn't, then your result cannot be represented by primitive datatypes like double (because it's either extremely large or extremely small), you then need some kind of arbitrary precision numbers, probably provided by some library.
But if you only need to print the result, or store it somehow, then you can easily compute even extremely large numbers:
For that, you change to base 10 (for displaying), compute the equivalent exponent therefore (tExponent = log10(eExponent)), and get that value into the allowed range between std::numeric_limits::max_exponent10 and std::numeric_limits::min_exponent10, saving the difference as scaling factor.
For now, I just have a quick and dirty live example, showing
exp(90000) / exp(100) = 1.18556 scaled by 10^39043
(Check at wolfram alpha)
Note: When I wrote this, it was pretty late in the evening. I'm leaving this here for an "alternative" approach.
Now, generally, there's
a^b = [a^(b/c)]^c
And since
(a/b)^c = (a^c)/(b^c)
holds, too, I guess the easiest approach here is to just divide both exponents as long as one of them is above your critical value, then do the exponentiation, divide the results, and finally use the divisor of the former exponents as exponent for the quotient:
double large_exp_quot(
double eNum,
double eDenom,
unsigned int const critical = 200) {
if (abs(eNum - eDenom) > critical) {
throw out_of_range{"That won't work, resulting exponent is too large"};
}
unsigned int eDivisor = 1;
while (abs(eNum) > critical or abs(eDenom) > critical) {
eNum /= 2;
eDenom /= 2;
eDivisor *= 2;
}
return pow(exp(eNum) / exp(eDenom), eDivisor);
}
But this will only work, if the result of your computation can actually be stored using the C++ primitive datatypes, in this case double. The example you gave ... with exponents 6000 and 10000 ... is obviously not representable with a double (it's e^(-4000) and thus incredibly small)

Numerically unstable computation: exp(a) / exp(b)
Equivalent stable computation: exp(a - b)
Numerically unstable computation: Πi=1..n pi
Equivalent stable computation: exp(Σi=1..n log(pi))

Related

Why my Gradient is wrong (Coursera, Logistic Regression, Julia)?

I'm trying to do Logistic Regression from Coursera in Julia, but it doesn't work.
The Julia code to calculate the Gradient:
sigmoid(z) = 1 / (1 + e ^ -z)
hypotesis(theta, x) = sigmoid(scalar(theta' * x))
function gradient(theta, x, y)
(m, n) = size(x)
h = [hypotesis(theta, x[i,:]') for i in 1:m]
g = Array(Float64, n, 1)
for j in 1:n
g[j] = sum([(h[i] - y[i]) * x[i, j] for i in 1:m])
end
g
end
If this gradient used it produces the wrong results. Can't figure out why, the code seems like the right one.
The full Julia script. In this script the optimal Theta calculated using my Gradient Descent implementation and using the built-in Optim package, and the results are different.
The gradient is correct (up to a scalar multiple, as #roygvib points out). The problem is with the gradient descent.
If you look at the values of the cost function during your gradient descent, you will see a lot of NaN,
which probably come from the exponential:
lowering the step size (e.g., to 1e-5) will avoid the overflow,
but you will have to increase the number of iterations a lot (perhaps to 10_000_000).
A better (faster) solution would be to let the step size vary.
For instance, one could multiply the step size by 1.1
if the cost function improves after a step
(the optimum still looks far away in this direction: we can go faster),
and divide it by 2 if it does not (we went too fast and ended up past the minimum).
One could also do a line search in the direction of the gradient to find the best step size
(but this is time-consuming and can be replaced by approximations, e.g., Armijo's rule).
Rescaling the predictive variables also helps.
I tried comparing gradient() in the OP's code with numerical derivative of cost_j() (which is the objective function of minimization) using the following routine
function grad_num( theta, x, y )
g = zeros( 3 )
eps = 1.0e-6
disp = zeros( 3 )
for k = 1:3
disp[:] = theta[:]
disp[ k ]= theta[ k ] + eps
plus = cost_j( disp, x, y )
disp[ k ]= theta[ k ] - eps
minus = cost_j( disp, x, y )
g[ k ] = ( plus - minus ) / ( 2.0 * eps )
end
return g
end
But the gradient values obtained from the two routines do no seem to agree very well (at least for the initial stage of minimization)... So I manually derived the gradient of cost_j( theta, x, y ), from which it seems that the division by m is missing:
#/ OP's code
# g[j] = sum( [ (h[i] - y[i]) * x[i, j] for i in 1:m ] )
#/ modified code
g[j] = sum( [ (h[i] - y[i]) * x[i, j] for i in 1:m ] ) / m
Because I am not very sure if the above code and expression are really correct, could you check them by yourself...?
But in fact, regardless of whether I use the original or corrected gradients, the program converges to the same minimum value (0.2034977016, almost the same as obtained from Optim), because the two gradients differ only by a multiplicative factor! Because the convergence was very slow, I also modified the stepsize alpha adaptively following the suggestion by Vincent (here I used more moderate values for acceleration/deceleration):
function gradient_descent(x, y, theta, alpha, n_iterations)
...
c = cost_j( theta, x, y )
for i = 1:n_iterations
c_prev = c
c = cost_j( theta, x, y )
if c - c_prev < 0.0
alpha *= 1.01
else
alpha /= 1.05
end
theta[:] = theta - alpha * gradient(theta, x, y)
end
...
end
and called this routine as
optimal_theta = gradient_descent( x, y, [0 0 0]', 1.5e-3, 10^7 )[ 1 ]
The variation of cost_j versus iteration steps is plotted below.

Newton Raphson with SSE2 - can someone explain me these 3 lines

I'm reading this document: http://software.intel.com/en-us/articles/interactive-ray-tracing
and I stumbled upon these three lines of code:
The SIMD version is already quite a bit faster, but we can do better.
Intel has added a fast 1/sqrt(x) function to the SSE2 instruction set.
The only drawback is that its precision is limited. We need the
precision, so we refine it using Newton-Rhapson:
__m128 nr = _mm_rsqrt_ps( x );
__m128 muls = _mm_mul_ps( _mm_mul_ps( x, nr ), nr );
result = _mm_mul_ps( _mm_mul_ps( half, nr ), _mm_sub_ps( three, muls ) );
This code assumes the existence of a __m128 variable named 'half'
(four times 0.5f) and a variable 'three' (four times 3.0f).
I know how to use Newton Raphson to calculate a function's zero and I know how to use it to calculate the square root of a number but I just can't see how this code performs it.
Can someone explain it to me please?
Given the Newton iteration , it should be quite straight forward to see this in the source code.
__m128 nr = _mm_rsqrt_ps( x ); // The initial approximation y_0
__m128 muls = _mm_mul_ps( _mm_mul_ps( x, nr ), nr ); // muls = x*nr*nr == x(y_n)^2
result = _mm_mul_ps(
_mm_sub_ps( three, muls ) // this is 3.0 - mul;
/*multiplied by */ __mm_mul_ps(half,nr) // y_0 / 2 or y_0 * 0.5
);
And to be precise, this algorithm is for the inverse square root.
Note that this still doesn't give fully a fully accurate result. rsqrtps with a NR iteration gives almost 23 bits of accuracy, vs. sqrtps's 24 bits with correct rounding for the last bit.
The limited accuracy is an issue if you want to truncate the result to integer. (int)4.99999 is 4. Also, watch out for the x == 0.0 case if using sqrt(x) ~= x * sqrt(x), because 0 * +Inf = NaN.
To compute the inverse square root of a, Newton's method is applied to the equation 0=f(x)=a-x^(-2) with derivative f'(x)=2*x^(-3) and thus the iteration step
N(x) = x - f(x)/f'(x) = x - (a*x^3-x)/2
= x/2 * (3 - a*x^2)
This division-free method has -- in contrast to the globally converging Heron's method -- a limited region of convergence, so you need an already good approximation of the inverse square root to get a better approximation.

How to fit the 2D scatter data with a line with C++

I used to work with MATLAB, and for the question I raised I can use p = polyfit(x,y,1) to estimate the best fit line for the scatter data in a plate. I was wondering which resources I can rely on to implement the line fitting algorithm with C++. I understand there are a lot of algorithms for this subject, and for me I expect the algorithm should be fast and meantime it can obtain the comparable accuracy of polyfit function in MATLAB.
This page describes the algorithm easier than Wikipedia, without extra steps to calculate the means etc. : http://faculty.cs.niu.edu/~hutchins/csci230/best-fit.htm . Almost quoted from there, in C++ it's:
#include <vector>
#include <cmath>
struct Point {
double _x, _y;
};
struct Line {
double _slope, _yInt;
double getYforX(double x) {
return _slope*x + _yInt;
}
// Construct line from points
bool fitPoints(const std::vector<Point> &pts) {
int nPoints = pts.size();
if( nPoints < 2 ) {
// Fail: infinitely many lines passing through this single point
return false;
}
double sumX=0, sumY=0, sumXY=0, sumX2=0;
for(int i=0; i<nPoints; i++) {
sumX += pts[i]._x;
sumY += pts[i]._y;
sumXY += pts[i]._x * pts[i]._y;
sumX2 += pts[i]._x * pts[i]._x;
}
double xMean = sumX / nPoints;
double yMean = sumY / nPoints;
double denominator = sumX2 - sumX * xMean;
// You can tune the eps (1e-7) below for your specific task
if( std::fabs(denominator) < 1e-7 ) {
// Fail: it seems a vertical line
return false;
}
_slope = (sumXY - sumX * yMean) / denominator;
_yInt = yMean - _slope * xMean;
return true;
}
};
Please, be aware that both this algorithm and the algorithm from Wikipedia ( http://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line ) fail in case the "best" description of points is a vertical line. They fail because they use
y = k*x + b
line equation which intrinsically is not capable to describe vertical lines. If you want to cover also the cases when data points are "best" described by vertical lines, you need a line fitting algorithm which uses
A*x + B*y + C = 0
line equation. You can still modify the current algorithm to produce that equation:
y = k*x + b <=>
y - k*x - b = 0 <=>
B=1, A=-k, C=-b
In terms of the above code:
B=1, A=-_slope, C=-_yInt
And in "then" block of the if checking for denominator equal to 0, instead of // Fail: it seems a vertical line, produce the following line equation:
x = xMean <=>
x - xMean = 0 <=>
A=1, B=0, C=-xMean
I've just noticed that the original article I was referring to has been deleted. And this web page proposes a little different formula for line fitting: http://hotmath.com/hotmath_help/topics/line-of-best-fit.html
double denominator = sumX2 - 2 * sumX * xMean + nPoints * xMean * xMean;
...
_slope = (sumXY - sumY*xMean - sumX * yMean + nPoints * xMean * yMean) / denominator;
The formulas are identical because nPoints*xMean == sumX and nPoints*xMean*yMean == sumX * yMean == sumY * xMean.
I would suggest coding it from scratch. It is a very simple implementation in C++. You can code up both the intercept and gradient for least-squares fit (the same method as polyfit) from your data directly from the formulas here
http://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_line
These are closed form formulas that you can easily evaluate yourself using loops. If you were using higher degree fits then I would suggest a matrix library or more sophisticated algorithms but for simple linear regression as you describe above this is all you need. Matrices and linear algebra routines would be overkill for such a problem (in my opinion).
Equation of line is Ax + By + C=0.
So it can be easily( when B is not so close to zero ) convert to y = (-A/B)*x + (-C/B)
typedef double scalar_type;
typedef std::array< scalar_type, 2 > point_type;
typedef std::vector< point_type > cloud_type;
bool fit( scalar_type & A, scalar_type & B, scalar_type & C, cloud_type const& cloud )
{
if( cloud.size() < 2 ){ return false; }
scalar_type X=0, Y=0, XY=0, X2=0, Y2=0;
for( auto const& point: cloud )
{ // Do all calculation symmetric regarding X and Y
X += point[0];
Y += point[1];
XY += point[0] * point[1];
X2 += point[0] * point[0];
Y2 += point[1] * point[1];
}
X /= cloud.size();
Y /= cloud.size();
XY /= cloud.size();
X2 /= cloud.size();
Y2 /= cloud.size();
A = - ( XY - X * Y ); //!< Common for both solution
scalar_type Bx = X2 - X * X;
scalar_type By = Y2 - Y * Y;
if( fabs( Bx ) < fabs( By ) ) //!< Test verticality/horizontality
{ // Line is more Vertical.
B = By;
std::swap(A,B);
}
else
{ // Line is more Horizontal.
// Classical solution, when we expect more horizontal-like line
B = Bx;
}
C = - ( A * X + B * Y );
//Optional normalization:
// scalar_type D = sqrt( A*A + B*B );
// A /= D;
// B /= D;
// C /= D;
return true;
}
You can also use or go over this implementation there is also documentation here.
Fitting a Line can be acomplished in different ways.
Least Square means minimizing the sum of the squared distance.
But you could take another cost function as example the (not squared) distance. But normaly you use the squred distance (Least Square).
There is also a possibility to define the distance in different ways. Normaly you just use the "y"-axis for the distance. But you could also use the total/orthogonal distance. There the distance is calculated in x- and y-direction. This can be a better fit if you have also errors in x direction (let it be the time of measurment) and you didn't start the measurment on the exact time you saved in the data. For Least Square and Total Least Square Line fit exist algorithms in closed form. So if you fitted with one of those you will get the line with the minimal sum of the squared distance to the datapoints. You can't fit a better line in the sence of your defenition. You could just change the definition as examples taking another cost function or defining distance in another way.
There is a lot of stuff about fitting models into data you could think of, but normaly they all use the "Least Square Line Fit" and you should be fine most times. But if you have a special case it can be necessary to think about what your doing. Taking Least Square done in maybe a few minutes. Thinking about what Method fits you best to the problem envolves understanding the math, which can take indefinit time :-).
Note: This answer is NOT AN ANSWER TO THIS QUESTION but to this one "Line closest to a set of points" that has been flagged as "duplicate" of this one (incorrectly in my opinion), no way to add new answers to it.
The question asks for:
Find the line whose distance from all the points is minimum ? By
distance I mean the shortest distance between the point and the line.
The most usual interpretation of distance "between the point and the line" is the euclidean distance and the most common interpretation of "from all points" is the sum of distances (in absolute or squared value).
When the target is minimize the sum of squared euclidean distances, the linear regression (LST) is not the algorithm to use. In addition, linear regression can not result in a vertical line. The algorithm to be used is the "total least squares". See by example wikipedia for the problem description and this answer in math stack exchange for details about the formulation.
to fit a line y=param[0]x+param[1] simply do this:
// loop over data:
{
sum_x += x[i];
sum_y += y[i];
sum_xy += x[i] * y[i];
sum_x2 += x[i] * x[i];
}
// means
double mean_x = sum_x / ninliers;
double mean_y = sum_y / ninliers;
float varx = sum_x2 - sum_x * mean_x;
float cov = sum_xy - sum_x * mean_y;
// check for zero varx
param[0] = cov / varx;
param[1] = mean_y - param[0] * mean_x;
More on the topic http://easycalculation.com/statistics/learn-regression.php
(formulas are the same, they just multiplied and divided by N, a sample sz.). If you want to fit plane to 3D data use a similar approach -
http://www.mymathforum.com/viewtopic.php?f=13&t=8793
Disclaimer: all quadratic fits are linear and optimal in a sense that they reduce the noise in parameters. However, you might interested in the reducing noise in the data instead. You might also want to ignore outliers since they can bia s your solutions greatly. Both problems can be solved with RANSAC. See my post at:

Calculating proportion with negative float values

I'd like to know if there's any way in C++ to calculate a proportion involving possibily negative values in both vars and extremes.
My goal is to sync a float text input widget with fixed extremes ( eg the user can input any double value between A (min) and B (max) with A,B=any_constant_real_number ) with a slider who can only slide between 0 and 100 ( to simplify ).
If A and B are positive everything is trivial. as
val_slider = ((val_textin-A)*100)/(B-A)
but as A and B can be assumed real it looks to me the only possibility is to use several if/cases, or huge formulas involving a lot of abs() and checks over 0-divisions, whose are quite error prone and very cost intense compared to such an easy task.
Is there any faster and shorter way to achieve the same in c/c++/stl?
Pardon my bad english. Any hint? Thank you.
Your formula should work fine with negative values of A and B as well, as log as A < B.
Example, if you want the user to be able to enter values from -100 to 100, and map these to a slider which goes from 0 - 100, when the user enters -90 you get:
((-90 - A) * 100) / (B - A) = ((-90 - (-100)) * 100) / (100 - (-100))
= 10 * 100 / 200
= 5
An input value of 50 results in a slider value of:
((50 - A) * 100) / (B - A) = ((50 - (-100)) * 100) / (100 - (-100))
= 150 * 100 / 200
= 75
I don't know C++, but I do know Math, so try:
val_slider = 100 * ( val_textin - A ) / ( B - A )
Hey wait. That's exactly what you have. Test case..
A=-200, B=+200, val_texin = 100 (75% of bar, right?)
val_slider = 100 * ( 100 - -200 ) / ( 200 - - 200 )
= ( 300 ) / ( 400 ) * 100
= 75
See, you got it right. The only thing that COULD happen is B==A, but that can't be accounted for with math and requires a single IF. If they are equal, val_slider is exactly B (or A, as they are equal).

Optimizations for pow() with const non-integer exponent?

I have hot spots in my code where I'm doing pow() taking up around 10-20% of my execution time.
My input to pow(x,y) is very specific, so I'm wondering if there's a way to roll two pow() approximations (one for each exponent) with higher performance:
I have two constant exponents: 2.4 and 1/2.4.
When the exponent is 2.4, x will be in the range (0.090473935, 1.0].
When the exponent is 1/2.4, x will be in the range (0.0031308, 1.0].
I'm using SSE/AVX float vectors. If platform specifics can be taken advantage of, right on!
A maximum error rate around 0.01% is ideal, though I'm interested in full precision (for float) algorithms as well.
I'm already using a fast pow() approximation, but it doesn't take these constraints into account. Is it possible to do better?
Another answer because this is very different from my previous answer, and this is blazing fast. Relative error is 3e-8. Want more accuracy? Add a couple more Chebychev terms. It's best to keep the order odd as this makes for a small discontinuity between 2^n-epsilon and 2^n+epsilon.
#include <stdlib.h>
#include <math.h>
// Returns x^(5/12) for x in [1,2), to within 3e-8 (relative error).
// Want more precision? Add more Chebychev polynomial coefs.
double pow512norm (
double x)
{
static const int N = 8;
// Chebychev polynomial terms.
// Non-zero terms calculated via
// integrate (2/pi)*ChebyshevT[n,u]/sqrt(1-u^2)*((u+3)/2)^(5/12)
// from -1 to 1
// Zeroth term is similar except it uses 1/pi rather than 2/pi.
static const double Cn[N] = {
1.1758200232996901923,
0.16665763094889061230,
-0.0083154894939042125035,
0.00075187976780420279038,
// Wolfram alpha doesn't want to compute the remaining terms
// to more precision (it times out).
-0.0000832402,
0.0000102292,
-1.3401e-6,
1.83334e-7};
double Tn[N];
double u = 2.0*x - 3.0;
Tn[0] = 1.0;
Tn[1] = u;
for (int ii = 2; ii < N; ++ii) {
Tn[ii] = 2*u*Tn[ii-1] - Tn[ii-2];
}
double y = 0.0;
for (int ii = N-1; ii >= 0; --ii) {
y += Cn[ii]*Tn[ii];
}
return y;
}
// Returns x^(5/12) to within 3e-8 (relative error).
double pow512 (
double x)
{
static const double pow2_512[12] = {
1.0,
pow(2.0, 5.0/12.0),
pow(4.0, 5.0/12.0),
pow(8.0, 5.0/12.0),
pow(16.0, 5.0/12.0),
pow(32.0, 5.0/12.0),
pow(64.0, 5.0/12.0),
pow(128.0, 5.0/12.0),
pow(256.0, 5.0/12.0),
pow(512.0, 5.0/12.0),
pow(1024.0, 5.0/12.0),
pow(2048.0, 5.0/12.0)
};
double s;
int iexp;
s = frexp (x, &iexp);
s *= 2.0;
iexp -= 1;
div_t qr = div (iexp, 12);
if (qr.rem < 0) {
qr.quot -= 1;
qr.rem += 12;
}
return ldexp (pow512norm(s)*pow2_512[qr.rem], 5*qr.quot);
}
Addendum: What's going on here?
Per request, the following explains how the above code works.
Overview
The above code defines two functions, double pow512norm (double x) and double pow512 (double x). The latter is the entry point to the suite; this is the function that user code should call to calculate x^(5/12). The function pow512norm(x) uses Chebyshev polynomials to approximate x^(5/12), but only for x in the range [1,2]. (Use pow512norm(x) for values of x outside that range and the result will be garbage.)
The function pow512(x) splits the incoming x into a pair (double s, int n) such that x = s * 2^n and such that 1≤s<2. A further partitioning of n into (int q, unsigned int r) such that n = 12*q + r and r is less than 12 lets me split the problem of finding x^(5/12) into parts:
x^(5/12)=(s^(5/12))*((2^n)^(5/12)) via (uv)^a=(u^a)(v^a) for positive u,v and real a.
s^(5/12) is calculated via pow512norm(s).
(2^n)^(5/12)=(2^(12*q+r))^(5/12) via substitution.
2^(12*q+r)=(2^(12*q))*(2^r) via u^(a+b)=(u^a)*(u^b) for positive u, real a,b.
(2^(12*q+r))^(5/12)=(2^(5*q))*((2^r)^(5/12)) via some more manipulations.
(2^r)^(5/12) is calculated by the lookup table pow2_512.
Calculate pow512norm(s)*pow2_512[qr.rem] and we're almost there. Here qr.rem is the r value calculated in step 3 above. All that is needed is to multiply this by 2^(5*q) to yield the desired result.
That is exactly what the math library function ldexp does.
Function Approximation
The goal here is to come up with an easily computable approximation of f(x)=x^(5/12) that is 'good enough' for the problem at hand. Our approximation should be close to f(x) in some sense. Rhetorical question: What does 'close to' mean? Two competing interpretations are minimizing the mean square error versus minimizing the maximum absolute error.
I'll use a stock market analogy to describe the difference between these. Suppose you want to save for your eventual retirement. If you are in your twenties, the best thing to do is to invest in stocks or stock market funds. This is because over a long enough span of time, the stock market on average beats any other investment scheme. However, we've all seen times when putting money into stocks is a very bad thing to do. If you are in your fifties or sixties (or forties if you want to retire young) you need to invest a bit more conservatively. Those downswings can wreak have on your retirement portfolio.
Back to function approximation: As the consumer of some approximation, you are typically worried about the worst-case error rather than the performance "on average". Use some approximation constructed to give the best performance "on average" (e.g. least squares) and Murphy's law dictates that your program will spend a whole lot of time using the approximation exactly where the performance is far worse than average. What you want is a minimax approximation, something that minimizes the maximum absolute error over some domain. A good math library will take a minimax approach rather than a least squares approach because this lets the authors of the math library give some guaranteed performance of their library.
Math libraries typically use a polynomial or a rational polynomial to approximate some function f(x) over some domain a≤x≤b. Suppose the function f(x) is analytic over this domain and you want to approximate the function by some polynomial p(x) of degree N. For a given degree N there exists some magical, unique polynomial p(x) such that p(x)-f(x) has N+2 extrema over [a,b] and such that the absolute values of these N+2 extrema are all equal to one another. Finding this magical polynomial p(x) is the holy grail of function approximators.
I did not find that holy grail for you. I instead used a Chebyshev approximation. The Chebyshev polynomials of the first kind are an orthogonal (but not orthonormal) set of polynomials with some very nice features when it comes to function approximation. The Chebyshev approximation oftentimes is very close to that magical polynomial p(x). (In fact, the Remez exchange algorithm that does find that holy grail polynomial typically starts with a Chebyshev approximation.)
pow512norm(x)
This function uses Chebyshev approximation to find some polynomial p*(x) that approximates x^(5/12). Here I'm using p*(x) to distinguish this Chebyshev approximation from the magical polynomial p(x) described above. The Chebyshev approximation p*(x) is easy to find; finding p(x) is a bear. The Chebyshev approximation p*(x) is sum_i Cn[i]*Tn(i,x), where the Cn[i] are the Chebyshev coefficients and Tn(i,x) are the Chebyshev polynomials evaluated at x.
I used Wolfram alpha to find the Chebyshev coefficients Cn for me. For example, this calculates Cn[1]. The first box after the input box has the desired answer, 0.166658 in this case. That's not as many digits as I would like. Click on 'more digits' and voila, you get a whole lot more digits. Wolfram alpha is free; there is a limit on how much computation it will do. It hits that limit on higher order terms. (If you buy or have access to mathematica you will be able to calculate those high-order coefficients to a high degree of precision.)
The Chebyshev polynomials Tn(x) are calculated in the array Tn. Beyond giving something very close to magical polynomial p(x), another reason for using Chebyshev approximation is that the values of those Chebyshev polynomials are easily calculated: Start with Tn[0]=1 and Tn[1]=x, and then iteratively calculate Tn[i]=2*x*Tn[i-1] - Tn[i-2]. (I used 'ii' as the index variable rather than 'i' in my code. I never use 'i' as a variable name. How many words in the English language have an 'i' in the word? How many have two consecutive 'i's?)
pow512(x)
pow512 is the function that user code should be calling. I already described the basics of this function above. A few more details: The math library function frexp(x) returns the significand s and exponent iexp for the input x. (Minor issue: I want s between 1 and 2 for use with pow512norm but frexp returns a value between 0.5 and 1.) The math library function div returns the quotient and remainder for integer division in one swell foop. Finally, I use the math library function ldexp to put the three parts together to form the final answer.
In the IEEE 754 hacking vein, here is another solution which is faster and less "magical." It achieves an error margin of .08% in about a dozen clock cycles (for the case of p=2.4, on an Intel Merom CPU).
Floating point numbers were originally invented as an approximation to logarithms, so you can use the integer value as an approximation of log2. This is somewhat-portably achievable by applying the convert-from-integer instruction to a floating-point value, to obtain another floating-point value.
To complete the pow computation, you can multiply by a constant factor and convert the logarithm back with the convert-to-integer instruction. On SSE, the relevant instructions are cvtdq2ps and cvtps2dq.
It's not quite so simple, though. The exponent field in IEEE 754 is signed, with a bias value of 127 representing an exponent of zero. This bias must be removed before you multiply the logarithm, and re-added before you exponentiate. Furthermore, bias adjustment by subtraction won't work on zero. Fortunately, both adjustments can be achieved by multiplying by a constant factor beforehand.
x^p
= exp2( p * log2( x ) )
= exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 )
= cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) )
= cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) )
= cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) )
= cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )
exp2( 127 / p - 127 ) is the constant factor. This function is rather specialized: it won't work with small fractional exponents, because the constant factor grows exponentially with the inverse of the exponent and will overflow. It won't work with negative exponents. Large exponents lead to high error, because the mantissa bits are mingled with the exponent bits by the multiplication.
But, it's just 4 fast instructions long. Pre-multiply, convert from "integer" (to logarithm), power-multiply, convert to "integer" (from logarithm). Conversions are very fast on this implementation of SSE. We can also squeeze an extra constant coefficient into the first multiplication.
template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
__m128 ret = arg;
// std::printf( "arg = %,vg\n", ret );
// Apply a constant pre-correction factor.
ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
* pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
// std::printf( "scaled = %,vg\n", ret );
// Reinterpret arg as integer to obtain logarithm.
asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "log = %,vg\n", ret );
// Multiply logarithm by power.
ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
// std::printf( "powered = %,vg\n", ret );
// Convert back to "integer" to exponentiate.
asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "result = %,vg\n", ret );
return ret;
}
A few trials with exponent = 2.4 show this consistently overestimates by about 5%. (The routine is always guaranteed to overestimate.) You could simply multiply by 0.95, but a few more instructions will get us about 4 decimal digits of accuracy, which should be enough for graphics.
The key is to match the overestimate with an underestimate, and take the average.
Compute x^0.8: four instructions, error ~ +3%.
Compute x^-0.4: one rsqrtps. (This is quite accurate enough, but does sacrifice the ability to work with zero.)
Compute x^0.4: one mulps.
Compute x^-0.2: one rsqrtps.
Compute x^2: one mulps.
Compute x^3: one mulps.
x^2.4 = x^2 * x^0.4: one mulps. This is the overestimate.
x^2.4 = x^3 * x^-0.4 * x^-0.2: two mulps. This is the underestimate.
Average the above: one addps, one mulps.
Instruction tally: fourteen, including two conversions with latency = 5 and two reciprocal square root estimates with throughput = 4.
To properly take the average, we want to weight the estimates by their expected errors. The underestimate raises the error to a power of 0.6 vs 0.4, so we expect it to be 1.5x as erroneous. Weighting doesn't add any instructions; it can be done in the pre-factor. Calling the coefficient a: a^0.5 = 1.5 a^-0.75, and a = 1.38316186.
The final error is about .015%, or 2 orders of magnitude better than the initial fastpow result. The runtime is about a dozen cycles for a busy loop with volatile source and destination variables… although it's overlapping the iterations, real-world usage will also see instruction-level parallelism. Considering SIMD, that's a throughput of one scalar result per 3 cycles!
int main() {
__m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 );
std::printf( "Input: %,vg\n", x0 );
// Approx 5% accuracy from one call. Always an overestimate.
__m128 x1 = fastpow< 24, 10, 1, 1 >( x0 );
std::printf( "Direct x^2.4: %,vg\n", x1 );
// Lower exponents provide lower initial error, but too low causes overflow.
__m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 );
std::printf( "1.38 x^0.8: %,vg\n", xf );
// Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
__m128 xfm4 = _mm_rsqrt_ps( xf );
__m128 xf4 = _mm_mul_ps( xf, xfm4 );
// Precisely calculate x^2 and x^3
__m128 x2 = _mm_mul_ps( x0, x0 );
__m128 x3 = _mm_mul_ps( x2, x0 );
// Overestimate of x^2 * x^0.4
x2 = _mm_mul_ps( x2, xf4 );
// Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4.
__m128 xfm2 = _mm_rsqrt_ps( xf4 );
x3 = _mm_mul_ps( x3, xfm4 );
x3 = _mm_mul_ps( x3, xfm2 );
std::printf( "x^2 * x^0.4: %,vg\n", x2 );
std::printf( "x^3 / x^0.6: %,vg\n", x3 );
x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) );
// Final accuracy about 0.015%, 200x better than x^0.8 calculation.
std::printf( "average = %,vg\n", x2 );
}
Well… sorry I wasn't able to post this sooner. And extending it to x^1/2.4 is left as an exercise ;v) .
Update with stats
I implemented a little test harness and two x(5⁄12) cases corresponding to the above.
#include <cstdio>
#include <xmmintrin.h>
#include <cmath>
#include <cfloat>
#include <algorithm>
using namespace std;
template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
__m128 ret = arg;
// std::printf( "arg = %,vg\n", ret );
// Apply a constant pre-correction factor.
ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
* pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
// std::printf( "scaled = %,vg\n", ret );
// Reinterpret arg as integer to obtain logarithm.
asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "log = %,vg\n", ret );
// Multiply logarithm by power.
ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
// std::printf( "powered = %,vg\n", ret );
// Convert back to "integer" to exponentiate.
asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "result = %,vg\n", ret );
return ret;
}
__m128 pow125_4( __m128 arg ) {
// Lower exponents provide lower initial error, but too low causes overflow.
__m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg );
// Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
__m128 xfm4 = _mm_rsqrt_ps( xf );
__m128 xf4 = _mm_mul_ps( xf, xfm4 );
// Precisely calculate x^2 and x^3
__m128 x2 = _mm_mul_ps( arg, arg );
__m128 x3 = _mm_mul_ps( x2, arg );
// Overestimate of x^2 * x^0.4
x2 = _mm_mul_ps( x2, xf4 );
// Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6.
__m128 xfm2 = _mm_rsqrt_ps( xf4 );
x3 = _mm_mul_ps( x3, xfm4 );
x3 = _mm_mul_ps( x3, xfm2 );
return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) );
}
__m128 pow512_2( __m128 arg ) {
// 5/12 is too small, so compute the sqrt of 10/12 instead.
__m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg );
return _mm_mul_ps( _mm_rsqrt_ps( x ), x );
}
__m128 pow512_4( __m128 arg ) {
// 5/12 is too small, so compute the 4th root of 20/12 instead.
// 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow.
// weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3
__m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg );
__m128 xover = _mm_mul_ps( arg, xf );
__m128 xfm1 = _mm_rsqrt_ps( xf );
__m128 x2 = _mm_mul_ps( arg, arg );
__m128 xunder = _mm_mul_ps( x2, xfm1 );
// sqrt2 * over + 2 * sqrt2 * under
__m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ),
_mm_add_ps( xover, xunder ) );
xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
return xavg;
}
__m128 mm_succ_ps( __m128 arg ) {
return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) );
}
void test_pow( double p, __m128 (*f)( __m128 ) ) {
__m128 arg;
for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON );
! isfinite( _mm_cvtss_f32( f( arg ) ) );
arg = mm_succ_ps( arg ) ) ;
for ( ; _mm_cvtss_f32( f( arg ) ) == 0;
arg = mm_succ_ps( arg ) ) ;
std::printf( "Domain from %g\n", _mm_cvtss_f32( arg ) );
int n;
int const bucket_size = 1 << 25;
do {
float max_error = 0;
double total_error = 0, cum_error = 0;
for ( n = 0; n != bucket_size; ++ n ) {
float result = _mm_cvtss_f32( f( arg ) );
if ( ! isfinite( result ) ) break;
float actual = ::powf( _mm_cvtss_f32( arg ), p );
float error = ( result - actual ) / actual;
cum_error += error;
error = std::abs( error );
max_error = std::max( max_error, error );
total_error += error;
arg = mm_succ_ps( arg );
}
std::printf( "error max = %8g\t" "avg = %8g\t" "|avg| = %8g\t" "to %8g\n",
max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) );
} while ( n == bucket_size );
}
int main() {
std::printf( "4 insn x^12/5:\n" );
test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > );
std::printf( "14 insn x^12/5:\n" );
test_pow( 12./5, & pow125_4 );
std::printf( "6 insn x^5/12:\n" );
test_pow( 5./12, & pow512_2 );
std::printf( "14 insn x^5/12:\n" );
test_pow( 5./12, & pow512_4 );
}
Output:
4 insn x^12/5:
Domain from 1.36909e-23
error max = inf avg = inf |avg| = inf to 8.97249e-19
error max = 2267.14 avg = 139.175 |avg| = 139.193 to 5.88021e-14
error max = 0.123606 avg = -0.000102963 |avg| = 0.0371122 to 3.85365e-09
error max = 0.123607 avg = -0.000108978 |avg| = 0.0368548 to 0.000252553
error max = 0.12361 avg = 7.28909e-05 |avg| = 0.037507 to 16.5513
error max = 0.123612 avg = -0.000258619 |avg| = 0.0365618 to 1.08471e+06
error max = 0.123611 avg = 8.70966e-05 |avg| = 0.0374369 to 7.10874e+10
error max = 0.12361 avg = -0.000103047 |avg| = 0.0371122 to 4.65878e+15
error max = 0.123609 avg = nan |avg| = nan to 1.16469e+16
14 insn x^12/5:
Domain from 1.42795e-19
error max = inf avg = nan |avg| = nan to 9.35823e-15
error max = 0.000936462 avg = 2.0202e-05 |avg| = 0.000133764 to 6.13301e-10
error max = 0.000792752 avg = 1.45717e-05 |avg| = 0.000129936 to 4.01933e-05
error max = 0.000791785 avg = 7.0132e-06 |avg| = 0.000129923 to 2.63411
error max = 0.000787589 avg = 1.20745e-05 |avg| = 0.000129347 to 172629
error max = 0.000786553 avg = 1.62351e-05 |avg| = 0.000132397 to 1.13134e+10
error max = 0.000785586 avg = 8.25205e-06 |avg| = 0.00013037 to 6.98147e+12
6 insn x^5/12:
Domain from 9.86076e-32
error max = 0.0284339 avg = 0.000441158 |avg| = 0.00967327 to 6.46235e-27
error max = 0.0284342 avg = -5.79938e-06 |avg| = 0.00897913 to 4.23516e-22
error max = 0.0284341 avg = -0.000140706 |avg| = 0.00897084 to 2.77556e-17
error max = 0.028434 avg = 0.000440504 |avg| = 0.00967325 to 1.81899e-12
error max = 0.0284339 avg = -6.11153e-06 |avg| = 0.00897915 to 1.19209e-07
error max = 0.0284298 avg = -0.000140597 |avg| = 0.00897084 to 0.0078125
error max = 0.0284371 avg = 0.000439748 |avg| = 0.00967319 to 512
error max = 0.028437 avg = -7.74294e-06 |avg| = 0.00897924 to 3.35544e+07
error max = 0.0284369 avg = -0.000142036 |avg| = 0.00897089 to 2.19902e+12
error max = 0.0284368 avg = 0.000439183 |avg| = 0.0096732 to 1.44115e+17
error max = 0.0284367 avg = -7.41244e-06 |avg| = 0.00897923 to 9.44473e+21
error max = 0.0284366 avg = -0.000141706 |avg| = 0.00897088 to 6.1897e+26
error max = 0.485129 avg = -0.0401671 |avg| = 0.048422 to 4.05648e+31
error max = 0.994932 avg = -0.891494 |avg| = 0.891494 to 2.65846e+36
error max = 0.999329 avg = nan |avg| = nan to -0
14 insn x^5/12:
Domain from 2.64698e-23
error max = 0.13556 avg = 0.00125936 |avg| = 0.00354677 to 1.73472e-18
error max = 0.000564988 avg = 2.51458e-06 |avg| = 0.000113709 to 1.13687e-13
error max = 0.000565065 avg = -1.49258e-06 |avg| = 0.000112553 to 7.45058e-09
error max = 0.000565143 avg = 1.5293e-06 |avg| = 0.000112864 to 0.000488281
error max = 0.000565298 avg = 2.76457e-06 |avg| = 0.000113713 to 32
error max = 0.000565453 avg = -1.61276e-06 |avg| = 0.000112561 to 2.09715e+06
error max = 0.000565531 avg = 1.42628e-06 |avg| = 0.000112866 to 1.37439e+11
error max = 0.000565686 avg = 2.71505e-06 |avg| = 0.000113715 to 9.0072e+15
error max = 0.000565763 avg = -1.56586e-06 |avg| = 0.000112415 to 1.84467e+19
I suspect accuracy of the more accurate 5/12 is being limited by the rsqrt operation.
Ian Stephenson wrote this code which he claims outperforms pow(). He describes the idea as follows:
Pow is basically implemented using
log's: pow(a,b)=x(logx(a)*b). so we
need a fast log and fast exponent - it
doesn't matter what x is so we use 2.
The trick is that a floating point
number is already in a log style
format:
a=M*2E
Taking the log of both sides gives:
log2(a)=log2(M)+E
or more simply:
log2(a)~=E
In other words if we take the floating
point representation of a number, and
extract the Exponent we've got
something that's a good starting point
as its log. It turns out that when we
do this by massaging the bit patterns,
the Mantissa ends up giving a good
approximation to the error, and it
works pretty well.
This should be good enough for simple
lighting calculations, but if you need
something better, you can then extract
the Mantissa, and use that to
calculate a quadratic correction factor
which is pretty accurate.
First off, using floats isn't going to buy much on most machines nowadays. In fact, doubles can be faster. Your power, 1.0/2.4, is 5/12 or 1/3*(1+1/4). Even though this is calling cbrt (once) and sqrt (twice!) it is still twice as fast as using pow(). (Optimization: -O3, compiler: i686-apple-darwin10-g++-4.2.1).
#include <math.h> // cmath does not provide cbrt; C99 does.
double xpow512 (double x) {
double cbrtx = cbrt(x);
return cbrtx*sqrt(sqrt(cbrtx));
}
This might not answer your question.
The 2.4f and 1/2.4f make me very suspicious, because those are exactly the powers used to convert between sRGB and a linear RGB color space. So you might actually be trying to optimize that, specifically. I don't know, which is why this might not answer your question.
If this is the case, try using a lookup table. Something like:
__attribute__((aligned(64))
static const unsigned short SRGB_TO_LINEAR[256] = { ... };
__attribute__((aligned(64))
static const unsigned short LINEAR_TO_SRGB[256] = { ... };
void apply_lut(const unsigned short lut[256], unsigned char *src, ...
If you are using 16-bit data, change as appropriate. I would make the table 16 bits anyway so you can dither the result if necessary when working with 8-bit data. This obviously won't work very well if your data is floating point to begin with -- but it doesn't really make sense to store sRGB data in floating point, so you might as well convert to 16-bit / 8-bit first and then do the conversion from linear to sRGB.
(The reason sRGB doesn't make sense as floating point is that HDR should be linear, and sRGB is only convenient for storing on disk or displaying on screen, but not convenient for manipulation.)
I shall answer the question you really wanted to ask, which is how to do fast sRGB <-> linear RGB conversion. To do this precisely and efficiently we can use polynomial approximations. The following polynomial approximations have been generated with sollya, and have a worst case relative error of 0.0144%.
inline double poly7(double x, double a, double b, double c, double d,
double e, double f, double g, double h) {
double ab, cd, ef, gh, abcd, efgh, x2, x4;
x2 = x*x; x4 = x2*x2;
ab = a*x + b; cd = c*x + d;
ef = e*x + f; gh = g*x + h;
abcd = ab*x2 + cd; efgh = ef*x2 + gh;
return abcd*x4 + efgh;
}
inline double srgb_to_linear(double x) {
if (x <= 0.04045) return x / 12.92;
// Polynomial approximation of ((x+0.055)/1.055)^2.4.
return poly7(x, 0.15237971711927983387,
-0.57235993072870072762,
0.92097986411523535821,
-0.90208229831912012386,
0.88348956209696805075,
0.48110797889132134175,
0.03563925285274562038,
0.00084585397227064120);
}
inline double linear_to_srgb(double x) {
if (x <= 0.0031308) return x * 12.92;
// Piecewise polynomial approximation (divided by x^3)
// of 1.055 * x^(1/2.4) - 0.055.
if (x <= 0.0523) return poly7(x, -6681.49576364495442248881,
1224.97114922729451791383,
-100.23413743425112443219,
6.60361150127077944916,
0.06114808961060447245,
-0.00022244138470139442,
0.00000041231840827815,
-0.00000000035133685895) / (x*x*x);
return poly7(x, -0.18730034115395793881,
0.64677431008037400417,
-0.99032868647877825286,
1.20939072663263713636,
0.33433459165487383613,
-0.01345095746411287783,
0.00044351684288719036,
-0.00000664263587520855) / (x*x*x);
}
And the sollya input used to generate the polynomials:
suppressmessage(174);
f = ((x+0.055)/1.055)^2.4;
p0 = fpminimax(f, 7, [|D...|], [0.04045;1], relative);
p = fpminimax(f/(p0(1)+1e-18), 7, [|D...|], [0.04045;1], relative);
print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
print("absolute:", dirtyinfnorm((f-p), [s;1]));
print(canonical(p));
s = 0.0523;
z = 3;
f = 1.055 * x^(1/2.4) - 0.055;
p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [0.0031308;s], relative)/x^z;
print("relative:", dirtyinfnorm((f-p)/f, [0.0031308;s]));
print("absolute:", dirtyinfnorm((f-p), [0.0031308;s]));
print(canonical(p));
p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [s;1], relative)/x^z;
print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
print("absolute:", dirtyinfnorm((f-p), [s;1]));
print(canonical(p));
Binomial series does account for a constant exponent, but you will be able to use it only if you can normalize all your input to the range [1,2). (Note that it computes (1+x)^a). You'll have to do some analysis to decide how many terms you need for your desired accuracy.
For exponents of 2.4, you could either make a lookup table for all your 2.4 values and lirp or perhaps higher-order function to fill in the in-betweem values if the table wasn't accurate enough (basically a huge log table.)
Or, value squared * value to the 2/5s which could take the initial square value from the first half of the function and then 5th root it. For the 5th root, you could Newton it or do some other fast approximator, though honestly once you get to this point, your probably better off just doing the exp and log functions with the appropriate abbreviated series functions yourself.
The following is an idea you can use with any of the fast calculation methods. Whether it helps things go faster depends on how your data arrives. You can use the fact that if you know x and pow(x, n), you can use the rate of change of the power to compute a reasonable approximation of pow(x + delta, n) for small delta, with a single multiply and add (more or less). If successive values you feed your power functions are close enough together, this would amortize the full cost of the accurate calculation over multiple function calls. Note that you don't need an extra pow calculation to get the derivative. You could extend this to use the second derivative so you can use a quadratic, which would increase the delta you could use and still get the same accuracy.
So traditionally the powf(x, p) = x^p is solved by rewriting x as x=2^(log2(x)) making powf(x,p) = 2^(p*log2(x)), which transforms the problem into two approximations exp2() & log2(). This has the advantage of working with larger powers p, however the downside is that this is not the optimal solution for a constant power p and over a specified input bound 0 ≤ x ≤ 1.
When the power p > 1, the answer is a trivial minimax polynomial over the bound 0 ≤ x ≤ 1, which is the case for p = 12/5 = 2.4 as can be seen below:
float pow12_5(float x){
float mp;
// Minimax horner polynomials for x^(5/12), Note: choose the accurarcy required then implement with fma() [Fused Multiply Accumulates]
// mp = 0x4.a84a38p-12 + x * (-0xd.e5648p-8 + x * (0xa.d82fep-4 + x * 0x6.062668p-4)); // 1.13705697e-3
mp = 0x1.117542p-12 + x * (-0x5.91e6ap-8 + x * (0x8.0f50ep-4 + x * (0xa.aa231p-4 + x * (-0x2.62787p-4)))); // 2.6079002e-4
// mp = 0x5.a522ap-16 + x * (-0x2.d997fcp-8 + x * (0x6.8f6d1p-4 + x * (0xf.21285p-4 + x * (-0x7.b5b248p-4 + x * 0x2.32b668p-4)))); // 8.61377e-5
// mp = 0x2.4f5538p-16 + x * (-0x1.abcdecp-8 + x * (0x5.97464p-4 + x * (0x1.399edap0 + x * (-0x1.0d363ap0 + x * (0xa.a54a3p-4 + x * (-0x2.e8a77cp-4)))))); // 3.524655e-5
return(mp);
}
However when p < 1 the minimax approximation over the bound 0 ≤ x ≤ 1 does not appropriately converge to the desired accuracy. One option [not really] is to rewrite the problem y=x^p=x^(p+m)/x^m where m=1,2,3 is a positive integer, making the new power approximation p > 1 but this introduces division which is inherently slower.
There's however another option which is to decompose the input x as its floating point exponent and mantissa form:
x = mx* 2^(ex) where 1 ≤ mx < 2
y = x^(5/12) = mx^(5/12) * 2^((5/12)*ex), let ey = floor(5*ex/12), k = (5*ex) % 12
= mx^(5/12) * 2^(k/12) * 2^(ey)
The minimax approximation of mx^(5/12) over 1 ≤ mx < 2 now converges much faster than before, without division, but requires 12 point LUT for the 2^(k/12). The code is below:
float powk_12LUT[] = {0x1.0p0, 0x1.0f38fap0, 0x1.1f59acp0, 0x1.306fep0, 0x1.428a3p0, 0x1.55b81p0, 0x1.6a09e6p0, 0x1.7f910ep0, 0x1.965feap0, 0x1.ae89fap0, 0x1.c823ep0, 0x1.e3437ep0};
float pow5_12(float x){
union{float f; uint32_t u;} v, e2;
float poff, m, e, ei;
int xe;
v.f = x;
xe = ((v.u >> 23) - 127);
if(xe < -127) return(0.0f);
// Calculate remainder k in 2^(k/12) to find LUT
e = xe * (5.0f/12.0f);
ei = floorf(e);
poff = powk_12LUT[(int)(12.0f * (e - ei))];
e2.u = ((int)ei + 127) << 23; // Calculate the exponent
v.u = (v.u & ~(0xFFuL << 23)) | (0x7FuL << 23); // Normalize exponent to zero
// Approximate mx^(5/12) on [1,2), with appropriate degree minimax
// m = 0x8.87592p-4 + v.f * (0x8.8f056p-4 + v.f * (-0x1.134044p-4)); // 7.6125e-4
// m = 0x7.582138p-4 + v.f * (0xb.1666bp-4 + v.f * (-0x2.d21954p-4 + v.f * 0x6.3ea0cp-8)); // 8.4522726e-5
m = 0x6.9465cp-4 + v.f * (0xd.43015p-4 + v.f * (-0x5.17b2a8p-4 + v.f * (0x1.6cb1f8p-4 + v.f * (-0x2.c5b76p-8)))); // 1.04091259e-5
// m = 0x6.08242p-4 + v.f * (0xf.352bdp-4 + v.f * (-0x7.d0c1bp-4 + v.f * (0x3.4d153p-4 + v.f * (-0xc.f7a42p-8 + v.f * 0x1.5d840cp-8)))); // 1.367401e-6
return(m * poff * e2.f);
}