Second order differential equation using C++ Boost odeint library - c++

Using boost c++ odeint library, is it possible to solve a second order differential equation defined as follows ?
m*x''[i] + x'[i] = K*\sum{j=1,N} sin(x[j] - x[i]), where i = 1,2,3..N.
m = 1, K = 1
where initial value of x is an vector or array of N uniformly generated random numbers between 0 to 2*pi.
I want to integrate above equation using runge_kutta stepper of odeint ?
I can solve it by writing above eqn. in two first order differential equations, but
then in that case how the odeint stepper's would be written or modified ?

Just transform your equations to a first order ODE and use a state type of length 2 N. The first N entries now handle only the x[i] while the second N entries refer to the velocities x'[i]
void ode( state_type const& x , state_type &dxdt , double t )
{
for( size_t i=0 ; i<N ; ++i )
{
double sum = 0.0;
// calculate sum
dxdt[i] = x[i+N];
dxdt[i+N] = K * sum;
}
}
A complete example might look like
size_t N = 512;
typedef std::vector< double > state_type;
state_type x( 2 * N );
// initialize x
double t_start = 0.0 , t_end = 10.0 , dt = 0.01;
odeint::integrate( ode , x , t_start , t_end , dt );

Related

Cauchy Distribution Secant Method C++ Rcpp MLE/Root

I am quite new to C++ and Rcpp integration. I am required to create a program using C++ with R integration to find the MLE/root of a Cauchy Distribution.
Below is my code thus far.
#include <Rcpp.h>
#include <math.h>
#include <iostream>
#include <cstdlib>
using namespace std;
using namespace Rcpp;
// [[Rcpp::export]]
double Cauchy(double x, double y); //Declare Function
double Cauchy(double x,double y) //Define Function
{
return 1/(M_PI*(1+(pow(x-y,2)))); //write the equation whose roots are to be determined x=chosen y=theta
}
using namespace std;
// [[Rcpp::export]]
int Secant (NumericVector x){
NumericVector xvector(x) ; //Input of x vector
double eplison= 0.001; //Threshold
double a= xvector[3]; //Select starting point
double b= xvector[4];//Select end point
double c= 0.0; //initial value for c
double Theta= 10.6; //median value for theta estimate
int noofIter= 0; //Iterations
double error = 0.0;
if (std::abs(Cauchy(a, Theta) <(std::abs(Cauchy(a, Theta))))) {
do {
a=b;
b=c;
error= (b-(Cauchy(b, Theta)))*((a-b)/(Cauchy(a, Theta)-Cauchy(b, Theta)));
error= Cauchy(c,Theta);
//return number of iterations
noofIter++;
for (int i = 0; i < noofIter; i += 1) {
cout << "The Value is " << c << endl;
cout << "The Value is " << a << endl;
cout << "The Value is " << b << endl;
cout << "The Value is " << Theta << endl;
}
} while (std::abs(error)>eplison);
}
cout<<"\nThe root of the equation is occurs at "<<c<<endl; //print the root
cout << "The number of iterations is " << noofIter;
return 0;
}
With a few amendments, the program either goes into a never ending loop or returns a value which is infinitely small.
My understanding of this mathematics is limited. Therefore any help or correction in that would be greatly appreciated.
The X vector that we have been given as an output is
x <- c( 11.262307 , 10.281078 , 10.287090 , 12.734039 ,
11.731881 , 8.861998 , 12.246509 , 11.244818 ,
9.696278 , 11.557572 , 11.112531 , 10.550190 ,
9.018438 , 10.704774 , 9.515617 , 10.003247 ,
10.278352 , 9.709630 , 10.963905 , 17.314814)
Using previous R code we know that the MLE/Root for this distribution occurs at 10.5935 approx
The code used to obtain this MLE was
optimize(function(theta)-sum(dcauchy(x, location=theta,
log=TRUE)), c(-100,100))
Thanks!
With the optimize()function you are directly searching for an extremum of the likelihood. An alternative is to use a root finding algorithm (e.g. the secant method) together with the derivative of the (log-)likelihood. From Wikipedia we get the formula we have to solve. In R this could look like this:
x <- c( 11.262307 , 10.281078 , 10.287090 , 12.734039 ,
11.731881 , 8.861998 , 12.246509 , 11.244818 ,
9.696278 , 11.557572 , 11.112531 , 10.550190 ,
9.018438 , 10.704774 , 9.515617 , 10.003247 ,
10.278352 , 9.709630 , 10.963905 , 17.314814)
ld <- function(sample, theta){
xp <- outer(sample, theta, FUN = "-")
colSums(xp/(1+xp^2))
}
uniroot(ld, sample = x, lower = 0, upper = 20)$root
#> [1] 10.59724
Note that the derivative of the log-likelihood is vectorized on both arguments. This allows for easy plotting:
theta <- seq(0, 20, length=500)
plot(theta, ld(x, theta), type="l",
xlab=expression(theta), ylab=expression(ld(x, theta)))
From this plot we already see that it will be tricky to find the correct starting points for the secant method to work.
Let's move this to C++ (C++11 to be precise):
#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]
Rcpp::List secant(const std::function<double(double)>& f,
double a, double b, int maxIterations, double epsilon) {
double c(0.0);
do {
c = b * (1 - (1 - a/b) / (1 - f(a)/f(b)));
a = b;
b = c;
} while (maxIterations-- > 0 && std::abs(a - b) > epsilon);
return Rcpp::List::create(Rcpp::Named("root") = c,
Rcpp::Named("f.root") = f(c),
Rcpp::Named("converged") = (maxIterations > 0));
}
// [[Rcpp::export]]
Rcpp::List mleCauchy(const Rcpp::NumericVector& sample, double a, double b,
int maxIterations = 100, double epsilon = 0.0001) {
auto f = [&sample](double theta) {
Rcpp::NumericVector xp = sample - theta;
xp = xp / (1 + xp * xp);
return Rcpp::sum(xp);
};
return secant(f, a, b, maxIterations, epsilon);
}
/*** R
x <- c( 11.262307 , 10.281078 , 10.287090 , 12.734039 ,
11.731881 , 8.861998 , 12.246509 , 11.244818 ,
9.696278 , 11.557572 , 11.112531 , 10.550190 ,
9.018438 , 10.704774 , 9.515617 , 10.003247 ,
10.278352 , 9.709630 , 10.963905 , 17.314814)
mleCauchy(x, 11, 15)
#-> does not converge
mleCauchy(x, 11, 14)
#-> 10.59721
mleCauchy(x, mean(x), median(x))
#-> 10.59721
*/
The secant() function works for any std::function that takes a double as argument and returns a double. Such a function is than defined as lambda function that depends on the provided sample values. As expected one gets the correct root only when starting with values that are close to the correct value.
Lambda functions might be a bit confusing at first sight, but they are very close to what we are used to in R. Here the same algorithm written in R:
secant <- function(f, a, b, maxIterations, epsilon) {
for (i in seq.int(maxIterations)) {
c <- b * (1 - (1 - a/b) / (1 - f(a)/f(b)))
a <- b
b <- c
if (abs(a - b) <= epsilon)
break
}
list(root = c, f.root = f(c), converged = (i < maxIterations))
}
mleCauchy <- function(sample, a, b, maxIterations = 100L, epsilon = 0.001) {
f <- function(theta) {
xp <- sample - theta
sum(xp/(1 + xp^2))
}
secant(f, a, b, maxIterations, epsilon)
}
x <- c( 11.262307 , 10.281078 , 10.287090 , 12.734039 ,
11.731881 , 8.861998 , 12.246509 , 11.244818 ,
9.696278 , 11.557572 , 11.112531 , 10.550190 ,
9.018438 , 10.704774 , 9.515617 , 10.003247 ,
10.278352 , 9.709630 , 10.963905 , 17.314814)
mleCauchy(x, 11, 12)
#-> 10.59721
The R function f and the lambda function f take the vector sample from the environment where they are defined. In R this happens implicitly, while in C++ we have to explicitly tell that this value should be captured. The numeric theta is the argument supplied when the function is called, i.e. the successive estimates for the root starting with a and b.

C++ - Complex Value mistake, computing Cross Spectral Density (CSD)

Dear community,
I am facing a rather annoying problem. I am calculating the Cross Spectral Density (CSD) between two time signals, which were already proccessed with FFT to two complex frequency vectors(Singal1 =>freqvec, Signal2 => freqvec2).
RowVectorXcd CSD(n_Epochs, fftsize);
for(int j = 0; j < fftsize; j++) {
std::complex<double> cospectrum = freqvec(j).real() * freqvec2(j).real() + freqvec(j).imag() * freqvec2(j).imag() ;
std::complex<double> quadspectrum = freqvec(j).real() * freqvec2(j).imag() - freqvec(j).imag() * freqvec2(j).real() ;
std::cout << "cospectrum:"<<cospectrum<< std::endl;
CSD(j) = sqrt( pow( cospectrum, 2 ) + pow( quadspectrum, 2) ) ;
For further computations I need to get the imaginary part of this calculation correctly.
The calculation does work, but somehow the result always has an imaginary value of zero.

Kernel Density Estimator ( with Gauss Kernel ) Sum f(x) = 1?

I want to use KDE with the Gaussian Kernel. If I'm correct, the sum of all f(x) must be 1 ( ~ rounding ) ?
My Implementation looks like this:
float K( float const& val)
{
const float p=1.0 / std::sqrt( 2.0 * M_PI);
float result = 0.5 * (val*val);
result = p * std::exp(- result);
return result;
};
std::vector< std::pair<float, float> kde( float *val, int len float h)
{
std::vector< std::pair<float, float>> density( len );
const float p = 1.0 / (h * len );
for(int r=0;r<len;r++)
{
float sum = 0;
for(int i=0;i<len;i++)
sum += k( (val[r] - val[i]) / h );
density[r] = std::make_pair( val[r], p*sum );
}
return density;
}
And I choosed h > 0. Am i right that p*sum is the probability for the value val[r] ? The sum over all probability is > 1 ( but looks ok for me ).
You misinterpreted the assumptions on the probability density here. The density integrates to one, whereas its values at certain points are definitely not 1.
Let's discuss it using the following formula from the linked Wikipedia article which you seem to use:
This formula provides the density f_h(x) evaluated at point x.
From my review, your code correctly evaluates this quantity. Yet, you misinterpreted the quantity which should be one. As a density, the integral over the complete space should yield one, i.e.
This property is called normalization of the density.
Moreover, being a density itself, each summand of f_h(x) should yield 1/n when integrated over the whole space, when one also includes the normalization constant. Again, there's no guarantee on the values of the summands.
In one dimension, you can easily confirm the normalization by using the trapezoidal rule or another quadrature scheme (--if you provide a working example, I can try to do that.)

Replacing for-loops using thrust::transform

I am trying to optimize my code by implementing for loops on threads of the GPU. I am trying to eliminate two for loops using thrust::transform. The code in C++ looks like:
ka_index = 0;
for (int i = 0; i < N_gene; i++)
{
for (int j = 0; j < n_ka_d[i]; j++ )
{
co0 = get_coeff0(ka_vec_d[ka_index]);
act[i] += (co0*ka_val_d[ka_index]);
ka_index++;
}
act[i] = pow(act[i],n);
}
I am estimating co-efficients for an ordinary differential equation(ODE) in the above loops
and have transferred all the data onto the device using thrust. Consider the case where the number of genes is represented by N_gene. The fist for loop has to run N_gene number of times. The second for loop is restricted by the number of activators(other friendly genes in the gene pool) of each gene. Each gene has a number of activators(friendly genes whose presence increases the concentration of gene i) represented by elements of n_ka vector. Value of n_ka[i] can vary from 0 to N_gene - 1. ka_val represents the measure of activation for each activator ka. ka_vec_d has the gene index which activates gene i.
I am trying to represent these loops using iterators, but unable to do so. I am familiar with using thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple)) for a single for loop, but having a tough time coming up with a way to implement two for loops using counting_iterator or transform iterators. Any pointers or help to convert these two for loops will be appreciated. Thanks for your time!
This looks like a reduce problem. I think you can use thrust::transform with zip iterators and thrust::reduce_by_key. A sketch of this solution is:
// generate indices
std::vector< int > hindices;
for( size_t i=0 ; i<N_gene ; ++i )
for( size_t j=0 ; j<n_ka_d[i] ; ++j )
hindices.push_back( i );
thrust::device_vector< int > indices = hindices;
// generate tmp
// trafo1 implements get_coeff0( get< 0 >( t ) ) * get< 1 >( t);
thrust::device_vector< double > tmp( N );
thrust::transform(
thrust::make_zip_iterator(
thrust::make_tuple( ka_vec_d.begin() , ka_val_d.begin() ) ) ,
thrust::make_zip_iterator(
thrust::make_tuple( ka_vec_d.end() , ka_val_d.end() ) ) ,
tmp.begin() , trafo1 );
// do the reduction for each ac[i]
thrust::device_vector< int > indices_out( N );
thrust::reduce_by_key( indices.begin() , indices.end() , tmp.begin() ,
ac.begin() , indices_out.begin() );
// do the pow transformation
thrust::transform( ac.begin() , ac.end() , ac.begin() , pow_trafo );
I this this can also be optimized by transform_iterators to reduce the number of calls of thrust::transform and thrust::recuce_by_key.

Optimizations for pow() with const non-integer exponent?

I have hot spots in my code where I'm doing pow() taking up around 10-20% of my execution time.
My input to pow(x,y) is very specific, so I'm wondering if there's a way to roll two pow() approximations (one for each exponent) with higher performance:
I have two constant exponents: 2.4 and 1/2.4.
When the exponent is 2.4, x will be in the range (0.090473935, 1.0].
When the exponent is 1/2.4, x will be in the range (0.0031308, 1.0].
I'm using SSE/AVX float vectors. If platform specifics can be taken advantage of, right on!
A maximum error rate around 0.01% is ideal, though I'm interested in full precision (for float) algorithms as well.
I'm already using a fast pow() approximation, but it doesn't take these constraints into account. Is it possible to do better?
Another answer because this is very different from my previous answer, and this is blazing fast. Relative error is 3e-8. Want more accuracy? Add a couple more Chebychev terms. It's best to keep the order odd as this makes for a small discontinuity between 2^n-epsilon and 2^n+epsilon.
#include <stdlib.h>
#include <math.h>
// Returns x^(5/12) for x in [1,2), to within 3e-8 (relative error).
// Want more precision? Add more Chebychev polynomial coefs.
double pow512norm (
double x)
{
static const int N = 8;
// Chebychev polynomial terms.
// Non-zero terms calculated via
// integrate (2/pi)*ChebyshevT[n,u]/sqrt(1-u^2)*((u+3)/2)^(5/12)
// from -1 to 1
// Zeroth term is similar except it uses 1/pi rather than 2/pi.
static const double Cn[N] = {
1.1758200232996901923,
0.16665763094889061230,
-0.0083154894939042125035,
0.00075187976780420279038,
// Wolfram alpha doesn't want to compute the remaining terms
// to more precision (it times out).
-0.0000832402,
0.0000102292,
-1.3401e-6,
1.83334e-7};
double Tn[N];
double u = 2.0*x - 3.0;
Tn[0] = 1.0;
Tn[1] = u;
for (int ii = 2; ii < N; ++ii) {
Tn[ii] = 2*u*Tn[ii-1] - Tn[ii-2];
}
double y = 0.0;
for (int ii = N-1; ii >= 0; --ii) {
y += Cn[ii]*Tn[ii];
}
return y;
}
// Returns x^(5/12) to within 3e-8 (relative error).
double pow512 (
double x)
{
static const double pow2_512[12] = {
1.0,
pow(2.0, 5.0/12.0),
pow(4.0, 5.0/12.0),
pow(8.0, 5.0/12.0),
pow(16.0, 5.0/12.0),
pow(32.0, 5.0/12.0),
pow(64.0, 5.0/12.0),
pow(128.0, 5.0/12.0),
pow(256.0, 5.0/12.0),
pow(512.0, 5.0/12.0),
pow(1024.0, 5.0/12.0),
pow(2048.0, 5.0/12.0)
};
double s;
int iexp;
s = frexp (x, &iexp);
s *= 2.0;
iexp -= 1;
div_t qr = div (iexp, 12);
if (qr.rem < 0) {
qr.quot -= 1;
qr.rem += 12;
}
return ldexp (pow512norm(s)*pow2_512[qr.rem], 5*qr.quot);
}
Addendum: What's going on here?
Per request, the following explains how the above code works.
Overview
The above code defines two functions, double pow512norm (double x) and double pow512 (double x). The latter is the entry point to the suite; this is the function that user code should call to calculate x^(5/12). The function pow512norm(x) uses Chebyshev polynomials to approximate x^(5/12), but only for x in the range [1,2]. (Use pow512norm(x) for values of x outside that range and the result will be garbage.)
The function pow512(x) splits the incoming x into a pair (double s, int n) such that x = s * 2^n and such that 1≤s<2. A further partitioning of n into (int q, unsigned int r) such that n = 12*q + r and r is less than 12 lets me split the problem of finding x^(5/12) into parts:
x^(5/12)=(s^(5/12))*((2^n)^(5/12)) via (uv)^a=(u^a)(v^a) for positive u,v and real a.
s^(5/12) is calculated via pow512norm(s).
(2^n)^(5/12)=(2^(12*q+r))^(5/12) via substitution.
2^(12*q+r)=(2^(12*q))*(2^r) via u^(a+b)=(u^a)*(u^b) for positive u, real a,b.
(2^(12*q+r))^(5/12)=(2^(5*q))*((2^r)^(5/12)) via some more manipulations.
(2^r)^(5/12) is calculated by the lookup table pow2_512.
Calculate pow512norm(s)*pow2_512[qr.rem] and we're almost there. Here qr.rem is the r value calculated in step 3 above. All that is needed is to multiply this by 2^(5*q) to yield the desired result.
That is exactly what the math library function ldexp does.
Function Approximation
The goal here is to come up with an easily computable approximation of f(x)=x^(5/12) that is 'good enough' for the problem at hand. Our approximation should be close to f(x) in some sense. Rhetorical question: What does 'close to' mean? Two competing interpretations are minimizing the mean square error versus minimizing the maximum absolute error.
I'll use a stock market analogy to describe the difference between these. Suppose you want to save for your eventual retirement. If you are in your twenties, the best thing to do is to invest in stocks or stock market funds. This is because over a long enough span of time, the stock market on average beats any other investment scheme. However, we've all seen times when putting money into stocks is a very bad thing to do. If you are in your fifties or sixties (or forties if you want to retire young) you need to invest a bit more conservatively. Those downswings can wreak have on your retirement portfolio.
Back to function approximation: As the consumer of some approximation, you are typically worried about the worst-case error rather than the performance "on average". Use some approximation constructed to give the best performance "on average" (e.g. least squares) and Murphy's law dictates that your program will spend a whole lot of time using the approximation exactly where the performance is far worse than average. What you want is a minimax approximation, something that minimizes the maximum absolute error over some domain. A good math library will take a minimax approach rather than a least squares approach because this lets the authors of the math library give some guaranteed performance of their library.
Math libraries typically use a polynomial or a rational polynomial to approximate some function f(x) over some domain a≤x≤b. Suppose the function f(x) is analytic over this domain and you want to approximate the function by some polynomial p(x) of degree N. For a given degree N there exists some magical, unique polynomial p(x) such that p(x)-f(x) has N+2 extrema over [a,b] and such that the absolute values of these N+2 extrema are all equal to one another. Finding this magical polynomial p(x) is the holy grail of function approximators.
I did not find that holy grail for you. I instead used a Chebyshev approximation. The Chebyshev polynomials of the first kind are an orthogonal (but not orthonormal) set of polynomials with some very nice features when it comes to function approximation. The Chebyshev approximation oftentimes is very close to that magical polynomial p(x). (In fact, the Remez exchange algorithm that does find that holy grail polynomial typically starts with a Chebyshev approximation.)
pow512norm(x)
This function uses Chebyshev approximation to find some polynomial p*(x) that approximates x^(5/12). Here I'm using p*(x) to distinguish this Chebyshev approximation from the magical polynomial p(x) described above. The Chebyshev approximation p*(x) is easy to find; finding p(x) is a bear. The Chebyshev approximation p*(x) is sum_i Cn[i]*Tn(i,x), where the Cn[i] are the Chebyshev coefficients and Tn(i,x) are the Chebyshev polynomials evaluated at x.
I used Wolfram alpha to find the Chebyshev coefficients Cn for me. For example, this calculates Cn[1]. The first box after the input box has the desired answer, 0.166658 in this case. That's not as many digits as I would like. Click on 'more digits' and voila, you get a whole lot more digits. Wolfram alpha is free; there is a limit on how much computation it will do. It hits that limit on higher order terms. (If you buy or have access to mathematica you will be able to calculate those high-order coefficients to a high degree of precision.)
The Chebyshev polynomials Tn(x) are calculated in the array Tn. Beyond giving something very close to magical polynomial p(x), another reason for using Chebyshev approximation is that the values of those Chebyshev polynomials are easily calculated: Start with Tn[0]=1 and Tn[1]=x, and then iteratively calculate Tn[i]=2*x*Tn[i-1] - Tn[i-2]. (I used 'ii' as the index variable rather than 'i' in my code. I never use 'i' as a variable name. How many words in the English language have an 'i' in the word? How many have two consecutive 'i's?)
pow512(x)
pow512 is the function that user code should be calling. I already described the basics of this function above. A few more details: The math library function frexp(x) returns the significand s and exponent iexp for the input x. (Minor issue: I want s between 1 and 2 for use with pow512norm but frexp returns a value between 0.5 and 1.) The math library function div returns the quotient and remainder for integer division in one swell foop. Finally, I use the math library function ldexp to put the three parts together to form the final answer.
In the IEEE 754 hacking vein, here is another solution which is faster and less "magical." It achieves an error margin of .08% in about a dozen clock cycles (for the case of p=2.4, on an Intel Merom CPU).
Floating point numbers were originally invented as an approximation to logarithms, so you can use the integer value as an approximation of log2. This is somewhat-portably achievable by applying the convert-from-integer instruction to a floating-point value, to obtain another floating-point value.
To complete the pow computation, you can multiply by a constant factor and convert the logarithm back with the convert-to-integer instruction. On SSE, the relevant instructions are cvtdq2ps and cvtps2dq.
It's not quite so simple, though. The exponent field in IEEE 754 is signed, with a bias value of 127 representing an exponent of zero. This bias must be removed before you multiply the logarithm, and re-added before you exponentiate. Furthermore, bias adjustment by subtraction won't work on zero. Fortunately, both adjustments can be achieved by multiplying by a constant factor beforehand.
x^p
= exp2( p * log2( x ) )
= exp2( p * ( log2( x ) + 127 - 127 ) - 127 + 127 )
= cvtps2dq( p * ( log2( x ) + 127 - 127 - 127 / p ) )
= cvtps2dq( p * ( log2( x ) + 127 - log2( exp2( 127 - 127 / p ) ) )
= cvtps2dq( p * ( log2( x * exp2( 127 / p - 127 ) ) + 127 ) )
= cvtps2dq( p * ( cvtdq2ps( x * exp2( 127 / p - 127 ) ) ) )
exp2( 127 / p - 127 ) is the constant factor. This function is rather specialized: it won't work with small fractional exponents, because the constant factor grows exponentially with the inverse of the exponent and will overflow. It won't work with negative exponents. Large exponents lead to high error, because the mantissa bits are mingled with the exponent bits by the multiplication.
But, it's just 4 fast instructions long. Pre-multiply, convert from "integer" (to logarithm), power-multiply, convert to "integer" (from logarithm). Conversions are very fast on this implementation of SSE. We can also squeeze an extra constant coefficient into the first multiplication.
template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
__m128 ret = arg;
// std::printf( "arg = %,vg\n", ret );
// Apply a constant pre-correction factor.
ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
* pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
// std::printf( "scaled = %,vg\n", ret );
// Reinterpret arg as integer to obtain logarithm.
asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "log = %,vg\n", ret );
// Multiply logarithm by power.
ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
// std::printf( "powered = %,vg\n", ret );
// Convert back to "integer" to exponentiate.
asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "result = %,vg\n", ret );
return ret;
}
A few trials with exponent = 2.4 show this consistently overestimates by about 5%. (The routine is always guaranteed to overestimate.) You could simply multiply by 0.95, but a few more instructions will get us about 4 decimal digits of accuracy, which should be enough for graphics.
The key is to match the overestimate with an underestimate, and take the average.
Compute x^0.8: four instructions, error ~ +3%.
Compute x^-0.4: one rsqrtps. (This is quite accurate enough, but does sacrifice the ability to work with zero.)
Compute x^0.4: one mulps.
Compute x^-0.2: one rsqrtps.
Compute x^2: one mulps.
Compute x^3: one mulps.
x^2.4 = x^2 * x^0.4: one mulps. This is the overestimate.
x^2.4 = x^3 * x^-0.4 * x^-0.2: two mulps. This is the underestimate.
Average the above: one addps, one mulps.
Instruction tally: fourteen, including two conversions with latency = 5 and two reciprocal square root estimates with throughput = 4.
To properly take the average, we want to weight the estimates by their expected errors. The underestimate raises the error to a power of 0.6 vs 0.4, so we expect it to be 1.5x as erroneous. Weighting doesn't add any instructions; it can be done in the pre-factor. Calling the coefficient a: a^0.5 = 1.5 a^-0.75, and a = 1.38316186.
The final error is about .015%, or 2 orders of magnitude better than the initial fastpow result. The runtime is about a dozen cycles for a busy loop with volatile source and destination variables… although it's overlapping the iterations, real-world usage will also see instruction-level parallelism. Considering SIMD, that's a throughput of one scalar result per 3 cycles!
int main() {
__m128 const x0 = _mm_set_ps( 0.01, 1, 5, 1234.567 );
std::printf( "Input: %,vg\n", x0 );
// Approx 5% accuracy from one call. Always an overestimate.
__m128 x1 = fastpow< 24, 10, 1, 1 >( x0 );
std::printf( "Direct x^2.4: %,vg\n", x1 );
// Lower exponents provide lower initial error, but too low causes overflow.
__m128 xf = fastpow< 8, 10, int( 1.38316186 * 1e9 ), int( 1e9 ) >( x0 );
std::printf( "1.38 x^0.8: %,vg\n", xf );
// Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
__m128 xfm4 = _mm_rsqrt_ps( xf );
__m128 xf4 = _mm_mul_ps( xf, xfm4 );
// Precisely calculate x^2 and x^3
__m128 x2 = _mm_mul_ps( x0, x0 );
__m128 x3 = _mm_mul_ps( x2, x0 );
// Overestimate of x^2 * x^0.4
x2 = _mm_mul_ps( x2, xf4 );
// Get x^-0.2 from x^0.4. Combine with x^-0.4 into x^-0.6 and x^2.4.
__m128 xfm2 = _mm_rsqrt_ps( xf4 );
x3 = _mm_mul_ps( x3, xfm4 );
x3 = _mm_mul_ps( x3, xfm2 );
std::printf( "x^2 * x^0.4: %,vg\n", x2 );
std::printf( "x^3 / x^0.6: %,vg\n", x3 );
x2 = _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 ) );
// Final accuracy about 0.015%, 200x better than x^0.8 calculation.
std::printf( "average = %,vg\n", x2 );
}
Well… sorry I wasn't able to post this sooner. And extending it to x^1/2.4 is left as an exercise ;v) .
Update with stats
I implemented a little test harness and two x(5⁄12) cases corresponding to the above.
#include <cstdio>
#include <xmmintrin.h>
#include <cmath>
#include <cfloat>
#include <algorithm>
using namespace std;
template< unsigned expnum, unsigned expden, unsigned coeffnum, unsigned coeffden >
__m128 fastpow( __m128 arg ) {
__m128 ret = arg;
// std::printf( "arg = %,vg\n", ret );
// Apply a constant pre-correction factor.
ret = _mm_mul_ps( ret, _mm_set1_ps( exp2( 127. * expden / expnum - 127. )
* pow( 1. * coeffnum / coeffden, 1. * expden / expnum ) ) );
// std::printf( "scaled = %,vg\n", ret );
// Reinterpret arg as integer to obtain logarithm.
asm ( "cvtdq2ps %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "log = %,vg\n", ret );
// Multiply logarithm by power.
ret = _mm_mul_ps( ret, _mm_set1_ps( 1. * expnum / expden ) );
// std::printf( "powered = %,vg\n", ret );
// Convert back to "integer" to exponentiate.
asm ( "cvtps2dq %1, %0" : "=x" (ret) : "x" (ret) );
// std::printf( "result = %,vg\n", ret );
return ret;
}
__m128 pow125_4( __m128 arg ) {
// Lower exponents provide lower initial error, but too low causes overflow.
__m128 xf = fastpow< 4, 5, int( 1.38316186 * 1e9 ), int( 1e9 ) >( arg );
// Imprecise 4-cycle sqrt is still far better than fastpow, good enough.
__m128 xfm4 = _mm_rsqrt_ps( xf );
__m128 xf4 = _mm_mul_ps( xf, xfm4 );
// Precisely calculate x^2 and x^3
__m128 x2 = _mm_mul_ps( arg, arg );
__m128 x3 = _mm_mul_ps( x2, arg );
// Overestimate of x^2 * x^0.4
x2 = _mm_mul_ps( x2, xf4 );
// Get x^-0.2 from x^0.4, and square it for x^-0.4. Combine into x^-0.6.
__m128 xfm2 = _mm_rsqrt_ps( xf4 );
x3 = _mm_mul_ps( x3, xfm4 );
x3 = _mm_mul_ps( x3, xfm2 );
return _mm_mul_ps( _mm_add_ps( x2, x3 ), _mm_set1_ps( 1/ 1.960131704207789 * 0.9999 ) );
}
__m128 pow512_2( __m128 arg ) {
// 5/12 is too small, so compute the sqrt of 10/12 instead.
__m128 x = fastpow< 5, 6, int( 0.992245 * 1e9 ), int( 1e9 ) >( arg );
return _mm_mul_ps( _mm_rsqrt_ps( x ), x );
}
__m128 pow512_4( __m128 arg ) {
// 5/12 is too small, so compute the 4th root of 20/12 instead.
// 20/12 = 5/3 = 1 + 2/3 = 2 - 1/3. 2/3 is a suitable argument for fastpow.
// weighting coefficient: a^-1/2 = 2 a; a = 2^-2/3
__m128 xf = fastpow< 2, 3, int( 0.629960524947437 * 1e9 ), int( 1e9 ) >( arg );
__m128 xover = _mm_mul_ps( arg, xf );
__m128 xfm1 = _mm_rsqrt_ps( xf );
__m128 x2 = _mm_mul_ps( arg, arg );
__m128 xunder = _mm_mul_ps( x2, xfm1 );
// sqrt2 * over + 2 * sqrt2 * under
__m128 xavg = _mm_mul_ps( _mm_set1_ps( 1/( 3 * 0.629960524947437 ) * 0.999852 ),
_mm_add_ps( xover, xunder ) );
xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
xavg = _mm_mul_ps( xavg, _mm_rsqrt_ps( xavg ) );
return xavg;
}
__m128 mm_succ_ps( __m128 arg ) {
return (__m128) _mm_add_epi32( (__m128i) arg, _mm_set1_epi32( 4 ) );
}
void test_pow( double p, __m128 (*f)( __m128 ) ) {
__m128 arg;
for ( arg = _mm_set1_ps( FLT_MIN / FLT_EPSILON );
! isfinite( _mm_cvtss_f32( f( arg ) ) );
arg = mm_succ_ps( arg ) ) ;
for ( ; _mm_cvtss_f32( f( arg ) ) == 0;
arg = mm_succ_ps( arg ) ) ;
std::printf( "Domain from %g\n", _mm_cvtss_f32( arg ) );
int n;
int const bucket_size = 1 << 25;
do {
float max_error = 0;
double total_error = 0, cum_error = 0;
for ( n = 0; n != bucket_size; ++ n ) {
float result = _mm_cvtss_f32( f( arg ) );
if ( ! isfinite( result ) ) break;
float actual = ::powf( _mm_cvtss_f32( arg ), p );
float error = ( result - actual ) / actual;
cum_error += error;
error = std::abs( error );
max_error = std::max( max_error, error );
total_error += error;
arg = mm_succ_ps( arg );
}
std::printf( "error max = %8g\t" "avg = %8g\t" "|avg| = %8g\t" "to %8g\n",
max_error, cum_error / n, total_error / n, _mm_cvtss_f32( arg ) );
} while ( n == bucket_size );
}
int main() {
std::printf( "4 insn x^12/5:\n" );
test_pow( 12./5, & fastpow< 12, 5, 1059, 1000 > );
std::printf( "14 insn x^12/5:\n" );
test_pow( 12./5, & pow125_4 );
std::printf( "6 insn x^5/12:\n" );
test_pow( 5./12, & pow512_2 );
std::printf( "14 insn x^5/12:\n" );
test_pow( 5./12, & pow512_4 );
}
Output:
4 insn x^12/5:
Domain from 1.36909e-23
error max = inf avg = inf |avg| = inf to 8.97249e-19
error max = 2267.14 avg = 139.175 |avg| = 139.193 to 5.88021e-14
error max = 0.123606 avg = -0.000102963 |avg| = 0.0371122 to 3.85365e-09
error max = 0.123607 avg = -0.000108978 |avg| = 0.0368548 to 0.000252553
error max = 0.12361 avg = 7.28909e-05 |avg| = 0.037507 to 16.5513
error max = 0.123612 avg = -0.000258619 |avg| = 0.0365618 to 1.08471e+06
error max = 0.123611 avg = 8.70966e-05 |avg| = 0.0374369 to 7.10874e+10
error max = 0.12361 avg = -0.000103047 |avg| = 0.0371122 to 4.65878e+15
error max = 0.123609 avg = nan |avg| = nan to 1.16469e+16
14 insn x^12/5:
Domain from 1.42795e-19
error max = inf avg = nan |avg| = nan to 9.35823e-15
error max = 0.000936462 avg = 2.0202e-05 |avg| = 0.000133764 to 6.13301e-10
error max = 0.000792752 avg = 1.45717e-05 |avg| = 0.000129936 to 4.01933e-05
error max = 0.000791785 avg = 7.0132e-06 |avg| = 0.000129923 to 2.63411
error max = 0.000787589 avg = 1.20745e-05 |avg| = 0.000129347 to 172629
error max = 0.000786553 avg = 1.62351e-05 |avg| = 0.000132397 to 1.13134e+10
error max = 0.000785586 avg = 8.25205e-06 |avg| = 0.00013037 to 6.98147e+12
6 insn x^5/12:
Domain from 9.86076e-32
error max = 0.0284339 avg = 0.000441158 |avg| = 0.00967327 to 6.46235e-27
error max = 0.0284342 avg = -5.79938e-06 |avg| = 0.00897913 to 4.23516e-22
error max = 0.0284341 avg = -0.000140706 |avg| = 0.00897084 to 2.77556e-17
error max = 0.028434 avg = 0.000440504 |avg| = 0.00967325 to 1.81899e-12
error max = 0.0284339 avg = -6.11153e-06 |avg| = 0.00897915 to 1.19209e-07
error max = 0.0284298 avg = -0.000140597 |avg| = 0.00897084 to 0.0078125
error max = 0.0284371 avg = 0.000439748 |avg| = 0.00967319 to 512
error max = 0.028437 avg = -7.74294e-06 |avg| = 0.00897924 to 3.35544e+07
error max = 0.0284369 avg = -0.000142036 |avg| = 0.00897089 to 2.19902e+12
error max = 0.0284368 avg = 0.000439183 |avg| = 0.0096732 to 1.44115e+17
error max = 0.0284367 avg = -7.41244e-06 |avg| = 0.00897923 to 9.44473e+21
error max = 0.0284366 avg = -0.000141706 |avg| = 0.00897088 to 6.1897e+26
error max = 0.485129 avg = -0.0401671 |avg| = 0.048422 to 4.05648e+31
error max = 0.994932 avg = -0.891494 |avg| = 0.891494 to 2.65846e+36
error max = 0.999329 avg = nan |avg| = nan to -0
14 insn x^5/12:
Domain from 2.64698e-23
error max = 0.13556 avg = 0.00125936 |avg| = 0.00354677 to 1.73472e-18
error max = 0.000564988 avg = 2.51458e-06 |avg| = 0.000113709 to 1.13687e-13
error max = 0.000565065 avg = -1.49258e-06 |avg| = 0.000112553 to 7.45058e-09
error max = 0.000565143 avg = 1.5293e-06 |avg| = 0.000112864 to 0.000488281
error max = 0.000565298 avg = 2.76457e-06 |avg| = 0.000113713 to 32
error max = 0.000565453 avg = -1.61276e-06 |avg| = 0.000112561 to 2.09715e+06
error max = 0.000565531 avg = 1.42628e-06 |avg| = 0.000112866 to 1.37439e+11
error max = 0.000565686 avg = 2.71505e-06 |avg| = 0.000113715 to 9.0072e+15
error max = 0.000565763 avg = -1.56586e-06 |avg| = 0.000112415 to 1.84467e+19
I suspect accuracy of the more accurate 5/12 is being limited by the rsqrt operation.
Ian Stephenson wrote this code which he claims outperforms pow(). He describes the idea as follows:
Pow is basically implemented using
log's: pow(a,b)=x(logx(a)*b). so we
need a fast log and fast exponent - it
doesn't matter what x is so we use 2.
The trick is that a floating point
number is already in a log style
format:
a=M*2E
Taking the log of both sides gives:
log2(a)=log2(M)+E
or more simply:
log2(a)~=E
In other words if we take the floating
point representation of a number, and
extract the Exponent we've got
something that's a good starting point
as its log. It turns out that when we
do this by massaging the bit patterns,
the Mantissa ends up giving a good
approximation to the error, and it
works pretty well.
This should be good enough for simple
lighting calculations, but if you need
something better, you can then extract
the Mantissa, and use that to
calculate a quadratic correction factor
which is pretty accurate.
First off, using floats isn't going to buy much on most machines nowadays. In fact, doubles can be faster. Your power, 1.0/2.4, is 5/12 or 1/3*(1+1/4). Even though this is calling cbrt (once) and sqrt (twice!) it is still twice as fast as using pow(). (Optimization: -O3, compiler: i686-apple-darwin10-g++-4.2.1).
#include <math.h> // cmath does not provide cbrt; C99 does.
double xpow512 (double x) {
double cbrtx = cbrt(x);
return cbrtx*sqrt(sqrt(cbrtx));
}
This might not answer your question.
The 2.4f and 1/2.4f make me very suspicious, because those are exactly the powers used to convert between sRGB and a linear RGB color space. So you might actually be trying to optimize that, specifically. I don't know, which is why this might not answer your question.
If this is the case, try using a lookup table. Something like:
__attribute__((aligned(64))
static const unsigned short SRGB_TO_LINEAR[256] = { ... };
__attribute__((aligned(64))
static const unsigned short LINEAR_TO_SRGB[256] = { ... };
void apply_lut(const unsigned short lut[256], unsigned char *src, ...
If you are using 16-bit data, change as appropriate. I would make the table 16 bits anyway so you can dither the result if necessary when working with 8-bit data. This obviously won't work very well if your data is floating point to begin with -- but it doesn't really make sense to store sRGB data in floating point, so you might as well convert to 16-bit / 8-bit first and then do the conversion from linear to sRGB.
(The reason sRGB doesn't make sense as floating point is that HDR should be linear, and sRGB is only convenient for storing on disk or displaying on screen, but not convenient for manipulation.)
I shall answer the question you really wanted to ask, which is how to do fast sRGB <-> linear RGB conversion. To do this precisely and efficiently we can use polynomial approximations. The following polynomial approximations have been generated with sollya, and have a worst case relative error of 0.0144%.
inline double poly7(double x, double a, double b, double c, double d,
double e, double f, double g, double h) {
double ab, cd, ef, gh, abcd, efgh, x2, x4;
x2 = x*x; x4 = x2*x2;
ab = a*x + b; cd = c*x + d;
ef = e*x + f; gh = g*x + h;
abcd = ab*x2 + cd; efgh = ef*x2 + gh;
return abcd*x4 + efgh;
}
inline double srgb_to_linear(double x) {
if (x <= 0.04045) return x / 12.92;
// Polynomial approximation of ((x+0.055)/1.055)^2.4.
return poly7(x, 0.15237971711927983387,
-0.57235993072870072762,
0.92097986411523535821,
-0.90208229831912012386,
0.88348956209696805075,
0.48110797889132134175,
0.03563925285274562038,
0.00084585397227064120);
}
inline double linear_to_srgb(double x) {
if (x <= 0.0031308) return x * 12.92;
// Piecewise polynomial approximation (divided by x^3)
// of 1.055 * x^(1/2.4) - 0.055.
if (x <= 0.0523) return poly7(x, -6681.49576364495442248881,
1224.97114922729451791383,
-100.23413743425112443219,
6.60361150127077944916,
0.06114808961060447245,
-0.00022244138470139442,
0.00000041231840827815,
-0.00000000035133685895) / (x*x*x);
return poly7(x, -0.18730034115395793881,
0.64677431008037400417,
-0.99032868647877825286,
1.20939072663263713636,
0.33433459165487383613,
-0.01345095746411287783,
0.00044351684288719036,
-0.00000664263587520855) / (x*x*x);
}
And the sollya input used to generate the polynomials:
suppressmessage(174);
f = ((x+0.055)/1.055)^2.4;
p0 = fpminimax(f, 7, [|D...|], [0.04045;1], relative);
p = fpminimax(f/(p0(1)+1e-18), 7, [|D...|], [0.04045;1], relative);
print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
print("absolute:", dirtyinfnorm((f-p), [s;1]));
print(canonical(p));
s = 0.0523;
z = 3;
f = 1.055 * x^(1/2.4) - 0.055;
p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [0.0031308;s], relative)/x^z;
print("relative:", dirtyinfnorm((f-p)/f, [0.0031308;s]));
print("absolute:", dirtyinfnorm((f-p), [0.0031308;s]));
print(canonical(p));
p = fpminimax(1.055 * (x^(z+1/2.4) - 0.055*x^z/1.055), 7, [|D...|], [s;1], relative)/x^z;
print("relative:", dirtyinfnorm((f-p)/f, [s;1]));
print("absolute:", dirtyinfnorm((f-p), [s;1]));
print(canonical(p));
Binomial series does account for a constant exponent, but you will be able to use it only if you can normalize all your input to the range [1,2). (Note that it computes (1+x)^a). You'll have to do some analysis to decide how many terms you need for your desired accuracy.
For exponents of 2.4, you could either make a lookup table for all your 2.4 values and lirp or perhaps higher-order function to fill in the in-betweem values if the table wasn't accurate enough (basically a huge log table.)
Or, value squared * value to the 2/5s which could take the initial square value from the first half of the function and then 5th root it. For the 5th root, you could Newton it or do some other fast approximator, though honestly once you get to this point, your probably better off just doing the exp and log functions with the appropriate abbreviated series functions yourself.
The following is an idea you can use with any of the fast calculation methods. Whether it helps things go faster depends on how your data arrives. You can use the fact that if you know x and pow(x, n), you can use the rate of change of the power to compute a reasonable approximation of pow(x + delta, n) for small delta, with a single multiply and add (more or less). If successive values you feed your power functions are close enough together, this would amortize the full cost of the accurate calculation over multiple function calls. Note that you don't need an extra pow calculation to get the derivative. You could extend this to use the second derivative so you can use a quadratic, which would increase the delta you could use and still get the same accuracy.
So traditionally the powf(x, p) = x^p is solved by rewriting x as x=2^(log2(x)) making powf(x,p) = 2^(p*log2(x)), which transforms the problem into two approximations exp2() & log2(). This has the advantage of working with larger powers p, however the downside is that this is not the optimal solution for a constant power p and over a specified input bound 0 ≤ x ≤ 1.
When the power p > 1, the answer is a trivial minimax polynomial over the bound 0 ≤ x ≤ 1, which is the case for p = 12/5 = 2.4 as can be seen below:
float pow12_5(float x){
float mp;
// Minimax horner polynomials for x^(5/12), Note: choose the accurarcy required then implement with fma() [Fused Multiply Accumulates]
// mp = 0x4.a84a38p-12 + x * (-0xd.e5648p-8 + x * (0xa.d82fep-4 + x * 0x6.062668p-4)); // 1.13705697e-3
mp = 0x1.117542p-12 + x * (-0x5.91e6ap-8 + x * (0x8.0f50ep-4 + x * (0xa.aa231p-4 + x * (-0x2.62787p-4)))); // 2.6079002e-4
// mp = 0x5.a522ap-16 + x * (-0x2.d997fcp-8 + x * (0x6.8f6d1p-4 + x * (0xf.21285p-4 + x * (-0x7.b5b248p-4 + x * 0x2.32b668p-4)))); // 8.61377e-5
// mp = 0x2.4f5538p-16 + x * (-0x1.abcdecp-8 + x * (0x5.97464p-4 + x * (0x1.399edap0 + x * (-0x1.0d363ap0 + x * (0xa.a54a3p-4 + x * (-0x2.e8a77cp-4)))))); // 3.524655e-5
return(mp);
}
However when p < 1 the minimax approximation over the bound 0 ≤ x ≤ 1 does not appropriately converge to the desired accuracy. One option [not really] is to rewrite the problem y=x^p=x^(p+m)/x^m where m=1,2,3 is a positive integer, making the new power approximation p > 1 but this introduces division which is inherently slower.
There's however another option which is to decompose the input x as its floating point exponent and mantissa form:
x = mx* 2^(ex) where 1 ≤ mx < 2
y = x^(5/12) = mx^(5/12) * 2^((5/12)*ex), let ey = floor(5*ex/12), k = (5*ex) % 12
= mx^(5/12) * 2^(k/12) * 2^(ey)
The minimax approximation of mx^(5/12) over 1 ≤ mx < 2 now converges much faster than before, without division, but requires 12 point LUT for the 2^(k/12). The code is below:
float powk_12LUT[] = {0x1.0p0, 0x1.0f38fap0, 0x1.1f59acp0, 0x1.306fep0, 0x1.428a3p0, 0x1.55b81p0, 0x1.6a09e6p0, 0x1.7f910ep0, 0x1.965feap0, 0x1.ae89fap0, 0x1.c823ep0, 0x1.e3437ep0};
float pow5_12(float x){
union{float f; uint32_t u;} v, e2;
float poff, m, e, ei;
int xe;
v.f = x;
xe = ((v.u >> 23) - 127);
if(xe < -127) return(0.0f);
// Calculate remainder k in 2^(k/12) to find LUT
e = xe * (5.0f/12.0f);
ei = floorf(e);
poff = powk_12LUT[(int)(12.0f * (e - ei))];
e2.u = ((int)ei + 127) << 23; // Calculate the exponent
v.u = (v.u & ~(0xFFuL << 23)) | (0x7FuL << 23); // Normalize exponent to zero
// Approximate mx^(5/12) on [1,2), with appropriate degree minimax
// m = 0x8.87592p-4 + v.f * (0x8.8f056p-4 + v.f * (-0x1.134044p-4)); // 7.6125e-4
// m = 0x7.582138p-4 + v.f * (0xb.1666bp-4 + v.f * (-0x2.d21954p-4 + v.f * 0x6.3ea0cp-8)); // 8.4522726e-5
m = 0x6.9465cp-4 + v.f * (0xd.43015p-4 + v.f * (-0x5.17b2a8p-4 + v.f * (0x1.6cb1f8p-4 + v.f * (-0x2.c5b76p-8)))); // 1.04091259e-5
// m = 0x6.08242p-4 + v.f * (0xf.352bdp-4 + v.f * (-0x7.d0c1bp-4 + v.f * (0x3.4d153p-4 + v.f * (-0xc.f7a42p-8 + v.f * 0x1.5d840cp-8)))); // 1.367401e-6
return(m * poff * e2.f);
}