Worse performance using Eigen than using my own class - c++

A couple of weeks ago I asked a question about the performance of matrix multiplication.
I was told that in order to enhance the performance of my program I should use some specialised matrix classes rather than my own class.
StackOverflow users recommended:
uBLAS
EIGEN
BLAS
At first I wanted to use uBLAS however reading documentation it turned out that this library doesn't support matrix-matrix multiplication.
After all I decided to use EIGEN library. So I exchanged my matrix class to Eigen::MatrixXd - however it turned out that now my application works even slower than before.
Time before using EIGEN was 68 seconds and after exchanging my matrix class to EIGEN matrix program runs for 87 seconds.
Parts of program which take the most time looks like that
TemplateClusterBase* TemplateClusterBase::TransformTemplateOne( vector<Eigen::MatrixXd*>& pointVector, Eigen::MatrixXd& rotation ,Eigen::MatrixXd& scale,Eigen::MatrixXd& translation )
{
for (int i=0;i<pointVector.size();i++ )
{
//Eigen::MatrixXd outcome =
Eigen::MatrixXd outcome = (rotation*scale)* (*pointVector[i]) + translation;
//delete prototypePointVector[i]; // ((rotation*scale)* (*prototypePointVector[i]) + translation).ConvertToPoint();
MatrixHelper::SetX(*prototypePointVector[i],MatrixHelper::GetX(outcome));
MatrixHelper::SetY(*prototypePointVector[i],MatrixHelper::GetY(outcome));
//assosiatedPointIndexVector[i] = prototypePointVector[i]->associatedTemplateIndex = i;
}
return this;
}
and
Eigen::MatrixXd AlgorithmPointBased::UpdateTranslationMatrix( int clusterIndex )
{
double membershipSum = 0,outcome = 0;
double currentPower = 0;
Eigen::MatrixXd outcomePoint = Eigen::MatrixXd(2,1);
outcomePoint << 0,0;
Eigen::MatrixXd templatePoint;
for (int i=0;i< imageDataVector.size();i++)
{
currentPower =0;
membershipSum += currentPower = pow(membershipMatrix[clusterIndex][i],m);
outcomePoint.noalias() += (*imageDataVector[i] - (prototypeVector[clusterIndex]->rotationMatrix*prototypeVector[clusterIndex]->scalingMatrix* ( *templateCluster->templatePointVector[prototypeVector[clusterIndex]->assosiatedPointIndexVector[i]]) ))*currentPower ;
}
outcomePoint.noalias() = outcomePoint/=membershipSum;
return outcomePoint; //.ConvertToMatrix();
}
As You can see, these functions performs a lot of matrix operations. That is why I thought using Eigen would speed up my application. Unfortunately (as I mentioned above), the program works slower.
Is there any way to speed up these functions?
Maybe if I used DirectX matrix operations I would get better performance ?? (however I have a laptop with integrated graphic card).

If you're using Eigen's MatrixXd types, those are dynamically sized. You should get much better results from using the fixed size types e.g Matrix4d, Vector4d.
Also, make sure you're compiling such that the code can get vectorized; see the relevant Eigen documentation.
Re your thought on using the Direct3D extensions library stuff (D3DXMATRIX etc): it's OK (if a bit old fashioned) for graphics geometry (4x4 transforms etc), but it's certainly not GPU accelerated (just good old SSE, I think). Also, note that it's floating point precision only (you seem to be set on using doubles). Personally I'd much prefer to use Eigen unless I was actually coding a Direct3D app.

Make sure to have compiler optimization switched on (e.g. at least -O2 on gcc). Eigen is heavily templated and will not perform very well if you don't turn on optimization.

Which version of Eigen are you using? They recently released 3.0.1, which is supposed to be faster than 2.x. Also, make sure you play a bit with the compiler options. For example, make sure SSE is being used in Visual Studio:
C/C++ --> Code Generation --> Enable Enhanced Instruction Set

You should profile and then optimize first the algorithm, then the implementation. In particular, the posted code is quite innefficient:
for (int i=0;i<pointVector.size();i++ )
{
Eigen::MatrixXd outcome = (rotation*scale)* (*pointVector[i]) + translation;
I don't know the library, so I won't even try to guess the number of unnecessary temporaries that you are creating, but a simple refactor:
Eigen::MatrixXd tmp = rotation*scale;
for (int i=0;i<pointVector.size();i++ )
{
Eigen::MatrixXd outcome = tmp*(*pointVector[i]) + translation;
Can save you a good amount of expensive multiplications (and again, probably new temporary matrices that get discarded right away.

A couple of points.
Why are you multiplying rotation*scale inside of the loop when that product will have the same value each iteration? That is a lot of wasted effort.
You are using dynamically sized matrices rather than fixed sized matrices. Someone else mentioned this already, and you said you shaved off 2 sec.
You are passing arguments as a vector of pointers to matrices. This adds an extra pointer indirection and destroys any guarantee of data locality, which will give poor cache performance.
I hope this isn't insulting, but are you compiling in Release or Debug? Eigen is very slow in debug builds, because it uses lots of trivial templated functions that are optimized out of release but remain in debug.
Looking at your code, I am hesitant to blame Eigen for performance problems. However, most linear algebra libraries (including Eigen) are not really designed for your use case of lots of tiny matrices. In general, Eigen will be better optimized for 100x100 or larger matrices. You very well may be better off using your own matrix class or the DirectX math helper classes. The DirectX math classes are completely independent from your video card.

Looking back at your previous post and the code in there, my suggestion would be to use your old code, but improve its efficiency by moving things around. I'm posting on that previous question to keep the answers separate.

Related

Results (slightly) different after vectorization is enabled

One of our software is using Eigen (3.2.5) to perform some matric/vector related computations. The software was developed carefully in this regard, starting by disabling all options and optimizations (including using -DEIGEN_DONT_VECTORIZE), and setting accuracy tests in place.
Since we are now interested in faster numerical throughputs, we have started enabling vectorization inside Eigen. However, we have noticed that one of our tests now gives a slightly different output: the difference with the reference implementation is around 1e-4, while it was 1e-5 before.
We are going to let loose a bit the precision in this test (because we don't really know the accuracy of the reference data, and we have another test case with synthetic data for which we have an exact solution and that still passes), but out of curiosity: what can be a plausible cause for this variation?
In case it's relevant, this computation involves Euclidean norms.
This has to be expected because when you enable vectorization, floating point operations are not carried out in the exact same order. This typically occurs for expressions involving reductions such as sum, norms, matrix products, etc. For instance, let's consider the following simple sum:
float s = 0;
for(int i=0;i<n;i++)
s += v[i];
A vectorized version might look to something like (pseudo code):
Packet ps = {0,0,0,0};
for(int i=0;i<n;i+=4)
ps += load_packet(&v[i]);
float s = ps[0]+ps[1]+ps[2]+ps[3];
Owing to roundoff errors, each version will return a different value. In Eigen, this aspect even more tricky because reductions are implemented in a way to maximize instruction pipelining.

Eigen: how to speed up a += coeffs * coeffs.transpose()

I need to compute many (about 400k) solutions of small linear least square problems. Each problem contains 10-300 equations with only 7 variables.
To solve these problems i use eigen library. Straight solving takes too much time and i transform each problem to solving 7x7 system of linear equations by deriving derivatives by my hand.
I recieve nice speed-up but i want to increase performance again.
I use vagrind to profile my program and i found that operation with highest self cost is operator += of eigen matrix. This operation takes more than ten calls of a.ldlt().solve(b);
I use this operator to compose A matrix and B vector of each system of equations
//I cal these code to solve each problem
const int nVars = 7;
//i really need double precision
Eigen::Matrix<double, nVars, nVars> a = Eigen::Matrix<double, nVars, nVars>::Zero();
Eigen::Matrix<double, nVars, 1> b = Eigen::Matrix<double, nVars, 1>::Zero();
Eigen::Matrix<double, nVars, 1> equationCoeffs;
//............................
//Somewhere in big cycle.
//equationCoeffs and z are updated on each iteration
a += equationCoeffs * equationCoeffs.transpose();
b += equationCoeffs * z;
Where z is some scalar
So my question is: How can i improve performance of these operations?
PS Sorry for my poor English
Instead of forming the matrix and vector components of the normal equation by hand, one equation at a time, you might try to allocate a large enough matrix once (e.g. 300 x 7) to store all coefficients and then let Eigen's optimized matrix-matrix product kernels do the job for you:
Matrix<double,Dynamic,nbVars> D(300,nbVars);
VectorXd f(300);
for(...)
{
int nb_equations = ...;
for(i=0..nb_equations-1)
{
D.row(i) = equationCoeffs;
f(i) = z;
}
a = D.topRows(nb_equations).transpose() * D.topRows(nb_equations);
b = D.topRows(nb_equations).transpose() * f.head(nb_equations);
// solve ax=b
}
You might bench with both a column-major and row-major storage for the matrix D to see which one is best.
Another possible approach would be to declare a, equationCoeffs, and b as 8x8 or 8x1 matrix or vectors making sure that equationCoeffs(7)==0. This way you maximize SIMD usage. Then use a.topLeftCorners<7,7>(), b.head<7>() when calling LDLT. You might even combine this strategy with the previous one.
Finally, if your CPU support AVX or FMA, you might use the devel branch and compile with -mavx or -mfma to get a significant speedup.
If you can use g++5.1, you might want to take a look at OpenMP
( http://openmp.org/mp-documents/OpenMP4.0.0.Examples.pdf ).
G++5.1 (or gcc5.1 for C) also has some basic support for OpenACC, you can try that as well. There should be more implementation of OpenACC in the future.
Also if you have access to intel compiler (icc, icpc) it speeded up my code even just by using it.
If you can use nvidia's nvcc, you might use the thrust library
( http://docs.nvidia.com/cuda/thrust/#axzz3g8xJPGHe ), there's a lot of sample code on their github as well
( https://github.com/thrust/thrust ). However, using thrust is not so straight forward and needs some real thinking.
EDIT:
Thrust also requires Nvidia GPU.
For AMD cards I believe there is a library called ArrayFire, which looks very similar to Thrust (I have not tried that one, yet)
I have a single problem Ax=b with 480k float variables. The matrix A is sparse and solving it with Eigen BiCGSTAB took 4.8 seconds.
I also worked with ViennaCL before, so I tried to solve the same problem there, and it took only 1.2 seconds. The increase in spead is realised
by the processing on the GPU.

GLSL conditional penalties

I've written my first couple of GLSL programs for Processing (a visual language similar to Java that can load shaders) recently that make fractals. In the loop that handles the fractal code, I have an escape conditional that breaks if a point would tend to infinity.
It works fine and it is similar to how I generally write the code for non-GLSL. However someone told me that two paths are calculated every time a conditional is executed. I've had a hard time finding exactly how much of a penalty is caused by conditionals in GLSL.
Edit: To the best of my understanding in non-GLSL when an if is encountered a path is assumed. If the "correct" path was assumed everything is great. If the "wrong" path was assumed then "bad" work is discarded and instructions continue along the "correct" path. The penalty might be say 3 (or whatever number) of instructions. I want to know if there is some number (3 or whatever) of instructions that are the penalty or if both paths are calculated all the way through.
Here is the code if the explanation is not clear enough:
// Mandelbrot Set code
int i = 0;
float zr = x;
float zi = y;
for (; i < maxIterations; i++) {
float sqZr = zr*zr;
float sqZi = zi*zi;
float twoZri = 2.0*zr*zi;
zr = sqZr-sqZi+x;
zi = twoZri+y;
if (sqZr+sqZi > 16.0) break;
}
On old GPUs, both sides of an if() clause were executed and the correct result chosen at the end. On newer ones, this is only the case if the compiler thinks it would be more efficient. if() clauses are not free: the generic rule of thumb I have used for some time is: "if() costs 14 clock cycles" though the latest GPUs may be cheaper.
Why is this so? Because GPUs are stream processors, they want to have identical data-loading profiles for all pixels (especially for gradient values like texture colors or values from vertex registers). The principle of SIMD -- even when the devices are not strictly SIMD -- is usually the way to get the most performance from such devices.
When in doubt, see if you can use one of the NVIDIA perf analysis tools on your code, or just try writing the code (it's short!) a few different ways and comparing your performance for your specific GPU.
(BTW Processing is not Java-like: it's Java)

How to optimize matrix multiplication operation [duplicate]

This question already has answers here:
Optimized matrix multiplication in C
(14 answers)
Closed 4 years ago.
I need to perform a lot of matrix operations in my application. The most time consuming is matrix multiplication. I implemented it this way
template<typename T>
Matrix<T> Matrix<T>::operator * (Matrix& matrix)
{
Matrix<T> multipliedMatrix = Matrix<T>(this->rows,matrix.GetColumns(),0);
for (int i=0;i<this->rows;i++)
{
for (int j=0;j<matrix.GetColumns();j++)
{
multipliedMatrix.datavector.at(i).at(j) = 0;
for (int k=0;k<this->columns ;k++)
{
multipliedMatrix.datavector.at(i).at(j) += datavector.at(i).at(k) * matrix.datavector.at(k).at(j);
}
//cout<<(*multipliedMatrix)[i][j]<<endl;
}
}
return multipliedMatrix;
}
Is there any way to write it in a better way?? So far matrix multiplication operations take most of time in my application. Maybe is there good/fast library for doing this kind of stuff ??
However I rather can't use libraries which uses graphic card for mathematical operations, because of the fact that I work on laptop with integrated graphic card.
Eigen is by far one of the fastest, if not the fastest, linear algebra libraries out there. It is well written and it is of high quality. Also, it uses expression template which makes writing code that is more readable. Version 3 just released uses OpenMP for data parallelism.
#include <iostream>
#include <Eigen/Dense>
using Eigen::MatrixXd;
int main()
{
MatrixXd m(2,2);
m(0,0) = 3;
m(1,0) = 2.5;
m(0,1) = -1;
m(1,1) = m(1,0) + m(0,1);
std::cout << m << std::endl;
}
Boost uBLAS I think is definitely the way to go with this sort of thing. Boost is well designed, well tested and used in a lot of applications.
Consider GNU Scientific Library, or MV++
If you're okay with C, BLAS is a low-level library that incorporates both C and C-wrapped FORTRAN instructions and is used a huge number of higher-level math libraries.
I don't know anything about this, but another option might be Meschach which seems to have decent performance.
Edit: With respect to your comment about not wanting to use libraries that use your graphics card, I'll point out that in many cases, the libraries that use your graphics card are specialized implementations of standard (non-GPU) libraries. For example, various implementations of BLAS are listed on it's Wikipedia page, only some are designed to leverage your GPU.
There is a book called Introduction to Algorithms. You may like to check the chapter of Dynamic Programming. It has an excellent matrix multiplication algo using dynamic programming. Its worth a read. Well, this info was in case you want to write your own logic instead of using a library.
There are plenty of algorithms for efficient matrix multiplication.
Algorithms for efficient matrix multiplication
Look at the algorithms, find an implementations.
You can also make a multi-threaded implementation for it.

Speedup C++ code

I am writing a C++ number crunching application, where the bottleneck is a function that has to calculate for double:
template<class T> inline T sqr(const T& x){return x*x;}
and another one that calculates
Base dist2(const Point& p) const
{ return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z); }
These operations take 80% of the computation time. I wonder if you can suggest approaches to make it faster, even if there is some sort of accuracy loss
Thanks
First, make sure dist2 can be inlined (it's not clear from your post whether or not this is the case), having it defined in a header file if necessary (generally you'll need to do this - but if your compiler generates code at link time, then that's not necessarily the case).
Assuming x86 architecture, be sure to allow your compiler to generate code using SSE2 instructions (an example of an SIMD instruction set) if they are available on the target architecture. To give the compiler the best opportunity to optimize these, you can try to batch your sqr operations together (SSE2 instructions should be able to do up to 4 float or 2 double operations at a time depending on the instruction.. but of course it can only do this if you have the inputs to more than one operation on the ready). I wouldn't be too optimistic about the compiler's ability to figure out that it can batch them.. but you can at least set up your code so that it would be possible in theory.
If you're still not satisfied with the speed and you don't trust that your compiler is doing it best, you should look into using compiler intrinsics which will allow you to write potential parallel instructions explicitly.. or alternatively, you can go right ahead and write architecture-specific assembly code to take advantage of SSE2 or whichever instructions are most appropriate on your architecture. (Warning: if you hand-code the assembly, either take extra care that it still gets inlined, or make it into a large batch operation)
To take it even further, (and as glowcoder has already mentioned) you could perform these operations on a GPU. For your specific case, bear in mind that GPU's often don't support double precision floating point.. though if it's a good fit for what you're doing, you'll get orders of magnitude better performance this way. Google for GPGPU or whatnot and see what's best for you.
What is Base?
Is it a class with a non-explicit constructor? It's possible that you're creating a fair amount of temporary Base objects. That could be a big CPU hog.
template<class T> inline T sqr(const T& x){return x*x;}
Base dist2(const Point& p) const {
return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z);
}
If p's member variables are of type Base, you could be calling sqr on Base objects, which will be creating temporaries for the subtracted coordinates, in sqr, and then for each added component.
(We can't tell without the class definitions)
You could probably speed it up by forcing the sqr calls to be on primitves and not using Base until you get to the return type of dist2.
Other performance improvement opportunities are to:
Use non-floating point operations, if you're ok with less precision.
Use algorithms which don't need to call dist2 so much, possibly caching or using the transitive property.
(this is probably obvious, but) Make sure you're compiling with optimization turned on.
I think optimising these functions might be difficult, you might be better off optimising the code that calls these functions to call them less, or to do things differently.
You don't say whether the calls to dist2 can be parallelised or not. If they can, then you could build a thread pool and split this work up into smaller chunks per thread.
What does your profiler tell you is happening inside dist2. Are you actually using 100% CPU all the time or are you cache missing and waiting for data to load?
To be honest, we really need more details to give you a definitive answer.
If sqr() is being used only on primitive types, you might try taking the argument by value instead of reference. That would save you an indirection.
If you can organise your data suitably then you may well be able to use SIMD optimisation here. For an efficient implementation you would probably want to pad your Point struct so that it has 4 elements (i.e. add a fourth dummy element for padding).
If you have a number of these to do, and you're doing graphics or "graphic like" tasks (thermal modeling, almost any 3d modeling) you might consider using OpenGL and offloading the tasks to a GPU. This would allow the computations to run in parallel, with highly optimized operational capacity. After all, you would expect something like distance or distancesq to have its own opcode on a GPU.
A researcher at a local univeristy offload almost all of his 3d-calculations for AI work to the GPU and achieved much faster results.
There are a lot of answers mentioning SSE already… but since nobody has mentioned how to use it, I'll throw another in…
Your code has most everything a vectorizer needs to work, except two constraints: aliasing and alignment.
Aliasing is the problem of two names referring two the same object. For example, my_point.dist2( my_point ) would operate on two copies of my_point. This messes with the vectorizer.
C99 defines the keyword restrict for pointers to specify that the referenced object is referenced uniquely: there will be no other restrict pointer to that object in the current scope. Most decent C++ compilers implement C99 as well, and import this feature somehow.
GCC calls it __restrict__. It may be applied to references or this.
MSVC calls it __restrict. I'd be surprised if support were any different from GCC.
(It is not in C++0x, though.)
#ifdef __GCC__
#define restrict __restrict__
#elif defined _MSC_VER
#define restrict __restrict
#endif
 
Base dist2(const Point& restrict p) const restrict
Most SIMD units require alignment to the size of the vector. C++ and C99 leave alignment implementation-defined, but C++0x wins this race by introducing [[align(16)]]. As that's still a bit in the future, you probably want your compiler's semi-portable support, a la restrict:
#ifdef __GCC__
#define align16 __attribute__((aligned (16)))
#elif defined _MSC_VER
#define align16 __declspec(align (16))
#endif
 
struct Point {
double align16 xyz[ 3 ]; // separate x,y,z might work; dunno
…
};
This isn't guaranteed to produce results; both GCC and MSVC implement helpful feedback to tell you what wasn't vectorized and why. Google your vectorizer to learn more.
If you really need all the dist2 values, then you have to compute them. It's already low level and cannot imagine speedups apart from distributing on multiple cores.
On the other side, if you're searching for closeness, then you can supply to the dist2() function your current miminum value. This way, if sqr(x-p.x) is already larger than your current minimum, you can avoid computing the remaining 2 squares.
Furthermore, you can avoid the first square by going deeper in the double representation. Comparing directly on the exponent value with your current miminum can save even more cycles.
Are you using Visual Studio? If so you may want to look at specifying the floating point unit control using /fp fast as a compile switch. Have a look at The fp:fast Mode for Floating-Point Semantics. GCC has a host of -fOPTION floating point optimisations you might want to consider (if, as you say, accuracy is not a huge concern).
I suggest two techniques:
Move the structure members into
local variables at the beginning.
Perform like operations together.
These techniques may not make a difference, but they are worth trying. Before making any changes, print the assembly language first. This will give you a baseline for comparison.
Here's the code:
Base dist2(const Point& p) const
{
// Load the cache with data values.
register x1 = p.x;
register y1 = p.y;
register z1 = p.z;
// Perform subtraction together
x1 = x - x1;
y1 = y - y1;
z1 = z - z2;
// Perform multiplication together
x1 *= x1;
y1 *= y1;
z1 *= z1;
// Perform final sum
x1 += y1;
x1 += z1;
// Return the final value
return x1;
}
The other alternative is to group by dimension. For example, perform all 'X' operations first, then Y and followed by Z. This may show the compiler that pieces are independent and it can delegate to another core or processor.
If you can't get any more performance out of this function, you should look elsewhere as other people have suggested. Also read up on Data Driven Design. There are examples where reorganizing the loading of data can speed up performance over 25%.
Also, you may want to investigate using other processors in the system. For example, the BOINC Project can delegate calculations to a graphics processor.
Hope this helps.
From an operation count, I don't see how this can be sped up without delving into hardware optimizations (like SSE) as others have pointed out. An alternative is to use a different norm, like the 1-norm is just the sum of the absolute values of the terms. Then no multiplications are necessary. However, this changes the underlying geometry of your space by rearranging the apparent spacing of the objects, but it may not matter for your application.
Floating point operations are quite often slower, maybe you can think about modifying the code to use only integer arithmetic and see if this helps?
EDIT: After the point made by Paul R I reworded my advice not to claim that floating point operations are always slower. Thanks.
Your best hope is to double-check that every dist2 call is actually needed: maybe the algorithm that calls it can be refactored to be more efficient? If some distances are computed multiple times, maybe they can be cached?
If you're sure all of the calls are necessary, you may be able to squeeze out a last drop of performance by using an architecture-aware compiler. I've had good results using Intel's compiler on x86s, for instance.
Just a few thoughts, however unlikely that I will add anything of value after 18 answers :)
If you are spending 80% time in these two functions I can imagine two typical scenarios:
Your algorithm is at least polynomial
As your data seem to be spatial maybe you can bring the O(n) down by introducing spatial indexes?
You are looping over certain set
If this set comes either from data on disk (sorted?) or from loop there might be possibility to cache, or use previous computations to calculate sqrt faster.
Also regarding the cache, you should define the required precision (and the input range) - maybe some sort of lookup/cache can be used?
(scratch that!!! sqr != sqrt )
See if the "Fast sqrt" is applicable in your case :
http://en.wikipedia.org/wiki/Fast_inverse_square_root
Look at the context. There's nothing you can do to optimize an operation as simple as x*x.
Instead you should look at a higher level: where is the function called from? How often? Why? Can you reduce the number of calls? Can you use SIMD instructions to perform the multiplication on multiple elements at a time?
Can you perhaps offload entire parts of the algorithm to the GPU?
Is the function defined so that it can be inlined? (basically, is its definition visible at the call sites)
Is the result needed immediately after the computation? If so, the latency of FP operations might hurt you. Try to arrange your code so dependency chains are broken up or interleaved with unrelated instructions.
And of course, examine the generated assembly and see if it's what you expect.
Is there a reason you are implementing your own sqr operator?
Have you tried the one in libm it should be highly optimized.
The first thing that occurs to me is memoization ( on-the-fly caching of function calls ), but both sqr and dist2 it would seem like they are too low level for the overhead associated with memoization to be made up for in savings due to memoization. However at a higher level, you may find it may work well for you.
I think a more detailed analysis of you data is called for. Saying that most of the time in the program is spent executing MOV and JUMp commands may be accurate, but it's not going to help yhou optimise much. The information is too low level. For example, if you know that integer arguments are good enough for dist2, and the values are between 0 and 9, then a pre-cached tabled would be 1000 elements--not to big. You can always use code to generate it.
Have you unrolled loops? Broken down matrix opration? Looked for places where you can get by with table lookup instead of actual calculation.
Most drastic would be to adopt the techniques described in:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115.8660&rep=rep1&type=pdf
though it is admittedly a hard read and you should get some help from someone who knows Common Lisp if you don't.
I'm curious why you made this a template when you said the computation is done using doubles?
Why not write a standard method, function, or just 'x * x' ?
If your inputs can be predictably constrained and you really need speed create an array that contains all the outputs your function can produce. Use the input as the index into the array (A sparse hash). A function evaluation then becomes a comparison (to test for array bounds), an addition, and a memory reference. It won't get a lot faster than that.
See the SUBPD, MULPD and DPPD instructions. (DPPD required SSE4)
Depends on your code, but in some cases a stucture-of-arrays layout might be more friendly to vectorization than an array-of-structures layout.