why polymorphism is so costly in haskell(GHC)?

why polymorphism is so costly in haskell(GHC)? - c++

I am asking this question with refernce to this SO question.
Accepted answer by Don stewart : First line says "Your code is highly polymorphic change all float vars to Double .." and it gives 4X performance improvement.
I am interested in doing matrix computations in Haskell, should I make it a habit of writing highly monomorphic code?
But some languages make good use of ad-hoc polymorphism to generate fast code, why GHC won't or can't? (read C++ or D)
why can't we have something like blitz++ or eigen for Haskell? I don't understand how typeclasses and (ad-hoc)polymorphism in GHC work.

With polymorphic code, there is usually a tradeoff between code size and code speed. Either you produce a separate version of the same code for each type that it will operate on, which results in larger code, or you produce a single version that can operate on multiple types, which will be slower.
C++ implementations of templates choose in favor of increasing code speed at the cost of increasing code size. By default, GHC takes the opposite tradeoff. However, it is possible to get GHC to produce separate versions for different types using the SPECIALIZE and INLINABLE pragmas. This will result in polymorphic code that has speed similar to monomorphic code.

I want to supplement Dirk's answer by saying that INLINABLE is usually recommended over SPECIALIZE. An INLINABLE annotation on a function guarantees that the module exports the original source code of the function so that it can be specialized at the point of usage. This usually removes the need to provide separate SPECIALIZE pragmas for every single use case.
Unlike INLINE, INLINABLE does not change GHC's optimization heuristics. It just says "Please export the source code".

I don't understand how typeclasses work in GHC.
OK, consider this function:
linear :: Num x => x -> x -> x -> x
linear a b x = a*x + b
This takes three numbers as input, and returns a number as output. This function accepts any number type; it is polymorphic. How does GHC implement that? Well, essentially the compiler creates a "class dictionary" which holds all the class methods inside it (in this case, +, -, *, etc.) This dictionary becomes an extra, hidden argument to the function. Something like this:
data NumDict x =
NumDict
{
method_add :: x -> x -> x,
method_subtract :: x -> x -> x,
method_multiply :: x -> x -> x,
...
}
linear :: NumDict x -> x -> x -> x -> x
linear dict a b x = a `method_multiply dict` x `method_add dict` b
Whenever you call the function, the compiler automatically inserts the correct dictionary - unless the calling function is also polymorphic, in which case it will have received a dictionary itself, so just pass that along.
In truth, functions that lack polymorphism are typically faster not so much because of a lack of function look-ups, but because knowing the types allows additional optimisations to be done. For example, our polymorphic linear function will work on numbers, vectors, matricies, ratios, complex numbers, anything. Now, if the compiler knows that we want to use it on, say, Double, now all the operations become single machine-code instructions, all the operands can be passed in processor registers, and so on. All of which results in fantastically efficient code. Even if it's complex numbers with Double components, we can make it nice and efficient. If we have no idea what type we'll get, we can't do any of those optimisations... That's where most of the speed difference typically comes from.
For a tiny function like linear, it's highly likely it will be inlined every time it's called, resulting in no polymorphism overhead and a small amount of code duplication - rather like a C++ template. For a larger, more complex polymorphic function, there may be some cost. In general, the compiler decides this, not you - unless you want to start sprinkling pragmas around the place. ;-) Or, if you don't actually use any polymorphism, you can just give everything monomorphic type signatures...

Related

How to write functions like these intrinsic ones?

I just started to learn Fortran and I found that some intrinsic functions look mysteriously for me.
One of them is ALLOCATE: ALLOCATE(array(-5:4, 1:10)). If I want to write a similar function, how would it look? What is its argument? Which type it will have? Because it's not obvious to me what is array(-5:4, 1:10)? array still is not allocated, so what does this expression mean? What is its type?! What will be a difference in array(10) and array(-5:4, 1:10) as a type? Is it some hidden preallocated "meta-object" with some internal attribute like "dimension"? At least it does not look like array pointer in C.
And the next mysterious functions example is the PACK: pack(m, m /= 0). First, I thought that m /= 0 is like a function pointer, i.e. lambda, like in Python pack(m, lambda el: el != 0) or in Haskell pack m (\el -> el /= 0). But then I read somewhere in the Web that it's not a lambda but a list of booleans, once per each m item. But this means that it's very inefficient code - it eats a lot of memory if m is big! So, I cannot understand how do these intrinsic functions work, even more, I have feeling that user cannot write such functions - they are coded in C and not in Fortran itself. Is it truth? How they were written?!

Allocate is not a library function, as pointed out by #dave_thompson_085. Not only that, there are also some types of actual intrinsic functions that cannot be written by the user. Like min(), max(), transfer(). They are not just "library" functions, they are "intrisic", part of the core language and hence can do stuff that user code cannot. They are written in any language that was used to write the compiler. Mostly C, but could probably be also implemented in Fortran - just not like a normal Fortran function, but as a feature the compiler inserts.
When it comes to functions that accept a mask, like PACK, but there are many others that accept it in an optional argument, the mask is a logical array. The compiler is free to implement any optimizations to avoid allocating such an array, but these optimizations cannot be guaranteed. It is not just a library function that is called in a straightforward way, the compiler can insert any code that does what the function is supposed to do.

Efficient evaluation of arbitrary functions given as data, in C++

Consider the following goal:
Create a program that solves: minimize f(x) for an arbitrary f and x supplied as input.
How could one design a C++ program that could receive a description of f and x and process it efficiently?
If the program was actually a C++ library then one could explicitly write the code for f and x (probably inheriting from some base function class for f and state class for x).
However, what should one do if the program is for example a service, and the user is sending the description of f and x in some high level representation, e.g. a JSON object?
Ideas that come to mind
1- Convert f into an internal function representation (e.g. a list of basic operations). Apply those whenever f is evaluated.
Problems: inefficient unless each operation is a batch operation (e.g. if we are doing vector or matrix operations with large vectors / matrices).
2- Somehow generate C++ code and compile the code for representing x and computing f. Is there a way to restrict compilation so that only that code needs to be compiled, but the rest of the code is 'pre-compiled' already?

The usual approach used by the mp library and others is to create an expression tree (or DAG) and use some kind of a nonlinear optimization method that normally relies on derivative information which can be computed using automatic or numeric differentiation.
An expression tree can be efficiently traversed for evaluation using a generic visitor pattern. Using JIT might be an overkill unless the time taken for evaluating a function takes substantial fraction of the optimization time.

Impact performance of the TARGET attribute (not the POINTER attribute)

The question of performance impact due to the presence of the TARGET attribute was asked many times, but the answers lack of concrete examples.
The question Why does a Fortran POINTER require a TARGET? has a nice anwser:
An item that might be pointed to could be aliased to another item, and the compiler must allow for this. Items without the target attribute should not be aliased and the compiler can make assumptions based on this and therefore produce more efficient code.
I realize that the optimizations depend on the compiler and the processor instruction set, but what actually are these optimizations?
Consider the following code:
subroutine complexSubroutine(tab)
double precision, dimension(:) :: tab
!...
! very mysterious complex instructions performed on tab
!...
end subroutine
What are the optimization that the compiler could perform for this code
double precision, dimension(very large number) :: A
call complexSubroutine(A)
and not for this code?
double precision, dimension(very large number), target :: A
call complexSubroutine(A)

Note that the dummy argument tab in complexSubroutine in the question does not have the TARGET attribute. Inside the subroutine the compiler can assume that tab is not aliased, and the below discussion is moot. The issue applies to the scope calling the subroutine.
If the compiler knows that there is no possibility of aliasing, then the compiler knows that there is no possibility of the storage for two objects overlapping. The potential for overlap has direct implications for some operations. As a simple example, consider:
INTEGER, xxx :: forward(3)
INTEGER, xxx :: reverse(3)
forward = [1,2,3]
reverse(1:3:1) = forward(3:1:-1)
The Fortran standard defines the semantics of assignment such that reverse should end up with the value [3,2,1].
If xxx does not include POINTER or TARGET, then the compiler knows that forward and reverse do not overlap. It can execute the array assignment in a straight forward manner, perhaps by working backwards through the elements on the right hand side and forwards through the elements on the left hand, as suggested by the subscript triplets, and doing the element by element assignment "directly".
However, if forward and reverse are TARGETs, then they may well overlap. The straight forward manner suggested above may fail to produce the result required by the standard. If the two names are associated with exactly the same underlying sequence of data objects, then the transfer of reverse(3) to forward(1) will change the value that reverse(1) later references. With the naive approach above that fails to accommodate aliasing, reverse would end up with the value [3,2,3].
To deliver the result required by the standard, compilers may create a temporary object to hold the result of evaluating the right hand side of the assignment, effectively:
INTEGER, TARGET :: forward(3)
INTEGER, TARGET :: reverse(3)
INTEGER :: temp(3)
temp = forward(3:1:-1)
reverse(1:3:1) = temp
The presence and additional operations associated with the temporary may result in a performance impact.
The potential for aliasing to break the otherwise straight forward and simple approach to an operation is a general problem for many situations. In the absence of compiler and runtime smarts to determine that aliasing isn't a problem in a particular situation, the creation and use of temporaries is a general solution, with the general potential for a performance impact.
The potential for aliasing not immediately apparent to the compiler may also prevent the compiler assuming that the value of an object won't change when it doesn't see any explicit reference to the object that would imply a change.
INTEGER, TARGET :: x
...much later...
x = 4
CALL abc
IF (x == 4) THEN
...
The compiler, not knowing anything about procedure abc, cannot, in general, assume that x is not modified inside procedure abc - perhaps a pointer to x is available to the procedure in some way and the procedure has used that pointer to indirectly modify x. If x did not have the TARGET attribute, then the compiler knows that abc could not legitimately change its value. This has implications for the compiler's ability to analyse possible code paths at compile time, and to resequence operations/move operations out of loops, etc.

In what situations lists in F# are optimized by the compiler?

In what situations lists in F# are optimized by F# compiler to arrays, to for-loops, while loops, etc. without creating actual list of single linked data?
For example:
[1..1000] |> List.map something
Could be optimized to for-loop without creating actual list. But I don't know if the compiler is doing that actually.
Mapping over lists that are less in size could be optimized with loop-unfolding, etc.

In what situations lists in F# are optimized by F# compiler to arrays, to for-loops, while loops, etc. without creating actual list of single linked data?
Never.
Your later comments are enlightening because you assume that this is a flaw in F#:
...it should be smart enough to do it. Similar to Haskell compiler...
Somewhat true.
...Haskell compiler is doing a lot of such optimizations...
True.
However, this is actually a really bad idea. Specifically, you are pursuing optimizations when what you really want is performance. Haskell offers lots of exotic optimizations but its performance characteristics are actually really bad. Moreover, the properties of Haskell that make these optimizations tractable require massive sacrifices elsewhere:
Purity makes interoperability much harder so Microsoft killed Haskell.NET and Haskell lives on only with its own incompatible VM.
The GC in Haskell's VM has been optimized for purely functional code at the expense of mutation.
Purely functional data structures are typically 10× slower than their imperative equivalents, sometimes asymptotically slower and in some cases there is no known purely functional equivalent.
Laziness and purity go hand-in-hand ("strict evaluation is a canonical side effect") and laziness not only massively degrades performance but makes it wildly unpredictable.
The enormous numbers of optimizations added to Haskell in an attempt to combat this poor performance (e.g. strictness analysis) render performance even less predictable.
Unpredictable cache behaviour makes scalability unpredictable on multicores.
For a trivial example of these optimizations not paying off look no further than the idiomatic 2-line quicksort in Haskell which, despite all of its optimizations, remains thousands of times slower than Sedgewick's quicksort in C. In theory, a sufficiently smart Haskell compiler could optimize such source code into an efficient program. In practice, the world's most sophisticated Haskell compilers cannot even do this for a trivial two-line program much less real software.
The source code to the Haskell programs on The Computer Language Benchmarks Game provide some enlightening examples of just how horrific Haskell code becomes when you optimize it.
I want a programming language to:
Have a simple method of compilation that keeps performance predictable.
Make it easy to optimize by hand where optimization is required.
Have a high ceiling on performance so that I can get close to optimal performance when necessary.
F# satisfies these requirements.

I think "never" is the answer.

It is easy to see if you look at the disassembly - which is quite easy to read
// method line 4
.method public static
default void main# () cil managed
{
// Method begins at RVA 0x2078
.entrypoint
// Code size 34 (0x22)
.maxstack 6
IL_0000: newobj instance void class Test2/clo#2::'.ctor'()
IL_0005: ldc.i4.1
IL_0006: ldc.i4.1
IL_0007: ldc.i4 1000
IL_000c: call class [mscorlib]System.Collections.Generic.IEnumerable`1<int32> class [FSharp.Core]Microsoft.FSharp.Core.Operators/OperatorIntrinsics::RangeInt32(int32, int32, int32)
IL_0011: call class [mscorlib]System.Collections.Generic.IEnumerable`1<!!0> class [FSharp.Core]Microsoft.FSharp.Core.Operators::CreateSequence<int32> (class [mscorlib]System.Collections.Generic.IEnumerable`1<!!0>)
IL_0016: call class [FSharp.Core]Microsoft.FSharp.Collections.FSharpList`1<!!0> class [FSharp.Core]Microsoft.FSharp.Collections.SeqModule::ToList<int32> (class [mscorlib]System.Collections.Generic.IEnumerable`1<!!0>)
IL_001b: call class [FSharp.Core]Microsoft.FSharp.Collections.FSharpList`1<!!1> class [FSharp.Core]Microsoft.FSharp.Collections.ListModule::Map<int32, int32> (class [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2<!!0,!!1>, class [FSharp.Core]Microsoft.FSharp.Collections.FSharpList`1<!!0>)
IL_0020: pop
IL_0021: ret
} // end of method $Test2::main#
} // end of class <StartupCode$test2>.$Test2
}
You can see that at 000c and 0011 the enumerable is created, and then at 0016 the sequence is converted to a list
So in this case the optimisation doesn't happen. In fact it would be very hard for the compiler to make such an optimisation as there could be any number of differences between Seq.Map and List.Map (which is the simplest optimisation as it would avoid the temporary list).

Whilst this question was asked some time ago, the current situation is somewhat different.
Many of the list module functions do actually use an array internally.
For example, the current implementation of pairwise is
[<CompiledName("Pairwise")>]
let pairwise (list: 'T list) =
let array = List.toArray list
if array.Length < 2 then [] else
List.init (array.Length-1) (fun i -> array.[i],array.[i+1])
also FoldBack (although only for lists with length greater than 4)
// this version doesn't causes stack overflow - it uses a private stack
[<CompiledName("FoldBack")>]
let foldBack<'T,'State> f (list:'T list) (acc:'State) =
let f = OptimizedClosures.FSharpFunc<_,_,_>.Adapt(f)
match list with
| [] -> acc
| [h] -> f.Invoke(h,acc)
| [h1;h2] -> f.Invoke(h1,f.Invoke(h2,acc))
| [h1;h2;h3] -> f.Invoke(h1,f.Invoke(h2,f.Invoke(h3,acc)))
| [h1;h2;h3;h4] -> f.Invoke(h1,f.Invoke(h2,f.Invoke(h3,f.Invoke(h4,acc))))
| _ ->
// It is faster to allocate and iterate an array than to create all those
// highly nested stacks. It also means we won't get stack overflows here.
let arr = toArray list
let arrn = arr.Length
foldArraySubRight f arr 0 (arrn - 1) acc
Here, foldArraySubRight actually uses an iterative loop to process the array.
Other functions with similar optimisations include almost anything with a name like *Back as well as all the sort* functions and the permute function.

Speedup C++ code

I am writing a C++ number crunching application, where the bottleneck is a function that has to calculate for double:
template<class T> inline T sqr(const T& x){return x*x;}
and another one that calculates
Base dist2(const Point& p) const
{ return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z); }
These operations take 80% of the computation time. I wonder if you can suggest approaches to make it faster, even if there is some sort of accuracy loss
Thanks

First, make sure dist2 can be inlined (it's not clear from your post whether or not this is the case), having it defined in a header file if necessary (generally you'll need to do this - but if your compiler generates code at link time, then that's not necessarily the case).
Assuming x86 architecture, be sure to allow your compiler to generate code using SSE2 instructions (an example of an SIMD instruction set) if they are available on the target architecture. To give the compiler the best opportunity to optimize these, you can try to batch your sqr operations together (SSE2 instructions should be able to do up to 4 float or 2 double operations at a time depending on the instruction.. but of course it can only do this if you have the inputs to more than one operation on the ready). I wouldn't be too optimistic about the compiler's ability to figure out that it can batch them.. but you can at least set up your code so that it would be possible in theory.
If you're still not satisfied with the speed and you don't trust that your compiler is doing it best, you should look into using compiler intrinsics which will allow you to write potential parallel instructions explicitly.. or alternatively, you can go right ahead and write architecture-specific assembly code to take advantage of SSE2 or whichever instructions are most appropriate on your architecture. (Warning: if you hand-code the assembly, either take extra care that it still gets inlined, or make it into a large batch operation)
To take it even further, (and as glowcoder has already mentioned) you could perform these operations on a GPU. For your specific case, bear in mind that GPU's often don't support double precision floating point.. though if it's a good fit for what you're doing, you'll get orders of magnitude better performance this way. Google for GPGPU or whatnot and see what's best for you.

What is Base?
Is it a class with a non-explicit constructor? It's possible that you're creating a fair amount of temporary Base objects. That could be a big CPU hog.
template<class T> inline T sqr(const T& x){return x*x;}
Base dist2(const Point& p) const {
return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z);
}
If p's member variables are of type Base, you could be calling sqr on Base objects, which will be creating temporaries for the subtracted coordinates, in sqr, and then for each added component.
(We can't tell without the class definitions)
You could probably speed it up by forcing the sqr calls to be on primitves and not using Base until you get to the return type of dist2.
Other performance improvement opportunities are to:
Use non-floating point operations, if you're ok with less precision.
Use algorithms which don't need to call dist2 so much, possibly caching or using the transitive property.
(this is probably obvious, but) Make sure you're compiling with optimization turned on.

I think optimising these functions might be difficult, you might be better off optimising the code that calls these functions to call them less, or to do things differently.

You don't say whether the calls to dist2 can be parallelised or not. If they can, then you could build a thread pool and split this work up into smaller chunks per thread.
What does your profiler tell you is happening inside dist2. Are you actually using 100% CPU all the time or are you cache missing and waiting for data to load?
To be honest, we really need more details to give you a definitive answer.

If sqr() is being used only on primitive types, you might try taking the argument by value instead of reference. That would save you an indirection.

If you can organise your data suitably then you may well be able to use SIMD optimisation here. For an efficient implementation you would probably want to pad your Point struct so that it has 4 elements (i.e. add a fourth dummy element for padding).

If you have a number of these to do, and you're doing graphics or "graphic like" tasks (thermal modeling, almost any 3d modeling) you might consider using OpenGL and offloading the tasks to a GPU. This would allow the computations to run in parallel, with highly optimized operational capacity. After all, you would expect something like distance or distancesq to have its own opcode on a GPU.
A researcher at a local univeristy offload almost all of his 3d-calculations for AI work to the GPU and achieved much faster results.

There are a lot of answers mentioning SSE already… but since nobody has mentioned how to use it, I'll throw another in…
Your code has most everything a vectorizer needs to work, except two constraints: aliasing and alignment.
Aliasing is the problem of two names referring two the same object. For example, my_point.dist2( my_point ) would operate on two copies of my_point. This messes with the vectorizer.
C99 defines the keyword restrict for pointers to specify that the referenced object is referenced uniquely: there will be no other restrict pointer to that object in the current scope. Most decent C++ compilers implement C99 as well, and import this feature somehow.
GCC calls it __restrict__. It may be applied to references or this.
MSVC calls it __restrict. I'd be surprised if support were any different from GCC.
(It is not in C++0x, though.)
#ifdef __GCC__
#define restrict __restrict__
#elif defined _MSC_VER
#define restrict __restrict
#endif
 
Base dist2(const Point& restrict p) const restrict
Most SIMD units require alignment to the size of the vector. C++ and C99 leave alignment implementation-defined, but C++0x wins this race by introducing [[align(16)]]. As that's still a bit in the future, you probably want your compiler's semi-portable support, a la restrict:
#ifdef __GCC__
#define align16 __attribute__((aligned (16)))
#elif defined _MSC_VER
#define align16 __declspec(align (16))
#endif
 
struct Point {
double align16 xyz[ 3 ]; // separate x,y,z might work; dunno
…
};
This isn't guaranteed to produce results; both GCC and MSVC implement helpful feedback to tell you what wasn't vectorized and why. Google your vectorizer to learn more.

If you really need all the dist2 values, then you have to compute them. It's already low level and cannot imagine speedups apart from distributing on multiple cores.
On the other side, if you're searching for closeness, then you can supply to the dist2() function your current miminum value. This way, if sqr(x-p.x) is already larger than your current minimum, you can avoid computing the remaining 2 squares.
Furthermore, you can avoid the first square by going deeper in the double representation. Comparing directly on the exponent value with your current miminum can save even more cycles.

Are you using Visual Studio? If so you may want to look at specifying the floating point unit control using /fp fast as a compile switch. Have a look at The fp:fast Mode for Floating-Point Semantics. GCC has a host of -fOPTION floating point optimisations you might want to consider (if, as you say, accuracy is not a huge concern).

I suggest two techniques:
Move the structure members into
local variables at the beginning.
Perform like operations together.
These techniques may not make a difference, but they are worth trying. Before making any changes, print the assembly language first. This will give you a baseline for comparison.
Here's the code:
Base dist2(const Point& p) const
{
// Load the cache with data values.
register x1 = p.x;
register y1 = p.y;
register z1 = p.z;
// Perform subtraction together
x1 = x - x1;
y1 = y - y1;
z1 = z - z2;
// Perform multiplication together
x1 *= x1;
y1 *= y1;
z1 *= z1;
// Perform final sum
x1 += y1;
x1 += z1;
// Return the final value
return x1;
}
The other alternative is to group by dimension. For example, perform all 'X' operations first, then Y and followed by Z. This may show the compiler that pieces are independent and it can delegate to another core or processor.
If you can't get any more performance out of this function, you should look elsewhere as other people have suggested. Also read up on Data Driven Design. There are examples where reorganizing the loading of data can speed up performance over 25%.
Also, you may want to investigate using other processors in the system. For example, the BOINC Project can delegate calculations to a graphics processor.
Hope this helps.

From an operation count, I don't see how this can be sped up without delving into hardware optimizations (like SSE) as others have pointed out. An alternative is to use a different norm, like the 1-norm is just the sum of the absolute values of the terms. Then no multiplications are necessary. However, this changes the underlying geometry of your space by rearranging the apparent spacing of the objects, but it may not matter for your application.

Floating point operations are quite often slower, maybe you can think about modifying the code to use only integer arithmetic and see if this helps?
EDIT: After the point made by Paul R I reworded my advice not to claim that floating point operations are always slower. Thanks.

Your best hope is to double-check that every dist2 call is actually needed: maybe the algorithm that calls it can be refactored to be more efficient? If some distances are computed multiple times, maybe they can be cached?
If you're sure all of the calls are necessary, you may be able to squeeze out a last drop of performance by using an architecture-aware compiler. I've had good results using Intel's compiler on x86s, for instance.

Just a few thoughts, however unlikely that I will add anything of value after 18 answers :)
If you are spending 80% time in these two functions I can imagine two typical scenarios:
Your algorithm is at least polynomial
As your data seem to be spatial maybe you can bring the O(n) down by introducing spatial indexes?
You are looping over certain set
If this set comes either from data on disk (sorted?) or from loop there might be possibility to cache, or use previous computations to calculate sqrt faster.
Also regarding the cache, you should define the required precision (and the input range) - maybe some sort of lookup/cache can be used?

(scratch that!!! sqr != sqrt )
See if the "Fast sqrt" is applicable in your case :
http://en.wikipedia.org/wiki/Fast_inverse_square_root

Look at the context. There's nothing you can do to optimize an operation as simple as x*x.
Instead you should look at a higher level: where is the function called from? How often? Why? Can you reduce the number of calls? Can you use SIMD instructions to perform the multiplication on multiple elements at a time?
Can you perhaps offload entire parts of the algorithm to the GPU?
Is the function defined so that it can be inlined? (basically, is its definition visible at the call sites)
Is the result needed immediately after the computation? If so, the latency of FP operations might hurt you. Try to arrange your code so dependency chains are broken up or interleaved with unrelated instructions.
And of course, examine the generated assembly and see if it's what you expect.

Is there a reason you are implementing your own sqr operator?
Have you tried the one in libm it should be highly optimized.

The first thing that occurs to me is memoization ( on-the-fly caching of function calls ), but both sqr and dist2 it would seem like they are too low level for the overhead associated with memoization to be made up for in savings due to memoization. However at a higher level, you may find it may work well for you.
I think a more detailed analysis of you data is called for. Saying that most of the time in the program is spent executing MOV and JUMp commands may be accurate, but it's not going to help yhou optimise much. The information is too low level. For example, if you know that integer arguments are good enough for dist2, and the values are between 0 and 9, then a pre-cached tabled would be 1000 elements--not to big. You can always use code to generate it.
Have you unrolled loops? Broken down matrix opration? Looked for places where you can get by with table lookup instead of actual calculation.
Most drastic would be to adopt the techniques described in:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115.8660&rep=rep1&type=pdf
though it is admittedly a hard read and you should get some help from someone who knows Common Lisp if you don't.

I'm curious why you made this a template when you said the computation is done using doubles?
Why not write a standard method, function, or just 'x * x' ?
If your inputs can be predictably constrained and you really need speed create an array that contains all the outputs your function can produce. Use the input as the index into the array (A sparse hash). A function evaluation then becomes a comparison (to test for array bounds), an addition, and a memory reference. It won't get a lot faster than that.

See the SUBPD, MULPD and DPPD instructions. (DPPD required SSE4)
Depends on your code, but in some cases a stucture-of-arrays layout might be more friendly to vectorization than an array-of-structures layout.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js