DSP performance, what should be avoided?

DSP performance, what should be avoided? - c++

I am starting with dsp programming right now and am writing my first low level classes and functions.
Since I want the functions to be fast (or at last not inefficient), I often wonder what I should use and what I should avoid in functions which get called per sample.
I know that the speed of an instruction varies quite a bit but I think that some of you at least can share a rule of thumb or just experience. :)
conditional statements
If I have to use conditions, switch should be faster than an if / else if block, right?
Are there differences between using two if-statements or an if-else? Somewhere I read that else should be avoided but I don't know why.
Also, compared to a multiplication, is there a rude estimation how much more time an if-block takes? Because in some cases, using multiplications by zero could be used instead of if-statements:
//something could be an int either 1 or 0:
if(something) {
signal += something_else;
}
// or:
signa+ += something*something_else;
functions and function-pointers
Instead of using conditional statements, you could use function-pointer. Instead of using conditions in every call, the pointer could be redirected to a specific function. However, for every call, the pointer had to be interpreted in order to call the right function. So I don't know if this would help or not.
What I also wonder is if calling functions have an impact. If so, boxing functions should be avoided, right?
variables
I would think that defining and using many variables in a function doesn't realy have an impact, at least relative to calculations. Is this true? If not, reusing declared variables would be better than more declaration.
calculations
Is there an order of calculation-types in term of the time they take to execute? I am sure that this highly depends on the context but a rule of thumb would be nice. I often read that people only count the multiplication in an algorithm. Is this because additions are realtively fast?
Does it make a difference between multiplication and division? (*0.5 or /2.0)
I hope that you can share soem experience.
Cheers

here are part of the answers:
calculations (talking about native precision of the processor for example 32bits):
Most DSP microprocessors have single cycle multipliers, that means a
multiply costs exactly the same as an addition in term of cycles.
and multiplication it generally faster then division.
conditional statements:
if/else - when looking in the assembly code you can see that the memory of the if condition is usually loaded by default, so when using if else make sure that the condition that will happen more frequently will be in the if.
but generally if possible you should avoid if/else in a loop to improve the pipe lining.
good luck.

DSP compilers are typically good at optimizing for loops that do not contain function-calls.
Therefore, try to inline every function that you call from within a time-critical for loop.
If your DSP is a fixed-point processor, then floating-point operations are implemented by SW.
This means that every such operation is essentially replaced by the compiler with a library function.
So you should basically avoid performing floating-point operations inside time-critical for loops.
The preprocessor should provide a special #pragma for the number of iterations of a for loop:
Minimum number of iterations
Maximum number of iterations
Multiplicity of the number of iterations
Use this #pragma where possible, in order to help the compiler to perform loop-unrolling where possible.
Finally, DSPs usually support a set of unique operations for enhanced performance.
As an example, consider _dotpu4 on Texas Instruments C64xx, which computes the scalar-product of two integers src1 and src2: For each pair of 8-bit values in src1 and src2, the 8-bit value from src1 is multiplied with the 8-bit value from src2, and the four products are summed together.
Check the data-sheet of your DSP, and see if you can make use of any of these operations.
The compiler should generate an intermediate file, which you can explore in order to analyze the expected performance of each of the optimized for loops in your code.
Based on that, you can try different assembly operations that might yield better results.

Related

Enforcing order of execution

I would like to ensure that the calculations requested are executed exactly in the order I specify, without any alterations from either the compiler or CPU (including the linker, assembler, and anything else you can think of).
Operator left-to-right associativity is assumed in the C language
I am working in C (possibly also interested in C++ solutions), which states that for operations of equal precedence there is an assumed left-to-right operator associativity, and hence
a = b + c - d + e + f - g ...;
is equivalent to
a = (...(((((b + c) - d) + e) + f) - g) ...);
A small example
However, consider the following example:
double a, b = -2, c = -3;
a = 1 + 2 - 2 + 3 + 4;
a += 2*b;
a += c;
So many opportunities for optimisation
For many compilers and pre-processors they may be clever enough to recognise the "+ 2 - 2" is redundant and optimise this away. Similarly they could recognise that the "+= 2*b" followed by the "+= c" can be written using a single FMA. Even if they don't optimise in an FMA, they may switch the order of these operations etc. Furthermore, if the compiler doesn't do any of these optimisations, the CPU may well decide to do some out of order execution, and decide it can do the "+= c" before the "+= 2*b", etc.
As floating-point arithmetic is non-associative, each type of optimisation may result in a different end result, which may be noticeable if the following is inlined somewhere.
Why worry about floating point associativity?
For most of my code I would like as much optimisation as I can have and don't care about floating-point associativity or bit-wise reproduciblilty, but occasionally there is a small snippet (similar to the above example) which I would like to be untampered with and totally respected. This is because I am working with a mathematical method which exactly requires a reproducible result.
What can I do to resolve this?
A few ideas which have come to mind:
Disable compiler optimisations and out of order execution
I don't want this, as I want the other 99% of my code to be heavily optimised. (This seems to be cutting off my nose to spite my face). I also most likely won't have permission to change my hardware settings.
Use a pragma
Write some assembly
The code snippets are small enough that this might be reasonable, although I'm not very confident in this, especially if (when) it comes to debugging.
Put this in a separate file, compile separately as un-optimised as possible, and then link using a function call
Volatile variables
To my mind these are just for ensuring that memory access is respected and un-optimised, but perhaps they might prove useful.
Access everything through judicious use of pointers
Perhaps, but this seems like a disaster in readability, performance, and bugs waiting to happen.
If anyone can think of any feasibly solutions (either from any of the ideas I've suggested or otherwise) that would be ideal. The "pragma" option or "function call" to my mind seem like the best approaches.
The ultimate goal
To have something that marks off a small chuck of simple and largely vanilla C code as protected and untouchable to any (realistically most) optimisations, while allowing for the rest of the code to be heavily optimised, covering optimisations from both the CPU and compiler.

This is not a complete answer, but it is informative, partially answers, and is too long for a comment.
Clarifying the Goal
The question actually seeks reproducibility of floating-point results, not order of execution. Also, order of execution is irrelevant; we do not care if, in (a+b)+(c+d), a+b or c+d is executed first. We care that the result of a+b is added to the result of c+d, without any reassociation or other rewriting of arithmetic unless the result is known to be the same.
Reproducibility of floating-point arithmetic is in general an unsolved technological problem. (There is no theoretical barrier; we have reproducible elementary operations. Reproducibility is a matter of what hardware and software vendors have provided and how hard it is to express the computations we want performed.)
Do you want reproducibility on one platform (e.g., always using the same version of the same math library)? Does your code use any math library routines like sin or log? Do you want reproducibility across different platforms? With multithreading? Across changes of compiler version?
Addressing Some Specific Issues
The samples shown in the question can largely be handled by writing each individual floating-point operation in its own statement, as by replacing:
a = 1 + 2 - 2 + 3 + 4;
a += 2*b;
a += c;
with:
t0 = 1 + 2;
t0 = t0 - 2;
t0 = t0 + 3;
t0 = t0 + 4;
t1 = 2*b;
t0 += t1;
a += c;
The basis for this is that both C and C++ permit an implementation to use “excess precision” when evaluating an expression but require that precision to be “discarded” when an assignment or cast is performed. Limiting each assignment expression to one operation or executing a cast after each operation effectively isolates the operations.
In many cases, a compiler will then generate code using instructions of the nominal type, instead of instructions using a type with excess precision. In particular, this should avoid a fused multiply-add (FMA) being substituted for a multiplication followed by an addition. (An FMA has effectively infinite precision in the product before it is added to the addend, thus falling under the “excess precision is permitted” rule.) There are caveats, however. An implementation might first evaluate an operation with excess precision and then round it to the nominal precision. In general, this can cause a different result than doing a single operation in the nominal precision. For the elementary operations of addition, subtract, multiplication, division, and even square root, this does not happen if the excess precision is sufficient greater than the nominal precision. (There are proofs that a result with sufficient excess precision is always close enough to the infinitely precise result that the rounding to nominal precision gets the same result.) This is true for the case where the nominal precision is the IEEE-754 basic 32-bit binary floating-point format, and the excess precision is the 64-bit format. However, it is not true where the nominal precision is the 64-bit format and the excess precision is Intel’s 80-bit format.
So, whether this workaround works depends on the platform.
Other Issues
Aside from the use of excess precision and features like FMA or the optimizer rewriting expressions, there are other things that affect reproducibility, such as non-standard treatment of subnormals (notably replacing them with zeroes), variations between math library routines. (sin, log, and similar functions return different results on different platforms. Nobody has fully implemented correctly rounded math library routines with known bounded performance.)
These are discussed in other Stack Overflow questions about floating-point reproducibility, as well as papers, specifications, and standards documents.
Irrelevant Issues
The order in which a processor executes floating-point operations is irrelevant. Processor reordering of calculations obeys rigid semantics; the results are identical regardless of the chronological order of execution. (Processor timing can affect results if, for example, a task is partitioned into subtasks, such as assigning multiple threads or processes to process different parts of the arrays. Among other issues, their results could arrive in different orders, and the process receiving their results might then add or otherwise combine their results in different orders.)
Using pointers will not fix anything. As far as C or C++ is concerned, *p where p is a pointer to double is the same as a where a is a double. One the objects has a name (a) and one of them does not, but they are like roses: They smell the same. (There are issues where, if you have some other pointer q, the compiler might not know whether *q and *p refer to the same thing. But that also holds true for *q and a.)
Using volatile qualifiers will not aid in reproducibility regarding the excess precision or expression rewriting issue. That is because only an object (not a value) is volatile, which means it has no effect until you write it or read it. But, if you write it, you are using an assignment expression1, so the rule about discarding excess precision already applies. When reading the object, you would force the compiler to retrieve the actual value from memory, but this value will not be any different than the non-volatile object has after assignment, so nothing is accomplished.
Footnote
1 I would have to check on other things that modify an object, such as ++, but those are likely not significant for this discussion.

Write this critical chunk of code in assembly language.
The situation you're in is unusual. Most of the time people want the compiler to do optimizations, so compiler developers don't spend much development effort on means to avoid them. Even with the knobs you do get (pragmas, separate compilation, indirections, ...) you can never be sure something won't be optimized. Some of the undesirable optimizations you mention (constant folding, for instance) cannot be turned off by any means in modern compilers.
If you use assembly language you can be sure you're getting exactly what you wrote. If you do it any other way you won't have that level of confidence.

"clever enough to recognise the + 2 - 2 is redundant and optimise this
away"
No ! All decent compilers will apply constant propagation and figure out that a is constant and optimize all your statement away, into something equivalent to a = 1;. Here the example with assembly.
Now if you make a volatile, the compiler has to assume that any change of a could have an impact outside the C++ programme. Constant propagation will still be performed to optimise each of these calculations, but the intermediary assignments are guaranteed to happen. Here the example with assembly.
If you don't want constant propagation to happen, you need to deactivate optimizations. In this case, the best would be to keep your code separate so to compile the rest with all optilizations on.
However this is not ideal. The optimizer could outperform you and with this approach, you'll loose global optimisation across the function boundaries.
Recommendation/quote of the day:
Don't diddle code; Find better algorithms
- B.W.Kernighan & P.J.Plauger

How to use if condition in intrinsics

I want to compare two floating point variables using intrinsics. If the comparison is true, do something else do something. I want to do this as a normal if..else condition. Is there any way using intrinsics?
//normal code
vector<float> v1, v2;
for(int i = 0; i < v1.size(); ++i)
if(v1[i]<v2[i])
{
//do something
}
else
{
//do something
)
How to do this using SSE2 or AVX?

If you expect that v1[i] < v2[i] is almost never true, almost always true, or usually stays the same for a long run (even if overall there might be no particular bias), then an other technique is also applicable which offers "true conditionality" (ie not "do both, discard one result"), a price of course, but you also get to actually skip work instead of just ignoring some results.
That technique is fairly simple, do the comparison (vectorized), gather the comparison mask with _mm_movemask_ps, and then you have 3 cases:
All comparisons went the same way and they were all false, execute the appropriate "do something" code that is now maybe easier to vectorize since the condition is gone.
All comparisons went the same way and they were all true, same.
Mixed, use more complicated logic. Depending on what you need, you could check all bits separately (falling back to scalar code, but now just 1 FP compare for the whole lot), or use one of the "iterate only over (un)set bits" tricks (combines well with bitscan to recover the actual index), or sometimes you can fall back to doing masking and merging as usual.
Not all 3 cases are always relevant, usually you're applying this because the predicate almost always goes the same way, making one of the "all the same" cases so rare that you can just lump it in with "mixed".
This technique is definitely not always useful. The "mixed" case is complicated and slow. The fast-path has to be common and fast enough to be worth testing whether you're can take it.
But it can be useful, maybe one of the sides is very slow and annoying, while the other side of the branch is nice simple vectorizable code that doesn't take all that long in comparison. For example, maybe the slow side has to do argument reduction for an otherwise fast approximated transcendental function, or maybe it has to normalize some vectors before taking their dot product, or orthogonalize a matrix, maybe even get data from disk..
Or, maybe neither side is exactly slow, but they evict each others data from cache (maybe both sides are a loop over an array that fits in cache, but the arrays don't fit in it together) so doing them unconditionally slows both of them down. This is probably a real thing, but I haven't seen it in the wild (yet).
Or, maybe one side cannot be executed unconditionally, doing some generally destructive things, maybe even some IO. For example if you're checking for error conditions and logging them.

SIMD conditional operations are done with branchless techniques. You use a packed-compare instruction to get a vector of elements that are all-zero or all-one.
e.g. you can conditionally add 4 to elements in an accumulator when a corresponding element matches a condition with code like:
__m128i match_counts = _mm_setzero_si128();
for (...) {
__m128 fvec = something;
__m128i condition = _mm_castps_si128( _mm_cmplt_ps(fvec, _mm_setzero_ps()) ); // for elements less than zero
__m128i masked_constant = _mm_and_si128(condition, _mm_set1_epi32(4));
match_counts = _mm_add_epi32(match_counts, masked_constant);
}
Obviously this only works well if you can come up with a branchless way to do both sides of the branch. A blend instruction can often help.
It's likely that you won't get any speedup at all if there's too much work in each side of the branch, especially if your element size is 4 bytes or larger. (SIMD is really powerful when you're doing 16 operations in parallel on 16 separate bytes, less powerful when doing 4 operations on four 32-bit elements).

I found a document which is very useful for conditional SIMD instructions.
It is a perfect solution to my question.
If...else condition
Document: http://saluc.engr.uconn.edu/refs/processors/intel/sse_sse2.pdf

C++ compiler optimization for complex equations

I have some equations that involve multiple operations that I would like to run as fast as possible. Since the c++ compiler breaks it down in to machine code anyway does it matter if I break it up to multiple lines like
A=4*B+4*C;
D=3*E/F;
G=A*D;
vs
G=12*E*(B+C)/F;
My need is more complex than this but the i think it conveys the idea. Also if this is in a function that gets called is in a loop, does defining double A, D cost CPU time vs putting it in as a class variable?

Using a modern compiler, Clang/Gcc/VC++/Intel, it won't really matter, the best thing you should do is worry about how readable your code will be and turn on optimizations, compiler designers are well aware of issues like these and design their compilers to (for the most part) optimize according.
If I were to say which would be slower I would assume the first way since there would be 3 mov instructions, I could be wrong. but this isn't something you should worry about too much.

If these variables are integers, that second code fragment is not a valid optimization of the first. For B=1, C=1, E=1, F=6, you have:
A=4*B+4*C; // 8
D=3*E/F; // 0
G=A*D; // 0
and
G=12*E*(B+C)/F; // 4
If floating point, then it really depends on what compiler, what compiler options, and what cpu you have.

Profiling a simple, one cycle length operation

We have an assignment where we need to profile a 'simple instruction' (addition or bit-wise and for example). This means performing the same operation a large number of times (100K+) and measuring the average time in microseconds. The result should be presented in cycle-lengths: (totalTime/iterations)*cphMHz.
So, results may vary but all in all we were told that we should get a result close to 1 cycle-length. Actual result doesn't matter as long as programming is correct.
My question is: what is a good operation to profile?
There are two points I need to concider:
I use loop unrolling to be a bit more accurate, so in each iteration I perform 10 simple instruction. This means I have to choose an operation to wouldn't be performed only once due to compiler optimization (we can't use -o0 flag as school staff does not).
Bad example: var = i; - the compiler would only perform the last command.
What is a real 'simple instruction'? How do I know the number of operations that are actually performed? I tried reading the assembly output, but I couldn't understand it.
Hope I was clear enough, any idea would be great.
Thanks anyway
P.S don't know if it matters but I write in CPP

1) This sounds (to me) like an impossible task, if optimizations are (or might be) enabled. You can never be sure on what the compiler will do during optimizations. I'd definitely do something like reusing the previous result. If allowed to/possible, I'd try to include a raw assembler snippet to be profiled (so you can be sure there's no additional overhead; although it still could be optimized).
2) As for instructions: One assembler command is one instruction. E.g. a += i will - depending on available instruction set and stuff - most likely result in 4 instructions: read a, read i, add, write a. Reading assembly is pretty much straightforward. Depending on the instruction set/processor, there might be different "directions" for reading (i.e. "from -> to"). x86 assemblers (and those for most other common processors) will prefer instruction target, source, while DSPs prefer to use instruction source, target. Just important to know: moving data has to happen through registers. So even a single assignment like a = b will result in two instructions (b to register and register to a).
In general, if this answer goes into the wrong direction, try to elaborate a bit more on your specific task and its requirements (e.g. which compiler is to be used) and drop me a short comment.

Speedup C++ code

I am writing a C++ number crunching application, where the bottleneck is a function that has to calculate for double:
template<class T> inline T sqr(const T& x){return x*x;}
and another one that calculates
Base dist2(const Point& p) const
{ return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z); }
These operations take 80% of the computation time. I wonder if you can suggest approaches to make it faster, even if there is some sort of accuracy loss
Thanks

First, make sure dist2 can be inlined (it's not clear from your post whether or not this is the case), having it defined in a header file if necessary (generally you'll need to do this - but if your compiler generates code at link time, then that's not necessarily the case).
Assuming x86 architecture, be sure to allow your compiler to generate code using SSE2 instructions (an example of an SIMD instruction set) if they are available on the target architecture. To give the compiler the best opportunity to optimize these, you can try to batch your sqr operations together (SSE2 instructions should be able to do up to 4 float or 2 double operations at a time depending on the instruction.. but of course it can only do this if you have the inputs to more than one operation on the ready). I wouldn't be too optimistic about the compiler's ability to figure out that it can batch them.. but you can at least set up your code so that it would be possible in theory.
If you're still not satisfied with the speed and you don't trust that your compiler is doing it best, you should look into using compiler intrinsics which will allow you to write potential parallel instructions explicitly.. or alternatively, you can go right ahead and write architecture-specific assembly code to take advantage of SSE2 or whichever instructions are most appropriate on your architecture. (Warning: if you hand-code the assembly, either take extra care that it still gets inlined, or make it into a large batch operation)
To take it even further, (and as glowcoder has already mentioned) you could perform these operations on a GPU. For your specific case, bear in mind that GPU's often don't support double precision floating point.. though if it's a good fit for what you're doing, you'll get orders of magnitude better performance this way. Google for GPGPU or whatnot and see what's best for you.

What is Base?
Is it a class with a non-explicit constructor? It's possible that you're creating a fair amount of temporary Base objects. That could be a big CPU hog.
template<class T> inline T sqr(const T& x){return x*x;}
Base dist2(const Point& p) const {
return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z);
}
If p's member variables are of type Base, you could be calling sqr on Base objects, which will be creating temporaries for the subtracted coordinates, in sqr, and then for each added component.
(We can't tell without the class definitions)
You could probably speed it up by forcing the sqr calls to be on primitves and not using Base until you get to the return type of dist2.
Other performance improvement opportunities are to:
Use non-floating point operations, if you're ok with less precision.
Use algorithms which don't need to call dist2 so much, possibly caching or using the transitive property.
(this is probably obvious, but) Make sure you're compiling with optimization turned on.

I think optimising these functions might be difficult, you might be better off optimising the code that calls these functions to call them less, or to do things differently.

You don't say whether the calls to dist2 can be parallelised or not. If they can, then you could build a thread pool and split this work up into smaller chunks per thread.
What does your profiler tell you is happening inside dist2. Are you actually using 100% CPU all the time or are you cache missing and waiting for data to load?
To be honest, we really need more details to give you a definitive answer.

If sqr() is being used only on primitive types, you might try taking the argument by value instead of reference. That would save you an indirection.

If you can organise your data suitably then you may well be able to use SIMD optimisation here. For an efficient implementation you would probably want to pad your Point struct so that it has 4 elements (i.e. add a fourth dummy element for padding).

If you have a number of these to do, and you're doing graphics or "graphic like" tasks (thermal modeling, almost any 3d modeling) you might consider using OpenGL and offloading the tasks to a GPU. This would allow the computations to run in parallel, with highly optimized operational capacity. After all, you would expect something like distance or distancesq to have its own opcode on a GPU.
A researcher at a local univeristy offload almost all of his 3d-calculations for AI work to the GPU and achieved much faster results.

There are a lot of answers mentioning SSE already… but since nobody has mentioned how to use it, I'll throw another in…
Your code has most everything a vectorizer needs to work, except two constraints: aliasing and alignment.
Aliasing is the problem of two names referring two the same object. For example, my_point.dist2( my_point ) would operate on two copies of my_point. This messes with the vectorizer.
C99 defines the keyword restrict for pointers to specify that the referenced object is referenced uniquely: there will be no other restrict pointer to that object in the current scope. Most decent C++ compilers implement C99 as well, and import this feature somehow.
GCC calls it __restrict__. It may be applied to references or this.
MSVC calls it __restrict. I'd be surprised if support were any different from GCC.
(It is not in C++0x, though.)
#ifdef __GCC__
#define restrict __restrict__
#elif defined _MSC_VER
#define restrict __restrict
#endif
 
Base dist2(const Point& restrict p) const restrict
Most SIMD units require alignment to the size of the vector. C++ and C99 leave alignment implementation-defined, but C++0x wins this race by introducing [[align(16)]]. As that's still a bit in the future, you probably want your compiler's semi-portable support, a la restrict:
#ifdef __GCC__
#define align16 __attribute__((aligned (16)))
#elif defined _MSC_VER
#define align16 __declspec(align (16))
#endif
 
struct Point {
double align16 xyz[ 3 ]; // separate x,y,z might work; dunno
…
};
This isn't guaranteed to produce results; both GCC and MSVC implement helpful feedback to tell you what wasn't vectorized and why. Google your vectorizer to learn more.

If you really need all the dist2 values, then you have to compute them. It's already low level and cannot imagine speedups apart from distributing on multiple cores.
On the other side, if you're searching for closeness, then you can supply to the dist2() function your current miminum value. This way, if sqr(x-p.x) is already larger than your current minimum, you can avoid computing the remaining 2 squares.
Furthermore, you can avoid the first square by going deeper in the double representation. Comparing directly on the exponent value with your current miminum can save even more cycles.

Are you using Visual Studio? If so you may want to look at specifying the floating point unit control using /fp fast as a compile switch. Have a look at The fp:fast Mode for Floating-Point Semantics. GCC has a host of -fOPTION floating point optimisations you might want to consider (if, as you say, accuracy is not a huge concern).

I suggest two techniques:
Move the structure members into
local variables at the beginning.
Perform like operations together.
These techniques may not make a difference, but they are worth trying. Before making any changes, print the assembly language first. This will give you a baseline for comparison.
Here's the code:
Base dist2(const Point& p) const
{
// Load the cache with data values.
register x1 = p.x;
register y1 = p.y;
register z1 = p.z;
// Perform subtraction together
x1 = x - x1;
y1 = y - y1;
z1 = z - z2;
// Perform multiplication together
x1 *= x1;
y1 *= y1;
z1 *= z1;
// Perform final sum
x1 += y1;
x1 += z1;
// Return the final value
return x1;
}
The other alternative is to group by dimension. For example, perform all 'X' operations first, then Y and followed by Z. This may show the compiler that pieces are independent and it can delegate to another core or processor.
If you can't get any more performance out of this function, you should look elsewhere as other people have suggested. Also read up on Data Driven Design. There are examples where reorganizing the loading of data can speed up performance over 25%.
Also, you may want to investigate using other processors in the system. For example, the BOINC Project can delegate calculations to a graphics processor.
Hope this helps.

From an operation count, I don't see how this can be sped up without delving into hardware optimizations (like SSE) as others have pointed out. An alternative is to use a different norm, like the 1-norm is just the sum of the absolute values of the terms. Then no multiplications are necessary. However, this changes the underlying geometry of your space by rearranging the apparent spacing of the objects, but it may not matter for your application.

Floating point operations are quite often slower, maybe you can think about modifying the code to use only integer arithmetic and see if this helps?
EDIT: After the point made by Paul R I reworded my advice not to claim that floating point operations are always slower. Thanks.

Your best hope is to double-check that every dist2 call is actually needed: maybe the algorithm that calls it can be refactored to be more efficient? If some distances are computed multiple times, maybe they can be cached?
If you're sure all of the calls are necessary, you may be able to squeeze out a last drop of performance by using an architecture-aware compiler. I've had good results using Intel's compiler on x86s, for instance.

Just a few thoughts, however unlikely that I will add anything of value after 18 answers :)
If you are spending 80% time in these two functions I can imagine two typical scenarios:
Your algorithm is at least polynomial
As your data seem to be spatial maybe you can bring the O(n) down by introducing spatial indexes?
You are looping over certain set
If this set comes either from data on disk (sorted?) or from loop there might be possibility to cache, or use previous computations to calculate sqrt faster.
Also regarding the cache, you should define the required precision (and the input range) - maybe some sort of lookup/cache can be used?

(scratch that!!! sqr != sqrt )
See if the "Fast sqrt" is applicable in your case :
http://en.wikipedia.org/wiki/Fast_inverse_square_root

Look at the context. There's nothing you can do to optimize an operation as simple as x*x.
Instead you should look at a higher level: where is the function called from? How often? Why? Can you reduce the number of calls? Can you use SIMD instructions to perform the multiplication on multiple elements at a time?
Can you perhaps offload entire parts of the algorithm to the GPU?
Is the function defined so that it can be inlined? (basically, is its definition visible at the call sites)
Is the result needed immediately after the computation? If so, the latency of FP operations might hurt you. Try to arrange your code so dependency chains are broken up or interleaved with unrelated instructions.
And of course, examine the generated assembly and see if it's what you expect.

Is there a reason you are implementing your own sqr operator?
Have you tried the one in libm it should be highly optimized.

The first thing that occurs to me is memoization ( on-the-fly caching of function calls ), but both sqr and dist2 it would seem like they are too low level for the overhead associated with memoization to be made up for in savings due to memoization. However at a higher level, you may find it may work well for you.
I think a more detailed analysis of you data is called for. Saying that most of the time in the program is spent executing MOV and JUMp commands may be accurate, but it's not going to help yhou optimise much. The information is too low level. For example, if you know that integer arguments are good enough for dist2, and the values are between 0 and 9, then a pre-cached tabled would be 1000 elements--not to big. You can always use code to generate it.
Have you unrolled loops? Broken down matrix opration? Looked for places where you can get by with table lookup instead of actual calculation.
Most drastic would be to adopt the techniques described in:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115.8660&rep=rep1&type=pdf
though it is admittedly a hard read and you should get some help from someone who knows Common Lisp if you don't.

I'm curious why you made this a template when you said the computation is done using doubles?
Why not write a standard method, function, or just 'x * x' ?
If your inputs can be predictably constrained and you really need speed create an array that contains all the outputs your function can produce. Use the input as the index into the array (A sparse hash). A function evaluation then becomes a comparison (to test for array bounds), an addition, and a memory reference. It won't get a lot faster than that.

See the SUBPD, MULPD and DPPD instructions. (DPPD required SSE4)
Depends on your code, but in some cases a stucture-of-arrays layout might be more friendly to vectorization than an array-of-structures layout.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js