variables vs long algorithms

variables vs long algorithms - c++

what are the differences between using these two algorithms. I've always wondered how I should be optimising things.How do they differ memory and speed wise? Is one better than the other? Aside from code clarity I mean.
this is the first version I had:
bool Intersects(BoundingSphere boundingSphere)
{
D3DXVECTOR3 vectorBetween = (centre - boundingSphere.centre);
// works out the distance between the sphere centre's using pythag
float distance = sqrt(
pow(vectorBetween.x, 2)
+ pow(vectorBetween.y, 2)
+ pow(vectorBetween.z, 2));
// if two radius's add to more than the distance between the centres
return (radius + boundingSphere.radius > distance);
}
This method is the same, but it doesn't hold any values in variables, it just uses one long calculation
bool Intersects(BoundingSphere boundingSphere)
{
return (radius + boundingSphere.radius >
(sqrt(pow((centre - boundingSphere.centre).x, 2) +
pow((centre - boundingSphere.centre).y, 2) +
pow((centre - boundingSphere.centre).z, 2))));
}

The two algorithms will, under proper optimization options, compile down to exactly the same code. Since the first is far more readable, it is undoubtedly the better of the two.
The correct way to optimize this code is not to get rid of the variables (the compiler can do that for you), but to get rid of the sqrt operation: just compare squared distances.

Which one is easier to maintain?
I think your first version is easier, since it breaks out the parts and makes the vectorBetween obvious, as you reuse that a couple of times.
Now, as to which one is faster... a smart compiler will probably figure out how to make both the same speed. Don't worry too much about that until you need to. We are talking about O(1) differences here anyways, so if you do do this in a tight loop, just test both versions. Then you will know which one is faster!

Do whichever is clearest for you. If there are performance problems,
try the other. Without more exact information concerning the type of
D3DXVECTOR3 and the return value of the operator- you're using and
any number of other factors, it's impossible to even make a guess as to
which is faster. And until you know that this code is critical, it
doesn't really matter.

The best you can hope for with the second variant is that the compiler will optimize it to do exactly the same as the first rather than computing the difference vector three times. In either case, the machine will need to store the intermediate results somewhere; it doesn't really matter whether that intermediate storage is a named C++ variable or an anonymous machine language pointer.
And of course this is premature micro-optimization and completely irrelevant compared to the better readability of the first variant.

Related

What to consider when looking at multiple methods to achieve the same result while coding?

I am currently coding in C++, creating an all rounded calculator that, when finished, will be capable of handling all major and common mathematical procedures.
The current wall I am hitting is from the fact I am still learning about to profession we call being a programmer.
I have several ways of achieving a single result. I am curious as to whether I should pick the method that has a clear breakdown of how it got to that point in the code; or the method that is much shorter - while not sacrificing any of the redability.
Below I have posted snippets from my class showing what I mean.
This function uses if statements to determine whether or not a common denominator is even needed, but is several lines long.
Fraction Fraction::addFraction(Fraction &AddInput)
{
Fraction output;
if (m_denominator != AddInput.m_denominator)
{
getCommonDenominator(AddInput);
output.setWhole(m_whole + AddInput.m_whole);
output.setNumerator((m_numerator * firstchange) + (AddInput.m_numerator * secondchange));
output.setDenominator(commondenominator);
}
else
{
output.setWhole(m_whole + AddInput.m_whole);
output.setNumerator(m_numerator + AddInput.m_numerator);
output.setDenominator(m_denominator);
}
output.simplify();
return output;
}
This function below, gets a common denominator; repeats the steps on the numerators; then simplifies to the lowest terms.
Fraction Fraction::addFraction(Fraction &AddInput)
{
getCommonDenominator(AddInput);
Fraction output(m_whole + AddInput.m_whole, (m_numerator * firstchange) + (AddInput.m_numerator * secondchange), commondenominator);
output.simplify();
return output;
}
Both functions have been tested and always return the accurate result. When it comes to coding standards... do we pick longer and asy to follow? or shorter and easy to understand?

Your first priority with your code should be that it's correct.
Your second priority with code should be "If someone who's never seen this before is going to make a tiny change, which one is he less likely to break?
There's actually a lot that goes into this. How difficult is it to understand at a high level? How abstracted out are arcane details? Are there any surprises? What quirks do you have to know about? Are there edge cases that have to be handled?
The reasons that this second priority is important are:
it's key to preventing you from writing bugs in the first place
it's easier to find bugs later
it's easier to fix bugs later
despite whatever you think, you won't remember the details in 6 months.
Both implementations appear about equally difficult in complexity per branch, but the first one has branches, so I'd lean toward the second for understandability. Details seem abstracted out in both, and if there's surprises or quirks, I don't immediately see them (but that's sort of the point, that they can be easily overlooked). I don't see any special handling for edge cases, so if edge cases exist in either, comments would be good.
Unrelated to picking, but while on the topic of reviewing code, It's unclear how either handles fractions that have no fractional part, but that might be part of the full class documentation, which would be fine. Both codepaths take AddArgument by mutable reference, which is bad, and require this to be mutable as well, which is also bad. Both have methods named get*() that appear to modify (getCommonDenominator), which is bad. The code appears to be using variables that are external (firstchange? secondchange?) which is a major strike against preventing bugs.

Use of Literals, yay/nay in C++

I've recently heard that in some cases, programmers believe that you should never use literals in your code. I understand that in some cases, assigning a variable name to a given number can be helpful (especially in terms of maintenance if that number is used elsewhere). However, consider the following case studies:
Case Study 1: Use of Literals for "special" byte codes.
Say you have an if statement that checks for a specific value stored in (for the sake of argument) a uint16_t. Here are the two code samples:
Version 1:
// Descriptive comment as to why I'm using 0xBEEF goes here
if (my_var == 0xBEEF) {
//do something
}
Version 2:
const uint16_t kSuperDescriptiveVarName = 0xBEEF;
if (my_var == kSuperDescriptiveVarName) {
// do something
}
Which is the "preferred" method in terms of good coding practice? I can fully understand why you would prefer version 2 if kSuperDescriptiveVarName is used more than once. Also, does the compiler do any optimizations to make both versions effectively the same executable code? That is, are there any performance implications here?
Case Study 2: Use of sizeof
I fully understand that using sizeof versus a raw literal is preferred for portability and also readability concerns. Take the two code examples into account. The scenario is that you are computing the offset into a packet buffer (an array of uint8_t) where the first part of the packet is stored as my_packet_header, which let's say is a uint32_t.
Version 1:
const int offset = sizeof(my_packet_header);
Version 2:
const int offset = 4; // good comment telling reader where 4 came from
Clearly, version 1 is preferred, but what about for cases where you have multiple data fields to skip over? What if you have the following instead:
Version 1:
const int offset = sizeof(my_packet_header) + sizeof(data_field1) + sizeof(data_field2) + ... + sizeof(data_fieldn);
Version 2:
const int offset = 47;
Which is preferred in this case? Does is still make sense to show all the steps involved with computing the offset or does the literal usage make sense here?
Thanks for the help in advance as I attempt to better my code practices.

Which is the "preferred" method in terms of good coding practice? I can fully understand why you would prefer version 2 if kSuperDescriptiveVarName is used more than once.
Sounds like you understand the main point... factoring values (and their comments) that are used in multiple places. Further, it can sometimes help to have a group of constants in one place - so their values can be inspected, verified, modified etc. without concern for where they're used in the code. Other times, there are many constants used in proximity and the comments needed to properly explain them would obfuscate the code in which they're used.
Countering that, having a const variable means all the programmers studying the code will be wondering whether it's used anywhere else, keeping it in mind as they inspect the rest of the scope in which it's declared etc. - the less unnecessary things to remember the surer the understanding of important parts of the code will be.
Like so many things in programming, it's "an art" balancing the pros and cons of each approach, and best guided by experience and knowledge of the way the code's likely to be studied, maintained, and evolved.
Also, does the compiler do any optimizations to make both versions effectively the same executable code? That is, are there any performance implications here?
There's no performance implications in optimised code.
I fully understand that using sizeof versus a raw literal is preferred for portability and also readability concerns.
And other reasons too. A big factor in good programming is reducing the points of maintenance when changes are done. If you can modify the type of a variable and know that all the places using that variable will adjust accordingly, that's great - saves time and potential errors. Using sizeof helps with that.
Which is preferred [for calculating offsets in a struct]? Does is still make sense to show all the steps involved with computing the offset or does the literal usage make sense here?
The offsetof macro (#include <cstddef>) is better for this... again reducing maintenance burden. With the this + that approach you illustrate, if the compiler decides to use any padding your offset will be wrong, and further you have to fix it every time you add or remove a field.
Ignoring the offsetof issues and just considering your this + that example as an illustration of a more complex value to assign, again it's a balancing act. You'd definitely want some explanation/comment/documentation re intent here (are you working out the binary size of earlier fields? calculating the offset of the next field?, deliberately missing some fields that might not be needed for the intended use or was that accidental?...). Still, a named constant might be enough documentation, so it's likely unimportant which way you lean....

In every example you list, I would go with the name.
In your first example, you almost certainly used that special 0xBEEF number at least twice - once to write it and once to do your comparison. If you didn't write it, that number is still part of a contract with someone else (perhaps a file format definition).
In the last example, it is especially useful to show the computation that yielded the value. That way, if you encounter trouble down the line, you can easily see either that the number is trustworthy, or what you missed and fix it.
There are some cases where I prefer literals over named constants though. These are always cases where a name is no more meaningful than the number. For example, you have a game program that plays a dice game (perhaps Yahtzee), where there are specific rules for specific die rolls. You could define constants for One = 1, Two = 2, etc. But why bother?

Generally it is better to use a name instead of a value. After all, if you need to change it later, you can find it more easily. Also it is not always clear why this particular number is used, when you read the code, so having a meaningful name assigned to it, makes this immediately clear to a programmer.
Performance-wise there is no difference, because the optimizers should take care of it. And it is rather unlikely, even if there would be an extra instruction generated, that this would cause you troubles. If your code would be that tight, you probably shouldn't rely on an optimizer effect anyway.

I can fully understand why you would prefer version 2 if kSuperDescriptiveVarName is used more than once.
I think kSuperDescriptiveVarName will definitely be used more than once. One for check and at least one for assignment, maybe in different part of your program.
There will be no difference in performance, since an optimization called Constant Propagation exists in almost all compilers. Just enable optimization for your compiler.

What is efficient to check: equal to or not equal to?

I was wondering, if we have if-else condition, then what is computationally more efficient to check: using the equal to operator or the not equal to operator? Is there any difference at all?
E.g., which one of the following is computationally efficient, both cases below will do same thing, but which one is better (if there's any difference)?
Case1:
if (a == x)
{
// execute Set1 of statements
}
else
{
// execute Set2 of statements
}
Case 2:
if (a != x)
{
// execute Set2 of statements
}
else
{
// execute Set1 of statements
}
Here assumptions are most of the time (say 90% of the cases) a will be equal to x. a and x both are of unsigned integer type.

Generally it shouldn't matter for performance which operator you use. However it is recommended for branching that the most likely outcome of the if-statement comes first.

Usually what you should consider is; what is the simplest and clearest way to write this code? IMHO, the first, positive is the simplest (not requiring a !)
In terms of performance there is no differences as the code is likely to compile to the same thing. (Certainly in the JIT for Java it should)
For Java, the JIT can optimise the code so the most common branch is preferred by the branch prediction.

In this simple case, it makes no difference. (assuming a and x are basic types) If they're class-types with overloaded operator == or operator != they might be different, but I wouldn't worry about it.
For subsequent loops:
if ( c1 ) { }
else if ( c2 ) { }
else ...
the most likely condition should be put first, to prevent useless evaluations of the others. (again, not applicable here since you only have one else).

GCC provides a way to inform the compiler about the likely outcome of an expression:
if (__builtin_expect(expression, 1))
…
This built-in evaluates to the value of expression, but it informs the compiler that the likely result is 1 (true for Booleans). To use this, you should write expression as clearly as possible (for humans), then set the second parameter to whichever value is most likely to be the result.

There is no difference.
The x86 CPU architecture has two opcodes for conditional jumps
JNE (jump if not equal)
JE (jump if equal)
Usually they both take the same number of CPU cycles.
And even when they wouldn't, you could expect the compiler to do such trivial optimizations for you. Write what's most readable and what makes your intention more clear instead of worrying about microseconds.

If you ever manage to write a piece of Java code that can be proven to be significantly more efficient one way than the other, you should publish your result and raise an issue against whatever implementation you observed the difference on.
More to the point, just asking this kind of question should be a sign of something amiss: it is an indication that you are focusing your attention and efforts on a wrong aspect of your code. Real-life application performance always suffers from inadequate architecture; never from concerns such as this.

Early optimization is the root of all evil
Even for branch prediction, I think you should not care too much about this, until it is really necessary.
Just as Peter said, use the simplest way.

Let the compiler/optimizer do its work.
It's a general rule of thumb (most nowadays) that the source code should express your intention in the most readable way. You are writing it to another human (and not to the computer), the one year later yourself or your team mate who will need to understand your code with the less effort.

It shouldn't make any difference performance wise but you consider what is easiest to read. Then when you are looking back on your code or if someone is looking at it, you want it to be easy to understand.

it has a little advantage (from point of readability) if the first condition is the one that is true in most cases.
Write the conditions that way that you can read them best. You will not benefit from speed by negating a condition

Most processors use an electrical gate for equality/inequality checks, this means all bits are checked at once. Therefore it should make no difference, but you want to truly optimise your code it is always better to benchmark things yourself and check the results.
If you are wondering whether it's worth it to optimise like that, imagine you would have this check multiple times for every pixel in your screen, or scenarios like that. Imho, it is alwasy worth it to optimise, even if it's only to teach yourself good habits ;)

Only the non-negative approach which you have used at the first seems to be the best .

The only way to know for sure is to code up both versions and measure their performance. If the difference is only a percent or so, use the version that more clearly conveys the intent.
It's very unlikely that you're going to see a significant difference between the two.

Performance difference between them is negligible. So, just think about readability of the code. For readability I prefer the one which has a more lines of code in the If statement.
if (a == x) {
// x lines of code
} else {
// y lines of code where y < x
}

Auto vectorization Region of interest (crop)

I have a library which has some image processing algorithms including a Region of Interest (crop) algorithm. When compiling with GCC, the auto vectorizer speeds up a lot of the code but worsens the performance of the Crop algorithm. Is there a way of flagging a certain loop to be ignored by the vectorizer or is there a better way of structuring the code for better performance?
for (RowIndex=0;RowIndex<Destination.GetRows();++RowIndex)
{
rowOffsetS = ((OriginY + RowIndex) * SizeX) + OriginX;
rowOffsetD = (RowIndex * Destination.GetColumns());
for (ColumnIndex=0;ColumnIndex<Destination.GetColumns();++ColumnIndex)
{
BufferSPtr=BufferS + rowOffsetS + ColumnIndex;
BufferDPtr=BufferD + rowOffsetD + ColumnIndex;
*BufferDPtr=*BufferSPtr;
}
}
Where
SizeX is the width of the source
OriginX is the left of the region of interest
OriginY is the top of the region of interest

I haven't found anything about changing the optimization flags for a loop, however according to the documentation you can use the attribute optimize (look here and here) on a function to override the optimization settings for that function somewhat like this:
void foo() __attribute__((optimize("O2", "inline-functions")))
If you want to change it for several functions, you can use #pragma GCC optimize to set it for all following functions (look here).
So you should be able to compile the function containing crop with a different set of optimization flags, omitting the auto-vectorization. That has the disadvantage of hardcoding the compilation flags for that function, but is the best I found.
With regards to restructuring for better performance the two points I already mentioned in the comments come to mind (assuming the ranges can't overlap):
declaring the pointers as __restrict to tell the compiler that they don't alias (the area pointed to by one pointer won't be accessed by any other means inside the function). The possibility of pointer aliasing is a major stumbling block for the optimizer, since it can't easily reorder the accesses if it doesn't know if writing to BufferD will change the contents of BufferS.
Replacing the inner loop with a call to copy:
std::copy(BufferS + rowOffsetS, BufferS + rowOffsetS + Destination.GetColumns(), BufferD + rowOffsetD);
The copy function is likely to be pretty well optimized (probably forwarding the arguments to memmove), so that might make your code faster, while also making your code shorter (always a plus).

Speedup C++ code

I am writing a C++ number crunching application, where the bottleneck is a function that has to calculate for double:
template<class T> inline T sqr(const T& x){return x*x;}
and another one that calculates
Base dist2(const Point& p) const
{ return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z); }
These operations take 80% of the computation time. I wonder if you can suggest approaches to make it faster, even if there is some sort of accuracy loss
Thanks

First, make sure dist2 can be inlined (it's not clear from your post whether or not this is the case), having it defined in a header file if necessary (generally you'll need to do this - but if your compiler generates code at link time, then that's not necessarily the case).
Assuming x86 architecture, be sure to allow your compiler to generate code using SSE2 instructions (an example of an SIMD instruction set) if they are available on the target architecture. To give the compiler the best opportunity to optimize these, you can try to batch your sqr operations together (SSE2 instructions should be able to do up to 4 float or 2 double operations at a time depending on the instruction.. but of course it can only do this if you have the inputs to more than one operation on the ready). I wouldn't be too optimistic about the compiler's ability to figure out that it can batch them.. but you can at least set up your code so that it would be possible in theory.
If you're still not satisfied with the speed and you don't trust that your compiler is doing it best, you should look into using compiler intrinsics which will allow you to write potential parallel instructions explicitly.. or alternatively, you can go right ahead and write architecture-specific assembly code to take advantage of SSE2 or whichever instructions are most appropriate on your architecture. (Warning: if you hand-code the assembly, either take extra care that it still gets inlined, or make it into a large batch operation)
To take it even further, (and as glowcoder has already mentioned) you could perform these operations on a GPU. For your specific case, bear in mind that GPU's often don't support double precision floating point.. though if it's a good fit for what you're doing, you'll get orders of magnitude better performance this way. Google for GPGPU or whatnot and see what's best for you.

What is Base?
Is it a class with a non-explicit constructor? It's possible that you're creating a fair amount of temporary Base objects. That could be a big CPU hog.
template<class T> inline T sqr(const T& x){return x*x;}
Base dist2(const Point& p) const {
return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z);
}
If p's member variables are of type Base, you could be calling sqr on Base objects, which will be creating temporaries for the subtracted coordinates, in sqr, and then for each added component.
(We can't tell without the class definitions)
You could probably speed it up by forcing the sqr calls to be on primitves and not using Base until you get to the return type of dist2.
Other performance improvement opportunities are to:
Use non-floating point operations, if you're ok with less precision.
Use algorithms which don't need to call dist2 so much, possibly caching or using the transitive property.
(this is probably obvious, but) Make sure you're compiling with optimization turned on.

I think optimising these functions might be difficult, you might be better off optimising the code that calls these functions to call them less, or to do things differently.

You don't say whether the calls to dist2 can be parallelised or not. If they can, then you could build a thread pool and split this work up into smaller chunks per thread.
What does your profiler tell you is happening inside dist2. Are you actually using 100% CPU all the time or are you cache missing and waiting for data to load?
To be honest, we really need more details to give you a definitive answer.

If sqr() is being used only on primitive types, you might try taking the argument by value instead of reference. That would save you an indirection.

If you can organise your data suitably then you may well be able to use SIMD optimisation here. For an efficient implementation you would probably want to pad your Point struct so that it has 4 elements (i.e. add a fourth dummy element for padding).

If you have a number of these to do, and you're doing graphics or "graphic like" tasks (thermal modeling, almost any 3d modeling) you might consider using OpenGL and offloading the tasks to a GPU. This would allow the computations to run in parallel, with highly optimized operational capacity. After all, you would expect something like distance or distancesq to have its own opcode on a GPU.
A researcher at a local univeristy offload almost all of his 3d-calculations for AI work to the GPU and achieved much faster results.

There are a lot of answers mentioning SSE already… but since nobody has mentioned how to use it, I'll throw another in…
Your code has most everything a vectorizer needs to work, except two constraints: aliasing and alignment.
Aliasing is the problem of two names referring two the same object. For example, my_point.dist2( my_point ) would operate on two copies of my_point. This messes with the vectorizer.
C99 defines the keyword restrict for pointers to specify that the referenced object is referenced uniquely: there will be no other restrict pointer to that object in the current scope. Most decent C++ compilers implement C99 as well, and import this feature somehow.
GCC calls it __restrict__. It may be applied to references or this.
MSVC calls it __restrict. I'd be surprised if support were any different from GCC.
(It is not in C++0x, though.)
#ifdef __GCC__
#define restrict __restrict__
#elif defined _MSC_VER
#define restrict __restrict
#endif
 
Base dist2(const Point& restrict p) const restrict
Most SIMD units require alignment to the size of the vector. C++ and C99 leave alignment implementation-defined, but C++0x wins this race by introducing [[align(16)]]. As that's still a bit in the future, you probably want your compiler's semi-portable support, a la restrict:
#ifdef __GCC__
#define align16 __attribute__((aligned (16)))
#elif defined _MSC_VER
#define align16 __declspec(align (16))
#endif
 
struct Point {
double align16 xyz[ 3 ]; // separate x,y,z might work; dunno
…
};
This isn't guaranteed to produce results; both GCC and MSVC implement helpful feedback to tell you what wasn't vectorized and why. Google your vectorizer to learn more.

If you really need all the dist2 values, then you have to compute them. It's already low level and cannot imagine speedups apart from distributing on multiple cores.
On the other side, if you're searching for closeness, then you can supply to the dist2() function your current miminum value. This way, if sqr(x-p.x) is already larger than your current minimum, you can avoid computing the remaining 2 squares.
Furthermore, you can avoid the first square by going deeper in the double representation. Comparing directly on the exponent value with your current miminum can save even more cycles.

Are you using Visual Studio? If so you may want to look at specifying the floating point unit control using /fp fast as a compile switch. Have a look at The fp:fast Mode for Floating-Point Semantics. GCC has a host of -fOPTION floating point optimisations you might want to consider (if, as you say, accuracy is not a huge concern).

I suggest two techniques:
Move the structure members into
local variables at the beginning.
Perform like operations together.
These techniques may not make a difference, but they are worth trying. Before making any changes, print the assembly language first. This will give you a baseline for comparison.
Here's the code:
Base dist2(const Point& p) const
{
// Load the cache with data values.
register x1 = p.x;
register y1 = p.y;
register z1 = p.z;
// Perform subtraction together
x1 = x - x1;
y1 = y - y1;
z1 = z - z2;
// Perform multiplication together
x1 *= x1;
y1 *= y1;
z1 *= z1;
// Perform final sum
x1 += y1;
x1 += z1;
// Return the final value
return x1;
}
The other alternative is to group by dimension. For example, perform all 'X' operations first, then Y and followed by Z. This may show the compiler that pieces are independent and it can delegate to another core or processor.
If you can't get any more performance out of this function, you should look elsewhere as other people have suggested. Also read up on Data Driven Design. There are examples where reorganizing the loading of data can speed up performance over 25%.
Also, you may want to investigate using other processors in the system. For example, the BOINC Project can delegate calculations to a graphics processor.
Hope this helps.

From an operation count, I don't see how this can be sped up without delving into hardware optimizations (like SSE) as others have pointed out. An alternative is to use a different norm, like the 1-norm is just the sum of the absolute values of the terms. Then no multiplications are necessary. However, this changes the underlying geometry of your space by rearranging the apparent spacing of the objects, but it may not matter for your application.

Floating point operations are quite often slower, maybe you can think about modifying the code to use only integer arithmetic and see if this helps?
EDIT: After the point made by Paul R I reworded my advice not to claim that floating point operations are always slower. Thanks.

Your best hope is to double-check that every dist2 call is actually needed: maybe the algorithm that calls it can be refactored to be more efficient? If some distances are computed multiple times, maybe they can be cached?
If you're sure all of the calls are necessary, you may be able to squeeze out a last drop of performance by using an architecture-aware compiler. I've had good results using Intel's compiler on x86s, for instance.

Just a few thoughts, however unlikely that I will add anything of value after 18 answers :)
If you are spending 80% time in these two functions I can imagine two typical scenarios:
Your algorithm is at least polynomial
As your data seem to be spatial maybe you can bring the O(n) down by introducing spatial indexes?
You are looping over certain set
If this set comes either from data on disk (sorted?) or from loop there might be possibility to cache, or use previous computations to calculate sqrt faster.
Also regarding the cache, you should define the required precision (and the input range) - maybe some sort of lookup/cache can be used?

(scratch that!!! sqr != sqrt )
See if the "Fast sqrt" is applicable in your case :
http://en.wikipedia.org/wiki/Fast_inverse_square_root

Look at the context. There's nothing you can do to optimize an operation as simple as x*x.
Instead you should look at a higher level: where is the function called from? How often? Why? Can you reduce the number of calls? Can you use SIMD instructions to perform the multiplication on multiple elements at a time?
Can you perhaps offload entire parts of the algorithm to the GPU?
Is the function defined so that it can be inlined? (basically, is its definition visible at the call sites)
Is the result needed immediately after the computation? If so, the latency of FP operations might hurt you. Try to arrange your code so dependency chains are broken up or interleaved with unrelated instructions.
And of course, examine the generated assembly and see if it's what you expect.

Is there a reason you are implementing your own sqr operator?
Have you tried the one in libm it should be highly optimized.

The first thing that occurs to me is memoization ( on-the-fly caching of function calls ), but both sqr and dist2 it would seem like they are too low level for the overhead associated with memoization to be made up for in savings due to memoization. However at a higher level, you may find it may work well for you.
I think a more detailed analysis of you data is called for. Saying that most of the time in the program is spent executing MOV and JUMp commands may be accurate, but it's not going to help yhou optimise much. The information is too low level. For example, if you know that integer arguments are good enough for dist2, and the values are between 0 and 9, then a pre-cached tabled would be 1000 elements--not to big. You can always use code to generate it.
Have you unrolled loops? Broken down matrix opration? Looked for places where you can get by with table lookup instead of actual calculation.
Most drastic would be to adopt the techniques described in:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115.8660&rep=rep1&type=pdf
though it is admittedly a hard read and you should get some help from someone who knows Common Lisp if you don't.

I'm curious why you made this a template when you said the computation is done using doubles?
Why not write a standard method, function, or just 'x * x' ?
If your inputs can be predictably constrained and you really need speed create an array that contains all the outputs your function can produce. Use the input as the index into the array (A sparse hash). A function evaluation then becomes a comparison (to test for array bounds), an addition, and a memory reference. It won't get a lot faster than that.

See the SUBPD, MULPD and DPPD instructions. (DPPD required SSE4)
Depends on your code, but in some cases a stucture-of-arrays layout might be more friendly to vectorization than an array-of-structures layout.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js