Effective for loop in assembly - c++

Im currently trying to get used to assembler and I have written a for loop in c++ and then I have looked at it in disassembly. I was wondering if anyone could explain to me what each step does and/or how to improve the loop manually.
for (int i = 0; i < length; i++){
013A17AE mov dword ptr [i],0
013A17B5 jmp encrypt_chars+30h (13A17C0h)
013A17B7 mov eax,dword ptr [i]
013A17BA add eax,1
013A17BD mov dword ptr [i],eax
013A17C0 mov eax,dword ptr [i]
013A17C3 cmp eax,dword ptr [length]
013A17C6 jge encrypt_chars+6Bh (13A17FBh)
temp_char = OChars [i]; // get next char from original string
013A17C8 mov eax,dword ptr [i]
013A17CB mov cl,byte ptr OChars (13AB138h)[eax]
013A17D1 mov byte ptr [temp_char],cl
Thanks in advance.

First, I'd note that what you've posted seems to contain only part of the loop body. Second, it looks like you compiled with all optimization turned off -- when/if you turn on optimization, don't be surprised if the result looks rather different.
That said, let's look at the code line-by-line:
013A17AE mov dword ptr [i],0
This is basically just i=0.
013A17B5 jmp encrypt_chars+30h (13A17C0h)
This is going to the beginning of the loop. Although it's common to put the test at the top of a loop in most higher level languages, that's not always the case in assembly language.
013A17B7 mov eax,dword ptr [i]
013A17BA add eax,1
013A17BD mov dword ptr [i],eax
This is i++ in (extremely sub-optimal) assembly language. It's retrieving the current value of i, adding one to it, then storing the result back into i.
013A17C0 mov eax,dword ptr [i]
013A17C3 cmp eax,dword ptr [length]
013A17C6 jge encrypt_chars+6Bh (13A17FBh)
This is basically if (i==length) /* skip forward to some code you haven't shown */ It's retrieving the value of i and comparing it to the value of length, the jumping somewhere if i was greater than or equal to length.
If you were writing this in assembly language by hand, you'd normally use something like xor eax, eax (or sub eax, eax) to zero a register. In most cases, you'd start from the maximum and count down to zero if possible (avoids a comparison in the loop). You certainly wouldn't store a value into a variable, then immediately retrieve it back out (in fairness, a compiler probably won't do that either, if you turn on optimization).
Applying that, and moving the "variables" into registers, we'd end up with something on this general order:
mov ecx, length
loop_top:
; stuff that wasn't pasted goes here
dec ecx
jnz loop_top

I'll try to interpret this in plain english:
013A17AE mov dword ptr [i],0 ; Move into i, 0
013A17B5 jmp encrypt_chars+30h (13A17C0h) ; Jump to check
013A17B7 mov eax,dword ptr [i] ; Load i into the accumulator (register eax)
013A17BA add eax,1 ; Increment the accumulator
013A17BD mov dword ptr [i],eax ; and put that in it, effectively adding
; 1 to i.
check:
013A17C0 mov eax,dword ptr [i] ; Move i into the accumulator
013A17C3 cmp eax,dword ptr [length] ; Compare it to the value in 'length',
; setting flags
013A17C6 jge encrypt_chars+6Bh (13A17FBh) ; Jump if it's greater or equal. This
; address is not in your code snippet
The compiler preferes EAX for arithmetic. Each register (in the past, I don't know if this is still current) has some type of operation that it is faster at doing.

Here's the part that should be more optimized:
(note: your compiler SHOULD do this, so either you have optimizations turned off, or something in the loop body is preventing this optimization)
mov eax,dword ptr [i] ; Go get "i" from memory, put it in register EAX
add eax,1 ; Add one to register EAX
mov dword ptr [i],eax ; Put register EAX back in memory "i". (now one bigger)
mov eax,dword ptr [i] ; Go get "i" from memory, put it in EAX again.
See how often you're moving values back-n-forth from memory to EAX?
You should be able to load "i" into EAX once at the beginning of the loop, run the full loop directly out of EAX, and put the finished value back into "i" after its all done.
(unless something else in your code prevents this)

Anyway, this code comes from DEBUG build. It is possible to optimize it, but MS compiler produces very good code for such simple cases.
There is no point to do it manually, just re-build it in release mode and read the listing to learn how to do it.

Related

Assembly: loop through a sequence of characters and swap them

My assignment is to Implement a function in assembly that would do the following:
loop through a sequence of characters and swap them such that the end result is the original string in reverse ( 100 points )
Hint: collect the string from user as a C-string then pass it to the assembly function along with the number of characters entered by the user. To find out the number of characters use strlen() function.
i have written both c++ and assembly programs and it works fine for extent: for example if i input 12345 the out put is correctly shown as 54321 , but if go more than 5 characters : the out put starts to be incorrect: for example if i input 123456 the output is :653241. i will greatly appreciate anyone who can point where my mistake is:
.code
_reverse PROC
push ebp
mov ebp,esp ;stack pointer to ebp
mov ebx,[ebp+8] ; address of first array element
mov ecx,[ebp+12] ; the number of elemets in array
mov eax,ebx
mov ebp,0 ;move 0 to base pointer
mov edx,0 ; set data register to 0
mov edi,0
Setup:
mov esi , ecx
shr ecx,1
add ecx,edx
dec esi
reverse:
cmp ebp , ecx
je allDone
mov edx, eax
add eax , edi
add edx , esi
Swap:
mov bl, [edx]
mov bh, [eax]
mov [edx],bh
mov [eax],bl
inc edi
dec esi
cmp edi, esi
je allDone
inc ebp
jmp reverse
allDone:
pop ebp ; pop ebp out of stack
ret ; retunr the value of eax
_reverse ENDP
END
and here is my c++ code:
#include<iostream>
#include <string>
using namespace std;
extern"C"
char reverse(char*, int);
int main()
{
char str[64] = {NULL};
int lenght;
cout << " Please Enter the text you want to reverse:";
cin >> str;
lenght = strlen(str);
reverse(str, lenght);
cout << " the reversed of the input is: " << str << endl;
}
You didn't comment your code, so IDK what exactly you're trying to do, but it looks like you are manually doing the array indexing with MOV / ADD instead of using an addressing mode like [eax + edi].
However, it looks like you're modifying your original value and then using it in a way that would make sense if it was unmodified.
mov edx, eax ; EAX holds a pointer to the start of array, read every iter
add eax , edi ; modify the start of the array!!!
add edx , esi
Swap:
inc edi
dec esi
EAX grows by EDI every step, and EDI increases linearly. So EAX increases geometrically (integral(x * dx) = x^2).
Single-stepping this in a debugger should have found this easily.
BTW, the normal way to do this is to walk one pointer up, one pointer down, and fall out of the loop when they cross. Then you don't need a separate counter, just cmp / ja. (Don't check for JNE or JE, because they can cross each other without ever being equal.)
Overall you the right idea to start at both ends of the string and swap elements until you get to the middle. Implementation is horrible though.
mov ebp,0 ;move 0 to base pointer
This seems to be loop counter (comment is useless or even worse); I guess idea was to swap length/2 elements which is perfectly fine. HINT I'd just compare pointers/indexes and exit once they collide.
mov edx,0 ; set data register to 0
...
add ecx,edx
mov edx, eax
Useless and misleading.
mov edi,0
mov esi , ecx
dec esi
Looks like indexes to start/end of the string. OK. HINT I'd go with pointers to start/end of the string; but indexes work too
cmp ebp , ecx
je allDone
Exit if did length/2 iterations. OK.
mov edx, eax
add eax , edi
add edx , esi
eax and edx point to current symbols to be swapped. Almost OK but this clobbers eax! Each loop iteration after second will use wrong pointers! This is what caused your problem in the first place. This wouldn't have happened if you used pointers instead indexes, or if you'd used offset addressing [eax+edi]/[eax+esi]
...
Swap part is OK
cmp edi, esi
je allDone
Second exit condition, this time comparing for index collision! Generally one exit condition should be enough; several exit conditions usually either superfluous or hint at some flaw in the algorithm. Also equality comparison is not enough - indexes can go from edi<esi to edi>esi during single iteration.

C/C++ Inline asm improper operand type

I have the following code, that is supposed to XOR a block of memory:
void XorBlock(DWORD dwStartAddress, DWORD dwSize, DWORD dwsKey)
{
DWORD dwKey;
__asm
{
push eax
push ecx
mov ecx, dwStartAddress // Move Start Address to ECX
add ecx, dwSize // Add the size of the function to ECX
mov eax, dwStartAddress // Copy the Start Address to EAX
crypt_loop: // Start of the loop
xor byte ptr ds:[eax], dwKey // XOR The current byte with 0x4D
inc eax // Increment EAX with dwStartAddress++
cmp eax,ecx // Check if every byte is XORed
jl crypt_loop; // Else jump back to the start label
pop ecx // pop ECX from stack
pop eax // pop EAX from stack
}
}
However, the argument dwKey gives me an error. The code works perfectly if for example the dwKey is replaced by 0x5D.
I think you have two problems.
First, "xor" can't take two memory operands (ds:[eax] is a memory location and dwKey is a memory location); secondly, you've used "byte ptr" to indicate you want a byte, but you're trying to use a DWORD and assembly can't automatically convert those.
So, you'll probably have to load your value into an 8-bit register and then do it. For example:
void XorBlock(DWORD dwStartAddress, DWORD dwSize, DWORD dwsKey)
{
DWORD dwKey;
__asm
{
push eax
push ecx
mov ecx, dwStartAddress // Move Start Address to ECX
add ecx, dwSize // Add the size of the function to ECX
mov eax, dwStartAddress // Copy the Start Address to EAX
mov ebx, dwKey // <---- LOAD dwKey into EBX
crypt_loop : // Start of the loop
xor byte ptr ds : [eax], bl // XOR The current byte with the low byte of EBX
inc eax // Increment EAX with dwStartAddress++
cmp eax, ecx // Check if every byte is XORed
jl crypt_loop; // Else jump back to the start label
pop ecx // pop ECX from stack
pop eax // pop EAX from stack
}
}
Although, it also looks like dwKey is uninitialized in your code; maybe you should just "mov bl, 0x42". I'm also not sure you need to push and pop the registers; I can't remember what registers you are allowed to clobber with MSVC++ inline assembler.
But, in the end, I think Alan Stokes is correct in his comment: it is very unlikely assembly is actually faster than C/C++ code in this case. The compiler can easily generate this code on its own, and you might find the compiler actually does unexpected optimizations to make it run even faster than the "obvious" assembly does (for example, loop unrolling).

Why access in a for loop is faster than access in a ranged-for in -O0 but not in -O3?

I'm learning performance in C++ (and C++11). And I need to performance in Debug and Release mode because I spend time in debugging and in executing.
I'm surprise with this two tests and how much change with the different compiler flags optimizations.
Test iterator 1:
Optimization 0 (-O0): faster.
Optimization 3 (-O3): slower.
Test iterator 2:
Optimization 0 (-O0): slower.
Optimization 3 (-O3): faster.
P.D.: I use the following clock code.
Test iterator 1:
void test_iterator_1()
{
int z = 0;
int nv = 1200000000;
std::vector<int> v(nv);
size_t count = v.size();
for (unsigned int i = 0; i < count; ++i) {
v[i] = 1;
}
}
Test iterator 2:
void test_iterator_2()
{
int z = 0;
int nv = 1200000000;
std::vector<int> v(nv);
for (int& i : v) {
i = 1;
}
}
UPDATE: The problem is still the same, but for ranged-for in -O3 the differences is small. So for loop 1 is the best.
UPDATE 2: Results:
With -O3:
t1: 80 units
t2: 74 units
With -O0:
t1: 287 units
t2: 538 units
UPDATE 3: The CODE!. Compile with: g++ -std=c++11 test.cpp -O0 (and then -O3)
Your first test is actually setting the value of each element in the vector to 1.
Your second test is setting the value of a copy of each element in the vector to 1 (the original vector is the same).
When you optimize, the second loop more than likely is removed entirely as it is basically doing nothing.
If you want the second loop to actually set the value:
for (int& i : v) // notice the &
{
i = 1;
}
Once you make that change, your loops are likely to produce assembly code that is almost identical.
As a side note, if you wanted to initialize the entire vector to a single value, the better way to do it is:
std::vector<int> v(SIZE, 1);
EDIT
The assembly is fairly long (100+ lines), so I won't post it all, but a couple things to note:
Version 1 will store a value for count and increment i, testing for it each time. Version 2 uses iterators (basically the same as std::for_each(b.begin(), v.end() ...)). So the code for the loop maintenance is very different (it is more setup for version 2, but less work each iteration).
Version 1 (just the meat of the loop)
mov eax, DWORD PTR _i$2[ebp]
push eax
lea ecx, DWORD PTR _v$[ebp]
call ??A?$vector#HV?$allocator#H#std###std##QAEAAHI#Z ; std::vector<int,std::allocator<int> >::operator[]
mov DWORD PTR [eax], 1
Version 2 (just the meat of the loop)
mov eax, DWORD PTR _i$2[ebp]
mov DWORD PTR [eax], 1
When they get optimized, this all changes and (other than the ordering of a few instructions), the output is almost identical.
Version 1 (optimized)
push ebp
mov ebp, esp
sub esp, 12 ; 0000000cH
push ecx
lea ecx, DWORD PTR _v$[ebp]
mov DWORD PTR _v$[ebp], 0
mov DWORD PTR _v$[ebp+4], 0
mov DWORD PTR _v$[ebp+8], 0
call ?resize#?$vector#HV?$allocator#H#std###std##QAEXI#Z ; std::vector<int,std::allocator<int> >::resize
mov ecx, DWORD PTR _v$[ebp+4]
mov edx, DWORD PTR _v$[ebp]
sub ecx, edx
sar ecx, 2 ; this is the only differing instruction
test ecx, ecx
je SHORT $LN3#test_itera
push edi
mov eax, 1
mov edi, edx
rep stosd
pop edi
$LN3#test_itera:
test edx, edx
je SHORT $LN21#test_itera
push edx
call DWORD PTR __imp_??3#YAXPAX#Z
add esp, 4
$LN21#test_itera:
mov esp, ebp
pop ebp
ret 0
Version 2 (optimized)
push ebp
mov ebp, esp
sub esp, 12 ; 0000000cH
push ecx
lea ecx, DWORD PTR _v$[ebp]
mov DWORD PTR _v$[ebp], 0
mov DWORD PTR _v$[ebp+4], 0
mov DWORD PTR _v$[ebp+8], 0
call ?resize#?$vector#HV?$allocator#H#std###std##QAEXI#Z ; std::vector<int,std::allocator<int> >::resize
mov edx, DWORD PTR _v$[ebp]
mov ecx, DWORD PTR _v$[ebp+4]
mov eax, edx
cmp edx, ecx
je SHORT $LN1#test_itera
$LL33#test_itera:
mov DWORD PTR [eax], 1
add eax, 4
cmp eax, ecx
jne SHORT $LL33#test_itera
$LN1#test_itera:
test edx, edx
je SHORT $LN47#test_itera
push edx
call DWORD PTR __imp_??3#YAXPAX#Z
add esp, 4
$LN47#test_itera:
mov esp, ebp
pop ebp
ret 0
Do not worry about how much time each operation takes, that falls squarely under the premature optimization is the root of all evil quote by Donald Knuth. Write easy to understand, simple programs, your time while writing the program (and reading it next week to tweak it, or to find out why the &%$# it is giving crazy results) is much more valuable than any computer time wasted. Just compare your weekly income to the price of an off-the-shelf machine, and think how much of your time is required to shave off a few minutes of compute time.
Do worry when you have measurements showing that the performance isn't adequate. Then you must measure where your runtime (or memory, or whatever else resource is critical) is spent, and see how to make that better. The (sadly out of print) book "Writing Efficient Programs" by Jon Bentley (much of it also appears in his "Programming Pearls") is an eye-opener, and a must read for any budding programmer.
Optimization is pattern matching: The compiler has a number of different situations it can recognize and optimize. If you change the code in a way that makes the pattern unrecognizable to the compiler, suddenly the effect of your optimization vanishes.
So, what you are witnessing is nothing more or less than that the ranged for loop produces more bloated code without optimization, but that in this form the optimizer is able to recognize a pattern that it cannot recognize for the iterator-free case.
In any case, if you are curious, you should take a look at the produced assembler code (compile with -S option).

How to increment an array in x86 assembly?

How would you increment an array using x86 assembly within a for loop. If the loop (made using c++) looked like:
for (int i = 0; i < limit; i++)
A value from an array is put in a register then the altered value is placed in a separate array. How would I increment each array in x86 assembly (I know c++ is simpler but it is practice work), so that each time the loop iterates the value used and the value placed into the arrays is one higher than the previous time? The details of what occur in the loop aside from the array manipulation are unimportant as I would like to know how this can be done in general, not a specific situation?
The loop you write here would be:
xor eax, eax ; clear loop variable
mov ebx, limit
loop:
cmp eax, ebx
je done
inc eax
jmp loop
done:
...
I really don't understand what you mean by "increment an array".
If you mean that you want to load some value from one array, manipulate the value and store the result in a target array, then you should consider this:
Load the pointer for the source array in esi and the target pointer in edi.
mov esi, offset array1
mov edi, offset array2
mov ebx, counter
loop:
mov eax, [esi]
do what you need
move [edi], eax
inc esi
inc edi
dec ebx
jne loop

What is different about C++ math.h abs() compared to my abs()

I am currently writing some glsl like vector math classes in C++, and I just implemented an abs() function like this:
template<class T>
static inline T abs(T _a)
{
return _a < 0 ? -_a : _a;
}
I compared its speed to the default C++ abs from math.h like this:
clock_t begin = clock();
for(int i=0; i<10000000; ++i)
{
float a = abs(-1.25);
};
clock_t end = clock();
unsigned long time1 = (unsigned long)((float)(end-begin) / ((float)CLOCKS_PER_SEC/1000.0));
begin = clock();
for(int i=0; i<10000000; ++i)
{
float a = myMath::abs(-1.25);
};
end = clock();
unsigned long time2 = (unsigned long)((float)(end-begin) / ((float)CLOCKS_PER_SEC/1000.0));
std::cout<<time1<<std::endl;
std::cout<<time2<<std::endl;
Now the default abs takes about 25ms while mine takes 60. I guess there is some low level optimisation going on. Does anybody know how math.h abs works internally? The performance difference is nothing dramatic, but I am just curious!
Since they are the implementation, they are free to make as many assumptions as they want. They know the format of the double and can play tricks with that instead.
Likely (as in almost not even a question), your double is the binary64 format. This means the sign has it's own bit, and an absolute value is merely clearing that bit. For example, as a specialization, a compiler implementer may do the following:
template <>
double abs<double>(const double x)
{
// breaks strict aliasing, but compiler writer knows this behavior for the platform
uint64_t i = reinterpret_cast<const std::uint64_t&>(x);
i &= 0x7FFFFFFFFFFFFFFFULL; // clear sign bit
return reinterpret_cast<const double&>(i);
}
This removes branching and may run faster.
There are well-known tricks for computing the absolute value of a two's complement signed number. If the number is negative, flip all the bits and add 1, that is, xor with -1 and subtract -1. If it is positive, do nothing, that is, xor with 0 and subtract 0.
int my_abs(int x)
{
int s = x >> 31;
return (x ^ s) - s;
}
What is your compiler and settings? I'm sure MS and GCC implement "intrinsic functions" for many math and string operations.
The following line:
printf("%.3f", abs(1.25));
falls into the following "fabs" code path (in msvcr90d.dll):
004113DE sub esp,8
004113E1 fld qword ptr [__real#3ff4000000000000 (415748h)]
004113E7 fstp qword ptr [esp]
004113EA call abs (4110FFh)
abs call the C runtime 'fabs' implementation on MSVCR90D (rather large):
102F5730 mov edi,edi
102F5732 push ebp
102F5733 mov ebp,esp
102F5735 sub esp,14h
102F5738 fldz
102F573A fstp qword ptr [result]
102F573D push 0FFFFh
102F5742 push 133Fh
102F5747 call _ctrlfp (102F6140h)
102F574C add esp,8
102F574F mov dword ptr [savedcw],eax
102F5752 movzx eax,word ptr [ebp+0Eh]
102F5756 and eax,7FF0h
102F575B cmp eax,7FF0h
102F5760 jne fabs+0D2h (102F5802h)
102F5766 sub esp,8
102F5769 fld qword ptr [x]
102F576C fstp qword ptr [esp]
102F576F call _sptype (102F9710h)
102F5774 add esp,8
102F5777 mov dword ptr [ebp-14h],eax
102F577A cmp dword ptr [ebp-14h],1
102F577E je fabs+5Eh (102F578Eh)
102F5780 cmp dword ptr [ebp-14h],2
102F5784 je fabs+77h (102F57A7h)
102F5786 cmp dword ptr [ebp-14h],3
102F578A je fabs+8Fh (102F57BFh)
102F578C jmp fabs+0A8h (102F57D8h)
102F578E push 0FFFFh
102F5793 mov ecx,dword ptr [savedcw]
102F5796 push ecx
102F5797 call _ctrlfp (102F6140h)
102F579C add esp,8
102F579F fld qword ptr [x]
102F57A2 jmp fabs+0F8h (102F5828h)
102F57A7 push 0FFFFh
102F57AC mov edx,dword ptr [savedcw]
102F57AF push edx
102F57B0 call _ctrlfp (102F6140h)
102F57B5 add esp,8
102F57B8 fld qword ptr [x]
102F57BB fchs
102F57BD jmp fabs+0F8h (102F5828h)
102F57BF mov eax,dword ptr [savedcw]
102F57C2 push eax
102F57C3 sub esp,8
102F57C6 fld qword ptr [x]
102F57C9 fstp qword ptr [esp]
102F57CC push 15h
102F57CE call _handle_qnan1 (102F98C0h)
102F57D3 add esp,10h
102F57D6 jmp fabs+0F8h (102F5828h)
102F57D8 mov ecx,dword ptr [savedcw]
102F57DB push ecx
102F57DC fld qword ptr [x]
102F57DF fadd qword ptr [__real#3ff0000000000000 (1022CF68h)]
102F57E5 sub esp,8
102F57E8 fstp qword ptr [esp]
102F57EB sub esp,8
102F57EE fld qword ptr [x]
102F57F1 fstp qword ptr [esp]
102F57F4 push 15h
102F57F6 push 8
102F57F8 call _except1 (102F99B0h)
102F57FD add esp,1Ch
102F5800 jmp fabs+0F8h (102F5828h)
102F5802 mov edx,dword ptr [ebp+0Ch]
102F5805 and edx,7FFFFFFFh
102F580B mov dword ptr [ebp-0Ch],edx
102F580E mov eax,dword ptr [x]
102F5811 mov dword ptr [result],eax
102F5814 push 0FFFFh
102F5819 mov ecx,dword ptr [savedcw]
102F581C push ecx
102F581D call _ctrlfp (102F6140h)
102F5822 add esp,8
102F5825 fld qword ptr [result]
102F5828 mov esp,ebp
102F582A pop ebp
102F582B ret
In release mode, the FPU FABS instruction is used instead (takes 1 clock cycle only on FPU >= Pentium), the dissasembly output is:
00401006 fld qword ptr [__real#3ff4000000000000 (402100h)]
0040100C sub esp,8
0040100F fabs
00401011 fstp qword ptr [esp]
00401014 push offset string "%.3f" (4020F4h)
00401019 call dword ptr [__imp__printf (4020A0h)]
It probably just uses a bitmask to set the sign bit to 0.
There can be several things:
are you sure the first call uses std::abs? It could just as well use the integer abs from C (either call std::abs explicitely, or have using std::abs;)
the compiler might have intrinsic implementation of some float functions (eg. compile them directly into FPU instructions)
However, I'm surprised the compiler doesn't eliminate the loop altogether - since you don't do anything with any effect inside the loop, and at least in case of abs, the compiler should know there are no side-effects.
Your version of abs is inlined and can be computed once and the compiler can trivially know that the value returned isn't going to change, so it doesn't even need to call the function.
You really need to look at the generated assembly code (set a breakpoint, and open the "large" debugger view, this disassembly will be on the bottom left if memory serves), and then you can see what's going on.
You can find documentation on your processor online without too much trouble, it'll tell you what all of the instructions are so you can figure out what's happening. Alternatively, paste it here and we'll tell you. ;)
Probably the library version of abs is an intrinsic function, whose behavior is exactly known by the compiler, which can even compute the value at compile time (since in your case it's known) and optimize the call away. You should try your benchmark with a value known only at runtime (provided by the user or got with rand() before the two cycles).
If there's still a difference, it may be because the library abs is written directly in hand-forged assembly with magic tricks, so it could be a little faster than the generated one.
The library abs function operates on integers while you are obviously testing floats. This means that call to abs with float argument involves conversion from float to int (may be a no-op as you are using constant and compiler may do it at compile time), then INTEGER abs operation and conversion int->float. You templated function will involve operations on floats and this is probably making a difference.