SSE/NEON table lookup optimization

SSE/NEON table lookup optimization - c++

i have the following lookup and interpolation code to optimize. (float table with size 128)
It will be used with an Intel compiler on windows, GCC on OSX and GCC with neon OSX.
for(unsigned int i = 0 ; i < 4 ; i++)
{
const int iIdx = (int)m_fIndex[i];
const float frac = m_fIndex - iIdx;
m_fResult[i] = sftable[iIdx].val + sftable[iIdx].val2 * frac;
}
i vecorized everything with sse/neon. (the macros convert into sse/neon instructions)
VEC_INT iIdx = VEC_FLOAT2INT(m_fIndex);
VEC_FLOAT frac = VEC_SUB(m_fIndex ,VEC_INT2FLOAT(iIdx);
m_fResult[0] = sftable[iIdx[0]].val2;
m_fResult[1] = sftable[iIdx[1]].val2;
m_fResult[2] = sftable[iIdx[2]].val2;
m_fResult[3] = sftable[iIdx[3]].val2;
m_fResult=VEC_MUL( m_fResult,frac);
frac[0] = sftable[iIdx[0]].val1;
frac[1] = sftable[iIdx[1]].val1;
frac[2] = sftable[iIdx[2]].val1;
frac[3] = sftable[iIdx[3]].val1;
m_fResult=VEC_ADD( m_fResult,frac);
i think that the table access and move into aligned memory is the real bottleneck here.
I am not good with assembler but there are a lot of unpcklps and mov:
10026751 mov eax,dword ptr [esp+4270h]
10026758 movaps xmm3,xmmword ptr [eax+16640h]
1002675F cvttps2dq xmm5,xmm3
10026763 cvtdq2ps xmm4,xmm5
10026766 movd edx,xmm5
1002676A movdqa xmm6,xmm5
1002676E movdqa xmm1,xmm5
10026772 psrldq xmm6,4
10026777 movdqa xmm2,xmm5
1002677B movd ebx,xmm6
1002677F subps xmm3,xmm4
10026782 psrldq xmm1,8
10026787 movd edi,xmm1
1002678B psrldq xmm2,0Ch
10026790 movdqa xmmword ptr [esp+4F40h],xmm5
10026799 mov ecx,dword ptr [eax+edx*8+10CF4h]
100267A0 movss xmm0,dword ptr [eax+edx*8+10CF4h]
100267A9 mov dword ptr [eax+166B0h],ecx
100267AF movd ecx,xmm2
100267B3 mov esi,dword ptr [eax+ebx*8+10CF4h]
100267BA movss xmm4,dword ptr [eax+ebx*8+10CF4h]
100267C3 mov dword ptr [eax+166B4h],esi
100267C9 mov edx,dword ptr [eax+edi*8+10CF4h]
100267D0 movss xmm7,dword ptr [eax+edi*8+10CF4h]
100267D9 mov dword ptr [eax+166B8h],edx
100267DF movss xmm1,dword ptr [eax+ecx*8+10CF4h]
100267E8 unpcklps xmm0,xmm7
100267EB unpcklps xmm4,xmm1
100267EE unpcklps xmm0,xmm4
100267F1 mulps xmm0,xmm3
100267F4 movaps xmmword ptr [eax+166B0h],xmm0
100267FB mov ebx,dword ptr [esp+4F40h]
10026802 mov edi,dword ptr [esp+4F44h]
10026809 mov ecx,dword ptr [esp+4F48h]
10026810 mov esi,dword ptr [esp+4F4Ch]
10026817 movss xmm2,dword ptr [eax+ebx*8+10CF0h]
10026820 movss xmm5,dword ptr [eax+edi*8+10CF0h]
10026829 movss xmm3,dword ptr [eax+ecx*8+10CF0h]
10026832 movss xmm6,dword ptr [eax+esi*8+10CF0h]
1002683B unpcklps xmm2,xmm3
1002683E unpcklps xmm5,xmm6
10026841 unpcklps xmm2,xmm5
10026844 mulps xmm2,xmm0
10026847 movaps xmmword ptr [eax+166B0h],xmm2
When profiling there is not much benefit with the sse version on win.
Do you have any suggestions how to improve ?
Any side effects with neon/gcc to be expected ?
Currently i consider just making the first part vecorized and do the tablereadout and interpolation in a loop, hoping that it will be benefit from compiler optimization.

OSX? Then it has nothing to do with NEON.
BTW, NEON cannot handle LUTs this large anyway. (I don't know about SSE for this matter)
Verify first if SSE can handle LUTs of this size, if yes, I suggest using a different compiler since GCC tends to make intrinsucks out of intrinsics.

That’s some of the absolute worst compiler codegen I’ve ever seen (assuming that the optimizer is enabled). Worth filing a bug against GCC.
Major issues:
Loading val and val2 for each lookup separately.
Getting the index for val and val2 into a GPR separately.
Writing the vector of indices to the stack and then loading them into GPRs.
In order to get compilers to generate better code (one load for each table line), you may need to load each table line as though it were a double, then cast the line to a vector of two floats and swizzle the lines to get homogeneous vectors. On both NEON and SSE, this should require only four loads and three or four unpacks (much better than the current eight loads + six unpacks).
Getting rid of the superfluous stack traffic may be harder. Make sure that the optimizer is turned on. Fixing the multiple-load issue will halve the stack traffic because you’ll only generate each index once, but to get rid of it entirely it might be necessary to write assembly instead of intrinsics (or to use a newer compiler version).

One of the reasons why the compiler creates "funky" code (with lots of re-loads) here is because it must assume, for correctness, that the data in sftable[] arrays may change. To make the generated code better, restructure it to look like:
VEC_INT iIdx = VEC_FLOAT2INT(m_fIndex);
VEC_FLOAT frac = VEC_SUB(m_fIndex ,VEC_INT2FLOAT(iIdx);
VEC_FLOAT fracnew;
// make it explicit that all you want is _four loads_
typeof(*sftable) tbl[4] = {
sftable[iIdx[0]], sftable[iIdx[1]], sftable[iIdx[2]], sftable[iIdx[3]]
};
m_fResult[0] = tbl[0].val2
m_fResult[1] = tbl[1].val2;
m_fResult[2] = tbl[2].val2;
m_fResult[3] = tbl[3].val2;
fracnew[0] = tbl[0].val1;
fracnew[1] = tbl[1].val1;
fracnew[2] = tbl[2].val1;
fracnew[3] = tbl[3].val1;
m_fResult=VEC_MUL( m_fResult,frac);
m_fResult=VEC_ADD( m_fResult,fracnew);
frac = fracnew;
It might make sense (due to the interleaved layout of what you have in sftable[]) to use intrinsics because both vector float arrays fResult and frac are quite probably loadable from tbl[] with a single instruction (unpack hi/lo in SSE, unzip in Neon). The "main" table lookup isn't vectorizable without the help of something like AVX2's VGATHER instruction, but it doesn't have to be more than four loads.

Related

C++ Optimizer : division

Lets say I have 2 if statements:
if (frequency1_mhz > frequency2_hz * 1000) {// some code}
if (frequency1_mhz / 1000 > frequency2_hz ) {// some code}
I'd imagine the two to function the exact same, yet I'm guessing the first statement with the multiplication is more efficient than the division.
Would a C++ compiler optimize this? Or is this something I should take into account when designing my code

Yes and no.
The code is not identical:
due to rounding, there can be differences in results (e.g. frequency1_mhz=1001 and frequency2_hz=1)
the first version might overflow sooner than the second one. e.g. a frequency2_hz of 1000000000 would overflow an int (and cause UB)
It's still possible to perform division using multiplication.
When unsure, just look at the generated assembly.
Here's the generated assembly for both versions. The second one is longer, but still contains no division.
version1(int, int):
imul esi, esi, 1000
xor eax, eax
cmp esi, edi
setl al
ret
version2(int, int):
movsx rax, edi
imul rax, rax, 274877907 ; look ma, no idiv!
sar edi, 31
sar rax, 38
sub eax, edi
cmp eax, esi
setg al
movzx eax, al
ret

No, these are not equivalent statemets because division is not a precise inverse of multiplication for floats nor integers.
Integer divison rounds down positive fractionns
int f1=999;
int f2=0;
static_assert(f1>f2*1000);
static_assert(f1/1000==f2);
Reciprocals are not precise:
static_assert(10.0!=10*(1.0/10));

If they are floats built with -O3, GCC will generate the same assembly (for better or for worse).
bool first(float frequency1_mhz,float frequency2_hz) {
return frequency1_mhz > frequency2_hz * 1000;
}
bool second(float frequency1_mhz,float frequency2_hz) {
return frequency1_mhz / 1000 > frequency2_hz;
}
The assembly
first(float, float):
mulss xmm1, DWORD PTR .LC0[rip]
comiss xmm0, xmm1
seta al
ret
second(float, float):
divss xmm0, DWORD PTR .LC0[rip]
comiss xmm0, xmm1
seta al
ret
.LC0:
.long 1148846080
So, really, its ends up the same code :-)

Is shufps slower than memory access?

The title may seem nonsense but let me explain. I was studying a program the other day when I encountered the following assembly code:
movaps xmm3, xmmword ptr [rbp-30h]
lea rdx, [rdi+1320h]
movaps xmm5, xmm3
movaps xmm6, xmm3
movaps xmm0, xmm3
movss dword ptr [rdx], xmm3
shufps xmm5, xmm3, 55h
shufps xmm6, xmm3, 0AAh
shufps xmm0, xmm3, 0FFh
movaps xmm4, xmm3
movss dword ptr [rdx+4], xmm5
movss dword ptr [rdx+8], xmm6
movss dword ptr [rdx+0Ch], xmm0
mulss xmm4, xmm3
and it seems like mostly it just copies four floats from [rbp-30h] to [rdx]. Those shufpss are used just to select one of four floats in xmm3 (e.g. shufps xmm5, xmm3, 55h selects the second float and places it in xmm5).
This makes me wonder if the compiler did so because shufps is actually faster than memory access (something like movss xmm0, dword ptr [rbp-30h], movss dword ptr [rdx], xmm0).
So I wrote some tests to compare these two approaches and found shufps always slower than multiple memory accesses. Now I'm thinking maybe the use of shufps has nothing to do with performance. It might just be there to obfuscate the code so decompilers cannot produce clean code easily (tried with IDA pro and it was indeed overly complicated).
While I'll probably never use shufps explicitly anyway (by using _mm_shuffle_ps for example) in any practical programs as the compiler is most likely smarter than me, I still want to know why the compiler that compiled the program generated such code. It's neither faster nor smaller. It makes no sense.
Anyway I'll provide the tests I wrote below.
#include <Windows.h>
#include <iostream>
using namespace std;
__declspec(noinline) DWORD profile_routine(void (*routine)(void *), void *arg, int iterations = 1)
{
DWORD startTime = GetTickCount();
while (iterations--)
{
routine(arg);
}
DWORD timeElapsed = GetTickCount() - startTime;
return timeElapsed;
}
struct Struct
{
float x, y, z, w;
};
__declspec(noinline) Struct shuffle1(float *arr)
{
float x = arr[3];
float y = arr[2];
float z = arr[0];
float w = arr[1];
return {x, y, z, w};
}
#define SS0 (0x00)
#define SS1 (0x55)
#define SS2 (0xAA)
#define SS3 (0xFF)
__declspec(noinline) Struct shuffle2(float *arr)
{
Struct r;
__m128 packed = *reinterpret_cast<__m128 *>(arr);
__m128 x = _mm_shuffle_ps(packed, packed, SS3);
__m128 y = _mm_shuffle_ps(packed, packed, SS2);
__m128 z = _mm_shuffle_ps(packed, packed, SS0);
__m128 w = _mm_shuffle_ps(packed, packed, SS1);
_mm_store_ss(&r.x, x);
_mm_store_ss(&r.y, y);
_mm_store_ss(&r.z, z);
_mm_store_ss(&r.w, w);
return r;
}
void profile_shuffle_r1(void *arg)
{
float *arr = static_cast<float *>(arg);
Struct q = shuffle1(arr);
arr[0] += q.w;
arr[1] += q.z;
arr[2] += q.y;
arr[3] += q.x;
}
void profile_shuffle_r2(void *arg)
{
float *arr = static_cast<float *>(arg);
Struct q = shuffle2(arr);
arr[0] += q.w;
arr[1] += q.z;
arr[2] += q.y;
arr[3] += q.x;
}
int main(int argc, char **argv)
{
int n = argc + 3;
float arr1[4], arr2[4];
for (int i = 0; i < 4; i++)
{
arr1[i] = static_cast<float>(n + i);
arr2[i] = static_cast<float>(n + i);
}
int iterations = 20000000;
DWORD time1 = profile_routine(profile_shuffle_r1, arr1, iterations);
cout << "time1 = " << time1 << endl;
DWORD time2 = profile_routine(profile_shuffle_r2, arr2, iterations);
cout << "time2 = " << time2 << endl;
return 0;
}
In the test above, I have two shuffle methods shuffle1 and shuffle2 that do the same thing. When compiled with MSVC -O2, it produces the following code:
shuffle1:
mov eax,dword ptr [rdx+0Ch]
mov dword ptr [rcx],eax
mov eax,dword ptr [rdx+8]
mov dword ptr [rcx+4],eax
mov eax,dword ptr [rdx]
mov dword ptr [rcx+8],eax
mov eax,dword ptr [rdx+4]
mov dword ptr [rcx+0Ch],eax
mov rax,rcx
ret
shuffle2:
movaps xmm2,xmmword ptr [rdx]
mov rax,rcx
movaps xmm0,xmm2
shufps xmm0,xmm2,0FFh
movss dword ptr [rcx],xmm0
movaps xmm0,xmm2
shufps xmm0,xmm2,0AAh
movss dword ptr [rcx+4],xmm0
movss dword ptr [rcx+8],xmm2
shufps xmm2,xmm2,55h
movss dword ptr [rcx+0Ch],xmm2
ret
shuffle1 is always at least 30% faster than shuffle2 on my machine. I did notice shuffle2 has two more instructions and shuffle1 actually uses eax instead of xmm0 so I thought if I add some junk arithmetic operations, the result would be different.
So I modified them as the following:
__declspec(noinline) Struct shuffle1(float *arr)
{
float x0 = arr[3];
float y0 = arr[2];
float z0 = arr[0];
float w0 = arr[1];
float x = x0 + y0 + z0;
float y = y0 + z0 + w0;
float z = z0 + w0 + x0;
float w = w0 + x0 + y0;
return {x, y, z, w};
}
#define SS0 (0x00)
#define SS1 (0x55)
#define SS2 (0xAA)
#define SS3 (0xFF)
__declspec(noinline) Struct shuffle2(float *arr)
{
Struct r;
__m128 packed = *reinterpret_cast<__m128 *>(arr);
__m128 x0 = _mm_shuffle_ps(packed, packed, SS3);
__m128 y0 = _mm_shuffle_ps(packed, packed, SS2);
__m128 z0 = _mm_shuffle_ps(packed, packed, SS0);
__m128 w0 = _mm_shuffle_ps(packed, packed, SS1);
__m128 yz = _mm_add_ss(y0, z0);
__m128 x = _mm_add_ss(x0, yz);
__m128 y = _mm_add_ss(w0, yz);
__m128 wx = _mm_add_ss(w0, x0);
__m128 z = _mm_add_ss(z0, wx);
__m128 w = _mm_add_ss(y0, wx);
_mm_store_ss(&r.x, x);
_mm_store_ss(&r.y, y);
_mm_store_ss(&r.z, z);
_mm_store_ss(&r.w, w);
return r;
}
and now the assembly looks a bit more fair as they have the same number of instructions and both need to use xmm registers.
shuffle1:
movss xmm5,dword ptr [rdx+8]
mov rax,rcx
movss xmm3,dword ptr [rdx+0Ch]
movaps xmm0,xmm5
movss xmm2,dword ptr [rdx]
addss xmm0,xmm3
movss xmm4,dword ptr [rdx+4]
movaps xmm1,xmm2
addss xmm1,xmm5
addss xmm0,xmm2
addss xmm1,xmm4
movss dword ptr [rcx],xmm0
movaps xmm0,xmm4
addss xmm0,xmm2
addss xmm4,xmm3
movss dword ptr [rcx+4],xmm1
addss xmm0,xmm3
addss xmm4,xmm5
movss dword ptr [rcx+8],xmm0
movss dword ptr [rcx+0Ch],xmm4
ret
shuffle2:
movaps xmm4,xmmword ptr [rdx]
mov rax,rcx
movaps xmm3,xmm4
movaps xmm5,xmm4
shufps xmm5,xmm4,0AAh
movaps xmm2,xmm4
shufps xmm2,xmm4,0FFh
movaps xmm0,xmm5
addss xmm0,xmm3
shufps xmm4,xmm4,55h
movaps xmm1,xmm4
addss xmm1,xmm2
addss xmm2,xmm0
addss xmm4,xmm0
addss xmm3,xmm1
addss xmm5,xmm1
movss dword ptr [rcx],xmm2
movss dword ptr [rcx+4],xmm4
movss dword ptr [rcx+8],xmm3
movss dword ptr [rcx+0Ch],xmm5
ret
but it doesn't matter. shuffle1 is still 30% faster!

Without the broader context, it is hard to say for sure, but ... when optimizing for newer processors, you have to consider the usage of the different ports. See Agners here: http://www.agner.org/optimize/instruction_tables.pdf
In this case, while it may seem unlikely, there are a few possibilities that jump out at me if we're assuming that the assembly is, in fact, optimized.
This could appear in a stretch of code where the Out-Of-Order scheduler happens to have more of port 5 (on Haswell, for example) than ports 2 and 3 (again, using Haswell as an example) available.
Similar to with #1, but the same effect might be observed when hyperthreading. This code may be intended to not steal read operations from the sibling hyperthread.
Lastly, specific to this sort of optimization and where I've used something similar. Say you have a branch that is run-time near 100% predictable, but not during compile-time. Let's imagine, hypothetically that right after the branch there is a read that is often a cache miss. You want to read as soon as possible. The Out-Of-Order scheduler will read ahead and begin executing that read if it you don't use the read ports. This could make the shufps instructions essentially "free" to execute. Here's that example:
MOV ecx, [some computed, mostly constant at run-time global]
label loop:
ADD rdi, 16
ADD rbp, 16
CALL shuffle
SUB ecx, 1
JNE loop
MOV rax, [rdi]
;do a read that could be "predicted" properly
MOV rbx, [rax]
Honestly though, it just looks like poorly written assembly or poorly generated machine code, so I wouldn't put much thought into it. The example I'm giving is pretty darned unlikely.

You don't show if the later code actually uses the results of broadcasting each element to all 4 positions of a vector. (e.g. 0x55 is _MM_SHUFFLE(1,1,1,1)). If you already need that for for a ...ps instruction later, then you need those shuffles anyway, so there'd be no reason to also do scalar loads.
If you don't, and the only visible side-effect is the stores to memory, this is just a hilariously bad missed optimization by either a human programmer using intrinsics, and/or by a compiler. Just like in your examples of MSVC output for your test functions.
Keep in mind that some compilers (like ICC and MSVC) don't really optimize intrinsics, so if you write 3x _mm_shuffle_ps you get 3x shufps, so this poor decision could have been made by a human using intrinsics, not a compiler.
But Clang on the other hand does aggressively optimize shuffle intrinsics. clang optimizes both of your shuffle functions to one movaps load, one shufps (or pshufd), and one movups store. This is optimal for most CPUs, getting the work done in the fewest instructions and uops.
(gcc auto-vectorizes shuffle1 but not shuffle2. MSVC fails at everything, just using scalar for shuffle1)
(If you just need each scalar float at the bottom of an xmm register for ...ss instructions, you can use the shuffle that creates your store vector as one of them, because it has a different low element than the input. You'd movaps copy first, though, or use pshufd, to avoid destroying the reg with the original low element.)
If tuning specifically for a CPU with slow movups stores (like Intel pre-Nehalem) and the result isn't known to be aligned, then you'd still use one shufps but store the result with movlps and movhps. This is what gcc does if you compile with -mtune=core2.
You apparently know your input vector is aligned, so it still makes a huge amount of sense to load it with movaps. K8 will split a movaps into two 8-byte load uops, but most other x86-64 CPUs can do 16-byte aligned loads as a single uop. (Pentium M / Core 1 were the last mainstream Intel CPUs to split 128-bit vector ops like that, and they didn't support 64-bit mode.)
vbroadcastss requires AVX, so without AVX if you want a dword from memory broadcast into an XMM register, you have to use a shuffle instruction that needs a port 5 ALU uop. (vbroadcastss xmm0, [rsi+4] decodes to a pure load uop on Intel CPUs, no ALU uop needed, so it has 2 per clock throughput instead of 1.)
Old CPUs like Merom and K8 have slow shuffle units that are only 64 bits wide, so shufps is pretty slow because it's a full 128-bit shuffle with granularity smaller than 64 bits. You might consider doing 2x movsd or movq loads to feed pshuflw, which is fast because it only shuffles the low 64 bits. But only if you're specifically tuning for old CPUs.
// for gcc, I used __attribute__((ms_abi)) to target the Windows x64 calling convention
Struct shuffle3(float *arr)
{
Struct r;
__m128 packed = _mm_load_ps(arr);
__m128 xyzw = _mm_shuffle_ps(packed, packed, _MM_SHUFFLE(1,0,2,3));
_mm_storeu_ps(&r.x, xyzw);
return r;
}
shuffle1 and shuffle3 both compile to identical code with gcc and clang (on the Godbolt compiler explorer), because they auto-vectorize the scalar assignments. The only difference is using a movups load for shuffle1, because nothing guarantees 16-byte alignment there. (If we promised the compiler an aligned pointer for the pure C scalar version, then it would be exactly identical.)
# MSVC compiles shuffle3 like this as well
# from gcc9.1 -O3 (default baseline x86-64, tune=generic)
shuffle3(float*):
movaps xmm0, XMMWORD PTR [rdx] # MSVC still uses movups even for _mm_load_ps
mov rax, rcx # return the retval pointer
shufps xmm0, xmm0, 75
movups XMMWORD PTR [rcx], xmm0 # store to the hidden retval pointer
ret
With -mtune=core2, gcc still auto-vectorizes shuffle1. It uses split unaligned loads because we didn't promise the compiler aligned memory.
For shuffle3, it does use movaps but still splits _mm_storeu_ps into movlps + movhps. (This is one of the interesting effects that tuning options can have. They don't let the compiler use new instruction, just change the selection for existing ones.)
# gcc9.1 -O3 -mtune=core2 # auto-vectorizing shuffle1
shuffle1(float*):
movq xmm0, QWORD PTR [rdx]
mov rax, rcx
movhps xmm0, QWORD PTR [rdx+8]
shufps xmm0, xmm0, 75
movlps QWORD PTR [rcx], xmm0 # store in 2 halves
movhps QWORD PTR [rcx+8], xmm0
ret
MSVC doesn't have tuning options, and doesn't auto-vectorize shuffle1.

Calling convention mismatch for x64 floating point functions

I'm having a weird error. I have one module compiled by one compiler (msvc in this case), that calls code loaded from another module compiled by a seperate compiler (TCC).
The tcc code provides a callback function that for both modules are defined like this:
typedef float( * ScaleFunc)(float value, float _min, float _max);
The MSVC code calls the code like this:
finalValue = extScale(val, _min, _max);
000007FEECAFCF52 mov rax,qword ptr [this]
000007FEECAFCF5A movss xmm2,dword ptr [rax+0D0h]
000007FEECAFCF62 mov rax,qword ptr [this]
000007FEECAFCF6A movss xmm1,dword ptr [rax+0CCh]
000007FEECAFCF72 movss xmm0,dword ptr [val]
000007FEECAFCF78 mov rax,qword ptr [this]
000007FEECAFCF80 call qword ptr [rax+0B8h]
000007FEECAFCF86 movss dword ptr [finalValue],xmm0
and the function compiled by TCC looks like this:
float linear_scale(float value, float _min, float _max)
{
return value * (_max - _min) + _min;
}
0000000000503DC4 push rbp
0000000000503DC5 mov rbp,rsp
0000000000503DC8 sub rsp,0
0000000000503DCF mov qword ptr [rbp+10h],rcx
0000000000503DD3 mov qword ptr [rbp+18h],rdx
0000000000503DD7 mov qword ptr [rbp+20h],r8
0000000000503DDB movd xmm0,dword ptr [rbp+20h]
0000000000503DE0 subss xmm0,dword ptr [rbp+18h]
0000000000503DE5 movq xmm1,xmm0
0000000000503DE9 movd xmm0,dword ptr [rbp+10h]
0000000000503DEE mulss xmm0,xmm1
0000000000503DF2 addss xmm0,dword ptr [rbp+18h]
0000000000503DF7 jmp 0000000000503DFC
0000000000503DFC leave
0000000000503DFD ret
It seems that TCC expects the arguments in the integer registers r6 to r8, while msvc puts them in the sse registers. I thought that x64 (on windows) defines one common calling convention? What exactly is going on here, and how can i enforce the same model on both platforms?
The same code works correctly in 32-bit mode. Weirdly enough, on OSX (where the other code is compiled by llvm) it works in both modes (32 and 64-bit). Ill see if i can fetch some assembly from there, later.
---- edit ----
I have created a working solution. It is however, without doubt, the dirtiest hack i've ever made (bar questionable inline assembly, unfortunately it isn't available on msvc 64-bit :)).
// passes first three floating point arguments in r6 to r8
template<typename sseType>
sseType TCCAssemblyHelper(ScaleFunc cb, sseType val, sseType _min, sseType _max)
{
sseType xmm0(val), xmm1(_min), xmm2(_max);
long long rcx, rdx, r8;
rcx = *(long long*)&xmm0;
rdx = *(long long*)&xmm1;
r8 = *(long long*)&xmm2;
typedef float(*interMedFunc)(long long rcx, long long rdx, long long r8);
interMedFunc helperFunc = reinterpret_cast<interMedFunc>(cb);
return helperFunc(rcx, rdx, r8);
}

Performance optimization with modular addition when the modulus is of special form

I'm writing a program which performs millions of modular additions. For more efficiency, I started thinking about how machine-level instructions can be used to implement modular additions.
Let w be the word size of the machine (typically, 32 or 64 bits). If one takes the modulus to be 2^w, then the modular addition can be performed very fast: It suffices to simply add the addends, and discard the carry.
I tested my idea using the following C code:
#include <stdio.h>
#include <time.h>
int main()
{
unsigned int x, y, z, i;
clock_t t1, t2;
x = y = 0x90000000;
t1 = clock();
for(i = 0; i <20000000 ; i++)
z = (x + y) % 0x100000000ULL;
t2 = clock();
printf("%x\n", z);
printf("%u\n", (int)(t2-t1));
return 0;
}
Compiling using GCC with the following options (I used -O0 to prevent GCC from unfolding the loop):
-S -masm=intel -O0
The relevant part of the resulting assembly code is:
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add eax, edx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
As is evident, no modular arithmetic whatsoever is involved.
Now, if we change the modular addition line of the C code to:
z = (x + y) % 0x0F0000000ULL;
The assembly code changes to (only the relevant part is shown):
mov DWORD PTR [esp+36], -1879048192
mov eax, DWORD PTR [esp+36]
mov DWORD PTR [esp+32], eax
call _clock
mov DWORD PTR [esp+28], eax
mov DWORD PTR [esp+40], 0
jmp L2
L3:
mov eax, DWORD PTR [esp+36]
mov edx, DWORD PTR [esp+32]
add edx, eax
cmp edx, -268435456
setae al
movzx eax, al
mov DWORD PTR [esp+44], eax
mov ecx, DWORD PTR [esp+44]
mov eax, 0
sub eax, ecx
sal eax, 28
mov ecx, edx
sub ecx, eax
mov eax, ecx
mov DWORD PTR [esp+44], eax
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
call _clock
Obviously, a great number of instructions were added between the two calls to _clock.
Considering the increased number of assembly instructions,
I expected the performance gain by proper choice of the modulus to be at least 100%. However, running the output, I noted that the speed is increased by only 10%. I suspected the OS is using the multi-core CPU to run the code in parallel, but even setting the CPU affinity of the process to 1 didn't change anything.
Could you please provide me with an explanation?
Edit: Running the example with VC++ 2010, I got what I expected: the second code is around 12 times slower than the first example!

Art nailed it.
For the power-of-2 modulus, the code for the computation generated with -O0 and -O3 is identical, the difference is the loop-control code, and the running time differs by a factor of 3.
For the other modulus, the difference in the loop-control code is the same, but the code for the computation is not quite identical (the optimised code looks like it should be a bit faster, but I don't know enough about assembly or my processor to be sure). The difference in running time between unoptimised and optimised code is about 2×.
Running times for both moduli are similar with unoptimised code. About the same as the running time without any modulus. About the same as the running time of the executable obtained by removing the computation from the generated assembly.
So the running time is completely dominated by the loop control code
mov DWORD PTR [esp+40], 0
jmp L2
L3:
# snip
inc DWORD PTR [esp+40]
L2:
cmp DWORD PTR [esp+40], 19999999
jbe L3
With optimisations turned on, the loop counter is kept in a register (here) and decremented, then the jump instruction is a jne. That loop control is so much faster that the modulus computation now takes a significant fraction of the running time, removing the computation from the generated assembly now reduces the running time by a factor of 3 resp. 2.
So when compiled with -O0, you're not measuring the speed of the computation, but the speed of the loop control code, thus the small difference. With optimisations, you are measuring both, computation and loop control, and the difference of speed in the computation shows clearly.

The difference between the two boils down to the fact that divisions by powers of 2 can be transformed easily in logic instruction.
a/n where n is power of two is equivalent to a >> log2 n
for the modulo it's the same
a mod n can be rendered by a & (n-1)
But in your case it goes even further than that:
your value 0x100000000ULL is 2^32. This means that any unsigned 32bit variable will automatically be a modulo 2^32 value.
The compiler was smart enough to remove the operation because it is an unnecessary operation on 32 bit variables. The ULL specifier
doesn't change that fact.
For the value 0x0F0000000 which fits in a 32 bit variable, the compiler can not elide the operation. It uses a transformation that
seems faster than a division operation.

What is different about C++ math.h abs() compared to my abs()

I am currently writing some glsl like vector math classes in C++, and I just implemented an abs() function like this:
template<class T>
static inline T abs(T _a)
{
return _a < 0 ? -_a : _a;
}
I compared its speed to the default C++ abs from math.h like this:
clock_t begin = clock();
for(int i=0; i<10000000; ++i)
{
float a = abs(-1.25);
};
clock_t end = clock();
unsigned long time1 = (unsigned long)((float)(end-begin) / ((float)CLOCKS_PER_SEC/1000.0));
begin = clock();
for(int i=0; i<10000000; ++i)
{
float a = myMath::abs(-1.25);
};
end = clock();
unsigned long time2 = (unsigned long)((float)(end-begin) / ((float)CLOCKS_PER_SEC/1000.0));
std::cout<<time1<<std::endl;
std::cout<<time2<<std::endl;
Now the default abs takes about 25ms while mine takes 60. I guess there is some low level optimisation going on. Does anybody know how math.h abs works internally? The performance difference is nothing dramatic, but I am just curious!

Since they are the implementation, they are free to make as many assumptions as they want. They know the format of the double and can play tricks with that instead.
Likely (as in almost not even a question), your double is the binary64 format. This means the sign has it's own bit, and an absolute value is merely clearing that bit. For example, as a specialization, a compiler implementer may do the following:
template <>
double abs<double>(const double x)
{
// breaks strict aliasing, but compiler writer knows this behavior for the platform
uint64_t i = reinterpret_cast<const std::uint64_t&>(x);
i &= 0x7FFFFFFFFFFFFFFFULL; // clear sign bit
return reinterpret_cast<const double&>(i);
}
This removes branching and may run faster.

There are well-known tricks for computing the absolute value of a two's complement signed number. If the number is negative, flip all the bits and add 1, that is, xor with -1 and subtract -1. If it is positive, do nothing, that is, xor with 0 and subtract 0.
int my_abs(int x)
{
int s = x >> 31;
return (x ^ s) - s;
}

What is your compiler and settings? I'm sure MS and GCC implement "intrinsic functions" for many math and string operations.
The following line:
printf("%.3f", abs(1.25));
falls into the following "fabs" code path (in msvcr90d.dll):
004113DE sub esp,8
004113E1 fld qword ptr [__real#3ff4000000000000 (415748h)]
004113E7 fstp qword ptr [esp]
004113EA call abs (4110FFh)
abs call the C runtime 'fabs' implementation on MSVCR90D (rather large):
102F5730 mov edi,edi
102F5732 push ebp
102F5733 mov ebp,esp
102F5735 sub esp,14h
102F5738 fldz
102F573A fstp qword ptr [result]
102F573D push 0FFFFh
102F5742 push 133Fh
102F5747 call _ctrlfp (102F6140h)
102F574C add esp,8
102F574F mov dword ptr [savedcw],eax
102F5752 movzx eax,word ptr [ebp+0Eh]
102F5756 and eax,7FF0h
102F575B cmp eax,7FF0h
102F5760 jne fabs+0D2h (102F5802h)
102F5766 sub esp,8
102F5769 fld qword ptr [x]
102F576C fstp qword ptr [esp]
102F576F call _sptype (102F9710h)
102F5774 add esp,8
102F5777 mov dword ptr [ebp-14h],eax
102F577A cmp dword ptr [ebp-14h],1
102F577E je fabs+5Eh (102F578Eh)
102F5780 cmp dword ptr [ebp-14h],2
102F5784 je fabs+77h (102F57A7h)
102F5786 cmp dword ptr [ebp-14h],3
102F578A je fabs+8Fh (102F57BFh)
102F578C jmp fabs+0A8h (102F57D8h)
102F578E push 0FFFFh
102F5793 mov ecx,dword ptr [savedcw]
102F5796 push ecx
102F5797 call _ctrlfp (102F6140h)
102F579C add esp,8
102F579F fld qword ptr [x]
102F57A2 jmp fabs+0F8h (102F5828h)
102F57A7 push 0FFFFh
102F57AC mov edx,dword ptr [savedcw]
102F57AF push edx
102F57B0 call _ctrlfp (102F6140h)
102F57B5 add esp,8
102F57B8 fld qword ptr [x]
102F57BB fchs
102F57BD jmp fabs+0F8h (102F5828h)
102F57BF mov eax,dword ptr [savedcw]
102F57C2 push eax
102F57C3 sub esp,8
102F57C6 fld qword ptr [x]
102F57C9 fstp qword ptr [esp]
102F57CC push 15h
102F57CE call _handle_qnan1 (102F98C0h)
102F57D3 add esp,10h
102F57D6 jmp fabs+0F8h (102F5828h)
102F57D8 mov ecx,dword ptr [savedcw]
102F57DB push ecx
102F57DC fld qword ptr [x]
102F57DF fadd qword ptr [__real#3ff0000000000000 (1022CF68h)]
102F57E5 sub esp,8
102F57E8 fstp qword ptr [esp]
102F57EB sub esp,8
102F57EE fld qword ptr [x]
102F57F1 fstp qword ptr [esp]
102F57F4 push 15h
102F57F6 push 8
102F57F8 call _except1 (102F99B0h)
102F57FD add esp,1Ch
102F5800 jmp fabs+0F8h (102F5828h)
102F5802 mov edx,dword ptr [ebp+0Ch]
102F5805 and edx,7FFFFFFFh
102F580B mov dword ptr [ebp-0Ch],edx
102F580E mov eax,dword ptr [x]
102F5811 mov dword ptr [result],eax
102F5814 push 0FFFFh
102F5819 mov ecx,dword ptr [savedcw]
102F581C push ecx
102F581D call _ctrlfp (102F6140h)
102F5822 add esp,8
102F5825 fld qword ptr [result]
102F5828 mov esp,ebp
102F582A pop ebp
102F582B ret
In release mode, the FPU FABS instruction is used instead (takes 1 clock cycle only on FPU >= Pentium), the dissasembly output is:
00401006 fld qword ptr [__real#3ff4000000000000 (402100h)]
0040100C sub esp,8
0040100F fabs
00401011 fstp qword ptr [esp]
00401014 push offset string "%.3f" (4020F4h)
00401019 call dword ptr [__imp__printf (4020A0h)]

It probably just uses a bitmask to set the sign bit to 0.

There can be several things:
are you sure the first call uses std::abs? It could just as well use the integer abs from C (either call std::abs explicitely, or have using std::abs;)
the compiler might have intrinsic implementation of some float functions (eg. compile them directly into FPU instructions)
However, I'm surprised the compiler doesn't eliminate the loop altogether - since you don't do anything with any effect inside the loop, and at least in case of abs, the compiler should know there are no side-effects.

Your version of abs is inlined and can be computed once and the compiler can trivially know that the value returned isn't going to change, so it doesn't even need to call the function.
You really need to look at the generated assembly code (set a breakpoint, and open the "large" debugger view, this disassembly will be on the bottom left if memory serves), and then you can see what's going on.
You can find documentation on your processor online without too much trouble, it'll tell you what all of the instructions are so you can figure out what's happening. Alternatively, paste it here and we'll tell you. ;)

Probably the library version of abs is an intrinsic function, whose behavior is exactly known by the compiler, which can even compute the value at compile time (since in your case it's known) and optimize the call away. You should try your benchmark with a value known only at runtime (provided by the user or got with rand() before the two cycles).
If there's still a difference, it may be because the library abs is written directly in hand-forged assembly with magic tricks, so it could be a little faster than the generated one.

The library abs function operates on integers while you are obviously testing floats. This means that call to abs with float argument involves conversion from float to int (may be a no-op as you are using constant and compiler may do it at compile time), then INTEGER abs operation and conversion int->float. You templated function will involve operations on floats and this is probably making a difference.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js