Calling convention mismatch for x64 floating point functions - c++

I'm having a weird error. I have one module compiled by one compiler (msvc in this case), that calls code loaded from another module compiled by a seperate compiler (TCC).
The tcc code provides a callback function that for both modules are defined like this:
typedef float( * ScaleFunc)(float value, float _min, float _max);
The MSVC code calls the code like this:
finalValue = extScale(val, _min, _max);
000007FEECAFCF52 mov rax,qword ptr [this]
000007FEECAFCF5A movss xmm2,dword ptr [rax+0D0h]
000007FEECAFCF62 mov rax,qword ptr [this]
000007FEECAFCF6A movss xmm1,dword ptr [rax+0CCh]
000007FEECAFCF72 movss xmm0,dword ptr [val]
000007FEECAFCF78 mov rax,qword ptr [this]
000007FEECAFCF80 call qword ptr [rax+0B8h]
000007FEECAFCF86 movss dword ptr [finalValue],xmm0
and the function compiled by TCC looks like this:
float linear_scale(float value, float _min, float _max)
{
return value * (_max - _min) + _min;
}
0000000000503DC4 push rbp
0000000000503DC5 mov rbp,rsp
0000000000503DC8 sub rsp,0
0000000000503DCF mov qword ptr [rbp+10h],rcx
0000000000503DD3 mov qword ptr [rbp+18h],rdx
0000000000503DD7 mov qword ptr [rbp+20h],r8
0000000000503DDB movd xmm0,dword ptr [rbp+20h]
0000000000503DE0 subss xmm0,dword ptr [rbp+18h]
0000000000503DE5 movq xmm1,xmm0
0000000000503DE9 movd xmm0,dword ptr [rbp+10h]
0000000000503DEE mulss xmm0,xmm1
0000000000503DF2 addss xmm0,dword ptr [rbp+18h]
0000000000503DF7 jmp 0000000000503DFC
0000000000503DFC leave
0000000000503DFD ret
It seems that TCC expects the arguments in the integer registers r6 to r8, while msvc puts them in the sse registers. I thought that x64 (on windows) defines one common calling convention? What exactly is going on here, and how can i enforce the same model on both platforms?
The same code works correctly in 32-bit mode. Weirdly enough, on OSX (where the other code is compiled by llvm) it works in both modes (32 and 64-bit). Ill see if i can fetch some assembly from there, later.
---- edit ----
I have created a working solution. It is however, without doubt, the dirtiest hack i've ever made (bar questionable inline assembly, unfortunately it isn't available on msvc 64-bit :)).
// passes first three floating point arguments in r6 to r8
template<typename sseType>
sseType TCCAssemblyHelper(ScaleFunc cb, sseType val, sseType _min, sseType _max)
{
sseType xmm0(val), xmm1(_min), xmm2(_max);
long long rcx, rdx, r8;
rcx = *(long long*)&xmm0;
rdx = *(long long*)&xmm1;
r8 = *(long long*)&xmm2;
typedef float(*interMedFunc)(long long rcx, long long rdx, long long r8);
interMedFunc helperFunc = reinterpret_cast<interMedFunc>(cb);
return helperFunc(rcx, rdx, r8);
}

Related

Finding max number between two, which implementation to choose

I am trying to figure out, which implementation has edge over other while finding max number between two. As an example let's examine two implementation:
Implementation 1:
int findMax (int a, int b)
{
return (a > b) ? a : b;
}
// Assembly output: (gcc 11.1)
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov eax, DWORD PTR [rbp-4]
cmp eax, DWORD PTR [rbp-8]
jle .L2
mov eax, DWORD PTR [rbp-4]
jmp .L4 .L2:
mov eax, DWORD PTR [rbp-8] .L4:
pop rbp
ret
Implementation 2:
int findMax(int a, int b)
{
int diff, s, max;
diff = a - b;
s = (diff >> 31) & 1;
max = a - (s * diff);
return max;
}
// Assembly output: (gcc 11.1)
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov eax, DWORD PTR [rbp-20]
sub eax, DWORD PTR [rbp-24]
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
shr eax, 31
mov DWORD PTR [rbp-8], eax
mov eax, DWORD PTR [rbp-8]
imul eax, DWORD PTR [rbp-4]
mov edx, eax
mov eax, DWORD PTR [rbp-20]
sub eax, edx
mov DWORD PTR [rbp-12], eax
mov eax, DWORD PTR [rbp-12]
pop rbp
ret
The second one produced more assembly instructions but first one has conditional jump. Just trying to understand if both are equally good.
First you need to turn on compiler optimizations (I used -O2 for the following). And you should compare to std::max. Then this:
#include <algorithm>
int findMax (int a, int b)
{
return (a > b) ? a : b;
}
int findMax2(int a, int b)
{
int diff, s, max;
diff = a - b;
s = (diff >> 31) & 1;
max = a - (s * diff);
return max;
}
int findMax3(int a,int b){
return std::max(a,b);
}
results in:
findMax(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
findMax2(int, int):
mov ecx, edi
mov eax, edi
sub ecx, esi
mov edx, ecx
shr edx, 31
imul edx, ecx
sub eax, edx
ret
findMax3(int, int):
cmp edi, esi
mov eax, esi
cmovge eax, edi
ret
Your first version results in identical assembly as std::max, while your second variant is doing more. Actually when trying to optimize you need to specify what you optimize for. There are several options that typically require a trade-off to be made: Runtime, memory usage, size of executable, readability of code, etc. Typically you cannot get it all at once.
When in doubt, do not reinvent a wheel but use existing already optimzied std::max. And do not forget that code you write is not instructions for your CPU, rather it is a high level abstract description of what the program should do. Its the compilers job to figure out how that can be achieved best.
Last but not least, your second variant is actually broken. See example here compiled with -O2 -fsanitize=signed-integer-overflow, results in:
/app/example.cpp:13:10: runtime error: signed integer overflow: -2147483648 - 2147483647 cannot be represented in type 'int'
You should favor correctness over speed. The fastest code is not worth a thing when it is wrong. And because of that, readability is next on the list. Code that is difficult to read and understand is also difficult to proove correct. I was only able to spot the problem in your code with the help of the compiler, while std::max(a,b) is unlikely to cause undefined behavior (and even if it does, at least it isnt your fault ;).
For two ints, you can compute max(a, b) without branching using a technique you probably learnt at school:
a ^ ((a ^ b) & -(a < b));
But no sane person would write this in their code. Always use std::max and trust the compiler to pick the best way. You may well find it adopts the above for int arguments with optimisations set appropriately. Although I conject that a compare and jump is probably the best way on the whole, even at the expense of a pipeline dump.
Using std::max gives the compiler the best optimisation hint.
Implementation 1 performs well on a CISC CPU like a modern x64 AMD/Intel CPU.
Implementation 2 performs well on a RISC GPU like from nVIDIA or AMD Graphics.
The term "performs well" is only significant in a tight loop.

Differences in custom and std fetch_add on floats

This is an attempt at implementing fetch_add on floats without C++20.
void fetch_add(volatile float* x, float y)
{
bool success = false;
auto xi = (volatile std::int32_t*)x;
while(!success)
{
union {
std::int32_t sumint;
float sum;
};
auto tmp = __atomic_load_n(xi, __ATOMIC_RELAXED);
sumint = tmp;
sum += y;
success = __atomic_compare_exchange_n(xi, &tmp, sumint, true, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
}
}
To my great confusion, when I compare the assembly from gcc10.1 -O2 -std=c++2a for x86-64, they differ.
fetch_add(float volatile*, float):
.L2:
mov eax, DWORD PTR [rdi]
movd xmm1, eax
addss xmm1, xmm0
movd edx, xmm1
lock cmpxchg DWORD PTR [rdi], edx
jne .L2
ret
fetch_add_std(std::atomic<float>&, float):
mov eax, DWORD PTR [rdi]
movaps xmm1, xmm0
movd xmm0, eax
mov DWORD PTR [rsp-4], eax
addss xmm0, xmm1
.L9:
mov eax, DWORD PTR [rsp-4]
movd edx, xmm0
lock cmpxchg DWORD PTR [rdi], edx
je .L6
mov DWORD PTR [rsp-4], eax
movss xmm0, DWORD PTR [rsp-4]
addss xmm0, xmm1
jmp .L9
.L6:
ret
My ability to read assembly is near non-existent, but the custom version looks correct to me, which implies it is either incorrect, inefficient or somehow the standard library is rather broken. I don't quite believe the third case, which leads me to ask, is the custom version incorrect or inefficient?
After some comments, a second version without reloading after cmpxchg is written. They do still differ.

Getting GCC/Clang to use CMOV

I have a simple tagged union of values. The values can either be int64_ts or doubles. I am performing addition on the these unions with the caveat that if both arguments represent int64_t values then the result should also have an int64_t value.
Here is the code:
#include<stdint.h>
union Value {
int64_t a;
double b;
};
enum Type { DOUBLE, LONG };
// Value + type.
struct TaggedValue {
Type type;
Value value;
};
void add(const TaggedValue& arg1, const TaggedValue& arg2, TaggedValue* out) {
const Type type1 = arg1.type;
const Type type2 = arg2.type;
// If both args are longs then write a long to the output.
if (type1 == LONG && type2 == LONG) {
out->value.a = arg1.value.a + arg2.value.a;
out->type = LONG;
} else {
// Convert argument to a double and add it.
double op1 = type1 == LONG ? (double)arg1.value.a : arg1.value.b; // Why isn't CMOV used?
double op2 = type2 == LONG ? (double)arg2.value.a : arg2.value.b; // Why isn't CMOV used?
out->value.b = op1 + op2;
out->type = DOUBLE;
}
}
The output of gcc at -O2 is here: http://goo.gl/uTve18
Attached here in case the link doesn't work.
add(TaggedValue const&, TaggedValue const&, TaggedValue*):
cmp DWORD PTR [rdi], 1
sete al
cmp DWORD PTR [rsi], 1
sete cl
je .L17
test al, al
jne .L18
.L4:
test cl, cl
movsd xmm1, QWORD PTR [rdi+8]
jne .L19
.L6:
movsd xmm0, QWORD PTR [rsi+8]
mov DWORD PTR [rdx], 0
addsd xmm0, xmm1
movsd QWORD PTR [rdx+8], xmm0
ret
.L17:
test al, al
je .L4
mov rax, QWORD PTR [rdi+8]
add rax, QWORD PTR [rsi+8]
mov DWORD PTR [rdx], 1
mov QWORD PTR [rdx+8], rax
ret
.L18:
cvtsi2sd xmm1, QWORD PTR [rdi+8]
jmp .L6
.L19:
cvtsi2sd xmm0, QWORD PTR [rsi+8]
addsd xmm0, xmm1
mov DWORD PTR [rdx], 0
movsd QWORD PTR [rdx+8], xmm0
ret
It produced code with a lot of branches. I know that the input data is pretty random i.e it has a random combination of int64_ts and doubles. I'd like to have at least the conversion to a double done with an equivalent of a CMOV instruction. Is there any way I can coax gcc to produce that code? I'd ideally like to run some benchmark on real data to see how the code with a lot of branches does vs one with fewer branches but more expensive CMOV instructions. It might turn out that the code generated by default by GCC works better but I'd like to confirm that. I could inline the assembly myself but I'd prefer not to.
The interactive compiler link is a good way to check the assembly. Any suggestions?

Bug in type conversions! Both in MSVC and GCC

The bug occurs when changing from ( unsigned long long ) to ( double ). Тhe error is very small and happens only in some numbers, but it is there. The error is the same in MSVC and GCC.
Check yourself:
#include <stdio.h>
int main( void )
{
long long a = 834146832894220100LL;
unsigned long long b = 834146832894220100ULL;
printf( "%.16f\n%.16f\n",
(( double )a / 1e+18),
(( double )b / 1e+18)
);
getchar( );
return 0;
}
The result is:
0.8341468328942201
0.8341468328942202
This is asm code from MSVC:
// long long a = 834146832894220100LL;
mov dword ptr [a],12D5744h
mov dword ptr [ebp-20h],0B937C20h
// unsigned long long b = 834146832894220100ULL;
mov dword ptr [b],12D5744h
mov dword ptr [ebp-30h],0B937C20h
// This is unsigned to double conversion:
mov eax,dword ptr [b]
mov ecx,dword ptr [ebp-30h]
mov dword ptr [ebp-100h],eax
mov dword ptr [ebp-0FCh],ecx
mov edx,dword ptr [ebp-0FCh]
mov dword ptr [ebp-104h],edx
and dword ptr [ebp-0FCh],7FFFFFFFh
fild qword ptr [ebp-100h]
and dword ptr [ebp-104h],80000000h
mov dword ptr [ebp-108h],0
fild qword ptr [ebp-108h]
fchs
faddp st(1),st
fdiv qword ptr [__real#43abc16d674ec800 (0C07870h)]
mov esi,esp
sub esp,8
fstp qword ptr [esp]
// This is signed to double conversion:
fild qword ptr [a]
fdiv qword ptr [__real#43abc16d674ec800 (0C07870h)]
sub esp,8
fstp qword ptr [esp]
What is your opinion ?
I disassembled the code produced in gcc 4.9.1 on i386. it produced the same exact code for each. I'd guess that your compiler is doing constant evaluation differently internally depending on the type. You could try different optimization levels and/or change the order of signed vs unsigned around.
It could also be a bug in the printf code that has to do with setting rounding modes differently for the first/rest of the conversions to text.

What is different about C++ math.h abs() compared to my abs()

I am currently writing some glsl like vector math classes in C++, and I just implemented an abs() function like this:
template<class T>
static inline T abs(T _a)
{
return _a < 0 ? -_a : _a;
}
I compared its speed to the default C++ abs from math.h like this:
clock_t begin = clock();
for(int i=0; i<10000000; ++i)
{
float a = abs(-1.25);
};
clock_t end = clock();
unsigned long time1 = (unsigned long)((float)(end-begin) / ((float)CLOCKS_PER_SEC/1000.0));
begin = clock();
for(int i=0; i<10000000; ++i)
{
float a = myMath::abs(-1.25);
};
end = clock();
unsigned long time2 = (unsigned long)((float)(end-begin) / ((float)CLOCKS_PER_SEC/1000.0));
std::cout<<time1<<std::endl;
std::cout<<time2<<std::endl;
Now the default abs takes about 25ms while mine takes 60. I guess there is some low level optimisation going on. Does anybody know how math.h abs works internally? The performance difference is nothing dramatic, but I am just curious!
Since they are the implementation, they are free to make as many assumptions as they want. They know the format of the double and can play tricks with that instead.
Likely (as in almost not even a question), your double is the binary64 format. This means the sign has it's own bit, and an absolute value is merely clearing that bit. For example, as a specialization, a compiler implementer may do the following:
template <>
double abs<double>(const double x)
{
// breaks strict aliasing, but compiler writer knows this behavior for the platform
uint64_t i = reinterpret_cast<const std::uint64_t&>(x);
i &= 0x7FFFFFFFFFFFFFFFULL; // clear sign bit
return reinterpret_cast<const double&>(i);
}
This removes branching and may run faster.
There are well-known tricks for computing the absolute value of a two's complement signed number. If the number is negative, flip all the bits and add 1, that is, xor with -1 and subtract -1. If it is positive, do nothing, that is, xor with 0 and subtract 0.
int my_abs(int x)
{
int s = x >> 31;
return (x ^ s) - s;
}
What is your compiler and settings? I'm sure MS and GCC implement "intrinsic functions" for many math and string operations.
The following line:
printf("%.3f", abs(1.25));
falls into the following "fabs" code path (in msvcr90d.dll):
004113DE sub esp,8
004113E1 fld qword ptr [__real#3ff4000000000000 (415748h)]
004113E7 fstp qword ptr [esp]
004113EA call abs (4110FFh)
abs call the C runtime 'fabs' implementation on MSVCR90D (rather large):
102F5730 mov edi,edi
102F5732 push ebp
102F5733 mov ebp,esp
102F5735 sub esp,14h
102F5738 fldz
102F573A fstp qword ptr [result]
102F573D push 0FFFFh
102F5742 push 133Fh
102F5747 call _ctrlfp (102F6140h)
102F574C add esp,8
102F574F mov dword ptr [savedcw],eax
102F5752 movzx eax,word ptr [ebp+0Eh]
102F5756 and eax,7FF0h
102F575B cmp eax,7FF0h
102F5760 jne fabs+0D2h (102F5802h)
102F5766 sub esp,8
102F5769 fld qword ptr [x]
102F576C fstp qword ptr [esp]
102F576F call _sptype (102F9710h)
102F5774 add esp,8
102F5777 mov dword ptr [ebp-14h],eax
102F577A cmp dword ptr [ebp-14h],1
102F577E je fabs+5Eh (102F578Eh)
102F5780 cmp dword ptr [ebp-14h],2
102F5784 je fabs+77h (102F57A7h)
102F5786 cmp dword ptr [ebp-14h],3
102F578A je fabs+8Fh (102F57BFh)
102F578C jmp fabs+0A8h (102F57D8h)
102F578E push 0FFFFh
102F5793 mov ecx,dword ptr [savedcw]
102F5796 push ecx
102F5797 call _ctrlfp (102F6140h)
102F579C add esp,8
102F579F fld qword ptr [x]
102F57A2 jmp fabs+0F8h (102F5828h)
102F57A7 push 0FFFFh
102F57AC mov edx,dword ptr [savedcw]
102F57AF push edx
102F57B0 call _ctrlfp (102F6140h)
102F57B5 add esp,8
102F57B8 fld qword ptr [x]
102F57BB fchs
102F57BD jmp fabs+0F8h (102F5828h)
102F57BF mov eax,dword ptr [savedcw]
102F57C2 push eax
102F57C3 sub esp,8
102F57C6 fld qword ptr [x]
102F57C9 fstp qword ptr [esp]
102F57CC push 15h
102F57CE call _handle_qnan1 (102F98C0h)
102F57D3 add esp,10h
102F57D6 jmp fabs+0F8h (102F5828h)
102F57D8 mov ecx,dword ptr [savedcw]
102F57DB push ecx
102F57DC fld qword ptr [x]
102F57DF fadd qword ptr [__real#3ff0000000000000 (1022CF68h)]
102F57E5 sub esp,8
102F57E8 fstp qword ptr [esp]
102F57EB sub esp,8
102F57EE fld qword ptr [x]
102F57F1 fstp qword ptr [esp]
102F57F4 push 15h
102F57F6 push 8
102F57F8 call _except1 (102F99B0h)
102F57FD add esp,1Ch
102F5800 jmp fabs+0F8h (102F5828h)
102F5802 mov edx,dword ptr [ebp+0Ch]
102F5805 and edx,7FFFFFFFh
102F580B mov dword ptr [ebp-0Ch],edx
102F580E mov eax,dword ptr [x]
102F5811 mov dword ptr [result],eax
102F5814 push 0FFFFh
102F5819 mov ecx,dword ptr [savedcw]
102F581C push ecx
102F581D call _ctrlfp (102F6140h)
102F5822 add esp,8
102F5825 fld qword ptr [result]
102F5828 mov esp,ebp
102F582A pop ebp
102F582B ret
In release mode, the FPU FABS instruction is used instead (takes 1 clock cycle only on FPU >= Pentium), the dissasembly output is:
00401006 fld qword ptr [__real#3ff4000000000000 (402100h)]
0040100C sub esp,8
0040100F fabs
00401011 fstp qword ptr [esp]
00401014 push offset string "%.3f" (4020F4h)
00401019 call dword ptr [__imp__printf (4020A0h)]
It probably just uses a bitmask to set the sign bit to 0.
There can be several things:
are you sure the first call uses std::abs? It could just as well use the integer abs from C (either call std::abs explicitely, or have using std::abs;)
the compiler might have intrinsic implementation of some float functions (eg. compile them directly into FPU instructions)
However, I'm surprised the compiler doesn't eliminate the loop altogether - since you don't do anything with any effect inside the loop, and at least in case of abs, the compiler should know there are no side-effects.
Your version of abs is inlined and can be computed once and the compiler can trivially know that the value returned isn't going to change, so it doesn't even need to call the function.
You really need to look at the generated assembly code (set a breakpoint, and open the "large" debugger view, this disassembly will be on the bottom left if memory serves), and then you can see what's going on.
You can find documentation on your processor online without too much trouble, it'll tell you what all of the instructions are so you can figure out what's happening. Alternatively, paste it here and we'll tell you. ;)
Probably the library version of abs is an intrinsic function, whose behavior is exactly known by the compiler, which can even compute the value at compile time (since in your case it's known) and optimize the call away. You should try your benchmark with a value known only at runtime (provided by the user or got with rand() before the two cycles).
If there's still a difference, it may be because the library abs is written directly in hand-forged assembly with magic tricks, so it could be a little faster than the generated one.
The library abs function operates on integers while you are obviously testing floats. This means that call to abs with float argument involves conversion from float to int (may be a no-op as you are using constant and compiler may do it at compile time), then INTEGER abs operation and conversion int->float. You templated function will involve operations on floats and this is probably making a difference.