This question already has answers here:
Is mov %esi, %esi a no-op or not on x86-64?
(2 answers)
Why did GCC generate mov %eax,%eax and what does it mean?
(1 answer)
Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?
(4 answers)
Closed 3 years ago.
I have some C++ code that is being compiled to the following assembly using MSVC compiler v14.24:
00007FF798252D4C vmulsd xmm1,xmm1,xmm7
00007FF798252D50 vcvttsd2si rcx,xmm1
00007FF798252D55 vmulsd xmm1,xmm7,mmword ptr [rbx+28h]
00007FF798252D5A mov ecx,ecx
00007FF798252D5C imul rdx,rcx,0BB8h
00007FF798252D63 vcvttsd2si rcx,xmm1
00007FF798252D68 mov ecx,ecx
00007FF798252D6A add rdx,rcx
00007FF798252D6D add rdx,rdx
00007FF798252D70 cmp byte ptr [r14+rdx*8+8],0
00007FF798252D76 je applyActionMovements+15Dh (07FF798252D8Dh)
As you can see, the compiler added two
mov ecx,ecx
instructions that don't make any sense to me, because they move data from and to the same register.
Is there something that I'm missing?
Here is a small Godbolt reproducer: https://godbolt.org/z/UFo2qe
int arr[4000][3000];
inline int foo(double a, double b) {
return arr[static_cast<unsigned int>(a * 100)][static_cast<unsigned int>(b * 100)];
}
int bar(double a, double b) {
if (foo(a, b)) {
return 0;
}
return 1;
}
That's an inefficient way to zero-extend ECX into RCX. More efficient would be mov into a different register so mov-elimination could work.
Duplicates of:
Why did GCC generate mov %eax,%eax and what does it mean?
Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?
But your specific test-case needs zero-extension for a slightly non-obvious reason:
x86 only has conversion between FP and signed integers (until AVX512). FP -> unsigned int is efficiently possible on x86-64 by doing FP -> int64_t and then taking the low 32 bits as unsigned int.
This is what this sequence is doing:
vcvttsd2si rcx,xmm1 ; double -> int64_t, unsigned int result in ECX
mov ecx,ecx ; zero-extend to promote unsigned to ptrdiff_t for indexing
add rdx,rcx ; 64-bit integer math on the zero-extended result
Related
Consider the following code:
unsigned long long div(unsigned long long a, unsigned long long b, unsigned long long c) {
unsigned __int128 d = (unsigned __int128)a*(unsigned __int128)b;
return d/c;
}
When compiled with x86-64 gcc 10 or clang 10, both with -O3, it emits __udivti3, instead of DIVQ instruction:
div:
mov rax, rdi
mov r8, rdx
sub rsp, 8
xor ecx, ecx
mul rsi
mov r9, rax
mov rsi, rdx
mov rdx, r8
mov rdi, r9
call __udivti3
add rsp, 8
ret
At least in my testing, the former is much slower than the (already) slow later, hence the question: is there a way to make a modern compiler emit DIVQ for the above code?
Edit: Let's assume the quotient fits into 64-bits register.
div will fault if the quotient doesn't fit in 64 bits. Doing (a*b) / c with mul + a single div isn't safe in the general case (doesn't implement the abstract-machine semantics for every possible input), therefore a compiler can't generate asm that way for x86-64.
Even if you do give the compiler enough info to figure out that the division can't overflow (i.e. that high_half < divisor), unfortunately gcc/clang still won't ever optimize it to single a div with a non-zero high-half dividend (RDX).
You need an intrinsic or inline asm to explicitly do 128 / 64-bit => 64-bit division. e.g. Intrinsics for 128 multiplication and division has GNU C inline asm that looks right for low/high halves separately.
Unfortunately GNU C doesn't have an intrinsic for this. MSVC does, though: Unsigned 128-bit division on 64-bit machine has links.
I have written the following very simple code which I am experimenting with in godbolt's compiler explorer:
#include <cstdint>
uint64_t func(uint64_t num, uint64_t den)
{
return num / den;
}
GCC produces the following output, which I would expect:
func(unsigned long, unsigned long):
mov rax, rdi
xor edx, edx
div rsi
ret
However Clang 13.0.0 produces the following, involving shifts and a jump even:
func(unsigned long, unsigned long): # #func(unsigned long, unsigned long)
mov rax, rdi
mov rcx, rdi
or rcx, rsi
shr rcx, 32
je .LBB0_1
xor edx, edx
div rsi
ret
.LBB0_1:
xor edx, edx
div esi
ret
When using uint32_t, clang's output is once again "simple" and what I would expect.
It seems this might be some sort of optimization, since clang 10.0.1 produces the same output as GCC, however I cannot understand what is happening. Why is clang producing this longer assembly?
The assembly seems to be checking if either num or den is larger than 2**32 by shifting right by 32 bits and then checking whether the resulting number is 0.
Depending on the decision, a 64-bit division (div rsi) or 32-bit division (div esi) is performed.
Presumably this code is generated because the compiler writer thinks the additional checks and potential branch outweigh the costs of doing an unnecessary 64-bit division.
If I understand correctly, it just checks if any of the operands is larger than 32-bits and uses different div for "up to" 32 bits and for larger one.
I need to support dynamic libraries and static linking of object files for 32 bit platforms (x86): Win32, Linux32 and MacOS32. The problem occurs when passing FPU arguments (float and double). By default, they are passed in SSE registers, not the stack. I am not against SSE, but I need the arguments and the result to be passed standardly - through the stack and the FPU.
I tried (godbolt) setting the -mno-sse option, and this produces the desired result. But I would not want to completely abandon SSE, I would sometimes like to use intrinsics and/or use MMX/SSE optimizations.
__attribute__((stdcall))
long double test(int* num, float f, double d)
{
*num = sizeof(long double);
return f * d;
}
/*-target i386-windows-gnu -c -O3*/
push ebp
mov ebp, esp
and esp, -8
sub esp, 8
movss xmm0, dword ptr [ebp + 12] # xmm0 = mem[0],zero,zero,zero
mov eax, dword ptr [ebp + 8]
cvtss2sd xmm0, xmm0
mov dword ptr [eax], 12
mulsd xmm0, qword ptr [ebp + 16]
movsd qword ptr [esp], xmm0
fld qword ptr [esp]
mov esp, ebp
pop ebp
ret 16
/*-target i386-windows-gnu -mno-sse -c -O3*/
mov eax, dword ptr [esp + 4]
mov dword ptr [eax], 12
fld dword ptr [esp + 8]
fmul qword ptr [esp + 12]
ret 16
Both versions of your function are using the same calling convention
By default, they are passed in SSE registers, not the stack.
That's not what your asm output shows, and not what happens. Notice that your first function loads its dword float arg from the stack into xmm0, then using mulsd with the qword double arg also from the stack. movss xmm0, dword ptr [ebp + 12] is a load that destroys the old contents of XMM0; XMM0 is not an input to this function.
Then, to return the retval in x87 st0 as per the crusty old 32-bit calling convention you're using, it uses a movsd store to the stack and an fld x87 load.
The * operator promotions the float to double to match the other operand, resulting in a double multiply, not long double. Promotion from double to long double doesn't happen until that temporary double result is returned.
It looks like clang defaults to what gcc would call -mfpmath=sse if available. This is normally good, except for small functions where the x87 return value calling convention gets in the way. (Also note that x87 has "free" promotion from float and double to long double, as part of how fld dword and qword work.) Clang isn't checking to see how much overhead it's going to cost to use SSE math in a small function; here it would obviously have been more efficiently to use x87 for one multiply.
But anyway, -mno-sse is not changing the ABI; read your asm more carefully. If it was, the generated asm would suck less!
On Windows, if you're stuck making 32-bit code at all, vectorcall should be a better way to pass/return FP vars when possible: it can use XMM registers to pass/return. Obviously any ABIs that are set in stone (like for existing libraries) need to be declared correctly so the compiler calls them / receives return values from them correctly.
What you currently have is stdcall with FP args on the stack and returned in st0.
BTW, a lot of the code in your first function is from clang aligning the stack to spill/reload the temporary double; the Windows ABI only guarantees 4-byte stack alignment. This is amount of work to avoid the risk of a cache-line split is almost certainly not worth it. Especially when it could have just destroyed its double d stack arg as scratch space, and hoped the caller had aligned that. Optimization is enabled, it's just setting up a frame pointer for it can and esp without losing the old ESP.
You could use return f * (long double)d;
That compiles to identical asm to the -mno-sse version. https://godbolt.org/z/LK0s_5
SSE2 doesn't support 80-bit x87 types, so clang is forced to use fmul. It ends up not messing around at all with SSE, and then the result is where it needs it for a return value.
I am currently trying to improve the speed of my program.
I was wondering whether it would help to replace all if-statements of the type:
bool a=1;
int b=0;
if(a){b++;}
with this:
bool a=1;
int b=0;
b+=a;
I am unsure whether the conversion from bool to int could be a problem time-wise.
One rule of thumb when programming is to not micro-optimise.
Another rule is to write clear code.
But in this case, another rule applies. If you are writing optimised code then avoid any code that can cause branches, as you can cause unwanted cpu pipeline dumps due to failed branch prediction.
Bear in mind also that there are not bool and int types as such in assembler: just registers, so you will probably find that all conversions will be optimised out. Therefore
b += a;
wins for me; it's also clearer.
Compilers are allowed to assume that the underlying value of a bool isn't messed up, so optimizing compilers can avoid the branch.
If we look at the generated code for this artificial test
int with_if_bool(bool a, int b) {
if(a){b++;}
return b;
}
int with_if_char(unsigned char a, int b) {
if(a){b++;}
return b;
}
int without_if(bool a, int b) {
b += a;
return b;
}
clang will exploit this fact and generate the exact same branchless code that sums a and b for the bool version, and instead generate actual comparisons with zero in the unsigned char case (although it's still branchless code):
with_if_bool(bool, int): # #with_if_bool(bool, int)
lea eax, [rdi + rsi]
ret
with_if_char(unsigned char, int): # #with_if_char(unsigned char, int)
cmp dil, 1
sbb esi, -1
mov eax, esi
ret
without_if(bool, int): # #without_if(bool, int)
lea eax, [rdi + rsi]
ret
gcc will instead treat bool just as if it was an unsigned char, without exploiting its properties, generating similar code as clang's unsigned char case.
with_if_bool(bool, int):
mov eax, esi
cmp dil, 1
sbb eax, -1
ret
with_if_char(unsigned char, int):
mov eax, esi
cmp dil, 1
sbb eax, -1
ret
without_if(bool, int):
movzx edi, dil
lea eax, [rdi+rsi]
ret
Finally, Visual C++ will treat the bool and the unsigned char versions equally, just as gcc, although with more naive codegen (it uses a conditional move instead of performing arithmetic with the flags register, which IIRC traditionally used to be less efficient, don't know for current machines).
a$ = 8
b$ = 16
int with_if_bool(bool,int) PROC ; with_if_bool, COMDAT
test cl, cl
lea eax, DWORD PTR [rdx+1]
cmove eax, edx
ret 0
int with_if_bool(bool,int) ENDP ; with_if_bool
a$ = 8
b$ = 16
int with_if_char(unsigned char,int) PROC ; with_if_char, COMDAT
test cl, cl
lea eax, DWORD PTR [rdx+1]
cmove eax, edx
ret 0
int with_if_char(unsigned char,int) ENDP ; with_if_char
a$ = 8
b$ = 16
int without_if(bool,int) PROC ; without_if, COMDAT
movzx eax, cl
add eax, edx
ret 0
int without_if(bool,int) ENDP ; without_if
In all cases, no branches are generated; the only difference is that, on most compilers, some more complex code is generated that depends on a cmp or a test, creating a longer dependency chain.
That being said, I would worry about this kind of micro-optimization only if you actually run your code under a profiler, and the results point to this specific code (or to some tight loop that involve it); in general you should write sensible, semantically correct code and focus on using the correct algorithms/data structures. Micro-optimization comes later.
In my program, this wouldn't work, as a is actually an operation of the type: b+=(a==c)
This should be even better for the optimizer, as it doesn't even have any doubt about where the bool is coming from - it can just decide straight from the flags register. As you can see, here gcc produces quite similar code for the two cases, clang exactly the same, while VC++ as usual produces something that is more conditional-ish (a cmov) in the if case.
How can I achieve the following with the minimum number of Intel instructions and without a branch or conditional move:
unsigned compare(unsigned x
,unsigned y) {
return (x == y)? ~0 : 0;
}
This is on hot code path and I need to squeeze out the most.
GCC solves this nicely, and it knows the negation trick when compiling with -O2 and up:
unsigned compare(unsigned x, unsigned y) {
return (x == y)? ~0 : 0;
}
unsigned compare2(unsigned x, unsigned y) {
return -static_cast<unsigned>(x == y);
}
compare(unsigned int, unsigned int):
xor eax, eax
cmp edi, esi
sete al
neg eax
ret
compare2(unsigned int, unsigned int):
xor eax, eax
cmp edi, esi
sete al
neg eax
ret
Visual Studio generates the following code:
compare2, COMDAT PROC
xor eax, eax
or r8d, -1 ; ffffffffH
cmp ecx, edx
cmove eax, r8d
ret 0
compare2 ENDP
compare, COMDAT PROC
xor eax, eax
cmp ecx, edx
setne al
dec eax
ret 0
compare ENDP
Here it seems the first version avoids the conditional move (note that the order of the funtions was changed).
To view other compiler's solution try pasting the code to
https://gcc.godbolt.org/ (add optimization flags).
Interestingly the first version produces shorter code on icc. Basically you have to measure actual performance with your compiler for each version and choose the best.
Also I would not be so sure a conditional register move is slower than other operations.
I assume you wrote the function just to show us the relevant part of the code, but a function like this would be an ideal candidate for inlining, potentially allowing the compiler to perform much more useful optimizations that involve the code where this is actually used. This may allow the compiler/CPU to parallelize this computation with other code, or merge some operations.
So, assuming this is indeed a function in your code, write it with the inline keyword and put it in a header.
return -int(x==y) is pretty terse C++. It's of course still up to the compiler to turn that into efficient assembly.
Works because int(true)==1 and unsigned (-1)==~0U.