Too large (overaligned?) stack frame with GCC but not with Clang - c++

Consider this simple code:
class X {
int i_;
public:
X();
};
void f() {
X x;
}
The stack frame of f is 32-byte long with GCC, which is unnecessarily long. The return address and x just need 12 bytes and 16-byte alignment should be required according to the Linux/x86_64 ABI. With Clang, only 16 bytes are allocated. Why GCC requires so much stack space?
GCC assembly:
f():
sub rsp, 24
lea rdi, [rsp+12]
call X::X()
add rsp, 24
ret
Clang assembly:
f():
push rax
mov rdi, rsp
call X::X()
pop rax
ret
Both with -O2. Live demo: https://godbolt.org/z/bcrWW36on

Fascinating rabbit hole, I've changed my analysis three times already.
It seems that is indeed a missed optimization. While playing around a bit, I found another missed optimization, this time in clang:
If you actually use the x object, then Clang uses rbx to cache the address of x instead of recomputing it, which means it needs to save rbx across the function, which extends the used space in the stack frame by 8 (from 12 to 20), bumping the aligned stack frame to 32, same as gcc.
From a debugging perspective, I'd prefer clang to use sub rsp, 8 instead of push rax to allocate the memory for x, so the memory isn't marked as initialized in valgrind.
GCC assembly:
f():
sub rsp, 24
lea rdi, [rsp+12]
call X::X() [complete object constructor]
lea rdi, [rsp+12]
call g(X&)
add rsp, 24
ret
Clang assembly:
f():
push rbx
sub rsp, 16
lea rbx, [rsp + 8]
mov rdi, rbx
call X::X() [complete object constructor]
mov rdi, rbx
call g(X&)
add rsp, 16
pop rbx
ret
I've checked whether gcc maybe uses 32 bytes stack alignment by using a 32 byte vector as a data member, and both gcc and clang generate code to align the stack pointer here, and use the base pointer to implement the variable-length stack frame. I have no idea why Clang allocates 64 bytes for the object here, though.
GCC assembly:
f():
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 32
mov rdi, rsp
call X::X() [complete object constructor]
leave
ret
Clang assembly:
f(): # #f()
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 64
mov rdi, rsp
call X::X() [complete object constructor]
mov rsp, rbp
pop rbp
ret
Without actually measuring performance, it is hard to tell which is better -- -O2 will optimize for runtime, not stack frame size, so there could be good reasons for all of these choices.

Related

Implementation of Stackful Coroutine in C++

Now I am trying to implement stackful coroutine in C++17 on Windows x64 OS, but, unfortunately, I have encountered the problem: I can't throw exception in my coroutine, if I do so, the program is immediately terminated with a bad exit code.
Implementation
At the begining, I allocate a stack for a new coroutine, the code looks something like that:
void* Allocate() {
static constexpr std::size_t kStackSize{524'288};
auto new_stack{::operator new(kStackSize)};
return static_cast<std::byte *>(new_stack) + kStackSize;
}
The next step is setting a trampoline function on the recently allocated stack. The code is written using MASM, since I utilize MVSC (I would like to use GCC and NASM but I have the problem with thread_local variables, see question, if it is interesting):
SetTrampoline PROC
mov rax, rsp ; saves the current stack pointer
mov rsp, [rcx] ; sets the new stack pointer
sub rsp, 20h ; shadow stack
push rdx ; saves the function pointer
; place for nonvolatile registers
sub rsp, 0e0h
mov [rcx], rsp ; saves the moved stack pointer
mov rsp, rax ; returns the initial stack pointer
ret
SetTrampoline ENDP
Then I switch machine context with this assembly function (I read this calling convetion):
SwitchContext PROC
; saves all nonvolatile registers to the caller stack
push rbx
push rbp
push rdi
push rsi
push r12
push r13
push r14
push r15
sub rsp, 10h
movdqu [rsp], xmm6
; ... pushes xmm7 - xmm14 in here, removed for brevity
sub rsp, 10h
movdqu [rsp], xmm15
mov [rdx], rsp ; saves the caller stack pointer
SwitchContextFinally PROC
mov rsp, [rcx] ; sets the callee stack pointer
; takes out the callee registers
movdqu xmm15, [rsp]
add rsp, 10h
; ... pops xmm7 - xmm14 in here, removed for brevity
movdqu xmm6, [rsp]
add rsp, 10h
pop r15
pop r14
pop r13
pop r12
pop rsi
pop rdi
pop rbp
pop rbx
ret
SwitchContextFinally ENDP
SwitchContext ENDP
Problem
Inside the trampoline I just invoke any passed function and within these functions I can't throw exceptions and catch them instantly in the same fucntion. What have I done wrong? Is it possible to throw exceptions in my case? Should I have shadow stack in SetTrampoline?
Also, I guarantee that the exception thrown don't go outside the trampoline function.

Function attribute [[clang::minsize]] does not behave as expected

[[clang::minsize]]:
This attribute suggests that optimization passes and code generator passes make choices that keep the code size of this function as small as possible and perform optimizations that may sacrifice runtime performance in order to minimize the size of the generated code.
With the following code:
[[gnu::noinline]] int bar()
{
return rand();
}
[[gnu::noinline]] int laa()
{
return 129345;
}
[[clang::minsize]] int foo()
{
return bar() + laa();
}
When using the minsize attribute it doesn't appear to perform any optimization:
clang example
I'm expecting something vaguely similar to GCC's [[gnu::optimize("s")]] which works nicely!
gcc example
When clang is configured with Os:
foo():
push rax
call bar()
add eax, 129345
pop rcx
ret
However, with O0 + the attribute:
foo():
push rbp
mov rbp, rsp
sub rsp, 16
call bar()
mov dword ptr [rbp - 4], eax # 4-byte Spill
call laa()
mov ecx, eax
mov eax, dword ptr [rbp - 4] # 4-byte Reload
add eax, ecx
add rsp, 16
pop rbp
ret

C++ std::string initialization better performance (assembly)

I was playing with www.godbolt.org to check what code generates better assembly code, and I can't understand why this two different approaches generate different results (in assembly commands).
The first approach is to declare a string, and then later set a value:
#include <string>
int foo() {
std::string a;
a = "abcdef";
return a.size();
}
Which, in my gcc 7.4 (-O3) outputs:
.LC0:
.string "abcdef"
foo():
push rbp
mov r8d, 6
mov ecx, OFFSET FLAT:.LC0
xor edx, edx
push rbx
xor esi, esi
sub rsp, 40
lea rbx, [rsp+16]
mov rdi, rsp
mov BYTE PTR [rsp+16], 0
mov QWORD PTR [rsp], rbx
mov QWORD PTR [rsp+8], 0
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace(unsigned long, unsigned long, char const*, unsigned long)
mov rdi, QWORD PTR [rsp]
mov rbp, QWORD PTR [rsp+8]
cmp rdi, rbx
je .L1
call operator delete(void*)
.L1:
add rsp, 40
mov eax, ebp
pop rbx
pop rbp
ret
mov rbp, rax
jmp .L3
foo() [clone .cold]:
.L3:
mov rdi, QWORD PTR [rsp]
cmp rdi, rbx
je .L4
call operator delete(void*)
.L4:
mov rdi, rbp
call _Unwind_Resume
So, I imagined that if I initialize the string in the declaration, the output assembly would be shorter:
int bar() {
std::string a {"abcdef"};
return a.size();
}
And indeed it is:
bar():
mov eax, 6
ret
Why this huge difference? What prevents gcc to optimize the first version similar to the second?
godbolt link
This is just a guess:
operator= has a strong exception guarantee; which means:
If an exception is thrown for any reason, this function has no effect (strong exception guarantee).
(since C++11)
(source)
So while the constructor can leave the object in any condition it likes, operator= needs to make sure that the object is the same as before; I suspect that's why the call to operator delete is there (to clean up potentially allocated memory).

What does Znwm and ZdlPv mean in assembly?

I'm new to assembly and I'm trying to figure out how C++ handles dynamic dispatch in assembly.
When looking through assembly code, I saw that there were 2 unusual calls:
call _Znwm
call _ZdlPv
These did not have a subroutine that I could trace them to. From examining the code, Znwm seemed to return the address of the object when its constructor was called, but I'm not sure about that. ZdlPv was in a block of code that could never be reached (it was jumped over).
C++:
Fruit * f;
f = new Apple();
x86:
# BB#1:
mov eax, 8
mov edi, eax
call _Znwm
mov rdi, rax
mov rcx, rax
.Ltmp6:
mov qword ptr [rbp - 48], rdi # 8-byte Spill
mov rdi, rax
mov qword ptr [rbp - 56], rcx # 8-byte Spill
call _ZN5AppleC2Ev
Any advice would be appreciated.
Thanks.
_Znwm is operator new.
_ZdlPv is operator delete.

Why does gcc and clang produce very differnt code for member function template parameters?

I am trying to understand what is going on when a member function pointer is used as template parameter. I always thought that function pointers (or member function pointers) are a run-time concept, so I was wondering what happens when they are used as template parameters. For this reason I took a look a the output produced by this code:
struct Foo { void foo(int i){ } };
template <typename T,void (T::*F)(int)>
void callFunc(T& t){ (t.*F)(1); }
void callF(Foo& f){ f.foo(1);}
int main(){
Foo f;
callF(f);
callFunc<Foo,&Foo::foo>(f);
}
where callF is for comparison. gcc 6.2 produces the exact same output for both functions:
callF(Foo&): // void callFunc<Foo, &Foo::foo>(Foo&):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov esi, 1
mov rdi, rax
call Foo::foo(int)
nop
leave
ret
while clang 3.9 produces almost the same output for callF():
callF(Foo&): # #callF(Foo&)
push rbp
mov rbp, rsp
sub rsp, 16
mov esi, 1
mov qword ptr [rbp - 8], rdi
mov rdi, qword ptr [rbp - 8]
call Foo::foo(int)
add rsp, 16
pop rbp
ret
but very different output for the template instantiation:
void callFunc<Foo, &Foo::foo>(Foo&): # #void callFunc<Foo, &Foo::foo>(Foo&)
push rbp
mov rbp, rsp
sub rsp, 32
xor eax, eax
mov cl, al
mov qword ptr [rbp - 8], rdi
mov rdi, qword ptr [rbp - 8]
test cl, 1
mov qword ptr [rbp - 16], rdi # 8-byte Spill
jne .LBB3_1
jmp .LBB3_2
.LBB3_1:
movabs rax, Foo::foo(int)
sub rax, 1
mov rcx, qword ptr [rbp - 16] # 8-byte Reload
mov rdx, qword ptr [rcx]
mov rax, qword ptr [rdx + rax]
mov qword ptr [rbp - 24], rax # 8-byte Spill
jmp .LBB3_3
.LBB3_2:
movabs rax, Foo::foo(int)
mov qword ptr [rbp - 24], rax # 8-byte Spill
jmp .LBB3_3
.LBB3_3:
mov rax, qword ptr [rbp - 24] # 8-byte Reload
mov esi, 1
mov rdi, qword ptr [rbp - 16] # 8-byte Reload
call rax
add rsp, 32
pop rbp
ret
Why is that? Is gcc taking some (possibly non-standard) shortcut?
gcc was able to figure out what the template was doing, and generated the simplest code possible. clang didn't. A compiler is permitted to perform any optimization as long as the observable results are compliant with the C++ specification. If optimizing away an intermediate function pointer, so be it. Nothing else in the code references the temporary function pointer, so it can be optimized away completely, and the whole thing replaced with a simple function call.
gcc and clang are different compilers, written by different people, with different approaches and algorithms for compiling C++.
It is natural, and expected to see different results from different compilers. In this case, gcc was able to figure things out better than clang. I'm sure there are other situations where clang will be able to figure things out better than gcc.
This test was done without any optimizations requested.
One compiler generated more verbose unoptimized code.
Unoptimized code is, quite simply, uninteresting. It is intended to be correct and easy to debug and derive directly from some intermediate representation that is easy to optimize.
The details of optimized code are what matter, barring a ridiculous and widespread slowdown that makes debugging painful.
There is nothing of interest to see or explain here.