Why is there no `mov %rsp, %rbp` in function prologue? - c++

In assembly, many functions begin with the following prologue:
00000001004010e0: main(int, char**)+0 push %rbp
00000001004010e1: main(int, char**)+1 mov %rsp,%rbp
Some functions, like the one below, do not:
int MainEntry(){
MainEntry():
0000000100401104: MainEntry()+0 push %rbp
0000000100401105: MainEntry()+1 push %rbx
0000000100401106: MainEntry()+2 sub $0x48,%rsp
000000010040110a: MainEntry()+6 lea 0x80(%rsp),%rbp
vector<int> v;
0000000100401112: MainEntry()+14 lea -0x60(%rbp),%rax
0000000100401116: MainEntry()+18 mov %rax,%rcx
0000000100401119: MainEntry()+21 callq 0x100401b00 <std::vector<int, std::allocator<int> >::vector()>
return 0;
000000010040111e: MainEntry()+26 mov $0x0,%ebx
0000000100401123: MainEntry()+31 lea -0x60(%rbp),%rax
0000000100401127: MainEntry()+35 mov %rax,%rcx
000000010040112a: MainEntry()+38 callq 0x100401b20 <std::vector<int, std::allocator<int> >::~vector()>
000000010040112f: MainEntry()+43 mov %ebx,%eax
}
Here is the C++ code that compiles into this:
int main(int c, char** args){
MainEntry();
return 0;
}
int MainEntry(){
vector<int> v;
return 0;
}
So here are my two questions:
In the MainEntry function, there is a push %rbp, and then a push %rbx. Why is RBX pushed onto the stack?
If I understand correctly, sub $0x48, %rsp allocates 0x48 bytes on the stack, and lea 0x80(%rsp), %rbp moves 0x80 bytes down on the stack and assigns that as the base. Where is RBP going to end up in the local stack frame and how did it get there?

rbx is pushed onto the stack because the calling convention says it is preserved across calls.
This function is compiled without frame pointers. rbp is just another general purpose register when compiling without frame pointers.
About the question in the title (now improved)
The push rsp, rbp instruction doesn't exist. push always takes one argument. Perhaps you meant to ask why rbp isn't pushed. The answer is that nothing uses it and so no instructions are required to preserve it.

Related

Implementation of Stackful Coroutine in C++

Now I am trying to implement stackful coroutine in C++17 on Windows x64 OS, but, unfortunately, I have encountered the problem: I can't throw exception in my coroutine, if I do so, the program is immediately terminated with a bad exit code.
Implementation
At the begining, I allocate a stack for a new coroutine, the code looks something like that:
void* Allocate() {
static constexpr std::size_t kStackSize{524'288};
auto new_stack{::operator new(kStackSize)};
return static_cast<std::byte *>(new_stack) + kStackSize;
}
The next step is setting a trampoline function on the recently allocated stack. The code is written using MASM, since I utilize MVSC (I would like to use GCC and NASM but I have the problem with thread_local variables, see question, if it is interesting):
SetTrampoline PROC
mov rax, rsp ; saves the current stack pointer
mov rsp, [rcx] ; sets the new stack pointer
sub rsp, 20h ; shadow stack
push rdx ; saves the function pointer
; place for nonvolatile registers
sub rsp, 0e0h
mov [rcx], rsp ; saves the moved stack pointer
mov rsp, rax ; returns the initial stack pointer
ret
SetTrampoline ENDP
Then I switch machine context with this assembly function (I read this calling convetion):
SwitchContext PROC
; saves all nonvolatile registers to the caller stack
push rbx
push rbp
push rdi
push rsi
push r12
push r13
push r14
push r15
sub rsp, 10h
movdqu [rsp], xmm6
; ... pushes xmm7 - xmm14 in here, removed for brevity
sub rsp, 10h
movdqu [rsp], xmm15
mov [rdx], rsp ; saves the caller stack pointer
SwitchContextFinally PROC
mov rsp, [rcx] ; sets the callee stack pointer
; takes out the callee registers
movdqu xmm15, [rsp]
add rsp, 10h
; ... pops xmm7 - xmm14 in here, removed for brevity
movdqu xmm6, [rsp]
add rsp, 10h
pop r15
pop r14
pop r13
pop r12
pop rsi
pop rdi
pop rbp
pop rbx
ret
SwitchContextFinally ENDP
SwitchContext ENDP
Problem
Inside the trampoline I just invoke any passed function and within these functions I can't throw exceptions and catch them instantly in the same fucntion. What have I done wrong? Is it possible to throw exceptions in my case? Should I have shadow stack in SetTrampoline?
Also, I guarantee that the exception thrown don't go outside the trampoline function.

Too large (overaligned?) stack frame with GCC but not with Clang

Consider this simple code:
class X {
int i_;
public:
X();
};
void f() {
X x;
}
The stack frame of f is 32-byte long with GCC, which is unnecessarily long. The return address and x just need 12 bytes and 16-byte alignment should be required according to the Linux/x86_64 ABI. With Clang, only 16 bytes are allocated. Why GCC requires so much stack space?
GCC assembly:
f():
sub rsp, 24
lea rdi, [rsp+12]
call X::X()
add rsp, 24
ret
Clang assembly:
f():
push rax
mov rdi, rsp
call X::X()
pop rax
ret
Both with -O2. Live demo: https://godbolt.org/z/bcrWW36on
Fascinating rabbit hole, I've changed my analysis three times already.
It seems that is indeed a missed optimization. While playing around a bit, I found another missed optimization, this time in clang:
If you actually use the x object, then Clang uses rbx to cache the address of x instead of recomputing it, which means it needs to save rbx across the function, which extends the used space in the stack frame by 8 (from 12 to 20), bumping the aligned stack frame to 32, same as gcc.
From a debugging perspective, I'd prefer clang to use sub rsp, 8 instead of push rax to allocate the memory for x, so the memory isn't marked as initialized in valgrind.
GCC assembly:
f():
sub rsp, 24
lea rdi, [rsp+12]
call X::X() [complete object constructor]
lea rdi, [rsp+12]
call g(X&)
add rsp, 24
ret
Clang assembly:
f():
push rbx
sub rsp, 16
lea rbx, [rsp + 8]
mov rdi, rbx
call X::X() [complete object constructor]
mov rdi, rbx
call g(X&)
add rsp, 16
pop rbx
ret
I've checked whether gcc maybe uses 32 bytes stack alignment by using a 32 byte vector as a data member, and both gcc and clang generate code to align the stack pointer here, and use the base pointer to implement the variable-length stack frame. I have no idea why Clang allocates 64 bytes for the object here, though.
GCC assembly:
f():
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 32
mov rdi, rsp
call X::X() [complete object constructor]
leave
ret
Clang assembly:
f(): # #f()
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 64
mov rdi, rsp
call X::X() [complete object constructor]
mov rsp, rbp
pop rbp
ret
Without actually measuring performance, it is hard to tell which is better -- -O2 will optimize for runtime, not stack frame size, so there could be good reasons for all of these choices.

C++ std::string initialization better performance (assembly)

I was playing with www.godbolt.org to check what code generates better assembly code, and I can't understand why this two different approaches generate different results (in assembly commands).
The first approach is to declare a string, and then later set a value:
#include <string>
int foo() {
std::string a;
a = "abcdef";
return a.size();
}
Which, in my gcc 7.4 (-O3) outputs:
.LC0:
.string "abcdef"
foo():
push rbp
mov r8d, 6
mov ecx, OFFSET FLAT:.LC0
xor edx, edx
push rbx
xor esi, esi
sub rsp, 40
lea rbx, [rsp+16]
mov rdi, rsp
mov BYTE PTR [rsp+16], 0
mov QWORD PTR [rsp], rbx
mov QWORD PTR [rsp+8], 0
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace(unsigned long, unsigned long, char const*, unsigned long)
mov rdi, QWORD PTR [rsp]
mov rbp, QWORD PTR [rsp+8]
cmp rdi, rbx
je .L1
call operator delete(void*)
.L1:
add rsp, 40
mov eax, ebp
pop rbx
pop rbp
ret
mov rbp, rax
jmp .L3
foo() [clone .cold]:
.L3:
mov rdi, QWORD PTR [rsp]
cmp rdi, rbx
je .L4
call operator delete(void*)
.L4:
mov rdi, rbp
call _Unwind_Resume
So, I imagined that if I initialize the string in the declaration, the output assembly would be shorter:
int bar() {
std::string a {"abcdef"};
return a.size();
}
And indeed it is:
bar():
mov eax, 6
ret
Why this huge difference? What prevents gcc to optimize the first version similar to the second?
godbolt link
This is just a guess:
operator= has a strong exception guarantee; which means:
If an exception is thrown for any reason, this function has no effect (strong exception guarantee).
(since C++11)
(source)
So while the constructor can leave the object in any condition it likes, operator= needs to make sure that the object is the same as before; I suspect that's why the call to operator delete is there (to clean up potentially allocated memory).

Are arguments loaded into the cache for empty functions?

I know that C++ compilers optimize empty (static) functions.
Based on that knowledge I wrote a piece of code that should get optimized away whenever I some identifier is defined (using the -D option of the compiler).
Consider the following dummy example:
#include <iostream>
#ifdef NO_INC
struct T {
static inline void inc(int& v, int i) {}
};
#else
struct T {
static inline void inc(int& v, int i) {
v += i;
}
};
#endif
int main(int argc, char* argv[]) {
int a = 42;
for (int i = 0; i < argc; ++i)
T::inc(a, i);
std::cout << a;
}
The desired behavior would be the following:
Whenever the NO_INC identifier is defined (using -DNO_INC when compiling), all calls to T::inc(...) should be optimized away (due to the empty function body). Otherwise, the call to T::inc(...) should trigger an increment by some given value i.
I got two questions regarding this:
Is my assumption correct that calls to T::inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized?
I wonder if the variables (a and i) are still loaded into the cache when T::inc(a, i) is called (assuming they are not there yet) although the function body is empty.
Thanks for any advice!
Compiler Explorer is an very useful tool to look at the assembly of your generated program, because there is no other way to figure out if the compiler optimized something or not for sure. Demo.
With actually incrementing, your main looks like:
main: # #main
push rax
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
shr rcx
lea esi, [rcx + rdi]
add esi, 41
jmp .LBB0_3
.LBB0_1:
mov esi, 42
.LBB0_3:
mov edi, offset std::cout
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xor eax, eax
pop rcx
ret
As you can see, the compiler completely inlined the call to T::inc and does the incrementing directly.
For an empty T::inc you get:
main: # #main
push rax
mov edi, offset std::cout
mov esi, 42
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xor eax, eax
pop rcx
ret
The compiler optimized away the entire loop!
Is my assumption correct that calls to t.inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized?
Yes.
If my assumption holds, does it also hold for more complex function bodies (in the #else branch)?
No, for some definition of "complex". Compilers use heuristics to determine whether it's worth it to inline a function or not, and bases its decision on that and on nothing else.
I wonder if the variables (a and i) are still loaded into the cache when t.inc(a, i) is called (assuming they are not there yet) although the function body is empty.
No, as demonstrated above, the loop doesn't even exist.
Is my assumption correct that calls to t.inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized? If my assumption holds, does it also hold for more complex function bodies (in the #else branch)?
You are right. I have modified your example (i.e. removed cout which clutters the assembly) in compiler explorer to make it more obvious what happens.
The compiler optimizes everything away and outouts
main: # #main
movl $42, %eax
retq
Only 42 is leaded in eax and returned.
For the more complex case, however, more instructions are needed to compute the return value. See here
main: # #main
testl %edi, %edi
jle .LBB0_1
leal -1(%rdi), %eax
leal -2(%rdi), %ecx
imulq %rax, %rcx
shrq %rcx
leal (%rcx,%rdi), %eax
addl $41, %eax
retq
.LBB0_1:
movl $42, %eax
retq
I wonder if the variables (a and i) are still loaded into the cache when t.inc(a, i) is called (assuming they are not there yet) although the function body is empty.
They are only loaded, when the compiler cannot reason that they are unused. See the second example of compiler explorer.
By the way: You do not need to make an instance of T (i.e. T t;) in order to call a static function within a class. This is defeating the purpose. Call it like T::inc(...) rahter than t.inc(...).
Because the inline keword is used, you can safely assume 1. Using these functions shouldn't negatively affect performance.
Running your code through
g++ -c -Os -g
objdump -S
confirms this; An extract:
int main(int argc, char* argv[]) {
T t;
int a = 42;
1020: b8 2a 00 00 00 mov $0x2a,%eax
for (int i = 0; i < argc; ++i)
1025: 31 d2 xor %edx,%edx
1027: 39 fa cmp %edi,%edx
1029: 7d 06 jge 1031 <main+0x11>
v += i;
102b: 01 d0 add %edx,%eax
for (int i = 0; i < argc; ++i)
102d: ff c2 inc %edx
102f: eb f6 jmp 1027 <main+0x7>
t.inc(a, i);
return a;
}
1031: c3 retq
(I replaced the cout with return for better readability)

Linux 64bit calling convention uses register to pass 'this' pointer, but code is less efficient?

My question
From wikipedia 64 bit API calling convention uses registers to pass first parameters in rdi, rsi, etc.
But I found: 64 bit, when calling class's member function(e.g. constructor), the code generated by compiler will move "this" pointer from register to memory, and function call will use this memory.
So I felt: the usage of register (as a mediator) is redundant.
Experiment
Observe the constructor generated by gcc and check disassembly via gdb.
First 32bit.
struct Test
{
int i;
Test(){
i=23;
}
};
int main()
{
Test obj1;
return 0;
}
$ gcc Test.cpp -g -o Test -m32
gdb to start it, break at 'i=23', check the disassembly:
(gdb) disassemble
Dump of assembler code for function Test::Test():
0x08048484 <+0>: push %ebp
0x08048485 <+1>: mov %esp,%ebp
=> 0x08048487 <+3>: mov 0x8(%ebp),%eax #'this' pointer in%ebp+8, passed by caller
0x0804848a <+6>: movl $0x17,(%eax) #Put "23" at the first member location
0x08048490 <+12>: nop
0x08048491 <+13>: pop %ebp
0x08048492 <+14>: ret
End of assembler dump.
Question(1)
This 32 bit version seems efficient.But,'this' pointer is not passed by 'ecx' register like VC does. Does gcc use 'ecx' to store 'this' pointer?
Then 64bit:
(gdb) disassemble
Dump of assembler code for function Test::Test():
0x0000000000400584 <+0>: push %rbp
0x0000000000400585 <+1>: mov %rsp,%rbp
0x0000000000400588 <+4>: mov %rdi,-0x8(%rbp) #rdi to store/ restore 'this'
=> 0x000000000040058c <+8>: mov -0x8(%rbp),%rax #Same as 32 bit version.
0x0000000000400590 <+12>: movl $0x17,(%rax)
0x0000000000400596 <+18>: nop
0x0000000000400597 <+19>: pop %rbp
0x0000000000400598 <+20>: retq
End of assembler dump.
Question(2)
This time, more instructions, the move of 'this' pointer from %rdi to memory seems to indicate,the register usage is useless, because finnaly it should be inside memory to be function parameter.
(2.1)For 64 bit, the first 2 parameters of function call are stored in rdi,rsi. But here seems no need for rdi to store 'this' pointer and restore it into memory again. We could 'push' this pointer directly and constructor could use it.
(2.2)And 64bit program requires an extra word size_t (%rbp-8) on stack to restore 'this' pointer.
So in all, the space and time efficiency of 64bit version are both worth than 32 bit. Is this due to 64bit calling convention, or just because I'm not telling gcc to optimize the code to the last bit of its strength?
When is 64bit faster?
Appreciate your suggestions. Thanks very much.
in fact, the optimizer erases all of your code
even changing main to return obj1.i the code of Test::Test() is optimized to null:
0000000000400470 <main>:
400470: b8 17 00 00 00 mov $0x17,%eax
400475: c3 retq