To get a better understanding of binary files I've prepared a small c++ example and used gdb to disassemble and look for the machine code.
The main() function calls the function func():
int func(void)
{
int a;
int b;
int c;
int d;
d = 4;
c = 3;
b = 2;
a = 1;
return 0;
}
The project is compiled with g++ keeping the debugging information. Next gdb is used to disassemble the source code. What I got for func() looks like:
0x00000000004004cc <+0>: push %rbp
0x00000000004004cd <+1>: mov %rsp,%rbp
0x00000000004004d0 <+4>: movl $0x4,-0x10(%rbp)
0x00000000004004d7 <+11>: movl $0x3,-0xc(%rbp)
0x00000000004004de <+18>: movl $0x2,-0x8(%rbp)
0x00000000004004e5 <+25>: movl $0x1,-0x4(%rbp)
0x00000000004004ec <+32>: mov $0x0,%eax
0x00000000004004f1 <+37>: pop %rbp
0x00000000004004f2 <+38>: retq
Now my problem is that I expect that the stack pointer should be moved by 16 bytes to lower addresses relative to the base pointer, since each integer value needs 4 bytes. But it looks like that the values are putted on the stack without moving the stack pointer.
What did I not understand correctly? Is this a problem with the compiler or did the assembler omit some lines?
Best regards,
NouGHt
There's absolutely no problem with your compiler. The compiler is free to choose how to compile your code, and it chose not to modify the stack pointer. There's no need for it to do so since your function doesn't call any other functions. If it did call another function then it would need to create another stack frame so that the callee did not stomp on the caller's stack frame.
As a general rule, you should avoid trying to make any assumptions on how the compiler will compile your code. For example, your compiler would be perfectly at liberty to opimize away the body of your function.
Related
I know that C++ compilers optimize empty (static) functions.
Based on that knowledge I wrote a piece of code that should get optimized away whenever I some identifier is defined (using the -D option of the compiler).
Consider the following dummy example:
#include <iostream>
#ifdef NO_INC
struct T {
static inline void inc(int& v, int i) {}
};
#else
struct T {
static inline void inc(int& v, int i) {
v += i;
}
};
#endif
int main(int argc, char* argv[]) {
int a = 42;
for (int i = 0; i < argc; ++i)
T::inc(a, i);
std::cout << a;
}
The desired behavior would be the following:
Whenever the NO_INC identifier is defined (using -DNO_INC when compiling), all calls to T::inc(...) should be optimized away (due to the empty function body). Otherwise, the call to T::inc(...) should trigger an increment by some given value i.
I got two questions regarding this:
Is my assumption correct that calls to T::inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized?
I wonder if the variables (a and i) are still loaded into the cache when T::inc(a, i) is called (assuming they are not there yet) although the function body is empty.
Thanks for any advice!
Compiler Explorer is an very useful tool to look at the assembly of your generated program, because there is no other way to figure out if the compiler optimized something or not for sure. Demo.
With actually incrementing, your main looks like:
main: # #main
push rax
test edi, edi
jle .LBB0_1
lea eax, [rdi - 1]
lea ecx, [rdi - 2]
imul rcx, rax
shr rcx
lea esi, [rcx + rdi]
add esi, 41
jmp .LBB0_3
.LBB0_1:
mov esi, 42
.LBB0_3:
mov edi, offset std::cout
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xor eax, eax
pop rcx
ret
As you can see, the compiler completely inlined the call to T::inc and does the incrementing directly.
For an empty T::inc you get:
main: # #main
push rax
mov edi, offset std::cout
mov esi, 42
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
xor eax, eax
pop rcx
ret
The compiler optimized away the entire loop!
Is my assumption correct that calls to t.inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized?
Yes.
If my assumption holds, does it also hold for more complex function bodies (in the #else branch)?
No, for some definition of "complex". Compilers use heuristics to determine whether it's worth it to inline a function or not, and bases its decision on that and on nothing else.
I wonder if the variables (a and i) are still loaded into the cache when t.inc(a, i) is called (assuming they are not there yet) although the function body is empty.
No, as demonstrated above, the loop doesn't even exist.
Is my assumption correct that calls to t.inc(...) do not affect the performance negatively when I specify the -DNO_INC option because the call to the empty function is optimized? If my assumption holds, does it also hold for more complex function bodies (in the #else branch)?
You are right. I have modified your example (i.e. removed cout which clutters the assembly) in compiler explorer to make it more obvious what happens.
The compiler optimizes everything away and outouts
main: # #main
movl $42, %eax
retq
Only 42 is leaded in eax and returned.
For the more complex case, however, more instructions are needed to compute the return value. See here
main: # #main
testl %edi, %edi
jle .LBB0_1
leal -1(%rdi), %eax
leal -2(%rdi), %ecx
imulq %rax, %rcx
shrq %rcx
leal (%rcx,%rdi), %eax
addl $41, %eax
retq
.LBB0_1:
movl $42, %eax
retq
I wonder if the variables (a and i) are still loaded into the cache when t.inc(a, i) is called (assuming they are not there yet) although the function body is empty.
They are only loaded, when the compiler cannot reason that they are unused. See the second example of compiler explorer.
By the way: You do not need to make an instance of T (i.e. T t;) in order to call a static function within a class. This is defeating the purpose. Call it like T::inc(...) rahter than t.inc(...).
Because the inline keword is used, you can safely assume 1. Using these functions shouldn't negatively affect performance.
Running your code through
g++ -c -Os -g
objdump -S
confirms this; An extract:
int main(int argc, char* argv[]) {
T t;
int a = 42;
1020: b8 2a 00 00 00 mov $0x2a,%eax
for (int i = 0; i < argc; ++i)
1025: 31 d2 xor %edx,%edx
1027: 39 fa cmp %edi,%edx
1029: 7d 06 jge 1031 <main+0x11>
v += i;
102b: 01 d0 add %edx,%eax
for (int i = 0; i < argc; ++i)
102d: ff c2 inc %edx
102f: eb f6 jmp 1027 <main+0x7>
t.inc(a, i);
return a;
}
1031: c3 retq
(I replaced the cout with return for better readability)
That's what I understood by reading some memory segmentation documents: when a function is called, there are a few instructions (called function prologue) that save the frame pointer on the stack, copy the value of the stack pointer into the base pointer and save some memory for local variables.
Here's a trivial code I am trying to debug using GDB:
void test_function(int a, int b, int c, int d) {
int flag;
char buffer[10];
flag = 31337;
buffer[0] = 'A';
}
int main() {
test_function(1, 2, 3, 4);
}
The purpose of debugging this code was to understand what happens in the stack when a function is called: so I had to examine the memory at various step of the execution of the program (before calling the function and during its execution). Although I managed to see things like the return address and the saved frame pointer by examining the base pointer, I really can't understand what I'm going to write after the disassembled code.
Disassembling:
(gdb) disassemble main
Dump of assembler code for function main:
0x0000000000400509 <+0>: push rbp
0x000000000040050a <+1>: mov rbp,rsp
0x000000000040050d <+4>: mov ecx,0x4
0x0000000000400512 <+9>: mov edx,0x3
0x0000000000400517 <+14>: mov esi,0x2
0x000000000040051c <+19>: mov edi,0x1
0x0000000000400521 <+24>: call 0x4004ec <test_function>
0x0000000000400526 <+29>: pop rbp
0x0000000000400527 <+30>: ret
End of assembler dump.
(gdb) disassemble test_function
Dump of assembler code for function test_function:
0x00000000004004ec <+0>: push rbp
0x00000000004004ed <+1>: mov rbp,rsp
0x00000000004004f0 <+4>: mov DWORD PTR [rbp-0x14],edi
0x00000000004004f3 <+7>: mov DWORD PTR [rbp-0x18],esi
0x00000000004004f6 <+10>: mov DWORD PTR [rbp-0x1c],edx
0x00000000004004f9 <+13>: mov DWORD PTR [rbp-0x20],ecx
0x00000000004004fc <+16>: mov DWORD PTR [rbp-0x4],0x7a69
0x0000000000400503 <+23>: mov BYTE PTR [rbp-0x10],0x41
0x0000000000400507 <+27>: pop rbp
0x0000000000400508 <+28>: ret
End of assembler dump.
I understand that "saving the frame pointer on the stack" is done by " push rbp", "copying the value of the stack pointer into the base pointer" is done by "mov rbp, rsp" but what is getting me confused is the lack of a "sub rsp $n_bytes" for "saving some memory for local variables". I've seen that in a lot of exhibits (even in some topics here on stackoverflow).
I also read that arguments should have a positive offset from the base pointer (after it's filled with the stack pointer value), since if they are located in the caller function and the stack grows toward lower addresses it makes perfect sense that when the base pointer is updated with the stack pointer value the compiler goes back in the stack by adding some positive numbers. But my code seems to store them in a negative offset, just like local variables.. I also can't understand why they are put in those registers (in the main).. shouldn't they be saved directly in the rsp "offsetted"?
Maybe these differences are due to the fact that I'm using a 64 bit system, but my researches didn't lead me to anything that would explain what I am facing.
The System V ABI for x86-64 specifies a red zone of 128 bytes below %rsp. These 128 bytes belong to the function as long as it doesn't call any other function (it is a leaf function).
Signal handlers (and functions called by a debugger) need to respect the red zone, since they are effectively involuntary function calls. All of the local variables of your test_function, which is a leaf function, fit into the red zone, thus no adjustment of %rsp is needed. (Also, the function has no visible side-effects and would be optimized out on any reasonable optimization setting).
You can compile with -mno-red-zone to stop the compiler from using space below the stack pointer. Kernel code has to do this because hardware interrupts don't implement a red-zone.
But my code seems to store them in a negative offset, just like local variables
The first x86_64 arguments are passed on registers, not on the stack. So when rbp is set to rsp, they are not on the stack, and cannot be on a positive offset.
They are being pushed only to:
save register state for a second function call.
In this case, this is not required since it is a leaf function.
make register allocation easier.
But an optimized allocator could do a better job without memory spill here.
The situation would be different if you had:
x86_64 function with lots of arguments. Those that don't fit on registers go on the stack.
IA-32, where every argument goes on the stack.
the lack of a "sub rsp $n_bytes" for "saving some memory for local variables".
The missing sub rsp on red zone of leaf function part of the question had already been asked at: Why does the x86-64 GCC function prologue allocate less stack than the local variables?
With reference to the following code
#include <iostream>
using namespace std;
void do_something(int* ptr) {
cout << "Got address " << reinterpret_cast<void*>(ptr) << endl;
}
void func() {
int a;
do_something(&a);
}
int main() {
func();
}
When I disassemble the func function the x86 (I am not sure whether it is x86 or x86_64) code is
-> 0x100001140 <+0>: pushq %rbp
0x100001141 <+1>: movq %rsp, %rbp
0x100001144 <+4>: subq $0x10, %rsp
0x100001148 <+8>: leaq -0x4(%rbp), %rdi
0x10000114c <+12>: callq 0x100000f90 ; do_something(int*)
0x100001151 <+17>: addq $0x10, %rsp
0x100001155 <+21>: popq %rbp
0x100001156 <+22>: retq
0x100001157 <+23>: nopw (%rax,%rax)
I understand that the first push statement is pushing the base pointer to the previous function call on the stack, and then the stack pointer value is copied over to the base pointer. But then why are 16 bytes reserved for the stack?
Does this have to do with alignment somehow? The variable a needs only 4 bytes..
Also what exactly is the lea instruction doing in this function call? is it just getting the address of the integer relative to the base pointer? Which in this case seems to be 4 bytes off from the base (assuming that the return address is 4 bytes long and is the first thing on the stack)
Other architectures seem to reserve more than 16 bytes and have other things stored on the base of the stack frame..
This is x64 code, note the usage of the rsp register. x86 code uses the esp register. Most important implementation detail of the x64 ABI is that the stack must always be aligned to 16. Not actually necessary to properly run 64-bit code, but the alignment guarantee ensures that the compiler can safely emit SSE instructions. Their operands require 16 byte alignment to be fast. None are actually used in this snippet but they might be in do_something.
Upon entry of your function, the caller's CALL instruction has pushed 8 bytes on the stack to store the return address. The first PUSH instruction aligns the stack to 16 again, no additional corrections required.
It then creates the stack frame to store the a variable. While only 4 bytes are required, adjusting rsp by only 4 isn't good enough to provide the necessary alignment. So it picks the next suitable value, 16. The extra 12 bytes are simply unused.
The LEA instruction is a very handy one that implements &a. LEA = Load Effective Address = "take the address of". Not a particularly involved calculation here, it gets more convoluted when you use something like &array[ix]. Something that still can be done by a single LEA if the array element size is 1, 2 or 4 bytes long, pretty common.
The -4 is the offset from the start of the stack frame for the a variable. 4 bytes are needed to store int, your compiler implements the LP64 data model. Keep in mind that the stack grows downwards so it isn't 0.
Then it is just making the function call, the rdi register is used to pass the 1st argument in the x64 ABI. Then it destroys the stack frame again by re-adjusting rsp and restores rbp.
Do keep in mind that you are looking at unoptimized code. Usually none of this is left after the optimizer is done with it, small functions like this almost always get inlined. So this doesn't teach you that much practical knowledge of the code that actually runs. Have a look-see at the -O2 code.
According to x86-64 ABI, the stack must be 16 byte aligned prior to a subroutine call.
leaq (mem), reg
is equivalent to the following
reg = &(*mem) = mem
My question
From wikipedia 64 bit API calling convention uses registers to pass first parameters in rdi, rsi, etc.
But I found: 64 bit, when calling class's member function(e.g. constructor), the code generated by compiler will move "this" pointer from register to memory, and function call will use this memory.
So I felt: the usage of register (as a mediator) is redundant.
Experiment
Observe the constructor generated by gcc and check disassembly via gdb.
First 32bit.
struct Test
{
int i;
Test(){
i=23;
}
};
int main()
{
Test obj1;
return 0;
}
$ gcc Test.cpp -g -o Test -m32
gdb to start it, break at 'i=23', check the disassembly:
(gdb) disassemble
Dump of assembler code for function Test::Test():
0x08048484 <+0>: push %ebp
0x08048485 <+1>: mov %esp,%ebp
=> 0x08048487 <+3>: mov 0x8(%ebp),%eax #'this' pointer in%ebp+8, passed by caller
0x0804848a <+6>: movl $0x17,(%eax) #Put "23" at the first member location
0x08048490 <+12>: nop
0x08048491 <+13>: pop %ebp
0x08048492 <+14>: ret
End of assembler dump.
Question(1)
This 32 bit version seems efficient.But,'this' pointer is not passed by 'ecx' register like VC does. Does gcc use 'ecx' to store 'this' pointer?
Then 64bit:
(gdb) disassemble
Dump of assembler code for function Test::Test():
0x0000000000400584 <+0>: push %rbp
0x0000000000400585 <+1>: mov %rsp,%rbp
0x0000000000400588 <+4>: mov %rdi,-0x8(%rbp) #rdi to store/ restore 'this'
=> 0x000000000040058c <+8>: mov -0x8(%rbp),%rax #Same as 32 bit version.
0x0000000000400590 <+12>: movl $0x17,(%rax)
0x0000000000400596 <+18>: nop
0x0000000000400597 <+19>: pop %rbp
0x0000000000400598 <+20>: retq
End of assembler dump.
Question(2)
This time, more instructions, the move of 'this' pointer from %rdi to memory seems to indicate,the register usage is useless, because finnaly it should be inside memory to be function parameter.
(2.1)For 64 bit, the first 2 parameters of function call are stored in rdi,rsi. But here seems no need for rdi to store 'this' pointer and restore it into memory again. We could 'push' this pointer directly and constructor could use it.
(2.2)And 64bit program requires an extra word size_t (%rbp-8) on stack to restore 'this' pointer.
So in all, the space and time efficiency of 64bit version are both worth than 32 bit. Is this due to 64bit calling convention, or just because I'm not telling gcc to optimize the code to the last bit of its strength?
When is 64bit faster?
Appreciate your suggestions. Thanks very much.
in fact, the optimizer erases all of your code
even changing main to return obj1.i the code of Test::Test() is optimized to null:
0000000000400470 <main>:
400470: b8 17 00 00 00 mov $0x17,%eax
400475: c3 retq
Let's take this for example:
Class TestClass {
public:
int functionInline();
int functionComplex();
};
inline int TestClas::functionInline()
{
// a single instruction
return functionComplex();
}
int TestClas::functionComplex()
{
/* many complex
instructions
*/
}
void myFunc()
{
TestClass testVar;
testVar.functionInline();
}
Suposing that all coments are in fact lines of code that are single line or many and complex lines of code. The equivalent code would be (after compilation):
void myFunc()
{
TestClass testVar;
// a single instruction
return functionComplex();
}
or would be:
void myFunc()
{
TestClass testVar;
// a single instruction
/* many complex
instructions
*/
}
In other words, would a normal function be inserted inline if called inside an inline function or not?
If the compiler can see that the function is not called anywhere else (e.g. it is static in the case of a free function), then at least gcc has inlined it for a long time.
Of course, this also assumes the compiler can actually "see" the source code of the function - only if you use "whole program optimisation" (available in at least MS and GCC compilers), does it inline functions that aren't either in the source file or headers included in the source.
Obviously, inlining a "large" function has very little benefit (because the overhead of making the call is such a small portion of the total runtime), and if the function gets called more than once (or "may be called more than once" by not being static), the compiler will almost certainly not inline a "large" function.
In summary: maybe the large function is inline, but quite likely not.
Please check the assembly code that I generated both for VC++ 2010 and g++.
Both the compilers dont actually treat any of the function as inline in this example.
Code:
class TestClass {
public:
int functionInline();
int functionComplex();
};
inline int TestClass::functionInline()
{
// a single instruction
return functionComplex();
}
int TestClass::functionComplex()
{
/* many complex
instructions
*/
return 0;
}
int main(){
TestClass t;
t.functionInline();
return 0;
}
VC++ 2010:
int main(){
01372E50 push ebp
01372E51 mov ebp,esp
01372E53 sub esp,0CCh
01372E59 push ebx
01372E5A push esi
01372E5B push edi
01372E5C lea edi,[ebp-0CCh]
01372E62 mov ecx,33h
01372E67 mov eax,0CCCCCCCCh
01372E6C rep stos dword ptr es:[edi]
TestClass t;
t.functionInline();
01372E6E lea ecx,[t]
01372E71 call TestClass::functionInline (1371677h)
return 0;
01372E76 xor eax,eax
}
Linux G++:
main:
.LFB3:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $16, %rsp
leaq -1(%rbp), %rax
movq %rax, %rdi
call _ZN9TestClass14functionInlineEv
movl $0, %eax
leave
ret
.cfi_endproc
Both the lines
01372E71 call TestClass::functionInline (1371677h)
and
call _ZN9TestClass14functionInlineEv
indicate that the function functionInline is not inline.
Now have a look at functionInline assembly:
inline int TestClass::functionInline()
{
01372E00 push ebp
01372E01 mov ebp,esp
01372E03 sub esp,0CCh
01372E09 push ebx
01372E0A push esi
01372E0B push edi
01372E0C push ecx
01372E0D lea edi,[ebp-0CCh]
01372E13 mov ecx,33h
01372E18 mov eax,0CCCCCCCCh
01372E1D rep stos dword ptr es:[edi]
01372E1F pop ecx
01372E20 mov dword ptr [ebp-8],ecx
// a single instruction
return functionComplex();
01372E23 mov ecx,dword ptr [this]
01372E26 call TestClass::functionComplex (1371627h)
}
Hence, functionComplex is not also inline.
No it would not be inlined. It is impossible beacause the compiler does not have available the body definition of a non-inlined function that could be located in another translation unit. I suppose normal function is a non-inlined function.
No, if you want your complex function be inserted inline, you must specify the inline keyword too.
In practice, use __forceinline keyword (on windows, __always_inline on linux) otherwise the compiler will ignore the keyword if there is a lot of instructions.
First of all inline function is just a directive to the compiler. It is not guaranteed that compiler will do the inlining.
Secondly, when you specify a function as inline, it tells compiler two things
1) Function might be a candidate for inlining. Whether it is going to be inlined is not guaranteed
2) This function has internal linkage. That is, function will be visible only in the translation unit it is compiled. This internal linkage is guaranteed irrespective of whether the function is actually inlined or not.
In you case, functionInline is specified as inline but functionComplex is not. functionComplex has external linkage. Compiler will never do the inlining of function with external linkage.
So, a plain answer to your question is "No" a normal (function without inline keyword and defined outside class)function will never be inlined