Why would inline assembly crash in release only? [duplicate] - c++

I have a scenario in GCC causing me problems. The behaviour I get is not the behaviour I expect. To summarise the situation, I am proposing several new instructions for x86-64 which are implemented in a hardware simulator. In order to test these instructions I am taking existing C source code and handcoding the new instructions using hexidecimal. Because these instructions interact with the existing x86-64 registers, I use the input/output/clobber lists to declare dependencies for GCC.
What's happening is that if I call a function e.g. printf, the dependent registers aren't saved and restored.
For example
register unsigned long r9 asm ("r9") = 101;
printf("foo %s\n", "bar");
asm volatile (".byte 0x00, 0x00, 0x00, 0x00" : /* no output */ : "q" (r9) );
101 was assigned to r9 and the inline assembly (fake in this example) is dependent on r9. This runs correctly in the absence of the printf, but when it is there GCC does not save and restore r9 and another value is there by the time my custom instruction is called.
I thought perhaps that GCC might have secretly changed the assignment to the variable r9, but when I do this
asm volatile (".byte %0" : /* no output */ : "q" (r9) );
and look at the assembly output, it is indeed using %r9.
I am using gcc 4.4.5. What do you think might be happening? I thought GCC will always save and restore registers on function calls. Is there some way I can enforce it?
Thanks!
EDIT: By the way, I'm compiling the program like this
gcc -static -m64 -mmmx -msse -msse2 -O0 test.c -o test

The ABI, section 3.2.1 says:
Registers %rbp, %rbx and %r12 through %r15 “belong” to the calling function and the
called function is
required to preserve their values. In other words, a called function must preserve
these registers’ values for its caller. Remaining registers “belong” to the called
function. If a calling function wants to preserve such a register value across a
function call, it must save the value in its local stack frame.
so you shouldn't expect registers other than %rbp, %rbx and %r12 through %r15 to be preserved by a function call.

gcc will not make explicit-register variables like this callee-saved. Basically this register notation you're using makes the variable a direct alias for the register, with the assumption you want to be able to read back the value a callee leaves in the register. If you used a callee-saved register instead of a call-clobbered (caller-saved) register, the problem would go away.

Related

The implementation difference between the interperter and the JIT compiler for AtomicInteger.lazySet() in the Hotspot JVM

Refer to the existing discussion at AtomicInteger lazySet vs. set for the background of AtomicInteger.lazySet().
So according to the semantics of AtomicInteger.lazySet(), on x86 CPU, AtomicInteger.lazySet() is equivalent to a normal write operation to the value of the AotmicInteger, because the x86 memory model guarantees the order among write operations.
However, the runtime behavior for AtomicInteger.lazySet() is different between the interperter and the JIT compiler (the C2 compiler specifically) in the JDK 8 Hotspot JVM, which confuses me.
First, create a simple Java application for demo.
import java.util.concurrent.atomic.AtomicInteger;
public class App {
public static void main (String[] args) throws Exception {
AtomicInteger i = new AtomicInteger(0);
i.lazySet(1);
System.out.println(i.get());
}
}
Then, dump the instructions for AtomicInteger.lazySet() which are from the instrinc method provided by the C2 compiler:
$ java -Xcomp -XX:+UnlockDiagnosticVMOptions -XX:-TieredCompilation -XX:CompileCommand=print,*AtomicInteger.lazySet App
...
0x00007f1bd927214c: mov %edx,0xc(%rsi) ;*invokevirtual putOrderedInt
; - java.util.concurrent.atomic.AtomicInteger::lazySet#8 (line 110)
As you can see, the operation is as expected a normal write.
Then, use the GDB to trace the runtime behavior of the interpreter for AtomicInteger.lazySet().
$ gdb --args java App
(gdb) b Unsafe_SetOrderedInt
0x7ffff69ae836 callq 0x7ffff69b6642 <OrderAccess::release_store_fence(int volatile*, int)>
0x7ffff69b6642:
push %rbp
mov %rsp,%rbp
mov %rdi,-0x8(%rbp)
mov %esi,-0xc(%rbp)
mov -0xc(%rbp),%eax
mov -0x8(%rbp),%rdx
xchg %eax,(%rdx)           // the write operation
mov %eax,-0xc(%rbp)
nop
pop %rbp
retq
s you can see, the operation is actually a XCHG instruction ,which has a implict lock semantics, which brings performance overhead that AtomicInteger.lazySet() is intended to eliminate.
Does anyone know why there is such a difference? thanks.
There is no much sense in optimizing a rarely used operation in the interpreter. This would increase development and maintenance costs for no visible benefit.
It's a common practice in HotSpot to implement optimizations only in a JIT compiler (either C2 or both C1+C2). The interpreter implementation just works, it does not need to be fast, because if the code is performance sensitive, it will be JIT-compiled anyway.
So, in the interpreter (i.e. on the slow path), Unsafe_SetOrderedInt is exactly the same as a volatile write.

How to force GCC to use jmp instruction instead of ret?

I was now using a stackful co-routines for network programming. But I was punished by the invalidation of return stack buffer (see
http://www.agner.org/optimize/microarchitecture.pdf p.36), during the context switch (because we manually change the SP register)
I found out that the jmp instruction is better than ret after assembly language test. However, I have some more functions that indirectly call the context switch function that was written in C++ language (compiled by GCC). How can we force these function return using jmp instead of ret in the GCC assembly result?
Some common but not perfect methods:
using inline assembly and manually set SP register to __builtin_frame_address+2*sizeof(void*) and jmp to the return address, before ret?
This is an unsafe solution. In C++, local variables or right values are destructed before ret instruction. We will omit these instruction if we jmp. What's worse, even if we are in C, callee-saved registers need to be restored before ret instruction and we will also omit these instruction, too.
So what can we do to force GCC use jmp instead of ret to avoid the problems listing above?
Use an assembler macro:
.macro ret
pop %ecx
jmp *%ecx
.endm
Put that in inline assembler at the top of the file or elsewhere.

A deeper look into variable initialization [duplicate]

In C, let's say you have a variable called variable_name. Let's say it's located at 0xaaaaaaaa, and at that memory address, you have the integer 123. So in other words, variable_name contains 123.
I'm looking for clarification around the phrasing "variable_name is located at 0xaaaaaaaa". How does the compiler recognize that the string "variable_name" is associated with that particular memory address? Is the string "variable_name" stored somewhere in memory? Does the compiler just substitute variable_name for 0xaaaaaaaa whenever it sees it, and if so, wouldn't it have to use memory in order to make that substitution?
Variable names don't exist anymore after the compiler runs (barring special cases like exported globals in shared libraries or debug symbols). The entire act of compilation is intended to take those symbolic names and algorithms represented by your source code and turn them into native machine instructions. So yes, if you have a global variable_name, and compiler and linker decide to put it at 0xaaaaaaaa, then wherever it is used in the code, it will just be accessed via that address.
So to answer your literal questions:
How does the compiler recognize that the string "variable_name" is associated with that particular memory address?
The toolchain (compiler & linker) work together to assign a memory location for the variable. It's the compiler's job to keep track of all the references, and linker puts in the right addresses later.
Is the string "variable_name" stored somewhere in memory?
Only while the compiler is running.
Does the compiler just substitute variable_name for 0xaaaaaaaa whenever it sees it, and if so, wouldn't it have to use memory in order to make that substitution?
Yes, that's pretty much what happens, except it's a two-stage job with the linker. And yes, it uses memory, but it's the compiler's memory, not anything at runtime for your program.
An example might help you understand. Let's try out this program:
int x = 12;
int main(void)
{
return x;
}
Pretty straightforward, right? OK. Let's take this program, and compile it and look at the disassembly:
$ cc -Wall -Werror -Wextra -O3 example.c -o example
$ otool -tV example
example:
(__TEXT,__text) section
_main:
0000000100000f60 pushq %rbp
0000000100000f61 movq %rsp,%rbp
0000000100000f64 movl 0x00000096(%rip),%eax
0000000100000f6a popq %rbp
0000000100000f6b ret
See that movl line? It's grabbing the global variable (in an instruction-pointer relative way, in this case). No more mention of x.
Now let's make it a bit more complicated and add a local variable:
int x = 12;
int main(void)
{
volatile int y = 4;
return x + y;
}
The disassembly for this program is:
(__TEXT,__text) section
_main:
0000000100000f60 pushq %rbp
0000000100000f61 movq %rsp,%rbp
0000000100000f64 movl $0x00000004,0xfc(%rbp)
0000000100000f6b movl 0x0000008f(%rip),%eax
0000000100000f71 addl 0xfc(%rbp),%eax
0000000100000f74 popq %rbp
0000000100000f75 ret
Now there are two movl instructions and an addl instruction. You can see that the first movl is initializing y, which it's decided will be on the stack (base pointer - 4). Then the next movl gets the global x into a register eax, and the addl adds y to that value. But as you can see, the literal x and y strings don't exist anymore. They were conveniences for you, the programmer, but the computer certainly doesn't care about them at execution time.
A C compiler first creates a symbol table, which stores the relationship between the variable name and where it's located in memory. When compiling, it uses this table to replace all instances of the variable with a specific memory location, as others have stated. You can find a lot more on it on the Wikipedia page.
All variables are substituted by the compiler. First they are substituted with references and later the linker places addresses instead of references.
In other words. The variable names are not available anymore as soon as the compiler has run through
This is what's called an implementation detail. While what you describe is the case in all compilers I've ever used, it's not required to be the case. A C compiler could put every variable in a hashtable and look them up at runtime (or something like that) and in fact early JavaScript interpreters did exactly that (now, they do Just-In-TIme compilation that results in something much more raw.)
Specifically for common compilers like VC++, GCC, and LLVM: the compiler will generally assign a variable to a location in memory. Variables of global or static scope get a fixed address that doesn't change while the program is running, while variables within a function get a stack address-that is, an address relative to the current stack pointer, which changes every time a function is called. (This is an oversimplification.) Stack addresses become invalid as soon as the function returns, but have the benefit of having effectively zero overhead to use.
Once a variable has an address assigned to it, there is no further need for the name of the variable, so it is discarded. Depending on the kind of name, the name may be discarded at preprocess time (for macro names), compile time (for static and local variables/functions), and link time (for global variables/functions.) If a symbol is exported (made visible to other programs so they can access it), the name will usually remain somewhere in a "symbol table" which does take up a trivial amount of memory and disk space.
Does the compiler just substitute variable_name for 0xaaaaaaaa whenever it sees it
Yes.
and if so, wouldn't it have to use memory in order to make that substitution?
Yes. But it's the compiler, after it compiled your code, why do you care about memory?

C++ inline assembly (Intel compiler): LEA and MOV behaving differently in Windows and Linux

I am converting a huge Windows dll to work on both Windows and Linux. The dll has a lot of assembly (and SS2 instructions) for video manipulation.
The code now compiles fine on both Windows and Linux using Intel compiler included in Intel ComposerXE-2011 on Windows and Intel ComposerXE-2013 SP1 on Linux.
The execution, however, crashes in Linux when trying to call a function pointer. I traced the code in gdb and indeed the function pointer doesn't point to the required function (whereas in Windows in does). Almost everything else works fine.
This is the sequence of code:
...
mov rdi, this
lea rdx, [rdi].m_sSomeStruct
...
lea rax, FUNCTION_NAME # if replaced by 'mov', works in Linux but crashes in Windows
mov [rdx].m_pfnFunction, rax
...
call [rdx].m_pfnFunction # crash in Linux
where:
1) 'this' has a struct member m_sSomeStruct.
2) m_sSomeStruct has a member m_pfnFunction, which is a pointer to a function.
3) FUNCTION_NAME is a free function in the same compilation unit.
4) All those pure assembly functions are declared as naked.
5) 64-bit environment.
What is confusing me the most is that if I replace the 'lea' instruction that is supposed to load the function's address into rax with a 'mov' instruction, it works fine on Linux but crashes on Windows. I traced the code in both Visual Studio and gdb and apparently in Windows 'lea' gives the correct function address, whereas in Linux 'mov' does.
I tried looking into the Intel assembly reference but didn't find much to help me there (unless I wasn't looking in the right place).
Any help is appreciated. Thanks!
Edit More details:
1) I tried using square brackets
lea rax, [FUNCTION_NAME]
but that didn't change the behaviour in Windows nor in Linux.
2) I looked at the disassembler in gdb and Windows, seem to both give the same instructions that I actually wrote. What's even worse is that I tried putting both lea/mov one after the other, and when I look at them in disassembly in gdb, the address printed after the instruction after a # sign (which I'm assuming is the address that's going to be stored in the register) is actually the same, and is NOT the correct address of the function.
It looked like this in gdb disassembler
lea 0xOffset1(%rip), %rax # 0xSomeAddress
mov 0xOffset2(%rip), %rax # 0xSomeAddress
where both (SomeAddress) were identical and both offsets were off by the same amount of difference between lea and mov instructions,
But somehow, the when I check the contents of the registers after each execution, mov seem to put in the correct value!!!!
3) The member variable m_pfnFunction is of type LOAD_FUNCTION which is defined as
typedef void (*LOAD_FUNCTION)(const void*, void*);
4) The function FUNCTION_NAME is declared in the .h (within a namespace) as
void FUNCTION_NAME(const void* , void*);
and implemented in .cpp as
__declspec(naked) void namespace_name::FUNCTION_NAME(const void* , void*)
{
...
}
5) I tried turning off optimizations by adding
#pragma optimize("", off)
but I still have the same issue
Off hand, I suspect that the way linking to DLLs works in the latter case is that FUNCTION_NAME is a memory location that actually will be set to the loaded address of the function. That is, it's a reference (or pointer) to the function, not the entry point.
I'm familiar with Win (not the other), and I've seen how calling a function might either
(1) generate a CALL to that address, which is filled in at link time. Normal enough for functions in the same module, but if it's discovered at link time that it's in a different DLL, then the Import Library is a stub that the linker treats the same as any normal function, but is nothing more than JMP [????]. The table of addresses to imported functions is arranged to have bytes that code a JMP instruction just before the field that will hold the address. The table is populated at DLL Load time.
(2) If the compiler knows that the function will be in a different DLL, it can generate more efficient code: It codes an indirect CALL to the address located in the import table. The stub function shown in (1) has a symbol name associated with it, and the actual field containing the address has a symbol name too. They both are named for the function, but with different "decorations". In general, a program might contain fixup references to both.
So, I conjecture that the symbol name you used matches the stub function on one compiler, and (that it works in a similar way) matches the pointer on the other platform. Maybe the assembler assigns the unmangled name to one or the other depending on whether it is declared as imported, and the options are different on the two toolchains.
Hope that helps. I suppose you could look at run-time in a debugger and see if the above helps you interpret the address and the stuff around it.
After reading the difference between mov and lea here What's the purpose of the LEA instruction? it looks to me like on Linux there is one additional level of indirection added into the function pointer. The mov instruction causes that extra level of indirection to be passed through, while on Windows without that extra indirection you would use lea.
Are you by any chance compiling with PIC on Linux? I could see that adding the extra indirection layer.

Is main() really start of a C++ program?

The section $3.6.1/1 from the C++ Standard reads,
A program shall contain a global
function called main, which is the
designated start of the program.
Now consider this code,
int square(int i) { return i*i; }
int user_main()
{
for ( int i = 0 ; i < 10 ; ++i )
std::cout << square(i) << endl;
return 0;
}
int main_ret= user_main();
int main()
{
return main_ret;
}
This sample code does what I intend it to do, i.e printing the square of integers from 0 to 9, before entering into the main() function which is supposed to be the "start" of the program.
I also compiled it with -pedantic option, GCC 4.5.0. It gives no error, not even warning!
So my question is,
Is this code really Standard conformant?
If it's standard conformant, then does it not invalidate what the Standard says? main() is not start of this program! user_main() executed before the main().
I understand that to initialize the global variable main_ret, the use_main() executes first but that is a different thing altogether; the point is that, it does invalidate the quoted statement $3.6.1/1 from the Standard, as main() is NOT the start of the program; it is in fact the end of this program!
EDIT:
How do you define the word 'start'?
It boils down to the definition of the phrase "start of the program". So how exactly do you define it?
You are reading the sentence incorrectly.
A program shall contain a global function called main, which is the designated start of the program.
The standard is DEFINING the word "start" for the purposes of the remainder of the standard. It doesn't say that no code executes before main is called. It says that the start of the program is considered to be at the function main.
Your program is compliant. Your program hasn't "started" until main is started. The function is called before your program "starts" according to the definition of "start" in the standard, but that hardly matters. A LOT of code is executed before main is ever called in every program, not just this example.
For the purposes of discussion, your function is executed prior to the 'start' of the program, and that is fully compliant with the standard.
No, C++ does a lot of things to "set the environment" prior to the call of main; however, main is the official start of the "user specified" part of the C++ program.
Some of the environment setup is not controllable (like the initial code to set up std::cout; however, some of the environment is controllable like static global blocks (for initializing static global variables). Note that since you don't have full control prior to main, you don't have full control on the order in which the static blocks get initialized.
After main, your code is conceptually "fully in control" of the program, in the sense that you can both specify the instructions to be performed and the order in which to perform them. Multi-threading can rearrange code execution order; but, you're still in control with C++ because you specified to have sections of code execute (possibly) out-of-order.
Your program will not link and thus not run unless there is a main. However main() does not cause the start of the execution of the program because objects at file level have constructors that run beforehand and it would be possible to write an entire program that runs its lifetime before main() is reached and let main itself have an empty body.
In reality to enforce this you would have to have one object that is constructed prior to main and its constructor to invoke all the flow of the program.
Look at this:
class Foo
{
public:
Foo();
// other stuff
};
Foo foo;
int main()
{
}
The flow of your program would effectively stem from Foo::Foo()
You tagged the question as "C" too, then, speaking strictly about C, your initialization should fail as per section 6.7.8 "Initialization" of the ISO C99 standard.
The most relevant in this case seems to be constraint #4 which says:
All the expressions in an initializer for an object that
has static storage duration shall be constant expressions or string literals.
So, the answer to your question is that the code is not compliant to the C standard.
You would probably want to remove the "C" tag if you were only interested to the C++ standard.
Section 3.6 as a whole is very clear about the interaction of main and dynamic initializations. The "designated start of the program" is not used anywhere else and is just descriptive of the general intent of main(). It doesn't make any sense to interpret that one phrase in a normative way that contradicts the more detailed and clear requirements in the Standard.
The compiler often has to add code before main() to be standard compliant. Because the standard specifies that initalization of globals/statics must be done before the program is executed. And as mentioned, the same goes for constructors of objects placed at file scope (globals).
Thus the original question is relevant to C as well, because in a C program you would still have the globals/static initialization to do before the program can be started.
The standards assume that these variables are initialized through "magic", because they don't say how they should be set before program initialization. I think they considered that as something outside the scope of a programming language standard.
Edit: See for example ISO 9899:1999 5.1.2:
All objects with static storage
duration shall be initialized (set to
their initial values) before program
startup. The manner and timing of such
initialization are otherwise
unspecified.
The theory behind how this "magic" was to be done goes way back to C's birth, when it was a programming language intended to be used only for the UNIX OS, on RAM-based computers. In theory, the program would be able to load all pre-initialized data from the executable file into RAM, at the same time as the program itself was uploaded to RAM.
Since then, computers and OS have evolved, and C is used in a far wider area than originally anticipated. A modern PC OS has virtual addresses etc, and all embedded systems execute code from ROM, not RAM. So there are many situations where the RAM can't be set "automagically".
Also, the standard is too abstract to know anything about stacks and process memory etc. These things must be done too, before the program is started.
Therefore, pretty much every C/C++ program has some init/"copy-down" code that is executed before main is called, in order to conform with the initialization rules of the standards.
As an example, embedded systems typically have an option called "non-ISO compliant startup" where the whole initialization phase is skipped for performance reasons, and then the code actually starts directly from main. But such systems don't conform to the standards, as you can't rely on the init values of global/static variables.
Your "program" simply returns a value from a global variable. Everything else is initialization code. Thus, the standard holds - you just have a very trivial program and more complex initialization.
main() is a user function called by the C runtime library.
see also: Avoiding the main (entry point) in a C program
Seems like an English semantics quibble. The OP refers to his block of code first as "code" and later as the "program." The user writes the code, and then the compiler writes the program.
Ubuntu 20.04 glibc 2.31 RTFS + GDB
glibc does some setup before main so that some of its functionalities will work. Let's try to track down the source code for that.
hello.c
#include <stdio.h>
int main() {
puts("hello");
return 0;
}
Compile and debug:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o hello.out hello.c
gdb hello.out
Now in GDB:
b main
r
bt -past-main
gives:
#0 main () at hello.c:3
#1 0x00007ffff7dc60b3 in __libc_start_main (main=0x555555555149 <main()>, argc=1, argv=0x7fffffffbfb8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffbfa8) at ../csu/libc-start.c:308
#2 0x000055555555508e in _start ()
This already contains the line of the caller of main: https://github.com/cirosantilli/glibc/blob/glibc-2.31/csu/libc-start.c#L308.
The function has a billion ifdefs as can be expected from the level of legacy/generality of glibc, but some key parts which seem to take effect for us should simplify to:
# define LIBC_START_MAIN __libc_start_main
STATIC int
LIBC_START_MAIN (int (*main) (int, char **, char **),
int argc, char **argv,
{
/* Initialize some stuff. */
result = main (argc, argv, __environ MAIN_AUXVEC_PARAM);
exit (result);
}
Before __libc_start_main are are already at _start, which by adding gcc -Wl,--verbose we know is the entry point because the linker script contains:
ENTRY(_start)
and is therefore is the actual very first instruction executed after the dynamic loader finishes.
To confirm that in GDB, we an get rid of the dynamic loader by compiling with -static:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o hello.out hello.c
gdb hello.out
and then make GDB stop at the very first instruction executed with starti and print the first instructions:
starti
display/12i $pc
which gives:
=> 0x401c10 <_start>: endbr64
0x401c14 <_start+4>: xor %ebp,%ebp
0x401c16 <_start+6>: mov %rdx,%r9
0x401c19 <_start+9>: pop %rsi
0x401c1a <_start+10>: mov %rsp,%rdx
0x401c1d <_start+13>: and $0xfffffffffffffff0,%rsp
0x401c21 <_start+17>: push %rax
0x401c22 <_start+18>: push %rsp
0x401c23 <_start+19>: mov $0x402dd0,%r8
0x401c2a <_start+26>: mov $0x402d30,%rcx
0x401c31 <_start+33>: mov $0x401d35,%rdi
0x401c38 <_start+40>: addr32 callq 0x4020d0 <__libc_start_main>
By grepping the source for _start and focusing on x86_64 hits we see that this seems to correspond to sysdeps/x86_64/start.S:58:
ENTRY (_start)
/* Clearing frame pointer is insufficient, use CFI. */
cfi_undefined (rip)
/* Clear the frame pointer. The ABI suggests this be done, to mark
the outermost frame obviously. */
xorl %ebp, %ebp
/* Extract the arguments as encoded on the stack and set up
the arguments for __libc_start_main (int (*main) (int, char **, char **),
int argc, char *argv,
void (*init) (void), void (*fini) (void),
void (*rtld_fini) (void), void *stack_end).
The arguments are passed via registers and on the stack:
main: %rdi
argc: %rsi
argv: %rdx
init: %rcx
fini: %r8
rtld_fini: %r9
stack_end: stack. */
mov %RDX_LP, %R9_LP /* Address of the shared library termination
function. */
#ifdef __ILP32__
mov (%rsp), %esi /* Simulate popping 4-byte argument count. */
add $4, %esp
#else
popq %rsi /* Pop the argument count. */
#endif
/* argv starts just at the current stack top. */
mov %RSP_LP, %RDX_LP
/* Align the stack to a 16 byte boundary to follow the ABI. */
and $~15, %RSP_LP
/* Push garbage because we push 8 more bytes. */
pushq %rax
/* Provide the highest stack address to the user code (for stacks
which grow downwards). */
pushq %rsp
#ifdef PIC
/* Pass address of our own entry points to .fini and .init. */
mov __libc_csu_fini#GOTPCREL(%rip), %R8_LP
mov __libc_csu_init#GOTPCREL(%rip), %RCX_LP
mov main#GOTPCREL(%rip), %RDI_LP
#else
/* Pass address of our own entry points to .fini and .init. */
mov $__libc_csu_fini, %R8_LP
mov $__libc_csu_init, %RCX_LP
mov $main, %RDI_LP
#endif
/* Call the user's main function, and exit with its value.
But let the libc call main. Since __libc_start_main in
libc.so is called very early, lazy binding isn't relevant
here. Use indirect branch via GOT to avoid extra branch
to PLT slot. In case of static executable, ld in binutils
2.26 or above can convert indirect branch into direct
branch. */
call *__libc_start_main#GOTPCREL(%rip)
which ends up calling __libc_start_main as expected.
Unfortunately -static makes the bt from main not show as much info:
#0 main () at hello.c:3
#1 0x0000000000402560 in __libc_start_main ()
#2 0x0000000000401c3e in _start ()
If we remove -static and start from starti, we get instead:
=> 0x7ffff7fd0100 <_start>: mov %rsp,%rdi
0x7ffff7fd0103 <_start+3>: callq 0x7ffff7fd0df0 <_dl_start>
0x7ffff7fd0108 <_dl_start_user>: mov %rax,%r12
0x7ffff7fd010b <_dl_start_user+3>: mov 0x2c4e7(%rip),%eax # 0x7ffff7ffc5f8 <_dl_skip_args>
0x7ffff7fd0111 <_dl_start_user+9>: pop %rdx
By grepping the source for _dl_start_user this seems to come from sysdeps/x86_64/dl-machine.h:L147
/* Initial entry point code for the dynamic linker.
The C function `_dl_start' is the real entry point;
its return value is the user program's entry point. */
#define RTLD_START asm ("\n\
.text\n\
.align 16\n\
.globl _start\n\
.globl _dl_start_user\n\
_start:\n\
movq %rsp, %rdi\n\
call _dl_start\n\
_dl_start_user:\n\
# Save the user entry point address in %r12.\n\
movq %rax, %r12\n\
# See if we were run as a command with the executable file\n\
# name as an extra leading argument.\n\
movl _dl_skip_args(%rip), %eax\n\
# Pop the original argument count.\n\
popq %rdx\n\
and this is presumably the dynamic loader entry point.
If we break at _start and continue, this seems to end up in the same location as when we used -static, which then calls __libc_start_main.
When I try a C++ program instead:
hello.cpp
#include <iostream>
int main() {
std::cout << "hello" << std::endl;
}
with:
g++ -ggdb3 -O0 -std=c++11 -Wall -Wextra -pedantic -o hello.out hello.cpp
the results are basically the same, e.g. the backtrace at main is the exact same.
I think the C++ compiler is just calling into hooks to achieve any C++ specific functionality, and things are pretty well factored across C/C++.
TODO:
commented on concrete easy-to-understand examples of what glibc is doing before main. This gives some ideas: What happens before main in C++?
make GDB show the source itself without us having to look at it separately, possibly with us building glibc ourselves: How to compile my own glibc C standard library from source and use it?
understand how the above source code maps to objects such as crti.o that can be seen with gcc --verbose main.c and which end up getting added to the final link
main is called after initializing all the global variables.
What the standard does not specify is the order of initialization of all the global variables of all the modules and statically linked libraries.
Yes, main is the "entry point" of every C++ program, excepting implementation-specific extensions. Even so, some things happen before main, notably global initialization such as for main_ret.