How global pointer variables are stored in memory? - c++

Suppose we have a simple code :
int* q = new int(13);
int main() {
return 0;
}
Clearly, variable q is global and initialized. From this answer, we expect q variable to be stored in initialized data segment (.data) within program file but it is a pointer, so it's value (which is an address in heap segment) is determined at run time. So what's the value stored in data segment within program file ?
My try:
In my thinking, compiler allocates some space for variable q (typically 8 bytes for 64 bit address) in data segment with no meaningful value. Then, puts some initialization code in text segment before main function code to initialize q variable at run time. Something like this in assembly :
....
mov edi, 4
call operator new(unsigned long)
mov DWORD PTR [rax], 13 // rax: 64 bit address (pointer value)
// offset : q variable offset in data segment, calculated by compiler
mov QWORD PTR [ds+offset], rax // store address in data segment
....
main:
....
Any idea?

Yes, that is essentially how it works.
Note that in ELF .data, .bss, and .text are actually sections, not segments. You can look at the assembly yourself by running your compiler:
c++ -S -O2 test.cpp
You will typically see a main function, and some kind of initialization code outside that function. The program entry point (part of your C++ runtime) will call the initialization code and then call main. The initialization code is also responsible for running things like constructors.

int *q will go in the .bss, not the .data section, since it's only initialized at run-time by a non-constant initializer (so this is only legal in C++, not in C). There's no need to have 8 bytes in the executable's data segment for it.
The compiler arranges for the initializer function to be run by putting its address into an array of initializers that the CRT (C Run-Time) startup code calls before calling main.
On the Godbolt compiler explorer, you can see the init function's asm without all the noise of directives. Notice that the addressing mode is just a simple RIP-relative access to q. The linker fills in the right offset from RIP at this point, since that's a link-time constant even though the .text and .bss sections end up in separate segments.
Godbolt's compiler-noise filtering isn't ideal for us. Some of the directives are relevant, but many of them aren't. Below is a hand-chosen mix of gcc6.2 -O3 asm output with Godbolt's "filter directives" option unchecked, for just the int* q = new int(13); statement. (No need to compile a main at the same time, we're not linking an executable).
# gcc6.2 -O3 output
_GLOBAL__sub_I_q: # presumably stands for subroutine
sub rsp, 8 # align the stack for calling another function
mov edi, 4 # 4 bytes
call operator new(unsigned long) # this is the demangled name, like from objdump -dC
mov DWORD PTR [rax], 13
mov QWORD PTR q[rip], rax # clang uses the equivalent `[rip + q]`
add rsp, 8
ret
.globl q
.bss
q:
.zero 8 # reserve 8 bytes in the BSS
There's no reference to the base of the ELF data (or any other) segment.
Also definitely no segment-register overrides. ELF segments have nothing to do with x86 segments. (And the default segment register for this is DS anyway, so the compiler doesn't need to emit [ds:rip+q] or anything. Some disassemblers may be explicit and show DS even though there was no segment-override prefix on the instruction, though.)
This is how the compiler arranges for it to be called before main():
# the "aw" sets options / flags for this section to tell the linker about it.
.section .init_array,"aw"
.align 8
.quad _GLOBAL__sub_I_q # this assembles to the absolute address of the function.
The CRT start code has a loop that knows the size of the .init_array section and uses a memory-indirect call instruction on each function-pointer in turn.
The .init_array section is marked writeable, so it goes into the data segment. I'm not sure what writes it. Maybe the CRT code marks it as already-done by zeroing the pointers after calling them?
There's a similar mechanism in Linux for running initializers in dynamic libraries, which is done by the ELF interpreter while doing dynamic linking. This is why you can call printf() or other glibc stdio functions from _start in a dynamically-linked binary created from hand-written asm, and why that fails in a statically linked binary if you don't call the right init functions. (See this Q&A for more about building static or dynamic binaries that define their own _start or just main(), with or without libc).

Related

Assembly: Why there is an empty memory on stack?

I use online complier wrote a simple c++ code :
int main()
{
int a = 4;
int&& b = 2;
}
and the main function part of assembly code complied by gcc 11.20 shown below
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 4
mov eax, 2
mov DWORD PTR [rbp-20], eax
lea rax, [rbp-20]
mov QWORD PTR [rbp-16], rax
mov eax, 0
pop rbp
ret
I notice that when initializing 'a', the instruction just simply move an immediate operand directly to memory while for r-value reference 'b', it first store the immediate value into register eax,then move it to the memory, and also there is an unused memory bettween [rbp-8] ~ [rbp-4], I think that whatever immediate value,they just exist, so it has to be somewhere or it just simply use signal to iniltialize(my guess), I want to know more about the underlying logic.
So my question is that:
Why does inilization differs?
Why there is an empty 4-bytes unused memory on stack?
Let me address the second question first.
Note that there are actually three objects defined in this function: the int variable a, the reference b (implemented as a pointer), and the unnamed temporary int with a value of 2 that b points to. In unoptimized compilation, each of these objects needs to be stored at some unique location on the stack, and the compiler allocates stack space naively, processing the variables one by one and assigning each one space below the previous. It evidently chooses to handle them in the following order:
The variable a, an int needing 4 bytes. It goes in the first available stack slot, at [rbp-4].
The reference b, stored as a pointer needing 8 bytes. You might think it would go at [rbp-12], but the x86-64 ABI requires that pointers be naturally aligned on 8-byte boundaries. So the compiler moves down another 4 bytes to achieve this alignment, putting b at [rbp-16]. The 4 bytes at [rbp-8] are unused so far.
The temporary int, also needing 4 bytes. The compiler puts it right below the previously placed variable, at [rbp-20]. True, there was space at [rbp-8] that could have been used instead, which would be more efficient; but since you told the compiler not to optimize, it doesn't perform this optimization. It would if you used one of the -O flags.
As to why a is initialized with an immediate store to memory, whereas the temporary is initialized via a register: to really answer this, you'd have to read the details of the GCC source code, and frankly I don't think you'll find that there is anything very interesting behind it. Presumably there are different code paths in the compiler for creating and initializing named variables versus temporaries, and the code for temporaries may happen to be written as two steps.
It may be that for convenience, the programmer chose to create an extra object in the intermediate representation (GIMPLE or RTL), perhaps because it simplifies the compiler code in handling more general cases. They wouldn't take any trouble to avoid this, because they know that later optimization passes will clean it up. But if you have optimization turned off, this doesn't happen and you get actual instructions emitted for this unnecessary transfer.
In
int a = 4;
you declare a (typically) 4-byte variable and ask the compiler to fill it with the bit representation of 4.
In
int&& b = 2;
you declare a reference ("r-value reference") to, well, to what? To a literal? Is it possible? In C++ references are typically translated, on the assembly level, into pointers. So one can expect that b will be "a pointer in disguise", that is, without the * and -> semantics. But it will likely occupy 64 bits on a 64-bit machine. Now, pointers must point to some memory stored in RAM, not in registers, cache(s) etc. So the compiler most likely creates a temporary (unnamed) integer, initializes it with 2, and then binds its address to b. I write "most likely" because I doubt the standard standardizes this in such great detail. What we know for sure is that there is an extra unnamed variable involved in the initialization of b in int&& b = 2;.
As for the assembler, I have too little knowledge of it to dare explain anything to you. I guess, however, that the concept of a temporary variable and a pointer behind the && reference solves all your problems here.

to how many subroutines can x86 processors call? [duplicate]

This question already has answers here:
C/C++ maximum stack size of program on mainstream OSes
(7 answers)
Closed 1 year ago.
i'm writing a small program to print a polygon with printf("\219") to just see if whatever i'm doing is right for my kernel. but it needs to call many functions and i don't know whether x86 processors can accept that many subroutines and i can't find results in google. so my question is will it accept so many function calls and what is the maximum. (i mean something like this:-)
function a() {b();}
function b() {c();}
function c() {d();}
...
i've used 5 such levels (you know what i mean, right?)
Your function depth is not limited by the processor, but by the size of your stack. Behind the scenes, calls to C++ functions usually translate to call instructions in x86, which push four (or eight, for x64 programs) bytes onto your program's stack for the return pointer. Sometimes calls are optimized and don't touch the stack at all. Functions might also push additional bytes (e.g. local function state) onto the stack.
To get the exact number of functions you can call, you need to disassemble your code to calculate the number of bytes each function pushes to the stack (minimum four/eight because of the return address, but likely many more), then find the maximum stack size and divide it by the function frame size.
call and ret instructions aren't special; the CPU doesn't know how deeply nested it is. (And doesn't know the difference between calling another function or being recursive.) As described in my answer here, functions are a high-level concept that asm gives you the tools to implement.
All the CPU knows is whether pushing a return address causes a page fault or not, if you run out of stack space. (A "stack overflow"). Usually the result of recursion going too deep, or a huge array in automatic storage (a C++ local var) or alloca.
(As mentioned in comments, not every C function call results in a call instruction; inlining can fully optimize away the fact that it's a separate function. You want small functions to inline, and the design of the template classes in the C++ standard library depends on this for efficiency. Also, a tailcall can be just a jmp, having the next function take over this stack space instead of getting new space.)
Linux typically uses 8 MiB stacks for user-space processes. (ulimit -s). Windows is typically 1 MiB. Inside a kernel, you often use smaller stacks. e.g. Linux kernel thread-stacks are currently 16 kiB for x86-64. Previously as small as 4 kiB (one page) for 32-bit x86, but some code has more local vars that take up stack space.
Related: How does the stack work in assembly language?. When ret executes, it just pops into the program counter. It's up to the programmer (or compiler) to create asm that runs ret when the stack pointer is pointing at somewhere you want to jump. (Typically a return address.)
In modern Linux systems, the smallest stack frame is 16 bytes, because the ABI specifies maintaining 16-byte stack alignment before a call. So best case you can have a call depth of 512k before you overflow the stack. (Unless you're in a thread that was started with an extra large thread-stack).
If you're in 32-bit mode with an old version of the i386 System V ABI that only required 4-byte stack alignment (like gcc -mpreferred-stack-boundary=2 instead of the default 4), with functions that just called without using any other stack space, 8 MiB of stack would give you 8 MiB / 4B = 2 Mi call depth.
In real life, some of that 8MiB space on the main thread's stack is used up by env vars and argv[] already on the stack at the process entry point (copied there by the kernel), and then a bit more as _start calls a function that calls main.
To make this able to actually return at the end, instead of just eventually faulting, you'd need either a huge chain, or some recursion with a termination condition, like
void recurse(int n) {
if (n == 1)
return;
recurse(n - 1);
}
and compile with some optimization but not enough to get the compiler to turn it into a do{}while(--n); loop or optimize away.
If you wanted all different functions, that's ok, the code-size limit is at least 2GiB, and call rel32 / ret takes a total of 6 bytes. (Un-optimized code would default to push ebp or push rbp as well, so you'd have to avoid that for 32-bit code if you wanted to meet that 4-byte stack-frame goal of having all your stack space full with just return addresses).
For example with GCC (see How to remove "noise" from GCC/clang assembly output? for the __attribute__((noipa)) option)
__attribute__((noipa)) // don't let other functions even notice that this is empty, let alone inline it
void foo(void){}
__attribute__((noipa))
void bar(){
foo();
}
void baz(){
bar();
}
compiles with GCC (Godbolt compiler explorer) to this 32-bit asm which just calls without using any stack space for anything else:
## GCC11.2 -O1 -m32 -mpreferred-stack-boundary=2
foo():
ret
bar():
call foo()
ret
baz():
call bar()
ret
g++ -O2 would optimize these call/ret functions into a tailcall like jmp foo, which doesn't push a return address. That's why I only used -O1 or -Og. Of course in real life you do want things to inline and optimize away; defeating that is just for this silly computer trick to achieve the longest finite call depth that actually would crash if you made it longer.
You can repeat this pattern indefinitely; GCC allows long symbol names, so it's not a problem to have many different unique function names. You can split across multiple files, with just a prototype for one of the functions in the other file.
If you reduce the -falign-functions tuning setting, you can probably get it down to either 6 bytes per function (no padding) or 8 (align by 8 thus 2 bytes of padding), down from the default of aligning each function label by 16, wasting 10 bytes per function.
Recursive:
I got GCC to make asm that recurses with no gaps between return address:
void recurse(int n){
if (!--n)
return;
recurse(n);
}
## G++11.2 -O1 -mregparm=3 -m32 -mpreferred-stack-boundary=2
recurse(int):
sub eax, 1 # --n
jne .L6 # skip over the ret if the result is non-zero
.L4:
ret
.L6:
call recurse(int) # push a 4-byte ret addr and jump to top
jmp .L4 # silly compiler, should put another ret here. But we told it not to optimize too much
Note the -mregparm=3 option, so the first up-to-3 args are passed in registers, instead of on the stack in the inefficient i386 System V calling convention. (It was designed a long time ago; the x86-64 SysV calling convention is much better.)
With -O2, this function optimizes away to just a ret. (It turns the call into a tailcall which becomes a loop, and then it can see it's a non-infinite loop with no side effects so it just removes it.)
Of course in real life you want optimizations like this. Recursion in asm sucks compared to loops. If you're worried about robust code and call depth, don't write recursive functions in the first place, unless you know recursion depth will be shallow. If a debug build doesn't convert your recursion to iteration, you don't want your kernel to crash.
Just for fun, I got GCC to tighten up the asm even at -O1 by convincing it to do the conditional branch over the call, to only have one ret instruction in the function so tail-duplication wouldn't be relevant anyway. And it means the fast path (recursion) involves a not-taken macro-fused conditional branch plus a call.
void recurse(int n){
if (__builtin_expect(!!--n, 1))
recurse(n);
}
Godbolt
## GCC 11.2 -O1 -mregparm=3 -m32 -mpreferred-stack-boundary=2
recurse(int):
sub eax, 1
je .L1
call recurse(int)
.L1:
ret

Understanding the assembly language for if-else in following code [duplicate]

What happens if i say 'call ' instead of jump? Since there is no return statement written, does control just pass over to the next line below, or is it still returned to the line after the call?
start:
mov $0, %eax
jmp two
one:
mov $1, %eax
two:
cmp %eax, $1
call one
mov $10, %eax
The CPU always executes the next instruction in memory, unless a branch instruction sends execution somewhere else.
Labels don't have a width, or any effect on execution. They just allow you to make reference to this address from other places. Execution simply falls through labels, even off the end of your code if you don't avoid that.
If you're familiar with C or other languages that have goto (example), the labels you use to mark places you can goto to work exactly the same as asm labels, and jmp / jcc work exactly like goto or if(EFLAGS_condition) goto. But asm doesn't have special syntax for functions; you have to implement that high-level concept yourself.
If you leave out the ret at the end of a block of code, execution keeps doing and decodes whatever comes next as instructions. (Maybe What would happen if a system executes a part of the file that is zero-padded? if that was the last function in an asm source file, or maybe execution falls into some CRT startup function that eventually returns.)
(In which case you could say that the block you're talking about isn't a function, just part of one, unless it's a bug and a ret or jmp was intended.)
You can (and maybe should) try this yourself in a debugger. Single-step through that code and watch RSP and RIP change. The nice thing about asm is that the total state of the CPU (excluding memory contents) is not very big, so it's possible to watch the entire architectural state in a debugger window. (Well, at least the interesting part that's relevant for user-space integer code, so excluding model-specific registers that the only the OS can tweak, and excluding the FPU and vector registers.)
call and ret aren't "special" (i.e. the CPU doesn't "remember" that it's inside a "function").
They just do exactly what the manual says they do, and it's up to you to use them correctly to implement function calls and returns. (e.g. make sure the stack pointer is pointing at a return address when ret runs.) It's also up to you to get the calling convention correct, and all that stuff. (See the x86 tag wiki.)
There's also nothing special about a label that you jmp to vs. a label that you call. An assembler just assembles bytes into the output file, and remembers where you put label markers. It doesn't truly "know" about functions the way a C compiler does. You can put labels wherever you want, and it doesn't affect the machine code bytes.
Using the .globl one directive would tell the assembler to put an entry in the symbol table so the linker could see it. That would let you define a label that's usable from other files, or even callable from C. But that's just meta-data in the object file and still doesn't put anything between instructions.
Labels are just part of the machinery that you can use in asm to implement the high-level concept of a "function", aka procedure or subroutine: A label for callers to call to, and code that will eventually jump back to a return address the caller passed, one way or another. But not every label is the start of a function. Some are just the tops of loops, or other targets of conditional branches within a function.
Your code would run exactly the same way if you emulated call with an equivalent push of the return address and then a jmp.
one:
mov $1, %eax
# missing ret so we fall through
two:
cmp %eax, $1
# call one # emulate it instead with push+jmp
pushl $.Lreturn_address
jmp one
.Lreturn_address:
mov $10, %eax
# fall off into whatever comes next, if it ever reaches here.
Note that this sequence only works in non-PIC code, because the absolute return address is encoded into the push imm32 instruction. In 64-bit code with a spare register available, you can use a RIP-relative lea to get the return address into a register and push that before jumping.
Also note that while architecturally the CPU doesn't "remember" past CALL instructions, real implementations run faster by assuming that call/ret pairs will be matched, and use a return-address predictor to avoid mispredicts on the ret.
Why is RET hard to predict? Because it's an indirect jump to an address stored in memory! It's equivalent to pop %internal_tmp / jmp *%internal_tmp, so you can emulate it that way if you have a spare register to clobber (e.g. rcx is not call-preserved in most calling conventions, and not used for return values). Or if you have a red-zone so values below the stack-pointer are still safe from being asynchronously clobbered (by signal handlers or whatever), you could add $8, %rsp / jmp *-8(%rsp).
Obviously for real use you should just use ret, because it's the most efficient way to do that. I just wanted to point out what it does using multiple simpler instructions. Nothing more, nothing less.
Note that functions can end with a tail-call instead of a ret:
(see this on Godbolt)
int ext_func(int a); // something that the optimizer can't inline
int foo(int a) {
return ext_func(a+a);
}
# asm output from clang:
foo:
add edi, edi
jmp ext_func # TAILCALL
The ret at the end of ext_func will return to foo's caller. foo can use this optimization because it doesn't need to make any modifications to the return value or do any other cleanup.
In the SystemV x86-64 calling convention, the first integer arg is in edi. So this function replaces that with a+a, then jumps to the start of ext_func. On entry to ext_func, everything is in the correct state just like it would be if something had run call ext_func. The stack pointer is pointing to the return address, and the args are where they're supposed to be.
Tail-call optimizations can be done more often in a register-args calling convention than in a 32-bit calling convention that passes args on the stack. You often run into situations where you have a problem because the function you want to tail-call takes more args than the current function, so there isn't room to rewrite our own args into args for the function. (And compilers don't tend to create code that modifies its own args, even though the ABI is very clear that functions own the stack space holding their args and can clobber it if they want.)
In a calling convention where the callee cleans the stack (with ret 8 or something to pop another 8 bytes after the return address), you can only tail-call a function that takes exactly the same number of arg bytes.
Your intuition is correct: the control just passes to the next line below after the function returns.
In your case, after call one, your function will jump to mov $1, %eax and then continue down to cmp %eax, $1 and end up in an infinite loop as you will call one again.
Beyond just an infinite loop, your function will eventually go beyond its memory constraints since a call command writes the current rip (instruction pointer) to the stack. Eventually, you'll overflow the stack.

Trouble with C and C++ compiler

I'm having a problem trying to convert a 32 bit product into a 64 bit product. I'm using Visual Studio 2008 and the code is in C and C++. I would like anyone to look at the following two lines of code, one from a C source file and the other from a C++ source file. Both of these files are included in a DLL. I also include the disassembly of both lines of code.
ewxlcom.c
memcpy(pCM->pSecAccInfo->spUserID,userSecurityInfo.spUserID,
sizeof(UserID));
000000000EF33BB9 mov r8d,80h
000000000EF33BBF mov rdx,qword ptr [rsp+828h]
000000000EF33BC7 mov rcx,qword ptr [rsp+1F8h]
000000000EF33BCF mov rcx,qword ptr [rcx+0BDEh]
000000000EF33BD6 call memcpy (0EF40352h)
tcputil.cpp
memcpy(serv_temp+INIT_MSG_USERID_OFFSET, pCM->pSecAccInfo->spUserID, INIT_MSG_USERID_LEN);
000000000EF3B8E6 lea rcx,[rsp+67h]
000000000EF3B8EB mov r8d,80h
000000000EF3B8F1 mov rdx,qword ptr [rsp+3B0h]
000000000EF3B8F9 mov rdx,qword ptr [rdx+0CBEh]
000000000EF3B900 call memcpy (0EF40352h)
As you can see, the first line copies some bytes into the memory pointed to by pCM->pSecAccInfo->spUserID. And the second line copies those same bytes into another place in memory. The ASM memcpy copies bytes from memory pointed to by register rdx to memory pointed to by register rcx. So in the first line a value is moved into register rcx. This I have verified to point to pCM. Then the value pointed to by rcx + 0BDEh is copied into rcx. And the memcpy is called. This works.
But later on in the second line a value is loaded into register rdx. This I have verified to also point to the same pCM as in the first line. It then loads the pointer residing in memory that is offset from pCM (rdx) by 0CBEh. That memory is all zeros, so memcpy crashes.
The question is why would the compiler produce different code for the same source variable. I think its an alignment problem. Is it the difference between a C file and a C++ file? Does VS use the same compiler for both C and C++? Are there any other things I should be looking at?
Any help would be appreciated.
If you're linking C & C++ code, you might need to be careful about different padding characteristics in your structs. Perhaps create a temporary function to print the offsets of each member of the struct, and copy that same code from a C source file (where you wrote it) to a C++ source file. The two copies of the functions can remain the same, since the C++ one will be mangled, but I'd add a printf() at the top of each to say which version it is. Then call each one from somewhere before the crash so you can compare the offsets. If they're different, you'll need to look into compiler flags to fix that problem. OR... perhaps you need to add lines like this...
#ifdef __cplusplus
extern "C" {
#endif
.
. ...your struct definitions & variables go here...
.
#ifdef __cplusplus
}
#endif
...around your struct definitions to get the C++ side to have the same padding behavior as the C side of your project.

How to read memory address in gdb for i7 processor code disassembly?

I'm trying to read the location of a variable in memory at runtime, using gdb within Eclipse, but can't really see which one is the correct address. Here is the output of gdb when I disassembly my program:
main():
0000000000400634: push %rbp
0000000000400635: mov %rsp,%rbp
5 int i = 7;
0000000000400638: movl $0x7,-0x4(%rbp)
6 int j = 8;
000000000040063f: movl $0x8,-0x8(%rbp)
8 return 0;
0000000000400646: mov $0x0,%eax
9 }
and what I want is the location of the variable i at runtime.
I'm guessing it's -0x4(%rbp), but then how can I figure out what address that is?
Should I take the current value of rbp and subtract 4 from it?
In this case, the value inside rbp is 0x7fffffffe250.
Thus, would the location of i in memory at runtime be 0x7fffffffe250 - 0x4?
Or is it just 0x7fffffffe250?
Your guess is correct: taking the value of %ebp within that function and subtracting 4 gives the address that i is being stored at. This address is not predictable, though, as it depends on the position of the stack at runtime.
Moreover, you should keep in mind that not all variables will have a fixed location, either in memory or in a register -- the compiler may end up moving a value between multiple locations, or optimize an intermediate value out entirely if it's unnecessary.