There are some existing questions about GCC ordering of variables on the stack. However, those usually involve intermixed variables and arrays, and this is not that. I'm working with the GCC 9.2.0 64-bit release, with no special flags on. If I do this:
#include <iostream>
int main() {
int a = 15, b = 30, c = 45, d = 60;
// std::cout << &a << std::endl;
return 0;
}
Then the memory layout is seen as in the disassembly here:
0x000000000040156d <+13>: mov DWORD PTR [rbp-0x4],0xf
0x0000000000401574 <+20>: mov DWORD PTR [rbp-0x8],0x1e
0x000000000040157b <+27>: mov DWORD PTR [rbp-0xc],0x2d
0x0000000000401582 <+34>: mov DWORD PTR [rbp-0x10],0x3c
So: The four variables are in order at offsets 0x04, 0x08, 0x0C, 0x10 from the RBP; that is, sequenced in the same order they were declared. This is consistent and deterministic; I can re-compile, add other lines of code (random printing statements, other later variables, etc.) and the layout remains the same.
However, as soon as I include a line that touches an address or pointer, then the layout changes. For example, this:
#include <iostream>
int main() {
int a = 15, b = 30, c = 45, d = 60;
std::cout << &a << std::endl;
return 0;
}
Produces this:
0x000000000040156d <+13>: mov DWORD PTR [rbp-0x10],0xf
0x0000000000401574 <+20>: mov DWORD PTR [rbp-0x4],0x1e
0x000000000040157b <+27>: mov DWORD PTR [rbp-0x8],0x2d
0x0000000000401582 <+34>: mov DWORD PTR [rbp-0xc],0x3c
So: A scrambled-up layout with the variables at offsets now respectively at 0x10, 0x04, 0x08, 0x0C. Again, this is consistent with any re-compiles, most random code I think to add, etc.
However, if I just touch a different address like so:
#include <iostream>
int main() {
int a = 15, b = 30, c = 45, d = 60;
std::cout << &b << std::endl;
return 0;
}
Then the variables get ordered like this:
0x000000000040156d <+13>: mov DWORD PTR [rbp-0x4],0xf
0x0000000000401574 <+20>: mov DWORD PTR [rbp-0x10],0x1e
0x000000000040157b <+27>: mov DWORD PTR [rbp-0x8],0x2d
0x0000000000401582 <+34>: mov DWORD PTR [rbp-0xc],0x3c
That is, a different sequence at offsets 0x04, 0x10, 0x08, 0x0C. Once again, this is consistent as far as I can tell with recompilations and code changes, excepting if I refer to some other address in the code.
If I didn't know any better, it would seem like the integer variables are placed in declaration order, unless the code does any manipulation with addressing, at which point it starts scrambling them up in some deterministic way.
Some responses that will not satisfy this question are as follows:
"The behavior is undefined in the C++ standard" -- I'm not asking about the C++ standard, I'm asking specifically about how this GCC compiler makes its decision on layout.
"The compiler can do whatever it wants" -- Does not answer how the compiler decides on what it "wants" in this specific, consistent case.
Why does the GCC compiler layout integer variables in this way?
What explains the consistent re-ordering seen here?
Edit: I guess on closer inspection, the variable whose address I touch is always placed in [rbp-0x10], and then the other ones are put in declaration order sequence after that. Why would that be beneficial? Note that printing the values of any of these variables don't seem to trigger the same re-ordering, from what I can tell.
You should compile your daniel.cc C++ code with g++ -O -fverbose-asm -daniel.cc -S -o daniel.s and look into the generated assembler code daniel.s
For your first example, a lot of constants and slots in your call frame have disappeared, since optimized:
.text
.globl main
.type main, #function
main:
.LFB1644:
.cfi_startproc
endbr64
subq $24, %rsp #,
.cfi_def_cfa_offset 32
# daniel.cc:2: int main() {
movq %fs:40, %rax # MEM[(<address-space-1> long unsigned int *)40B], tmp89
movq %rax, 8(%rsp) # tmp89, D.41631
xorl %eax, %eax # tmp89
# daniel.cc:3: int a = 15, b = 30, c = 45, d = 60;
movl $15, 4(%rsp) #, a
# /usr/include/c++/10/ostream:246: { return _M_insert(__p); }
leaq 4(%rsp), %rsi #, tmp85
leaq _ZSt4cout(%rip), %rdi #,
call _ZNSo9_M_insertIPKvEERSoT_#PLT #
movq %rax, %rdi # tmp88, _4
# /usr/include/c++/10/ostream:113: return __pf(*this);
call _ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_#PLT #
# daniel.cc:6: }
movq 8(%rsp), %rax # D.41631, tmp90
subq %fs:40, %rax # MEM[(<address-space-1> long unsigned int *)40B], tmp90
jne .L4 #,
movl $0, %eax #,
addq $24, %rsp #,
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.L4:
.cfi_restore_state
call __stack_chk_fail#PLT #
.cfi_endproc
.LFE1644:
.size main, .-main
.type _GLOBAL__sub_I_main, #function
If for whatever reason you really require your call frame to contain slots in a known order, you need to use a struct as an automatic variable (and that approach is portable to other C++ compilers).
If you need to understand why GCC has compiled your code the way it did, download the source code of GCC, read the documentation of GCC internals, study it (it is free software).
You should be interested by GCC developer options, they dump a lot of things regarding the internal state of the compiler.
Once you understood a bit what GCC actually does, subscribe to some GCC mailing list (e.g. gcc#gcc.gnu.org) and ask questions there. Alternatively, code your GCC plugin to improve its behavior, change the organization of the call frame, add dumping routines.
If you need to understand or improve GCC, budget several months of full time work, and read the Dragon book before.
Related
I am writing a program that has a shared state between assembly and C++. I declared a global array in the assembly file and accessed that array in a function within C++. When I call that function from within C++, there are no issues, but then I call that same function from within assembly and I get a segmentation fault. I believe I preserved the right registers across function calls.
Strangely, when I change the type of the pointer within C++ to a uint64_t pointer, it correctly outputs the values but then segmentation faults again after casting it to a uint64_t.
In the following code, the array which keeps giving me errors is currentCPUState.
//CPU.cpp
extern uint64_t currentCPUState[6];
extern "C" {
void initInternalState(void* instructions, int indexSize);
void printCPUState();
}
void printCPUState() {
uint64_t b = currentCPUState[0];
printf("%d\n", b); //this line DOESNT crash ???
std::cout << b << "\n"; //this line crashes
//omitted some code for the sake of brevity
std::cout << "\n";
}
CPU::CPU() {
//set initial cpu state
currentCPUState[AF] = 0;
currentCPUState[BC] = 0;
currentCPUState[DE] = 0;
currentCPUState[HL] = 0;
currentCPUState[SP] = 0;
currentCPUState[PC] = 0;
printCPUState(); //this has no issues
initInternalState(instructions, sizeof(void*));
}
//cpu.s
.section .data
.balign 8
instructionArr:
.space 8 * 1024, 0
//stores values of registers
//used for transitioning between C and ASM
//uint64_t currentCPUState[6]
.global currentCPUState
currentCPUState:
.quad 0, 0, 0, 0, 0, 0
.section .text
.global initInternalState
initInternalState:
push %rdi
push %rsi
mov %rcx, %rdi
mov %rdx, %rsi
push %R12
push %R13
push %R14
push %R15
call initGBCpu
pop %R15
pop %R14
pop %R13
pop %R12
pop %rsi
pop %rdi
ret
//omitted unimportant code
//initGBCpu(rdi: void* instructions, rsi:int size)
//function initializes the array of opcodes
initGBCpu:
pushq %rdx
//move each instruction into the array in proper order
//also fill the instructionArr
leaq instructionArr(%rip), %rdx
addop inst0x00
addop inst0x01
addop inst0x02
addop inst0x03
addop inst0x04
call loadCPUState
call inst0x04 //inc BC
call saveCPUState
call printCPUState //CRASHES HERE
popq %rdx
ret
Additional details:
OS: Windows 64 bit
Compiler (MinGW64-w)
Architecture: x64
Any insight would be much appreciated
Edit:
addop is a macro:
//adds an opcode to the array of functions
.macro addop lbl
leaq \lbl (%rip), %rcx
mov %rcx, 0(%rdi)
mov %rcx, 0(%rdx)
add %rsi, %rdi
add %rsi, %rdx
.endm
Some of x86-64 calling conventions require that the stack have to be alligned to 16-byte boundary before calling functions.
After functions are called, a 8-byte return address is pushed on the stack, so another 8-byte data have to be added to the stack to satisfy this allignment requirement. Otherwise, some instruction with allignment requirement (like some of the SSE instructions) may crash.
Assumign that such calling conventions are applied, the initGBCpu function looks OK, but the initInternalState function have to add one more 8-byte thing to the stack before calling the initInternalState function.
For example:
initInternalState:
push %rdi
push %rsi
mov %rcx, %rdi
mov %rdx, %rsi
push %R12
push %R13
push %R14
push %R15
sub $8, %rsp // adjust stack allignment
call initGBCpu
add $8, %rsp // undo the stack pointer movement
pop %R15
pop %R14
pop %R13
pop %R12
pop %rsi
pop %rdi
ret
They say, the tail recursion optimization works only when the the call is just before return from the function. So they show this code as example of what shouldn't be optimized by C compilers:
long long f(long long n) {
return n > 0 ? f(n - 1) * n : 1;
}
because there the recursive function call is multiplied by n which means the last operation is multiplication, not recursive call. However, it is even on -O1 level:
recursion`f:
0x100000930 <+0>: pushq %rbp
0x100000931 <+1>: movq %rsp, %rbp
0x100000934 <+4>: movl $0x1, %eax
0x100000939 <+9>: testq %rdi, %rdi
0x10000093c <+12>: jle 0x10000094e
0x10000093e <+14>: nop
0x100000940 <+16>: imulq %rdi, %rax
0x100000944 <+20>: cmpq $0x1, %rdi
0x100000948 <+24>: leaq -0x1(%rdi), %rdi
0x10000094c <+28>: jg 0x100000940
0x10000094e <+30>: popq %rbp
0x10000094f <+31>: retq
They say that:
Your final rules are therefore sufficiently correct. However, return n
* fact(n - 1) does have an operation in the tail position! This is the multiplication *, which will be the last thing the function does
before it returns. In some languages, this might actually be
implemented as a function call which could then be tail-call
optimized.
However, as we see from ASM listing, multiplication is still an ASM instruction, not a separate function. So I really struggle to see difference with accumulator approach:
int fac_times (int n, int acc) {
return (n == 0) ? acc : fac_times(n - 1, acc * n);
}
int factorial (int n) {
return fac_times(n, 1);
}
This produces
recursion`fac_times:
0x1000008e0 <+0>: pushq %rbp
0x1000008e1 <+1>: movq %rsp, %rbp
0x1000008e4 <+4>: testl %edi, %edi
0x1000008e6 <+6>: je 0x1000008f7
0x1000008e8 <+8>: nopl (%rax,%rax)
0x1000008f0 <+16>: imull %edi, %esi
0x1000008f3 <+19>: decl %edi
0x1000008f5 <+21>: jne 0x1000008f0
0x1000008f7 <+23>: movl %esi, %eax
0x1000008f9 <+25>: popq %rbp
0x1000008fa <+26>: retq
Am I missing something? Or it's just compilers became smarter?
As you see in the assembly code, the compiler is smart enough to turn your code into a loop that is basically equivalent to (disregarding the different data types):
int fac(int n)
{
int result = n;
while (--n)
result *= n;
return result;
}
GCC is smart enough to know that the state needed by each call to your original f can be kept in two variables (n and result) through the whole recursive call sequence, so that no stack is necessary. It can transform f to fac_times, and both to fac, so to say. This is most likely not only a result of tail call optimization in the strictest sense, but one of the loads of other heuristics that GCC uses for optimization.
(I can't go more into detail regarding the specific heuristics that are used here since I don't know enough about them.)
The non-accumulator f isn't tail-recursive. The compiler's options include turning it into a loop by transforming it, or call / some insns / ret, but they don't include jmp f without other transformations.
tail-call optimization applies in cases like this:
int ext(int a);
int foo(int x) { return ext(x); }
asm output from godbolt:
foo: # #foo
jmp ext # TAILCALL
Tail-call optimization means leaving a function (or recursing) with a jmp instead of a ret. Anything else is not tailcall optimization. Tail-recursion that's optimized with a jmp really is a loop, though.
A good compiler will do further transformations to put the conditional branch at the bottom of the loop when possible, removing the unconditional branch. (In asm, the do{}while() style of looping is the most natural).
I worte the following code in c++ .
I want to do casting to a const variable and change it,
this is the code:
#include <iostream>
using namespace std;
int main()
{
int const a = 5;
int* ptr = (int*)&a;
*ptr = 10;
cout<<"a is : "<< a << endl;
system("pause");
}
This code passed the compiler, I expect the program to print an the screen 10 ,
but the result in the screen is 5.
When I run the debugger the memory in &a has been changed to 10 like I expected.
Any idea why?
First of all this is undefined behavior. Don't do it. Second, the compiler optimized away actually looking at the memory at &a when you print out a because you told the compiler a would never change (you said it was const). So it actually turned into...
cout << "a is : "<<5 << endl;
You are invoking undefined behavior with the code in question, trying to change a variable declared as const by casting away the constness is not allowed (unless the const variable is really a reference to a variable which isn't const).
One plausible, and highly likely, explanation to your result is that the compiler knows that the value of a shouldn't change, therefor it can pretty much replace all occurences of a with 5. ie. the "look up" is optimized out.
Why look at the adress of a to read it's value when it's declared as always being 5?
Let's take a look at what instructions a compiler might turn the snippet into
foo.cpp
void func (int)
{
/* ... */
}
int
main (int argc, char *argv[])
{
const int a = 10;
int * p = &const_cast<int&> (a);
*p = 20;
func (a);
}
assembly instructions of main as given by g++ -S foo.cpp
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $32, %rsp
movl %edi, -20(%rbp)
movq %rsi, -32(%rbp)
movl $10, -12(%rbp)
leaq -12(%rbp), %rax
movq %rax, -8(%rbp)
movq -8(%rbp), %rax # store the adress of `a` in %rax
movl $20, (%rax) # store 20 at the location pointed to by %rax (ie. &a)
movl $10, %edi # put 10 in register %edi (placeholder for first argument to function)
# # notice how the value is not read from `a`
# # but is a constant
call _Z4funci # call `func`
movl $0, %eax
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
As seen above the value of 20 is indeed put at the address stored in %rax where (%rax) contains the address of a (movl $20, (%rax)), but the argument to our call to void func (int) is the constant number 10 (movl $10, %edi).
As said earlier; the compiler assumes that the value of a doesn't change, and instead of reading the memory location every time a is used it will instead replace it with the constant value 10.
I have been reading through the following series of articles: http://www.altdevblogaday.com/2011/11/09/a-low-level-curriculum-for-c-and-c
The disassembled code shown and the disassembled code I am managing to produce whilst running the same code vary quite significantly and I lack the understanding to explain the differences.
Is there anyone that can step through it line by line and perhaps explain what it's doing at each step ? I get the feeling from the searching around I have done that the first few lines have something to do with frame pointers, there also seems to be a few extra lines in my disassembled code that ensures registers are empty before placing new values into them (absent from the code in the article)
I am running this on OSX (original author is using Windows) using the g++ compiler from within XCode 4. I am really clueless as to weather or not these variances are due to the OS, the architecture (32 bit vs 64 bit maybe?) or the compiler itself. It could even be the code I guess - mine is wrapped inside the main function declaration whereas the original code makes no mention of this.
My code:
int main(int argc, const char * argv[])
{
int x = 1;
int y = 2;
int z = 0;
z = x + y;
}
My disassembled code:
0x100000f40: pushq %rbp
0x100000f41: movq %rsp, %rbp
0x100000f44: movl $0, %eax
0x100000f49: movl %edi, -4(%rbp)
0x100000f4c: movq %rsi, -16(%rbp)
0x100000f50: movl $1, -20(%rbp)
0x100000f57: movl $2, -24(%rbp)
0x100000f5e: movl $0, -28(%rbp)
0x100000f65: movl -20(%rbp), %edi
0x100000f68: addl -24(%rbp), %edi
0x100000f6b: movl %edi, -28(%rbp)
0x100000f6e: popq %rbp
0x100000f6f: ret
The disassembled code from the original article:
mov dword ptr [ebp-8],1
mov dword ptr [ebp-14h],2
mov dword ptr [ebp-20h],0
mov eax, dword ptr [ebp-8]
add eax, dword ptr [ebp-14h]
mov dword ptr [ebp-20h],eax
A full line by line breakdown would be extremely enlightening but any help in understanding this would be appreciated.
All of the code from the original article is in your code, there's just some extra stuff around it. This:
0x100000f50: movl $1, -20(%rbp)
0x100000f57: movl $2, -24(%rbp)
0x100000f5e: movl $0, -28(%rbp)
0x100000f65: movl -20(%rbp), %edi
0x100000f68: addl -24(%rbp), %edi
0x100000f6b: movl %edi, -28(%rbp)
Corresponds directly to the 6 instructions talked about in the article.
There are two major differences between your disassembled code and the article's code.
One is that the article is using the Intel assembler syntax, while your disassembled code is using the traditional Unix/AT&T assembler syntax. Some differences between the two are documented on Wikipedia.
The other difference is that the article omits the function prologue, which sets up the stack frame, and the function epilogue, which destroys the stack frame and returns to the caller. The program he's disassembling has to contain instructions to do those things, but his disassembler isn't showing them. (Actually the stack frame could and probably would be omitted if the optimizer were enabled, but it's clearly not enabled.)
There are also some minor differences: your code is using a slightly different layout for local variables, and your code is computing the sum in a different register.
On the Mac, g++ doesn't support emitting Intel mnemonics, but clang does:
:; clang -S -mllvm --x86-asm-syntax=intel t.c
:; cat t.s
.section __TEXT,__text,regular,pure_instructions
.globl _main
.align 4, 0x90
_main: ## #main
.cfi_startproc
## BB#0:
push RBP
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset rbp, -16
mov RBP, RSP
Ltmp4:
.cfi_def_cfa_register rbp
mov EAX, 0
mov DWORD PTR [RBP - 4], EDI
mov QWORD PTR [RBP - 16], RSI
mov DWORD PTR [RBP - 20], 1
mov DWORD PTR [RBP - 24], 2
mov DWORD PTR [RBP - 28], 0
mov EDI, DWORD PTR [RBP - 20]
add EDI, DWORD PTR [RBP - 24]
mov DWORD PTR [RBP - 28], EDI
pop RBP
ret
.cfi_endproc
.subsections_via_symbols
If you add the -g flag, the compiler will add debug information including source filenames and line numbers. It's too big to put here in its entirety, but this is the relevant part:
.loc 1 4 14 prologue_end ## t.c:4:14
Ltmp5:
mov DWORD PTR [RBP - 20], 1
.loc 1 5 14 ## t.c:5:14
mov DWORD PTR [RBP - 24], 2
.loc 1 6 14 ## t.c:6:14
mov DWORD PTR [RBP - 28], 0
.loc 1 8 5 ## t.c:8:5
mov EDI, DWORD PTR [RBP - 20]
add EDI, DWORD PTR [RBP - 24]
mov DWORD PTR [RBP - 28], EDI
First of all, the assembler listed as "from original article" is using "Intel" syntax, where the "disassembled output" in your post is "AT&T syntax". This explains the order of arguments to instructions being "back to front" [let's not argue about which is right or wrong, ok?], and register names are prefixed by a %, constants prefixed by $. There is also a difference in how memory locations/offsets to registers are referenced - dword ptr [reg+offs] in Intel assembler translates to l as a suffix on the instruction, and offs(%reg).
The 32-bit vs. 64-bit renames some of the registers - %rbp is the same as ebp in the article code.
The actual offsets (e.g -20) are different partly because the registers are bigger in 64-bit, but also because you have argc and argv as part of your function arguments, which is stored as part of the start of the function - I have a feeling the original article is actually disassembling a different function than main.
I was always interested in assembler, however so far I didn't have a true chance to confront it in a best way. Now, when I do have some time, I began coding some small programs using assembler in a c++, but that's just small ones, i.e. define x, store it somewhere and so on, so forth. I wanted to implement foor loop in assembler, but I couldn't make it, so I would like to ask if anyone here has ever done with it, would be nice to share here. Example of some function would be
for(i=0;i<10;i++) { std::cout<< "A"; }
Anyone has some idea how to implement this in a assembler?
edit2: ISA x86
Here's the unoptimized output1 of GCC for this code:
void some_function(void);
int main()
{
for (int i = 0; i < 137; ++i) { some_function(); }
}
movl $0, 12(%esp) // i = 0; i is stored at %esp + 12
jmp .L2
.L3:
call some_function // some_function()
addl $1, 12(%esp) // ++i
.L2:
cmpl $136, 12(%esp) // compare i to 136 ...
jle .L3 // ... and repeat loop less-or-equal
movl $0, %eax // return 0
leave // --"--
With optimization -O3, the addition+comparison is turned into subtraction:
pushl %ebx // save %ebx
movl $137, %ebx // set %ebx to 137
// some unrelated parts
.L2:
call some_function // some_function()
subl $1, %ebx // subtract 1 from %ebx
jne .L2 // if not equal to 0, repeat loop
1The generated assembly can be examined by invoking GCC with the -S flag.
Try to rewrite the for loop in C++ using a goto and an if statement and you will have the basics for the assembly version.
You could try the reverse - write the program in C++ or C and look at the dissasembled code:
for ( int i = 0 ; i < 10 ; i++ )
00E714EE mov dword ptr [i],0
00E714F5 jmp wmain+30h (0E71500h)
00E714F7 mov eax,dword ptr [i]
00E714FA add eax,1
00E714FD mov dword ptr [i],eax
00E71500 cmp dword ptr [i],0Ah
00E71504 jge wmain+4Bh (0E7151Bh)
cout << "A";
00E71506 push offset string "A" (0E76800h)
00E7150B mov eax,dword ptr [__imp_std::cout (0E792ECh)]
00E71510 push eax
00E71511 call std::operator<<<std::char_traits<char> > (0E71159h)
00E71516 add esp,8
00E71519 jmp wmain+27h (0E714F7h)
then try to make sense of it.