Why there is no `leave` instruction at function epilog on x64? [duplicate]

Why there is no `leave` instruction at function epilog on x64? [duplicate] - c++

This question already has answers here:
Why does the x86-64 GCC function prologue allocate less stack than the local variables?
(1 answer)
Why is there no "sub rsp" instruction in this function prologue and why are function parameters stored at negative rbp offsets?
(2 answers)
Closed 4 years ago.
I'm on the way to get idea how the stack works on x86 and x64 machines. What I observed however is that when I manually write a code and disassembly it, it differs from what I see in the code people provide (eg. in their questions and tutorials). Here is little example:
Source
int add(int a, int b) {
int c = 16;
return a + b + c;
}
int main () {
add(3,4);
return 0;
}
x86
add(int, int):
push ebp
mov ebp, esp
sub esp, 16
mov DWORD PTR [ebp-4], 16
mov edx, DWORD PTR [ebp+8]
mov eax, DWORD PTR [ebp+12]
add edx, eax
mov eax, DWORD PTR [ebp-4]
add eax, edx
leave (!)
ret
main:
push ebp
mov ebp, esp
push 4
push 3
call add(int, int)
add esp, 8
mov eax, 0
leave (!)
ret
Now goes x64
add(int, int):
push rbp
mov rbp, rsp
(?) where is `sub rsp, X`?
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov DWORD PTR [rbp-4], 16
mov edx, DWORD PTR [rbp-20]
mov eax, DWORD PTR [rbp-24]
add edx, eax
mov eax, DWORD PTR [rbp-4]
add eax, edx
(?) where is `mov rsp, rbp` before popping rbp?
pop rbp
ret
main:
push rbp
mov rbp, rsp
mov esi, 4
mov edi, 3
call add(int, int)
mov eax, 0
(?) where is `mov rsp, rbp` before popping rbp?
pop rbp
ret
As you can see, my main confusion is that when I compile against x86 - I see what I expect. When it's x64 - I miss leave instruction or exact following sequence: mov rsp, rbp then pop rbp. What's worng?
UPDATE
It seems like leave is missing, just because it wasn't altered previously. But then, goes another question - why there is no allocation for local vars in the frame?
To this question #melpomene gives pretty straightforward answer - because of "red zone". Which basically means the function that calls no further functions (leaf) can use the first 128 bytes below the stack without allocating space. So if I insert a call inside an add() to any other dumb function - sub rsp, X and add rsp, X will be added to prologue and epilogue respectively.

Related

Why is the C++ function parameter stored 20 bytes off of the rbp in x86-64 when the method body only has one 4 byte variable?

Consider the following program, compiled using x86-64 GCC 12.2 with flags --std=c++17 -O0:
int square(int num, int num2) {
int foo = 37;
return num * num;
}
int main () {
return square(10, 5);
}
The resulting assembly using godbolt is:
square(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov DWORD PTR [rbp-4], 37
mov eax, DWORD PTR [rbp-20]
imul eax, eax
pop rbp
ret
main:
push rbp
mov rbp, rsp
mov esi, 5
mov edi, 10
call square(int, int)
nop
pop rbp
ret
I read about shadow spaces and it appears that in x64 there must be at minimum 32 bytes allocated: "32 bytes above the return address which the called function owns" ...
With that said, how is the offset -20 determined for the parameter num? If there's 32 bytes from rbp, wouldn't that be -24?
I noticed even if you add more local variables, it'll remain -20 until it gets pushed over to -36, but I cannot understand why. Thanks!

Modulus in Assembly x64 linux question C++ [duplicate]

This question already has answers here:
Why does GCC use multiplication by a strange number in implementing integer division?
(5 answers)
Divide Signed Integer By 2 compiles to complex assembly output, not just a shift
(1 answer)
Closed 1 year ago.
I have these functions in C++
int f1(int a)
{
int x = a / 2;
}
int f2(int a)
{
int y = a % 2;
}
int f3(int a)
{
int z = a % 7;
}
int f4(int a,int b)
{
int xy = a % b;
}
And i saw their assembly code but couldn't understand what they are doing.I couldn't even find a good referance or some explained example for the same. Here is the assembly
f1(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov eax, DWORD PTR [rbp-20]
mov edx, eax
shr edx, 31
add eax, edx
sar eax
mov DWORD PTR [rbp-4], eax
nop
pop rbp
ret
f2(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov eax, DWORD PTR [rbp-20]
cdq
shr edx, 31
add eax, edx
and eax, 1
sub eax, edx
mov DWORD PTR [rbp-4], eax
nop
pop rbp
ret
f3(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov eax, DWORD PTR [rbp-20]
movsx rdx, eax
imul rdx, rdx, -1840700269
shr rdx, 32
add edx, eax
sar edx, 2
mov esi, eax
sar esi, 31
mov ecx, edx
sub ecx, esi
mov edx, ecx
sal edx, 3
sub edx, ecx
sub eax, edx
mov DWORD PTR [rbp-4], eax
nop
pop rbp
ret
f4(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov eax, DWORD PTR [rbp-20]
cdq
idiv DWORD PTR [rbp-24]
mov DWORD PTR [rbp-4], edx
nop
pop rbp
ret
Can you please tell by some example or what steps it is following to calculate the answers in all these three cases and why would they work just fine instead of normal divide

GCC generated assembly

Why printf function causes the change of prologue?
C code_1:
#include <cstdio>
int main(){
int a = 11;
printf("%d", a);
}
GCC -m32 generated one:
.LC0:
.string "%d"
main:
lea ecx, [esp+4] // What's purpose of this three
and esp, -16 // lines?
push DWORD PTR [ecx-4] //
push ebp
mov ebp, esp
push ecx
sub esp, 20 // why sub 20?
mov DWORD PTR [ebp-12], 11
sub esp, 8
push DWORD PTR [ebp-12]
push OFFSET FLAT:.LC0
call printf
add esp, 16
mov eax, 0
mov ecx, DWORD PTR [ebp-4]
leave
lea esp, [ecx-4]
ret
C code_2:
#include <cstdio>
int main(){
int a = 11;
}
GCC -m32:
main:
push ebp
mov ebp, esp
sub esp, 16
mov DWORD PTR [ebp-4], 11
mov eax, 0
leave
ret
What is the purpose of first three lines added in first code?
Please, explain first assembly code, if you can.
EDIT:
64-bit mode:
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 11
mov eax, DWORD PTR [rbp-4]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
leave
ret

The insight is that the compiler keep the stack aligned at function calls.
The alignment is 16 byte.
lea ecx, [esp+4] ;Save original ESP to ECX (ESP+4 actually)
and esp, -16 ;Align stack on 16 bytes (Lower esp)
push DWORD PTR [ecx-4] ;Push main return address (Stack at 16B + 4)
;My guess is to aid debugging tools that expect the RA
;to be at [ebp+04h]
push ebp
mov ebp, esp ;Prolog (Stack at 16B+8)
push ecx ;Save ECX (Original stack pointer) (Stack at 16B+12)
sub esp, 20 ;Reserve 20 bytes (Stack at 16B+0, ALIGNED AGAIN)
;4 for alignment + 1x16 for a variable (variable space is
;allocated in multiple of 16)
mov DWORD PTR [ebp-12], 11 ;a = 11
sub esp, 8 ;Stack at 16B+8 for later alignment
push DWORD PTR [ebp-12] ;a
push OFFSET FLAT:.LC0 ;"%d" (Stack at 16B)
call printf
add esp, 16 ;Remove args+pad from the stack (Stack at 16B)
mov eax, 0 ;Return 0
mov ecx, DWORD PTR [ebp-4] ;Restore ECX without the need to add to esp
leave ;Restore EBP
lea esp, [ecx-4] ;Restore original ESP
ret
I don't know why the compiler saves esp+4 in ecx instead of esp (esp+4 is the address of the first parameter of main).

Access violation on 'ret' instruction

I've got this function, which consists mostly of inline asm.
long *toarrayl(int members, ...){
__asm{
push esp
mov eax, members
imul eax, 4
push eax
call malloc
mov edx, eax
mov edi, eax
xor ecx, ecx
xor esi, esi
loopx:
cmp ecx, members
je done
mov esi, 4
imul esi, ecx
add esi, ebp
mov eax, [esi+0xC]
mov [edi], eax
inc ecx
add edi, 4
jmp loopx
done:
mov eax, edx
pop esp
ret
}
}
And upon running, I get an access violation on the return instruction.
I'm using VC++ 6, and it can sometimes mean to point at the line above, so possible on 'pop esp'.
If you could help me out, it'd be great.
Thanks, iDomo.

You are failing to manage the stack pointer correctly. In particular, your call to malloc unbalances the stack, and your pop esp ends up popping the wrong value into esp. The access violation therefore occurs when you try to ret from an invalid stack (the CPU cannot read the return address). It's unclear why you are pushing and popping esp; that accomplishes nothing.

As you spotted, you should never use the instruction POP ESP - when you see that in the code, you know something extremely wrong has happened. Of course, calling malloc inside asseembler code is also rather a bad thing to do - you have for example forgotten to check if it returned NULL, so you may well crash. Stick that outside your assembler - and check for NULL, it's much easier to debug "Couldn't allocate memory at line 54 in file mycode.c" than "Somewhere in the assembler, we got a
Here's some suggestions for improvement, which should speed up your loop a bit:
long *toarrayl(int members, ...){
__asm{
mov eax, members
imul eax, 4
push eax
call malloc
add esp, 4
mov edx, eax
mov edi, eax
mov ecx, members
lea esi, [ebp+0xc]
loopx:
mov eax, [esi]
mov [edi], eax
add edi, 4
add esi, 4
dec ecx
jnz loopx
mov lret, eax
ret
}
}
Improvements: Remove multiply by four in every loop. Just increment esi instead. Use decrement on ecx, instead of increament, and load it up with members before the loop. This allows usage of just one jump in the loop, rather than two. Remove reduntant move from edx, to eax. Use eax directly.

I've figured out the answer on my own.
For those who have had this same, or alike problem:
The actual exception was occuring after the user code, when vc++ automatically pops/restores the registers to their states before the function was called. Since I miss-aligned the stack pointer when calling malloc, there was an access violation when poping from the stack. I wasn't able to see this in the editor because it wasn't my code, so it was just shown as the last of my code in the function.
To correct this, just add an add esp, (size of parameters for previous call) after the calls you make.
Fixed code:
long *toarrayl(int members, ...){
__asm{
mov eax, members
imul eax, 4
push eax
call malloc
add esp, 4
mov edx, eax
mov edi, eax
xor ecx, ecx
xor esi, esi
loopx:
cmp ecx, members
je done
mov esi, 4
imul esi, ecx
add esi, ebp
mov eax, [esi+0xC]
mov [edi], eax
inc ecx
add edi, 4
jmp loopx
done:
mov eax, edx
ret
}
//return (long*)0;
}
Optimized code:
long *toarrayl(int members, ...){
__asm{
mov eax, members
shl eax, 2
push eax
call malloc
add esp, 4
;cmp eax, 0
;je _error
mov edi, eax
mov ecx, members
lea esi, [ebp+0xC]
loopx:
mov edx, [esi]
mov [edi], edx
add edi, 4
add esi, 4
dec ecx
jnz loopx
}
}

How's __RTC_CheckEsp implemented?

__RTC_CheckEsp is a call that verifies the correctness of the esp, stack, register. It is called to ensure that the value of the esp was saved across a function call.
Anyone knows how it's implemented?

Well a little bit of inspection of the assembler gives it away
0044EE35 mov esi,esp
0044EE37 push 3039h
0044EE3C mov ecx,dword ptr [ebp-18h]
0044EE3F add ecx,70h
0044EE42 mov eax,dword ptr [ebp-18h]
0044EE45 mov edx,dword ptr [eax+70h]
0044EE48 mov eax,dword ptr [edx+0Ch]
0044EE4B call eax
0044EE4D cmp esi,esp
0044EE4F call #ILT+6745(__RTC_CheckEsp) (42BA5Eh)
There are 2 lines to note in this. First note at 0x44ee35 it stores the current value of esp to esi.
Then after the function call is completed it does a cmp between esp and esi. They should both be the same now. If they aren't then someone has either unwound the stack twice or not unwound it.
The _RTC_CheckEsp function looks like this:
_RTC_CheckEsp:
00475A60 jne esperror (475A63h)
00475A62 ret
esperror:
00475A63 push ebp
00475A64 mov ebp,esp
00475A66 sub esp,0
00475A69 push eax
00475A6A push edx
00475A6B push ebx
00475A6C push esi
00475A6D push edi
00475A6E mov eax,dword ptr [ebp+4]
00475A71 push 0
00475A73 push eax
00475A74 call _RTC_Failure (42C34Bh)
00475A79 add esp,8
00475A7C pop edi
00475A7D pop esi
00475A7E pop ebx
00475A7F pop edx
00475A80 pop eax
00475A81 mov esp,ebp
00475A83 pop ebp
00475A84 ret
As you can see the first thing it check is whether the result of the earlier comparison were "not equal" ie esi != esp. If thats the case then it jumps to the failure code. If they ARE the same then the function simply returns.

If you're any good at asm, maybe this helps:
jne (Jump if Not Equal) - jumps if the ZERO flag is NZ (NotZero)
_RTC_CheckEsp:
004C8690 jne esperror (4C8693h)
004C8692 ret
esperror:
004C8693 push ebp
004C8694 mov ebp,esp
004C8696 sub esp,0
004C8699 push eax
004C869A push edx
004C869B push ebx
004C869C push esi
004C869D push edi
004C869E mov eax,dword ptr [ebp+4]
004C86A1 push 0
004C86A3 push eax
004C86A4 call _RTC_Failure (4550F8h)
004C86A9 add esp,8
004C86AC pop edi
004C86AD pop esi
004C86AE pop ebx
004C86AF pop edx
004C86B0 pop eax
004C86B1 mov esp,ebp
004C86B3 pop ebp
004C86B4 ret
004C86B5 int 3
004C86B6 int 3
004C86B7 int 3
004C86B8 int 3
004C86B9 int 3
004C86BA int 3
004C86BB int 3
004C86BC int 3
004C86BD int 3
004C86BE int 3
004C86BF int 3

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why there is no `leave` instruction at function epilog on x64? [duplicate] - c++

Related

Why is the C++ function parameter stored 20 bytes off of the rbp in x86-64 when the method body only has one 4 byte variable?

Modulus in Assembly x64 linux question C++ [duplicate]

GCC generated assembly

Access violation on 'ret' instruction

How's __RTC_CheckEsp implemented?

Categories

Resources