if we compile some C code with gcc we often see the following assembly result
0x08048494 <+0>: push ebp
0x08048495 <+1>: mov ebp,esp
0x08048497 <+3>: and esp,0xfffffff0
0x0804849a <+6>: sub esp,0x130
0x080484a0 <+12>: mov eax,DWORD PTR [ebp+0xc]
0x080484a3 <+15>: mov DWORD PTR [esp+0x1c],eax
0x080484a7 <+19>: mov eax,gs:0x14
this is a simple function prologue.
from the +19 line, we can see the stack protector value is
obtained from gs:0x14.
my question is, can I know the actual virtual address of gs:0x14 with gdb?
the gs segment selector value indicates the offset from GDT
however, a user level process such as gdb cannot access the GDT information.
how can I figure out the base address of gs segment using gdb or other debugger?
is this impossible?
thank you in advance.
Related
Now I am trying to implement stackful coroutine in C++17 on Windows x64 OS, but, unfortunately, I have encountered the problem: I can't throw exception in my coroutine, if I do so, the program is immediately terminated with a bad exit code.
Implementation
At the begining, I allocate a stack for a new coroutine, the code looks something like that:
void* Allocate() {
static constexpr std::size_t kStackSize{524'288};
auto new_stack{::operator new(kStackSize)};
return static_cast<std::byte *>(new_stack) + kStackSize;
}
The next step is setting a trampoline function on the recently allocated stack. The code is written using MASM, since I utilize MVSC (I would like to use GCC and NASM but I have the problem with thread_local variables, see question, if it is interesting):
SetTrampoline PROC
mov rax, rsp ; saves the current stack pointer
mov rsp, [rcx] ; sets the new stack pointer
sub rsp, 20h ; shadow stack
push rdx ; saves the function pointer
; place for nonvolatile registers
sub rsp, 0e0h
mov [rcx], rsp ; saves the moved stack pointer
mov rsp, rax ; returns the initial stack pointer
ret
SetTrampoline ENDP
Then I switch machine context with this assembly function (I read this calling convetion):
SwitchContext PROC
; saves all nonvolatile registers to the caller stack
push rbx
push rbp
push rdi
push rsi
push r12
push r13
push r14
push r15
sub rsp, 10h
movdqu [rsp], xmm6
; ... pushes xmm7 - xmm14 in here, removed for brevity
sub rsp, 10h
movdqu [rsp], xmm15
mov [rdx], rsp ; saves the caller stack pointer
SwitchContextFinally PROC
mov rsp, [rcx] ; sets the callee stack pointer
; takes out the callee registers
movdqu xmm15, [rsp]
add rsp, 10h
; ... pops xmm7 - xmm14 in here, removed for brevity
movdqu xmm6, [rsp]
add rsp, 10h
pop r15
pop r14
pop r13
pop r12
pop rsi
pop rdi
pop rbp
pop rbx
ret
SwitchContextFinally ENDP
SwitchContext ENDP
Problem
Inside the trampoline I just invoke any passed function and within these functions I can't throw exceptions and catch them instantly in the same fucntion. What have I done wrong? Is it possible to throw exceptions in my case? Should I have shadow stack in SetTrampoline?
Also, I guarantee that the exception thrown don't go outside the trampoline function.
Consider this simple code:
class X {
int i_;
public:
X();
};
void f() {
X x;
}
The stack frame of f is 32-byte long with GCC, which is unnecessarily long. The return address and x just need 12 bytes and 16-byte alignment should be required according to the Linux/x86_64 ABI. With Clang, only 16 bytes are allocated. Why GCC requires so much stack space?
GCC assembly:
f():
sub rsp, 24
lea rdi, [rsp+12]
call X::X()
add rsp, 24
ret
Clang assembly:
f():
push rax
mov rdi, rsp
call X::X()
pop rax
ret
Both with -O2. Live demo: https://godbolt.org/z/bcrWW36on
Fascinating rabbit hole, I've changed my analysis three times already.
It seems that is indeed a missed optimization. While playing around a bit, I found another missed optimization, this time in clang:
If you actually use the x object, then Clang uses rbx to cache the address of x instead of recomputing it, which means it needs to save rbx across the function, which extends the used space in the stack frame by 8 (from 12 to 20), bumping the aligned stack frame to 32, same as gcc.
From a debugging perspective, I'd prefer clang to use sub rsp, 8 instead of push rax to allocate the memory for x, so the memory isn't marked as initialized in valgrind.
GCC assembly:
f():
sub rsp, 24
lea rdi, [rsp+12]
call X::X() [complete object constructor]
lea rdi, [rsp+12]
call g(X&)
add rsp, 24
ret
Clang assembly:
f():
push rbx
sub rsp, 16
lea rbx, [rsp + 8]
mov rdi, rbx
call X::X() [complete object constructor]
mov rdi, rbx
call g(X&)
add rsp, 16
pop rbx
ret
I've checked whether gcc maybe uses 32 bytes stack alignment by using a 32 byte vector as a data member, and both gcc and clang generate code to align the stack pointer here, and use the base pointer to implement the variable-length stack frame. I have no idea why Clang allocates 64 bytes for the object here, though.
GCC assembly:
f():
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 32
mov rdi, rsp
call X::X() [complete object constructor]
leave
ret
Clang assembly:
f(): # #f()
push rbp
mov rbp, rsp
and rsp, -32
sub rsp, 64
mov rdi, rsp
call X::X() [complete object constructor]
mov rsp, rbp
pop rbp
ret
Without actually measuring performance, it is hard to tell which is better -- -O2 will optimize for runtime, not stack frame size, so there could be good reasons for all of these choices.
I'm new to assembly and I'm trying to figure out how C++ handles dynamic dispatch in assembly.
When looking through assembly code, I saw that there were 2 unusual calls:
call _Znwm
call _ZdlPv
These did not have a subroutine that I could trace them to. From examining the code, Znwm seemed to return the address of the object when its constructor was called, but I'm not sure about that. ZdlPv was in a block of code that could never be reached (it was jumped over).
C++:
Fruit * f;
f = new Apple();
x86:
# BB#1:
mov eax, 8
mov edi, eax
call _Znwm
mov rdi, rax
mov rcx, rax
.Ltmp6:
mov qword ptr [rbp - 48], rdi # 8-byte Spill
mov rdi, rax
mov qword ptr [rbp - 56], rcx # 8-byte Spill
call _ZN5AppleC2Ev
Any advice would be appreciated.
Thanks.
_Znwm is operator new.
_ZdlPv is operator delete.
(This question is specific to my machine's architecture and calling conventions, Windows x86_64)
I don't exactly remember where I had read this, or if I had recalled it correctly, but I had heard that, when a function should return some struct or object by value, it will either stuff it in rax (if the object can fit in the register width of 64 bits) or be passed a pointer to where the resulting object would be (I'm guessing allocated in the calling function's stack frame) in rcx, where it would do all the usual initialization, and then a mov rax, rcx for the return trip. That is, something like
extern some_struct create_it(); // implemented in assembly
would really have a secret parameter like
extern some_struct create_it(some_struct* secret_param_pointing_to_where_i_will_be);
Did my memory serve me right, or am I incorrect? How are large objects (i.e. wider than the register width) returned by value from functions?
Here's a simple disassembling of a code exampling what you're saying
typedef struct
{
int b;
int c;
int d;
int e;
int f;
int g;
char x;
} A;
A foo(int b, int c)
{
A myA = {b, c, 5, 6, 7, 8, 10};
return myA;
}
int main()
{
A myA = foo(5,9);
return 0;
}
and here's the disassembly of the foo function, and the main function calling it
main:
push ebp
mov ebp, esp
and esp, 0FFFFFFF0h
sub esp, 30h
call ___main
lea eax, [esp+20] ; placing the addr of myA in eax
mov dword ptr [esp+8], 9 ; param passing
mov dword ptr [esp+4], 5 ; param passing
mov [esp], eax ; passing myA addr as a param
call _foo
mov eax, 0
leave
retn
foo:
push ebp
mov ebp, esp
sub esp, 20h
mov eax, [ebp+12]
mov [ebp-28], eax
mov eax, [ebp+16]
mov [ebp-24], eax
mov dword ptr [ebp-20], 5
mov dword ptr [ebp-16], 6
mov dword ptr [ebp-12], 7
mov dword ptr [ebp-8], 9
mov byte ptr [ebp-4], 0Ah
mov eax, [ebp+8]
mov edx, [ebp-28]
mov [eax], edx
mov edx, [ebp-24]
mov [eax+4], edx
mov edx, [ebp-20]
mov [eax+8], edx
mov edx, [ebp-16]
mov [eax+0Ch], edx
mov edx, [ebp-12]
mov [eax+10h], edx
mov edx, [ebp-8]
mov [eax+14h], edx
mov edx, [ebp-4]
mov [eax+18h], edx
mov eax, [ebp+8]
leave
retn
now let's go through what just happened, so when calling foo the paramaters were passed in the following way, 9 was at highest address, then 5 then the address the myA in main begins
lea eax, [esp+20] ; placing the addr of myA in eax
mov dword ptr [esp+8], 9 ; param passing
mov dword ptr [esp+4], 5 ; param passing
mov [esp], eax ; passing myA addr as a param
within foo there is some local myA which is stored on the stack frame, since the stack is going downwards, the lowest address of myA begins in [ebp - 28], the -28 offset could be caused by struct alignments so I'm guessing the size of the struct should be 28 bytes here and not 25 as expected. and as we can see in foo after the local myA of foo was created and filled with parameters and immediate values, it is copied and re-written to the address of myA passed from main ( this is the actual meaning of return by value )
mov eax, [ebp+8]
mov edx, [ebp-28]
[ebp + 8] is where the address of main::myA was stored ( memory address go upwards hence ebp + old ebp ( 4 bytes ) + return address ( 4 bytes )) at overall ebp + 8 to get to the first byte of main::myA, as said earlier foo::myA is stored within [ebp-28] as stack goes downwards
mov [eax], edx
place foo::myA.b in the address of the first data member of main::myA which is main::myA.b
mov edx, [ebp-24]
mov [eax+4], edx
place the value that resides in the address of foo::myA.c in edx, and place that value within the address of main::myA.b + 4 bytes which is main::myA.c
as you can see this process repeats itself through out the function
mov edx, [ebp-20]
mov [eax+8], edx
mov edx, [ebp-16]
mov [eax+0Ch], edx
mov edx, [ebp-12]
mov [eax+10h], edx
mov edx, [ebp-8]
mov [eax+14h], edx
mov edx, [ebp-4]
mov [eax+18h], edx
mov eax, [ebp+8]
which basically proves that when returning a struct by val, that could not be placed in as a param, what happens is that the address of where the return value should reside in is passed as a param to the function and within the function being called the values of the returned struct are copied into the address passed as a parameter...
hope this exampled helped you visualize what happens under the hood a little bit better :)
EDIT
I hope that you've noticed that my example was using 32 bit assembler and I KNOW you've asked regarding x86-64, but I'm currently unable to disassemble code on a 64 bit machine so I hope you take my word on it that the concept is exactly the same both for 64 bit and 32 bit, and that the calling convention is nearly the same
That is exactly correct. The caller passes an extra argument which is the address of the return value. Normally it will be on the caller's stack frame but there are no guarantees.
The precise mechanics are specified by the platform ABI, but this mechanism is very common.
Various commentators have left useful links with documentation for calling conventions, so I'll hoist some of them into this answer:
Wikipedia article on x86 calling conventions
Agner Fog's collection of optimization resources, including a summary of calling conventions (Direct link to 57-page PDF document.)
Microsoft Developer Network (MSDN) documentation on calling conventions.
StackOverflow x86 tag wiki has lots of useful links.
Due to a WPO patch the way a function I called through an injected DLL changed.
The function is a __fastcall
The original function looked like
PUSH EAX
MOV EAX,DWORD PTR SS:[ESP]
PUSH EAX
LEA EBX,[ARG.22]
LEA EDI,[ARG.23]
CALL Function
So I could call it via:
Push ebx
Push edi
Push 0
Push 0
lea ebx,dword ptr ss:[ecx]
lea edi,dword ptr ss:[edx]
call Function
Pop edi
Pop ebx
retn
The function only needed 2 ascii strings.
Now after the WPO the function changed to
PUSH 0
LEA EDX,[LOCAL.22]
PUSH EDX
LEA EDX,[LOCAL.23]
XOR ECX,ECX
CALL Function
A common fastcall, which looks simpler. But the issue started that the ebp register carried a number while esi and edi the same strings but in Unicode.
While the call still needed only 2 arguments the registers contained additional which was required.
So instead of calling the function via 2 Ascii on ecx and edx I wrote a struct which contained the strings as ascii and unicode.
My attempt to solve it looked like
pushad
push 0
lea edi,dword ptr ss:[ecx+0x20]
lea esi,dword ptr ss:[ecx]
mov ebp, 100
lea edx,dword ptr ss:[ecx+0x50]
push edx
lea edx,dword ptr ss:[ecx+0x40]
xor ecx, ecx
call Function
pop edx
popad
retn
I followed it in the debugger and the call is processed as it should be, but after the the function returns to my asmstub and returns to my c++ code my code creates an exception on write.
Did I make a fundamental asm mistake such as messing up the order which causes the exception?