Implementation of Stackful Coroutine in C++ - c++

Now I am trying to implement stackful coroutine in C++17 on Windows x64 OS, but, unfortunately, I have encountered the problem: I can't throw exception in my coroutine, if I do so, the program is immediately terminated with a bad exit code.
Implementation
At the begining, I allocate a stack for a new coroutine, the code looks something like that:
void* Allocate() {
static constexpr std::size_t kStackSize{524'288};
auto new_stack{::operator new(kStackSize)};
return static_cast<std::byte *>(new_stack) + kStackSize;
}
The next step is setting a trampoline function on the recently allocated stack. The code is written using MASM, since I utilize MVSC (I would like to use GCC and NASM but I have the problem with thread_local variables, see question, if it is interesting):
SetTrampoline PROC
mov rax, rsp ; saves the current stack pointer
mov rsp, [rcx] ; sets the new stack pointer
sub rsp, 20h ; shadow stack
push rdx ; saves the function pointer
; place for nonvolatile registers
sub rsp, 0e0h
mov [rcx], rsp ; saves the moved stack pointer
mov rsp, rax ; returns the initial stack pointer
ret
SetTrampoline ENDP
Then I switch machine context with this assembly function (I read this calling convetion):
SwitchContext PROC
; saves all nonvolatile registers to the caller stack
push rbx
push rbp
push rdi
push rsi
push r12
push r13
push r14
push r15
sub rsp, 10h
movdqu [rsp], xmm6
; ... pushes xmm7 - xmm14 in here, removed for brevity
sub rsp, 10h
movdqu [rsp], xmm15
mov [rdx], rsp ; saves the caller stack pointer
SwitchContextFinally PROC
mov rsp, [rcx] ; sets the callee stack pointer
; takes out the callee registers
movdqu xmm15, [rsp]
add rsp, 10h
; ... pops xmm7 - xmm14 in here, removed for brevity
movdqu xmm6, [rsp]
add rsp, 10h
pop r15
pop r14
pop r13
pop r12
pop rsi
pop rdi
pop rbp
pop rbx
ret
SwitchContextFinally ENDP
SwitchContext ENDP
Problem
Inside the trampoline I just invoke any passed function and within these functions I can't throw exceptions and catch them instantly in the same fucntion. What have I done wrong? Is it possible to throw exceptions in my case? Should I have shadow stack in SetTrampoline?
Also, I guarantee that the exception thrown don't go outside the trampoline function.

Related

C++/Assembly pushing arguments onto the stack for function call

i have a special case where i need to push arguments onto the stack one at a time and then call a function which takes a callable as an argument and pass those arguments that were pushed to the stack to the callable/function. to solve this i created a few functions in assembly to push arguments onto the stack and call a function passing those arguments. however it seams to work but my code started breaking in weird places whenever i made calls to other functions inside the callable passed to CallFunction.
i need this function to take any callable. any help would be much appreciated.
please note that std::bind and other similar functions are not an option. each argument needs to be pushed onto the stack one at a time using a function call and i need to be able to call functions between calls to PushArg and CallFunction should return the return value of the function it executes.
What ive tried:
saving the stack pointer at the beginning of the call and restoring it before i return (current state)
creating my own stack in memory for storing arguments (too complex and messy and not as efficient as just using the registers and existing stack. would like to avoid if possible)
Code:
.data
FunctionPointer QWORD 0
ReturnPointer QWORD 0
StackPointer QWORD 0
.code
ALIGN 16
FT_StartCall PROC
mov [StackPointer],rsp
ret
FT_StartCall ENDP
FT_PushIntPointer PROC
pop ReturnPointer
push rcx
push ReturnPointer
ret
FT_PushIntPointer ENDP
FT_CallFunction PROC
;save return address in Memory
pop ReturnPointer
mov FunctionPointer,rcx
pop rcx
pop rdx
pop r8
pop r9
push r9
push r8
push rdx
push rcx
call FunctionPointer
mov rcx,0
mov rdx,0
mov r8,0
mov r9,0
mov rsp, [StackPointer]
push ReturnPointer
ret
FT_CallFunction ENDP
END
C++:
extern "C" void* __fastcall FT_CallFunction(void* Function);
extern "C" void __fastcall FT_PushIntPointer(void* ArgOrPointer);
extern "C" void __fastcall FT_StartCall();
int ADD(int first,int second){
return first+second;
}
int main(){
FT_StartCall();
FT_PushIntPointer(3);
FT_PushIntPointer(2);
int res = FT_CallFunction(ADD);
return res;
}
Output:
5
As everyone agrees, you are delving into UB territory with this code, but it does have some experimental value, I guess.
You should consider allocating your own private stack for something a bit more robust.
This bit of code:
pop rcx
pop rdx
pop r8
pop r9
push r9
push r8
push rdx
push rcx
Does nothing, besides scrambling rcd, rdx, r8 and r9. You should remove it.
The same goes for this pîece of code:
mov rcx,0
mov rdx,0
mov r8,0
mov r9,0
Did you try something like this?
FT_CallFunction PROC
; save return address in Memory
pop [ReturnPointer]
; call and replace with our own return point.
call [rcx]
mov rsp,[StackPointer]
; return to caller
push [ReturnPointer]
ret
FT_CallFunction ENDP

Why are parameters allocated below the frame pointer instead of above?

I have tried to understand this basing on a square function in c++ at godbolt.org . Clearly, return, parameters and local variables use “rbp - alignment” for this function.
Could someone please explain how this is possible?
What then would rbp + alignment do in this case?
int square(int num){
int n = 5;// just to test how locals are treated with frame pointer
return num * num;
}
Compiler (x86-64 gcc 11.1)
Generated Assembly:
square(int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi. ;\\Both param and local var use rbp-*
mov DWORD PTR[rbp-4], 5. ;//
mov eax, DWORD PTR [rbp-20]
imul eax, eax
pop rbp
ret
This is one of those cases where it’s handy to distinguish between parameters and arguments. In short: arguments are the values given by the caller, while parameters are the variables holding them.
When square is called, the caller places the argument in the rdi register, in accordance with the standard x86-64 calling convention. square then allocates a local variable, the parameter, and places the argument in the parameter. This allows the parameter to be used like any other variable: be read, written into, having its address taken, and so on. Since in this case it’s the callee that allocated the memory for the parameter, it necessarily has to reside below the frame pointer.
With an ABI where arguments are passed on the stack, the callee would be able to reuse the stack slot containing the argument as the parameter. This is exactly what happens on x86-32 (pass -m32 to see yourself):
square(int): # #square(int)
push ebp
mov ebp, esp
push eax
mov eax, dword ptr [ebp + 8]
mov dword ptr [ebp - 4], 5
mov eax, dword ptr [ebp + 8]
imul eax, dword ptr [ebp + 8]
add esp, 4
pop ebp
ret
Of course, if you enabled optimisations, the compiler would not bother with allocating a parameter on the stack in the callee; it would just use the value in the register directly:
square(int): # #square(int)
mov eax, edi
imul eax, edi
ret
GCC allows "leaf" functions, those that don't call other functions, to not bother creating a stack frame. The free stack is fair game to do so as these fns wish.

What does Znwm and ZdlPv mean in assembly?

I'm new to assembly and I'm trying to figure out how C++ handles dynamic dispatch in assembly.
When looking through assembly code, I saw that there were 2 unusual calls:
call _Znwm
call _ZdlPv
These did not have a subroutine that I could trace them to. From examining the code, Znwm seemed to return the address of the object when its constructor was called, but I'm not sure about that. ZdlPv was in a block of code that could never be reached (it was jumped over).
C++:
Fruit * f;
f = new Apple();
x86:
# BB#1:
mov eax, 8
mov edi, eax
call _Znwm
mov rdi, rax
mov rcx, rax
.Ltmp6:
mov qword ptr [rbp - 48], rdi # 8-byte Spill
mov rdi, rax
mov qword ptr [rbp - 56], rcx # 8-byte Spill
call _ZN5AppleC2Ev
Any advice would be appreciated.
Thanks.
_Znwm is operator new.
_ZdlPv is operator delete.

C/C++ returning struct by value under the hood

(This question is specific to my machine's architecture and calling conventions, Windows x86_64)
I don't exactly remember where I had read this, or if I had recalled it correctly, but I had heard that, when a function should return some struct or object by value, it will either stuff it in rax (if the object can fit in the register width of 64 bits) or be passed a pointer to where the resulting object would be (I'm guessing allocated in the calling function's stack frame) in rcx, where it would do all the usual initialization, and then a mov rax, rcx for the return trip. That is, something like
extern some_struct create_it(); // implemented in assembly
would really have a secret parameter like
extern some_struct create_it(some_struct* secret_param_pointing_to_where_i_will_be);
Did my memory serve me right, or am I incorrect? How are large objects (i.e. wider than the register width) returned by value from functions?
Here's a simple disassembling of a code exampling what you're saying
typedef struct
{
int b;
int c;
int d;
int e;
int f;
int g;
char x;
} A;
A foo(int b, int c)
{
A myA = {b, c, 5, 6, 7, 8, 10};
return myA;
}
int main()
{
A myA = foo(5,9);
return 0;
}
and here's the disassembly of the foo function, and the main function calling it
main:
push ebp
mov ebp, esp
and esp, 0FFFFFFF0h
sub esp, 30h
call ___main
lea eax, [esp+20] ; placing the addr of myA in eax
mov dword ptr [esp+8], 9 ; param passing
mov dword ptr [esp+4], 5 ; param passing
mov [esp], eax ; passing myA addr as a param
call _foo
mov eax, 0
leave
retn
foo:
push ebp
mov ebp, esp
sub esp, 20h
mov eax, [ebp+12]
mov [ebp-28], eax
mov eax, [ebp+16]
mov [ebp-24], eax
mov dword ptr [ebp-20], 5
mov dword ptr [ebp-16], 6
mov dword ptr [ebp-12], 7
mov dword ptr [ebp-8], 9
mov byte ptr [ebp-4], 0Ah
mov eax, [ebp+8]
mov edx, [ebp-28]
mov [eax], edx
mov edx, [ebp-24]
mov [eax+4], edx
mov edx, [ebp-20]
mov [eax+8], edx
mov edx, [ebp-16]
mov [eax+0Ch], edx
mov edx, [ebp-12]
mov [eax+10h], edx
mov edx, [ebp-8]
mov [eax+14h], edx
mov edx, [ebp-4]
mov [eax+18h], edx
mov eax, [ebp+8]
leave
retn
now let's go through what just happened, so when calling foo the paramaters were passed in the following way, 9 was at highest address, then 5 then the address the myA in main begins
lea eax, [esp+20] ; placing the addr of myA in eax
mov dword ptr [esp+8], 9 ; param passing
mov dword ptr [esp+4], 5 ; param passing
mov [esp], eax ; passing myA addr as a param
within foo there is some local myA which is stored on the stack frame, since the stack is going downwards, the lowest address of myA begins in [ebp - 28], the -28 offset could be caused by struct alignments so I'm guessing the size of the struct should be 28 bytes here and not 25 as expected. and as we can see in foo after the local myA of foo was created and filled with parameters and immediate values, it is copied and re-written to the address of myA passed from main ( this is the actual meaning of return by value )
mov eax, [ebp+8]
mov edx, [ebp-28]
[ebp + 8] is where the address of main::myA was stored ( memory address go upwards hence ebp + old ebp ( 4 bytes ) + return address ( 4 bytes )) at overall ebp + 8 to get to the first byte of main::myA, as said earlier foo::myA is stored within [ebp-28] as stack goes downwards
mov [eax], edx
place foo::myA.b in the address of the first data member of main::myA which is main::myA.b
mov edx, [ebp-24]
mov [eax+4], edx
place the value that resides in the address of foo::myA.c in edx, and place that value within the address of main::myA.b + 4 bytes which is main::myA.c
as you can see this process repeats itself through out the function
mov edx, [ebp-20]
mov [eax+8], edx
mov edx, [ebp-16]
mov [eax+0Ch], edx
mov edx, [ebp-12]
mov [eax+10h], edx
mov edx, [ebp-8]
mov [eax+14h], edx
mov edx, [ebp-4]
mov [eax+18h], edx
mov eax, [ebp+8]
which basically proves that when returning a struct by val, that could not be placed in as a param, what happens is that the address of where the return value should reside in is passed as a param to the function and within the function being called the values of the returned struct are copied into the address passed as a parameter...
hope this exampled helped you visualize what happens under the hood a little bit better :)
EDIT
I hope that you've noticed that my example was using 32 bit assembler and I KNOW you've asked regarding x86-64, but I'm currently unable to disassemble code on a 64 bit machine so I hope you take my word on it that the concept is exactly the same both for 64 bit and 32 bit, and that the calling convention is nearly the same
That is exactly correct. The caller passes an extra argument which is the address of the return value. Normally it will be on the caller's stack frame but there are no guarantees.
The precise mechanics are specified by the platform ABI, but this mechanism is very common.
Various commentators have left useful links with documentation for calling conventions, so I'll hoist some of them into this answer:
Wikipedia article on x86 calling conventions
Agner Fog's collection of optimization resources, including a summary of calling conventions (Direct link to 57-page PDF document.)
Microsoft Developer Network (MSDN) documentation on calling conventions.
StackOverflow x86 tag wiki has lots of useful links.

C++ inline ASM exception on return

Due to a WPO patch the way a function I called through an injected DLL changed.
The function is a __fastcall
The original function looked like
PUSH EAX
MOV EAX,DWORD PTR SS:[ESP]
PUSH EAX
LEA EBX,[ARG.22]
LEA EDI,[ARG.23]
CALL Function
So I could call it via:
Push ebx
Push edi
Push 0
Push 0
lea ebx,dword ptr ss:[ecx]
lea edi,dword ptr ss:[edx]
call Function
Pop edi
Pop ebx
retn
The function only needed 2 ascii strings.
Now after the WPO the function changed to
PUSH 0
LEA EDX,[LOCAL.22]
PUSH EDX
LEA EDX,[LOCAL.23]
XOR ECX,ECX
CALL Function
A common fastcall, which looks simpler. But the issue started that the ebp register carried a number while esi and edi the same strings but in Unicode.
While the call still needed only 2 arguments the registers contained additional which was required.
So instead of calling the function via 2 Ascii on ecx and edx I wrote a struct which contained the strings as ascii and unicode.
My attempt to solve it looked like
pushad
push 0
lea edi,dword ptr ss:[ecx+0x20]
lea esi,dword ptr ss:[ecx]
mov ebp, 100
lea edx,dword ptr ss:[ecx+0x50]
push edx
lea edx,dword ptr ss:[ecx+0x40]
xor ecx, ecx
call Function
pop edx
popad
retn
I followed it in the debugger and the call is processed as it should be, but after the the function returns to my asmstub and returns to my c++ code my code creates an exception on write.
Did I make a fundamental asm mistake such as messing up the order which causes the exception?