I've been trying to use 'thunking' so I can use member functions to legacy APIs which expects a C function. I'm trying to use a similar solution to this. This is my thunk structure so far:
struct Thunk
{
byte mov; // ↓
uint value; // mov esp, 'value' <-- replace the return address with 'this' (since this thunk was called with 'call', we can replace the 'pushed' return address with 'this')
byte call; // ↓
int offset; // call 'offset' <-- we want to return here for ESP alignment, so we use call instead of 'jmp'
byte sub; // ↓
byte esp; // ↓
byte num; // sub esp, 4 <-- pop the 'this' pointer from the stack
//perhaps I should use 'ret' here as well/instead?
} __attribute__((packed));
The following code is a test of mine which uses this thunk structure (but it does not yet work):
#include <iostream>
#include <sys/mman.h>
#include <cstdio>
typedef unsigned char byte;
typedef unsigned short ushort;
typedef unsigned int uint;
typedef unsigned long ulong;
#include "thunk.h"
template<typename Target, typename Source>
inline Target brute_cast(const Source s)
{
static_assert(sizeof(Source) == sizeof(Target));
union { Target t; Source s; } u;
u.s = s;
return u.t;
}
void Callback(void (*cb)(int, int))
{
std::cout << "Calling...\n";
cb(34, 71);
std::cout << "Called!\n";
}
struct Test
{
int m_x = 15;
void Hi(int x, int y)
{
printf("X: %d | Y: %d | M: %d\n", x, y, m_x);
}
};
int main(int argc, char * argv[])
{
std::cout << "Begin Execution...\n";
Test test;
Thunk * thunk = static_cast<Thunk*>(mmap(nullptr, sizeof(Thunk),
PROT_EXEC | PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0));
thunk->mov = 0xBC; // mov esp
thunk->value = reinterpret_cast<uint>(&test);
thunk->call = 0xE8; // call
thunk->offset = brute_cast<uint>(&Test::Hi) - reinterpret_cast<uint>(thunk);
thunk->offset -= 10; // Adjust the relative call
thunk->sub = 0x83; // sub
thunk->esp = 0xEC; // esp
thunk->num = 0x04; // 'num'
// Call the function
Callback(reinterpret_cast<void (*)(int, int)>(thunk));
std::cout << "End execution\n";
}
If I use that code; I receive a segmentation fault within the Test::Hi function. The reason is obvious (once you analyze the stack in GDB) but I do not know how to fix this. The stack is not aligned properly.
The x argument contains garbage but the y argument contains the this pointer (see the Thunk code). That means the stack is misaligned by 8 bytes, but I still don't know why this is the case. Can anyone tell why this is happening? x and y should contain 34 and 71 respectively.
NOTE: I'm aware of the fact that this is does not work in all scenarios (such as MI and VC++ thiscall convention) but I want to see if I can get this work, since I would benefit from it a lot!
EDIT: Obviously I also know that I can use static functions, but I see this more as a challenge...
Suppose you have a standalone (non-member, or maybe static) cdecl function:
void Hi_cdecl(int x, int y)
{
printf("X: %d | Y: %d | M: %d\n", x, y, m_x);
}
Another function calls it this way:
push 71
push 36
push (return-address)
call (address-of-hi)
add esp, 8 (stack cleanup)
You want to replace this by the following:
push 71
push 36
push this
push (return-address)
call (address-of-hi)
add esp, 4 (cleanup of this from stack)
add esp, 8 (stack cleanup)
For this, you have to read the return-address from the stack, push this, and then, push the return-address. And for the cleanup, add 4 (not subtract) to esp.
Regarding the return address - since the thunk must do some cleanup after the callee returns, it must store the original return-address somewhere, and push the return-address of the cleanup part of the thunk. So, where to store the original return-address?
In a global variable - might be an acceptable hack (since you probably don't need your solution to be reentrant)
On the stack - requires moving the whole block of parameters (using a machine-language equivalent of memmove), whose length is pretty much unknown
Please also note that the resulting stack is not 16-byte-aligned; this can lead to crashes if the function uses certain types (those that require 8-byte and 16-byte alignment - the SSE ones, for example; also maybe double).
Related
I have read some questions about returning more than one value such as What is the reason behind having only one return value in C++ and Java?, Returning multiple values from a C++ function and Why do most programming languages only support returning a single value from a function?.
I agree with most of the arguments used to prove that more than one return value is not strictly necessary and I understand why such feature hasn't been implemented, but I still can't understand why can't we use multiple caller-saved registers such as ECX and EDX to return such values.
Wouldn't it be faster to use the registers instead of creating a Class/Struct to store those values or passing arguments by reference/pointers, both of which use memory to store them? If it is possible to do such thing, does any C/C++ compiler use this feature to speed up the code?
Edit:
An ideal code would be like this:
(int, int) getTwoValues(void) { return 1, 2; }
int main(int argc, char** argv)
{
// a and b are actually returned in registers
// so future operations with a and b are faster
(int a, int b) = getTwoValues();
// do something with a and b
return 0;
}
Yes, this is sometimes done. If you read the Wikipedia page on x86 calling conventions under cdecl:
There are some variations in the interpretation of cdecl, particularly in how to return values. As a result, x86 programs compiled for different operating system platforms and/or by different compilers can be incompatible, even if they both use the "cdecl" convention and do not call out to the underlying environment. Some compilers return simple data structures with a length of 2 registers or less in the register pair EAX:EDX, and larger structures and class objects requiring special treatment by the exception handler (e.g., a defined constructor, destructor, or assignment) are returned in memory. To pass "in memory", the caller allocates memory and passes a pointer to it as a hidden first parameter; the callee populates the memory and returns the pointer, popping the hidden pointer when returning.
(emphasis mine)
Ultimately, it comes down to calling convention. It's possible for your compiler to optimize your code to use whatever registers it wants, but when your code interacts with other code (like the operating system), it needs to follow the standard calling conventions, which typically uses 1 register for returning values.
Returning in stack isn't necessarily slower, because once the values are available in L1 cache (which the stack often fulfills), accessing them will be very fast.
However in most computer architectures there are at least 2 registers to return values that are twice (or more) as wide as the word size (edx:eax in x86, rdx:rax in x86_64, $v0 and $v1 in MIPS (Why MIPS assembler has more that one register for return value?), R0:R3 in ARM1, X0:X7 in ARM64...). The ones that don't have are mostly microcontrollers with only one accumulator or a very limited number of registers.
1"If the type of value returned is too large to fit in r0 to r3, or whose size cannot be determined statically at compile time, then the caller must allocate space for that value at run time, and pass a pointer to that space in r0."
These registers can also be used for returning directly small structs that fits in 2 (or more depending on architecture and ABI) registers or less.
For example with the following code
struct Point
{
int x, y;
};
struct shortPoint
{
short x, y;
};
struct Point3D
{
int x, y, z;
};
Point P1()
{
Point p;
p.x = 1;
p.y = 2;
return p;
}
Point P2()
{
Point p;
p.x = 1;
p.y = 0;
return p;
}
shortPoint P3()
{
shortPoint p;
p.x = 1;
p.y = 0;
return p;
}
Point3D P4()
{
Point3D p;
p.x = 1;
p.y = 2;
p.z = 3;
return p;
}
Clang emits the following instructions for x86_64 as you can see here
P1(): # #P1()
movabs rax, 8589934593
ret
P2(): # #P2()
mov eax, 1
ret
P3(): # #P3()
mov eax, 1
ret
P4(): # #P4()
movabs rax, 8589934593
mov edx, 3
ret
For ARM64:
P1():
mov x0, 1
orr x0, x0, 8589934592
ret
P2():
mov x0, 1
ret
P3():
mov w0, 1
ret
P4():
mov x1, 1
mov x0, 0
sub sp, sp, #16
bfi x0, x1, 0, 32
mov x1, 2
bfi x0, x1, 32, 32
add sp, sp, 16
mov x1, 3
ret
As you can see, no stack operations are involved. You can switch to other compilers to see that the values are mainly returned on registers.
Return data is put on the stack. Returning a struct by copy is literally the same thing as returning multiple values in that all it's data members are put on the stack. If you want multiple return values that is the simplest way. I know in Lua that's exactly how it handles it, just wraps it in a struct. Why it was never implemented, probably because you could just do it with a struct, so why implement a different method? As for C++, it actually does support multiple return values, but it's in the form of a special class, really the same way Java handles multiple return values (tuples) as well. So in the end, it's all the same, either you copy the data raw (non-pointer/non-reference to a struct/object) or just copy a pointer to a collection that stores multiple values.
(Intel x86. Turbo Assembler and BorlandC compilers, Turbo Linker.)
My question will be about how to modify my f1.asm (and possibly main1.cpp) code.
In main1.cpp I input integer values which I send to function in f1.asm, add them, and send back and display the result in main1.cpp.
main1.cpp:
#include <iostream.h>
#include <stdlib.h>
#include <math.h>
extern "C" int f1(int, int, int);
int main()
{
int a,b,c;
cout<<"W = a+b+c" << endl ;
cout<<"a = " ;
cin>> a;
cout<<"b = " ;
cin>>b;
cout<<"c = " ;
cin>>c;
cout<<"\nW = "<< f1(a,b,c) ;
return 0;
}
f1.asm:
.model SMALL, C
.data
.code
PUBLIC f1
f1 PROC
push BP
mov BP, SP
mov ax,[bp+4]
add ax,[bp+6]
add ax,[bp+8]
pop BP
ret
f1 ENDP
.stack
db 100(?)
END
I want to make such a function for an arbitrary number of variables by sending a pointer to the array of elements to the f1.asm.
QUESTION: If I make the int f1(int, int, int) function in main1.cpp into a int f1( int* ), and put into it the pointer to the array containing the to-be-added values, then how should my .asm code look to access the first (and subsequent) array elements?
How is the pointer stored? Because I tried treating it as an offeset, and an offsett of an offset, and I tried a few other things but I still couldn't access the array's elements.
(If I can just access the first few, I can take care of the rest of the problem.)
...Or should I, in this particular case, use something else from .cpp's side than a pointer?
Ouch, long time I haven't seen a call from 16 bits C to assembly ...
C or C++ allows passing a variable number of arguments provided callee can determine the number because it pushes all arguments in opposite order before calling the function, and the caller cleans up the stack after the function returned.
But passing an array is something totally different : you just pass one single value which is the address (a pointer ...) of the array
Assuming you pass an array of 3 ints - 16 bits small model (int, data pointers, and code addresses are all 16 bits)
C++
int arr[3] = {1, 2, 3}
int cr;
cr = f1(arr);
ASM
push BP
mov BP, SP
mov ax,[bp+4] ; get the address of the array
mov bp, ax ; BP now points to the array
mov ax, [bp] ; get value of first element
add ax,[bp+2] ; add remaining elements
add ax,[bp+4]
pop BP
ret
I'm trying to write a thunk for __thiscall using a struct.
I've tested this struct and it works:
#pragma pack(push, 1)
struct Thunk
{
unsigned short leaECX;
unsigned long pThis;
unsigned char movEAX;
unsigned long pMemFunc;
unsigned short jmpEAX;
};
#pragma pack(pop)
I fill this struct with the following bytecode (which I found online):
//Load effective address of this to ECX
//because __thiscall expect to get 'this' in ECX
leaECX = 0x0D8D;
pThis = here goes 'this' pointer;
//Move member function pointer to EAX
movEAX = 0xB8;
pMemFunc = here goes pointer to member function;
//Jump to member function
jmpEAX = 0xE0FF;
My question is can the movEAX and jmpEAX instructions be replaced with bytecode for assembly call instruction ?
If so how do I do it ?
I'm allocating this struct using VirtualAlloc and this flags MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE.
Is this a compact way or does it waste memory (allocate whole page instead of sizeof(Thunk)) ?
You can use call but then of course execution will return to your thunk so you need more code afterwards. Also if you get rid of the mov I assume you will want to do the call address variant, in which case be mindful of the fact that that uses relative encoding, so you can't just poke your address into memory.
You can switch to relative jump to get rid of the mov, using something like this:
#pragma pack(push, 1)
struct Thunk
{
unsigned short leaECX;
unsigned long pThis;
unsigned char jmp;
unsigned long pOffset;
};
#pragma pack(pop)
//Load effective address of this to ECX
//because __thiscall expect to get 'this' in ECX
leaECX = 0x0D8D;
pThis = here goes 'this' pointer;
jmp = 0xE9;
pOffset = (char*)address_of_member - (char*)&thunk.pOffset - 4;
Since memory protections are page granular you will need at least a page (VirtualAlloc does round up for you automatically). If you have multiple thunks you can of course pack them into the same page.
I want to print return value in my tracer, there are two questions
How to get return address ?
The return position is updated before OR after ~Tracer() ?
Need text here so Stackoverflow formats the code:
struct Tracer
{
int* _retval;
~Tracer()
{ printf("return value is %d", *_retval); }
};
int foo()
{
Tracer __tracter = { __Question_1_how_to_get_return_address_here__ };
if(cond) {
return 0;
} else {
return 99;
}
//Question-2:
// return postion is updated before OR after ~Tracer() called ???
}
I found some hints for Question-1, checking Vc code now
For gcc, __builtin_return_address
http://gcc.gnu.org/onlinedocs/gcc/Return-Address.html
For Visual C++, _ReturnAddress
You can't portably or reliably do this in C++. The return value may be in memory or in a register and may or may not be indirected in different cases.
You could probably use inline assembly to make something work on certain hardware/compilers.
One possible way is to make your Tracer a template that takes a reference to a return value variable (when appropriate) and prints that out before destructing.
Also note that identifiers with __ (double underscore) are reserved for the implementation.
Your question is rather confusing, you're interchangeably using the terms "address" and "value", which are not interchangeable.
Return value is what the function spits out, in x86(_64) that comes in the form of a 4/8 byte value in E/RAX, or EDX:EAX, or XMM0, etc, you can read more about it here.
Return address on the other hand, is what E/RSP point to when a call is made (aka thing on top of the stack), and holds the address of where the function "jumps" back to when it's done (what is by definition called returning).
Now I don't even know what a tracer is tbh, but I can tell you how you'd get either, it's all about hooks.
For the value, and assuming you're doing it internally, just hook the function with one with the same definition, and once it returns you'll have your result.
For the address it's a bit more complicated because you'll have to go a bit lower, and possibly do some asm shenanigains, I really have no idea what exactly you are looking to acomplish, but I made a little "stub" if you will, to provide the callee with the return pointer.
Here is:
void __declspec(noinline) __declspec(naked) __stdcall _replaceFirstArgWithRetPtrAndJump_() {
__asm { //let's call the function we jump to "callee", and the function that called us "caller"
push ebp //save ebp, ESP IS NOW -4
mov ebp, [esp + 4] //save return address
mov eax, [esp + 8] //get callee's address (which is the first param) - eax is volatile so iz fine
mov[esp + 8], ebp //put the return address where the callee's address was (to the callee, it will be the caller)
pop ebp //restore ebp
jmp eax //jump to callee
} }
#define CallFunc_RetPtr(Function, ...) ((decltype(&Function))_replaceFirstArgWithRetPtrAndJump_)(Function, __VA_ARGS__)
unsigned __declspec(noinline) __stdcall printCaller(void* caller, unsigned param1, unsigned param2) {
printf("I'm printCaller, Called By %p; Param1: %u, Param2: %u\n", caller, param1, param2);
return 20;
}
void __declspec(noinline) doshit() {
printf("us: %p\nFunction we're calling: %p\n", doshit, printCaller);
CallFunc_RetPtr(printCaller, 69, 420);
}
Now sure, you could and maybe should use _ReturnAddress() or any different compiler's intrinsics, but if that's not available (which should be a really rare scenario depending on your work) and you know your ASM, this concept should work for any architecture, since however different the instruction set may be, they all follow the same Program Counter design.
I wrote this more because I was looking for an answer for this quite a long time ago for a certain purpose, and I couldn't find a good one since most people just go "hurr durr it's not possible or portable or whatever", and I feel like this would have helped.
I'm doing reverse-engineery stuff and patching a game's memory via DLL. Usually I stick to the same old way of patching everything in a single or several functions. But it feels like it could be pulled off better by using a struct array which defines the memory writes that need to take place and looping through them all in one go. Much easier to manage, IMO.
I wanna make it constant, though. So the data is all there in one go (in .rdata) instead of having to dynamically allocate memory for such things each patch, which is a simple task with 'bytesize' data, for example:
struct struc_patch
{
BYTE val[8]; // max size of each patch (usually I only use 5 bytes anyway for call and jmp writes)
// I can of course increase this if really needed
void *dest;
char size;
} patches[] =
{
// simply write "01 02 03 04" to 0x400000
{{0x1, 0x2, 0x3, 0x4}, (void*)0x400000, 4},
};
//[...]
for each(struc_patch p in patches)
{
memcpy(p.dest, p.val, p.size);
}
But when I want to get fancier with the types, I find no way to specify an integer like "0x90909090" as the byte array "90 90 90 90". So this won't work:
struct struc_patch
{
BYTE val[8]; // max size of each patch (usually I only use 5 bytes anyway for call and jmp writes)
// I can of course increase this if really needed
void *dest;
char size;
} patches[] =
{
// how to write "jmp MyHook"? Here, the jmp offset will be truncated instead of overlapping in the array. Annoying.
{{0xE9, (DWORD)&MyHook - 0x400005}, (void*)0x400000, 5},
};
Of course the major problem is that &MyHook has to be resolved by the compiler. Any other way to get the desired result and keep it const?
I've got little experience with STL, to be honest. So if there is a solution using that, I might need it explained in detail in order to understand the code properly. I'm a big C/C++/WinAPI junkie lol, but it's for a game written in a similar nature, so it fits.
I dont think anything from the STL will help you with this, not at compile time.
There might be a fancy way of doing with templates what you did with macros. (comma separating the bytes)
But I recommend doing something simple like this:
struct jump_insn
{
unsigned char opcode;
unsigned long addr;
} jump_insns[] = {
{0xe9, (unsigned long)&MyHook - 0x400005}
};
struct mem
{
unsigned char val[8];
} mems[] = {
{1,2,3,4}
};
struct struc_patch
{
unsigned char *val; // max size of each patch (usually I only use 5 bytes anyway for call and jmp writes)
// I can of course increase this if really needed
void *dest;
char size;
} patches[] =
{
// simply write "01 02 03 04" to 0x400000
{(unsigned char*)(&mems[0]), (void*)0x400000, 4},
// how to write "jmp MyHook"? Here, the jmp offset will be truncated instead of overlapping in the array. Annoying.
{(unsigned char*)(&jump_insns[0]), (void*)0x400000, 5},
};
You can't do everything inline and you will need new types for different kind of patches, but they can be arbitrarily long (not just 8 bytes) and everything will be in .rodata.
A better way to handle that is to calculate the address difference on the fly. For instance (source):
#define INST_CALL 0xE8
void InterceptLocalCode(BYTE bInst, DWORD pAddr, DWORD pFunc, DWORD dwLen)
{
BYTE *bCode = new BYTE[dwLen];
::memset(bCode, 0x90, dwLen);
DWORD dwFunc = pFunc - (pAddr + 5);
bCode[0] = bInst;
*(DWORD *)&bCode[1] = dwFunc;
WriteBytes((void*)pAddr, bCode, dwLen);
delete[] bCode;
}
void PatchCall(DWORD dwAddr, DWORD dwFunc, DWORD dwLen)
{
InterceptLocalCode(INST_CALL, dwAddr, dwFunc, dwLen);
}
dwAddr is the address to put the call instruction in, dwFunc is the function to call and dwLen is the length of the instruction to replace (basically used to calculate how many NOPs to put in).
To summarize, my solution (thanks to Nicolas' suggestion):
#pragma pack(push)
#pragma pack(1)
#define POFF(d,a) (DWORD)d-(a+5)
struct jump_insn
{
const BYTE opcode = 0xE9;
DWORD offset;
};
struct jump_short_insn
{
const BYTE opcode = 0xEB;
BYTE offset;
};
struct struc_patch
{
void *data;
void *dest;
char size;
};
#pragma pack(pop)
And in use:
// Patches
jump_insn JMP_HOOK_LoadButtonTextures = {POFF(&HOOK_LoadButtonTextures, 0x400000)};
struc_patch patches[] =
{
{&JMP_HOOK_LoadButtonTextures, IntToPtr(0x400000)},
};
Using class member const's I can define everything much easier and cleaner and it can simply all be memcpy'd. The pack pragma is of course required to ensure that memcpy doesn't copy the 3 align bytes between the BYTE opcode and DWORD value.
Thanks all, helped me make my patching methods a lot more robust.