I'm trying to write a jump to memory, and I can't seem to find anywhere which can explain me how it works.
typedef UINT(WINAPI* tResetWriteWatch)(LPVOID lpBaseAddress, SIZE_T dwRegionSize);
UINT WINAPI ResetWriteWatchHook(LPVOID lpBaseAddress, SIZE_T dwRegionSize){
printf("Function called\n");
return 0;
}
void main(){
DWORD64 hookAddr = (DWORD64)&ResetWriteWatch;
WriteJump(hookAddr, ResetWriteWatchHook/*Let's say it's 0x7FE12345678*/);//Writes E9 XX XX XX XX to memory
}
My main issue is that I don't understand: how do I convert asm JMP 0x7FE12345678 to E9 XX XX XX XX so I can write it at hookAddr.
Process is 64 bits.
This is how it's commonly done on a 32bit program (not sure how much is different on 64 bit), but this should give you an idea on where to go. This is specific to windows because of VirtualProtect, but you can use mprotect if you are on linux.
#include <stdio.h>
#include <windows.h>
void foo() {
printf("Foo!");
}
void my_foo() {
printf("My foo!");
}
int setJMP(void *from, void *to) {
DWORD protection;
if (!VirtualProtect(from, 5, PAGE_EXECUTE_READWRITE, &protection)) { // We must be able to write to it (don't necessarily need execute and read)
return 0;
}
*(char *)from = 0xE9; // jmp opcode
*(int *)(from + 1) = (int)(to - from - 5); // relative addr
return VirtualProtect(from, 5, protection, &protection); // Restore original protection
}
int main() {
setJMP(foo, my_foo);
foo(); // outputs "My foo!"
return 0;
}
I would suggest using an assembler to generate the correct bytes. Simply create the following file and run it through NASM:
BITS 64
JUMP 0x7fe1234
The result is:
2f e9 fe 12 00 07 (due to the LE byte order). This is a relative jump, so it will be hard to generate from high level code. You might want to use the opcode EA instead, which performs an absolute jump. Then you can simply use the absolute address of the location to which you want to jump.
Related
I made some computations to get a relative virtual address(RVA).
I compute a correct RVA (according to the .map file) and I want to translate it to a callable function pointer.
Is there a way to translate it to a physical address?
I have thought of taking the image base address and add it. According to this thread it should be possible via GetModuleHandle(NULL), but the result is "wrong". I only have a good result when I subtract a pointer from a function from its RVA defined in the .map file.
Is there a WINAPI to either convert a RVA to a physical address, or get the image base address, or get the RVA of a physical address?
Here's some example code:
#include <stdio.h>
#include <Windows.h>
#include <WinNT.h>
static int count = 0;
void reference() // According to .map RVA = 0x00401000
{
}
void toCall() // According to .map RVA = 0x00401030
{
printf("Called %d\n", ++count);
}
int main()
{
typedef void (*fnct_t)();
fnct_t fnct;
fnct = (fnct_t) (0x00401030 + (((int) reference) - 0x00401000));
fnct(); // works
fnct = (fnct_t) (0x00401030 + ((int) GetModuleHandle(NULL)) - 0x00400000);
fnct(); // often works
return 0;
}
My main concern is that it seems that sometimes (maybe in threaded contexts) GetModuleHandle(NULL) isn't correct.
To get the image base without the entry point being predefined directly at compile-time you can do, a simple search from aligning eax#1000h from the current VA and looping until a valid PE signature 'MZ' is found in the current memory page.
Assuming the base address is not relocated into another PE image. I've prepared a function for you:
DWORD FindImageBase() {
DWORD* VA = (DWORD *) &FindImageBase, ImageBase;
__asm {
mov eax, VA
and eax, 0FFFF0000h
search:
cmp word ptr [eax], 0x5a4d
je stop
sub eax, 00010000h
jmp search
stop:
mov [ImageBase], 0
mov [ImageBase], eax
}
return ImageBase;
}
I have a game, I disasembled it and located a jump which I want to rewrite,
but whenever I try to write to the address I get a access voilation exception, even when I use VirtualProtect and set the READWRITE permission.
the instruction on 0x0042BD5F is this:
0x0046AACF E9 FF FF 89 FC | jmp some address here
Now, when I try to write to 0x0042BD5F, to change the relative jump address, I get an access voilation exception.
How do I change the jump on that address?
Code was requested, so here it is:
#define AddVar(Type,Name,Address) Type& Name = *reinterpret_cast<Type*>(Address)
/*
Hooker
1b 0x0042BD5F == E9 <relative jmp>
4b 0x0042BD60 - relative jump offset (always the value 0xFFFF89FC)
*/
AddVar(uqbyte, jump_hook_bytes, 0x0042BD60);
//the user tick function
void(*tick)(void);
void SetTick(void(*passed)(void))
{
tick = passed;
}
void Ticker();
void OnDLLLoad(void(*passed)(void) = nullptr)
{
tick = passed;
//point the game loop end to Ticker()
//replace the jump address
//jmp (DESTINATION_RVA - CURRENT_RVA - 5 [sizeof(E9 xx xx xx xx)])
DWORD old;
VirtualProtect(
(LPVOID)0x0042BD5F,
0x05,
PAGE_EXECUTE | PAGE_EXECUTE_READ | PAGE_EXECUTE_READWRITE,
&old
);
jump_hook_bytes = (((uqbyte)((uqbyte*)&Ticker) - (uqbyte)0x0042BD5F) - (uqbyte)0x0000005);
}
void Ticker()
{
if (tick != nullptr)
{
tick();
}
__asm
{
MOV EAX, 0x0042B9EA;//old address
JMP EAX;
}
}
uqbyte is an unsigned long.
When calling getlasterror the code seems to return the decimal error 87 (INVALID_PARAMETERS).
The documentation for the memory protection constants says:
The following are the memory-protection options; you must specify one of the following values when allocating or protecting a page in memory.
It then lists a number of values, including the three that you combined together. When the documentation says "specify one of the following values" it means exactly one. You must not combine them.
You need to use PAGE_EXECUTE_READWRITE on its own.
I recommend that you add error checking around all your API calls. I also think that you could avoid hard coding the addresses.
You need to use a debugger or a program that enables the process token privilege for debugger mode (like a trainer). I assume this isn't being used for an online cheat (offline shouldn't matter).
Passing PAGE_EXECUTE | PAGE_EXECUTE_READ | PAGE_EXECUTE_READWRITE is wrong, you need to pass PAGE_EXECUTE_READWRITE only to Virtual Protect. Now it works.
Consider the following code segment:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define ARRAYSIZE(arr) (sizeof(arr)/sizeof(arr[0]))
inline void
clflush(volatile void *p)
{
asm volatile ("clflush (%0)" :: "r"(p));
}
inline uint64_t
rdtsc()
{
unsigned long a, d;
asm volatile ("cpuid; rdtsc" : "=a" (a), "=d" (d) : : "ebx", "ecx");
return a | ((uint64_t)d << 32);
}
inline int func() { return 5;}
inline void test()
{
uint64_t start, end;
char c;
start = rdtsc();
func();
end = rdtsc();
printf("%ld ticks\n", end - start);
}
void flushFuncCache()
{
// Assuming function to be not greater than 320 bytes.
char* fPtr = (char*)func;
clflush(fPtr);
clflush(fPtr+64);
clflush(fPtr+128);
clflush(fPtr+192);
clflush(fPtr+256);
}
int main(int ac, char **av)
{
test();
printf("Function must be cached by now!\n");
test();
flushFuncCache();
printf("Function flushed from cache.\n");
test();
printf("Function must be cached again by now!\n");
test();
return 0;
}
Here, i am trying to flush the instruction cache to remove the code for 'func', and then expecting a performance overhead on the next call to func but my results don't agree to my expectations:
858 ticks
Function must be cached by now!
788 ticks
Function flushed from cache.
728 ticks
Function must be cached again by now!
710 ticks
I was expecting CLFLUSH to also flush the instruction cache, but apparently, it is not doing so. Can someone explain this behavior or suggest how to achieve the desired behavior.
Your code does almost nothing in func, and the little you do gets inlined into test, and probably optimized out since you never use the return value.
gcc -O3 gives me -
0000000000400620 <test>:
400620: 53 push %rbx
400621: 0f a2 cpuid
400623: 0f 31 rdtsc
400625: 48 89 d7 mov %rdx,%rdi
400628: 48 89 c6 mov %rax,%rsi
40062b: 0f a2 cpuid
40062d: 0f 31 rdtsc
40062f: 5b pop %rbx
...
So you're measuring time for the two moves that are very cheap HW-wise - your measurement is probably showing the latency of cpuid which is relatively expensive..
Worse, your clflush would actually flush test as well, this means you pay the re-fetch penalty when you next access it, which is out of the rdtsc pair so it's not measured. The measured code on the other hand, sequentially follows, so fetching test would probably also fetch the flushed code you measure, so it could actually be cached by the time you measure it.
it works well on my computer.
264 ticks
Function must be cached by now!
258 ticks
Function flushed from cache.
519 ticks
Function must be cached again by now!
240 ticks
From http://lastfrag.com/hotpatching-and-inline-hooking-explained/,
Q1) Does code proceed from high memory to low memory or vice versa?
Q2) More importantly, during the calculation of the replacement offset, why is it that you have to minus the function preamble? Is it because the offset starts from the end of the instruction and not the beginning?
DWORD ReplacementAddressOffset = ReplacementAddress - OriginalAddress - 5;
Full Code:
void HookAPI(wchar_t *Module, char *API, DWORD Function)
{
HMODULE hModule = LoadLibrary(Module);
DWORD OriginalAddress = (DWORD)GetProcAddress(hModule, API);
DWORD ReplacementAddress = (DWORD)Function;
DWORD ReplacementAddressOffset = ReplacementAddress - OriginalAddress - 5;
LPBYTE pOriginalAddress = (LPBYTE)OriginalAddress;
LPBYTE pReplacementAddressOffset = (LPBYTE)(&ReplacementAddressOffset);
DWORD OldProtect = 0;
DWORD NewProtect = PAGE_EXECUTE_READWRITE;
VirtualProtect((PVOID)OriginalAddress, 5, NewProtect, &OldProtect);
for (int i = 0; i < 5; i++)
Store[i] = pOriginalAddress[i];
pOriginalAddress[0] = (BYTE)0xE9;
for (int i = 0; i < 4; i++)
pOriginalAddress[i + 1] = pReplacementAddressOffset[i];
VirtualProtect((PVOID)OriginalAddress, 5, OldProtect, &NewProtect);
FlushInstructionCache(GetCurrentProcess(), NULL, NULL);
FreeLibrary(hModule);
}
Q3) In this code, the relative address of a jmp instruction is being replaced; relAddrSet is a pointer to the original destination; to is a pointer to the new destination. I don't understand the calculation of the to address, why is it that you have to add the original destination to the functionForHook + opcodeOffset?
DWORD *relAddrSet = (DWORD *)(currentOpcode + 1);
DWORD_PTR to = (*relAddrSet) + ((DWORD_PTR)functionForHook + opcodeOffset);
*relAddrSet = (DWORD)(to - ((DWORD_PTR)originalFunction + opcodeOffset));
Yes the relative address is the the offset after the instructions, that's why you have to substract 5.
But, in my opinion, you should just forget the idea of the relative jump and try absolute jump.
Why ? Because it is a lot easier and x86-64 compatible (relative jumps are limited to +/-2GB).
The absolute jump is (x64) :
48 b8 ef cd ab 89 67 45 23 01 mov rax, 0x0123456789abcdef
ff e0 jmp rax
And for x86 :
b8 67 45 23 01 mov eax, 0x01234567
ff e0 jmp eax
Here is the modified code (the loader is now 7 bytes instead of 5):
void HookAPI(wchar_t *Module, char *API, DWORD Function)
{
HMODULE hModule = LoadLibrary(Module);
DWORD OriginalAddress = (DWORD)GetProcAddress(hModule, API);
DWORD OldProtect = 0;
DWORD NewProtect = PAGE_EXECUTE_READWRITE;
VirtualProtect((PVOID)OriginalAddress, 7, NewProtect, &OldProtect);
memcpy(Store, OriginalAddress, 7);
memcpy(OriginalAddress, "\xb8\x00\x00\x00\x00\xff\xe0", 7);
memcpy(OriginalAddress+1, &ReplacementAddress, sizeof(void*));
VirtualProtect((PVOID)OriginalAddress, 7, OldProtect, &NewProtect);
FlushInstructionCache(GetCurrentProcess(), NULL, NULL);
FreeLibrary(hModule);
}
The code is the same for x64 but you have to add 2 nops (90) at the beginning or the end in order match the size of the following instructions, so the loader is "\x48\xb8<8-bytes addr>\xff\xe0\x90\x90" (14 bytes)
Q1) The program runs from lower to highest addresses (i.e. the program counter gets increased by the size of each instruction, unless in case of jumps, calls or ret). But I am probably missing the point of the question.
Q2) Yes, on x86 the jumps are executed after the program counter has been increased by the size of the jump instruction (5 bytes); when the CPU adds the jump offset to the program counter to calculate the target address, the program counter has already been increased of 5.
Q3) This code is quite weird, but it may work. I suppose that *relAddrset initially contains a jump offset to originalFunction (i.e. *relAddSet==originalFunction-relativeOffset). If this is true, the final result is that *reladdrSet contains a jump offset to functionFoHook. Indeed the last instruction becomes:
*relAddrSet=(originalFunction-relativeOffset)+functionForHook-originalFunction
== functionForHook-relativeOffset
Yes, code runs "forward" if I understand this question correctly. One instruction is executed after another if it is not branching.
An instruction that does a relative jump (JMP, CALL) does the jump relative to the start of the next instruction. That's why you have to subtract the length of the instruction (here: 5) from the difference.
I can't answer your third question. Please give some context and what the code is supposed to do.
It feels like I'm abusing Stackoverflow with all my questions, but it's a Q&A forum after all :) Anyhow, I have been using detours for a while now, but I have yet to implement one of my own (I've used wrappers earlier). Since I want to have complete control over my code (who doesn't?) I have decided to implement a fully functional detour'er on my own, so I can understand every single byte of my code.
The code (below) is as simple as possible, the problem though, is not. I have successfully implemented the detour (i.e a hook to my own function) but I haven't been able to implement the trampoline.
Whenever I call the trampoline, depending on the offset I use, I get either a "segmentation fault" or an "illegal instruction". Both cases ends the same though; 'core dumped'. I think it is because I've mixed up the 'relative address' (note: I'm pretty new to Linux so I have far from mastered GDB).
As commented in the code, depending on sizeof(jmpOp)(at line 66) I either get an illegal instruction or a segmentation fault. I'm sorry if it's something obvious, I'm staying up way too late...
// Header files
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
#include "global.h" // Contains typedefines for byte, ulong, ushort etc...
#include <cstring>
bool ProtectMemory(void * addr, int flags)
{
// Constant holding the page size value
const size_t pageSize = sysconf(_SC_PAGE_SIZE);
// Calculate relative page offset
size_t temp = (size_t) addr;
temp -= temp % pageSize;
// Update address
addr = (void*) temp;
// Update memory area protection
return !mprotect(addr, pageSize, flags);
}
const byte jmpOp[] = { 0xE9, 0x00, 0x00, 0x00, 0x00 };
int Test(void)
{
printf("This is testing\n");
return 5;
}
int MyTest(void)
{
printf("This is ******\n");
return 9;
}
typedef int (*TestType)(void);
int main(int argc, char * argv[])
{
// Fetch addresses
byte * test = (byte*) &Test;
byte * myTest = (byte*) &MyTest;
// Call original
Test();
// Update memory access for 'test' function
ProtectMemory((void*) test, PROT_EXEC | PROT_WRITE | PROT_READ);
// Allocate memory for the trampoline
byte * trampoline = new byte[sizeof(jmpOp) * 2];
// Do copy operations
memcpy(trampoline, test, sizeof(jmpOp));
memcpy(test, jmpOp, sizeof(jmpOp));
// Setup trampoline
trampoline += sizeof(jmpOp);
*trampoline = 0xE9;
// I think this address is incorrect, how should I calculate it? With the current
// status (commented 'sizeof(jmpOp)') the compiler complains about "Illegal Instruction".
// If I uncomment it, and use either + or -, a segmentation fault will occur...
*(uint*)(trampoline + 1) = ((uint) test - (uint) trampoline)/* + sizeof(jmpOp)*/;
trampoline -= sizeof(jmpOp);
// Make the trampoline executable (and read/write)
ProtectMemory((void*) trampoline, PROT_EXEC | PROT_WRITE | PROT_READ);
// Setup detour
*(uint*)(test + 1) = ((uint) myTest - (uint) test) - sizeof(jmpOp);
// Call 'detoured' func
Test();
// Call trampoline (crashes)
((TestType) trampoline)();
return 0;
}
In case of interest, this is the output during a normal run (with the exact code above):
This is testing
This is **
Illegal instruction (core dumped)
And this is the result if I use +/- sizeof(jmpOp) at line 66:
This is testing
This is ******
Segmentation fault (core dumped)
NOTE: I'm running Ubuntu 32 bit and compile with g++ global.cpp main.cpp -o main -Iinclude
You're not going to be able to indiscriminately copy the first 5 bytes of Test() into your trampoline, followed by a jump to the 6th instruction byte of Test(), because you don't know if the first 5 bytes comprise an integral number of x86 variable-length instructions. To do this, you're going to have to do at least a minimal amount of automated disassembling of the Test() function in order to find an instruction boundary that's 5 or more bytes past the beginning of the function, then copy an appropriate number of bytes to your trampoline, and THEN append your jump (which won't be at a fixed offset within your trampoline). Note that on a typical RISC processor (like PPC), you wouldn't have this problem, as all instructions are the same width.