Set stack size programmatically on Windows - c++

Is it possible in WinAPI to set stack size for the current thread at runtime like setrlimit does on Linux?
I mean to increase the reserved stack size for the current thread if it is too small for the current requirements.
This is in a library that may be called by threads from other programming languages, so it's not an option to set stack size at compile time.
If not, any ideas about a solution like an assembly trampoline that changes the stack pointer to a dynamically allocated memory block?
FAQ: Proxy thread is a surefire solution (unless the caller thread has extremely small stack). However, thread switching seems a performance killer. I need substantial amount of stack for recursion or for _alloca. This is also for performance, because heap allocation is slow, especially if multiple threads allocate from heap in parallel (they get blocked by the same libc/CRT mutex, so the code becomes serial).

you can not full swap stack in current thread (allocate self, delete old) in library code because in old stack - return addresses, may be pointers to variables in stack, etc.
and you can not expand stack (virtual memory for it already allocated (reserved/commit) and not expandable.
however possible allocate temporary stack and switch to this stack during call. you must in this case save old StackBase and StackLimit from NT_TIB (look this structure in winnt.h), set new values (you need allocate memory for new stack), do call (for switch stack you need some assembly code - you can not do this only on c/c++) and return original StackBase and StackLimit. in kernelmode exist support for this - KeExpandKernelStackAndCallout
however in user mode exist Fibers - this is very rare used, but look like perfectly match to task. with Fiber we can create additional stack/execution context inside current thread.
so in general solution is next (for library):
on DLL_THREAD_ATTACH :
convert thread to fiber
(ConvertThreadToFiber) (if it return false check also
GetLastError for ERROR_ALREADY_FIBER - this is also ok code)
and create own Fiber by call CreateFiberEx
we do this only once. than, every time when your procedure is called, which require large stack space:
remember the current fiber by call GetCurrentFiber
setup task for your fiber
switch to your fiber by call SwitchToFiber
call procedure inside fiber
return to original fiber (saved from call GetCurrentFiber)
again by SwitchToFiber
and finally on DLL_THREAD_DETACH you need:
delete your fiber by DeleteFiber
convert fiber to thread by call ConvertFiberToThread but only
in case initial ConvertThreadToFiber return true (if was
ERROR_ALREADY_FIBER- let who first convert thread to fiber convert
it back - this is not your task in this case)
you need some (usual small) data associated with your fiber / thread. this must be of course per thread variable. so you need use __declspec(thread) for declare this data. or direct use TLS (or which modern c++ features exist for this)
demo implementation is next:
typedef ULONG (WINAPI * MY_EXPAND_STACK_CALLOUT) (PVOID Parameter);
class FIBER_DATA
{
public:
PVOID _PrevFiber, _MyFiber;
MY_EXPAND_STACK_CALLOUT _pfn;
PVOID _Parameter;
ULONG _dwError;
BOOL _bConvertToThread;
static VOID CALLBACK _FiberProc( PVOID lpParameter)
{
reinterpret_cast<FIBER_DATA*>(lpParameter)->FiberProc();
}
VOID FiberProc()
{
for (;;)
{
_dwError = _pfn(_Parameter);
SwitchToFiber(_PrevFiber);
}
}
public:
~FIBER_DATA()
{
if (_MyFiber)
{
DeleteFiber(_MyFiber);
}
if (_bConvertToThread)
{
ConvertFiberToThread();
}
}
FIBER_DATA()
{
_bConvertToThread = FALSE, _MyFiber = 0;
}
ULONG Create(SIZE_T dwStackCommitSize, SIZE_T dwStackReserveSize);
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
_PrevFiber = GetCurrentFiber();
_pfn = pfn;
_Parameter = Parameter;
SwitchToFiber(_MyFiber);
return _dwError;
}
};
__declspec(thread) FIBER_DATA* g_pData;
ULONG FIBER_DATA::Create(SIZE_T dwStackCommitSize, SIZE_T dwStackReserveSize)
{
if (ConvertThreadToFiber(this))
{
_bConvertToThread = TRUE;
}
else
{
ULONG dwError = GetLastError();
if (dwError != ERROR_ALREADY_FIBER)
{
return dwError;
}
}
return (_MyFiber = CreateFiberEx(dwStackCommitSize, dwStackReserveSize, 0, _FiberProc, this)) ? NOERROR : GetLastError();
}
void OnDetach()
{
if (FIBER_DATA* pData = g_pData)
{
delete pData;
}
}
ULONG OnAttach()
{
if (FIBER_DATA* pData = new FIBER_DATA)
{
if (ULONG dwError = pData->Create(2*PAGE_SIZE, 512 * PAGE_SIZE))
{
delete pData;
return dwError;
}
g_pData = pData;
return NOERROR;
}
return ERROR_NO_SYSTEM_RESOURCES;
}
ULONG WINAPI TestCallout(PVOID param)
{
DbgPrint("TestCallout(%s)\n", param);
return NOERROR;
}
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
if (FIBER_DATA* pData = g_pData)
{
return pData->DoCallout(pfn, Parameter);
}
return ERROR_GEN_FAILURE;
}
if (!OnAttach())//DLL_THREAD_ATTACH
{
DoCallout(TestCallout, "Demo Task #1");
DoCallout(TestCallout, "Demo Task #2");
OnDetach();//DLL_THREAD_DETACH
}
also note that all fibers executed in single thread context - multiple fibers associated with thread can not execute in concurrent - only sequential, and you yourself control switch time. so not need any additional synchronization. and SwitchToFiber - this is complete user mode proc. which executed very fast, never fail (because never allocate any resources)
update
despite use __declspec(thread) FIBER_DATA* g_pData; more simply (less code), better for implementation direct use TlsGetValue / TlsSetValue and allocate FIBER_DATA on first call inside thread, but not for all threads. also __declspec(thread) not correct worked (not worked at all) in XP for dll. so some modification can be
at DLL_PROCESS_ATTACH allocate your TLS slot gTlsIndex = TlsAlloc();
and free it on DLL_PROCESS_DETACH
if (gTlsIndex != TLS_OUT_OF_INDEXES) TlsFree(gTlsIndex);
on every DLL_THREAD_DETACH notification call
void OnThreadDetach()
{
if (FIBER_DATA* pData = (FIBER_DATA*)TlsGetValue(gTlsIndex))
{
delete pData;
}
}
and DoCallout need be modified in next way
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
FIBER_DATA* pData = (FIBER_DATA*)TlsGetValue(gTlsIndex);
if (!pData)
{
// this code executed only once on first call
if (!(pData = new FIBER_DATA))
{
return ERROR_NO_SYSTEM_RESOURCES;
}
if (ULONG dwError = pData->Create(512*PAGE_SIZE, 4*PAGE_SIZE))// or what stack size you need
{
delete pData;
return dwError;
}
TlsSetValue(gTlsIndex, pData);
}
return pData->DoCallout(pfn, Parameter);
}
so instead allocate stack for every new thread on DLL_THREAD_ATTACH via OnAttach() much better alocate it only for threads when really need (at first call)
and this code can potential have problems with fibers, if someone else also try use fibers. say in msdn example code not check for ERROR_ALREADY_FIBER in case ConvertThreadToFiber return 0. so we can wait that this case will be incorrect handled by main application if we before it decide create fiber and it also try use fiber after us. also ERROR_ALREADY_FIBER not worked in xp (begin from vista).
so possible and another solution - yourself create thread stack, and temporary switch to it doring call which require large stack space. main need not only allocate space for stack and swap esp (or rsp) but not forget correct establish StackBase and StackLimit in NT_TIB - it is necessary and sufficient condition (otherwise exceptions and guard page extension will be not worked).
despite this alternate solution require more code (manually create thread stack and stack switch) it will be work on xp too and nothing affect in situation when somebody else also try using fibers in thread
typedef ULONG (WINAPI * MY_EXPAND_STACK_CALLOUT) (PVOID Parameter);
extern "C" PVOID __fastcall SwitchToStack(PVOID param, PVOID stack);
struct FIBER_DATA
{
PVOID _Stack, _StackLimit, _StackPtr, _StackBase;
MY_EXPAND_STACK_CALLOUT _pfn;
PVOID _Parameter;
ULONG _dwError;
static void __fastcall FiberProc(FIBER_DATA* pData, PVOID stack)
{
for (;;)
{
pData->_dwError = pData->_pfn(pData->_Parameter);
// StackLimit can changed during _pfn call
pData->_StackLimit = ((PNT_TIB)NtCurrentTeb())->StackLimit;
stack = SwitchToStack(0, stack);
}
}
ULONG Create(SIZE_T Reserve, SIZE_T Commit);
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
_pfn = pfn;
_Parameter = Parameter;
PNT_TIB tib = (PNT_TIB)NtCurrentTeb();
PVOID StackBase = tib->StackBase, StackLimit = tib->StackLimit;
tib->StackBase = _StackBase, tib->StackLimit = _StackLimit;
_StackPtr = SwitchToStack(this, _StackPtr);
tib->StackBase = StackBase, tib->StackLimit = StackLimit;
return _dwError;
}
~FIBER_DATA()
{
if (_Stack)
{
VirtualFree(_Stack, 0, MEM_RELEASE);
}
}
FIBER_DATA()
{
_Stack = 0;
}
};
ULONG FIBER_DATA::Create(SIZE_T Reserve, SIZE_T Commit)
{
Reserve = (Reserve + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
Commit = (Commit + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
if (Reserve < Commit || !Reserve)
{
return ERROR_INVALID_PARAMETER;
}
if (PBYTE newStack = (PBYTE)VirtualAlloc(0, Reserve, MEM_RESERVE, PAGE_NOACCESS))
{
union {
PBYTE newStackBase;
void** ppvStack;
};
newStackBase = newStack + Reserve;
PBYTE newStackLimit = newStackBase - Commit;
if (newStackLimit = (PBYTE)VirtualAlloc(newStackLimit, Commit, MEM_COMMIT, PAGE_READWRITE))
{
if (Reserve == Commit || VirtualAlloc(newStackLimit - PAGE_SIZE, PAGE_SIZE, MEM_COMMIT, PAGE_READWRITE|PAGE_GUARD))
{
_StackBase = newStackBase, _StackLimit = newStackLimit, _Stack = newStack;
#if defined(_M_IX86)
*--ppvStack = FiberProc;
ppvStack -= 4;// ebp,esi,edi,ebx
#elif defined(_M_AMD64)
ppvStack -= 5;// x64 space
*--ppvStack = FiberProc;
ppvStack -= 8;// r15,r14,r13,r12,rbp,rsi,rdi,rbx
#else
#error "not supported"
#endif
_StackPtr = ppvStack;
return NOERROR;
}
}
VirtualFree(newStack, 0, MEM_RELEASE);
}
return GetLastError();
}
ULONG gTlsIndex;
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
FIBER_DATA* pData = (FIBER_DATA*)TlsGetValue(gTlsIndex);
if (!pData)
{
// this code executed only once on first call
if (!(pData = new FIBER_DATA))
{
return ERROR_NO_SYSTEM_RESOURCES;
}
if (ULONG dwError = pData->Create(512*PAGE_SIZE, 4*PAGE_SIZE))
{
delete pData;
return dwError;
}
TlsSetValue(gTlsIndex, pData);
}
return pData->DoCallout(pfn, Parameter);
}
void OnThreadDetach()
{
if (FIBER_DATA* pData = (FIBER_DATA*)TlsGetValue(gTlsIndex))
{
delete pData;
}
}
and assembly code for SwitchToStack : on x86
#SwitchToStack#8 proc
push ebx
push edi
push esi
push ebp
xchg esp,edx
mov eax,edx
pop ebp
pop esi
pop edi
pop ebx
ret
#SwitchToStack#8 endp
and for x64:
SwitchToStack proc
push rbx
push rdi
push rsi
push rbp
push r12
push r13
push r14
push r15
xchg rsp,rdx
mov rax,rdx
pop r15
pop r14
pop r13
pop r12
pop rbp
pop rsi
pop rdi
pop rbx
ret
SwitchToStack endp
usage/test can be next:
gTlsIndex = TlsAlloc();//DLL_PROCESS_ATTACH
if (gTlsIndex != TLS_OUT_OF_INDEXES)
{
TestStackMemory();
DoCallout(TestCallout, "test #1");
//play with stack, excepions, guard pages
PSTR str = (PSTR)alloca(256);
DoCallout(zTestCallout, str);
DbgPrint("str=%s\n", str);
DoCallout(TestCallout, "test #2");
OnThreadDetach();//DLL_THREAD_DETACH
TlsFree(gTlsIndex);//DLL_PROCESS_DETACH
}
void TestMemory(PVOID AllocationBase)
{
MEMORY_BASIC_INFORMATION mbi;
PVOID BaseAddress = AllocationBase;
while (VirtualQuery(BaseAddress, &mbi, sizeof(mbi)) >= sizeof(mbi) && mbi.AllocationBase == AllocationBase)
{
BaseAddress = (PBYTE)mbi.BaseAddress + mbi.RegionSize;
DbgPrint("[%p, %p) %p %08x %08x\n", mbi.BaseAddress, BaseAddress, (PVOID)(mbi.RegionSize >> PAGE_SHIFT), mbi.State, mbi.Protect);
}
}
void TestStackMemory()
{
MEMORY_BASIC_INFORMATION mbi;
if (VirtualQuery(_AddressOfReturnAddress(), &mbi, sizeof(mbi)) >= sizeof(mbi))
{
TestMemory(mbi.AllocationBase);
}
}
ULONG WINAPI zTestCallout(PVOID Parameter)
{
TestStackMemory();
alloca(5*PAGE_SIZE);
TestStackMemory();
__try
{
*(int*)0=0;
}
__except(EXCEPTION_EXECUTE_HANDLER)
{
DbgPrint("exception %x handled\n", GetExceptionCode());
}
strcpy((PSTR)Parameter, "zTestCallout demo");
return NOERROR;
}
ULONG WINAPI TestCallout(PVOID param)
{
TestStackMemory();
DbgPrint("TestCallout(%s)\n", param);
return NOERROR;
}

The maximum stack size is determined when the thread is created. It cannot be modified after that time.

Related

VEH hook acceleration

I am trying to accelerate the speed of VEH hook. Veh hook class that I found from https://github.com/hoangprod/LeoSpecial-VEH-Hook/blob/master/LeoSpecial.h
#pragma once
#include <Windows.h>
#include <stdio.h>
#include <iostream>
#ifdef _WIN64
#define XIP Rip
#else
#define XIP Eip
#endif
class LeoHook {
public:
static bool Hook(uintptr_t og_fun, uintptr_t hk_fun);
static bool Unhook();
private:
static uintptr_t og_fun;
static uintptr_t hk_fun;
static PVOID VEH_Handle;
static DWORD oldProtection;
static bool AreInSamePage(const uint8_t* Addr1, const uint8_t* Addr2);
static LONG WINAPI LeoHandler(EXCEPTION_POINTERS *pExceptionInfo);
};
uintptr_t LeoHook::og_fun = 0;
uintptr_t LeoHook::hk_fun = 0;
PVOID LeoHook::VEH_Handle = nullptr;
DWORD LeoHook::oldProtection = 0;
bool LeoHook::Hook(uintptr_t original_fun, uintptr_t hooked_fun)
{
LeoHook::og_fun = original_fun;
LeoHook::hk_fun = hooked_fun;
//We cannot hook two functions in the same page, because we will cause an infinite callback
if (AreInSamePage((const uint8_t*)og_fun, (const uint8_t*)hk_fun))
return false;
//Register the Custom Exception Handler
VEH_Handle = AddVectoredExceptionHandler(true, (PVECTORED_EXCEPTION_HANDLER)LeoHandler);
//Toggle PAGE_GUARD flag on the page
if(VEH_Handle && VirtualProtect((LPVOID)og_fun, 1, PAGE_EXECUTE_READ | PAGE_GUARD, &oldProtection))
return true;
return false;
}
bool LeoHook::Unhook()
{
DWORD old;
if (VEH_Handle && //Make sure we have a valid Handle to the registered VEH
VirtualProtect((LPVOID)og_fun, 1, oldProtection, &old) && //Restore old Flags
RemoveVectoredExceptionHandler(VEH_Handle)) //Remove the VEH
return true;
return false;
}
LONG WINAPI LeoHook::LeoHandler(EXCEPTION_POINTERS *pExceptionInfo)
{
if (pExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_GUARD_PAGE_VIOLATION) //We will catch PAGE_GUARD Violation
{
if (pExceptionInfo->ContextRecord->XIP == (uintptr_t)og_fun) //Make sure we are at the address we want within the page
{
pExceptionInfo->ContextRecord->XIP = (uintptr_t)hk_fun; //Modify EIP/RIP to where we want to jump to instead of the original function
}
pExceptionInfo->ContextRecord->EFlags |= 0x100; //Will trigger an STATUS_SINGLE_STEP exception right after the next instruction get executed. In short, we come right back into this exception handler 1 instruction later
return EXCEPTION_CONTINUE_EXECUTION; //Continue to next instruction
}
if (pExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_SINGLE_STEP) //We will also catch STATUS_SINGLE_STEP, meaning we just had a PAGE_GUARD violation
{
DWORD dwOld;
VirtualProtect((LPVOID)og_fun, 1, PAGE_EXECUTE_READ | PAGE_GUARD, &dwOld); //Reapply the PAGE_GUARD flag because everytime it is triggered, it get removes
return EXCEPTION_CONTINUE_EXECUTION; //Continue the next instruction
}
return EXCEPTION_CONTINUE_SEARCH; //Keep going down the exception handling list to find the right handler IF it is not PAGE_GUARD nor SINGLE_STEP
}
bool LeoHook::AreInSamePage(const uint8_t* Addr1, const uint8_t* Addr2)
{
MEMORY_BASIC_INFORMATION mbi1;
if (!VirtualQuery(Addr1, &mbi1, sizeof(mbi1))) //Get Page information for Addr1
return true;
MEMORY_BASIC_INFORMATION mbi2;
if (!VirtualQuery(Addr2, &mbi2, sizeof(mbi2))) //Get Page information for Addr1
return true;
if (mbi1.BaseAddress == mbi2.BaseAddress) //See if the two pages start at the same Base Address
return true; //Both addresses are in the same page, abort hooking!
return false;
}
The hook is working fine without any problem. But it cost me a lot of FPS drop from a process that I injected. As I googled the most answer is saying reduce the amount of logic that have been use in hook. My hook method
__declspec(naked) void HookSend()
{
__asm
{
jmp dwAddress
}
}
As you see it doesn't have much logic that include in a function. But it still cost me a lot of fps.
The research number two is saying to copied all page function then modify the ASM of page function. But I have no idea how to copy a page function. So I need some guide how to copy a page function or another method to accelerate the VEH hook.
Thank you.

Trying to understand the asynchrouny with a big amount of different function calls

I've started learing an asynchronous aproach, and
encountered a problem, help me with it.
The purpose is: get from somewhere a char data, and after that do something with it(using as text on the button, in my case). The code, that is pinned below is very slow. The most slowiest moment is a data getting: the fact is that the get(int id) function loads data from internet via WinInet(synchronously), sending the Post methods, and returning the answer.
void some_func()
{
for(int i(0);i<10;i++)
for(int q(0);q<5;q++)
{
char data[100];
strcpy(data, get(i,q)); // i, q - just some identifier data
button[5*i+(q+1)]=new Button(data);
}
}
The first question:
How should it be solved(generaly, I mean, if get has nothing to do with the internet, but runs slow)? I have only one, stupid idea: run get in every separate thread. If it's the right way - how should I do that? Cause, it's wrong to, created 50 threads call from each the get function. 50 get functions?
Second Question
How to realize it with WinInet? Have red MSDN, but it too hardly for me, as for newer, maybe you explain it more simlier?
Thanks
for asynchronous programming you need create some object which will be maintain state - in current case i and q must be not local variables of function but members of object, mandatory reference count to object. and usual file(socket) handle , etc.
function some_func() must have another pattern. it must be member function of object. and it must not call asynchronous get in loop. after call get it must just exit. when asynchronous operation, initiated by get, will be finished - some your callback must be called (if failed initiate asynchronous operation you need yourself just call this callback with error code). in callback you will be have pointer to your object and using it - call some_func(). so some_func() must at begin handle result of previous get call - check for error, handle received data, if no error. than adjust object state (in your case i and q) and if need - call get again. and for initiated all this - need first time call get direct:
begin -> get() -> .. callback .. -> some_func() -> exit
^ ┬
└─────────────────────────────┘
some demo example (with asynchronous read file)
struct SOME_OBJECT
{
LARGE_INTEGER _ByteOffset;
HANDLE _hFile;
LONG _dwRef;
int _i, _q;
SOME_OBJECT()
{
_i = 0, _q = 0;
_dwRef = 1;
_ByteOffset.QuadPart = 0;
_hFile = 0;
}
void beginGet();
void DoSomething(PVOID pvData, DWORD_PTR cbData)
{
DbgPrint("DoSomething<%u,%u>(%x, %p)\n", _i, _q, cbData, pvData);
}
// some_func
void OnComplete(DWORD dwErrorCode, PVOID pvData, DWORD_PTR cbData)
{
if (dwErrorCode == NOERROR)
{
DoSomething(pvData, cbData);
if (++_q == 5)
{
_q = 0;
if (++_i == 10)
{
return ;
}
}
_ByteOffset.QuadPart += cbData;
beginGet();
}
else
{
DbgPrint("OnComplete - error=%u\n", dwErrorCode);
}
}
~SOME_OBJECT()
{
if (_hFile) CloseHandle(_hFile);
}
void AddRef() { InterlockedIncrement(&_dwRef); }
void Release() { if (!InterlockedDecrement(&_dwRef)) delete this; }
ULONG Create(PCWSTR FileName);
};
struct OPERATION_CTX : OVERLAPPED
{
SOME_OBJECT* _pObj;
BYTE _buf[];
OPERATION_CTX(SOME_OBJECT* pObj) : _pObj(pObj)
{
pObj->AddRef();
hEvent = 0;
}
~OPERATION_CTX()
{
_pObj->Release();
}
VOID CALLBACK CompletionRoutine(DWORD dwErrorCode, DWORD_PTR dwNumberOfBytesTransfered)
{
_pObj->OnComplete(dwErrorCode, _buf, dwNumberOfBytesTransfered);
delete this;
}
static VOID CALLBACK _CompletionRoutine(DWORD dwErrorCode, DWORD dwNumberOfBytesTransfered, OVERLAPPED* lpOverlapped)
{
static_cast<OPERATION_CTX*>(lpOverlapped)->CompletionRoutine(RtlNtStatusToDosError(dwErrorCode), dwNumberOfBytesTransfered);
}
void CheckResult(BOOL fOk)
{
if (!fOk)
{
ULONG dwErrorCode = GetLastError();
if (dwErrorCode != ERROR_IO_PENDING)
{
CompletionRoutine(dwErrorCode, 0);
}
}
}
void* operator new(size_t cb, size_t ex)
{
return ::operator new(cb + ex);
}
void operator delete(PVOID pv)
{
::operator delete(pv);
}
};
ULONG SOME_OBJECT::Create(PCWSTR FileName)
{
HANDLE hFile = CreateFile(FileName, FILE_READ_DATA, FILE_SHARE_READ, 0,
OPEN_EXISTING, FILE_FLAG_OVERLAPPED, 0);
if (hFile != INVALID_HANDLE_VALUE)
{
_hFile = hFile;
if (BindIoCompletionCallback(hFile, OPERATION_CTX::_CompletionRoutine, 0))
{
return NOERROR;
}
}
return GetLastError();
}
void SOME_OBJECT::beginGet()
{
const ULONG cbRead = 0x1000;
if (OPERATION_CTX* ctx = new(cbRead) OPERATION_CTX(this))
{
ctx->Offset = _ByteOffset.LowPart;
ctx->OffsetHigh = _ByteOffset.HighPart;
ctx->CheckResult(ReadFile(_hFile, ctx->_buf, cbRead, 0, ctx));
}
}
void ADemo(PCWSTR FileName)
{
if (SOME_OBJECT* pObj = new SOME_OBJECT)
{
if (!pObj->Create(FileName))
{
pObj->beginGet();
}
pObj->Release();
}
}

Recovering Detoured Library Functions

The question is fairly straight forward, what I'm trying to do is restore my process' detoured functions.
When I say detoured I mean the usual jmp instruction to an unknown location.
For example, when the ntdll.dll export NtOpenProcess() is not detoured, the first 5 bytes of the instruction of the function are along the lines of mov eax, *.
(The * offset depending on the OS version.)
When it gets detoured, that mov eax, * turns into a jmp.
What I'm trying to do is restore their bytes to what they were originally before any memory modifications.
My idea was to try and read the information I need from the disk, not from memory, however I do not know how to do that as I'm just a beginner.
Any help or explanation is greatly welcomed, if I did not explain my problem correctly please tell me!
I ended up figuring it out.
Example on NtOpenProcess.
Instead of restoring the bytes I decided to jump over them instead.
First we have to define the base of ntdll.
/* locate ntdll */
#define NTDLL _GetModuleHandleA("ntdll.dll")
Once we've done that, we're good to go. GetOffsetFromRva will calculate the offset of the file based on the address and module header passed to it.
DWORD GetOffsetFromRva(IMAGE_NT_HEADERS * nth, DWORD RVA)
{
PIMAGE_SECTION_HEADER sectionHeader = IMAGE_FIRST_SECTION(nth);
for (unsigned i = 0, sections = nth->FileHeader.NumberOfSections; i < sections; i++, sectionHeader++)
{
if (sectionHeader->VirtualAddress <= RVA)
{
if ((sectionHeader->VirtualAddress + sectionHeader->Misc.VirtualSize) > RVA)
{
RVA -= sectionHeader->VirtualAddress;
RVA += sectionHeader->PointerToRawData;
return RVA;
}
}
}
return 0;
}
We call this to get us the file offset that we need in order to find the original bytes of the function.
DWORD GetExportPhysicalAddress(HMODULE hmModule, char* szExportName)
{
if (!hmModule)
{
return 0;
}
DWORD dwModuleBaseAddress = (DWORD)hmModule;
IMAGE_DOS_HEADER* pHeaderDOS = (IMAGE_DOS_HEADER *)hmModule;
if (pHeaderDOS->e_magic != IMAGE_DOS_SIGNATURE)
{
return 0;
}
IMAGE_NT_HEADERS * pHeaderNT = (IMAGE_NT_HEADERS *)(dwModuleBaseAddress + pHeaderDOS->e_lfanew);
if (pHeaderNT->Signature != IMAGE_NT_SIGNATURE)
{
return 0;
}
/* get the export virtual address through a custom GetProcAddress function. */
void* pExportRVA = GetProcedureAddress(hmModule, szExportName);
if (pExportRVA)
{
/* convert the VA to RVA... */
DWORD dwExportRVA = (DWORD)pExportRVA - dwModuleBaseAddress;
/* get the file offset and return */
return GetOffsetFromRva(pHeaderNT, dwExportRVA);
}
return 0;
}
Using the function that gets us the file offset, we can now read the original export bytes.
size_t ReadExportFunctionBytes(HMODULE hmModule, char* szExportName, BYTE* lpBuffer, size_t t_Count)
{
/* get the offset */
DWORD dwFileOffset = GetExportPhysicalAddress(hmModule, szExportName);
if (!dwFileOffset)
{
return 0;
}
/* get the path of the targetted module */
char szModuleFilePath[MAX_PATH];
GetModuleFileNameA(hmModule, szModuleFilePath, MAX_PATH);
if (strnull(szModuleFilePath))
{
return 0;
}
/* try to open the file off the disk */
FILE *fModule = fopen(szModuleFilePath, "rb");
if (!fModule)
{
/* we couldn't open the file */
return 0;
}
/* go to the offset and read it */
fseek(fModule, dwFileOffset, SEEK_SET);
size_t t_Read = 0;
if ((t_Read = fread(lpBuffer, t_Count, 1, fModule)) == 0)
{
/* we didn't read anything */
return 0;
}
/* close file and return */
fclose(fModule);
return t_Read;
}
And we can retrieve the syscall index from the mov instruction originally placed in the first 5 bytes of the export on x86.
DWORD GetSyscallIndex(char* szFunctionName)
{
BYTE buffer[5];
ReadExportFunctionBytes(NTDLL, szFunctionName, buffer, 5);
if (!buffer)
{
return 0;
}
return BytesToDword(buffer + 1);
}
Get the NtOpenProcess address and add 5 to trampoline over it.
DWORD _ptrNtOpenProcess = (DWORD) GetProcAddress(NTDLL, "NtOpenProcess") + 5;
DWORD _oNtOpenProcess = GetSyscallIndex("NtOpenProcess");
The recovered/reconstructed NtOpenProcess.
__declspec(naked) NTSTATUS NTAPI _NtOpenProcess
(
_Out_ PHANDLE ProcessHandle,
_In_ ACCESS_MASK DesiredAccess,
_In_ POBJECT_ATTRIBUTES ObjectAttributes,
_In_opt_ PCLIENT_ID ClientId
) {
__asm
{
mov eax, [_oNtOpenProcess]
jmp dword ptr ds : [_ptrNtOpenProcess]
}
}
Let's call it.
int main()
{
printf("NtOpenProcess %x index: %x\n", _ptrNtOpenProcess, _oNtOpenProcess);
uint32_t pId = 0;
do
{
pId = GetProcessByName("notepad.exe");
Sleep(200);
} while (pId == 0);
OBJECT_ATTRIBUTES oa;
CLIENT_ID cid;
cid.UniqueProcess = (HANDLE)pId;
cid.UniqueThread = 0;
InitializeObjectAttributes(&oa, NULL, 0, NULL, NULL);
HANDLE hProcess;
NTSTATUS ntStat;
ntStat = _NtOpenProcess(&hProcess, PROCESS_ALL_ACCESS, &oa, &cid);
if (!NT_SUCCESS(ntStat))
{
printf("Couldn't open the process. NTSTATUS: %d", ntStat);
return 0;
}
printf("Successfully opened the process.");
/* clean up. */
NtClose(hProcess);
getchar();
return 0;
}

Modifying the stack on Windows, TIB and exceptions

The origin of my question effectively stems from wanting to provide an implementation of pthreads on Windows which supports user provide stacks. Specifically, pthread_attr_setstack should do something meaningful. My actual requirements are a bit more involved than this but this is good enough for the purpose of the post.
There are no public Win APIs for providing a stack in either the Fiber or Thread APIs. I've searched around for sneaky backdoors, workarounds and hacks, there's nothing going. In fact, I looked that the winpthread source for inspiration and that ignores any stack provided to pthread_attr_setstack.
Instead I tried the following "solution" to see if it would work. I create a Fiber using the usual combination of ConvertThreadToFiber, CreateFiberEx and SwitchToFiber. In CreateFiberEx I provide a minimal stack size. In the entry point of the fibre I then allocate memory for a stack, change the TIB fields: "Stack Base" and "Stack Limit" appropriately (see here: http://en.wikipedia.org/wiki/Win32_Thread_Information_Block) and then set ESP to the high address of my stack.
(In a real world case I would setup the stack better than this and change EIP as well so that this step behaves more like the posix funciton swapcontext, but you get the idea).
If I make any OS calls when on this different stack then I'm pretty much screwed (printf for example dies). However this isn't an issue for me. I can ensure that I never make sure calls when on my custom stack (hence why I said my actual requirements are a bit more involved). Except...I need exceptions to work. And they don't! Specifically, if I try to throw and catch an exception on my modified stack then I get an assert
Unhandled exception at 0xXXXXXXXX ....
So my (vague) question is, does anyone have any insight as to how exceptions and a custom stack might not be playing nicely together? I appreciate that this is totally unsupported and can happily except nil response or "go away". In fact, I've pretty much decided that I need a different solution and, despite this involving compromise, I'm likely to use one. However, curiosity gets the better of me so I'd like to know why this doesn't work.
On a related note, I wondered how Cygwin dealt with this for ucontext. The source here http://szupervigyor.ddsi.hu/source/in/openjdk-6-6b18-1.8.13/cacao-0.99.4/src/vm/jit/i386/cygwin/ucontext.c uses GetThreadContext/SetThreadContext to implement ucontext. However, from experimentation I see that this also fails when an exception is thrown from inside a new context. In fact the SetThreadContext call doesn't even update the TIB block!
EDIT (based on the answer from #avakar)
The following code, which is very similar to yours, demonstrates the same failure. The difference is that I don't start the second thread suspended but suspend it then try to change context. This code exhibits the error I was describing when the try-catch block is hit in foo. Perhaps this simply isn't legal. One notable thing is that in this situation the ExceptionList member of the TIB is a valid pointer when modifyThreadContext is called, whereas in your example it's -1. Manually editing this doesn't help.
As mentioned in my comment to your answer. This isn't precisely what I need. I would like to switch contexts from the thread I'm current on. However, the docs for SetThreadContext warn not to call this on an active thread. So I'm guessing that if the below code doesn't work then I have no chance of making it work on a single thread.
namespace
{
HANDLE ghSemaphore = 0;
void foo()
{
try
{
throw 6;
}
catch(...){}
ExitThread(0);
}
void modifyThreadContext(HANDLE thread)
{
typedef NTSTATUS WINAPI NtQueryInformationThread_t(HANDLE ThreadHandle, DWORD ThreadInformationClass, PVOID ThreadInformation, ULONG ThreadInformationLength, PULONG ReturnLength);
HMODULE hNtdll = LoadLibraryW(L"ntdll.dll");
auto NtQueryInformationThread = (NtQueryInformationThread_t *)GetProcAddress(hNtdll, "NtQueryInformationThread");
DWORD stackSize = 1024 * 1024;
void * mystack = VirtualAlloc(0, stackSize, MEM_COMMIT, PAGE_READWRITE);
DWORD threadInfo[7];
NtQueryInformationThread(thread, 0, threadInfo, sizeof threadInfo, 0);
NT_TIB * tib = (NT_TIB *)threadInfo[1];
CONTEXT ctx = {};
ctx.ContextFlags = CONTEXT_ALL;
GetThreadContext(thread, &ctx);
ctx.Esp = (DWORD)mystack + stackSize - ((DWORD)tib->StackBase - ctx.Esp);
ctx.Eip = (DWORD)&foo;
tib->StackBase = (PVOID)((DWORD)mystack + stackSize);
tib->StackLimit = (PVOID)((DWORD)mystack);
SetThreadContext(thread, &ctx);
}
DWORD CALLBACK threadMain(LPVOID)
{
ReleaseSemaphore(ghSemaphore, 1, NULL);
while (1)
Sleep(10000);
// Never gets here
return 1;
}
} // namespace
int main()
{
ghSemaphore = CreateSemaphore(NULL, 0, 1, NULL);
HANDLE th = CreateThread(0, 0, threadMain, 0, 0, 0);
while (WaitForSingleObject(ghSemaphore, INFINITE) != WAIT_OBJECT_0);
SuspendThread(th);
modifyThreadContext(th);
ResumeThread(th);
while (WaitForSingleObject(th, 10) != WAIT_OBJECT_0);
return 0;
}
Both exceptions and printf work for me, and I don't see why they shouldn't. If you post your code, we can try to pinpoint what's going on.
#include <windows.h>
#include <stdio.h>
DWORD CALLBACK ThreadProc(LPVOID)
{
try
{
throw 1;
}
catch (int i)
{
printf("%d\n", i);
}
return 0;
}
typedef NTSTATUS WINAPI NtQueryInformationThread_t(HANDLE ThreadHandle, DWORD ThreadInformationClass, PVOID ThreadInformation, ULONG ThreadInformationLength, PULONG ReturnLength);
int main()
{
HMODULE hNtdll = LoadLibraryW(L"ntdll.dll");
auto NtQueryInformationThread = (NtQueryInformationThread_t *)GetProcAddress(hNtdll, "NtQueryInformationThread");
DWORD stackSize = 1024 * 1024;
void * mystack = VirtualAlloc(0, stackSize, MEM_COMMIT, PAGE_READWRITE);
DWORD dwThreadId;
HANDLE hThread = CreateThread(0, 0, &ThreadProc, 0, CREATE_SUSPENDED, &dwThreadId);
DWORD threadInfo[7];
NtQueryInformationThread(hThread, 0, threadInfo, sizeof threadInfo, 0);
NT_TIB * tib = (NT_TIB *)threadInfo[1];
CONTEXT ctx = {};
ctx.ContextFlags = CONTEXT_ALL;
GetThreadContext(hThread, &ctx);
ctx.Esp = (DWORD)mystack + stackSize - ((DWORD)tib->StackBase - ctx.Esp);
tib->StackBase = (PVOID)((DWORD)mystack + stackSize);
tib->StackLimit = (PVOID)((DWORD)mystack);
SetThreadContext(hThread, &ctx);
ResumeThread(hThread);
WaitForSingleObject(hThread, INFINITE);
}

How to create a trampoline function for hook

I'm interested in hooking and I decided to see if I could hook some functions. I wasn't interested in using a library like detours because I want to have the experience of doing it on my own. With some sources I found on the internet, I was able to create the code below. It's basic, but it works alright. However when hooking functions that are called by multiple threads it proves to be extremely unstable. If two calls are made at nearly the same time, it'll crash. After some research I think I need to create a trampoline function. After looking for hours all I was not able to find anything other that a general description on what a trampoline was. I could not find anything specifically about writing a trampoline function, or how they really worked. If any one could help me write one, post some sources, or at least point me in the right direction by recommending some articles, sites, books, etc. I would greatly appreciate it.
Below is the code I've written. It's really basic but I hope others might learn from it.
test.cpp
#include "stdafx.h"
Hook hook;
typedef int (WINAPI *tMessageBox)(HWND hWnd, LPCTSTR lpText, LPCTSTR lpCaption, UINT uType);
DWORD hMessageBox(HWND hWnd, LPCTSTR lpText, LPCTSTR lpCaption, UINT uType)
{
hook.removeHook();
tMessageBox oMessageBox = (tMessageBox)hook.funcPtr;
int ret =oMessageBox(hWnd, lpText, "Hooked!", uType);
hook.applyHook(&hMessageBox);
return ret;
}
void hookMessageBox()
{
printf("Hooking MessageBox...\n");
if(hook.findFunc("User32.dll", "MessageBoxA"))
{
if(hook.applyHook(&hMessageBox))
{
printf("hook applied! \n\n");
} else printf("hook could not be applied\n");
}
}
hook.cpp
#include "stdafx.h"
bool Hook::findFunc(char* libName, char* funcName)
{
Hook::funcPtr = (void*)GetProcAddress(GetModuleHandleA(libName), funcName);
return (Hook::funcPtr != NULL);
}
bool Hook::removeHook()
{
DWORD dwProtect;
if(VirtualProtect(Hook::funcPtr, 6, PAGE_EXECUTE_READWRITE, &dwProtect))
{
WriteProcessMemory(GetCurrentProcess(), (LPVOID)Hook::funcPtr, Hook::origData, 6, 0);
VirtualProtect(Hook::funcPtr, 6, dwProtect, NULL);
return true;
} else return false;
}
bool Hook::reapplyHook()
{
DWORD dwProtect;
if(VirtualProtect(funcPtr, 6, PAGE_EXECUTE_READWRITE, &dwProtect))
{
WriteProcessMemory(GetCurrentProcess(), (LPVOID)funcPtr, Hook::hookData, 6, 0);
VirtualProtect(funcPtr, 6, dwProtect, NULL);
return true;
} else return false;
}
bool Hook::applyHook(void* hook)
{
return setHookAtAddress(Hook::funcPtr, hook);
}
bool Hook::setHookAtAddress(void* funcPtr, void* hook)
{
Hook::funcPtr = funcPtr;
BYTE jmp[6] = { 0xE9, //jmp
0x00, 0x00, 0x00, 0x00, //address
0xC3 //retn
};
DWORD dwProtect;
if(VirtualProtect(funcPtr, 6, PAGE_EXECUTE_READWRITE, &dwProtect)) // make memory writable
{
ReadProcessMemory(GetCurrentProcess(), (LPVOID)funcPtr, Hook::origData, 6, 0); // save old data
DWORD offset = ((DWORD)hook - (DWORD)funcPtr - 5); //((to)-(from)-5)
memcpy(&jmp[1], &offset, 4); // write address into jmp
memcpy(Hook::hookData, jmp, 6); // save hook data
WriteProcessMemory(GetCurrentProcess(), (LPVOID)funcPtr, jmp, 6, 0); // write jmp
VirtualProtect(funcPtr, 6, dwProtect, NULL); // reprotect
return true;
} else return false;
}
If you want your hook to be safe when called by multiple threads, you don't want to be constantly unhooking and rehooking the original API.
A trampoline is simply a bit of code you generate that replicates the functionality of the first few bytes of the original API (which you overwrote with your jump), then jumps into the API after the bytes you overwrote.
Rather than unhooking the API, calling it and rehooking it you simply call the trampoline.
This is moderately complicated to do on x86 because you need (a fairly minimal) disassembler to find the instruction boundaries. You also need to check that the code you copy into your trampoline doesn't do anything relative to the instruction pointer (like a jmp, branch or call).
This is sufficient to make calls to the hook thread-safe, but you can't create the hook if multiple threads are using the API. For this, you need to hook the function with a two-byte near jump (which can be written atomically). Windows APIs are frequently preceded by a few NOPs (which can be overwritten with a far jump) to provide a target for this near jump.
Doing this on x64 is much more complicated. You can't simply patch the function with a 64-bit far jump (because there isn't one, and instructions to simulate it are often too long). And, depending on what your trampoline does, you may need to add it to the OS's stack unwind information.
I hope this isn't too general.
The defacto standard hooking tutorial is from jbremer and available here
Here is a simple x86 detour and trampoline hook based on this tutorial using Direct3D's EndScene() function as a example:
bool Detour32(char* src, char* dst, const intptr_t len)
{
if (len < 5) return false;
DWORD curProtection;
VirtualProtect(src, len, PAGE_EXECUTE_READWRITE, &curProtection);
intptr_t relativeAddress = (intptr_t)(dst - (intptr_t)src) - 5;
*src = (char)'\xE9';
*(intptr_t*)((intptr_t)src + 1) = relativeAddress;
VirtualProtect(src, len, curProtection, &curProtection);
return true;
}
char* TrampHook32(char* src, char* dst, const intptr_t len)
{
// Make sure the length is greater than 5
if (len < 5) return 0;
// Create the gateway (len + 5 for the overwritten bytes + the jmp)
void* gateway = VirtualAlloc(0, len + 5, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
//Write the stolen bytes into the gateway
memcpy(gateway, src, len);
// Get the gateway to destination addy
intptr_t gatewayRelativeAddr = ((intptr_t)src - (intptr_t)gateway) - 5;
// Add the jmp opcode to the end of the gateway
*(char*)((intptr_t)gateway + len) = 0xE9;
// Add the address to the jmp
*(intptr_t*)((intptr_t)gateway + len + 1) = gatewayRelativeAddr;
// Perform the detour
Detour32(src, dst, len);
return (char*)gateway;
}
typedef HRESULT(APIENTRY* tEndScene)(LPDIRECT3DDEVICE9 pDevice);
tEndScene oEndScene = nullptr;
HRESULT APIENTRY hkEndScene(LPDIRECT3DDEVICE9 pDevice)
{
//do stuff in here
return oEndScene(pDevice);
}
//just an example
int main()
{
oEndScene = (tEndScene)TrampHook32((char*)d3d9Device[42], (char*)hkEndScene, 7);
}