How to create a trampoline function for hook - c++

I'm interested in hooking and I decided to see if I could hook some functions. I wasn't interested in using a library like detours because I want to have the experience of doing it on my own. With some sources I found on the internet, I was able to create the code below. It's basic, but it works alright. However when hooking functions that are called by multiple threads it proves to be extremely unstable. If two calls are made at nearly the same time, it'll crash. After some research I think I need to create a trampoline function. After looking for hours all I was not able to find anything other that a general description on what a trampoline was. I could not find anything specifically about writing a trampoline function, or how they really worked. If any one could help me write one, post some sources, or at least point me in the right direction by recommending some articles, sites, books, etc. I would greatly appreciate it.
Below is the code I've written. It's really basic but I hope others might learn from it.
test.cpp
#include "stdafx.h"
Hook hook;
typedef int (WINAPI *tMessageBox)(HWND hWnd, LPCTSTR lpText, LPCTSTR lpCaption, UINT uType);
DWORD hMessageBox(HWND hWnd, LPCTSTR lpText, LPCTSTR lpCaption, UINT uType)
{
hook.removeHook();
tMessageBox oMessageBox = (tMessageBox)hook.funcPtr;
int ret =oMessageBox(hWnd, lpText, "Hooked!", uType);
hook.applyHook(&hMessageBox);
return ret;
}
void hookMessageBox()
{
printf("Hooking MessageBox...\n");
if(hook.findFunc("User32.dll", "MessageBoxA"))
{
if(hook.applyHook(&hMessageBox))
{
printf("hook applied! \n\n");
} else printf("hook could not be applied\n");
}
}
hook.cpp
#include "stdafx.h"
bool Hook::findFunc(char* libName, char* funcName)
{
Hook::funcPtr = (void*)GetProcAddress(GetModuleHandleA(libName), funcName);
return (Hook::funcPtr != NULL);
}
bool Hook::removeHook()
{
DWORD dwProtect;
if(VirtualProtect(Hook::funcPtr, 6, PAGE_EXECUTE_READWRITE, &dwProtect))
{
WriteProcessMemory(GetCurrentProcess(), (LPVOID)Hook::funcPtr, Hook::origData, 6, 0);
VirtualProtect(Hook::funcPtr, 6, dwProtect, NULL);
return true;
} else return false;
}
bool Hook::reapplyHook()
{
DWORD dwProtect;
if(VirtualProtect(funcPtr, 6, PAGE_EXECUTE_READWRITE, &dwProtect))
{
WriteProcessMemory(GetCurrentProcess(), (LPVOID)funcPtr, Hook::hookData, 6, 0);
VirtualProtect(funcPtr, 6, dwProtect, NULL);
return true;
} else return false;
}
bool Hook::applyHook(void* hook)
{
return setHookAtAddress(Hook::funcPtr, hook);
}
bool Hook::setHookAtAddress(void* funcPtr, void* hook)
{
Hook::funcPtr = funcPtr;
BYTE jmp[6] = { 0xE9, //jmp
0x00, 0x00, 0x00, 0x00, //address
0xC3 //retn
};
DWORD dwProtect;
if(VirtualProtect(funcPtr, 6, PAGE_EXECUTE_READWRITE, &dwProtect)) // make memory writable
{
ReadProcessMemory(GetCurrentProcess(), (LPVOID)funcPtr, Hook::origData, 6, 0); // save old data
DWORD offset = ((DWORD)hook - (DWORD)funcPtr - 5); //((to)-(from)-5)
memcpy(&jmp[1], &offset, 4); // write address into jmp
memcpy(Hook::hookData, jmp, 6); // save hook data
WriteProcessMemory(GetCurrentProcess(), (LPVOID)funcPtr, jmp, 6, 0); // write jmp
VirtualProtect(funcPtr, 6, dwProtect, NULL); // reprotect
return true;
} else return false;
}

If you want your hook to be safe when called by multiple threads, you don't want to be constantly unhooking and rehooking the original API.
A trampoline is simply a bit of code you generate that replicates the functionality of the first few bytes of the original API (which you overwrote with your jump), then jumps into the API after the bytes you overwrote.
Rather than unhooking the API, calling it and rehooking it you simply call the trampoline.
This is moderately complicated to do on x86 because you need (a fairly minimal) disassembler to find the instruction boundaries. You also need to check that the code you copy into your trampoline doesn't do anything relative to the instruction pointer (like a jmp, branch or call).
This is sufficient to make calls to the hook thread-safe, but you can't create the hook if multiple threads are using the API. For this, you need to hook the function with a two-byte near jump (which can be written atomically). Windows APIs are frequently preceded by a few NOPs (which can be overwritten with a far jump) to provide a target for this near jump.
Doing this on x64 is much more complicated. You can't simply patch the function with a 64-bit far jump (because there isn't one, and instructions to simulate it are often too long). And, depending on what your trampoline does, you may need to add it to the OS's stack unwind information.
I hope this isn't too general.

The defacto standard hooking tutorial is from jbremer and available here
Here is a simple x86 detour and trampoline hook based on this tutorial using Direct3D's EndScene() function as a example:
bool Detour32(char* src, char* dst, const intptr_t len)
{
if (len < 5) return false;
DWORD curProtection;
VirtualProtect(src, len, PAGE_EXECUTE_READWRITE, &curProtection);
intptr_t relativeAddress = (intptr_t)(dst - (intptr_t)src) - 5;
*src = (char)'\xE9';
*(intptr_t*)((intptr_t)src + 1) = relativeAddress;
VirtualProtect(src, len, curProtection, &curProtection);
return true;
}
char* TrampHook32(char* src, char* dst, const intptr_t len)
{
// Make sure the length is greater than 5
if (len < 5) return 0;
// Create the gateway (len + 5 for the overwritten bytes + the jmp)
void* gateway = VirtualAlloc(0, len + 5, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
//Write the stolen bytes into the gateway
memcpy(gateway, src, len);
// Get the gateway to destination addy
intptr_t gatewayRelativeAddr = ((intptr_t)src - (intptr_t)gateway) - 5;
// Add the jmp opcode to the end of the gateway
*(char*)((intptr_t)gateway + len) = 0xE9;
// Add the address to the jmp
*(intptr_t*)((intptr_t)gateway + len + 1) = gatewayRelativeAddr;
// Perform the detour
Detour32(src, dst, len);
return (char*)gateway;
}
typedef HRESULT(APIENTRY* tEndScene)(LPDIRECT3DDEVICE9 pDevice);
tEndScene oEndScene = nullptr;
HRESULT APIENTRY hkEndScene(LPDIRECT3DDEVICE9 pDevice)
{
//do stuff in here
return oEndScene(pDevice);
}
//just an example
int main()
{
oEndScene = (tEndScene)TrampHook32((char*)d3d9Device[42], (char*)hkEndScene, 7);
}

Related

Set stack size programmatically on Windows

Is it possible in WinAPI to set stack size for the current thread at runtime like setrlimit does on Linux?
I mean to increase the reserved stack size for the current thread if it is too small for the current requirements.
This is in a library that may be called by threads from other programming languages, so it's not an option to set stack size at compile time.
If not, any ideas about a solution like an assembly trampoline that changes the stack pointer to a dynamically allocated memory block?
FAQ: Proxy thread is a surefire solution (unless the caller thread has extremely small stack). However, thread switching seems a performance killer. I need substantial amount of stack for recursion or for _alloca. This is also for performance, because heap allocation is slow, especially if multiple threads allocate from heap in parallel (they get blocked by the same libc/CRT mutex, so the code becomes serial).
you can not full swap stack in current thread (allocate self, delete old) in library code because in old stack - return addresses, may be pointers to variables in stack, etc.
and you can not expand stack (virtual memory for it already allocated (reserved/commit) and not expandable.
however possible allocate temporary stack and switch to this stack during call. you must in this case save old StackBase and StackLimit from NT_TIB (look this structure in winnt.h), set new values (you need allocate memory for new stack), do call (for switch stack you need some assembly code - you can not do this only on c/c++) and return original StackBase and StackLimit. in kernelmode exist support for this - KeExpandKernelStackAndCallout
however in user mode exist Fibers - this is very rare used, but look like perfectly match to task. with Fiber we can create additional stack/execution context inside current thread.
so in general solution is next (for library):
on DLL_THREAD_ATTACH :
convert thread to fiber
(ConvertThreadToFiber) (if it return false check also
GetLastError for ERROR_ALREADY_FIBER - this is also ok code)
and create own Fiber by call CreateFiberEx
we do this only once. than, every time when your procedure is called, which require large stack space:
remember the current fiber by call GetCurrentFiber
setup task for your fiber
switch to your fiber by call SwitchToFiber
call procedure inside fiber
return to original fiber (saved from call GetCurrentFiber)
again by SwitchToFiber
and finally on DLL_THREAD_DETACH you need:
delete your fiber by DeleteFiber
convert fiber to thread by call ConvertFiberToThread but only
in case initial ConvertThreadToFiber return true (if was
ERROR_ALREADY_FIBER- let who first convert thread to fiber convert
it back - this is not your task in this case)
you need some (usual small) data associated with your fiber / thread. this must be of course per thread variable. so you need use __declspec(thread) for declare this data. or direct use TLS (or which modern c++ features exist for this)
demo implementation is next:
typedef ULONG (WINAPI * MY_EXPAND_STACK_CALLOUT) (PVOID Parameter);
class FIBER_DATA
{
public:
PVOID _PrevFiber, _MyFiber;
MY_EXPAND_STACK_CALLOUT _pfn;
PVOID _Parameter;
ULONG _dwError;
BOOL _bConvertToThread;
static VOID CALLBACK _FiberProc( PVOID lpParameter)
{
reinterpret_cast<FIBER_DATA*>(lpParameter)->FiberProc();
}
VOID FiberProc()
{
for (;;)
{
_dwError = _pfn(_Parameter);
SwitchToFiber(_PrevFiber);
}
}
public:
~FIBER_DATA()
{
if (_MyFiber)
{
DeleteFiber(_MyFiber);
}
if (_bConvertToThread)
{
ConvertFiberToThread();
}
}
FIBER_DATA()
{
_bConvertToThread = FALSE, _MyFiber = 0;
}
ULONG Create(SIZE_T dwStackCommitSize, SIZE_T dwStackReserveSize);
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
_PrevFiber = GetCurrentFiber();
_pfn = pfn;
_Parameter = Parameter;
SwitchToFiber(_MyFiber);
return _dwError;
}
};
__declspec(thread) FIBER_DATA* g_pData;
ULONG FIBER_DATA::Create(SIZE_T dwStackCommitSize, SIZE_T dwStackReserveSize)
{
if (ConvertThreadToFiber(this))
{
_bConvertToThread = TRUE;
}
else
{
ULONG dwError = GetLastError();
if (dwError != ERROR_ALREADY_FIBER)
{
return dwError;
}
}
return (_MyFiber = CreateFiberEx(dwStackCommitSize, dwStackReserveSize, 0, _FiberProc, this)) ? NOERROR : GetLastError();
}
void OnDetach()
{
if (FIBER_DATA* pData = g_pData)
{
delete pData;
}
}
ULONG OnAttach()
{
if (FIBER_DATA* pData = new FIBER_DATA)
{
if (ULONG dwError = pData->Create(2*PAGE_SIZE, 512 * PAGE_SIZE))
{
delete pData;
return dwError;
}
g_pData = pData;
return NOERROR;
}
return ERROR_NO_SYSTEM_RESOURCES;
}
ULONG WINAPI TestCallout(PVOID param)
{
DbgPrint("TestCallout(%s)\n", param);
return NOERROR;
}
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
if (FIBER_DATA* pData = g_pData)
{
return pData->DoCallout(pfn, Parameter);
}
return ERROR_GEN_FAILURE;
}
if (!OnAttach())//DLL_THREAD_ATTACH
{
DoCallout(TestCallout, "Demo Task #1");
DoCallout(TestCallout, "Demo Task #2");
OnDetach();//DLL_THREAD_DETACH
}
also note that all fibers executed in single thread context - multiple fibers associated with thread can not execute in concurrent - only sequential, and you yourself control switch time. so not need any additional synchronization. and SwitchToFiber - this is complete user mode proc. which executed very fast, never fail (because never allocate any resources)
update
despite use __declspec(thread) FIBER_DATA* g_pData; more simply (less code), better for implementation direct use TlsGetValue / TlsSetValue and allocate FIBER_DATA on first call inside thread, but not for all threads. also __declspec(thread) not correct worked (not worked at all) in XP for dll. so some modification can be
at DLL_PROCESS_ATTACH allocate your TLS slot gTlsIndex = TlsAlloc();
and free it on DLL_PROCESS_DETACH
if (gTlsIndex != TLS_OUT_OF_INDEXES) TlsFree(gTlsIndex);
on every DLL_THREAD_DETACH notification call
void OnThreadDetach()
{
if (FIBER_DATA* pData = (FIBER_DATA*)TlsGetValue(gTlsIndex))
{
delete pData;
}
}
and DoCallout need be modified in next way
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
FIBER_DATA* pData = (FIBER_DATA*)TlsGetValue(gTlsIndex);
if (!pData)
{
// this code executed only once on first call
if (!(pData = new FIBER_DATA))
{
return ERROR_NO_SYSTEM_RESOURCES;
}
if (ULONG dwError = pData->Create(512*PAGE_SIZE, 4*PAGE_SIZE))// or what stack size you need
{
delete pData;
return dwError;
}
TlsSetValue(gTlsIndex, pData);
}
return pData->DoCallout(pfn, Parameter);
}
so instead allocate stack for every new thread on DLL_THREAD_ATTACH via OnAttach() much better alocate it only for threads when really need (at first call)
and this code can potential have problems with fibers, if someone else also try use fibers. say in msdn example code not check for ERROR_ALREADY_FIBER in case ConvertThreadToFiber return 0. so we can wait that this case will be incorrect handled by main application if we before it decide create fiber and it also try use fiber after us. also ERROR_ALREADY_FIBER not worked in xp (begin from vista).
so possible and another solution - yourself create thread stack, and temporary switch to it doring call which require large stack space. main need not only allocate space for stack and swap esp (or rsp) but not forget correct establish StackBase and StackLimit in NT_TIB - it is necessary and sufficient condition (otherwise exceptions and guard page extension will be not worked).
despite this alternate solution require more code (manually create thread stack and stack switch) it will be work on xp too and nothing affect in situation when somebody else also try using fibers in thread
typedef ULONG (WINAPI * MY_EXPAND_STACK_CALLOUT) (PVOID Parameter);
extern "C" PVOID __fastcall SwitchToStack(PVOID param, PVOID stack);
struct FIBER_DATA
{
PVOID _Stack, _StackLimit, _StackPtr, _StackBase;
MY_EXPAND_STACK_CALLOUT _pfn;
PVOID _Parameter;
ULONG _dwError;
static void __fastcall FiberProc(FIBER_DATA* pData, PVOID stack)
{
for (;;)
{
pData->_dwError = pData->_pfn(pData->_Parameter);
// StackLimit can changed during _pfn call
pData->_StackLimit = ((PNT_TIB)NtCurrentTeb())->StackLimit;
stack = SwitchToStack(0, stack);
}
}
ULONG Create(SIZE_T Reserve, SIZE_T Commit);
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
_pfn = pfn;
_Parameter = Parameter;
PNT_TIB tib = (PNT_TIB)NtCurrentTeb();
PVOID StackBase = tib->StackBase, StackLimit = tib->StackLimit;
tib->StackBase = _StackBase, tib->StackLimit = _StackLimit;
_StackPtr = SwitchToStack(this, _StackPtr);
tib->StackBase = StackBase, tib->StackLimit = StackLimit;
return _dwError;
}
~FIBER_DATA()
{
if (_Stack)
{
VirtualFree(_Stack, 0, MEM_RELEASE);
}
}
FIBER_DATA()
{
_Stack = 0;
}
};
ULONG FIBER_DATA::Create(SIZE_T Reserve, SIZE_T Commit)
{
Reserve = (Reserve + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
Commit = (Commit + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
if (Reserve < Commit || !Reserve)
{
return ERROR_INVALID_PARAMETER;
}
if (PBYTE newStack = (PBYTE)VirtualAlloc(0, Reserve, MEM_RESERVE, PAGE_NOACCESS))
{
union {
PBYTE newStackBase;
void** ppvStack;
};
newStackBase = newStack + Reserve;
PBYTE newStackLimit = newStackBase - Commit;
if (newStackLimit = (PBYTE)VirtualAlloc(newStackLimit, Commit, MEM_COMMIT, PAGE_READWRITE))
{
if (Reserve == Commit || VirtualAlloc(newStackLimit - PAGE_SIZE, PAGE_SIZE, MEM_COMMIT, PAGE_READWRITE|PAGE_GUARD))
{
_StackBase = newStackBase, _StackLimit = newStackLimit, _Stack = newStack;
#if defined(_M_IX86)
*--ppvStack = FiberProc;
ppvStack -= 4;// ebp,esi,edi,ebx
#elif defined(_M_AMD64)
ppvStack -= 5;// x64 space
*--ppvStack = FiberProc;
ppvStack -= 8;// r15,r14,r13,r12,rbp,rsi,rdi,rbx
#else
#error "not supported"
#endif
_StackPtr = ppvStack;
return NOERROR;
}
}
VirtualFree(newStack, 0, MEM_RELEASE);
}
return GetLastError();
}
ULONG gTlsIndex;
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
FIBER_DATA* pData = (FIBER_DATA*)TlsGetValue(gTlsIndex);
if (!pData)
{
// this code executed only once on first call
if (!(pData = new FIBER_DATA))
{
return ERROR_NO_SYSTEM_RESOURCES;
}
if (ULONG dwError = pData->Create(512*PAGE_SIZE, 4*PAGE_SIZE))
{
delete pData;
return dwError;
}
TlsSetValue(gTlsIndex, pData);
}
return pData->DoCallout(pfn, Parameter);
}
void OnThreadDetach()
{
if (FIBER_DATA* pData = (FIBER_DATA*)TlsGetValue(gTlsIndex))
{
delete pData;
}
}
and assembly code for SwitchToStack : on x86
#SwitchToStack#8 proc
push ebx
push edi
push esi
push ebp
xchg esp,edx
mov eax,edx
pop ebp
pop esi
pop edi
pop ebx
ret
#SwitchToStack#8 endp
and for x64:
SwitchToStack proc
push rbx
push rdi
push rsi
push rbp
push r12
push r13
push r14
push r15
xchg rsp,rdx
mov rax,rdx
pop r15
pop r14
pop r13
pop r12
pop rbp
pop rsi
pop rdi
pop rbx
ret
SwitchToStack endp
usage/test can be next:
gTlsIndex = TlsAlloc();//DLL_PROCESS_ATTACH
if (gTlsIndex != TLS_OUT_OF_INDEXES)
{
TestStackMemory();
DoCallout(TestCallout, "test #1");
//play with stack, excepions, guard pages
PSTR str = (PSTR)alloca(256);
DoCallout(zTestCallout, str);
DbgPrint("str=%s\n", str);
DoCallout(TestCallout, "test #2");
OnThreadDetach();//DLL_THREAD_DETACH
TlsFree(gTlsIndex);//DLL_PROCESS_DETACH
}
void TestMemory(PVOID AllocationBase)
{
MEMORY_BASIC_INFORMATION mbi;
PVOID BaseAddress = AllocationBase;
while (VirtualQuery(BaseAddress, &mbi, sizeof(mbi)) >= sizeof(mbi) && mbi.AllocationBase == AllocationBase)
{
BaseAddress = (PBYTE)mbi.BaseAddress + mbi.RegionSize;
DbgPrint("[%p, %p) %p %08x %08x\n", mbi.BaseAddress, BaseAddress, (PVOID)(mbi.RegionSize >> PAGE_SHIFT), mbi.State, mbi.Protect);
}
}
void TestStackMemory()
{
MEMORY_BASIC_INFORMATION mbi;
if (VirtualQuery(_AddressOfReturnAddress(), &mbi, sizeof(mbi)) >= sizeof(mbi))
{
TestMemory(mbi.AllocationBase);
}
}
ULONG WINAPI zTestCallout(PVOID Parameter)
{
TestStackMemory();
alloca(5*PAGE_SIZE);
TestStackMemory();
__try
{
*(int*)0=0;
}
__except(EXCEPTION_EXECUTE_HANDLER)
{
DbgPrint("exception %x handled\n", GetExceptionCode());
}
strcpy((PSTR)Parameter, "zTestCallout demo");
return NOERROR;
}
ULONG WINAPI TestCallout(PVOID param)
{
TestStackMemory();
DbgPrint("TestCallout(%s)\n", param);
return NOERROR;
}
The maximum stack size is determined when the thread is created. It cannot be modified after that time.

What is the lowest level WinSock API available in the user mode? (For API injection trampoline.)

My goal is to intercept outbound TCP packets from a custom-built application that I do not have source code to. I need to adjust several parameters in the outbound data. It is an older application that the original company no longer sells and the developer is no longer available.
So I was planning to install an API injection trampoline into the send() type raw WinSock API from my DLL that I can inject into the target process. But before writing such DLL, I decided to test this concept in my local process. So I did the following:
#ifdef _M_X64
//Simple code to install "API injection trampoline"
//Compiled as 64-bit process
static int WINAPI TestJump1(SOCKET s, const char *buf, int len, int flags);
{
//This part is just for debugging to make sure that my trampoline method is called
//The actual "working" trampoline will involve additional steps to insure that the original method is also called
::MessageBox(NULL, L"Injected method called!", L"Debugger Message", MB_OK);
return SOCKET_ERROR;
}
HMODULE hModWS2 = ::LoadLibrary(L"Ws2_32.dll");
if(hModWS2)
{
int (WINAPI *pfn_send)(SOCKET s, const char *buf, int len, int flags);
int (WINAPI *pfn_sendto)(SOCKET s, const char *buf, int len, int flags, const struct sockaddr *to, int tolen);
if(pfn_send &&
pfn_sendto)
{
//Long absolute JMP
//48 b8 xx xx xx xx xx xx xx xx mov rax, 0xxxx
//ff e0 jmp rax
BYTE subst[] = {
0x48, 0xb8,
0, 0, 0, 0,
0, 0, 0, 0,
0xff, 0xe0
};
HANDLE hProc = ::GetCurrentProcess();
VOID* pPtrAPI = pfn_send;
//Also tried with
//VOID* pPtrAPI = pfn_sendto;
//Make this address writable
DWORD dwOldProtect = 0;
if(::VirtualProtectEx(hProc, pPtrAPI, sizeof(subst), PAGE_EXECUTE_READWRITE, &dwOldProtect))
{
//Install JMP opcodes
VOID* pJumpPtr = TestJump1;
*(VOID**)(subst + 2) = pJumpPtr;
memcpy(pPtrAPI, subst, sizeof(subst));
//Reset it back
DWORD dwDummy;
if(::VirtualProtectEx(hProc, pPtrAPI, sizeof(subst), dwOldProtect, &dwDummy))
{
if(::FlushInstructionCache(hProc, pPtrAPI, sizeof(subst)))
{
//Try to call our method
//int rz = pfn_send(NULL, "", 0, 0); //This works!
//Try real test with the higher level API
//Download a web page into a file:
//This API must be calling raw sockets at some point internally...
//but my TestJump1() is never called from here...
HRESULT hr = URLDownloadToFile(NULL,
L"http://microsoft.com/",
L"C:\\Users\\User\\Desktop\\file.txt", 0, NULL);
}
}
}
}
}
#endif
So the code works fine, the JMP trampoline is installed and called alright if I call send method explicitly (as shown above) but my further assumption that a higher level API (i.e. URLDownloadToFile) would call it as well does not seem to hold true. My trampoline method is never called from it.
So what am I missing here? Is there an even lower WinSock API?
send() is not the only function available for sending TCP data using Winsock in user code. There is also:
WSASend()
WSASendDisconnect()
WSASendMsg()
TransmitFile()
TransmitPackets()
RIOSend/Ex()
At the very least, apps that don't use send() will usually use WSASend() instead, for use with Overlapped I/O or I/O Completion Ports. That is usually good enough for most situations. The other functions are not used very often, but may be used in certain situations where higher performance is really needed.
the lowest level winsock api is implemented by Winsock Service Provider Interface. during interface initialization the WSPStartup function is called from interface provider (this api must be exported by name from provider dll). for MSAFD Tcpip [TCP/IP] "{E70F1AA0-AB8B-11CF-8CA3-00805F48A192}" it implemented in mswsock.dll by default - look more here (strictly said used dll registered in HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\WinSock2\Parameters\Protocol_Catalog9\Catalog_Entries[64] but by default, this is mswsock.dll )
WSPStartup return to ws2_32 WSPPROC_TABLE - here the lowest user mode api which called. say for sendto - this is lpWSPSendTo function. so for hook all sendto you need replace lpWSPSendTo pointer in WSPPROC_TABLE to own. some api pointers returned via WSPIoctl. so you must replace lpWSPIoctl member in table to own and replace in result returned by original WSPIoctl to own api. example for RIO extension:
#include <ws2spi.h>
#include <mswsock.h>
LPWSPIOCTL g_lpWSPIoctl;
LPWSPSENDTO g_lpWSPSendTo;
LPFN_RIOSENDEX g_RIOSendEx;
int WSPAPI WSPIoctl(
__in SOCKET s,
__in DWORD dwIoControlCode,
__in LPVOID lpvInBuffer,
__in DWORD cbInBuffer,
__out LPVOID lpvOutBuffer,
__in DWORD cbOutBuffer,
__out LPDWORD lpcbBytesReturned,
__in LPWSAOVERLAPPED lpOverlapped,
__in LPWSAOVERLAPPED_COMPLETION_ROUTINE lpCompletionRoutine,
__in LPWSATHREADID lpThreadId,
__out LPINT lpErrno
)
{
int r = g_lpWSPIoctl(s, dwIoControlCode, lpvInBuffer, cbInBuffer, lpvOutBuffer, cbOutBuffer, lpcbBytesReturned, lpOverlapped, lpCompletionRoutine, lpThreadId, lpErrno);
static GUID functionTableId = WSAID_MULTIPLE_RIO;
if (
!r
&&
dwIoControlCode == SIO_GET_MULTIPLE_EXTENSION_FUNCTION_POINTER &&
cbInBuffer == sizeof(GUID) &&
cbOutBuffer >= sizeof(RIO_EXTENSION_FUNCTION_TABLE) &&
!memcmp(lpvInBuffer, &functionTableId, sizeof(GUID))
)
{
PRIO_EXTENSION_FUNCTION_TABLE priot = (PRIO_EXTENSION_FUNCTION_TABLE)lpvOutBuffer;
if (priot->cbSize >= FIELD_OFFSET(RIO_EXTENSION_FUNCTION_TABLE, RIOSendEx))
{
g_RIOSendEx = priot->RIOSendEx;// save original pointer to use
priot->RIOSendEx = RIOSendEx;// this is your hook function
}
}
return r;
}
so we need hook WSPStartup function before it will be called first time by ws2_32.dll

Recovering Detoured Library Functions

The question is fairly straight forward, what I'm trying to do is restore my process' detoured functions.
When I say detoured I mean the usual jmp instruction to an unknown location.
For example, when the ntdll.dll export NtOpenProcess() is not detoured, the first 5 bytes of the instruction of the function are along the lines of mov eax, *.
(The * offset depending on the OS version.)
When it gets detoured, that mov eax, * turns into a jmp.
What I'm trying to do is restore their bytes to what they were originally before any memory modifications.
My idea was to try and read the information I need from the disk, not from memory, however I do not know how to do that as I'm just a beginner.
Any help or explanation is greatly welcomed, if I did not explain my problem correctly please tell me!
I ended up figuring it out.
Example on NtOpenProcess.
Instead of restoring the bytes I decided to jump over them instead.
First we have to define the base of ntdll.
/* locate ntdll */
#define NTDLL _GetModuleHandleA("ntdll.dll")
Once we've done that, we're good to go. GetOffsetFromRva will calculate the offset of the file based on the address and module header passed to it.
DWORD GetOffsetFromRva(IMAGE_NT_HEADERS * nth, DWORD RVA)
{
PIMAGE_SECTION_HEADER sectionHeader = IMAGE_FIRST_SECTION(nth);
for (unsigned i = 0, sections = nth->FileHeader.NumberOfSections; i < sections; i++, sectionHeader++)
{
if (sectionHeader->VirtualAddress <= RVA)
{
if ((sectionHeader->VirtualAddress + sectionHeader->Misc.VirtualSize) > RVA)
{
RVA -= sectionHeader->VirtualAddress;
RVA += sectionHeader->PointerToRawData;
return RVA;
}
}
}
return 0;
}
We call this to get us the file offset that we need in order to find the original bytes of the function.
DWORD GetExportPhysicalAddress(HMODULE hmModule, char* szExportName)
{
if (!hmModule)
{
return 0;
}
DWORD dwModuleBaseAddress = (DWORD)hmModule;
IMAGE_DOS_HEADER* pHeaderDOS = (IMAGE_DOS_HEADER *)hmModule;
if (pHeaderDOS->e_magic != IMAGE_DOS_SIGNATURE)
{
return 0;
}
IMAGE_NT_HEADERS * pHeaderNT = (IMAGE_NT_HEADERS *)(dwModuleBaseAddress + pHeaderDOS->e_lfanew);
if (pHeaderNT->Signature != IMAGE_NT_SIGNATURE)
{
return 0;
}
/* get the export virtual address through a custom GetProcAddress function. */
void* pExportRVA = GetProcedureAddress(hmModule, szExportName);
if (pExportRVA)
{
/* convert the VA to RVA... */
DWORD dwExportRVA = (DWORD)pExportRVA - dwModuleBaseAddress;
/* get the file offset and return */
return GetOffsetFromRva(pHeaderNT, dwExportRVA);
}
return 0;
}
Using the function that gets us the file offset, we can now read the original export bytes.
size_t ReadExportFunctionBytes(HMODULE hmModule, char* szExportName, BYTE* lpBuffer, size_t t_Count)
{
/* get the offset */
DWORD dwFileOffset = GetExportPhysicalAddress(hmModule, szExportName);
if (!dwFileOffset)
{
return 0;
}
/* get the path of the targetted module */
char szModuleFilePath[MAX_PATH];
GetModuleFileNameA(hmModule, szModuleFilePath, MAX_PATH);
if (strnull(szModuleFilePath))
{
return 0;
}
/* try to open the file off the disk */
FILE *fModule = fopen(szModuleFilePath, "rb");
if (!fModule)
{
/* we couldn't open the file */
return 0;
}
/* go to the offset and read it */
fseek(fModule, dwFileOffset, SEEK_SET);
size_t t_Read = 0;
if ((t_Read = fread(lpBuffer, t_Count, 1, fModule)) == 0)
{
/* we didn't read anything */
return 0;
}
/* close file and return */
fclose(fModule);
return t_Read;
}
And we can retrieve the syscall index from the mov instruction originally placed in the first 5 bytes of the export on x86.
DWORD GetSyscallIndex(char* szFunctionName)
{
BYTE buffer[5];
ReadExportFunctionBytes(NTDLL, szFunctionName, buffer, 5);
if (!buffer)
{
return 0;
}
return BytesToDword(buffer + 1);
}
Get the NtOpenProcess address and add 5 to trampoline over it.
DWORD _ptrNtOpenProcess = (DWORD) GetProcAddress(NTDLL, "NtOpenProcess") + 5;
DWORD _oNtOpenProcess = GetSyscallIndex("NtOpenProcess");
The recovered/reconstructed NtOpenProcess.
__declspec(naked) NTSTATUS NTAPI _NtOpenProcess
(
_Out_ PHANDLE ProcessHandle,
_In_ ACCESS_MASK DesiredAccess,
_In_ POBJECT_ATTRIBUTES ObjectAttributes,
_In_opt_ PCLIENT_ID ClientId
) {
__asm
{
mov eax, [_oNtOpenProcess]
jmp dword ptr ds : [_ptrNtOpenProcess]
}
}
Let's call it.
int main()
{
printf("NtOpenProcess %x index: %x\n", _ptrNtOpenProcess, _oNtOpenProcess);
uint32_t pId = 0;
do
{
pId = GetProcessByName("notepad.exe");
Sleep(200);
} while (pId == 0);
OBJECT_ATTRIBUTES oa;
CLIENT_ID cid;
cid.UniqueProcess = (HANDLE)pId;
cid.UniqueThread = 0;
InitializeObjectAttributes(&oa, NULL, 0, NULL, NULL);
HANDLE hProcess;
NTSTATUS ntStat;
ntStat = _NtOpenProcess(&hProcess, PROCESS_ALL_ACCESS, &oa, &cid);
if (!NT_SUCCESS(ntStat))
{
printf("Couldn't open the process. NTSTATUS: %d", ntStat);
return 0;
}
printf("Successfully opened the process.");
/* clean up. */
NtClose(hProcess);
getchar();
return 0;
}

Modifying the stack on Windows, TIB and exceptions

The origin of my question effectively stems from wanting to provide an implementation of pthreads on Windows which supports user provide stacks. Specifically, pthread_attr_setstack should do something meaningful. My actual requirements are a bit more involved than this but this is good enough for the purpose of the post.
There are no public Win APIs for providing a stack in either the Fiber or Thread APIs. I've searched around for sneaky backdoors, workarounds and hacks, there's nothing going. In fact, I looked that the winpthread source for inspiration and that ignores any stack provided to pthread_attr_setstack.
Instead I tried the following "solution" to see if it would work. I create a Fiber using the usual combination of ConvertThreadToFiber, CreateFiberEx and SwitchToFiber. In CreateFiberEx I provide a minimal stack size. In the entry point of the fibre I then allocate memory for a stack, change the TIB fields: "Stack Base" and "Stack Limit" appropriately (see here: http://en.wikipedia.org/wiki/Win32_Thread_Information_Block) and then set ESP to the high address of my stack.
(In a real world case I would setup the stack better than this and change EIP as well so that this step behaves more like the posix funciton swapcontext, but you get the idea).
If I make any OS calls when on this different stack then I'm pretty much screwed (printf for example dies). However this isn't an issue for me. I can ensure that I never make sure calls when on my custom stack (hence why I said my actual requirements are a bit more involved). Except...I need exceptions to work. And they don't! Specifically, if I try to throw and catch an exception on my modified stack then I get an assert
Unhandled exception at 0xXXXXXXXX ....
So my (vague) question is, does anyone have any insight as to how exceptions and a custom stack might not be playing nicely together? I appreciate that this is totally unsupported and can happily except nil response or "go away". In fact, I've pretty much decided that I need a different solution and, despite this involving compromise, I'm likely to use one. However, curiosity gets the better of me so I'd like to know why this doesn't work.
On a related note, I wondered how Cygwin dealt with this for ucontext. The source here http://szupervigyor.ddsi.hu/source/in/openjdk-6-6b18-1.8.13/cacao-0.99.4/src/vm/jit/i386/cygwin/ucontext.c uses GetThreadContext/SetThreadContext to implement ucontext. However, from experimentation I see that this also fails when an exception is thrown from inside a new context. In fact the SetThreadContext call doesn't even update the TIB block!
EDIT (based on the answer from #avakar)
The following code, which is very similar to yours, demonstrates the same failure. The difference is that I don't start the second thread suspended but suspend it then try to change context. This code exhibits the error I was describing when the try-catch block is hit in foo. Perhaps this simply isn't legal. One notable thing is that in this situation the ExceptionList member of the TIB is a valid pointer when modifyThreadContext is called, whereas in your example it's -1. Manually editing this doesn't help.
As mentioned in my comment to your answer. This isn't precisely what I need. I would like to switch contexts from the thread I'm current on. However, the docs for SetThreadContext warn not to call this on an active thread. So I'm guessing that if the below code doesn't work then I have no chance of making it work on a single thread.
namespace
{
HANDLE ghSemaphore = 0;
void foo()
{
try
{
throw 6;
}
catch(...){}
ExitThread(0);
}
void modifyThreadContext(HANDLE thread)
{
typedef NTSTATUS WINAPI NtQueryInformationThread_t(HANDLE ThreadHandle, DWORD ThreadInformationClass, PVOID ThreadInformation, ULONG ThreadInformationLength, PULONG ReturnLength);
HMODULE hNtdll = LoadLibraryW(L"ntdll.dll");
auto NtQueryInformationThread = (NtQueryInformationThread_t *)GetProcAddress(hNtdll, "NtQueryInformationThread");
DWORD stackSize = 1024 * 1024;
void * mystack = VirtualAlloc(0, stackSize, MEM_COMMIT, PAGE_READWRITE);
DWORD threadInfo[7];
NtQueryInformationThread(thread, 0, threadInfo, sizeof threadInfo, 0);
NT_TIB * tib = (NT_TIB *)threadInfo[1];
CONTEXT ctx = {};
ctx.ContextFlags = CONTEXT_ALL;
GetThreadContext(thread, &ctx);
ctx.Esp = (DWORD)mystack + stackSize - ((DWORD)tib->StackBase - ctx.Esp);
ctx.Eip = (DWORD)&foo;
tib->StackBase = (PVOID)((DWORD)mystack + stackSize);
tib->StackLimit = (PVOID)((DWORD)mystack);
SetThreadContext(thread, &ctx);
}
DWORD CALLBACK threadMain(LPVOID)
{
ReleaseSemaphore(ghSemaphore, 1, NULL);
while (1)
Sleep(10000);
// Never gets here
return 1;
}
} // namespace
int main()
{
ghSemaphore = CreateSemaphore(NULL, 0, 1, NULL);
HANDLE th = CreateThread(0, 0, threadMain, 0, 0, 0);
while (WaitForSingleObject(ghSemaphore, INFINITE) != WAIT_OBJECT_0);
SuspendThread(th);
modifyThreadContext(th);
ResumeThread(th);
while (WaitForSingleObject(th, 10) != WAIT_OBJECT_0);
return 0;
}
Both exceptions and printf work for me, and I don't see why they shouldn't. If you post your code, we can try to pinpoint what's going on.
#include <windows.h>
#include <stdio.h>
DWORD CALLBACK ThreadProc(LPVOID)
{
try
{
throw 1;
}
catch (int i)
{
printf("%d\n", i);
}
return 0;
}
typedef NTSTATUS WINAPI NtQueryInformationThread_t(HANDLE ThreadHandle, DWORD ThreadInformationClass, PVOID ThreadInformation, ULONG ThreadInformationLength, PULONG ReturnLength);
int main()
{
HMODULE hNtdll = LoadLibraryW(L"ntdll.dll");
auto NtQueryInformationThread = (NtQueryInformationThread_t *)GetProcAddress(hNtdll, "NtQueryInformationThread");
DWORD stackSize = 1024 * 1024;
void * mystack = VirtualAlloc(0, stackSize, MEM_COMMIT, PAGE_READWRITE);
DWORD dwThreadId;
HANDLE hThread = CreateThread(0, 0, &ThreadProc, 0, CREATE_SUSPENDED, &dwThreadId);
DWORD threadInfo[7];
NtQueryInformationThread(hThread, 0, threadInfo, sizeof threadInfo, 0);
NT_TIB * tib = (NT_TIB *)threadInfo[1];
CONTEXT ctx = {};
ctx.ContextFlags = CONTEXT_ALL;
GetThreadContext(hThread, &ctx);
ctx.Esp = (DWORD)mystack + stackSize - ((DWORD)tib->StackBase - ctx.Esp);
tib->StackBase = (PVOID)((DWORD)mystack + stackSize);
tib->StackLimit = (PVOID)((DWORD)mystack);
SetThreadContext(hThread, &ctx);
ResumeThread(hThread);
WaitForSingleObject(hThread, INFINITE);
}

Execute BYTE array in detoured function

I know it's a long post but it's mostly code and pictures, it's a quick read! First of all, here is what I'm trying to do:
I'm trying to execute a BYTE array in a detoured function in order to go back to the original code as if I didn't detour anyhting Here is my code:
DllMain (DetourAddress is all that matter):
BOOL APIENTRY DllMain( HANDLE hModule, DWORD ul_reason_for_call, LPVOID lpReserved )
{
switch (ul_reason_for_call)
{
case DLL_PROCESS_ATTACH:
AllocConsole();
freopen("CONOUT$", "w", stdout);
DetourAddress((void*)HookAddress, (void*)&DetourFunc);
case DLL_PROCESS_DETACH:
FreeConsole();
break;
}
return TRUE;
}
DetourAddress (code is self-explanatory, I think):
void DetourAddress(void* funcPtr, void* hook)
{
// write jmp
BYTE cmd[5] =
{
0xE9, //jmp
0x00, 0x00, 0x00, 0x00 //address
};
// make memory readable/writable
DWORD dwProtect;
VirtualProtect(funcPtr, 5, PAGE_EXECUTE_READWRITE, &dwProtect);
// read bytes about to be replaced
ReadProcessMemory(GetCurrentProcess(), (LPVOID)funcPtr, mem, 5, NULL);
// write jmp in cmd
DWORD offset = ((DWORD)hook - (DWORD)funcPtr - 5); // (dest address) - (source address) - (jmp size)
memcpy(&cmd[1], &offset, 4); // write address into jmp
WriteProcessMemory(GetCurrentProcess(), (LPVOID)funcPtr, cmd, 5, 0); // write jmp
// reprotect
VirtualProtect(funcPtr, 5, dwProtect, NULL);
}
DetourFunc:
_declspec(naked) void DetourFunc()
{
__asm
{
PUSHFD
PUSHAD
}
printf("function detoured\n");
__asm
{
POPAD
POPFD
}
// make memory readable/writable
DWORD dwProtect;
VirtualProtect(mem, 6, PAGE_EXECUTE_READWRITE, &dwProtect);
pByteExe();
// reprotect
VirtualProtect(mem, 6, dwProtect, NULL);
__asm
{
jmp HookReturnAddress
}
}
And finaly the global variables, typedef for pByteExe() and includes:
#include <Windows.h>
#include <cstdio>
DWORD HookAddress = 0x08B1418,
HookReturnAddress = HookAddress+5;
typedef void ( * pFunc)();
BYTE mem[6] = { 0x0, 0x0, 0x0, 0x0, 0x0, 0xC3 };
pFunc pByteExe = (pFunc) &mem
As you can see in DetourFunc, I'm trying to execute my byte array (mem) directly. Using OllyDbg, this gets me there:
Which is exactly the bytes I'm trying to execute. Only problem is that it gives me an Access violation error when executing... Any idea why? I would have thought "VirtualProtect(mem, 5, PAGE_EXECUTE_READWRITE, &dwProtect);" would have made it safe to access... Thanks for your help!
EDIT: I just realized something wierd was happening... when I "Step into" with ollydbg, the mem instructions are correct, but as soon as I scroll a little, they change back to this:
Any idea why?
You've forgot the module offset...
DWORD module = (DWORD)GetModuleHandle(NULL);
DWORD real_address = module + (DWORD)ADDRESS;
ADDRESS have to of course relative to your module. (The module offset isn't allways the same)
And btw. why you take WriteProcessMemory, when you inject your DLL? A simple memcpy is enought...