VEH hook acceleration - c++

I am trying to accelerate the speed of VEH hook. Veh hook class that I found from https://github.com/hoangprod/LeoSpecial-VEH-Hook/blob/master/LeoSpecial.h
#pragma once
#include <Windows.h>
#include <stdio.h>
#include <iostream>
#ifdef _WIN64
#define XIP Rip
#else
#define XIP Eip
#endif
class LeoHook {
public:
static bool Hook(uintptr_t og_fun, uintptr_t hk_fun);
static bool Unhook();
private:
static uintptr_t og_fun;
static uintptr_t hk_fun;
static PVOID VEH_Handle;
static DWORD oldProtection;
static bool AreInSamePage(const uint8_t* Addr1, const uint8_t* Addr2);
static LONG WINAPI LeoHandler(EXCEPTION_POINTERS *pExceptionInfo);
};
uintptr_t LeoHook::og_fun = 0;
uintptr_t LeoHook::hk_fun = 0;
PVOID LeoHook::VEH_Handle = nullptr;
DWORD LeoHook::oldProtection = 0;
bool LeoHook::Hook(uintptr_t original_fun, uintptr_t hooked_fun)
{
LeoHook::og_fun = original_fun;
LeoHook::hk_fun = hooked_fun;
//We cannot hook two functions in the same page, because we will cause an infinite callback
if (AreInSamePage((const uint8_t*)og_fun, (const uint8_t*)hk_fun))
return false;
//Register the Custom Exception Handler
VEH_Handle = AddVectoredExceptionHandler(true, (PVECTORED_EXCEPTION_HANDLER)LeoHandler);
//Toggle PAGE_GUARD flag on the page
if(VEH_Handle && VirtualProtect((LPVOID)og_fun, 1, PAGE_EXECUTE_READ | PAGE_GUARD, &oldProtection))
return true;
return false;
}
bool LeoHook::Unhook()
{
DWORD old;
if (VEH_Handle && //Make sure we have a valid Handle to the registered VEH
VirtualProtect((LPVOID)og_fun, 1, oldProtection, &old) && //Restore old Flags
RemoveVectoredExceptionHandler(VEH_Handle)) //Remove the VEH
return true;
return false;
}
LONG WINAPI LeoHook::LeoHandler(EXCEPTION_POINTERS *pExceptionInfo)
{
if (pExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_GUARD_PAGE_VIOLATION) //We will catch PAGE_GUARD Violation
{
if (pExceptionInfo->ContextRecord->XIP == (uintptr_t)og_fun) //Make sure we are at the address we want within the page
{
pExceptionInfo->ContextRecord->XIP = (uintptr_t)hk_fun; //Modify EIP/RIP to where we want to jump to instead of the original function
}
pExceptionInfo->ContextRecord->EFlags |= 0x100; //Will trigger an STATUS_SINGLE_STEP exception right after the next instruction get executed. In short, we come right back into this exception handler 1 instruction later
return EXCEPTION_CONTINUE_EXECUTION; //Continue to next instruction
}
if (pExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_SINGLE_STEP) //We will also catch STATUS_SINGLE_STEP, meaning we just had a PAGE_GUARD violation
{
DWORD dwOld;
VirtualProtect((LPVOID)og_fun, 1, PAGE_EXECUTE_READ | PAGE_GUARD, &dwOld); //Reapply the PAGE_GUARD flag because everytime it is triggered, it get removes
return EXCEPTION_CONTINUE_EXECUTION; //Continue the next instruction
}
return EXCEPTION_CONTINUE_SEARCH; //Keep going down the exception handling list to find the right handler IF it is not PAGE_GUARD nor SINGLE_STEP
}
bool LeoHook::AreInSamePage(const uint8_t* Addr1, const uint8_t* Addr2)
{
MEMORY_BASIC_INFORMATION mbi1;
if (!VirtualQuery(Addr1, &mbi1, sizeof(mbi1))) //Get Page information for Addr1
return true;
MEMORY_BASIC_INFORMATION mbi2;
if (!VirtualQuery(Addr2, &mbi2, sizeof(mbi2))) //Get Page information for Addr1
return true;
if (mbi1.BaseAddress == mbi2.BaseAddress) //See if the two pages start at the same Base Address
return true; //Both addresses are in the same page, abort hooking!
return false;
}
The hook is working fine without any problem. But it cost me a lot of FPS drop from a process that I injected. As I googled the most answer is saying reduce the amount of logic that have been use in hook. My hook method
__declspec(naked) void HookSend()
{
__asm
{
jmp dwAddress
}
}
As you see it doesn't have much logic that include in a function. But it still cost me a lot of fps.
The research number two is saying to copied all page function then modify the ASM of page function. But I have no idea how to copy a page function. So I need some guide how to copy a page function or another method to accelerate the VEH hook.
Thank you.

Related

Trying to understand the asynchrouny with a big amount of different function calls

I've started learing an asynchronous aproach, and
encountered a problem, help me with it.
The purpose is: get from somewhere a char data, and after that do something with it(using as text on the button, in my case). The code, that is pinned below is very slow. The most slowiest moment is a data getting: the fact is that the get(int id) function loads data from internet via WinInet(synchronously), sending the Post methods, and returning the answer.
void some_func()
{
for(int i(0);i<10;i++)
for(int q(0);q<5;q++)
{
char data[100];
strcpy(data, get(i,q)); // i, q - just some identifier data
button[5*i+(q+1)]=new Button(data);
}
}
The first question:
How should it be solved(generaly, I mean, if get has nothing to do with the internet, but runs slow)? I have only one, stupid idea: run get in every separate thread. If it's the right way - how should I do that? Cause, it's wrong to, created 50 threads call from each the get function. 50 get functions?
Second Question
How to realize it with WinInet? Have red MSDN, but it too hardly for me, as for newer, maybe you explain it more simlier?
Thanks
for asynchronous programming you need create some object which will be maintain state - in current case i and q must be not local variables of function but members of object, mandatory reference count to object. and usual file(socket) handle , etc.
function some_func() must have another pattern. it must be member function of object. and it must not call asynchronous get in loop. after call get it must just exit. when asynchronous operation, initiated by get, will be finished - some your callback must be called (if failed initiate asynchronous operation you need yourself just call this callback with error code). in callback you will be have pointer to your object and using it - call some_func(). so some_func() must at begin handle result of previous get call - check for error, handle received data, if no error. than adjust object state (in your case i and q) and if need - call get again. and for initiated all this - need first time call get direct:
begin -> get() -> .. callback .. -> some_func() -> exit
^ ┬
└─────────────────────────────┘
some demo example (with asynchronous read file)
struct SOME_OBJECT
{
LARGE_INTEGER _ByteOffset;
HANDLE _hFile;
LONG _dwRef;
int _i, _q;
SOME_OBJECT()
{
_i = 0, _q = 0;
_dwRef = 1;
_ByteOffset.QuadPart = 0;
_hFile = 0;
}
void beginGet();
void DoSomething(PVOID pvData, DWORD_PTR cbData)
{
DbgPrint("DoSomething<%u,%u>(%x, %p)\n", _i, _q, cbData, pvData);
}
// some_func
void OnComplete(DWORD dwErrorCode, PVOID pvData, DWORD_PTR cbData)
{
if (dwErrorCode == NOERROR)
{
DoSomething(pvData, cbData);
if (++_q == 5)
{
_q = 0;
if (++_i == 10)
{
return ;
}
}
_ByteOffset.QuadPart += cbData;
beginGet();
}
else
{
DbgPrint("OnComplete - error=%u\n", dwErrorCode);
}
}
~SOME_OBJECT()
{
if (_hFile) CloseHandle(_hFile);
}
void AddRef() { InterlockedIncrement(&_dwRef); }
void Release() { if (!InterlockedDecrement(&_dwRef)) delete this; }
ULONG Create(PCWSTR FileName);
};
struct OPERATION_CTX : OVERLAPPED
{
SOME_OBJECT* _pObj;
BYTE _buf[];
OPERATION_CTX(SOME_OBJECT* pObj) : _pObj(pObj)
{
pObj->AddRef();
hEvent = 0;
}
~OPERATION_CTX()
{
_pObj->Release();
}
VOID CALLBACK CompletionRoutine(DWORD dwErrorCode, DWORD_PTR dwNumberOfBytesTransfered)
{
_pObj->OnComplete(dwErrorCode, _buf, dwNumberOfBytesTransfered);
delete this;
}
static VOID CALLBACK _CompletionRoutine(DWORD dwErrorCode, DWORD dwNumberOfBytesTransfered, OVERLAPPED* lpOverlapped)
{
static_cast<OPERATION_CTX*>(lpOverlapped)->CompletionRoutine(RtlNtStatusToDosError(dwErrorCode), dwNumberOfBytesTransfered);
}
void CheckResult(BOOL fOk)
{
if (!fOk)
{
ULONG dwErrorCode = GetLastError();
if (dwErrorCode != ERROR_IO_PENDING)
{
CompletionRoutine(dwErrorCode, 0);
}
}
}
void* operator new(size_t cb, size_t ex)
{
return ::operator new(cb + ex);
}
void operator delete(PVOID pv)
{
::operator delete(pv);
}
};
ULONG SOME_OBJECT::Create(PCWSTR FileName)
{
HANDLE hFile = CreateFile(FileName, FILE_READ_DATA, FILE_SHARE_READ, 0,
OPEN_EXISTING, FILE_FLAG_OVERLAPPED, 0);
if (hFile != INVALID_HANDLE_VALUE)
{
_hFile = hFile;
if (BindIoCompletionCallback(hFile, OPERATION_CTX::_CompletionRoutine, 0))
{
return NOERROR;
}
}
return GetLastError();
}
void SOME_OBJECT::beginGet()
{
const ULONG cbRead = 0x1000;
if (OPERATION_CTX* ctx = new(cbRead) OPERATION_CTX(this))
{
ctx->Offset = _ByteOffset.LowPart;
ctx->OffsetHigh = _ByteOffset.HighPart;
ctx->CheckResult(ReadFile(_hFile, ctx->_buf, cbRead, 0, ctx));
}
}
void ADemo(PCWSTR FileName)
{
if (SOME_OBJECT* pObj = new SOME_OBJECT)
{
if (!pObj->Create(FileName))
{
pObj->beginGet();
}
pObj->Release();
}
}

Set stack size programmatically on Windows

Is it possible in WinAPI to set stack size for the current thread at runtime like setrlimit does on Linux?
I mean to increase the reserved stack size for the current thread if it is too small for the current requirements.
This is in a library that may be called by threads from other programming languages, so it's not an option to set stack size at compile time.
If not, any ideas about a solution like an assembly trampoline that changes the stack pointer to a dynamically allocated memory block?
FAQ: Proxy thread is a surefire solution (unless the caller thread has extremely small stack). However, thread switching seems a performance killer. I need substantial amount of stack for recursion or for _alloca. This is also for performance, because heap allocation is slow, especially if multiple threads allocate from heap in parallel (they get blocked by the same libc/CRT mutex, so the code becomes serial).
you can not full swap stack in current thread (allocate self, delete old) in library code because in old stack - return addresses, may be pointers to variables in stack, etc.
and you can not expand stack (virtual memory for it already allocated (reserved/commit) and not expandable.
however possible allocate temporary stack and switch to this stack during call. you must in this case save old StackBase and StackLimit from NT_TIB (look this structure in winnt.h), set new values (you need allocate memory for new stack), do call (for switch stack you need some assembly code - you can not do this only on c/c++) and return original StackBase and StackLimit. in kernelmode exist support for this - KeExpandKernelStackAndCallout
however in user mode exist Fibers - this is very rare used, but look like perfectly match to task. with Fiber we can create additional stack/execution context inside current thread.
so in general solution is next (for library):
on DLL_THREAD_ATTACH :
convert thread to fiber
(ConvertThreadToFiber) (if it return false check also
GetLastError for ERROR_ALREADY_FIBER - this is also ok code)
and create own Fiber by call CreateFiberEx
we do this only once. than, every time when your procedure is called, which require large stack space:
remember the current fiber by call GetCurrentFiber
setup task for your fiber
switch to your fiber by call SwitchToFiber
call procedure inside fiber
return to original fiber (saved from call GetCurrentFiber)
again by SwitchToFiber
and finally on DLL_THREAD_DETACH you need:
delete your fiber by DeleteFiber
convert fiber to thread by call ConvertFiberToThread but only
in case initial ConvertThreadToFiber return true (if was
ERROR_ALREADY_FIBER- let who first convert thread to fiber convert
it back - this is not your task in this case)
you need some (usual small) data associated with your fiber / thread. this must be of course per thread variable. so you need use __declspec(thread) for declare this data. or direct use TLS (or which modern c++ features exist for this)
demo implementation is next:
typedef ULONG (WINAPI * MY_EXPAND_STACK_CALLOUT) (PVOID Parameter);
class FIBER_DATA
{
public:
PVOID _PrevFiber, _MyFiber;
MY_EXPAND_STACK_CALLOUT _pfn;
PVOID _Parameter;
ULONG _dwError;
BOOL _bConvertToThread;
static VOID CALLBACK _FiberProc( PVOID lpParameter)
{
reinterpret_cast<FIBER_DATA*>(lpParameter)->FiberProc();
}
VOID FiberProc()
{
for (;;)
{
_dwError = _pfn(_Parameter);
SwitchToFiber(_PrevFiber);
}
}
public:
~FIBER_DATA()
{
if (_MyFiber)
{
DeleteFiber(_MyFiber);
}
if (_bConvertToThread)
{
ConvertFiberToThread();
}
}
FIBER_DATA()
{
_bConvertToThread = FALSE, _MyFiber = 0;
}
ULONG Create(SIZE_T dwStackCommitSize, SIZE_T dwStackReserveSize);
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
_PrevFiber = GetCurrentFiber();
_pfn = pfn;
_Parameter = Parameter;
SwitchToFiber(_MyFiber);
return _dwError;
}
};
__declspec(thread) FIBER_DATA* g_pData;
ULONG FIBER_DATA::Create(SIZE_T dwStackCommitSize, SIZE_T dwStackReserveSize)
{
if (ConvertThreadToFiber(this))
{
_bConvertToThread = TRUE;
}
else
{
ULONG dwError = GetLastError();
if (dwError != ERROR_ALREADY_FIBER)
{
return dwError;
}
}
return (_MyFiber = CreateFiberEx(dwStackCommitSize, dwStackReserveSize, 0, _FiberProc, this)) ? NOERROR : GetLastError();
}
void OnDetach()
{
if (FIBER_DATA* pData = g_pData)
{
delete pData;
}
}
ULONG OnAttach()
{
if (FIBER_DATA* pData = new FIBER_DATA)
{
if (ULONG dwError = pData->Create(2*PAGE_SIZE, 512 * PAGE_SIZE))
{
delete pData;
return dwError;
}
g_pData = pData;
return NOERROR;
}
return ERROR_NO_SYSTEM_RESOURCES;
}
ULONG WINAPI TestCallout(PVOID param)
{
DbgPrint("TestCallout(%s)\n", param);
return NOERROR;
}
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
if (FIBER_DATA* pData = g_pData)
{
return pData->DoCallout(pfn, Parameter);
}
return ERROR_GEN_FAILURE;
}
if (!OnAttach())//DLL_THREAD_ATTACH
{
DoCallout(TestCallout, "Demo Task #1");
DoCallout(TestCallout, "Demo Task #2");
OnDetach();//DLL_THREAD_DETACH
}
also note that all fibers executed in single thread context - multiple fibers associated with thread can not execute in concurrent - only sequential, and you yourself control switch time. so not need any additional synchronization. and SwitchToFiber - this is complete user mode proc. which executed very fast, never fail (because never allocate any resources)
update
despite use __declspec(thread) FIBER_DATA* g_pData; more simply (less code), better for implementation direct use TlsGetValue / TlsSetValue and allocate FIBER_DATA on first call inside thread, but not for all threads. also __declspec(thread) not correct worked (not worked at all) in XP for dll. so some modification can be
at DLL_PROCESS_ATTACH allocate your TLS slot gTlsIndex = TlsAlloc();
and free it on DLL_PROCESS_DETACH
if (gTlsIndex != TLS_OUT_OF_INDEXES) TlsFree(gTlsIndex);
on every DLL_THREAD_DETACH notification call
void OnThreadDetach()
{
if (FIBER_DATA* pData = (FIBER_DATA*)TlsGetValue(gTlsIndex))
{
delete pData;
}
}
and DoCallout need be modified in next way
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
FIBER_DATA* pData = (FIBER_DATA*)TlsGetValue(gTlsIndex);
if (!pData)
{
// this code executed only once on first call
if (!(pData = new FIBER_DATA))
{
return ERROR_NO_SYSTEM_RESOURCES;
}
if (ULONG dwError = pData->Create(512*PAGE_SIZE, 4*PAGE_SIZE))// or what stack size you need
{
delete pData;
return dwError;
}
TlsSetValue(gTlsIndex, pData);
}
return pData->DoCallout(pfn, Parameter);
}
so instead allocate stack for every new thread on DLL_THREAD_ATTACH via OnAttach() much better alocate it only for threads when really need (at first call)
and this code can potential have problems with fibers, if someone else also try use fibers. say in msdn example code not check for ERROR_ALREADY_FIBER in case ConvertThreadToFiber return 0. so we can wait that this case will be incorrect handled by main application if we before it decide create fiber and it also try use fiber after us. also ERROR_ALREADY_FIBER not worked in xp (begin from vista).
so possible and another solution - yourself create thread stack, and temporary switch to it doring call which require large stack space. main need not only allocate space for stack and swap esp (or rsp) but not forget correct establish StackBase and StackLimit in NT_TIB - it is necessary and sufficient condition (otherwise exceptions and guard page extension will be not worked).
despite this alternate solution require more code (manually create thread stack and stack switch) it will be work on xp too and nothing affect in situation when somebody else also try using fibers in thread
typedef ULONG (WINAPI * MY_EXPAND_STACK_CALLOUT) (PVOID Parameter);
extern "C" PVOID __fastcall SwitchToStack(PVOID param, PVOID stack);
struct FIBER_DATA
{
PVOID _Stack, _StackLimit, _StackPtr, _StackBase;
MY_EXPAND_STACK_CALLOUT _pfn;
PVOID _Parameter;
ULONG _dwError;
static void __fastcall FiberProc(FIBER_DATA* pData, PVOID stack)
{
for (;;)
{
pData->_dwError = pData->_pfn(pData->_Parameter);
// StackLimit can changed during _pfn call
pData->_StackLimit = ((PNT_TIB)NtCurrentTeb())->StackLimit;
stack = SwitchToStack(0, stack);
}
}
ULONG Create(SIZE_T Reserve, SIZE_T Commit);
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
_pfn = pfn;
_Parameter = Parameter;
PNT_TIB tib = (PNT_TIB)NtCurrentTeb();
PVOID StackBase = tib->StackBase, StackLimit = tib->StackLimit;
tib->StackBase = _StackBase, tib->StackLimit = _StackLimit;
_StackPtr = SwitchToStack(this, _StackPtr);
tib->StackBase = StackBase, tib->StackLimit = StackLimit;
return _dwError;
}
~FIBER_DATA()
{
if (_Stack)
{
VirtualFree(_Stack, 0, MEM_RELEASE);
}
}
FIBER_DATA()
{
_Stack = 0;
}
};
ULONG FIBER_DATA::Create(SIZE_T Reserve, SIZE_T Commit)
{
Reserve = (Reserve + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
Commit = (Commit + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
if (Reserve < Commit || !Reserve)
{
return ERROR_INVALID_PARAMETER;
}
if (PBYTE newStack = (PBYTE)VirtualAlloc(0, Reserve, MEM_RESERVE, PAGE_NOACCESS))
{
union {
PBYTE newStackBase;
void** ppvStack;
};
newStackBase = newStack + Reserve;
PBYTE newStackLimit = newStackBase - Commit;
if (newStackLimit = (PBYTE)VirtualAlloc(newStackLimit, Commit, MEM_COMMIT, PAGE_READWRITE))
{
if (Reserve == Commit || VirtualAlloc(newStackLimit - PAGE_SIZE, PAGE_SIZE, MEM_COMMIT, PAGE_READWRITE|PAGE_GUARD))
{
_StackBase = newStackBase, _StackLimit = newStackLimit, _Stack = newStack;
#if defined(_M_IX86)
*--ppvStack = FiberProc;
ppvStack -= 4;// ebp,esi,edi,ebx
#elif defined(_M_AMD64)
ppvStack -= 5;// x64 space
*--ppvStack = FiberProc;
ppvStack -= 8;// r15,r14,r13,r12,rbp,rsi,rdi,rbx
#else
#error "not supported"
#endif
_StackPtr = ppvStack;
return NOERROR;
}
}
VirtualFree(newStack, 0, MEM_RELEASE);
}
return GetLastError();
}
ULONG gTlsIndex;
ULONG DoCallout(MY_EXPAND_STACK_CALLOUT pfn, PVOID Parameter)
{
FIBER_DATA* pData = (FIBER_DATA*)TlsGetValue(gTlsIndex);
if (!pData)
{
// this code executed only once on first call
if (!(pData = new FIBER_DATA))
{
return ERROR_NO_SYSTEM_RESOURCES;
}
if (ULONG dwError = pData->Create(512*PAGE_SIZE, 4*PAGE_SIZE))
{
delete pData;
return dwError;
}
TlsSetValue(gTlsIndex, pData);
}
return pData->DoCallout(pfn, Parameter);
}
void OnThreadDetach()
{
if (FIBER_DATA* pData = (FIBER_DATA*)TlsGetValue(gTlsIndex))
{
delete pData;
}
}
and assembly code for SwitchToStack : on x86
#SwitchToStack#8 proc
push ebx
push edi
push esi
push ebp
xchg esp,edx
mov eax,edx
pop ebp
pop esi
pop edi
pop ebx
ret
#SwitchToStack#8 endp
and for x64:
SwitchToStack proc
push rbx
push rdi
push rsi
push rbp
push r12
push r13
push r14
push r15
xchg rsp,rdx
mov rax,rdx
pop r15
pop r14
pop r13
pop r12
pop rbp
pop rsi
pop rdi
pop rbx
ret
SwitchToStack endp
usage/test can be next:
gTlsIndex = TlsAlloc();//DLL_PROCESS_ATTACH
if (gTlsIndex != TLS_OUT_OF_INDEXES)
{
TestStackMemory();
DoCallout(TestCallout, "test #1");
//play with stack, excepions, guard pages
PSTR str = (PSTR)alloca(256);
DoCallout(zTestCallout, str);
DbgPrint("str=%s\n", str);
DoCallout(TestCallout, "test #2");
OnThreadDetach();//DLL_THREAD_DETACH
TlsFree(gTlsIndex);//DLL_PROCESS_DETACH
}
void TestMemory(PVOID AllocationBase)
{
MEMORY_BASIC_INFORMATION mbi;
PVOID BaseAddress = AllocationBase;
while (VirtualQuery(BaseAddress, &mbi, sizeof(mbi)) >= sizeof(mbi) && mbi.AllocationBase == AllocationBase)
{
BaseAddress = (PBYTE)mbi.BaseAddress + mbi.RegionSize;
DbgPrint("[%p, %p) %p %08x %08x\n", mbi.BaseAddress, BaseAddress, (PVOID)(mbi.RegionSize >> PAGE_SHIFT), mbi.State, mbi.Protect);
}
}
void TestStackMemory()
{
MEMORY_BASIC_INFORMATION mbi;
if (VirtualQuery(_AddressOfReturnAddress(), &mbi, sizeof(mbi)) >= sizeof(mbi))
{
TestMemory(mbi.AllocationBase);
}
}
ULONG WINAPI zTestCallout(PVOID Parameter)
{
TestStackMemory();
alloca(5*PAGE_SIZE);
TestStackMemory();
__try
{
*(int*)0=0;
}
__except(EXCEPTION_EXECUTE_HANDLER)
{
DbgPrint("exception %x handled\n", GetExceptionCode());
}
strcpy((PSTR)Parameter, "zTestCallout demo");
return NOERROR;
}
ULONG WINAPI TestCallout(PVOID param)
{
TestStackMemory();
DbgPrint("TestCallout(%s)\n", param);
return NOERROR;
}
The maximum stack size is determined when the thread is created. It cannot be modified after that time.

How to get thread state (e.g. suspended), memory + CPU usage, start time, priority, etc

How do I get the information if a thread has been suspended with SuspendThread(). There is no API that gives this information. The toolhelp snapshot API is very limited.
There is a lot of misleading information in internet and also on StackOverflow. Some people on StackOverflow even say that this is not possible.
Others post a solution that requires Windows 7. But I need the code to work on XP.
I found the answer myself.
I wrote a class cProcInfo that obtains a lot of information about processes and threads like:
Process and Thread identifiers
Process Parent Identifier
Process Name
Priority
Context Switches
Address
State (running, waiting, suspended, etc..)
Date and Time when process and thread were started
Time spent in Kernel mode
Time spent in User mode
Memory usage
Handle count
Page Faults
My class works on Windows 2000, XP, Vista, 7, 8...
The following code shows how to determine if the thread 640 in the process 1948 is supended:
Main()
{
cProcInfo i_Proc;
DWORD u32_Error = i_Proc.Capture();
if (u32_Error)
{
printf("Error 0x%X capturing processes.\n", u32_Error);
return 0;
}
SYSTEM_PROCESS* pk_Proc = i_Proc.FindProcessByPid(1948);
if (!pk_Proc)
{
printf("The process does not exist.\n");
return 0;
}
SYSTEM_THREAD* pk_Thread = i_Proc.FindThreadByTid(pk_Proc, 640);
if (!pk_Thread)
{
printf("The thread does not exist.\n");
return 0;
}
BOOL b_Suspend;
i_Proc.IsThreadSuspended(pk_Thread, &b_Suspend);
if (b_Suspend) printf("The thread is suspended.\n");
else printf("The thread is not suspended.\n");
return 0;
}
My class first captures all currently running processes and threads with NtQuerySystemInformation() and then searches the information of the requested thread.
You can easily add a function that searches a process by name. (pk_Proc->usName)
Here comes my class. It is only one header file:
#pragma once
#include <winternl.h>
#include <winnt.h>
typedef LONG NTSTATUS;
#define STATUS_SUCCESS ((NTSTATUS) 0x00000000)
#define STATUS_INFO_LENGTH_MISMATCH ((NTSTATUS) 0xC0000004)
enum KWAIT_REASON
{
Executive,
FreePage,
PageIn,
PoolAllocation,
DelayExecution,
Suspended,
UserRequest,
WrExecutive,
WrFreePage,
WrPageIn,
WrPoolAllocation,
WrDelayExecution,
WrSuspended,
WrUserRequest,
WrEventPair,
WrQueue,
WrLpcReceive,
WrLpcReply,
WrVirtualMemory,
WrPageOut,
WrRendezvous,
Spare2,
Spare3,
Spare4,
Spare5,
Spare6,
WrKernel,
MaximumWaitReason
};
enum THREAD_STATE
{
Running = 2,
Waiting = 5,
};
#pragma pack(push,8)
struct CLIENT_ID
{
HANDLE UniqueProcess; // Process ID
HANDLE UniqueThread; // Thread ID
};
// http://www.geoffchappell.com/studies/windows/km/ntoskrnl/api/ex/sysinfo/thread.htm
// Size = 0x40 for Win32
// Size = 0x50 for Win64
struct SYSTEM_THREAD
{
LARGE_INTEGER KernelTime;
LARGE_INTEGER UserTime;
LARGE_INTEGER CreateTime;
ULONG WaitTime;
PVOID StartAddress;
CLIENT_ID ClientID; // process/thread ids
LONG Priority;
LONG BasePriority;
ULONG ContextSwitches;
THREAD_STATE ThreadState;
KWAIT_REASON WaitReason;
};
struct VM_COUNTERS // virtual memory of process
{
ULONG_PTR PeakVirtualSize;
ULONG_PTR VirtualSize;
ULONG PageFaultCount;
ULONG_PTR PeakWorkingSetSize;
ULONG_PTR WorkingSetSize;
ULONG_PTR QuotaPeakPagedPoolUsage;
ULONG_PTR QuotaPagedPoolUsage;
ULONG_PTR QuotaPeakNonPagedPoolUsage;
ULONG_PTR QuotaNonPagedPoolUsage;
ULONG_PTR PagefileUsage;
ULONG_PTR PeakPagefileUsage;
};
// http://www.geoffchappell.com/studies/windows/km/ntoskrnl/api/ex/sysinfo/process.htm
// See also SYSTEM_PROCESS_INROMATION in Winternl.h
// Size = 0x00B8 for Win32
// Size = 0x0100 for Win64
struct SYSTEM_PROCESS
{
ULONG NextEntryOffset; // relative offset
ULONG ThreadCount;
LARGE_INTEGER WorkingSetPrivateSize;
ULONG HardFaultCount;
ULONG NumberOfThreadsHighWatermark;
ULONGLONG CycleTime;
LARGE_INTEGER CreateTime;
LARGE_INTEGER UserTime;
LARGE_INTEGER KernelTime;
UNICODE_STRING ImageName;
LONG BasePriority;
PVOID UniqueProcessId;
PVOID InheritedFromUniqueProcessId;
ULONG HandleCount;
ULONG SessionId;
ULONG_PTR UniqueProcessKey;
VM_COUNTERS VmCounters;
ULONG_PTR PrivatePageCount;
IO_COUNTERS IoCounters; // defined in winnt.h
};
#pragma pack(pop)
typedef NTSTATUS (WINAPI* t_NtQueryInfo)(SYSTEM_INFORMATION_CLASS, PVOID, ULONG, PULONG);
class cProcInfo
{
public:
cProcInfo()
{
#ifdef WIN64
assert(sizeof(SYSTEM_THREAD) == 0x50 && sizeof(SYSTEM_PROCESS) == 0x100);
#else
assert(sizeof(SYSTEM_THREAD) == 0x40 && sizeof(SYSTEM_PROCESS) == 0xB8);
#endif
mu32_DataSize = 1000;
mp_Data = NULL;
mf_NtQueryInfo = NULL;
}
virtual ~cProcInfo()
{
if (mp_Data) LocalFree(mp_Data);
}
// Capture all running processes and all their threads.
// returns an API or NTSTATUS Error code or zero if successfull
DWORD Capture()
{
if (!mf_NtQueryInfo)
{
mf_NtQueryInfo = (t_NtQueryInfo)GetProcAddress(GetModuleHandleA("NtDll.dll"), "NtQuerySystemInformation");
if (!mf_NtQueryInfo)
return GetLastError();
}
// This must run in a loop because in the mean time a new process may have started
// and we need more buffer than u32_Needed !!
while (true)
{
if (!mp_Data)
{
mp_Data = (BYTE*)LocalAlloc(LMEM_FIXED, mu32_DataSize);
if (!mp_Data)
return GetLastError();
}
ULONG u32_Needed = 0;
NTSTATUS s32_Status = mf_NtQueryInfo(SystemProcessInformation, mp_Data, mu32_DataSize, &u32_Needed);
if (s32_Status == STATUS_INFO_LENGTH_MISMATCH) // The buffer was too small
{
mu32_DataSize = u32_Needed + 4000;
LocalFree(mp_Data);
mp_Data = NULL;
continue;
}
return s32_Status;
}
}
// Searches a process by a given Process Identifier
// Capture() must have been called before!
SYSTEM_PROCESS* FindProcessByPid(DWORD u32_PID)
{
if (!mp_Data)
{
assert(mp_Data);
return NULL;
}
SYSTEM_PROCESS* pk_Proc = (SYSTEM_PROCESS*)mp_Data;
while (TRUE)
{
if ((DWORD)(DWORD_PTR)pk_Proc->UniqueProcessId == u32_PID)
return pk_Proc;
if (!pk_Proc->NextEntryOffset)
return NULL;
pk_Proc = (SYSTEM_PROCESS*)((BYTE*)pk_Proc + pk_Proc->NextEntryOffset);
}
}
SYSTEM_THREAD* FindThreadByTid(SYSTEM_PROCESS* pk_Proc, DWORD u32_TID)
{
if (!pk_Proc)
{
assert(pk_Proc);
return NULL;
}
// The first SYSTEM_THREAD structure comes immediately after the SYSTEM_PROCESS structure
SYSTEM_THREAD* pk_Thread = (SYSTEM_THREAD*)((BYTE*)pk_Proc + sizeof(SYSTEM_PROCESS));
for (DWORD i=0; i<pk_Proc->ThreadCount; i++)
{
if (pk_Thread->ClientID.UniqueThread == (HANDLE)(DWORD_PTR)u32_TID)
return pk_Thread;
pk_Thread++;
}
return NULL;
}
DWORD IsThreadSuspended(SYSTEM_THREAD* pk_Thread, BOOL* pb_Suspended)
{
if (!pk_Thread)
return ERROR_INVALID_PARAMETER;
*pb_Suspended = (pk_Thread->ThreadState == Waiting &&
pk_Thread->WaitReason == Suspended);
return 0;
}
private:
BYTE* mp_Data;
DWORD mu32_DataSize;
t_NtQueryInfo mf_NtQueryInfo;
};
// Based on the 32 bit code of Sven B. Schreiber on:
// http://www.informit.com/articles/article.aspx?p=22442&seqNum=5

Synchronization Shared Memory

I have shared memory communication successful between two processes on Windows. I write to it like this:
Don't mind my comments to myself. I thought I'd have some fun with the diagram.
How can I do this faster? In-other words, how can I do this using a Mutex or CreateEvent? I tried hard to understand Mutexes and CreateEvent but it confuses me on MSDN as it uses on application and a thread. An example would be helpful but isn't required.
The way I do current do it is (Very SLOW!):
//Create SharedMemory Using Filename + KnownProcessID so I can have multiple clients and servers with unique mappings. I already have this working and communication successful.
bool SharedMemoryBusy()
{
double* Data = static_cast<double*>(pData); //Pointer to the mapped memory.
return static_cast<int>(Data[1]) > 0 ? true : false;
}
void* RequestSharedMemory()
{
for (int Success = 0; Success < 50; ++Success)
{
if (SharedMemoryBusy())
Sleep(10);
else
break;
}
return pData;
}
bool SharedMemoryReturned()
{
double* Data = static_cast<double*>(pData);
return static_cast<int>(Data[1]) == 2 ? true : false;
}
bool SharedDataFetched()
{
for (int Success = 0; Success < 50; ++Success)
{
if (SharedMemoryReturned())
return true;
if (!SharedMemoryBusy())
return false;
Sleep(10);
}
return false;
}
The reading and writing part (Just an example of on request.. There can be many different types.. Models, Points, Locations, etc):
void* GLHGetModels()
{
double* Data = static_cast<double*>(RequestSharedMemory());
Data[0] = MODELS;
Data[1] = StatusSent;
if (SharedDataFetched())
{
if (static_cast<int>(Data[2]) != 0)
{
unsigned char* SerializedData = reinterpret_cast<unsigned char*>(&Data[3]);
MemDeSerialize(ListOfModels, SerializedData, static_cast<int>(Data[2]));
return Models.data();
}
return NULL;
}
return NULL;
}
void WriteModels()
{
If (pData[0] == MODELS)
{
//Write The Request..
Data[1] = StatusReturned;
Data[2] = ListOfModels.size();
unsigned char* Destination = reinterpret_cast<unsigned char*>(&Data[3]);
MemSerialize(Destination, ListOfModels); //Uses MEMCOPY To copy to Destination.
}
}
You can use events and mutexes between processes. This works fine. You only need to duplicate the handle of the object before passing it to other process:
BOOL WINAPI DuplicateHandle(
_In_ HANDLE hSourceProcessHandle,
_In_ HANDLE hSourceHandle,
_In_ HANDLE hTargetProcessHandle,
_Out_ LPHANDLE lpTargetHandle,
_In_ DWORD dwDesiredAccess,
_In_ BOOL bInheritHandle,
_In_ DWORD dwOptions
);
Each process works with its own handle while they both refer to the same kernel object.
What you need to do is to create 2 events. First party will signal that the data is ready. Other side shoud take the data, place its own data, reset first event and signal the second. After that the first party is doing the same.
Event functions are: CreateEvent, SetEvent, ... CloseHandle.

How to create a trampoline function for hook

I'm interested in hooking and I decided to see if I could hook some functions. I wasn't interested in using a library like detours because I want to have the experience of doing it on my own. With some sources I found on the internet, I was able to create the code below. It's basic, but it works alright. However when hooking functions that are called by multiple threads it proves to be extremely unstable. If two calls are made at nearly the same time, it'll crash. After some research I think I need to create a trampoline function. After looking for hours all I was not able to find anything other that a general description on what a trampoline was. I could not find anything specifically about writing a trampoline function, or how they really worked. If any one could help me write one, post some sources, or at least point me in the right direction by recommending some articles, sites, books, etc. I would greatly appreciate it.
Below is the code I've written. It's really basic but I hope others might learn from it.
test.cpp
#include "stdafx.h"
Hook hook;
typedef int (WINAPI *tMessageBox)(HWND hWnd, LPCTSTR lpText, LPCTSTR lpCaption, UINT uType);
DWORD hMessageBox(HWND hWnd, LPCTSTR lpText, LPCTSTR lpCaption, UINT uType)
{
hook.removeHook();
tMessageBox oMessageBox = (tMessageBox)hook.funcPtr;
int ret =oMessageBox(hWnd, lpText, "Hooked!", uType);
hook.applyHook(&hMessageBox);
return ret;
}
void hookMessageBox()
{
printf("Hooking MessageBox...\n");
if(hook.findFunc("User32.dll", "MessageBoxA"))
{
if(hook.applyHook(&hMessageBox))
{
printf("hook applied! \n\n");
} else printf("hook could not be applied\n");
}
}
hook.cpp
#include "stdafx.h"
bool Hook::findFunc(char* libName, char* funcName)
{
Hook::funcPtr = (void*)GetProcAddress(GetModuleHandleA(libName), funcName);
return (Hook::funcPtr != NULL);
}
bool Hook::removeHook()
{
DWORD dwProtect;
if(VirtualProtect(Hook::funcPtr, 6, PAGE_EXECUTE_READWRITE, &dwProtect))
{
WriteProcessMemory(GetCurrentProcess(), (LPVOID)Hook::funcPtr, Hook::origData, 6, 0);
VirtualProtect(Hook::funcPtr, 6, dwProtect, NULL);
return true;
} else return false;
}
bool Hook::reapplyHook()
{
DWORD dwProtect;
if(VirtualProtect(funcPtr, 6, PAGE_EXECUTE_READWRITE, &dwProtect))
{
WriteProcessMemory(GetCurrentProcess(), (LPVOID)funcPtr, Hook::hookData, 6, 0);
VirtualProtect(funcPtr, 6, dwProtect, NULL);
return true;
} else return false;
}
bool Hook::applyHook(void* hook)
{
return setHookAtAddress(Hook::funcPtr, hook);
}
bool Hook::setHookAtAddress(void* funcPtr, void* hook)
{
Hook::funcPtr = funcPtr;
BYTE jmp[6] = { 0xE9, //jmp
0x00, 0x00, 0x00, 0x00, //address
0xC3 //retn
};
DWORD dwProtect;
if(VirtualProtect(funcPtr, 6, PAGE_EXECUTE_READWRITE, &dwProtect)) // make memory writable
{
ReadProcessMemory(GetCurrentProcess(), (LPVOID)funcPtr, Hook::origData, 6, 0); // save old data
DWORD offset = ((DWORD)hook - (DWORD)funcPtr - 5); //((to)-(from)-5)
memcpy(&jmp[1], &offset, 4); // write address into jmp
memcpy(Hook::hookData, jmp, 6); // save hook data
WriteProcessMemory(GetCurrentProcess(), (LPVOID)funcPtr, jmp, 6, 0); // write jmp
VirtualProtect(funcPtr, 6, dwProtect, NULL); // reprotect
return true;
} else return false;
}
If you want your hook to be safe when called by multiple threads, you don't want to be constantly unhooking and rehooking the original API.
A trampoline is simply a bit of code you generate that replicates the functionality of the first few bytes of the original API (which you overwrote with your jump), then jumps into the API after the bytes you overwrote.
Rather than unhooking the API, calling it and rehooking it you simply call the trampoline.
This is moderately complicated to do on x86 because you need (a fairly minimal) disassembler to find the instruction boundaries. You also need to check that the code you copy into your trampoline doesn't do anything relative to the instruction pointer (like a jmp, branch or call).
This is sufficient to make calls to the hook thread-safe, but you can't create the hook if multiple threads are using the API. For this, you need to hook the function with a two-byte near jump (which can be written atomically). Windows APIs are frequently preceded by a few NOPs (which can be overwritten with a far jump) to provide a target for this near jump.
Doing this on x64 is much more complicated. You can't simply patch the function with a 64-bit far jump (because there isn't one, and instructions to simulate it are often too long). And, depending on what your trampoline does, you may need to add it to the OS's stack unwind information.
I hope this isn't too general.
The defacto standard hooking tutorial is from jbremer and available here
Here is a simple x86 detour and trampoline hook based on this tutorial using Direct3D's EndScene() function as a example:
bool Detour32(char* src, char* dst, const intptr_t len)
{
if (len < 5) return false;
DWORD curProtection;
VirtualProtect(src, len, PAGE_EXECUTE_READWRITE, &curProtection);
intptr_t relativeAddress = (intptr_t)(dst - (intptr_t)src) - 5;
*src = (char)'\xE9';
*(intptr_t*)((intptr_t)src + 1) = relativeAddress;
VirtualProtect(src, len, curProtection, &curProtection);
return true;
}
char* TrampHook32(char* src, char* dst, const intptr_t len)
{
// Make sure the length is greater than 5
if (len < 5) return 0;
// Create the gateway (len + 5 for the overwritten bytes + the jmp)
void* gateway = VirtualAlloc(0, len + 5, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
//Write the stolen bytes into the gateway
memcpy(gateway, src, len);
// Get the gateway to destination addy
intptr_t gatewayRelativeAddr = ((intptr_t)src - (intptr_t)gateway) - 5;
// Add the jmp opcode to the end of the gateway
*(char*)((intptr_t)gateway + len) = 0xE9;
// Add the address to the jmp
*(intptr_t*)((intptr_t)gateway + len + 1) = gatewayRelativeAddr;
// Perform the detour
Detour32(src, dst, len);
return (char*)gateway;
}
typedef HRESULT(APIENTRY* tEndScene)(LPDIRECT3DDEVICE9 pDevice);
tEndScene oEndScene = nullptr;
HRESULT APIENTRY hkEndScene(LPDIRECT3DDEVICE9 pDevice)
{
//do stuff in here
return oEndScene(pDevice);
}
//just an example
int main()
{
oEndScene = (tEndScene)TrampHook32((char*)d3d9Device[42], (char*)hkEndScene, 7);
}