How do scopes and thread locals work in (V8's) C++? - c++

I am intrigued by how V8's scopes work.
How can a scope object on the stack find other scope objects and contexts further up the stack?
Digging into how HandleScopes worked I found that they rely on thread locals. This has left me wondering how these work in C++, I've found the implementation but still don't feel I understand what's going on.
api.cc -- HandleScope looks for the current Isolate
HandleScope::HandleScope() {
i::Isolate* isolate = i::Isolate::Current();
API_ENTRY_CHECK(isolate, "HandleScope::HandleScope");
v8::ImplementationUtilities::HandleScopeData* current =
isolate->handle_scope_data();
isolate_ = isolate;
prev_next_ = current->next;
prev_limit_ = current->limit;
is_closed_ = false;
current->level++;
}
isolate.cc -- static method looks for the current isolate as thread local
// Returns the isolate inside which the current thread is running.
INLINE(static Isolate* Current()) {
const Thread::LocalStorageKey key = isolate_key();
Isolate* isolate = reinterpret_cast<Isolate*>(
Thread::GetExistingThreadLocal(key));
if (!isolate) {
EnsureDefaultIsolate();
isolate = reinterpret_cast<Isolate*>(
Thread::GetExistingThreadLocal(key));
}
ASSERT(isolate != NULL);
return isolate;
}
platform.h -- calls a low level method to retrieve thread local
static inline void* GetExistingThreadLocal(LocalStorageKey key) {
void* result = reinterpret_cast<void*>(
InternalGetExistingThreadLocal(static_cast<intptr_t>(key)));
ASSERT(result == GetThreadLocal(key));
return result;
}
platform-tls-win32.h -- the magic happens
inline intptr_t InternalGetExistingThreadLocal(intptr_t index) {
const intptr_t kTibInlineTlsOffset = 0xE10;
const intptr_t kTibExtraTlsOffset = 0xF94;
const intptr_t kMaxInlineSlots = 64;
const intptr_t kMaxSlots = kMaxInlineSlots + 1024;
ASSERT(0 <= index && index < kMaxSlots);
if (index < kMaxInlineSlots) {
return static_cast<intptr_t>(__readfsdword(kTibInlineTlsOffset +
kPointerSize * index));
}
intptr_t extra = static_cast<intptr_t>(__readfsdword(kTibExtraTlsOffset));
ASSERT(extra != 0);
return *reinterpret_cast<intptr_t*>(extra +
kPointerSize * (index - kMaxInlineSlots));
}
How exactly is this last method working?
How does it know where to look?
What is the structure of the stack?

You can view InternalGetExistingThreadLocal as an inline version of TlsGetValue WinAPI call.
On Windows in user mode fs segment register allows code to access Thread Information Block (TIB) which contains thread specific information, for example Thread Local Storage structures.
Layout of TIB and the way TLS is stored inside TIB is exposed in DDK (see http://en.wikipedia.org/wiki/Win32_Thread_Information_Block for quick overview of the TIB layout).
Given this knowledge and ability to read data from TIB via __readfsdword(offs) (which is equivalent of reading dword ptr fs:[offs]) one can directly and efficiently access TLS without calling TlsGetValue.

Related

per thread c++ guard to prevent re-entrant function calls

I've got function that call the registry that can fail and print the failure reason.
This function can also be called directly or indirectly from the context of a dedicated built-in printing function, and I wish to avoid printing the reason in this case to avoid endless recursion.
I can use thread_local to define per thread flag to avoid calling the print function from this function, but I guess it's rather widespread problem, so I'm looking for std implementation for this guard or any other well debugged code.
Here's an example that just made to express the problem.
Each print function comes with log level, and it's being compared with the current log level threshold that reside in registry. if lower than threshold, the function returns without print. However, in order to get the threshold, additional print can be made, so I wanted to create a guard that will prevent the print from getPrintLevelFromRegistry if it's called from print
int getPrintLevelFromRegistry() {
int value = 0;
DWORD res = RegGetValueW("//Software//...//logLevel" , &value);
if (res != ERROR_SUCCESS) {
print("couldn't find registry key");
return 0;
}
return value;
}
void print(const char * text, int printLoglevel) {
if (printLogLevel < getPrintLevelFromRegistry()) {
return;
}
// do the print itself
...
}
Thanks !
The root of the problem is that you are attempting to have your logging code log itself. Rather than some complicated guard, consider the fact that you really don't need to log a registry read. Just have it return a default value and just log the error to the console.
int getPrintLevelFromRegistry() {
int value = 0;
DWORD res = RegGetValueW("//Software//...//logLevel" , &value);
if (res != ERROR_SUCCESS) {
OutputDebugStringA("getPrintLevelFromRegistry: Can't read from registry\r\n");
}
return value;
}
Further, it's OK to read from the registry on each log statement, but it's redundant and unnecessary.
Better:
int getPrintLevelFromRegistry() {
static std::atomic<int> cachedValue(-1);
int value = cachedValue;
if (value == -1) {
DWORD res = RegGetValueW("//Software//...//logLevel" , &value);
if (res == ERROR_SUCCESS) {
cachedValue = value;
}
}
return value;
}

Exporting vulkan memory allocation handle cause an out of device memory

I'm trying to export the handle of a memory allocation made by Vulkan to import it into OpenGL. But when I add a vk::ExportMemoryAllocateInfo in the pNext chain of the vk::MemoryAllocateInfo, vk::Device::allocateMemory throws an OutOfDeviceMemory exception.
When I don't export the handle, the allocation does not fail but the returned handle is not valid.
Here is the guilty code (based jherico's example: vulkan-opengl-interop-example):
// The physical device and the device are correct
// requirements are for a RGBA image of 1024x1024 pixels
// memory properties is just vk::MemoryPropertyFlagBits::eDeviceLocal
void IMemoryObject::Allocate(vk::PhysicalDevice physicalDevice, vk::Device device, const vk::MemoryRequirements& requirements, vk::MemoryPropertyFlags properties)
{
unsigned int memoryTypeIndex = 0;
bool memoryIndexTypeFound = false;
vk::PhysicalDeviceMemoryProperties memoryProperties = physicalDevice.getMemoryProperties();
for (unsigned int i = 0; i < memoryProperties.memoryTypeCount && !memoryIndexTypeFound; i++)
{
vk::MemoryType memoryType = memoryProperties.memoryTypes[i];
if (requirements.memoryTypeBits & 1 << i && (memoryType.propertyFlags & properties) == properties)
{
memoryTypeIndex = i;
memoryIndexTypeFound = true;
}
}
if (!memoryIndexTypeFound)
throw std::exception();
vk::ExportMemoryAllocateInfo exportAllocInfo;
exportAllocInfo.setHandleTypes(vk::ExternalMemoryHandleTypeFlagBits::eOpaqueWin32);
vk::MemoryAllocateInfo allocateInfo;
allocateInfo.setPNext(&exportAllocInfo); // Remove this line and the allocation won't fail
allocateInfo.setAllocationSize(requirements.size);
allocateInfo.setMemoryTypeIndex(memoryTypeIndex);
deviceMemory_ = device.allocateMemoryUnique(allocateInfo);
// Call VkBindBufferMemory or VkBindImageMemory, never reached anyway when allocateInfo.pNext == &exportAllocInfo;
BindMemory(*deviceMemory_, 0);
}
My system is:
Windows 7
Nvidia Quadro RTX 4000
Driver version 431.02 or 431.94
The validation layer is present in my instance but stays silencious, extensions VK_KHR_external_memory and VK_KHR_external_memory_win32 are available in my device, the allocation size respect the API limitations and the memoryIndexType is correct.
Am I doing something wrong or there is a limitation I missed?
Thanks !
Edit:
I tried to export the handle as a vk::ExternalMemoryHandleTypeFlagBits::eOpaqueWin32Kmt and the allocation worked. The code below is how I test if the allocation require an dedicated allocation to export an handle type.
bool RequireDedicatedAllocation(vk::PhysicalDevice physicalDevice, const vk::ImageCreateInfo& createInfo, vk::ExternalMemoryHandleTypeFlagBits handleType)
{
vk::PhysicalDeviceExternalImageFormatInfo externalImageFormatInfo;
externalImageFormatInfo.setHandleType(handleType);
vk::PhysicalDeviceImageFormatInfo2 imageFormatInfo;
imageFormatInfo.setUsage(createInfo.usage)
.setFormat(createInfo.format)
.setTiling(createInfo.tiling)
.setType(createInfo.imageType)
.setPNext(&externalImageFormatInfo);
vk::StructureChain<vk::ImageFormatProperties2, vk::ExternalImageFormatProperties> imageFormatPropertiesChain = physicalDevice.getImageFormatProperties2<vk::ImageFormatProperties2, vk::ExternalImageFormatProperties>(imageFormatInfo);
vk::ExternalImageFormatProperties externalImageProperties = imageFormatPropertiesChain.get<vk::ExternalImageFormatProperties>();
return static_cast<bool>(externalImageProperties.externalMemoryProperties.externalMemoryFeatures & vk::ExternalMemoryFeatureFlagBits::eDedicatedOnly);
}
On my system, it throws an ErrorFormatNotSupported error (on vk::PhysicalDevice::getImageFormatProperties2) with vk::ExternalMemoryHandleTypeFlagBits::eOpaqueWin32 and return false with vk::ExternalMemoryHandleTypeFlagBits::eOpaqueWin32Kmt.
I finally ended up with a vk::ExternalMemoryHandleTypeFlagBits::eOpaqueWin32Kmt. I don't need to interact with it throught the Windows' API, just OpenGL.

Why there are three unexpected worker threads when a Win32 console application starts up? [duplicate]

This question already has answers here:
Why does Windows 10 start extra threads in my program?
(3 answers)
Closed 5 years ago.
Here is the screenshot of the situation!
I created a Visual C++ Win32 Console Application with VS2010. When I started the application, I found that there were four threads: one 'Main Thread' and three worker threads (I didn't write any code).
I don't know where these three worker threads came from.
I would like to know the role of these three threads.
Thanks in advance!
Windows 10 implemented a new way of loading DLLs - several worker threads do it in parallel (LdrpWorkCallback). All Windows 10 processes now have several such threads.
Before Win10, the system (ntdll.dll) always loaded DLLs in a single thread, but starting with Win10 this behaviour changed. Now a "Parallel loader" exists in ntdll. Now the loading task (NTSTATUS LdrpSnapModule(LDRP_LOAD_CONTEXT* LoadContext)) can be executed in worker threads. Almost every DLL has imports (dependent DLLs), so when a DLL is loaded - its dependent DLLs are also loaded and this process is recursive (dependent DLLs have own dependencies).
The function void LdrpMapAndSnapDependency(LDRP_LOAD_CONTEXT* LoadContext) walks the current loaded DLL import table and loads its direct (1st level) dependent DLLs by calling LdrpLoadDependentModule() (which internally calls LdrpMapAndSnapDependency() for the newly loaded DLL - so this process is recursive). Finally, LdrpMapAndSnapDependency() needs to call NTSTATUS LdrpSnapModule(LDRP_LOAD_CONTEXT* LoadContext) to bind imports to the already loaded DLLs. LdrpSnapModule() is executed for many DLLs in the top level DLL load process, and this process is independent for every DLL - so this is a good place to parallelize. LdrpSnapModule() in most cases does not load new DLLs, but only binds import to export from already loaded ones. But if an import is resolved to a forwarded export (which rarely happens) - the new, forwarded DLL, is loaded.
Some current implementation details:
first of all, let us look into the struct _RTL_USER_PROCESS_PARAMETERS new field - ULONG LoaderThreads. this LoaderThreads (if set to nonzero) enables or disables "Parallel loader" in the new process. When we create a new process by ZwCreateUserProcess() - the 9th argument is
PRTL_USER_PROCESS_PARAMETERS ProcessParameters. but if we use CreateProcess[Internal]W() - we cannot pass PRTL_USER_PROCESS_PARAMETERS directly - only STARTUPINFO. RTL_USER_PROCESS_PARAMETERS is partially initialized from STARTUPINFO, but we do not control ULONG LoaderThreads, and it will always be zero (if we do not call ZwCreateUserProcess() or set a hook to this routine).
In the new process initialization phase, LdrpInitializeExecutionOptions() is called (from LdrpInitializeProcess()). This routine checks HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\<app name> for several values (if the <app name> subkey exists - usually it doesn't), including MaxLoaderThreads (REG_DWORD) - if MaxLoaderThreads exists - its value overrides RTL_USER_PROCESS_PARAMETERS.LoaderThreads.
LdrpCreateLoaderEvents() is called. This routine must create 2 global events: HANDLE LdrpWorkCompleteEvent, LdrpLoadCompleteEvent;, which are used for synchronization.
NTSTATUS LdrpCreateLoaderEvents()
{
NTSTATUS status = ZwCreateEvent(&LdrpWorkCompleteEvent, EVENT_ALL_ACCESS, 0, SynchronizationEvent, TRUE);
if (0 <= status)
{
status = ZwCreateEvent(&LdrpLoadCompleteEvent, EVENT_ALL_ACCESS, 0, SynchronizationEvent, TRUE);
}
return status;
}
LdrpInitializeProcess() calls void LdrpDetectDetour(). This name speaks for itself. it does not return a value but initializes the global variable BOOLEAN LdrpDetourExist. This routine first checks whether some loader critical routines are hooked - currently these are 5 routines:
NtOpenFile
NtCreateSection
NtQueryAttributesFile
NtOpenSection
NtMapViewOfSection
If yes - LdrpDetourExist = TRUE;
If not hooked - ThreadDynamicCodePolicyInfo is queried - full code:
void LdrpDetectDetour()
{
if (LdrpDetourExist) return ;
static PVOID LdrpCriticalLoaderFunctions[] = {
NtOpenFile,
NtCreateSection,
ZwQueryAttributesFile,
ZwOpenSection,
ZwMapViewOfSection,
};
static M128A LdrpThunkSignature[5] = {
//***
};
ULONG n = RTL_NUMBER_OF(LdrpCriticalLoaderFunctions);
M128A* ppv = (M128A*)LdrpCriticalLoaderFunctions;
M128A* pps = LdrpThunkSignature;
do
{
if (ppv->Low != pps->Low || ppv->High != pps->High)
{
if (LdrpDebugFlags & 5)
{
DbgPrint("!!! Detour detected, disable parallel loading\n");
LdrpDetourExist = TRUE;
return;
}
}
} while (pps++, ppv++, --n);
BOOL DynamicCodePolicy;
if (0 <= ZwQueryInformationThread(NtCurrentThread(), ThreadDynamicCodePolicyInfo, &DynamicCodePolicy, sizeof(DynamicCodePolicy), 0))
{
if (LdrpDetourExist = (DynamicCodePolicy == 1))
{
if (LdrpMapAndSnapWork)
{
WaitForThreadpoolWorkCallbacks(LdrpMapAndSnapWork, TRUE);//TpWaitForWork
TpReleaseWork(LdrpMapAndSnapWork);//CloseThreadpoolWork
LdrpMapAndSnapWork = 0;
TpReleasePool(LdrpThreadPool);//CloseThreadpool
LdrpThreadPool = 0;
}
}
}
}
LdrpInitializeProcess() calls NTSTATUS LdrpEnableParallelLoading (ULONG LoaderThreads) - as LdrpEnableParallelLoading(ProcessParameters->LoaderThreads):
NTSTATUS LdrpEnableParallelLoading (ULONG LoaderThreads)
{
LdrpDetectDetour();
if (LoaderThreads)
{
LoaderThreads = min(LoaderThreads, 16);// not more than 16 threads allowed
if (LoaderThreads <= 1) return STATUS_SUCCESS;
}
else
{
if (RtlGetSuiteMask() & 0x10000) return STATUS_SUCCESS;
LoaderThreads = 4;// default for 4 threads
}
if (LdrpDetourExist) return STATUS_SUCCESS;
NTSTATUS status = TpAllocPool(&LdrpThreadPool, 1);//CreateThreadpool
if (0 <= status)
{
TpSetPoolWorkerThreadIdleTimeout(LdrpThreadPool, -300000000);// 30 second idle timeout
TpSetPoolMaxThreads(LdrpThreadPool, LoaderThreads - 1);//SetThreadpoolThreadMaximum
TP_CALLBACK_ENVIRON CallbackEnviron = { };
CallbackEnviron->CallbackPriority = TP_CALLBACK_PRIORITY_NORMAL;
CallbackEnviron->Size = sizeof(TP_CALLBACK_ENVIRON);
CallbackEnviron->Pool = LdrpThreadPool;
CallbackEnviron->Version = 3;
status = TpAllocWork(&LdrpMapAndSnapWork, LdrpWorkCallback, 0, &CallbackEnviron);//CreateThreadpoolWork
}
return status;
}
A special loader thread pool is created - LdrpThreadPool, with LoaderThreads - 1 max threads. Idle timeout is set to 30 seconds (after which the thread exits) and allocated PTP_WORK LdrpMapAndSnapWork, which is then used in void LdrpQueueWork(LDRP_LOAD_CONTEXT* LoadContext).
Global variables used by the parallel loader:
HANDLE LdrpWorkCompleteEvent, LdrpLoadCompleteEvent;
CRITICAL_SECTION LdrpWorkQueueLock;
LIST_ENTRY LdrpWorkQueue = { &LdrpWorkQueue, &LdrpWorkQueue };
ULONG LdrpWorkInProgress;
BOOLEAN LdrpDetourExist;
PTP_POOL LdrpThreadPool;
PTP_WORK LdrpMapAndSnapWork;
enum DRAIN_TASK {
WaitLoadComplete, WaitWorkComplete
};
struct LDRP_LOAD_CONTEXT
{
UNICODE_STRING BaseDllName;
PVOID somestruct;
ULONG Flags;//some unknown flags
NTSTATUS* pstatus; //final status of load
_LDR_DATA_TABLE_ENTRY* ParentEntry; // of 'parent' loading dll
_LDR_DATA_TABLE_ENTRY* Entry; // this == Entry->LoadContext
LIST_ENTRY WorkQueueListEntry;
_LDR_DATA_TABLE_ENTRY* ReplacedEntry;
_LDR_DATA_TABLE_ENTRY** pvImports;// in same ordef as in IMAGE_IMPORT_DESCRIPTOR piid
ULONG ImportDllCount;// count of pvImports
LONG TaskCount;
PVOID pvIAT;
ULONG SizeOfIAT;
ULONG CurrentDll; // 0 <= CurrentDll < ImportDllCount
PIMAGE_IMPORT_DESCRIPTOR piid;
ULONG OriginalIATProtect;
PVOID GuardCFCheckFunctionPointer;
PVOID* pGuardCFCheckFunctionPointer;
};
Unfortunately LDRP_LOAD_CONTEXT is not contained in published .pdb files, so my definitions include only partial names.
struct {
ULONG MaxWorkInProgress;//4 - values from explorer.exe at some moment
ULONG InLoaderWorker;//7a (this mean LdrpSnapModule called from worker thread)
ULONG InLoadOwner;//87 (LdrpSnapModule called direct, in same thread as `LdrpMapAndSnapDependency`)
} LdrpStatistics;
// for statistics
void LdrpUpdateStatistics()
{
LdrpStatistics.MaxWorkInProgress = max(LdrpStatistics.MaxWorkInProgress, LdrpWorkInProgress);
NtCurrentTeb()->LoaderWorker ? LdrpStatistics.InLoaderWorker++ : LdrpStatistics.InLoadOwner++
}
In TEB.CrossTebFlags - now exist 2 new flags:
USHORT LoadOwner : 01; // 0x1000;
USHORT LoaderWorker : 01; // 0x2000;
Last 2 bits is spare (USHORT SpareSameTebBits : 02; // 0xc000)
LdrpMapAndSnapDependency(LDRP_LOAD_CONTEXT* LoadContext) includes the following code:
LDR_DATA_TABLE_ENTRY* Entry = LoadContext->CurEntry;
if (LoadContext->pvIAT)
{
Entry->DdagNode->State = LdrModulesSnapping;
if (LoadContext->PrevEntry)// if recursive call
{
LdrpQueueWork(LoadContext); // !!!
}
else
{
status = LdrpSnapModule(LoadContext);
}
}
else
{
Entry->DdagNode->State = LdrModulesSnapped;
}
So, if LoadContext->PrevEntry (say we load user32.dll. In the first call to LdrpMapAndSnapDependency(), LoadContext->PrevEntry will be always 0 (when CurEntry points to user32.dll), but when we recursively call LdrpMapAndSnapDependency() for it dependency gdi32.dll - PrevEntry will be for user32.dll and CurEntry for gdi32.dll), we do not direct call LdrpSnapModule(LoadContext); but LdrpQueueWork(LoadContext);.
LdrpQueueWork() is simply:
void LdrpQueueWork(LDRP_LOAD_CONTEXT* LoadContext)
{
if (0 <= ctx->pstatus)
{
EnterCriticalSection(&LdrpWorkQueueLock);
InsertHeadList(&LdrpWorkQueue, &LoadContext->WorkQueueListEntry);
LeaveCriticalSection(&LdrpWorkQueueLock);
if (LdrpMapAndSnapWork && !RtlGetCurrentPeb()->Ldr->ShutdownInProgress)
{
SubmitThreadpoolWork(LdrpMapAndSnapWork);//TpPostWork
}
}
}
We insert LoadContext to LdrpWorkQueue and if "Parallel loader" is started (LdrpMapAndSnapWork != 0) and not ShutdownInProgress - we submit work to loader pool. But even if the pool is not initialized (say because Detours exist) - there will be no error - we process this task in LdrpDrainWorkQueue().
In a worker thread callback, this is executed:
void LdrpWorkCallback()
{
if (LdrpDetourExist) return;
EnterCriticalSection(&LdrpWorkQueueLock);
PLIST_ENTRY Entry = RemoveEntryList(&LdrpWorkQueue);
if (Entry != &LdrpWorkQueue)
{
++LdrpWorkInProgress;
LdrpUpdateStatistics()
}
LeaveCriticalSection(&LdrpWorkQueueLock);
if (Entry != &LdrpWorkQueue)
{
LdrpProcessWork(CONTAINING_RECORD(Entry, LDRP_LOAD_CONTEXT, WorkQueueListEntry), FALSE);
}
}
We simply popup an entry from LdrpWorkQueue, convert it to LDRP_LOAD_CONTEXT* (CONTAINING_RECORD(Entry, LDRP_LOAD_CONTEXT, WorkQueueListEntry)) and call void LdrpProcessWork(LDRP_LOAD_CONTEXT* LoadContext, BOOLEAN LoadOwner).
void LdrpProcessWork(LDRP_LOAD_CONTEXT* ctx, BOOLEAN LoadOwner)
in general calls LdrpSnapModule(LoadContext) and in the end the next code is executed:
if (!LoadOwner)
{
EnterCriticalSection(&LdrpWorkQueueLock);
BOOLEAN bSetEvent = --LdrpWorkInProgress == 1 && IsListEmpty(&LdrpWorkQueue);
LeaveCriticalSection(&LdrpWorkQueueLock);
if (bSetEvent) ZwSetEvent(LdrpWorkCompleteEvent, 0);
}
So, if we are not LoadOwner (in worked thread), we decrement LdrpWorkInProgress, and if LdrpWorkQueue is empty then signal LdrpWorkCompleteEvent (LoadOwner can wait on it).
and finally, LdrpDrainWorkQueue() is called from LoadOwner (primary thread) to "drain" the WorkQueue. It can possible pop and directly execute tasks pushed to LdrpWorkQueue by LdrpQueueWork(), and yet is not popped by worked threads or because parallel loader is disabled (in this case LdrpQueueWork() also push LDRP_LOAD_CONTEXT but not really post work to worked thread), and finally wait (if need) on LdrpWorkCompleteEvent or LdrpLoadCompleteEvent events.
enum DRAIN_TASK {
WaitLoadComplete, WaitWorkComplete
};
void LdrpDrainWorkQueue(DRAIN_TASK task)
{
BOOLEAN LoadOwner = FALSE;
HANDLE hEvent = task ? LdrpWorkCompleteEvent : LdrpLoadCompleteEvent;
for(;;)
{
PLIST_ENTRY Entry;
EnterCriticalSection(&LdrpWorkQueueLock);
if (LdrpDetourExist && task == WaitLoadComplete)
{
if (!LdrpWorkInProgress)
{
LdrpWorkInProgress = 1;
LoadOwner = TRUE;
}
Entry = &LdrpWorkQueue;
}
else
{
Entry = RemoveHeadList(&LdrpWorkQueue);
if (Entry == &LdrpWorkQueue)
{
if (!LdrpWorkInProgress)
{
LdrpWorkInProgress = 1;
LoadOwner = TRUE;
}
}
else
{
if (!LdrpDetourExist)
{
++LdrpWorkInProgress;
}
LdrpUpdateStatistics();
}
}
LeaveCriticalSection(&LdrpWorkQueueLock);
if (LoadOwner)
{
NtCurrentTeb()->LoadOwner = 1;
return;
}
if (Entry != &LdrpWorkQueue)
{
LdrpProcessWork(CONTAINING_RECORD(Entry, LDRP_LOAD_CONTEXT, WorkQueueListEntry), FALSE);
}
else
{
ZwWaitForSingleObject(hEvent, 0, 0);
}
}
}
void LdrpDropLastInProgressCount()
{
NtCurrentTeb()->LoadOwner = 0;
EnterCriticalSection(&LdrpWorkQueueLock);
LdrpWorkInProgress = 0;
LeaveCriticalSection(&LdrpWorkQueueLock);
ZwSetEvent(LdrpLoadCompleteEvent);
}

When using memory BIOs with OpenSSL, how can you find the 'needed size' for the input BIO?

Here's some sample code which shows how I'm using OpenSSL:
BIO *CreateMemoryBIO() {
if (BIO *bio = BIO_new(BIO_s_mem())) {
BIO_set_mem_eof_return(bio, -1);
return bio;
}
throw std::runtime_error("Could not create memory BIO");
}
m_readBIO = CreateMemoryBIO();
m_writeBIO = CreateMemoryBIO();
SSL_set_bio(m_ssl, m_readBIO, m_writeBIO);
Now, if I do an SSL_Read, and I get SSL_ERROR_WANT_READ, is there any way for me to find out how much it had tried to read internally (in other words, how much do I need to write with BIO_write to m_readBIO before SSL_Read would be satisfied?)
A good lower bound would work for me as well, my issue is that I need to report how much data to read to the layer above me, and it will not return control to me until it has read that much data (and I don't want to degenerate into 1-byte reads).
I'm aware that SSL_Read and SSL_Write may both alternately read & write due to handshaking and such, but I'm interested in the 'current' read that is being done internally.
If it's not possible to do with the standard BIO_s_mem, I assume it could be done if I wrote my own BIO which 'remembered' the size of the last read request which failed, so any pointers to documentation on writing custom BIOs (which, to my knowledge, is supported by OpenSSL) would also be appreciated.
Thanks to CristiFati for the suggesting BIO_set_callback, it seems to work. If you want to make your comment into an answer, I'll accept it, but I want to put the details here for posterity.
Inside my 'SSLSocket' class:
in the constructor:
BIO_set_callback(m_readBIO, &BIOCallback);
BIO_set_callback_arg(m_readBIO, reinterpret_cast<char*>(this));
long SSLSocket::BIOCallback(
BIO *in_bio,
int in_operation,
const char* in_arg1,
int in_arg2,
long in_arg3,
long in_returnValue)
{
// in_bio isn't provided for BIO_CB_FREE.
if (BIO_CB_FREE == in_operation)
{
return in_returnValue;
}
assert(in_arg1);
return reinterpret_cast<SSLSocket*>(BIO_get_callback_arg(in_bio))->DoBIOCallback(
in_bio,
in_operation,
in_arg1,
in_arg2,
in_arg3,
in_returnValue);
long SSLSocket::DoBIOCallback(
BIO *in_bio,
int in_operation,
const char* in_arg1,
int in_arg2,
long in_arg3,
long in_returnValue)
{
UNUSED(in_arg3);
// We only care about the return callback for BIO_read()
if ((BIO_CB_READ | BIO_CB_RETURN) == in_operation)
{
const int shouldRetry = BIO_should_retry(in_bio);
const int bytesRequested = in_arg2;
assert(bytesRequested > 0);
if ((in_returnValue <= 0) && shouldRetry)
{
m_needBytes = bytesRequested;
}
else if ((in_returnValue > 0) && (in_returnValue < bytesRequested) && shouldRetry)
{
m_needBytes = bytesRequested - in_returnValue;
}
else
{
m_needBytes = 0;
}
}
return in_returnValue;
}
Then I use m_needBytes to decide how much to write in BIO_write().

Create a function with unique function pointer in runtime

When calling WinAPI functions that take callbacks as arguments, there's usually a special parameter to pass some arbitrary data to the callback. In case there's no such thing (e.g. SetWinEventHook) the only way we can understand which of the API calls resulted in the call of the given callback is to have distinct callbacks. When we know all the cases in which the given API is called at compile-time, we can always create a class template with static method and instantiate it with different template arguments in different call sides. That's a hell of a work, and I don't like doing so.
How do I create callback functions at runtime so that they have different function pointers?
I saw a solution (sorry, in Russian) with runtime assembly generation, but it wasn't portable across x86/x64 archtectures.
You can use the closure API of libffi. It allows you to create trampolines each with a different address. I implemented a wrapping class here, though that's not finished yet (only supports int arguments and return type, you can specialize detail::type to support more than just int). A more heavyweight alternative is LLVM, though if you're dealing only with C types, libffi will do the job fine.
I've come up with this solution which should be portable (but I haven't tested it):
#define ID_PATTERN 0x11223344
#define SIZE_OF_BLUEPRINT 128 // needs to be adopted if uniqueCallbackBlueprint is complex...
typedef int (__cdecl * UNIQUE_CALLBACK)(int arg);
/* blueprint for unique callback function */
int uniqueCallbackBlueprint(int arg)
{
int id = ID_PATTERN;
printf("%x: Hello unique callback (arg=%d)...\n", id, arg);
return (id);
}
/* create a new unique callback */
UNIQUE_CALLBACK createUniqueCallback(int id)
{
UNIQUE_CALLBACK result = NULL;
char *pUniqueCallback;
char *pFunction;
int pattern = ID_PATTERN;
char *pPattern;
char *startOfId;
int i;
int patterns = 0;
pUniqueCallback = malloc(SIZE_OF_BLUEPRINT);
if (pUniqueCallback != NULL)
{
pFunction = (char *)uniqueCallbackBlueprint;
#if defined(_DEBUG)
pFunction += 0x256; // variable offset depending on debug information????
#endif /* _DEBUG */
memcpy(pUniqueCallback, pFunction, SIZE_OF_BLUEPRINT);
result = (UNIQUE_CALLBACK)pUniqueCallback;
/* replace ID_PATTERN with requested id */
pPattern = (char *)&pattern;
startOfId = NULL;
for (i = 0; i < SIZE_OF_BLUEPRINT; i++)
{
if (pUniqueCallback[i] == *pPattern)
{
if (pPattern == (char *)&pattern)
startOfId = &(pUniqueCallback[i]);
if (pPattern == ((char *)&pattern) + sizeof(int) - 1)
{
pPattern = (char *)&id;
for (i = 0; i < sizeof(int); i++)
{
*startOfId++ = *pPattern++;
}
patterns++;
break;
}
pPattern++;
}
else
{
pPattern = (char *)&pattern;
startOfId = NULL;
}
}
printf("%d pattern(s) replaced\n", patterns);
if (patterns == 0)
{
free(pUniqueCallback);
result = NULL;
}
}
return (result);
}
Usage is as follows:
int main(void)
{
UNIQUE_CALLBACK callback;
int id;
int i;
id = uniqueCallbackBlueprint(5);
printf(" -> id = %x\n", id);
callback = createUniqueCallback(0x4711);
if (callback != NULL)
{
id = callback(25);
printf(" -> id = %x\n", id);
}
id = uniqueCallbackBlueprint(15);
printf(" -> id = %x\n", id);
getch();
return (0);
}
I've noted an interresting behavior if compiling with debug information (Visual Studio). The address obtained by pFunction = (char *)uniqueCallbackBlueprint; is off by a variable number of bytes. The difference can be obtained using the debugger which displays the correct address. This offset changes from build to build and I assume it has something to do with the debug information? This is no problem for the release build. So maybe this should be put into a library which is build as "release".
Another thing to consider whould be byte alignment of pUniqueCallback which may be an issue. But an alignment of the beginning of the function to 64bit boundaries is not hard to add to this code.
Within pUniqueCallback you can implement anything you want (note to update SIZE_OF_BLUEPRINT so you don't miss the tail of your function). The function is compiled and the generated code is re-used during runtime. The initial value of id is replaced when creating the unique function so the blueprint function can process it.