SetPerTcpConnectionEStats fails and can't get GetPerTcpConnectionEStats multiple times c++

SetPerTcpConnectionEStats fails and can't get GetPerTcpConnectionEStats multiple times c++ - c++

I am following the example in https://learn.microsoft.com/en-gb/windows/win32/api/iphlpapi/nf-iphlpapi-getpertcp6connectionestats?redirectedfrom=MSDN to get the TCP statistics. Although, I got it working and get the statistics in the first place, still I want to record them every a time interval (which I haven't managed to do so), and I have the following questions.
The SetPerTcpConnectionEStats () fails with status != NO_ERROR and equal to 5. Although, it fails, I can get the statistics. Why?
I want to get the statistics every, let's say 1 second. I have tried two different ways; a) to use a while loop and use a std::this_thread::sleep_for(1s), where I could get the statistics every ~1sec, but the whole app was stalling (is it because of the this), I supposed that I am blocking the operation of the main, and b) (since a) failed) I tried to call TcpStatistics() from another function (in different class) that is triggered every 1 sec (I store clientConnectRow to a global var). However, in that case (b), GetPerTcpConnectionEStats() fails with winStatus = 1214 (ERROR_INVALID_NETNAME) and of course TcpStatistics() cannot get any of the statistics.
a)
ClassB::ClassB()
{
UINT winStatus = GetTcpRow(localPort, hostPort, MIB_TCP_STATE_ESTAB, (PMIB_TCPROW)clientConnectRow);
ToggleAllEstats(clientConnectRow, TRUE);
thread t1(&ClassB::TcpStatistics, this, clientConnectRow);
t1.join();
}
ClassB::TcpStatistics()
{
while (true)
{
GetAndOutputEstats(row, TcpConnectionEstatsBandwidth)
// some more code here
this_thread::sleep_for(milliseconds(1000));
}
}
b)
ClassB::ClassB()
{
MIB_TCPROW client4ConnectRow;
void* clientConnectRow = NULL;
clientConnectRow = &client4ConnectRow;
UINT winStatus = GetTcpRow(localPort, hostPort, MIB_TCP_STATE_ESTAB, (PMIB_TCPROW)clientConnectRow);
m_clientConnectRow = clientConnectRow;
TcpStatistics();
}
ClassB::TcpStatistics()
{
ToggleAllEstats(m_clientConnectRow , TRUE);
void* row = m_clientConnectRow;
GetAndOutputEstats(row, TcpConnectionEstatsBandwidth)
// some more code here
}
ClassB::GetAndOutputEstats(void* row, TCP_ESTATS_TYPE type)
{
//...
winStatus = GetPerTcpConnectionEStats((PMIB_TCPROW)row, type, NULL, 0, 0, ros, 0, rosSize, rod, 0, rodSize);
if (winStatus != NO_ERROR) {wprintf(L"\nGetPerTcpConnectionEStats %s failed. status = %d", estatsTypeNames[type], winStatus); //
}
else { ...}
}
ClassA::FunA()
{
classB_ptr->TcpStatistics();
}

I found a work around for the second part of my question. I am posting it here, in case someone else find it useful. There might be other solutions too, more advanced, but this is how I did it myself. We have to first Obtain MIB_TCPROW corresponding to the TCP connection and then to Enable Estats collection before dumping current stats. So, what I did was to add all of these in a function and call this instead, every time I want to get the stats.
void
ClassB::FunSetTcpStats()
{
MIB_TCPROW client4ConnectRow;
void* clientConnectRow = NULL;
clientConnectRow = &client4ConnectRow;
//this is for the statistics
UINT winStatus = GetTcpRow(lPort, hPort, MIB_TCP_STATE_ESTAB, (PMIB_TCPROW)clientConnectRow); //lPort & hPort in htons!
if (winStatus != ERROR_SUCCESS) {
wprintf(L"\nGetTcpRow failed on the client established connection with %d", winStatus);
return;
}
//
// Enable Estats collection and dump current stats.
//
ToggleAllEstats(clientConnectRow, TRUE);
TcpStatistics(clientConnectRow); // same as GetAllEstats() in msdn
}

Related

per thread c++ guard to prevent re-entrant function calls

I've got function that call the registry that can fail and print the failure reason.
This function can also be called directly or indirectly from the context of a dedicated built-in printing function, and I wish to avoid printing the reason in this case to avoid endless recursion.
I can use thread_local to define per thread flag to avoid calling the print function from this function, but I guess it's rather widespread problem, so I'm looking for std implementation for this guard or any other well debugged code.
Here's an example that just made to express the problem.
Each print function comes with log level, and it's being compared with the current log level threshold that reside in registry. if lower than threshold, the function returns without print. However, in order to get the threshold, additional print can be made, so I wanted to create a guard that will prevent the print from getPrintLevelFromRegistry if it's called from print
int getPrintLevelFromRegistry() {
int value = 0;
DWORD res = RegGetValueW("//Software//...//logLevel" , &value);
if (res != ERROR_SUCCESS) {
print("couldn't find registry key");
return 0;
}
return value;
}
void print(const char * text, int printLoglevel) {
if (printLogLevel < getPrintLevelFromRegistry()) {
return;
}
// do the print itself
...
}
Thanks !

The root of the problem is that you are attempting to have your logging code log itself. Rather than some complicated guard, consider the fact that you really don't need to log a registry read. Just have it return a default value and just log the error to the console.
int getPrintLevelFromRegistry() {
int value = 0;
DWORD res = RegGetValueW("//Software//...//logLevel" , &value);
if (res != ERROR_SUCCESS) {
OutputDebugStringA("getPrintLevelFromRegistry: Can't read from registry\r\n");
}
return value;
}
Further, it's OK to read from the registry on each log statement, but it's redundant and unnecessary.
Better:
int getPrintLevelFromRegistry() {
static std::atomic<int> cachedValue(-1);
int value = cachedValue;
if (value == -1) {
DWORD res = RegGetValueW("//Software//...//logLevel" , &value);
if (res == ERROR_SUCCESS) {
cachedValue = value;
}
}
return value;
}

How to correctly use grpc asynchronously (ClientAsyncReaderWriter)

I can't find a grpc example showing how to use the ClientAsyncReaderWriter (is there one?). I tried something on my own, but am having trouble with the reference counts. My question comes from tracing through the code.
struct grpc_call has a member of type gpr_refcount called ext_ref. The ClientContext C++ object wraps the grpc_call, and holds onto it in a member grpc_call *call_;. Only when ext_ref is 0, can this grpc_call pointer be deleted.
When I use grpc synchronously with ClientReader:
In its implementation it uses CreateCall() and PerformOps() to add to ext_ref (ext_ref == 2).
Then I use Pluck() which subtracts from ext_ref so that (ext_ref == 1).
The last use ~ClientContext() subtracts from ext_ref, so that ext_ref == 0 and deletes the call
But when I use grpc asynchronously with ClientAsyncReaderWriter:
First use asyncXXX(), this API use CreateCall() and register Write() (ext_ref == 2).
Then it uses AsyncNext() to get tag...which must use a write or read operator.
So ext_ref > 1 forever, unless got_event you don't handle.
I'm calling it like this:
struct Notice
{
std::unique_ptr<
grpc::ClientAsyncReaderWriter<ObserveNoticRequest, EventNotice>
> _rw;
ClientContext _context;
EventNotice _rsp;
}
Register Thread
CompletionQueue *cq = new CompletionQueue;
Notice *notice = new Notice;
notice->rw = stub->AsyncobserverNotice(&context, cq, notice);
// here context.call_.ext_ref is 2
Get CompletionQueue Event Thread
void *tag = NULL;
bool ok = false;
CompletionQueue::NextStatus got = CompletionQueue::NextStatus::TIMEOUT;
gpr_timespec deadline;
deadline.clock_type = GPR_TIMESPAN;
deadline.tv_sec = 0;
deadline.tv_nsec = 10000000;
got = cq->AsyncNext<gpr_timespec>(&tag, &ok, deadline);
if (GOT_EVENT == got) {
if (tag != NULL) {
Notice *notice = (Notice *)tag;
notice->_rw->Read(&_rsp, notice);
// here context.call_.ext_ref is 2.
// now I want to stop this CompletionQueue.
delete notice;
// use ~ClientContext(), ext_ref change to 1
// but only ext_ref == 0, call_ be deleted
}
}

Take a look at this file, client_async.cc, for good use of the ClientAsyncReaderWriter. If you still have confusion, please create a very clean reproduction of the issue, and we will look into it further.

using SetJob function in printspooler API

I am using a getJobs function I found to get current print jobs in my printer (not print device). So far I can tell how many print jobs are in the queue of my virtual printer, and I have the information from JOB_INFO_ strucs to mess with, but I am trying to use SetJob() to delete the job from the print queue (after storing the information I want). With this I get an error:
0xC0000005: Access violation reading location 0x00002012.
My question is, what exactly am I doing wrong? I've tried putting 0 as level and NULL for pJob, then I do not get an error but the print job is still in the queue. I can't seem to find anyone else that has examples with explenations.
BOOL getJobs(LPTSTR printerName) {
HANDLE hPrinter; //Printer handle variable
DWORD dwNeeded, dwReturned, i; //Mem needed, jobs found, variable for loop
JOB_INFO_1 *pJobInfo; // pointer to structure
//Find printer handle
if (!OpenPrinter(printerName, &hPrinter, NULL)) {
return FALSE;
}
//Get amount of memory needed
if (!EnumJobs(hPrinter, 0, 0xFFFFFFFF, 1, NULL, 0, &dwNeeded, &dwReturned)) {
if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
ClosePrinter(hPrinter);
return FALSE;
}
}
//Allocate the memory, if you cant end function
if ((pJobInfo = (JOB_INFO_1 *)malloc(dwNeeded)) == NULL) {
ClosePrinter(hPrinter);
return FALSE;
}
//Get job info struc
if (!EnumJobs(hPrinter, 0, 0xFFFFFFFF, 1, (LPBYTE)pJobInfo, dwNeeded, &dwNeeded, &dwReturned)) {
ClosePrinter(hPrinter);
free(pJobInfo);
return FALSE;
}
//If there are printjobs, get document name and data type. put into docinfo1 struc and return true
if (dwReturned > 0){
docinfo1.pDocName = pJobInfo[1].pDocument;
docinfo1.pDatatype = pJobInfo[1].pDatatype;
SetJob(hPrinter, pJobInfo[1].JobId, 2, (LPBYTE)pJobInfo, JOB_CONTROL_DELETE);
ClosePrinter(hPrinter);
free(pJobInfo);
return TRUE;
}
//No print jobs, Free memory and finish up :>
ClosePrinter(hPrinter);
free(pJobInfo);
return FALSE;
}
Help is much appreciated.
EDIT: The issue ended up being a simple mistake in where I told SetJob the wrong struct type.

Aside from specifying a JOB_INFO_2 struct when you're actually passing a JOB_INFO_1 struct (as was pointed out in comments), you're also trying to use the second element of pJobInfo[], which may not even exist:
SetJob(hPrinter, pJobInfo[1].JobId, 2, (LPBYTE)pJobInfo, JOB_CONTROL_DELETE);
Change it to:
SetJob(hPrinter, pJobInfo[0].JobId, 1, (LPBYTE)pJobInfo, JOB_CONTROL_DELETE);
Or better yet, do this because all you need to delete a print job is the job ID:
SetJob(hPrinter, pJobInfo[0].JobId, 0, NULL, JOB_CONTROL_DELETE);

How to get the names of all power schemes in windows 7 in C++?

I have to get the names of all available power schemes in windows 7. I try to enumerate them with the power management functions and I do get the right amount but when I call "PowerReadFriendlyName" (http://msdn.microsoft.com/en-us/library/windows/desktop/aa372740%28v=vs.85%29.aspx) it works sometimes and fails sometimes:
UCHAR displayBuffer[256];
DWORD displayBufferSize = sizeof(displayBuffer);
GUID buffer;
DWORD bufferSize = sizeof(buffer);
int index;
int fail=0,ok=0;
//
for(index = 0 ; ; index++)
{ ZeroMemory(&buffer, sizeof(buffer));
ZeroMemory(&displayBuffer, sizeof(displayBuffer));
if (PowerEnumerate(NULL,NULL,NULL, ACCESS_SCHEME,index,(UCHAR*)&buffer,&bufferSize) == ERROR_SUCCESS)
{ if (PowerReadFriendlyName(NULL, &buffer,&NO_SUBGROUP_GUID,NULL,displayBuffer,&displayBufferSize) == ERROR_SUCCESS)
{ ok++;
// stuff to todo
}
else
{ fail++;
// why?
}
}
else
{ break;
}
}
At first I had 2 custom power schemes and the retrieval of their name always failed. The standard 3 power schemes (high performance, balanced, power saver) always worked.
So I thought it had to do with the custom schemes and I manually added 2 more of them. But as it turns out now one of them actually works and I can get its name (both were derived from balanced).
I then manually added another 2 custom schemes (this time derived from power saver) and this time both seemed to work. I now have 9 in total and I can get the names of 6 of them. I cannot get the name of the 2 original custom power schemes (both derived from balanced) as well as the 2nd of the ones I added the first time.
When I type "powercfg -list" in a console I can get the list of all power schemes, but how can I get the names of all power schemes reliably in c++ without redirecting/parsing the console but using the windows power management functions?

The documentation of the PowerReadFriendlyName() function does not mention that the variable holding the length of the buffer gets overwritten in a successful call with a non-NULL buffer. It has therefore be set before each call of PowerReadFriendlyName() or it can fail:
UCHAR displayBuffer[2048];
DWORD displayBufferSize;
GUID buffer;
DWORD bufferSize = sizeof(buffer);
int index;
int fail=0,ok=0;
//
for(index = 0 ; ; index++)
{ ZeroMemory(&buffer, sizeof(buffer));
ZeroMemory(&displayBuffer, sizeof(displayBuffer));
if (PowerEnumerate(NULL,NULL,NULL, ACCESS_SCHEME,index,(UCHAR*)&buffer,&bufferSize) == ERROR_SUCCESS)
{ displayBufferSize = sizeof(displayBuffer);
if (PowerReadFriendlyName(NULL, &buffer,&NO_SUBGROUP_GUID,NULL,displayBuffer,&displayBufferSize) == ERROR_SUCCESS)
{ ok++;
// stuff to todo
}
else
{ fail++;
}
}
else
{ break;
}
}

Buffer communication speed nightmare

I'm trying to use buffers to communicate between several 'layers' (threads) in my program and now that I have visual output of what's going on inside, I realize there's a devastating amount of time being eaten up in the process of using these buffers.
Here's some notes about what's going on in my code.
when the rendering mode is triggered in this thread, it begins sending as many points as it can to the layer (thread) below it
the points from the lower thread are then processed and returned to this thread via the output buffer of the lower thread
points received back are mapped (for now) as white pixels in the D3D surface
if I bypass the buffer and put the points directly into the surface pixels, it only takes about 3 seconds to do the whole job
if I hand the point down and then have the lower layer pass it right back up, skipping any actual number-crunching, the whole job takes about 30 minutes (which makes the whole program useless)
changing the size of my buffers has no noticeable effect on the speed
I was originally using MUTEXes in my buffers but have eliminated them in attempt the fix the problem
Is there something I can do differently to fix this speed problem I'm having?
...something to do with the way I'm handling these messages???
Here's my code
I'm very sorry that it's such a mess. I'm having to move way too fast on this project and I've left a lot of pieces laying around in comments where I've been experimenting.
DWORD WINAPI CONTROLSUBSYSTEM::InternalExProcedure(__in LPVOID lpSelf)
{
XMSG xmsg;
LPCONTROLSUBSYSTEM lpThis = ((LPCONTROLSUBSYSTEM)lpSelf);
BOOL bStall;
BOOL bRendering = FALSE;
UINT64 iOutstandingPoints = 0; // points that are out being tested
UINT64 iPointsDone = 0;
UINT64 iPointsTotal = 0;
BOOL bAssigning;
DOUBLE dNextX;
DOUBLE dNextY;
while(1)
{
if( lpThis->hwTargetWindow!=NULL && lpThis->d3ddev!=NULL )
{
lpThis->d3ddev->Clear(0,NULL,D3DCLEAR_TARGET,D3DCOLOR_XRGB(0,0,0),1.0f,0);
if(lpThis->d3ddev->BeginScene())
{
lpThis->d3ddev->StretchRect(lpThis->sfRenderingCanvas,NULL,lpThis->sfBackBuffer,NULL,D3DTEXF_NONE);
lpThis->d3ddev->EndScene();
}
lpThis->d3ddev->Present(NULL,NULL,NULL,NULL);
}
//bStall = TRUE;
// read input buffer
if(lpThis->bfInBuffer.PeekMessage(&xmsg))
{
bStall = FALSE;
if( HIBYTE(xmsg.wType)==HIBYTE(CONT_MSG) )
{
// take message off
lpThis->bfInBuffer.GetMessage(&xmsg);
// double check consistency
if( HIBYTE(xmsg.wType)==HIBYTE(CONT_MSG) )
{
switch(LOBYTE(xmsg.wType))
{
case SETRESOLUTION_MSG:
lpThis->iAreaWidth = (UINT)xmsg.dptPoint.X;
lpThis->iAreaHeight = (UINT)xmsg.dptPoint.Y;
lpThis->sfRenderingCanvas->Release();
if(lpThis->d3ddev->CreateOffscreenPlainSurface(
(UINT)xmsg.dptPoint.X,(UINT)xmsg.dptPoint.Y,
D3DFMT_X8R8G8B8,
D3DPOOL_DEFAULT,
&(lpThis->sfRenderingCanvas),
NULL)!=D3D_OK)
{
MessageBox(NULL,"Error resizing surface.","ERROR",MB_ICONERROR);
}
else
{
D3DLOCKED_RECT lrt;
if(D3D_OK == lpThis->sfRenderingCanvas->LockRect(&lrt,NULL,0))
{
lpThis->iPitch = lrt.Pitch;
VOID *data;
data = lrt.pBits;
ZeroMemory(data,lpThis->iPitch*lpThis->iAreaHeight);
lpThis->sfRenderingCanvas->UnlockRect();
MessageBox(NULL,"Surface Resized","yay",0);
}
else
{
MessageBox(NULL,"Error resizing surface.","ERROR",MB_ICONERROR);
}
}
break;
case SETCOLORMETHOD_MSG:
break;
case SAVESNAPSHOT_MSG:
lpThis->SaveSnapshot();
break;
case FORCERENDER_MSG:
bRendering = TRUE;
iPointsTotal = lpThis->iAreaHeight*lpThis->iPitch;
iPointsDone = 0;
MessageBox(NULL,"yay, render something!",":o",0);
break;
default:
break;
}
}// else, lost this message
}
else
{
if( HIBYTE(xmsg.wType)==HIBYTE(MATH_MSG) )
{
XMSG xmsg2;
switch(LOBYTE(xmsg.wType))
{
case RESETFRAME_MSG:
case ZOOMIN_MSG:
case ZOOMOUT_MSG:
case PANUP_MSG:
case PANDOWN_MSG:
case PANLEFT_MSG:
case PANRIGHT_MSG:
// tell self to start a render
xmsg2.wType = CONT_MSG|FORCERENDER_MSG;
if(lpThis->bfInBuffer.PutMessage(&xmsg2))
{
// pass it down
while(!lpThis->lplrSubordinate->PutMessage(&xmsg));
// message passed so pull it from buffer
lpThis->bfInBuffer.GetMessage(&xmsg);
}
break;
default:
// pass it down
if(lpThis->lplrSubordinate->PutMessage(&xmsg))
{
// message passed so pull it from buffer
lpThis->bfInBuffer.GetMessage(&xmsg);
}
break;
}
}
else if( lpThis->lplrSubordinate!=NULL )
// pass message down
{
if(lpThis->lplrSubordinate->PutMessage(&xmsg))
{
// message passed so pull it from buffer
lpThis->bfInBuffer.GetMessage(&xmsg);
}
}
}
}
// read output buffer from subordinate
if( lpThis->lplrSubordinate!=NULL && lpThis->lplrSubordinate->PeekMessage(&xmsg) )
{
bStall = FALSE;
if( xmsg.wType==(REPLY_MSG|TESTPOINT_MSG) )
{
// got point test back
D3DLOCKED_RECT lrt;
if(D3D_OK == lpThis->sfRenderingCanvas->LockRect(&lrt,NULL,0))
{
INT pitch = lrt.Pitch;
VOID *data;
data = lrt.pBits;
INT Y=dRound((xmsg.dptPoint.Y/(DOUBLE)100)*((DOUBLE)lpThis->iAreaHeight));
INT X=dRound((xmsg.dptPoint.X/(DOUBLE)100)*((DOUBLE)pitch));
// decide color
if( xmsg.iNum==0 )
((WORD *)data)[X+Y*pitch] = 0xFFFFFFFF;
else
((WORD *)data)[X+Y*pitch] = 0xFFFFFFFF;
// message handled so remove from buffer
lpThis->lplrSubordinate->GetMessage(&xmsg);
lpThis->sfRenderingCanvas->UnlockRect();
}
}
else if(lpThis->bfOutBuffer.PutMessage(&xmsg))
{
// message sent so pull the real one off the buffer
lpThis->lplrSubordinate->GetMessage(&xmsg);
}
}
if( bRendering && lpThis->lplrSubordinate!=NULL )
{
bAssigning = TRUE;
while(bAssigning)
{
dNextX = 100*((DOUBLE)(iPointsDone%lpThis->iPitch))/((DOUBLE)lpThis->iPitch);
dNextY = 100*(DOUBLE)((INT)(iPointsDone/lpThis->iPitch))/(DOUBLE)(lpThis->iAreaHeight);
xmsg.dptPoint.X = dNextX;
xmsg.dptPoint.Y = dNextY;
//
//xmsg.iNum = 0;
//xmsg.wType = REPLY_MSG|TESTPOINT_MSG;
//
xmsg.wType = MATH_MSG|TESTPOINT_MSG;
/*D3DLOCKED_RECT lrt;
if(D3D_OK == lpThis->sfRenderingCanvas->LockRect(&lrt,NULL,0))
{
INT pitch = lrt.Pitch;
VOID *data;
data = lrt.pBits;
INT Y=dRound((dNextY/(DOUBLE)100)*((DOUBLE)lpThis->iAreaHeight));
INT X=dRound((dNextX/(DOUBLE)100)*((DOUBLE)pitch));
((WORD *)data)[X+Y*pitch] = 0xFFFFFFFF;
lpThis->sfRenderingCanvas->UnlockRect();
}
iPointsDone++;
if( iPointsDone>=iPointsTotal )
{
MessageBox(NULL,"done rendering","",0);
bRendering = FALSE;
bAssigning = FALSE;
}
*/
if( lpThis->lplrSubordinate->PutMessage(&xmsg) )
{
bStall = FALSE;
iPointsDone++;
if( iPointsDone>=iPointsTotal )
{
MessageBox(NULL,"done rendering","",0);
bRendering = FALSE;
bAssigning = FALSE;
}
}
else
{
bAssigning = FALSE;
}
}
}
//if( bStall )
//Sleep(10);
}
return 0;
}
}
(still getting used to this forum's code block stuff)
Edit:
Here's an example that I perceive to be similar in concept, although this example consumes the messages it produces in the same thread.
#include <Windows.h>
#include "BUFFER.h"
int main()
{
BUFFER myBuffer;
INT jobsTotal = 1024*768;
INT currentJob = 0;
INT jobsOut = 0;
XMSG xmsg;
while(1)
{
if(myBuffer.PeekMessage(&xmsg))
{
// do something with message
// ...
// if successful, remove message
myBuffer.GetMessage(&xmsg);
jobsOut--;
}
while( currentJob<jobsTotal )
{
if( myBuffer.PutMessage(&xmsg) )
{
currentJob++;
jobsOut++;
}
else
{
// buffer is full at the moment
// stop for now and put more on later
break;
}
}
if( currentJob==jobsTotal && jobsOut==0 )
{
MessageBox(NULL,"done","",0);
break;
}
}
return 0;
}
This example also runs in about 3 seconds, as opposed to 30 minutes.
Btw, if anybody knows why visual studio keeps trying to make me say PeekMessageA and GetMessageA instead of the actual names I defined, that would be nice to know as well.

Locking and Unlocking an entire rect to change a single point is probably not very efficient, you might be better off generating a list of points you intend to modify and then locking the rect once, iterating over that list and modifying all the points, and then unlocking the rect.
When you lock the rect you are effectively stalling concurrent access to it, so its like a mutex for the GPU in that respect - then you only modify a single pixel. Doing this repeatedly for each pixel will constantly stall the GPU. You could use D3DLOCK_NOSYSLOCK to avoid this to some extent, but I'm not sure if it will play nicely in the larger context of your program.
I'm obviously not entirely sure what the goal of your algorithm is, but if you are trying to parallel process pixels on a d3d surface, then i think the best approach would be via a shader on the GPU.
Where you basically generate an array in system memory, populate it with "input" values on a per point/pixel basis, then generate a texture on a GPU from the array. Next you paint the texture to a full screen quad, and then render it with a pixel shader to some render target. The shader can be coded to process each point in whatever way you like, the GPU will take care of optimizing parallelization. Then you generate a new texture from that render target and then you copy that texture into a system memory array. And then you can extract all your outputs from that array. You can also apply multiple shaders to the render target result back into the render target to pipeline multiple transformations if needed.

A couple notes:
Don't write your own messape-passing code. It may be correct and slow, or fast and buggy. It takes a lot of experience to design code that's fast and then getting it bug-free is really hard, because debugging threaded code is hard. Win32 provides a couple of efficient threadsafe queues: SList and the window message queue.
Your design splits up work in the worst possible way. Passing information between threads is expensive even under the best circumstances, because it causes cache contention, both on the data and on the synchronization objects. It's MUCH better to split your work into distinct non-interacting (or minimize interaction) datasets and give each to a separate thread, that is then responsible for all stages of processing that dataset.

Don't poll.
That's likely to be the heart of the problem. You have a task continually calling peekmessage and probably finding nothing there. This will just eat all available CPU. Any task that wants to post messages is unlikely to receive any CPU time to acheive this.
I can't remember how you'd achieve this with the windows message queue (probably WaitMessage or some variant) but typically you might implement this with a counting semaphore. When the consumer wants data, it waits for the semaphore to be signalled. When the producer has data, it signals the semaphore.

I managed to resolve it by redesigning the whole thing
It now passes huge payloads instead of individual tasks
(I'm the poster)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js