Why does the code sample below cause one thread to execute way more than another but a mutex does not?
#include <windows.h>
#include <conio.h>
#include <process.h>
#include <iostream>
using namespace std;
typedef struct _THREAD_INFO_ {
COORD coord; // a structure containing x and y coordinates
INT threadNumber; // each thread has it's own number
INT count;
}THREAD_INFO, * PTHREAD_INFO;
void gotoxy(int x, int y);
BOOL g_bRun;
CRITICAL_SECTION g_cs;
unsigned __stdcall ThreadFunc( void* pArguments )
{
PTHREAD_INFO info = (PTHREAD_INFO)pArguments;
while(g_bRun)
{
EnterCriticalSection(&g_cs);
//if(TryEnterCriticalSection(&g_cs))
//{
gotoxy(info->coord.X, info->coord.Y);
cout << "T" << info->threadNumber << ": " << info->count;
info->count++;
LeaveCriticalSection(&g_cs);
//}
}
ExitThread(0);
return 0;
}
int main(void)
{
// OR unsigned int
unsigned int id0, id1; // a place to store the thread ID returned from CreateThread
HANDLE h0, h1; // handles to theads
THREAD_INFO tInfo[2]; // only one of these - not optimal!
g_bRun = TRUE;
ZeroMemory(&tInfo, sizeof(tInfo)); // win32 function - memset(&buffer, 0, sizeof(buffer))
InitializeCriticalSection(&g_cs);
// setup data for the first thread
tInfo[0].threadNumber = 1;
tInfo[0].coord.X = 0;
tInfo[0].coord.Y = 0;
h0 = (HANDLE)_beginthreadex(
NULL, // no security attributes
0, // defaut stack size
&ThreadFunc, // pointer to function
&tInfo[0], // each thread gets its own data to output
0, // 0 for running or CREATE_SUSPENDED
&id0 ); // return thread id - reused here
// setup data for the second thread
tInfo[1].threadNumber = 2;
tInfo[1].coord.X = 15;
tInfo[1].coord.Y = 0;
h1 = (HANDLE)_beginthreadex(
NULL, // no security attributes
0, // defaut stack size
&ThreadFunc, // pointer to function
&tInfo[1], // each thread gets its own data to output
0, // 0 for running or CREATE_SUSPENDED
&id1 ); // return thread id - reused here
_getch();
g_bRun = FALSE;
return 0;
}
void gotoxy(int x, int y) // x=column position and y=row position
{
HANDLE hdl;
COORD coords;
hdl = GetStdHandle(STD_OUTPUT_HANDLE);
coords.X = x;
coords.Y = y;
SetConsoleCursorPosition(hdl, coords);
}
That may not answer your question but the behavior of critical sections changed on Windows Server 2003 SP1 and later.
If you have bugs related to critical sections on Windows 7 that you can't reproduce on an XP machine you may be affected by that change.
My understanding is that on Windows XP critical sections used a FIFO based strategy that was fair for all threads while later versions use a new strategy aimed at reducing context switching between threads.
There's a short note about this on the MSDN page about critical sections
You may also want to check this forum post
Critical sections, like mutexes are designed to protect a shared resource against conflicting access (such as concurrent modification). Critical sections are not meant to replace thread priority.
You have artificially introduced a shared resource (the screen) and made it into a bottleneck. As a result, the critical section is highly contended. Since both threads have equal priority, that is no reason for Windows to prefer one thread over another. Reduction of context switches is a reason to pick one thread over another. As a result of that reduction, the utilization of the shared resource goes up. That is a good thing; it means that one thread will be finished a lot earlier and the other thread will finish a bit earlier.
To see the effect graphically, compare
A B A B A B A B A B
to
AAAAA BBBBB
The second sequence is shorter because there's only one switch from A to B.
In hand wavey terms:
CriticalSection is saying the thread wants control to do some things together.
Mutex is making a marker to show 'being busy' so others can wait and notifying of completion so somebody else can start. Somebody else already waiting for the mutex will grab it before you can start the loop again and get it back.
So what you are getting with CriticalSection is a failure to yield between loops. You might see a difference if you had Sleep(0); after LeaveCriticalSection
I can't say why you're observing this particular behavior, but it's probably to do with the specifics of the implementation of each mechanism. What I CAN say is that unlocking then immediately locking a mutex is a bad thing. You will observe odd behavior eventually.
From some MSDN docs (http://msdn.microsoft.com/en-us/library/ms682530.aspx):
Starting with Windows Server 2003 with Service Pack 1 (SP1), threads waiting on a critical section do not acquire the critical section on a first-come, first-serve basis. This change increases performance significantly for most code
Related
I'd like, instead of having my threads wait, doing nothing, for other threads to finish using data, to do something else in the meantime (like checking for input, or re-rendering the previous frame in the queue, and then returning to check to see if the other thread is done with its task).
I think this code that I've written does that, and it "seems" to work in the tests I've performed, but I don't really understand how std::memory_order_acquire and std::memory_order_clear work exactly, so I'd like some expert advice on if I'm using those correctly to achieve the behaviour I want.
Also, I've never seen multithreading done this way before, which makes me a bit worried. Are there good reasons not to have a thread do other tasks instead of waiting?
/*test program
intended to test if atomic flags can be used to perform other tasks while shared
data is in use, instead of blocking
each thread enters the flag protected part of the loop 20 times before quitting
if the flag indicates that the if block is already in use, the thread is intended to
execute the code in the else block (only up to 5 times to avoid cluttering the output)
debug note: this doesn't work with std::cout because all the threads are using it at once
and it's not thread safe so it all gets garbled. at least it didn't crash
real world usage
one thread renders and draws to the screen, while the other checks for input and
provides frameData for the renderer to use. neither thread should ever block*/
#include <fstream>
#include <atomic>
#include <thread>
#include <string>
struct ThreadData {
int numTimesToWriteToDebugIfBlockFile;
int numTimesToWriteToDebugElseBlockFile;
};
class SharedData {
public:
SharedData() {
threadData = new ThreadData[10];
for (int a = 0; a < 10; ++a) {
threadData[a] = { 20, 5 };
}
flag.clear();
}
~SharedData() {
delete[] threadData;
}
void runThread(int threadID) {
while (this->threadData[threadID].numTimesToWriteToDebugIfBlockFile > 0) {
if (this->flag.test_and_set(std::memory_order_acquire)) {
std::string fileName = "debugIfBlockOutputThread#";
fileName += std::to_string(threadID);
fileName += ".txt";
std::ofstream writeFile(fileName.c_str(), std::ios::app);
writeFile << threadID << ", running, output #" << this->threadData[threadID].numTimesToWriteToDebugIfBlockFile << std::endl;
writeFile.close();
writeFile.clear();
this->threadData[threadID].numTimesToWriteToDebugIfBlockFile -= 1;
this->flag.clear(std::memory_order_release);
}
else {
if (this->threadData[threadID].numTimesToWriteToDebugElseBlockFile > 0) {
std::string fileName = "debugElseBlockOutputThread#";
fileName += std::to_string(threadID);
fileName += ".txt";
std::ofstream writeFile(fileName.c_str(), std::ios::app);
writeFile << threadID << ", standing by, output #" << this->threadData[threadID].numTimesToWriteToDebugElseBlockFile << std::endl;
writeFile.close();
writeFile.clear();
this->threadData[threadID].numTimesToWriteToDebugElseBlockFile -= 1;
}
}
}
}
private:
ThreadData* threadData;
std::atomic_flag flag;
};
void runThread(int threadID, SharedData* sharedData) {
sharedData->runThread(threadID);
}
int main() {
SharedData sharedData;
std::thread thread[10];
for (int a = 0; a < 10; ++a) {
thread[a] = std::thread(runThread, a, &sharedData);
}
thread[0].join();
thread[1].join();
thread[2].join();
thread[3].join();
thread[4].join();
thread[5].join();
thread[6].join();
thread[7].join();
thread[8].join();
thread[9].join();
return 0;
}```
The memory ordering you're using here is correct.
The acquire memory order when you test and set your flag (to take your hand-written lock) has the effect, informally speaking, of preventing any memory accesses of the following code from becoming visible before the flag is tested. That's what you want, because you want to ensure that those accesses are effectively not done if the flag was already set. Likewise, the release order on the clear at the end prevents any of the preceding accesses from becoming visible after the clear, which is also what you need so that they only happen while the lock is held.
However, it's probably simpler to just use a std::mutex. If you don't want to wait to take the lock, but instead do something else if you can't, that's what try_lock is for.
class SharedData {
// ...
private:
std::mutex my_lock;
}
// ...
if (my_lock.try_lock()) {
// lock was taken, proceed with critical section
my_lock.unlock();
} else {
// lock not taken, do non-critical work
}
This may have a bit more overhead, but avoids the need to think about atomicity and memory ordering. It also gives you the option to easily do a blocking wait if that later becomes useful. If you've designed your program around an atomic_flag and later find a situation where you must wait to take the lock, you may find yourself stuck with either spinning while continually retrying the lock (which is wasteful of CPU cycles), or something like std::this_thread::yield(), which may wait for longer than necessary after the lock is available.
It's true this pattern is somewhat unusual. If there is always non-critical work to be done that doesn't need the lock, commonly you'd design your program to have a separate thread that just does the non-critical work continuously, and then the "critical" thread can just block as it waits for the lock.
So I'm trying to create a program that implements a function that generates a random number (n) and based on n, creates n threads. The main thread is responsible to print the minimum and maximum of the leafs. The depth of hierarchy with the Main thread is 3.
I have written the code below:
#include <iostream>
#include <thread>
#include <time.h>
#include <string>
#include <sstream>
using namespace std;
// a structure to keep the needed information of each thread
struct ThreadInfo
{
long randomN;
int level;
bool run;
int maxOfVals;
double minOfVals;
};
// The start address (function) of the threads
void ChildWork(void* a) {
ThreadInfo* info = (ThreadInfo*)a;
// Generate random value n
srand(time(NULL));
double n=rand()%6+1;
// initialize the thread info with n value
info->randomN=n;
info->maxOfVals=n;
info->minOfVals=n;
// the depth of recursion should not be more than 3
if(info->level > 3)
{
info->run = false;
}
// Create n threads and run them
ThreadInfo* childInfo = new ThreadInfo[(int)n];
for(int i = 0; i < n; i++)
{
childInfo[i].level = info->level + 1;
childInfo[i].run = true;
std::thread tt(ChildWork, &childInfo[i]) ;
tt.detach();
}
// checks if any child threads are working
bool anyRun = true;
while(anyRun)
{
anyRun = false;
for(int i = 0; i < n; i++)
{
anyRun = anyRun || childInfo[i].run;
}
}
// once all child threads are done, we find their max and min value
double maximum=1, minimum=6;
for( int i=0;i<n;i++)
{
// cout<<childInfo[i].maxOfVals<<endl;
if(childInfo[i].maxOfVals>=maximum)
maximum=childInfo[i].maxOfVals;
if(childInfo[i].minOfVals< minimum)
minimum=childInfo[i].minOfVals;
}
info->maxOfVals=maximum;
info->minOfVals=minimum;
// we set the info->run value to false, so that the parrent thread of this thread will know that it is done
info->run = false;
}
int main()
{
ThreadInfo info;
srand(time(NULL));
double n=rand()%6+1;
cout<<"n is: "<<n<<endl;
// initializing thread info
info.randomN=n;
info.maxOfVals=n;
info.minOfVals=n;
info.level = 1;
info.run = true;
std::thread t(ChildWork, &info) ;
t.join();
while(info.run);
info.maxOfVals= max<unsigned long>(info.randomN,info.maxOfVals);
info.minOfVals= min<unsigned long>(info.randomN,info.minOfVals);
cout << "Max is: " << info.maxOfVals <<" and Min is: "<<info.minOfVals;
}
The code compiles with no error, but when I execute it, it gives me this :
libc++abi.dylib: terminating with uncaught exception of type
std::__1::system_error: thread constructor failed: Resource
temporarily unavailable Abort trap: 6
You spawn too many threads. It looks a bit like a fork() bomb. Threads are a very heavy-weight system resource. Use them sparingly.
Within the function void Childwork I see two mistakes:
As someone already pointed out in the comments, you check the info level of a thread and then you go and create some more threads regardless of the previous check.
Within the for loop that spawns your new threads, you increment the info level right before you spawn the actual thread. However you increment a freshly created instance of ThreadInfo here ThreadInfo* childInfo = new ThreadInfo[(int)n]. All instances within childInfo hold a level of 0. Basically the level of each thread you spawn is 1.
In general avoid using threads to achieve concurrency for I/O bound operations (*). Just use threads to achieve concurrency for independent CPU bound operations. As a rule of thumb you never need more threads than you have CPU cores in your system (**). Having more does not improve concurrency and does not improve performance.
(*) You should always use direct function calls and an event based system to run pseudo concurrent I/O operations. You do not need any threading to do so. For example a TCP server does not need any threads to serve thousands of clients.
(**) This is the ideal case. In practice your software is composed of multiple parts, developed by independent developers and maintained in different modes, so it is ok to have some threads which could be theoretically avoided.
Multithreading is still rocket science in 2019. Especially in C++. Do not do it unless you know exactly what you are doing. Here is a good series of blog posts that handle threads.
I'm trying to write code where the screen is divided into two windows and one of them is modified by a different thread, but output seems to be very random. Could anyone help? Upper piece of console should be modified by main, and lower by thread k.
#include <stdio.h>
#include <ncurses.h>
#include <unistd.h>
#include <thread>
#define WIDTH 30
#define HEIGHT 10
int startx = 0;
int starty = 0;
void kupa (int score_size, int parent_x, int parent_y)
{
int i = 0;
WINDOW *dupa = newwin(score_size, parent_x, parent_y - score_size, 0);
while(true)
{
i++;
mvwprintw(dupa, 0 , 0, "You chose choice %d with choice string", i);
wrefresh(dupa);
sleep(5);
wclear(dupa);
}
delwin(dupa);
}
int main ()
{
int parent_x, parent_y;
int score_size =10;
int counter =0 ;
initscr();
noecho();
curs_set(FALSE);
getmaxyx(stdscr, parent_y, parent_x);
WINDOW *field = newwin(parent_y - score_size, parent_x, 0, 0);
std::thread k (kupa, score_size, parent_x, parent_y);
while(true) {
mvwprintw(field, 0, counter, "Field");
wrefresh(field);
sleep(5);
wclear(field);
counter++;
}
k.join();
delwin(field);
}
The underlying curses/ncurses library is not thread-safe (see for example What is meant by “thread-safe” code? which discusses the term). In the case of curses, this means that the library's WINDOW structures such as stdscr are global variables which are not guarded by mutexes or other methods. The library also has internal static data which is shared across windows. You can only get reliable results for multithreaded code using one of these strategies:
do all of the window management (including input) within one thread
use mutexes, semaphores or whatever concurrency technique seems best to manage separate threads which "own" separate windows. To succeed here, a thread would have to own the whole screen from the point where the curses library blocks while waiting for input, until it updates the screen and resumes waiting for input. That is harder than it sounds.
ncurses 5.7 and up can be compiled to provide rudimentary support for reentrant code and some threaded applications. To do this, it uses mutexes wrapped around its static data, makes the global variables into "getter" functions, and adds functions which explicitly pass the SCREEN pointer which is implied in many calls. For more detail, see the manual page.
Some of ncurses' test-programs illustrate the threading support (these are programs in the test subdirectory of the sources):
ditto shows use_screen.
test_opaque execises the "getters" for WINDOW properties
rain shows use_window
worm shows use_window
I am not sure what you want to do but this behaviour is quite normal. The thread that is active writes to the window and when the system makes a task switch the other thread writes to the window. Normal behaviour is to use only one thread that writes to the window. Other threads are supposed to do only some work.
Anyway, if you are using more than one thread you have to synchronize them using events, mutexes, queues, semaphores or other methods.
I'm working through an example of protecting a global double using mutexes, however I get the error -
Unhandled exception at 0x77b6308e in
Lab7.exe: 0xC0000005: Access violation
writing location 0x00000068.
I assume this is related to accessing score? (The global double)
#include <windows.h>
#include <iostream>
#include <process.h>
double score = 0.0;
HANDLE threads[10];
CRITICAL_SECTION score_mutex;
unsigned int __stdcall MyThread(void *data)
{
EnterCriticalSection(&score_mutex);
score = score + 1.0;
LeaveCriticalSection(&score_mutex);
return 0;
}
int main()
{
InitializeCriticalSection(&score_mutex);
for (int loop = 0; loop < 10; loop++)
{
threads[loop] = (HANDLE) _beginthreadex(NULL, 0, MyThread, NULL, 0, NULL);
}
WaitForMultipleObjects(10, threads, 0, INFINITE);
DeleteCriticalSection(&score_mutex);
std::cout << score;
while(true);
}
Update:
After fixing the problem with the loop being set to 1000 instead of 10, the error still occured, however when I commented out the pieces of code referring to the mutex the error did not occur.
CRITICAL_SECTION score_mutex;
EnterCriticalSection(&score_mutex);
LeaveCriticalSection(&score_mutex);
InitializeCriticalSection(&score_mutex);
DeleteCriticalSection(&score_mutex);
Update 2
The threads return 0 as per convention (It's been a long week!)
I tried adding back in the mutex-related code, and the program will compile and run fine (other than the race condition issues with the double of course) with CRITICAL_SECTION, InitializeCriticalSection and DeleteCriticalSection all added back in. The problem appears to be with EnterCriticalSection or LeaveCriticalSection, as the error reoccurs when I add them.
The remaining bug in your code is in the call to WaitForMultipleObjects(). You set the 3rd parameter to 0 (FALSE) such that the main thread unblocks as soon as any of the 10 threads finishes.
This causes the call to DeleteCriticalSection() to execute before all threads are finished, creating an access violation when one of the (possibly) 9 other threads starts and calls EnterCriticalSection().
You're writing beyond the end of your threads[10] array:
for (int loop = 0; loop < 1000; loop++){
threads[loop];
}
threads only has size 10!
Your problem is that WaitForMultipleObjects is not waiting for all the threads to complete, causing the critical section to be prematurely deleted. According to MSDN, the third argument is
bWaitAll [in]
If this parameter is TRUE, the function returns when the state of all objects in the >lpHandles array is signaled. If FALSE, the function returns when the state of any one of >the objects is set to signaled. In the latter case, the return value indicates the object >whose state caused the function to return.
You set this to 0, which returns when ANY ONE of your threads completes. This causes the following DeleteCriticalSection to be run while there's still threads waiting to access it.
You should also declare score as a volatile so you don't have cached value problem.
I have a program that spawns 3 worker threads that do some number crunching, and waits for them to finish like so:
#define THREAD_COUNT 3
volatile LONG waitCount;
HANDLE pSemaphore;
int main(int argc, char **argv)
{
// ...
HANDLE threads[THREAD_COUNT];
pSemaphore = CreateSemaphore(NULL, THREAD_COUNT, THREAD_COUNT, NULL);
waitCount = 0;
for (int j=0; j<THREAD_COUNT; ++j)
{
threads[j] = CreateThread(NULL, 0, Iteration, p+j, 0, NULL);
}
WaitForMultipleObjects(THREAD_COUNT, threads, TRUE, INFINITE);
// ...
}
The worker threads use a custom Barrier function at certain points in the code to wait until all other threads reach the Barrier:
void Barrier(volatile LONG* counter, HANDLE semaphore, int thread_count = THREAD_COUNT)
{
LONG wait_count = InterlockedIncrement(counter);
if ( wait_count == thread_count )
{
*counter = 0;
ReleaseSemaphore(semaphore, thread_count - 1, NULL);
}
else
{
WaitForSingleObject(semaphore, INFINITE);
}
}
(Implementation based on this answer)
The program occasionally deadlocks. If at that point I use VS2008 to break execution and dig around in the internals, there is only 1 worker thread waiting on the Wait... line in Barrier(). The value of waitCount is always 2.
To make things even more awkward, the faster the threads work, the more likely they are to deadlock. If I run in Release mode, the deadlock comes about 8 out of 10 times. If I run in Debug mode and put some prints in the thread function to see where they hang, they almost never hang.
So it seems that some of my worker threads are killed early, leaving the rest stuck on the Barrier. However, the threads do literally nothing except read and write memory (and call Barrier()), and I'm quite positive that no segfaults occur. It is also possible that I'm jumping to the wrong conclusions, since (as mentioned in the question linked above) I'm new to Win32 threads.
What could be going on here, and how can I debug this sort of weird behavior with VS?
How do I debug weird thread behaviour?
Not quite what you said, but the answer is almost always: understand the code really well, understand all the possible outcomes and work out which one is happening. A debugger becomes less useful here, because you can either follow one thread and miss out on what is causing other threads to fail, or follow from the parent, in which case execution is no longer sequential and you end up all over the place.
Now, onto the problem.
pSemaphore = CreateSemaphore(NULL, THREAD_COUNT, THREAD_COUNT, NULL);
From the MSDN documentation:
lInitialCount [in]: The initial count for the semaphore object. This value must be greater than or equal to zero and less than or equal to lMaximumCount. The state of a semaphore is signaled when its count is greater than zero and nonsignaled when it is zero. The count is decreased by one whenever a wait function releases a thread that was waiting for the semaphore. The count is increased by a specified amount by calling the ReleaseSemaphore function.
And here:
Before a thread attempts to perform the task, it uses the WaitForSingleObject function to determine whether the semaphore's current count permits it to do so. The wait function's time-out parameter is set to zero, so the function returns immediately if the semaphore is in the nonsignaled state. WaitForSingleObject decrements the semaphore's count by one.
So what we're saying here, is that a semaphore's count parameter tells you how many threads are allowed to perform a given task at once. When you set your count initially to THREAD_COUNT you are allowing all your threads access to the "resource" which in this case is to continue onwards.
The answer you link uses this creation method for the semaphore:
CreateSemaphore(0, 0, 1024, 0)
Which basically says none of the threads are permitted to use the resource. In your implementation, the semaphore is signaled (>0), so everything carries on merrily until one of the threads manages to decrease the count to zero, at which point some other thread waits for the semaphore to become signaled again, which probably isn't happening in sync with your counters. Remember when WaitForSingleObject returns it decreases the counter on the semaphore.
In the example you've posted, setting:
::ReleaseSemaphore(sync.Semaphore, sync.ThreadsCount - 1, 0);
Works because each of the WaitForSingleObject calls decrease the semaphore's value by 1 and there are threadcount - 1 of them to do, which happen when the threadcount - 1 WaitForSingleObjects all return, so the semaphore is back to 0 and therefore unsignaled again, so on the next pass everybody waits because nobody is allowed to access the resource at once.
So in short, set your initial value to zero and see if that fixes it.
Edit A little explanation: So to think of it a different way, a semaphore is like an n-atomic gate. What you do is usually this:
// Set the number of tickets:
HANDLE Semaphore = CreateSemaphore(0, 20, 200, 0);
// Later on in a thread somewhere...
// Get a ticket in the queue
WaitForSingleObject(Semaphore, INFINITE);
// Only 20 threads can access this area
// at once. When one thread has entered
// this area the available tickets decrease
// by one. When there are 20 threads here
// all other threads must wait.
// do stuff
ReleaseSemaphore(Semaphore, 1, 0);
// gives back one ticket.
So the use we're putting semaphores to here isn't quite the one for which they were designed.
It's a bit hard to guess exactly what you might be running into. Parallel programming is one of those places that (IMO) it pays to follow the philosophy of "keep it so simple it's obviously correct", and unfortunately I can't say that your Barrier code seems to qualify. Personally, I think I'd have something like this:
// define and initialize the array of events use for the barrier:
HANDLE barrier_[thread_count];
for (int i=0; i<thread_count; i++)
barrier_[i] = CreateEvent(NULL, true, false, NULL);
// ...
Barrier(size_t thread_num) {
// Signal that this thread has reached the barrier:
SetEvent(barrier_[thread_num]);
// Then wait for all the threads to reach the barrier:
WaitForMultipleObjects(thread_count, barrier_, true, INFINITE);
}
Edit:
Okay, now that the intent has been clarified (need to handle multiple iterations), I'd modify the answer, but only slightly. Instead of one array of Events, have two: one for the odd iterations and one for the even iterations:
// define and initialize the array of events use for the barrier:
HANDLE barrier_[2][thread_count];
for (int i=0; i<thread_count; i++) {
barrier_[0][i] = CreateEvent(NULL, true, false, NULL);
barrier_[1][i] = CreateEvent(NULL, true, false, NULL);
}
// ...
Barrier(size_t thread_num, int iteration) {
// Signal that this thread has reached the barrier:
SetEvent(barrier_[iteration & 1][thread_num]);
// Then wait for all the threads to reach the barrier:
WaitForMultipleObjects(thread_count, &barrier[iteration & 1], true, INFINITE);
ResetEvent(barrier_[iteration & 1][thread_num]);
}
In your barrier, what prevents this line:
*counter = 0;
to be executed while this other one is executed by another thread?
LONG wait_count =
InterlockedIncrement(counter);