Wanted: Elegant solution to race condition

Wanted: Elegant solution to race condition - c++

I have the following code:
class TimeOutException
{};
template <typename T>
class MultiThreadedBuffer
{
public:
MultiThreadedBuffer()
{
InitializeCriticalSection(&m_csBuffer);
m_evtDataAvail = CreateEvent(NULL, TRUE, FALSE, NULL);
}
~MultiThreadedBuffer()
{
CloseHandle(m_evtDataAvail);
DeleteCriticalSection(&m_csBuffer);
}
void LockBuffer()
{
EnterCriticalSection(&m_csBuffer);
}
void UnlockBuffer()
{
LeaveCriticalSection(&m_csBuffer);
}
void Add(T val)
{
LockBuffer();
m_buffer.push_back(val);
SetEvent(m_evtDataAvail);
UnlockBuffer();
}
T Get(DWORD timeout)
{
T val;
if (WaitForSingleObject(m_evtDataAvail, timeout) == WAIT_OBJECT_0) {
LockBuffer();
if (!m_buffer.empty()) {
val = m_buffer.front();
m_buffer.pop_front();
}
if (m_buffer.empty()) {
ResetEvent(m_evtDataAvail);
}
UnlockBuffer();
} else {
throw TimeOutException();
}
return val;
}
bool IsDataAvail()
{
return (WaitForSingleObject(m_evtDataAvail, 0) == WAIT_OBJECT_0);
}
std::list<T> m_buffer;
CRITICAL_SECTION m_csBuffer;
HANDLE m_evtDataAvail;
};
Unit testing shows that this code works fine when used on a single thread as long as T's default constructor and copy/assignment operators don't throw. Since I'm writting T, that is acceptable.
My problem is the Get method. If there is no data available (i.e. m_evtDataAvail is not set), then a couple of threads can block on the WaitForSingleObject call. When new data becomes available, they all fall through to the Lock() call. Only one will pass and can get the data out and move on. After the Unlock() another thread can move on through and will find that there is no data. Currently it will return the default object.
What I want to happen is for that second thread (and others) to go back to the WaitForSingleObject call. I could add an else block that unlocked and did a goto but that just feels evil.
That solution also adds the possibility for an endless loop since each trip back would restart the timeout. I could add some code to check the clock on entry and adjust the timeout on each trip back but then this simple Get method starts to get very complicated.
Any ideas on how to solve these problems while maintaining testability and simplicity?
Oh, for anyone wondering, the IsDataAvail function only exists for testing. It won't be used in production code. The Add and Get are the only methods that will be used in a non-testing environment.

You need to create a auto-reset event instead of a manual reset event. This guarantees that if multiple threads are waiting on an event, and when the event is set only one thread will be released. All other threads will remain in waiting state. You can create auto-reset event by passing FALSE to the second parameter of CreateEvent API. Also, note that this code is not exception safe i.e. after locking the buffer, if some statement throws an exception your critical section will not be unlocked. Use RAII principle to ensure that your critical section gets unlocked even in the case of exceptions.

You could use a Semaphore object instead of a generic Event object. The semaphore count should be initialized to 0 and incremented by 1 with ReleaseSemaphore each time Add is called. That way the WaitForSingleObject in Get will never release more threads to read from the buffer than there are values in the buffer.

You will always have to code for the case the event is signaled but there is not data, even WITH auto-reset events. There is a race condition from the moment WaitForsingleevent wakes until the LockBuffer is called, and in that interval another thread can pop the data from the buffer. Your code must place WaitForSingleEvent in a loop. Decrease the timeout with the time already spent in each loop iteration...
As an alternative, may I interest you in more scalable and performant alternatives? Interlocked Singly Linked Lists, OS thread pool QueueUserWorkItem and idempotent processing. Add pushes an entry into the list and submits a work item. The work item pops an entry and if not NULL, process it. You can go fancy and have extra logic for the processor to loop and keep a state marking its 'active' presence so that the Add does not quuee unnecessary work items, but that is not strictly required. For even higher sclae and multi-core/multi-cpu load spread I recommend using queued completion ports. The details are described in Rick Vicik's articles, I have a blog entry that links all 3 at once: High Performance Windows programs.

Related

Trying to minimize checks of atomics on every iteration

From a multithreading perspective, is the following correct or incorrect?
I have an app which has 2 threads: the main thread, and a worker thread.
The main thread has a MainUpdate() function that gets called in a continuous loop. As part of its job, that MainUpdate() function might call a ToggleActive() method on the worker objects running on the worker thread. That ToggleActive() method is used to turn the worker objects on/off.
The flow is something like this.
// MainThread
while(true) {
MainUpdate(...);
}
void MainUpdate(...) {
for(auto& obj: objectsInWorkerThread) {
if (foo())
obj.ToggleActive(getBool());
}
}
// Worker thread example worker ------------------------------
struct SomeWorkerObject {
void Execute(...) {
if(mIsActive == false) // %%%%%%% THIS!
return;
Update(...);
}
void ToggleActive(bool active) {
mIsActiveAtom = active; // %%%%%%% THIS!
mIsActive = mIsActiveAtom; // %%%%%%% THIS!
}
private:
void Update(...) {...}
std::atomic_bool mIsActiveAtom = true;
volatile bool mIsActive = true;
};
I'm trying to avoid checking the atomic field on every invocation of Execute(), which gets called on every iteration of the worker thread. There are many worker objects running at any one time, and thus there would be many atomic fields checks.
As you can see, I'm using the non-atomic field to check for activeness. The value of the non-atomic field gets its value from the atomic field in ToggleActive().
From my tests, this seems to be working, but I have a feeling that it is incorrect.

volatile variable only guarantees that it is not optimized out and reorder by compiler and has nothing to do with multi-thread execution. Therefore, your program does have race condition since ToggleActive and Execute can modify/read mIsActive at the same time.
About performance, you can check if your platform support for lock-free atomic bool. If that is the case, checking atomic value can be very fast. I remember seeing a benchmark somewhere that show std::atomic<bool> has the same speed as volatile bool.

#hgminh is right, your code is not safe.
Synchronization is two way road — if you have a thread perform thread-safe write, another thread must perform thread-safe read. If you have a thread use a lock, another thread must use the same lock.
Think about inter-thread communication as message passing (incidentally, it works exactly that way in modern CPUs). If both sides don't share a messaging channel (mIsActiveAtom), the message might not be delivered properly.

How to avoid race conditions in a condition variable in VxWorks

We're programming on a proprietary embedded platform sitting atop of VxWorks 5.5. In our toolbox, we have a condition variable, that is implemented using a VxWorks binary semaphore.
Now, POSIX provides a wait function that also takes a mutex. This will unlock the mutex (so that some other task might write to the data) and waits for the other task to signal (it is done writing the data). I believe this implements what's called a Monitor, ICBWT.
We need such a wait function, but implementing it is tricky. A simple approach would do this:
bool condition::wait_for(mutex& mutex) const {
unlocker ul(mutex); // relinquish mutex
return wait(event);
} // ul's dtor grabs mutex again
However, this sports a race condition because it allows another task to preempt this one after the unlocking and before the waiting. The other task can write to the date after it was unlocked and signal the condition before this task starts to wait for the semaphore. (We have tested this and this indeed happens and blocks the waiting task forever.)
Given that VxWorks 5.5 doesn't seem to provide an API to temporarily relinquish a semaphore while waiting for a signal, is there a way to implement this on top of the provided synchronization routines?
Note: This is a very old VxWorks version that has been compiled without POSIX support (by the vendor of the proprietary hardware, from what I understood).

This should be quite easy with native vxworks, a message queue is what is required here. Your wait_for method can be used as is.
bool condition::wait_for(mutex& mutex) const
{
unlocker ul(mutex); // relinquish mutex
return wait(event);
} // ul's dtor grabs mutex again
but the wait(event) code would look like this:
wait(event)
{
if (msgQRecv(event->q, sigMsgBuf, sigMsgSize, timeoutTime) == OK)
{
// got it...
}
else
{
// timeout, report error or something like that....
}
}
and your signal code would like something like this:
signal(event)
{
msgQSend(event->q, sigMsg, sigMsgSize, NO_WAIT, MSG_PRI_NORMAL);
}
So if the signal gets triggered before you start waiting, then msgQRecv will return immediately with the signal when it eventually gets invoked and you can then take the mutex again in the ul dtor as stated above.
The event->q is a MSG_Q_ID that is created at event creation time with a call to msgQCreate, and the data in sigMsg is defined by you... but can be just a random byte of data, or you can come up with a more intelligent structure with information regarding who signaled or something else that may be nice to know.
Update for multiple waiters, this is a little tricky: So there are a couple of assumptions I will make to simplify things
The number of tasks that will be pending is known at event creation time and is constant.
There will be one task that is always responsible for indicating when it is ok to unlock the mutex, all other tasks just want notification when the event is signaled/complete.
This approach uses a counting semaphore, similar to the above with just a little extra logic:
wait(event)
{
if (semTake(event->csm, timeoutTime) == OK)
{
// got it...
}
else
{
// timeout, report error or something like that....
}
}
and your signal code would like something like this:
signal(event)
{
for (int x = 0; x < event->numberOfWaiters; x++)
{
semGive(event->csm);
}
}
The creation of the event is something like this, remember in this example the number of waiters is constant and known at event creation time. You could make it dynamic, but the key is that every time the event is going to happen the numberOfWaiters must be correct before the unlocker unlocks the mutex.
createEvent(numberOfWaiters)
{
event->numberOfWaiters = numberOfWaiters;
event->csv = semCCreate(SEM_Q_FIFO, 0);
return event;
}
You cannot be wishy-washy about the numberOfWaiters :D I will say it again: The numberOfWaiters must be correct before the unlocker unlocks the mutex. To make it dynamic (if that is a requirement) you could add a setNumWaiters(numOfWaiters) function, and call that in the wait_for function before the unlocker unlocks the mutex, so long as it always sets the number correctly.
Now for the last trick, as stated above the assumption is that one task is responsible for unlocking the mutex, the rest just wait for the signal, which means that one and only one task will call the wait_for() function above, and the rest of the tasks just call the wait(event) function.
With this in mind the numberOfWaiters is computed as follows:
The number of tasks who will call wait()
plus 1 for the task that calls wait_for()
Of course you can also make this more complex if you really need to, but chances are this will work because normally 1 task triggers an event, but many tasks want to know it is complete, and that is what this provides.
But your basic flow is as follows:
init()
{
event->createEvent(3);
}
eventHandler()
{
locker l(mutex);
doEventProcessing();
signal(event);
}
taskA()
{
doOperationThatTriggersAnEvent();
wait_for(mutex);
eventComplete();
}
taskB()
{
doWhateverIWant();
// now I need to know if the event has occurred...
wait(event);
coolNowIKnowThatIsDone();
}
taskC()
{
taskCIsFun();
wait(event);
printf("event done!\n");
}
When I write the above I feel like all OO concepts are dead, but hopefully you get the idea, in reality wait and wait_for should take the same parameter, or no parameter but rather be members of the same class that also has all the data they need to know... but none the less that is the overview of how it works.

Race conditions can be avoided if each waiting task waits on a separate binary semaphore.
These semaphores must be registered in a container which the signaling task uses to unblock all waiting tasks. The container must be protected by a mutex.
The wait_for() method obtains a binary semaphore, waits on it and finally deletes it.
void condition::wait_for(mutex& mutex) {
SEM_ID sem = semBCreate(SEM_Q_PRIORITY, SEM_EMPTY);
{
lock l(listeners_mutex); // assure exclusive access to listeners container
listeners.push_back(sem);
} // l's dtor unlocks listeners_mutex again
unlocker ul(mutex); // relinquish mutex
semTake(sem, WAIT_FOREVER);
{
lock l(listeners_mutex);
// remove sem from listeners
// ...
semDelete(sem);
}
} // ul's dtor grabs mutex again
The signal() method iterates over all registered semaphores and unlocks them.
void condition::signal() {
lock l(listeners_mutex);
for_each (listeners.begin(), listeners.end(), /* call semGive()... */ )
}
This approach assures that wait_for() will never miss a signal. A disadvantage is the need of additional system resources.
To avoid creating and destroying semaphores for every wait_for() call, a pool could be used.

From the description, it looks like you may want to implement (or use) a semaphore - it's a standard CS algorithm with semantics similar to condvars, and there are tons of textbooks on how to implement them (https://www.google.com/search?q=semaphore+algorithm).
A random Google result which explains semaphores is at: http://www.cs.cornell.edu/courses/cs414/2007sp/lectures/08-bakery.ppt‎ (see slide 32).

Product/Consumer - what is the optimal signalling pattern

I am building a high performance app that needs two function to synchronise threads
void wake_thread(thread)
void sleep_thread(thread)
The app has a single thread (lets call it C) that may fall asleep with a call to sleep_thread. There are multiple threads that will call wake_thread. When wake_thread returns it MUST guarantee that C is either running or will be woken. wake_thread must NEVER block.
The easy way is of course to do use a synchronisation event like this:
hEvent = CreateEvent(NULL, FALSE, TRUE, NULL);
void wake_thread(thread) {
SetEvent(hEvent);
}
And:
void sleep_thread(thread)
{
WaitForSingleObject(hEvent);
}
This provides the desired semantics and is free of race conditions for the scenario (There is only one thread waiting, but multiple that can signal). I included it here to show what I am trying to tune.
HOWEVER, I am wondering there is a faster way under Windows for this very specific scenario. wake_thread may be called a lot, even when C is not sleeping. This causes a lot of calls to SetEvent that do nothing. Would there be a faster way to use manual reset event and reference counters to make sure SetEvent is only called when there is actually something to set.
Every CPU cycle counts in this scenario.

I haven't tested this (apart from making sure it compiles) but I think this should do the trick. It was, admittedly, a bit trickier than I at first thought. Note that there are some obvious optimizations you could make; I've left it in unoptimized form for clarity and to aid any debugging that may be necessary. I've also omitted error checking.
#include <intrin.h>
HANDLE hEvent = CreateEvent(NULL, TRUE, FALSE, NULL);
__declspec(align(4)) volatile LONG thread_state = 2;
// 0 (00): sleeping
// 1 (01): sleeping, wake request pending
// 2 (10): awake, no additional wake request received
// 3 (11): awake, at least one additional wake request
void wake_thread(void)
{
LONG old_state;
old_state = _InterlockedOr(&thread_state, 1);
if (old_state == 0)
{
// This is the first wake request since the consumer thread
// went to sleep. Set the event.
SetEvent(hEvent);
return;
}
if (old_state == 1)
{
// The consumer thread is already in the process of being woken up.
// Any items added to the queue by this thread will be processed,
// so we don't need to do anything.
return;
}
if (old_state == 2)
{
// This is an additional wake request when the consumer thread
// is already awake. We've already changed the state accordingly,
// so we don't need to do anything else.
return;
}
if (old_state == 3)
{
// The consumer thread is already awake, and already has an
// additional wake request registered, so we don't need to do
// anything.
return;
}
BigTrouble();
}
void sleep_thread(void)
{
LONG old_state;
// Debugging only, remove this test in production code.
// The event should never be signaled at this point.
if (WaitForSingleObject(hEvent, 0) != WAIT_TIMEOUT)
{
BigTrouble();
}
old_state = _InterlockedAnd(&thread_state, 1);
if (old_state == 2)
{
// We've changed the state from "awake" to "asleep".
// Go to sleep.
WaitForSingleObject(hEvent, INFINITE);
// We've been buzzed; change the state to "awake"
// and then reset the event.
if (_InterlockedExchange(&thread_state, 2) != 1)
{
BigTrouble();
}
ResetEvent(hEvent);
return;
}
if (old_state == 3)
{
// We've changed the state from "awake with additional
// wake request" to "waking". Change it to "awake"
// and then carry on.
if (_InterlockedExchange(&thread_state, 2) != 1)
{
BigTrouble();
}
return;
}
BigTrouble();
}
Basically this uses a manual-reset event and a two-bit flag to reproduce the behaviour of an automatic-reset event. It may be clearer if you draw a state diagram. The thread safety depends on the rules about which of the functions is allowed to make which transitions, and also on when the event object is allowed to be signaled.
As an editorial: I think it is separating the synchronization code into the wake_thread and sleep_thread functions that makes things a bit awkward. It would probably be more natural, slightly more efficient, and almost certainly clearer if the synchronization code were moved into the queue implementation.

SetEvent() will introduce some latency as it does have to make a system call (sysenter triggers the switch from user to kernel mode) for the object manager to check the state of the event and dispatch it (via a call to KeSetEvent()). I think that the time of the system call might be considered to be acceptable even in your circumstances, but that is speculation. Where most of the latency is likely going to be introduced is on the receiving side of the event. In other words, it takes time to wake a thread from a WaitFor*Object() than it does to signal the event. The Windows scheduler tries to help getting to the thread sooner by giving a priority "boost" to a thread that is having a wait return, but that boost only does so much.
In order to get around this, you should be sure that you are only waiting when it is necessary to do so. The typical method to do this is, in your consumer, when you are signaled to go, consume every work item that you can without waiting on the event again, then when done make your call to sleep_thread()
I should point out that SetEvent()/WaitFor*Object() is almost surely faster than everything short of eating 100% CPU and even then it may be quicker as a result of the contention on whatever locking object needs to protect your shared data.
Normally, I would recommend the use of a ConditionVariable but I have not tested its performance compared to your technique. I have a suspicion that it may be slower since it also has the overhead of entering CRITICAL_SECTION object. You may have to measure the performance different -- when in doubt, measure, measure, measure.
The only other thing that I can think to say is that MS does acknowledge that dispatching and waiting on events can be slow, especially when it is performed repeatedly. In order to get around this, they changed the CRITICAL_SECTION object to try for a number of times in user mode to acquire the lock before actually waiting on the event. They call this the spin count. While I wouldn't recommend it, you may be able to do something similar.

Something like:
void consumer_thread(void)
{
while(1)
{
WaitForSingleObject(...);
// Consume all items from queue in a thread safe manner (e.g. critical section)
}
}
void produce()
{
bool queue_was_empty = ...; // in a thread safe manner determine if queue is empty
// thread safe insertion into queue ...
// These two steps should be done in a way that prevents the consumer
// from emptying the queue in between, e.g. a spin lock.
// This guarantees you will never miss the "edge"
if( queue_was_empty )
{
SetEvent(...);
}
}
The general idea is to only SetEvent on the transition from empty to full. If the threads have the same priority Windows should let the producer(s) keep running and therefore you can minimize your number of SetEvent calls per queue insertions. I've found this arrangement (between threads of equal priority) to give the best performance (at least under Windows XP and Win7, YMMV).

Elegant ways to notify consumer when producer is done?

I'm implementing a concurrent_blocking_queue with minimal functions:
//a thin wrapper over std::queue
template<typename T>
class concurrent_blocking_queue
{
std::queue<T> m_internal_queue;
//...
public:
void add(T const & item);
T& remove();
bool empty();
};
I intend to use this for producer-consumer problem (I guess, it is where one uses such data structures?). But I'm stuck on one problem which is:
How to elegantly notify consumer when producer is done? How would the producer notify the queue when it is done? By calling a specifiic member function, say done()? Is throwing exception from the queue (i.e from remove function) a good idea?
I came across many examples, but all has infinite loop as if the producer will produce items forever. None discussed the issue of stopping condition, not even the wiki article.

I've simply introduced a dummy "done" product in the past. So if the producer can create "products" of, say, type A and type B, I've invented type "done". When a consumer encounters a product of type "done" it knows that further processing isn't required anymore.

It is true that it's common to enqueue a special "we're done" message; however I think OP's original desire for an out-of-band indicator is reasonable. Look at the complexity people are contemplating to set up an in-band completion message! Proxy types, templating; good grief. I'd say a done() method is simpler and easier, and it makes the common case (we're not done yet) faster and cleaner.
I would agree with kids_fox that a try_remove that returns an error code if the queue is done is preferred, but that's stylistic and YMMV.
Edit:
Bonus points for implementing a queue that keeps track of how many producers are remaining in a multiple-producers situation and raises the done exception iff all producers have thrown in the towel ;-) Not going to do that with in-band messages!

My queues have usually used pointers (with an std::auto_ptr in the
interface, to clearly indicate that the sender may no longer access the
pointer); for the most part, the queued objects were polymorphic, so
dynamic allocation and reference semantics were required anyway.
Otherwise, it shouldn't be too difficult to add an “end of
file” flag to the queue. You'd need a special function on the
producer side (close?) to set it (using exactly the same locking
primitives as when you write to the queue), and the loop in the removal
function must wait for either something to be there, or the queue to be
closed. Of course, you'll need to return a Fallible value, so that
the reader can know whether the read succeeded or not. Also, don't
forget that in this case, you need a notify_all to ensure that all
processes waiting on the condition are awoken.
BTW: I don't quite see how your interface is implementable. What does
the T& returned by remove refer to. Basically, remove has to be
something like:
Fallible<T>
MessageQueue<T>::receive()
{
ScopedLock l( myMutex );
while ( myQueue.empty() && ! myIsDone )
myCondition.wait( myMutex );
Fallible<T> results;
if ( !myQueue.empty() ) {
results.validate( myQueue.top() );
myQueue.pop();
}
return results;
}
Even without the myIsDone condition, you have to read the value into a
local variable before removing it from the queue, and you can't return a
reference to a local variable.
For the rest:
void
MessageQueue<T>::send( T const& newValue )
{
ScopedLock l( myMutex );
myQueue.push( newValue );
myCondition.notify_all();
}
void
MessageQueue<T>::close()
{
ScopedLock l( myMutex );
myIsDone = true;
myCondition.notify_all();
}

'Stopping' is not often discussed because it's often never done. In those cases where it is required, it's often just as easier and more flexible to enqueue a poison-pill using the higher-level P-C protocol itself as it is to build extra functionality into the queue itself.
If you really want to do this, you could indeed set a flag that causes every consumer to raise an exception, either 'immediately' or whenever it gets back to the queue, but there are problems. Do you need the 'done' method to be synchronous, ie. do you want all the consumers gone by the time 'done' returns, or asynchronous, ie. the last consumer thread calls an event parameter when all the other the consumers are gone?
How are you going to arrange for those consumers that are currently waiting to wake up? How many are waiting and how many are busy, but will return to the queue when they have done their work? What if one or more consumers are stuck on a blocking call, (perhaps they can be unblocked, but that requires a call from another thread - how are you going to do that)?
How are the consumers going to notify that they have handled their exception and are about to die? Is 'about to die' enough, or do you need to wait on the thread handle? If you have to wait on the thread handle, what is going to do the waiting - the thread requesting the queue shutdown or the last consumer thread to notify?
Oh yes - to be safe, you should arrange for producer threads that turn up with objects to queue up while in 'shutting down' state to raise an exception as well.
I raise these questions becasue I've done all this once, a long time ago. Eventually, it all worked-ish. The objects queued up all had to have a 'QueuedItem' inserted into their inheritance chain, (so that a job-cancellation method could be exposed to the queue), and the queue had to keep a thread-safe list of objects that had been popped-off by
threads but not processed yet.
After a while, I stopped using the class in favour of a simple P-C queue with no special shutdown mechanism.

Lightest synchronization primitive for worker thread queue

I am about to implement a worker thread with work item queuing, and while I was thinking about the problem, I wanted to know if I'm doing the best thing.
The thread in question will have to have some thread local data (preinitialized at construction) and will loop on work items until some condition will be met.
pseudocode:
volatile bool run = true;
int WorkerThread(param)
{
localclassinstance c1 = new c1();
[other initialization]
while(true) {
[LOCK]
[unqueue work item]
[UNLOCK]
if([hasWorkItem]) {
[process data]
[PostMessage with pointer to data]
}
[Sleep]
if(!run)
break;
}
[uninitialize]
return 0;
}
I guess I will do the locking via critical section, as the queue will be std::vector or std::queue, but maybe there is a better way.
The part with Sleep doesn't look too great, as there will be a lot of extra Sleep with big Sleep values, or lot's of extra locking when Sleep value is small, and that's definitely unnecessary.
But I can't think of a WaitForSingleObject friendly primitive I could use instead of critical section, as there might be two threads queuing work items at the same time. So Event, which seems to be the best candidate, can loose the second work item if the Event was set already, and it doesn't guarantee a mutual exclusion.
Maybe there is even a better approach with InterlockedExchange kind of functions that leads to even less serialization.
P.S.: I might need to preprocess the whole queue and drop the obsolete work items during the unqueuing stage.

There are a multitude of ways to do this.
One option is to use a semaphore for the waiting. The semaphore is signalled every time a value is pushed on the queue, so the worker thread will only block if there are no items in the queue. This will still require separate synchronization on the queue itself.
A second option is to use a manual-reset event which is set when there are items in the queue and cleared when the queue is empty. Again, you will need to do separate synchronization on the queue.
A third option is to have an invisible message-only window created on the thread, and use a special WM_USER or WM_APP message to post items to the queue, attaching the item to the message via a pointer.
Another option is to use condition variables. The native Windows condition variables only work if you're targetting Windows Vista or Windows 7, but condition variables are also available for Windows XP with Boost or an implementation of the C++0x thread library. An example queue using boost condition variables is available on my blog: http://www.justsoftwaresolutions.co.uk/threading/implementing-a-thread-safe-queue-using-condition-variables.html

It is possible to share a resource between threads without using blocking locks at all, if your scenario meets certain requirements.
You need an atomic pointer exchange primitive, such as Win32's InterlockedExchange. Most processor architectures provide some sort of atomic swap, and it's usually much less expensive than acquiring a formal lock.
You can store your queue of work items in a pointer variable that is accessible to all the threads that will be interested in it. (global var, or field of an object that all the threads have access to)
This scenario assumes that the threads involved always have something to do, and only occasionally "glance" at the shared resource. If you want a design where threads block waiting for input, use a traditional blocking event object.
Before anything begins, create your queue or work item list object and assign it to the shared pointer variable.
Now, when producers want to push something onto the queue, they "acquire" exclusive access to the queue object by swapping a null into the shared pointer variable using InterlockedExchange. If the result of the swap returns a null, then somebody else is currently modifying the queue object. Sleep(0) to release the rest of your thread's time slice, then loop to retry the swap until it returns non-null. Even if you end up looping a few times, this is many. many times faster than making a kernel call to acquire a mutex object. Kernel calls require hundreds of clock cycles to transition into kernel mode.
When you successfully obtain the pointer, make your modifications to the queue, then swap the queue pointer back into the shared pointer.
When consuming items from the queue, you do the same thing: swap a null into the shared pointer and loop until you get a non-null result, operate on the object in the local var, then swap it back into the shared pointer var.
This technique is a combination of atomic swap and brief spin loops. It works well in scenarios where the threads involved are not blocked and collisions are rare. Most of the time the swap will give you exclusive access to the shared object on the first try, and as long as the length of time the queue object is held exclusively by any thread is very short then no thread should have to loop more than a few times before the queue object becomes available again.
If you expect a lot of contention between threads in your scenario, or you want a design where threads spend most of their time blocked waiting for work to arrive, you may be better served by a formal mutex synchronization object.

The fastest locking primitive is usually a spin-lock or spin-sleep-lock. CRITICAL_SECTION is just such a (user-space) spin-sleep-lock.
(Well, aside from not using locking primitives at all of course. But that means using lock-free data-structures, and those are really really hard to get right.)
As for avoiding the Sleep: have a look at condition-variables. They're designed to be used together with a "mutex", and I think they're much easier to use correctly than Windows' EVENTs.
Boost.Thread has a nice portable implementation of both, fast user-space spin-sleep-locks and condition variables:
http://www.boost.org/doc/libs/1_44_0/doc/html/thread/synchronization.html#thread.synchronization.condvar_ref
A work-queue using Boost.Thread could look something like this:
template <class T>
class Queue : private boost::noncopyable
{
public:
void Enqueue(T const& t)
{
unique_lock lock(m_mutex);
// wait until the queue is not full
while (m_backingStore.size() >= m_maxSize)
m_queueNotFullCondition.wait(lock); // releases the lock temporarily
m_backingStore.push_back(t);
m_queueNotEmptyCondition.notify_all(); // notify waiters that the queue is not empty
}
T DequeueOrBlock()
{
unique_lock lock(m_mutex);
// wait until the queue is not empty
while (m_backingStore.empty())
m_queueNotEmptyCondition.wait(lock); // releases the lock temporarily
T t = m_backingStore.front();
m_backingStore.pop_front();
m_queueNotFullCondition.notify_all(); // notify waiters that the queue is not full
return t;
}
private:
typedef boost::recursive_mutex mutex;
typedef boost::unique_lock<boost::recursive_mutex> unique_lock;
size_t const m_maxSize;
mutex mutable m_mutex;
boost::condition_variable_any m_queueNotEmptyCondition;
boost::condition_variable_any m_queueNotFullCondition;
std::deque<T> m_backingStore;
};

There are various ways to do this
For one you could create an event instead called 'run' and then use that to detect when thread should terminate, the main thread then signals. Instead of sleep you would then use WaitForSingleObject with a timeout, that way you will quit directly instead of waiting for sleep ms.
Another way is to accept messages in your loop and then invent a user defined message that you post to the thread
EDIT: depending on situation it may also be wise to have yet another thread that monitors this thread to check if it is dead or not, this can be done by the above mentioned message queue so replying to a certain message within x ms would mean that the thread hasn't locked up.

I'd restructure a bit:
WorkItem GetWorkItem()
{
while(true)
{
WaitForSingleObject(queue.Ready);
{
ScopeLock lock(queue.Lock);
if(!queue.IsEmpty())
{
return queue.GetItem();
}
}
}
}
int WorkerThread(param)
{
bool done = false;
do
{
WorkItem work = GetWorkItem();
if( work.IsQuitMessage() )
{
done = true;
}
else
{
work.Process();
}
} while(!done);
return 0;
}
Points of interest:
ScopeLock is a RAII class to make critical section usage safer.
Block on event until workitem is (possibly) ready - then lock while trying to dequeue it.
don't use a global "IsDone" flag, enqueue special quitmessage WorkItems.

You can have a look at another approach here that uses C++0x atomic operations
http://www.drdobbs.com/high-performance-computing/210604448

Use a semaphore instead of an event.

Keep the signaling and synchronizing separate. Something along these lines...
// in main thread
HANDLE events[2];
events[0] = CreateEvent(...); // for shutdown
events[1] = CreateEvent(...); // for work to do
// start thread and pass the events
// in worker thread
DWORD ret;
while (true)
{
ret = WaitForMultipleObjects(2, events, FALSE, <timeout val or INFINITE>);
if shutdown
return
else if do-work
enter crit sec
unqueue work
leave crit sec
etc.
else if timeout
do something else that has to be done
}

Given that this question is tagged windows, Ill answer thus:
Don't create 1 worker thread. Your worker thread jobs are presumably independent, so you can process multiple jobs at once? If so:
In your main thread call CreateIOCompletionPort to create an io completion port object.
Create a pool of worker threads. The number you need to create depends on how many jobs you might want to service in parallel. Some multiple of the number of CPU cores is a good start.
Each time a job comes in call PostQueuedCompletionStatus() passing a pointer to the job struct as the lpOverlapped struct.
Each worker thread calls GetQueuedCompletionItem() - retrieves the work item from the lpOverlapped pointer and does the job before returning to GetQueuedCompletionStatus.
This looks heavy, but io completion ports are implemented in kernel mode and represent a queue that can be deserialized into any of the worker threads associated with the queue (i.e. waiting on a call to GetQueuedCompletionStatus). The io completion port knows how many of the threads that are processing an item are actually using a CPU vs blocked on an IO call - and will release more worker threads from the pool to ensure that the concurrency count is met.
So, its not lightweight, but it is very very efficient... io completion port can be associated with pipe and socket handles for example and can dequeue the results of asynchronous operations on those handles. io completion port designs can scale to handling 10's of thousands of socket connects on a single server - but on the desktop side of the world make a very convenient way of scaling processing of jobs over the 2 or 4 cores now common in desktop PCs.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js