Waiting for all asynchronous consumers to complete before producer continues

Waiting for all asynchronous consumers to complete before producer continues - c++

We have a problem set that is very close to the producer-consumer problem. The actual use case is for a thread (producer) that runs through a directory listing (approx. 2000 entries), then feeds these entries to 4 threads (consumers) that processes specific files in those directories.
The problem we are attempting to resolve is how to make the producer thread wait for the final consumer to complete before continuing on. There is post-processing required once we have all the files in memory that can only be done once all the files have been read.
We have implemented a very naive counter solution based on a busy wait that polls a class counter (counter incremented by producer, decremented by consumer, protected by a mutex):
while(fileCnt > 0) {
usleep(10000);
}
This is of cause not a nice soltion.
Is there any way of doing this via conditionals/semaphores/something else?
We are limited to non-C++11 implementations (pthread based).
Thanks.

Hmm.. this is actually quite difficult to do in an efficient manner for a general case. If you know before submitting your first entry how many objects you are going to submit to the queue, (as you seem to do), it's easier:
Set an atomic integer to the number of objects to be submitted. Load a callback in each item queued that the threads call when they have finished processing each object. The callback decrements the int towards zero. When a thread decs it to zero, it signals a synchro object upon which the producer is waiting after queueing its last object.
I'm still thinking about what to do if the producer is iterating some list and does not know where the end is before queueing its first item:(
That case may require an actual lock in the callback so that the producer can enter it and check 'atomically' if all the queued operations are finished yet and, if not, wait on the synchro object after exiting the lock. It's safer if the synchro object maintains state, eg. a semaphore, so that a signal made after exiting the lock, but before the waiting, is not missed, (?? not sure how to do it safely with a condvar??).

Related

C++ wait notify in threads with synchronized queues

I have a program structured like that: one thread that receives tasks and writes them to input queue, multiple which process them and write in output queue, one that responds with results from it. When queue is empty, thread sleeps for several milliesconds. Queue has mutex inside it, pushing does lock(), and popping does try_lock() and returns if there is nothing in queue.
This is processing thread for example:
//working - atomic bool
while (working) {
if (!inputQue_->pop(msg)) {
std::this_thread::sleep_for(std::chrono::milliseconds(200));
continue;
} else {
string reply = messageHandler_->handle(msg);
if (!reply.empty()) {
outputQue_->push(reply);
}
}
}
And the thing that I dont like is that the time since receiving task until responding, as i have measured with high_resolution_clock, is almost 0, when there is no sleeping. When there is sleeping, it becomes bigger.
I dont want cpu resources to be wasted and want to do something like that: when recieving thread gets task, it notifies one of the processing threads, that does wait_for, and when processing task is done, it notifies responding thread same way. As a result I think i will get less time spent and cpu resources will not be wasted. And I have some questions:
Will this work the way that I see it supposed to, and the only difference will be waking up on notifying?
To do this, I have to create 2 condition variables: first same for receiving thread and all processing, second same for all processing and responding? And mutex in processing threads has to be common for all of them or uniuqe?
Can I place creation of unique_lock(mutex) and wait_for() in if branch just instead of sleep_for?
If some processing threads are busy, is it possible that notify_one() can try to wake up one of them, but not the free thread? I need to use notify_all()?
Is it possible that notify will not wake up any of threads? If yes, does it have high probability?

Will this work the way that I see it supposed to, and the only difference will be waking up on notifying?
Yes, assuming you do it correctly.
To do this, I have to create 2 condition variables: first same for receiving thread and all processing, second same for all processing and responding? And mutex in processing threads has to be common for all of them or uniuqe?
You can use a single mutex and a single condition variable, but that makes it a bit more complex. I'd suggest a single mutex, but one condition variable for each condition a thread might want to wait for.
Can I place creation of unique_lock(mutex) and wait_for() in if branch just instead of sleep_for?
Absolutely not. You need to hold the mutex while you check whether the queue is empty and continue to hold it until you call wait_for. Otherwise, you destroy the entire logic of the condition variable. The mutex associated with the condition variable must protect the condition that the thread is going to wait for, which in this case is the queue being non-empty.
If some processing threads are busy, is it possible that notify_one() can try to wake up one of them, but not the free thread? I need to use notify_all()?
I don't know what you mean by the "free thread". As a general rule, you can use notify_one if it's not possible for a thread to be blocked on the condition variable that can't handle the condition. You should use notify_all if either more than one thread might need to be awoken or there's a possibility that more than one thread will be blocked on the condition variable and the "wrong thread" could be woken, that is, there could be at least one thread that can't do whatever it is that needs to be done.
Is it possible that notify will not wake up any of threads? If yes, does it have high probability?
Sure, it's quite possible. But that would mean no threads were blocked on the condition. In that case, no thread can block on the condition because threads must check the condition before they wait, and they do it while holding a mutex. To provide this atomic "unlock and wait" semantic is the entire purpose of a condition variable.

The mechanism you have is called polling. The thread repeatedly checks (polls) if there is data available. As you mentioned, it has the drawback of wasting time. (But it is simple). What you mentioned you would like to use is called a blocking mechanism. This deschedules the thread until the moment that work becomes available.
1) Yes (although I don't know exactly what you're imagining)
2) a) Yes, 2 condition variables is one way to do it. b) Common mutex is best
3) You would probably place those within pop, and calling pop would have the potential to block.
4) No. notify_one will only wake a thread that is currently waiting from having called wait. Also, if multiple are waiting, it is not necessarily guaranteed which will receive the notification. (OS/library dependent)
5) No. If 1+ threads are waiting, notify_one it is guaranteed to wake one. BUT if no threads are waiting, the notification is consumed (and has no effect). Note that under certain edge conditions, notify_one may actually wake more than one. Also, a thread may wake from wait without anyone having called notify_one ("Spurious wake up"). The fact that this can happen at all means that you always have to do additional checking for it.
This is called the producer/consumer problem btw.

In general, your considerations about condition variable are correct. My proposal is more connected to design and reusability of such functionality.
The main idea is to implement ThreadPool pattern, which has constructor with number of worker threads ,methods submitTask, shutdown, join.
Having such class, you will use 2 instances of pools: one multithreaded for processing, second (singlethreaded by your choice) for result sending.
The pool consists of Blocking Queue of Tasks and array of Worker threads, each performing the same "pop Task and run" loop.The Blocking Queue encapsulates mutex and cond_var. The Task is common functor.
This also brings your design to Task oriented approach, which has a lot of advantages in future of your application.
You are welcome to ask more questions about implementation details if you like this idea.
Best regards, Daniel

Interrupting threads if not joined

I am looking for a way(preferably with boost threads), to interrupt a thread if it has not joined. I start multiple threads, and would like to end any of them that have not finished by 200 milliseconds. I tried something like this
boost::thread_group tgroup;
tgroup.create_thread(boost::bind(&print_f));
tgroup.create_thread(boost::bind(&print_g));
boost::this_thread::sleep(boost::posix_time::milliseconds(200));
tgroup.interrupt_all();
Now this works, and all threads are ended after 200 milliseconds; however I would like to try and join these threads if they finish before 200 milliseconds, is there a way to join and interrupt if not finished by a certain amount of time?
Edit: reason why I need join to happen before timeout:
I am creating a server where speed is very important. Unfortunately I have to make requests to other servers for some information. So I would like to make these calls in parallel, and finish as soon as possible. If a server is taking too long, I have to just ignore the information coming from that server, and continue on without it. So my timeout time is my maximum amount of time I can wait. It will be extremely beneficial to me to be able to continue on with contemplation when all responses are received, instead of waiting for time timeout timer. So what my program will:
-Get a request from a client.
-Parse information.
Create threads
-Send information to multiple other servers.
-Get information back from servers.
-Put information from servers on a shared queue.
End Threads
-Parse information from shared queue.
-Return information back to client

What you want to use is probably a set of scoped threads, and call terminate on all the remaining threads after timeout. thread groups and scoped threads are not useable together unfortunately.
The thread group class is actually a very simple container: you cannot remove a thread of it if you don't have a pointer to it already, and you cannot get a pointer to a thread which has been created by the group. The class API doesn't provide much either. This is a bit hindering for management in your situation.
The remaining solutions rely on creating the threads outside the goup, and have each of them do a specific task just before finishing. It could:
remove itself from the group,
then add itself to another group
The managing thread will have to call join_all on the later group, and act as before with the former.
using namespace boost;
void thread_end(auto &thmap, thread_group& t1, thread_group& t2, auto &task){
task();
thread *self = thmap[this_thread::get_id()];
t1.remove_thread(&self);
t2.add_thread(&self);
}
std::map<thread::id, thread *> thmap;
thread_group trunninggroup;
thread_group tfinishedgroup;
thread *th;
th = new thread(
bind(&thread_end, thmap, trunninggroup, tfinishedgroup, bind(&print_f)));
thmap[th->get_id()] = th;
trunninggroup.add_thread(th);
th = new thread(
bind(&thread_end, thmap, trunning_group, tfinishedgroup, bind(&print_g)));
thmap[th->get_id()] = th;
trunninggroup.add_thread(th);
boost::this_thread::sleep(boost::posix_time::milliseconds(200));
tfinishedgroup.join_all();
trunninggroup.interrupt_all();
But this is not ideal if you actually want the managing thread to be notified of a thread end when it actually happens (and I'm not really certain it does anything useful anyway). A solution for getting notified is perhaps to:
do the group migration as above
then trigger a condition variable on which the management thread is doing a timed_wait
but you will have to do some time computation to keep track of the remaining time after being notified, and resume sleep with that time left. That would be entirely dependent on the Duration class used for that task.
Update
Seeing the big picture, I would try a completely different approach: I don't think that terminating a thread which is already finished is a problem, so I would leave them all in the group, and use the group to create them, as your code demonstrate it.
However, I would try to wake up the managing thread as soon as all the threads are done, or after timeout. This is not doable with what the thread_group class offers alone, but it can be done with a custom made semaphore, or a patched version of boost::barrier to allow a timed wait.
Basically, you set a barrier to the number of threads in the group plus one (the main thread), and have the main thread time wait on it. Each worker thread does its work, and when finished, post its result in the queue, and wait on the barrier. If all the worker threads finish their task, everyone will wait and the barrier gets triggered.
Then main thread (as well as all others, but it doesn't matter), wakes up and can proceed by terminating the group and process the result. Otherwise, it will be awaken at timeout and do the same anyway.
The patching of boost::barrier should not be too difficult, you should only need to duplicate the wait method and replace the condition variable wait inside by a timed_wait (I didn't look at the code, this assumption might be totally of the mark though). Otherwise I provided a sample semaphore implementation for this question, which shouldn't be difficult to patch either.
Some last consideration: terminating a thread is usually not the best approach. You should instead try to signal the threads they have to abort, and wait for them, or somehow havecthem pass their unfinished task to an auxiliary thread which should clean things up serially. Then your thread group would be ready to tackle on the next task, and you wouldn't have to destroy and create threads all the time, which is a somewhzt costly operation. It will require to formalize the idea of a task in the context of your application, and make the threads run on a loop for taking new tasks and process them.

If you're using a very recent Boost and C++11, use try_join_for() (http://www.boost.org/doc/libs/1_53_0/doc/html/thread/thread_management.html#thread.thread_management.thread.try_join_for). Otherwise, use timed_join() (http://www.boost.org/doc/libs/1_53_0/doc/html/thread/thread_management.html#thread.thread_management.thread.timed_join).

C++, What and/or where is a pthread executing?

I have built a multi-threaded producer-consumer (add to a Queue, consume off the queue using numerous threads), but I am trying to optimize this further by sending a new produce() directly to the execution threads, if they are idle (instead of enqueue-ing it onto the queue).
So, I need to figure out where a thread is currently executing (is it currently conditionally waiting, or is it executing something). Can anyone suggest a way to do this?

If the execution thread is idle, won't it be waiting on the queue? The fastest way to get it some work to do is probably just pushing the work onto the queue.
Do you have reason to believe that the queue is a bottleneck?

That's what the queue should already do.
First, the thread can't be idle unless the queue is empty, right?
So what does your "enqueue and signal" operation do? It puts a pointer to the data where the thread can find it and then tells the thread to work on the data. That's the minimum task to do what you want to do anyway.
So no optimization should be possible.

You could have a global flag for each thread indicating whether it is waiting or not. Just set the flag before going into a pthread_cont_wait and reset it when released.
Having said this, I really don't see why you would want to venture away from the classic task queue pattern. It works well in most cases.

You can do this, but whether you actually want to is another matter - see the other posts.
First, forget about all the consumer threads waiting on a common semaphore. To do what you seem to want, waiting consumer threads have to be addressed by instance. To do this, a consumer that turns up, locks the queue and finds it empty needs to wait on an event of its very own. Also, the consumer needs to provide, in its 'pop' call, the address of where it wants the object put. So, in addition to the 'normal' object queue, consumer threads that need to wait need a struct containing a pointer and an event to wait on. You could create an array, or circular buffer, of these wait_structs when you create the P-C queue.
Then you're set.
PRODUCER: (calls push with an object ref/ptr)
Acquires queue lock and checks the list of wait_structs. If there is an entry, it loads its object into the address pointed to by the wait_struct pointer, (so 'sending a new produce() directly to the execution thread'), and signals the wait_struct event. If there is no entry in the list of wait_structs, the producer queues its object in the object queue. Oh yes - releases the queue lock :)
CONSUMER: (calls pop with the address where it wants an object ref put)
Acquires queue lock and checks the object queue count. If it's non-zero, it pops the object, shoves it into the target address it provided, releases the lock and runs on. If the object queue is empty, the consumer gets a free wait_strut in the list of wait_structs, sets the pointer to the value it passed in, releases the queue lock and waits on the event. When the event gets signaled, the consumer already has its object, (shoved in by the producer), and can just run on - no need to visit the PC-queue again.
Yes, this design works, (in Delphi, anyway - should work in C++), and is faster than a 'classic' semaphore-based PC-queue, (which is faster than a Windows Message Queue, which is faster than an IOCP queue).
I have got it working with a timeout - I'll let you figure out how to do that. (Hint - you have to make use of the consumer object location, (that is addressed by the pointer passed in), as temporary storage :)

how to pass data to running thread

When using pthread, I can pass data at thread creation time.
What is the proper way of passing new data to an already running thread?
I'm considering making a global variable and make my thread read from that.
Thanks

That will certainly work. Basically, threads are just lightweight processes that share the same memory space. Global variables, being in that memory space, are available to every thread.
The trick is not with the readers so much as the writers. If you have a simple chunk of global memory, like an int, then assigning to that int will probably be safe. Bt consider something a little more complicated, like a struct. Just to be definite, let's say we have
struct S { int a; float b; } s1, s2;
Now s1,s2 are variables of type struct S. We can initialize them
s1 = { 42, 3.14f };
and we can assign them
s2 = s1;
But when we assign them the processor isn't guaranteed to complete the assignment to the whole struct in one step -- we say it's not atomic. So let's now imagine two threads:
thread 1:
while (true){
printf("{%d,%f}\n", s2.a, s2.b );
sleep(1);
}
thread 2:
while(true){
sleep(1);
s2 = s1;
s1.a += 1;
s1.b += 3.14f ;
}
We can see that we'd expect s2 to have the values {42, 3.14}, {43, 6.28}, {44, 9.42} ....
But what we see printed might be anything like
{42,3.14}
{43,3.14}
{43,6.28}
or
{43,3.14}
{44,6.28}
and so on. The problem is that thread 1 may get control and "look at" s2 at any time during that assignment.
The moral is that while global memory is a perfectly workable way to do it, you need to take into account the possibility that your threads will cross over one another. There are several solutions to this, with the basic one being to use semaphores. A semaphore has two operations, confusingly named from Dutch as P and V.
P simply waits until a variable is 0 and the goes on, adding 1 to the variable; V subtracts 1 from the variable. The only thing special is that they do this atomically -- they can't be interrupted.
Now, do you code as
thread 1:
while (true){
P();
printf("{%d,%f}\n", s2.a, s2.b );
V();
sleep(1);
}
thread 2:
while(true){
sleep(1);
P();
s2 = s1;
V();
s1.a += 1;
s1.b += 3.14f ;
}
and you're guaranteed that you'll never have thread 2 half-completing an assignment while thread 1 is trying to print.
(Pthreads has semaphores, by the way.)

I have been using the message-passing, producer-consumer queue-based, comms mechanism, as suggested by asveikau, for decades without any problems specifically related to multiThreading. There are some advantages:
1) The 'threadCommsClass' instances passed on the queue can often contain everything required for the thread to do its work - member/s for input data, member/s for output data, methods for the thread to call to do the work, somewhere to put any error/exception messages and a 'returnToSender(this)' event to call so returning everything to the requester by some thread-safe means that the worker thread does not need to know about. The worker thread then runs asynchronously on one set of fully encapsulated data that requires no locking. 'returnToSender(this)' might queue the object onto a another P-C queue, it might PostMessage it to a GUI thread, it might release the object back to a pool or just dispose() it. Whatever it does, the worker thread does not need to know about it.
2) There is no need for the requesting thread to know anything about which thread did the work - all the requestor needs is a queue to push on. In an extreme case, the worker thread on the other end of the queue might serialize the data and communicate it to another machine over a network, only calling returnToSender(this) when a network reply is received - the requestor does not need to know this detail - only that the work has been done.
3) It is usually possible to arrange for the 'threadCommsClass' instances and the queues to outlive both the requester thread and the worker thread. This greatly eases those problems when the requester or worker are terminated and dispose()'d before the other - since they share no data directly, there can be no AV/whatever. This also blows away all those 'I can't stop my work thread because it's stuck on a blocking API' issues - why bother stopping it if it can be just orphaned and left to die with no possibility of writing to something that is freed?
4) A threadpool reduces to a one-line for loop that creates several work threads and passes them the same input queue.
5) Locking is restricted to the queues. The more mutexes, condVars, critical-sections and other synchro locks there are in an app, the more difficult it is to control it all and the greater the chance of of an intermittent deadlock that is a nightmare to debug. With queued messages, (ideally), only the queue class has locks. The queue class must work 100% with mutiple producers/consumers, but that's one class, not an app full of uncooordinated locking, (yech!).
6) A threadCommsClass can be raised anytime, anywhere, in any thread and pushed onto a queue. It's not even necessary for the requester code to do it directly, eg. a call to a logger class method, 'myLogger.logString("Operation completed successfully");' could copy the string into a comms object, queue it up to the thread that performs the log write and return 'immediately'. It is then up to the logger class thread to handle the log data when it dequeues it - it may write it to a log file, it may find after a minute that the log file is unreachable because of a network problem. It may decide that the log file is too big, archive it and start another one. It may write the string to disk and then PostMessage the threadCommsClass instance on to a GUI thread for display in a terminal window, whatever. It doesn't matter to the log requesting thread, which just carries on, as do any other threads that have called for logging, without significant impact on performance.
7) If you do need to kill of a thread waiting on a queue, rather than waiing for the OS to kill it on app close, just queue it a message telling it to teminate.
There are surely disadvantages:
1) Shoving data directly into thread members, signaling it to run and waiting for it to finish is easier to understand and will be faster, assuming that the thread does not have to be created each time.
2) Truly asynchronous operation, where the thread is queued some work and, sometime later, returns it by calling some event handler that has to communicate the results back, is more difficult to handle for developers used to single-threaded code and often requires state-machine type design where context data must be sent in the threadCommsClass so that the correct actions can be taken when the results come back. If there is the occasional case where the requestor just has to wait, it can send an event in the threadCommsClass that gets signaled by the returnToSender method, but this is obviously more complex than simply waiting on some thread handle for completion.
Whatever design is used, forget the simple global variables as other posters have said. There is a case for some global types in thread comms - one I use very often is a thread-safe pool of threadCommsClass instances, (this is just a queue that gets pre-filled with objects). Any thread that wishes to communicate has to get a threadCommsClass instance from the pool, load it up and queue it off. When the comms is done, the last thread to use it releases it back to the pool. This approach prevents runaway new(), and allows me to easily monitor the pool level during testing without any complex memory-managers, (I usually dump the pool level to a status bar every second with a timer). Leaking objects, (level goes down), and double-released objects, (level goes up), are easily detected and so get fixed.
MultiThreading can be safe and deliver scaleable, high-performance apps that are almost a pleasure to maintain/enhance, (almost:), but you have to lay off the simple globals - treat them like Tequila - quick and easy high for now but you just know they'll blow your head off tomorrow.
Good luck!
Martin

Global variables are bad to begin with, and even worse with multi-threaded programming. Instead, the creator of the thread should allocate some sort of context object that's passed to pthread_create, which contains whatever buffers, locks, condition variables, queues, etc. are needed for passing information to and from the thread.

You will need to build this yourself. The most typical approach requires some cooperation from the other thread as it would be a bit of a weird interface to "interrupt" a running thread with some data and code to execute on it... That would also have some of the same trickiness as something like POSIX signals or IRQs, both of which it's easy to shoot yourself in the foot while processing, if you haven't carefully thought it through... (Simple example: You can't call malloc inside a signal handler because you might be interrupted in the middle of malloc, so you might crash while accessing malloc's internal data structures which are only partially updated.)
The typical approach is to have your thread creation routine basically be an event loop. You can build a queue structure and pass that as the argument to the thread creation routine. Then other threads can enqueue things and the thread's event loop will dequeue it and process the data. Note this is cleaner than a global variable (or global queue) because it can scale to have multiple of these queues.
You will need some synchronization on that queue data structure. Entire books could be written about how to implement your queue structure's synchronization, but the most simple thing would have a lock and a semaphore. When modifying the queue, threads take a lock. When waiting for something to be dequeued, consumer threads would wait on a semaphore which is incremented by enqueuers. It's also a good idea to implement some mechanism to shut down the consumer thread.

Lightest synchronization primitive for worker thread queue

I am about to implement a worker thread with work item queuing, and while I was thinking about the problem, I wanted to know if I'm doing the best thing.
The thread in question will have to have some thread local data (preinitialized at construction) and will loop on work items until some condition will be met.
pseudocode:
volatile bool run = true;
int WorkerThread(param)
{
localclassinstance c1 = new c1();
[other initialization]
while(true) {
[LOCK]
[unqueue work item]
[UNLOCK]
if([hasWorkItem]) {
[process data]
[PostMessage with pointer to data]
}
[Sleep]
if(!run)
break;
}
[uninitialize]
return 0;
}
I guess I will do the locking via critical section, as the queue will be std::vector or std::queue, but maybe there is a better way.
The part with Sleep doesn't look too great, as there will be a lot of extra Sleep with big Sleep values, or lot's of extra locking when Sleep value is small, and that's definitely unnecessary.
But I can't think of a WaitForSingleObject friendly primitive I could use instead of critical section, as there might be two threads queuing work items at the same time. So Event, which seems to be the best candidate, can loose the second work item if the Event was set already, and it doesn't guarantee a mutual exclusion.
Maybe there is even a better approach with InterlockedExchange kind of functions that leads to even less serialization.
P.S.: I might need to preprocess the whole queue and drop the obsolete work items during the unqueuing stage.

There are a multitude of ways to do this.
One option is to use a semaphore for the waiting. The semaphore is signalled every time a value is pushed on the queue, so the worker thread will only block if there are no items in the queue. This will still require separate synchronization on the queue itself.
A second option is to use a manual-reset event which is set when there are items in the queue and cleared when the queue is empty. Again, you will need to do separate synchronization on the queue.
A third option is to have an invisible message-only window created on the thread, and use a special WM_USER or WM_APP message to post items to the queue, attaching the item to the message via a pointer.
Another option is to use condition variables. The native Windows condition variables only work if you're targetting Windows Vista or Windows 7, but condition variables are also available for Windows XP with Boost or an implementation of the C++0x thread library. An example queue using boost condition variables is available on my blog: http://www.justsoftwaresolutions.co.uk/threading/implementing-a-thread-safe-queue-using-condition-variables.html

It is possible to share a resource between threads without using blocking locks at all, if your scenario meets certain requirements.
You need an atomic pointer exchange primitive, such as Win32's InterlockedExchange. Most processor architectures provide some sort of atomic swap, and it's usually much less expensive than acquiring a formal lock.
You can store your queue of work items in a pointer variable that is accessible to all the threads that will be interested in it. (global var, or field of an object that all the threads have access to)
This scenario assumes that the threads involved always have something to do, and only occasionally "glance" at the shared resource. If you want a design where threads block waiting for input, use a traditional blocking event object.
Before anything begins, create your queue or work item list object and assign it to the shared pointer variable.
Now, when producers want to push something onto the queue, they "acquire" exclusive access to the queue object by swapping a null into the shared pointer variable using InterlockedExchange. If the result of the swap returns a null, then somebody else is currently modifying the queue object. Sleep(0) to release the rest of your thread's time slice, then loop to retry the swap until it returns non-null. Even if you end up looping a few times, this is many. many times faster than making a kernel call to acquire a mutex object. Kernel calls require hundreds of clock cycles to transition into kernel mode.
When you successfully obtain the pointer, make your modifications to the queue, then swap the queue pointer back into the shared pointer.
When consuming items from the queue, you do the same thing: swap a null into the shared pointer and loop until you get a non-null result, operate on the object in the local var, then swap it back into the shared pointer var.
This technique is a combination of atomic swap and brief spin loops. It works well in scenarios where the threads involved are not blocked and collisions are rare. Most of the time the swap will give you exclusive access to the shared object on the first try, and as long as the length of time the queue object is held exclusively by any thread is very short then no thread should have to loop more than a few times before the queue object becomes available again.
If you expect a lot of contention between threads in your scenario, or you want a design where threads spend most of their time blocked waiting for work to arrive, you may be better served by a formal mutex synchronization object.

The fastest locking primitive is usually a spin-lock or spin-sleep-lock. CRITICAL_SECTION is just such a (user-space) spin-sleep-lock.
(Well, aside from not using locking primitives at all of course. But that means using lock-free data-structures, and those are really really hard to get right.)
As for avoiding the Sleep: have a look at condition-variables. They're designed to be used together with a "mutex", and I think they're much easier to use correctly than Windows' EVENTs.
Boost.Thread has a nice portable implementation of both, fast user-space spin-sleep-locks and condition variables:
http://www.boost.org/doc/libs/1_44_0/doc/html/thread/synchronization.html#thread.synchronization.condvar_ref
A work-queue using Boost.Thread could look something like this:
template <class T>
class Queue : private boost::noncopyable
{
public:
void Enqueue(T const& t)
{
unique_lock lock(m_mutex);
// wait until the queue is not full
while (m_backingStore.size() >= m_maxSize)
m_queueNotFullCondition.wait(lock); // releases the lock temporarily
m_backingStore.push_back(t);
m_queueNotEmptyCondition.notify_all(); // notify waiters that the queue is not empty
}
T DequeueOrBlock()
{
unique_lock lock(m_mutex);
// wait until the queue is not empty
while (m_backingStore.empty())
m_queueNotEmptyCondition.wait(lock); // releases the lock temporarily
T t = m_backingStore.front();
m_backingStore.pop_front();
m_queueNotFullCondition.notify_all(); // notify waiters that the queue is not full
return t;
}
private:
typedef boost::recursive_mutex mutex;
typedef boost::unique_lock<boost::recursive_mutex> unique_lock;
size_t const m_maxSize;
mutex mutable m_mutex;
boost::condition_variable_any m_queueNotEmptyCondition;
boost::condition_variable_any m_queueNotFullCondition;
std::deque<T> m_backingStore;
};

There are various ways to do this
For one you could create an event instead called 'run' and then use that to detect when thread should terminate, the main thread then signals. Instead of sleep you would then use WaitForSingleObject with a timeout, that way you will quit directly instead of waiting for sleep ms.
Another way is to accept messages in your loop and then invent a user defined message that you post to the thread
EDIT: depending on situation it may also be wise to have yet another thread that monitors this thread to check if it is dead or not, this can be done by the above mentioned message queue so replying to a certain message within x ms would mean that the thread hasn't locked up.

I'd restructure a bit:
WorkItem GetWorkItem()
{
while(true)
{
WaitForSingleObject(queue.Ready);
{
ScopeLock lock(queue.Lock);
if(!queue.IsEmpty())
{
return queue.GetItem();
}
}
}
}
int WorkerThread(param)
{
bool done = false;
do
{
WorkItem work = GetWorkItem();
if( work.IsQuitMessage() )
{
done = true;
}
else
{
work.Process();
}
} while(!done);
return 0;
}
Points of interest:
ScopeLock is a RAII class to make critical section usage safer.
Block on event until workitem is (possibly) ready - then lock while trying to dequeue it.
don't use a global "IsDone" flag, enqueue special quitmessage WorkItems.

You can have a look at another approach here that uses C++0x atomic operations
http://www.drdobbs.com/high-performance-computing/210604448

Use a semaphore instead of an event.

Keep the signaling and synchronizing separate. Something along these lines...
// in main thread
HANDLE events[2];
events[0] = CreateEvent(...); // for shutdown
events[1] = CreateEvent(...); // for work to do
// start thread and pass the events
// in worker thread
DWORD ret;
while (true)
{
ret = WaitForMultipleObjects(2, events, FALSE, <timeout val or INFINITE>);
if shutdown
return
else if do-work
enter crit sec
unqueue work
leave crit sec
etc.
else if timeout
do something else that has to be done
}

Given that this question is tagged windows, Ill answer thus:
Don't create 1 worker thread. Your worker thread jobs are presumably independent, so you can process multiple jobs at once? If so:
In your main thread call CreateIOCompletionPort to create an io completion port object.
Create a pool of worker threads. The number you need to create depends on how many jobs you might want to service in parallel. Some multiple of the number of CPU cores is a good start.
Each time a job comes in call PostQueuedCompletionStatus() passing a pointer to the job struct as the lpOverlapped struct.
Each worker thread calls GetQueuedCompletionItem() - retrieves the work item from the lpOverlapped pointer and does the job before returning to GetQueuedCompletionStatus.
This looks heavy, but io completion ports are implemented in kernel mode and represent a queue that can be deserialized into any of the worker threads associated with the queue (i.e. waiting on a call to GetQueuedCompletionStatus). The io completion port knows how many of the threads that are processing an item are actually using a CPU vs blocked on an IO call - and will release more worker threads from the pool to ensure that the concurrency count is met.
So, its not lightweight, but it is very very efficient... io completion port can be associated with pipe and socket handles for example and can dequeue the results of asynchronous operations on those handles. io completion port designs can scale to handling 10's of thousands of socket connects on a single server - but on the desktop side of the world make a very convenient way of scaling processing of jobs over the 2 or 4 cores now common in desktop PCs.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js