std::async execution order guarantee

std::async execution order guarantee - c++

I am trying to introduce logging to a multithreaded application. Currently, I am just using std::cout from the different threads. However, in that case, the order of the output is getting jumbled, even though one thread logged early, its output in stdout is coming after the log of another thread.
So, one solution can be to move all logging to an extra thread, but I don't want to manage one more thread. So I am thinking of using std::async to do the logging from the different threads. Is this possible? Are there any suggested ways to do this? Also is the order of execution of std::async guaranteed?
#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>
#include <future>
#include <string>
#include <mutex>
void print(int i)
{
std::cout << i << std::endl;
}
int main()
{
auto a1 = std::async(std::launch::async, print, 1);
auto a2 = std::async(std::launch::async, print, 2);
auto a3 = std::async(std::launch::async, print, 3);
a3.wait();
a2.wait();
a1.wait();
}
For the above code, is it guaranteed that the order of output will be
1
2
3
?

The whole point of std::async(std::launch::async, ...) is that it's asynchronous (thus using not only "async" in the name, but repeating it in the first parameter).
You're not guaranteed much of anything about the relative order of things happening in threads created with std::async, unless you force synchronization using something like an std::mutex, std::condition_variable, or some of the synchronizing primitives in <atomic>.
You say you don't want to manage one more thread, but then you create and manage not only one, but three more threads. I don't quite understand how this makes sense.
My own tendency would be to create a type to handle logging. Whether it uses a separate thread or not is its own internal affair. The threads doing the real work just do something like: log(error) << "error 12345"; and it's up to the logging object to implement that efficiently. Yes, if you have a lot of other threads contending for use of the single logging object, it's likely to be best off running in a thread of its own--but they should neither know nor care one way or the other about that.

For the above code, is it guaranteed that the order of output will be
1
2
3
4
?
No. wait() waits until the value is available. It doesn't start the execution. It merely blocks until its done.
It may as well happen that all of those futures are ready even before you call any wait()s. In that case it's quite obvious that the order is not guaranteed (and completely unrelated to the order of those wait() calls).

A way to offload logging from the current thread to elsewhere is to have a queue between the thread wanting to log and a thread actually writing the log to disk (or whatever). A queue is a good object to use, because it preserves order.
There's probably way better ways of doing this than what I'm about to describe (and I did this quite a while ago, so its bound to have been bettered). It's possible to adapt log4cpp so as to have a thread accepting logging requests submitted via std::queue. That's not a multithreaded thing, so what I've done before is to create a log event class, manage those using shared_ptrs, put the shared_ptrs to log event objects on the std::queue (so the posting to the queue is small, fast, and a minimum lock time on the necessary mutex). Then I've added in ZeroMQ with a PUSH/PULL pattern to allow multiple PUSHERs (posters onto the std::queue) to send 1 byte long messages to wake up a single PULL thread (that polls the zeroMQ and pulls from the std::queue). So logging consisted of the creation of a log event object, acquiring a mutex, pushing a shared_ptr to the log event object onto a std::queue, releasing the mutex and finally pushing a 1 byte message into a ZeroMQ socket.
Yes, it's a fairly horrific blend of std::queue and ZeroMQ, but it was quick to dispatch an arbitrary long log event without having to serialise the log event data in the thread send it.
A possible embelishment would be to turn off the mutex locking on the shared_ptr (it's not needed), or just use a raw pointer instead and try to remember to call delete.

Related

Best Design Practices for Passing Around Multithread Data?

I'm trying to create a clean and efficient design for passing off events to a background thread for evaluation, then return a selected result to the game thread.
This is my initial design
//Occurrence object passed from director on game thread to background thread OccurrenceQueue
//Execute BackgroundThread::EvaluateQueues()
//If OccurrenceQueue.Dequeue()
//Score Occurrence, then return via Occurrence.GetOwner()->PurposeSelected(Occurrence)
//EvaluateQueues()
This resulted in a clean loop of selecting a chain of purposes from an event. So now I want to move this to a background thread. Here is what I've learned so far:
Thread Safety (in UE) requires absolutely no modification of UObject
data from other threads (From what I read this is due to their custom
GC)
You can lock objects and/or design so that objects in background
thread aren't touched by game thread, but there is still a risk of
unexpected behavior due to lifetime not being extended by background
thread and synchonization issues
You cannot simply execute a function on a game thread existing object
to move the callstack back to game thread
Calling Occurrence.GetOwner()->PurposeSelected(Occurrence) from a
background thread remains in the background thread
This is the main subject I'd like to get a better understanding of
This applies to delegates in UE as well
TQueues in UE can be used across threads safely
From what I've learned above, my current design doesn't appear to be logically possible.
These are my alternatives thus far:
Use two queues
One to dequeue and score on the background thread
The other to dequeue the result on game thread via tick
Use a delegate existing on the game thread, which calls OccurrenceEvaluated.Broadcast() through tick
When a result is scored, bind Occurrence.GetOwner()->PurposeSelected(Occurrence) to OccurrenceEvaluated
I've seen that c++ utilizes something called Future() (or something like that) for ASync tasks, and it appears UE has something similar with TFuture<> && TFuture.IsReady(), but I have yet to look deeper into that and how it returns data
Same with FAsyncTask
I'm hesitant to implement any design which utilizes tick to check if data has been updated/returned from background threads.
Can anyone suggest relevant design practices, or clarify the nature of returning execution to a main thread from a background thread (I've had a hard time finding the right question to research/info regarding this)?

I found a perfect solution. As these events aren't particularly time sensitive, I just use Unreal Engine's AsyncTask() to schedule an async task on the game thread from my background thread. As #Pepjin Kramer pointed out is the same as std::async.
So simple it's basically a slap in the face.

Goodmorning,
Threading can help you get things done faster/more responsive but has a lot of pitfalls. The short answer in situations like this I use these kind of constructs
#include <future>
#include <thread>
#include <vector>
int main()
{
// make data shared
auto data = std::make_shared<std::vector<int>>();
data -> push_back(42);
// future will be a future<int> deduced from return from vector
// async will run lambda in other thread
// capture data by value! It copies the shared_ptr and will
// increase its reference count.
auto future = std::async(std::launch::async, [data]
{
auto value = data[0];
return value;
// shared_ptr data goes out of scope and reference count to data is decreased
});
// synchronize with thread and get "calculated" value
auto my_value = future.get();
// now data on the main thread goes out of scope reference count is decreased. Only when both the thread are done AND this function exits then data is deleted.
return 0;
}
Most objects take up memory and during assignments that memory isn't updated in 1 clock cycle. You can protect memory with sdt::mutex and std::scoped_lock. Try looking for those.
You often need to synchronize information/processing between threads, look at std::condition_variable (but they have pitfalls https://www.modernescpp.com/index.php/c-core-guidelines-be-aware-of-the-traps-of-condition-variables)
C++ has good support classes, std::thread and std::async/std::future. Personally I like the solution with std::future because you can return values and exceptions
(!) from other threads when you call future.get()
Learn about lambdas and captures, they're vital for use with std::thread/std::async
Life cycle of objects! When sharing objects between threads you must be sure that one thread doesn't delete objects in use by other threads. Don't use raw pointers and/or new/delete when using threads. I personally often use std::make_shared/std::shared_ptr's to data/objects when sharing information between threads.
Another tricky thing is that sometimes you cannot be sure work in another thread has started after creating it. E.g. When std::async returns a thread is created but it isn't guaranteed to have really started (operating system scheduling etc...) . If you want to be really sure it has started then after the async call returns you will have to wait on a condition_variable you set at the start of the thread function.
I hope these remarks can get you started.

What is the best way to share data containers between threads in c++

I have an application which has a couple of processing levels like:
InputStream->Pre-Processing->Computation->OutputStream
Each of these entities run in separate thread.
So in my code I have the general thread, which owns the
std::vector<ImageRead> m_readImages;
and then it passes this member variable to each thread:
InputStream input{&m_readImages};
std::thread threadStream{&InputStream::start, &InputStream};
PreProcess pre{&m_readImages};
std::thread preStream{&PreProcess::start, &PreProcess};
...
And each of these classes owns a pointer member to this data:
std::vector<ImageRead>* m_ptrReadImages;
I also have a global mutex defined, which I lock and unlock on each read/write operation to that shared container.
What bothers me is that this mechanism is pretty obscure and sometimes I get confused whether the data is used by another thread or not.
So what is the more straightforward way to share this container between those threads?

The process you described as "Input-->preprocessing-->computation-->Output" is sequential by design: each step depends on the previous one so parallelization in this particular manner is not beneficial as each thread just has to wait for another to complete. Try to find out which step takes most time and parallelize that. Or try to set up multiple parallel processing pipelines that operate sequentially on independent, individual data sets. A usual approach for that would employ a processing queue which distributes the tasks among a set of threads.

It would seem to me that your reading and preprocessing could be done independently of the container.
Naively, I would structure this as a fan-out and then fan-in network of tasks.
First, make dispatch task (a task is a unit of work that is given to a thread to actually operate) that will create input-and-preprocess tasks.
Use futures as a means for the sub-tasks to communicate back a pointer to the completely loaded image.
Make a second task, the std::vector builder task that just calls join on the futures to get the results when they are done and adds them to the std::vector array.
I suggest you structure things this way because I suspect that any IO and preprocessing you are doing will take longer than setting a value in the vector. Using tasks instead of threads directly lets you tune the parallel portion of your work.
I hope that's not too abstracted away from the concrete elements. This is a pattern I find to be well balanced between saturating available hardware, reducing thrash / lock contention, and is understandable by future-you debugging it later.

I would use 3 separate queues, ready_for_preprocessing which is fed by InputStream and consumed by Pre-processing, ready_for_computation which is fed by Pre-Processing and consumed by Computation, and ready_for_output which is fed by Computation and consumed by OutputStream.
You'll want each queue to be in a class, which has an access mutex (to control actually adding and removing items from the queue) and an "image available" semaphore (to signal that items are available) as well as the actual queue. This would allow multiple instances of each thread. Something like this:
class imageQueue
{
std::deque<ImageRead> m_readImages;
std::mutex m_changeQueue;
Semaphore m_imagesAvailable;
public:
bool addImage( ImageRead );
ImageRead getNextImage();
}
addImage() takes the m_changeQueue mutex, adds the image to m_readImages, then signals m_imagesAvailable;
getNextImage() waits on m_imagesAvailable. When it becomes signaled, it takes m_changeQueue, removes the next image from the list, and returns it.
cf. http://en.cppreference.com/w/cpp/thread

Ignoring the question of "Should each operation run in an individual thread", it appears that the objects that you want to process move from thread to thread. In effect, they are uniquely owned by only one thread at a time (no thread ever needs to access any data from other threads, ). There is a way to express just that in C++: std::unique_ptr.
Each step then only works on its owned image. All you have to do is find a thread-safe way to move the ownership of your images through the process steps one by one, which means the critical sections are only at the boundaries between tasks. Since you have multiple of these, abstracting it away would be reasonable:
class ProcessBoundary
{
public:
void setImage(std::unique_ptr<ImageRead> newImage)
{
while (running)
{
{
std::lock_guard<m_mutex> guard;
if (m_imageToTransfer == nullptr)
{
// Image has been transferred to next step, so we can place this one here.
m_imageToTransfer = std::move(m_newImage);
return;
}
}
std::this_thread::yield();
}
}
std::unique_ptr<ImageRead> getImage()
{
while (running)
{
{
std::lock_guard<m_mutex> guard;
if (m_imageToTransfer != nullptr)
{
// Image has been transferred to next step, so we can place this one here.
return std::move(m_imageToTransfer);
}
}
std::this_thread::yield();
}
}
void stop()
{
running = false;
}
private:
std::mutex m_mutex;
std::unique_ptr<ImageRead> m_imageToTransfer;
std::atomic<bool> running; // Set to true in constructor
};
The process steps would then ask for an image with getImage(), which they uniquely own once that function returns. They process it and pass it to the setImage of the next ProcessBoundary.
You could probably improve on this with condition variables, or adding a queue in this class so that threads can get back to processing the next image. However, if some steps are faster than others they will necessarily be stalled by the slower ones eventually.

This is a design pattern problem. I suggest to read about concurrency design pattern and see if there is anything that would help you out.
If you wan to add concurrency to the following sequential process.
InputStream->Pre-Processing->Computation->OutputStream
Then I suggest to use the active object design pattern. This way each process is not blocked by the previous step and can run concurrently. It is also very simple to implement(Here is an implementation:
http://www.drdobbs.com/parallel/prefer-using-active-objects-instead-of-n/225700095)
As to your question about each thread sharing a DTO. This is easily solved with a wrapper on the DTO. The wrapper will contain write and read functions. The write functions blocks with a mutext and the read returns const data.
However, I think your problem lies in design. If the process is sequential as you described, then why are each process sharing the data? The data should be passed into the next process once the current one completes. In other words, each process should be decoupled.

You are correct in using mutexes and locks. For C++11, this is really the most elegant way of accessing complex data between threads.

C++ Input and output to the console window at the same time

I'm writing a server(mainly for windows, but it would be cool if i could keep it multiplatform) and i just use a normal console window for it. However, I want the server to be able to do commands like say text_to_say_here or kick playername, etc. How can i have a asynchronous input/output? I allready tried some stuff with the normal printf() and gets_s but that resulted in some really.... weird stuff.
I mean something like this 1
thanks.

Quick code to take advantage of C++11 features (i.e. cross-platform)
#include <atomic>
#include <thread>
#include <iostream>
void ReadCin(std::atomic<bool>& run)
{
std::string buffer;
while (run.load())
{
std::cin >> buffer;
if (buffer == "Quit")
{
run.store(false);
}
}
}
int main()
{
std::atomic<bool> run(true);
std::thread cinThread(ReadCin, std::ref(run));
while (run.load())
{
// main loop
}
run.store(false);
cinThread.join();
return 0;
}

You can simulate asynchronous I/O using threads, but more importantly, you must share a mutex between the two read/write threads in order to avoid any issues with a thread stepping on another thread, and writing to the console on top of the output of another thread. In other words std::cout, std::cin, fprintf(), etc. are not multi-thread safe, and as a result, you will get an unpredictable interleaving pattern between the two operations where a read or write takes place while another read or write was already happening. You could easily end up with a read trying to take place in the middle of a write, and furthermore, while you're typing an input on the console, another writing thread could start writing on the console, making a visual mess of what you're trying to type as input.
In order to properly manage your asynchronous read and write threads, it would be best to setup two classes, one for reading, and another for writing. In each class, setup a message queue that will either store messages (most likely std::string) for the main thread to retrieve in the case of the read thread, and for the main thread to push messages to in the case of the write thread. You may also want to make a special version of your read thread that can print a prompt, with a message pushed into its message queue by the main thread that will print a prompt before reading from stdin or std::cin. Both classes will then share a common mutex or semaphore to prevent unpredictable interleaving of I/O. By locking the common mutex before any iostream calls (an unlocking it afterwards), any unpredictable interleaving of I/O will be avoided. Each thread will also add another mutex that is unique to each thread that can be used to maintain exclusivity over access to the class's internal message queue. Finally, you can implement the message queues in each class as a std::queue<std::string>.
If you want to make your program as cross-platform as possible, I would suggest implementing this with either Boost::threads, or using the new C++0x std::threads libraries.

If you ditch the console window and use TCP connections for command and control, your server will be much easier to keep multi-platform, and also simpler and more flexible.

You can try placing the input and output on separate threads. I'm not quite sure why you want to do this, but threading should do the job.
:)
http://en.wikibooks.org/wiki/C++_Programming/Threading

how to pass data to running thread

When using pthread, I can pass data at thread creation time.
What is the proper way of passing new data to an already running thread?
I'm considering making a global variable and make my thread read from that.
Thanks

That will certainly work. Basically, threads are just lightweight processes that share the same memory space. Global variables, being in that memory space, are available to every thread.
The trick is not with the readers so much as the writers. If you have a simple chunk of global memory, like an int, then assigning to that int will probably be safe. Bt consider something a little more complicated, like a struct. Just to be definite, let's say we have
struct S { int a; float b; } s1, s2;
Now s1,s2 are variables of type struct S. We can initialize them
s1 = { 42, 3.14f };
and we can assign them
s2 = s1;
But when we assign them the processor isn't guaranteed to complete the assignment to the whole struct in one step -- we say it's not atomic. So let's now imagine two threads:
thread 1:
while (true){
printf("{%d,%f}\n", s2.a, s2.b );
sleep(1);
}
thread 2:
while(true){
sleep(1);
s2 = s1;
s1.a += 1;
s1.b += 3.14f ;
}
We can see that we'd expect s2 to have the values {42, 3.14}, {43, 6.28}, {44, 9.42} ....
But what we see printed might be anything like
{42,3.14}
{43,3.14}
{43,6.28}
or
{43,3.14}
{44,6.28}
and so on. The problem is that thread 1 may get control and "look at" s2 at any time during that assignment.
The moral is that while global memory is a perfectly workable way to do it, you need to take into account the possibility that your threads will cross over one another. There are several solutions to this, with the basic one being to use semaphores. A semaphore has two operations, confusingly named from Dutch as P and V.
P simply waits until a variable is 0 and the goes on, adding 1 to the variable; V subtracts 1 from the variable. The only thing special is that they do this atomically -- they can't be interrupted.
Now, do you code as
thread 1:
while (true){
P();
printf("{%d,%f}\n", s2.a, s2.b );
V();
sleep(1);
}
thread 2:
while(true){
sleep(1);
P();
s2 = s1;
V();
s1.a += 1;
s1.b += 3.14f ;
}
and you're guaranteed that you'll never have thread 2 half-completing an assignment while thread 1 is trying to print.
(Pthreads has semaphores, by the way.)

I have been using the message-passing, producer-consumer queue-based, comms mechanism, as suggested by asveikau, for decades without any problems specifically related to multiThreading. There are some advantages:
1) The 'threadCommsClass' instances passed on the queue can often contain everything required for the thread to do its work - member/s for input data, member/s for output data, methods for the thread to call to do the work, somewhere to put any error/exception messages and a 'returnToSender(this)' event to call so returning everything to the requester by some thread-safe means that the worker thread does not need to know about. The worker thread then runs asynchronously on one set of fully encapsulated data that requires no locking. 'returnToSender(this)' might queue the object onto a another P-C queue, it might PostMessage it to a GUI thread, it might release the object back to a pool or just dispose() it. Whatever it does, the worker thread does not need to know about it.
2) There is no need for the requesting thread to know anything about which thread did the work - all the requestor needs is a queue to push on. In an extreme case, the worker thread on the other end of the queue might serialize the data and communicate it to another machine over a network, only calling returnToSender(this) when a network reply is received - the requestor does not need to know this detail - only that the work has been done.
3) It is usually possible to arrange for the 'threadCommsClass' instances and the queues to outlive both the requester thread and the worker thread. This greatly eases those problems when the requester or worker are terminated and dispose()'d before the other - since they share no data directly, there can be no AV/whatever. This also blows away all those 'I can't stop my work thread because it's stuck on a blocking API' issues - why bother stopping it if it can be just orphaned and left to die with no possibility of writing to something that is freed?
4) A threadpool reduces to a one-line for loop that creates several work threads and passes them the same input queue.
5) Locking is restricted to the queues. The more mutexes, condVars, critical-sections and other synchro locks there are in an app, the more difficult it is to control it all and the greater the chance of of an intermittent deadlock that is a nightmare to debug. With queued messages, (ideally), only the queue class has locks. The queue class must work 100% with mutiple producers/consumers, but that's one class, not an app full of uncooordinated locking, (yech!).
6) A threadCommsClass can be raised anytime, anywhere, in any thread and pushed onto a queue. It's not even necessary for the requester code to do it directly, eg. a call to a logger class method, 'myLogger.logString("Operation completed successfully");' could copy the string into a comms object, queue it up to the thread that performs the log write and return 'immediately'. It is then up to the logger class thread to handle the log data when it dequeues it - it may write it to a log file, it may find after a minute that the log file is unreachable because of a network problem. It may decide that the log file is too big, archive it and start another one. It may write the string to disk and then PostMessage the threadCommsClass instance on to a GUI thread for display in a terminal window, whatever. It doesn't matter to the log requesting thread, which just carries on, as do any other threads that have called for logging, without significant impact on performance.
7) If you do need to kill of a thread waiting on a queue, rather than waiing for the OS to kill it on app close, just queue it a message telling it to teminate.
There are surely disadvantages:
1) Shoving data directly into thread members, signaling it to run and waiting for it to finish is easier to understand and will be faster, assuming that the thread does not have to be created each time.
2) Truly asynchronous operation, where the thread is queued some work and, sometime later, returns it by calling some event handler that has to communicate the results back, is more difficult to handle for developers used to single-threaded code and often requires state-machine type design where context data must be sent in the threadCommsClass so that the correct actions can be taken when the results come back. If there is the occasional case where the requestor just has to wait, it can send an event in the threadCommsClass that gets signaled by the returnToSender method, but this is obviously more complex than simply waiting on some thread handle for completion.
Whatever design is used, forget the simple global variables as other posters have said. There is a case for some global types in thread comms - one I use very often is a thread-safe pool of threadCommsClass instances, (this is just a queue that gets pre-filled with objects). Any thread that wishes to communicate has to get a threadCommsClass instance from the pool, load it up and queue it off. When the comms is done, the last thread to use it releases it back to the pool. This approach prevents runaway new(), and allows me to easily monitor the pool level during testing without any complex memory-managers, (I usually dump the pool level to a status bar every second with a timer). Leaking objects, (level goes down), and double-released objects, (level goes up), are easily detected and so get fixed.
MultiThreading can be safe and deliver scaleable, high-performance apps that are almost a pleasure to maintain/enhance, (almost:), but you have to lay off the simple globals - treat them like Tequila - quick and easy high for now but you just know they'll blow your head off tomorrow.
Good luck!
Martin

Global variables are bad to begin with, and even worse with multi-threaded programming. Instead, the creator of the thread should allocate some sort of context object that's passed to pthread_create, which contains whatever buffers, locks, condition variables, queues, etc. are needed for passing information to and from the thread.

You will need to build this yourself. The most typical approach requires some cooperation from the other thread as it would be a bit of a weird interface to "interrupt" a running thread with some data and code to execute on it... That would also have some of the same trickiness as something like POSIX signals or IRQs, both of which it's easy to shoot yourself in the foot while processing, if you haven't carefully thought it through... (Simple example: You can't call malloc inside a signal handler because you might be interrupted in the middle of malloc, so you might crash while accessing malloc's internal data structures which are only partially updated.)
The typical approach is to have your thread creation routine basically be an event loop. You can build a queue structure and pass that as the argument to the thread creation routine. Then other threads can enqueue things and the thread's event loop will dequeue it and process the data. Note this is cleaner than a global variable (or global queue) because it can scale to have multiple of these queues.
You will need some synchronization on that queue data structure. Entire books could be written about how to implement your queue structure's synchronization, but the most simple thing would have a lock and a semaphore. When modifying the queue, threads take a lock. When waiting for something to be dequeued, consumer threads would wait on a semaphore which is incremented by enqueuers. It's also a good idea to implement some mechanism to shut down the consumer thread.

Lightest synchronization primitive for worker thread queue

I am about to implement a worker thread with work item queuing, and while I was thinking about the problem, I wanted to know if I'm doing the best thing.
The thread in question will have to have some thread local data (preinitialized at construction) and will loop on work items until some condition will be met.
pseudocode:
volatile bool run = true;
int WorkerThread(param)
{
localclassinstance c1 = new c1();
[other initialization]
while(true) {
[LOCK]
[unqueue work item]
[UNLOCK]
if([hasWorkItem]) {
[process data]
[PostMessage with pointer to data]
}
[Sleep]
if(!run)
break;
}
[uninitialize]
return 0;
}
I guess I will do the locking via critical section, as the queue will be std::vector or std::queue, but maybe there is a better way.
The part with Sleep doesn't look too great, as there will be a lot of extra Sleep with big Sleep values, or lot's of extra locking when Sleep value is small, and that's definitely unnecessary.
But I can't think of a WaitForSingleObject friendly primitive I could use instead of critical section, as there might be two threads queuing work items at the same time. So Event, which seems to be the best candidate, can loose the second work item if the Event was set already, and it doesn't guarantee a mutual exclusion.
Maybe there is even a better approach with InterlockedExchange kind of functions that leads to even less serialization.
P.S.: I might need to preprocess the whole queue and drop the obsolete work items during the unqueuing stage.

There are a multitude of ways to do this.
One option is to use a semaphore for the waiting. The semaphore is signalled every time a value is pushed on the queue, so the worker thread will only block if there are no items in the queue. This will still require separate synchronization on the queue itself.
A second option is to use a manual-reset event which is set when there are items in the queue and cleared when the queue is empty. Again, you will need to do separate synchronization on the queue.
A third option is to have an invisible message-only window created on the thread, and use a special WM_USER or WM_APP message to post items to the queue, attaching the item to the message via a pointer.
Another option is to use condition variables. The native Windows condition variables only work if you're targetting Windows Vista or Windows 7, but condition variables are also available for Windows XP with Boost or an implementation of the C++0x thread library. An example queue using boost condition variables is available on my blog: http://www.justsoftwaresolutions.co.uk/threading/implementing-a-thread-safe-queue-using-condition-variables.html

It is possible to share a resource between threads without using blocking locks at all, if your scenario meets certain requirements.
You need an atomic pointer exchange primitive, such as Win32's InterlockedExchange. Most processor architectures provide some sort of atomic swap, and it's usually much less expensive than acquiring a formal lock.
You can store your queue of work items in a pointer variable that is accessible to all the threads that will be interested in it. (global var, or field of an object that all the threads have access to)
This scenario assumes that the threads involved always have something to do, and only occasionally "glance" at the shared resource. If you want a design where threads block waiting for input, use a traditional blocking event object.
Before anything begins, create your queue or work item list object and assign it to the shared pointer variable.
Now, when producers want to push something onto the queue, they "acquire" exclusive access to the queue object by swapping a null into the shared pointer variable using InterlockedExchange. If the result of the swap returns a null, then somebody else is currently modifying the queue object. Sleep(0) to release the rest of your thread's time slice, then loop to retry the swap until it returns non-null. Even if you end up looping a few times, this is many. many times faster than making a kernel call to acquire a mutex object. Kernel calls require hundreds of clock cycles to transition into kernel mode.
When you successfully obtain the pointer, make your modifications to the queue, then swap the queue pointer back into the shared pointer.
When consuming items from the queue, you do the same thing: swap a null into the shared pointer and loop until you get a non-null result, operate on the object in the local var, then swap it back into the shared pointer var.
This technique is a combination of atomic swap and brief spin loops. It works well in scenarios where the threads involved are not blocked and collisions are rare. Most of the time the swap will give you exclusive access to the shared object on the first try, and as long as the length of time the queue object is held exclusively by any thread is very short then no thread should have to loop more than a few times before the queue object becomes available again.
If you expect a lot of contention between threads in your scenario, or you want a design where threads spend most of their time blocked waiting for work to arrive, you may be better served by a formal mutex synchronization object.

The fastest locking primitive is usually a spin-lock or spin-sleep-lock. CRITICAL_SECTION is just such a (user-space) spin-sleep-lock.
(Well, aside from not using locking primitives at all of course. But that means using lock-free data-structures, and those are really really hard to get right.)
As for avoiding the Sleep: have a look at condition-variables. They're designed to be used together with a "mutex", and I think they're much easier to use correctly than Windows' EVENTs.
Boost.Thread has a nice portable implementation of both, fast user-space spin-sleep-locks and condition variables:
http://www.boost.org/doc/libs/1_44_0/doc/html/thread/synchronization.html#thread.synchronization.condvar_ref
A work-queue using Boost.Thread could look something like this:
template <class T>
class Queue : private boost::noncopyable
{
public:
void Enqueue(T const& t)
{
unique_lock lock(m_mutex);
// wait until the queue is not full
while (m_backingStore.size() >= m_maxSize)
m_queueNotFullCondition.wait(lock); // releases the lock temporarily
m_backingStore.push_back(t);
m_queueNotEmptyCondition.notify_all(); // notify waiters that the queue is not empty
}
T DequeueOrBlock()
{
unique_lock lock(m_mutex);
// wait until the queue is not empty
while (m_backingStore.empty())
m_queueNotEmptyCondition.wait(lock); // releases the lock temporarily
T t = m_backingStore.front();
m_backingStore.pop_front();
m_queueNotFullCondition.notify_all(); // notify waiters that the queue is not full
return t;
}
private:
typedef boost::recursive_mutex mutex;
typedef boost::unique_lock<boost::recursive_mutex> unique_lock;
size_t const m_maxSize;
mutex mutable m_mutex;
boost::condition_variable_any m_queueNotEmptyCondition;
boost::condition_variable_any m_queueNotFullCondition;
std::deque<T> m_backingStore;
};

There are various ways to do this
For one you could create an event instead called 'run' and then use that to detect when thread should terminate, the main thread then signals. Instead of sleep you would then use WaitForSingleObject with a timeout, that way you will quit directly instead of waiting for sleep ms.
Another way is to accept messages in your loop and then invent a user defined message that you post to the thread
EDIT: depending on situation it may also be wise to have yet another thread that monitors this thread to check if it is dead or not, this can be done by the above mentioned message queue so replying to a certain message within x ms would mean that the thread hasn't locked up.

I'd restructure a bit:
WorkItem GetWorkItem()
{
while(true)
{
WaitForSingleObject(queue.Ready);
{
ScopeLock lock(queue.Lock);
if(!queue.IsEmpty())
{
return queue.GetItem();
}
}
}
}
int WorkerThread(param)
{
bool done = false;
do
{
WorkItem work = GetWorkItem();
if( work.IsQuitMessage() )
{
done = true;
}
else
{
work.Process();
}
} while(!done);
return 0;
}
Points of interest:
ScopeLock is a RAII class to make critical section usage safer.
Block on event until workitem is (possibly) ready - then lock while trying to dequeue it.
don't use a global "IsDone" flag, enqueue special quitmessage WorkItems.

You can have a look at another approach here that uses C++0x atomic operations
http://www.drdobbs.com/high-performance-computing/210604448

Use a semaphore instead of an event.

Keep the signaling and synchronizing separate. Something along these lines...
// in main thread
HANDLE events[2];
events[0] = CreateEvent(...); // for shutdown
events[1] = CreateEvent(...); // for work to do
// start thread and pass the events
// in worker thread
DWORD ret;
while (true)
{
ret = WaitForMultipleObjects(2, events, FALSE, <timeout val or INFINITE>);
if shutdown
return
else if do-work
enter crit sec
unqueue work
leave crit sec
etc.
else if timeout
do something else that has to be done
}

Given that this question is tagged windows, Ill answer thus:
Don't create 1 worker thread. Your worker thread jobs are presumably independent, so you can process multiple jobs at once? If so:
In your main thread call CreateIOCompletionPort to create an io completion port object.
Create a pool of worker threads. The number you need to create depends on how many jobs you might want to service in parallel. Some multiple of the number of CPU cores is a good start.
Each time a job comes in call PostQueuedCompletionStatus() passing a pointer to the job struct as the lpOverlapped struct.
Each worker thread calls GetQueuedCompletionItem() - retrieves the work item from the lpOverlapped pointer and does the job before returning to GetQueuedCompletionStatus.
This looks heavy, but io completion ports are implemented in kernel mode and represent a queue that can be deserialized into any of the worker threads associated with the queue (i.e. waiting on a call to GetQueuedCompletionStatus). The io completion port knows how many of the threads that are processing an item are actually using a CPU vs blocked on an IO call - and will release more worker threads from the pool to ensure that the concurrency count is met.
So, its not lightweight, but it is very very efficient... io completion port can be associated with pipe and socket handles for example and can dequeue the results of asynchronous operations on those handles. io completion port designs can scale to handling 10's of thousands of socket connects on a single server - but on the desktop side of the world make a very convenient way of scaling processing of jobs over the 2 or 4 cores now common in desktop PCs.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js