Simple mutex protected queue vs. thread safe queue in C++

Simple mutex protected queue vs. thread safe queue in C++ - c++

I want to use a simple thread-safe std::queue in my program that has multiple threads accessing the same queue. The first thing that came to my mind is protecting the queue operation using mutex as below:
/*Enqueue*/
mutex.lock();
queue.push();
mutex.unlock();
/*Dequeue*/
mutex.lock();
val = queue.front
mutex.unlock();
/*some operation*/
mutex.pop();
I've seen many robust implementations using condition variables for thread-safe queue implementation For e.g. https://stackoverflow.com/a/16075550/3598205 . Will there be a significant difference in the performance if I only have two threads accessing the same queue?

Mutexes and condition variables do two different things, although they are often used together.
To ensure that only one thread can access a resource at a time, use a mutex. The code you posted shows an example of this.
To block a worker thread until there is something for it to do, have it wait on a condition variable (which is then signalled by another thread providing some kind of work item). There is an example of this over at cppreference.
Your first thought when writing multi-threaded code should be to write robust, safe code. It's very easy to make mistakes, especially if you're new to the area, and bugs are very hard to diagnose since they lead to sporadic, unpredictable errors. Worry about performance later.

Your application will be limited by one main thing. We call this the bottleneck. In my experience, 90% of the applications I've seen were limited by the bus bandwidth, transferring memory to/from main memory/CPU.
Different apps will have different bottleneck. It could be GPU performance, or disk access. It could be raw CPU power or maybe, contention accessing the queue above.
Very likely, the mutex will be just as good as the fancy lock-free queue. But you don't know until you profile.
It might very well happen that your application is strictly limited by the access to this queue. For example, if your app is a low-latency market data exchange for a financial institution and the queue is holding the buy/sell directives, then it will make a difference. The two threads could be constantly writing to different locations on a queue that has a couple hundred items (so, on different memory pages).
Or it might be that your application is always waiting on the GPU to render frames and the queue holds the player weapon changes that rendering and gameplay threads access just a couple times per frame.
Profile and check.

Related

Mutex is defying the very idea of threads: parallel processing [duplicate]

When I have a block of code like this:
mutex mtx;
void hello(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hello";
}
mtx.unlock();
}
void hi(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hi";
}
mtx.unlock();
}
int main(){
thread x(hello);
thread y(hi);
x.join();
y.join();
}
What is the difference between just calling `hello()` and `hi()`? (Like so)
...
int main(){
hello();
hi();
}
Are threads more efficient? The purpose of thread is to run at the same time, right?
Can someone explain why we use mutexes within thread functions? Thank you!

The purpose of thread is to run at the same time, right?
Yes, threads are used to perform multiple tasks in parallel, especially on different CPUs.
Can someone explain why we use mutexes within thread functions?
To serialize multiple threads with each other, such as when they are accessing a shared resource that is not safe to access concurrently and needs to be protected.

Are threads more efficient?
No. But see final note (below).
On a single core, threads are much, much less efficient (than function/method calls).
As one example, on my Ubuntu 15.10(64), using g++ v5.2.1,
a) a context switch (from one thread to the other) enforced by use of std::mutex takes about 12,000 nanoseconds
b) but invoking 2 simple methods, for instance std::mutex lock() & unlock(), this takes < 50 nanoseconds. 3 orders of magnitude! So context switch vx function call is no contest.
The purpose of thread is to run at the same time, right?
Yes ... but this can not happen on a single core processor.
And on a multi-core system, context switch time can still dominate.
For example, my Ubuntu system is dual core. The measurement of context switch time I reported above uses a chain of 10 threads, where each thread simply waits for its input semaphore to be unlock()'d. When a thread's input semaphore is unlocked, the thread gets to run ... but the brief thread activity is simply 1) increment a count and check a flag, and 2) unlock() the next thread, and 3) lock() its own input mutex, i.e. wait again for the previous task signal. In that test, the thread we known as main starts the thread-sequencing with unlock() of one of the threads, and stops it with a flag that all threads can see.
During this measurement activity (about 3 seconds), Linux system monitor shows both cores are involved, and reports both cores at abut 60% utilization. I expected both cores at 100% .. don't know why they are not.
Can someone explain why we use mutexes within thread functions? Thank
you!
I suppose the most conventional use of std::mutex's is to serialize access to a memory structure (perhaps a shared-access storage or structure). If your application has data accessible by multiple threads, each write access must be serialized to prevent race conditions from corrupting the data. Sometimes, both read and write access needs to be serialized. (See dining philosophers problem.)
In your code, as an example (although I do not know what system you are using), it is possible that std::cout (a shared structure) will 'interleave' text. That is, a thread context switch might happen in the middle of printing a "hello", or even a 'hi'. This behaviour is usually undesired, but might be acceptable.
A number of years ago, I worked with vxWorks and my team learned to use mutex's on access to std::cout to eliminate that interleaving. Such behavior can be distracting, and generally, customers do not like it. (ultimately, for that app, we did away with the use of the std trio-io (cout, cerr, cin))
Devices, of various kinds, also might not function properly if you allow more than 1 thread to attempt operations on them 'simultaneously'. For example, I have written software for a device that required 50 us or more to complete its reaction to my software's 'poke', before any additional action to the device should be applied. The device simply ignored my codes actions without the wait.
You should also know that there are techniques that do not involve semaphores, but instead use a thread and an IPC to provide serialized (i.e. protected) resource access.
From wikipedia, "In concurrent programming, a monitor is a synchronization construct that allows threads to have both mutual exclusion and the ability to wait (block) for a certain condition to become true."
When the os provides a suitable IPC, I prefer to use a Hoare monitor. In my interpretation, the monitor is simply a thread that accepts commands over the IPC, and is the only thread to access the shared structure or device. When only 1 thread accesses a structure, NO mutex is needed. All other threads must send a message (via IPC) to request (or perhaps command) another structure change. The monitor thread handles one request at a time, sequentially out of the IPC.
Definition: collision
In the context of "thread context switch' and 'mutex semaphores', a 'collision' occurs when a thread must block-and-wait for access to a resource, because that resource is already 'in use' (i.e. 'occupied'). This is a forced context switch. See also the term "critical section".
When the shared resource is NOT currently in use, no collision. The lock() and unlock() cost almost nothing (by comparison to context switch).
When there is a collision, the context switch slows things down by a 'bunch'. But this 'bunch' might still be acceptable ... consider when 'bunch' is small compared to the duration of the activity inside the critical section.
Final note ... With this new idea of 'collision':
a) Multiple threads can be far less efficient in the face of many collisions.
For unexpected example, the function 'new' accesses a thread-shared resource we can call "dynamic memory". In one experience, each thread generated 1000's of new's at start up. One thread could complete that effort in 0.5 seconds. Four threads, started quickly back-to-back, took 40 seconds to complete the 4 start ups. Context switches!
b) Multiple threads can be more efficient, when you have multiple cores and no / or few collisions. Essentially, if the threads seldom interact, they can run (mostly) simultaneously.
Thread efficiency can be any where between a or b, when multiple cores and collisions.
For instance, my ram based "log" mechanisms seems to work well - one mutex access per log entry. Generally, I intentionally used minimal logging. And when debugging a 'discovered' challenge, I added additional logging (maybe later removed) to determine what was going wrong. Generally, the debugger is better than a general logging technique. But sometimes, adding several log entries worked well.

Threads have at least two advantages over purely serial code.
Convenience in separating logically independent sequences of instructions. This is true even on a single core machine. This gives you logical concurrency without necessarily parallelism.
Having multiple threads allows either the operating system or a user-level threading library to multiplex multiple logical threads over a smaller number of CPU cores, without the application developer having to worry about other threads and processes.
Taking advantage of multiple cores / processors. Threads allow you to scale your execution to the number of CPU cores you have, enabling parallelism.
Your example is a little contrived because the entire thread's execution is locked. Normally, threads perform many actions independently and only take a mutex when accessing a shared resource.
More specifically, under your scenario you would not gain any performance. However, if your entire thread was not under a mutex, then you could potentially gain efficiency. I say potentially because there are overheads to running multiple threads which may offset any efficiency gain you obtain.

Threads theoretically run simultaneously, it means that threads could write to the same memory block at the same time. For example, if you have a global var int i;, and two threads try to write different values at same time, which one value remains in i?
Mutex forces synchronous access to memory, inside a mutex block (mutex.lock & mutex.unlock) you warrant synchronous memory access and avoid memory corruption.
When you call mtx.lock(), JUST ONE THREAD KEEPS RUNNING, and any other thread calling the same mtx.lock() stops, waiting for mtx.unlock call.

Mutex vs. standard function call

When I have a block of code like this:
mutex mtx;
void hello(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hello";
}
mtx.unlock();
}
void hi(){
mtx.lock();
for(int i = 0; i < 10; i++){
cout << "hi";
}
mtx.unlock();
}
int main(){
thread x(hello);
thread y(hi);
x.join();
y.join();
}
What is the difference between just calling `hello()` and `hi()`? (Like so)
...
int main(){
hello();
hi();
}
Are threads more efficient? The purpose of thread is to run at the same time, right?
Can someone explain why we use mutexes within thread functions? Thank you!

The purpose of thread is to run at the same time, right?
Yes, threads are used to perform multiple tasks in parallel, especially on different CPUs.
Can someone explain why we use mutexes within thread functions?
To serialize multiple threads with each other, such as when they are accessing a shared resource that is not safe to access concurrently and needs to be protected.

Threads have at least two advantages over purely serial code.
Convenience in separating logically independent sequences of instructions. This is true even on a single core machine. This gives you logical concurrency without necessarily parallelism.
Having multiple threads allows either the operating system or a user-level threading library to multiplex multiple logical threads over a smaller number of CPU cores, without the application developer having to worry about other threads and processes.
Taking advantage of multiple cores / processors. Threads allow you to scale your execution to the number of CPU cores you have, enabling parallelism.
Your example is a little contrived because the entire thread's execution is locked. Normally, threads perform many actions independently and only take a mutex when accessing a shared resource.
More specifically, under your scenario you would not gain any performance. However, if your entire thread was not under a mutex, then you could potentially gain efficiency. I say potentially because there are overheads to running multiple threads which may offset any efficiency gain you obtain.

Threads theoretically run simultaneously, it means that threads could write to the same memory block at the same time. For example, if you have a global var int i;, and two threads try to write different values at same time, which one value remains in i?
Mutex forces synchronous access to memory, inside a mutex block (mutex.lock & mutex.unlock) you warrant synchronous memory access and avoid memory corruption.
When you call mtx.lock(), JUST ONE THREAD KEEPS RUNNING, and any other thread calling the same mtx.lock() stops, waiting for mtx.unlock call.

C/C++ - ring buffer in shared memory (POSIX compatible)

I've an application where producers and consumers ("clients") want to send broadcast messages to each other, i.e. a n:m relationship. All could be different programs so they are different processes and not threads.
To reduce the n:m to something more maintainable I was thinking of a setup like introducing a little, central server. That server would offer an socket where each client connects to.
And each client would send a new message through that socket to the server - resulting in 1:n.
The server would also offer a shared memory that is read only for the clients. It would be organized as a ring buffer where the new messages would be added by the server and overwrite older ones.
This would give the clients some time to process the message - but if it's too slow it's bad luck, it wouldn't be relevant anymore anyway...
The advantage I see by this approach is that I avoid synchronisation as well as unnecessary data copying and buffer hierarchies, the central one should be enough, shouldn't it?
That's the architecture so far - I hope it makes sense...
Now to the more interesting aspect of implementing that:
The index of the newest element in the ring buffer is a variable in shared memory and the clients would just have to wait till it changes. Instead of a stupid while( central_index == my_last_processed_index ) { /* do nothing */ } I want to free CPU resources, e.g. by using a pthread_cond_wait().
But that needs a mutex that I think I don't need - on the other hand Why do pthreads’ condition variable functions require a mutex? gave me the impression that I'd better ask if my architecture makes sense and could be implemented like that...
Can you give me a hint if all of that makes sense and could work?
(Side note: the client programs could also be written in the common scripting languages like Perl and Python. So the communication with the server has to be recreated there and thus shouldn't be too complicated or even proprietary)

If memory serves, the reason for the mutex accompanying a condition variable is that under POSIX, signalling the condition variable causes the kernel to wake up all waiters on the condition variable. In these circumstances, the first thing that consumer threads need to do is check is that there is something to consume - by means of accessing a variable shared between producer and consumer threads. The mutex protects against concurrent access to the variable used for this purpose. This of course means that if there are many consumers, n-1 of them are needless awoken.
Having implemented precisely the arrangement described above, the choice of IPC object to use is not obvious. We were buffering audio between high priority real-time threads in separate processes, and didn't want to block the consumer. As the audio was produced and consumed in real-time, we were already getting scheduled regularly on both ends, and if there wasn't to consume (or space to produce into) we trashed the data because we'd already missed the deadline.
In the arrangement you describe, you will need a mutex to prevent the consumers concurrently consuming items that are queued (and believe me, on a lightly loaded SMP system, they will). However, you don't need to have the producer contend on this as well.
I don't understand you comment about the consumer having read-only access to the shared memory. In the classic lockless ring buffer implementation, the producer writes the queue tail pointer and the consumer(s) the head - whilst all parties need to be able to read both.
You might of course arrange for the queue head and tails to be in a different shared memory region to the queue data itself.
Also be aware that there is a theoretical data coherency hazard on SMP systems when implementing a ring buffer such as this - namely that write-back to memory of the queue content with respect to the head or tail pointer may occur out of order (they in cache - usually per-CPU core). There are other variants on this theme to do with synchonization of caches between CPUs. To guard against these, you need to an memory, load and store barriers to enforce ordering. See Memory Barrier on Wikipedia. You explicitly avoid this hazard by using kernel synchronisation primitives such as mutex and condition variables.
The C11 atomic operations can help with this.

You do need a mutex on a pthread_cond_wait() as far as I know. The reason is that pthread_cond_wait() is not atomic. The condition variable could change during the call, unless it's protected by a mutex.
It's possible that you can ignore this situation - the client might sleep past message 1, but when the subsequent message is sent then the client will wake up and find two messages to process. If that's unacceptable then use a mutex.

You probably can have a bit of different design by using sem_t if your system has them; some POSIX systems are still stuck on the 2001 version of POSIX.
You probably don't forcably need a mutex/condition pair. This is just how it was designed long time ago for POSIX.
Modern C, C11, and C++, C++11, now brings you (or will bring you) atomic operations, which were a feature that is implemented in all modern processors, but lacked support from most higher languages. Atomic operations are part of the answer for resolving a race condition for a ring buffer as you want to implement it. But they are not sufficient because with them you can only do active wait through polling, which is probably not what you want.
Linux, as an extension to POSIX, has futex that resolves both problems: to avoid races for updates by using atomic operations and the ability to putting waiters to sleep via a system call. Futexes are often considered as being too low level for everyday programming, but I think that it actually isn't too difficult to use them. I have written up things here.

Thread safe programming

I keep hearing about thread safe. What is that exactly and how and where can I learn to program thread safe code?
Also, assume I have 2 threads, one that writes to a structure and another one that reads from it. Is that dangerous in any way? Is there anything I should look for? I don't think it is a problem. Both threads will not (well can't ) be accessing the struct at the exact same time..
Also, can someone please tell me how in this example : https://stackoverflow.com/a/5125493/1248779 we are doing a better job in concurrency issues. I don't get it.

It's a very deep topic. At the heart threads are usually about making things go fast by using multiple cores at the same time; or about doing long operations in the background when you don't have a good way to interleave the operation with a 'primary' thread. The latter being very common in UI programming.
Your scenario is one of the classic trouble spots, and one of the first people run into. It's vary rare to have a struct where the members are truly independent. It's very common to want to modify multiple values in the structure to maintain consistency. Without any precautions it is very possible to modify the first value, then have the other thread read the struct and operate on it before the second value has been written.
Simple example would be a 'point' struct for 2d graphics. You'd like to move the point from [2,2] to [5,6]. If you had a different thread drawing a line to that point you could end up drawing to [5,2] very easily.
This is the tip of the iceberg really. There are lots of great books, but learning this space usually goes something like this:
Uh oh, I just read from that thing in an inconsistent state.
Uh oh, I just modified that thing from 2 threads and now it's garbage.
Yay! I learned about locks
Whoa, I have a lot of locks and everything seems to just hang sometimes when I have lots of them locking in nested code.
Hrm. I need to stop doing this locking on the fly, I seem to be missing a lot of places; so I should encapsulate them in a data structure.
That data structure thing was great, but now I seem to be locking all the time and my code is just as slow as a single thread.
condition variables are weird
It's fast because I got clever with how I lock things. Hrm. Sometimes data corrupts.
Whoa.... InterlockedWhatDidYouSay?
Hey, look no lock, I do this thing called a spin lock.
Condition variables. Hrm... I see.
You know what, how about I just start thinking about how to operate on this stuff in completely independent ways, pipelineing my operations, and having as few cross thread dependencies as possible...
Obviously it's not all about condition variables. But there are many problems that can be solved with threading, and probably almost as many ways to do it, and even more ways to do it wrong.

Thread-safety is one aspect of a larger set of issues under the general heading of "Concurrent Programming". I'd suggest reading around that subject.
Your assumption that two threads cannot access the struct at the same time is not good. First: today we have multi-core machines, so two threads can be running at exactly the same time. Second: even on a single core machine the slices of time given to any other thread are unpredicatable. You have to anticipate that ant any arbitrary time the "other" thread might be processing. See my "window of opportunity" example below.
The concept of thread-safety is exactly to answer the question "is this dangerous in any way". The key question is whether it's possible for code running in one thread to get an inconsistent view of some data, that inconsistency happening because while it was running another thread was in the middle of changing data.
In your example, one thread is reading a structure and at the same time another is writing. Suppose that there are two related fields:
{ foreground: red; background: black }
and the writer is in the process of changing those
foreground = black;
<=== window of opportunity
background = red;
If the reader reads the values at just that window of opportunity then it sees a "nonsense" combination
{ foreground: black; background: black }
This essence of this pattern is that for a brief time, while we are making a change, the system becomes inconsistent and readers should not use the values. As soon as we finish our changes it becomes safe to read again.
Hence we use the CriticalSection APIs mentioned by Stefan to prevent a thread seeing an inconsistent state.

what is that exactly?
Briefly, a program that may be executed in a concurrent context without errors related to concurrency.
If ThreadA and ThreadB read and/or write data without errors and use proper synchronization, then the program may be threadsafe. It's a design choice -- making an object threadsafe can be accomplished a number of ways, and more complex types may be threadsafe using combinations of these techniques.
and how and where can I learn to program thread safe code?
boost/libs/thread/ would likely be a good introduction. The topic is quite complex.
The C++11 standard library provides implementations for locks, atomics and threads -- any well written programs which use these would be a good read. The standard library was modeled after boost's implementation.
also, assume I have 2 threads one that writes to a structure and another one that reads from it. Is that dangerous in any way? is there anything I should look for?
Yes, it can be dangerous and/or may produce incorrect results. Just imagine that a thread may run out of its time at any point, and then another thread could then read or modify that structure -- if you have not protected it, it may be in the middle of an update. A common solution is a lock, which can be used to prevent another thread from accessing shared resources during reads/writes.

When writing multithreaded C++ programs on WIN32 platforms, you need to protect certain shared objects so that only one thread can access them at any given time from different threads. You can use 5 system functions to achieve this. They are InitializeCriticalSection, EnterCriticalSection, TryEnterCriticalSection, LeaveCriticalSection, and DeleteCriticalSection.
Also maybe this links can help:
how to make an application thread safe?
http://www.codeproject.com/Articles/1779/Making-your-C-code-thread-safe

Thread safety is a simple concept: is it "safe" to perform operation A on one thread whilst another thread is performing operation B, which may or may not be the same as operation A. This can be extended to cover many threads. In this context, "safe" means:
No undefined behaviour
All invariants of the data structures are guaranteed to be observed by the threads
The actual operations A and B are important. If two threads both read a plain int variable, then this is fine. However, if any thread may write to that variable, and there is no synchronization to ensure that the read and write cannot happen together, then you have a data race, which is undefined behaviour, and this is not thread safe.
This applies equally to the scenario you asked about: unless you have taken special precautions, then it is not safe to have one thread read from a structure at the same time as another thread writes to it. If you can guarantee that the threads cannot access the data structure at the same time, through some form of synchronization such as a mutex, critical section, semaphore or event, then there is not a problem.
You can use things like mutexes and critical sections to prevent concurrent access to some data, so that the writing thread is the only thread accessing the data when it is writing, and the reading thread is the only thread accessing the data when it is reading, thus providing the guarantee I just mentioned. This therefore avoids the undefined behaviour mentioned above.
However, you still need to ensure that your code is safe in the wider context: if you need to modify more than one variable then you need to hold the lock on the mutex across the whole operation rather than for each individual access, otherwise you may find that the invariants of your data structure may not be observed by other threads.
It is also possible that a data structure may be thread safe for some operations but not others. For example, a single-producer single-consumer queue will be OK if one thread is pushing items on the queue and another is popping items off the queue, but will break if two threads are pushing items, or two threads are popping items.
In the example you reference, the point is that global variables are implicitly shared between all threads, and therefore all accesses must be protected by some form of synchronization (such as a mutex) if any thread can modify them. On the other hand, if you have a separate copy of the data for each thread, then that thread can modify its copy without worrying about concurrent access from any other thread, and no synchronization is required. Of course, you always need synchronization if two or more threads are going to operate on the same data.
My book, C++ Concurrency in Action covers what it means for things to be thread safe, how to design thread safe data structures, and the C++ synchronization primitives used for the purpose, such as std::mutex.

Threads safe is when a certain block of code is protected from being accessed by more than one thread. Meaning that the data manipulated always stays in a consistent state.
A common example is the producer consumer problem where one thread reads from a data structure while another thread writes to the same data structure : Detailed explanation

To answer the second part of the question: Imagine two threads both accessing std::vector<int> data:
//first thread
if (data.size() > 0)
{
std::cout << data[0]; //fails if data.size() == 0
}
//second thread
if (rand() % 5 == 0)
{
data.clear();
}
else
{
data.push_back(1);
}
Run these threads in parallel and your program will crash because std::cout << data[0]; might be executed directly after data.clear();.
You need to know that at any point of your thread code, the thread might be interrupted, e.g. after checking that (data.size() > 0), and another thread could become active. Although the first thread looks correct in a single threaded app, it's not in a multi-threaded program.

what's the advantage of message queue over shared data in thread communication?

I read a article about multithread program design http://drdobbs.com/architecture-and-design/215900465, it says it's a best practice that "replacing shared data with asynchronous messages. As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data".
What confuse me is that I don't see the difference between using shared data and message queues. I am now working on a non-gui project on windows, so let's use windows's message queues. and take a tradition producer-consumer problem as a example.
Using shared data, there would be a shared container and a lock guarding the container between the producer thread and the consumer thread. when producer output product, it first wait for the lock and then write something to the container then release the lock.
Using message queue, the producer could simply PostThreadMessage without block. and this is the async message's advantage. but I think there must exist some lock guarding the message queue between the two threads, otherwise the data will definitely corrupt. the PostThreadMessage call just hide the details. I don't know whether my guess is right but if it's true, the advantage seems no longer exist,since both two method do the same thing and the only difference is that the system hide the details when using message queues.
ps. maybe the message queue use a non-blocking containner, but I could use a concurrent container in the former way too. I want to know how the message queue is implemented and is there any performance difference bwtween the two ways?
updated:
I still don't get the concept of async message if the message queue operations are still blocked somewhere else. Correct me if my guess was wrong: when we use shared containers and locks we will block in our own thread. but when using message queues, myself's thread returned immediately, and left the blocking work to some system thread.

Message passing is useful for exchanging smaller amounts of data, because no conflicts need be avoided. It's much easier to implement than is shared memory for intercomputer communication. Also, as you've already noticed, message passing has the advantage that application developers don't need to worry about the details of protections like shared memory.
Shared memory allows maximum speed and convenience of communication, as it can be done at memory speeds when within a computer. Shared memory is usually faster than message passing, as message-passing are typically implemented using system calls and thus require the more time-consuming tasks of kernel intervention. In contrast, in shared-memory systems, system calls are required only to establish shared-memory regions. Once established, all access are treated as normal memory accesses w/o extra assistance from the kernel.
Edit: One case that you might want implement your own queue is that there are lots of messages to be produced and consumed, e.g., a logging system. With the implemenetation of PostThreadMessage, its queue capacity is fixed. Messages will most liky get lost if that capacity is exceeded.

Imagine you have 1 thread producing data,and 4 threads processing that data (presumably to make use of a multi core machine). If you have a big global pool of data you are likely to have to lock it when any of the threads needs access, potentially blocking 3 other threads. As you add more processing threads you increase the chance of a lock having to wait and increase how many things might have to wait. Eventually adding more threads achieves nothing because all you do is spend more time blocking.
If instead you have one thread sending messages into message queues, one for each consumer thread then they can't block each other. You stil have to lock the queue between the producer and consumer threads but as you have a separate queue for each thread you have a separate lock and each thread can't block all the others waiting for data.
If you suddenly get a 32 core machine you can add 20 more processing threads (and queues) and expect that performance will scale fairly linearly unlike the first case where the new threads will just run into each other all the time.

I have used a shared memory model where the pointers to the shared memory are managed in a message queue with careful locking. In a sense, this is a hybrid between a message queue and shared memory. This is very when large quantities of data must be passed between threads while retaining the safety of the message queue.
The entire queue can be packaged in a single C++ class with appropriate locking and the like. The key is that the queue owns the shared storage and takes care of the locking. Producers acquire a lock for input to the queue and receive a pointer to the next available storage chunk (usually an object of some sort), populates it and releases it. The consumer will block until the next shared object has released by the producer. It can then acquire a lock to the storage, process the data and release it back to the pool. In A suitably designed queue can perform multiple producer/multiple consumer operations with great efficiency. Think a Java thread safe (java.util.concurrent.BlockingQueue) semantics but for pointers to storage.

Of course there is "shared data" when you pass messages. After all, the message itself is some sort of data. However, the important distinction is when you pass a message, the consumer will receive a copy.
the PostThreadMessage call just hide the details
Yes, it does, but being a WINAPI call, you can be reasonably sure that it does it right.
I still don't get the concept of async message if the message queue operations are still blocked somewhere else.
The advantage is more safety. You have a locking mechanism that is systematically enforced when you are passing a message. You don't even need to think about it, you can't forget to lock. Given that multi-thread bugs are some of the nastiest ones (think of race conditions), this is very important. Message passing is a higher level of abstraction built on locks.
The disadvantage is that passing large amounts of data would be probably slow. In that case, you need to use need shared memory.
For passing state (i.e. worker thread reporting progress to the GUI) the messages are the way to go.

It's quite simple (I'm amazed others wrote such length responses!):
Using a message queue system instead of 'raw' shared data means that you have to get the synchronization (locking/unlocking of resources) right only once, in a central place.
With a message-based system, you can think in higher terms of "messages" without having to worry about synchronization issues anymore. For what it's worth, it's perfectly possible that a message queue is implemented using shared data internally.

I think this is the key piece of info there: "As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data". I.e. use producer-consumer :)
You can do your own message passing or use something provided by the OS. That's an implementation detail (needs to be done right ofc). The key is to avoid shared data, as in having the same region of memory modified by multiple threads. This can cause hard to find bugs, and even if the code is perfect it will eat performance because of all the locking.

I had exact the same question. After reading the answers. I feel:
in most typical use case, queue = async, shared memory (locks) = sync. Indeed, you can do a async version of shared memory, but that's more code, similar to reinvent the message passing wheel.
Less code = less bug and more time to focus on other stuff.
The pros and cons are already mentioned by previous answers so I will not repeat.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js