C/C++ - ring buffer in shared memory (POSIX compatible) - c++

I've an application where producers and consumers ("clients") want to send broadcast messages to each other, i.e. a n:m relationship. All could be different programs so they are different processes and not threads.
To reduce the n:m to something more maintainable I was thinking of a setup like introducing a little, central server. That server would offer an socket where each client connects to.
And each client would send a new message through that socket to the server - resulting in 1:n.
The server would also offer a shared memory that is read only for the clients. It would be organized as a ring buffer where the new messages would be added by the server and overwrite older ones.
This would give the clients some time to process the message - but if it's too slow it's bad luck, it wouldn't be relevant anymore anyway...
The advantage I see by this approach is that I avoid synchronisation as well as unnecessary data copying and buffer hierarchies, the central one should be enough, shouldn't it?
That's the architecture so far - I hope it makes sense...
Now to the more interesting aspect of implementing that:
The index of the newest element in the ring buffer is a variable in shared memory and the clients would just have to wait till it changes. Instead of a stupid while( central_index == my_last_processed_index ) { /* do nothing */ } I want to free CPU resources, e.g. by using a pthread_cond_wait().
But that needs a mutex that I think I don't need - on the other hand Why do pthreads’ condition variable functions require a mutex? gave me the impression that I'd better ask if my architecture makes sense and could be implemented like that...
Can you give me a hint if all of that makes sense and could work?
(Side note: the client programs could also be written in the common scripting languages like Perl and Python. So the communication with the server has to be recreated there and thus shouldn't be too complicated or even proprietary)

If memory serves, the reason for the mutex accompanying a condition variable is that under POSIX, signalling the condition variable causes the kernel to wake up all waiters on the condition variable. In these circumstances, the first thing that consumer threads need to do is check is that there is something to consume - by means of accessing a variable shared between producer and consumer threads. The mutex protects against concurrent access to the variable used for this purpose. This of course means that if there are many consumers, n-1 of them are needless awoken.
Having implemented precisely the arrangement described above, the choice of IPC object to use is not obvious. We were buffering audio between high priority real-time threads in separate processes, and didn't want to block the consumer. As the audio was produced and consumed in real-time, we were already getting scheduled regularly on both ends, and if there wasn't to consume (or space to produce into) we trashed the data because we'd already missed the deadline.
In the arrangement you describe, you will need a mutex to prevent the consumers concurrently consuming items that are queued (and believe me, on a lightly loaded SMP system, they will). However, you don't need to have the producer contend on this as well.
I don't understand you comment about the consumer having read-only access to the shared memory. In the classic lockless ring buffer implementation, the producer writes the queue tail pointer and the consumer(s) the head - whilst all parties need to be able to read both.
You might of course arrange for the queue head and tails to be in a different shared memory region to the queue data itself.
Also be aware that there is a theoretical data coherency hazard on SMP systems when implementing a ring buffer such as this - namely that write-back to memory of the queue content with respect to the head or tail pointer may occur out of order (they in cache - usually per-CPU core). There are other variants on this theme to do with synchonization of caches between CPUs. To guard against these, you need to an memory, load and store barriers to enforce ordering. See Memory Barrier on Wikipedia. You explicitly avoid this hazard by using kernel synchronisation primitives such as mutex and condition variables.
The C11 atomic operations can help with this.

You do need a mutex on a pthread_cond_wait() as far as I know. The reason is that pthread_cond_wait() is not atomic. The condition variable could change during the call, unless it's protected by a mutex.
It's possible that you can ignore this situation - the client might sleep past message 1, but when the subsequent message is sent then the client will wake up and find two messages to process. If that's unacceptable then use a mutex.

You probably can have a bit of different design by using sem_t if your system has them; some POSIX systems are still stuck on the 2001 version of POSIX.
You probably don't forcably need a mutex/condition pair. This is just how it was designed long time ago for POSIX.
Modern C, C11, and C++, C++11, now brings you (or will bring you) atomic operations, which were a feature that is implemented in all modern processors, but lacked support from most higher languages. Atomic operations are part of the answer for resolving a race condition for a ring buffer as you want to implement it. But they are not sufficient because with them you can only do active wait through polling, which is probably not what you want.
Linux, as an extension to POSIX, has futex that resolves both problems: to avoid races for updates by using atomic operations and the ability to putting waiters to sleep via a system call. Futexes are often considered as being too low level for everyday programming, but I think that it actually isn't too difficult to use them. I have written up things here.

Related

Simple mutex protected queue vs. thread safe queue in C++

I want to use a simple thread-safe std::queue in my program that has multiple threads accessing the same queue. The first thing that came to my mind is protecting the queue operation using mutex as below:
/*Enqueue*/
mutex.lock();
queue.push();
mutex.unlock();
/*Dequeue*/
mutex.lock();
val = queue.front
mutex.unlock();
/*some operation*/
mutex.pop();
I've seen many robust implementations using condition variables for thread-safe queue implementation For e.g. https://stackoverflow.com/a/16075550/3598205 . Will there be a significant difference in the performance if I only have two threads accessing the same queue?
Mutexes and condition variables do two different things, although they are often used together.
To ensure that only one thread can access a resource at a time, use a mutex. The code you posted shows an example of this.
To block a worker thread until there is something for it to do, have it wait on a condition variable (which is then signalled by another thread providing some kind of work item). There is an example of this over at cppreference.
Your first thought when writing multi-threaded code should be to write robust, safe code. It's very easy to make mistakes, especially if you're new to the area, and bugs are very hard to diagnose since they lead to sporadic, unpredictable errors. Worry about performance later.
Your application will be limited by one main thing. We call this the bottleneck. In my experience, 90% of the applications I've seen were limited by the bus bandwidth, transferring memory to/from main memory/CPU.
Different apps will have different bottleneck. It could be GPU performance, or disk access. It could be raw CPU power or maybe, contention accessing the queue above.
Very likely, the mutex will be just as good as the fancy lock-free queue. But you don't know until you profile.
It might very well happen that your application is strictly limited by the access to this queue. For example, if your app is a low-latency market data exchange for a financial institution and the queue is holding the buy/sell directives, then it will make a difference. The two threads could be constantly writing to different locations on a queue that has a couple hundred items (so, on different memory pages).
Or it might be that your application is always waiting on the GPU to render frames and the queue holds the player weapon changes that rendering and gameplay threads access just a couple times per frame.
Profile and check.

Implementing progress visualization in C++

Let's say I have a computationally intensive algorithm running.
For example, let's say it's a routing algorithm, and on window running on a separate thread, I want to show the user what routes are being currently being analyzed and such, and for whatever reason, it contains heavily CPU-intensive code.
The important thing is that I don't want to slow down the worker thread just for the sake of displaying progress; it needs to run at full-speed as much as possible. It is perfectly OK if the user sees stale data, such as an in-between that didn't actually occur (say, two active routes at once), because this progress visualization is for informational purposes only, and nothing else.
From a theoretical standpoint, I think that according to the C++ standard, my best option is to use std::atomic with std::memory_order_relaxed on both threads. But that would slow down the code on the worker thread noticeably.
From a practical standpoint, though, I'm just tempted to ignore std::atomic altogether, and just have the worker thread work with all the variables normally. Who cares if the GUI thread reads stale data? i don't, and presumably neither will the user. In reality it won't matter because there is only one worker thread, and only that thread needs to observe valid writes, which in practice is the only thing that'll happen.
What I'm wondering about is:
What is the best way to solve this kind of problem, both in theory and in practice?
Do people just ignore the standard and go for raw primitives, or do they bite the bullet and take the performance hit of using std::atomic?
Or are there other facilities I'm not aware of for soving this problem?
Ignoring proper fences for std::atomic wouldn't buy you match but you might be at risk of loosing the communication between threads completely, mostly on the compiler side. The problem does not exist for example on x86 hardware side at all, because each store to memory (if you can ensure your compiler do it as expected) has required store-with-release semantics anyway.
Also I doubt that sharing the progress more often than 30-100 FPS (or Hz) brings any value. On the other hand, it can certainly put the unnecessary burden on the system resources (if repeated in a tight loop) and break compiler optimizations, e.g. vectorization.
So, if the overhead for worker thread is the concern, share the info with less frequency. E.g. update the atomic counter once in 1024 iterations:
// worker thread
if( i%1024 == 0 ) // update the progress info
my_atomic_progress.store( i, std::memory_order_release ); // regular `mov` on x86
// GUI thread
auto i = my_atomic_progress.load( std::memory_order_consume );
This example also shows the minimal fences necessary to establish the communication, otherwise the compiler is free to optimize the memory operations out of a loop for example.
There is no best way - it depends how much data you need to send to the display, if its just a single long integer value, and the display is completely nu-guaranteed, then I'd just write the value and have done with it. Occasionally the reader will read a corrupted value, but it won't matter so I won't care.
Otherwise, I'd be tempted to send the value to a queue and use an event or condition variable to trigger the read afterwards (as often you do not want the reader running full tilt, and you need some way to inform it there is new data to read)
I'm not sure the overhead for std::atomic is that great - isn't it going to be implemented in the OS primitives anyway? If so, the primitives (on Windows, x86 at least via InterlockedExchange function) end up as a single CPU instruction after the compiler and optimiser have done their thng.

IOCP Critical Section Design

I'm running an fully operational IOCP TCP socket application. Today I was thinking about the Critical Section design and now I have one endless question in my head: global or per client Critical Section? I came to this because as I see there is no point to use multiple working threads if every threads depends on a single lock, right? I mean... now I don't see any performance issue with 100 simultaneous clients, but what if was 10000?
My shared resource is per client pre allocated struct, so, each client have your own IO context, socket and stuff. There is no inter-client resource share, so I think that is another point for use the per client CS. I use one accept thread and 8 (processors * 2) working threads. This applications is basicaly designed for small (< 1KB) packets but sometimes for file streaming.
The "correct" answer probably depends on your design, the number of concurrent clients and the performance that you require from the hardware that you have available.
In general, I find it best to go with the simplest thing that works and then profile to locate hot spots.
However... You say that you have no inter-client shared resources so I assume the only synchronisation that you need to do is around 'per-connection' state.
Since it's per connection the obvious (to me) design would be for the per-connection state to contain its own critical section. What do you perceive to be the downside of this approach?
The problem with a single shared lock is that you introduce contention between connections (and threads) that have no reason to block each other. This will adversely affect performance and will likely become a hot-spot as connection numbers rise.
Once you have a per connection lock you might want to look at avoiding using it as often as possible by having the IOCP threads simply lock to place completions in a per connection queue for processing. This has the advantage of allowing a single IOCP thread to work on each connection and preventing a single connection from having additional IOCP threads blocking on it. It also works well with 'skip completion port on success' processing.

what's the advantage of message queue over shared data in thread communication?

I read a article about multithread program design http://drdobbs.com/architecture-and-design/215900465, it says it's a best practice that "replacing shared data with asynchronous messages. As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data".
What confuse me is that I don't see the difference between using shared data and message queues. I am now working on a non-gui project on windows, so let's use windows's message queues. and take a tradition producer-consumer problem as a example.
Using shared data, there would be a shared container and a lock guarding the container between the producer thread and the consumer thread. when producer output product, it first wait for the lock and then write something to the container then release the lock.
Using message queue, the producer could simply PostThreadMessage without block. and this is the async message's advantage. but I think there must exist some lock guarding the message queue between the two threads, otherwise the data will definitely corrupt. the PostThreadMessage call just hide the details. I don't know whether my guess is right but if it's true, the advantage seems no longer exist,since both two method do the same thing and the only difference is that the system hide the details when using message queues.
ps. maybe the message queue use a non-blocking containner, but I could use a concurrent container in the former way too. I want to know how the message queue is implemented and is there any performance difference bwtween the two ways?
updated:
I still don't get the concept of async message if the message queue operations are still blocked somewhere else. Correct me if my guess was wrong: when we use shared containers and locks we will block in our own thread. but when using message queues, myself's thread returned immediately, and left the blocking work to some system thread.
Message passing is useful for exchanging smaller amounts of data, because no conflicts need be avoided. It's much easier to implement than is shared memory for intercomputer communication. Also, as you've already noticed, message passing has the advantage that application developers don't need to worry about the details of protections like shared memory.
Shared memory allows maximum speed and convenience of communication, as it can be done at memory speeds when within a computer. Shared memory is usually faster than message passing, as message-passing are typically implemented using system calls and thus require the more time-consuming tasks of kernel intervention. In contrast, in shared-memory systems, system calls are required only to establish shared-memory regions. Once established, all access are treated as normal memory accesses w/o extra assistance from the kernel.
Edit: One case that you might want implement your own queue is that there are lots of messages to be produced and consumed, e.g., a logging system. With the implemenetation of PostThreadMessage, its queue capacity is fixed. Messages will most liky get lost if that capacity is exceeded.
Imagine you have 1 thread producing data,and 4 threads processing that data (presumably to make use of a multi core machine). If you have a big global pool of data you are likely to have to lock it when any of the threads needs access, potentially blocking 3 other threads. As you add more processing threads you increase the chance of a lock having to wait and increase how many things might have to wait. Eventually adding more threads achieves nothing because all you do is spend more time blocking.
If instead you have one thread sending messages into message queues, one for each consumer thread then they can't block each other. You stil have to lock the queue between the producer and consumer threads but as you have a separate queue for each thread you have a separate lock and each thread can't block all the others waiting for data.
If you suddenly get a 32 core machine you can add 20 more processing threads (and queues) and expect that performance will scale fairly linearly unlike the first case where the new threads will just run into each other all the time.
I have used a shared memory model where the pointers to the shared memory are managed in a message queue with careful locking. In a sense, this is a hybrid between a message queue and shared memory. This is very when large quantities of data must be passed between threads while retaining the safety of the message queue.
The entire queue can be packaged in a single C++ class with appropriate locking and the like. The key is that the queue owns the shared storage and takes care of the locking. Producers acquire a lock for input to the queue and receive a pointer to the next available storage chunk (usually an object of some sort), populates it and releases it. The consumer will block until the next shared object has released by the producer. It can then acquire a lock to the storage, process the data and release it back to the pool. In A suitably designed queue can perform multiple producer/multiple consumer operations with great efficiency. Think a Java thread safe (java.util.concurrent.BlockingQueue) semantics but for pointers to storage.
Of course there is "shared data" when you pass messages. After all, the message itself is some sort of data. However, the important distinction is when you pass a message, the consumer will receive a copy.
the PostThreadMessage call just hide the details
Yes, it does, but being a WINAPI call, you can be reasonably sure that it does it right.
I still don't get the concept of async message if the message queue operations are still blocked somewhere else.
The advantage is more safety. You have a locking mechanism that is systematically enforced when you are passing a message. You don't even need to think about it, you can't forget to lock. Given that multi-thread bugs are some of the nastiest ones (think of race conditions), this is very important. Message passing is a higher level of abstraction built on locks.
The disadvantage is that passing large amounts of data would be probably slow. In that case, you need to use need shared memory.
For passing state (i.e. worker thread reporting progress to the GUI) the messages are the way to go.
It's quite simple (I'm amazed others wrote such length responses!):
Using a message queue system instead of 'raw' shared data means that you have to get the synchronization (locking/unlocking of resources) right only once, in a central place.
With a message-based system, you can think in higher terms of "messages" without having to worry about synchronization issues anymore. For what it's worth, it's perfectly possible that a message queue is implemented using shared data internally.
I think this is the key piece of info there: "As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data". I.e. use producer-consumer :)
You can do your own message passing or use something provided by the OS. That's an implementation detail (needs to be done right ofc). The key is to avoid shared data, as in having the same region of memory modified by multiple threads. This can cause hard to find bugs, and even if the code is perfect it will eat performance because of all the locking.
I had exact the same question. After reading the answers. I feel:
in most typical use case, queue = async, shared memory (locks) = sync. Indeed, you can do a async version of shared memory, but that's more code, similar to reinvent the message passing wheel.
Less code = less bug and more time to focus on other stuff.
The pros and cons are already mentioned by previous answers so I will not repeat.

Thread communication theory

What is the common theory behind thread communication? I have some primitive idea about how it should work but something doesn't settle well with me. Is there a way of doing it with interrupts?
Really, it's just the same as any concurrency problem: you've got multiple threads of control, and it's indeterminate which statements on which threads get executed when. That means there are a large number of POTENTIAL execution paths through the program, and your program must be correct under all of them.
In general the place where trouble can occur is when state is shared among the threads (aka "lightweight processes" in the old days.) That happens when there are shared memory areas,
To ensure correctness, what you need to do is ensure that these data areas get updated in a way that can't cause errors. To do this, you need to identify "critical sections" of the program, where sequential operation must be guaranteed. Those can be as little as a single instruction or line of code; if the language and architecture ensure that these are atomic, that is, can't be interrupted, then you're golden.
Otherwise, you idnetify that section, and put some kind of guards onto it. The classic way is to use a semaphore, which is an atomic statement that only allows one thread of control past at a time. These were invented by Edsgar Dijkstra, and so have names that come from the Dutch, P and V. When you come to a P, only one thread can proceed; all other threads are queued and waiting until the executing thread comes to the associated V operation.
Because these primitives are a little primitive, and because the Dutch names aren't very intuitive, there have been some ther larger-scale approaches developed.
Per Brinch-Hansen invented the monitor, which is basically just a data structure that has operations which are guaranteed atomic; they can be implemented with semaphores. Monitors are pretty much what Java synchronized statements are based on; they make an object or code block have that particular behavir -- that is, only one thread can be "in" them at a time -- with simpler syntax.
There are other modeals possible. Haskell and Erlang solve the problem by being functional languages that never allow a variable to be modified once it's created; this means they naturally don't need to wory about synchronization. Some new languages, like Clojure, instead have a structure called "transactional memory", which basically means that when there is an assignment, you're guaranteed the assignment is atomic and reversible.
So that's it in a nutshell. To really learn about it, the best places to look at Operating Systems texts, like, eg, Andy Tannenbaum's text.
The two most common mechanisms for thread communication are shared state and message passing.
THe most common way for threads to communicate is via some shared data structure, typically a queue. Some threads put information into the queue while others take it out. The queue must be protected by operating system facilities such as mutexes and semaphores. Interrupts have nothing to do with it.
If you're really interested in a theory of thread communications, you may want to look into formalisms like the pi Calculus.
To communicate between threads, you'll need to use whatever mechanism is supplied by your operating system and/or runtime. Interrupts would be unusually low level, although they might be used implicitly if your threads communicate using sockets or named pipes.
A common pattern would be to implement shared state using a shared memory block, relying on an os-supplied synchronization primitive such as a mutex to spare you from busy-waiting when your read from the block. Remember that if you have threads at all, then you must have some kind of scheduler already (whether it's native from the OS or emulated in your language runtime). So this scheduler can provide synchronization objects and a "sleep" function without necessarily having to rely on hardware support.
Sockets, pipes, and shared memory work between processes too. Sometimes a runtime will give you a lighter-weight way of doing synchronization for threads within the same process. Shared memory is cheaper within a single process. And sometimes your runtime will also give you an atomic message-passing mechanism.