Best practices for inter thread communication of large objects

Best practices for inter thread communication of large objects - c++

I need to be extremely concerned with speed/latency in my current multi-threaded project.
Consider the following inter-thread communication:
Thread #1: Processes a large object, call it A, that it receives from network events. This data significantly alters its internal state.
Thread #2: Needs to know about the current, altered internal state of object A to make some set of decisions.
I can think of essentially two methods to proceed:
Thread #2 has a pointer to object A, and when signaled (say, through constantly checking a small object sent via a lockfree queue or by checking a shared atomic bool), Thread #2 locks object A and reads it.
Thread #1 pushes some version of a copy of the object onto the lockfree queue so that thread #2 can simply receive it directly, use it, and dispose of the copy when it is finished.
Method #1 avoids the costly copy of a large object needed for Method #2, but it always requires a lock/unlock and, if I understand things correctly, additional L3 cache hits.
I understand there may not be any simple performance answers...that I may simply need to benchmark. I am mostly interested in best practices advice. Specifically, I'd like to know to think about the problem a cache memory level to know how information is being passed and copied internally.

It really depends on how much information thread 2 is trying to access from thread 1. If it needs complete access of all of thread 1's objects/memory then yeah, copying all that is costly and I'd do a lock. But it also depends on the time. If all thread 2 needs is to check a state like one variable I would have thread 1 update a smaller object outside of itself that can be shared without locks since thread 2 only reads while thread 1 only writes.
Also you might not even need locking if thread 2 only reads and is ok with making a decision on a data state as it updates.

Related

Is there a C++ design pattern that implements a mechanism or mutex that controls the amount of time a thread can own a locked resource?

I am looking for a way to guarantee that any time a thread locks a specific resource, it is forced to release that resource after a specific period of time (if it has not already released it). Envision a connection where you need to limit the amount of time any specific thread can own that connection for.
I envision this is how it could be used:
{
std::lock_guard<std::TimeLimitedMutex> lock(this->myTimeLimitedMutex, timeout);
try {
// perform some operation with the resource that myTimeLimitedMutex guards.
}
catch (MutexTimeoutException ex) {
// perform cleanup
}
}
I see that there is a timed_mutex that lets the program timeout if a lock cannot be acquired. I need the timeout to occur after the lock is acquired.
There are already some situations where you get a resource that can be taken away unexpectedly. For instance, a tcp sockets -- once a socket connection is made, code on each side needs to handle the case where the other side drops the connection.
I am looking for a pattern that handle types of resources that normally time out on their own, but when they don't, they need to be reset. This does not have to handle every type of resource.

This can't work, and it will never work. In other words, this can never be made. It goes against all concept of ownership and atomic transactions. Because when thread acquires the lock and implements two transactions in a row, it expects them to become atomically visible to outside word. In this scenario, it would be very possible that the transaction will be torn - first part of it will be performed, but the second will be not.
What's worse is that since the lock will be forcefully removed, the part-executed transaction will become visible to outside word, before the interrupted thread has any chance to roll-back.
This idea goes contrary to all school of multi-threaded thinking.

I support SergeyAs answer. Releasing a locked mutex after a timeout is a bad idea and cannot work. Mutex stands for mutual exclusion and this is a rock-hard contract which cannot be violated.
But you can do what you want:
Problem: You want to guarantee that your threads do not hold the mutex longer than a certain time T.
Solution: Never lock the mutex for longer than time T. Instead write your code so that the mutex is locked only for the absolutely necessary operations. It is always possible to give such a time T (modulo the uncertainties and limits given my a multitasking and multiuser operating system of course).
To achieve that (examples):
Never do file I/O inside a locked section.
Never call a system call while a mutex is locked.
Avoid sorting a list while a mutex is locked (*).
Avoid doing a slow operation on each element of a list while a mutex is locked (*).
Avoid memory allocation/deallocation while a mutex is locked (*).
There are exceptions to these rules, but the general guideline is:
Make your code slightly less optimal (e.g. do some redundant copying inside the critical section) to make the critical section as short as possible. This is good multithreading programming.
(*) These are just examples for operations where it is tempting to lock the entire list, do the operations and then unlock the list. Instead it is advisable to just take a local copy of the list and clear the original list while the mutex is locked, ideally by using the swap() operation offered by most STL containers. And then do the slow operation on the local copy outside of the critical section. This is not always possible but always worth considering. Sorting has square complexity in the worst case and usually needs random access to the entire list. It is useful to sort (a copy of) the list outside of the critical section and later check whether elements need to be added or removed. Memory allocations also have quite some complexity behind them, so massive memory allocations/deallocations should be avoided.

You can't do that with only C++.
If you are using a Posix system, it can be done.
You'll have to trigger a SIGALARM signal that's only unmasked for the thread that'll timeout. In the signal handler, you'll have to set a flag and use longjmp to return to the thread code.
In the thread code, on the setjmp position, you can only be called if the signal was triggered, thus you can throw the Timeout exception.
Please see this answer for how to do that.
Also, on linux, it seems you can directly throw from the signal handler (so no longjmp/setjmp here).
BTW, if I were you, I would code the opposite. Think about it: You want to tell a thread "hey, you're taking too long, so let's throw away all the (long) work you've done so far so I can make progress".
Ideally, you should have your long thread be more cooperative, doing something like "I've done A of a ABCD task, let's release the mutex so other can progress on A. Then let's check if I can take it again to do B and so on."
You probably want to be more fine grained (have more mutex on smaller objects, but make sure you're locking in the same order) or use RW locks (so that other threads can use the objects if you're not modifying them), etc...

Such an approach cannot be enforced because the holder of the mutex needs the opportunity to clean up anything which is left in an invalid state part way through the transaction. This can take an unknown arbitrary amount of time.
The typical approach is to release the lock when doing long tasks, and re-aquire it as needed. You have to manage this yourself as everyone will have a slightly different approach.
The only situation I know of where this sort of thing is accepted practice is at the kernel level, especially with respect to microcontrollers (which either have no kernel, or are all kernel, depending on who you ask). You can set an interrupt which modifies the call stack, so that when it is triggered it unwinds the particular operations you are interested in.

"Condition" variables can have timeouts. This allows you to wait until a thread voluntarily releases a resource (with notify_one() or notify_all()), but the wait itself will timeout after a specified fixed amount of time.
Examples in the Boost documentation for "conditions" might make this more clear.
If you want to force a release, you have to write the code which will force it though. This could be dangerous. The code written in C++ can be doing some pretty close-to-the-metal stuff. The resource could be accessing real hardware and it could be waiting on it to finish something. It may not be physically possible to end whatever the program is stuck on.
However, if it is possible, then you can handle it in the thread in which the wait() times out.

Multirate threads

I ran recently into a requirement in which there is a need for multithreaded application whose threads run at different rates.
The questions then become, since i am still learning multithreading:
A scenario is given to put things into perspective:
Say 1st thread runs at 100 Hz "real time"
2nd runs at 10 Hz
and say that the 1st thread provides data "myData" to the 2nd thread.
How is myData going to be provided to the 2nd thread, is the common practice to just read whatever is available from the first thread, or there need to be some kind of decimation to reduce the rate.
Does the myData need to be some kind of Singleton with locking mechanism. Although myData isn't shared, but rather updated by the first thread and used in the second thread.
How about the opposite case, when the data used in one thread need to be used at higher rate in a different thread.

How is myData going to be provided to the 2nd thread
One common method is to provide a FIFO queue -- this could be a std::dequeue or a linked list, or whatever -- and have the producer thread push data items onto one end of the queue while the consumer thread pops the data items off of the other end of the queue. Be sure to serialize all accesses to the FIFO queue (using a mutex or similar locking mechanism), to avoid race conditions.
Alternatively, instead of a queue you could have a single shared data object (essentially a queue of length one) and have your producer thread overwrite the object every time it generates new data. This could be done in cases where it's not important that the consumer thread sees every piece of data that was generated, but rather it's only important that it sees the most recent data. You'd still need to do the locking, though, to avoid the risk of the consumer thread reading from the data object at the same time the producer thread is in the middle of writing to it.
or does there need to be some kind of decimation to reduce the rate.
There doesn't need to be any decimation -- the second thread can just read in as much data as there is available to read, whenever it wakes up.
Does the myData need to be some kind of Singleton with locking
mechanism.
Singleton isn't necessary (although it's possible to do it that way). The locking mechanism is necessary, unless you have some kind of lock-free synchronization mechanism (and if you're asking this level of question, you don't have one and you don't want to try to get one either -- keep things simple for now!)
How about the opposite case, when the data used in one thread need to
be used at higher rate in a different thread.
It's the same -- if you're using a proper inter-thread communications mechanism, the rates at which the threads wake up doesn't matter, because the communications mechanism will do the right thing regardless of when or how often the the threads wake up.

Any multithreaded program has to cope with the possibility that one of the threads will work faster than another - by any ratio - even if they're executing on the same CPU with the same clock frequency.
Your choices include:
producer-consumer container than lets the first thread enqueue data, and the second thread "pop" it off for processing: you could let the queue grow as large as memory allows, or put some limit on the size after which either data would be lost or the 1st thread would be forced to slow down and wait to enqueue further values
there are libraries available (e.g. boost), or if you want to implement it yourself google some tutorials/docs on mutex and condition variables
do something conceptually similar to the above but where the size limit is 1 so there's just the single myData variable rather than a "container" - but all the synchronisation and delay choices remain the same
The Singleton pattern is orthogonal to your needs here: the two threads do need to know where the data is, but that would normally be done using e.g. a pointer argument to the function(s) run in the threads. Singleton's easily overused and best avoided unless reasons stack up high....

what's the advantage of message queue over shared data in thread communication?

I read a article about multithread program design http://drdobbs.com/architecture-and-design/215900465, it says it's a best practice that "replacing shared data with asynchronous messages. As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data".
What confuse me is that I don't see the difference between using shared data and message queues. I am now working on a non-gui project on windows, so let's use windows's message queues. and take a tradition producer-consumer problem as a example.
Using shared data, there would be a shared container and a lock guarding the container between the producer thread and the consumer thread. when producer output product, it first wait for the lock and then write something to the container then release the lock.
Using message queue, the producer could simply PostThreadMessage without block. and this is the async message's advantage. but I think there must exist some lock guarding the message queue between the two threads, otherwise the data will definitely corrupt. the PostThreadMessage call just hide the details. I don't know whether my guess is right but if it's true, the advantage seems no longer exist,since both two method do the same thing and the only difference is that the system hide the details when using message queues.
ps. maybe the message queue use a non-blocking containner, but I could use a concurrent container in the former way too. I want to know how the message queue is implemented and is there any performance difference bwtween the two ways?
updated:
I still don't get the concept of async message if the message queue operations are still blocked somewhere else. Correct me if my guess was wrong: when we use shared containers and locks we will block in our own thread. but when using message queues, myself's thread returned immediately, and left the blocking work to some system thread.

Message passing is useful for exchanging smaller amounts of data, because no conflicts need be avoided. It's much easier to implement than is shared memory for intercomputer communication. Also, as you've already noticed, message passing has the advantage that application developers don't need to worry about the details of protections like shared memory.
Shared memory allows maximum speed and convenience of communication, as it can be done at memory speeds when within a computer. Shared memory is usually faster than message passing, as message-passing are typically implemented using system calls and thus require the more time-consuming tasks of kernel intervention. In contrast, in shared-memory systems, system calls are required only to establish shared-memory regions. Once established, all access are treated as normal memory accesses w/o extra assistance from the kernel.
Edit: One case that you might want implement your own queue is that there are lots of messages to be produced and consumed, e.g., a logging system. With the implemenetation of PostThreadMessage, its queue capacity is fixed. Messages will most liky get lost if that capacity is exceeded.

Imagine you have 1 thread producing data,and 4 threads processing that data (presumably to make use of a multi core machine). If you have a big global pool of data you are likely to have to lock it when any of the threads needs access, potentially blocking 3 other threads. As you add more processing threads you increase the chance of a lock having to wait and increase how many things might have to wait. Eventually adding more threads achieves nothing because all you do is spend more time blocking.
If instead you have one thread sending messages into message queues, one for each consumer thread then they can't block each other. You stil have to lock the queue between the producer and consumer threads but as you have a separate queue for each thread you have a separate lock and each thread can't block all the others waiting for data.
If you suddenly get a 32 core machine you can add 20 more processing threads (and queues) and expect that performance will scale fairly linearly unlike the first case where the new threads will just run into each other all the time.

I have used a shared memory model where the pointers to the shared memory are managed in a message queue with careful locking. In a sense, this is a hybrid between a message queue and shared memory. This is very when large quantities of data must be passed between threads while retaining the safety of the message queue.
The entire queue can be packaged in a single C++ class with appropriate locking and the like. The key is that the queue owns the shared storage and takes care of the locking. Producers acquire a lock for input to the queue and receive a pointer to the next available storage chunk (usually an object of some sort), populates it and releases it. The consumer will block until the next shared object has released by the producer. It can then acquire a lock to the storage, process the data and release it back to the pool. In A suitably designed queue can perform multiple producer/multiple consumer operations with great efficiency. Think a Java thread safe (java.util.concurrent.BlockingQueue) semantics but for pointers to storage.

Of course there is "shared data" when you pass messages. After all, the message itself is some sort of data. However, the important distinction is when you pass a message, the consumer will receive a copy.
the PostThreadMessage call just hide the details
Yes, it does, but being a WINAPI call, you can be reasonably sure that it does it right.
I still don't get the concept of async message if the message queue operations are still blocked somewhere else.
The advantage is more safety. You have a locking mechanism that is systematically enforced when you are passing a message. You don't even need to think about it, you can't forget to lock. Given that multi-thread bugs are some of the nastiest ones (think of race conditions), this is very important. Message passing is a higher level of abstraction built on locks.
The disadvantage is that passing large amounts of data would be probably slow. In that case, you need to use need shared memory.
For passing state (i.e. worker thread reporting progress to the GUI) the messages are the way to go.

It's quite simple (I'm amazed others wrote such length responses!):
Using a message queue system instead of 'raw' shared data means that you have to get the synchronization (locking/unlocking of resources) right only once, in a central place.
With a message-based system, you can think in higher terms of "messages" without having to worry about synchronization issues anymore. For what it's worth, it's perfectly possible that a message queue is implemented using shared data internally.

I think this is the key piece of info there: "As much as possible, prefer to keep each thread’s data isolated (unshared), and let threads instead communicate via asynchronous messages that pass copies of data". I.e. use producer-consumer :)
You can do your own message passing or use something provided by the OS. That's an implementation detail (needs to be done right ofc). The key is to avoid shared data, as in having the same region of memory modified by multiple threads. This can cause hard to find bugs, and even if the code is perfect it will eat performance because of all the locking.

I had exact the same question. After reading the answers. I feel:
in most typical use case, queue = async, shared memory (locks) = sync. Indeed, you can do a async version of shared memory, but that's more code, similar to reinvent the message passing wheel.
Less code = less bug and more time to focus on other stuff.
The pros and cons are already mentioned by previous answers so I will not repeat.

Critical sections better in thread or main program?

I use to use critical section (in c++) to block theads execution whilel accessing shared data, but as to work them must need to wait until data is not used before blocking, maybe it's better to use them in main or thread.
Then if I want my main program to have priority and not be blocked must I use critical sections inside it to block other thread or the contrary ?

You seem to have rather a misconception over what critical sections are and how they work.
Speaking generically, a critical section (CS) is a piece of code that needs to run "exclusively" -- i.e., you need to ensure that only one thread is executing that piece of code at any given time.
As the term is used in most environments, a CS is really a mutex -- a mutual exclusion semaphore (aka binary semaphore). It's a data structure (and set of functions) you use to ensure that a section of code gets executed exclusively (rather than referring to the code itself).
In any case, a CS only makes sense at all when/if you have some code that will execute in more than one thread, and you need to ensure that it only ever executes in one thread at any given time. This is typically when you have some shared data that could and would be corrupted if more than one thread tried to manipulate it at one time. When/if that arises, you need to "use" the critical section for every thread that manipulates that data to assure that the shared data isn't corrupted.
Assuring that a particular thread remains responsive is a whole separate question. In most cases, this means using a queue (for one possibility) to allow the thread to "hand off" a task to some other thread quickly, with minimal contention (i.e., instead of using a CS for the duration of processing the data, the CS only lasts long enough to put a data structure into a queue, and some other thread takes the processing from there).

You cannot say "I am using critical section in thread A but not in thread B". Critical section is a piece of code that accesses shared resource. When this code is executed from two threads that run in parallel, shared resource might get corrupted so therefore you need to synchronise access to it: you need to use some of synchronisation objects (mutexes, semaphores, events...depending on the platform and API you are using). ThreadA locks the critical section so ThreadB needs to wait till ThreadA releases it.
If you want your main thread to block (wait) less than working thread, set working thread priority to be lower than priority of the main thread.

SetThreadAffinityMask of pooled thread

I am wondering whether it is possible to set the processor affinity of a thread obtained from a thread pool. More specifically the thread is obtained through the use of TimerQueue API which I use to implement periodic tasks.
As a sidenote: I found TimerQueues the easiest way to implement periodic tasks but since these are usually longliving tasks might it be more appropriate to use dedicated threads for this purpose? Furthermore it is anticipated that synchronization primites such as semapores and mutexes need to be used to synchronize the various periodic tasks. Are the pooled threads suitable for these?
Thanks!
EDIT1: As Leo has pointed out the above question is actually two only loosely related ones. The first one is related to processor affinity of pooled threads. The second question is related to whether pooled threads obtained from the TimerQueue API are behaving just like manually created threads when it comes to synchronization objects. I will move this second question the a seperate topic.

If you do this, make sure you return things to how they were every time you release a thread back to the pool. Since you don't own those threads and other code which uses them may have other requirements/assumptions.
Are you sure you actually need to do this, though? It's very, very rare to need to set processor affinity. (I don't think I've ever needed to do it in anything I've written.)
Thread affinity can mean two quite different things. (Thanks to bk1e's comment to my original answer for pointing this out. I hadn't realised myself.)
What I would call processor affinity: Where a thread needs to be run consistently on a the same processor. This is what SetThreadAffinityMask deals with and it's very rare for code to care about it. (Usually it's due to very low-level issues like CPU caching in high performance code. Usually the OS will do its best to keep threads on the same CPU and it's usually counterproductive to force it to do otherwise.)
What I would call thread affinity: Where objects use thread-local storage (or some other state tied to the thread they're accessed from) and will go wrong if a sequence of actions is not done on the same thread.
From your question it sounds like you may be confusing #1 with #2. The thread itself will not change while your callback is running. While a thread is running it may jump between CPUs but that is normal and not something you have to worry about (except in very special cases).
Mutexes, semaphores, etc. do not care if a thread jumps between CPUs.
If your callback is executed by the thread pool multiple times, there is (depending on how the pool is used) usually no guarantee that the same thread will be used each time. i.e. Your callback may jump between threads, but not while it is in the middle of running; it may only change threads each time it runs again.
Some synchronization objects will care if your callback code runs on one thread and then, still thinking it holding locks on those objects, runs again on a different thread. (The first thread will still hold the locks, not the second one, although it depends which kind of synchronization object you use. Some don't care.) That isn't a #1, though; that's #2, and not something you'd use SetThreadAffinityMask to deal with.
As an example, Mutexes (CreateMutex) are owned by a thread. If you acquire a mutex on Thread A then any other thread which tries to acquire the mutex will block until you release the mutex on Thread A. (It is also an error for a thread to release a mutex it does not own.) So if your callback acquired a mutex, then exited, then ran again on another thread and released the mutex from there, it would be wrong.
On the other hand, an Event (CreateEvent) does not care which threads create, signal or destroy it. You can signal an event on one thread and then reset it on another and that's fine (normal, in fact).
It'd also be rare to hold a synchronization object between two separate runs of your callback (that would invite deadlocks, although there are certainly situations where you could legitimately want/do such a thing). However, if you created (for example) an apartment-threaded COM object then that would be something you would want to only access from one specific thread.

You shouldn't. You're only supposed to use that thread for the job at hand, on the processor it's running on at that point. Apart from the obvious inefficiency, the threadpool might destroy every thread as soon as you're done, and create a new one for your next job. The affinity masks wouldn't disappear that soon in practice, but it's even harder to debug if they disappear at random.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js