Strandify inter coorporating objects for multithread support - c++

My current application owns multiple «activatable» objects*. My intent is to "run" all those object in the same io_context and to add the necessary protection in order to toggle from single to multiple threads (to make it scalable)
If these objects were completely independent from each others, the number of threads running the associated io_context could grow smoothly. But since those objects need to cooperate, the application crashes in multithread despite the strand in each object.
Let's say we have objects of type A and type B, all of them served by the same io_context. Each of those types run asynchronous operations (timers and sockets - their handlers are surrounded with bind_executor(strand, handler)), and can build a cache based on information received via sockets and posted operations to them. Objects of type A needs to get information cached from multiple instances of B in order to perform their own work.
Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
If not, what strategy could be adopted to achieve the scalability?
I already tried playing with futures but that strategy leads unsurprisingly to deadlocks.
Thanx
(*) Maybe I'm wrong in the terminology: objects get a reference to an io_context and own their own strand, so I think they are activatable, because they don't own really a running thread

You're mixing vague words a bit. "Activatable", "Strandify", "inter coorporating". They're all close to meaningful concepts, yet, narrowly avoid binding to any precise meaning.
Deconstructing
Let's simplify using more precise concepts.
Let's say we have objects of type A and type B, all of them served by the same io_context
I think it's more fruitful to say "types A and B have associated executors". When you make sure all operations on A and B operate from that executor and you make sure that executor serializes access, then you basically get the Active Object pattern.
[can build a cache based on information received via sockets] and posted operations to them
That's interesting. I take that to mean you don't directly call members of the class, unless they defer the actual execution to the strand. This, again, would be the Active Object.
However, your symptoms suggest that not all operations are "posted to them". Which implies they run on arbitrary threads, leading to your problem.
Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
The key to your problems is here. Data dependencies. It's also, ;ole;y going to limit the usefulness of scaling, unless of course the generation of information to retrieve from other threads is a computationally expensive operation.
However, in the light of the phrase _"to get information cached from multiple instances of B'" suggests that in fact, the data is instantaneous, and you'll just be paying synchronization costs for accessing across threads.
Questions
Q. Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
Technically, yes. By making sure all operations go on the strand, and the objects become true active objects.
However, there's an important caveat: strands aren't zero-cost. Only in certain contexts they can be optimized (e.g. in immediate continuations or when the execution context has no concurrency).
But in all other contexts, they end up synchronizing at similar cost as mutexes. The purpose of a strand is not to remove the lock contention. Instead it rather allows one to declaratively specify the synchronization requirements for tasks, so that so that the same code can be correctly synchronized regardless of the methods of async completion (using callbacks, futures, coroutines, awaitables, etc) or the chosen execution context(s).
Example: I recently uncovered a vivid illustration of the cost of strand synchronization even in a simple context (where serial execution was already implicitly guaranteed) here:
sehe mar 15, 23:08 Oh cool. The strands were unnecessary. I add them for safety until I know it's safe to go without. In this case the async call chains form logical strands (there are no timers or full duplex sockets going on, so it's all linear). That... improves the situation :)
Now it's 3.5gbps even with the 1024 byte server buffer
The throughput increased ~7x from just removing the strand.
Q. If not, what strategy could be adopted to achieve the scalability?
I suspect you really want caches that contain shared_futures. So that the first retrieval puts the future for the result in cache, where subsequent retrievals get the already existing shared future immediately.
If you make sure your cache lookup datastructure is threadsafe, likely with a reader/writer lock (shared_mutex), you will be free to access it with minimal overhead from any actor, instead of requiring to go through individual strands of each producer.
Keep in mind that awaiting futures is a blocking operation. So, if you do that from tasks posted on the execution context, you may easily run out of threads. In such cases it maybe better to provide async_get in terms of boost::asio::async_result or boost::asio::async_completion so you can wait in non-blocking fashion.

Related

asio, shared data, Active Object vs mutexes

I want to understand what is true-asio way to use shared data?
reading the asio and the beast examples, the only example of using shared data is http_crawl.cpp. (perhaps I missed something)
in that example the shared object is only used to collect statistics for sessions, that is the sessions do not read that object's data.
as a result I have three questions:
Is it implied that interaction with shared data in asio-style is an Active Object? i.e. should mutexes be avoided?
whether the statement will be correct that for reading the shared data it is also necessary to use "requests" to Active Object, and also no mutexes?
has anyone tried to evaluate the overhead of "requests" to Active Object, compared to using mutexes?
Is it implied that interaction with shared data in asio-style is an Active Object? i.e. should mutexes be avoided?
Starting at the end, yes mutexes should be avoided. This is because all service handlers (initiations and completions) will be executed on the service thread(s) which means that blocking in a handler will block all other handlers.
Whether that leads to Active Object seems to be a choice to me. Yes, a typical approach would be like Active Object (see e.g. boost::asio and Active Object), where operations queue for the data.
However, other approaches are viable and frequently seen, like e.g. the data being moving with their task(s) e.g. through a task flow.
whether the statement will be correct that for reading the shared data it is also necessary to use "requests" to Active Object, and also no mutexes?
Yes, synchronization needs to happen for shared state, regardless of the design pattern chosen (although some design pattern reduce sharing alltogether).
The Asio approach is using strands, which abstract away the scheduling from the control flow. This gives the service the option to optimize for various cases (e.g. continuation on the same strand, the case where there's only one service thread anyway etc.).
has anyone tried to evaluate the overhead of "requests" to Active Object, compared to using mutexes?
Lots of people and lots of times. Often are wary of trying Asio because "it uses locking internally". If you know what you're doing, throughput can be excellent, which goes for most patterns and industrial-strength frameworks.
Specific benchmarks depend heavily on specific implementation choices. I'm pretty sure you can find examples on github, blogs and perhaps even on this site.
(perhaps I missed something)
You're missing the fact that all IO objects are not thread-safe, which means that they themselves are shared data for any composed asynchronous operation (chain)

Cancelling arbitary jobs running in a thread_pool

Is there a way for a thread-pool to cancel a task underway? Better yet, is there a safe alternative for on-demand cancelling opaque function calls in thread_pools?
Killing the entire process is a bad idea and using native handle to perform pthread_cancel or similar API is a last resort only.
Extra
Bonus if the cancellation is immediate, but it's acceptable if the cancellation has some time constraint 'guarantees' (say cancellation within 0.1 execution seconds of the thread in question for example)
More details
I am not restricted to using Boost.Thread.thread_pool or any specific library. The only limitation is compatibility with C++14, and ability to work on at least BSD and Linux based OS.
The tasks are usually data-processing related, pre-compiled and loaded dynamically using C-API (extern "C") and thus are opaque entities. The aim is to perform compute intensive tasks with an option to cancel them when the user sends interrupts.
While launching, the thread_id for a specific task is known, and thus some API can be sued to find more details if required.
Disclaimer
I know using native thread handles to cancel/exit threads is not recommended and is a sign of bad design. I also can't modify the functions using boost::this_thread::interrupt_point, but can wrap them in lambdas/other constructs if that helps. I feel like this is a rock and hard place situation, so alternate suggestions are welcome, but they need to be minimally intrusive in existing functionality, and can be dramatic in their scope for the feature-set being discussed.
EDIT:
Clarification
I guess this should have gone in the 'More Details' section, but I want it to remain separate to show that existing 2 answers are based o limited information. After reading the answers, I went back to the drawing board and came up with the following "constraints" since the question I posed was overly generic. If I should post a new question, please let me know.
My interface promises a "const" input (functional programming style non-mutable input) by using mutexes/copy-by-value as needed and passing by const& (and expecting thread to behave well).
I also mis-used the term "arbitrary" since the jobs aren't arbitrary (empirically speaking) and have the following constraints:
some which download from "internet" already use a "condition variable"
not violate const correctness
can spawn other threads, but they must not outlast the parent
can use mutex, but those can't exist outside the function body
output is via atomic<shared_ptr> passed as argument
pure functions (no shared state with outside) **
** can be lambda binding a functor, in which case the function needs to makes sure it's data structures aren't corrupted (which is the case as usually, the state is a 1 or 2 atomic<inbuilt-type>). Usually the internal state is queried from an external db (similar architecture like cookie + web-server, and the tab/browser can be closed anytime)
These constraints aren't written down as a contract or anything, but rather I generalized based on the "modules" currently in use. The jobs are arbitrary in terms of what they can do: GPU/CPU/internet all are fair play.
It is infeasible to insert a periodic check because of heavy library usage. The libraries (not owned by us) haven't been designed to periodically check a condition variable since it'd incur a performance penalty for the general case and rewriting the libraries is not possible.
Is there a way for a thread-pool to cancel a task underway?
Not at that level of generality, no, and also not if the task running in the thread is implemented natively and arbitrarily in C or C++. You cannot terminate a running task prior to its completion without terminating its whole thread, except with the cooperation of the task.
Better
yet, is there a safe alternative for on-demand cancelling opaque
function calls in thread_pools?
No. The only way to get (approximately) on-demand preemption of a specific thread is to deliver a signal to it (that is is not blocking or ignoring) via pthread_kill(). If such a signal terminates the thread but not the whole process then it does not automatically make any provision for freeing allocated objects or managing the state of mutexes or other synchronization objects. If the signal does not terminate the thread then the interruption can produce surprising and unwanted effects in code not designed to accommodate such signal usage.
Killing the entire process is a bad idea and using native handle to
perform pthread_cancel or similar API is a last resort only.
Note that pthread_cancel() can be blocked by the thread, and that even when not blocked, its effects may be deferred indefinitely. When the effects do occur, they do not necessarily include memory or synchronization-object cleanup. You need the thread to cooperate with its own cancellation to achieve these.
Just what a thread's cooperation with cancellation looks like depends in part on the details of the cancellation mechanism you choose.
Cancelling a non cooperative, not designed to be cancelled component is only possible if that component has limited, constrained, managed interactions with the rest of the system:
the ressources owned by the components should be managed externally (the system knows which component uses what resources)
all accesses should be indirect
the modifications of shared ressources should be safe and reversible until completion
That would allow the system to clean up resource, stop operations, cancel incomplete changes...
None of these properties are cheap; all the properties of threads are the exact opposite of these properties.
Threads only have an implied concept of ownership apparent in the running thread: for a deleted thread, determining what was owned by the thread is not possible.
Threads access shared objects directly. A thread can start modifications of shared objects; after cancellation, such modifications that would be partial, non effective, incoherent if stopped in the middle of an operation.
Cancelled threads could leave locked mutexes around. At least subsequent accesses to these mutexes by other threads trying to access the shared object would deadlock.
Or they might find some data structure in a bad state.
Providing safe cancellation for arbitrary non cooperative threads is not doable even with very large scale changes to thread synchronization objects. Not even by a complete redesign of the thread primitives.
You would have to make thread almost like full processes to be able to do that; but it wouldn't be called a thread then!

Controlled application shut-down strategy

Our (Windows native C++) app is composed of threaded objects and managers. It is pretty well written, with a design that sees Manager objects controlling the lifecycle of their minions. Various objects dispatch and receive events; some events come from Windows, some are home-grown.
In general, we have to be very aware of thread interoperability so we use hand-rolled synchronization techniques using Win32 critical sections, semaphores and the like. However, occasionally we suffer thread deadlock during shut-down due to things like event handler re-entrancy.
Now I wonder if there is a decent app shut-down strategy we could implement to make this easier to develop for - something like every object registering for a shutdown event from a central controller and changing its execution behaviour accordingly? Is this too naive or brittle?
I would prefer strategies that don't stipulate rewriting the entire app to use Microsoft's Parallel Patterns Library or similar. ;-)
Thanks.
EDIT:
I guess I am asking for an approach to controlling object life cycles in a complex app where many threads and events are firing all the time. Giovanni's suggestion is the obvious one (hand-roll our own), but I am convinced there must be various off-the-shelf strategies or frameworks, for cleanly shutting down active objects in the correct order. For example, if you want to base your C++ app on an IoC paradigm you might use PocoCapsule instead of trying to develop your own container. Is there something similar for controlling object lifecycles in an app?
This seems like a special case of the more general question, "how do I avoid deadlocks in my multithreaded application?"
And the answer to that is, as always: make sure that any time your threads have to acquire more than one lock at a time, that they all acquire the locks in the same order, and make sure all threads release their locks in a finite amount of time. This rule applies just as much at shutdown as at any other time. Nothing less is good enough; nothing more is necessary. (See here for a relevant discussion)
As for how to best do this... the best way (if possible) is to simplify your program as much as you can, and avoid holding more than one lock at a time if you can possibly help it.
If you absolutely must hold more than one lock at a time, you must verify your program to be sure that every thread that holds multiple locks locks them in the same order. Programs like helgrind or Intel thread checker can help with this, but it often comes down to simply eyeballing the code until you've proved to yourself that it satisfies this constraint. Also, if you are able to reproduce the deadlocks easily, you can examine (using a debugger) the stack trace of each deadlocked thread, which will show where the deadlocked threads are forever-blocked at, and with that information, you can that start to figure out where the lock-ordering inconsistencies are in your code. Yes, it's a major pain, but I don't think there is any good way around it (other than avoiding holding multiple locks at once). :(
One possible general strategy would be to send an "I am shutting down" event to every manager, which would cause the managers to do one of three things (depending on how long running your event-handlers are, and how much latency you want between the user initiating shutdown, and the app actually exiting).
1) Stop accepting new events, and run the handlers for all events received before the "I am shutting down" event. To avoid deadlocks you may need to accept events that are critical to the completion of other event handlers. These could be signaled by a flag in the event or the type of the event (for example). If you have such events then you should also consider restructuring your code so that those actions are not performed through event handlers (as dependent events would be prone to deadlocks in ordinary operation too.)
2) Stop accepting new events, and discard all events that were received after the event that the handler is currently running. Similar comments about dependent events apply in this case too.
3) Interrupt the currently running event (with a function similar to boost::thread::interrupt()), and run no further events. This requires your handler code to be exception safe (which it should already be, if you care about resource leaks), and to enter interruption points at fairly regular intervals, but it leads to the minimum latency.
Of course you could mix these three strategies together, depending on the particular latency and data corruption requirements of each of your managers.
As a general method, use an atomic boolean to indicate "i am shutting down", then every thread checks this boolean before acquiring each lock, handling each event etc. Can't give a more detailed answer unless you give us a more detailed question.

Is checking current thread inside a function ok?

Is it ok to check the current thread inside a function?
For example if some non-thread safe data structure is only altered by one thread, and there is a function which is called by multiple threads, it would be useful to have separate code paths depending on the current thread. If the current thread is the one that alters the data structure, it is ok to alter the data structure directly in the function. However, if the current thread is some other thread, the actual altering would have to be delayed, so that it is performed when it is safe to perform the operation.
Or, would it be better to use some boolean which is given as a parameter to the function to separate the different code paths?
Or do something totally different?
What do you think?
You are not making all too much sense. You said a non-thread safe data structure is only ever altered by one thread, but in the next sentence you talk about delaying any changes made to that data structure by other threads. Make up your mind.
In general, I'd suggest wrapping the access to the data structure up with a critical section, or mutex.
It's possible to use such animals as reader/writer locks to differentiate between readers and writers of datastructures but the performance advantage for typical cases usually wont merit the additional complexity associated with their use.
From the way your question is stated, I'm guessing you're fairly new to multithreaded development. I highly suggest sticking with the simplist and most commonly used approaches for ensuring data integrity (most books/articles you readon the issue will mention the same uses for mutexes/critical sections). Multithreaded development is extremely easy to get wrong and can be difficult to debug. Also, what seems like the "optimal" solution very often doesn't buy you the huge performance benefit you might think. It's usually best to implement the simplist approach that will work then worry about optimizing it after the fact.
There is a trick that could work in case, as you said, the other threads will only make changes only once in a while, although it is still rather hackish:
make sure your "master" thread can't be interrupted by the other ones (higher priority, non fair scheduling)
check your thread
if "master", just change
if other, put off scheduling, if needed by putting off interrupts, make change, reinstall scheduling
really test to see whether there are no issues in your setup.
As you can see, if requirements change a little bit, this could turn out worse than using normal locks.
As mentioned, the simplest solution when two threads need access to the same data is to use some synchronization mechanism (i.e. critical section or mutex).
If you already have synchronization in your design try to reuse it (if possible) instead of adding more. For example, if the main thread receives its work from a synchronized queue you might be able to have thread 2 queue the data structure update. The main thread will pick up the request and can update it without additional synchronization.
The queuing concept can be hidden from the rest of the design through the Active Object pattern. The activ object may also be able to publish the data structure changes through the Observer pattern to other interested threads.

Threads and simple Dead lock cure

When dealing with threads (specifically in C++) using mutex locks and semaphores is there a simple rule of thumb to avoid Dead Locks and have nice clean Synchronization?
A good simple rule of thumb is to always obtain your locks in a consistent predictable order from everywhere in your application. For example, if your resources have names, always lock them in alphabetical order. If they have numeric ids, always lock from lowest to highest. The exact order or criteria is arbitrary. The key is to be consistent. That way you'll never have a deadlock situation. eg.
Thread 1 locks resource A
Thread 2 locks resource B
Thread 1 waits to obtain a lock on B
Thread 2 waits to obtain a lock on A
Deadlock
The above can never happen if you follow the rule of thumb outlined above. For a more detailed discussion, see the Wikipedia entry on the Dining Philosophers problem.
If at all possible, design your code so that you never have to lock more then a single mutex/semaphore at a time.
If that's not possible, make sure to always lock multiple mutex/semaphores in the same order. So if one part of the code locks mutex A and then takes semaphore B, make sure that no other part of the code takes semaphore B and then locks mutex A.
Try to avoid acquiring one lock and trying to acquire another. This can result into circular dependency and cause for deadlock.
If it is un-avoidable then at least the order of acquire locks should be predictable.
Use RAII ( to make sure lock is release properly in case of exception as well)
There is no simple deadlock cure.
Acquire locks in agreed order: If all calls acquire A->B->C then no deadlock can occur. Deadlocks can occur only if the locking order differs between the two threads (one acquires A->B the second B->A).
In practice is hard to choose an order between arbitrary objects in memory. On a simple trivial project is possible, but on large projects with many individual contributors is very hard. A partial solution is to create hierarchies, by ranking the locks. All locks in module A have rank 1, all locks in module B have rank 2. One can acquire a lock of rank 2 when helding locks of rank 1, but not vice-versa. Of course you need a framework around the locking primitives that tracks and validates the ranking.
One way to ensure the ordering that other folks have talked about is to acquire locks in an order defined by their memory address. If at any point, you try to acquire a lock that should have been earlier in the sequence, you release all the locks and start over.
With a little work, it's possible to do this nearly automatically with some wrapper classes around the system primitives.
There's no practical cure. Specifically, there's no way to simply test code for being synchronizationally correct, or to have your programmers obey the rules of the gentleman with the green V.
There's no way to properly test the multithreaded code, because the program logic may depend on timing of locks acquisition, and therefore, be different from execution to execution, somehow invalidating the concept of QA.
I would say
prefer using threads only as a performance optimization for multi-core machines
only optimize performance when you are sure you need this performance
you may use threads to simplify program logic, but only when you are absolutely sure what you are doing. Be extra careful and all locks are confined to a very small piece of code. Do not let any newbies near such code.
never use threads in a mission-critical system, such as flying an aircraft or operating dangerous machinery
in all cases, threads are seldom cost-effective, due to higher debug and QA costs
If you determined to do threads or maintaining existing codebase:
confine all locks to small and simple pieces of code, which operate on primitives
avoid function calls or getting the program flow away to where the fact of being executed under lock is not immediately visible. This function will change by future authors, widening your lock span without your control.
get locks inside objects to reduce locking scope, wrap non-thread-safe 3rd-party objects with your own thread-safe interfaces.
never send synchronous notifications (callbacks) when executing under lock
use only RAII locks, to reduce the cognitive load when thinking "how else can we exit from here", as in exceptions, etc.
A few words on how to avoid multi-threading.
A single-threaded design usually involves some heart-beat function provided by program components, and called in a loop (called heartbeat cycle) which, when called, gives a chance to all components to do the next piece of work and to surrender control back again. What algorithmists like to think of as "loops" inside the components, will turn into state machines, to identify what is the next thing that should be done when called. State is best maintained as member data of respective objects.
There are plenty of simple "deadlock cures". But none that are easy to apply and work universally.
The simplest of all, of course, is "never have more than one thread".
Assuming you have a multithreaded application though, there are still a number of solutions:
You can try to minimize shared state and synchronization. Two threads that just run in parallel and never interact can never deadlock. Deadlocks only occur when multiple threads try to access the same resource. Why do they do that? Can that be avoided? Can the resource be restructured or divided so that for example, one thread can write to it, and other threads are asynchronously passed the data they need?
Perhaps the resource can be copied, giving each thread its own private copy to work with?
And as already mentioned by every other answer, if and when you try to acquire locks, do so in a global consistent order. To simplify this, you should try to ensure that all the locks a thread is going to need are acquired as a single operation. If a thread needs to acquire locks A, B and C, it should not make three lock() calls at different times and from different places. You'll get confused, and you won't be able to keep track of which locks are held by the thread, and which ones it has yet to acquire, and then you'll mess up the order. If you can acquire all the lock you need once, then you can factor it out into a separate function call which acquires N locks, and does so in the correct order to avoid deadlocks.
Then there are the more ambitious approaches: Techniques like CSP make threading extremely simple and easy to prove correct, even with thousands of concurrent threads. But it requires you to structure your program very differently from what you're used to.
Transactional Memory is another promising option, and one that may be easier to integrate into conventional programs. But production-quality implementations are still very rare.
Read Deadlock: the Problem and a Solution.
"The common advice for avoiding deadlock is to always lock the two mutexes in the same order: if you always lock mutex A before mutex B, then you'll never deadlock. Sometimes this is straightforward, as the mutexes are serving different purposes, but other times it is not so simple, such as when the mutexes are each protecting a separate instance of the same class".
If you want to attack the possibility of a deadlock you must attack one of the 4 crucial conditions for the existence of a deadlock.
The 4 conditions for a deadlock are:
1. Mutual Exclusion - only one thread can enter the critical section at a time.
2. Hold and Wait - a thread doesn't release the resources he acquired as long as he didn't finish his job even if other resources are un available.
3. No preemption - A thread doesn't have a priority over other threads.
4. Resource Cycle - There has to be a cycle chain of threads that waits for resources from other threads.
The easiest condition to attack is the resource cycle by making sure that no cycles are possible.