What's *Deterministic concurrency*? - concurrency

I heard that there are 3 kind of concurrency.
Deterministic concurrency
Message-passing concurrency
Shared-state concurrency
I know #2 (=actor model) and #3 (=general threading), but not #1. What's that?

Deterministic concurrency is a concurrent programming model such that programs written in this model have the following property: for a given set of inputs, the output values of a program are the same for any execution schedule. This means that the outputs of the program depend solely on the inputs of the program.
There are ways to ensure this property. One of the ways is the so-called single-assignment programming where variables don't have to be initialized, but may be assigned at most once. Reading an uninitialized variable stalls until it's assigned a value (possibly by some other thread). The Mozart programming language has support for these.
Another way is to use ownership analysis to determine which threads 'own' different references, and to ensure that no 2 threads write to the reference at the same 'time', so there are no data races.

I haven't heard the term before, but coroutines come to mind. They don't provide "true" concurrency, in the sense that only one routine is executing at any particular moment, but they're concurrent in the sense that a group of interacting coroutines can all make progress without having to wait for each other to finish.

Related

Strandify inter coorporating objects for multithread support

My current application owns multiple «activatable» objects*. My intent is to "run" all those object in the same io_context and to add the necessary protection in order to toggle from single to multiple threads (to make it scalable)
If these objects were completely independent from each others, the number of threads running the associated io_context could grow smoothly. But since those objects need to cooperate, the application crashes in multithread despite the strand in each object.
Let's say we have objects of type A and type B, all of them served by the same io_context. Each of those types run asynchronous operations (timers and sockets - their handlers are surrounded with bind_executor(strand, handler)), and can build a cache based on information received via sockets and posted operations to them. Objects of type A needs to get information cached from multiple instances of B in order to perform their own work.
Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
If not, what strategy could be adopted to achieve the scalability?
I already tried playing with futures but that strategy leads unsurprisingly to deadlocks.
Thanx
(*) Maybe I'm wrong in the terminology: objects get a reference to an io_context and own their own strand, so I think they are activatable, because they don't own really a running thread
You're mixing vague words a bit. "Activatable", "Strandify", "inter coorporating". They're all close to meaningful concepts, yet, narrowly avoid binding to any precise meaning.
Deconstructing
Let's simplify using more precise concepts.
Let's say we have objects of type A and type B, all of them served by the same io_context
I think it's more fruitful to say "types A and B have associated executors". When you make sure all operations on A and B operate from that executor and you make sure that executor serializes access, then you basically get the Active Object pattern.
[can build a cache based on information received via sockets] and posted operations to them
That's interesting. I take that to mean you don't directly call members of the class, unless they defer the actual execution to the strand. This, again, would be the Active Object.
However, your symptoms suggest that not all operations are "posted to them". Which implies they run on arbitrary threads, leading to your problem.
Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
The key to your problems is here. Data dependencies. It's also, ;ole;y going to limit the usefulness of scaling, unless of course the generation of information to retrieve from other threads is a computationally expensive operation.
However, in the light of the phrase _"to get information cached from multiple instances of B'" suggests that in fact, the data is instantaneous, and you'll just be paying synchronization costs for accessing across threads.
Questions
Q. Would it be possible to access this information by using strands (without adding explicit mutex protection) and if yes how ?
Technically, yes. By making sure all operations go on the strand, and the objects become true active objects.
However, there's an important caveat: strands aren't zero-cost. Only in certain contexts they can be optimized (e.g. in immediate continuations or when the execution context has no concurrency).
But in all other contexts, they end up synchronizing at similar cost as mutexes. The purpose of a strand is not to remove the lock contention. Instead it rather allows one to declaratively specify the synchronization requirements for tasks, so that so that the same code can be correctly synchronized regardless of the methods of async completion (using callbacks, futures, coroutines, awaitables, etc) or the chosen execution context(s).
Example: I recently uncovered a vivid illustration of the cost of strand synchronization even in a simple context (where serial execution was already implicitly guaranteed) here:
sehe mar 15, 23:08 Oh cool. The strands were unnecessary. I add them for safety until I know it's safe to go without. In this case the async call chains form logical strands (there are no timers or full duplex sockets going on, so it's all linear). That... improves the situation :)
Now it's 3.5gbps even with the 1024 byte server buffer
The throughput increased ~7x from just removing the strand.
Q. If not, what strategy could be adopted to achieve the scalability?
I suspect you really want caches that contain shared_futures. So that the first retrieval puts the future for the result in cache, where subsequent retrievals get the already existing shared future immediately.
If you make sure your cache lookup datastructure is threadsafe, likely with a reader/writer lock (shared_mutex), you will be free to access it with minimal overhead from any actor, instead of requiring to go through individual strands of each producer.
Keep in mind that awaiting futures is a blocking operation. So, if you do that from tasks posted on the execution context, you may easily run out of threads. In such cases it maybe better to provide async_get in terms of boost::asio::async_result or boost::asio::async_completion so you can wait in non-blocking fashion.

Cancelling arbitary jobs running in a thread_pool

Is there a way for a thread-pool to cancel a task underway? Better yet, is there a safe alternative for on-demand cancelling opaque function calls in thread_pools?
Killing the entire process is a bad idea and using native handle to perform pthread_cancel or similar API is a last resort only.
Extra
Bonus if the cancellation is immediate, but it's acceptable if the cancellation has some time constraint 'guarantees' (say cancellation within 0.1 execution seconds of the thread in question for example)
More details
I am not restricted to using Boost.Thread.thread_pool or any specific library. The only limitation is compatibility with C++14, and ability to work on at least BSD and Linux based OS.
The tasks are usually data-processing related, pre-compiled and loaded dynamically using C-API (extern "C") and thus are opaque entities. The aim is to perform compute intensive tasks with an option to cancel them when the user sends interrupts.
While launching, the thread_id for a specific task is known, and thus some API can be sued to find more details if required.
Disclaimer
I know using native thread handles to cancel/exit threads is not recommended and is a sign of bad design. I also can't modify the functions using boost::this_thread::interrupt_point, but can wrap them in lambdas/other constructs if that helps. I feel like this is a rock and hard place situation, so alternate suggestions are welcome, but they need to be minimally intrusive in existing functionality, and can be dramatic in their scope for the feature-set being discussed.
EDIT:
Clarification
I guess this should have gone in the 'More Details' section, but I want it to remain separate to show that existing 2 answers are based o limited information. After reading the answers, I went back to the drawing board and came up with the following "constraints" since the question I posed was overly generic. If I should post a new question, please let me know.
My interface promises a "const" input (functional programming style non-mutable input) by using mutexes/copy-by-value as needed and passing by const& (and expecting thread to behave well).
I also mis-used the term "arbitrary" since the jobs aren't arbitrary (empirically speaking) and have the following constraints:
some which download from "internet" already use a "condition variable"
not violate const correctness
can spawn other threads, but they must not outlast the parent
can use mutex, but those can't exist outside the function body
output is via atomic<shared_ptr> passed as argument
pure functions (no shared state with outside) **
** can be lambda binding a functor, in which case the function needs to makes sure it's data structures aren't corrupted (which is the case as usually, the state is a 1 or 2 atomic<inbuilt-type>). Usually the internal state is queried from an external db (similar architecture like cookie + web-server, and the tab/browser can be closed anytime)
These constraints aren't written down as a contract or anything, but rather I generalized based on the "modules" currently in use. The jobs are arbitrary in terms of what they can do: GPU/CPU/internet all are fair play.
It is infeasible to insert a periodic check because of heavy library usage. The libraries (not owned by us) haven't been designed to periodically check a condition variable since it'd incur a performance penalty for the general case and rewriting the libraries is not possible.
Is there a way for a thread-pool to cancel a task underway?
Not at that level of generality, no, and also not if the task running in the thread is implemented natively and arbitrarily in C or C++. You cannot terminate a running task prior to its completion without terminating its whole thread, except with the cooperation of the task.
Better
yet, is there a safe alternative for on-demand cancelling opaque
function calls in thread_pools?
No. The only way to get (approximately) on-demand preemption of a specific thread is to deliver a signal to it (that is is not blocking or ignoring) via pthread_kill(). If such a signal terminates the thread but not the whole process then it does not automatically make any provision for freeing allocated objects or managing the state of mutexes or other synchronization objects. If the signal does not terminate the thread then the interruption can produce surprising and unwanted effects in code not designed to accommodate such signal usage.
Killing the entire process is a bad idea and using native handle to
perform pthread_cancel or similar API is a last resort only.
Note that pthread_cancel() can be blocked by the thread, and that even when not blocked, its effects may be deferred indefinitely. When the effects do occur, they do not necessarily include memory or synchronization-object cleanup. You need the thread to cooperate with its own cancellation to achieve these.
Just what a thread's cooperation with cancellation looks like depends in part on the details of the cancellation mechanism you choose.
Cancelling a non cooperative, not designed to be cancelled component is only possible if that component has limited, constrained, managed interactions with the rest of the system:
the ressources owned by the components should be managed externally (the system knows which component uses what resources)
all accesses should be indirect
the modifications of shared ressources should be safe and reversible until completion
That would allow the system to clean up resource, stop operations, cancel incomplete changes...
None of these properties are cheap; all the properties of threads are the exact opposite of these properties.
Threads only have an implied concept of ownership apparent in the running thread: for a deleted thread, determining what was owned by the thread is not possible.
Threads access shared objects directly. A thread can start modifications of shared objects; after cancellation, such modifications that would be partial, non effective, incoherent if stopped in the middle of an operation.
Cancelled threads could leave locked mutexes around. At least subsequent accesses to these mutexes by other threads trying to access the shared object would deadlock.
Or they might find some data structure in a bad state.
Providing safe cancellation for arbitrary non cooperative threads is not doable even with very large scale changes to thread synchronization objects. Not even by a complete redesign of the thread primitives.
You would have to make thread almost like full processes to be able to do that; but it wouldn't be called a thread then!

Benefits of a multi thread program in a unicore system [duplicate]

This question already has answers here:
How can multithreading speed up an application (when threads can't run concurrently)?
(9 answers)
Closed 9 years ago.
My professor causally mentioned that we should program multi-thread programs even if we are using a unicore processor however because of the lack of time , he did not elaborate on it .
I would like to know what are the benefits of a multi-thread program in a unicore processor ??
It won't be as significant as a multi-core system but it can still provide some benefits.
Mainly all the benefits that you are going to get will be regarding to the context switch that will happen after a input miss to the already executing thread. Executing thread may be waiting for anything such as a hardware resource or a branch mis-prediction or even data transfer after a cache miss.
At this point the waiting thread can be executed to benefit from this "waiting time". But of course context switch will take some time. Also managing threads inside the code rather than sequential computation can create some extra complexity to your program. And as it has been said, some applications needs to be multi-threaded so there is no escape from the context switch in some cases.
Some applications need to be multi-threaded. Multi-threading isn't just about improving performance by using more cores, it's also about performing multiple tasks at once.
Take Skype for example - The GUI needs to be able to accept the text you're entering, display it on the screen, listen for new messages coming from the user you're talking to, and display them. This wouldn't be a trivial task in a single threaded application.
Even if there's only one core available, the OS thread scheduler will give you the illusion of parallelism.
Usually it is about not blocking. Running many threads on a single core still gives the illusion of concurrency. So you can have, say, a thread doing IO while another one does user interactions. The user interaction thread is not blocked while the other does IO, so the user is free to carry on interacting.
Benefits could be different.
One of the widely used examples is the application with GUI, which supposed to perform some kind of computations. If you will have a single thread - the user will have to wait the result before dealing something else with the application, but if you start it in the separate thread - user interface could be still available for user during the computation process. So, multi-thread program could emulate multi-task environment even on a unicore system. That's one of the points.
As others have already mentioned, not blocking is one application. Another one is separation of logic for unrelated tasks that are to be executed simultaneously. Using threads for that leaves handling of scheduling these tasks to the OS.
However, note that it may also be possible to implement similar behavior using asynchronous operations in a single thread. "Future" and boost::asio provide ways of doing non-blocking stuff without necessarily resorting to multiple threads.
I think it depends a bit on how exactly you design your threads and which logic is actually in the thread. Some benefits you can even get on a single core:
A thread can wrap a blocking/long-during call you can't circumvent otherwise. For some operations there are polling mechanisms, but not for all.
A thread can wrap an almost standalone part of your application that has virtually no interaction with other code. For example background polling for updates, monitoring some resource (e.g. free storage), checking internet connectivity. If you keep them in a separate thread you can keep the code relatively simple in its own 'runtime' without caring too much about the impact on the main program, the sole communication with the main logic is usually a single 'event'.
In some environments you might get more processing time. This mainly depends on how your OS scheduling system works, but if this allocates time per thread, the more threads you have the more your app will be scheduled.
Some benefits long-term:
Where it's not hard to do you benefit if your hardware evolves. You never know what's going to happen, today your app runs on a single-core embedded device, tomorrow that embedded device gets a quad core. Programming threaded from the beginning improves your future scalability.
One example is an environment where you can deterministically assign work to a thread, e.g. based on some hash all related operations end up in the same thread. The advantage for single cores is 'small' but it's not hard to do as you need little synchronization primitives so the overhead stays small.
That said, I think there are situations where it's very ill advise:
As soon as your required synchronization mechanism with other threads becomes complex (e.g. multiple locks, lots of critical sections, ...). It might still be then that multi-threading gives you a benefit when effectively moving to multiple CPUs, but the overhead is huge both for your single core and your programming time.
For instance think about operations that block because of slow peripheral devices (harddisk access etc.). While these are waiting, even the single core can do other things asyncronously.
In a lot of applications the bottleneck is not CPU processing power. So when the program flow is waiting for completion of IO requests (user input, network/disk IO), critical resources to be available, or any sort of asynchroneously triggered events, the CPU can be scheduled to do other work instead of just blocking.
In this case you don't necessarily need multiple threads that can actually run in parallel. Cooperative multi-tasking concepts like asynchroneous IO, coroutines, or fibers come into mind.
If however the application's bottleneck is CPU processing power (constantly 100% CPU usage), then it makes sense to increase the number of CPUs available to the application. At that point it is easier to scale the application up to use more CPUs if it was designed to run in parallel upfront.
As far as I can see, one answer was not yet given:
You will have to write multithreaded applications in the future!
The average number of cores will double every 18 months in the future. People have learned single-threaded programming for 50 years now, and now they are confronted with devices that have multiple cores. The programming style in a multi-threaded environment differs significantly from single-threaded programming. This refers to low-level aspects like avoiding race conditions and proper synchronization, as well as the high-level aspects like the general algorithm design.
So in addition to the points already mentioned, it's also about writing future-proof software, scalability and the development of the skills that are required to achieve these goals.

List of concurrency models

I'd like a large list so I can reference this for ideas. Some answers already have been enlightening .
What are some concurrency models? I heard of message passing where there is no memory shared. Futures which returns an object right away (so it doesn't block) and allows you to dereference the original function returns value later when you need it blocking if the results are not ready yet. I heard of coroutines, software transactional memory and random others.
I searched for a list or a wiki and couldn't find any good ones (many did not list the 3 I mentioned above) and many results gave me a complicated description explaining how it works rather then what it does or how it is to be used.
What are some concurrency models and what is a simple description of what they do? One per answer.
Actor Model
I heard of message passing where there is no memory shared.
Is it about Erlang-style Actors?
Scala uses this idea in its Actors framework (thus, in Scala its not a part of the language, just a library) and it looks quite sexy!
In a few words Actors are objects that have no shared data at all, but can use async messages for interaction. Actors can be located on one or different hosts and use interesting error handling policy (when error happened - actor just dies).
You should read more on this in Erlang and Scala docs, its really straightforward and progressive approach!
Chapters 3, 17, 17.11:
http://www.scala-lang.org/sites/default/files/linuxsoft_archives/docu/files/ScalaByExample.pdf
https://en.wikipedia.org/wiki/Actor_model
COM Threading (Concurrency) Model
Single-Threaded Apartments
Multi-Threaded Apartments
Mixed Model Development
COM objects can be used in multiple threads of a process. The terms
"Single- threaded Apartmen*t" (STA) and
"*Multi-threaded Apartment" (MTA) are
used to create a conceptual framework
for describing the relationship
between objects and threads, the
concurrency relationships among
objects, the means by which method
calls are delivered to an object, and
the rules for passing interface
pointers among threads. Components and
their clients choose between the
following two apartment models
presently supported by COM:
Single-threaded Apartment model (STA):
One or more threads in a process use
COM and calls to COM objects are
synchronized by COM. Interfaces are
marshaled between threads. A
degenerate case of the single-threaded
apartment model, where only one thread
in a given process uses COM, is called
the single-threading model. Previous
Microsoft information and
documentation has sometimes referred
to the STA model simply as the
"apartment model." Multi-threaded
Apartment model (MTA): One or more
threads use COM and calls to COM
objects associated with the MTA are
made directly by all threads
associated with the MTA without any
interposition of system code between
caller and object. Because multiple
simultaneous clients may be calling
objects more or less simultaneously
(simultaneously on multi-processor
systems), objects must synchronize
their internal state by themselves.
Interfaces are not marshaled between
threads. Previous Microsoft
information and documentation has
sometimes referred to this model as
the "free-threaded model." Both the
STA model and the MTA model can be
used in the same process. This is
sometimes referred to as a
"mixed-model" process.
Other models according to Wikipedia
There are several models of concurrent
computing, which can be used to
understand and analyze concurrent
systems. These models include:
Actor model
Object-capability model for security
Petri nets
Process calculi such as
Ambient calculus
Calculus of Communicating Systems (CCS)
Communicating Sequential Processes (CSP)
π-calculus
Futures
A future is a place-holder for the
undetermined result of a (concurrent)
computation. Once the computation
delivers a result, the associated
future is eliminated by globally
replacing it with the result value.
That value may be a future on its own.
Whenever a future is requested by a
concurrent computation, i.e. it tries
to access its value, that computation
automatically synchronizes on the
future by blocking until it becomes
determined or failed.
There are four kinds of futures:
concurrent futures stand for the result of a concurrent computation,
lazy futures stand for the result of a computation that is only performed on request,
promised futures stand for a value that is promised to be delivered later by explicit means,
failed futures represent the result of a computation that terminated with an exception.
Software transactional memory
In computer science, software
transactional memory (STM) is a
concurrency control mechanism
analogous to database transactions for
controlling access to shared memory in
concurrent computing. It is an
alternative to lock-based
synchronization. A transaction in this
context is a piece of code that
executes a series of reads and writes
to shared memory. These reads and
writes logically occur at a single
instant in time; intermediate states
are not visible to other (successful)
transactions. The idea of providing
hardware support for transactions
originated in a 1986 paper and patent
by Tom Knight[1]. The idea was
popularized by Maurice Herlihy and J.
Eliot B. Moss[2]. In 1995 Nir Shavit
and Dan Touitou extended this idea to
software-only transactional memory
(STM)[3]. STM has recently been the
focus of intense research and support
for practical implementations is
growing.
There's also map/reduce.
The idea is to spawn many instances of a sub problem and to combine the answers when they're done. A simple example would be matrix multiplication, which is the sum of several dot products. You spawn a worker thread for each dot product, and when all the threads are finished you sum the result.
This is how GPUs, functional languages such as LISP/Scheme/APL, and some frameworks (Google's Map/Reduce) handle concurrency.
Coroutines
In computer science, coroutines are
program components that generalize
subroutines to allow multiple entry
points for suspending and resuming
execution at certain locations.
Coroutines are well-suited for
implementing more familiar program
components such as cooperative tasks,
iterators, infinite lists and pipes.
There's also non-blocking concurrency such as compare-and-swap and load-link/store-conditional instructions. For example, compare-and-swap (cas) could be defined as so:
bool cas( int new_value, int current_value, int * location );
This operation will then attempt to set the value at location to the value passed in new_value, but only if the value in location is the same as current_value. This only requires one instruction and is usually how blocking concurrency (mutexes/semaphores/etc.) are implemented.
IPC (including MPI and RMI)
Hi,
in the wiki pages you can find that MPI (message passing interface) is a methods of a general IPC technique: http://en.wikipedia.org/wiki/Inter-process_communication
Another interesting approach is a Remote procedure call. For example Java's RMI enables you
to focus only on your application domain and communication patterns. It's an "application level" concurrency.
http://www.oracle.com/technetwork/java/javase/tech/index-jsp-136424.html
There a various design patterns/tools available to aid in shared memory model prallelization. Apart from the mentioned futures one can also take advantage of:
1. Thread pool pattern - focuses on task distribution between fixed number of threads: http://en.wikipedia.org/wiki/Thread_pool_pattern
2. Scheduler pattern - controls the threads execution according to a chosen scheduling policy http://en.wikipedia.org/wiki/Scheduler_pattern
3. Reactor pattern - to embed a single threaded application in a parallel environment http://en.wikipedia.org/wiki/Reactor_pattern
4. OpenMP (allows to parallelize part of code by means of preprocessor pragmas)
Regards,
Marcin
Parallel Random Access Machine (PRAM) is useful for complexity/tractability isues (please refer to a nice book for details).
About models you will also find something here (by Blaise Barney)
How about tuple space?
A tuple space is an implementation of the associative memory paradigm
for parallel/distributed computing. It provides a repository of tuples
that can be accessed concurrently. As an illustrative example,
consider that there are a group of processors that produce pieces of
data and a group of processors that use the data. Producers post their
data as tuples in the space, and the consumers then retrieve data from
the space that match a certain pattern. This is also known as the
blackboard metaphor. Tuple space may be thought as a form of
distributed shared memory.
LMAX's disruptor pattern keeps data in place and assures only one thread (consumer or producer) is owner of a data item (=queue slot) at a time.

Thread communication theory

What is the common theory behind thread communication? I have some primitive idea about how it should work but something doesn't settle well with me. Is there a way of doing it with interrupts?
Really, it's just the same as any concurrency problem: you've got multiple threads of control, and it's indeterminate which statements on which threads get executed when. That means there are a large number of POTENTIAL execution paths through the program, and your program must be correct under all of them.
In general the place where trouble can occur is when state is shared among the threads (aka "lightweight processes" in the old days.) That happens when there are shared memory areas,
To ensure correctness, what you need to do is ensure that these data areas get updated in a way that can't cause errors. To do this, you need to identify "critical sections" of the program, where sequential operation must be guaranteed. Those can be as little as a single instruction or line of code; if the language and architecture ensure that these are atomic, that is, can't be interrupted, then you're golden.
Otherwise, you idnetify that section, and put some kind of guards onto it. The classic way is to use a semaphore, which is an atomic statement that only allows one thread of control past at a time. These were invented by Edsgar Dijkstra, and so have names that come from the Dutch, P and V. When you come to a P, only one thread can proceed; all other threads are queued and waiting until the executing thread comes to the associated V operation.
Because these primitives are a little primitive, and because the Dutch names aren't very intuitive, there have been some ther larger-scale approaches developed.
Per Brinch-Hansen invented the monitor, which is basically just a data structure that has operations which are guaranteed atomic; they can be implemented with semaphores. Monitors are pretty much what Java synchronized statements are based on; they make an object or code block have that particular behavir -- that is, only one thread can be "in" them at a time -- with simpler syntax.
There are other modeals possible. Haskell and Erlang solve the problem by being functional languages that never allow a variable to be modified once it's created; this means they naturally don't need to wory about synchronization. Some new languages, like Clojure, instead have a structure called "transactional memory", which basically means that when there is an assignment, you're guaranteed the assignment is atomic and reversible.
So that's it in a nutshell. To really learn about it, the best places to look at Operating Systems texts, like, eg, Andy Tannenbaum's text.
The two most common mechanisms for thread communication are shared state and message passing.
THe most common way for threads to communicate is via some shared data structure, typically a queue. Some threads put information into the queue while others take it out. The queue must be protected by operating system facilities such as mutexes and semaphores. Interrupts have nothing to do with it.
If you're really interested in a theory of thread communications, you may want to look into formalisms like the pi Calculus.
To communicate between threads, you'll need to use whatever mechanism is supplied by your operating system and/or runtime. Interrupts would be unusually low level, although they might be used implicitly if your threads communicate using sockets or named pipes.
A common pattern would be to implement shared state using a shared memory block, relying on an os-supplied synchronization primitive such as a mutex to spare you from busy-waiting when your read from the block. Remember that if you have threads at all, then you must have some kind of scheduler already (whether it's native from the OS or emulated in your language runtime). So this scheduler can provide synchronization objects and a "sleep" function without necessarily having to rely on hardware support.
Sockets, pipes, and shared memory work between processes too. Sometimes a runtime will give you a lighter-weight way of doing synchronization for threads within the same process. Shared memory is cheaper within a single process. And sometimes your runtime will also give you an atomic message-passing mechanism.