I'd like a large list so I can reference this for ideas. Some answers already have been enlightening .
What are some concurrency models? I heard of message passing where there is no memory shared. Futures which returns an object right away (so it doesn't block) and allows you to dereference the original function returns value later when you need it blocking if the results are not ready yet. I heard of coroutines, software transactional memory and random others.
I searched for a list or a wiki and couldn't find any good ones (many did not list the 3 I mentioned above) and many results gave me a complicated description explaining how it works rather then what it does or how it is to be used.
What are some concurrency models and what is a simple description of what they do? One per answer.
Actor Model
I heard of message passing where there is no memory shared.
Is it about Erlang-style Actors?
Scala uses this idea in its Actors framework (thus, in Scala its not a part of the language, just a library) and it looks quite sexy!
In a few words Actors are objects that have no shared data at all, but can use async messages for interaction. Actors can be located on one or different hosts and use interesting error handling policy (when error happened - actor just dies).
You should read more on this in Erlang and Scala docs, its really straightforward and progressive approach!
Chapters 3, 17, 17.11:
http://www.scala-lang.org/sites/default/files/linuxsoft_archives/docu/files/ScalaByExample.pdf
https://en.wikipedia.org/wiki/Actor_model
COM Threading (Concurrency) Model
Single-Threaded Apartments
Multi-Threaded Apartments
Mixed Model Development
COM objects can be used in multiple threads of a process. The terms
"Single- threaded Apartmen*t" (STA) and
"*Multi-threaded Apartment" (MTA) are
used to create a conceptual framework
for describing the relationship
between objects and threads, the
concurrency relationships among
objects, the means by which method
calls are delivered to an object, and
the rules for passing interface
pointers among threads. Components and
their clients choose between the
following two apartment models
presently supported by COM:
Single-threaded Apartment model (STA):
One or more threads in a process use
COM and calls to COM objects are
synchronized by COM. Interfaces are
marshaled between threads. A
degenerate case of the single-threaded
apartment model, where only one thread
in a given process uses COM, is called
the single-threading model. Previous
Microsoft information and
documentation has sometimes referred
to the STA model simply as the
"apartment model." Multi-threaded
Apartment model (MTA): One or more
threads use COM and calls to COM
objects associated with the MTA are
made directly by all threads
associated with the MTA without any
interposition of system code between
caller and object. Because multiple
simultaneous clients may be calling
objects more or less simultaneously
(simultaneously on multi-processor
systems), objects must synchronize
their internal state by themselves.
Interfaces are not marshaled between
threads. Previous Microsoft
information and documentation has
sometimes referred to this model as
the "free-threaded model." Both the
STA model and the MTA model can be
used in the same process. This is
sometimes referred to as a
"mixed-model" process.
Other models according to Wikipedia
There are several models of concurrent
computing, which can be used to
understand and analyze concurrent
systems. These models include:
Actor model
Object-capability model for security
Petri nets
Process calculi such as
Ambient calculus
Calculus of Communicating Systems (CCS)
Communicating Sequential Processes (CSP)
π-calculus
Futures
A future is a place-holder for the
undetermined result of a (concurrent)
computation. Once the computation
delivers a result, the associated
future is eliminated by globally
replacing it with the result value.
That value may be a future on its own.
Whenever a future is requested by a
concurrent computation, i.e. it tries
to access its value, that computation
automatically synchronizes on the
future by blocking until it becomes
determined or failed.
There are four kinds of futures:
concurrent futures stand for the result of a concurrent computation,
lazy futures stand for the result of a computation that is only performed on request,
promised futures stand for a value that is promised to be delivered later by explicit means,
failed futures represent the result of a computation that terminated with an exception.
Software transactional memory
In computer science, software
transactional memory (STM) is a
concurrency control mechanism
analogous to database transactions for
controlling access to shared memory in
concurrent computing. It is an
alternative to lock-based
synchronization. A transaction in this
context is a piece of code that
executes a series of reads and writes
to shared memory. These reads and
writes logically occur at a single
instant in time; intermediate states
are not visible to other (successful)
transactions. The idea of providing
hardware support for transactions
originated in a 1986 paper and patent
by Tom Knight[1]. The idea was
popularized by Maurice Herlihy and J.
Eliot B. Moss[2]. In 1995 Nir Shavit
and Dan Touitou extended this idea to
software-only transactional memory
(STM)[3]. STM has recently been the
focus of intense research and support
for practical implementations is
growing.
There's also map/reduce.
The idea is to spawn many instances of a sub problem and to combine the answers when they're done. A simple example would be matrix multiplication, which is the sum of several dot products. You spawn a worker thread for each dot product, and when all the threads are finished you sum the result.
This is how GPUs, functional languages such as LISP/Scheme/APL, and some frameworks (Google's Map/Reduce) handle concurrency.
Coroutines
In computer science, coroutines are
program components that generalize
subroutines to allow multiple entry
points for suspending and resuming
execution at certain locations.
Coroutines are well-suited for
implementing more familiar program
components such as cooperative tasks,
iterators, infinite lists and pipes.
There's also non-blocking concurrency such as compare-and-swap and load-link/store-conditional instructions. For example, compare-and-swap (cas) could be defined as so:
bool cas( int new_value, int current_value, int * location );
This operation will then attempt to set the value at location to the value passed in new_value, but only if the value in location is the same as current_value. This only requires one instruction and is usually how blocking concurrency (mutexes/semaphores/etc.) are implemented.
IPC (including MPI and RMI)
Hi,
in the wiki pages you can find that MPI (message passing interface) is a methods of a general IPC technique: http://en.wikipedia.org/wiki/Inter-process_communication
Another interesting approach is a Remote procedure call. For example Java's RMI enables you
to focus only on your application domain and communication patterns. It's an "application level" concurrency.
http://www.oracle.com/technetwork/java/javase/tech/index-jsp-136424.html
There a various design patterns/tools available to aid in shared memory model prallelization. Apart from the mentioned futures one can also take advantage of:
1. Thread pool pattern - focuses on task distribution between fixed number of threads: http://en.wikipedia.org/wiki/Thread_pool_pattern
2. Scheduler pattern - controls the threads execution according to a chosen scheduling policy http://en.wikipedia.org/wiki/Scheduler_pattern
3. Reactor pattern - to embed a single threaded application in a parallel environment http://en.wikipedia.org/wiki/Reactor_pattern
4. OpenMP (allows to parallelize part of code by means of preprocessor pragmas)
Regards,
Marcin
Parallel Random Access Machine (PRAM) is useful for complexity/tractability isues (please refer to a nice book for details).
About models you will also find something here (by Blaise Barney)
How about tuple space?
A tuple space is an implementation of the associative memory paradigm
for parallel/distributed computing. It provides a repository of tuples
that can be accessed concurrently. As an illustrative example,
consider that there are a group of processors that produce pieces of
data and a group of processors that use the data. Producers post their
data as tuples in the space, and the consumers then retrieve data from
the space that match a certain pattern. This is also known as the
blackboard metaphor. Tuple space may be thought as a form of
distributed shared memory.
LMAX's disruptor pattern keeps data in place and assures only one thread (consumer or producer) is owner of a data item (=queue slot) at a time.
Related
The active object design pattern as I understand is tying up a (private/dedicated) thread life time with an object and making it work on independent data. From some of the documentation I read , the evolution of this kind of paradigm was because of two reasons , first , managing raw threads would be pain and second more threads contending for the shared resource doesn't scale well using mutex and locks. while I agree with the first reason , I do not fully comprehend the second . Making an object active just makes the object independent but the problems like contention for lock/mutex is still there (as we still have shared queue/buffer), the object just delegated the sharing responsibility onto the message queue. The only advantage of this design pattern as i see is the case where I had to perform long asynch task on the shared object (now that i am just passing message to a shared queue , the threads no longer have to block for long on mutex/locks but they still will blocka and contend for publishing messages/task). Other than this case could someone tell more scenarios where this kind of design pattern will have other advantages.
The second question I have is (I just started digging around design patterns) , what is the conceptual difference between , active object , reactor and proactor design pattern . How do you decide in which design pattern is more efficient and suits your requirements more. It would be really nice if someone can demonstrate certain examples showing how the three design patterns will behave and which one has comparative advantage/disadvantage in different scenarios.
I am kind of confused as I have used active object (which used shared thread-safe buffer) and boost::asio(Proactor) both to do similar kind of async stuff , I would like to know if any one has more insights on applicability of different patterns when approaching a problem.
The ACE website has some very good papers on the Active Object, Proactor en Reactor design patterns. A short summary of their intents:
The Active Object design pattern decouples method execution
from method invocation to enhance concurrency and
simplify synchronized access to an object that resides in its
own thread of control. Also known as: Concurrent Object, Actor.
The Proactor pattern supports the demultiplexing and dispatching
of multiple event handlers, which are triggered by the completion
of asynchronous events. This pattern is heavily used in Boost.Asio.
The Reactor design pattern handles service requests that are delivered
concurrently to an application by one or more clients. Each service
in an application may consist of several methods and is represented by
a separate event handler that is responsible for dispatching service-specific
requests. Also known as: Dispatcher, Notifier.
I heard that there are 3 kind of concurrency.
Deterministic concurrency
Message-passing concurrency
Shared-state concurrency
I know #2 (=actor model) and #3 (=general threading), but not #1. What's that?
Deterministic concurrency is a concurrent programming model such that programs written in this model have the following property: for a given set of inputs, the output values of a program are the same for any execution schedule. This means that the outputs of the program depend solely on the inputs of the program.
There are ways to ensure this property. One of the ways is the so-called single-assignment programming where variables don't have to be initialized, but may be assigned at most once. Reading an uninitialized variable stalls until it's assigned a value (possibly by some other thread). The Mozart programming language has support for these.
Another way is to use ownership analysis to determine which threads 'own' different references, and to ensure that no 2 threads write to the reference at the same 'time', so there are no data races.
I haven't heard the term before, but coroutines come to mind. They don't provide "true" concurrency, in the sense that only one routine is executing at any particular moment, but they're concurrent in the sense that a group of interacting coroutines can all make progress without having to wait for each other to finish.
Goal
My goal to better understand how concurrency within Java EE environment and how can I better consume it.
General questions
Let's take typical servlet container (tomcat) as example. For each request it uses 1 thread to process it. Thread pool is configured so, that it can have max 80 threads in pool. Let's also take simple webapp - it makes some processing and DB communication during each request.
At peak time I can see 80 parallel running threads (+ several other infrastructure threads). Let's also assume I running it in 'm1.large' EC2 instance.
I don't think that all these threads can really run in parallel on this hardware. So now scheduler should decide how better to split CPU time between them all. So the questions are - how big is scheduler overhead in this case? How can I find right balance between thread amount and processing speed?
Actors comparison
Having 80+ threads on 4 core CPU doesn't sound healthy to me. Especially if most of them are blocked on some kind of IO (DB, Filesystem, Socket) - they just consume precious resources. What if we will detach request from thread and will have only reasonable amount of threads (8 for instance) and will just send processing tasks to them. Of course in this case IO should be also non-blocking, so that I receive events when some data, that I need, is available and I send event, if I have some results.
As far as I understand, Actor model is all about this. Actors are not bound to threads (at least in Akka and Scala). So I have reasonable thread pool and bunch of actors with mailboxes that contain processing tasks.
Now question is - how actor model compares to traditional thread-per-request model in terms of performance, scheduler overhead and resources (RAM, CPU) consumption?
Custom threads
I have some requests (only several) that take too much time to process. I optimized code and all algorithms, added caches, but it still takes too much time. But I see, that algorithm can be parallelized. It fits naturally in actor model - I just split my big task in several tasks, and then aggregate results somehow (if needed). But in thread-per-request model I need spawn my own threads (or create my small thread pool). As far as I know, it's not recommended practice within Java EE environment. And, from my point of view, it doesn't fits naturally in thread-per-request model. Question arise: how big my thread pool size should be? Even if I will make it reasonable in terms of hardware I still have this bunch of threads managed by servlet container. Thread management becomes decentralized and goes wild.
So my question - what is the best way to deal with these situations in thread-per-request model?
Having 80+ threads on 4 core CPU doesn't sound healthy to me. Especially if most of them are blocked on some kind of IO (DB, Filesystem, Socket) - they just consume precious resourecs.
Wrong. Exactly in this scenario the processors can handle many more threads than the number of individual cores, since most of the threads at any point in time are blocked waiting for I/O. Fair enough, context switching takes time, but that overhead is usually irrelevant compared to file/network/DB latency.
The rule of thumb that the number of threads should be equal - or a little more than - the number of processor cores applies only for computation-intensive tasks when the cores are kept busy most of the time.
I have some requests (only several) that take too much time to process. I optimized code and all algorithms, added caches, but it still takes too much time. But I see, that algorithm can be parallelized. It fits naturally in actor model - I just split my big task in several tasks, and then aggregate results somehow (if needed). But in thread-per-request model I need spawn my own threads (or create my small thread pool). As far as I know, it's not recommended practice within Java EE environment.
Never heard about that (but I don't claim myself to be the ultimate Java EE expert). IMHO there is nothing wrong in executing tasks associated with a single request parallelly using e.g. a ThreadPoolExecutor. Note that these threads are not request handling threads, so they don't directly interfere with the thread pool used by the EJB container. Except that they compete for the same resources of course, so they may slow down or completely stop other request processing threads in a careless setup.
what is the best way to deal with these situations in thread-per-request model?
In the end, you can't escape measuring concurrent performance and fine-tuning the size of your thread pool and other parameters for your own specific environment.
The whole point of Java EE is to put common architectural concerns like security, state, and concurrency into the framework and let you provide the bits of business logic or data mappings along with the wiring to connect them. As such, Java EE intentionally hides the nasty bits of concurrency (locking to read/write mutable state) in the framework.
This approach lets a much broader range of developers successfully write correct applications. A necessary side effect though is that these abstractions create overhead and remove control. That's both good (in making it simple and encoding policies as policies not code) and bad (if you know what you're doing and can make choices impossible in the framework).
It is not inherently bad to have 80 threads on a production box. Most will be blocked or waiting on I/O which is fine. There is a (tunable) pool of threads doing the actual computation and Java EE will give you external hooks to tune those knobs.
Actors are a different model. They also let you write islands of code (the actor body) that (can) avoid locking to modify state. You can write your actors to be stateless (capturing the state in the recursive function call parameters) or hide your state completely in an actor instance so the state is all confined (for react style actors you probably still need to explicitly lock around data access to ensure visibility on the next thread that runs your actor).
I can't say that one or the other is better. I think there is adequate proof that both models can be used to write safe, high-throughput systems. To make either perform well, you need to think hard about your problem and build apps that isolate parts of state and the computations on each kind of state. For code where you understand your data well and have a high potential for parallelism I think models outside Java EE make a lot of sense.
Generally, the rule of thumb in sizing compute-bound thread pools is that they should be approximately equal to N of cores + 2. Many frameworks size to that automatically. You can use Runtime.getRuntime().availableProcessors() to get N. If your problem decomposes in a divide-and-conquer style algorithm and the number of data items is large, I would strongly suggest checking out fork/join which can be used now as a separate library and will be part of Java 7.
As far as how to manage this, you're not supposed to spawn threads as such inside Java EE (they want to control that) but you might investigate sending a request to your data-crunching thread pool via a message queue and handling that request via a return message. That can fit in the Java EE model (a bit clumsily of course).
I have a writeup of actors, fork/join, and some other concurrency models here that you might find interesting: http://tech.puredanger.com/2011/01/14/comparing-concurrent-frameworks/
I have a multithreaded application. Each module is executed in a separate thread.
Modules are:
- network module - used to receive/send data from network
- parser module - encode/decode network data to internal presentation
- 2 application module - perform some application logic on the above data one after other
- counter module - used to gather statistics from other modules
- timer module - used to schedule timers
- and much more ...
All threads using message queues for inter thread communication (std::deque sync by conditional variable and mutex).
Some modules are used by others ones (e.g. all modules use timer and counter) and this for each message received from network wich should be handled in very high rates.
This is pretty complex application and the design looks "reasonable". From other hand, I'm not sure that such design, thread per module, is the "best" one? In particular, I'm afraid that such design "encorage" a lot of context switches.
What do you think?
Is there're any good guidelines or open source project to learn from how to do "correct" design of threaded application?
Thread-per-function designs are just naive: they assume that by separating tasks - by module - onto threads, that some kind of scalability will be achieved.
This kind of design is inefficient, as very few task breakdowns yield exactly as many tasks as there are CPUs.
Far more rational designs are to break tasks down into 'jobs' - and then use thread pooling mechanisms to dispatch those jobs.
Advantages over the thread-per-module approach:
Thread pools take advantage of all cores. with thread-per-module if you have modules < cores you have cores sitting idle.
Thread pools minimize contention and resources by maintaining a parity between active threads, and cores. with thread-per-module, if modules > cores you incur needless extra context switches and (on some platforms) each thread exhausts other limited per process resources (like virtual memory).
Thread pools let a "module" do multiple jobs at a time. thread-per-module means that the busiest module still only gets one core.
I wouldn't call myself an expert an multi-threaded design. But I've at least worked with threads enough to have run into various issues trying to design them to work together (communication, locking resources, waiting for threads to end, etc).
At this point, my general rule of thumb is that I must justify the existence of each new thread. For example, if the network layer I'm using provides both a synchronous and an asynchronous API, can I really justify making the network code use synchronous calls in a new thread instead of just using the asynchronous calls in the main thread? In your case, how many modules actually need a thread of their own for a specific reason. Are there any that could instead just be called in turn from the main thread?
If some threads have no good reason for existing, then you might be able to save yourself some trouble and complexity by just putting that module in the main thread.
Now of course, there are good justifiable reasons for putting things in threads. Such as making synchronous calls that may block for a long time, keeping a GUI thread responsive while performing a long task, or being able to take advantage of parallel processing of a large task on a multi-core system.
I don't know of any particular "correct" way to do it. A lot of it really comes down to the details of what your application is actually supposed to do.
A good guideline is to put operations that might block (such as I/O) in its own thread. Your network module is a definite candidate here. Have your network thread use select (I assume UNIX here) to block on input.
Asynchronous events are good in separate threads as well. Your timer module looks like a good candidate here.
You might want to put your other modules in one thread to decrease complexity of your application. BUT, you might want to split them up if you have a multi-processor system.
Have a good strategy for locking resources and mutex handling to prevent deadlocks. A dependency graph (using a whiteboard!) might help here to get your design correct.
Good luck! Sounds like a complex system which will cause many hours of fun development!
For what platform?
For instance a Win32 applications the best model for back-end servers (like yours seems to be) is the thread pool and IO Completion Port. This is not just some hear say and opinion, there are strong facts behind this claim. Rick Vicik of the Windows Performance team has posted a series of articles describing in greater detail why high end servers need to follow this model, see High Performance Windows Programs.
There are other factors that come into play, like for instance the typo of protocol your network module has to handle. Request-Response protocols are often handled by one-thread-per-request metaphor and they do well enough, but high-throughput high-scale protocols don't fare well in that model, specifically because of boxcaring requirements.
Ultimately, whether your design is sound or not is hard to tell just from this brief description. Personally I tend o favor an IO completion driven threading model, as opposed to logical-module driven one, but that's just me.
Just to add to the other answers, lets reason every single thread in your dessign:
network module
Accepted.
parser module + 2 application module
Are you sure that these 3 threads can't be merged into one, main data processing thread? If that were the case, you could then benefit of a thread pool like others sugested, having this processing performed by N threads.
timer module
This one probably is reasonable in most platforms, as you will need a message processing loop to dispatch timer events. Also, if you ever need a GUI that could be the place.
counter module
This is the one that most annoys me. I can't find the reason for having a separate thread for this. Depending on how much you increment it, it will be a nice bottleneck for the application.
I'll suggest keeping separate counters in each thread and poll(message queue) for them when you need it.
and much more ...
Hope not!
I am currently designing an application that has one module which will load large amounts of data from a database and reduce it to a much smaller set by various calculations depending on the circumstances.
Many of the more intensive operations behave deterministically and would lend themselves to parallel processing.
Provided I have a loop that iterates over a large number of data chunks arriving from the db and for each one call a deterministic function without side effects, how would I make it so that the program does not wait for the function to return but rather sets the next calls going, so they could be processed in parallel? A naive approach to demonstrate the principle would do me for now.
I have read Google's MapReduce paper and while I could use the overall principle in a number of places, I won't, for now, target large clusters, rather it's going to be a single multi-core or multi-CPU machine for version 1.0. So currently, I'm not sure if I can actually use the library or would have to roll a dumbed-down basic version myself.
I am at an early stage of the design process and so far I am targeting C-something (for the speed critical bits) and Python (for the productivity critical bits) as my languages. If there are compelling reasons, I might switch, but so far I am contented with my choice.
Please note that I'm aware of the fact that it might take longer to retrieve the next chunk from the database than to process the current one and the whole process would then be I/O-bound. I would, however, assume for now that it isn't and in practice use a db cluster or memory caching or something else to be not I/O-bound at this point.
Well, if .net is an option, they have put a lot of effort into Parallel Computing.
If you still plan on using Python, you might want to have a look at Processing. It uses processes rather than threads for parallel computing (due to the Python GIL) and provides classes for distributing "work items" onto several processes. Using the pool class, you can write code like the following:
import processing
def worker(i):
return i*i
num_workers = 2
pool = processing.Pool(num_workers)
result = pool.imap(worker, range(100000))
This is a parallel version of itertools.imap, which distributes calls over to processes. You can also use the apply_async methods of the pool and store lazy result objects in a list:
results = []
for i in range(10000):
results.append(pool.apply_async(worker, i))
For further reference, see the documentation of the Pool class.
Gotchas:
processing uses fork(), so you have to be careful on Win32
objects transferred between processes need to be pickleable
if the workers are relatively fast, you can tweak chunksize, i.e.
the number of work items send to a worker process in one batch
processing.Pool uses a background thread
You can implement the algorithm from Google's MapReduce without having physically separate machines. Just consider each of those "machines" to be "threads." Threads are automatically distributed on multi-core machines.
I might be missing something here, but this this seems fairly straight forward using pthreads.
Set up a small threadpool with N threads in it and have one thread to control them all.
The master thread simply sits in a loop doing something like:
Get data chunk from DB
Find next free thread If no thread is free then wait
Hand over chunk to worker thread
Go back and get next chunk from DB
In the meantime the worker threads they sit and do:
Mark myself as free
Wait for the mast thread to give me a chunk of data
Process the chunk of data
Mark myself as free again
The method by which you implement this can be as simple as two mutex controlled arrays. One has the worked threads in it (the threadpool) and the other indicated if each corresponding thread is free or busy.
Tweak N to your liking ...
If you're working with a compiler that will support it, I would suggest taking a look at http://www.openmp.org for a way of annotating your code in such a way that
certain loops will be parallelized.
It does a lot more as well, and you might find it very helpful.
Their web page reports that gcc4.2 will support openmp, for example.
The same thread pool is used in java. But the threads in threadpools are serialisable and sent to other computers and deserialised to run.
I have developed a MapReduce library for multi-threaded/multi-core use on a single server. Everything is taken care of by the library, and the user just has to implement Map and Reduce. It is positioned as a Boost library, but not yet accepted as a formal lib. Check out http://www.craighenderson.co.uk/mapreduce
You may be interested in examining the code of libdispatch, which is the open source implementation of Apple's Grand Central Dispatch.
Intel's TBB or boost::mpi might be of interest to you also.