Understanding the scalability of Erlang - concurrency

It is said that thousands of processes can be spawned to do the similar task concurrently and Erlang is good at handling it. If there is more work to be done, we can simply and safely add more worker processes and that makes it scalable.
What I fail to understand is that if the work performed by each work is itself resource-intensive, how will Erlang be able to handle it? For instance, if entries are being made into a table by several sources and an Erlang application withing its hundreds of processes reads rows from the table and does something, this is obviously likely to cause resource burden. Every worker will try to pull a record from the table.
If this is a bad example, consider a worker that has to perform a highly CPU-intensive computation in memory. Thousands of such workers running concurrently will overwork the CPU.
Please rectify my understanding of the scalability in Erlang:
Erlang processes get time slices of the CPU only if there is work available for them. OS processes on the other hand get time slices regardless of whether they are idle.
The startup and shutdown time of Erlang processes is much lower than that of OS processes.
Apart from the above two points is there something about Erlang that makes it scalable?

Scaling in Erlang is not automatic. The Erlang language and runtime provides some tools which makes it comparatively easy to write concurrent programs. If these are written correctly, then they are able to scale along several different dimensions:
Parallel execution on multiple cores - since the VM understands to utilize them all.
Capacity - Since you can have a process per task and they are light weight.
The biggest advantage is that Erlang processes are isolated, like in the OS, but unlike the OS the communication overhead is small. These two traits is what you want to exploit in Erlang programming.
The problem where you have a highly contended data resource is one to avoid if you are targeting high parallel execution. The best way to go around it is to split up your problem so it doesn't occur.
I have a blog post, http://jlouisramblings.blogspot.dk/2013/01/how-erlang-does-scheduling.html which describes in some more detail how the Erlang scheduler works. You may want to read that.


why concurrent programs are faster?

I've been reading a lot about concurrent programming as well as watching a lot videos online, but I still can't understand one big idea. Provided that a piece of software is written correctly and is not executed on a mulit-core processor (i.e. it runs on a single core machine) why is concurrent program runs faster than a sequential one? I keep trying to figure it out but I really can't understand.
It's not. The argument for writing concurrent code for single processors wasn't on the grounds of speed, it was about organization of tasks. It's cleaner to have different tasks handled by different threads with switching between them done by the OS, otherwise the application has to juggle the tasks itself. Data for a task can be confined to a thread and kept separate from other tasks.

Benefits of a multi thread program in a unicore system [duplicate]

This question already has answers here:
How can multithreading speed up an application (when threads can't run concurrently)?
(9 answers)
Closed 9 years ago.
My professor causally mentioned that we should program multi-thread programs even if we are using a unicore processor however because of the lack of time , he did not elaborate on it .
I would like to know what are the benefits of a multi-thread program in a unicore processor ??
It won't be as significant as a multi-core system but it can still provide some benefits.
Mainly all the benefits that you are going to get will be regarding to the context switch that will happen after a input miss to the already executing thread. Executing thread may be waiting for anything such as a hardware resource or a branch mis-prediction or even data transfer after a cache miss.
At this point the waiting thread can be executed to benefit from this "waiting time". But of course context switch will take some time. Also managing threads inside the code rather than sequential computation can create some extra complexity to your program. And as it has been said, some applications needs to be multi-threaded so there is no escape from the context switch in some cases.
Some applications need to be multi-threaded. Multi-threading isn't just about improving performance by using more cores, it's also about performing multiple tasks at once.
Take Skype for example - The GUI needs to be able to accept the text you're entering, display it on the screen, listen for new messages coming from the user you're talking to, and display them. This wouldn't be a trivial task in a single threaded application.
Even if there's only one core available, the OS thread scheduler will give you the illusion of parallelism.
Usually it is about not blocking. Running many threads on a single core still gives the illusion of concurrency. So you can have, say, a thread doing IO while another one does user interactions. The user interaction thread is not blocked while the other does IO, so the user is free to carry on interacting.
Benefits could be different.
One of the widely used examples is the application with GUI, which supposed to perform some kind of computations. If you will have a single thread - the user will have to wait the result before dealing something else with the application, but if you start it in the separate thread - user interface could be still available for user during the computation process. So, multi-thread program could emulate multi-task environment even on a unicore system. That's one of the points.
As others have already mentioned, not blocking is one application. Another one is separation of logic for unrelated tasks that are to be executed simultaneously. Using threads for that leaves handling of scheduling these tasks to the OS.
However, note that it may also be possible to implement similar behavior using asynchronous operations in a single thread. "Future" and boost::asio provide ways of doing non-blocking stuff without necessarily resorting to multiple threads.
I think it depends a bit on how exactly you design your threads and which logic is actually in the thread. Some benefits you can even get on a single core:
A thread can wrap a blocking/long-during call you can't circumvent otherwise. For some operations there are polling mechanisms, but not for all.
A thread can wrap an almost standalone part of your application that has virtually no interaction with other code. For example background polling for updates, monitoring some resource (e.g. free storage), checking internet connectivity. If you keep them in a separate thread you can keep the code relatively simple in its own 'runtime' without caring too much about the impact on the main program, the sole communication with the main logic is usually a single 'event'.
In some environments you might get more processing time. This mainly depends on how your OS scheduling system works, but if this allocates time per thread, the more threads you have the more your app will be scheduled.
Some benefits long-term:
Where it's not hard to do you benefit if your hardware evolves. You never know what's going to happen, today your app runs on a single-core embedded device, tomorrow that embedded device gets a quad core. Programming threaded from the beginning improves your future scalability.
One example is an environment where you can deterministically assign work to a thread, e.g. based on some hash all related operations end up in the same thread. The advantage for single cores is 'small' but it's not hard to do as you need little synchronization primitives so the overhead stays small.
That said, I think there are situations where it's very ill advise:
As soon as your required synchronization mechanism with other threads becomes complex (e.g. multiple locks, lots of critical sections, ...). It might still be then that multi-threading gives you a benefit when effectively moving to multiple CPUs, but the overhead is huge both for your single core and your programming time.
For instance think about operations that block because of slow peripheral devices (harddisk access etc.). While these are waiting, even the single core can do other things asyncronously.
In a lot of applications the bottleneck is not CPU processing power. So when the program flow is waiting for completion of IO requests (user input, network/disk IO), critical resources to be available, or any sort of asynchroneously triggered events, the CPU can be scheduled to do other work instead of just blocking.
In this case you don't necessarily need multiple threads that can actually run in parallel. Cooperative multi-tasking concepts like asynchroneous IO, coroutines, or fibers come into mind.
If however the application's bottleneck is CPU processing power (constantly 100% CPU usage), then it makes sense to increase the number of CPUs available to the application. At that point it is easier to scale the application up to use more CPUs if it was designed to run in parallel upfront.
As far as I can see, one answer was not yet given:
You will have to write multithreaded applications in the future!
The average number of cores will double every 18 months in the future. People have learned single-threaded programming for 50 years now, and now they are confronted with devices that have multiple cores. The programming style in a multi-threaded environment differs significantly from single-threaded programming. This refers to low-level aspects like avoiding race conditions and proper synchronization, as well as the high-level aspects like the general algorithm design.
So in addition to the points already mentioned, it's also about writing future-proof software, scalability and the development of the skills that are required to achieve these goals.

Concurrency within Java EE environment

My goal to better understand how concurrency within Java EE environment and how can I better consume it.
General questions
Let's take typical servlet container (tomcat) as example. For each request it uses 1 thread to process it. Thread pool is configured so, that it can have max 80 threads in pool. Let's also take simple webapp - it makes some processing and DB communication during each request.
At peak time I can see 80 parallel running threads (+ several other infrastructure threads). Let's also assume I running it in 'm1.large' EC2 instance.
I don't think that all these threads can really run in parallel on this hardware. So now scheduler should decide how better to split CPU time between them all. So the questions are - how big is scheduler overhead in this case? How can I find right balance between thread amount and processing speed?
Actors comparison
Having 80+ threads on 4 core CPU doesn't sound healthy to me. Especially if most of them are blocked on some kind of IO (DB, Filesystem, Socket) - they just consume precious resources. What if we will detach request from thread and will have only reasonable amount of threads (8 for instance) and will just send processing tasks to them. Of course in this case IO should be also non-blocking, so that I receive events when some data, that I need, is available and I send event, if I have some results.
As far as I understand, Actor model is all about this. Actors are not bound to threads (at least in Akka and Scala). So I have reasonable thread pool and bunch of actors with mailboxes that contain processing tasks.
Now question is - how actor model compares to traditional thread-per-request model in terms of performance, scheduler overhead and resources (RAM, CPU) consumption?
Custom threads
I have some requests (only several) that take too much time to process. I optimized code and all algorithms, added caches, but it still takes too much time. But I see, that algorithm can be parallelized. It fits naturally in actor model - I just split my big task in several tasks, and then aggregate results somehow (if needed). But in thread-per-request model I need spawn my own threads (or create my small thread pool). As far as I know, it's not recommended practice within Java EE environment. And, from my point of view, it doesn't fits naturally in thread-per-request model. Question arise: how big my thread pool size should be? Even if I will make it reasonable in terms of hardware I still have this bunch of threads managed by servlet container. Thread management becomes decentralized and goes wild.
So my question - what is the best way to deal with these situations in thread-per-request model?
Having 80+ threads on 4 core CPU doesn't sound healthy to me. Especially if most of them are blocked on some kind of IO (DB, Filesystem, Socket) - they just consume precious resourecs.
Wrong. Exactly in this scenario the processors can handle many more threads than the number of individual cores, since most of the threads at any point in time are blocked waiting for I/O. Fair enough, context switching takes time, but that overhead is usually irrelevant compared to file/network/DB latency.
The rule of thumb that the number of threads should be equal - or a little more than - the number of processor cores applies only for computation-intensive tasks when the cores are kept busy most of the time.
I have some requests (only several) that take too much time to process. I optimized code and all algorithms, added caches, but it still takes too much time. But I see, that algorithm can be parallelized. It fits naturally in actor model - I just split my big task in several tasks, and then aggregate results somehow (if needed). But in thread-per-request model I need spawn my own threads (or create my small thread pool). As far as I know, it's not recommended practice within Java EE environment.
Never heard about that (but I don't claim myself to be the ultimate Java EE expert). IMHO there is nothing wrong in executing tasks associated with a single request parallelly using e.g. a ThreadPoolExecutor. Note that these threads are not request handling threads, so they don't directly interfere with the thread pool used by the EJB container. Except that they compete for the same resources of course, so they may slow down or completely stop other request processing threads in a careless setup.
what is the best way to deal with these situations in thread-per-request model?
In the end, you can't escape measuring concurrent performance and fine-tuning the size of your thread pool and other parameters for your own specific environment.
The whole point of Java EE is to put common architectural concerns like security, state, and concurrency into the framework and let you provide the bits of business logic or data mappings along with the wiring to connect them. As such, Java EE intentionally hides the nasty bits of concurrency (locking to read/write mutable state) in the framework.
This approach lets a much broader range of developers successfully write correct applications. A necessary side effect though is that these abstractions create overhead and remove control. That's both good (in making it simple and encoding policies as policies not code) and bad (if you know what you're doing and can make choices impossible in the framework).
It is not inherently bad to have 80 threads on a production box. Most will be blocked or waiting on I/O which is fine. There is a (tunable) pool of threads doing the actual computation and Java EE will give you external hooks to tune those knobs.
Actors are a different model. They also let you write islands of code (the actor body) that (can) avoid locking to modify state. You can write your actors to be stateless (capturing the state in the recursive function call parameters) or hide your state completely in an actor instance so the state is all confined (for react style actors you probably still need to explicitly lock around data access to ensure visibility on the next thread that runs your actor).
I can't say that one or the other is better. I think there is adequate proof that both models can be used to write safe, high-throughput systems. To make either perform well, you need to think hard about your problem and build apps that isolate parts of state and the computations on each kind of state. For code where you understand your data well and have a high potential for parallelism I think models outside Java EE make a lot of sense.
Generally, the rule of thumb in sizing compute-bound thread pools is that they should be approximately equal to N of cores + 2. Many frameworks size to that automatically. You can use Runtime.getRuntime().availableProcessors() to get N. If your problem decomposes in a divide-and-conquer style algorithm and the number of data items is large, I would strongly suggest checking out fork/join which can be used now as a separate library and will be part of Java 7.
As far as how to manage this, you're not supposed to spawn threads as such inside Java EE (they want to control that) but you might investigate sending a request to your data-crunching thread pool via a message queue and handling that request via a return message. That can fit in the Java EE model (a bit clumsily of course).
I have a writeup of actors, fork/join, and some other concurrency models here that you might find interesting: http://tech.puredanger.com/2011/01/14/comparing-concurrent-frameworks/

Large number of simultaneous long-running operations in Qt

I have some long-running operations that number in the hundreds. At the moment they are each on their own thread. My main goal in using threads is not to speed these operations up. The more important thing in this case is that they appear to run simultaneously.
I'm aware of cooperative multitasking and fibers. However, I'm trying to avoid anything that would require touching the code in the operations, e.g. peppering them with things like yieldToScheduler(). I also don't want to prescribe that these routines be stylized to be coded to emit queues of bite-sized task items...I want to treat them as black boxes.
For the moment I can live with these downsides:
Maximum # of threads tend to be O(1000)
Cost per thread is O(1MB)
To address the bad cache performance due to context-switches, I did have the idea of a timer which would juggle the priorities such that only idealThreadCount() threads were ever at Normal priority, with all the rest set to Idle. This would let me widen the timeslices, which would mean fewer context switches and still be okay for my purposes.
Question #1: Is that a good idea at all? One certain downside is it won't work on Linux (docs say no QThread::setPriority() there).
Question #2: Any other ideas or approaches? Is QtConcurrent thinking about this scenario?
(Some related reading: how-many-threads-does-it-take-to-make-them-a-bad-choice, many-threads-or-as-few-threads-as-possible, maximum-number-of-threads-per-process-in-linux)
IMHO, this is a very bad idea. If I were you, I would try really, really hard to find another way to do this. You're combining two really bad ideas: creating a truck load of threads, and messing with thread priorities.
You mention that these operations only need to appear to run simultaneously. So why not try to find a way to make them appear to run simultaneously, without literally running them simultaneously?
It's been 6 months, so I'm going to close this.
Firstly I'll say that threads serve more than one purpose. One is speedup...and a lot of people are focusing on that in the era of multi-core machines. But another is concurrency, which can be desirable even if it slows the system down when taken as a whole. Yet concurrency can be achieved using mechanisms more lightweight than threads, although it may complicate the code.
So this is just one of those situations where the tradeoff of programmer convenience against user experience must be tuned to fit the target environment. It's how Google's approach to a process-per-tab with Chrome would have been ill-advised in the era of Mosaic (even if process isolation was preferable with all else being equal). If the OS, memory, and CPU couldn't give a good browsing experience...they wouldn't do it that way now.
Similarly, creating a lot of threads when there are independent operations you want to be concurrent saves you the trouble of sticking in your own scheduler and yield() operations. It may be the cleanest way to express the code, but if it chokes the target environment then something different needs to be done.
So I think I'll settle on the idea that in the future when our hardware is better than it is today, we'll probably not have to worry about how many threads we make. But for now I'll take it on a case-by-case basis. i.e. If I have 100 of concurrent task class A, and 10 of concurrent task class B, and 3 of concurrent task class C... then switching A to a fiber-based solution and giving it a pool of a few threads is probably worth the extra complication.

How do I tell a multi-core / multi-CPU machine to process function calls in a loop in parallel?

I am currently designing an application that has one module which will load large amounts of data from a database and reduce it to a much smaller set by various calculations depending on the circumstances.
Many of the more intensive operations behave deterministically and would lend themselves to parallel processing.
Provided I have a loop that iterates over a large number of data chunks arriving from the db and for each one call a deterministic function without side effects, how would I make it so that the program does not wait for the function to return but rather sets the next calls going, so they could be processed in parallel? A naive approach to demonstrate the principle would do me for now.
I have read Google's MapReduce paper and while I could use the overall principle in a number of places, I won't, for now, target large clusters, rather it's going to be a single multi-core or multi-CPU machine for version 1.0. So currently, I'm not sure if I can actually use the library or would have to roll a dumbed-down basic version myself.
I am at an early stage of the design process and so far I am targeting C-something (for the speed critical bits) and Python (for the productivity critical bits) as my languages. If there are compelling reasons, I might switch, but so far I am contented with my choice.
Please note that I'm aware of the fact that it might take longer to retrieve the next chunk from the database than to process the current one and the whole process would then be I/O-bound. I would, however, assume for now that it isn't and in practice use a db cluster or memory caching or something else to be not I/O-bound at this point.
Well, if .net is an option, they have put a lot of effort into Parallel Computing.
If you still plan on using Python, you might want to have a look at Processing. It uses processes rather than threads for parallel computing (due to the Python GIL) and provides classes for distributing "work items" onto several processes. Using the pool class, you can write code like the following:
import processing
def worker(i):
return i*i
num_workers = 2
pool = processing.Pool(num_workers)
result = pool.imap(worker, range(100000))
This is a parallel version of itertools.imap, which distributes calls over to processes. You can also use the apply_async methods of the pool and store lazy result objects in a list:
results = []
for i in range(10000):
results.append(pool.apply_async(worker, i))
For further reference, see the documentation of the Pool class.
processing uses fork(), so you have to be careful on Win32
objects transferred between processes need to be pickleable
if the workers are relatively fast, you can tweak chunksize, i.e.
the number of work items send to a worker process in one batch
processing.Pool uses a background thread
You can implement the algorithm from Google's MapReduce without having physically separate machines. Just consider each of those "machines" to be "threads." Threads are automatically distributed on multi-core machines.
I might be missing something here, but this this seems fairly straight forward using pthreads.
Set up a small threadpool with N threads in it and have one thread to control them all.
The master thread simply sits in a loop doing something like:
Get data chunk from DB
Find next free thread If no thread is free then wait
Hand over chunk to worker thread
Go back and get next chunk from DB
In the meantime the worker threads they sit and do:
Mark myself as free
Wait for the mast thread to give me a chunk of data
Process the chunk of data
Mark myself as free again
The method by which you implement this can be as simple as two mutex controlled arrays. One has the worked threads in it (the threadpool) and the other indicated if each corresponding thread is free or busy.
Tweak N to your liking ...
If you're working with a compiler that will support it, I would suggest taking a look at http://www.openmp.org for a way of annotating your code in such a way that
certain loops will be parallelized.
It does a lot more as well, and you might find it very helpful.
Their web page reports that gcc4.2 will support openmp, for example.
The same thread pool is used in java. But the threads in threadpools are serialisable and sent to other computers and deserialised to run.
I have developed a MapReduce library for multi-threaded/multi-core use on a single server. Everything is taken care of by the library, and the user just has to implement Map and Reduce. It is positioned as a Boost library, but not yet accepted as a formal lib. Check out http://www.craighenderson.co.uk/mapreduce
You may be interested in examining the code of libdispatch, which is the open source implementation of Apple's Grand Central Dispatch.
Intel's TBB or boost::mpi might be of interest to you also.