Determining the number of parallel processes in a multi-instance subprocess

Determining the number of parallel processes in a multi-instance subprocess - camunda

I am modelling a process which at times will require a very large number of parallel sub-processes (tens of thousands) to be launched. Obviously it’s not possible for these all to run in parallel simultaneously - how will the Camunda process engine handle this? Is it possible to control how many subprocesses will run at a time?

Camunda 7 uses a job executor thread pool. This determines the concurrency level of jobs such an asynchronously started call activity.
The amount of sub processes you mentioned is very high though. What history level did you have in mind? It is likely better to handle this differently.
Camunda 8 was release two days ago. It has a fundamentally different architecture, no relational DB, applying event streaming concepts, designed for massive volumes. It may be more suitable for your use case.

Related

How to execute multiple sessions in tensorflow via one graph in CPU

Currently I am building a Tensorflow Graph and then performing, graph->getSession()->Run(input,model,output) in C++ in CPU
I want to achieve concurrency. What are my options to execute in parallel so I can support multiple requests executed concurrently.
Can I run sessions in multi threaded way?
By executing multiple sessions parallel will the processing time be constant? Example : If one session takes 100 ms then running 2 sessions concurrently takes approximately 100 ms.
Note: I want to run this on CPU

First thing to note is that tensorflow will use all cores for processing by default. You have some limited manner of control over this via inter and intra op perallelism discussed in this authoritative answer:
Tensorflow: executing an ops with a specific core of a CPU
The second point to note is that a session is thread safe. You can call call it from multiple threads. Each call will see a consistent point-in-time snapshot of the variables as they were when the call began, this is a question I asked once upon a time:
How are variables shared between concurrent `session.run(...)` calls in tensorflow?
The moral:
If you are running lots of small, sequential operations you can run them concurrently against one session and may be able to squeak out some improved performance if you limit tensorflow's use of parallelism. If you are running large operations (such as large matrix multiples, for example) which benefit more from distributed multi-core processing, you don't need to deal with parallelism yourself, tensorflow is already distributing across all CPU cores by default.
Also if your graph dependencies lend themselves to any amount of parallelization tensorflow handles this as well. You can set up profiling to see this in action.
https://towardsdatascience.com/howto-profile-tensorflow-1a49fb18073d

What is the point of being able to spawn dozens of processes efficiently if only very few of them can be executed in parallel?

Erlang is very efficient in spawning new processes, but what is the point, if the CPU can only execute only e.g. 4 of them in parallel?
Therefore the rest should wait for the Erlang-"context switch".
Do you get more things done faster if you have for example 10k processes, than you would by using Java/C#/C++?

There are many reasons:
Conceptually, processes are easy to reason about. Asynchronous callbacks and promises in languages like JavaScript are harder to reason about because the code in the callbacks can change the values of variables used by other code in the thread.
Processes provide isolation for the code running inside them. A process can only affect other processes by placing messages in their mailboxes. A process cannot meddle with the state of other processes.
Processes are granular. This means:
If you have 400 processes on a 4 core machine the scheduler will make sure to distribute them across the threads in such a way as to fully utilize the 4 cores. One core is always going to be handling OS stuff, so the scheduler would likely end up giving the thread running on that core less work than the other 3 threads. But it adapts, so in any situation the scheduler will do it's best to make sure processes wait as little as possible and threads always have a queue of processes waiting for CPU time.
Moving to better hardware with more cores doesn't require changes to the code or architecture of the application. Moving your Erlang application from a 4 core machine to a 64 core machine will mean your application will run about ~16 times faster without any changes, assuming your application is structured in such a way that it can take advantage of the extra cores (usually this means making sure tasks that could be done in parallel are executed in separate processes).
Processes are very lightweight, so there is very little overhead. In most applications the benefits provided by processes and the scheduler far outweigh the small overhead from running thousands of processes. Commodity hardware can easily handle hundreds of thousands of processes.
So in closing, whether or not processes execute in parallel isn't that important. The other benefits they provide are enough to justify their usage.

Celery: number of workers vs concurrency

What is the difference between having:
one worker with concurrency 4 or
two workers with concurrency 2 each
for the same queue.
Thanks

I assume that you are running both workers in the same machine. In that case I would recommend you to maintain one worker for a queue.
Two workers for the same queue does not benefit you by any means. It would just increase the memory wastage.
Two or more workers when you have multiple queues, to maintain priority or to allocate different number of cores to each worker.
Two or more workers for a single queue is useful if you run the workers in different machines. The workers in different machines consumes tasks from same queue, you could allocate concurrency based on the cores available in each machine.
I do realise I responded 2+ years later. But I just thought I'll put it here for anyone who still has similar doubts.

Intersting question.
Things that I can think of (I'm sure there are a lot more):
For having high availability:
You want more than one machine (if one goes down) - so you must use worker per machine.
Even for one machine - I think it is safer to have 2 workers which run in a two different processes instead of one worker with high concurrency (correct me if I wrong, but I think it is implemented with threads).
In docs I see that they the recommendation is to use concurrency per CPUs.
If you want to separate different tasks to different workers..
Of course, you have price for that: more processes that takes more resources (CPU/Memory etc).
I found this question which is quite similar.

Parallel Thread Execution to achieve performance

I am little bit confused in multithreading. Actually we create multiple threads for breaking the main process to subprocess for achieving responsiveness and for removing waiting time.
But Here I got a situation where I have to execute the same task using multiple threads parallel.
And My processor can execute 4 threads parallel and so Will it improve the performance if I create more that 4 threads(10 or more). When I put this question to my colleague he is telling that nothing will happen we are already executing many threads in many other applications like browser threads, kernel threads, etc so he is telling to create multiple threads for the same task.
But if I create more than 4 threads that will execute parallel will not create more context switch and decrease the performance.
Or even though we create multiple thread for executing parallely the will execute one after the other so the performance will be the same.
So what to do in the above situations and are these correct?
edit
1 thread worked. time to process 120 seconds.
2 threads worked. time to process is about 60 seconds.
3 threads created. time to process is about 60 seconds.(not change to the time of 2 threads.)
Is it because, my hardware can only create 2 threads(for being dual)?
software thread=piece of code
Hardware thread=core(processor) for running software thread.
So my CPU support only 2 concurrent threads so if I purchase a AMD CPU which is having 8 cores or 12 cores can I achieve higher performance?

Multi-Tasking is pretty complex and performance gains usually depend a lot on the problem itself:
Only a part of the application can be worked in parallel (there is always a first part that splits up the work into multiple tasks). So the first question is: How much of the work can be done in parallel and how much of it needs to be synchronized (in some cases, you can stop here because so little can be done in parallel that the whole work isn't worth it).
Multiple tasks may depend on each other (one task may need the result of another task). These tasks cannot be executed in parallel.
Multiple tasks may work on the same data/resources (read/write situation). Here we need to synchronize access to this data/resources. If all tasks needs write access to the same object during the WHOLE process, then we cannot work in parallel.
Basically this means that without the exact definition of the problem (dependencies between tasks, dependencies on data, amount of parallel tasks, ...) it's very hard to tell how much performance you'll gain by using multiple threads (and if it's really worth it).

http://en.wikipedia.org/wiki/Amdahl%27s_law
Amdahl's states in a nutshell that the performance boost you receive from parallel execution is limited by your code that must run sequentially.
Without knowing your problem space here are some general things you should look at:
Refactor to eliminate mutex/locks. By definition they force code to run sequentially.
Reduce context switch overhead by pinning threads to physical cores. This becomes more complicated when threads must wait for work (ie blocking on IO) but in general you want to keep your core as busy as possible running your program not switching out threads.
Unless you absolutely need to use threads and sync primitives try use a task scheduler or parallel algorithms library to parallelize your work. Examples would be Intel TBB, Thrust or Apple's libDispatch.

Understanding the scalability of Erlang

It is said that thousands of processes can be spawned to do the similar task concurrently and Erlang is good at handling it. If there is more work to be done, we can simply and safely add more worker processes and that makes it scalable.
What I fail to understand is that if the work performed by each work is itself resource-intensive, how will Erlang be able to handle it? For instance, if entries are being made into a table by several sources and an Erlang application withing its hundreds of processes reads rows from the table and does something, this is obviously likely to cause resource burden. Every worker will try to pull a record from the table.
If this is a bad example, consider a worker that has to perform a highly CPU-intensive computation in memory. Thousands of such workers running concurrently will overwork the CPU.
Please rectify my understanding of the scalability in Erlang:
Erlang processes get time slices of the CPU only if there is work available for them. OS processes on the other hand get time slices regardless of whether they are idle.
The startup and shutdown time of Erlang processes is much lower than that of OS processes.
Apart from the above two points is there something about Erlang that makes it scalable?
Thanks,
Melvyn

Scaling in Erlang is not automatic. The Erlang language and runtime provides some tools which makes it comparatively easy to write concurrent programs. If these are written correctly, then they are able to scale along several different dimensions:
Parallel execution on multiple cores - since the VM understands to utilize them all.
Capacity - Since you can have a process per task and they are light weight.
The biggest advantage is that Erlang processes are isolated, like in the OS, but unlike the OS the communication overhead is small. These two traits is what you want to exploit in Erlang programming.
The problem where you have a highly contended data resource is one to avoid if you are targeting high parallel execution. The best way to go around it is to split up your problem so it doesn't occur.
I have a blog post, http://jlouisramblings.blogspot.dk/2013/01/how-erlang-does-scheduling.html which describes in some more detail how the Erlang scheduler works. You may want to read that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js