How do parallel multi instance loop work in Camunda 7.16.6 - camunda

I'm using the camunda-enginge 7.16.6.
I have a Process with a multi instance loop like this one that repeats parallel a 1000 times.
This loop is execute parallel. My assumption was, that n camunda executors now starts their work so executor #1 executes Task 2, then Task 3, then Task 4, and executor #2 and all others do the same. So after a short while at least some of the 1000 times finished all three Tasks in the loop
However what I observed so far is, that Task 2 gets execute 1000 times and only when that is finished, Task 3 gets executed a 1000 times and so on.
I also noticed, that camunda takes a lot of time by itself, outside of the tasks.
Is my Observation correct and is this behavior documented somewhere? Can you change that behavior?

I've run some tests an can explain the behavior:
The Order of Tasks and the overall time to finish is influenced by whenever or not there are transaction boundaries (async after, the red bars in the Screenshot).
Its a bit described here.
By setting the asyncBefore='true' attribute we introduce an additional save point at which the process state will be persisted and committed to the database. A separate job executor thread will continue the process asynchronously by using a separate database transaction. In case this transaction fails the service task will be retried and eventually marked as failed - in order to be dealt with by a human operator.
repeat 1000 times, parallel, no transaction
One Job Executor rushes trough the process, the Order is 1, [2,3,4|2,3,4|...], 5. Not really parallel. But this is as documented here:
The Job Executor makes sure that jobs from a single process instance are never executed concurrently.
It can be turned off if you are an expert and know what you are doing (and have understood this section).
Overall this took around 5 seconds.
repeat 1000 times, parallel, with transaction
Here, due the transactions, there will be 1000 waiting Jobs for Task 7, and each finish Task 7 creates another Job of Task 8. Since the execution of the Jobs is by the order in the database (see here), the order is 6,[7,7,7...8,8,8...9,9,9...],10.
The transaction handling which includes maintaining the variables has a huge impact on the runtime, with Transactions in parallel mode it runs 06:33 minutes.
If you turn off the exclusive-flag it takes around 4:30 minutes, but at the cost of thousands of OptimisticLockingExceptions.
Afaik the recommended approach to gain true parallelism would be to move Task 7, Task 8 and Task 9 to a seperate process and spawn 1000 instances of that process.
You can influence the order of execution if you tweak the job executor settings & priority, see here, but that seems to require the exclusive flag, too. If you do that, the Order will be 6,[7,7,7|8,9,8,9(in random order),...]10
repeat 1000 times, sequential, no transaction
The Order is 11,[12,13,14|12,13,14,...]15
This takes only 2 seconds.
repeat 1000 times, sequential, with transaction
The order is as expected 16,[17,18,19|17,18,19|...],20
Due the Transactions this takes 02:45 minutes.
I heard from colleges, that one should use parallel only if it involves long running/blocking tasks like a human task - in sequential mode there would only be one human task, and after that one is done, another will be created. in parallel mode, you have 1000 human tasks which is more likely the desired behavior.
Parallel performance seems to be improved in Camunda 8

Related

Reusing a database record created by means of Celery task

There is a task which creates database record {R) when it runs for the first time. When task is started second time it should read database record, perform some calculations and call external API. First and second start happens in a loop
In case of single start of the task there are no problems, but in the case of loops (at each loop's iteration the new task is created and starts at certain time) there is a problem. In the task queue (for it we use a flower) we have crashed task on every second iteration.
If we add, at the and of the loop time.sleep(1) sometimes the tasks work properly, but sometimes - not. How to avoid this problem? We afraid that task for different combination of two users started at the same time also will be crashed.
Is there some problem with running tasks in Celery simultaneously? Or something we should consider, tasks are for scheduled payments so they have to work rock solid

Processing tasks in parallel in specific time frame without waiting for them to finish

This is a question about concurrency/parallelism and processes. I am not sure how to express it, so please forgive my ignorance.
It is not related to any specific language, although I'm using Rust lately.
The question is if it is possible to launch processes in concurrent/parallel mode, without waiting for them to finish, and within a specific time frame, even when the total time of the processes takes more than the given time frame.
For example: lets say I have 100 HTTP requests that I want to launch in one second, separated by 10ms each. Each request will take +/- 50ms. I have a computer with 2 cores to make them.
In parallel that would be 100 tasks / 2 cores, 50 tasks each. The problem is that 50 tasks * 50ms each is 2500ms in total, so two seconds and half to run the 100 tasks in parallel.
Would it be possible to launch all these tasks in 1s?

Dynamically Evaluate load and create Threads depending on machine performance

Hi i have started to work on a project where i use parallel computing to separate job loads among multiple machines, such as hashing and other forms of mathematical calculations. Im using C++
it is running on a Master/slave or Server/Client model if you prefer where every client connects to the server and waits for a job. The server can than take a job and seperate it depending on the number of clients
1000 jobs -- > 3 clients
IE: client 1 --> calculate(0 to 333)
Client 2 --> calculate(334 to 666)
Client 3 --> calculate(667 to 999)
I wanted to further enhance the speed by creating multiple threads on every running client. But since every machine are not likely (almost 100%) not going to have the same hardware, i cannot arbitrarily decide on a number of threads to run on every client.
i would like to know if one of you guys knew a way to evaluate the load a thread has on the cpu and extrapolate the number of threads that can be run concurently on the machine.
there are ways i see of doing this.
I start threads one by one, evaluating the cpu load every time and stop when i reach a certain prefix ceiling of (50% - 75% etc) but this has the flaw that ill have to stop and re-separate the job every time i start a new thread.
(and this is the more complex)
run some kind of test thread and calculate its impact on the cpu base load and extrapolate the number of threads that can be run on the machine and than start threads and separate jobs accordingly.
any idea or pointer are welcome, thanks in advance !

Unbalanced load (v2.0) using MPI

(the problem is embarrassingly parallel)
Consider an array of 12 cells:
|__|__|__|__|__|__|__|__|__|__|__|__|
and four (4) CPUs.
Naively, I would run 4 parallel jobs and feeding 3 cells to each CPU.
|__|__|__|__|__|__|__|__|__|__|__|__|
=========|========|========|========|
1 CPU 2 CPU 3 CPU 4 CPU
BUT, it appears, that each cell has different evaluation time, some cells are evaluated very quickly, and some are not.
So, instead of wasting "relaxed CPU", I think to feed EACH cell to EACH CPU at time and continue until the entire job is done.
Namely:
at the beginning:
|____|____|____|____|____|____|____|____|____|____|____|____|
1cpu 2cpu 3cpu 4cpu
if, 2cpu finished his job at cell "2", it can jump to the first empty cell "5" and continue working:
|____|done|____|____|____|____|____|____|____|____|____|____|
1cpu 3cpu 4cpu 2cpu
|-------------->
if 1cpu finished, it can take sixth cell:
|done|done|____|____|____|____|____|____|____|____|____|____|
3cpu 4cpu 2cpu 1cpu
|------------------------>
and so on, until the full array is done.
QUESTION:
I do not know a priori which cell is "quick" and which cell is "slow", so I cannot spread cpus according to the load (more cpus to slow, less to quick).
How one can implement such algorithm for dynamic evaluation with MPI?
Thanks!!!!!
UPDATE
I use a very simple approach, how to divide the entire job into chunks, with IO-MPI:
given: array[NNN] and nprocs - number of available working units:
for (int i=0;i<NNN/nprocs;++i)
{
do_what_I_need(start+i);
}
MPI_File_write(...);
where "start" corresponds to particular rank number. In simple words, I divide the entire NNN array into fixed size chunk according to the number of available CPU and each CPU performs its chunk, writes the result to (common) output and relaxes.
IS IT POSSIBLE to change the code (Not to completely re-write in terms of Master/Slave paradigm) in such a way, that each CPU will get only ONE iteration (and not NNN/nprocs) and after it completes its job and writes its part to the file, will Continue to the next cell and not to relax.
Thanks!
There is a well known parallel programming pattern, known under many names, some of which are: bag of tasks, master / worker, task farm, work pool, etc. The idea is to have a single master process, which distributes cells to the other processes (workers). Each worker runs an infinite loop in which it waits for a message from the master, computes something and then returns the result. The loop is terminated by having the master send a message with a special tag. The wildcard tag value MPI_ANY_TAG can be used by the worker to receive messages with different tags.
The master is more complex. It also runs a loop but until all cells have been processed. Initially it sends each worker a cell and then starts a loop. In this loop it receives a message from any worker using the wildcard source value of MPI_ANY_SOURCE and if there are more cells to be processed, sends one of them to the same worker that have returned the result. Otherwise it sends a message with a tag set to the termination value.
There are many many many readily available implementations of this model on the Internet and even some on Stack Overflow (for example this one). Mind that this scheme requires one additional MPI process that often does very little work. If this is unacceptable, one can run a worker loop in a separate thread.
You want to implement a kind of client-server architecture where you have workers asking the server for work whenever they are out of work.
Depending on the size of the chunks and the speed of your communication between workers and server, you may want to adjust the size of the chunks sent to workers.
To answer your updated question:
Under the master/slave (or worker pool if that's how you prefer it to be labelled) model, you will basically need a task scheduler. The master should have information about what work has been done and what still needs to be done. The master will give each process some work to be done, then sit and wait until a process completes (using nonblocking receives and a wait_all). Once a process completes, have it send the data to the master then wait for the master to respond with more work. Continue this until the work is done.

2 different task_group instances not running tasks in parallel

I wanted to replace the use of normal threads with the task_group class from ppl, but I ran in to the following problem:
I have a class A with a task_group member,
create 2 different instances of class A,
start a task in the task_group of the first A instance (using run),
after a few seconds start a task in the task_group of the second A instance.
I'm expecting the two tasks to run in parallel but the second task wait for the first task to finish then starts.
This is happening only in my application where the tasks are started from a static function. I did the same scenario in a test application and the tasks are running correctly in parallel.
After spending several hours trying to figure this out I switched back to normal threads.
Does anyone knows why is the concurrency run-time having this behavior, or how I can avoid this?
EDIT
The problem was that it was running on a single core CPU and concurrency run-time looks at throughput. I wonder if microsoft parallel patterns library has the concept of an active object, or something on the lines so that you can specify that the task you are about to lunch is to be executed in parallel with the thread you start it from...
The response can be found here: http://social.msdn.microsoft.com/Forums/en/parallelcppnative/thread/85a84373-4c3d-4862-bff3-9a21ffe82493
For one core machines, this is expected "default" behavior. This can be changed.
By default, number of tasks that can run in parallel = number of hardware threads (num of cores). This improves the raw throughut and efficiency of completing tasks.
However, there are a number of situations where a developer would want many tasks running in parallel, regardless of the number of cores. In this case you have two options:
Oversubsribe locally.
In your example above, you would use
void lengthyTask()
{
Context::Oversubscribe(true)
...do a lengthy task (//OR a blocking task)
Context::Oversubscribe(false)
}
Oversubcribe the scheduler when you start the application.
SchedulerPolicy policy(1, MaxConcurrency, GetProcessorCount() * 2);
SetDefaultSchedulerPolicy(policy);