2 different task_group instances not running tasks in parallel - c++

I wanted to replace the use of normal threads with the task_group class from ppl, but I ran in to the following problem:
I have a class A with a task_group member,
create 2 different instances of class A,
start a task in the task_group of the first A instance (using run),
after a few seconds start a task in the task_group of the second A instance.
I'm expecting the two tasks to run in parallel but the second task wait for the first task to finish then starts.
This is happening only in my application where the tasks are started from a static function. I did the same scenario in a test application and the tasks are running correctly in parallel.
After spending several hours trying to figure this out I switched back to normal threads.
Does anyone knows why is the concurrency run-time having this behavior, or how I can avoid this?
EDIT
The problem was that it was running on a single core CPU and concurrency run-time looks at throughput. I wonder if microsoft parallel patterns library has the concept of an active object, or something on the lines so that you can specify that the task you are about to lunch is to be executed in parallel with the thread you start it from...

The response can be found here: http://social.msdn.microsoft.com/Forums/en/parallelcppnative/thread/85a84373-4c3d-4862-bff3-9a21ffe82493
For one core machines, this is expected "default" behavior. This can be changed.
By default, number of tasks that can run in parallel = number of hardware threads (num of cores). This improves the raw throughut and efficiency of completing tasks.
However, there are a number of situations where a developer would want many tasks running in parallel, regardless of the number of cores. In this case you have two options:
Oversubsribe locally.
In your example above, you would use
void lengthyTask()
{
Context::Oversubscribe(true)
...do a lengthy task (//OR a blocking task)
Context::Oversubscribe(false)
}
Oversubcribe the scheduler when you start the application.
SchedulerPolicy policy(1, MaxConcurrency, GetProcessorCount() * 2);
SetDefaultSchedulerPolicy(policy);

Related

How do parallel multi instance loop work in Camunda 7.16.6

I'm using the camunda-enginge 7.16.6.
I have a Process with a multi instance loop like this one that repeats parallel a 1000 times.
This loop is execute parallel. My assumption was, that n camunda executors now starts their work so executor #1 executes Task 2, then Task 3, then Task 4, and executor #2 and all others do the same. So after a short while at least some of the 1000 times finished all three Tasks in the loop
However what I observed so far is, that Task 2 gets execute 1000 times and only when that is finished, Task 3 gets executed a 1000 times and so on.
I also noticed, that camunda takes a lot of time by itself, outside of the tasks.
Is my Observation correct and is this behavior documented somewhere? Can you change that behavior?
I've run some tests an can explain the behavior:
The Order of Tasks and the overall time to finish is influenced by whenever or not there are transaction boundaries (async after, the red bars in the Screenshot).
Its a bit described here.
By setting the asyncBefore='true' attribute we introduce an additional save point at which the process state will be persisted and committed to the database. A separate job executor thread will continue the process asynchronously by using a separate database transaction. In case this transaction fails the service task will be retried and eventually marked as failed - in order to be dealt with by a human operator.
repeat 1000 times, parallel, no transaction
One Job Executor rushes trough the process, the Order is 1, [2,3,4|2,3,4|...], 5. Not really parallel. But this is as documented here:
The Job Executor makes sure that jobs from a single process instance are never executed concurrently.
It can be turned off if you are an expert and know what you are doing (and have understood this section).
Overall this took around 5 seconds.
repeat 1000 times, parallel, with transaction
Here, due the transactions, there will be 1000 waiting Jobs for Task 7, and each finish Task 7 creates another Job of Task 8. Since the execution of the Jobs is by the order in the database (see here), the order is 6,[7,7,7...8,8,8...9,9,9...],10.
The transaction handling which includes maintaining the variables has a huge impact on the runtime, with Transactions in parallel mode it runs 06:33 minutes.
If you turn off the exclusive-flag it takes around 4:30 minutes, but at the cost of thousands of OptimisticLockingExceptions.
Afaik the recommended approach to gain true parallelism would be to move Task 7, Task 8 and Task 9 to a seperate process and spawn 1000 instances of that process.
You can influence the order of execution if you tweak the job executor settings & priority, see here, but that seems to require the exclusive flag, too. If you do that, the Order will be 6,[7,7,7|8,9,8,9(in random order),...]10
repeat 1000 times, sequential, no transaction
The Order is 11,[12,13,14|12,13,14,...]15
This takes only 2 seconds.
repeat 1000 times, sequential, with transaction
The order is as expected 16,[17,18,19|17,18,19|...],20
Due the Transactions this takes 02:45 minutes.
I heard from colleges, that one should use parallel only if it involves long running/blocking tasks like a human task - in sequential mode there would only be one human task, and after that one is done, another will be created. in parallel mode, you have 1000 human tasks which is more likely the desired behavior.
Parallel performance seems to be improved in Camunda 8

Akka Dispatcher Thread creation

I have been working on Akka Actor model. I have an usecase where more than 1000 actors will be in active and I have to process those actors. I thought of controlling the thread count through configuration defined in the application.conf.
But no. of dispatcher thread created in my application makes me helpless in tuning the dispatcher configuration. Each time when I restart my application, I see different number of dispatcher threads created (I have checked this via Thread dump each time after starting the application).
Even thread count is not equal to the one which I defined in parallelism-min. Due to this low thread count, my application is processing very slowly.
On checking the no. of core in my machine through the below code:
Runtime.getRuntime().availableProcessors();
It displays 40. But the no. of dispatcher thread count created is less than 300 even I configured parallelism as 500.
Following is my application.conf file:
consumer-dispatcher {
type = "Dispatcher"
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 500
parallelism-factor = 20.0
parallelism-max = 1000
}
shutdown-timeout = 1s
throughput = 1
}
May I know on what basis akka will be creating dispatcher threads internally and how I can increase my dispatcher thread count to increase parallel processing of actors?
X-Post from discuss.lightbend.com
First let me answer the question directly.
A fork-join-executor will be backed by a java.util.concurrent.forkJoinPool pool with its parallelism set to the implied parallelism from the dispatcher config. (parallelism-factor * processors, but no larger than max or less than min). So, in your case, 800.
And while I’m no expert on the implementation of the ForkJoinPool the source for the Java implementation of ForkJoinPool says “All worker thread creation is on-demand, triggered by task submissions, replacement of terminated workers, and/or compensation for blocked workers.” and it has methods like getActiveThreads(), so it’s clear that ForkJoinPooldoesn’t just naively create a giant pool of workers.
In other words, what you are seeing is expected: it’s only going to create threads as they are needed. If you really must have a gigantic pool of worker threads you could create a thread-pool-executor with a fixed-pool-size of 800. This would give you the implementation you are looking for.
But, before you do so, I think you are entirely missing the point of actors and Akka. One of the reasons that people like actors is that they are much more lightweight than threads and can give you a lot more concurrency than a thread. (Also note that concurrency != parallelism as noted in the documentation on concepts.) So trying to create a pool of 800 threads to back 1000 actors is very wasteful. In the akka docs introduction it highlights "Millions of actors can be efficiently scheduled on a dozen of threads".
I can’t tell you exactly how many threads you need without knowing your application (for example if you have blocking behavior) but the defaults (which would give you a parallelism factor of 20) is probably just fine. Benchmark to be certain, but I really don’t think you have a problem with too few threads. (The ForkJoinPool behavior you are observing seems to confirm this.)

Dynamically Evaluate load and create Threads depending on machine performance

Hi i have started to work on a project where i use parallel computing to separate job loads among multiple machines, such as hashing and other forms of mathematical calculations. Im using C++
it is running on a Master/slave or Server/Client model if you prefer where every client connects to the server and waits for a job. The server can than take a job and seperate it depending on the number of clients
1000 jobs -- > 3 clients
IE: client 1 --> calculate(0 to 333)
Client 2 --> calculate(334 to 666)
Client 3 --> calculate(667 to 999)
I wanted to further enhance the speed by creating multiple threads on every running client. But since every machine are not likely (almost 100%) not going to have the same hardware, i cannot arbitrarily decide on a number of threads to run on every client.
i would like to know if one of you guys knew a way to evaluate the load a thread has on the cpu and extrapolate the number of threads that can be run concurently on the machine.
there are ways i see of doing this.
I start threads one by one, evaluating the cpu load every time and stop when i reach a certain prefix ceiling of (50% - 75% etc) but this has the flaw that ill have to stop and re-separate the job every time i start a new thread.
(and this is the more complex)
run some kind of test thread and calculate its impact on the cpu base load and extrapolate the number of threads that can be run on the machine and than start threads and separate jobs accordingly.
any idea or pointer are welcome, thanks in advance !

C++, How to implement Thread pool for tasks when each worker thread has to do few different tasks

I am analyzing a video stream. For each new image (frame) i do the following 3 tasks sequentially:
Reduce the size of the image
Detect faces
Track the 4 most important faces in the image
In order to speed this up on a 4-cpu machine I use 4 worker threads. In the following way
Main process gets an image. Creates 4 worker threads, split the image to 4 quarters, each worker re-sizes its 1/4 of the image pixels. Main process waits for the threads to finish and assembles the quarters to final image
Main process creates 4 new workers for face detection. I detect 4 types of faces (male, female, baby, dog). Each worker thread is responsible for one type. Main process waits for the workers to finish and assembles the results (list of all the existing faces).
Main process creates 4 new workers for face tracking. 4 most important faces are selected and each worker tracks 1 face. Main process waits for completion.
The problem with my implementation is that I don't have a thread pool. On each video frame (roughly 30 times per second) the main process rises and kills 12 workers (4 workers x 3 different tasks). So a lot of time is wasted on thread management. Currently I use _beginthreadex() method to lunch a worker thread for a specific task
Desired solution: I want to create only once 4 worker threads (each worker is able to perform all 3 different tasks). Those workers will exists through out the entire video processing. On each video frame the main process will throw the image re-size task to the workers, than the detection and later the tracking.
An ugly implementation would be that each worker thread is a big function with implementation of all the 3 tasks. Main process just tells each worker which task to execute (worker has a 'switch' statement to select the requested task). This is an ugly solution because in the future when I will have 30 different tasks in the pipeline instead of 3 - the code of the workers will become enormous. Moreover this solution violates encapsulation rule because it requires all the tasks to reside in the same function + for each new task, I need to change the code of the worker
A clean implementation would be that main process gives each worker a pointer to a function (which task to execute) and some parameters. Thus I can easily add new tasks in my video processing pipeline without changing the code of the worker because the code of the worker is generic (execute a pointer to a function and wait until a request with new pointer arrives)
But the problem here is that each task has a different amount of parameters (different function interface) and worker does not know how to call/execute the address of a given function.
What is a good way to use a thread pool in my case, while keeping code generic as possible, and able to extend it from 3 tasks to 30.
P.s. - My code runs on any platform (Android, iOS, linux, windows server, windows phone, etc). So I prefer a generic solution, instead of an OS specific or compiler specific solution
Your mistake is that you're too focused on using functions.
An old-fashioned approach is to have a base class Task with a member function virtual void operator()();. Then for anything that should be a task, you make a subclass of Task that contains all of the relevant data needed to run and provides an appropriate override of operator().
A more modern approach would be to make tasks instances of std::function<void(void)> which not only works with the above approach, but for those cases where you actually have a function with that signature as well as with lambdas. (or since you're doing multithreading, maybe want something like std::packaged_task<void(void)>; I haven't really looked into how these are used)
Either way, once your worker threads obtain a reference to a task, they simply invoke task(); to perform the task.

Thread pool understanding problem

I'm trying to figure out a way how to align tasks in a thread pool.
I have 4 ( parallel ) threads and at the beginning ( of each thread execution step ) 10 tasks. My first attempt was to measure the time of each task, and based on that time find the best combination of tasks in threads to get the best possible result. I'm attempting to write a parallel game engine based on this article http://software.intel.com/en-us/articles/designing-the-framework-of-a-parallel-game-engine/
The problem is that my 'solution' does not work. Are there any more ways to align tasks?
(The project is in c++)
To align tasks in parallel threads, use semaphores, events, mutexes.
Do not measure the time a task takes. Threads are executed at most randomly.
If you're executing 4 tasks in parallel threads, the first 2 may finish even before the second 2 begin.
Here is how to properly do it
void Thread1()
{
task1();
semaphore1.Release()
}
void Thread2()
{
task2();
semaphore1.WaitOne();
task3();
}
this way, task3 will be always executed after task1 finishes
You should check out both smoke (their multithreaded processing tech demo engine) and nullstien from intels visual computing site, they both tackle thread pooling tasks (nullstien has two variants, a tbb based scheduler and a custom built one)