Processing tasks in parallel in specific time frame without waiting for them to finish - concurrency

This is a question about concurrency/parallelism and processes. I am not sure how to express it, so please forgive my ignorance.
It is not related to any specific language, although I'm using Rust lately.
The question is if it is possible to launch processes in concurrent/parallel mode, without waiting for them to finish, and within a specific time frame, even when the total time of the processes takes more than the given time frame.
For example: lets say I have 100 HTTP requests that I want to launch in one second, separated by 10ms each. Each request will take +/- 50ms. I have a computer with 2 cores to make them.
In parallel that would be 100 tasks / 2 cores, 50 tasks each. The problem is that 50 tasks * 50ms each is 2500ms in total, so two seconds and half to run the 100 tasks in parallel.
Would it be possible to launch all these tasks in 1s?

Related

How do parallel multi instance loop work in Camunda 7.16.6

I'm using the camunda-enginge 7.16.6.
I have a Process with a multi instance loop like this one that repeats parallel a 1000 times.
This loop is execute parallel. My assumption was, that n camunda executors now starts their work so executor #1 executes Task 2, then Task 3, then Task 4, and executor #2 and all others do the same. So after a short while at least some of the 1000 times finished all three Tasks in the loop
However what I observed so far is, that Task 2 gets execute 1000 times and only when that is finished, Task 3 gets executed a 1000 times and so on.
I also noticed, that camunda takes a lot of time by itself, outside of the tasks.
Is my Observation correct and is this behavior documented somewhere? Can you change that behavior?
I've run some tests an can explain the behavior:
The Order of Tasks and the overall time to finish is influenced by whenever or not there are transaction boundaries (async after, the red bars in the Screenshot).
Its a bit described here.
By setting the asyncBefore='true' attribute we introduce an additional save point at which the process state will be persisted and committed to the database. A separate job executor thread will continue the process asynchronously by using a separate database transaction. In case this transaction fails the service task will be retried and eventually marked as failed - in order to be dealt with by a human operator.
repeat 1000 times, parallel, no transaction
One Job Executor rushes trough the process, the Order is 1, [2,3,4|2,3,4|...], 5. Not really parallel. But this is as documented here:
The Job Executor makes sure that jobs from a single process instance are never executed concurrently.
It can be turned off if you are an expert and know what you are doing (and have understood this section).
Overall this took around 5 seconds.
repeat 1000 times, parallel, with transaction
Here, due the transactions, there will be 1000 waiting Jobs for Task 7, and each finish Task 7 creates another Job of Task 8. Since the execution of the Jobs is by the order in the database (see here), the order is 6,[7,7,7...8,8,8...9,9,9...],10.
The transaction handling which includes maintaining the variables has a huge impact on the runtime, with Transactions in parallel mode it runs 06:33 minutes.
If you turn off the exclusive-flag it takes around 4:30 minutes, but at the cost of thousands of OptimisticLockingExceptions.
Afaik the recommended approach to gain true parallelism would be to move Task 7, Task 8 and Task 9 to a seperate process and spawn 1000 instances of that process.
You can influence the order of execution if you tweak the job executor settings & priority, see here, but that seems to require the exclusive flag, too. If you do that, the Order will be 6,[7,7,7|8,9,8,9(in random order),...]10
repeat 1000 times, sequential, no transaction
The Order is 11,[12,13,14|12,13,14,...]15
This takes only 2 seconds.
repeat 1000 times, sequential, with transaction
The order is as expected 16,[17,18,19|17,18,19|...],20
Due the Transactions this takes 02:45 minutes.
I heard from colleges, that one should use parallel only if it involves long running/blocking tasks like a human task - in sequential mode there would only be one human task, and after that one is done, another will be created. in parallel mode, you have 1000 human tasks which is more likely the desired behavior.
Parallel performance seems to be improved in Camunda 8

Performance testing using Ultimate thread group

I want to use ultimate thread group for my test with 2100 users concurrency and synchronising timer with number of simulated users to group by 100.
Here I want to configure the thread group for 10 mins.
I am not sure how to distribute it across initial delay ,start up time, hold load and shut down time
We cannot suggest anything meaningful because we don't know what is your desired load pattern.
Normally people configure threads arrival/leaving so it would be:
Ramp-up phase - so the load would increase gradually, it will allow you to correlate increasing load with the changing metrics like response time, transactions per second, errors per second, etc.
"Plateau" phase - check how does system behave under constant sustained load
Ramp-down phase - it will allow to check whether system gets back to normal when the load decreases
If you don't have better ideas - go for 33% for ramp-up, plateau and ramp-down, in your case it will be easier to take 3 minutes for ramp-up and ramp-down and 4 minutes for the time to holds the load.
The relevant Ultimate Thread Group configuration:
With regards to the Synchronizing Timer, what it will do is to act as a rendezvous point for all Samplers in it's scope so given ramp-up of 180 seconds for 2100 users it means that 11.6 users will arrive every second so first request will be executed on 8th second of your test with 100 users then requests will be executed one by one each with 100 users in form of "spikes"

simultaneous tasks with 8051

Is there any way to run two tasks with the 8051 μC simultaneously? For example,
Task one
Delay 1 sec
P2.B2 = 1
Delay 1 sec
P2.B2 = 0
Task 2
If P1.B0 = 1
P2.B3=1
So at any time, press the switch connected to P2.0 is 1, LED at P2.3=ON, and P2.2 keeps LED at P2.2 blinking.
A task is something what is typically provided by the underlying OS. If you are running on a bare metal system without any OS, you have no tasks at the first point.
But your application can build its own tasks. The job is more or less easy. You have to build a scheduler, typically triggered by a hardware clock for task switching, create stacks for each of the tasks and some control structures for the maintenance of the tasks. As you have no MMU and no memory protection on bare metal systems like 8051, you simply can modify stack pointers to do the task switching.
That is exactly what a library like FreeRTos can do for you. There is a port for 8051 available as I know. Searching on the web returns a lot of links to 8051 FreeRtos. Maybe there are some more libraries which offering tasks for you.
But mostly the overhead of scheduling and all the administration effort is much to high. Running an endless loop which is doing some jobs by reading some kind of queues or flags is much easier and often the more efficient solution. Also running some jobs in interrupt service routines fits well to bare metal requirements.
I assume you are running on bare metal with no battery saving requirements. I assume you can now write a program, load it to your device and run it. What I suggest you do is roughly this.
This program should have a main loop, which in its most simple would be like this:
MAX_TIME is the largest possible value of system clock, should never be reached
task_table is table with
next execution time as system clock time (MAX_TIME means disabled)
function pointer
initialize task-table with the three tasks below
forever
for each task with time 0
set task time MAX_TIME (disable)
call task function (task probably enables itself or other task)
find a task with lowest non-zero time in task_queue
if task time is in past or now
set task time MAX_TIME (disable)
call task function (task probably enables itself or other task)
Time 0 tasks are checked separately, and then tasks with time, so that the time 0 tasks don't block each others or the tasks with time from ever being called. Same could be achieved in different ways, this is just an example.
Then your requirements really call for 3 "tasks":
task_p2_b2_0:
P2.B2 = 0
enable task task_p2_b2_1 at current_time + 1 second
task_p2_b2_1:
P2.B2 = 1
enable task task_p2_b2_0 at current_time + 1 second
task_p1_b0_poll:
If P1.B0 = 1
P2.B3=1
enable task task_p1_b0_poll at time 0 (or current time + 10 ms or whatever)
Future development: Above is for a small number of static tasks. Iterating up to... 5-10 item table is so fast that there is no point trying to optimize it. Once you have more tasks than that, you should consider using a priority heap to store the tasks. Then you could also consider making main loop sleep when it has nothing to do, and use interrupt to wake it up (timer interrupt, serial port interrupt, pin activation interrupt etc). Also, you could have different task types, such as tasks which are activated when there is some IO (button press, byte from serial port, whatever). Etc. At the upper end of this adding features is a complete operating system really, but for simple things what I wrote above is really enough.

Dynamically Evaluate load and create Threads depending on machine performance

Hi i have started to work on a project where i use parallel computing to separate job loads among multiple machines, such as hashing and other forms of mathematical calculations. Im using C++
it is running on a Master/slave or Server/Client model if you prefer where every client connects to the server and waits for a job. The server can than take a job and seperate it depending on the number of clients
1000 jobs -- > 3 clients
IE: client 1 --> calculate(0 to 333)
Client 2 --> calculate(334 to 666)
Client 3 --> calculate(667 to 999)
I wanted to further enhance the speed by creating multiple threads on every running client. But since every machine are not likely (almost 100%) not going to have the same hardware, i cannot arbitrarily decide on a number of threads to run on every client.
i would like to know if one of you guys knew a way to evaluate the load a thread has on the cpu and extrapolate the number of threads that can be run concurently on the machine.
there are ways i see of doing this.
I start threads one by one, evaluating the cpu load every time and stop when i reach a certain prefix ceiling of (50% - 75% etc) but this has the flaw that ill have to stop and re-separate the job every time i start a new thread.
(and this is the more complex)
run some kind of test thread and calculate its impact on the cpu base load and extrapolate the number of threads that can be run on the machine and than start threads and separate jobs accordingly.
any idea or pointer are welcome, thanks in advance !

how to detect if a thread or process is getting starved due to OS scheduling

This is on Linux OS. App is written in C++ with ACE library.
I am suspecting that one of the thread in the process is getting blocked for unusually long time(5 to 40 seconds) sometimes. The app runs fine most of the times except couple times a day it has this issue. There are other similar 5 apps running on the box which are also I/O bound due to heavy socket incoming data.
I would like to know if there is any thing I can do programatically to see if the thread/process are getting their time slice.
If a process is being starved out, self monitoring for that process would not be that productive. But, if you just want that process to notice it hasn't been run in a while, it can call times periodically and compare the relative difference in elapsed time with the relative difference in scheduled user time (you would sum the tms_utime and tms_cutime fields if you want to count waiting for children as productive time, and you would sum in the tms_stime and tms_cstime fields if you count kernel time spent on your behalf to be productive time). For thread times, the only way I know of is to consult the /proc filesystem.
A high priority external process or high priority thread could externally monitor processes (and threads) of interest by reading the appropriate /proc/<pid>/stat entries for the process (and /proc/<pid>/task/<tid>/stat for the threads). The user times are found in the 14th and 16th fields of the stat file. The system times are found in the 15th and 17th fields. (The field positions are accurate for my Linux 2.6 kernel.)
Between two time points, you determine the amount of elapsed time that has passed (a monitor process or thread would usually wake up at regular intervals). Then the difference between the cumulative processing times at each of those time points represents how much time the thread of interest got to run during that time. The ratio of processing time to elapsed time would represent the time slice.
One last bit of info: On Linux, I use the following to obtain the tid of the current thread for examining the right task in the /proc/<pid>/task/ directory:
tid = syscall(__NR_gettid);
I do this, because I could not find the gettid system call actually exported by any library on my system, even though it was documented. But, it might be available on yours.