How does Ray handles a number of jobs higher than the number of resources? - ray

Pretty basic question, but I wasn't able to find the answer in the docs.
I am developing a computationally intensive application in Python and I'm employing Ray to parallelize computation. I only use remote functions (thus no Actors) and I have 40 cores available.
What happens when the main script sends a number of tasks higher than 40? Is Ray able to handle it or should I always control the number of tasks in order to keep it under the number of available cores?

In this scenario, Ray will queue the tasks, and run them as CPUs become available.
For example,
import ray
import time
ray.init(num_cpus=10)
#ray.remote
def outer_task():
return ray.get([inner_task.remote() for _ in range(20)])
#ray.remote
def inner_task():
return time.sleep(1)
ray.get([outer_task.remote() for _ in range(20)])
in this scenario, there are 420 tasks which each require a CPU. Ray will queue and run these tasks, so that at most 10 tasks are running at the same time, and will make sure they all finish (in roughly 40 seconds).

Related

Processing tasks in parallel in specific time frame without waiting for them to finish

This is a question about concurrency/parallelism and processes. I am not sure how to express it, so please forgive my ignorance.
It is not related to any specific language, although I'm using Rust lately.
The question is if it is possible to launch processes in concurrent/parallel mode, without waiting for them to finish, and within a specific time frame, even when the total time of the processes takes more than the given time frame.
For example: lets say I have 100 HTTP requests that I want to launch in one second, separated by 10ms each. Each request will take +/- 50ms. I have a computer with 2 cores to make them.
In parallel that would be 100 tasks / 2 cores, 50 tasks each. The problem is that 50 tasks * 50ms each is 2500ms in total, so two seconds and half to run the 100 tasks in parallel.
Would it be possible to launch all these tasks in 1s?

simultaneous tasks with 8051

Is there any way to run two tasks with the 8051 μC simultaneously? For example,
Task one
Delay 1 sec
P2.B2 = 1
Delay 1 sec
P2.B2 = 0
Task 2
If P1.B0 = 1
P2.B3=1
So at any time, press the switch connected to P2.0 is 1, LED at P2.3=ON, and P2.2 keeps LED at P2.2 blinking.
A task is something what is typically provided by the underlying OS. If you are running on a bare metal system without any OS, you have no tasks at the first point.
But your application can build its own tasks. The job is more or less easy. You have to build a scheduler, typically triggered by a hardware clock for task switching, create stacks for each of the tasks and some control structures for the maintenance of the tasks. As you have no MMU and no memory protection on bare metal systems like 8051, you simply can modify stack pointers to do the task switching.
That is exactly what a library like FreeRTos can do for you. There is a port for 8051 available as I know. Searching on the web returns a lot of links to 8051 FreeRtos. Maybe there are some more libraries which offering tasks for you.
But mostly the overhead of scheduling and all the administration effort is much to high. Running an endless loop which is doing some jobs by reading some kind of queues or flags is much easier and often the more efficient solution. Also running some jobs in interrupt service routines fits well to bare metal requirements.
I assume you are running on bare metal with no battery saving requirements. I assume you can now write a program, load it to your device and run it. What I suggest you do is roughly this.
This program should have a main loop, which in its most simple would be like this:
MAX_TIME is the largest possible value of system clock, should never be reached
task_table is table with
next execution time as system clock time (MAX_TIME means disabled)
function pointer
initialize task-table with the three tasks below
forever
for each task with time 0
set task time MAX_TIME (disable)
call task function (task probably enables itself or other task)
find a task with lowest non-zero time in task_queue
if task time is in past or now
set task time MAX_TIME (disable)
call task function (task probably enables itself or other task)
Time 0 tasks are checked separately, and then tasks with time, so that the time 0 tasks don't block each others or the tasks with time from ever being called. Same could be achieved in different ways, this is just an example.
Then your requirements really call for 3 "tasks":
task_p2_b2_0:
P2.B2 = 0
enable task task_p2_b2_1 at current_time + 1 second
task_p2_b2_1:
P2.B2 = 1
enable task task_p2_b2_0 at current_time + 1 second
task_p1_b0_poll:
If P1.B0 = 1
P2.B3=1
enable task task_p1_b0_poll at time 0 (or current time + 10 ms or whatever)
Future development: Above is for a small number of static tasks. Iterating up to... 5-10 item table is so fast that there is no point trying to optimize it. Once you have more tasks than that, you should consider using a priority heap to store the tasks. Then you could also consider making main loop sleep when it has nothing to do, and use interrupt to wake it up (timer interrupt, serial port interrupt, pin activation interrupt etc). Also, you could have different task types, such as tasks which are activated when there is some IO (button press, byte from serial port, whatever). Etc. At the upper end of this adding features is a complete operating system really, but for simple things what I wrote above is really enough.

Dynamically Evaluate load and create Threads depending on machine performance

Hi i have started to work on a project where i use parallel computing to separate job loads among multiple machines, such as hashing and other forms of mathematical calculations. Im using C++
it is running on a Master/slave or Server/Client model if you prefer where every client connects to the server and waits for a job. The server can than take a job and seperate it depending on the number of clients
1000 jobs -- > 3 clients
IE: client 1 --> calculate(0 to 333)
Client 2 --> calculate(334 to 666)
Client 3 --> calculate(667 to 999)
I wanted to further enhance the speed by creating multiple threads on every running client. But since every machine are not likely (almost 100%) not going to have the same hardware, i cannot arbitrarily decide on a number of threads to run on every client.
i would like to know if one of you guys knew a way to evaluate the load a thread has on the cpu and extrapolate the number of threads that can be run concurently on the machine.
there are ways i see of doing this.
I start threads one by one, evaluating the cpu load every time and stop when i reach a certain prefix ceiling of (50% - 75% etc) but this has the flaw that ill have to stop and re-separate the job every time i start a new thread.
(and this is the more complex)
run some kind of test thread and calculate its impact on the cpu base load and extrapolate the number of threads that can be run on the machine and than start threads and separate jobs accordingly.
any idea or pointer are welcome, thanks in advance !

Locust.io: Controlling the request per second parameter

I have been trying to load test my API server using Locust.io on EC2 compute optimized instances. It provides an easy-to-configure option for setting the consecutive request wait time and number of concurrent users. In theory, rps = wait time X #_users. However while testing, this rule breaks down for very low thresholds of #_users (in my experiment, around 1200 users). The variables hatch_rate, #_of_slaves, including in a distributed test setting had little to no effect on the rps.
Experiment info
The test has been done on a C3.4x AWS EC2 compute node (AMI image) with 16 vCPUs, with General SSD and 30GB RAM. During the test, CPU utilization peaked at 60% max (depends on the hatch rate - which controls the concurrent processes spawned), on an average staying under 30%.
Locust.io
setup: uses pyzmq, and setup with each vCPU core as a slave. Single POST request setup with request body ~ 20 bytes, and response body ~ 25 bytes. Request failure rate: < 1%, with mean response time being 6ms.
variables: Time between consecutive requests set to 450ms (min:100ms and max: 1000ms), hatch rate at a comfy 30 per sec, and RPS measured by varying #_users.
The RPS follows the equation as predicted for upto 1000 users. Increasing #_users after that has diminishing returns with a cap reached at roughly 1200 users. #_users here isn't the independent variable, changing the wait time affects the RPS as well. However, changing the experiment setup to 32 cores instance (c3.8x instance) or 56 cores (in a distributed setup) doesn't affect the RPS at all.
So really, what is the way to control the RPS? Is there something obvious I am missing here?
(one of the Locust authors here)
First, why do you want to control the RPS? One of the core ideas behind Locust is to describe user behavior and let that generate load (requests in your case). The question Locust is designed to answer is: How many concurrent users can my application support?
I know it is tempting to go after a certain RPS number and sometimes I "cheat" as well by striving for an arbitrary RPS number.
But to answer your question, are you sure your Locusts doesn't end up in a dead lock? As in, they complete a certain number of requests and then become idle because they have no other task to perform? Hard to tell what's happening without seeing the test code.
Distributed mode is recommended for larger production setups and most real-world load tests I've run have been on multiple but smaller instances. But it shouldn't matter if you are not maxing out the CPU. Are you sure you are not saturating a single CPU core? Not sure what OS you are running but if Linux, what is your load value?
While there is no direct way of controlling rps, you can try constant_pacing and constant_throughput option in wait_time
From docs
https://docs.locust.io/en/stable/api.html#locust.wait_time.constant_throughput
In the following example the task will always be executed once every 1 seconds, no matter the task execution time:
class MyUser(User):
wait_time = constant_throughput(1)
constant_pacing is inverse of this.
So if you run with 100 concurrent users, test will run at 100rps (assuming each request takes less than 1 second in first place

Unbalanced load (v2.0) using MPI

(the problem is embarrassingly parallel)
Consider an array of 12 cells:
|__|__|__|__|__|__|__|__|__|__|__|__|
and four (4) CPUs.
Naively, I would run 4 parallel jobs and feeding 3 cells to each CPU.
|__|__|__|__|__|__|__|__|__|__|__|__|
=========|========|========|========|
1 CPU 2 CPU 3 CPU 4 CPU
BUT, it appears, that each cell has different evaluation time, some cells are evaluated very quickly, and some are not.
So, instead of wasting "relaxed CPU", I think to feed EACH cell to EACH CPU at time and continue until the entire job is done.
Namely:
at the beginning:
|____|____|____|____|____|____|____|____|____|____|____|____|
1cpu 2cpu 3cpu 4cpu
if, 2cpu finished his job at cell "2", it can jump to the first empty cell "5" and continue working:
|____|done|____|____|____|____|____|____|____|____|____|____|
1cpu 3cpu 4cpu 2cpu
|-------------->
if 1cpu finished, it can take sixth cell:
|done|done|____|____|____|____|____|____|____|____|____|____|
3cpu 4cpu 2cpu 1cpu
|------------------------>
and so on, until the full array is done.
QUESTION:
I do not know a priori which cell is "quick" and which cell is "slow", so I cannot spread cpus according to the load (more cpus to slow, less to quick).
How one can implement such algorithm for dynamic evaluation with MPI?
Thanks!!!!!
UPDATE
I use a very simple approach, how to divide the entire job into chunks, with IO-MPI:
given: array[NNN] and nprocs - number of available working units:
for (int i=0;i<NNN/nprocs;++i)
{
do_what_I_need(start+i);
}
MPI_File_write(...);
where "start" corresponds to particular rank number. In simple words, I divide the entire NNN array into fixed size chunk according to the number of available CPU and each CPU performs its chunk, writes the result to (common) output and relaxes.
IS IT POSSIBLE to change the code (Not to completely re-write in terms of Master/Slave paradigm) in such a way, that each CPU will get only ONE iteration (and not NNN/nprocs) and after it completes its job and writes its part to the file, will Continue to the next cell and not to relax.
Thanks!
There is a well known parallel programming pattern, known under many names, some of which are: bag of tasks, master / worker, task farm, work pool, etc. The idea is to have a single master process, which distributes cells to the other processes (workers). Each worker runs an infinite loop in which it waits for a message from the master, computes something and then returns the result. The loop is terminated by having the master send a message with a special tag. The wildcard tag value MPI_ANY_TAG can be used by the worker to receive messages with different tags.
The master is more complex. It also runs a loop but until all cells have been processed. Initially it sends each worker a cell and then starts a loop. In this loop it receives a message from any worker using the wildcard source value of MPI_ANY_SOURCE and if there are more cells to be processed, sends one of them to the same worker that have returned the result. Otherwise it sends a message with a tag set to the termination value.
There are many many many readily available implementations of this model on the Internet and even some on Stack Overflow (for example this one). Mind that this scheme requires one additional MPI process that often does very little work. If this is unacceptable, one can run a worker loop in a separate thread.
You want to implement a kind of client-server architecture where you have workers asking the server for work whenever they are out of work.
Depending on the size of the chunks and the speed of your communication between workers and server, you may want to adjust the size of the chunks sent to workers.
To answer your updated question:
Under the master/slave (or worker pool if that's how you prefer it to be labelled) model, you will basically need a task scheduler. The master should have information about what work has been done and what still needs to be done. The master will give each process some work to be done, then sit and wait until a process completes (using nonblocking receives and a wait_all). Once a process completes, have it send the data to the master then wait for the master to respond with more work. Continue this until the work is done.