What is the way to make Tune run parallel trials across multiple GPUs? - ray

I am hoping to make Tune run each trial of a grid search in parallel across multiple GPUs. I have a 4 GPU machine with 24 VCPUs. When I run the following code, I see 3 GPUs being used in by nvidia-smi, but it is only running one trial.
tune.run("PPO",
config={
"env": "PongNoFrameskip-v4",
"lr": tune.grid_search([0.01, 0.001, 0.0001]),
"num_gpus": 3,
"num_workers": 3
}
)
I can see from the run that Tune is only running one trial.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 4/24 CPUs, 3/4 GPUs, 0.0/190.43 GiB heap, 0.0/12.84 GiB objects
Memory usage on this node: 5.4/220.4 GiB
Result logdir: /home//ray_results/PPO
Number of trials: 3 ({'RUNNING': 1, 'PENDING': 2})
PENDING trials:
- PPO_PongNoFrameskip-v4_1_lr=0.001: PENDING
- PPO_PongNoFrameskip-v4_2_lr=0.0001: PENDING
RUNNING trials:
- PPO_PongNoFrameskip-v4_0_lr=0.01: RUNNING
I tried setting resources_per_trial with "gpu":1 but Ray gave an error to clear resources_per_trial.
ValueError: Resources for <class 'ray.rllib.agents.trainer_template.PPO'> have been automatically set to Resources(cpu=1, gpu=3, memory=0, object_store_memory=0, extra_cpu=3, extra_gpu=0, extra_memory=0, extra_object_store_memory=0, custom_resources={}, extra_custom_resources={}) by its `default_resource_request()` method. Please clear the `resources_per_trial` option.
What is the way to tell Tune to run all 3 trials in parallel?
Thank you.

Try
tune.run("PPO",
config={
"env": "PongNoFrameskip-v4",
"lr": tune.grid_search([0.01, 0.001, 0.0001]),
"num_gpus": 1,
"num_workers": 3
}
)

Explanation to richiliaw's answer:
Note that the important bit in resources_per_trial is per trial. If e.g. you have 4 GPUs and your grid search has 4 combinations, you must set 1 GPU per trial if you want the 4 of them to run in parallel.
If you set it to 4, each trial will require 4 GPUs, i.e. only 1 trial can run at the same time.
This is explained in the ray tune docs, with the following code sample:
# If you have 8 GPUs, this will run 8 trials at once.
tune.run(trainable, num_samples=10, resources_per_trial={"gpu": 1})
# If you have 4 CPUs on your machine and 1 GPU, this will run 1 trial at a time.
tune.run(trainable, num_samples=10, resources_per_trial={"cpu": 2, "gpu": 1})

Related

Cuda Multi-GPU Latency

I'm new to CUDA and I'm trying to analyse the performance of two GPUs (RTX 3090; 48GB vRAM) in parallel. The issue I face is that for the simple block of code shown below, I would expect this overall block to complete at the same time regardless of the presence of Device 2 code, as they are running Asynchronously on different streams.
// aHost, bHost, cHost, dHost are pinned memory. All arrays are of same length.
for(int i = 0; i < 2; i++){
// ---------- Device 1 code -----------
cudaSetDevice(0);
cudaMemcpyAsync(aDest, aHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(bDest, bHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
// ---------- Device 2 code -----------
cudaSetDevice(1);
cudaMemcpyAsync(cDest, cHost, N* sizeof(float), cudaMemcpyHostToDevice,stream2);
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
}
But alas, when I do end up running the block, running Device 1 code alone takes 80ms but adding Device 2 code to the above block adds 20ms, thus reaching 100ms as execution time for the block. I tried profiling the above code and observed the following:
Device 1 + Device 2 Concurrently (image)
When I run Device 1 alone though, I get the below profile:
Device 1 alone (image)
I can see that the initial HtoD process of Device 1 is extended in duration when I add Device 2, and I'm not sure why this is happening cause as far as I'm concerned, these processes are running independently, on different GPUs.
I realise that I haven't created any seperate CPU threads to handle seperate devices but I'm not sure if that would help. Could someone please help me understand why this elongation of duration happens when I add Device 2 code?
EDIT:
Tried profiling the code, and expected the execution durations to be independent of GPU, although I realise MemCpyAsync involves the host as well and perhaps the addition of Device 2 gives rise to more stress on the CPU as it now has to handle additional transfers...?

Why did optimizing my Spark config slow it down?

I'm running Spark 2.2 (legacy codebase constraints) on an AWS EMR Cluster (version 5.8), and run a system of both Spark and Hadoop jobs on a cluster daily. I noticed that the configs that are submitted for Spark to EMR when spinning up the cluster have some values that are not "optimal" according to the AWS official Best Practices for managing memory guide
When configuring the cluster, the current spark-defaults were set as follows:
{
"classification": "spark-defaults",
"properties": {
"spark.executor.memory": "16g",
"spark.shuffle.io.maxRetries": "10",
"spark.driver.memory": "16g",
"spark.kryoserializer.buffer.max": "2000m",
"spark.rpc.message.maxSize": "1024",
"spark.local.dir": "/mnt/tmp",
"spark.driver.cores": "4",
"spark.driver.maxResultSize": "10g",
"spark.executor.cores": "4",
"spark.speculation": "true",
"spark.default.parallelism": "960",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.yarn.executor.memoryOverhead": "4g",
"spark.task.maxFailures": "10"
}
}
Going by the guide, I made the updates to:
{
"classification": "spark-defaults",
"properties": {
"spark.executor.memory": "27g",
"spark.shuffle.io.maxRetries": "10",
"spark.driver.memory": "27g",
"spark.kryoserializer.buffer.max": "2000m",
"spark.rpc.message.maxSize": "1024",
"spark.local.dir": "/mnt/tmp",
"spark.driver.cores": "3",
"spark.driver.maxResultSize": "20g",
"spark.executor.cores": "3",
"spark.speculation": "true",
"spark.default.parallelism": "228",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.executor.instances": "37",
"spark.yarn.executor.memoryOverhead": "3g",
"spark.task.maxFailures": "10"
}
},
My cluster is running on 1 master node, and 19 core node instances, all on-demand, and all are of this type: r3.2xlarge 8 vCore, 61 GiB memory, 160 SSD GB storage
Note the following changes and the reasoning:
"spark.executor.cores": "4", -> "3" // The optimal amount of cores/executor is 5, but go down if the math doesn't work out well based on the instance type. Since my instances have 8 cores, and you need 1 for daemon processes, I rounded down to 3 cores, 1 for the daemon (and 1 leftover). Having it at 4 meant that the executors were evenly divided, but nothing was leftover for the daemon processes - which is not good, I've read.
"spark.executor.memory": "16g", -> "27g" and "spark.yarn.executor.memoryOverhead": "4g", -> "3g" : I get 2 executors per instance. Each instance has 61 gib of memory, and I need to save 10% for executor overhead. So (61 / 2) * 90%= 27g rounded down. Here I'm dramatically increasing the working memory available to each executor, and only slightly decreasing their overhead.
"spark.driver.cores": "4", -> "3" : the guide says to keep the master the same as the executors. This one doesn't make as much sense, because there's just 1 driver operating on the master node, so I'm leaving a lot of cores unused here. Also I did "spark.driver.memory": "16g" -> "27g" for the same reason of keeping them the same. A big bump again
"spark.driver.maxResultSize": "10g", -> "20g" : this one was more arbitrary. Since my driver on the master node is only using 27g for memory, assuming another 3g for overhead, I have ~30g of memory sitting unused (I assumed) - so bumped it up.
"spark.default.parallelism": "960" - "222" - this I bumped way down, because the guide says spark.executor.instances * spark.executors.cores * 2 . So I have 19 core machines with 2 executors each with 3 cores each so that's 19 * 2 * 3 * 2 = 228. But from my understanding, bringing it down here should decrease the parallelism, meaning less context switching, but doing it in the range that makes sense for the machines.
I tried this out with my new configs, thinking I'd get better performance due to all the extra memory I was giving to the executors and driver, and one of my Spark jobs that normally runs and finishes in 45-50 minutes was still running after 95 minutes. I tried it a few times and cancelled it after this same result 3-4 times.
Any idea why my optimizations were making things worse?
this is the config I got using this cheat sheet with that you were intending. can you try to change your overhead to the full number (3072) instead of 3g?
also check your Ganglia monitor when running your job and see if more of your memory is being utilized vs default.
https://www.c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/

Strange behaviour of Parallel Boost Graph Library example code

I have set up simple tests with Parallel Boost Graph Library (PBGL), which I have never used before, and observed entirely unexpected behaviour I would like to explain.
My steps were as follows:
Dump test data in METIS format (a kind of social graph with 50 mln vertices and 100 mln edges);
Build modified PBGL example from graph_parallel\example\dijkstra_shortest_paths.cpp
Example was slightly extended to proceed with Eager, Crauser and delta-stepping algorithms.
Note: building of the example required some obscure workaround about the MUTABLE_QUEUE define in crauser_et_al_shortest_paths.hpp (example code is in fact incompatible with the new mutable_queue)
int lookahead = 1;
delta_stepping_shortest_paths(g, start, dummy_property_map(), get(vertex_distance, g), get(edge_weight, g), lookahead);
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)).lookahead(lookahead));
dijkstra_shortest_paths(g, start, distance_map(get(vertex_distance, g)));
Run
mpiexec -n 1 mytest.exe mydata.me
mpiexec -n 2 mytest.exe mydata.me
mpiexec -n 4 mytest.exe mydata.me
mpiexec -n 8 mytest.exe mydata.me
The observed behaviour:
-n 1:
mem usage: 35 GB in 1 running process, which utilizes exactly 1 device thread (processor load 12.5%)
delta stepping time: about 1 min 20 s
eager time: about 2 min
crauser time: about 3 min 20 s.
-n 2:
crash in the stage of data load.
-n 4:
mem usage: 40+ Gb in roughly equal parts in 4 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
-n 8:
mem usage: 44+ Gb in roughly equal parts in 8 running processes, each of which utilizes exactly 1 device thread
calculation times are unchanged in the margins of observation error.
So, except the unapropriate memory usage and very low total performance the only changes I observe when more MPI processes are running are slightly increased total memory consumption and linear rise of processor load.
The fact that initial graph is somehow partitioned between processes (probably by the vertices number ranges) is nevertheless evident.
What is wrong with this test (and, probably, my idea of MPI usage in whole)?
My enviromnent:
- one Win 10 PC with 64 Gb and 8 kernels;
- MS MPI 10.0.12498.5;
- MSVC 2017, toolset 141;
- boost 1.71
N.B. See original example code here.

Poor man's Python data parallelism doesn't scale with the # of processors? Why?

I'm needing to make many computational iterations, say 100 or so, with a very complicated function that accepts a number of input parameters. Though the parameters will be varied, each iteration will take nearly the same amount of time to compute. I found this post by Muhammad Alkarouri. I modified his code so as to fork the script N times where I would plan to set N equal at least to the number of cores on my desktop computer.
It's an old MacPro 1,0 running 32-bit Ubuntu 16.04 with 18GB of RAM, and in none of the runs of the test file below does the RAM usage exceed about 15% (swap is never used). That's according to the System Monitor's Resources tab, on which it is also shown that when I try running 4 iterations in parallel all four CPUs are running at 100%, whereas if I only do one iteration only one CPU is 100% utilized and the other 3 are idling. (Here are the actual specs for the computer. And cat /proc/self/status shows a Cpus_allowed of 8, twice the total number of cpu cores, which may indicate hyper-threading.)
So I expected to find that the 4 simultaneous runs would consume only a little more time than one (and going forward I generally expected run times to pretty much scale in inverse proportion to the number of cores on any computer). However, I found to the contrary that instead of the running time for 4 being not much more than the running time for one, it is instead a bit more than twice the running time for one. For example, with the scheme presented below, a sample "time(1) real" value for a single iteration is 0m7.695s whereas for 4 "simultaneous" iterations it is 0m17.733s. And when I go from running 4 iterations at once to 8, the running time scales in proportion.
So my question is, why doesn't it scale as I had supposed (and can anything be done to fix that)? This is, by the way, for deployment on a few desktops; it doesn't have to scale or run on Windows.
Also, I have for now neglected the multiprocessing.Pool() alternative as my function was rejected as not being pickleable.
Here is the modified Alkarouri script, multifork.py:
#!/usr/bin/env python
import os, cPickle, time
import numpy as np
np.seterr(all='raise')
def function(x):
print '... the parameter is ', x
arr = np.zeros(5000).reshape(-1,1)
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
return { 'junk': arr, 'shape': arr.shape }
def run_in_separate_process(func, *args, **kwds):
numruns = 4
result = [ None ] * numruns
pread = [ None ] * numruns
pwrite = [ None ] * numruns
pid = [ None ] * numruns
for i in range(numruns):
pread[i], pwrite[i] = os.pipe()
pid[i] = os.fork()
if pid[i] > 0:
pass
else:
os.close(pread[i])
result[i] = func(*args, **kwds)
with os.fdopen(pwrite[i], 'wb') as f:
cPickle.dump((0,result[i]), f, cPickle.HIGHEST_PROTOCOL)
os._exit(0)
#time.sleep(17)
while True:
for i in range(numruns):
os.close(pwrite[i])
with os.fdopen(pread[i], 'rb') as f:
stat, res = cPickle.load(f)
result[i] = res
#os.waitpid(pid[i], 0)
if not None in result: break
return result
def main():
print 'Running multifork.py.'
print run_in_separate_process( function, 3 )
if __name__ == "__main__":
main()
With multifork.py, uncommenting os.waitpid(pid[i], 0) has no effect. And neither does uncommenting time.sleep() if 4 iterations are being run at once and if the delay is not set to more than about 17 seconds. Given that time(1) real is something like 0m17.733s for 4 iterations done at once, I take that to be an indication that the While True loop is not itself a cause of any appreciable inefficiency (due to the processes all taking the same amount of time) and that the 17 seconds are indeed being consumed solely by the child processes.
Out of a profound sense of mercy I have spared you for now my other scheme, with which I employed subprocess.Popen() in lieu of os.fork(). With that one I had to send the function to an auxiliary script, the script that defines the command that is the first argument of Popen(), via a file. I did however use the same While True loop. And the results? They were the same as with the simpler scheme that I am presenting here--- almost exactly.
Why don't you use joblib.Parallel feature?
#!/usr/bin/env python
from joblib import Parallel, delayed
import numpy as np
np.seterr(all='raise')
def function(x):
print '... the parameter is ', x
arr = np.zeros(5000).reshape(-1,1)
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
return { 'junk': arr, 'shape': arr.shape }
def main():
print 'Running multifork.py.'
print Parallel(n_jobs=2)(delayed(function)(3) for _ in xrange(4))
if __name__ == "__main__":
main()
It seems that you have some bottleneck in your computations.
On your example you passes your data over the pipe which is not very fast method. To avoid this performance problem you should use shared memory. This is how multiprocessing and joblib.Parallel work.
Also you should remember that in single-thread case you don't have to serialize and deserialize the data but in case of multiprocess you have to.
Next, even if you have 8 cores it could be hyper-threading feature enabled with divides core performance into 2 threads, thus if you have 4 HW cores the OS things that there are 8 of them. There are a lot of advantages and disadvantages to use HT but the main thing is if you going to load all your cores to do some computations for a long time then you should disable it.
For example, I have a Intel(R) Core(TM) i3-2100 CPU # 3.10GHz with 2 HW cores and HT enabled. So in top I saw 4 cores. The time of computations for me are:
n_jobs=1 - 0:07.13
n_jobs=2 - 0:06.54
n_jobs=4 - 0:07.45
This is how my lscpu looks like:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Stepping: 7
CPU MHz: 1600.000
BogoMIPS: 6186.10
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
Note at Thread(s) per core line.
So in your example there are not so many computations rather that data transfering. You application doesn't have time to get the advantages of parallelism. If you will have a long computation job (about 10 minutes) I think you will get it.
ADDITION:
I've taken a look at your function more detailed. I've replaced multiple execution with just one execution of function(3) function and run it within profiler:
$ /usr/bin/time -v python -m cProfile so.py
Output if quite long, you can view full version here (http://pastebin.com/qLBBH5zU). But the main thing is that the program lives most of the time in numpy.concatenate function. You can see it:
ncalls tottime percall cumtime percall filename:lineno(function)
.........
600 1.375 0.002 1.375 0.002 {numpy.core.multiarray.concatenate}
.........
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.64
.........
If you will run multiple instances of this program you will see that the time increases very much as the execution time of individual program instance. I've started 2 copy simultaneously:
$ /usr/bin/time -v python -m cProfile prog.py & /usr/bin/time -v python -m cProfile prog.py &
On the other hand I wrote a small fibo function:
def fibo(x):
arr = (0, 1)
for _ in xrange(x):
arr = (arr[-1], sum(arr))
return arr[0]
And replace the concatinate line with fibo(10000). In this case the execution time of single-instanced program is 0:22.82 while the execution time of two instances takes almost the same time per instance (0:24.62).
Based on this I think perhaps numpy uses some shared resource which leads to parallization problem. Or it can be numpy or scipy -specific issue.
And the last thing about the code, you have to replace the block below:
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
With the only line:
arr = np.concatenate( (arr, np.log(np.arange(2, 5002).repeat(3*200).reshape(-1,3*200))), axis=1 )
I am providing an answer here so as to not leave such a muddled record. First, I can report that most-serene frist's joblib code works, and note that it's shorter and that it also doesn't suffer from the limitations of the While True loop of my example, which only works efficiently with jobs that each take about the same amount of time. I see that the joblib project has current support and if you don't mind dependency upon a third-party library it may be an excellent solution. I may adopt it.
But, with my test function the time(1) real, the running time using the time wrapper, is about the same with either joblib or my poor man's code.
To answer the question of why the scaling of the running time in inverse proportion to the number of physical cores doesn't go as hoped, I had diligently prepared my test code that I presented here so as to produce similar output and to take about as long to run as the code of my actual project, per single-run without parallelism (and my actual project is likewise CPU-bound, not I/O-bound). I did so in order to test before undertaking the not-really-easy fitting of the simple code into my rather complicated project. But I have to report that in spite of that similarity, the results were much better with my actual project. I was surprised to see that I did get the sought inverse scaling of running time with the number of physical cores, more or less.
So I'm left supposing--- here's my tentative answer to my own question--- that perhaps the OS scheduler is fickle and very sensitive to the type of job. And there could be effects due to other processes that may be running even if, as in my case, the other processes are hardly using any CPU time (I checked; they weren't).
Tip #1: Never name your joblib test code joblib.py (you'll get an import error). Tip #2: Never rename your joblib.py file test code file and run the renamed file without deleting the joblib.py file (you'll get an import error).

Machine Repair Simulation?

The Problem Statement:
A system is composed of 1, 2, or 3 machines and a repairman responsible for
maintaining these machines. Normally, the machines are running and producing
a product. At random points in time, the machines fail and are fixed by the
repairman. If a second or third machine fails while the repairman is busy
fixing the first machine, these machines will wait on the services of the
repairman in a first come, first served order. When repair on a machine is
complete, the machine will begin running again and producing a product.
The repairman will then repair the next machine waiting. When all machines
are running, the repairman becomes idle.
simulate this system for a fixed period of time and calculate the fraction
of time the machines are busy (utilization) and the fraction of time
the repairman is busy (utilization).
Now, the Input is 50 Running time and 50 Repairing time, then given the period
to calculate the utilization over it and the number of machines to simulate
for each test case.
Sample Input:
7.0 4.5 13.0 10.5 3.0 12.0 ....
9.5 2.5 4.5 12.0 5.7 1.5 ....
20.0 1
20.0 3
0.0 0
Sample Output:
No of Utilization
Case Machines Machine Repairman
1 1 .525 .475
2 3 .558 .775
Case 2 Explanation:
Machine Utilization = ((7+3)+(4.5+6)+(13))/(3*20) = .558
Repairman Utilization = (15.5)/20 = .775
My Approach:
1) load the machines into minimum heap (called runHeap) and give each
of them a run time, so the next to give run time will be a new one
from the 50 run times in the input,
2) calculate the minimum time between minimum reminding run time in the runHeap
,the reminding repair time in the head of the repair queue Or the reminding
time to finish simulation, And Call that value "toGo".
3) Subtract all reminding run time for all machines in the runHeap by toGo,
Subtract the reminding repair time of head of repairQueue by toGo,
4) All machines having reminding run time == 0, push it into the repairQueue,
The head of the repair Queue if the reminding repair time == 0 push it into
the runHeap,
5) Add toGo to the current time
6) if current time < simulation time go to step 2, else return utilization's.
Now, the Question Is It A Good Approach Or one can figure out a better one ??