Unable to use multiple GPUs with Ray - ray

I am using the Asynchronous Hyperband scheduler
https://ray.readthedocs.io/en/latest/tune-schedulers.html?highlight=hyperband
with 2 GPUs.
My machine configuration has 2 GPUs and 12 GPUs.
But still, only one trail runs at a time whereas 2 trials could
simultaneously run at a time.
I specify
ray.init(num_gpus=torch.cuda.device_count())
"resources_per_trial": {
"cpu": 4,
"gpu": int(args.cuda)}

Related

why when creating e2-small it shows it has 2 vCPU and when checking configuration it shows it has only 1 vCPU on gcp?

This is what I get when creating a e2-small machine
This is what I get when checking the machine after creation:
This is what is shown on the page https://cloud.google.com/compute/vm-instance-pricing:
*This is what I get when using cat /proc/cpuinfo:
In your first screenshot, it shows **e2-small (2 vCPU, 2 GB memory).
In your second screenshot, it shows 1 shared core.
One CPU core is 2 vCPUs. Therefore, your first and second screenshots are showing the same thing.
A vCPU is a hyper-thread. Each CPU core consists of two hyper-threads.
How does Hyper-Threading work? When Intel® Hyper-Threading Technology
is active, the CPU exposes two execution contexts per physical core.
This means that one physical core now works like two “logical cores”
that can handle different software threads.
What Is Hyper-Threading?

Strange behaviors of cuda kernel with infinite loop on different NVIDIA GPU

#include <cstdio>
__global__ void loop(void) {
int smid = -1;
if (threadIdx.x == 0) {
asm volatile("mov.u32 %0, %%smid;": "=r"(smid));
printf("smid: %d\n", smid);
}
while (1);
}
int main() {
loop<<<1, 32>>>();
cudaDeviceSynchronize();
return 0;
}
This is my source code, the kernel just print smid when thread index is 0 and then go to infinite loop, and the host just invoke the previous cuda kernel and wait for it. I run some experiments under 2 different configurations as following:
1. GPU(Geforce 940M) OS(Ubuntu 18.04) MPS(Enable) CUDA(v11.0)
2. GPU(Geforce RTX 3050Ti Mobile) OS(Ubuntu 20.04) MPS(Enable) CUDA(v11.4)
Experiment 1: When I run this code under configuration 1, the GUI system seems to get freezed because any graphical responses cannot be observed anymore, but as I press ctrl+c, this phenomena disappears as the CUDA process is killed.
Experiment 2: When I run this code under configuration 2, the system seems to work well without any abnormal phenomena, and the output of smid such as smid: 2\n can be displayed.
Experiment 3: As I change the block configuration loop<<<1, 1024>>> and run this new code twice under configuration 2, I get the same smid output such as smid: 2\nsmid: 2\n.(As for Geforce RTX 3050Ti Mobile, the amount of SM is 20, the maximum number of threads per multiprocessor is 1536 and max number of threads per block is 1024.)
I'm confused with these results, and here are my questions:
1. Why doesn't the system output smid under configuration 1?
2. Why does the GUI system seems to get freezed under configuration 1?
3. Unlike experiment 1, why does experiment 2 output smid normally?
4. In third experiment, the block configuation reaches to 1024 threads, which means that two different block cannot be scheduled to the same SM. Under MPS environment, all CUDA contexts will be merged into one CUDA context and share the GPU resource without timeslice anymore, but why do I still get same smid in the third experiment?(Furthermore, as I change the grid configuration into 10 and run it twice, the smid varies from 0 to 19 and each smid just appears once!)
Why doesn't the system output smid under configuration 1?
A safe rule of thumb is that unlike host code, in-kernel printf output will not be printed to the console at the moment the statement is encountered, but at the point of completion of the kernel and device synchronization with the host. This is the actual regime in effect in configuration 1, which is using a maxwell gpu. So no printf output is observed in configuration 1, because the kernel never ends.
Why does the GUI system seems to get freezed under configuration 1?
For the purpose of this discussion, there are two possible regimes: a pre-pascal regime in which compute-preemption is not possible, and a post-pascal regime in which it is possible. Your configuration 1 is a maxwell device, which is pre-pascal. Your configuration 2 is ampere device, which is post-pascal. So in configuration 2, compute preemption is working. This has a variety of impacts, one of which is that the GPU will service both GUI needs as well as compute kernel needs, "simultaneously" (the low level behavior is not thoroughly documented but is a form of time-slicing, alternating attention to the compute kernel and the GUI). Therefore in config 1, pre-pascal, kernels running for any noticeable time at all will "freeze" the GUI during kernel execution. In config2, the GPU services both, to some degree.
Unlike experiment 1, why does experiment 2 output smid normally?
Although its not well-documented, the compute preemption process appears to introduce an additional synchronization point, allowing for the flushing of the printf buffer, as mentioned in point 1. If you read the documentation I linked there, you will see that "synchronization point" covers a number of possibilities, and compute preemption seems to introduce (a new) one.
Sorry, won't be able to answer your 4th question at this time. A best practice on SO is to ask one question per question. However, I would consider usage of MPS with a GPU that is also servicing a display to be "unusual". Since we've established that compute preemption is in effect here, it may be that due to compute-preemption as well as the need to service a display, the GPU services clients in a round-robin timeslicing fashion (since it must do so anyway to service the display). In that case the behavior under MPS may be different. Compute preemption allows for the possibility of the usual limitations you are describing to be voided. One kernel can completely replace another.

How to make TensorFlow use more available CPU

How can I fully utilize each of my EC2 cores?
I'm using a c4.4xlarge AWS Ubuntu EC2 instance and TensorFlow to build a large convoluted neural network. nproc says that my EC2 instance has 16 cores. When I run my convnet training code, the top utility says that I'm only using 400% CPU. I was expecting it to use 1600% CPU because of the 16 cores. The AWS EC2 monitoring tab confirms that I'm only using 25% of my CPU capacity. This is a huge network, and on my new Mac Pro it consumes about 600% CPU and takes a few hours to build, so I don't think the reason is because my network is too small.
I believe the line below ultimately determines CPU usage:
sess = tf.InteractiveSession(config=tf.ConfigProto())
I admit I don't fully understand the relationship between threads and cores, but I tried increasing the number of cores. It had the same effect as the line above: still 400% CPU.
NUM_THREADS = 16
sess = tf.InteractiveSession(config=tf.ConfigProto(intra_op_parallelism_threads=NUM_THREADS))
EDIT:
htop shows that shows that I am actually using all 16 of my EC2 cores, but each core is only at about 25%
top shows that my total CPU % is around 400%, but occasionally it will shoot up to 1300% and then almost immediately go back down to ~400%. This makes me think there could be a deadlock problem
Several things you can try:
Increase the number of threads
You already tried changing the intra_op_parallelism_threads. Depending on your network it can also make sense to increase the inter_op_parallelism_threads. From the doc:
inter_op_parallelism_threads:
Nodes that perform blocking operations are enqueued on a pool of
inter_op_parallelism_threads available in each process. 0 means the
system picks an appropriate number.
intra_op_parallelism_threads:
The execution of an individual op (for
some op types) can be parallelized on a pool of
intra_op_parallelism_threads. 0 means the system picks an appropriate
number.
(Side note: The values from the configuration file referenced above are not the actual default values tensorflow uses but just example values. You can see the actual default configuration by manually inspecting the object returned by tf.ConfigProto().)
Tensorflow uses 0 for the above options meaning it tries to choose appropriate values itself. I don't think tensorflow picked poor values that caused your problem but you can try out different values for the above option to be on the safe side.
Extract traces to see how well your code parallelizes
Have a look at
tensorflow code optimization strategy
It gives you something like this. In this picture you can see that the actual computation happens on far fewer threads than available. This could also be the case for your network. I marked potential synchronization points. There you can see that all threads are active for a short moment which potentially is the reason for the sporadic peaks in CPU utilization that you experience.
Miscellaneous
Make sure you are not running out of memory (htop)
Make sure you are not doing a lot of I/O or something similar

Speed variation between vCPUs on the same Amazon EC2 instance

I'm exploring the feasibility of running numerical computations on Amazon EC2. I currently have one c4.8xlarge instance running. It has 36 vCPUs, each of which is a hyperthread of a Haswell Xeon chip. The instance runs Ubuntu in HVM mode.
I have a GCC-optimised binary of a completely sequential (i.e. single-threaded) program. I launched 30 instances with CPU-pinning thus:
for i in `seq 0 29`; do
nohup taskset -c $i $BINARY_PATH &> $i.out &
done
The 30 processes run almost identical calculations. There's very little disk activity (a few megabytes every 5 minutes), and there's no network activity or interprocess communication whatsoever. htop reports that all processes run constantly at 100%.
The whole thing has been running for about 4 hours at this point. Six processes (12-17) have already done their task, while processes 0, 20, 24 and 29 look as if they will require another 4 hours to complete. Other processes fall somewhere in between.
My questions are:
Other than resource contention with other users, is there anything else that may be causing the significant variation in performance between the vCPUs within the same instance? As it stands, the instance would be rather unsuitable for any OpenMP or MPI jobs that synchronise between threads/ranks.
Is there anything I can do to achieve a more uniform (hopefully higher) performance across the cores? I have basically excluded hyperthreading as a culprit here since the six "fast" processes are hyperthreads on the same physical cores. Perhaps there's some NUMA-related issue?
My experience is on c3 instances. It's likely similar with c4.
For example, take a c3.2xlarge instance with 8 vCPUs (most of the explaination below is derived from direct discussion with AWS support).
In fact only the first 4 vCPUs are usable for heavy scientic calculations. The last 4 vCPUs are hyperthreads. For scientific applications it’s often not useful to use hyperthreading, it causes context swapping or reduces the available cache (and associated bandwidth) per thread.
To find out the exact mapping between the vCPUs and the physical cores, look into /proc/cpuinfo
"physical id" : shows the physical processor id (only one processor in c3.2xlarge)
"processor" : gives the number of vCPUs
"core id" : tells you which vCPUs map back to each Core ID.
If you put this in a table, you have:
physical_id processor core_id
0 0 0
0 1 1
0 2 2
0 3 3
0 4 0
0 5 1
0 6 2
0 7 3
You can also get this from the "thread_siblings_list". Internal kernel map of cpuX's hardware threads within the same core as cpuX (https://www.kernel.org/doc/Documentation/cputopology.txt) :
cat /sys/devices/system/cpu/cpuX/topology/thread_siblings_list
When Hyper-threading is enabled each vCPU (Processor) is a "Logical Core".
There will be 2 "Logical Cores" that are associated with a "Physical Core"
So, in your case, one solution is to disable hyperthreading with :
echo 0 > /sys/devices/system/cpu/cpuX/online
Where X for a c3.2xlarge would be 4...7
EDIT : you can observe this behaviour only in HVM instances. In PV instances, this topology is hidden by the hypervisor : all core ids & processor ids in /proc/cpuinfo are '0'.

OpenCL multiple command queue for Concurrent NDKernal Launch

I m trying to run an application of vector addition, where i need to launch multiple kernels concurrently,
so for concurrent kernel launch someone in my last question advised me to use multiple command queues.
which i m defining by an array
context = clCreateContext(NULL, 1, &device_id, NULL, NULL, &err);
for(i=0;i<num_ker;++i)
{
queue[i] = clCreateCommandQueue(context, device_id, 0, &err);
}
I m getting an error "command terminated by signal 11" some where around the above code.
i m using for loop for launching kernels and En-queue data too
for(i=0;i<num_ker;++i)
{
err = clEnqueueNDRangeKernel(queue[i], kernel, 1, NULL, &globalSize, &localSize,
0, NULL, NULL);
}
The thing is I m not sure where m i going wrong, i saw somewhere that we can make array of command queues, so thats why i m using an array.
another information, when i m not using A for loop, just manually defining multiple command queues, it works fine.
I read as well your last question, and I think you should first rethink what do you really want to do and if OpenCL is really the way of doing it.
OpenCL is an API for masive parallel processing and data crunching.
Where each kernel (or queued task) operates parallelly on many data
values at the same time, therefore outperforming any serial CPU processing by many orders of magnitude.
The typical use case for OpenCL is 1 kernel running millions of work items.
Were more advance applications may need multiple sequences of different kernels, and special syncronizations between CPU and GPU.
But concurrency is never a requirement. (Otherwise, a single core CPU would not be able to perform the task, and thats never the case. It will be slower, ok, but it will still be possible to run it)
Even if 2 tasks need to run at the same time. The time taken will be the same concurrently or not:
Not concurrent case:
Kernel 1: *
Kernel 2: -
GPU Core 1: *****-----
GPU Core 2: *****-----
GPU Core 3: *****-----
GPU Core 4: *****-----
Concurrent case:
Kernel 1: *
Kernel 2: -
GPU Core 1: **********
GPU Core 2: **********
GPU Core 3: ----------
GPU Core 4: ----------
In fact, the non concurrent case is preferred, since at least the first task is already completed and further processing can continue.
What you do want to do, as far as I understand, is run multiple kernels at the same time. So that the kernels run fully concurrently. For example, run 100 kernels (same kernel or different) and run them at the same time.
That does not fit the OpenCL model at all. And in fact in may be way slower than a CPU single thread.
If each kernel is independent to all the others, a core (SIMD or CPU) can only be allocated for 1 kernel at a time (because they only have 1 PC), even though they could run 1k threads at the same time. In an ideal scenario, this will convert your OpenCL device in a pool of few cores (6-10) that consume serially the kernels queued. And that is supposing the API supports it and the device as well, what is not always the case. In the worst case you will have a single device that runs a single kernel and is 99% wasted.
Examples of stuff that can be done in OpenCL:
Data crunching/processing. Multiply vectors, simulate particles, etc..
Image processing, border detection, filtering, etc.
Video compresion, edition, generation
Raytracing, complex light math, etc.
Sorting
Examples of stuff that are not suitable for OpenCL:
Atending async request (HTTP, trafic, interactive data)
Procesing low amounts of data
Procesing data that need completely different procesing for each type of it
From my point of view, the only real use case of using multiple kernels is the latter, and no matter what you do the performance will be horrible in that case.
Better use a multithread pool instead.