Cloud ML-Engine Quota between P100 and K80 - google-cloud-ml

The Cloud ML-Engine Quota documentation mentions:
Total concurrent number of GPUs: This is the maximum number of GPUs in concurrent use, split per type as follows:
Concurrent number of Tesla K80 GPUs: 30.
Concurrent number of Tesla P100 GPUs: 30.
According to this I should be able to run 60 jobs at the same time as long as they are split 30/30 between these two types of GPU.
In practice, after starting 30 P100 jobs my K80 jobs are left in the queue and are not getting scheduled.
Is that expected?

The jobs started on their own after 1 hr. I did have both P100 and K80 jobs running eventually, just took a while for the K80 jobs to start.

Related

GPU 0 utilization higher than other GPUs on Amazon SageMaker SMDP (distributed Training)

When using SageMaker Data Parallelism (SMDP), my team sees a higher utilization on GPU 0 compared to other GPUs.
What can be the likely cause here?
Does it have anything to do with the data loader workers that run on CPU? I would expect SMDP to shard the datasets equally.

GCP gpu all region quota near the limit

I have activated the free 300 dollars trial of google cloud platform, and successfully increase the gpu all region quota to 1, so I can create a notebook with a Tesla T4 and I've been training some models using it so far.
However, when I enter the notebook today, I found that Tesla T4 is not available, and I can't create new notebook with GPU and it says quota near limit.
My question is: is this quota limit permanent, or temporary because all GPUs are currently busy? I only used around half of my free trial dollars. Thanks!

Difference between NVIDIA P100 GPUs and Committed NVIDIA P100 GPUs limit name in GCP

Currently, I am trying to increase my quota limit for NVIDIA P100 GPU in GKE. When I filter in Quotas using the Limit name, I get two types of options - NVIDIA P100 GPUs and Committed NVIDIA P100 GPUs. What is the difference between these two?
As their names mean:
NVIDIA P100 GPUs: quota of GPU that you can use in your project (and attach to a GCE). You pay only when the GPU is attached to an active GCE.
Committed NVIDIA P100 GPUs: quota of GPU that you can commit (reserve) in your project. You will pay this GPU even if not use or attach to a VM, but you will have a discount

AWS BATCH - how to run more concurrent jobs

I have just started working with AWS BATCH for my deep learning workload. I have created a compute environment with the following config:
min vCPUs: 0
max vCPUs: 16
Instance type: g4dn family, g3s family, g3 family, p3 family
allocation strategy: BEST_FIT_PROGRESSIVE
The maximum number of vCPU limits for my account is 16 and each of my jobs requires 16GB of memory. I observe that a maximum of 2 jobs can run concurrently at any point in time. I was using allocation strategy: BEST_FIT before and changed it to allocation strategy: BEST_FIT_PROGRESSIVE but I still see that only 2 jobs can run concurrently. This limits the amount of experimentation I can do in a given time. What can I do to increase number of jobs that can run concurrently?
I figured it out myself just now. I'm posting an answer here just in case anyone finds it helpful in the future. It turns out that the instances that were assigned to each of my jobs are g4dn2xlarge. Each of these instances takes up 8 vCPUs. And as my vCPU limit is 16 only 2 jobs can run concurrently. One of the solutions to this is to ask AWS to increase the limit on vCPU by creating a new support case. Another solution could be to modify the compute environment to use GPU instances that consume 4 vCPUs (lowest possible on AWS) and in this case maximum of 4 jobs can run concurrently.
There are 2 kind of solutions:
Configure your compute environment with ec2 instances with vCPUs tha be
multiple of your jobs definitions. For example:
Compute env. with ec2 instance type 8 vCPU and limit up 128 vCPUs of you
have a job definition with 8 vCPU it will let you to execute up to 16
concurrent jobs.Because 16 jobs concurrents X 8 vCPU = 128 vCPUs (take
in count the allocation strategy and memory of your instance which is
important in your job consume memory resources too)
Multi-node parallel jobs, this a very interesting soution because in
this kind of scenario you don't need ec2 instances vCPU that at lest be
multiple of you vCPU used in your Job definition and jobs can be spaned
accross multiple Amazon EC2 instances.

GCP AI platform training cannot use full GPU quota

On GCP -> IAM & admin -> Quotas page, Service "Compute Engine API NVidia V100 GPUs" for us-central1 show Limit is 4. But when I submit training job on GCP AI platform using the commands below, I got an error saying maximum allowed V100 GPUs are 2.
Here is the command:
gcloud beta ai-platform jobs submit training $JOB_NAME \
--staging-bucket $PACKAGE_STAGING_PATH \
--job-dir $JOB_DIR \
--package-path $TRAINER_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--python-version 3.5 \
--region us-central1 \
--runtime-version 1.14 \
--scale-tier custom \
--master-machine-type n1-standard-8 \
--master-accelerator count=4,type=nvidia-tesla-v100 \
-- \
--data_dir=$DATA_DIR \
--initial_epoch=$INITIAL_EPOCH \
--num_epochs=$NUM_EPOCHS
Here is the error message:
ERROR: (gcloud.beta.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED: Quota failure for project [PROJECT_ID]. The request for 4 V100 accelerators exceeds the allowed m
aximum of 16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 40 K80, 40 P100, 8 T4. To read more about Cloud ML Engine quota, see https://cloud.google.com/ml-engine/quotas.
- '#type': type.googleapis.com/google.rpc.QuotaFailure
violations:
- description: The request for 4 V100 accelerators exceeds the allowed maximum of
16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 40 K80, 40 P100, 8 T4.
subject: [PROJECT_ID]
Here is the GPUs on Compute Engine webpage saying that 8 NVIDIA® Tesla® V100 GPUs are available in zones us-central1-a, us-central1-b, us-central1-c, and us-central1-f. My default zone is us-central1-c.
What should I do to use all 4 V100 GPUs for the training? Thanks.
UPDATE 1 (1/14/2020):
On this page, it says something about the global GPU quota that needs to be increased to match the per-region quota. But I couldn't find it anywhere on the Quota page.
To protect Compute Engine systems and users, new projects have a global GPU quota, which limits the total number of GPUs you can create in any supported zone. When you request a GPU quota, you must request a quota for the GPU models that you want to create in each region, and an additional global quota for the total number of GPUs of all types in all zones.
Update 2 (1/14/2020):
I contacted GCP to increase the global GPU quota to match my region quota. They replied that for some projects this is needed, but for my project there is no need to do it.
This documentation link may shed some light on your error:
"The GPUs that you use for prediction are not counted as GPUs for Compute Engine, and the quota for AI Platform Training does not give you access to any Compute Engine VMs using GPUs. If you want to spin up a Compute Engine VM using a GPU, you must request Compute Engine GPU quota, as described in the Compute Engine documentation."
Google people told me "there is a V100 GPUS quota, and a V100 VWS GPUS quota. The VWS quota in your project is only 1. Not sure which one is needed here, but that might have been the root cause." After they adjusted the quota, now I can attach up to 8 V100 GPUs for training jobs.