GCP AI platform training cannot use full GPU quota - google-cloud-platform

On GCP -> IAM & admin -> Quotas page, Service "Compute Engine API NVidia V100 GPUs" for us-central1 show Limit is 4. But when I submit training job on GCP AI platform using the commands below, I got an error saying maximum allowed V100 GPUs are 2.
Here is the command:
gcloud beta ai-platform jobs submit training $JOB_NAME \
--staging-bucket $PACKAGE_STAGING_PATH \
--job-dir $JOB_DIR \
--package-path $TRAINER_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--python-version 3.5 \
--region us-central1 \
--runtime-version 1.14 \
--scale-tier custom \
--master-machine-type n1-standard-8 \
--master-accelerator count=4,type=nvidia-tesla-v100 \
-- \
--data_dir=$DATA_DIR \
--initial_epoch=$INITIAL_EPOCH \
--num_epochs=$NUM_EPOCHS
Here is the error message:
ERROR: (gcloud.beta.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED: Quota failure for project [PROJECT_ID]. The request for 4 V100 accelerators exceeds the allowed m
aximum of 16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 40 K80, 40 P100, 8 T4. To read more about Cloud ML Engine quota, see https://cloud.google.com/ml-engine/quotas.
- '#type': type.googleapis.com/google.rpc.QuotaFailure
violations:
- description: The request for 4 V100 accelerators exceeds the allowed maximum of
16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 40 K80, 40 P100, 8 T4.
subject: [PROJECT_ID]
Here is the GPUs on Compute Engine webpage saying that 8 NVIDIA® Tesla® V100 GPUs are available in zones us-central1-a, us-central1-b, us-central1-c, and us-central1-f. My default zone is us-central1-c.
What should I do to use all 4 V100 GPUs for the training? Thanks.
UPDATE 1 (1/14/2020):
On this page, it says something about the global GPU quota that needs to be increased to match the per-region quota. But I couldn't find it anywhere on the Quota page.
To protect Compute Engine systems and users, new projects have a global GPU quota, which limits the total number of GPUs you can create in any supported zone. When you request a GPU quota, you must request a quota for the GPU models that you want to create in each region, and an additional global quota for the total number of GPUs of all types in all zones.
Update 2 (1/14/2020):
I contacted GCP to increase the global GPU quota to match my region quota. They replied that for some projects this is needed, but for my project there is no need to do it.

This documentation link may shed some light on your error:
"The GPUs that you use for prediction are not counted as GPUs for Compute Engine, and the quota for AI Platform Training does not give you access to any Compute Engine VMs using GPUs. If you want to spin up a Compute Engine VM using a GPU, you must request Compute Engine GPU quota, as described in the Compute Engine documentation."

Google people told me "there is a V100 GPUS quota, and a V100 VWS GPUS quota. The VWS quota in your project is only 1. Not sure which one is needed here, but that might have been the root cause." After they adjusted the quota, now I can attach up to 8 V100 GPUs for training jobs.

Related

Starting a GPU instance on Google Cloud Compute

I am trying to set up a GPU instance on Google Compute Cloud like this
gcloud compute instances create another-ubuntu-instance \
--maintenance-policy TERMINATE --restart-on-failure \
--image-project=ubuntu-os-cloud \
--image-family=ubuntu-2004-lts --machine-type=a2-highgpu-1g --zone europe-west4-b
but I get an error message:
ERROR: (gcloud.compute.instances.create) Could not fetch resource:
- Quota 'NVIDIA_A100_GPUS' exceeded. Limit: 0.0 in region europe-west4.
even though I have a quota (I think):
So, what am I doing wrong?
You can check your quotas in a particular region for active project using this command:
gcloud compute regions describe europe-west4 | grep -1 A100
To read more about GPU quotas please see here
To answer you last verification, here are some documentation that might explain the error you are experiencing.
If you are you using one of the Google Cloud's Free program there are limitations and conditions for each program.
For more information regarding GPU that are available for Compute Engine and Graphics workload, also in what region and zone they are available.
And as discussed in my comment above, you can request a quota increase to Google

Can't create a GPU powered VM instance on Google Cloud Platform

I'm trying to create a GCP VM Instance with a Tesla P100 GPU. The region I choose is europe-west1. I increased some quotas to get a P100 GPU, and I have the following situation:
NVIDIA P100 GPUs for europe-west1 set to 1
Committed NVIDIA P100 GPUs for europe-west1 set to 1
GPUs (all regions) set to 1
When I try to create the instance I get the following error message:
Quota 'GPUS_ALL_REGIONS' exceeded. Limit: 1.0 globally.
I don't know what's wrong with this configuration. I tried I tried contacting GPC support (replying to one of their messages) but they sent me a defult support message with no clues.
Thanks to everybody
It seems that you already requested your quota increase for your GPU in all regions.
Please take into consideration that Quota increase requests typically take two business days to process.
If you tried before the quote increase has taken effect, you will receive the error message:
Quota 'GPUS_ALL_REGIONS' exceeded. Limit: 1.0 globally.
On the other hand, when a GPU is not available in the zone or region you might receive a different errors:
ZONE_RESOURCE_POOL_EXHAUSTED
Or
ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS
If those errors appears you can check the following documentation:
Troubleshooting VM creation

Unable to start VM instance with Nvidia Tesla K80

I'm unable to start a new VM Instance with a Nvidia Tesla K80 GPU.
Whenever I try to start, i get the following error message:
Start VM instance "gpu-1"
My First Project
The zone 'projects/XXX/zones/europe-west1-b' does not have enough resources
available to fulfill the request. Try a different zone, or try again
later.
I've tried nearly all zones around the world, that have Nvidia Tesla K80 GPUs. I've also tried different hours of the day.
Is it correct, that the rather cheap GPUs are most of the time heavily overbooked all around the world or is it a misleading error message I am receiving? Or is some maintenance going on, I did not notice?
GPUs are resources that are not available in all the zones, you can see the GPUs availables per zone in the following link.
Also, they are a high demanded resources, for this reason sometimes is difficult to create an instance using GPUs,
I tried in my own project and I received the same problem in some zones, but at the end, I was able to create an instance with nvidia-tesla-k80 using the following command
gcloud compute instances create test-instance \
--zone=us-west1-b \
--machine-type=n1-standard-1 \
--image-project=eip-images \
--maintenance-policy=TERMINATE \
--accelerator=type=nvidia-tesla-k80,count=1 \
--image=debian-9-drawfork-v20200207 \
--boot-disk-size=50GB \
--boot-disk-type=pd-standard\
And I received the following output:
Created [https://www.googleapis.com/compute/v1/projects/projectname/zones/us-west1-b/instances/test-instance].
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
test-instance us-west1-b n1-standard-1 10.x.x.x 34.x.x.x RUNNING
I hope you find this information useful.

GPU Quota Error even when i have Quota for Nvidia P100 both for region Us-west1 and Europe-west4 from 0 to 1

I just made up a account on Google Cloud Platform and am trying to make a VM instance and have even increased my GPU quota in region Us-west1 and Europe-west4 both to 1 from 0
Yet when i try to create a VM instance using Nvidia P100
Its gives me the error - Quota 'GPUS_ALL_REGIONS' exceeded. Limit: 0.0 globally
Any help would be appreciated please and if that GPU is not usable then can you advise on a similar powered GPU please
As the error says you need to increase the ALL_REGIONS quota, take a look at this SO question
From Google documentation:
"Similar to virtual CPU quota, GPU quota refers to the total number of virtual GPUs in all VM instances in a region. Check the quotas page to ensure that you have enough GPUs available in your project, and to request a quota increase. In addition, new accounts and projects have a global GPU quota that applies to all regions.
When you request a GPU quota, you must request a quota for the GPU models that you want to create in each region, and an additional global quota for the total number of GPUs of all types in all zones."
*I assume you upgraded your billing account already as it is a requirement to use GPUs.

Error on creating instance with GPU on Google Cloud Platform (GCP)

I have increased the GPU quotas (Preemptible NVIDIA K80 GPUs) in region us-east1 via request. However, I still can not create the instance with GPUs and get the error message saying Quota NVIDIA_K80_GPUS exceeded no matter I try zone us-east1-c or us-east1-d. I have contacted them but it charges $150/month for technical support. Please let me know if you need additional info to troubleshooting. Thanks.
It turns out preemptible GPUs are not the same as (regular) GPUs. Based on my multiple experiments, one has to use preemptible VM in order to carry the preemptible GPUs. Don't mess up these two while sending the quota request.