I'm unable to start a new VM Instance with a Nvidia Tesla K80 GPU.
Whenever I try to start, i get the following error message:
Start VM instance "gpu-1"
My First Project
The zone 'projects/XXX/zones/europe-west1-b' does not have enough resources
available to fulfill the request. Try a different zone, or try again
later.
I've tried nearly all zones around the world, that have Nvidia Tesla K80 GPUs. I've also tried different hours of the day.
Is it correct, that the rather cheap GPUs are most of the time heavily overbooked all around the world or is it a misleading error message I am receiving? Or is some maintenance going on, I did not notice?
GPUs are resources that are not available in all the zones, you can see the GPUs availables per zone in the following link.
Also, they are a high demanded resources, for this reason sometimes is difficult to create an instance using GPUs,
I tried in my own project and I received the same problem in some zones, but at the end, I was able to create an instance with nvidia-tesla-k80 using the following command
gcloud compute instances create test-instance \
--zone=us-west1-b \
--machine-type=n1-standard-1 \
--image-project=eip-images \
--maintenance-policy=TERMINATE \
--accelerator=type=nvidia-tesla-k80,count=1 \
--image=debian-9-drawfork-v20200207 \
--boot-disk-size=50GB \
--boot-disk-type=pd-standard\
And I received the following output:
Created [https://www.googleapis.com/compute/v1/projects/projectname/zones/us-west1-b/instances/test-instance].
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
test-instance us-west1-b n1-standard-1 10.x.x.x 34.x.x.x RUNNING
I hope you find this information useful.
Related
I am trying to set up a GPU instance on Google Compute Cloud like this
gcloud compute instances create another-ubuntu-instance \
--maintenance-policy TERMINATE --restart-on-failure \
--image-project=ubuntu-os-cloud \
--image-family=ubuntu-2004-lts --machine-type=a2-highgpu-1g --zone europe-west4-b
but I get an error message:
ERROR: (gcloud.compute.instances.create) Could not fetch resource:
- Quota 'NVIDIA_A100_GPUS' exceeded. Limit: 0.0 in region europe-west4.
even though I have a quota (I think):
So, what am I doing wrong?
You can check your quotas in a particular region for active project using this command:
gcloud compute regions describe europe-west4 | grep -1 A100
To read more about GPU quotas please see here
To answer you last verification, here are some documentation that might explain the error you are experiencing.
If you are you using one of the Google Cloud's Free program there are limitations and conditions for each program.
For more information regarding GPU that are available for Compute Engine and Graphics workload, also in what region and zone they are available.
And as discussed in my comment above, you can request a quota increase to Google
I have to been trying to follow steps mentioned in this https://cloud.google.com/marketplace/docs/partners/vm/build-vm-image#create_a_licensed_vm_image guide to offer VM solution on google cloud marketplace.
With reference to step number 5 in the above link:-
The data disk shows up on my system with its boot and a data partition, but it is not mounted anywhere. I am able to access my installed application files on my current boot disk, which is weird since I did not install my application on this boot disk.
Here is my gcloud command for first instance.
gcloud compute instances create vm \
--image-family centos-7 \
--image-project centos-cloud \
--no-restart-on-failure \
--maintenance-policy=TERMINATE --preemptible
Here is my gcloud command for second instance.
gcloud compute instances create vm2 \
--image-family centos-7 \
--image-project centos-cloud \
--no-restart-on-failure \
--maintenance-policy=TERMINATE --preemptible \
--disk=name=vm,mode=rw,boot=no
Could someone please explain me step number 5 regarding boot disk cleanup in the link I mentioned above.
I figured it out. It seems that even though I am attaching the previous boot disk as data disk, the 2nd instance boots up with the previous boot disk (this behavior is random). lsblk outputs two identical boot disks. Now if i find my application running, then it means that I booted from the first boot disk otherwise the second. The only workaround I found for this issue is to spawn the 2nd instance first, wait for it to completely boot and then attach the previous boot disk as data disk.
On GCP -> IAM & admin -> Quotas page, Service "Compute Engine API NVidia V100 GPUs" for us-central1 show Limit is 4. But when I submit training job on GCP AI platform using the commands below, I got an error saying maximum allowed V100 GPUs are 2.
Here is the command:
gcloud beta ai-platform jobs submit training $JOB_NAME \
--staging-bucket $PACKAGE_STAGING_PATH \
--job-dir $JOB_DIR \
--package-path $TRAINER_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--python-version 3.5 \
--region us-central1 \
--runtime-version 1.14 \
--scale-tier custom \
--master-machine-type n1-standard-8 \
--master-accelerator count=4,type=nvidia-tesla-v100 \
-- \
--data_dir=$DATA_DIR \
--initial_epoch=$INITIAL_EPOCH \
--num_epochs=$NUM_EPOCHS
Here is the error message:
ERROR: (gcloud.beta.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED: Quota failure for project [PROJECT_ID]. The request for 4 V100 accelerators exceeds the allowed m
aximum of 16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 40 K80, 40 P100, 8 T4. To read more about Cloud ML Engine quota, see https://cloud.google.com/ml-engine/quotas.
- '#type': type.googleapis.com/google.rpc.QuotaFailure
violations:
- description: The request for 4 V100 accelerators exceeds the allowed maximum of
16 TPU_V2, 16 TPU_V3, 2 P4, 2 V100, 40 K80, 40 P100, 8 T4.
subject: [PROJECT_ID]
Here is the GPUs on Compute Engine webpage saying that 8 NVIDIA® Tesla® V100 GPUs are available in zones us-central1-a, us-central1-b, us-central1-c, and us-central1-f. My default zone is us-central1-c.
What should I do to use all 4 V100 GPUs for the training? Thanks.
UPDATE 1 (1/14/2020):
On this page, it says something about the global GPU quota that needs to be increased to match the per-region quota. But I couldn't find it anywhere on the Quota page.
To protect Compute Engine systems and users, new projects have a global GPU quota, which limits the total number of GPUs you can create in any supported zone. When you request a GPU quota, you must request a quota for the GPU models that you want to create in each region, and an additional global quota for the total number of GPUs of all types in all zones.
Update 2 (1/14/2020):
I contacted GCP to increase the global GPU quota to match my region quota. They replied that for some projects this is needed, but for my project there is no need to do it.
This documentation link may shed some light on your error:
"The GPUs that you use for prediction are not counted as GPUs for Compute Engine, and the quota for AI Platform Training does not give you access to any Compute Engine VMs using GPUs. If you want to spin up a Compute Engine VM using a GPU, you must request Compute Engine GPU quota, as described in the Compute Engine documentation."
Google people told me "there is a V100 GPUS quota, and a V100 VWS GPUS quota. The VWS quota in your project is only 1. Not sure which one is needed here, but that might have been the root cause." After they adjusted the quota, now I can attach up to 8 V100 GPUs for training jobs.
After executing
gcloud compute instances move instance-ba --zone us-east1-b --destination-zone us-east1-c
and waiting for about 5 minutes the following error was thrown
Moving gce instance instance-ba...failed.
ERROR: (cloud.compute.instances.move) Code: '6562453592928582321'
and the instance was gone from the web interface as well as from zone us-east1-b and us-east1-c
I tried to start the instance with
cloud compute instances start instance-ba --zone us-east1-b
and
cloud compute instances start instance-ba --zone us-east1-c
but none was working.
Thank you in advance for your help.
I have to say that this instance is quite important and I appreciate every input to solve this issue.
Edit
In the Stackdriver Logging I am seeing the following commands executed alternating:
Compute Engine setDiskAutoDelete us-east1-b:instance-ba
compute.instances.setDiskAutoDelete
As it seems the instance has been deleted from us-east1-b but not transferred to us-east1-c.
I do not see any error at all. All logs have severity "INFO" or lower.
Edit 2
I recall my steps which preceeded the moving error as follows
I tried adding a second Tesla P100 to my instance which gave at startup the error that the resources are not enough to fulfill the request
I tried moving the instance which gave the "TERMINATED" error so I
tried to reset the machine with the reset command which gave the "instance not ready" error
I removed the second Tesla P100 so that I could start the machine
I did the restart command over and over until it worked and the machine was able to start
Since I needed a second GPU I tried to moved this instance (without the second GPU) from us-east1-b to us-east1-c which finally did not work and gave the error
Edit 3
After some research I found that the procedure actually made a snapshot from my instance and the data is not lost.
However I will keep this question updated concerning the error and the response to it from google.
According to the documentation, you have a short specification for when to use the manual or the automatic move. As the procedure says, use the manual move when:
"Your VM is not RUNNING."
"You are moving your VM to a different region and your VM belongs to a subnetwork."
"Your instance has GPUs or local SSDs attached."
In your case, you had one GPU attached to your instance. So the correct way to move it is the following:
Stop your instance
Edit the instance, on the “Machine type” click customize and select “none” numbers of GPU. More details here.
Start your instance
Use the gcloud command to move the instance between zones:
$ gcloud compute instances move example-instance --zone us-central1-a --destination-zone us-central1-f
Once the instance is migrated stop it again.
Add the GPU and start the instance.
Keep in mind that every zone has different GPUs available and new projects have limits for GPUs.
"To protect Compute Engine systems and users, new projects have a global GPU quota, which limits the total number of GPUs you can create in any supported zone. When you request a GPU quota, you must request a quota for the GPU models that you want to create in each region, and an additional global quota for the total number of GPUs of all types in all zones."
I have increased the GPU quotas (Preemptible NVIDIA K80 GPUs) in region us-east1 via request. However, I still can not create the instance with GPUs and get the error message saying Quota NVIDIA_K80_GPUS exceeded no matter I try zone us-east1-c or us-east1-d. I have contacted them but it charges $150/month for technical support. Please let me know if you need additional info to troubleshooting. Thanks.
It turns out preemptible GPUs are not the same as (regular) GPUs. Based on my multiple experiments, one has to use preemptible VM in order to carry the preemptible GPUs. Don't mess up these two while sending the quota request.