Instance Group won't create an instance with GPU: Not enough resources - google-cloud-platform

I'm able to create an instance with NVIDIA-K80 manually, however my instance group shows a warning on the instance:
Instance 'instance-6lqk' creation failed: The zone 'projects/my-project/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.
Note: Both are create in the same zone

(Works for Google)
The error message you received indicated that you did everything right but the zone couldn't fulfill your request. From time to time, this happens for a variety of reasons in a single zone. My suggestion would be to use multiple zones and/or multiple regions so that if this happens to you in one zone, you can simply create capacity in another zone.
Note, a lot of our Preemptible GPU users looking to run large workloads on many GPUs do just this. Ask for quota in many regions and run multi-region instance groups to have the best chance to get access to the most capacity possible.

I created my instance group from an instance templates from the Google doc
Using the same example from the Google Doc below:
gcloud beta compute instance-templates create gpu-template \
--machine-type n1-standard-2 \
--boot-disk-size 250GB \
--accelerator type=nvidia-tesla-k80,count=1 \
--image-family ubuntu-1604-lts --image-project ubuntu-os-cloud \
--maintenance-policy TERMINATE --restart-on-failure \
--metadata startup-script='#!/bin/bash
echo "Checking for CUDA and installing."
# Check for CUDA and try to install.
if ! dpkg-query -W cuda-9-0; then
curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
dpkg -i ./cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
apt-get update
apt-get install cuda-9-0 -y
fi'
However I found some recommendations when you are creating a GPU instances. Make sure you have necessary Quotas in the zone, and Michael commented about the GPU restrictions.
hope it can be useful to you.

Related

GCP - gcloud commands history for actions done via GUI

When I do something in GCP console (by clicking in GUI), I imagine some gcloud command is executed underneath. Is it possible to view this command?
(I created a notebooks instance on Vertex AI and wanted to know what exactly I should put after gcloud notebooks instances create... to get the same result)
I think it's not possible to view a gcloud command from GUI.
You should test your gcloud command to create another instance alongside the current with all the needed parameters.
When the 2 instances are the same, you know that your gcloud command is ready.
The documentation seems to be clear and complete for this :
https://cloud.google.com/vertex-ai/docs/workbench/user-managed/create-new#gcloud
If it's possible for you, you can also think about Terraform to automate this creation for you with a state management.
Try this for a Python based User Managed Notebook (GUI version of Python instance is using the base image as boot disk, which does not containg Pythong.
The Python suite is installed explicitly via Metadata parameters):
export NETWORK_URI="NETWORK URI"
export SUBNET_URI="SUBNET URI"
export INSTANCE_NAME="instance-name-of-your-liking"
export VM_IMAGE_PROJECT="deeplearning-platform-release"
export VM_IMAGE_FAMILY="common-cpu-notebooks-debian-10"
export MACHINE_TYPE="n1-standard-4"
export LOCATION="europe-west3-b"
gcloud notebooks instances create $INSTANCE_NAME \
--no-public-ip \
--vm-image-project=$VM_IMAGE_PROJECT \
--vm-image-family=$VM_IMAGE_FAMILY \
--machine-type=$MACHINE_TYPE \
--location=$LOCATION \
--network=$NETWORK_URI \
--subnet=$SUBNET_URI \
--metadata=framework=NumPy/SciPy/scikit-learn,report-system-health=true,proxy-mode=service_account,shutdown-script=/opt/deeplearning/bin/shutdown_script.sh,notebooks-api=PROD,enable-guest-attributes=TRUE
To get a list of Network URIs in your project:
gcloud compute networks list --uri
To get a list of Subnet URIs in your project:
gcloud compute networks subnets list --uri
Put the corresponding URIs in between the quotation marks in the first two variables:
export NETWORK_URI="NETWORK URI"
export SUBNET_URI="SUBNET URI"
Name the instance (keep the quotation marks):
export INSTANCE_NAME="instance-name-of-your-liking"
When done copy paste the complete block in your Google Cloud Shell (assuming you are in a correct project).
To additionally enable secure boot (which is a thick box in the GUI setup):
gcloud compute instances stop $INSTANCE_NAME
gcloud compute instances update $INSTANCE_NAME --shielded-secure-boot
Hope it works for you, as it does for me.

Issues creating AlloyDB

Creating new AlloyDB instances has been failing for the past 24 hours. It was working fine a few days ago
# creating the cluster works
gcloud beta alloydb clusters create dev-cluster \
--password=$PG_RAND_PW \
--network=$PRIVATE_NETWORK_NAME \
--region=us-east4 \
--project=${PROJECT_ID}
# creating primary instance fails
gcloud beta alloydb instances create devdb \
--instance-type=PRIMARY \
--cpu-count=2 \
--region=us-east4 \
--cluster=dev-cluster \
--project=${PROJECT_ID}
Error message is
Operation ID: operation-1660168834702-5e5ea2da8dcd1-d96bdabb-4c686076
Creating instance...failed.
ERROR: (gcloud.beta.alloydb.instances.create) an internal error has occurred
Creating from the console fails also
I have tried from a complete new project also and it still fails.
Any suggestions?
I've managed to replicate your issue and it seems that this is due to AlloyDB for PostgreSQL is still in preview and we may encounter some bugs and errors according to this documentation:
This product is covered by the Pre-GA Offerings Terms of the Google Cloud Terms of Service. Pre-GA products might have limited support, and changes to pre-GA products might not be compatible with other pre-GA versions. For more information, see the launch stage descriptions.
What worked on my end is following this documentation on creating a cluster and its primary instance using the console. This step will create both the cluster and its primary instance at the same time. Please see my screenshot below for your reference:
As you can see the instance under the cluster my-cluster has an error and was not created however the instance devdb was created following the link that I provided above.
It would also be best to raise this as an issue as per #DazWilkin's comment if this issue would still persist in the future.

Additional persistent disks are not created when using --source-machine-image with 'gcloud beta compute instances create' CLI command

The following works great - creating VM from source image and additional persistent disk(s).
gcloud compute instances create ${INSTANCE_NAME} \
--image-project ${PROJECT_NAME} \
--image ${BASE_IMAGE_NAME} \
--zone=${ZONE_NAME} \
--create-disk=size=128GB,type=pd-balanced,name=${INSTANCE_NAME}-home,device-name=homedisk
The following, however, creates a VM BUT no additional disk(s) are created.
gcloud beta compute instances create ${INSTANCE_NAME} \
--source-machine-image ${BASE_IMAGE_NAME} \
--zone=${ZONE_NAME} \
--create-disk=size=128GB,type=pd-balanced,name=${INSTANCE_NAME}-homedisk,device-name=homedisk
The documentation for the command does not suggest that --source-machine-image and --create-disk cannot work in tandem. The property overrides when creating a VM from machine image suggests that any of the properties can be overridden.
Any insights as to what might be going on?
the problem here is with the --source-machine-image ${BASE_IMAGE_NAME} flag because your BASE_IMAGE_NAME must already have the desired additional disk, that is why it is not being created, because it is creating everything from the BASE_IMAGE_NAME which does not have an additional disk, try it by creating a new Machine image with the desired additional disk attached and then run your gcloud beta compute instances create again (the second command you have) and confirm that it creates the instance based on that Machine image including the additional disk.
If you need to create a new instance with 1 additional disk you should use (your first command) --image ${NAME} --image-project ${PROJECT}
So --source-machine-image and --image ... --image-project are very different.
Here is the documentation for Machine images which may explain this better.
https://cloud.google.com/compute/docs/machine-images

Orchestrating many instances from a local - Too frequent operations from the source resource

I have a local linux machine of the following flavor:
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
On that machine I have gcloud installed with the following version:
$ gcloud -v
Google Cloud SDK 334.0.0
alpha 2021.03.26
beta 2021.03.26
bq 2.0.66
core 2021.03.26
gsutil 4.60
On that machine I have the following shell script:
#!/bin/bash
client_name=$1
instance_name_arg=imageclient${client_name}
z=us-east1-b
gcloud beta compute --project=foo-123 instances create $instance_name_arg --zone=${z} --machine-type=g1-small --subnet=default --network-tier=PREMIUM --source-machine-image projects/foo-123/global/machineImages/envNameimage2
gcloud compute instances start $instance_name_arg --zone=${z}
sleep 10s
gcloud compute ssh --zone=${z} usr#${instance_name_arg} -- '/home/usr/miniconda3/envs/minienvName/bin/python /home/usr/git/bitstamp/envName/client_run.py --payload=client_packages_'${client_name}
gcloud compute instances stop $instance_name_arg --zone=${z}
As I am scaling this project, I am needing to launch this task many times at exactly the same time but with different settings.
As I am launching more and more of these bash scripts, I am starting to get the following error message:
ERROR: (gcloud.beta.compute.instances.create) Could not fetch resource:
- Operation rate exceeded for resource 'projects/foo-123/global/machineImages/envNameimage2'. Too frequent operations from the source resource.
How do I work around this issue?
As my solution is architected, I have a "one-machine to one-launch-setting" solution and would ideally want to persist this.
There must be other, large-scale gcloud customers who may or may not need a large number of machines spawned in parallel.
Thank you very much
As you are using the machine image envNameimage2 for creating the new instance, this is seen as a snapshot of the disk.
You can snapshot your disks at most once every 10 minutes. If you want to issue a burst of requests to snapshot your disks, you can issue at most 6 requests in 60 minutes.
Reference:
Snapshot frequency limits
Creating disks from snapshots
Anothers API rate limits
A workaround could be to follow the rate limits, or to create an instance using an existing (available) disk with the --disk flag.
--disk=name=clone-disk-1,device-name=clone-disk-1,mode=rw,boot=yes

Google compute engine Change Zones

I saw the following warning message when I connected to the compute engine:
"This zone is deprecated and will go offline soon. When the zone goes offline, all VMs in this zone will be destroyed.
In this case, my machine will be deleted, am I right? Shouldn't it be migrated online normally? How can I move my machine to a different zone?
Thanks.
UPDATE:
Google has released new tools in the sdk for moving instances and disks. First you need to update the group tools:
$ gcloud components update
Then you can move instances as follows:
$ gcloud compute instances move my-vm --zone europe-west1-a --destination-zone europe-west1-d
Or disks:
$ gcloud compute disks move my-disk --zone europe-west1-a --destination-zone europe-west1-d
ORIGINAL ANSWER:
You will need to migrate manually for zone deprecation
Source: https://cloud.google.com/compute/docs/zones#zone_deprecation
You can find instructions on migrating here: https://cloud.google.com/compute/docs/instances#moving_an_instance_between_zones