Can't create Deep Learning VM using Tensorflow 2.0 framework - google-cloud-platform

I'm trying to create a Deep Learning Virtual Machine using Google Cloud Platform that uses tensorflow 2.0. But when I instantiate it i get the following error:
deep-learning-training-vm: {"ResourceType":"compute.v1.instance","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"errors":[{"domain":"global","message":"Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://www.googleapis.com/compute/v1/projects/click-to-deploy-images/global/images/tf-2-0-cu100-experimental-20190909'. The referenced image resource cannot be found.","reason":"invalid"}],"message":"Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://www.googleapis.com/compute/v1/projects/click-to-deploy-images/global/images/tf-2-0-cu100-experimental-20190909'. The referenced image resource cannot be found.","statusMessage":"Bad Request","requestPath":"https://compute.googleapis.com/compute/v1/projects/my-project/zones/us-west1-b/instances","httpMethod":"POST"}}
I don't quite understand the error but I believe that gcp is not able to find the right image for my virtual machine, i.e, the image that have this version of tensorflow in it (maybe because of TF 2.0 release?).
Have someone faced this problem before? Is there a way to create a DL VM using tensorflow 2.0?

It seemed to be a transient issue, since it is available now.
In addition, you can create your DL VM via gcloud. Here's an example of the command:
gcloud compute instances create INSTANCE_NAME \
--zone=ZONE \
--image-family=tf2-latest-cu100 \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--accelerator="type=nvidia-tesla-v100,count=1" \
--metadata="install-nvidia-driver=True,proxy-mode=project_editors" \
--machine-type=n2-highmem-8
There's more information on how to this in the DL documentation.
Also, if you are looking to create a VM with Tensorflow and Jupyter, you can try using AI Platform Notebooks.
When you create a new Notebook, you can select Tensorflow 2.0 and further customize it to select the accelerator, machine-type, etc.

Related

How to copy artifacts between projects in Google Artifact Registry

Previously using Container Registry one could copy a container between projects using this method
However I am unable to get this working using Artifact Registry. If I try
gcloud artifacts docker tags add \
us-east4-docker.pkg.dev/source-proj/my-repo/my-image:latest \
us-east4-docker.pkg.dev/dest-proj/my-repo/my-image:latest
It gives the error
ERROR: (gcloud.artifacts.docker.tags.add) Image us-east4-docker.pkg.dev/source-proj/my-repo/my-image
does not match image us-east4-docker.pkg.dev/dest-proj/my-repo/my-image
I have searched and can not find any examples or documentation on how to do this.
You can use the gcrane tool to copy images in Artifact Registry.
For example, the following command copies image my-image:latest from the repository my-repo in the project source-proj to the repository my-repo in another project called dest-proj.
gcrane cp \
us-east4-docker.pkg.dev/source-proj/my-repo/my-image:latest \
us-east4-docker.pkg.dev/dest-proj/my-repo/my-image:latest
Here is the link to the Google Cloud official documentation.
This can be done using
gcrane cp us-east4-docker.pkg.dev/source-proj/my-repo/my-image:latest us-east4-docker.pkg.dev/dest-proj/my-repo/my-image:latest

PyTorch model deployment in AI Platform

I'm deploying a Pytorch model in Google Cloud AI Platform, I'm getting the following error:
ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to have error, please contact Cloud ML.
Configuration:
setup.py
from setuptools import setup
REQUIRED_PACKAGES = ['torch']
setup(
name="iris-custom-model",
version="0.1",
scripts=["model.py"],
install_requires=REQUIRED_PACKAGES
)
Model version creation
MODEL_VERSION='v1'
RUNTIME_VERSION='1.15'
MODEL_CLASS='model.PyTorchIrisClassifier'
!gcloud beta ai-platform versions create {MODEL_VERSION} --model={MODEL_NAME} \
--origin=gs://{BUCKET}/{GCS_MODEL_DIR} \
--python-version=3.7 \
--runtime-version={RUNTIME_VERSION} \
--package-uris=gs://{BUCKET}/{GCS_PACKAGE_URI} \
--prediction-class={MODEL_CLASS}
You need to use Pytorch compiled packages compatible with Cloud AI Platform Package information here
This bucket contains compiled packages for PyTorch that are compatible with Cloud AI Platform prediction. The files are mirrored from the official builds at https://download.pytorch.org/whl/cpu/torch_stable.html
From documentation
In order to deploy a PyTorch model on Cloud AI Platform Online
Predictions, you must add one of these packages to the packageURIs
field on the version you deploy. Pick the package matching your Python
and PyTorch version. The package names follow this template:
Package name =
torch-{TORCH_VERSION_NUMBER}-{PYTHON_VERSION}-linux_x86_64.whl where
PYTHON_VERSION = cp35-cp35m for Python 3 with runtime versions <
1.15, cp37-cp37m for Python 3 with runtime versions >= 1.15
For example, if I were to deploy a PyTorch model based on PyTorch
1.1.0 and Python 3, my gcloud command would look like:
gcloud beta ai-platform versions create {VERSION_NAME} --model {MODEL_NAME}
...
--package-uris=gs://{MY_PACKAGE_BUCKET}/my_package-0.1.tar.gz,gs://cloud->ai-pytorch/torch-1.1.0-cp35-cp35m-linux_x86_64.whl
In summary:
1) Remove torch from your install_requires dependencies in setup.py
2) Include torch package when creating your version model.
!gcloud beta ai-platform versions create {VERSION_NAME} --model {MODEL_NAME} \
--origin=gs://{BUCKET}/{MODEL_DIR}/ \
--python-version=3.7 \
--runtime-version={RUNTIME_VERSION} \
--package-uris=gs://{BUCKET}/{PACKAGES_DIR}/text_classification-0.1.tar.gz,gs://cloud-ai-pytorch/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl \
--prediction-class=model_prediction.CustomModelPrediction

How to get list of all docker-machine images for google cloud

I'm creating docker-machines in Google Cloud with the shell command
docker-machine create --driver google \
--google-project my-project \
--google-zone my-zone \
--google-machine-image debian-cloud/global/images/debian-10-buster-v20191210 \
machine-name
As you see, i use the image debian-10-buster-v20191210. But I want to switch the version of the image to a less recent one. And the problem is that I can't find the place where the list of
such images (debian-10-buster-v*) can be found. Can you please help me to find the place?
Determining a list of available images can be done using the gcloud command line.
--show deprecated indicates you want to see ALL images, not just the latest
--filter= only selects images with a starting name of debian-10-buster
$ gcloud compute images list --filter="name=debian-10-buster" --show-deprecated
NAME PROJECT FAMILY DEPRECATED STATUS
debian-10-buster-v20191115 debian-cloud debian-10 DEPRECATED READY
debian-10-buster-v20191121 debian-cloud debian-10 DEPRECATED READY
debian-10-buster-v20191210 debian-cloud debian-10 READY
You can find additional information in the gcloud Images List documentation.

Cloud Machine Learning Engine fails to deploy model

I have trained both my own model and the one from the official tutorial.
I'm up to the step to deploy the model to support prediction. However, it keeps giving me an error saying:
"create version failed. internal error happened"
when I attempt to deploy the models by running:
gcloud ml-engine versions create v1 \
--model $MODEL_NAME \
--origin $MODEL_BINARIES \
--python-version 3.5 \
--runtime-version 1.13
*the model binary should be correct, as I pointed it to the folder containing model.pb and variables folder, e.g. MODEL_BINARIES=gs://$BUCKET_NAME/results/20190404_020134/saved_model/1554343466.
I have also tried to change the region setting for the model as well, but this doesn't help.
Turns out your GCS bucket and the trained model needs to be in the same region. This was not explained well in the Cloud ML tutorial, where it only says:
Note: Use the same region where you plan on running Cloud ML Engine jobs. The example uses us-central1 because that is the region used in the getting-started instructions.
Also note that a lot of regions cannot be used for both the bucket and model training (e.g. asia-east1).

Google Cloud ML returns empty predictions with object detection model

I am deploying a model to Google Cloud ML for the first time. I have trained and tested the model locally and it still needs work but it works ok.
I have uploaded it to Cloud ML and tested with the same example images I test locally that I know get detections. (using this tutorial)
When I do this, I get no detections. At first I thought I had uploaded the wrong checkpoint but I tested and the same checkpoint works with these images offline, I don't know how to debug further.
When I look at the results the file
prediction.results-00000-of-00001
is just empty
and the file
prediction.errors_stats-00000-of-00001
contains the following text: ('No JSON object could be decoded', 1)
Is this a sign the detection has run and detected nothing, or is there some problem while running?
Maybe the problem is I am preparing the images wrong for uploading?
The logs show no errors at all
Thank you
EDIT:
I was doing more tests and tried to run the model locally using the command "gcloud ml-engine local predict" instead of the usual local code. I get the same result as online, no answer at all, but also no error message
EDIT 2:
I am using a TF_Record file, so I don't understand the JSON response. Here is a copy of my command:
gcloud ml-engine jobs submit prediction ${JOB_ID} --data-
format=tf_record \ --input-paths=gs://MY_BUCKET/data_dir/inputs.tfr
\ --output-path=gs://MY_BUCKET/data_dir/version4 \ --region
us-central1 \ --model="gcp_detector" \ --version="Version4"
Works with the following commands
Model export:
# From tensorflow/models
export PYTHONPATH=$PYTHONPATH:/home/[user]/repos/DeepLearning/tools/models/research:/home/[user]/repos/DeepLearning/tools/models/research/slim
cd /home/[user]/repos/DeepLearning/tools/models/research
python object_detection/export_inference_graph.py \
--input_type encoded_image_string_tensor \
--pipeline_config_path /home/[user]/[path]/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix /[path_to_checkpoint]/model.ckpt-216593 \
--output_directory /[output_path]/output_inference_graph.pb
Cloud execution
gcloud ml-engine jobs submit prediction ${JOB_ID} --data-format=TF_RECORD \
--input-paths=gs://my_inference/data_dir/inputs/* \
--output-path=${YOUR_OUTPUT_DIR} \
--region us-central1 \
--model="model_name" \
--version="version_name"
I don't know what change exactly fixes the issue, but there are some small changes like tf_record now being TF_RECORD. Hope this helps someone else. Props to google support for their help (they suggested the changes)