PyTorch model deployment in AI Platform - google-cloud-ml

I'm deploying a Pytorch model in Google Cloud AI Platform, I'm getting the following error:
ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to have error, please contact Cloud ML.
Configuration:
setup.py
from setuptools import setup
REQUIRED_PACKAGES = ['torch']
setup(
name="iris-custom-model",
version="0.1",
scripts=["model.py"],
install_requires=REQUIRED_PACKAGES
)
Model version creation
MODEL_VERSION='v1'
RUNTIME_VERSION='1.15'
MODEL_CLASS='model.PyTorchIrisClassifier'
!gcloud beta ai-platform versions create {MODEL_VERSION} --model={MODEL_NAME} \
--origin=gs://{BUCKET}/{GCS_MODEL_DIR} \
--python-version=3.7 \
--runtime-version={RUNTIME_VERSION} \
--package-uris=gs://{BUCKET}/{GCS_PACKAGE_URI} \
--prediction-class={MODEL_CLASS}

You need to use Pytorch compiled packages compatible with Cloud AI Platform Package information here
This bucket contains compiled packages for PyTorch that are compatible with Cloud AI Platform prediction. The files are mirrored from the official builds at https://download.pytorch.org/whl/cpu/torch_stable.html
From documentation
In order to deploy a PyTorch model on Cloud AI Platform Online
Predictions, you must add one of these packages to the packageURIs
field on the version you deploy. Pick the package matching your Python
and PyTorch version. The package names follow this template:
Package name =
torch-{TORCH_VERSION_NUMBER}-{PYTHON_VERSION}-linux_x86_64.whl where
PYTHON_VERSION = cp35-cp35m for Python 3 with runtime versions <
1.15, cp37-cp37m for Python 3 with runtime versions >= 1.15
For example, if I were to deploy a PyTorch model based on PyTorch
1.1.0 and Python 3, my gcloud command would look like:
gcloud beta ai-platform versions create {VERSION_NAME} --model {MODEL_NAME}
...
--package-uris=gs://{MY_PACKAGE_BUCKET}/my_package-0.1.tar.gz,gs://cloud->ai-pytorch/torch-1.1.0-cp35-cp35m-linux_x86_64.whl
In summary:
1) Remove torch from your install_requires dependencies in setup.py
2) Include torch package when creating your version model.
!gcloud beta ai-platform versions create {VERSION_NAME} --model {MODEL_NAME} \
--origin=gs://{BUCKET}/{MODEL_DIR}/ \
--python-version=3.7 \
--runtime-version={RUNTIME_VERSION} \
--package-uris=gs://{BUCKET}/{PACKAGES_DIR}/text_classification-0.1.tar.gz,gs://cloud-ai-pytorch/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl \
--prediction-class=model_prediction.CustomModelPrediction

Related

GCP Composer Airflow - unable to install packages using PyPi

I have created a Composer environment with image version -> composer-2.0.13-airflow-2.2.5
when i try to install software using PyPi, it fails.
details below :
Command :
gcloud composer environments update $AIRFLOW --location us-east1 --update-pypi-packages-from-file requirements.txt
requirement.txt
---------------
google-api-core
google-auth
google-auth-oauthlib
google-cloud-bigquery
google-cloud-core
google-cloud-storage
google-crc32c
google-resumable-media
googleapis-common-protos
google-endpoints
joblib
json5
jsonschema
pandas
requests
requests-oauthlib
Error :
Karans-MacBook-Pro:composer_dags karanalang$ gcloud composer environments update $AIRFLOW --location us-east1 --update-pypi-packages-from-file requirements.txt
Waiting for [projects/versa-sml-googl/locations/us-east1/environments/versa-composer3] to be updated with [projects/versa-sml-googl/locations/us-east1/operations/c23b77a9-f46b-4222-bafd-62527bf27239]..
.failed.
ERROR: (gcloud.composer.environments.update) Error updating [projects/versa-sml-googl/locations/us-east1/environments/versa-composer3]: Operation [projects/versa-sml-googl/locations/us-east1/operations/c23b77a9-f46b-4222-bafd-62527bf27239] failed: Failed to install PyPI packages. looker-sdk 22.4.0 has requirement attrs>=20.1.0; python_version >= "3.7", but you have attrs 17.4.0.
Check the Cloud Build log at https://console.cloud.google.com/cloud-build/builds/60ac972a-8f5e-4b4f-a4a7-d81049fb19a3?project=939354532596 for details. For detailed instructions see https://cloud.google.com/composer/docs/troubleshooting-package-installation
Pls note:
I have an older Composer cluster (Composer version - 1.16.8, Airflow version - 1.10.15), where the above command works fine.
However, it is not working with the new cluster
What needs to be done to debug/fix this ?
tia!
I was able to get this working using the following code :
path = "gs://dataproc-spark-configs/pip_install.sh"
CLUSTER_GENERATOR_CONFIG = ClusterGenerator(
project_id=PROJECT_ID,
zone="us-east1-b",
master_machine_type="n1-standard-4",
worker_machine_type="n1-standard-4",
num_workers=4,
storage_bucket="dataproc-spark-logs",
init_actions_uris=[path],
metadata={'PIP_PACKAGES': 'pyyaml requests pandas openpyxl kafka-python'},
).make()
with models.DAG(
'Versa-Alarm-Insights-UsingComposer2',
# Continue to run DAG twice per day
default_args=default_dag_args,
schedule_interval='0 0/12 * * *',
catchup=False,
) as dag:
create_dataproc_cluster = DataprocCreateClusterOperator(
task_id="create_dataproc_cluster",
cluster_name="versa-composer2",
region=REGION,
cluster_config=CLUSTER_GENERATOR_CONFIG
)
The earlier command which involved installing packages by reading from file was working in Composer1 (Airflow 1.x), however failing with Composer 2.x (Airflow 2.x)
From the error, it is clear that you are running old version of attrs package.
run the below command and try
pip install attrs==20.3.0
or
pip install attrs==20.1.0

45 MB model too big for Google AI Platform

I'm trying to use AI platform to deploy a scikit-learn pipeline. The size of the model.joblib file I'm tryin to deploy is 45 megabytes.
python version: 3.7
framework: scikit-learn(==0.20.4)
Single Core CPU, Quad Core CPU (Beta)
I've used the following command to deploy as well as GUI
gcloud beta ai-platform versions create v0 \
--model test_watch_model \
--origin gs://rohan_test_watch_model \
--runtime-version=1.15 \
--python-version=3.7 \
--package-uris=gs://rohan_test_watch_model/train_custom-0.1.tar.gz \
--framework=scikit-learn \
--project=xxxx
This is the setup.py file I'm using, in case the problem might lie with the libraries.
from setuptools import setup
setup(
name='train_custom',
version='0.1',
scripts=[
# 'train_custom.py',
# 'data_silo_custom.py',
# 'dataset_custom.py',
# 'preprocessor_custom.py'
'all.py'
],
install_requires=[
"torch==1.5.1",
"transformers==3.0.2",
"farm==0.4.6"
]
)
I also tried removing pytorch from setup.py and using torch 1.3 from http://storage.googleapis.com/cloud-ai-pytorch/readme.txt but that leaves me with this same error message.
ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to experience errors, please contact support.

Can't create Deep Learning VM using Tensorflow 2.0 framework

I'm trying to create a Deep Learning Virtual Machine using Google Cloud Platform that uses tensorflow 2.0. But when I instantiate it i get the following error:
deep-learning-training-vm: {"ResourceType":"compute.v1.instance","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"errors":[{"domain":"global","message":"Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://www.googleapis.com/compute/v1/projects/click-to-deploy-images/global/images/tf-2-0-cu100-experimental-20190909'. The referenced image resource cannot be found.","reason":"invalid"}],"message":"Invalid value for field 'resource.disks[0].initializeParams.sourceImage': 'https://www.googleapis.com/compute/v1/projects/click-to-deploy-images/global/images/tf-2-0-cu100-experimental-20190909'. The referenced image resource cannot be found.","statusMessage":"Bad Request","requestPath":"https://compute.googleapis.com/compute/v1/projects/my-project/zones/us-west1-b/instances","httpMethod":"POST"}}
I don't quite understand the error but I believe that gcp is not able to find the right image for my virtual machine, i.e, the image that have this version of tensorflow in it (maybe because of TF 2.0 release?).
Have someone faced this problem before? Is there a way to create a DL VM using tensorflow 2.0?
It seemed to be a transient issue, since it is available now.
In addition, you can create your DL VM via gcloud. Here's an example of the command:
gcloud compute instances create INSTANCE_NAME \
--zone=ZONE \
--image-family=tf2-latest-cu100 \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--accelerator="type=nvidia-tesla-v100,count=1" \
--metadata="install-nvidia-driver=True,proxy-mode=project_editors" \
--machine-type=n2-highmem-8
There's more information on how to this in the DL documentation.
Also, if you are looking to create a VM with Tensorflow and Jupyter, you can try using AI Platform Notebooks.
When you create a new Notebook, you can select Tensorflow 2.0 and further customize it to select the accelerator, machine-type, etc.

Cloud Machine Learning Engine fails to deploy model

I have trained both my own model and the one from the official tutorial.
I'm up to the step to deploy the model to support prediction. However, it keeps giving me an error saying:
"create version failed. internal error happened"
when I attempt to deploy the models by running:
gcloud ml-engine versions create v1 \
--model $MODEL_NAME \
--origin $MODEL_BINARIES \
--python-version 3.5 \
--runtime-version 1.13
*the model binary should be correct, as I pointed it to the folder containing model.pb and variables folder, e.g. MODEL_BINARIES=gs://$BUCKET_NAME/results/20190404_020134/saved_model/1554343466.
I have also tried to change the region setting for the model as well, but this doesn't help.
Turns out your GCS bucket and the trained model needs to be in the same region. This was not explained well in the Cloud ML tutorial, where it only says:
Note: Use the same region where you plan on running Cloud ML Engine jobs. The example uses us-central1 because that is the region used in the getting-started instructions.
Also note that a lot of regions cannot be used for both the bucket and model training (e.g. asia-east1).

Using tensorflow.contrib.data.Dataset in Cloud ML

Recently I changed my data pipeline in tensorflow from threading to a new Dataset api, which is pretty convenient once you want to validate your model each epoch.
I've noticed that current runtime version of tensorflow in Cloud ML is 1.2. Nevertheless, I've tried to use nightly build of tensorflow v1.3, but pip installation fails with:
AssertionError: tensorflow==1.3.0 .dist-info directory not found
Command '['pip', 'install', '--user', '--upgrade', '--force-reinstall', '--no-deps', u'tensorflow-1.3.0-cp27-none-linux_x86_64.whl']' returned non-zero exit status 2
Does anyone succeeded with using tensorflow.cotrib.data.Dataset with Cloud ML engine?
This worked for me: create a setup.py file with the following content:
from setuptools import find_packages
from setuptools import setup
REQUIRED_PACKAGES = ['tensorflow==1.3.0']
setup(
name='trainer',
version='0.1',
install_requires=REQUIRED_PACKAGES,
packages=find_packages(),
include_package_data=True,
description='Upgrading tf to 1.3')
more info on the setup.py file is available at: Packaging a Training Application