45 MB model too big for Google AI Platform - google-cloud-platform

I'm trying to use AI platform to deploy a scikit-learn pipeline. The size of the model.joblib file I'm tryin to deploy is 45 megabytes.
python version: 3.7
framework: scikit-learn(==0.20.4)
Single Core CPU, Quad Core CPU (Beta)
I've used the following command to deploy as well as GUI
gcloud beta ai-platform versions create v0 \
--model test_watch_model \
--origin gs://rohan_test_watch_model \
--runtime-version=1.15 \
--python-version=3.7 \
--package-uris=gs://rohan_test_watch_model/train_custom-0.1.tar.gz \
--framework=scikit-learn \
--project=xxxx
This is the setup.py file I'm using, in case the problem might lie with the libraries.
from setuptools import setup
setup(
name='train_custom',
version='0.1',
scripts=[
# 'train_custom.py',
# 'data_silo_custom.py',
# 'dataset_custom.py',
# 'preprocessor_custom.py'
'all.py'
],
install_requires=[
"torch==1.5.1",
"transformers==3.0.2",
"farm==0.4.6"
]
)
I also tried removing pytorch from setup.py and using torch 1.3 from http://storage.googleapis.com/cloud-ai-pytorch/readme.txt but that leaves me with this same error message.
ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to experience errors, please contact support.

Related

What is the correct configuration AWS SageMaker-Python-SDK to achieve local debugging/training with Apple M1 Pro

I want to run an RL training job on AWS SageMaker(script given below). But since the project is complex I was hoping to do a test run using SageMaker Local Mode (In my M1 MacBook Pro) before submitting to paid instances. However, I am struggling to make this local run successful even with a simple training task.
Now I did use Tensorflow-metal and Tensorflow-macos when running local training jobs(Without SageMaker). And I did not see anywhere I can specify this in the framework_version and nor that I am sure "local_gpu" which is the correct argument for a normal linux machine with GPU is exactly matching for Apple Silicon (M1 Pro).
I searched all over but I cannot find a case where this is addressed. (Very odd, am I doing something wrong? If so, please correct me.) If not and there's anyone who knows of a configuration, a docker image or an example properly done with M1 Pro please share.
I tried to run the following code. Which hangs after Logging in. (If you are trying to run the code, try with any simple training script as entry_point, and make sure to login with a similar code matching to your region using awscli with following command.
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
##main.py
import boto3
import sagemaker
import os
import keras
import numpy as np
from keras.datasets import fashion_mnist
from sagemaker.tensorflow import TensorFlow
sess = sagemaker.Session()
#role = <'arn:aws:iam::0000000000000:role/CFN-SM-IM-Lambda-Catalog-sk-SageMakerExecutionRole-BlaBlaBla'> #KINDLY ADD YOUR ROLE HERE
(x_train, y_train), (x_val, y_val) = fashion_mnist.load_data()
os.makedirs("./data", exist_ok = True)
np.savez('./data/training', image=x_train, label=y_train)
np.savez('./data/validation', image=x_val, label=y_val)
# Train on local data. S3 URIs would work too.
training_input_path = 'file://data/training.npz'
validation_input_path = 'file://data/validation.npz'
# Store model locally. A S3 URI would work too.
output_path = 'file:///tmp/model/'
tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py',
role=role,
instance_count=1,
instance_type='local_gpu', # Train on the local CPU ('local_gpu' if it has a GPU)
framework_version='2.1.0',
py_version='py3',
hyperparameters={'epochs': 1},
output_path=output_path
)
tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})
The prebuilt SageMaker Docker Images for Deep Learning doesn't have Arm based support yet.
You can see Available Deep Learning Containers Images here.
The solution is to build your own Docker image and use it with SageMaker.
This is an example Dockerfile that uses miniconda to install TensorFlow dependencies:
FROM arm64v8/ubuntu
RUN apt-get -y update && apt-get install -y --no-install-recommends \
wget \
nginx \
ca-certificates \
gcc \
linux-headers-generic \
libc-dev
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.9.2-Linux-aarch64.sh
RUN chmod a+x Miniconda3-py38_4.9.2-Linux-aarch64.sh
RUN bash Miniconda3-py38_4.9.2-Linux-aarch64.sh -b
ENV PATH /root/miniconda3/bin/:$PATH
COPY ml-dependencies.yml ./
RUN conda env create -f ml-dependencies.yml
ENV PATH /root/miniconda3/envs/ml-dependencies/bin:$PATH
This is the the ml-dependencies.yml:
name: ml-dependencies
dependencies:
- python=3.8
- numpy
- pandas
- scikit-learn
- tensorflow==2.8.2
- pip
- pip:
- sagemaker-training
And this is how you'll run the trainin gusing SageMaker Script mode:
image = 'sagemaker-tensorflow2-graviton-training-toolkit-local'
california_housing_estimator = Estimator(
image_uri=image,
entry_point='california_housing_tf2.py',
source_dir='code',
role=DUMMY_IAM_ROLE,
instance_count=1,
instance_type='local',
hyperparameters={'epochs': 10,
'batch_size': 64,
'learning_rate': 0.1})
inputs = {'train': 'file://./data/train', 'test': 'file://./data/test'}
california_housing_estimator.fit(inputs, logs=True)
You can find the full working sample code on the Amazon SageMaker Local Mode Examples GitHub repository here.

PyTorch model deployment in AI Platform

I'm deploying a Pytorch model in Google Cloud AI Platform, I'm getting the following error:
ERROR: (gcloud.beta.ai-platform.versions.create) Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. If you continue to have error, please contact Cloud ML.
Configuration:
setup.py
from setuptools import setup
REQUIRED_PACKAGES = ['torch']
setup(
name="iris-custom-model",
version="0.1",
scripts=["model.py"],
install_requires=REQUIRED_PACKAGES
)
Model version creation
MODEL_VERSION='v1'
RUNTIME_VERSION='1.15'
MODEL_CLASS='model.PyTorchIrisClassifier'
!gcloud beta ai-platform versions create {MODEL_VERSION} --model={MODEL_NAME} \
--origin=gs://{BUCKET}/{GCS_MODEL_DIR} \
--python-version=3.7 \
--runtime-version={RUNTIME_VERSION} \
--package-uris=gs://{BUCKET}/{GCS_PACKAGE_URI} \
--prediction-class={MODEL_CLASS}
You need to use Pytorch compiled packages compatible with Cloud AI Platform Package information here
This bucket contains compiled packages for PyTorch that are compatible with Cloud AI Platform prediction. The files are mirrored from the official builds at https://download.pytorch.org/whl/cpu/torch_stable.html
From documentation
In order to deploy a PyTorch model on Cloud AI Platform Online
Predictions, you must add one of these packages to the packageURIs
field on the version you deploy. Pick the package matching your Python
and PyTorch version. The package names follow this template:
Package name =
torch-{TORCH_VERSION_NUMBER}-{PYTHON_VERSION}-linux_x86_64.whl where
PYTHON_VERSION = cp35-cp35m for Python 3 with runtime versions <
1.15, cp37-cp37m for Python 3 with runtime versions >= 1.15
For example, if I were to deploy a PyTorch model based on PyTorch
1.1.0 and Python 3, my gcloud command would look like:
gcloud beta ai-platform versions create {VERSION_NAME} --model {MODEL_NAME}
...
--package-uris=gs://{MY_PACKAGE_BUCKET}/my_package-0.1.tar.gz,gs://cloud->ai-pytorch/torch-1.1.0-cp35-cp35m-linux_x86_64.whl
In summary:
1) Remove torch from your install_requires dependencies in setup.py
2) Include torch package when creating your version model.
!gcloud beta ai-platform versions create {VERSION_NAME} --model {MODEL_NAME} \
--origin=gs://{BUCKET}/{MODEL_DIR}/ \
--python-version=3.7 \
--runtime-version={RUNTIME_VERSION} \
--package-uris=gs://{BUCKET}/{PACKAGES_DIR}/text_classification-0.1.tar.gz,gs://cloud-ai-pytorch/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl \
--prediction-class=model_prediction.CustomModelPrediction

Object detection training job fails on GCP

I am running a training job on GCP for object detection using my own dataset. My training job script is like this:
JOB_NAME=object_detection"_$(date +%m_%d_%Y_%H_%M_%S)"
echo $JOB_NAME
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir=gs://$1 \
--scale-tier BASIC_GPU \
--runtime-version 1.12 \
--packages $PWD/models/research/dist/object_detection-0.1.tar.gz,$PWD/models/research/slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \
--module-name $PWD/models/research/object_detection.model_main \
--region europe-west1 \
-- \
--model_dir=gs://$1 \
--pipeline_config_path=gs://$1/data/fast_rcnn_resnet101_coco.config
It fails at the following line :
python -m $PWD/models/research/object_detection.model_main --model_dir=gs://my-hand-detector --pipeline_config_path=gs://my-hand-detector/data/fast_rcnn_resnet101_coco.config --job-dir gs://my-hand-detector/
/usr/bin/python: Import by filename is not supported.
Based on logs, this is the source of error which I have understood. Any help in this regard would be helpful. Thank you.
I assume that you are using model_main.py file from Tensorflow GitHub repository. Using it, I have been able to replicate your error message. After troubleshooting, I successfully submitted the training job and could train the model properly.
In order to address your issue I suggest you to follow this tutorial, taking special consideration to the following steps:
Make sure to have an updated version of tensorflow (1.14 doesn’t include all necessary capabilities)
Properly generate TFRecords from input data and upload them to GCS bucket
Configure object detection pipeline (set the proper paths to data and label map)
In my case, I have reproduced the workflow using PASCAL VOC input data (See this).

gcloud ai-platform local train not running in jupyter notebook

This one unsolved part from another post. I am trying to submit a google cloud job that trains cnn model for mnist digit.
here's my systems. windows 10, anaconda, jupyter notebook 6, python 3.6, tf 1.13.0.
I use gcloud command for local train. the second cell seems stuck at [*] status and showing nothing until I close and halt the ipynb file. the training started right after it and results are correct as I monitored it on Tensorboard.
I can make it run in a terminal without this issue. I also successfully submitted the job to cloud and finished successfully.
any thought of the local train problem? codes are here.
OUTDIR='trained_test'
INPDIR='..\data'
shutil.rmtree(path = OUTDIR, ignore_errors = True)
!gcloud ai-platform local train \
--module-name=trainer.task \
--package-path=trainer \
-- \
--output_dir=$OUTDIR \
--input_dir=$INPDIR \
--epochs=2 \
--learning_rate=0.001 \
--batch_size=100

How to force python versions to sync in a datalab instance spun from a GCP dataproc cluster?

I've created a Dataproc cluster in GCP using image 1.2. I want to run Spark from a Datalab notebook. This works fine if I keep the Datalab notebook running Python 2.7 as its kernel, but if I want to use Python 3 I run into a minor version mismatch. I demonstrate the mismatch with a Datalab script below:
### Configuration
import sys, os
sys.path.insert(0, '/opt/panera/lib')
os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/opt/conda/bin/python'
import google.datalab.storage as storage
from io import BytesIO
spark = SparkSession.builder \
.enableHiveSupport() \
.config("hive.exec.dynamic.partition","true") \
.config("hive.exec.dynamic.partition.mode","nonstrict") \
.config("mapreduce.fileoutputcommitter.marksuccessfuljobs","false") \
.getOrCreate() \
sc = spark.sparkContext
### import libraries
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
### trivial example
data = [
LabeledPoint(0.0, [0.0]),
LabeledPoint(1.0, [1.0]),
LabeledPoint(1.0, [2.0]),
LabeledPoint(1.0, [3.0])
]
toyModel = DecisionTree.trainClassifier(sc.parallelize(data), 2, {})
print(toyModel)
The error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, pan-bdaas-prod-jrl6-w-3.c.big-data-prod.internal, executor 6): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 124, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 3.6 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
Other initialization scripts:
gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh
gs://dataproc-initialization-actions/datalab/datalab.sh
...and scripts that load some of our necessary libraries and utilities
The Python 3 kernel in Datalab is using Python 3.5 rather than Python 3.6
You could try to set up a 3.6 environment inside of Datalab and then install a new kernelspec for it, but it is probably easier to just configure the Dataproc cluster to use Python 3.5
The instructions for setting up your cluster to use 3.5 are here