Vertex AI Custom Container deployment - google-cloud-platform

I have a simple application that have PyTorch model for predicting emotions in the text. The model get's downloaded inside the container when it starts to work.
Unfortunately the deployment in vertex ai fails everytime with the message:
Failed to deploy model "emotion_recognition" to endpoint "emotions" due to the error: Error: model server never became ready. Please validate that your model file or container configuration are valid.
Here is my Dockerfile:
FROM tiangolo/uvicorn-gunicorn-fastapi:python3.8-slim
COPY requirements.txt ./requirements.txt
RUN pip install -r requirements.txt
WORKDIR /usr/src/emotions
COPY ./schemas/ /emotions/schemas
COPY ./main.py /emotions
COPY ./utils.py /emotions
ENV PORT 8080
ENV HOST "0.0.0.0"
WORKDIR /emotions
EXPOSE 8080
CMD ["uvicorn", "main:app"]
Here's my main.py:
from fastapi import FastAPI,Request
from utils import get_emotion
from schemas.schema import Prediction, Predictions, Response
app = FastAPI(title="People Analytics")
#app.get("/isalive")
async def health():
message="The Endpoint is running successfully"
status="Ok"
code = 200
response = Response(message=message,status=status,code=code)
return response
#app.post("/predict",
response_model=Predictions,
response_model_exclude_unset=True)
async def predict_emotions(request: Request):
body = await request.json()
print(body)
instances = body["instances"]
print(instances)
print(type(instances))
instances = [x['text'] for x in instances]
print(instances)
outputs = []
for text in instances:
emotion = get_emotion(text)
outputs.append(Prediction(emotion=emotion))
return Predictions(predictions=outputs)
I cannot see the cause of error in cloud logging so I am curious about the reason. Please check if my health/predict routes are correct for vertex ai or there's something else I have to change.

I would like to recommend that when deploying an endpoint you should enable the use of logs so that you will get more meaningful information from logs.
This issue could be due to different reasons:
Make sure that the container configuration port is using port 8080 or not.Vertex AI sends liveness checks, health checks, and prediction requests to this port on the container. Your container's HTTP server must listen for requests on this port.
Make sure that you have the required permissions.For this you can
follow this gcp documention and also validate that the account you
are using has enough permissions to read your project's GCS bucket
Vertex AI has some quota limits to verify this you can also
follow this gcp documention.
As per documentation, it should select the default Prediction and health route if you did not specify the path.
If any of the suggestions above work, it’s a requirement to contact GCP Support by creating a support case to fix it. It’s impossible for the community to troubleshoot it without using internal GCP resources

Related

Cloud RUN job InterfaceError - (pg8000.exceptions.InterfaceError) - when connecting to cloud sql

This question is about cloud run job (not services).
InterfaceError - (pg8000.exceptions.InterfaceError) Can't create a connection to host 127.0.0.1 and port 5432 (timeout is None and source_address is None).
I have python code that connects to cloud sql and runs a simple select * on sql db.
My cloud sql instance is public, in same account & region as cloud run
I had added cloud sql connection to cloud RUN job through console:
Recreating this error on local machine using docker:
When I run the container locally along with cloud sql proxy as shown below it works successfully:
docker run --rm --network=host job1
If I remove --network=host then I can recreate the exact error (shown in cloud RUN) locally:
docker run --rm job1
Am I using wrong host?
On local machine I set host - 127.0.0.1 as shown in official example - gcp github
On cloud RUN I tried setting host to 127.0.0.1 and /cloudsql/project:region:instance . Both did not work
My python code that runs on cloud RUN:
import os
import pandas
import sqlalchemy
def execute_sql(query, engine):
with engine.connect() as connection:
df = pandas.read_sql(
con=connection,
sql=query
)
return df
def connect_tcp_socket() -> sqlalchemy.engine.base.Engine:
db_user = 'postgres' # e.g. 'my-database-user'
db_pass = 'abcxyz123' # e.g. 'my-database-password'
db_name = 'development' # e.g. 'my-database'
db_host = os.getenv('host', '127.0.0.1')
db_port = os.getenv('port', 5432)
connect_args = {}
pool = sqlalchemy.create_engine(
sqlalchemy.engine.url.URL.create(
drivername="postgresql+pg8000",
username=db_user,
password=db_pass,
host=db_host,
port=db_port,
database=db_name,
),
connect_args=connect_args,
pool_size=5,
max_overflow=2,
pool_timeout=30, # 30 seconds
pool_recycle=1800, # 30 minutes
)
return pool
def func1():
engine = connect_tcp_socket()
query = 'select * from public.administrator;'
df = execute_sql(query, engine)
print(f'df={df}')
if __name__ == '__main__':
func1()
How is your Cloud SQL instance configured? is it using private or public ip? Is the Cloud SQL instance in the same project, region and net? usually when you are connecting to 127.0.0.1 you are actually connecting to the Cloud SQL via Auth Proxy locally, however this doesn't apply for Cloud Run, depending on your cloud sql configuration you want to make sure that you configured the Cloud SQL connectivity at the deployement moment using the following flags if your Cloud SQL uses public ip
gcloud run deploy
--image=IMAGE
--add-cloudsql-instances=INSTANCE_CONNECTION_NAME
If your Cloud SQL is using private ip you want to use the instance private ip and not 127.0.0.1
I was unable to connect to proxy from container running on cloud run jobs. So instead I started proxy manually from inside the Dockerfile. This way I know exact port and host to map.
To run the python script job1.py shown in the question use following files
Dockerfile:
FROM python:buster
ADD . /code
RUN pip install --upgrade pip
RUN pip install pandas sqlalchemy
RUN pip install pg8000 cloud-sql-python-connector
# download the cloudsql proxy binary
RUN mkdir "/workspace"
RUN wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O /workspace/cloud_sql_proxy
RUN chmod +x /workspace/cloud_sql_proxy
RUN chmod +x /code/start.sh
CMD ["/code/start.sh"]
start.sh
#!/bin/sh
/workspace/cloud_sql_proxy -instances=pangea-dev-314501:us-central1:pangea-dev=tcp:5432 -credential_file=/key/sv_account_key.json &
sleep 6
python /code/job1.py
I'd recommend using the Cloud SQL Python Connector package as it offers a consistent way of connecting across all environments (Local machine, Cloud Run, App Engine, Cloud Functions etc.) and provides the following benefits (one of which is not having to worry about IP addresses or needing the Cloud SQL proxy):
IAM Authorization: uses IAM permissions to control who/what can connect to your Cloud SQL instances
Improved Security: uses robust, updated TLS 1.3 encryption and identity verification between the client connector and the server-side proxy, independent of the database protocol.
Convenience: removes the requirement to use and distribute SSL certificates, as well as manage firewalls or source/destination IP addresses.
(optionally) IAM DB Authentication: provides support for Cloud SQL’s automatic IAM DB AuthN feature.
You can find a Flask App example using the Python Connector in the same Github repo you linked in your question.
Basic usage example:
from google.cloud.sql.connector import Connector, IPTypes
import sqlalchemy
# build connection (for creator argument of connection pool)
def getconn():
# Cloud SQL Python Connector object
with Connector() as connector:
conn = connector.connect(
"project:region:instance", # Cloud SQL instance connection name
"pg8000",
user="my-user",
password="my-password",
db="my-db-name",
ip_type=IPTypes.PUBLIC # IPTypes.PRIVATE for private IP
)
return conn
# create connection pool
pool = sqlalchemy.create_engine(
"postgresql+pg8000://",
creator=getconn,
)

Does it make sense to run a non web application on cloud run?

I see that all of the examples per the documentation use some form of a simple web application (For example, Flask in Python). Is it possible to use cloud run as a non web application? For example, deploy cloud run to use a python script and then use GCP Scheduler to invoke cloud run every hour to run that script? Basically my thinking for this is to avoid having to deploy and pay for Compute Engine, and only pay for when the cloud run container is invoked via the scheduler.
It's mandatory to answer to HTTP request. It's the contract of Cloud Run
Stateless (no volume attached to the container)
Answer to HTTP request
However, if you already have a python script, it's easy to wrap it in a flask webserver. Let's say, you have something like this (I assume that the file name is main.py -> important for the Dockerfile at the end)
import ....
var = todo(...)
connect = connect(...)
connect(var)
Firstly, wrap it in a function like this
import ....
def my_function(request):
var = todo(...)
connect = connect(...)
connect(var)
return 'ok',200
Secondly, add a flask server
from flask import Flask, request
import os
import ....
app = Flask(__name__)
#app.route('/')
def my_function(request):
var = todo(...)
connect = connect(...)
connect(var)
return 'ok',200
if __name__ == "__main__":
app.run(host='0.0.0.0',port=int(os.environ.get('PORT',8080)))
Add flask in your requirements.txt
Build a standard container, here an example of Dockerfile
FROM python:3-alpine
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PORT 8080
CMD [ "python", "main.py" ]
Build (with Cloud Build for example) and deploy the service on Cloud Run
Now you have an URL, that you can call with Cloud Scheduler.
Be careful, the max request duration is, for now, limited to 15 minutes (soon 4x more) and limited to 2vCPU and 2Gb of memory (again, soon more).
It depends what is being installed in the container image, as there is no requirement that one would have to install a web-server. For example, with such an image I can build Android applications, triggered whenever a repository changes (file excludes recommend) ...and likely could even run a head-less Android emulator for Gradle test tasks and publish test results to Pub/Sub (at least while the test-suite wouldn't run for too long). I mean, one has to understand the possibilities of Cloud Build to understand what Cloud Run can do.
I've struggled with deployment my function which no need to handle any request in Cloud Run by putting functions in Flask app and found out that Cloud Run provides us 2 kinds of jobs, services and jobs.
Illustration from codelabs
From cloud run jobs documentation,
This page describes how to create and update Cloud Run jobs from an existing container image. Unlike services, which listen for requests, a job does not serve requests but only runs its tasks and exits when finished.
After you create or update a job, you can execute the job as a one-off, on a schedule or as part of a workflow. You can manage individual job executions and view the execution logs.
You may see that there are two tabs in Cloud Run console. I am not sure when Cloud Run jobs started.
See cloud run console

Tensorboard - can't connect from Google Cloud Instance

I'm trying to load Tensorboard from within my google cloud VM terminal.
tensorboard --logdir logs --port 6006
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.2.1 at http://localhost:6006/ (Press CTRL+C to quit)
When I click on the link:
Chrome I get error 400
Firefox Error: Could not connect to Cloud Shell on port 6006. Ensure your server is listening on port 6006 and try again.
I've added a new firewall rule to allow port 6006 for ip 0.0.0.0/0 , but still can't get this to work. I've tried using --bind_all too but this doesn't work.
From Training a Keras Model on Google Cloud ML GPU:
... To train this model now on google cloud ML engine run the below command on cloud sdk terminal
gcloud ml-engine jobs submit training JOB1
--module-name=trainer.cnn_with_keras
--package-path=./trainer
--job-dir=gs://keras-on-cloud
--region=us-central1
--config=trainer/cloudml-gpu.yaml
Once you have started the training you can watch the logs from google console. Training would take around 5 minutes and the logs should look like below. Also you would be able to view the tensorboard logs in the bucket that we had created earlier named ‘keras-on-cloud’
To visualize the training and changes graphically open the cloud shell by clicking the icon on top right for the same. Once started type the below command to start Tensorboard on port 8080.
tensorboard --logdir=gs://keras-on-cloud --port=8080
For anyone else struggling with this, I decided to output my logs to an S3 bucket, and then rather than trying to run tensorboard from within the GCP instance, I just ran it locally, tested with the below script.
I needed to put this into a script rather than calling directly from the command line as I needed my AWS credentials to be loaded. I then use subprocess to run the command line function as normal.
Credentials contained within an env file, found using python-dotenv
from dotenv import find_dotenv, load_dotenv
import subprocess
load_dotenv(find_dotenv())
if __name__=='__main__':
cmd = 'tensorboard --logdir s3://path-to-s3-bucket/Logs/'
p = subprocess.Popen(cmd, shell=True)
p.wait()
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.1.1 at http://localhost:6006/ (Press CTRL+C to quit)

Requirements for launching Google Cloud AI Platform Notebooks with custom docker image

On AI Platform Notebooks, the UI lets you select a custom image to launch. If you do so, you're greeted with an info box saying that the container "must follow certain technical requirements":
I assume this means they have a required entrypoint, exposed port, jupyterlab launch command, or something, but I can't find any documentation of what the requirements actually are.
I've been trying to reverse engineer it without much luck. I nmaped a standard instance and saw that it had port 8080 open, but setting my image's CMD to run Jupyter Lab on 0.0.0.0:8080 did not do the trick. When I click "Open JupyterLab" in the UI, I get a 504.
Does anyone have a link to the relevant docs, or experience with doing this in the past?
There are two ways you can create custom containers:
Building a Derivative Container
If you only need to install additional packages, ou should create a Dockerfile derived from one of the standard images (for example, FROM gcr.io/deeplearning-platform-release/tf-gpu.1-13:latest), then add RUN commands to install packages using conda/pip/jupyter.
The conda base environment has already been added to the path, so no need to conda init/conda activate unless you need to setup another environment. Additional scripts/dynamic environment variables that need to be run prior to bringing up the environment can be added to /env.sh, which is sourced as part of the entrypoint.
For example, let’s say that you have a custom built TensorFlow wheel that you’d like to use in place of the built-in TensorFlow binary. If you need no additional dependencies, your Dockerfile will be similar to:
Dockerfile.example
FROM gcr.io/deeplearning-platform-release/tf-gpu:latest
RUN pip uninstall -y tensorflow-gpu && \
pip install -y /path/to/local/tensorflow.whl
Then you’ll need to build and push it somewhere accessible to your GCE service account.
PROJECT="my-gcp-project"
docker build . -f Dockerfile.example -t "gcr.io/${PROJECT}/tf-custom:latest"
gcloud auth configure-docker
docker push "gcr.io/${PROJECT}/tf-custom:latest"
Building Container From Scratch
The main requirement is that the container must expose a service on port 8080.
The sidecar proxy agent that executes on the VM will ferry requests to this port only.
If using Jupyter, you should also make sure your jupyter_notebook_config.py is configured as such:
c.NotebookApp.token = ''
c.NotebookApp.password = ''
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8080
c.NotebookApp.allow_origin_pat = (
'(^https://8080-dot-[0-9]+-dot-devshell\.appspot\.com$)|'
'(^https://colab\.research\.google\.com$)|'
'((https?://)?[0-9a-z]+-dot-datalab-vm[\-0-9a-z]*.googleusercontent.com)')
c.NotebookApp.allow_remote_access = True
c.NotebookApp.disable_check_xsrf = False
c.NotebookApp.notebook_dir = '/home'
This disables notebook token-based auth (auth is instead handled through oauth login on the proxy), and allows cross origin requests from three sources: Cloud Shell web preview, colab (see this blog post), and the Cloud Notebooks service proxy. Only the third is required for the notebook service; the first two support alternate access patterns.
To complete Zain's answer, below you can find a minimal example using official Jupyter image, inspired by this repo https://github.com/doitintl/AI-Platform-Notebook-Using-Custom-Container:
Dockerfile
FROM jupyter/base-notebook:python-3.9.5
EXPOSE 8080
ENTRYPOINT ["jupyter", "lab", "--ip", "0.0.0.0", "--allow-root", "--config", "/etc/jupyter/jupyter_notebook_config.py"]
COPY jupyter_notebook_config.py /etc/jupyter/
jupyter_notebook_config.py
(almost the same as Zain's, but with an extra pattern enabling the communication with the kernel; the communication didn't work without it)
c.NotebookApp.ip = '*'
c.NotebookApp.token = ''
c.NotebookApp.password = ''
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8080
c.NotebookApp.allow_origin_pat = '(^https://8080-dot-[0-9]+-dot-devshell\.appspot\.com$)|(^https://colab\.research\.google\.com$)|((https?://)?[0-9a-z]+-dot-datalab-vm[\-0-9a-z]*.googleusercontent.com)|((https?://)?[0-9a-z]+-dot-[\-0-9a-z]*.notebooks.googleusercontent.com)|((https?://)?[0-9a-z\-]+\.[0-9a-z\-]+\.cloudshell\.dev)|((https?://)ssh\.cloud\.google\.com/devshell)'
c.NotebookApp.allow_remote_access = True
c.NotebookApp.disable_check_xsrf = False
c.NotebookApp.notebook_dir = '/home'
c.Session.debug = True
And finally, think about this page while troubleshooting: https://cloud.google.com/notebooks/docs/troubleshooting

_tds.InterfaceError when trying to connect to Azure Data Warehouse through Python 2.7 and ctds

I'm trying to connect a python 2.7 script to Azure SQL Data Warehouse.
The coding part is done and the test cases work in our development environment. We're are coding in Python 2.7 in MacOS X and connecting to ADW via ctds.
The problem appears when we deploy on our Azure Kubernetes pod (running Debian 9).
When we try to instantiate a connection this way:
# init a connection
self._connection = ctds.connect(
server='myserver.database.windows.net',
port=1433,
user="my_user#myserver.database.windows.net",
timeout=1200,
password="XXXXXXXX",
database="my_db",
autocommit=True
)
we get an exception that only prints the user name
my_user#myserver.database.windows.net
the type of the exception is
_tds.InterfaceError
The code deployed is the exact same and also the requirements are.
The documentation we found for this exception is almost non-existent.
Do you guys recognize it? Do you know how can we go around it?
We also tried in our old AWS instances of EC2 and AWS Kubernetes (which rans the same OS as the Azure ones) and it also doesn't work.
We managed to connect to ADW via sqlcmd, so that proves the pod can in fact connect (I guess).
EDIT: SOLVED. JUST CHANGED TO PYODBC
def connection(self):
""":rtype: pyodbc.Connection"""
if self._connection is None:
env = '' # whichever way you have to identify it
# init a connection
driver = '/usr/local/lib/libmsodbcsql.17.dylib' if env == 'dev' else '{ODBC Driver 17 for SQL Server}' # my dev env is MacOS and my prod is Debian 9
connection_string = 'Driver={driver};Server=tcp:{server},{port};Database={db};Uid={user};Pwd={password};Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;'.format(
driver=driver,
server='myserver.database.windows.net',
port=1433,
db='mydb',
user='myuser#myserver',
password='XXXXXXXXXXXX'
)
self._connection = pyodbc.connect(connection_string, autocommit=True)
return self._connection
As Ron says, pyodbc is recommended because it enables you to use a Microsoft-supported ODBC Driver.
I'm going to go ahead and guess that ctds is failing on redirect, and you need to force your server into "proxy" mode. See: Azure SQL Connectivity Architecture
EG
# Get SQL Server ID
sqlserverid=$(az sql server show -n sql-server-name -g sql-server-group --query 'id' -o tsv)
# Set URI
id="$sqlserverid/connectionPolicies/Default"
# Get current connection policy
az resource show --ids $id
# Update connection policy
az resource update --ids $id --set properties.connectionType=Proxy