How to reduce the verbosity of sagemaker training jobs?

How to reduce the verbosity of sagemaker training jobs? - amazon-web-services

I am training a model with tqdm in the dataloader, the log prints every update in a seperate line that fills up the entire notebook. Is there a way to reduce the verbosity or remove the logs entirely from the jupyter cell similar to from IPython.display import clear_output clear_output(wait=True)?

usually the logging can be controlled using the API parameters and SageMaker doesn't usually add a lot of additional verbose. If you are using a sagemaker notebook instance to run your training you can very well use Ipython clear_output as the notebook is based on jupyter lab.

You can set log="None" inside the .fit() parameter.
Here's an example how I set it
job_name = "something"
model = sagemaker.estimator.Estimator(image_uri=container_image_uri,
...)
model.fit(inputs=train_input,
job_name=job_name,
wait=True,
logs="None")

Related

Try deploying a custom ML model with an endpoint by using a custom image on Google Cloud's Vertex AI

I have been banging my head around this for a while and Google Cloud does not have a lot of documentation about this issue. What I am trying to do is deploy a custom ML model on Google Cloud Vertex by:
Uploading the model onto the Model Registry in Vertex AI
Create an endpoint
Deploying the uploaded model on the created endpoint.
Steps 1 and 2 are easy to implement, and I am not facing and issues. However step 3 is always failing for some reason. Even the logs don't give me a lot of information.
For Step 1:
This is Dockerfile I am using to create a custom image to serve my ML model:
FROM tiangolo/uvicorn-gunicorn-fastapi:python3.8-slim
COPY requirements-base.txt requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt
COPY serve.py serve.py
COPY model.pkl model.pkl
And this is what my serve.py file looks like:
from fastapi import Request, FastAPI, Response
import json
import catboost
import pickle
import os
app = FastAPI(title="Sentiment Analysis")
AIP_HEALTH_ROUTE = os.environ.get('AIP_HEALTH_ROUTE', '/health')
AIP_PREDICT_ROUTE = os.environ.get('AIP_PREDICT_ROUTE', '/predict')
#app.get(AIP_HEALTH_ROUTE, status_code=200)
async def health():
return {'health': 'ok'}
#app.post(AIP_PREDICT_ROUTE)
async def predict(request: Request):
with open('model.pkl', 'rb') as file:
model = pickle.load(file)
data = request.get_json()
input_data = data['input']
predictions = model.predict(input_data)
return json.dumps({'predictions': predictions.tolist()})
if __name__ == '__main__':
app.run(debug = True, host="0.0.0.0", port=8080)
After building the image, I push it to artifact registry on Google Cloud.
Is there an issue with how I have written the serve.py file or Dockerfile?
Or is there an easier way to deploy custom ML models on Google Cloud for MLOps and prediction purposes.
Well I tried a couple of manual approaches from the Google Cloud Vertex AI and also using gcloud commands.
In the manual process, after importing the model with the custom image I clicked on deploy to an end point. But this seems to always fail and takes forever.
Similarly using gcloud, I first create endpoint, then upload my model on to the registry, and the upload the model on the endpoint created. But this approach also fails.
At the end of the day I want my model to be successfully deployed on the endpoint and should give the right answers for predictions. Or, somehow host my custom ML model on Google Cloud and make predictions with it in a reasonable and manageable way!

How to track the model Progress/status when Sagemaker Kernel is dead?

While training a model on AWS Sagemaker(let us assume training takes 15 hours or more). If our laptop lose internet connection in between, the Kernal on which it is training will die. But the model continues to train (I confirmed this with model.save command, and the model did save in the s3 bucket).
I want to know if there is a way, to track the status/progress of our model training when Kernel dies at Sagemaker environment.
Note: I know we can create a training job under Training - Training Jobs - Create Training Jobs. I just wanted to know if there is any other approach to track if we are not creating the Training Job.

Could you specify the 'Job Name' of the sagemaker training job? You can get the status using an API call if you have the job name. https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html
Another note: you can specify the job name of a training job using the 'TrainingJobName' parameter of training requests: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html

Simply check of status
When you run a training job, a log tracker is automatically created in CloudWatch within the "/aws/sagemaker/TrainingJobs" group with the name of your job and in turn one or more sub-logs, based on the number of instances selected.
This already ensures you can track the status of the job even if the kernel dies or if you simply turn off the notebook instance.
Monitor metrics
For sagemaker's built-in algorithms, no configuration action is required since the monitorable metrics are already prepared.
Custom model
On custom models, on the other hand, to have a monitoring graph of metrics, you can configure the log group related to them in CloudWatch (Metrics) as the official documentation explains. at "Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics" and "Define Metrics".
Basically, you just need to add the parameter metric_definitions to your Estimator (or a subclass of it):
metric_definitions=[
{'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},
{'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}
]
this will capture from the print/logger output of your training script the text identified by the regexes you set (which you can clearly change to your liking) and create a tracking within cloudwatch metrics.
A complete code example from doc:
import sagemaker
from sagemaker.estimator import Estimator
estimator = Estimator(
image_uri="your-own-image-uri",
role=sagemaker.get_execution_role(),
sagemaker_session=sagemaker.Session(),
instance_count=1,
instance_type='ml.c4.xlarge',
metric_definitions=[
{'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},
{'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}
]
)

Upload to BigQuery from Cloud Storage

Have ~50k compressed (gzip) json files daily that need to be uploaded to BQ with some transformation, no API calls. The size of the files may be up to 1Gb.
What is the most cost-efficient way to do it?
Will appreciate any help.

Most efficient way to use Cloud Data Fusion.
I would suggest below approach
Create cloud function and trigger on every new file upload to uncompress file.
Create datafusion job with GCS file as source and bigquery as sink with desired transformation.
Refer below my youtube video.
https://youtu.be/89of33RcaRw

Here is (for example) one way - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json...
... but quickly looking over it however one can see that there are some specific limitations. So perhaps simplicity, customization and maintainability of solution can also be added to your “cost” function.
Not knowing some details (for example read "Limitations" section under my link above, what stack you have/willing/able to use, files names or if your files have nested fields etc etc etc ) my first thought is cloud function service ( https://cloud.google.com/functions/pricing ) that is "listening" (event type = Finalize/Create) to your cloud (storage) bucket where your files land (if you go this route put your storage and function in the same zone [if possible], which will make it cheaper).
If you can code Python here is some started code:
main.py
import pandas as pd
from pandas.io import gbq
from io import BytesIO, StringIO
import numpy as np
from google.cloud import storage, bigquery
import io
def process(event, context):
file = event
# check if its your file can also check for patterns in name
if file['name'] == 'YOUR_FILENAME':
try:
working_file = file['name']
storage_client = storage.Client()
bucket = storage_client.get_bucket('your_bucket_here')
blob = bucket.blob(working_file)
#https://stackoverflow.com/questions/49541026/how-do-i-unzip-a-zip-file-in-google-cloud-storage
zipbytes = io.BytesIO(blob.download_as_string())
#print for logging
print(f"file downloaded, {working_file}")
#read_file_as_df --- check out docs here = https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
# if nested might need to go text --> to dictionary and then do some preprocessing
df = pd.read_json(zipbytes, compression='gzip', low_memory= False)
#write processed to big query
df.to_gbq(destination_table ='your_dataset.your_table',
project_id ='your_project_id',
if_exists = 'append')
print(f"table bq created, {working_file}")
# if you want to delete processed file from your storage to save on storage costs uncomment 2 lines below
# blob.delete()
#print(f"blob delete, {working_file}")
except Exception as e:
print(f"exception occured {e}, {working_file}")
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-storage
google-cloud-bigquery
pandas
pandas.io
pandas-gbq
PS
Some alternatives include
Starting up a VM and run your script on a schedule and shutting VM down once process is done ( for example cloud scheduler –-> pub/sub –-> cloud function –-> which starts up your vm --> which then runs your script)
Using app engine to run your script (similar)
Using cloud run to run your script (similar)
Using composer/airflow (not similar to 1,2&3) [ could use all types of approaches including data transfers etc, just not sure what stack you are trying to use or what you already have running ]
Scheduling vertex ai workbook (not similar to 1,2&3, basically write up a jupyter notebook and schedule it to run in vertex ai)
Try to query files directly (https://cloud.google.com/bigquery/external-data-cloud-storage#bq_1) and schedule that query (https://cloud.google.com/bigquery/docs/scheduling-queries) to run (but again not sure about your overall pipeline)
Setup for all (except #5 & #6) just in technical debt to me is not worth it if you can get away with functions
Best of luck,

Schedule batch predictions Vertex AI

I have created a forecasting model using AutoML on Vertex AI. I want to use this model to make batch predictions every week. Is there a way to schedule this?
The data to make those predictions is stored in a bigquery table, which is updated every week.

There is no automatic scheduling directly in Vertex AutoML yet but many different ways to set this up in GCP.
Two options to try first using the client libraries available for BigQuery and Vertex:
Cloud Scheduler to use cron https://cloud.google.com/scheduler/docs/quickstart
use either Cloud Functions or Cloud Run to setup a BigQuery event trigger, and then trigger the AutoML batch prediction. Example to repurpose https://cloud.google.com/blog/topics/developers-practitioners/how-trigger-cloud-run-actions-bigquery-events

Not sure if you're using Vertex pipelines to run the prediction job but if you are there's a method to schedule your pipeline execution listed here.
from kfp.v2.google.client import AIPlatformClient # noqa: F811
api_client = AIPlatformClient(project_id=PROJECT_ID, region=REGION)
# adjust time zone and cron schedule as necessary
response = api_client.create_schedule_from_job_spec(
job_spec_path="intro_pipeline.json",
schedule="2 * * * *",
time_zone="America/Los_Angeles", # change this as necessary
parameter_values={"text": "Hello world!"},
# pipeline_root=PIPELINE_ROOT # this argument is necessary if you did not specify PIPELINE_ROOT as part of the pipeline definition.
)

Beginners guide to Sagemaker

I have followed an Amazon tutorial for using SageMaker and have used it to create the model in the tutorial (https://aws.amazon.com/getting-started/tutorials/build-train-deploy-machine-learning-model-sagemaker/).
This is my first time using SageMaker, so my question may be stupid.
How do you actually view the model that it has created? I want to be able to see a) the final formula created with the parameters etc. b) graphs of plotted factors etc. as if I was reviewing a GLM for example.
Thanks in advance.

If you followed the SageMaker tutorial you must have trained an XGBoost model. SageMaker places the model artifacts in a bucket that you own, check the output S3 location in the AWS SageMaker console.
For more information about XGBoost you can check the AWS SageMaker documentation https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#xgboost-sample-notebooks and the example notebooks, e.g. https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.ipynb
To consume the XGBoost artifact generated by SageMaker, check out the official documentation, which contains the following code:
# SageMaker XGBoost uses the Python pickle module to serialize/deserialize
# the model, which can be used for saving/loading the model.
# To use a model trained with SageMaker XGBoost in open source XGBoost
# Use the following Python code:
import pickle as pkl
model = pkl.load(open(model_file_path, 'rb'))
# prediction with test data
pred = model.predict(dtest)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to reduce the verbosity of sagemaker training jobs? - amazon-web-services

I am training a model with tqdm in the dataloader, the log prints every update in a seperate line that fills up the entire notebook. Is there a way to reduce the verbosity or remove the logs entirely from the jupyter cell similar to from IPython.display import clear_output clear_output(wait=True)?

usually the logging can be controlled using the API parameters and SageMaker doesn't usually add a lot of additional verbose. If you are using a sagemaker notebook instance to run your training you can very well use Ipython clear_output as the notebook is based on jupyter lab.

You can set log="None" inside the .fit() parameter. Here's an example how I set it job_name = "something" model = sagemaker.estimator.Estimator(image_uri=container_image_uri, ...) model.fit(inputs=train_input, job_name=job_name, wait=True, logs="None")

Related

Try deploying a custom ML model with an endpoint by using a custom image on Google Cloud's Vertex AI

How to track the model Progress/status when Sagemaker Kernel is dead?

Upload to BigQuery from Cloud Storage

Schedule batch predictions Vertex AI

Beginners guide to Sagemaker

Categories

Resources