How to track the model Progress/status when Sagemaker Kernel is dead? - amazon-web-services

While training a model on AWS Sagemaker(let us assume training takes 15 hours or more). If our laptop lose internet connection in between, the Kernal on which it is training will die. But the model continues to train (I confirmed this with model.save command, and the model did save in the s3 bucket).
I want to know if there is a way, to track the status/progress of our model training when Kernel dies at Sagemaker environment.
Note: I know we can create a training job under Training - Training Jobs - Create Training Jobs. I just wanted to know if there is any other approach to track if we are not creating the Training Job.

Could you specify the 'Job Name' of the sagemaker training job? You can get the status using an API call if you have the job name. https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html
Another note: you can specify the job name of a training job using the 'TrainingJobName' parameter of training requests: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html

Simply check of status
When you run a training job, a log tracker is automatically created in CloudWatch within the "/aws/sagemaker/TrainingJobs" group with the name of your job and in turn one or more sub-logs, based on the number of instances selected.
This already ensures you can track the status of the job even if the kernel dies or if you simply turn off the notebook instance.
Monitor metrics
For sagemaker's built-in algorithms, no configuration action is required since the monitorable metrics are already prepared.
Custom model
On custom models, on the other hand, to have a monitoring graph of metrics, you can configure the log group related to them in CloudWatch (Metrics) as the official documentation explains. at "Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics" and "Define Metrics".
Basically, you just need to add the parameter metric_definitions to your Estimator (or a subclass of it):
metric_definitions=[
{'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},
{'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}
]
this will capture from the print/logger output of your training script the text identified by the regexes you set (which you can clearly change to your liking) and create a tracking within cloudwatch metrics.
A complete code example from doc:
import sagemaker
from sagemaker.estimator import Estimator
estimator = Estimator(
image_uri="your-own-image-uri",
role=sagemaker.get_execution_role(),
sagemaker_session=sagemaker.Session(),
instance_count=1,
instance_type='ml.c4.xlarge',
metric_definitions=[
{'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},
{'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}
]
)

Related

How do I see monitoing jobs in GCP console?

I have created a monitoring job using create_model_deployment_monitoring_job. How do I view it in GCP Monitoring?
I create the monitoring job thus:
job = vertex_ai_beta.ModelDeploymentMonitoringJob(
display_name=MONITORING_JOB_NAME,
endpoint=endpoint_uri,
model_deployment_monitoring_objective_configs=deployment_objective_configs,
logging_sampling_strategy=sampling_config,
model_deployment_monitoring_schedule_config=schedule_config,
model_monitoring_alert_config=alerting_config,
)
response = job_client_beta.create_model_deployment_monitoring_job(
parent=PARENT, model_deployment_monitoring_job=job
)
AI Platform Training supports two types of jobs: training and batch prediction. The details for each are different, but the basic operation is the same.
As you are using Vertex AI, you can check the job status in the Vertex AI dashboard. In GCP Console search for Vertex AI , enable API or click on this link and follow this Doc for Job status
Then following this Link summarizes the job operations and lists the interfaces you can use to perform them and also to know more information about Jobs follow this link

Schedule batch predictions Vertex AI

I have created a forecasting model using AutoML on Vertex AI. I want to use this model to make batch predictions every week. Is there a way to schedule this?
The data to make those predictions is stored in a bigquery table, which is updated every week.
There is no automatic scheduling directly in Vertex AutoML yet but many different ways to set this up in GCP.
Two options to try first using the client libraries available for BigQuery and Vertex:
Cloud Scheduler to use cron https://cloud.google.com/scheduler/docs/quickstart
use either Cloud Functions or Cloud Run to setup a BigQuery event trigger, and then trigger the AutoML batch prediction. Example to repurpose https://cloud.google.com/blog/topics/developers-practitioners/how-trigger-cloud-run-actions-bigquery-events
Not sure if you're using Vertex pipelines to run the prediction job but if you are there's a method to schedule your pipeline execution listed here.
from kfp.v2.google.client import AIPlatformClient # noqa: F811
api_client = AIPlatformClient(project_id=PROJECT_ID, region=REGION)
# adjust time zone and cron schedule as necessary
response = api_client.create_schedule_from_job_spec(
job_spec_path="intro_pipeline.json",
schedule="2 * * * *",
time_zone="America/Los_Angeles", # change this as necessary
parameter_values={"text": "Hello world!"},
# pipeline_root=PIPELINE_ROOT # this argument is necessary if you did not specify PIPELINE_ROOT as part of the pipeline definition.
)

Is there a way to publish custom metrics from AWS Glue jobs?

I'm using an AWS Glue job to move and transform data across S3 buckets, and I'd like to build custom accumulators to monitor the number of rows that I'm receiving and sending, along with other custom metrics. What is the best way to monitor these metrics? According to this document: https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html I can keep track of general metrics on my glue job but there doesn't seem to be a good way to send custom metrics through cloudwatch.
I have done lots of similar project like this, each micro batch can be:
a file or a bunch of file
a time interval of data from API
a partition of records from database
etc ...
Your use case is can be break down into three question:
given a bunch of input, how could you define a task_id
how you want to define the metrics for your task, you need to define a simple dictionary structure for this metrics data
find a backend data store to store the metrics data
find a way to query the metrics data
In some business use case, you also need to store status information to track each of the input, are they succeeded? failed? in-progress? stuck? and you may want to control retry, and concurrency control (avoid multiple worker working on the same input)
DynamoDB is the perfect backend for this type of use case. It is a super fast, no ops, pay as you go, automatically scaling key-value store.
There's a Python library that implemented this pattern https://github.com/MacHu-GWU/pynamodb_mate-project/blob/master/examples/patterns/status-tracker.ipynb
Here's an example:
put your glue ETL job main logic in a function:
def glue_job() -> dict:
...
return your_metrics
given an input, calculate the task id identifier, then you just need
tracker = Tracker.new(task_id)
# start the job, it will succeed
with tracker.start_job():
# do some work
your_metrics = glue_job()
# save your metrics in dynamodb
tracker.set_data(your_metrics)
Consider enabling continuous logging on your AWS Glue Job. This will allow you to do custom logging via. CloudWatch. Custom logging can include information such as row count.
More specifically
Enable continuous logging for you Glue Job
Add logger = glueContext.get_logger() at the beginning of you Glue Job
Add logger.info("Custom logging message that will be sent to CloudWatch") where you want to log information to CloudWatch. For example if I have a data frame named df I could log the number of rows to CloudWatch by adding logger.info("Row count of df " + str(df.count()))
Your log messages will be located under the CloudWatch log groups /aws-glue/jobs/logs-v2 under the log stream named glue_run_id -driver.
You can also reference the "Logging Application-Specific Messages Using the Custom Script Logger" section of the AWS documentation Enabling Continuous Logging for AWS Glue Jobs for more information on application specific logging.

How to perform Real Time Object Detection with trained AWS model

After successfully training object detection model with AWS SageMaker, how do I use this model to perform real time object detection on RTSP video?
One solution for real-time object detection is as follows.
After training your model, upload your model to S3. You can check this using
$aws sagemaker list-training-jobs --region us-east-1
Then you need to your trained model on an Amazon SageMaker endpoint. You can do this using this:
object_detector = estimator.deploy(initial_instance_count = 1,
instance_type = 'ml.g4dn.xlarge')
After your model is attached to an endpoint you will need to create an API that will allow users to pass input to your trained model for inference. You can do this using Serverless. With Serverless, you can create a template which will create a Lambda function as handler.py and serverless.yml which needs to be configured on how your application will operate. Make sure that in serverless.yml you specify your endpoint name, SAGEMAKER_ENDPOINT_NAMEas well as Resource: ${ssm:sagemakerarn}. This is an allow policy resource (AWS Systems Manager Agent) parameter that needs to be passed in. In your lambda function, make sure you invoke your SageMaker endpoint.
From here you can now deploy your API for real-time detection:
serverless deploy -v
Finally, you can use curl to invoke your API.
See here for a detailed walk-through.
You haven't specified where you will host the model: on AWS, a mobile device or something else. However, the general approach is that your model, assuming a CNN that processes images, will consume one frame at a time. You haven't specified a programming language or libraries so here is the general process in psuedocode:
while True:
video_frame = get_next_rtsp_frame()
detections = model.predict(video_frame)
# There might be multiple objects detected, handle each one:
for detected_object in detections:
(x1, y1, x2, y2, score) = detected_object # bounding box
# use the bounding box information, say to draw a box on the image
One challenge with a real-time video stream requirement is avoiding latency, depending on your platform and what kind of processing you do in your loop. You can skip frames, or don't buffer missed frames, to address this.

Data Preprocessing on AWS SageMaker

I have an endpoint running a trained SageMaker model on AWS, which expects the data on a specific format.
Initially, the data has been processed on the client side of the application, it means, the API Gateway (which receives the POST API calls on AWS) used to receive pre-processed data, but now there's a change, the API Gateway will receive raw data from the client, and the job of pre-processing this data before sending to our SageMaker model is up to our workflow.
What is the best way to create a pre-processing job on this workflow, without needing to re-train the model? My pre-process is just a bunch of dataframe transformations, no standardization or calculation with the training set required (it would not need to save any model file).
Thanks!
After some research, this is the solution I've followed:
First I have created a SKLearn sagemaker model to do all the preprocess setup (I've built a Scikit-Learn custom class to handle all the preprocess steps, following this AWS code)
Trained this preprocess model on my training data. My model, in specific, didn't need to be trained (it does not have any standardization or anything that would need to store training data parameters), but sagemaker requires the model to be trained.
Loaded the trained legacy model that we had using the Model parameter.
Created a PipelineModel with the preprocessing model and legacy model in cascade:
pipeline_model = PipelineModel(name=model_name,
role=role,
models=[
preprocess_model,
trained_model
])
Create a new endpoint, calling the PipelineModel and then changed the Lambda function to call this new endpoint. With this I could send the raw data directly for the same API Gateway and it would call only one endpoint, without needing to pay two endpoints 24/7 to perform the entire process.
I've found this to be a good and "economic" way to perform the preprocess outside the trained model, without having to do hard processing jobs on a Lambda function.
I would create a Lambda, which is getting invoked by the API-Gateway, processing the data and sending it to your SageMaker endpoint.