sagemaker Monitoring 'describe processing job' Error - amazon-web-services

I am following the steps in the SageMaker Monitoring Tutorial here: https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_model_monitor/introduction/SageMaker-ModelMonitoring.html
And I get the following error for sage.describe_processing job():
I am not using any directory as /opt/ml... in my code. What is the /opt/ml... directory mentioned in the error? And how can I fix that error?

To be answer your question, please share the code you are using( link seems to be broken). I'm sharing code on running and creating a processing job to setup model monitoring for reference. Using this code you can run the processing job manually( without having to wait for the model monitoring job to run on an hourly basis) as long as your adhere to the API contracts
Reference code
from sagemaker.processing import Processor
processor = Processor(
base_job_name='nlp-data-drift-bert-v1',
role=role,
image_uri=processing_repository_uri,
instance_count=1,
instance_type='ml.m5.large',
env={ 'THRESHOLD':'0.5','bucket': bucket },
)
processor.run(
[ProcessingInput(
input_name='endpointdata',
source=f's3://{sagemaker_session.default_bucket()}/{s3_prefix}/endpoint/data_capture',
destination='/opt/ml/processing/input/endpoint',
)],
[ProcessingOutput(
output_name='result',
source='/opt/ml/processing/resultdata',
destination=destination,
)],
)

Related

How to track the model Progress/status when Sagemaker Kernel is dead?

While training a model on AWS Sagemaker(let us assume training takes 15 hours or more). If our laptop lose internet connection in between, the Kernal on which it is training will die. But the model continues to train (I confirmed this with model.save command, and the model did save in the s3 bucket).
I want to know if there is a way, to track the status/progress of our model training when Kernel dies at Sagemaker environment.
Note: I know we can create a training job under Training - Training Jobs - Create Training Jobs. I just wanted to know if there is any other approach to track if we are not creating the Training Job.
Could you specify the 'Job Name' of the sagemaker training job? You can get the status using an API call if you have the job name. https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html
Another note: you can specify the job name of a training job using the 'TrainingJobName' parameter of training requests: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html
Simply check of status
When you run a training job, a log tracker is automatically created in CloudWatch within the "/aws/sagemaker/TrainingJobs" group with the name of your job and in turn one or more sub-logs, based on the number of instances selected.
This already ensures you can track the status of the job even if the kernel dies or if you simply turn off the notebook instance.
Monitor metrics
For sagemaker's built-in algorithms, no configuration action is required since the monitorable metrics are already prepared.
Custom model
On custom models, on the other hand, to have a monitoring graph of metrics, you can configure the log group related to them in CloudWatch (Metrics) as the official documentation explains. at "Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics" and "Define Metrics".
Basically, you just need to add the parameter metric_definitions to your Estimator (or a subclass of it):
metric_definitions=[
{'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},
{'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}
]
this will capture from the print/logger output of your training script the text identified by the regexes you set (which you can clearly change to your liking) and create a tracking within cloudwatch metrics.
A complete code example from doc:
import sagemaker
from sagemaker.estimator import Estimator
estimator = Estimator(
image_uri="your-own-image-uri",
role=sagemaker.get_execution_role(),
sagemaker_session=sagemaker.Session(),
instance_count=1,
instance_type='ml.c4.xlarge',
metric_definitions=[
{'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},
{'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}
]
)

AWS Glue Job using awsglueml.transforms.FindMatches gives timeout error seemingly randomly

I have a Glue ETL Job (using pyspark) that gives a timeout error when trying to access the awsglueml.transforms.FindMatches library seemingly randomly. The error given on the glue dashboard is:
An error occurred while calling z:com.amazonaws.services.glue.ml.FindMatches.apply. The target server failed to respond
Basically if I try to run this Glue ETL job late at night, it most of the time succeeds. But if I try to run this ETL Job in the middle of the day, it fails with this error. Sometimes just retrying it enough times causes it to succeed, but this doesn't seem like a good solution. It seems like the issue is with AWS FindMatches library not having enough bandwidth to support people wanting to use this library, but I could be wrong here.
The Glue ETL job was setup using the option A proposed script generated by AWS Glue
The line of code that this is timing out on is a line that was provided by glue when I created this job:
from awsglueml.transforms import FindMatches
...
findmatches2 = FindMatches.apply(frame = datasource0, transformId = "<redacted>", computeMatchConfidenceScores = True, transformation_ctx = "findmatches2")
Welcoming any information on this elusive issue.

VertexAI Batch Inference Failing for Custom Container Model

I'm having trouble executing VertexAI's batch inference, despite endpoint deployment and inference working perfectly. My TensorFlow model has been trained in a custom Docker container with the following arguments:
aiplatform.CustomContainerTrainingJob(
display_name=display_name,
command=["python3", "train.py"],
container_uri=container_uri,
model_serving_container_image_uri=container_uri,
model_serving_container_environment_variables=env_vars,
model_serving_container_predict_route='/predict',
model_serving_container_health_route='/health',
model_serving_container_command=[
"gunicorn",
"src.inference:app",
"--bind",
"0.0.0.0:5000",
"-k",
"uvicorn.workers.UvicornWorker",
"-t",
"6000",
],
model_serving_container_ports=[5000],
)
I have a Flask endpoint defined for predict and health essentially defined below:
#app.get(f"/health")
def health_check_batch():
return 200
#app.post(f"/predict")
def predict_batch(request_body: dict):
pred_df = pd.DataFrame(request_body['instances'],
columns = request_body['parameters']['columns'])
# do some model inference things
return {"predictions": predictions.tolist()}
As described, when training a model and deploying to an endpoint, I can successfully hit the API with JSON schema like:
{"instances":[[1,2], [1,3]], "parameters":{"columns":["first", "second"]}}
This also works when using the endpoint Python SDK and feeding in instances/parameters as functional arguments.
However, I've tried performing batch inference with a CSV file and a JSONL file, and every time it fails with an Error Code 3. I can't find logs on why it failed in Logs Explorer either. I've read through all the documentation I could find and have seen other's successfully invoke batch inference, but haven't been able to find a guide. Does anyone have recommendations on batch file structure or the structure of my APIs? Thank you!

How to deploy our own TensorFlow Object Detection Model in amazon Sagemaker?

I have my own trained TF Object Detection model. If I try to deploy/implement the same model in AWS Sagemaker. It was not working.
I have tried TensorFlowModel() in Sagemaker. But there is an argument called entrypoint- how to create that .py file for prediction?
entrypoint is a argument which contains the file name inference.py,which means,once you create a endpoint and try to predict the image using the invoke endpoint api. the instance will be created based on you mentioned and it will go to the inference.py script and execute the process.
Link : Documentation for tensor-flow model deployment in amazon sage-maker
.
The inference script must contain a methods input_handler and output_handler or handler which will cover both the function in inference.py script, this for pre and post processing of your image.
Example for Deploying the tensor flow model
In the above link, i have mentioned a medium post, this will be helpful for your doubts.

Perform cloud formation only if any changes in lambda using AWS Code Pipeline

I am uisng AWS Code pipeline to perform cloud formation. My source code is committed in GitHub repository. When ever a commit is happening in my github repository, AWS Code Pipeline will starts its execution and perform the cloud formation. These functionalities are working fine.
In my project I have multiple modules. So if a user is modified only in one module, the entire module's lambda's are updated. Is there any way to restrict this using AWS Code Pipeline.
My Code Pipeline has 3 stages.
Source
Build
Deploy
The following is the snapshot of my code pipeline.
We had a similar issue and eventually we came to conclusion that this is not exactly possible. So unless you separate your modules into different repos and make separate pipelines for each of them it is always going to execute everything.
The good thing is that with each execution of the pipeline it is not entirely redeploying everything when the cloud formation is executed. In the deploy stage you can add Create Changeset part which is basically going to detect what is changed from the previous CloudFormation deployment and it is going to redeploy only those parts and will not touch anything else.
This is the exact issue we faced recently and while I see comments mentioning that it isn't possible to achieve with a single repository, I have found a workaround!
Generally, the code pipeline is triggered by a CloudWatch event listening to the GitHub/Code Commit repository. Rather than triggering the pipeline, I made the CloudWatch event trigger a lambda function. In the lambda, we can write the logic to execute the pipeline(s) only for module which has changes. This works really nicely and provides a lot of control over the pipeline execution. This way multiple pipeline can be created from a single repository, solving the problem mention in the question.
Lambda logic can be something like:
import boto3
# Map config files to pipelines
project_pipeline_mapping = {
"CodeQuality_ScoreCard" : "test-pipeline-code-quality",
"ProductQuality_ScoreCard" : "test-product-quality-pipeline"
}
files_to_ignore = [ "readme.md" ]
codecommit_client = boto3.client('codecommit')
codepipeline_client = boto3.client('codepipeline')
def lambda_handler(event, context):
projects_changed = []
# Extract commits
print("\n EVENT::: " , event)
old_commit_id = event["detail"]["oldCommitId"]
new_commit_id = event["detail"]["commitId"]
# Get commit differences
codecommit_response = codecommit_client.get_differences(
repositoryName="ScorecardAPI",
beforeCommitSpecifier=str(old_commit_id),
afterCommitSpecifier=str(new_commit_id)
)
print ("\n Code commit response: ", codecommit_response)
# Search commit differences for files that trigger executions
for difference in codecommit_response["differences"]:
file_name = difference["afterBlob"]["path"]
project_name = file_name.split('/')[0]
print("\nChanged project: ", project_name)
# If project corresponds to pipeline, add it to the pipeline array
if project_name in project_pipeline_mapping:
projects_changed.insert(len(projects_changed),project_name)
projects_changed = list(dict.fromkeys(projects_changed))
print("pipeline(s) to be executed: ", projects_changed)
for project in projects_changed:
codepipeline_response = codepipeline_client.start_pipeline_execution(
name=project_pipeline_mapping[project]
)
Check AWS blog on this topic: Customizing triggers for AWS CodePipeline with AWS Lambda and Amazon CloudWatch Events
Why not model this as a pipeline per module?