how to debug invocation timeout error in sagemaker batch transform? - amazon-web-services

I am experimenting with sagemaker, using a container from list here , https://github.com/aws/deep-learning-containers/blob/master/available_images.md to run my model and overwriting model_fn and predict_fn functions in inference.py file for loading model and prediction as shown in link here (https://github.com/PacktPublishing/Learn-Amazon-SageMaker-second-edition/blob/main/Chapter%2007/huggingface/src/torchserve-predictor.py) .
I keep getting invocations timeout error => "Model server did not respond to /invocations request within 3600 seconds". am i missing anything in my inference.py code , as to adding something to response to the ping/healthcheck?
file : inference.py
import json
import torch
from transformers import AutoConfig, AutoTokenizer, DistilBertForSequenceClassification
JSON_CONTENT_TYPE = 'application/json'
def model_fn(model_dir):
config_path = '{}/config.json'.format(model_dir)
model_path = '{}/pytorch_model.bin'.format(model_dir)
config = AutoConfig.from_pretrained(config_path)
...
def predict_fn(input_data, model):
//return predictions
...

The issue is not with the health checks. It is with the container not responding to the /invocations request and this is can be due to model taking longer time than expected to get predictions from the input data.

Related

Invoke endpoint error - detectron2 on AWS Sagemaker: ValueError: Type [application/x-npy] not support this type yet

I have been following this guide for implementing a Detectron2 model on Sagemaker.
It all looks good, both on the training and the batch transform side.
However, I tried to tweak a bit the code to create an Endpoint that can be invoked by sending a payload, and I am having some troubles with it.
At the end of this notebook, after creating the SageMaker model object:
model = PyTorchModel(
name="d2-sku110k-model",
model_data=training_job_artifact,
role=role,
sagemaker_session=sm_session,
entry_point="predict_sku110k.py",
source_dir="container_serving",
image_uri=serve_image_uri,
framework_version="1.6.0",
code_location=f"s3://{bucket}/{prefix_code}",
)
I added the following code:
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')
And I can see that the model has been successfully deployed.
However, when I try to predict an image with :
predictor.predict(input)
I get the following error:
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from primary with message "Type [application/x-npy] not support this type yet
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 126, in transform
result = self._transform_fn(self._model, input_data, content_type, accept)
File "/opt/conda/lib/python3.6/site-packages/sagemaker_inference/transformer.py", line 215, in _default_transform_fn
data = self._input_fn(input_data, content_type)
File "/opt/ml/model/code/predict_sku110k.py", line 98, in input_fn
raise ValueError(err_msg)
ValueError: Type [application/x-npy] not support this type yet
I tried a bunch of different input types: a image byte-encoded (created with cv2.imencode('.jpg', cv_img)[1].tobytes()), a numpy array, a BytesIO object (created with io module), a dictionary of the form {'input': image} where image is any of the previous (this is because this format was used by a tensorflow endpoint I created some time ago).
As I think it might be relevant, I also copy paste here the Inference script used as entry point:
"""Code used for sagemaker batch transform jobs"""
from typing import BinaryIO, Mapping
import json
import logging
import sys
from pathlib import Path
import numpy as np
import cv2
import torch
from detectron2.engine import DefaultPredictor
from detectron2.config import CfgNode
##############
# Macros
##############
LOGGER = logging.Logger("InferenceScript", level=logging.INFO)
HANDLER = logging.StreamHandler(sys.stdout)
HANDLER.setFormatter(logging.Formatter("%(levelname)s | %(name)s | %(message)s"))
LOGGER.addHandler(HANDLER)
##########
# Deploy
##########
def _load_from_bytearray(request_body: BinaryIO) -> np.ndarray:
npimg = np.frombuffer(request_body, np.uint8)
return cv2.imdecode(npimg, cv2.IMREAD_COLOR)
def model_fn(model_dir: str) -> DefaultPredictor:
r"""Load trained model
Parameters
----------
model_dir : str
S3 location of the model directory
Returns
-------
DefaultPredictor
PyTorch model created by using Detectron2 API
"""
path_cfg, path_model = None, None
for p_file in Path(model_dir).iterdir():
if p_file.suffix == ".json":
path_cfg = p_file
if p_file.suffix == ".pth":
path_model = p_file
LOGGER.info(f"Using configuration specified in {path_cfg}")
LOGGER.info(f"Using model saved at {path_model}")
if path_model is None:
err_msg = "Missing model PTH file"
LOGGER.error(err_msg)
raise RuntimeError(err_msg)
if path_cfg is None:
err_msg = "Missing configuration JSON file"
LOGGER.error(err_msg)
raise RuntimeError(err_msg)
with open(str(path_cfg)) as fid:
cfg = CfgNode(json.load(fid))
cfg.MODEL.WEIGHTS = str(path_model)
cfg.MODEL.DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
return DefaultPredictor(cfg)
def input_fn(request_body: BinaryIO, request_content_type: str) -> np.ndarray:
r"""Parse input data
Parameters
----------
request_body : BinaryIO
encoded input image
request_content_type : str
type of content
Returns
-------
np.ndarray
input image
Raises
------
ValueError
ValueError if the content type is not `application/x-image`
"""
if request_content_type == "application/x-image":
np_image = _load_from_bytearray(request_body)
else:
err_msg = f"Type [{request_content_type}] not support this type yet"
LOGGER.error(err_msg)
raise ValueError(err_msg)
return np_image
def predict_fn(input_object: np.ndarray, predictor: DefaultPredictor) -> Mapping:
r"""Run Detectron2 prediction
Parameters
----------
input_object : np.ndarray
input image
predictor : DefaultPredictor
Detectron2 default predictor (see Detectron2 documentation for details)
Returns
-------
Mapping
a dictionary that contains: the image shape (`image_height`, `image_width`), the predicted
bounding boxes in format x1y1x2y2 (`pred_boxes`), the confidence scores (`scores`) and the
labels associated with the bounding boxes (`pred_boxes`)
"""
LOGGER.info(f"Prediction on image of shape {input_object.shape}")
outputs = predictor(input_object)
fmt_out = {
"image_height": input_object.shape[0],
"image_width": input_object.shape[1],
"pred_boxes": outputs["instances"].pred_boxes.tensor.tolist(),
"scores": outputs["instances"].scores.tolist(),
"pred_classes": outputs["instances"].pred_classes.tolist(),
}
LOGGER.info(f"Number of detected boxes: {len(fmt_out['pred_boxes'])}")
return fmt_out
# pylint: disable=unused-argument
def output_fn(predictions, response_content_type):
r"""Serialize the prediction result into the desired response content type"""
return json.dumps(predictions)
Can anyone point out what is the correct format for invoking the model (or how to tweak the code to use the endpoint)? I am thinking to change the request_content_type to 'application/json', but I am not sure that it will help much.
Edit: I tried a solution inspired by this SO thread but it did not work for my case.
It's been a while since you asked this so I hope you found a solution already, but for people seeing this in the future ...
The error appears to be because you are sending the request with the default content_type (no specified a content type in the request, neither specified a serialiser), but your code is made in a way that will only respond to requests that come with content type "application/x-image"
The default content-type is "application/json"
You have 2 options here, you either amend your code to be able to handle "application/json" content type, or when you invoke the endpoint, you add a content-type header with the right value. You could do this by changing the predict method as below:
instead of:
predictor.predict(input)
try:
predictor.predict(input, initial_args={"ContentType":"application/x-image"})

Dialogflow: Agent metadata not found for agentId

I'm trying to use Dialogflow's detect_intent in Python and I keep getting:
404 com.google.apps.framework.request.NotFoundException: Agent metadata not found for agentId: ####-####-####-####-####
Here's a snippet of my code:
import google.cloud.dialogflow as dialogflow
from CONFIG import DIALOGFLOW_PROJECT_ID
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = 'credentials/dialogflow.json'
def predict_intent(text, language):
session_client = dialogflow.SessionsClient()
session = session_client.session_path(DIALOGFLOW_PROJECT_ID, SESSION_ID)
text_input = dialogflow.TextInput(text=text, language_code=language)
query_input = dialogflow.QueryInput(text=text_input)
response = session_client.detect_intent(session=session, query_input=query_input) # ERROR
return response.query_result.intent.display_name
I tried running the function multiple times and some of them succeed, but most fall in the exception.
I can train the bot using the same interface and it works fine.
I'm using Python 3.7 and the following Google Cloud modules: google-api-core==2.0.1, google-auth==2.0.2, google-cloud-dialogflow==2.7.1, googleapis-common-protos==1.53.0.

How to run BigQuery after Dataflow job completed successfully

I am trying to run a query in BigQuery right after a dataflow job completes successfully. I have defined 3 different functions in main.py.
The first one is for running the dataflow job. The second one checks the dataflow jobs status. And the last one runs the query in BigQuery.
The trouble is the second function checks the dataflow job status multiple times for a period of time and after the dataflow job completes successfully, it does not stop checking the status.
And then function deployment fails due to 'function load attempt timed out' error.
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
import os
import re
import config
from google.cloud import bigquery
import time
global flag
def trigger_job(gcs_path, body):
credentials = GoogleCredentials.get_application_default()
service = build('dataflow', 'v1b3', credentials=credentials, cache_discovery=False)
request = service.projects().templates().launch(projectId=config.project_id, gcsPath=gcs_path, body=body)
response = request.execute()
def get_job_status(location, flag):
credentials=GoogleCredentials.get_application_default()
dataflow=build('dataflow', 'v1b3', credentials=credentials, cache_discovery=False)
result=dataflow.projects().jobs().list(projectId=config.project_id, location=location).execute()
for job in result['jobs']:
if re.findall(r'' + re.escape(config.job_name) + '', job['name']):
while flag==0:
if job['currentState'] != "JOB_STATE_DONE":
print('NOT DONE')
else:
flag=1
print('DONE')
break
def bq(sql):
client = bigquery.Client()
query_job = client.query(sql, location='US')
gcs_path = config.gcs_path
body=config.body
trigger_job(gcs_path,body)
flag=0
location='us-central1'
get_job_status(location,flag)
sql= """CREATE OR REPLACE TABLE 'table' AS SELECT * FROM 'table'"""
bq(SQL)
Cloud Function timeout is set to 540 seconds but deployment fails in 3-4 minutes.
Any help is very appreciated.
It appears from the code snippet provided that your HTTP-triggered cloud function is not returning a HTTP response.
All HTTP-based cloud functions must return a HTTP response for proper termination. From the google documentation Ensure HTTP functions send an HTTP response (Emphasis - mine):
If your function is HTTP-triggered, remember to send an HTTP response,
as shown below. Failing to do so can result in your function executing
until timeout. If this occurs, you will be charged for the entire
timeout time. Timeouts may also cause unpredictable behavior or cold
starts on subsequent invocations, resulting in unpredictable behavior
or additional latency.
Thus, you must have a function that in your main.py that returns some sort of value, ideally a value that can be coerced into a Flask http response.

aws boto3 client Stubber help stubbing unit tests

I'm trying to write some unit tests for aws RDS. Currently, the start stop rds api calls have not yet been implemented in moto. I tried just mocking out boto3 but ran into all sorts of weird issues. I did some googling and found http://botocore.readthedocs.io/en/latest/reference/stubber.html
So I have tried to implement the example for rds but the code appears to be behaving like the normal client, even though I have stubbed it. Not sure what's going on or if I am stubbing correctly?
from LambdaRdsStartStop.lambda_function import lambda_handler
from LambdaRdsStartStop.lambda_function import AWS_REGION
def tests_turn_db_on_when_cw_event_matches_tag_value(self, mock_boto):
client = boto3.client('rds', AWS_REGION)
stubber = Stubber(client)
response = {u'DBInstances': [some copy pasted real data here], extra_info_about_call: extra_info}
stubber.add_response('describe_db_instances', response, {})
with stubber:
r = client.describe_db_instances()
lambda_handler({u'AutoStart': u'10:00:00+10:00/mon'}, 'context')
so the mocking WORKS for the first line inside the stubber and the value of r is returned as my stubbed data. When I try and go into my lambda_handler method inside my lambda_function.py and still use the stubbed client it behaves like a normal unstubbed client:
lambda_function.py
def lambda_handler(event, context):
rds_client = boto3.client('rds', region_name=AWS_REGION)
rds_instances = rds_client.describe_db_instances()
error output:
File "D:\dev\projects\virtual_envs\rds_sloth\lib\site-packages\botocore\auth.py", line 340, in add_auth
raise NoCredentialsError
NoCredentialsError: Unable to locate credentials
You will need to patch boto3 where it is called in the routine that you will be testing. Also Stubber responses appear to be consumed on each call and thus will require another add_response for each stubbed call as below:
def tests_turn_db_on_when_cw_event_matches_tag_value(self, mock_boto):
client = boto3.client('rds', AWS_REGION)
stubber = Stubber(client)
# response data below should match aws documentation otherwise more errors due to botocore error handling
response = {u'DBInstances': [{'DBInstanceIdentifier': 'rds_response1'}, {'DBInstanceIdentifierrd': 'rds_response2'}]}
stubber.add_response('describe_db_instances', response, {})
stubber.add_response('describe_db_instances', response, {})
with mock.patch('lambda_handler.boto3') as mock_boto3:
with stubber:
r = client.describe_db_instances() # first_add_response consumed here
mock_boto3.client.return_value = client
response=lambda_handler({u'AutoStart': u'10:00:00+10:00/mon'}, 'context') # second_add_response would be consumed here
# asert.equal(r,response)

Django HTTP Response object for GAE Cron

I am doing this using Django / GAE / Python environment:
cron:
#run events every 12 hours
and
def events(request):
# read all records
# Do some processing on a few records
return http.HTTPResponseGone('Some Records are modified' )
Result in production :
Job runs on time with 'failed' message
However, it has done the job exactly on the datastore as required
No error log entry seen
Dev : No errors ; returns the message 'Some Records are modified'
Is it possible to avoid HTTP Response returned ? There is no need for HTTPResponse for me, however, I have kept this as Dev server testing fails in its absence. Can some one
help me to make the code clean?
Gone is error 410. You should return 200 Success if the operation succeeds. When you return HttpResponse, the default status is 200.