Data Preprocessing on AWS SageMaker - amazon-web-services

I have an endpoint running a trained SageMaker model on AWS, which expects the data on a specific format.
Initially, the data has been processed on the client side of the application, it means, the API Gateway (which receives the POST API calls on AWS) used to receive pre-processed data, but now there's a change, the API Gateway will receive raw data from the client, and the job of pre-processing this data before sending to our SageMaker model is up to our workflow.
What is the best way to create a pre-processing job on this workflow, without needing to re-train the model? My pre-process is just a bunch of dataframe transformations, no standardization or calculation with the training set required (it would not need to save any model file).
Thanks!

After some research, this is the solution I've followed:
First I have created a SKLearn sagemaker model to do all the preprocess setup (I've built a Scikit-Learn custom class to handle all the preprocess steps, following this AWS code)
Trained this preprocess model on my training data. My model, in specific, didn't need to be trained (it does not have any standardization or anything that would need to store training data parameters), but sagemaker requires the model to be trained.
Loaded the trained legacy model that we had using the Model parameter.
Created a PipelineModel with the preprocessing model and legacy model in cascade:
pipeline_model = PipelineModel(name=model_name,
role=role,
models=[
preprocess_model,
trained_model
])
Create a new endpoint, calling the PipelineModel and then changed the Lambda function to call this new endpoint. With this I could send the raw data directly for the same API Gateway and it would call only one endpoint, without needing to pay two endpoints 24/7 to perform the entire process.
I've found this to be a good and "economic" way to perform the preprocess outside the trained model, without having to do hard processing jobs on a Lambda function.

I would create a Lambda, which is getting invoked by the API-Gateway, processing the data and sending it to your SageMaker endpoint.

Related

Need recommendation to create an API by aggregating data from multiple source APIs

Before I start doing this I wanted to get advice from the community on the best and most efficient manner to go about doing it.
Here is what I want to do:
Ingest data from multiple API's which returns JSON
Store it in either S3 or DynamoDB
Modify the data to use my JSON structure
Pipe out the aggregate data as an API
The data will be updated twice a day, so I would pull in the data from the source APIs and put it through my pipeline twice a day.
So basically I want to create an API by aggregating data from multiple source APIs.
I've started playing with Lambda and created the following function using Python.
#https://stackoverflow.com/a/41765656
import requests
import json
def lambda_handler(event, context):
#https://www.nylas.com/blog/use-python-requests-module-rest-apis/ USEFUL!!!
#https://stackoverflow.com/a/65896274
response = requests.get("https://remoteok.com/api")
#print(response.json())
return {
'statusCode': 200,
'body': response.json()
}
#https://stackoverflow.com/questions/63733410/using-lambda-to-add-json-to-dynamodb DYNAMODB
This works and returns a JSON response.
Here are my questions:
Should I store the data on S3 or DynamoDB?
Which AWS service should I use to aggregate the data into my JSON structure?
Which service should I use to publish the aggregate data as an API, API Gateway?
However, before I go further I would like to know what is the best way to go about doing this.
If you have experience with this I would love to hear from you.
The answer will vary depending on the quantity of data you're planning to mine. Lambdas are designed for short-duration, high-frequency workloads and thus might not be suitable.
I would recommend looking into AWS Glue, as this seems like a fairly typical ETL (Extract Transform Load) problem. You can set up glue jobs to run on a schedule, and as for data aggregation, that's the T in ETL.
It's simple to output the glue dataframe (result of a transformation) as s3 files, which can then be queried directly by Amazon Athena (as if they were db content).
As for exposing that data via an API, the serverless framework or SST are great tools for taking the sting out of spinning up a serverless API and associated resources.

Vertex AI custom container batch prediction

I have created a custom container for prediction and successfully uploaded the model to Vertex AI. I was also able to deploy the model to an endpoint and successfully request predictions from the endpoint. Within the custom container code, I use the parameters field as described here, which I then supply later on when making an online prediction request.
My questions are regarding requesting batch predictions from a custom container for prediction.
I cannot find any documentation that describes what happens when I request a batch prediction. Say, for example, I use the my_model.batch_predict function from the Python SDK and set the instances_format to "csv" and provide the gcs_source. Now, I have setup my custom container to expect prediction requests at /predict as described in this documentation. Does Vertex AI make a POST request to this path, converting the cvs data into the appropriate POST body?
How do I specify the parameters field for batch prediction as I did for online prediction?
Yes vertex AI makes a POST request your custom containers in batch prediction.
No, there is no way for batch prediction to pass a parameter since we don't know which column is "parameter". We put everything into "instances".

How to perform Real Time Object Detection with trained AWS model

After successfully training object detection model with AWS SageMaker, how do I use this model to perform real time object detection on RTSP video?
One solution for real-time object detection is as follows.
After training your model, upload your model to S3. You can check this using
$aws sagemaker list-training-jobs --region us-east-1
Then you need to your trained model on an Amazon SageMaker endpoint. You can do this using this:
object_detector = estimator.deploy(initial_instance_count = 1,
instance_type = 'ml.g4dn.xlarge')
After your model is attached to an endpoint you will need to create an API that will allow users to pass input to your trained model for inference. You can do this using Serverless. With Serverless, you can create a template which will create a Lambda function as handler.py and serverless.yml which needs to be configured on how your application will operate. Make sure that in serverless.yml you specify your endpoint name, SAGEMAKER_ENDPOINT_NAMEas well as Resource: ${ssm:sagemakerarn}. This is an allow policy resource (AWS Systems Manager Agent) parameter that needs to be passed in. In your lambda function, make sure you invoke your SageMaker endpoint.
From here you can now deploy your API for real-time detection:
serverless deploy -v
Finally, you can use curl to invoke your API.
See here for a detailed walk-through.
You haven't specified where you will host the model: on AWS, a mobile device or something else. However, the general approach is that your model, assuming a CNN that processes images, will consume one frame at a time. You haven't specified a programming language or libraries so here is the general process in psuedocode:
while True:
video_frame = get_next_rtsp_frame()
detections = model.predict(video_frame)
# There might be multiple objects detected, handle each one:
for detected_object in detections:
(x1, y1, x2, y2, score) = detected_object # bounding box
# use the bounding box information, say to draw a box on the image
One challenge with a real-time video stream requirement is avoiding latency, depending on your platform and what kind of processing you do in your loop. You can skip frames, or don't buffer missed frames, to address this.

How to deploy our own TensorFlow Object Detection Model in amazon Sagemaker?

I have my own trained TF Object Detection model. If I try to deploy/implement the same model in AWS Sagemaker. It was not working.
I have tried TensorFlowModel() in Sagemaker. But there is an argument called entrypoint- how to create that .py file for prediction?
entrypoint is a argument which contains the file name inference.py,which means,once you create a endpoint and try to predict the image using the invoke endpoint api. the instance will be created based on you mentioned and it will go to the inference.py script and execute the process.
Link : Documentation for tensor-flow model deployment in amazon sage-maker
.
The inference script must contain a methods input_handler and output_handler or handler which will cover both the function in inference.py script, this for pre and post processing of your image.
Example for Deploying the tensor flow model
In the above link, i have mentioned a medium post, this will be helpful for your doubts.

ELK stack (Elasticsearch, Logstash, Kibana) - is logstash a necessary component?

We're currently processing daily mobile app log data with AWS lambda and posting it into redshift. The lambda structures the data but it is essentially raw. The next step is to do some actual processing of the log data into sessions etc, for reporting purposes. The final step is to have something do feature engineering, and then use the data for model training.
The steps are
Structure the raw data for storage
Sessionize the data for reporting
Feature engineering for modeling
For step 2, I am looking at using Quicksight and/or Kibana to create reporting dashboard. But the typical stack as I understand it is to do the log processing with logstash, then have it go to elasticsreach and finally to Kibana/Quicksight. Since we're already handling the initial log processing through lambda, is it possible to skip this step and pass it directly into elasticsearch? If so where does this happen - in the lambda function or from redshift after it has been stored in a table? Or can elasticsearch just read it from the same s3 where I'm posting the data for ingestion into a redshift table?
Elasticsearch uses JSON to perform all operations. For example, to add a document to an index, you use a PUT operation (copied from docs):
PUT twitter/_doc/1
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
Logstash exists to collect log messages, transform them into JSON, and make these PUT requests. However, anything that produces correctly-formatted JSON and can perform an HTTP PUT will work. If you already invoke Lambdas to transform your S3 content, then you should be able to adapt them to write JSON to Elasticsearch. I'd use separate Lambdas for Redshift and Elasticsearch, simply to improve manageability.
Performance tip: you're probably processing lots of records at a time, in which case the bulk API will be more efficient than individual PUTs. However, there is a limit on the size of a request, so you'll need to batch your input.
Also: you don't say whether you're using an AWS Elasticsearch cluster or self-managed. If the former you'll also have to deal with authenticated requests, or use an IP-based access policy on the cluster. You don't say what language your Lambdas are written in, but if it's Python you can use the aws-requests-auth library to make authenticated requests.