Scaling endpoint that uses ML models in AWS - amazon-web-services

I have an API (python+FastAPI) with different endpoints. One of these endpoints manages images, and apply different models (up to 4) one after another one. Furthermore, there can be a preprocessing and postprocessing between these models are applied in series.
What is the best approach to scale this kind of endpoints using AWS? I have in mind pages such as huggingface where you can try models and they run fast even they can have many calls at the same time. Or for example, pages like https://interiorai.com.
Right now I have an instance inside an autoscaling group, which adds images to a queue using celery+rabbitmq, and then process images one by one since celery can only work with maximum 1 worker if you need to use GPU.
Should I look for real-time instance inference? Is there any better aproximations? Do you have a good example on how to start with this?

Your best bet is SageMaker real-time inference, with this you can also enable AutoScaling. You can also utilize SageMaker Serial Inference Pipelines to stitch together multiple models (in containers) behind just one endpoint.

Related

Deploying NLP model to AWS for beginners

I have the task of optimizing search on the website. The search should be for pictures and for text by text query. I have already developed, trained, tested and selected a machine learning model that transforms images and text into a feature vector (Python, based on OpenAI CLIP). This feature vector will be transferred to Elastic Search. Elastic Search will be configured by another specialist.
The model will be used first to determine the feature vector on all existing images and texts, and then be used whenever new content is added or existing content is changed.
There is a lot of existing content (approximately several tens of millions of pictures and texts together). About 100-500 pieces of content are added and changed per day.
I haven't worked much with AWS, but in this case the model needs to be deployed to AWS somehow. Of course, I have the model and the entire project locally, I can write an API app and make a Docker container.
The question is, what is the best method to deploy this application on AWS? The best in terms of speed and ease of implementation (for me as an AWS beginner), as well as cost optimization, taking into account the number of requests for the application.
I've seen different possibilities, from simply deploying the application on EC2 (probably the easiest option) to using SageMaker. Also Kubernetes and ECS...
I'd recommend using SageMaker Hosting endpoint if you need to be able to run vectorization in near-real time any time of the day, or in a SageMaker Training job if you can run vectorization batched, for example once every few hour.
For both systems you can use pre-defined Framework containers and SDK to which you pass a Python code and optionally requirements.txt, or you can create your own image.

What are the differences between AWS sagemaker and sagemaker_pyspark?

I'm currently running a quick Machine Learning proof of concept on AWS with SageMaker, and I've come across two libraries: sagemaker and sagemaker_pyspark. I would like to work with distributed data. My questions are:
Is using sagemaker the equivalent of running a training job without taking advantage of the distributed computing capabilities of AWS? I assume it is, if not, why have they implemented sagemaker_pyspark? Based on this assumption, I do not understand what it would offer regarding using scikit-learn on a SageMaker notebook (in terms of computing capabilities).
Is it normal for something like model = xgboost_estimator.fit(training_data) to take 4 minutes to run with sagemaker_pyspark for a small set of test data? I see that what it does below is to train the model and also create an Endpoint to be able to offer its predictive services, and I assume that this endpoint is deployed on an EC2 instance that is created and started at the moment. Correct me if I'm wrong. I assume this from how the estimator is defined:
from sagemaker import get_execution_role
from sagemaker_pyspark.algorithms import XGBoostSageMakerEstimator
xgboost_estimator = XGBoostSageMakerEstimator (
trainingInstanceType = "ml.m4.xlarge",
trainingInstanceCount = 1,
endpointInstanceType = "ml.m4.xlarge",
endpointInitialInstanceCount = 1,
sagemakerRole = IAMRole(get_execution_role())
)
xgboost_estimator.setNumRound(1)
If so, is there a way to reuse the same endpoint with different training jobs so that I don't have to wait for a new endpoint to be created each time?
Does sagemaker_pyspark support custom algorithms? Or does it only allow you to use the predefined ones in the library?
Do you know if sagemaker_pyspark can perform hyperparameter optimization? From what I see, sagemaker offers the HyperparameterTuner class, but I can't find anything like it in sagemaker_pyspark. I suppose it is a more recent library and there is still a lot of functionality to implement.
I am a bit confused about the concept of entry_point and container/image_name (both possible input arguments for the Estimator object from the sagemaker library): can you deploy models with and without containers? why would you use model containers? Do you always need to define the model externally with the entry_point script? It is also confusing that the class AlgorithmEstimator allows the input argument algorithm_arn; I see there are three different ways of passing a model as input, why? which one is better?
I see the sagemaker library offers SageMaker Pipelines, which seem to be very handy for deploying properly structured ML workflows. However, I don't think this is available with sagemaker_pyspark, so in that case, I would rather create my workflows with a combination of Step Functions (to orchestrate the entire thing), Glue processes (for ETL, preprocessing and feature/target engineering) and SageMaker processes using sagemaker_pyspark.
I also found out that sagemaker has the sagemaker.sparkml.model.SparkMLModel object. What is the difference between this and what sagemaker_pyspark offers?
sagemaker is the SageMaker Python SDK. It calls SageMaker-related AWS service APIs on your behalf. You don't need to use it, but it can make life easier
Is using sagemaker the equivalent of running a training job without taking advantage of the distributed computing capabilities of AWS? I assume it is, if not, why have they implemented sagemaker_pyspark?
No. You can run distributed training jobs using sagemaker (see instance_count parameter)
sagemaker_pyspark facilitates calling SageMaker-related AWS service APIs from Spark. Use it if you want to use SageMaker services from Spark
Is it normal for something like model = xgboost_estimator.fit(training_data) to take 4 minutes to run with sagemaker_pyspark for a small set of test data?
Yes, it takes a few minutes for an EC2 instance to spin-up. Use Local Mode if you want to iterate more quickly locally. Note: Local Mode won't work with SageMaker built-in algorithms, but you can prototype with (non AWS) XGBoost/SciKit-Learn
Does sagemaker_pyspark support custom algorithms? Or does it only allow you to use the predefined ones in the library?
Yes, but you'd probably want to extend SageMakerEstimator. Here you can provide the trainingImage URI
Do you know if sagemaker_pyspark can perform hyperparameter optimization?
It does not appear so. It'd probably be easier just to do this from SageMaker itself though
can you deploy models with and without containers?
You can certainly host your own models any way you want. But if you want to use SageMaker model inference hosting, then containers are required
why would you use model containers?
Do you always need to define the model externally with the entry_point script?
The whole Docker thing makes bundling dependencies easier, and also makes things language/runtime-neutral. SageMaker doesn't care if your algorithm is in Python or Java or Fortran. But it needs to know how to "run" it, so you tell it a working directory and a command to run. This is the entry point
It is also confusing that the class AlgorithmEstimator allows the input argument algorithm_arn; I see there are three different ways of passing a model as input, why? which one is better?
Please clarify which "three" you are referring to
6 is not a question, so no answer required :)
What is the difference between this and what sagemaker_pyspark offers?
sagemaker_pyspark lets you call SageMaker services from Spark, whereas SparkML Serving lets you use Spark ML services from SageMaker

Training multiple model in AWS Sagemaker

Can I train multiple model in AWS Sagemaker by evaluating the models is train.py script and also how to get back multiple metrics from multiple models?
Any links, docs or videos would be useful.
Yes, what you write in a sagemaker training script (assuming you use something that lets you pass custom code like your own container or a framework container) is flexible, and does not need to be just one model or even ML. You can definitely write multiple model trainings in a single container, and pull all related metrics using SageMaker metric capture via regex, see an example regex here with the Sklearn random forest.
That being said, it is often a better idea to separate things and have one model per SageMaker job, because of the following reasons among other:
It allows you to separate model metadata and metrics and compare
them easily with the SageMaker metadata service
It allows you to specialize hardware to each model and get better economics. Each model has its own sweet spot when it comes to CPU, GPU, RAM
It allows you to use the exact same container for single training but
also for bayesian hyperparameter search, an method that can be
both faster and cheaper than regular gridsearch.

AWS Sagemaker init 1K+ models "endpoints"?

under the assumptions that the model training itself is very fast, I'm wondering what is the best practice to spin up ~ > 1K models endpoints
fast as possible.
Thanks for any hint
Christian
Assuming these are different models (not production variants for testing), you'll need one endpoint per model and thus one SageMaker instance. Probably not the greatest option (cost, time to spin up instances, synchronous calls, API throttling, etc). For now, I'd use another service to deploy, e.g. an ECS cluster.
Could you please tell me a little more about your use case (business problem, framework, model size, etc)? You're not the first one to ask about this capability and your feedback would be very valuable in building the best solution.
Julien (AWS)

AWS celery and database

I am writing a webapp thats runs on AWS. My app requires users to upload their pdf files. I will convert them into Images using the "convert" utility in linux.
Here is my setup on Ubuntu 12.04:
Django
Celery
Django Celery
Boto
I am using apache as my webserver.
The work flow is as follows:
Three are three asynchronous tasks and two queues for handling all the processing and S3 for storing input and Output files.
A user uploads a pdf then:
accept_file_task is called: This task takes the user uploaded pdf and stores it in my S3 storage and then inserts a message into the input_queue(SQS)
check_queue_and_launch_instance_task: A periodic task that keeps monitoring the number of messages in the input_queue and launches instances whenever the queue has more messages than the no of Ec2 instances
The instances have a bootstrap script which is a while True: loop. Any of the instances can pick the message from the input_queue and do a Subprocess.Popen("convert "+input+ouput) and write the processed stated to output_queue and also upload the image generated into S3 output bucket and make it available as a download link
output_process_task: another periodic task that keeps polling the output_queue and whenever a message is available it will update the status in the table mentioned below.
I am using a model called Document to store all the status information. I also have users registering and hence a table to store all user information. Also Celery created a lot of tables to store all its task information. Right now I am using a single instance and the sqlite3 database (that comes with python) on that instance.
I am unsure about the following things
How do I scale up the database? Should I go for a RDS or a simpleDB or AmazonDB. If not celery, I could have easily used simpleDB. I am really stuck on this one
How do I get rid of the two periodic tasks check_queue_and_launch_instance_task and output_process_task. My idea is that Autoscaling must be used in some way so that if need at a later stage an Elastic Load Balancer can be used.
If any of you have designed something similar please help me on how to go about it
How do I scale up the database? Should I go for a RDS or a simpleDB or AmazonDB. If not celery, I could have easily used simpleDB. I am really stuck on this one
Keep in mind that premature optimization is the root of all evil. The question of RDS (which is really just MySQL, Oracle, or MS SQL) vs. SimpleDB is more of an application design decision than one based on scalability. SimpleDB is just a simple key-value store. RDS, on the other hand, will give you full ACID functionality. If your data is relational, then you should be using a relational database. If you just need a place to store simple strings or integers, then something like SimpleDB would make more sense.
Right now I am using a single instance and the sqlite3 database (that comes with python) on that instance.
Make sure that you understand the consequences of a) creating a single point-of-failure in your design and b) SQLite's limitations compared to using a standalone RDBMS in this application. (You can use it, but it's really intended for single-user applications).
How do I get rid of the two periodic tasks check_queue_and_launch_instance_task and output_process_task. My idea is that Autoscaling must be used in some way so that if need at a later stage an Elastic Load Balancer can be used.
If you're willing to replace Celery with SQS, you can tie together SQS + SNS + Cloudwatch to simplify this portion of your app. Though what you're doing doesn't sound like a bad choice, especially if it's working well already. Your time is probably better spent working on the problems in front of you rather than those that might occur down the road.