Training multiple model in AWS Sagemaker - amazon-web-services

Can I train multiple model in AWS Sagemaker by evaluating the models is train.py script and also how to get back multiple metrics from multiple models?
Any links, docs or videos would be useful.

Yes, what you write in a sagemaker training script (assuming you use something that lets you pass custom code like your own container or a framework container) is flexible, and does not need to be just one model or even ML. You can definitely write multiple model trainings in a single container, and pull all related metrics using SageMaker metric capture via regex, see an example regex here with the Sklearn random forest.
That being said, it is often a better idea to separate things and have one model per SageMaker job, because of the following reasons among other:
It allows you to separate model metadata and metrics and compare
them easily with the SageMaker metadata service
It allows you to specialize hardware to each model and get better economics. Each model has its own sweet spot when it comes to CPU, GPU, RAM
It allows you to use the exact same container for single training but
also for bayesian hyperparameter search, an method that can be
both faster and cheaper than regular gridsearch.

Related

Scaling endpoint that uses ML models in AWS

I have an API (python+FastAPI) with different endpoints. One of these endpoints manages images, and apply different models (up to 4) one after another one. Furthermore, there can be a preprocessing and postprocessing between these models are applied in series.
What is the best approach to scale this kind of endpoints using AWS? I have in mind pages such as huggingface where you can try models and they run fast even they can have many calls at the same time. Or for example, pages like https://interiorai.com.
Right now I have an instance inside an autoscaling group, which adds images to a queue using celery+rabbitmq, and then process images one by one since celery can only work with maximum 1 worker if you need to use GPU.
Should I look for real-time instance inference? Is there any better aproximations? Do you have a good example on how to start with this?
Your best bet is SageMaker real-time inference, with this you can also enable AutoScaling. You can also utilize SageMaker Serial Inference Pipelines to stitch together multiple models (in containers) behind just one endpoint.

Best way to run 1000s of training jobs on sagemaker

I have thousands of training jobs that I want to run on sagemaker. Basically I have a list of hyperparameters and I want to train the model for all of those hyperparmeters in parallel (not a standard hyperparameter tuning where we just want to optimize the hyperparameter, here we want to train for all of the hyperparameters). I have searched the docs quite extensively but it surprises me that I couldn't find any info about this, even though it seems like a pretty basic functionality.
For example, let's say I have 10,000 training jobs, and my quota is 20 instances, what is the best way to run these jobs utilizing all my available instances? In particular,
Is there a "queue manager" functionality that takes the list of hyperparameters and runs the training jobs in batches of 20 until they are all done (even better if it could keep track of failed/completed jobs).
Is it best practice to run a single training job per instance? If that's the case do I need to ask for a much higher quota on the number of instance?
If this functionality does not exist in sagemaker, is it worth using EC2 instead since it's a bit cheaper?
Your question is very broad and the best way forward would depend on other details of your use-case, so we will have to make some assumptions.
[Queue manager]
SageMaker does not have a queue manager. If at the end you decide you need a queue manager, I would suggest looking towards AWS Batch.
[Single vs multiple training jobs]
Since you need to run 10s of thousands job I assume you are training fairly lightweight models, so to save on time, you would be better off reusing instances for multiple training jobs. (Otherwise, with 20 instances limit, you need 500 rounds of training, with a 3 min start time - depending on instance type - you need 25 hours just for the wait time. Depending on the complexity of each individual model, this 25hours might be significant or totally acceptable).
[Instance limit increase]
You can always ask for a limit increase, but going from a limit of 20 to 10k at once is likely that will not be accepted by the AWS support team, unless you are part of an organisation with a track record of usage on AWS, in which case this might be fine.
[One possible option] (Assuming multiple lightweight models)
You could create a single training job, with instance count, the number of instances available to you.
Inside the training job, your code can run a for loop and perform all the individual training jobs you need.
In this case, you will need to know which which instance is which so you can make the split of the HPOs. SageMaker writes this information on the file: /opt/ml/input/config/resourceconfig.json so using that you can easily have each instance run a subset of the trainings required.
Another thing to think of, is if you need to save the generated models (which you probably need). You can either save everything in the output model directory - standard SM approach- but this would zip all models in a model.tar.gz file.
If you don't want this, and prefer to have each model individually saved, I'd suggest using the checkpoints directory that will sync anything written there to your s3 location.

What are the differences between AWS sagemaker and sagemaker_pyspark?

I'm currently running a quick Machine Learning proof of concept on AWS with SageMaker, and I've come across two libraries: sagemaker and sagemaker_pyspark. I would like to work with distributed data. My questions are:
Is using sagemaker the equivalent of running a training job without taking advantage of the distributed computing capabilities of AWS? I assume it is, if not, why have they implemented sagemaker_pyspark? Based on this assumption, I do not understand what it would offer regarding using scikit-learn on a SageMaker notebook (in terms of computing capabilities).
Is it normal for something like model = xgboost_estimator.fit(training_data) to take 4 minutes to run with sagemaker_pyspark for a small set of test data? I see that what it does below is to train the model and also create an Endpoint to be able to offer its predictive services, and I assume that this endpoint is deployed on an EC2 instance that is created and started at the moment. Correct me if I'm wrong. I assume this from how the estimator is defined:
from sagemaker import get_execution_role
from sagemaker_pyspark.algorithms import XGBoostSageMakerEstimator
xgboost_estimator = XGBoostSageMakerEstimator (
trainingInstanceType = "ml.m4.xlarge",
trainingInstanceCount = 1,
endpointInstanceType = "ml.m4.xlarge",
endpointInitialInstanceCount = 1,
sagemakerRole = IAMRole(get_execution_role())
)
xgboost_estimator.setNumRound(1)
If so, is there a way to reuse the same endpoint with different training jobs so that I don't have to wait for a new endpoint to be created each time?
Does sagemaker_pyspark support custom algorithms? Or does it only allow you to use the predefined ones in the library?
Do you know if sagemaker_pyspark can perform hyperparameter optimization? From what I see, sagemaker offers the HyperparameterTuner class, but I can't find anything like it in sagemaker_pyspark. I suppose it is a more recent library and there is still a lot of functionality to implement.
I am a bit confused about the concept of entry_point and container/image_name (both possible input arguments for the Estimator object from the sagemaker library): can you deploy models with and without containers? why would you use model containers? Do you always need to define the model externally with the entry_point script? It is also confusing that the class AlgorithmEstimator allows the input argument algorithm_arn; I see there are three different ways of passing a model as input, why? which one is better?
I see the sagemaker library offers SageMaker Pipelines, which seem to be very handy for deploying properly structured ML workflows. However, I don't think this is available with sagemaker_pyspark, so in that case, I would rather create my workflows with a combination of Step Functions (to orchestrate the entire thing), Glue processes (for ETL, preprocessing and feature/target engineering) and SageMaker processes using sagemaker_pyspark.
I also found out that sagemaker has the sagemaker.sparkml.model.SparkMLModel object. What is the difference between this and what sagemaker_pyspark offers?
sagemaker is the SageMaker Python SDK. It calls SageMaker-related AWS service APIs on your behalf. You don't need to use it, but it can make life easier
Is using sagemaker the equivalent of running a training job without taking advantage of the distributed computing capabilities of AWS? I assume it is, if not, why have they implemented sagemaker_pyspark?
No. You can run distributed training jobs using sagemaker (see instance_count parameter)
sagemaker_pyspark facilitates calling SageMaker-related AWS service APIs from Spark. Use it if you want to use SageMaker services from Spark
Is it normal for something like model = xgboost_estimator.fit(training_data) to take 4 minutes to run with sagemaker_pyspark for a small set of test data?
Yes, it takes a few minutes for an EC2 instance to spin-up. Use Local Mode if you want to iterate more quickly locally. Note: Local Mode won't work with SageMaker built-in algorithms, but you can prototype with (non AWS) XGBoost/SciKit-Learn
Does sagemaker_pyspark support custom algorithms? Or does it only allow you to use the predefined ones in the library?
Yes, but you'd probably want to extend SageMakerEstimator. Here you can provide the trainingImage URI
Do you know if sagemaker_pyspark can perform hyperparameter optimization?
It does not appear so. It'd probably be easier just to do this from SageMaker itself though
can you deploy models with and without containers?
You can certainly host your own models any way you want. But if you want to use SageMaker model inference hosting, then containers are required
why would you use model containers?
Do you always need to define the model externally with the entry_point script?
The whole Docker thing makes bundling dependencies easier, and also makes things language/runtime-neutral. SageMaker doesn't care if your algorithm is in Python or Java or Fortran. But it needs to know how to "run" it, so you tell it a working directory and a command to run. This is the entry point
It is also confusing that the class AlgorithmEstimator allows the input argument algorithm_arn; I see there are three different ways of passing a model as input, why? which one is better?
Please clarify which "three" you are referring to
6 is not a question, so no answer required :)
What is the difference between this and what sagemaker_pyspark offers?
sagemaker_pyspark lets you call SageMaker services from Spark, whereas SparkML Serving lets you use Spark ML services from SageMaker

how to increase performance in aws comprehend on custom classification

I trained a custom classifier with simply two tag in CSV
I have feed my custom classification model with 1000 text each
but when I run a job in my custom classification model, the job take ~5 min (running) for analyses one new text, I search about this issue in AWS, but I don't find any answer...
How can I speed up / optimize my job for analysis new text with the model ?
Thank you in advance
Prior to Nov 2019, Comprehend only supported asynchronous inference for Custom classification. Asynchronous inference is optimized for bulk processing.
Comprehend has since launched real-time inference for Custom classification to satisfy the real-time needs of our customers.
https://docs.aws.amazon.com/comprehend/latest/dg/custom-sync.html
Note that Custom endpoints are charged by time units even when you're not actively using them. You can also look at the pricing document for details - https://aws.amazon.com/comprehend/pricing/

AWS Sagemaker - using cross validation instead of dedicated validation set?

When I train my model locally I use a 20% test set and then cross validation. Sagameker seems like it needs a dedicated valdiation set (at least in the tutorials I've followed). Currently I have 20% test, 10% validation leaving 70% to train - so I lose 10% of my training data compared to when I train locally, and there is some performance loss as a results of this.
I could just take my locally trained models and overwrite the sagemaker models stored in s3, but that seems like a bit of a work around. Is there a way to use Sagemaker without having to have a dedicated validation set?
Thanks
SageMaker seems to allow a single training set while in cross validation you iterate between for example 5 different training set each one validated on a different hold out set. So it seems that SageMaker training service is not well suited for cross validation. Of course cross validation is usually useful with small (to be accurate low variance) data, so in those cases you can set the training infrastructure to local (so it doesn't take a lot of time) and then iterate manually to achieve cross validation functionality. But it's not something out of the box.
Sorry, can you please elaborate which tutorials you are referring to when you say "SageMaker seems like it needs a dedicated validation set (at least in the tutorials I've followed)."
SageMaker training exposes the ability to separate datasets into "channels" so you can separate your dataset in whichever way you please.
See here for more info: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-trainingdata