What are the differences between AWS sagemaker and sagemaker_pyspark? - amazon-web-services

I'm currently running a quick Machine Learning proof of concept on AWS with SageMaker, and I've come across two libraries: sagemaker and sagemaker_pyspark. I would like to work with distributed data. My questions are:
Is using sagemaker the equivalent of running a training job without taking advantage of the distributed computing capabilities of AWS? I assume it is, if not, why have they implemented sagemaker_pyspark? Based on this assumption, I do not understand what it would offer regarding using scikit-learn on a SageMaker notebook (in terms of computing capabilities).
Is it normal for something like model = xgboost_estimator.fit(training_data) to take 4 minutes to run with sagemaker_pyspark for a small set of test data? I see that what it does below is to train the model and also create an Endpoint to be able to offer its predictive services, and I assume that this endpoint is deployed on an EC2 instance that is created and started at the moment. Correct me if I'm wrong. I assume this from how the estimator is defined:
from sagemaker import get_execution_role
from sagemaker_pyspark.algorithms import XGBoostSageMakerEstimator
xgboost_estimator = XGBoostSageMakerEstimator (
trainingInstanceType = "ml.m4.xlarge",
trainingInstanceCount = 1,
endpointInstanceType = "ml.m4.xlarge",
endpointInitialInstanceCount = 1,
sagemakerRole = IAMRole(get_execution_role())
)
xgboost_estimator.setNumRound(1)
If so, is there a way to reuse the same endpoint with different training jobs so that I don't have to wait for a new endpoint to be created each time?
Does sagemaker_pyspark support custom algorithms? Or does it only allow you to use the predefined ones in the library?
Do you know if sagemaker_pyspark can perform hyperparameter optimization? From what I see, sagemaker offers the HyperparameterTuner class, but I can't find anything like it in sagemaker_pyspark. I suppose it is a more recent library and there is still a lot of functionality to implement.
I am a bit confused about the concept of entry_point and container/image_name (both possible input arguments for the Estimator object from the sagemaker library): can you deploy models with and without containers? why would you use model containers? Do you always need to define the model externally with the entry_point script? It is also confusing that the class AlgorithmEstimator allows the input argument algorithm_arn; I see there are three different ways of passing a model as input, why? which one is better?
I see the sagemaker library offers SageMaker Pipelines, which seem to be very handy for deploying properly structured ML workflows. However, I don't think this is available with sagemaker_pyspark, so in that case, I would rather create my workflows with a combination of Step Functions (to orchestrate the entire thing), Glue processes (for ETL, preprocessing and feature/target engineering) and SageMaker processes using sagemaker_pyspark.
I also found out that sagemaker has the sagemaker.sparkml.model.SparkMLModel object. What is the difference between this and what sagemaker_pyspark offers?

sagemaker is the SageMaker Python SDK. It calls SageMaker-related AWS service APIs on your behalf. You don't need to use it, but it can make life easier
Is using sagemaker the equivalent of running a training job without taking advantage of the distributed computing capabilities of AWS? I assume it is, if not, why have they implemented sagemaker_pyspark?
No. You can run distributed training jobs using sagemaker (see instance_count parameter)
sagemaker_pyspark facilitates calling SageMaker-related AWS service APIs from Spark. Use it if you want to use SageMaker services from Spark
Is it normal for something like model = xgboost_estimator.fit(training_data) to take 4 minutes to run with sagemaker_pyspark for a small set of test data?
Yes, it takes a few minutes for an EC2 instance to spin-up. Use Local Mode if you want to iterate more quickly locally. Note: Local Mode won't work with SageMaker built-in algorithms, but you can prototype with (non AWS) XGBoost/SciKit-Learn
Does sagemaker_pyspark support custom algorithms? Or does it only allow you to use the predefined ones in the library?
Yes, but you'd probably want to extend SageMakerEstimator. Here you can provide the trainingImage URI
Do you know if sagemaker_pyspark can perform hyperparameter optimization?
It does not appear so. It'd probably be easier just to do this from SageMaker itself though
can you deploy models with and without containers?
You can certainly host your own models any way you want. But if you want to use SageMaker model inference hosting, then containers are required
why would you use model containers?
Do you always need to define the model externally with the entry_point script?
The whole Docker thing makes bundling dependencies easier, and also makes things language/runtime-neutral. SageMaker doesn't care if your algorithm is in Python or Java or Fortran. But it needs to know how to "run" it, so you tell it a working directory and a command to run. This is the entry point
It is also confusing that the class AlgorithmEstimator allows the input argument algorithm_arn; I see there are three different ways of passing a model as input, why? which one is better?
Please clarify which "three" you are referring to
6 is not a question, so no answer required :)
What is the difference between this and what sagemaker_pyspark offers?
sagemaker_pyspark lets you call SageMaker services from Spark, whereas SparkML Serving lets you use Spark ML services from SageMaker

Related

What if i say ” sagemaker async is nothing but a task queue which use ML type instances.”

If we run All the inference code within the celery task or another task queue like rabbit MQ, we will get the same performance ( if we ignore the impact of ML type instance ).
What are the major benefits we will get if we are using sagemaker async for a non-conventional ML model?
SageMaker Async is managed: you don't need a team/skills to choose, develop, test and maintain queuing software.
You can also configure autoscaling, to scale up and down (eg to reduce fleet size to zero if queue is empty and reduce costs).
It also does S3 interactions for you, (copy to/from S3), you don't need to learn to use S3 code. You would need to add S3 interaction to your ML inference code you were to develop the same thing from scratch.

Clarification on Default SageMaker Distribution Strategy

Context: When using SageMaker distributed training: Let’s say when training a network I do not provide any distribution parameter (keep it to default), but provide 2 instances for the instance_count value in the estimator (could be any deep learning based estimator, e.g., PyTorch).
In this scenario would there be any distributed training taking place? If so, what strategy is used by default?
NOTE: I could see both instances’ GPUs are actively used but wondering what sort of distributed training take place by default ?
If you're using custom code (custom Docker, custom code in Framework container) The answer is NO. Unless you are writing distributed code (Horovod, PyTorch DDP, MPI...), SageMaker will not distribute things for you. It will launch the same Docker or Python code N times, once per instance. Consider SageMaker Training API like a whiteboard, that can create multiple connected and configured machines for you. But the code is still yours to write. SageMaker Distributed Training Libraries can make distributed code much easier to write though.
If you're using a built-in algorithm, the answer is it depends. Some SageMaker built-in algorithms natively are multi-machine, like SM XGBoost or SM Random Cut Forest.

Deploying NLP model to AWS for beginners

I have the task of optimizing search on the website. The search should be for pictures and for text by text query. I have already developed, trained, tested and selected a machine learning model that transforms images and text into a feature vector (Python, based on OpenAI CLIP). This feature vector will be transferred to Elastic Search. Elastic Search will be configured by another specialist.
The model will be used first to determine the feature vector on all existing images and texts, and then be used whenever new content is added or existing content is changed.
There is a lot of existing content (approximately several tens of millions of pictures and texts together). About 100-500 pieces of content are added and changed per day.
I haven't worked much with AWS, but in this case the model needs to be deployed to AWS somehow. Of course, I have the model and the entire project locally, I can write an API app and make a Docker container.
The question is, what is the best method to deploy this application on AWS? The best in terms of speed and ease of implementation (for me as an AWS beginner), as well as cost optimization, taking into account the number of requests for the application.
I've seen different possibilities, from simply deploying the application on EC2 (probably the easiest option) to using SageMaker. Also Kubernetes and ECS...
I'd recommend using SageMaker Hosting endpoint if you need to be able to run vectorization in near-real time any time of the day, or in a SageMaker Training job if you can run vectorization batched, for example once every few hour.
For both systems you can use pre-defined Framework containers and SDK to which you pass a Python code and optionally requirements.txt, or you can create your own image.

Should I run forecast predictive model with AWS lambda or sagemaker?

I've been reading some articles regarding this topic and have preliminary thoughts as what I should do with it, but still want to see if anyone can share comments if you have more experience with running machine learning on AWS. I was doing a project for a professor at school, and we decided to use AWS. I need to find a cost-effective and efficient way to deploy a forecasting model on it.
What we want to achieve is:
read the data from S3 bucket monthly (there will be new data coming in every month),
run a few python files (.py) for custom-built packages and install dependencies (including the files, no more than 30kb),
produce predicted results into a file back in S3 (JSON or CSV works), or push to other endpoints (most likely to be some BI tools - tableau etc.) - but really this step can be flexible (not web for sure)
First thought I have is AWS sagemaker. However, we'll be using "fb prophet" model to predict the results, and we built a customized package to use in the model, therefore, I don't think the notebook instance is gonna help us. (Please correct me if I'm wrong) My understanding is that sagemaker is a environment to build and train the model, but we already built and trained the model. Plus, we won't be using AWS pre-built models anyways.
Another thing is if we want to use custom-built package, we will need to create container image, and I've never done that before, not sure about the efforts to do that.
2nd option is to create multiple lambda functions
one that triggers to run the python scripts from S3 bucket (2-3 .py files) every time a new file is imported into S3 bucket, which will happen monthly.
one that trigger after the python scripts are done running and produce results and save into S3 bucket.
3rd option will combine both options:
- Use lambda function to trigger the implementation on the python scripts in S3 bucket when the new file comes in.
- Push the result using sagemaker endpoint, which means we host the model on sagemaker and deploy from there.
I am still not entirely sure how to put pre-built model and python scripts onto sagemaker instance and host from there.
I'm hoping whoever has more experience with AWS service can help give me some guidance, in terms of more cost-effective and efficient way to run model.
Thank you!!
I would say it all depends on how heavy your model is / how much data you're running through it. You're right to identify that Lambda will likely be less work. It's quite easy to get a lambda up and running to do the things that you need, and Lambda has a very generous free tier. The problem is:
Lambda functions are fundamentally limited in their processing capacity (they timeout after max 15 minutes).
Your model might be expensive to load.
If you have a lot of data to run through your model, you will need multiple lambdas. Multiple lambdas means you have to load your model multiple times, and that's wasted work. If you're working with "big data" this will get expensive once you get through the free tier.
If you don't have much data, Lambda will work just fine. I would eyeball it as follows: assuming your data processing step is dominated by your model step, and if all your model interactions (loading the model + evaluating all your data) take less than 15min, you're definitely fine. If they take more, you'll need to do a back-of-the-envelope calculation to figure out whether you'd leave the Lambda free tier.
Regarding Lambda: You can literally copy-paste code in to setup a prototype. If your execution takes more than 15min for all your data, you'll need a method of splitting your data up between multiple Lambdas. Consider Step Functions for this.
SageMaker is a set of services that each is responsible for a different part of the Machine Learning process. What you might want to use is the hosted version of Jupyter notebooks in SageMaker. You get a lot of freedom in the size of the instance that you are using (CPU/GPU, memory, and disk), and you can install various packages on that instance (such as FB Prophet). If you need it once a month, you can stop and start the notebook instances between these times and "Run all" the cells in your notebooks on this instance. It will only cost you the minutes of execution.
regarding the other alternatives, it is not trivial to run FB Prophet in Lambda due to the size limit of the libraries that you can install on Lambda (to avoid too long cold start). You can also use ECS (container Service) where you can have much larger images, but you need to know how to build a Docker image of your code and endpoint to be able to call it.

Training multiple model in AWS Sagemaker

Can I train multiple model in AWS Sagemaker by evaluating the models is train.py script and also how to get back multiple metrics from multiple models?
Any links, docs or videos would be useful.
Yes, what you write in a sagemaker training script (assuming you use something that lets you pass custom code like your own container or a framework container) is flexible, and does not need to be just one model or even ML. You can definitely write multiple model trainings in a single container, and pull all related metrics using SageMaker metric capture via regex, see an example regex here with the Sklearn random forest.
That being said, it is often a better idea to separate things and have one model per SageMaker job, because of the following reasons among other:
It allows you to separate model metadata and metrics and compare
them easily with the SageMaker metadata service
It allows you to specialize hardware to each model and get better economics. Each model has its own sweet spot when it comes to CPU, GPU, RAM
It allows you to use the exact same container for single training but
also for bayesian hyperparameter search, an method that can be
both faster and cheaper than regular gridsearch.