Clarification on Default SageMaker Distribution Strategy - amazon-web-services

Context: When using SageMaker distributed training: Let’s say when training a network I do not provide any distribution parameter (keep it to default), but provide 2 instances for the instance_count value in the estimator (could be any deep learning based estimator, e.g., PyTorch).
In this scenario would there be any distributed training taking place? If so, what strategy is used by default?
NOTE: I could see both instances’ GPUs are actively used but wondering what sort of distributed training take place by default ?

If you're using custom code (custom Docker, custom code in Framework container) The answer is NO. Unless you are writing distributed code (Horovod, PyTorch DDP, MPI...), SageMaker will not distribute things for you. It will launch the same Docker or Python code N times, once per instance. Consider SageMaker Training API like a whiteboard, that can create multiple connected and configured machines for you. But the code is still yours to write. SageMaker Distributed Training Libraries can make distributed code much easier to write though.
If you're using a built-in algorithm, the answer is it depends. Some SageMaker built-in algorithms natively are multi-machine, like SM XGBoost or SM Random Cut Forest.

Related

Can SageMaker distributed training be used for training non-deep learning models?

I am following this documentation page to understand SageMaker's distributed training feature.
It says here that:- 
The SageMaker distributed training libraries are available only through the AWS deep learning containers for the TensorFlow, PyTorch, and HuggingFace frameworks within the SageMaker training platform. 
Does this mean that we cannot use SageMaker distributed training to train machine learning models with traditional machine learning algorithms such as linear regression, random forest or XGBoost?
I have a use cases where the data set is very large and distributed training can help with model parallelism and data parallelism. What other options can be recommended to avoid bringing in large amounts of data in memory on a training instance?
SageMaker training offers various tabular built-in algorithms, the KNN, XGBoost and linear learner, and factorization machines algorithms are parallelizable (can run on multiple devices), and supports data streaming (no limit of data size).
Beyong built-in algorithms, SageMaker also supports bringing your own training script.
I guess the parallel implementation of Sagemaker Xgboost algorithm can be called a "data-parallel" approach since the model is copied over to multiple instances and data is distributed across those instances ( when used with distribution = "ShardedByS3Key" in sagemaker.TrainingInput ) .
The "model parallel" approaches are probably more applicable to Neural Networks . It uses smdistributed.modelparallel.torch as smp to create a decorator #smp.step that shall inform the model about how to split the layers of the Neural Network across instances.

What are the differences between AWS sagemaker and sagemaker_pyspark?

I'm currently running a quick Machine Learning proof of concept on AWS with SageMaker, and I've come across two libraries: sagemaker and sagemaker_pyspark. I would like to work with distributed data. My questions are:
Is using sagemaker the equivalent of running a training job without taking advantage of the distributed computing capabilities of AWS? I assume it is, if not, why have they implemented sagemaker_pyspark? Based on this assumption, I do not understand what it would offer regarding using scikit-learn on a SageMaker notebook (in terms of computing capabilities).
Is it normal for something like model = xgboost_estimator.fit(training_data) to take 4 minutes to run with sagemaker_pyspark for a small set of test data? I see that what it does below is to train the model and also create an Endpoint to be able to offer its predictive services, and I assume that this endpoint is deployed on an EC2 instance that is created and started at the moment. Correct me if I'm wrong. I assume this from how the estimator is defined:
from sagemaker import get_execution_role
from sagemaker_pyspark.algorithms import XGBoostSageMakerEstimator
xgboost_estimator = XGBoostSageMakerEstimator (
trainingInstanceType = "ml.m4.xlarge",
trainingInstanceCount = 1,
endpointInstanceType = "ml.m4.xlarge",
endpointInitialInstanceCount = 1,
sagemakerRole = IAMRole(get_execution_role())
)
xgboost_estimator.setNumRound(1)
If so, is there a way to reuse the same endpoint with different training jobs so that I don't have to wait for a new endpoint to be created each time?
Does sagemaker_pyspark support custom algorithms? Or does it only allow you to use the predefined ones in the library?
Do you know if sagemaker_pyspark can perform hyperparameter optimization? From what I see, sagemaker offers the HyperparameterTuner class, but I can't find anything like it in sagemaker_pyspark. I suppose it is a more recent library and there is still a lot of functionality to implement.
I am a bit confused about the concept of entry_point and container/image_name (both possible input arguments for the Estimator object from the sagemaker library): can you deploy models with and without containers? why would you use model containers? Do you always need to define the model externally with the entry_point script? It is also confusing that the class AlgorithmEstimator allows the input argument algorithm_arn; I see there are three different ways of passing a model as input, why? which one is better?
I see the sagemaker library offers SageMaker Pipelines, which seem to be very handy for deploying properly structured ML workflows. However, I don't think this is available with sagemaker_pyspark, so in that case, I would rather create my workflows with a combination of Step Functions (to orchestrate the entire thing), Glue processes (for ETL, preprocessing and feature/target engineering) and SageMaker processes using sagemaker_pyspark.
I also found out that sagemaker has the sagemaker.sparkml.model.SparkMLModel object. What is the difference between this and what sagemaker_pyspark offers?
sagemaker is the SageMaker Python SDK. It calls SageMaker-related AWS service APIs on your behalf. You don't need to use it, but it can make life easier
Is using sagemaker the equivalent of running a training job without taking advantage of the distributed computing capabilities of AWS? I assume it is, if not, why have they implemented sagemaker_pyspark?
No. You can run distributed training jobs using sagemaker (see instance_count parameter)
sagemaker_pyspark facilitates calling SageMaker-related AWS service APIs from Spark. Use it if you want to use SageMaker services from Spark
Is it normal for something like model = xgboost_estimator.fit(training_data) to take 4 minutes to run with sagemaker_pyspark for a small set of test data?
Yes, it takes a few minutes for an EC2 instance to spin-up. Use Local Mode if you want to iterate more quickly locally. Note: Local Mode won't work with SageMaker built-in algorithms, but you can prototype with (non AWS) XGBoost/SciKit-Learn
Does sagemaker_pyspark support custom algorithms? Or does it only allow you to use the predefined ones in the library?
Yes, but you'd probably want to extend SageMakerEstimator. Here you can provide the trainingImage URI
Do you know if sagemaker_pyspark can perform hyperparameter optimization?
It does not appear so. It'd probably be easier just to do this from SageMaker itself though
can you deploy models with and without containers?
You can certainly host your own models any way you want. But if you want to use SageMaker model inference hosting, then containers are required
why would you use model containers?
Do you always need to define the model externally with the entry_point script?
The whole Docker thing makes bundling dependencies easier, and also makes things language/runtime-neutral. SageMaker doesn't care if your algorithm is in Python or Java or Fortran. But it needs to know how to "run" it, so you tell it a working directory and a command to run. This is the entry point
It is also confusing that the class AlgorithmEstimator allows the input argument algorithm_arn; I see there are three different ways of passing a model as input, why? which one is better?
Please clarify which "three" you are referring to
6 is not a question, so no answer required :)
What is the difference between this and what sagemaker_pyspark offers?
sagemaker_pyspark lets you call SageMaker services from Spark, whereas SparkML Serving lets you use Spark ML services from SageMaker

AWS SageMaker Very large Dataset

I have a csv file of 500GB and a mysql database of 1.5 TB of data and I want to run aws sagemaker classification and regression algorithm and random forest on it.
Can aws sagemaker support it? can model be read and trained in batches or chunks? any example for it
Amazon SageMaker is designed for such scales and it is possible to use it to train on very large datasets. To take advantage of the scalability of the service you should consider a few modifications to your current practices, mainly around distributed training.
If you want to use distributed training to allow much faster training (“100 hours of a single instance cost exactly the same as 1 hour of 100 instances, just 100 times faster”), more scalable (“if you have 10 times more data, you just add 10 times more instances and everything just works”) and more reliable, as each instance is only handling a small part of the datasets or the model, and doesn’t go out of disk or memory space.
It is not obvious how to implement the ML algorithm in a distributed way that is still efficient and accurate. Amazon SageMaker has modern implementations of classic ML algorithms such as Linear Learner, K-means, PCA, XGBoost etc. that are supporting distributed training, that can scale to such dataset sizes. From some benchmarking these implementations can be 10 times faster compared to other distributed training implementations such as Spark MLLib. You can see some examples in this notebook: https://github.com/awslabs/amazon-sagemaker-workshop/blob/master/notebooks/video-game-sales-xgboost.ipynb
The other aspect of the scale is the data file(s). The data shouldn’t be in a single file as it limits the ability to distribute the data across the cluster that you are using for your distributed training. With SageMaker you can decide how to use the data files from Amazon S3. It can be in a fully replicated mode, where all the data is copied to all the workers, but it can also be sharded by key, that distributed the data across the workers, and can speed up the training even further. You can see some examples in this notebook: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/data_distribution_types
Amazon Sagemaker is built to help you scale your training activities. With large datasets, you might consider two main aspects:
The way data are stored and accessed,
The actual training parallelism.
Data storage: S3 is the most cost-effective way to store your data for training. To get faster startup and training times, you can consider the followings:
If your data is are already stored on Amazon S3, you might want first to consider leveraging the Pipe mode with built-in algorithms or bringing your own. But Pipe mode is not suitable all the time, for example, if your algorithm needs to backtrack or skip ahead within an epoch (the underlying FIFO cannot support lseek() operations) or if it is not easy to parse your training dataset from a streaming source.
In those cases, you may want to leverage Amazon FSx for Lustre and Amazon EFS file systems. If your training data is already in an Amazon EFS, I recommend using it as a data source; otherwise, choose Amazon FSx for Lustre.
Training Parallelism: With large datasets, it is likely you'll want to train on different GPUs. In that case, consider the followings:
If your training is already Horovod ready, you can do it with Amazon SageMaker (notebook).
In December, AWS has released managed data parallelism, which simplifies parallel training over multiple GPUs. As of today, it is available for TensorFlow and PyTorch.
(bonus) Cost Optimisation: Do not forget to leverage Managed Spot training to save up to 90% of the compute costs.
You will find other examples on the Amazon SageMaker Distributed Training documentation page
You can use SageMaker for large scale Machine Learning tasks! It's designed for that. I developed this open source project https://github.com/Kenza-AI/sagify (sagify), it's a CLI tool that can help you train and deploy your Machine Learning/Deep Learning models on SageMaker in a very easy way. I managed to train and deploy all of my ML models whatever library I was using (Keras, Tensorflow, scikit-learn, LightFM, etc)

How to make my datalab machine learning run faster

I got some data, which is 3.2 million entries in a csv file. I'm trying to use CNN estimator in tensorflow to train the model, but it's very slow. Everytime I run the script, it got stuck, like the webpage(localhost) just refuse to respond anymore. Any recommendations? (I've tried with 22 CPUs and I can't increase it anymore)
Can I just run it and use a thread, like the command line python xxx.py & to keep the process going? And then go back to check after some time?
Google offers serverless machine learning with TensorFlow for precisely this reason. It is called Cloud ML Engine. Your workflow would basically look like this:
Develop the program to train your neural network on a small dataset that can fit in memory (iron out the bugs, make sure it works the way you want)
Upload your full data set to the cloud (Google Cloud Storage or BigQuery or &c.) (documentation reference: training steps)
Submit a package containing your training program to ML Cloud (this will point to the location of your full data set in the cloud) (documentation reference: packaging the trainer)
Start a training job in the cloud; this is serverless, so it will take care of scaling to as many machines as necessary, without you having to deal with setting up a cluster, &c. (documentation reference: submitting training jobs).
You can use this workflow to train neural networks on massive data sets - particularly useful for image recognition.
If this is a little too much information, or if this is part of a workflow that you'll be doing a lot and you want to get a stronger handle on it, Coursera offers a course on Serverless Machine Learning with Tensorflow. (I have taken it, and was really impressed with the quality of the Google Cloud offerings on Coursera.)
I am sorry for answering even though I am completely igonorant to what datalab is, but have you tried batching?
I am not aware if it is possible in this scenario, but insert maybe only 10 000 entries in one go and do this in so many batches that eventually all entries have been inputted?

How to choose specific algorithm on Amazon Machine Learning or another cloud platform?

Is is possible to run some specific machine learning algorithms on Amazon Machine Learning? For me it's seems like it works like a black box: you put data there and get some performance without algorithm selection, parameters tuning, etc.
By the way, is it possible somewhere to run specific machine learning algorithm in cloud?
From the Amazon Machine Learning FAQ here, they state that the algorithm being used in this package is logistic regression.
If you want something more sophisticated you'd have to do more work yourself, such as setting up the necessary packages on and EC2 or EMR cluster.