Can SageMaker distributed training be used for training non-deep learning models? - amazon-web-services

I am following this documentation page to understand SageMaker's distributed training feature.
It says here that:- 
The SageMaker distributed training libraries are available only through the AWS deep learning containers for the TensorFlow, PyTorch, and HuggingFace frameworks within the SageMaker training platform. 
Does this mean that we cannot use SageMaker distributed training to train machine learning models with traditional machine learning algorithms such as linear regression, random forest or XGBoost?
I have a use cases where the data set is very large and distributed training can help with model parallelism and data parallelism. What other options can be recommended to avoid bringing in large amounts of data in memory on a training instance?

SageMaker training offers various tabular built-in algorithms, the KNN, XGBoost and linear learner, and factorization machines algorithms are parallelizable (can run on multiple devices), and supports data streaming (no limit of data size).
Beyong built-in algorithms, SageMaker also supports bringing your own training script.

I guess the parallel implementation of Sagemaker Xgboost algorithm can be called a "data-parallel" approach since the model is copied over to multiple instances and data is distributed across those instances ( when used with distribution = "ShardedByS3Key" in sagemaker.TrainingInput ) .
The "model parallel" approaches are probably more applicable to Neural Networks . It uses smdistributed.modelparallel.torch as smp to create a decorator #smp.step that shall inform the model about how to split the layers of the Neural Network across instances.

Related

Clarification on Default SageMaker Distribution Strategy

Context: When using SageMaker distributed training: Let’s say when training a network I do not provide any distribution parameter (keep it to default), but provide 2 instances for the instance_count value in the estimator (could be any deep learning based estimator, e.g., PyTorch).
In this scenario would there be any distributed training taking place? If so, what strategy is used by default?
NOTE: I could see both instances’ GPUs are actively used but wondering what sort of distributed training take place by default ?
If you're using custom code (custom Docker, custom code in Framework container) The answer is NO. Unless you are writing distributed code (Horovod, PyTorch DDP, MPI...), SageMaker will not distribute things for you. It will launch the same Docker or Python code N times, once per instance. Consider SageMaker Training API like a whiteboard, that can create multiple connected and configured machines for you. But the code is still yours to write. SageMaker Distributed Training Libraries can make distributed code much easier to write though.
If you're using a built-in algorithm, the answer is it depends. Some SageMaker built-in algorithms natively are multi-machine, like SM XGBoost or SM Random Cut Forest.

Distributed Spark on Amazon SageMaker

I have built a SparkML collaborative filtering algorithm that I want to train and deploy on Sagemaker. What is the best way to achieve this other than BYOC?
Also, I want to understand how distributed training works in Sagemaker if we go with the BYOC route.
I have tried to look for good resources on this, but documentation is pretty sparse on distributed aspect. You can provide instance_count in your Estimator but how is it used in BYOC scenario? Do we have to handle it in the training scripts, code ? Any example of doing that with SparkML?

Is there a way to access and work with data stored in GCP bucket directly?

I have to do a deep learning project at my university, where I need to work with a medical image database. This database is stored in a Google Cloud Platform bucket.
However, the database's size is over 4 TB, so I can't afford download the data using gsutil. I can't use Google Colab notebook either, since it's disk storage size is 350GB.
Is there any way I can access the data and use it to teach my network?
I think you aren't on the right way.
When you build your model, you only need to have a representative subset of your dataset to validate your layers and the expected behavior.
Then, when all is done and packaged, you run your training job on dedicated VM (like Deep Learning VM). This process can be handle automatically by AI-Platform. You can also set up hyper-parameters server and parallelize your training.
In training phase, you often work with batches: you load only a subset of your dataset, you shuffle it and you train perform several steps on this subset (with RMSE/cross-entropy figure out, evaluation, gradient optimization).
Because you use a subset of your full dataset in batches, your don't need to have the 4Tb on your VM at the same time. Your training loop do it for you (download, train, evaluate, delete).
Like I said before, because you use a subset, you can also parallelize your training on several VMs for reducing your training duration.
I recommend you to review your training loop. If your give me the framework name/version which one you work, I could help you with tutorals and examples.

Sagemaker Endpoint throttling exception

I have created an endpoint using Sagemaker, and designed my system so that it is called about 100 times simultaneously. This seemed to cause 'Model error' and take too much time. Do I need to create an endpoint for each event, and make one call per endpoint, instead?
you can go in cloudwatch logs to diagnose your model failure.
Real-time inference traffic scaling can be addressed via working on 3 independent dimensions:
hardware: choosing larger machines or more
machines. For example you can load test your model endpoint with bigger and bigger machines and see when hardware size gives you acceptable latency. The Autoscaling feature of SageMaker helps you address this automatically. If deploying a deep neural net, you can also consider using appropriate accelerators, eg GPU (EC2 P3, EC2 G4) or Amazon Elastic Inference Accelerator to make each prediction much faster.
software: you have 2 levers to tune here:
choosing a serving stack that is lean and fast. Different servers will handle load at different levels of performance. One common trick is to batch the load - for example, instead of hitting 100 times your server can you hit it only once with a batch of 100 records? If clients cannot batch their requests, can you use micro-asynchrony so that you do the batching yourself after they issued requests? You can usually configure such micro-batching in advanced deep learning servers such as TF Serving or MXNet Model Server (both can be used in SageMaker), but otherwise you can also do it yourself by having a queue (SQS) in front of your server.
model compilation - optimizing the model graph and its runtime. This is a very smart concept, that leverages the fact that when you know where you're going to deploy (eg NVIDIA, Intel, ARM, etc), you have an insider edge and you can refine your model artifact and create a bespoke runtime application that are tailor-made for this specific target platform. This can reduce memory consumption and latency by double-digit percentage, and is an active area of ML research. In the SageMaker ecosystem, such a compilation task can be performed with SageMaker Neo, but the open source ecosystem is developing fast, with notably treelite (paper, doc) for decision tree compilation and TVM (paper, doc) for arbitrary neural net compilation. Both are dependencies of Neo by the way.
science: some models are slower or heavier than others. If speed and concurrency are your priorities over accuracy, and if you already exploited all possible tricks at level (1) and (2) above, consider using fast-throughput models, eg linear models & logistic regression for structured data, MobileNet or SqueezeNet instead of large Resnets for classification (nice benchmark here), Yolov3 instead of FasterRCNN for detection (nice benchmark here), etc. But be aware that unlike levels (1) and (2), changing model science will alter accuracy.
As mentioned above, those 3 areas of improvements really are about real-time inference; if you can afford to pre-compute all possible model inputs, then the ultimate low-latency high-throughput solution is to pre-compute offline a variety of input-predictions pairs of interest and serve them on demand from a fast database or local read-only store.

AWS SageMaker Very large Dataset

I have a csv file of 500GB and a mysql database of 1.5 TB of data and I want to run aws sagemaker classification and regression algorithm and random forest on it.
Can aws sagemaker support it? can model be read and trained in batches or chunks? any example for it
Amazon SageMaker is designed for such scales and it is possible to use it to train on very large datasets. To take advantage of the scalability of the service you should consider a few modifications to your current practices, mainly around distributed training.
If you want to use distributed training to allow much faster training (“100 hours of a single instance cost exactly the same as 1 hour of 100 instances, just 100 times faster”), more scalable (“if you have 10 times more data, you just add 10 times more instances and everything just works”) and more reliable, as each instance is only handling a small part of the datasets or the model, and doesn’t go out of disk or memory space.
It is not obvious how to implement the ML algorithm in a distributed way that is still efficient and accurate. Amazon SageMaker has modern implementations of classic ML algorithms such as Linear Learner, K-means, PCA, XGBoost etc. that are supporting distributed training, that can scale to such dataset sizes. From some benchmarking these implementations can be 10 times faster compared to other distributed training implementations such as Spark MLLib. You can see some examples in this notebook: https://github.com/awslabs/amazon-sagemaker-workshop/blob/master/notebooks/video-game-sales-xgboost.ipynb
The other aspect of the scale is the data file(s). The data shouldn’t be in a single file as it limits the ability to distribute the data across the cluster that you are using for your distributed training. With SageMaker you can decide how to use the data files from Amazon S3. It can be in a fully replicated mode, where all the data is copied to all the workers, but it can also be sharded by key, that distributed the data across the workers, and can speed up the training even further. You can see some examples in this notebook: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/data_distribution_types
Amazon Sagemaker is built to help you scale your training activities. With large datasets, you might consider two main aspects:
The way data are stored and accessed,
The actual training parallelism.
Data storage: S3 is the most cost-effective way to store your data for training. To get faster startup and training times, you can consider the followings:
If your data is are already stored on Amazon S3, you might want first to consider leveraging the Pipe mode with built-in algorithms or bringing your own. But Pipe mode is not suitable all the time, for example, if your algorithm needs to backtrack or skip ahead within an epoch (the underlying FIFO cannot support lseek() operations) or if it is not easy to parse your training dataset from a streaming source.
In those cases, you may want to leverage Amazon FSx for Lustre and Amazon EFS file systems. If your training data is already in an Amazon EFS, I recommend using it as a data source; otherwise, choose Amazon FSx for Lustre.
Training Parallelism: With large datasets, it is likely you'll want to train on different GPUs. In that case, consider the followings:
If your training is already Horovod ready, you can do it with Amazon SageMaker (notebook).
In December, AWS has released managed data parallelism, which simplifies parallel training over multiple GPUs. As of today, it is available for TensorFlow and PyTorch.
(bonus) Cost Optimisation: Do not forget to leverage Managed Spot training to save up to 90% of the compute costs.
You will find other examples on the Amazon SageMaker Distributed Training documentation page
You can use SageMaker for large scale Machine Learning tasks! It's designed for that. I developed this open source project https://github.com/Kenza-AI/sagify (sagify), it's a CLI tool that can help you train and deploy your Machine Learning/Deep Learning models on SageMaker in a very easy way. I managed to train and deploy all of my ML models whatever library I was using (Keras, Tensorflow, scikit-learn, LightFM, etc)