We’re training a big Temporal Fusion Transformer using PyTorch.
We’re looking into using Distributed Training and accelerate training jobs with SageMaker.
Does anyone have any examples of this? Any pattern you can recommend?
Although there is no direct example for the above mentioned model, you should be able to follow the below documentation for PL
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt-lightning.html
Refer below example for a full example of using SageMaker DDP and Pytorch Lightning.
https://github.com/aws-samples/sagemaker-distributed-training-workshop/blob/main/1_data_parallel/PyTorch%20Lightning%20on%20SageMaker.ipynb
Related
Context: When using SageMaker distributed training: Let’s say when training a network I do not provide any distribution parameter (keep it to default), but provide 2 instances for the instance_count value in the estimator (could be any deep learning based estimator, e.g., PyTorch).
In this scenario would there be any distributed training taking place? If so, what strategy is used by default?
NOTE: I could see both instances’ GPUs are actively used but wondering what sort of distributed training take place by default ?
If you're using custom code (custom Docker, custom code in Framework container) The answer is NO. Unless you are writing distributed code (Horovod, PyTorch DDP, MPI...), SageMaker will not distribute things for you. It will launch the same Docker or Python code N times, once per instance. Consider SageMaker Training API like a whiteboard, that can create multiple connected and configured machines for you. But the code is still yours to write. SageMaker Distributed Training Libraries can make distributed code much easier to write though.
If you're using a built-in algorithm, the answer is it depends. Some SageMaker built-in algorithms natively are multi-machine, like SM XGBoost or SM Random Cut Forest.
I have built a SparkML collaborative filtering algorithm that I want to train and deploy on Sagemaker. What is the best way to achieve this other than BYOC?
Also, I want to understand how distributed training works in Sagemaker if we go with the BYOC route.
I have tried to look for good resources on this, but documentation is pretty sparse on distributed aspect. You can provide instance_count in your Estimator but how is it used in BYOC scenario? Do we have to handle it in the training scripts, code ? Any example of doing that with SparkML?
I want to train Yolact on a custom dataset using Google Colab+.
Is it possible to train on Colab+ or does it time out to easily?
Thank you!
Yes, you can train your model on Colab+. The problem is that Colab has a relatively short lifecycle compared with other cloud platforms such as AWS SageMaker or Google Cloud. I run the code below to extend a bit more such time.
%%javascript
function ClickConnect(){
console.log("Working");
document.querySelector("colab-toolbar-button#connect").click()
}setInterval(ClickConnect,50000)
Im trying to implement an anomaly detection machine learning solution on GCP but finding it hard to find a specific solution using Google Cloud ML as with AWS' Random Cut Forest solution in Kinesis. Im streaming IoT temperature sensor data for water heaters.
Anyone know a tensorflow/google solution for this as my company only uses google stack?
Ive tried using sklearn models but none of them are implementable on producton for streaming data so have to use tensorflow but am novice. Any suggestions on a good flow to get this done?
I would suggest using Esper complex event processing engine if primary concern is the analysis of data stream and catching patterns in real time. It provides SQL like event processing language which runs as continuous query on floating data. Esper offers abstractions for correlation, aggregation and pattern detection. It is open source project and license is required if you want to run engine on multiple servers to achieve high availability.
Could you just help me with the following points:
Can we train the tensorflow custom object detection model in SageMaker of AWS?
I came across SageMaker's Image classification Algorithm? Can we use it to detect particular objects in Video after training the model?
Confused with the pricing plan of SageMaker. They are saying "you are offered a monthly free tier of 250 hours of t2.medium notebook usage"; Does that mean we can use t2.medium notebook free for 250 hours?
Final AIM is to train a model for custom object detection like we used to train in paperspace or floydhub in very less price.
Thanks in advance.
1- Sure. You can bring any TensorFlow code to SageMaker. https://docs.aws.amazon.com/sagemaker/latest/dg/tf-examples.html
2- This is a classification model (labels only), not a detection model (labels + bounding boxes). Having said that, yes, you can definitely use it to predict frames extracted from a video.
3- Yes, in the first 12 months following the creation of your AWS account.
Hope this helps.
Any TensorFlow model can be used/ported to SageMaker. You can find examples of TensorFlow models ported to SageMaker here https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk#amazon-sagemaker-examples.