SageMaker Distributed Training in Local Mode (inside Notebook Instances) - amazon-web-services

I've been using SageMaker for a while and have performed several experiments already with distributed training. I am wondering if it is possible to test and run SageMaker distributed training in local mode (using SageMaker Notebook Instances)?

Yes, SageMaker supports distributed training in local mode. But as #Philipp Schmid said some other features (like pipemode are not supported).

No, not possible yet. local mode does not support the distributed training with local_gpufor Gzip compression, Pipe Mode, or manifest files for inputs

Related

Is it possible to use Sagemaker Notebooks with a Docker image as your environment?

I'm currently developing a system that some private libraries. I'm developing in local mode and then when I need to process something specific I use Sagemaker Processing Jobs. The thing is that in order to speed up the process it would be nice to have the possibility of developing everything in a cloud environment.
I'm wondering if is possible to use the same Docker image that I use
for batch processing (the one that I use for Sagemaker Processing Job)
in my Sagemaker Jupyter Notebooks of my cloud environment?
The main problem here is that every time that I work in my cloud Notebooks I have to deal with dependencies conflicts and etc. Using a Docker image would avoid this, and will also allow to each member of the team use the same image to develop in the cloud without having to deal with these kind of conflicts.
You can use the same Docker image to run a processing job locally using SageMaker local mode (basically setting the instance_type parameter on the Processor to local.
However, it sounds like you'd want to use the same image as your dev environment in notebooks. In SageMaker notebook instances, the solution would be to create and maintain conda environments with the same requirements and versions (you can also use LCCs to install a set of packages at notebook start, see some samples here).
An alternative is to use SageMaker Studio, where you can create and bring your own custom image for Studio. There is a detailed tutorial here, and some sample dockerfiles for you to get started here.

GCP run a prediction of a model every day

I have a .py file containing all the instructions to generate the predictions for some data.
Those data are taken from BigQuery and the predictions should be inserted in another BigQuery table.
Right now the code is running on a AIPlatform Notebook, but I want to schedule its execution every day, is there any way to do it?
I run into the AIPlatform Jobs, but I can't understand what should my code do and what should be the structure of the code, is there any step-by-step guide to follow?
You can schedule a Notebook execution using different options:
nbconvert
Different variants of the same technology:
nbconvert: Provides a convenient way to execute the input cells of an .ipynb notebook file and save the results, both input and output cells, as a .ipynb file.
papermill: is a Python package for parameterizing and executing Jupyter Notebooks. (Uses nbconvert --execute under the hood.)
notebook executor: This tool that can be used to schedule the execution of Jupyter notebooks from anywhere (local, GCE, GCP Notebooks) to the Cloud AI Deep Learning VM. You can read more about the usage of this tool here. (Uses gcloud sdk and papermill under the hood)
KubeFlow Fairing
Is a Python package that makes it easy to train and deploy ML models on Kubeflow. Kubeflow Fairing can also be extended to train or deploy on other platforms. Currently, Kubeflow Fairing has been extended to train on Google AI Platform.
AI Platform Notebook Executor There are two core functions of the Scheduler extension:
Ability to submit a Notebook to run on AI Platform’s Machine Learning Engine as a training job with a custom container image. This allows you to experiment and write your training code in a cost-effective single VM environment, but scale out to an AI Platform job to take advantage of superior resources (ie. GPUs, TPUs, etc.).
Scheduling a Notebook for recurring runs follows the exact same sequence of steps, but requires a crontab-formatted schedule option.
Nova Plugin: This is the predecessor of the Notebook Scheduler project. Allows you to execute notebooks directly from your Jupyter UI.
Notebook training
Python package allows users to run a Jupyter notebook at Google Cloud AI Platform Training Jobs.
GCP runner: Allows running any Jupyter notebook function on Google Cloud Platform
Unlike all other solutions listed above, it allows to run training for the whole project, not single Python file or Jupyter notebook
Allows running any function with parameters, moving from local execution to cloud is just a matter of wrapping function in a: gcp_runner.run_cloud(<function_name>, …) call.
This project is production-ready without any modifications
Supports execution on local (for testing purposes), AI Platform, and Kubernetes environments Full end to end example can be found here:
https://www.github.com/vlasenkoalexey/criteo_nbdev
tensorflow_cloud (Keras for GCP) Provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud.
Update July 2021:
The recommended option in GCP is Notebook Executor which is already available in EAP.

Understanding Sagemaker Neo

I have few questions for Sagemaker Neo:
1) Can I take advantage of Sagemaker Neo if I have an externally trained tensorflow/mxnet model?
2) Sagemaker provides container image for 'image-classification' and it has released a new image with name 'image-classification-neo' for the neo compilation job. What is the difference between both of them? Do I require a new Neo compatible image for each pre built sagemaker template(container) similarly?
Any help would be appreciated
Thanks!!
1) Yes. Upload your model to an S3 bucket as a model.tar.gz file (similar to what SageMaker would save after training) and you can compile it.
2) The Neo versions use the Neo runtime to load and predict, so yes, the containers are different. Right now, Neo supports the XGBoost and Image Classification built-in algos. Of course, you could build your own custom container and use Neo inside that. For more info: https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html
Julien
It is long time that this question has been asked. But in case someone turns up here after searching for the same question:
Mainly, Amazon NEO is an optimizer for making the program compatible for multiple underlying hardware and platform. Based on documentation:
"Neo is a new capability of Amazon SageMaker that enables machine learning models to train once and run anywhere in the cloud and at the edge. "
And yes, those 2 docker images are different. As one of them has the optimiser code, the other doesn't.
The difference is not in the input, so 'image-classification-neo' can work with images that 'image-classification' can work.
But the output is different.
The output of 'image-classification-neo' can be used on multiple platforms.
you can check out the supported hardware platforms in the link below:
https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html

Selecting google cloud tool for executing demanding python script

Where should I execute a python script that process ~7giga of data that is available on GCS. The output will be writen to GCS as well.
The script was debugged on datalab notebook with small dataset. I would like to scale up the processing. Should I allocate a big machine? I have no idea what size (resources) of machine is needed.
Many thanks,
Eila
Just in case,
Dataflow can’t work for that kind of data processing
From what I have read about HDF5, it seems that it is not easily parallelizable (See Parallel HDF5 and h5py multiprocessing_example) so I'll assume that reading that ~7GB must me done by one worker.
If there is no workaround to it, and you do not encounter memory issues while processing it on the machine you are already using, I do not see a need to upgrade your datalab instance.

Library for deep learning on Amazon EC2 with CPU and GPU support for convolutional neural network

I want to train a CNN on a bunch of images. I want to run it on Amazon EC2 CPU or GPU clusters. For running deep learning on a cluster, I figured that some of the options are:
h2o (with Spark)
Caffee
Theano
I am not sure which of these options suit my needs. I read through h2o documentation on deep learning, they do not seem to support CNNs. Any ideas on how I should proceed?
Another side question:
How do I upload my images to the cluster for training the CNN? I am fairly new to cluster computing.
Just came accross your question. In this tutorial you'll also find how to set up an Amazon instance with a GPU to run Deep Learning frameworks.
The AMI (~computer model) is pre-configured with:
Ubuntu Server 16.04 as OS
Anaconda 4.2.0 (scientific Python distribution)
Python 3.5
Cuda 8.0 (“parallel computing platform and programming model”, used to send code to the GPU)
cuDNN 5.1 (Cuda’s library for Deep Learning used by Tensorflow and Theano)
Tensorflow 0.12 for Python 3.5 and GPU-enabled
Keras 1.1.2 (use with Tensorflow backend)
I believe you can use this set-up with elastic GPUs to scale the system according to your needs or use a P2 instance
Anyway you can follow the tutorial and use another AMI like Amazon's Deep Learning AMI
AWS provides the Deep Learning AMI with various Deep Learning Frameworks installed into it, which covers your use case since it has Theano as well as Caffe.
Link to Deep Learning AMI https://aws.amazon.com/machine-learning/amis/.
How do I upload my images to the cluster for training the CNN? I am
fairly new to cluster computing?
There are many AWS storage services which gives you way to store your training data (images) which will be accessible to your cluster. Few of them are
S3
EBS
EFS
Explore them and see what works best for you.
If you follow the instructions here https://github.com/deeplearningparis/dl-machine then you can set up an AMI image with Theano and Torch. There is also a PR on the config to have caffe by default as well (if you need it, just checkout the branch and run the install script as soon as the instance is up).