There are multiple environment variables have been pre-set and available in the SageMaker runtime during training and serving. Where are they defined and explained?
The SageMaker SDK documentation says:
For the exhaustive list of available environment variables, see the SageMaker Containers documentation.
However, the documentaion says:
WARNING: This package has been deprecated. Please use the SageMaker Training Toolkit for model training and the SageMaker Inference Toolkit for model serving.
And SageMaker Inference Toolkit does not list them, apparently.
These typical obsolete documentations not updated by the SageMaker team cause so much time to spend. Does AWS not have internal documentation update process?
Here are the links to the official documentation that were requested in the question:
Training toolkit environment variables
Inference toolkit environment variables
Interference toolkit parameters
Related
I've been using SageMaker for a while and have performed several experiments already with distributed training. I am wondering if it is possible to test and run SageMaker distributed training in local mode (using SageMaker Notebook Instances)?
Yes, SageMaker supports distributed training in local mode. But as #Philipp Schmid said some other features (like pipemode are not supported).
No, not possible yet. local mode does not support the distributed training with local_gpufor Gzip compression, Pipe Mode, or manifest files for inputs
I'm currently developing a system that some private libraries. I'm developing in local mode and then when I need to process something specific I use Sagemaker Processing Jobs. The thing is that in order to speed up the process it would be nice to have the possibility of developing everything in a cloud environment.
I'm wondering if is possible to use the same Docker image that I use
for batch processing (the one that I use for Sagemaker Processing Job)
in my Sagemaker Jupyter Notebooks of my cloud environment?
The main problem here is that every time that I work in my cloud Notebooks I have to deal with dependencies conflicts and etc. Using a Docker image would avoid this, and will also allow to each member of the team use the same image to develop in the cloud without having to deal with these kind of conflicts.
You can use the same Docker image to run a processing job locally using SageMaker local mode (basically setting the instance_type parameter on the Processor to local.
However, it sounds like you'd want to use the same image as your dev environment in notebooks. In SageMaker notebook instances, the solution would be to create and maintain conda environments with the same requirements and versions (you can also use LCCs to install a set of packages at notebook start, see some samples here).
An alternative is to use SageMaker Studio, where you can create and bring your own custom image for Studio. There is a detailed tutorial here, and some sample dockerfiles for you to get started here.
I have a model.pkl file which is pre-trained and all other files related to the ml model. I want it to deploy it on the aws sagemaker.
But without training, how to deploy it to the aws sagmekaer, as fit() method in aws sagemaker run the train command and push the model.tar.gz to the s3 location and when deploy method is used it uses the same s3 location to deploy the model, we don't manual create the same location in s3 as it is created by the aws model and name it given by using some timestamp. How to put out our own personalized model.tar.gz file in the s3 location and call the deploy() function by using the same s3 location.
All you need is:
to have your model in an arbitrary S3 location in a model.tar.gz archive
to have an inference script in a SageMaker-compatible docker image that is able to read your model.pkl, serve it and handle inferences.
to create an endpoint associating your artifact to your inference code
When you ask for an endpoint deployment, SageMaker will take care of downloading your model.tar.gz and uncompressing to the appropriate location in the docker image of the server, which is /opt/ml/model
Depending on the framework you use, you may use either a pre-existing docker image (available for Scikit-learn, TensorFlow, PyTorch, MXNet) or you may need to create your own.
Regarding custom image creation, see here the specification and here two examples of custom containers for R and sklearn (the sklearn one is less relevant now that there is a pre-built docker image along with a sagemaker sklearn SDK)
Regarding leveraging existing containers for Sklearn, PyTorch, MXNet, TF, check this example: Random Forest in SageMaker Sklearn container. In this example, nothing prevents you from deploying a model that was trained elsewhere. Note that with a train/deploy environment mismatch you may run in errors due to some software version difference though.
Regarding your following experience:
when deploy method is used it uses the same s3 location to deploy the
model, we don't manual create the same location in s3 as it is created
by the aws model and name it given by using some timestamp
I agree that sometimes the demos that use the SageMaker Python SDK (one of the many available SDKs for SageMaker) may be misleading, in the sense that they often leverage the fact that an Estimator that has just been trained can be deployed (Estimator.deploy(..)) in the same session, without having to instantiate the intermediary model concept that maps inference code to model artifact. This design is presumably done on behalf of code compacity, but in real life, training and deployment of a given model may well be done from different scripts running in different systems. It's perfectly possible to deploy a model with training it previously in the same session, you need to instantiate a sagemaker.model.Model object and then deploy it.
I have few questions for Sagemaker Neo:
1) Can I take advantage of Sagemaker Neo if I have an externally trained tensorflow/mxnet model?
2) Sagemaker provides container image for 'image-classification' and it has released a new image with name 'image-classification-neo' for the neo compilation job. What is the difference between both of them? Do I require a new Neo compatible image for each pre built sagemaker template(container) similarly?
Any help would be appreciated
Thanks!!
1) Yes. Upload your model to an S3 bucket as a model.tar.gz file (similar to what SageMaker would save after training) and you can compile it.
2) The Neo versions use the Neo runtime to load and predict, so yes, the containers are different. Right now, Neo supports the XGBoost and Image Classification built-in algos. Of course, you could build your own custom container and use Neo inside that. For more info: https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html
Julien
It is long time that this question has been asked. But in case someone turns up here after searching for the same question:
Mainly, Amazon NEO is an optimizer for making the program compatible for multiple underlying hardware and platform. Based on documentation:
"Neo is a new capability of Amazon SageMaker that enables machine learning models to train once and run anywhere in the cloud and at the edge. "
And yes, those 2 docker images are different. As one of them has the optimiser code, the other doesn't.
The difference is not in the input, so 'image-classification-neo' can work with images that 'image-classification' can work.
But the output is different.
The output of 'image-classification-neo' can be used on multiple platforms.
you can check out the supported hardware platforms in the link below:
https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html
I want to train a CNN on a bunch of images. I want to run it on Amazon EC2 CPU or GPU clusters. For running deep learning on a cluster, I figured that some of the options are:
h2o (with Spark)
Caffee
Theano
I am not sure which of these options suit my needs. I read through h2o documentation on deep learning, they do not seem to support CNNs. Any ideas on how I should proceed?
Another side question:
How do I upload my images to the cluster for training the CNN? I am fairly new to cluster computing.
Just came accross your question. In this tutorial you'll also find how to set up an Amazon instance with a GPU to run Deep Learning frameworks.
The AMI (~computer model) is pre-configured with:
Ubuntu Server 16.04 as OS
Anaconda 4.2.0 (scientific Python distribution)
Python 3.5
Cuda 8.0 (“parallel computing platform and programming model”, used to send code to the GPU)
cuDNN 5.1 (Cuda’s library for Deep Learning used by Tensorflow and Theano)
Tensorflow 0.12 for Python 3.5 and GPU-enabled
Keras 1.1.2 (use with Tensorflow backend)
I believe you can use this set-up with elastic GPUs to scale the system according to your needs or use a P2 instance
Anyway you can follow the tutorial and use another AMI like Amazon's Deep Learning AMI
AWS provides the Deep Learning AMI with various Deep Learning Frameworks installed into it, which covers your use case since it has Theano as well as Caffe.
Link to Deep Learning AMI https://aws.amazon.com/machine-learning/amis/.
How do I upload my images to the cluster for training the CNN? I am
fairly new to cluster computing?
There are many AWS storage services which gives you way to store your training data (images) which will be accessible to your cluster. Few of them are
S3
EBS
EFS
Explore them and see what works best for you.
If you follow the instructions here https://github.com/deeplearningparis/dl-machine then you can set up an AMI image with Theano and Torch. There is also a PR on the config to have caffe by default as well (if you need it, just checkout the branch and run the install script as soon as the instance is up).