Is SageMaker Distributed Data-Parallel (SMDDP) supported for keras models? - amazon-web-services

Is SageMaker Distributed Data-Parallel (SMDDP) supported for keras models?
In documentation it says "SageMaker distributed data parallel is adaptable to TensorFlow training scripts composed of tf core modules except tf.keras modules. SageMaker distributed data parallel does not support TensorFlow with Keras implementation." https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp.html
But inside the training script and how to modify it, I can see the tf.keras and tf.keras.model is used. https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/v1.0.0/smd_data_parallel_tensorflow.html

tf.keras.Sequential model is usable. However, not all other keras submodules are available yet. This may have changed over the subsequent version releases v1.4 and v1.5

Related

How to build custom model using tf.keras on TensorFlow 2.x that supports SageMaker distributed training?

How to create custom models built using tf.keras on TensorFlow 2.x that support distributed training (multiple GPU instances) in Amazon SageMaker?
E.g. using Distributed Data Parallel Library (DDPL)?
The documentation mentioned that tf.keras is not supported by DDPL library, so that shouldn't be an option. I've seen examples of distributed training using Horovod: https://sagemaker-examples.readthedocs.io/en/latest/aws_sagemaker_studio/frameworks/keras_pipe_mode_horovod/keras_pipe_mode_horovod_cifar10.html

SageMaker Distributed Training in Local Mode (inside Notebook Instances)

I've been using SageMaker for a while and have performed several experiments already with distributed training. I am wondering if it is possible to test and run SageMaker distributed training in local mode (using SageMaker Notebook Instances)?
Yes, SageMaker supports distributed training in local mode. But as #Philipp Schmid said some other features (like pipemode are not supported).
No, not possible yet. local mode does not support the distributed training with local_gpufor Gzip compression, Pipe Mode, or manifest files for inputs

Where are SageMaker runtime Environment Variables explained?

There are multiple environment variables have been pre-set and available in the SageMaker runtime during training and serving. Where are they defined and explained?
The SageMaker SDK documentation says:
For the exhaustive list of available environment variables, see the SageMaker Containers documentation.
However, the documentaion says:
WARNING: This package has been deprecated. Please use the SageMaker Training Toolkit for model training and the SageMaker Inference Toolkit for model serving.
And SageMaker Inference Toolkit does not list them, apparently.
These typical obsolete documentations not updated by the SageMaker team cause so much time to spend. Does AWS not have internal documentation update process?
Here are the links to the official documentation that were requested in the question:
Training toolkit environment variables
Inference toolkit environment variables
Interference toolkit parameters

Tensorflow 2 on Google Cloud AI platform

Any news when Tensorflow 2 will be supported on Google Cloud AI platform?
According to the list, 1.15 is the last version to be supported: https://cloud.google.com/ml-engine/docs/runtime-version-list
we will support TF 2.1 officially in early Feb due to large corresponding changes on service. Thank you for your patience!
Add a dependency tensorflow-gpu==2.0 to setup.py file to train with Tensorflow 2.0 in AI Platform run-time version 1.15. Tensorflow 2.1 can't use GPUs due to mismatch in CUDA libraries, but you can train with CPUs. Be careful that other dependencies don't overwrite tensorflow-gpu. Disclaimer: I trained just one job this way, and this is a really hacky solution.
Tensorflow 2.1 runtime version is available as of March 9th
https://cloud.google.com/ai-platform/training/docs/runtime-version-list

Sagemaker Tensorflow 2.0 endpoint

I have a tensorflow 2.0 model I would like to deploy to an AWS sagemaker endpoint. I have moved the model to S3 bucket and executed the following code, but get below error because there is no TF 2.0 image. If I try to deploy with different version (e.g. 1.4, 1.8) I get ping time out errors.
Is it possible to create one easily? I can't find a good tutorial to follow. Or will Amazon deploy one at some point.
Failed. Reason: The image '520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-tensorflow:2.0-cpu-py2' does not exist..
from sagemaker.tensorflow.model import TensorFlowModel
sagemaker_model = TensorFlowModel(model_data = 's3://sagemaker-eu-west-1-
273649867642/model/model.tar.gz',
role = role,
framework_version = '2.0',
entry_point = 'train.py')
%%time
predictor = sagemaker_model.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge')
Also no images seem to support python 3 even though they suggest you define that when setting up the model.
"The Python 2 tensorflow images will be soon deprecated and may not be supported for newer upcoming versions of the tensorflow images.
Please set the argument "py_version='py3'" to use the Python 3 tensorflow image"
SageMaker currently do not support TensorFlow 2.0 yet (neither py2 or py3). But it will be available with SageMaker soon.
As regard with Python versions. For the current supporting TensorFlow versions, py2 is supported, however, after Jan. 1, 2020, all future framework versions will not support py2 anymore.