How to use SageMaker Estimator for model training and saving - amazon-web-services
The documentations of how to use SageMaker estimators are scattered around, sometimes obsolete, incorrect. Is there a one stop location which gives the comprehensive views of how to use SageMaker SDK Estimator to train and save models?
Answer
There is no one such resource from AWS that provides the comprehensive view of how to use SageMaker SDK Estimator to train and save models.
Alternative Overview Diagram
I put a diagram and brief explanation to get the overview on how SageMaker Estimator runs a training.
SageMaker sets up a docker container for a training job where:
Environment variables are set as in SageMaker Docker Container. Environment Variables.
Training data is setup under /opt/ml/input/data.
Training script codes are setup under /opt/ml/code.
/opt/ml/model and /opt/ml/output directories are setup to store training outputs.
/opt/ml
├── input
│ ├── config
│ │ ├── hyperparameters.json <--- From Estimator hyperparameter arg
│ │ └── resourceConfig.json
│ └── data
│ └── <channel_name> <--- From Estimator fit method inputs arg
│ └── <input data>
├── code
│ └── <code files> <--- From Estimator src_dir arg
├── model
│ └── <model files> <--- Location to save the trained model artifacts
└── output
└── failure <--- Training job failure logs
SageMaker Estimator fit(inputs) method executes the training script. Estimator hyperparameters and fit method inputs are provided as its command line arguments.
The training script saves the model artifacts in the /opt/ml/model once the training is completed.
SageMaker archives the artifacts under /opt/ml/model into model.tar.gz and save it to the S3 location specified to output_path Estimator parameter.
You can set Estimator metric_definitions parameter to extract model metrics from the training logs. Then you can monitor the training progress in the SageMaker console metrics.
I believe AWS needs to stop mass-producing verbose, redundant, wordy, scattered, and obsolete documents. AWS needs to understand A picture is worth thousand words.
Have diagrams and piece document parts together in a context with a clear objective to achieve.
Problem
AWS documentations need serious re-design and re-structuring. Just to understand how to train and save a model forces us going through dozens of scattered, fragmented, verbose, redundant documentations, which are often obsolete, incomplete, and sometime incorrect.
It is well-summarized in Why I think GCP is better than AWS:
It’s not that AWS is harder to use than GCP, it’s that it is needlessly hard; a disjointed, sprawl of infrastructure primitives with poor cohesion between them.
A challenge is nice, a confusing mess is not, and the problem with AWS is that a large part of your working hours will be spent untangling their documentation and weeding through features and products to find what you want, rather than focusing on cool interesting challenges.
Especially the SageMaker team keeps changing implementations without updating documents. Its roll-out was also inconsistent, e.g. SDK version 2 was rolled out in the SageMaker Studio making the AWS examples in Github incompatible without announcing it. Whereas SageMaker instance still had SDK 1, hence code worked in Instance but not in Studio.
It is mind-boggling that we have to go through these many documents below to understand how to use the SageMaker SDK Estimator for training.
Documents for Model Training
Train a Model with Amazon SageMaker
This document gives 20,000 feet overview of how SageMaker training but does not give any clue what to do.
Running a container for Amazon SageMaker training
This document gives an overview of how SageMaker training looks like. However, this is not up-to-date as it is based on SageMaker Containers which is obsolete.
WARNING: This package has been deprecated. Please use the SageMaker Training Toolkit for model training and the SageMaker Inference Toolkit for model serving.
Step 4: Train a Model
This document layouts the steps for training.
The Amazon SageMaker Python SDK provides framework estimators and generic estimators to train your model while orchestrating the machine learning (ML) lifecycle accessing the SageMaker features for training and the AWS infrastructures
Train a Model with the SageMaker Python SDK
To train a model by using the SageMaker Python SDK, you:
Prepare a training script
Create an estimator
Call the fit method of the estimator
Finally this document gives concrete steps and ideas. However still missing comprehensiv details about Environment Variables, Directory structure in the SageMaker docker container**, S3 for uploading code, placing data, S3 where the trained model is saved, etc.
Use TensorFlow with the SageMaker Python SDK
This documents is focused on TensorFlow Estimator implementation steps. Use Training a Tensorflow Model on MNIST Github example to accompany with to follow the actual implementation.
Documents for passing parameters and data locations
How Amazon SageMaker Provides Training Information
This section explains how SageMaker makes training information, such as training data, hyperparameters, and other configuration information, available to your Docker container.
This document finally gives the idea of how parameters and data are passed around but again, not comprehensive.
SageMaker Docker Container Environment Variables
This documentation is marked as deprecated but the only document which explains the SageMaker Environment Variables.
IMPORTANT ENVIRONMENT VARIABLES
SM_MODEL_DIR
SM_CHANNELS
SM_CHANNEL_{channel_name}
SM_HPS
SM_HP_{hyperparameter_name}
SM_CURRENT_HOST
SM_HOSTS
SM_NUM_GPUS
List of provided environment variables by SageMaker Containers
SM_NUM_CPUS
SM_LOG_LEVEL
SM_NETWORK_INTERFACE_NAME
SM_USER_ARGS
SM_INPUT_DIR
SM_INPUT_CONFIG_DIR
SM_OUTPUT_DATA_DIR
SM_RESOURCE_CONFIG
SM_INPUT_DATA_CONFIG
SM_TRAINING_ENV
Documents for SageMaker Docker Container Directory Structure
Running a container for Amazon SageMaker training
/opt/ml
├── input
│ ├── config
│ │ ├── hyperparameters.json
│ │ └── resourceConfig.json
│ └── data
│ └── <channel_name>
│ └── <input data>
├── model
│ └── <model files>
└── output
└── failure
This document explains the directory structure and purpose of each directory.
The input
/opt/ml/input/config contains information to control how your program runs. hyperparameters.json is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. resourceConfig.json is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn’t support distributed training, we’ll ignore it here.
/opt/ml/input/data/<channel_name>/ (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it’s generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure.
/opt/ml/input/data/<channel_name>_<epoch_number> (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.
The output
/opt/ml/model/ is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the DescribeTrainingJob result.
/opt/ml/output is a directory where the algorithm can write a file failure that describes why the job failed. The contents of this file will be returned in the FailureReason field of the DescribeTrainingJob result. For jobs that succeed, there is no reason to write this file as it will be ignored.
However, this is not up-to-date as it is based on SageMaker Containers which is obsolete.
Documents for Model Saving
The information on where the trained model is saved and in what format are fundamentally missing. The training script needs to save the model under /opt/ml/model and the format and sub-directory structure depend on the frameworks e,g TensorFlow, Pytorch. This is because SageMaker deployment uses the Framework dependent model-serving, e,g. TensorFlow Serving for TensorFlow framework.
This is not clearly documented and causing confusions. The developer needs to specify which format to use and under which sub-directory to save.
To use TensorFlow Estimator training and deployment:
Deploy the trained model
Because we’re using TensorFlow Serving for deployment, our training script saves the model in TensorFlow’s SavedModel format.
amazon-sagemaker-examples/frameworks/tensorflow/code/train.py
# Save the model
# A version number is needed for the serving container
# to load the model
version = "00000000"
ckpt_dir = os.path.join(args.model_dir, version)
if not os.path.exists(ckpt_dir):
os.makedirs(ckpt_dir)
model.save(ckpt_dir)
The code is saving the model in /opt/ml/model/00000000 because this is for TensorFlow serving.
Using the SavedModel format
The save-path follows a convention used by TensorFlow Serving where the last path component (1/ here) is a version number for your model - it allows tools like Tensorflow Serving to reason about the relative freshness.
Train and serve a TensorFlow model with TensorFlow Serving
To load our trained model into TensorFlow Serving we first need to save it in SavedModel format. This will create a protobuf file in a well-defined directory hierarchy, and will include a version number. TensorFlow Serving allows us to select which version of a model, or "servable" we want to use when we make inference requests. Each version will be exported to a different sub-directory under the given path.
Documents for API
Basically the SageMaker SDK Estimator implements the CreateTrainingJob API for training part. Hence, better to understand how it is designed and what parameters need to be defined. Otherwise working on Estimators are like walking in the dark.
Example
Jupyter Notebook
import sagemaker
from sagemaker import get_execution_role
sagemaker_session = sagemaker.Session()
role = get_execution_role()
bucket = sagemaker_session.default_bucket()
metric_definitions = [
{"Name": "train:loss", "Regex": ".*loss: ([0-9\\.]+) - accuracy: [0-9\\.]+.*"},
{"Name": "train:accuracy", "Regex": ".*loss: [0-9\\.]+ - accuracy: ([0-9\\.]+).*"},
{
"Name": "validation:accuracy",
"Regex": ".*step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_accuracy: ([0-9\\.]+).*",
},
{
"Name": "validation:loss",
"Regex": ".*step - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: ([0-9\\.]+) - val_accuracy: [0-9\\.]+.*",
},
{
"Name": "sec/sample",
"Regex": ".* - \d+s (\d+)[mu]s/sample - loss: [0-9\\.]+ - accuracy: [0-9\\.]+ - val_loss: [0-9\\.]+ - val_accuracy: [0-9\\.]+",
},
]
import uuid
checkpoint_s3_prefix = "checkpoints/{}".format(str(uuid.uuid4()))
checkpoint_s3_uri = "s3://{}/{}/".format(bucket, checkpoint_s3_prefix)
from sagemaker.tensorflow import TensorFlow
# --------------------------------------------------------------------------------
# 'trainingJobName' msut satisfy regular expression pattern: ^[a-zA-Z0-9](-*[a-zA-Z0-9]){0,62}
# --------------------------------------------------------------------------------
base_job_name = "fashion-mnist"
hyperparameters = {
"epochs": 2,
"batch-size": 64
}
estimator = TensorFlow(
entry_point="fashion_mnist.py",
source_dir="src",
metric_definitions=metric_definitions,
hyperparameters=hyperparameters,
role=role,
input_mode='File',
framework_version="2.3.1",
py_version="py37",
instance_count=1,
instance_type="ml.m5.xlarge",
base_job_name=base_job_name,
checkpoint_s3_uri=checkpoint_s3_uri,
model_dir=False
)
estimator.fit()
fashion_mnist.py
import os
import argparse
import json
import multiprocessing
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, BatchNormalization
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers.experimental.preprocessing import Normalization
from tensorflow.keras import backend as K
print("TensorFlow version: {}".format(tf.__version__))
print("Eager execution is: {}".format(tf.executing_eagerly()))
print("Keras version: {}".format(tf.keras.__version__))
image_width = 28
image_height = 28
def load_data():
fashion_mnist = tf.keras.datasets.fashion_mnist
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
number_of_classes = len(set(y_train))
print("number_of_classes", number_of_classes)
x_train = x_train / 255.0
x_test = x_test / 255.0
x_full = np.concatenate((x_train, x_test), axis=0)
print(x_full.shape)
print(type(x_train))
print(x_train.shape)
print(x_train.dtype)
print(y_train.shape)
print(y_train.dtype)
# ## Train
# * C: Convolution layer
# * P: Pooling layer
# * B: Batch normalization layer
# * F: Fully connected layer
# * O: Output fully connected softmax layer
# Reshape data based on channels first / channels last strategy.
# This is dependent on whether you use TF, Theano or CNTK as backend.
# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
if K.image_data_format() == 'channels_first':
x = x_train.reshape(x_train.shape[0], 1, image_width, image_height)
x_test = x_test.reshape(x_test.shape[0], 1, image_width, image_height)
input_shape = (1, image_width, image_height)
else:
x_train = x_train.reshape(x_train.shape[0], image_width, image_height, 1)
x_test = x_test.reshape(x_test.shape[0], image_width, image_height, 1)
input_shape = (image_width, image_height, 1)
return x_train, y_train, x_test, y_test, input_shape, number_of_classes
# tensorboard --logdir=/full_path_to_your_logs
validation_split = 0.2
verbosity = 1
use_multiprocessing = True
workers = multiprocessing.cpu_count()
def train(model, x, y, args):
# SavedModel Output
tensorflow_saved_model_path = os.path.join(args.model_dir, "tensorflow/saved_model/0")
os.makedirs(tensorflow_saved_model_path, exist_ok=True)
# Tensorboard Logs
tensorboard_logs_path = os.path.join(args.model_dir, "tensorboard/")
os.makedirs(tensorboard_logs_path, exist_ok=True)
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=tensorboard_logs_path,
write_graph=True,
write_images=True,
histogram_freq=1, # How often to log histogram visualizations
embeddings_freq=1, # How often to log embedding visualizations
update_freq="epoch",
) # How often to write logs (default: once per epoch)
model.compile(
optimizer='adam',
loss=tf.keras.losses.sparse_categorical_crossentropy,
metrics=['accuracy']
)
history = model.fit(
x,
y,
shuffle=True,
batch_size=args.batch_size,
epochs=args.epochs,
validation_split=validation_split,
use_multiprocessing=use_multiprocessing,
workers=workers,
verbose=verbosity,
callbacks=[
tensorboard_callback
]
)
return history
def create_model(input_shape, number_of_classes):
model = Sequential([
Conv2D(
name="conv01",
filters=32,
kernel_size=(3, 3),
strides=(1, 1),
padding="same",
activation='relu',
input_shape=input_shape
),
MaxPooling2D(
name="pool01",
pool_size=(2, 2)
),
Flatten(), # 3D shape to 1D.
BatchNormalization(
name="batch_before_full01"
),
Dense(
name="full01",
units=300,
activation="relu"
), # Fully connected layer
Dense(
name="output_softmax",
units=number_of_classes,
activation="softmax"
)
])
return model
def save_model(model, args):
# Save the model
# A version number is needed for the serving container
# to load the model
version = "00000000"
model_save_dir = os.path.join(args.model_dir, version)
if not os.path.exists(model_save_dir):
os.makedirs(model_save_dir)
print(f"saving model at {model_save_dir}")
model.save(model_save_dir)
def parse_args():
# --------------------------------------------------------------------------------
# https://docs.python.org/dev/library/argparse.html#dest
# --------------------------------------------------------------------------------
parser = argparse.ArgumentParser()
# --------------------------------------------------------------------------------
# hyperparameters Estimator argument are passed as command-line arguments to the script.
# --------------------------------------------------------------------------------
parser.add_argument('--epochs', type=int, default=10)
parser.add_argument('--batch-size', type=int, default=64)
# /opt/ml/model
# sagemaker.tensorflow.estimator.TensorFlow override 'model_dir'.
# See https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/\
# sagemaker.tensorflow.html#sagemaker.tensorflow.estimator.TensorFlow
parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
# /opt/ml/output
parser.add_argument("--output_dir", type=str, default=os.environ["SM_OUTPUT_DIR"])
args = parser.parse_args()
return args
if __name__ == "__main__":
args = parse_args()
print("---------- key/value args")
for key, value in vars(args).items():
print(f"{key}:{value}")
x_train, y_train, x_test, y_test, input_shape, number_of_classes = load_data()
model = create_model(input_shape, number_of_classes)
history = train(model=model, x=x_train, y=y_train, args=args)
print(history)
save_model(model, args)
results = model.evaluate(x_test, y_test, batch_size=100)
print("test loss, test accuracy:", results)
SageMaker Console
Notebook output
2021-09-03 03:02:04 Starting - Starting the training job...
2021-09-03 03:02:16 Starting - Launching requested ML instancesProfilerReport-1630638122: InProgress
......
2021-09-03 03:03:17 Starting - Preparing the instances for training.........
2021-09-03 03:04:59 Downloading - Downloading input data
2021-09-03 03:04:59 Training - Downloading the training image...
2021-09-03 03:05:23 Training - Training image download completed. Training in progress.2021-09-03 03:05:23.966037: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2021-09-03 03:05:23.969704: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
2021-09-03 03:05:24.118054: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2021-09-03 03:05:26,842 sagemaker-training-toolkit INFO Imported framework sagemaker_tensorflow_container.training
2021-09-03 03:05:26,852 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2021-09-03 03:05:27,734 sagemaker-training-toolkit INFO Installing dependencies from requirements.txt:
/usr/local/bin/python3.7 -m pip install -r requirements.txt
WARNING: You are using pip version 21.0.1; however, version 21.2.4 is available.
You should consider upgrading via the '/usr/local/bin/python3.7 -m pip install --upgrade pip' command.
2021-09-03 03:05:29,028 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2021-09-03 03:05:29,045 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2021-09-03 03:05:29,062 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2021-09-03 03:05:29,072 sagemaker-training-toolkit INFO Invoking user script
Training Env:
{
"additional_framework_parameters": {},
"channel_input_dirs": {},
"current_host": "algo-1",
"framework_module": "sagemaker_tensorflow_container.training:main",
"hosts": [
"algo-1"
],
"hyperparameters": {
"batch-size": 64,
"epochs": 2
},
"input_config_dir": "/opt/ml/input/config",
"input_data_config": {},
"input_dir": "/opt/ml/input",
"is_master": true,
"job_name": "fashion-mnist-2021-09-03-03-02-02-305",
"log_level": 20,
"master_hostname": "algo-1",
"model_dir": "/opt/ml/model",
"module_dir": "s3://sagemaker-us-east-1-316725000538/fashion-mnist-2021-09-03-03-02-02-305/source/sourcedir.tar.gz",
"module_name": "fashion_mnist",
"network_interface_name": "eth0",
"num_cpus": 4,
"num_gpus": 0,
"output_data_dir": "/opt/ml/output/data",
"output_dir": "/opt/ml/output",
"output_intermediate_dir": "/opt/ml/output/intermediate",
"resource_config": {
"current_host": "algo-1",
"hosts": [
"algo-1"
],
"network_interface_name": "eth0"
},
"user_entry_point": "fashion_mnist.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"batch-size":64,"epochs":2}
SM_USER_ENTRY_POINT=fashion_mnist.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=fashion_mnist
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-east-1-316725000538/fashion-mnist-2021-09-03-03-02-02-305/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","framework_module":"sagemaker_tensorflow_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch-size":64,"epochs":2},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"fashion-mnist-2021-09-03-03-02-02-305","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-east-1-316725000538/fashion-mnist-2021-09-03-03-02-02-305/source/sourcedir.tar.gz","module_name":"fashion_mnist","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"fashion_mnist.py"}
SM_USER_ARGS=["--batch-size","64","--epochs","2"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_HP_BATCH-SIZE=64
SM_HP_EPOCHS=2
PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/local/lib/python37.zip:/usr/local/lib/python3.7:/usr/local/lib/python3.7/lib-dynload:/usr/local/lib/python3.7/site-packages
Invoking script with the following command:
/usr/local/bin/python3.7 fashion_mnist.py --batch-size 64 --epochs 2
TensorFlow version: 2.3.1
Eager execution is: True
Keras version: 2.4.0
---------- key/value args
epochs:2
batch_size:64
model_dir:/opt/ml/model
output_dir:/opt/ml/output
Related
start, monitor and define script of SageMaker processing job from local machine
I am looking at this, which makes all sense. Let us focus on this bit of code: from sagemaker.processing import ProcessingInput, ProcessingOutput sklearn_processor.run( code="preprocessing.py", inputs=[ ProcessingInput(source="s3://your-bucket/path/to/your/data", destination="/opt/ml/processing/input"), ], outputs=[ ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"), ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"), ], arguments=["--train-test-split-ratio", "0.2"], ) preprocessing_job_description = sklearn_processor.jobs[-1].describe() Here preprocessing.py has to be obviously in the cloud. I am curious, could one also put scripts into an S3 bucket and trigger the job remotely. I can easily to this with hyper parameter optimisation, which does not require dedicated scripts though as I use an OOTB training image. In this case I can fire off the job like so: tuning_job_name = "amazing-hpo-job-" + strftime("%d-%H-%M-%S", gmtime()) smclient = boto3.Session().client("sagemaker") smclient.create_hyper_parameter_tuning_job( HyperParameterTuningJobName=tuning_job_name, HyperParameterTuningJobConfig=tuning_job_config, TrainingJobDefinition=training_job_definition ) and then monitor the job's progress: smclient = boto3.Session().client("sagemaker") tuning_job_result = smclient.describe_hyper_parameter_tuning_job( HyperParameterTuningJobName=tuning_job_name ) status = tuning_job_result["HyperParameterTuningJobStatus"] if status != "Completed": print("Reminder: the tuning job has not been completed.") job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"] print("%d training jobs have completed" % job_count) objective = tuning_job_result["HyperParameterTuningJobConfig"]["HyperParameterTuningJobObjective"] is_minimize = objective["Type"] != "Maximize" objective_name = objective["MetricName"] if tuning_job_result.get("BestTrainingJob", None): print("Best model found so far:") pprint(tuning_job_result["BestTrainingJob"]) else: print("No training jobs have reported results yet.") I would think starting and monitoring a SageMaker processing job from a local machine should be possible as with an HPO job but what about the script(s)? Ideally I would like to develop and test them locally and the run remotely. Hope this makes sense?
Im not sure I understand the comparison to a Tuning Job. Based on what you have described, in this case the preprocessing.py is actually stored locally. The SageMaker SDK will upload it to S3 for the remote Processing Job to access it. I suggest launching the Job and then taking a look at the inputs in the SageMaker Console. If you wanted to test the Processing Job locally you can do so using Local Mode. This will basically imitate the Job locally which aids in debugging the script before kicking off a remote Processing Job. Kindly note docker is required to make use of Local Mode. Example code for local mode: from sagemaker.local import LocalSession from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput sagemaker_session = LocalSession() sagemaker_session.config = {'local': {'local_code': True}} # For local training a dummy role will be sufficient role = 'arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001' processor = ScriptProcessor(command=['python3'], image_uri='sagemaker-scikit-learn-processing-local', role=role, instance_count=1, instance_type='local') processor.run(code='processing_script.py', inputs=[ProcessingInput( source='./input_data/', destination='/opt/ml/processing/input_data/')], outputs=[ProcessingOutput( output_name='word_count_data', source='/opt/ml/processing/processed_data/')], arguments=['job-type', 'word-count'] ) preprocessing_job_description = processor.jobs[-1].describe() output_config = preprocessing_job_description['ProcessingOutputConfig'] print(output_config) for output in output_config['Outputs']: if output['OutputName'] == 'word_count_data': word_count_data_file = output['S3Output']['S3Uri'] print('Output file is located on: {}'.format(word_count_data_file))
Advice/Guidance - composer/beam/dataflow on gcp
I am trying to learn/try out cloud composer/beam/dataflow on gcp. I have written functions to do some basic cleaning of data in python, and used a DAG in cloud composer to run this function to download a file from a bucket, process it, and upload it to a bucket at a set frequency. It was all bespoke written functionality. I am now trying to figure out how I use beam pipeline and data flow instead and use cloud composer to kick off the dataflow job. The cleaning I am trying to do, is take a csv input of col1,col2,col3,col4,col5 and combine the middle 3 columns to output a csv of col1,combinedcol234,col5. Questions I have are... How do I pull in my own functions within a beam pipeline to do this merge? Should I be pulling in my own functions or do beam have built in ways of doing this? How do I then trigger a pipeline from a dag? Does anyone have any example code on git hub? I have been googling and trying to research but can't seem to find anything that helps me get my head around it enough. Any help would be appreciated. Thank you.
You can use the DataflowCreatePythonJobOperator to run a dataflow job in a python. You have to instantiate your cloud composer environment; Add the dataflow job file in a bucket; Add the input file to a bucket; Add the following dag in the DAGs directory of the composer environment: composer_dataflow_dag.py: import datetime from airflow import models from airflow.providers.google.cloud.operators.dataflow import DataflowCreatePythonJobOperator from airflow.utils.dates import days_ago bucket_path = "gs://<bucket name>" project_id = "<project name>" gce_zone = "us-central1-a" import pytz tz = pytz.timezone('US/Pacific') tstmp = datetime.datetime.now(tz).strftime('%Y%m%d%H%M%S') default_args = { # Tell airflow to start one day ago, so that it runs as soon as you upload it "start_date": days_ago(1), "dataflow_default_options": { "project": project_id, # Set to your zone "zone": gce_zone, # This is a subfolder for storing temporary files, like the staged pipeline job. "tempLocation": bucket_path + "/tmp/", }, } with models.DAG( "composer_dataflow_dag", default_args=default_args, schedule_interval=datetime.timedelta(days=1), # Override to match your needs ) as dag: create_mastertable = DataflowCreatePythonJobOperator( task_id="create_mastertable", py_file=f'gs://<bucket name>/dataflow-job.py', options={"runner":"DataflowRunner","project":project_id,"region":"us-central1" ,"temp_location":"gs://<bucket name>/", "staging_location":"gs://<bucket name>/"}, job_name=f'job{tstmp}', location='us-central1', wait_until_finished=True, ) Here is the dataflow job file, with the modification you want to concatenate some columns data: dataflow-job.py import apache_beam as beam from apache_beam.options.pipeline_options import PipelineOptions import os from datetime import datetime import pytz tz = pytz.timezone('US/Pacific') tstmp = datetime.now(tz).strftime('%Y-%m-%d %H:%M:%S') bucket_path = "gs://<bucket>" input_file = f'{bucket_path}/inputFile.txt' output = f'{bucket_path}/output_{tstmp}.txt' p = beam.Pipeline(options=PipelineOptions()) ( p | 'Read from a File' >> beam.io.ReadFromText(input_file, skip_header_lines=1) | beam.Map(lambda x:x.split(",")) | beam.Map(lambda x:f'{x[0]},{x[1]}{x[2]}{x[3]},{x[4]}') | beam.io.WriteToText(output) ) p.run().wait_until_finish() After running the result will be stored in the gcs Bucket:
A beam program is just an ordinary Python program that builds up a pipeline and runs it. For example ''' def main(): with beam.Pipline() as p: p | beam.io.ReadFromText(...) | beam.Map(...) | beam.io.WriteToText(...) ''' Many examples can be found in the repository and the programming guide is useful toohttps://beam.apache.org/documentation/programming-guide/ . The easiest way to read CSV files is with the dataframes API which retruns an object you can manipulate as if it were a Pandas Dataframe, or you can turn into a PCollection (where each column is an attribute of a named tuple) and process with Beam's Map, FlatMap, etc, e.g. pcoll | beam.Map( lambda row: (row.col1, func(row.col2, row.col3, row.col4), row.col5)))
Cannot deploy a small transformers model for prediction serving using Google Cloud AI Platform due to "Model requires more memory than allowed"
I have a fine tuned distilgpt2 model that I want to deploy using GCP ai-platform. I've followed all the documentation for deploying a custom prediction routine on GCP but when creating the model I get the error: Create Version failed. Bad model detected with error: Model requires more memory than allowed. Please try to decrease the model size and re-deploy. Here is my setup.py file: from setuptools import setup setup( name="generator_package", version="0.2", include_package_data=True, scripts=["generator_class.py"], install_requires=['transformers==2.8.0'] ) I then create a model version using: gcloud beta ai-platform versions create v1 --model my_model \ --origin=gs://my_bucket/model/ \ --python-version=3.7 \ --runtime-version=2.3 \ --package-uris=gs://my_bucket/packages/gpt2-0.1.tar.gz,gs://cloud-ai-pytorch/torch-1.3.1+cpu-cp37-cp37m-linux_x86_64.whl \ --prediction-class=model_prediction.CustomModelPrediction Following this answer: PyTorch model deployment in AI Platform, I figured out how to get pytorch installed on my custom prediction routine, but am still getting the above error. I believe it may have something to do with the transformers package, as it has torch as a dependency. Can that be causing the issue? I have tried every suggested route and cant get this to work and I'm still getting the above error. I'm using the smallest gpt2 model and am well within memory. Can anyone who have successfully deployed to GCP please give some insight here. Update: So to address the above concern of transformers also trying to install torch and that may be causing the problem I rebuilt the .whl file from source with the additional packages removed, below is the edited setup.py file and built using python setup.py bdist_wheel. I then added this whl to the required dependencies when creating a model version in GCP and removed the transformers==2.8.0 from my own setup.py. But it is still giving the same error of model requires more memory =( import shutil from pathlib import Path from setuptools import find_packages, setup # Remove stale transformers.egg-info directory to avoid https://github.com/pypa/pip/issues/5466 stale_egg_info = Path(__file__).parent / "transformers.egg-info" if stale_egg_info.exists(): print( ( "Warning: {} exists.\n\n" "If you recently updated transformers to 3.0 or later, this is expected,\n" "but it may prevent transformers from installing in editable mode.\n\n" "This directory is automatically generated by Python's packaging tools.\n" "I will remove it now.\n\n" "See https://github.com/pypa/pip/issues/5466 for details.\n" ).format(stale_egg_info) ) shutil.rmtree(stale_egg_info) extras = {} extras["mecab"] = ["mecab-python3"] extras["sklearn"] = ["scikit-learn"] # extras["tf"] = ["tensorflow"] # extras["tf-cpu"] = ["tensorflow-cpu"] # extras["torch"] = ["torch"] extras["serving"] = ["pydantic", "uvicorn", "fastapi", "starlette"] extras["all"] = extras["serving"] + ["tensorflow", "torch"] extras["testing"] = ["pytest", "pytest-xdist"] extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rtd-theme"] extras["quality"] = [ "black", "isort", "flake8", ] extras["dev"] = extras["testing"] + extras["quality"] + ["mecab-python3", "scikit-learn", "tensorflow", "torch"] setup( name="transformers", version="2.8.0", author="Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Sam Shleifer, Google AI Language Team Authors, Open AI team Authors, Facebook AI Authors, Carnegie Mellon University Authors", author_email="thomas#huggingface.co", description="State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch", long_description=open("README.md", "r", encoding="utf-8").read(), long_description_content_type="text/markdown", keywords="NLP deep learning transformer pytorch tensorflow BERT GPT GPT-2 google openai CMU", license="Apache", url="https://github.com/huggingface/transformers", package_dir={"": "src"}, packages=find_packages("src"), install_requires=[ "numpy", "tokenizers == 0.5.2", # dataclasses for Python versions that don't have it "dataclasses;python_version<'3.7'", # accessing files from S3 directly "boto3", # filesystem locks e.g. to prevent parallel downloads "filelock", # for downloading models over HTTPS "requests", # progress bars in model download and training scripts "tqdm >= 4.27", # for OpenAI GPT "regex != 2019.12.17", # for XLNet "sentencepiece", # for XLM "sacremoses", ], extras_require=extras, scripts=["transformers-cli"], python_requires=">=3.6.0", classifiers=[ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "Intended Audience :: Education", "Intended Audience :: Science/Research", "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Scientific/Engineering :: Artificial Intelligence", ], )
I've had trouble training a model in AWS SageMaker, everything is fine until the model needs to be saved
I've had trouble training a model in AWS SageMaker, everything is fine until the model needs to be saved. I have tried with a 500MB dataset and everything works correctly, but when the .csv file occupies 10GB the training job fails. Next I leave my training python file and the error output, the machine used to train was ml.m5.2xlarge with a train_volume_size = 100. File .py to train the model in SageMaker with an output of 10GB import argparse import pandas as pd import os import sys from os.path import join import sklearn from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn import metrics import numpy as np import logging import boto3 import time logger = logging.getLogger(__name__) logger.setLevel(logging.DEBUG) logger.addHandler(logging.StreamHandler(sys.stdout)) if 'SAGEMAKER_METRICS_DIRECTORY' in os.environ: log_file_handler = logging.FileHandler(join(os.environ['SAGEMAKER_METRICS_DIRECTORY'], "metrics.json")) log_file_handler.setFormatter( "{'time':'%(asctime)s', 'name': '%(name)s', \ 'level': '%(levelname)s', 'message': '%(message)s'}" ) logger.addHandler(log_file_handler) os.system('pip install joblib') import joblib if __name__ == '__main__': parser = argparse.ArgumentParser() # Adicion de hyperparametros # Solamente se añade el parametro lambda de regularizacion parser.add_argument('--regularization_lambda',type=float, default=0.0) # Argumentos propios de sagemaker parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR']) parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR']) parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN']) args = parser.parse_args() input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ] if len(input_files) == 0: raise ValueError(('There are no files in {}.\n' + 'This usually indicates that the channel ({}) was incorrectly specified,\n' + 'the data specification in S3 was incorrectly specified or the role specified\n' + 'does not have permission to access the data.').format(args.train, "train")) raw_data = [pd.read_csv(file,header=None,engine="python") for file in input_files] train_data = pd.concat(raw_data) # Definicion del modelo model = GaussianNB() matrix = train_data.values for submatrix in np.split(matrix,np.arange(100,12100,100),axis=0): # Generacion de los datos de entrenemiento asumiendo que # las etiquetas estan en la primera columna train_y = submatrix[:,0] train_x = submatrix[:,1:] model = model.partial_fit(train_x,train_y,classes=np.unique(train_y)) print('Accuracy: ', model.score(train_x, train_y)) logger.info('Train accuracy: {:.6f};'.format(model.score(train_x, train_y))) # Mustra de los coeficientes y guradarlos joblib.dump(model, os.path.join(args.model_dir, "model.joblib")) def model_fn(model_dir): # Se retorna el modelo entrenado model = joblib.load(os.path.join(model_dir, "model.joblib")) return model When the finished the output was the next error 2020-07-20 09:49:52 Starting - Starting the training job... 2020-07-20 09:49:54 Starting - Launching requested ML instances...... 2020-07-20 09:50:58 Starting - Preparing the instances for training... 2020-07-20 09:51:39 Downloading - Downloading input data............... 2020-07-20 09:54:22 Training - Training image download completed. Training in progress..2020-07-20 09:54:24,234 sagemaker-containers INFO Imported framework sagemaker_sklearn_container.training 2020-07-20 09:54:24,236 sagemaker-containers INFO No GPUs detected (normal if no gpus installed) 2020-07-20 09:54:24,246 sagemaker_sklearn_container.training INFO Invoking user training script. 2020-07-20 09:54:24,803 sagemaker-containers INFO Module eeg-NB-model does not provide a setup.py. Generating setup.py 2020-07-20 09:54:24,803 sagemaker-containers INFO Generating setup.cfg 2020-07-20 09:54:24,803 sagemaker-containers INFO Generating MANIFEST.in 2020-07-20 09:54:24,803 sagemaker-containers INFO Installing module with the following command: /miniconda3/bin/python -m pip install . Processing /opt/ml/code Building wheels for collected packages: eeg-NB-model Building wheel for eeg-NB-model (setup.py): started Building wheel for eeg-NB-model (setup.py): finished with status 'done' Created wheel for eeg-NB-model: filename=eeg_NB_model-1.0.0-py2.py3-none-any.whl size=7074 sha256=2d6213105e4f7f707f68278b1291d2940b8de2c319f7084b322b2d4197402c33 Stored in directory: /tmp/pip-ephem-wheel-cache-8kr3fxjv/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3 Successfully built eeg-NB-model Installing collected packages: eeg-NB-model Successfully installed eeg-NB-model-1.0.0 2020-07-20 09:54:26,753 sagemaker-containers INFO No GPUs detected (normal if no gpus installed) 2020-07-20 09:54:26,763 sagemaker-containers INFO Invoking user script Training Env: { "additional_framework_parameters": {}, "channel_input_dirs": { "train": "/opt/ml/input/data/train" }, "current_host": "algo-1", "framework_module": "sagemaker_sklearn_container.training:main", "hosts": [ "algo-1" ], "hyperparameters": { "regularization_lambda": 0.0 }, "input_config_dir": "/opt/ml/input/config", "input_data_config": { "train": { "TrainingInputMode": "File", "S3DistributionType": "FullyReplicated", "RecordWrapperType": "None" } }, "input_dir": "/opt/ml/input", "is_master": true, "job_name": "sagemaker-scikit-learn-2020-07-20-09-49-52-390", "log_level": 20, "master_hostname": "algo-1", "model_dir": "/opt/ml/model", "module_dir": "s3://sagemaker-eu-west-1-798663412819/sagemaker-scikit-learn-2020-07-20-09-49-52-390/source/sourcedir.tar.gz", "module_name": "eeg-NB-model", "network_interface_name": "eth0", "num_cpus": 8, "num_gpus": 0, "output_data_dir": "/opt/ml/output/data", "output_dir": "/opt/ml/output", "output_intermediate_dir": "/opt/ml/output/intermediate", "resource_config": { "current_host": "algo-1", "hosts": [ "algo-1" ], "network_interface_name": "eth0" }, "user_entry_point": "eeg-NB-model.py" } Environment variables: SM_HOSTS=["algo-1"] SM_NETWORK_INTERFACE_NAME=eth0 SM_HPS={"regularization_lambda":0.0} SM_USER_ENTRY_POINT=eeg-NB-model.py SM_FRAMEWORK_PARAMS={} SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"} SM_INPUT_DATA_CONFIG={"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}} SM_OUTPUT_DATA_DIR=/opt/ml/output/data SM_CHANNELS=["train"] SM_CURRENT_HOST=algo-1 SM_MODULE_NAME=eeg-NB-model SM_LOG_LEVEL=20 SM_FRAMEWORK_MODULE=sagemaker_sklearn_container.training:main SM_INPUT_DIR=/opt/ml/input SM_INPUT_CONFIG_DIR=/opt/ml/input/config SM_OUTPUT_DIR=/opt/ml/output SM_NUM_CPUS=8 SM_NUM_GPUS=0 SM_MODEL_DIR=/opt/ml/model SM_MODULE_DIR=s3://sagemaker-eu-west-1-798663412819/sagemaker-scikit-learn-2020-07-20-09-49-52-390/source/sourcedir.tar.gz SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_sklearn_container.training:main","hosts":["algo-1"],"hyperparameters":{"regularization_lambda":0.0},"input_config_dir":"/opt/ml/input/config","input_data_config":{"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-scikit-learn-2020-07-20-09-49-52-390","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-eu-west-1-798663412819/sagemaker-scikit-learn-2020-07-20-09-49-52-390/source/sourcedir.tar.gz","module_name":"eeg-NB-model","network_interface_name":"eth0","num_cpus":8,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"eeg-NB-model.py"} SM_USER_ARGS=["--regularization_lambda","0.0"] SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate SM_CHANNEL_TRAIN=/opt/ml/input/data/train SM_HP_REGULARIZATION_LAMBDA=0.0 PYTHONPATH=/miniconda3/bin:/miniconda3/lib/python37.zip:/miniconda3/lib/python3.7:/miniconda3/lib/python3.7/lib-dynload:/miniconda3/lib/python3.7/site-packages Invoking script with the following command: /miniconda3/bin/python -m eeg-NB-model --regularization_lambda 0.0 /miniconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses import imp Collecting joblib Downloading https://files.pythonhosted.org/packages/51/dd/0e015051b4a27ec5a58b02ab774059f3289a94b0906f880a3f9507e74f38/joblib-0.16.0-py3-none-any.whl (300kB) Installing collected packages: joblib Successfully installed joblib-0.16.0 2020-07-20 09:58:21 Uploading - Uploading generated training model 2020-07-20 09:58:21 Failed - Training job failed 2020-07-20 09:58:12,544 sagemaker-containers ERROR ExecuteUserScriptError: Command "/miniconda3/bin/python -m eeg-NB-model --regularization_lambda 0.0" --------------------------------------------------------------------------- UnexpectedStatusException Traceback (most recent call last) <ipython-input-7-267e445b3bf0> in <module> 28 NB_training_job_name = "Naive-Bayes-training-job-{}".format(int(time.time())) 29 ---> 30 estimator.fit({'train': train_input},wait=True) /opt/conda/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config) 463 self.jobs.append(self.latest_training_job) 464 if wait: --> 465 self.latest_training_job.wait(logs=logs) 466 467 def _compilation_job_name(self): /opt/conda/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs) 1056 # If logs are requested, call logs_for_jobs. 1057 if logs != "None": -> 1058 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs) 1059 else: 1060 self.sagemaker_session.wait_for_job(self.job_name) /opt/conda/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type) 3019 3020 if wait: -> 3021 self._check_job_status(job_name, description, "TrainingJobStatus") 3022 if dot: 3023 print() /opt/conda/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name) 2613 ), 2614 allowed_statuses=["Completed", "Stopped"], -> 2615 actual_status=status, 2616 ) 2617 UnexpectedStatusException: Error for Training job sagemaker-scikit-learn-2020-07-20-09-49-52-390: Failed. Reason: AlgorithmError: ExecuteUserScriptError: Command "/miniconda3/bin/python -m eeg-NB-model --regularization_lambda 0.0"
However, it looks like you are loading all of your data into memory in these lines: raw_data = [pd.read_csv(file,header=None,engine="python") for file in input_files] train_data = pd.concat(raw_data) The model type you are using ml.m5.2xlarge has 32 GiB of memory. It could be that loading all of your data into memory this way is leading to an out-of-memory exception or timeout. Poke around the SageMaker / Cloudwatch logs to try to get a failure reason. Unfortunately, the SageMaker logs are only showing ExecuteUserScriptError which doesn't tell you much, but in other cases this error code without details was due to resource errors. One way to test this is to increase the size of your sagemaker instance to one with bigger memory. Or, you could refrain from loading all of your training data into memory at once. It looks like your input CSV data is already split up into files. Have you considered programming a loop over all of these files to train from them one-by-one? That way you don't have to store all of the features in memory at once. for file in input_files: raw_data_block = pd.read_csv(file,header=None,engine="python") # training code for raw_data_block here.
Script mode py3 and lack of output in s3 after successful training
I've created a script where I define my Tensorflow Estimator, then I pass it to AWS sagemaker sdk and run fit(), the training passes (though doesnt show anything related to training in the console) and in S3 the only output is /source/sourcedir.tar.gz and I believe there also should be at least /model/model.tar.gz which for some reason is not generated and I'm not getting any errors. sagemaker_session = sagemaker.Session() role = get_execution_role() inputs = sagemaker_session.upload_data(path='data', key_prefix='data/NamingConventions') NamingConventions_estimator = TensorFlow(entry_point='NamingConventions.py', role=role, framework_version='1.12.0', train_instance_count=1, train_instance_type='ml.m5.xlarge', py_version='py3', model_dir="s3://sagemaker-eu-west-2-218566301064/model") NamingConventions_estimator.fit(inputs, run_tensorboard_locally=True) and my model_fn from 'NamingConventions.py' def model_fn(features, labels, mode, params): net = keras.layers.Embedding(alphabetLen + 1, 8, input_length=maxFeatureLen)(features[INPUT_TENSOR_NAME]) net = keras.layers.LSTM(12)(net) logits = keras.layers.Dense(len(conventions), activation=tf.nn.softmax)(net) #output predictions = tf.reshape(logits, [-1]) if mode == tf.estimator.ModeKeys.PREDICT: return tf.estimator.EstimatorSpec( mode=mode, predictions={"ages": predictions}, export_outputs={SIGNATURE_NAME: PredictOutput({"ages": predictions})}) loss = keras.losses.sparse_categorical_crossentropy(labels, predictions) train_op = tf.contrib.layers.optimize_loss( loss=loss, global_step=tf.contrib.framework.get_global_step(), learning_rate=params["learning_rate"], optimizer="AdamOptimizer") predictions_dict = {"ages": predictions} eval_metric_ops = { "rmse": tf.metrics.root_mean_squared_error( tf.cast(labels, tf.float32), predictions) } return tf.estimator.EstimatorSpec( mode=mode, loss=loss, train_op=train_op, eval_metric_ops=eval_metric_ops) I still can't get it running, I'm trying to use script-mode, it seems like I can't import my model from the same directory. Currently my script: import argparse import os if __name__ =='__main__': parser = argparse.ArgumentParser() # hyperparameters sent by the client are passed as command-line arguments to the script. parser.add_argument('--epochs', type=int, default=10) parser.add_argument('--batch_size', type=int, default=100) parser.add_argument('--learning_rate', type=float, default=0.1) # input data and model directories parser.add_argument('--model_dir', type=str) parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN')) parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST')) args, _ = parser.parse_known_args() import tensorflow as tf from NC_model import model_fn, train_input_fn, eval_input_fn def train(args): print(args) estimator = tf.estimator.Estimator(model_fn=model_fn, model_dir=args.model_dir) train_spec = tf.estimator.TrainSpec(train_input_fn, max_steps=1000) eval_spec = tf.estimator.EvalSpec(eval_input_fn) tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec) if __name__ == '__main__': train(args)
Is the training job listed as successful in the AWS console? Did you check the training log in Amazon CloudWatch?
I think you need to set your estimator model_dir to the path in the environment variable SM_MODEL_DIR. This is a bit contrary to the docs which are not clear on this point. I suspect the --model_dir arg is used for distributed training and not saving of the final artifact. Note that you'll get all your checkpoints and summaries there to so it probably best to use --model_dir in your estimator and copy your model export to SM_MODEL_DIR when training has finished.
Script mode gives you the freedom to write TensorFlow scripts the way you want, but the cost is, you have to do almost everything by yourself. For example, here in your case, if you want the model.tar.gz in S3, you have to export the model locally first. Then SageMaker will upload your local model to S3 automatically. So what you need to add in your script is: You need to add an exporter and pass it to eval_spec. You need to call export_savedmodel to save the model to the local model dir that SageMaker can get. The local model dir is in env variable SM_MODEL_DIR, and should be '/opt/ml/model'. To finish above, I guess you need to have your serving_input_fn implemented too. Then SageMaker will upload your model from the local model dir automatically to the S3 model dir you specify. And you can see that in S3 after job succeeds.