How to pass the experiment configuration to a SagemakerTrainingOperator while training? - amazon-web-services

Idea:
To use experiments and trials to log the training parameters and artifacts in sagemaker while using MWAA as the pipeline orchestrator
I am using the training_config to create the dict to pass the training configuration to the Tensorflow estimator, but there is no parameter to pass the experiment configuration
tf_estimator = TensorFlow(entry_point='train_model.py',
source_dir= source
role=sagemaker.get_execution_role(),
instance_count=1,
framework_version='2.3.0',
instance_type=instance_type,
py_version='py37',
script_mode=True,
enable_sagemaker_metrics = True,
metric_definitions=metric_definitions,
output_path=output
model_training_config = training_config(
estimator=tf_estimator,
inputs=input
job_name=training_jobname,
)
training_task = SageMakerTrainingOperator(
task_id=test_id,
config=model_training_config,
aws_conn_id="airflow-sagemaker",
print_log=True,
wait_for_completion=True,
check_interval=60
)

You can use the experiment_config in estimator.fit. More detailed example can be found here

The only way that i found right now is to use the CreateTrainigJob API (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#sagemaker-CreateTrainingJob-request-RoleArn). The following steps are needed:
I am not sure if this will work with Bring your own script method for E.g with a Tensorflow estimator
it works with a build your own container approach
Using the CreateTrainigJob API i created the configs which in turn includes all the needed configs like - training, experiment, algporthm etc and passed that to SagemakerTrainingOperator

Related

Naming a Sagemaker Processing job using Sagemaker Pipelines ProcessingStep

I am running a Sagemaker Pipeline with the current processor:
from sagemaker.sklearn.processing import SKLearnProcessor
framework_version = "0.23-1"
sklearn_processor = SKLearnProcessor(
framework_version=framework_version,
instance_type=processing_instance_type,
instance_count=processing_instance_count,
base_job_name="pre-processing-job-name",
role=role
)
and the processing step is:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
step_process = ProcessingStep(
name="AbaloneProcess",
processor=sklearn_processor,
inputs=[
ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
],
outputs=[
ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
],
code="abalone/preprocessing.py",
)
It looks like the base_job_name does nothing, because the processing job that is created is pipelines-o6e2jn38g05j-AbaloneProcess-nc2OlXF8jA.
I want the processing job name to be defined manually. Does Sagemaker pipelines support this? I seem to be going around in circles.
Pipelines sets the processing job name to preserve caching behavior.
An override could be added here to allow manually defining job names, but that has been intentionally left out. Because SageMaker job names must be unique, this would become a sharp edge for users who statically define the name without realizing that any pipeline execution after the first will fail with ResourceAlreadyExists.
I'd suggest filing an issue explaining the usecase and see if the team picks it up.

Pytorch Lightning Tensorboard Logger Across Multiple Models

I'm relatively new to Lightning and Loggers vs manually tracking metrics. I am trying to train two distinct models and have their accuracy and loss plotted on the same charts in tensorboard (or any other logger) within Colab.
What I have right now is basically:
trainer1 = pl.Trainer(gpus=n_gpus, max_epochs=n_epochs, progress_bar_refresh_rate=20, num_sanity_val_steps=0)
trainer2 = pl.Trainer(gpus=n_gpus, max_epochs=n_epochs, progress_bar_refresh_rate=20, num_sanity_val_steps=0)
trainer1.fit(Model1, train_loader, val_loader)
trainer2.fit(Model2, train_loader, val_loader)
#Then later:
%load_ext tensorboard
%tensorboard --logdir lightning_logs/
What I'd like to see at this point are those logged metrics charted together on the same chart, any help would be appreciated. I've spent some time trying to toy with this but I'm a bit out of my depth on this, thank you!
The exact chart used for logging a specific metric depends on the key name you provide in the .log() call (its a feature that Lightning inherits from TensorBoard itself)
def validation_step(self, batch, _):
# This string decides which chart to use in the TB web interface
# vvvvvvvvv
self.log('valid_acc', acc)
Just use the same string for both .log() calls and have both runs saved in same directory.
logger = TensorBoardLogger(save_dir='lightning_logs/', name='model1')
logger = TensorBoardLogger(save_dir='lightning_logs/', name='model2')
If you run tesnsorboard --logdir ./lightning_logs pointing at the parent directory, you should be able to see both metric in the same chart with the key named valid_acc.

Use TensorBoard with Keras Tuner

I ran into an apparent circular dependency trying to use log data for TensorBoard during a hyper-parameter search done with Keras Tuner, for a model built with TF2. The typical setup for the latter needs to set up the Tensorboard callback in the tuner's search() method, which wraps the model's fit() method.
from kerastuner.tuners import RandomSearch
tuner = RandomSearch(build_model, #this method builds the model
hyperparameters=hp, objective='val_accuracy')
tuner.search(x=train_x, y=train_y,
validation_data=(val_x, val_y),
callbacks=[tensorboard_cb]
In practice, the tensorboard_cb callback method needs to set up the directory where data will be logged and this directory has to be unique to each trial. A common way is to do this by naming the directory based on the current timestamp, with code like below.
log_dir = time.strftime('trial_%Y_%m_%d-%H_%M_%S')
tensorboard_cb = TensorBoard(log_dir)
This works when training a model with known hyper-parameters. However, when doing hyper-parameters search, I have to define and specify the TensorBoard callback before invoking tuner.search(). This is the problem: tuner.search() will invoke build_model() multiple times and each of these trials should have its own TensorBoard directory. Ideally defining log_dir will be done inside build_model() but the Keras Tuner search API forces the TensorBoard to be defined outside of that function.
TL;DR: TensorBoard gets data through a callback and requires one log directory per trial, but Keras Tuner requires defining the callback once for the entire search, before performing it, not per trial. How can unique directories per trial be defined in this case?
The keras tuner creates a subdir for each run (statement is probably version dependent).
I guess finding the right version mix is of importance.
Here is how it works for me, in jupyterlab.
prerequisite:
pip requirements
keras-tuner==1.0.1
tensorboard==2.1.1
tensorflow==2.1.0
Keras==2.2.4
jupyterlab==1.1.4
(2.) jupyterlab installed, built and running [standard compile arguments: production:minimize]
Here is the actual code. First i define the log folder and the callback
# run parameter
log_dir = "logs/" + datetime.datetime.now().strftime("%m%d-%H%M")
# training meta
stop_callback = EarlyStopping(
monitor='loss', patience=1, verbose=0, mode='auto')
hist_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir,
histogram_freq=1,
embeddings_freq=1,
write_graph=True,
update_freq='batch')
print("log_dir", log_dir)
Then i define my hypermodel, which i do not want to disclose. Afterwards
i set up the hyper parameter search
from kerastuner.tuners import Hyperband
hypermodel = get_my_hpyermodel()
tuner = Hyperband(
hypermodel
max_epochs=40,
objective='loss',
executions_per_trial=5,
directory=log_dir,
project_name='test'
)
which i then execute
tuner.search(
train_data,
labels,
epochs=10,
validation_data=(val_data, val_labels),
callbacks=[hist_callback],
use_multiprocessing=True)
tuner.search_space_summary()
While the notebook with this code searches for adequate hyper parameters i control the loss in another notebook. Since tf V2 tensorboard can be called via a magic function
Cell 1
import tensorboard
Cell 2
%load_ext tensorboard
Cell 3
%tensorboard --logdir 'logs/'
Sitenote: Since i run jupyterlab in a docker container i have to specifiy the appropriate address and port for tensorboard and also forward this in the dockerfile.
The result is not really predictable for me... I did not understand yet, when i can expect histograms and distributions in tensorboard.
Some runs the loading time seems really excessive... so have patience
Under scalars i find a list of the turns as follows
"logdir"/"model_has"/execution[iter]/[train/validation]
E.g.
0101-1010/bb7981e03d05b05106d8a35923353ec46570e4b6/execution0/train
0101-1010/bb7981e03d05b05106d8a35923353ec46570e4b6/execution0/validation

Can I add images to Tensorboard through Keras?

I have set up Tensorboard with Keras as a callback, like so:
callback_tb = keras.callbacks.TensorBoard(log_dir=tb_dir, histogram_freq=2,write_graph=True,write_images=True)
callbacks_list = [callback_save,callack_view, callback_tb]
model.fit(x_train,y_train,
batch_size=batch_size,
epochs=epochs,
callbacks=callbacks_list,
verbose=1,
validation_data=(x_test,y_test),
shuffle='batch')
This works fine and I can see loss and accuracy graphs on Tensorboard.
I am generating and saving model predictions in another file, but I want to know if it is possible to view these images on Tensorboard with Keras?
I have found the tf.summary.image function on https://github.com/tensorflow/tensorboard
But I don't understand how this relates to Keras.
Any help would be appreciated.
I fixed my problem by creating my own Keras callback based on the TensorBoard callback, where I could use the tf.summary.image feature

How to log from Python Luigi

I am trying to build a strategy to log from Luigi in such a way that there is a configurable list of outputs including stdout and a custom list of files. I would like to be able to set the logging level at runtime. Our system uses Luigi to call spark from Jenkins. Thank you in advance.
Inside any of the Task class methods, you can do:
class Agg(luigi.Task):
_date = luigi.DateParameter()
def output(self):
return luigi.LocalTarget("file_%.txt" % self._date)
def run(self):
# Use the luigi-interface to log to console
logger = logging.getLogger('luigi-interface')
logger.info("Running --> Agg.Task")
Have you checked the logging_conf_file parameter of the configuration? you can set up there all your configuration regarding logging using Python's standard logging mechanism.
For some examples see:
https://groups.google.com/forum/#!topic/luigi-user/nKJYq_ng5y8
https://groups.google.com/forum/#!topic/luigi-user/N83ZeePuqJk
https://github.com/spotify/luigi/issues/1401