Naming a Sagemaker Processing job using Sagemaker Pipelines ProcessingStep - amazon-web-services

I am running a Sagemaker Pipeline with the current processor:
from sagemaker.sklearn.processing import SKLearnProcessor
framework_version = "0.23-1"
sklearn_processor = SKLearnProcessor(
framework_version=framework_version,
instance_type=processing_instance_type,
instance_count=processing_instance_count,
base_job_name="pre-processing-job-name",
role=role
)
and the processing step is:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
step_process = ProcessingStep(
name="AbaloneProcess",
processor=sklearn_processor,
inputs=[
ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
],
outputs=[
ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
],
code="abalone/preprocessing.py",
)
It looks like the base_job_name does nothing, because the processing job that is created is pipelines-o6e2jn38g05j-AbaloneProcess-nc2OlXF8jA.
I want the processing job name to be defined manually. Does Sagemaker pipelines support this? I seem to be going around in circles.

Pipelines sets the processing job name to preserve caching behavior.
An override could be added here to allow manually defining job names, but that has been intentionally left out. Because SageMaker job names must be unique, this would become a sharp edge for users who statically define the name without realizing that any pipeline execution after the first will fail with ResourceAlreadyExists.
I'd suggest filing an issue explaining the usecase and see if the team picks it up.

Related

How to pass the experiment configuration to a SagemakerTrainingOperator while training?

Idea:
To use experiments and trials to log the training parameters and artifacts in sagemaker while using MWAA as the pipeline orchestrator
I am using the training_config to create the dict to pass the training configuration to the Tensorflow estimator, but there is no parameter to pass the experiment configuration
tf_estimator = TensorFlow(entry_point='train_model.py',
source_dir= source
role=sagemaker.get_execution_role(),
instance_count=1,
framework_version='2.3.0',
instance_type=instance_type,
py_version='py37',
script_mode=True,
enable_sagemaker_metrics = True,
metric_definitions=metric_definitions,
output_path=output
model_training_config = training_config(
estimator=tf_estimator,
inputs=input
job_name=training_jobname,
)
training_task = SageMakerTrainingOperator(
task_id=test_id,
config=model_training_config,
aws_conn_id="airflow-sagemaker",
print_log=True,
wait_for_completion=True,
check_interval=60
)
You can use the experiment_config in estimator.fit. More detailed example can be found here
The only way that i found right now is to use the CreateTrainigJob API (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#sagemaker-CreateTrainingJob-request-RoleArn). The following steps are needed:
I am not sure if this will work with Bring your own script method for E.g with a Tensorflow estimator
it works with a build your own container approach
Using the CreateTrainigJob API i created the configs which in turn includes all the needed configs like - training, experiment, algporthm etc and passed that to SagemakerTrainingOperator

Pytorch Lightning Tensorboard Logger Across Multiple Models

I'm relatively new to Lightning and Loggers vs manually tracking metrics. I am trying to train two distinct models and have their accuracy and loss plotted on the same charts in tensorboard (or any other logger) within Colab.
What I have right now is basically:
trainer1 = pl.Trainer(gpus=n_gpus, max_epochs=n_epochs, progress_bar_refresh_rate=20, num_sanity_val_steps=0)
trainer2 = pl.Trainer(gpus=n_gpus, max_epochs=n_epochs, progress_bar_refresh_rate=20, num_sanity_val_steps=0)
trainer1.fit(Model1, train_loader, val_loader)
trainer2.fit(Model2, train_loader, val_loader)
#Then later:
%load_ext tensorboard
%tensorboard --logdir lightning_logs/
What I'd like to see at this point are those logged metrics charted together on the same chart, any help would be appreciated. I've spent some time trying to toy with this but I'm a bit out of my depth on this, thank you!
The exact chart used for logging a specific metric depends on the key name you provide in the .log() call (its a feature that Lightning inherits from TensorBoard itself)
def validation_step(self, batch, _):
# This string decides which chart to use in the TB web interface
# vvvvvvvvv
self.log('valid_acc', acc)
Just use the same string for both .log() calls and have both runs saved in same directory.
logger = TensorBoardLogger(save_dir='lightning_logs/', name='model1')
logger = TensorBoardLogger(save_dir='lightning_logs/', name='model2')
If you run tesnsorboard --logdir ./lightning_logs pointing at the parent directory, you should be able to see both metric in the same chart with the key named valid_acc.

Use TensorBoard with Keras Tuner

I ran into an apparent circular dependency trying to use log data for TensorBoard during a hyper-parameter search done with Keras Tuner, for a model built with TF2. The typical setup for the latter needs to set up the Tensorboard callback in the tuner's search() method, which wraps the model's fit() method.
from kerastuner.tuners import RandomSearch
tuner = RandomSearch(build_model, #this method builds the model
hyperparameters=hp, objective='val_accuracy')
tuner.search(x=train_x, y=train_y,
validation_data=(val_x, val_y),
callbacks=[tensorboard_cb]
In practice, the tensorboard_cb callback method needs to set up the directory where data will be logged and this directory has to be unique to each trial. A common way is to do this by naming the directory based on the current timestamp, with code like below.
log_dir = time.strftime('trial_%Y_%m_%d-%H_%M_%S')
tensorboard_cb = TensorBoard(log_dir)
This works when training a model with known hyper-parameters. However, when doing hyper-parameters search, I have to define and specify the TensorBoard callback before invoking tuner.search(). This is the problem: tuner.search() will invoke build_model() multiple times and each of these trials should have its own TensorBoard directory. Ideally defining log_dir will be done inside build_model() but the Keras Tuner search API forces the TensorBoard to be defined outside of that function.
TL;DR: TensorBoard gets data through a callback and requires one log directory per trial, but Keras Tuner requires defining the callback once for the entire search, before performing it, not per trial. How can unique directories per trial be defined in this case?
The keras tuner creates a subdir for each run (statement is probably version dependent).
I guess finding the right version mix is of importance.
Here is how it works for me, in jupyterlab.
prerequisite:
pip requirements
keras-tuner==1.0.1
tensorboard==2.1.1
tensorflow==2.1.0
Keras==2.2.4
jupyterlab==1.1.4
(2.) jupyterlab installed, built and running [standard compile arguments: production:minimize]
Here is the actual code. First i define the log folder and the callback
# run parameter
log_dir = "logs/" + datetime.datetime.now().strftime("%m%d-%H%M")
# training meta
stop_callback = EarlyStopping(
monitor='loss', patience=1, verbose=0, mode='auto')
hist_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir,
histogram_freq=1,
embeddings_freq=1,
write_graph=True,
update_freq='batch')
print("log_dir", log_dir)
Then i define my hypermodel, which i do not want to disclose. Afterwards
i set up the hyper parameter search
from kerastuner.tuners import Hyperband
hypermodel = get_my_hpyermodel()
tuner = Hyperband(
hypermodel
max_epochs=40,
objective='loss',
executions_per_trial=5,
directory=log_dir,
project_name='test'
)
which i then execute
tuner.search(
train_data,
labels,
epochs=10,
validation_data=(val_data, val_labels),
callbacks=[hist_callback],
use_multiprocessing=True)
tuner.search_space_summary()
While the notebook with this code searches for adequate hyper parameters i control the loss in another notebook. Since tf V2 tensorboard can be called via a magic function
Cell 1
import tensorboard
Cell 2
%load_ext tensorboard
Cell 3
%tensorboard --logdir 'logs/'
Sitenote: Since i run jupyterlab in a docker container i have to specifiy the appropriate address and port for tensorboard and also forward this in the dockerfile.
The result is not really predictable for me... I did not understand yet, when i can expect histograms and distributions in tensorboard.
Some runs the loading time seems really excessive... so have patience
Under scalars i find a list of the turns as follows
"logdir"/"model_has"/execution[iter]/[train/validation]
E.g.
0101-1010/bb7981e03d05b05106d8a35923353ec46570e4b6/execution0/train
0101-1010/bb7981e03d05b05106d8a35923353ec46570e4b6/execution0/validation

Enable AWS Glue Continuous Logging from create_job

I'm creating a Glue Job using boto3 create_job. I was interested to pass a parameter to enable Continuous Logging (no filter) for this new Job.
Unfortunately neither here or here I can find any useful parameter to enable it.
Any suggestions?
Found a way to do it, it's similar to other arguments, you just need to pass these log related arguments in Default Arguments like -
glueClient.create_job(
Name="testBoto",
Role="Role_name",
Command={
'Name': "some_name",
'ScriptLocation' : "some_location"
},
DefaultArguments={
"--enable-continuous-cloudwatch-log": "true",
"--enable-continuous-log-filter": "true"
}
)

Should I have concern about datastoreRpcErrors?

When I run dataflow jobs that writes to google cloud datastore, sometime I see the metrics show that I had one or two datastoreRpcErrors:
Since these datastore writes usually contain a batch of keys, I am wondering in the situation of RpcError, if some retry will happen automatically. If not, what would be a good way to handle these cases?
tl;dr: By default datastoreRpcErrors will use 5 retries automatically.
I dig into the code of datastoreio in beam python sdk. It looks like the final entity mutations are flushed in batch via DatastoreWriteFn().
# Flush the current batch of mutations to Cloud Datastore.
_, latency_ms = helper.write_mutations(
self._datastore, self._project, self._mutations,
self._throttler, self._update_rpc_stats,
throttle_delay=_Mutate._WRITE_BATCH_TARGET_LATENCY_MS/1000)
The RPCError is caught by this block of code in write_mutations in the helper; and there is a decorator #retry.with_exponential_backoff for commit method; and the default number of retry is set to 5; retry_on_rpc_error defines the concrete RPCError and SocketError reasons to trigger retry.
for mutation in mutations:
commit_request.mutations.add().CopyFrom(mutation)
#retry.with_exponential_backoff(num_retries=5,
retry_filter=retry_on_rpc_error)
def commit(request):
# Client-side throttling.
while throttler.throttle_request(time.time()*1000):
try:
response = datastore.commit(request)
...
except (RPCError, SocketError):
if rpc_stats_callback:
rpc_stats_callback(errors=1)
raise
...
I think you should first of all determine which kind of error occurred in order to see what are your options.
However, in the official Datastore documentation, there is a list of all the possible errors and their error codes . Fortunately, they come with recommended actions for each.
My advice is that your implement their recommendations and see for alternatives if they are not effective for you