Pytorch Lightning Tensorboard Logger Across Multiple Models - tensorboard

I'm relatively new to Lightning and Loggers vs manually tracking metrics. I am trying to train two distinct models and have their accuracy and loss plotted on the same charts in tensorboard (or any other logger) within Colab.
What I have right now is basically:
trainer1 = pl.Trainer(gpus=n_gpus, max_epochs=n_epochs, progress_bar_refresh_rate=20, num_sanity_val_steps=0)
trainer2 = pl.Trainer(gpus=n_gpus, max_epochs=n_epochs, progress_bar_refresh_rate=20, num_sanity_val_steps=0)
trainer1.fit(Model1, train_loader, val_loader)
trainer2.fit(Model2, train_loader, val_loader)
#Then later:
%load_ext tensorboard
%tensorboard --logdir lightning_logs/
What I'd like to see at this point are those logged metrics charted together on the same chart, any help would be appreciated. I've spent some time trying to toy with this but I'm a bit out of my depth on this, thank you!

The exact chart used for logging a specific metric depends on the key name you provide in the .log() call (its a feature that Lightning inherits from TensorBoard itself)
def validation_step(self, batch, _):
# This string decides which chart to use in the TB web interface
# vvvvvvvvv
self.log('valid_acc', acc)
Just use the same string for both .log() calls and have both runs saved in same directory.
logger = TensorBoardLogger(save_dir='lightning_logs/', name='model1')
logger = TensorBoardLogger(save_dir='lightning_logs/', name='model2')
If you run tesnsorboard --logdir ./lightning_logs pointing at the parent directory, you should be able to see both metric in the same chart with the key named valid_acc.

Related

How to pass the experiment configuration to a SagemakerTrainingOperator while training?

Idea:
To use experiments and trials to log the training parameters and artifacts in sagemaker while using MWAA as the pipeline orchestrator
I am using the training_config to create the dict to pass the training configuration to the Tensorflow estimator, but there is no parameter to pass the experiment configuration
tf_estimator = TensorFlow(entry_point='train_model.py',
source_dir= source
role=sagemaker.get_execution_role(),
instance_count=1,
framework_version='2.3.0',
instance_type=instance_type,
py_version='py37',
script_mode=True,
enable_sagemaker_metrics = True,
metric_definitions=metric_definitions,
output_path=output
model_training_config = training_config(
estimator=tf_estimator,
inputs=input
job_name=training_jobname,
)
training_task = SageMakerTrainingOperator(
task_id=test_id,
config=model_training_config,
aws_conn_id="airflow-sagemaker",
print_log=True,
wait_for_completion=True,
check_interval=60
)
You can use the experiment_config in estimator.fit. More detailed example can be found here
The only way that i found right now is to use the CreateTrainigJob API (https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#sagemaker-CreateTrainingJob-request-RoleArn). The following steps are needed:
I am not sure if this will work with Bring your own script method for E.g with a Tensorflow estimator
it works with a build your own container approach
Using the CreateTrainigJob API i created the configs which in turn includes all the needed configs like - training, experiment, algporthm etc and passed that to SagemakerTrainingOperator

Update Dataflow Streaming job with Session and Siding window embedded in DF

In my use-case, I'm performing Session as well as Sliding window inside Dataflow job. So basically my Sliding window timing is 10 hour with sliding time 4 min. Since I'm applying grouping and performing max function on top of that, on every 3 min interval, window will fire the pane and it will go into Session window with triggering logic on it. Below is the code for the same.
Window<Map<String, String>> windowMap = Window.<Map<String, String>>into(
SlidingWindows.of(Duration.standardHours(10)).every(Duration.standardMinutes(4)));
Window<Map<String, String>> windowSession = Window
.<Map<String, String>>into(Sessions.withGapDuration(Duration.standardHours(10))).discardingFiredPanes()
.triggering(Repeatedly
.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(5))))
.withAllowedLateness(Duration.standardSeconds(10));
I would like to add logger on some steps for Debugging, so I'm trying to update the current streaming job using below code:
options.setRegion("asia-east1");
options.setUpdate(true);
options.setStreaming(true);
So previously I had around 10k data and I updated the existing pipeline using above config and now I'm not able to see that much data in steps of updated DF job. So help me with the understanding whether it preserves the previous job data or not as I'm not seeing previous DF step count in updated Job.

Google App Engine deferred.defer task not getting executed

I have a Google App Engine Standard Environment application that has been working fine for a year or more, that, quite suddenly, refuses to enqueue new deferred tasks using deferred.defer.
Here's the Python 2.7 code that is making the deferred call:
# Find any inventory items that reference the product, and change them too.
# because this could take some time, we'll do it as a deferred task, and only
# if needed.
if upd:
updater = deferredtasks.InvUpdate()
deferred.defer(updater.run, product_key)
My app.yaml file has the necessary bits to support deferred.defer:
- url: /_ah/queue/deferred
script: google.appengine.ext.deferred.deferred.application
login: admin
builtins:
- deferred: on
And my deferred task has logging in it so I should see it running when it does:
#-------------------------------------------------------------------------------
# DEFERRED routine that updates the inventory items for a particular product. Should be callecd
# when ANY changes are made to the product, because it should trigger a re-download of the
# inventory record for that product to the iPad.
#-------------------------------------------------------------------------------
class InvUpdate(object):
def __init__(self):
self.to_put = []
self.product_key = None
self.updcount = 0
def run(self, product_key, batch_size=100):
updproduct = product_key.get()
if not updproduct:
logging.error("DEFERRED ERROR: Product key passed in does not exist")
return
logging.info(u"DEFERRED BEGIN: beginning inventory update for: {}".format(updproduct.name))
self.product_key = product_key
self._continue(None, batch_size)
...
When I run this in the development environment on my development box, everything works fine. Once I deploy it to the App Engine server, the inventory updates never get done (i.e. the deferred task is not executed), and there are no errors (and no other logging from the deferred task in fact) in the log files on the server. I know that with the sudden move to get everybody on Python 3 as quickly as possible, the deferred.defer library has been marked as not recommended because it only works with the 2.7 Python environment, and I planned on moving to task queues for this, but I wasn't expecting deferred.defer to suddenly stop working in the existing python environment.
Any insight would be greatly appreciated!
I'm pretty sure you cant pass the method of an instance to appengine taskqueue, because that instance will not get exist when your task runs since it will be running in a different process. I actually dont understand how your task ever worked when running remotely in the first place (and running locally is not an accurate representation of how things will run remotely)
Try changing your code to this:
if upd:
deferred.defer(deferredtasks.InvUpdate.run_cls, product_key)
and then InvUpdate is the same but has a new function run_cls:
class InvUpdate(object):
#classmethod
def run_cls(cls, product_key):
cls().run(product_key)
And I'm still on the process of migrating to cloud tasks and my deferred tasks still work

Use TensorBoard with Keras Tuner

I ran into an apparent circular dependency trying to use log data for TensorBoard during a hyper-parameter search done with Keras Tuner, for a model built with TF2. The typical setup for the latter needs to set up the Tensorboard callback in the tuner's search() method, which wraps the model's fit() method.
from kerastuner.tuners import RandomSearch
tuner = RandomSearch(build_model, #this method builds the model
hyperparameters=hp, objective='val_accuracy')
tuner.search(x=train_x, y=train_y,
validation_data=(val_x, val_y),
callbacks=[tensorboard_cb]
In practice, the tensorboard_cb callback method needs to set up the directory where data will be logged and this directory has to be unique to each trial. A common way is to do this by naming the directory based on the current timestamp, with code like below.
log_dir = time.strftime('trial_%Y_%m_%d-%H_%M_%S')
tensorboard_cb = TensorBoard(log_dir)
This works when training a model with known hyper-parameters. However, when doing hyper-parameters search, I have to define and specify the TensorBoard callback before invoking tuner.search(). This is the problem: tuner.search() will invoke build_model() multiple times and each of these trials should have its own TensorBoard directory. Ideally defining log_dir will be done inside build_model() but the Keras Tuner search API forces the TensorBoard to be defined outside of that function.
TL;DR: TensorBoard gets data through a callback and requires one log directory per trial, but Keras Tuner requires defining the callback once for the entire search, before performing it, not per trial. How can unique directories per trial be defined in this case?
The keras tuner creates a subdir for each run (statement is probably version dependent).
I guess finding the right version mix is of importance.
Here is how it works for me, in jupyterlab.
prerequisite:
pip requirements
keras-tuner==1.0.1
tensorboard==2.1.1
tensorflow==2.1.0
Keras==2.2.4
jupyterlab==1.1.4
(2.) jupyterlab installed, built and running [standard compile arguments: production:minimize]
Here is the actual code. First i define the log folder and the callback
# run parameter
log_dir = "logs/" + datetime.datetime.now().strftime("%m%d-%H%M")
# training meta
stop_callback = EarlyStopping(
monitor='loss', patience=1, verbose=0, mode='auto')
hist_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir,
histogram_freq=1,
embeddings_freq=1,
write_graph=True,
update_freq='batch')
print("log_dir", log_dir)
Then i define my hypermodel, which i do not want to disclose. Afterwards
i set up the hyper parameter search
from kerastuner.tuners import Hyperband
hypermodel = get_my_hpyermodel()
tuner = Hyperband(
hypermodel
max_epochs=40,
objective='loss',
executions_per_trial=5,
directory=log_dir,
project_name='test'
)
which i then execute
tuner.search(
train_data,
labels,
epochs=10,
validation_data=(val_data, val_labels),
callbacks=[hist_callback],
use_multiprocessing=True)
tuner.search_space_summary()
While the notebook with this code searches for adequate hyper parameters i control the loss in another notebook. Since tf V2 tensorboard can be called via a magic function
Cell 1
import tensorboard
Cell 2
%load_ext tensorboard
Cell 3
%tensorboard --logdir 'logs/'
Sitenote: Since i run jupyterlab in a docker container i have to specifiy the appropriate address and port for tensorboard and also forward this in the dockerfile.
The result is not really predictable for me... I did not understand yet, when i can expect histograms and distributions in tensorboard.
Some runs the loading time seems really excessive... so have patience
Under scalars i find a list of the turns as follows
"logdir"/"model_has"/execution[iter]/[train/validation]
E.g.
0101-1010/bb7981e03d05b05106d8a35923353ec46570e4b6/execution0/train
0101-1010/bb7981e03d05b05106d8a35923353ec46570e4b6/execution0/validation

Can I add images to Tensorboard through Keras?

I have set up Tensorboard with Keras as a callback, like so:
callback_tb = keras.callbacks.TensorBoard(log_dir=tb_dir, histogram_freq=2,write_graph=True,write_images=True)
callbacks_list = [callback_save,callack_view, callback_tb]
model.fit(x_train,y_train,
batch_size=batch_size,
epochs=epochs,
callbacks=callbacks_list,
verbose=1,
validation_data=(x_test,y_test),
shuffle='batch')
This works fine and I can see loss and accuracy graphs on Tensorboard.
I am generating and saving model predictions in another file, but I want to know if it is possible to view these images on Tensorboard with Keras?
I have found the tf.summary.image function on https://github.com/tensorflow/tensorboard
But I don't understand how this relates to Keras.
Any help would be appreciated.
I fixed my problem by creating my own Keras callback based on the TensorBoard callback, where I could use the tf.summary.image feature