My metrics are not being logged when running HPO in Watson Studio using Tensorboard - tensorboard

I'm training a simple MLP using Watson Studio's HPO capability. However when viewing my logs, the metrics are not displaying. The metrics logging works when running a non-HPO training run, but the logs don't show when running in HPO.
Here's how I defined my Tensorboard callback:
tb_directory = os.path.join(os.environ["JOB_STATE_DIR"], "logs", "tb",
os.makedirs(tb_directory, exist_ok=True)
tensorboard = TensorBoard(log_dir=tb_directory)
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test),
callbacks=[tensorboard])

Found the answer. When running HPO, metrics for each training run must be placed into its own subdirectory otherwise it's overwritten. So I should have setup my Tensorboard log directory like this:
tb_directory = os.path.join(os.environ["SUBID"],os.environ["JOB_STATE_DIR"], "logs", "tb",

Related

Unable to start Google Cloud Profiler due to error: Unable to find the job id or job name from env var

We followed the Cloud Profiler documentation to enable the Cloud Profiler for our Dataflow jobs and the Profiler is failing to start.
The issue is, Cloud Profiler needs JOB_NAME and JOB_ID environment vars to start but the worker VM has only the JOB_ID env var but the JOB_NAME is missing.
The question is why the JOB_NAME env var is missing?
Logs:
jsonPayload: {
job: "2022-09-16 13 41 20-1177626142222241340"
logger: "/us/local/lib/pvthon3.9/site-packages/apache_beam/runners/worker/sdk_worker_main.pv:177"
message: "Unable to start google cloud profiler due to error: Unable to find the job id or job name from envvar"
portability_worker_1d: "sdk-0-13"
thread: "MainThread"
worker: "description-embeddings-20-09161341-k27g-harness-qxq2"
}
Following done so far:
Cloud Profiler API enabled for the project
Projects have enough quota.
the Service Account for the Dataflow job has appropriate permissions for Profiler.
Following options added to the pipeline
--dataflow_service_options=enable_google_cloud_profiler
enable_google_cloud_profiler and enable_google_cloud_heap_sampling flags specified as additional experiments to deploy our pipeline from Dataflow templates.
Edit: Found the cause.
The provisioning API is returning an empty JOB_NAME, causing boot.go to set the JOB_NAME env var to "", which causes the Python SDK code to fail when trying to activate googlecloudprofiler.
There is an open issue on IssueTracker regarding this.

Upload tensorboard logs from cloud storage to vertex ai - tensorboard

I created a pipeline with vertex ai and added the code for creating and storing my tensorboard logs in cloud storage. The next step in the instructions here https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview#getting_started is to use the tb-gcp-uploader command to upload the logs to the tensboard experiment page. But I'm getting this message "'tb-gcp-uploader' is not recognized as an internal or external command". Any thoughts?
You should be able to run the command tb-gcp-uploader by installing the following package:
pip install google-cloud-aiplatform[tensorboard]

How to load an .RData file to AWS sagemaker notebook?

I just started using the AWS sagemaker and I have an xgboost model saved in my personal laptop using save as .Rdata, saveRDS, xgb.save commands. I have uploaded those files in my Sagemaker notebook instance where my different notebooks are. However, I am unable to load it to my environment and predict for test data by using the following commands:
load("Model.RData")
model=xgb.load('model')
model <- readRDS("Model.rds")
When I predict, I get NAs as my prediction. These commands work fine on Rstudio but not on sagemaker notebook.Please help

Tensorboard - can't connect from Google Cloud Instance

I'm trying to load Tensorboard from within my google cloud VM terminal.
tensorboard --logdir logs --port 6006
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.2.1 at http://localhost:6006/ (Press CTRL+C to quit)
When I click on the link:
Chrome I get error 400
Firefox Error: Could not connect to Cloud Shell on port 6006. Ensure your server is listening on port 6006 and try again.
I've added a new firewall rule to allow port 6006 for ip 0.0.0.0/0 , but still can't get this to work. I've tried using --bind_all too but this doesn't work.
From Training a Keras Model on Google Cloud ML GPU:
... To train this model now on google cloud ML engine run the below command on cloud sdk terminal
gcloud ml-engine jobs submit training JOB1
--module-name=trainer.cnn_with_keras
--package-path=./trainer
--job-dir=gs://keras-on-cloud
--region=us-central1
--config=trainer/cloudml-gpu.yaml
Once you have started the training you can watch the logs from google console. Training would take around 5 minutes and the logs should look like below. Also you would be able to view the tensorboard logs in the bucket that we had created earlier named ‘keras-on-cloud’
To visualize the training and changes graphically open the cloud shell by clicking the icon on top right for the same. Once started type the below command to start Tensorboard on port 8080.
tensorboard --logdir=gs://keras-on-cloud --port=8080
For anyone else struggling with this, I decided to output my logs to an S3 bucket, and then rather than trying to run tensorboard from within the GCP instance, I just ran it locally, tested with the below script.
I needed to put this into a script rather than calling directly from the command line as I needed my AWS credentials to be loaded. I then use subprocess to run the command line function as normal.
Credentials contained within an env file, found using python-dotenv
from dotenv import find_dotenv, load_dotenv
import subprocess
load_dotenv(find_dotenv())
if __name__=='__main__':
cmd = 'tensorboard --logdir s3://path-to-s3-bucket/Logs/'
p = subprocess.Popen(cmd, shell=True)
p.wait()
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.1.1 at http://localhost:6006/ (Press CTRL+C to quit)

What to define as entrypoint when initializing a pytorch estimator with a custom docker image for training on AWS Sagemaker?

So I created a docker image for training. In the dockerfile I have an entrypoint defined such that when docker run is executed, it will start running my python code.
To use this on aws sagemaker in my understanding I need to create a pytorch estimator in a jupyter notebook in sagemaker. I tried something like this:
import sagemaker
from sagemaker.pytorch import PyTorch
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
estimator = PyTorch(entry_point='train.py',
role=role,
framework_version='1.3.1',
image_name='xxx.ecr.eu-west-1.amazonaws.com/xxx:latest',
train_instance_count=1,
train_instance_type='ml.p3.xlarge',
hyperparameters={})
estimator.fit({})
In the documentation I found that as image name I can specify the link the my docker image on aws ecr. When I try to execute this it keeps complaining
[Errno 2] No such file or directory: 'train.py'
It complains immidiatly, so surely I am doing something completely wrong. I would expect that first my docker image should run, and than it could find out that the entry point does not exist.
But besides this, why do I need to specify an entry point, as in, should it not be clear that the entry to my training is simply docker run?
For maybe better understanding. The entrypoint python file in my docker image looks like this:
if __name__=='__main__':
parser = argparse.ArgumentParser()
# Hyperparameters sent by the client are passed as command-line arguments to the script.
parser.add_argument('--epochs', type=int, default=5)
parser.add_argument('--batch_size', type=int, default=16)
parser.add_argument('--learning_rate', type=float, default=0.0001)
# Data and output directories
parser.add_argument('--output_data_dir', type=str, default=os.environ['OUTPUT_DATA_DIR'])
parser.add_argument('--train_data_path', type=str, default=os.environ['CHANNEL_TRAIN'])
parser.add_argument('--valid_data_path', type=str, default=os.environ['CHANNEL_VALID'])
# Start training
...
Later I would like to specify the hyperparameters and data channels. But for now I simply do not understand what to put as entry point. In the documentation it says that the entrypoint is required and it should be a local/global path to the entrypoint...
If you really would like to use a complete separate by yourself build docker image, you should create an Amazon Sagemaker algorithm (which is one of the options in the Sagemaker menu). Here you have to specify a link to your docker image on amazon ECR as well as the input parameters and data channels etc. When choosing this options, you should not use the PyTorch estimater but the Algoritm estimater. This way you indeed don't have to specify an entrypoint because it simple runs the docker when training and the default entrypoint can be defined in your docker file.
The Pytorch estimator can be used when having you own model code, but you would like to run this code in an off-the-shelf Sagemaker PyTorch docker image. This is why you have to for example specify the PyTorch framework version. In this case the entrypoint file by default should be placed next to where your jupyter notebook is stored (just upload the file by clicking on the upload button). The PyTorch estimator inherits all options from the framework estimator where options can be found where to place the entrypoint and model, for example source_dir.