SageMaker XGBoost Distributed Training Job doesn't complete with early stopping

SageMaker XGBoost Distributed Training Job doesn't complete with early stopping - amazon-web-services

SageMaker XGBoost HPO takes really long with distributed training + early stopping (more so than just running a regular XGBoost job or just running distributed training or just early stopping). Regular training XGBoost both w/ and w/o early stopping: 1-2 hours Distributed training: 18-20 hours w/ early stopping - seems to be a bug.
Looking to understand if there are parameters / settings I'm missing or if this is a bug?

Related

How to use SageMaker Experiments trackers and trial components?

I'm completely confused with how SageMaker Experiments works. I used the SDK to create an Experiment and a Trial. Now I want to track job parameters, metadata and metrics.
Shall I create Trial components manually with the SDK or let SM Estimator fit create them for me??
after creating my experiment and trial, I use the below code
job.fit(inputs,
experiment_config={
"ExperimentName": reg_experiment.experiment_name,
"TrialName": trial1.trial_name,
"TrialComponentDisplayName": "training-with-RF1"},
wait=False)
When I look in Studio, I see an automatically created Trial component named "training-with-RF1".
I see here and here that we can (can = must? should? could?...) also create Trials manually, for example with
my_trial = trial.Trial.create('AutoML')
my_tracker = tracker.Tracker.create()
my_tracker.log_parameter('learning_rate', 0.01)
my_trial.add_trial_component(my_tracker)
Or here with
Trial.create(
trial_name=trial_name,
experiment_name=mnist_experiment.experiment_name,
sagemaker_boto_client=sm)
When I create trials like that manually, they appear as separate empty trials than the trials created by SageMaker jobs, see below.
I'm confused because the AWS blog post says we have to create Trials manually, however SageMaker Training jobs seem to be creating those trials on our behalf...
I'm completely confused by this service, can someone please help?

The best way to do this is to create an Experiment, a Trial and then pass the experiment config to the Training Job. The training job will automatically create a Trial Component and add it to the Trial.
Depending on the type of training job you are using, some metrics will automatically be tracked in the Trial Component. You can set this up through metric_definitions regex in the Estimator.
If you are running the training job in script mode, you can install sagemaker-experiments in the container running the job (or in the python script using subprocess.call) and import the Tracker object. You can use the Tracker to log metrics from the training script to the Trial Component.
There are some examples here - https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-experiments
This is the documentation for sagemaker-experiments sdk - https://sagemaker-experiments.readthedocs.io/en/latest/tracker.html

SageMaker Distributed Training in Local Mode (inside Notebook Instances)

I've been using SageMaker for a while and have performed several experiments already with distributed training. I am wondering if it is possible to test and run SageMaker distributed training in local mode (using SageMaker Notebook Instances)?

Yes, SageMaker supports distributed training in local mode. But as #Philipp Schmid said some other features (like pipemode are not supported).

No, not possible yet. local mode does not support the distributed training with local_gpufor Gzip compression, Pipe Mode, or manifest files for inputs

Object Detection in AWS + Sagemaker Neo

I am trying the inbuilt object detection algorithm available on AWS for a computer vision problem. The training job ran successfully and I have received the model artifacts in the .tar.gz format in an S3 bucket.
To reduce the model footprint, we need to use Sagemaker Neo - a compilation job on the available model artifacts. The compilation job fails with the error - "ClientError: OperatorNotImplemented:('One or more operators are not supported in frontend MXNet:\n_contrib_MultiBoxTarget: 1\nMakeLoss: 3"
How can this be resolved?
Around August 2019, Sagemaker Neo did not support models trained with built-in Sagemaker Object Detection Algorithm. Is there any change in this status today ?
Thanks

GCP run a prediction of a model every day

I have a .py file containing all the instructions to generate the predictions for some data.
Those data are taken from BigQuery and the predictions should be inserted in another BigQuery table.
Right now the code is running on a AIPlatform Notebook, but I want to schedule its execution every day, is there any way to do it?
I run into the AIPlatform Jobs, but I can't understand what should my code do and what should be the structure of the code, is there any step-by-step guide to follow?

You can schedule a Notebook execution using different options:
nbconvert
Different variants of the same technology:
nbconvert: Provides a convenient way to execute the input cells of an .ipynb notebook file and save the results, both input and output cells, as a .ipynb file.
papermill: is a Python package for parameterizing and executing Jupyter Notebooks. (Uses nbconvert --execute under the hood.)
notebook executor: This tool that can be used to schedule the execution of Jupyter notebooks from anywhere (local, GCE, GCP Notebooks) to the Cloud AI Deep Learning VM. You can read more about the usage of this tool here. (Uses gcloud sdk and papermill under the hood)
KubeFlow Fairing
Is a Python package that makes it easy to train and deploy ML models on Kubeflow. Kubeflow Fairing can also be extended to train or deploy on other platforms. Currently, Kubeflow Fairing has been extended to train on Google AI Platform.
AI Platform Notebook Executor There are two core functions of the Scheduler extension:
Ability to submit a Notebook to run on AI Platform’s Machine Learning Engine as a training job with a custom container image. This allows you to experiment and write your training code in a cost-effective single VM environment, but scale out to an AI Platform job to take advantage of superior resources (ie. GPUs, TPUs, etc.).
Scheduling a Notebook for recurring runs follows the exact same sequence of steps, but requires a crontab-formatted schedule option.
Nova Plugin: This is the predecessor of the Notebook Scheduler project. Allows you to execute notebooks directly from your Jupyter UI.
Notebook training
Python package allows users to run a Jupyter notebook at Google Cloud AI Platform Training Jobs.
GCP runner: Allows running any Jupyter notebook function on Google Cloud Platform
Unlike all other solutions listed above, it allows to run training for the whole project, not single Python file or Jupyter notebook
Allows running any function with parameters, moving from local execution to cloud is just a matter of wrapping function in a: gcp_runner.run_cloud(<function_name>, …) call.
This project is production-ready without any modifications
Supports execution on local (for testing purposes), AI Platform, and Kubernetes environments Full end to end example can be found here:
https://www.github.com/vlasenkoalexey/criteo_nbdev
tensorflow_cloud (Keras for GCP) Provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud.
Update July 2021:
The recommended option in GCP is Notebook Executor which is already available in EAP.

get hyperparameters during hyperparamter tuning cloud ml-engine

I run a hyperparameter tuning job on cloud ml-engine. Only when a trial is concluded I can get the values of hyperparameters in Job details and in Training output.
I wonder if there is a way to get the values of hyperparameters while the trial is running.
Edit: I think it's a better idea to dump the hyperparameters during training programmatically using tf.gfile.

currently CloudML Engine doesn't expose the training output until finishing all trials.
If you hope to learn the detailed hyperparameters for each trial, you should be able to find it from stackdriver logs. There is a log entry that records the command to invoke your trainer code for each trial: Running command: python -m trainer.task <params>

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

SageMaker XGBoost Distributed Training Job doesn't complete with early stopping - amazon-web-services

Related

How to use SageMaker Experiments trackers and trial components?

SageMaker Distributed Training in Local Mode (inside Notebook Instances)

Object Detection in AWS + Sagemaker Neo

GCP run a prediction of a model every day

get hyperparameters during hyperparamter tuning cloud ml-engine

Categories

Resources