I run a hyperparameter tuning job on cloud ml-engine. Only when a trial is concluded I can get the values of hyperparameters in Job details and in Training output.
I wonder if there is a way to get the values of hyperparameters while the trial is running.
Edit: I think it's a better idea to dump the hyperparameters during training programmatically using tf.gfile.
currently CloudML Engine doesn't expose the training output until finishing all trials.
If you hope to learn the detailed hyperparameters for each trial, you should be able to find it from stackdriver logs. There is a log entry that records the command to invoke your trainer code for each trial: Running command: python -m trainer.task <params>
Related
I am trying to train a t5 conditional Generation model in Sagemaker, its running fine when I am passing the arguments directly in notebook but its not learning anything when I am passing estimator and train.py script, I followed the documentation provided by hugging face as well as AWS. But still we are facing issue it is saying training is completed and saving model with in 663 seconds what ever might be the size of dataset. Kindly give suggestions for this.
Check Amazon CloudWatch logs to be able to tell what took place during training (train.py stdout/stderr). This utility can help with downloading logs to your local machine/notebook.
I'm completely confused with how SageMaker Experiments works. I used the SDK to create an Experiment and a Trial. Now I want to track job parameters, metadata and metrics.
Shall I create Trial components manually with the SDK or let SM Estimator fit create them for me??
after creating my experiment and trial, I use the below code
job.fit(inputs,
experiment_config={
"ExperimentName": reg_experiment.experiment_name,
"TrialName": trial1.trial_name,
"TrialComponentDisplayName": "training-with-RF1"},
wait=False)
When I look in Studio, I see an automatically created Trial component named "training-with-RF1".
I see here and here that we can (can = must? should? could?...) also create Trials manually, for example with
my_trial = trial.Trial.create('AutoML')
my_tracker = tracker.Tracker.create()
my_tracker.log_parameter('learning_rate', 0.01)
my_trial.add_trial_component(my_tracker)
Or here with
Trial.create(
trial_name=trial_name,
experiment_name=mnist_experiment.experiment_name,
sagemaker_boto_client=sm)
When I create trials like that manually, they appear as separate empty trials than the trials created by SageMaker jobs, see below.
I'm confused because the AWS blog post says we have to create Trials manually, however SageMaker Training jobs seem to be creating those trials on our behalf...
I'm completely confused by this service, can someone please help?
The best way to do this is to create an Experiment, a Trial and then pass the experiment config to the Training Job. The training job will automatically create a Trial Component and add it to the Trial.
Depending on the type of training job you are using, some metrics will automatically be tracked in the Trial Component. You can set this up through metric_definitions regex in the Estimator.
If you are running the training job in script mode, you can install sagemaker-experiments in the container running the job (or in the python script using subprocess.call) and import the Tracker object. You can use the Tracker to log metrics from the training script to the Trial Component.
There are some examples here - https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-experiments
This is the documentation for sagemaker-experiments sdk - https://sagemaker-experiments.readthedocs.io/en/latest/tracker.html
I am currently trying to train a model using pytorch on AWS Sagemaker but can't get it to run properly. My main question now is: Is there some workflow step I'm missing? Any help is greatly appreciated.
I managed to get the code running on colab or on a local machine for example but not on sagemaker.
In short the program should: Setup a pytorch model, load the train data from a file system and perform train epochs.
For this, I am trying the following:
The code files (dataloaders/help functions etc) with the "entry point" are stored at Sagemaker Studio in the folder "code".
enter image description here
The train files are stored in a s3 bucket and are transfered in "file mode".
I then call the estimator in a python notebook as this:
estimator = PyTorch(entry_point='entry.py',
role=role,
py_version='py3',
source_dir = "code",
output_path = "s3://XXXXX/XXXXXX/XXXX",
framework_version='1.3.1',
instance_count=1,
instance_type='ml.g4dn.2xlarge',
hyperparameters={
'epochs': 5,
'backend': 'gloo'
})
inputs = "s3://XXXXX/XXXXX"
estimator.fit({'training': inputs})
In the output I can see, that the train instance is prepared and the data is downloaded but then the problem arises:
For some reason the program jumps right into the train method. The outputs of the first steps which should take place before a train epoch, Network whitening for example, are shown after or during the train step. After one train epoch the program freezes without any error message until I manually stop the instance.
Thanks for any help.
Your script gets stuck after one epoch.
There's no error messages to review, nor the code itself.
I suggest to try and troubleshoot this in a fast manner using SageMaker Local mode (e.g., instance_type='local_gpu'). This will allow you to retry different configurations in seconds instead of minutes. And potentially remote debug it.
Note: SageMaker local requires docker support, so you'll need to run this either on your laptop, or on a SageMaker notebook instance like 'ml.g4dn.2xlarge', but not on a SageMaker Studio instance). And potentially remote debug it.
I have a .py file containing all the instructions to generate the predictions for some data.
Those data are taken from BigQuery and the predictions should be inserted in another BigQuery table.
Right now the code is running on a AIPlatform Notebook, but I want to schedule its execution every day, is there any way to do it?
I run into the AIPlatform Jobs, but I can't understand what should my code do and what should be the structure of the code, is there any step-by-step guide to follow?
You can schedule a Notebook execution using different options:
nbconvert
Different variants of the same technology:
nbconvert: Provides a convenient way to execute the input cells of an .ipynb notebook file and save the results, both input and output cells, as a .ipynb file.
papermill: is a Python package for parameterizing and executing Jupyter Notebooks. (Uses nbconvert --execute under the hood.)
notebook executor: This tool that can be used to schedule the execution of Jupyter notebooks from anywhere (local, GCE, GCP Notebooks) to the Cloud AI Deep Learning VM. You can read more about the usage of this tool here. (Uses gcloud sdk and papermill under the hood)
KubeFlow Fairing
Is a Python package that makes it easy to train and deploy ML models on Kubeflow. Kubeflow Fairing can also be extended to train or deploy on other platforms. Currently, Kubeflow Fairing has been extended to train on Google AI Platform.
AI Platform Notebook Executor There are two core functions of the Scheduler extension:
Ability to submit a Notebook to run on AI Platform’s Machine Learning Engine as a training job with a custom container image. This allows you to experiment and write your training code in a cost-effective single VM environment, but scale out to an AI Platform job to take advantage of superior resources (ie. GPUs, TPUs, etc.).
Scheduling a Notebook for recurring runs follows the exact same sequence of steps, but requires a crontab-formatted schedule option.
Nova Plugin: This is the predecessor of the Notebook Scheduler project. Allows you to execute notebooks directly from your Jupyter UI.
Notebook training
Python package allows users to run a Jupyter notebook at Google Cloud AI Platform Training Jobs.
GCP runner: Allows running any Jupyter notebook function on Google Cloud Platform
Unlike all other solutions listed above, it allows to run training for the whole project, not single Python file or Jupyter notebook
Allows running any function with parameters, moving from local execution to cloud is just a matter of wrapping function in a: gcp_runner.run_cloud(<function_name>, …) call.
This project is production-ready without any modifications
Supports execution on local (for testing purposes), AI Platform, and Kubernetes environments Full end to end example can be found here:
https://www.github.com/vlasenkoalexey/criteo_nbdev
tensorflow_cloud (Keras for GCP) Provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud.
Update July 2021:
The recommended option in GCP is Notebook Executor which is already available in EAP.
I am trying to deploy a locally trained model. I followed all of the instructions here for model preparation and I managed to deploy it.
However when I try to get the predictions, the online prediction responds with 502 Server error and the batch prediction returns ('Failed to run the inference graph', 1)
Is there a way to get a better error message to narrow down what's wrong?
Thanks
The error message indicated it occurred when running the session for the inference graph. It might be possible to uncover what is be happening with some code to use the model locally. One way to test it is to create a small input dataset and feed it to the inference graph to check if you can run the session locally.
You may refer the local_predict.py in the samples/mnist/deployable/ in SDK about how to do that. Here is an example use:
python local_predict.py --input=/path/to/my/local/files --model_dir=/path/to/modeldir.
Note that the model_dir points to where the tensorflow meta graph proto and checkpoint files are saved. They are generated by training. Here is the doc link about how to train a model. https://cloud.google.com/ml/docs/how-tos/training-models. The model dir can be on GCS as well.
Thanks for bringing this up. We're continually working to improve the overall experience of the service including error reporting.