AWS Sagemaker Pytorch does not run properly

AWS Sagemaker Pytorch does not run properly - amazon-web-services

I am currently trying to train a model using pytorch on AWS Sagemaker but can't get it to run properly. My main question now is: Is there some workflow step I'm missing? Any help is greatly appreciated.
I managed to get the code running on colab or on a local machine for example but not on sagemaker.
In short the program should: Setup a pytorch model, load the train data from a file system and perform train epochs.
For this, I am trying the following:
The code files (dataloaders/help functions etc) with the "entry point" are stored at Sagemaker Studio in the folder "code".
enter image description here
The train files are stored in a s3 bucket and are transfered in "file mode".
I then call the estimator in a python notebook as this:
estimator = PyTorch(entry_point='entry.py',
role=role,
py_version='py3',
source_dir = "code",
output_path = "s3://XXXXX/XXXXXX/XXXX",
framework_version='1.3.1',
instance_count=1,
instance_type='ml.g4dn.2xlarge',
hyperparameters={
'epochs': 5,
'backend': 'gloo'
})
inputs = "s3://XXXXX/XXXXX"
estimator.fit({'training': inputs})
In the output I can see, that the train instance is prepared and the data is downloaded but then the problem arises:
For some reason the program jumps right into the train method. The outputs of the first steps which should take place before a train epoch, Network whitening for example, are shown after or during the train step. After one train epoch the program freezes without any error message until I manually stop the instance.
Thanks for any help.

Your script gets stuck after one epoch.
There's no error messages to review, nor the code itself.
I suggest to try and troubleshoot this in a fast manner using SageMaker Local mode (e.g., instance_type='local_gpu'). This will allow you to retry different configurations in seconds instead of minutes. And potentially remote debug it.
Note: SageMaker local requires docker support, so you'll need to run this either on your laptop, or on a SageMaker notebook instance like 'ml.g4dn.2xlarge', but not on a SageMaker Studio instance). And potentially remote debug it.

Related

How to use SageMaker Experiments trackers and trial components?

I'm completely confused with how SageMaker Experiments works. I used the SDK to create an Experiment and a Trial. Now I want to track job parameters, metadata and metrics.
Shall I create Trial components manually with the SDK or let SM Estimator fit create them for me??
after creating my experiment and trial, I use the below code
job.fit(inputs,
experiment_config={
"ExperimentName": reg_experiment.experiment_name,
"TrialName": trial1.trial_name,
"TrialComponentDisplayName": "training-with-RF1"},
wait=False)
When I look in Studio, I see an automatically created Trial component named "training-with-RF1".
I see here and here that we can (can = must? should? could?...) also create Trials manually, for example with
my_trial = trial.Trial.create('AutoML')
my_tracker = tracker.Tracker.create()
my_tracker.log_parameter('learning_rate', 0.01)
my_trial.add_trial_component(my_tracker)
Or here with
Trial.create(
trial_name=trial_name,
experiment_name=mnist_experiment.experiment_name,
sagemaker_boto_client=sm)
When I create trials like that manually, they appear as separate empty trials than the trials created by SageMaker jobs, see below.
I'm confused because the AWS blog post says we have to create Trials manually, however SageMaker Training jobs seem to be creating those trials on our behalf...
I'm completely confused by this service, can someone please help?

The best way to do this is to create an Experiment, a Trial and then pass the experiment config to the Training Job. The training job will automatically create a Trial Component and add it to the Trial.
Depending on the type of training job you are using, some metrics will automatically be tracked in the Trial Component. You can set this up through metric_definitions regex in the Estimator.
If you are running the training job in script mode, you can install sagemaker-experiments in the container running the job (or in the python script using subprocess.call) and import the Tracker object. You can use the Tracker to log metrics from the training script to the Trial Component.
There are some examples here - https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-experiments
This is the documentation for sagemaker-experiments sdk - https://sagemaker-experiments.readthedocs.io/en/latest/tracker.html

AWS Sagemaker - Custom Training Job not saving Model output

I'm running a training job using AWS SageMaker and i'm using a custom Estimator based on an available docker image from AWS. I wanted to get some feedback on whether my process is correct or not prior to deployment.
I'm running the training job in a docker container using 'local' in a SageMaker notebook instance and the training job runs successfully. However, after the job completes and saves the model to opt/model/models within the docker image, once the docker container exits, the model saved from training is lost. Ideally, i'd like to use the model for inference, however, I'm not sure about the best way of doing it. I have also tried the training job after pushing the image to ECR, but the same thing happens.
It is my understanding that the docker state is lost, once the image exits, as such, is it possible to persist the model that was produced in training in the image? One option I have thought about is saving the model output to an S3 bucket once the training job is complete, then pulling that model into another docker image for inference. Is this expected behaviour and the correct way of doing it?
I am fairly new to using SageMaker but i'd like to do it according to best practices. I've looked at a lot of the AWS documents and followed the tutorials but it doesn't seem to mention explicitly if this is how it should be done.
Thanks for any feedback on this.

You can refer to Rok's comment on saving a model file when you're using a custom estimator. That said, SageMaker built-in estimators save the model artifacts to S3. To make inferences using that model, you can either use a real-time inference endpoint for real time predictions, or a batch transformer to run inferences in batch mode. In both cases, you'll have to point the configuration to the container for inference and the model artifacts. the amazon-sagemaker-examples repository has examples for common frameworks, especially, the scikit-learn example has detailed explanations.
Also, make sure the model is being saved to /opt/ml/model/, not opt/model/models as mentioned in your question.

Reloading from checkpoing during AWS Sagemaker Training

Sagemaker is a great tool to train your models, and we save some money by using AWS spot instances. However, training jobs sometimes get stopped in the middle. We are using some mechanisms to continue from the latest checkpoint after a restart. See also the docs.
Still, how do you efficiently test such a mechanism? Can you trigger it yourself? Otherwise you have to wait until the spot instance actually ís restarted.
Also, are you expected to use the linked checkpoint_s3_uri argument or the model_dir for this? E.g. the TensorFlow estimator docs seem to suggest something model_dirfor checkpoints.

Since you can't manually terminate a sagemaker instance, run an Amazon SageMaker Managed Spot training for a small number of epochs, Amazon SageMaker would have backed up your checkpoint files to S3. Check that checkpoints are there. Now run a second training run, but this time provide the first jobs’ checkpoint location to checkpoint_s3_uri. Reference is here, this also answer your second question.

AWS SageMaker - training locally but deploying to AWS?

I have a the following challenge with SageMaker:
I've downloaded one of the tutorial notebooks (https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_abalone_age_predictor_using_keras/tensorflow_abalone_age_predictor_using_keras.ipynb)
I ran the training locally (successfully) with the modifying the following line:
abalone_estimator = TensorFlow(entry_point='abalone.py',
role=role,
training_steps= 100,
evaluation_steps= 100,
hyperparameters={'learning_rate': 0.001},
train_instance_count=1,
**train_instance_type='local'**)
abalone_estimator.fit(inputs)
I then wanted to deploy my model to AWS with the following line but it seems the SDK deploys it locally (it doesn't fail, I just see it running on my machine)
abalone_predictor = abalone_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
Any tips on how to either fix it so it gets deployed to AWS or alternatively re-load my training model and deploy it to AWS from scratch?
Many thanks,
Stefan

Its easier to run the training again on SageMaker.
Otherwise, here are the steps that you would have to do.
Take the checkpoint file generated during the training and convert them into tensorflow serving models.
Zip them in a specific format and upload to S3
Then create estimator as you have done above and do the inference.
If you want details on each of the specific steps above do let me know, but if your dataset is not too big, I would say just retrain on SageMaker.

Failed to run the inference graph - what could be wrong?

I am trying to deploy a locally trained model. I followed all of the instructions here for model preparation and I managed to deploy it.
However when I try to get the predictions, the online prediction responds with 502 Server error and the batch prediction returns ('Failed to run the inference graph', 1)
Is there a way to get a better error message to narrow down what's wrong?
Thanks

The error message indicated it occurred when running the session for the inference graph. It might be possible to uncover what is be happening with some code to use the model locally. One way to test it is to create a small input dataset and feed it to the inference graph to check if you can run the session locally.
You may refer the local_predict.py in the samples/mnist/deployable/ in SDK about how to do that. Here is an example use:
python local_predict.py --input=/path/to/my/local/files --model_dir=/path/to/modeldir.
Note that the model_dir points to where the tensorflow meta graph proto and checkpoint files are saved. They are generated by training. Here is the doc link about how to train a model. https://cloud.google.com/ml/docs/how-tos/training-models. The model dir can be on GCS as well.
Thanks for bringing this up. We're continually working to improve the overall experience of the service including error reporting.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js