Reloading from checkpoing during AWS Sagemaker Training - amazon-web-services

Sagemaker is a great tool to train your models, and we save some money by using AWS spot instances. However, training jobs sometimes get stopped in the middle. We are using some mechanisms to continue from the latest checkpoint after a restart. See also the docs.
Still, how do you efficiently test such a mechanism? Can you trigger it yourself? Otherwise you have to wait until the spot instance actually ís restarted.
Also, are you expected to use the linked checkpoint_s3_uri argument or the model_dir for this? E.g. the TensorFlow estimator docs seem to suggest something model_dirfor checkpoints.

Since you can't manually terminate a sagemaker instance, run an Amazon SageMaker Managed Spot training for a small number of epochs, Amazon SageMaker would have backed up your checkpoint files to S3. Check that checkpoints are there. Now run a second training run, but this time provide the first jobs’ checkpoint location to checkpoint_s3_uri. Reference is here, this also answer your second question.

Related

AWS Sagemaker T5 or huggingface Model training issue

I am trying to train a t5 conditional Generation model in Sagemaker, its running fine when I am passing the arguments directly in notebook but its not learning anything when I am passing estimator and train.py script, I followed the documentation provided by hugging face as well as AWS. But still we are facing issue it is saying training is completed and saving model with in 663 seconds what ever might be the size of dataset. Kindly give suggestions for this.
Check Amazon CloudWatch logs to be able to tell what took place during training (train.py stdout/stderr). This utility can help with downloading logs to your local machine/notebook.

AWS Sagemaker - Custom Training Job not saving Model output

I'm running a training job using AWS SageMaker and i'm using a custom Estimator based on an available docker image from AWS. I wanted to get some feedback on whether my process is correct or not prior to deployment.
I'm running the training job in a docker container using 'local' in a SageMaker notebook instance and the training job runs successfully. However, after the job completes and saves the model to opt/model/models within the docker image, once the docker container exits, the model saved from training is lost. Ideally, i'd like to use the model for inference, however, I'm not sure about the best way of doing it. I have also tried the training job after pushing the image to ECR, but the same thing happens.
It is my understanding that the docker state is lost, once the image exits, as such, is it possible to persist the model that was produced in training in the image? One option I have thought about is saving the model output to an S3 bucket once the training job is complete, then pulling that model into another docker image for inference. Is this expected behaviour and the correct way of doing it?
I am fairly new to using SageMaker but i'd like to do it according to best practices. I've looked at a lot of the AWS documents and followed the tutorials but it doesn't seem to mention explicitly if this is how it should be done.
Thanks for any feedback on this.
You can refer to Rok's comment on saving a model file when you're using a custom estimator. That said, SageMaker built-in estimators save the model artifacts to S3. To make inferences using that model, you can either use a real-time inference endpoint for real time predictions, or a batch transformer to run inferences in batch mode. In both cases, you'll have to point the configuration to the container for inference and the model artifacts. the amazon-sagemaker-examples repository has examples for common frameworks, especially, the scikit-learn example has detailed explanations.
Also, make sure the model is being saved to /opt/ml/model/, not opt/model/models as mentioned in your question.

Setting Instance Type in Sagemaker Experiements

Im pretty new to sagemaker had a quick question
Using experiments in Sagemaker Studio it spins up ml.m5.4xlarge instance automatically
Is there anyway of setting a smaller instance size.... maybe to one that fits under the free tier....
Just playing around with experiments and will require a few runs so want to keep the cost down to start with?
TIA
When you are using a SageMaker Experiments, there is an associated TrainingJob or Processing Job with that Experiment. It is the Training/Processing Job that spins up the instanc. That is where you can edit the instance type. Experiments itself is available to no extra cost when using SageMaker Studio.
I work for AWS & my opinions are my own.

AWS Sagemaker processing Job automatically created?

I haven't used sagemaker for a while and today I started a training job (with the same old settings I always used before), but this time I noticed that a processing job has been automatically created and it's running while my training job run (I presume for debugging purpose).
I'm sure that this is the first time that it happens.. Is that a new feature introduced by sagemaker? I didn't find any related in documentation, but it's important to know because I don't want extra costs..
This is the image used by the processing job, with a instance type of ml.m5.2xlarge which I didn't set anywhere..
929884845733.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-debugger-rules:latest
I can answer my question.. it seems to be a new feature as highlighted here. You can turn it off as suggested in the doc:
To disable both monitoring and profiling, include the disable_profiler parameter to your estimator and set it to True.

Re-hosting a trained model on AWS SageMaker

I have started exploring AWS SageMaker starting with these examples provided by AWS. I then made some modifications to this particular setup so that it uses the data from my use case for training.
Now, as I continue to work on this model and tuning, after I delete the inference endpoint once, I would like to be able to recreate the same endpoint -- even after stopping and restarting the notebook instance (so the notebook / kernel session is no longer valid) -- using the already trained model artifacts that gets uploaded to S3 under /output folder.
Now I cannot simply jump directly to this line of code:
bt_endpoint = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')
I did some searching -- including amazon's own example of hosting pre-trained models, but I am a little lost. I would appreciate any guidance, examples, or documentation that I could emulate and adapt to my case.
Your comment is correct - you can re-create an Endpoint given an existing EndpointConfiguration. This can be done via the console, the AWS CLI, or the SageMaker boto client.
https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint