AWS SageMaker - training locally but deploying to AWS? - amazon-web-services

I have a the following challenge with SageMaker:
I've downloaded one of the tutorial notebooks (https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_abalone_age_predictor_using_keras/tensorflow_abalone_age_predictor_using_keras.ipynb)
I ran the training locally (successfully) with the modifying the following line:
abalone_estimator = TensorFlow(entry_point='abalone.py',
role=role,
training_steps= 100,
evaluation_steps= 100,
hyperparameters={'learning_rate': 0.001},
train_instance_count=1,
**train_instance_type='local'**)
abalone_estimator.fit(inputs)
I then wanted to deploy my model to AWS with the following line but it seems the SDK deploys it locally (it doesn't fail, I just see it running on my machine)
abalone_predictor = abalone_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
Any tips on how to either fix it so it gets deployed to AWS or alternatively re-load my training model and deploy it to AWS from scratch?
Many thanks,
Stefan

Its easier to run the training again on SageMaker.
Otherwise, here are the steps that you would have to do.
Take the checkpoint file generated during the training and convert them into tensorflow serving models.
Zip them in a specific format and upload to S3
Then create estimator as you have done above and do the inference.
If you want details on each of the specific steps above do let me know, but if your dataset is not too big, I would say just retrain on SageMaker.

Related

AWS Sagemaker - Custom Training Job not saving Model output

I'm running a training job using AWS SageMaker and i'm using a custom Estimator based on an available docker image from AWS. I wanted to get some feedback on whether my process is correct or not prior to deployment.
I'm running the training job in a docker container using 'local' in a SageMaker notebook instance and the training job runs successfully. However, after the job completes and saves the model to opt/model/models within the docker image, once the docker container exits, the model saved from training is lost. Ideally, i'd like to use the model for inference, however, I'm not sure about the best way of doing it. I have also tried the training job after pushing the image to ECR, but the same thing happens.
It is my understanding that the docker state is lost, once the image exits, as such, is it possible to persist the model that was produced in training in the image? One option I have thought about is saving the model output to an S3 bucket once the training job is complete, then pulling that model into another docker image for inference. Is this expected behaviour and the correct way of doing it?
I am fairly new to using SageMaker but i'd like to do it according to best practices. I've looked at a lot of the AWS documents and followed the tutorials but it doesn't seem to mention explicitly if this is how it should be done.
Thanks for any feedback on this.
You can refer to Rok's comment on saving a model file when you're using a custom estimator. That said, SageMaker built-in estimators save the model artifacts to S3. To make inferences using that model, you can either use a real-time inference endpoint for real time predictions, or a batch transformer to run inferences in batch mode. In both cases, you'll have to point the configuration to the container for inference and the model artifacts. the amazon-sagemaker-examples repository has examples for common frameworks, especially, the scikit-learn example has detailed explanations.
Also, make sure the model is being saved to /opt/ml/model/, not opt/model/models as mentioned in your question.

AWS Sagemaker Pytorch does not run properly

I am currently trying to train a model using pytorch on AWS Sagemaker but can't get it to run properly. My main question now is: Is there some workflow step I'm missing? Any help is greatly appreciated.
I managed to get the code running on colab or on a local machine for example but not on sagemaker.
In short the program should: Setup a pytorch model, load the train data from a file system and perform train epochs.
For this, I am trying the following:
The code files (dataloaders/help functions etc) with the "entry point" are stored at Sagemaker Studio in the folder "code".
enter image description here
The train files are stored in a s3 bucket and are transfered in "file mode".
I then call the estimator in a python notebook as this:
estimator = PyTorch(entry_point='entry.py',
role=role,
py_version='py3',
source_dir = "code",
output_path = "s3://XXXXX/XXXXXX/XXXX",
framework_version='1.3.1',
instance_count=1,
instance_type='ml.g4dn.2xlarge',
hyperparameters={
'epochs': 5,
'backend': 'gloo'
})
inputs = "s3://XXXXX/XXXXX"
estimator.fit({'training': inputs})
In the output I can see, that the train instance is prepared and the data is downloaded but then the problem arises:
For some reason the program jumps right into the train method. The outputs of the first steps which should take place before a train epoch, Network whitening for example, are shown after or during the train step. After one train epoch the program freezes without any error message until I manually stop the instance.
Thanks for any help.
Your script gets stuck after one epoch.
There's no error messages to review, nor the code itself.
I suggest to try and troubleshoot this in a fast manner using SageMaker Local mode (e.g., instance_type='local_gpu'). This will allow you to retry different configurations in seconds instead of minutes. And potentially remote debug it.
Note: SageMaker local requires docker support, so you'll need to run this either on your laptop, or on a SageMaker notebook instance like 'ml.g4dn.2xlarge', but not on a SageMaker Studio instance). And potentially remote debug it.

how to run a pre-trained model in AWS sagemaker?

I have a model.pkl file which is pre-trained and all other files related to the ml model. I want it to deploy it on the aws sagemaker.
But without training, how to deploy it to the aws sagmekaer, as fit() method in aws sagemaker run the train command and push the model.tar.gz to the s3 location and when deploy method is used it uses the same s3 location to deploy the model, we don't manual create the same location in s3 as it is created by the aws model and name it given by using some timestamp. How to put out our own personalized model.tar.gz file in the s3 location and call the deploy() function by using the same s3 location.
All you need is:
to have your model in an arbitrary S3 location in a model.tar.gz archive
to have an inference script in a SageMaker-compatible docker image that is able to read your model.pkl, serve it and handle inferences.
to create an endpoint associating your artifact to your inference code
When you ask for an endpoint deployment, SageMaker will take care of downloading your model.tar.gz and uncompressing to the appropriate location in the docker image of the server, which is /opt/ml/model
Depending on the framework you use, you may use either a pre-existing docker image (available for Scikit-learn, TensorFlow, PyTorch, MXNet) or you may need to create your own.
Regarding custom image creation, see here the specification and here two examples of custom containers for R and sklearn (the sklearn one is less relevant now that there is a pre-built docker image along with a sagemaker sklearn SDK)
Regarding leveraging existing containers for Sklearn, PyTorch, MXNet, TF, check this example: Random Forest in SageMaker Sklearn container. In this example, nothing prevents you from deploying a model that was trained elsewhere. Note that with a train/deploy environment mismatch you may run in errors due to some software version difference though.
Regarding your following experience:
when deploy method is used it uses the same s3 location to deploy the
model, we don't manual create the same location in s3 as it is created
by the aws model and name it given by using some timestamp
I agree that sometimes the demos that use the SageMaker Python SDK (one of the many available SDKs for SageMaker) may be misleading, in the sense that they often leverage the fact that an Estimator that has just been trained can be deployed (Estimator.deploy(..)) in the same session, without having to instantiate the intermediary model concept that maps inference code to model artifact. This design is presumably done on behalf of code compacity, but in real life, training and deployment of a given model may well be done from different scripts running in different systems. It's perfectly possible to deploy a model with training it previously in the same session, you need to instantiate a sagemaker.model.Model object and then deploy it.

How to schedule a retrain of a sagemaker pipeline model using airflow

I have already implemented a sagemaker pipeline model. In particular for an end-to-end notebook that trains a model, builds a pipeline model and deploys it, I have followed this sample notebook.
Now I would like to retrain and deploy the entire pipeline every day using Airflow, but I have seen here the possibility to retrain and deploy only a single sagemaker model.
Is there a way to retrain and deploy the entire pipeline? Thanks
SageMaker provides 2 options for users to do Airflow stuff:
Use the APIs in SageMaker Python SDK to generate input of all SageMaker operators in Airflow. The blog you linked goes this way. For example, they use API training_config in SageMaker Python SDK and operator SageMakerTrainingOperator in Airflow.
Use PythonOperator provided by Airflow and write Python codes to do what you want.
For 1, SageMaker only implemented APIs related to training, tuning, single model deployment and transform. Hence you are doing pipeline model, I don't think it has the API you want.
But for 2, if you can finish what you want in whatever Python codes with SageMaker. You should be able to adapt it as Python callables and make them work with PythonOperators. Here's an example for training in this way provided by SageMaker:
https://sagemaker.readthedocs.io/en/stable/using_workflow.html#using-airflow-python-operator
I think you can do similar things to make Airflow work with your pipeline model.

Re-hosting a trained model on AWS SageMaker

I have started exploring AWS SageMaker starting with these examples provided by AWS. I then made some modifications to this particular setup so that it uses the data from my use case for training.
Now, as I continue to work on this model and tuning, after I delete the inference endpoint once, I would like to be able to recreate the same endpoint -- even after stopping and restarting the notebook instance (so the notebook / kernel session is no longer valid) -- using the already trained model artifacts that gets uploaded to S3 under /output folder.
Now I cannot simply jump directly to this line of code:
bt_endpoint = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')
I did some searching -- including amazon's own example of hosting pre-trained models, but I am a little lost. I would appreciate any guidance, examples, or documentation that I could emulate and adapt to my case.
Your comment is correct - you can re-create an Endpoint given an existing EndpointConfiguration. This can be done via the console, the AWS CLI, or the SageMaker boto client.
https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint