How to schedule a retrain of a sagemaker pipeline model using airflow - amazon-web-services

I have already implemented a sagemaker pipeline model. In particular for an end-to-end notebook that trains a model, builds a pipeline model and deploys it, I have followed this sample notebook.
Now I would like to retrain and deploy the entire pipeline every day using Airflow, but I have seen here the possibility to retrain and deploy only a single sagemaker model.
Is there a way to retrain and deploy the entire pipeline? Thanks

SageMaker provides 2 options for users to do Airflow stuff:
Use the APIs in SageMaker Python SDK to generate input of all SageMaker operators in Airflow. The blog you linked goes this way. For example, they use API training_config in SageMaker Python SDK and operator SageMakerTrainingOperator in Airflow.
Use PythonOperator provided by Airflow and write Python codes to do what you want.
For 1, SageMaker only implemented APIs related to training, tuning, single model deployment and transform. Hence you are doing pipeline model, I don't think it has the API you want.
But for 2, if you can finish what you want in whatever Python codes with SageMaker. You should be able to adapt it as Python callables and make them work with PythonOperators. Here's an example for training in this way provided by SageMaker:
https://sagemaker.readthedocs.io/en/stable/using_workflow.html#using-airflow-python-operator
I think you can do similar things to make Airflow work with your pipeline model.

Related

How to use packaged model tar.gz inside SageMaker Processing Job?

I am working on deploying a full ML pipeline for SageMaker and Airflow. I would like to separate training and processing part of the pipeline.
I have a question concerning the SageMakerProcessingOperator(source_code). This operator relies on create_processing_job() function. When using this operator, I would like to extend the base docker image used for processing in order to use an home-made script. Currently, the processing works fine when I push my container to aws ECR. However, I would prefer to use a part of the script stored inside my packaged model (with tar.gz format).
For training and registering the model, we can specify the image used to extend with sagemaker_submit_directory and SAGEMAKER_PROGRAM env variable (cf aws_doc). However it looks like it is not possible using the SageMakerProcessingOperator.
Below is a extract of the config used in the operator, with no success yet.
"Environment": {
"sagemaker_enable_cloudwatch_metrics": "false",
"SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
"SAGEMAKER_REGION": f"{self.region_name}",
"SAGEMAKER_SUBMIT_DIRECTORY": f"{self.train_code_path}",
"SAGEMAKER_PROGRAM": f"{self.processing_entry_point}",
"sagemaker_job_name": f"{self.process_job_name}",
},
Did anyone manage to use these parameters for Sagemaker create_processing_job() ? Or is it only limited to AWS ECR ?
SageMaker Processing Job and SageMaker training job are different so the underlying architecture is different and we cannot combine both of them.

AWS Sagemaker - Custom Training Job not saving Model output

I'm running a training job using AWS SageMaker and i'm using a custom Estimator based on an available docker image from AWS. I wanted to get some feedback on whether my process is correct or not prior to deployment.
I'm running the training job in a docker container using 'local' in a SageMaker notebook instance and the training job runs successfully. However, after the job completes and saves the model to opt/model/models within the docker image, once the docker container exits, the model saved from training is lost. Ideally, i'd like to use the model for inference, however, I'm not sure about the best way of doing it. I have also tried the training job after pushing the image to ECR, but the same thing happens.
It is my understanding that the docker state is lost, once the image exits, as such, is it possible to persist the model that was produced in training in the image? One option I have thought about is saving the model output to an S3 bucket once the training job is complete, then pulling that model into another docker image for inference. Is this expected behaviour and the correct way of doing it?
I am fairly new to using SageMaker but i'd like to do it according to best practices. I've looked at a lot of the AWS documents and followed the tutorials but it doesn't seem to mention explicitly if this is how it should be done.
Thanks for any feedback on this.
You can refer to Rok's comment on saving a model file when you're using a custom estimator. That said, SageMaker built-in estimators save the model artifacts to S3. To make inferences using that model, you can either use a real-time inference endpoint for real time predictions, or a batch transformer to run inferences in batch mode. In both cases, you'll have to point the configuration to the container for inference and the model artifacts. the amazon-sagemaker-examples repository has examples for common frameworks, especially, the scikit-learn example has detailed explanations.
Also, make sure the model is being saved to /opt/ml/model/, not opt/model/models as mentioned in your question.

Custom code containers for google cloud-ml for inference

I am aware that it is possible to deploy custom containers for training jobs on google cloud and I have been able to get the same running using command.
gcloud ai-platform jobs submit training infer name --region some_region --master-image-uri=path/to/docker/image --config config.yaml
The training job was completed successfully and the model was successfully obtained, Now I want to use this model for inference, but the issue is a part of my code has system level dependencies, so I have to make some modification into the architecture in order to get it running all the time. This was the reason to have a custom container for the training job in the first place.
The documentation is only available for the training part and the inference part, (if possible) with custom containers has not been explored to the best of my knowledge.
The training part documentation is available on this link
My question is, is it possible to deploy custom containers for inference purposes on google cloud-ml?
This response refers to using Vertex AI Prediction, the newest platform for ML on GCP.
Suppose you wrote the model artifacts out to cloud storage from your training job.
The next step is to create the custom container and push to a registry, by following something like what is described here:
https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements
This section describes how you pass the model artifact directory to the custom container to be used for interence:
https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#artifacts
You will also need to create an endpoint in order to deploy the model:
https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api#aiplatform_deploy_model_custom_trained_model_sample-gcloud
Finally, you would use gcloud ai endpoints deploy-model ... to deploy the model to the endpoint:
https://cloud.google.com/sdk/gcloud/reference/ai/endpoints/deploy-model

AWS SageMaker - training locally but deploying to AWS?

I have a the following challenge with SageMaker:
I've downloaded one of the tutorial notebooks (https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_abalone_age_predictor_using_keras/tensorflow_abalone_age_predictor_using_keras.ipynb)
I ran the training locally (successfully) with the modifying the following line:
abalone_estimator = TensorFlow(entry_point='abalone.py',
role=role,
training_steps= 100,
evaluation_steps= 100,
hyperparameters={'learning_rate': 0.001},
train_instance_count=1,
**train_instance_type='local'**)
abalone_estimator.fit(inputs)
I then wanted to deploy my model to AWS with the following line but it seems the SDK deploys it locally (it doesn't fail, I just see it running on my machine)
abalone_predictor = abalone_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
Any tips on how to either fix it so it gets deployed to AWS or alternatively re-load my training model and deploy it to AWS from scratch?
Many thanks,
Stefan
Its easier to run the training again on SageMaker.
Otherwise, here are the steps that you would have to do.
Take the checkpoint file generated during the training and convert them into tensorflow serving models.
Zip them in a specific format and upload to S3
Then create estimator as you have done above and do the inference.
If you want details on each of the specific steps above do let me know, but if your dataset is not too big, I would say just retrain on SageMaker.

How to load a training set in AWS SageMaker to build a model?

I am very new to SageMaker. Upon my first interaction, it looks like the AWS SageMaker requires you to start from its Notebook. I have a training set which is ready. Is there a way to bypass setting the Notebook and just to start by upload the training set? Or it should be done through the Notebook. If anyone knows some example fitting my need above, that will be great.
Amazon SageMaker is a combination of multiple services that each is independent of the others. You can use the notebook instances if you want to develop your models in the familiar Jupyter environment. But if just need to train a model, you can use the training jobs without opening a notebook instance.
There a few ways to launch a training job:
Use the high-level SDK for Python that is similar to the way that you start a training step in your python code
kmeans.fit(kmeans.record_set(train_set[0]))
Here is the link to the python library: https://github.com/aws/sagemaker-python-sdk
Use the low-level API to Create-Training-Job, and you can do that using various SDK (Java, Python, JavaScript, C#...) or the CLI.
sagemaker = boto3.client('sagemaker')
sagemaker.create_training_job(**create_training_params)
Here is a link to the documentation on these options: https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model-create-training-job.html
Use Spark interface to launch it using a similar interface to creating an MLLib training job
val estimator = new KMeansSageMakerEstimator(
sagemakerRole = IAMRole(roleArn),
trainingInstanceType = "ml.p2.xlarge",
trainingInstanceCount = 1,
endpointInstanceType = "ml.c4.xlarge",
endpointInitialInstanceCount = 1)
.setK(10).setFeatureDim(784)
val model = estimator.fit(trainingData)
Here is a link to the spark-sagemaker library: https://github.com/aws/sagemaker-spark
Create a training job in the Amazon SageMaker console using the wizard there: https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/jobs
Please note that there a few options also to train models, either using the built-in algorithms such as K-Means, Linear Learner or XGBoost (see here for the complete list: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). But you can also bring your own models for pre-baked Docker images such as TensorFlow (https://docs.aws.amazon.com/sagemaker/latest/dg/tf.html) or MXNet (https://docs.aws.amazon.com/sagemaker/latest/dg/mxnet.html), your own Docker image (https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).