Object Detection in AWS + Sagemaker Neo - amazon-web-services

I am trying the inbuilt object detection algorithm available on AWS for a computer vision problem. The training job ran successfully and I have received the model artifacts in the .tar.gz format in an S3 bucket.
To reduce the model footprint, we need to use Sagemaker Neo - a compilation job on the available model artifacts. The compilation job fails with the error - "ClientError: OperatorNotImplemented:('One or more operators are not supported in frontend MXNet:\n_contrib_MultiBoxTarget: 1\nMakeLoss: 3"
How can this be resolved?
Around August 2019, Sagemaker Neo did not support models trained with built-in Sagemaker Object Detection Algorithm. Is there any change in this status today ?
Thanks

Related

AWS Sagemaker T5 or huggingface Model training issue

I am trying to train a t5 conditional Generation model in Sagemaker, its running fine when I am passing the arguments directly in notebook but its not learning anything when I am passing estimator and train.py script, I followed the documentation provided by hugging face as well as AWS. But still we are facing issue it is saying training is completed and saving model with in 663 seconds what ever might be the size of dataset. Kindly give suggestions for this.
Check Amazon CloudWatch logs to be able to tell what took place during training (train.py stdout/stderr). This utility can help with downloading logs to your local machine/notebook.

AWS Sagemaker Pytorch does not run properly

I am currently trying to train a model using pytorch on AWS Sagemaker but can't get it to run properly. My main question now is: Is there some workflow step I'm missing? Any help is greatly appreciated.
I managed to get the code running on colab or on a local machine for example but not on sagemaker.
In short the program should: Setup a pytorch model, load the train data from a file system and perform train epochs.
For this, I am trying the following:
The code files (dataloaders/help functions etc) with the "entry point" are stored at Sagemaker Studio in the folder "code".
enter image description here
The train files are stored in a s3 bucket and are transfered in "file mode".
I then call the estimator in a python notebook as this:
estimator = PyTorch(entry_point='entry.py',
role=role,
py_version='py3',
source_dir = "code",
output_path = "s3://XXXXX/XXXXXX/XXXX",
framework_version='1.3.1',
instance_count=1,
instance_type='ml.g4dn.2xlarge',
hyperparameters={
'epochs': 5,
'backend': 'gloo'
})
inputs = "s3://XXXXX/XXXXX"
estimator.fit({'training': inputs})
In the output I can see, that the train instance is prepared and the data is downloaded but then the problem arises:
For some reason the program jumps right into the train method. The outputs of the first steps which should take place before a train epoch, Network whitening for example, are shown after or during the train step. After one train epoch the program freezes without any error message until I manually stop the instance.
Thanks for any help.
Your script gets stuck after one epoch.
There's no error messages to review, nor the code itself.
I suggest to try and troubleshoot this in a fast manner using SageMaker Local mode (e.g., instance_type='local_gpu'). This will allow you to retry different configurations in seconds instead of minutes. And potentially remote debug it.
Note: SageMaker local requires docker support, so you'll need to run this either on your laptop, or on a SageMaker notebook instance like 'ml.g4dn.2xlarge', but not on a SageMaker Studio instance). And potentially remote debug it.

How to schedule a retrain of a sagemaker pipeline model using airflow

I have already implemented a sagemaker pipeline model. In particular for an end-to-end notebook that trains a model, builds a pipeline model and deploys it, I have followed this sample notebook.
Now I would like to retrain and deploy the entire pipeline every day using Airflow, but I have seen here the possibility to retrain and deploy only a single sagemaker model.
Is there a way to retrain and deploy the entire pipeline? Thanks
SageMaker provides 2 options for users to do Airflow stuff:
Use the APIs in SageMaker Python SDK to generate input of all SageMaker operators in Airflow. The blog you linked goes this way. For example, they use API training_config in SageMaker Python SDK and operator SageMakerTrainingOperator in Airflow.
Use PythonOperator provided by Airflow and write Python codes to do what you want.
For 1, SageMaker only implemented APIs related to training, tuning, single model deployment and transform. Hence you are doing pipeline model, I don't think it has the API you want.
But for 2, if you can finish what you want in whatever Python codes with SageMaker. You should be able to adapt it as Python callables and make them work with PythonOperators. Here's an example for training in this way provided by SageMaker:
https://sagemaker.readthedocs.io/en/stable/using_workflow.html#using-airflow-python-operator
I think you can do similar things to make Airflow work with your pipeline model.

Understanding Sagemaker Neo

I have few questions for Sagemaker Neo:
1) Can I take advantage of Sagemaker Neo if I have an externally trained tensorflow/mxnet model?
2) Sagemaker provides container image for 'image-classification' and it has released a new image with name 'image-classification-neo' for the neo compilation job. What is the difference between both of them? Do I require a new Neo compatible image for each pre built sagemaker template(container) similarly?
Any help would be appreciated
Thanks!!
1) Yes. Upload your model to an S3 bucket as a model.tar.gz file (similar to what SageMaker would save after training) and you can compile it.
2) The Neo versions use the Neo runtime to load and predict, so yes, the containers are different. Right now, Neo supports the XGBoost and Image Classification built-in algos. Of course, you could build your own custom container and use Neo inside that. For more info: https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html
Julien
It is long time that this question has been asked. But in case someone turns up here after searching for the same question:
Mainly, Amazon NEO is an optimizer for making the program compatible for multiple underlying hardware and platform. Based on documentation:
"Neo is a new capability of Amazon SageMaker that enables machine learning models to train once and run anywhere in the cloud and at the edge. "
And yes, those 2 docker images are different. As one of them has the optimiser code, the other doesn't.
The difference is not in the input, so 'image-classification-neo' can work with images that 'image-classification' can work.
But the output is different.
The output of 'image-classification-neo' can be used on multiple platforms.
you can check out the supported hardware platforms in the link below:
https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html

AWS SageMaker - training locally but deploying to AWS?

I have a the following challenge with SageMaker:
I've downloaded one of the tutorial notebooks (https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_abalone_age_predictor_using_keras/tensorflow_abalone_age_predictor_using_keras.ipynb)
I ran the training locally (successfully) with the modifying the following line:
abalone_estimator = TensorFlow(entry_point='abalone.py',
role=role,
training_steps= 100,
evaluation_steps= 100,
hyperparameters={'learning_rate': 0.001},
train_instance_count=1,
**train_instance_type='local'**)
abalone_estimator.fit(inputs)
I then wanted to deploy my model to AWS with the following line but it seems the SDK deploys it locally (it doesn't fail, I just see it running on my machine)
abalone_predictor = abalone_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')
Any tips on how to either fix it so it gets deployed to AWS or alternatively re-load my training model and deploy it to AWS from scratch?
Many thanks,
Stefan
Its easier to run the training again on SageMaker.
Otherwise, here are the steps that you would have to do.
Take the checkpoint file generated during the training and convert them into tensorflow serving models.
Zip them in a specific format and upload to S3
Then create estimator as you have done above and do the inference.
If you want details on each of the specific steps above do let me know, but if your dataset is not too big, I would say just retrain on SageMaker.