AWS Sagemaker custom training job container emit loss metric - amazon-web-services

I have created a customer docker container using an Amazon tensorflow container as a starting point:
763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.15.2-gpu-py36-cu100-ubuntu18.04
inside the container I run a custom keras (with TF backend) training job from the docker SAGEMAKER_PROGRAM. I can access the training data ok (from an EFS mount) and can generate output into /opt/ml/model that gets synced back to S3. So input and output is good: what I am missing is real-time monitoring.
A Sagemaker training job emits system metrics like cpu and gpu loads which you can conveniently view in real-time on the Sagemaker training job console. But I cannot find a way to emit metrics about the progress of the training job. i.e. loss, accuracy etc from my python code.
Actually, ideally I would like to use Tensorboard but as Sagemaker doesn't expose the instance on the EC2 console I cannot see how I can find the IP address of the instance to connect to for Tensorboard.
So the fallback is try and emit relevant metrics from the training code so that we can monitor the job as it runs.
The basic question is how do I real-time monitor key metrics for my custom training job runnning in a container on Sagemaker training job:
- Is a tensorboard solution possible? If so how?
- If not how do I emit metrics from my python code and have them show up in the training job console or as cloudwatch metrics directly?
BTW: so far I have failed to be able to get sufficient credentials inside the training job container to access either s3 or cloudwatch.

If you're using customer images for training, you can specify a name and a regular expression for metrics you want to track for training.
byo_estimator = Estimator(image_name=image_name,
role='SageMakerRole', train_instance_count=1,
train_instance_type='ml.c4.xlarge',
sagemaker_session=sagemaker_session,
metric_definitions=[{'Name': 'test:msd', 'Regex': '#quality_metric: host=\S+, test msd <loss>=(\S+)'},
{'Name': 'test:ssd', 'Regex': '#quality_metric: host=\S+, test ssd <loss>=(\S+)'}])

Related

Aws Model Quality Monitoring without Endpoints

Is there any possible ways to do model monitoring in aws without an endpoint? Kindly share any good notebook regarding this if you knew
Aws not gives any explainable example regarding Batch Model monitoring.
Amazon SageMaker Model Monitor monitors the quality of Amazon SageMaker machine learning models in production.
You can set up continuous monitoring with a real-time endpoint (or a batch transform job that runs regularly), or on-schedule monitoring for asynchronous batch transform jobs.
Here are some example notebooks:
(1) SageMaker Model Monitor with Batch Transform - Data Quality Monitoring On-Schedule (link)
(2) SageMaker Data Quality Model Monitor for Batch Transform with SageMaker Pipelines On-demand (link)

AWS Batch vs Spring Batch

I have been planning to migrate my Batch processing from Spring Batch to AWS Batch. Can someone give me the reasons to Choose AWS Batch over Spring Batch?
Whilst both these things will play a role in orchestrating your batch workloads a key difference is that AWS Batch will also manage the infrastructure you need to run the jobs/pipeline. AWS Batch lets you to tailor the underlying cloud instances, or specifcy a broad array of instance types that will work for you. And it'll let you make trade-offs: you can task it with managing a bag of EC2 Spot instances for you (for example), and then ask it to optimize time-to-execution over price (or prefer price to speed).
(For full disclosure, I work for the engineering team that builds AWS Batch).
I believe both work at different levels .Spring batch provide framework that reduce the boiler plate code that you need in order to write a batch job.For eg. saving the state of job in Job repository that provide restartability.
On the contrary, AWS batch is an infrastructure framework that helps in managing infra and set some environment variable that help differentiate master node from slave node.
In my opinion both can work together to write a full fledged cost effective batch job at scale on AWS cloud.
Aws Batch is full blown SaaS solution for batch processing,
It has inbuilt
Queue with priority options
Runtime, which can be self managed and auto managed
Job repo, docker images for the job definitions
Monitoring , dashboards, integration with other AWS services like SNS and from SNS to where ever you want
On the other hand, batch is a framework which would still need some of your efforts to manage it all. like employing a queue, scaling is your headache, monitoring etc
My take is , if your company or app is on AWS , go for AWS batch , you will save months of time and get to scalability to a million jobs per day in no time . If you are on-perm or private go for spring batch with some research

Use AWS Lambda to execute a jupyter notebook on AWS Sagemaker

I made a classifier in Python that uses a lot of libraries. I have uploaded the model to Amazon S3 as a pickle (my_model.pkl). Ideally, every time someone uploads a file to a specific S3 bucket, it should trigger an AWS Lambda that would load the classifier, return predictions and save a few files on an Amazon S3 bucket.
I want to know if it is possible to use a Lambda to execute a Jupyter Notebook in AWS SageMaker. This way I would not have to worry about the dependencies and would generally make the classification more straight forward.
So, is there a way to use an AWS Lambda to execute a Jupyter Notebook?
Scheduling notebook execution is a bit of a SageMaker anti-pattern, because (1) you would need to manage data I/O (training set, trained model) yourself, (2) you would need to manage metadata tracking yourself, (3) you cannot run on distributed hardware and (4) you cannot use Spot. Instead, it is recommended for scheduled task to leverage the various SageMaker long-running, background job APIs: SageMaker Training, SageMaker Processing or SageMaker Batch Transform (in the case of a batch inference).
That being said, if you still want to schedule a notebook to run, you can do it in a variety of ways:
in the SageMaker CICD Reinvent 2018 Video, Notebooks are launched as Cloudformation templates, and their execution is automated via a SageMaker lifecycle configuration.
AWS released this blog post to document how to launch Notebooks from within Processing jobs
But again, my recommendation for scheduled tasks would be to remove them from Jupyter, turn them into scripts and run them in SageMaker Training
No matter your choices, all those tasks can be launched as API calls from within a Lambda function, as long as the function role has appropriate permissions
I agree with Olivier. Using Sagemaker for Notebook execution might not be the right tool for the job.
Papermill is the framework to run Jupyter Notebooks in this fashion.
You can consider trying this. This allows you to deploy your Jupyter Notebook directly as serverless cloud function and uses Papermill behind the scene.
Disclaimer: I work for Clouderizer.
It totally possible, not an anti-pattern at all. It really depends on your use-case. AWs actually made a great article describing it, which includes a lambda

AWS SageMaker on GPU

I am trying to train a neural network (Tensorflow) on AWS. I have some AWS credits. From my understanding AWS SageMaker is the one best for the job. I managed to load the Jupyter Lab console on SageMaker and tried to find a GPU kernel since, I know it is the best for training neural networks. However, I could not find such kernel.
Would anyone be able to help in this regard.
Thanks & Best Regards
Michael
You train models on GPU in the SageMaker ecosystem via 2 different components:
You can instantiate a GPU-powered SageMaker Notebook Instance, for example p2.xlarge (NVIDIA K80) or p3.2xlarge (NVIDIA V100). This is convenient for interactive development - you have the GPU right under your notebook and can run code on the GPU interactively and monitor the GPU via nvidia-smi in a terminal tab - a great development experience. However when you develop directly from a GPU-powered machine, there are times when you may not use the GPU. For example when you write code or browse some documentation. All that time you pay for a GPU that sits idle. In that regard, it may not be the most cost-effective option for your use-case.
Another option is to use a SageMaker Training Job running on a GPU instance. This is a preferred option for training, because training metadata (data and model path, hyperparameters, cluster specification, etc) is persisted in the SageMaker metadata store, logs and metrics stored in Cloudwatch and the instance automatically shuts down itself at the end of training. Developing on a small CPU instance and launching training tasks using SageMaker Training API will help you make the most of your budget, while helping you retain metadata and artifacts of all your experiments. You can see here a well documented TensorFlow example
All Notebook GPU and CPU instance types: AWS Documentation.

GCP Dataflow, Dataproc, Bigtable

I am selecting services to write and transform JSON messages from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. I want to minimize service costs. I also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention. What should I do?
A. Use Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.
B. Use Cloud Dataproc to run your transformations. Use the diagnose command to generate an operational output archive. Locate the bottleneck and adjust cluster resources.
C. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances.
D. Use Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs. Configure the job to use non-default Compute Engine machine types when needed.
C!
Use Dataflow on pubsub to transform your data and let it write rows into BQ. You can monitor the ETL pipeline straight from data flow and use stackdriver on top. Stackdriver can also be used to start events etc.
Use autoscaling to minimize the number of manual actions. Basically when this solution is setup correctly, it doesn't need work at all.