AWS SageMaker on GPU - amazon-web-services

I am trying to train a neural network (Tensorflow) on AWS. I have some AWS credits. From my understanding AWS SageMaker is the one best for the job. I managed to load the Jupyter Lab console on SageMaker and tried to find a GPU kernel since, I know it is the best for training neural networks. However, I could not find such kernel.
Would anyone be able to help in this regard.
Thanks & Best Regards
Michael

You train models on GPU in the SageMaker ecosystem via 2 different components:
You can instantiate a GPU-powered SageMaker Notebook Instance, for example p2.xlarge (NVIDIA K80) or p3.2xlarge (NVIDIA V100). This is convenient for interactive development - you have the GPU right under your notebook and can run code on the GPU interactively and monitor the GPU via nvidia-smi in a terminal tab - a great development experience. However when you develop directly from a GPU-powered machine, there are times when you may not use the GPU. For example when you write code or browse some documentation. All that time you pay for a GPU that sits idle. In that regard, it may not be the most cost-effective option for your use-case.
Another option is to use a SageMaker Training Job running on a GPU instance. This is a preferred option for training, because training metadata (data and model path, hyperparameters, cluster specification, etc) is persisted in the SageMaker metadata store, logs and metrics stored in Cloudwatch and the instance automatically shuts down itself at the end of training. Developing on a small CPU instance and launching training tasks using SageMaker Training API will help you make the most of your budget, while helping you retain metadata and artifacts of all your experiments. You can see here a well documented TensorFlow example

All Notebook GPU and CPU instance types: AWS Documentation.

Related

Aws Model Quality Monitoring without Endpoints

Is there any possible ways to do model monitoring in aws without an endpoint? Kindly share any good notebook regarding this if you knew
Aws not gives any explainable example regarding Batch Model monitoring.
Amazon SageMaker Model Monitor monitors the quality of Amazon SageMaker machine learning models in production.
You can set up continuous monitoring with a real-time endpoint (or a batch transform job that runs regularly), or on-schedule monitoring for asynchronous batch transform jobs.
Here are some example notebooks:
(1) SageMaker Model Monitor with Batch Transform - Data Quality Monitoring On-Schedule (link)
(2) SageMaker Data Quality Model Monitor for Batch Transform with SageMaker Pipelines On-demand (link)

Is it possible to run Vertex AI Workbench on Spot machines?

I'm trying to save budget on jupyter notebooks on Google Cloud but couldn't find a way to run Vertex AI Workbench (Notebooks) on spot machines.
What are my alternatives?
The short answer is no; the better answer is: you have an alternative.
Vertex AI Workbench is indeed a managed service with Compute Engine VM as the underlying infrastructure. However it doesn't support Spot/Preemptible instances.
Instead you can quickly install a deep/machine learning image on a VM using a Google's images. See this detailed tutorial.
Deep Learning VMs don't support launching from the GCP Console and more features like co-coding. But it does support Spot/Preemptible instances and doesn't introduce a management fee. So you get lesser experience but also pay less.

Cannot spin up a simple dev project on GCP

Is there a trick to getting a quota increase for NVIDIA GPU's on GCP? Really can't believe the amount of red tape here. I'm going through a course on deploying KubeFlow via GCP and need 4 GPU
's for the distributed training module. GCP continually rejects any request over 1 GPU. Are there any Data Scientists actively using GCP for personal dev projects, who have had any luck?

Triggering a training task on cloud ml when file arrives to cloud storage

I am trying to build an app where the user is able to upload a file to cloud storage. This would then trigger a model training process (and predicting later on). Initially I though I could do this with cloud functions/pubsub and cloudml, but it seems that cloud functions are not able to trigger gsutil commands which is needed for cloudml.
Is my only option to enable cloud-composer and attach GPUs to a kubernetes node and create a cloud function that triggers a dag to boot up a pod on the node with GPUs and mounting the bucket with the data? Seems a bit excessive but I can't think of another way currently.
You're correct. As for now, there's no possibility to execute gsutil command from a Google Cloud Function:
Cloud Functions can be written in Node.js, Python, Go, and Java, and are executed in language-specific runtimes.
I really like your second approach with triggering the DAG.
Another idea that comes to my mind is to interact with GCP Virtual Machines within Cloud Composer through the Python operator by using the Compute Engine Pyhton API. You can find more information in automating infrastructure and taking a deep technical dive into the core features of Cloud Composer here.
Another solution that you can think of is Kubeflow, which aims to make running ML workloads on Kubernetes. Kubeflow adds some resources to your cluster to assist with a variety of tasks, including training and serving models and running Jupyter Notebooks. Please, have a look on Codelabs tutorial.
I hope you find the above pieces of information useful.

Pros and Cons of Amazon SageMaker VS. Amazon EMR, for deploying TensorFlow-based deep learning models?

I want to build some neural network models for NLP and recommendation applications. The framework I want to use is TensorFlow. I plan to train these models and make predictions on Amazon web services. The application will be most likely distributed computing.
I am wondering what are the pros and cons of SageMaker and EMR for TensorFlow applications?
They both have TensorFlow integrated.
In general terms, they serve different purposes.
EMR is when you need to process massive amounts of data and heavily rely on Spark, Hadoop, and MapReduce (EMR = Elastic MapReduce). Essentially, if your data is in large enough volume to make use of the efficiencies of Spark, Hadoop, Hive, HDFS, HBase and Pig stack then go with EMR.
EMR Pros:
Generally, low cost compared to EC2 instances
As the name suggests Elastic meaning you can provision what you need when you need it
Hive, Pig, and HBase out of the box
EMR Cons:
You need a very specific use case to truly benefit from all the offerings in EMR. Most don't take advantage of its entire offering
SageMaker is an attempt to make Machine Learning easier and distributed. The marketplace provides out of the box algos and models for quick use. It's a great service if you conform to the workflows it enforces. Meaning creating training jobs, deploying inference endpoints
SageMaker Pros:
Easy to get up and running with Notebooks
Rich marketplace to quickly try existing models
Many different example notebooks for popular algorithms
Predefined kernels that minimize configuration
Easy to deploy models
Allows you to distribute inference compute by deploying endpoints
SageMaker Cons:
Expensive!
Enforces a certain workflow making it hard to be fully custom
Expensive!
From AWS documentation:
Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. By using these frameworks and related open-source projects, such as Apache Hive and Apache Pig, you can process data for analytics purposes and business intelligence workloads. Additionally, you can use Amazon EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
(...) Amazon SageMaker is a fully-managed platform that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. Amazon SageMaker removes all the barriers that typically slow down developers who want to use machine learning.
Conclussion:
If you want to deploy AI models just use AWS SageMaker