Cannot spin up a simple dev project on GCP - google-cloud-platform

Is there a trick to getting a quota increase for NVIDIA GPU's on GCP? Really can't believe the amount of red tape here. I'm going through a course on deploying KubeFlow via GCP and need 4 GPU
's for the distributed training module. GCP continually rejects any request over 1 GPU. Are there any Data Scientists actively using GCP for personal dev projects, who have had any luck?

Related

Is it possible to run Vertex AI Workbench on Spot machines?

I'm trying to save budget on jupyter notebooks on Google Cloud but couldn't find a way to run Vertex AI Workbench (Notebooks) on spot machines.
What are my alternatives?
The short answer is no; the better answer is: you have an alternative.
Vertex AI Workbench is indeed a managed service with Compute Engine VM as the underlying infrastructure. However it doesn't support Spot/Preemptible instances.
Instead you can quickly install a deep/machine learning image on a VM using a Google's images. See this detailed tutorial.
Deep Learning VMs don't support launching from the GCP Console and more features like co-coding. But it does support Spot/Preemptible instances and doesn't introduce a management fee. So you get lesser experience but also pay less.

Kubeflow deployment on GCP

I have been reading for few weeks for different approaches for ML in production. I decided to test Kubeflow and I decided to test it on GCP. I started to deploy Kubeflow on GCP using the guiidline on official kubeflow website(here https://www.kubeflow.org/docs/gke/). I run into a lot of issues and it was quit hard to fix them. I started to look into a better approach and I noticed that GCP AI platform now offers deploying Kubeflow pipelines with just few simple steps. (https://cloud.google.com/ai-platform/pipelines/docs/connecting-with-sdk.)
After easily setting up this, I had few question and doubts. If it is this much easy to set up and deploy Kubeflow why we have to go through such a cumbersome way as suggested in the kubeflow official website. Since creating Kubeflow pipeline on GCP means basically I am deploying Kubeflow on GCP, does that mean I can access other Kubeflow services like Katib?
Elnaz
The kubeflow official website provides the required information in detailed way and where as in google cloud it directly provides you the services with possible ready solution.
Referring to will fuks document it says YES, you can able to access katlib on GCP
The GCP managed service of Kubeflow Pipelines is just that. You won't have a lot of access to the cluster to make changes. I've deployed a Kubeflow cluster that can still reach the AI Hub as well.
I believe they have plans to expand what can be deployed in the AI Platform but if you don't want to wait, the self-deployment is possible (but not easy) IMO.

DataBricks + Kedro Vs GCP + Kubeflow Vs Server + Kedro + Airflow

We are deploying a data consortium between more than 10 companies. Wi will deploy several machine learning models (in general advanced analytics models) for all the companies and we will administrate all the models. We are looking for a solution that administrates several servers, clusters and data science pipelines. I love kedro, but not sure what is the best option to administrate all while using kedro.
In summary, we are looking for the best solution to administrate several models, tasks and pipelines in different servers and possibly Spark clusters. Our current options are:
AWS as our data warehouse and Databricks for administrating servers, clusters and tasks. I don't feel that the notebooks of databricks are a good solution for building pipelines and to work collaboratively, so I would like to connect kedro to databricks (is it good? is it easy to schedule the run of the kedro pipelines using databricks?)
Using GCP for data warehouse and use kubeflow (iin GCP) for deploying models and the administration and the schedule of the pipelines and the needed resources
Setting up servers from ASW or GCP, install kedro and schedule the pipelines with airflow (I see a big problem administrating 20 servers and 40 pipelines)
I would like to know if someone knows what is the best option between these alternatives, their downsides and advantages, or if there are more possibilities.
I'll try and summarise what I know, but be aware that I've not been part of a KubeFlow project.
Kedro on Databricks
Our approach was to build our project with CI and then execute the pipeline from a notebook. We did not use the kedro recommended approach of using databricks-connect due to the large price difference between Jobs and Interactive Clusters (which are needed for DB-connect). If you're working on several TB's of data, this quickly becomes relevant.
As a DS, this approach may feel natural, as a SWE though it does not. Running pipelines in notebooks feels hacky. It works but it feels non-industrialised. Databricks performs well in automatically spinning up and down clusters & taking care of the runtime for you. So their value add is abstracting IaaS away from you (more on that later).
GCP & "Cloud Native"
Pro: GCP's main selling point is BigQuery. It is an incredibly powerful platform, simply because you can be productive from day 0. I've seen people build entire web API's on top of it. KubeFlow isn't tied to GCP so you could port this somewhere else later on. Kubernetes will also allow you to run anything else you wish on the cluster, API's, streaming, web services, websites, you name it.
Con: Kubernetes is complex. If you have 10+ engineers to run this project long-term, you should be OK. But don't underestimate the complexity of Kubernetes. It is to the cloud what Linux is to the OS world. Think log management, noisy neighbours (one cluster for web APIs + batch spark jobs), multi-cluster management (one cluster per department/project), security, resource access etc.
IaaS server approach
Your last alternative, the manual installation of servers is one I would recommend only if you have a large team, extremely large data and are building a long-term product who's revenue can sustain the large maintenance costs.
The people behind it
How does the talent market look like in your region? If you can hire experienced engineers with GCP knowledge, I'd go for the 2nd solution. GCP is a mature, "native" platform in the sense that it abstracts a lot away for customers. If your market has mainly AWS engineers, that may be a better road to take. If you have a number of kedro engineers, that also has relevance. Note that kedro is agnostic enough to run anywhere. It's really just python code.
Subjective advise:
Having worked mostly on AWS projects and a few GCP projects, I'd go for GCP. I'd use the platform's components (BigQuery, Cloud Run, PubSub, Functions, K8S) as a toolbox to choose from and build an organisation around that. Kedro can run in any of these contexts, as a triggered job by the Scheduler, as a container on Kubernetes or as a ETL pipeline bringing data into (or out of) BigQuery.
While Databricks is "less management" than raw AWS, it's still servers to think about and VPC networking charges to worry over. BigQuery is simply GB queried. Functions are simply invocation count. These high level components will allow you to quickly show value to customers and you only need to go deeper (RaaS -> PaaS -> IaaS) as you scale.
AWS also has these higher level abstractions over IaaS but in general, it appears (to me) that Google's offering is the most mature. Mainly because they have published tools they've been using internally for almost a decade whereas AWS has built new tools for the market. AWS is the king of IaaS though.
Finally, a bit of content, two former colleagues have discussed ML industrialisation frameworks earlier this fall

Amazon Fargate vs EC2 container website hosting

I got a project recently in which I have to build a React / NextJS application which will serve occasional high traffic but will mostly sit idle. We are currently looking for the cheapest option in all categories, but also want to build a scalable and manageable app with a quick and easy CI/CD pipeline. For the development server, we chose Heroku's free plan and pipeline, as I think it's perfectly idle for the job. For production, we decided to use Docker as it's the best way to set up a CD pipeline, and with 2000 minutes of free Github Actions per month, the whole Production/Development pipeline will be essentially free of cost for us. We also were thinking to use AWS because of its features and we want to keep a minimum number of bills to manage. For DB we're thinking of using DynamoDB because of free 25GB lifetime storage which will be enough as the only dynamic data in the site will be user data and blogs. And for object storage, the choice is S3.
Here, we're confused between the two offerings by AWS when it comes to Container hosting, ECS EC2, and ECS Fargate. While Fargate definitely feels like a better choice because of the fact that the application will sit idle most of the time, but we're really confused in resource provisioning for containers in Fargate. The app is running on NextJS, so it'll be server-side rendered.
So my question was, will a combo of 0.5 GB RAM x 0.25 vCPU will be enough for a Server Side Rendered NextJS application? Or should I go for a dedicated EC2? Or another cloud provider maybe?
NextJS is a framework that run on top of nodejs, as there is no such specific requirement (nodejs 10 only) mentioned on the documentation but you can treat them as we treat nodejs.
Node.js with V8 suitable for limited memory device?
So my question was, will a combo of 0.5 GB RAM x 0.25 vCPU will be
enough for a Server Side Rendered NextJS application? Or should I go
for a dedicated EC2? Or another cloud provider maybe?
I will not suggest EC2 type ECS service, you can go for fargate with minimal memory and CPU and set auto-scaling of ECS services whenever required.
But I think we have a better option then fargate that is serverless-nextjs
Serverless deployment dramatically improves reliability and scalability by splitting your application into smaller parts (also called [lambdas]3). In the case of Next.js, each page in the pages directory becomes a serverless lambda.
There are a number of benefits to serverless. The referenced link talks about some of them in the context of Express, but the principles apply universally: serverless allows for distributed points of failure, infinite scalability, and is incredibly affordable with a "pay for what you use" model.
Serverless Nextjs

AWS SageMaker on GPU

I am trying to train a neural network (Tensorflow) on AWS. I have some AWS credits. From my understanding AWS SageMaker is the one best for the job. I managed to load the Jupyter Lab console on SageMaker and tried to find a GPU kernel since, I know it is the best for training neural networks. However, I could not find such kernel.
Would anyone be able to help in this regard.
Thanks & Best Regards
Michael
You train models on GPU in the SageMaker ecosystem via 2 different components:
You can instantiate a GPU-powered SageMaker Notebook Instance, for example p2.xlarge (NVIDIA K80) or p3.2xlarge (NVIDIA V100). This is convenient for interactive development - you have the GPU right under your notebook and can run code on the GPU interactively and monitor the GPU via nvidia-smi in a terminal tab - a great development experience. However when you develop directly from a GPU-powered machine, there are times when you may not use the GPU. For example when you write code or browse some documentation. All that time you pay for a GPU that sits idle. In that regard, it may not be the most cost-effective option for your use-case.
Another option is to use a SageMaker Training Job running on a GPU instance. This is a preferred option for training, because training metadata (data and model path, hyperparameters, cluster specification, etc) is persisted in the SageMaker metadata store, logs and metrics stored in Cloudwatch and the instance automatically shuts down itself at the end of training. Developing on a small CPU instance and launching training tasks using SageMaker Training API will help you make the most of your budget, while helping you retain metadata and artifacts of all your experiments. You can see here a well documented TensorFlow example
All Notebook GPU and CPU instance types: AWS Documentation.