Google Cloud Vertex AI Endpoint deployment is extremely slow - google-cloud-platform

I have built a model using Google Cloud Vertex AI and uploaded to Model Registry. I then try to create an Endpoint with it but I found that it is extremely slow - takes more than 30-40mins. And sometimes it just stick as deploying... Is this normal? Is there any way to speed this up? Or at least is there anywhere I can see any possible error?

Related

Google AutoML Vision API and Google Vision API Custom Algorithm

I am looking at Google AutoML Vision API and Google Vision API. I know that if you use Google AutoML Vision API that it is a custom model because you train ML models based on your own images and define your own labels. And when using Google Vision API, you are using a pretrained model...
However, I am wondering if it is possible to use my own algorithm (one which I created and not provided by Google) and using that instead with Vision / AutoML Vision API ? ...
Sure, you can definitely deploy your own ML algorithm on Google Cloud, without being tied up to the Vision or AutoML API.
Two approaches that I have used many times for this same use case:
Serverless approach, if your model is relatively light in terms of computational resources requirement - Deploy your own custom cloud function. More info here.
To be more specific, the way it works is that you just call your cloud function, passing your image directly (base64 or pointing to a storage location). The function then automatically allocates all required resources (automatically), run your custom algorithm to process the image and/or run inferences, send the results back and vanishes (all resources released, no more running costs). Neat :)
Google AI Platform. More info here
Use AI Platform to train your machine learning models at scale, to host your trained model in the cloud, and to use your model to make predictions about new data.
In doubt, go for AI Platform, as the whole pipeline is nicely lined-up for any of your custom code/models. Perfect for deployment in production as well.

Object Detection Django Rest API Deployment on Google Cloud Platform or Google ML Engine

I have developed Django API which accepts images from livefeed camera using in the form of base64 as request. Then, In API this image is converted into numpy arrays to pass to machine learning model i.e object detection using tensorflow object API. Response is simple text of detected objects.
I need GPU based cloud instance where i can deploy this application for fast processing to achieve real time results. I have searched a lot but no such resource found. I believe google cloud console (instances) can be connected to live API but I am not sure how exactly.
Thanks
I assume that you're using GPU locally or wherever your Django application is hosted.
First thing is to make sure that you are using tensorflow-gpu and all the necessary setup for Cuda is done.
You can start your GPU instance easily on Google Cloud Platform (GCP). There are multiple ways to do this.
Quick option
Search for notebooks and start a new instance with the required GPU and
RAM.
Instead of the notebook instance, you can set up the instance separately if you need some specific OS and more flexibility on choosing the machine.
To access the instance with ssh simply add your ssh public key
to Metadata which can be seen when you open the instance details.
Setup Django as you would do on the server. To test it simply just debug run it on host 0 or 0.0.0.0 and preferred port.
You can access the APIs with the external IP of the machine which can be found out in the instance details page.
Some suggestions
While the first option is quick and dirty, it's not recommended to use that in production.
It is better to use some deployment services such as tensorflow-serving along with Kubeflow.
If you think that you're handling the inference properly itself, then make sure that you load balance the server properly. Use NGINX or any other good server along with gunicorn/uwsgi.
You can use redis for queue management. When someone calls the API, it is not necessary that GPU is available for the inference. It is fine not to use this when you have very less number of hits on the API per second. But when we think of scaling up, think of 50 requests per second which a single GPU can't handle at a time, we can use a queue system.
All the requests should directly go to redis first and the GPU takes the jobs required to be done from the queue. If required, you can always scale the GPU.
Google Cloud actually offers Cloud GPUs. If you are looking to perform higher level computations with your applications that require real-time capabilities I would suggest your look into the following link for more information.
https://cloud.google.com/gpu/
Compute Engine also provides GPUs that can be added to your virtual machine instances. Use GPUs to accelerate specific workloads on your instances such as Machine Learning and data processing.
https://cloud.google.com/compute/docs/gpus/
However, if your application requires a lot of resources you’ll need to increase your quota to ensure you have enough GPUs available in your project. Make sure to pick a zone where GPUs are available. If this requires much more computing power you would need to submit a request for an increase of your quota. https://cloud.google.com/compute/docs/gpus/add-gpus#create-new-gpu-instance
Since you would be using the Tensorflow API for your application on ML Engine I would advise you to take a look at this link below. It provides instructions for creating a Deep Learning VM instance with TensorFlow and other tools pre-installed.
https://cloud.google.com/ai-platform/deep-learning-vm/docs/tensorflow_start_instance

High latency issue of online prediction

I've deployed a linear model for classification on Google Machine Learning Engine and want to predict new data using online prediction.
When I called the APIs using Google API client library, it took around 0.5s to get the response for a request with only one instance. I expected the latency should be less than 10 microseconds (because the model is quite simple) and 0.5s was way too long. I also tried to make predictions for the new data offline using the predict_proba method. It took 8.2s to score more than 100,000 instances, which is much faster than using Google ML engine. Is there a way I can reduce the latency of online prediction? The model and server which sent the request are hosted in the same region.
I want to make predictions in real-time (the response is returned immediately after the APIs gets the request). Is Google ML Engine suitable for this purpose?
Some more info would be helpful:
Can you measure the network latency from the machine you are accessing the service to gcp? Latency will be lowest if you are calling from a Compute Engine instance in the same region that you deployed the model to.
Can you post your calling code?
Is this the latency to the first request or to every request?
To answer your final question, yes, cloud ml engine is designed to support a high queries per second.

Getting data from local running java app to google cloud app and back

I wanted to dive into the world of distributed systems, cloud computing, IoT, etc., and I gotta be honest, I imagined everything being a little more intuitive than it finally turned out.
I had a tiny testing architecture in mind, that I'd like to set up with Google Clouds and their services, but I am kinda stuck since I can't get my head around some concepts.
What I basically wanted to do (as a first step) is writing a simple java application that would run locally on my computer. This application should just generate random numbers and send those numbers somehow to the google cloud. On the cloud I wanted to define another java application that would manipulate those random numbers in some kind of way (it doesn't matter actually). Afterwards, the output should somehow get back to me of course. And actually, at the moment, I don't even care about how exactly. It could be somehow back to my local app (with some kind of listener, would that be possible?). But it could also simply store the results somewhere on the google cloud? Or maybe upload them to my google drive?
I guess you already noticed that - at some points - I don't even know what i want exactly, since I'm not sure of what is possible, and what not.
Could you provide me some help to get this set up?
The most important questions for me right now are:
Do I need to use a pubsub system, where my generated numbers are sent
to, and which then forwards this to the cloud app, that transforms my
data?
How do I get my data from the local app to the cloud services?
Would my data transforming app run on Google Dataflow?
Above I wrote "as a first step"... because later I would also like to send config files (for example in json format, or xml) to the cloud, and the
cloud application should transform those config files... if I get the
first scenario running the I guess this woul also be no problem
right?
Those are just a few of the questions that are on my mind currently. The most important ones I guess.
It would be a big help. Sorry, if the questions are not very precise, but I really need some kind of pointing into the right direction.
Thank you in advance!
I think it would be good to read up on some of the technologies you mention here:
Google Cloud Pubsub: Pub/Sub enables you to publish messages to a topic, and consume them in another place in the (Google) Cloud. You can see some different examples of publishers and consumers in the link. In your case you could for example write a Java application that writes random numbers to the Pub/Sub queue, where they will sit for 7 days to be consumed by another component (for example, Google Cloud Dataflow). To get started developing, you can find the SDKs here (there is a Java SDK).
Google Cloud Dataflow is managed service running Apache Beam pipelines to process your data at scale. You can learn about the different concepts here and get started designing your pipeline here. I suggest taking a look at some examples first though, which will make it more easy to grasp what is actually going on. Dataflow has a PubSub connector, so in your application you will be able to read from the topic you created before. In Dataflow you can for example multiply all your random numbers and write them to a certain sink (for example Google Cloud Storage, or even BigQuery or PubSub again).
Google Cloud Storage: is a cloud storage where you can put files, for example the output of your Dataflow pipeline. You will be able to manually download the files using the Cloud Console UI, or you can use one of the SDKs to download the output programmatically.
Hope this gives you an overview and some pointers to start. Whenever you are ready and have a more concrete use case in mind, you can start looking at some more components.

What are some of the most appropriate ways for serving a large scale django app on Google Compute Engine?

I am working on a project that will presumably have a lot of user uploaded content and also a fairly large user base. I am now looking for deploying this app to the Google Compute Engine.
I have looked up for the possible options and nginx+gunicorn seems to be a good option. In the beginning I am going to be using a single ns-1 instance with 100 GB persistent hard drive and google cloud sql for serving my database.
But I want to make things scalable so that I can add more instances and disk storage without any hustle in the future. But I am very confused how to do that. So the main concern is.
I want such setup so that I can extend my disk space and no. of Google Compute Instances whenever I want.
In order to have a fully scalable architecture, a good approach is to separate computation / serving, from file storage, and both from data storage. Going part by part:
file storage - Google Cloud Storage - by storing common service files in a GCS bucket, you get a central repository that is both highly-redundant, and scalable;
data storage - Google Cloud SQL - gives you a highly reliable, scalable MySQL-like database back-end, which can be resized at will to accommodate increasing database usage;
front-ends - GCE instance group - template-generated web / computation front-ends, setting up a resource pool into which a forwarding rule (load balancer) distributes incoming connections.
In a nutshell, this is one of the most adaptable set-ups I can think of, while you keep control over every aspect of the service and underlying infrastructure.
A simple approach would be to run a Python app on Google App Engine, which will auto-scale your instances (both up and down) and it supports Django, as mentioned by #spirulence in the comments.
Here are some starting points:
Django and Cloud SQL support on App Engine
Running Pure Django Projects on Google App Engine
Third-party Libraries in Python 2.7
The last link shows which versions of Django are currently supported.