using tpu in custom training job on vertex ai - google-cloud-platform

I try to use tpu-v2-8 through custom training job. My job runs fine on vm, but as custom training job, it OOM and also seems slower. It is also quite hard to schedule (pending for more than a few minutes, hit internal error most of the time, tried us-central1 and asia-east1).
Furthermore, the monitoring for cpu, memory, network etc exists in web the UI but says unavailable. Also, I'm using TF/JAX and the log format conforms to glog standard, yet the logging from my application all shows up as error instead of at appropriate levels in cloud logging.
Am I missing something or doing something wrong?

No, everything seems fine on your side. To be specific:
It makes sense that the training process is slower, as all operations are passed through Vertex AI to TPU.
Sometimes, it's hard to obtain TPUs via Vertex AI. This could be the capacity issue in the Vertex AI itself. Just keep trying different regions, including europe-west4.
Yes, unfortunately, no metric is available using TPU at this moment, and some log entries are marked as errors.

Related

Vertex AI custom prediction vs Google Kubernetes Engine

I have been exploring using Vertex AI for my machine learning workflows. Because deploying different models to the same endpoint utilizing only one node is not possible in Vertex AI, I am considering a workaround. With this workaround, I will be unable to use many Vertex AI features, like model monitoring, feature attribution etc., and it simply becomes, I think, a managed alternative to running the prediction application on, say, a GKE cluster. So, besides the cost difference, I am exploring if running the custom prediction container on Vertex AI vs. GKE will involve any limitations, for example, only N1 machine types are available for prediction in Vertex AI
There is a similar question, but I it does not raise the specific questions I hope to have answered.
I am not sure of the available disk space. In Vertex AI, one can specify the machine type, such as n1-standard-2 etc., but I am not sure what disk space will be available and if/how one can specify it? In the custom container code, I may copy multiple model artifacts, or data from outside sources to the local directory before processing them so understanding any disk space limitations is important.
For custom training in Vertex AI, one can use an interactive shell to inspect the container where the training code is running, as described here. Is something like this possible for a custom prediction container? I have not found anything in the docs.
For custom training, one can use a private IP for custom training as described here. Again, I have not found anything similar for custom prediction in the docs, is it possible?
If you know of any other possible limitations, please post.
we don't specify a disk size, so default to 100GB
I'm not aware of this right now. But if it's a custom container, you could just run it locally or on GKE for debugging purpose.
are you looking for this? https://cloud.google.com/vertex-ai/docs/predictions/using-private-endpoints

Object Detection Django Rest API Deployment on Google Cloud Platform or Google ML Engine

I have developed Django API which accepts images from livefeed camera using in the form of base64 as request. Then, In API this image is converted into numpy arrays to pass to machine learning model i.e object detection using tensorflow object API. Response is simple text of detected objects.
I need GPU based cloud instance where i can deploy this application for fast processing to achieve real time results. I have searched a lot but no such resource found. I believe google cloud console (instances) can be connected to live API but I am not sure how exactly.
Thanks
I assume that you're using GPU locally or wherever your Django application is hosted.
First thing is to make sure that you are using tensorflow-gpu and all the necessary setup for Cuda is done.
You can start your GPU instance easily on Google Cloud Platform (GCP). There are multiple ways to do this.
Quick option
Search for notebooks and start a new instance with the required GPU and
RAM.
Instead of the notebook instance, you can set up the instance separately if you need some specific OS and more flexibility on choosing the machine.
To access the instance with ssh simply add your ssh public key
to Metadata which can be seen when you open the instance details.
Setup Django as you would do on the server. To test it simply just debug run it on host 0 or 0.0.0.0 and preferred port.
You can access the APIs with the external IP of the machine which can be found out in the instance details page.
Some suggestions
While the first option is quick and dirty, it's not recommended to use that in production.
It is better to use some deployment services such as tensorflow-serving along with Kubeflow.
If you think that you're handling the inference properly itself, then make sure that you load balance the server properly. Use NGINX or any other good server along with gunicorn/uwsgi.
You can use redis for queue management. When someone calls the API, it is not necessary that GPU is available for the inference. It is fine not to use this when you have very less number of hits on the API per second. But when we think of scaling up, think of 50 requests per second which a single GPU can't handle at a time, we can use a queue system.
All the requests should directly go to redis first and the GPU takes the jobs required to be done from the queue. If required, you can always scale the GPU.
Google Cloud actually offers Cloud GPUs. If you are looking to perform higher level computations with your applications that require real-time capabilities I would suggest your look into the following link for more information.
https://cloud.google.com/gpu/
Compute Engine also provides GPUs that can be added to your virtual machine instances. Use GPUs to accelerate specific workloads on your instances such as Machine Learning and data processing.
https://cloud.google.com/compute/docs/gpus/
However, if your application requires a lot of resources you’ll need to increase your quota to ensure you have enough GPUs available in your project. Make sure to pick a zone where GPUs are available. If this requires much more computing power you would need to submit a request for an increase of your quota. https://cloud.google.com/compute/docs/gpus/add-gpus#create-new-gpu-instance
Since you would be using the Tensorflow API for your application on ML Engine I would advise you to take a look at this link below. It provides instructions for creating a Deep Learning VM instance with TensorFlow and other tools pre-installed.
https://cloud.google.com/ai-platform/deep-learning-vm/docs/tensorflow_start_instance

Economic possibility to execute many workflow tasks

so I have these Airflow DAGs which consists of several tasks. Basically each task executes some independent analysis steps against a given code snippet and finally it is decided if this snippet can be used from a regulatory point of view.
Each tasks - depending on the code snippet - is quite short (1-25 minutes at most) and mostly it boils down to executing some external analysis tool (open source and internally) and processing the output of this tool.
All this works quite nice on my development machine but since we are analyzing quite a lot of code snippets during working hours (~50 per hour) and none outside of working hours, I'd like to get all of this up and running somewhere in the cloud (I don't really care if on google cloud, aws or azure).
So my question is what would be an economic way of getting this up and running in the cloud? I thought about using google cloud composer and these google preemptible VMs (the ones that shut down randomly but are super cheap) but it seems that I can not use the PVMs together with cloud composer.
Since the various steps in the DAG are independent the PVMs would be IMO great - if during task execution they are shut down I just retry this one task on a different PVM.
Thank you
On Google Cloud, there are a few options for you.
Run self-managed Airflow on a Compute Engine VM
Run Cloud Composer
The best option will be a mix of how much you want to spend and what features you need. Self-managed Airflow is a great option if you want to have very low cost (less than $100 per month) and are OK self-managing the VM and taking on the risk that the SLA from Google will only cover the VM, so if Airflow malfunctions, you're going to have to detect it and fix it.
The benefit of Composer is the fact that it's integrated so you get things like IAM, Stackdriver, WebUI proxying and so on. You will pay more for the service, however, since it's managed. Presently there is no way to run Composer with preemptible VMs.
Your use case sounds like it could run on a default size cluster on Cloud Composer, though.
It's worth noting that if you go self-managed, you also get the benefit that Google actively contributes to Airflow, so things like the operators should work against the current product APIs. Google also contributes fixes and new operators pretty regularly.

Cloud ML: Varying training time taken for the same data

I am using Google Cloud ML to for training jobs. I observe a peculiar behavior in which I observe varying time taken for the training job to complete for the same data. I analyzed the CPU and Memory utilization in the cloud ML console and see very similar utilization in both the cases(7min and 14mins).
Can anyone let me know what would be the reason for the service to take inconsistent time for the job to complete.
I have the same parameters and data in both the cases and also verified that the time spent in the PREPARING phase is pretty much the same in both cases.
Also would it matter that I schedule simultaneous multiple independent training job on the same project, if so then would like to know the rationale behind it.
Any help would be greatly appreciated.
The easiest way is to add more logging to inspect where the time was spent. You can also inspect training progress using TensorBoard. There's no VM sharing between multiple jobs, so it's unlikely caused by simultaneous jobs.
Also, the running time should be measured from the point when the job enters RUNNING state. Job startup latency varies depending on it's cold or warm start (i.e., we keep the VMs from previous job running for a while).

How to make my datalab machine learning run faster

I got some data, which is 3.2 million entries in a csv file. I'm trying to use CNN estimator in tensorflow to train the model, but it's very slow. Everytime I run the script, it got stuck, like the webpage(localhost) just refuse to respond anymore. Any recommendations? (I've tried with 22 CPUs and I can't increase it anymore)
Can I just run it and use a thread, like the command line python xxx.py & to keep the process going? And then go back to check after some time?
Google offers serverless machine learning with TensorFlow for precisely this reason. It is called Cloud ML Engine. Your workflow would basically look like this:
Develop the program to train your neural network on a small dataset that can fit in memory (iron out the bugs, make sure it works the way you want)
Upload your full data set to the cloud (Google Cloud Storage or BigQuery or &c.) (documentation reference: training steps)
Submit a package containing your training program to ML Cloud (this will point to the location of your full data set in the cloud) (documentation reference: packaging the trainer)
Start a training job in the cloud; this is serverless, so it will take care of scaling to as many machines as necessary, without you having to deal with setting up a cluster, &c. (documentation reference: submitting training jobs).
You can use this workflow to train neural networks on massive data sets - particularly useful for image recognition.
If this is a little too much information, or if this is part of a workflow that you'll be doing a lot and you want to get a stronger handle on it, Coursera offers a course on Serverless Machine Learning with Tensorflow. (I have taken it, and was really impressed with the quality of the Google Cloud offerings on Coursera.)
I am sorry for answering even though I am completely igonorant to what datalab is, but have you tried batching?
I am not aware if it is possible in this scenario, but insert maybe only 10 000 entries in one go and do this in so many batches that eventually all entries have been inputted?