Debugging Google Cloud Dataflow VM Instances - google-cloud-platform

I have a Google Cloud Dataflow streaming job that have a growing system lag. Lag started when I deployed a new changes. Lag is gradually growing without subsiding. I see frequent GCs happening from the stackdriver logs which indicate inefficiency/bug introduced by newly deployed changes. I like to further debug this, what is the best way to debug JVM on Dataflow instances?
I have tried enabling Monitoring agent when launching the job, which gives me GC count/time which is not that useful to debug the source of issue.

Related

Triggering an alert when multiple dataflow jobs run in parallel in GCP

I am using google cloud dataflow to execute some resource intensive dataflow jobs. And at a given time, my system must execute no more than 2 jobs in parallel.
Since each job is quite resource intensive, I am looking for a way to trigger an alert when more than 2 dataflow jobs are running.
I tried implementing a custom_count which increments after the start of each job. But custom_couter only display after the job has executed. And it might be too late to trigger an alert by then.
You could modify the quota dataflow.googleapis.com/job_count of the project to be limited to 1, and no two jobs could run parallel in that project. The quota is at the project level, it would not affect other projects.
Another option is to use an GCP monitoring system that is observing the running Dataflow jobs. You can e.g. use Elastic Cloud (available via Marketplace) to load all relevant Metrics and Logs. Elastic can visualize and alert on every state you are interested in.
I found this terraform project very helpful in order to get started with that approach.

Triggering a training task on cloud ml when file arrives to cloud storage

I am trying to build an app where the user is able to upload a file to cloud storage. This would then trigger a model training process (and predicting later on). Initially I though I could do this with cloud functions/pubsub and cloudml, but it seems that cloud functions are not able to trigger gsutil commands which is needed for cloudml.
Is my only option to enable cloud-composer and attach GPUs to a kubernetes node and create a cloud function that triggers a dag to boot up a pod on the node with GPUs and mounting the bucket with the data? Seems a bit excessive but I can't think of another way currently.
You're correct. As for now, there's no possibility to execute gsutil command from a Google Cloud Function:
Cloud Functions can be written in Node.js, Python, Go, and Java, and are executed in language-specific runtimes.
I really like your second approach with triggering the DAG.
Another idea that comes to my mind is to interact with GCP Virtual Machines within Cloud Composer through the Python operator by using the Compute Engine Pyhton API. You can find more information in automating infrastructure and taking a deep technical dive into the core features of Cloud Composer here.
Another solution that you can think of is Kubeflow, which aims to make running ML workloads on Kubernetes. Kubeflow adds some resources to your cluster to assist with a variety of tasks, including training and serving models and running Jupyter Notebooks. Please, have a look on Codelabs tutorial.
I hope you find the above pieces of information useful.

Why is there a DAG named 'airflow_monitoring' automatically generated in Cloud Composer?

When creating an Airflow environment on GCP Composer, there is a DAG named airflow_monitoring automatically created and that comes back even when deleted.
Why? How to handle it? Should I copy this file inside my DAG folder and resign myself to make it part of my code? I noticed that each time I upload my code it stops the execution of this DAG as it could not be found inside the DAG folder until it magically reappears.
I have already tried deleting it inside the DAG folder, delete the logs, delete it from the UI, all of this at the same time etc.
The airflow_monitoring DAG is a per-environment liveness prober/healthcheck that is used to populate the Cloud Composer monitoring metric environment/healthy. It is an indicator for the general overall health of your environment, or more specifically, its ability to schedule DAGs and run tasks. This allows you to use Google Cloud Monitoring features such as metric graphs, or setting alerts when your environment becomes unhealthy.
You can find more information about the metric on the GCP Metrics List, and can explore the metric in Cloud Monitoring under the following:
Resource type: Cloud Composer Environment
Metric: Healthy
This is a Composer-managed DAG and uses very minimal resources from your environment. Ideally, you should leave it untouched, as it has little to no effect on anything else running in your environment.

Google Cloud Platform unable to run long running process when connectivity drops

I am doing custom object detection training using darkflow on Google Cloud Platform Compute Engine VM with GPU but the long-running process dies whenever I lose connectivity or my laptop goes to sleep. I have tried running it via SSH from my Windows machine, using Google Cloud Shell, via a terminal on Jupyter Notebook on the Cloud platform and via a Jupyter Notebook on the Cloud platform directly but the process fails in all these scenarios due to a connectivity loss even though the VM is running. What is the best way to keep this long-running process going?
P.S. I did realize later that Google Cloud Shell is not suitable for this purpose.
As you already write CloudShell is not suitable for that kind of job, also the work-a-rounds with screen, tmux or byobu do not help. The best practice is just to use a preemptible VM.
Some limitations of the CloudShell are mentioned in the documentation:
Usage limits
Cloud Shell is intended for interactive use only. Non-interactive sessions will
be ended automatically after a warning. Prolonged usage or computational or network intensive processes are not supported and may result in session termination without a warning.
Cloud Shell also has weekly usage limits. If you reach your usage limit, you'll need to wait until the specified time (listed under Usage Quota, found under the three dots menu icon) before you can use Cloud Shell again.
Nevermind, I found the solution here: https://askubuntu.com/questions/8653/how-to-keep-processes-running-after-ending-ssh-session

Unable to cancel a dataflow on Google Cloud Platform

I have a number of google cloud dataflows marked as "Running" in the Dataflow console, but there are no GCE instances running. I manually terminated the instances to avoid being billed. The dataflows seem to be permanently stuck in "running" state. If I try to cancel them from the console or gcloud utility, I receive a warning that the flow is already in "finishing state" so the request was ignored.
I am now at the running quota of 10, so I am stuck. Is there any solution to this other than creating a new project?
There was an issue in the Dataflow service that caused cancel requests to become stuck. It has since been resolved.