Triggering an alert when multiple dataflow jobs run in parallel in GCP - google-cloud-platform

I am using google cloud dataflow to execute some resource intensive dataflow jobs. And at a given time, my system must execute no more than 2 jobs in parallel.
Since each job is quite resource intensive, I am looking for a way to trigger an alert when more than 2 dataflow jobs are running.
I tried implementing a custom_count which increments after the start of each job. But custom_couter only display after the job has executed. And it might be too late to trigger an alert by then.

You could modify the quota dataflow.googleapis.com/job_count of the project to be limited to 1, and no two jobs could run parallel in that project. The quota is at the project level, it would not affect other projects.

Another option is to use an GCP monitoring system that is observing the running Dataflow jobs. You can e.g. use Elastic Cloud (available via Marketplace) to load all relevant Metrics and Logs. Elastic can visualize and alert on every state you are interested in.
I found this terraform project very helpful in order to get started with that approach.

Related

Cloud Composer vs Cloud Scheduler

I am currently studying for the GCP Data Engineer exam and have struggled to understand when to use Cloud Scheduler and whe to use Cloud Composer.
From reading the docs, I have the impression that Cloud Composer should be used when there is interdependencies between the job, e.g. we need the output of a job to start another whenever the first finished, and use dependencies coming from first job. You can then chain flexibly as many of these "workflows" as you want, as well as giving the opporutnity to restart jobs when failed, run batch jobs, shell scripts, chain queries and so on.
For the Cloud Scheduler, it has very similar capabilities in regards to what tasks it can execute, however, it is used more for regular jobs, that you can execute at regular intervals, and not necessarily used when you have interdependencies in between jobs or when you need to wait for other jobs before starting another one. Therefore, seems to be more tailored to use in "simpler" tasks.
These thoughts came after attempting to answer some exam questions I found. However, I was surprised with the "correct answers" I found, and was hoping someone could clarify if these answers are correct and if I understood when to use one over another.
Here are the example questions that confused me in regards to this topic:
Question 1
You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times.
Which service should you use to manage the execution of these jobs?
A. Cloud Scheduler
B. Cloud Dataflow
C. Cloud Functions
D. Cloud Composer
Correct Answer: A
Question 2
You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Cloud Dataproc and Cloud Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day.
Which tool should you use?
A. cron
B. Cloud Composer
C. Cloud Scheduler
D. Workflow Templates on Cloud Dataproc
Correct Answer: D
Question 3
Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers.
Which cloud-native service should you use to orchestrate the entire pipeline?
A. Cloud Dataflow
B. Cloud Composer
C. Cloud Dataprep
D. Cloud Dataproc
Correct Answer: D
Any insight on this would be greatly appreciated. Thank you !
Your assumptions are correct, Cloud Composer is an Apache Airflow managed service, it serves well when orchestrating interdependent pipelines, and Cloud Scheduler is just a managed Cron service.
I don't know where you have got these questions and answers, but I assure you(and I just got the GCP Data Engineer certification last month), the correct answer would be Cloud Composer for each one of them, just ignore this supposed correct answers and move on.
Cloud Scheduler is essentially Cron-as-a-service. All you need is to enter a schedule and an endpoint (Pub/Sub topic, HTTP, App Engine route). Cloud Scheduler has built in retry handling so you can set a fixed number of times and doesn't have time limits for requests. The functionality is much simpler than Cloud Composer.
Cloud Composer is managed Apache Airflow that "helps you create, schedule, monitor and manage workflows. Cloud Composer automation helps you create Airflow environments quickly and use Airflow-native tools, such as the powerful Airflow web interface and command line tools, so you can focus on your workflows and not your infrastructure."(https://cloud.google.com/composer/docs/)
Airflow is aimed at data pipelines with all the needed tooling.

Scheduling cron jobs on Google Cloud DataProc

I currently have a PySpark job that is deployed on a DataProc cluster (1 master & 4 worker nodes with sufficient cores and memory). This job runs on millions of records and performs an expensive computation (Point in Polygon). I am able to successfully run this job by itself. However, I want to schedule the job to be run on the 7th of every month.
What I am looking for is the most efficient way to set up cron jobs on a DataProc Cluster. I tried to read up on Cloud Scheduler, but it doesn't exactly explain how it can be used in conjunction with a DataProc cluster. It would be really helpful to see either an example of cron job on DataProc or some documentation on DataProc exclusively working together with Scheduler.
Thanks in advance!
For scheduled Dataproc interactions (create cluster, submit job, wait for job, delete cluster while also handling errors) Dataproc's Workflow Templates API is a better choice than trying to orchestrate these yourself. A key advantage is Workflows are fire-and-forget and any clusters created will also be deleted on completion.
If your Workflow Template is relatively simple such that it's parameters do not change between invocations a simpler way to schedule would be to use Cloud Scheduler. Cloud Functions are a good choice if you need to run a workflow in response to files in GCS or events in PubSub. Finally, Cloud Composer is great if your workflow parameters are dynamic or there's other GCP products in the mix.
Assuming your use cases is the simple run workflow every so often with the same parameters, I'll demonstrate using Cloud Scheduler:
I created a workflow in my project called terasort-example.
I then created a new Service Account in my project, called workflow-starter#example.iam.gserviceaccount.com and gave it Dataproc Editor role; however something more restricted with just dataproc.workflows.instantiate is also sufficient.
After enabling the the Cloud Scheduler API, I headed over to Cloud Scheduler in Developers Console. I created a job as follows:
Target: HTTP
URL: https://dataproc.googleapis.com/v1/projects/example/regions/global/workflowTemplates/terasort-example:instantiate?alt=json
HTTP Method: POST
Body: {}
Auth Header: OAuth Token
Service Account: workflow-starter#example.iam.gserviceaccount.com
Scope: (left blank)
You can test it by clicking Run Now.
Note you can also copy the entire workflow content in the Body as JSON payload. The last part of the URL would become workflowTemplates:instantiateInline?alt=json
Check out this official doc that discusses other scheduling options.
Please see the other answer for more comprehensive solution
What you will have to do is publish an event to pubsub topic from Cloud Scheduler and then have a Cloud Function react to that event.
Here's a complete example of using Cloud Function to trigger Dataproc:
How can I run create Dataproc cluster, run job, delete cluster from Cloud Function

GCP Dataflow, Dataproc, Bigtable

I am selecting services to write and transform JSON messages from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. I want to minimize service costs. I also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention. What should I do?
A. Use Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.
B. Use Cloud Dataproc to run your transformations. Use the diagnose command to generate an operational output archive. Locate the bottleneck and adjust cluster resources.
C. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances.
D. Use Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs. Configure the job to use non-default Compute Engine machine types when needed.
C!
Use Dataflow on pubsub to transform your data and let it write rows into BQ. You can monitor the ETL pipeline straight from data flow and use stackdriver on top. Stackdriver can also be used to start events etc.
Use autoscaling to minimize the number of manual actions. Basically when this solution is setup correctly, it doesn't need work at all.

Debugging Google Cloud Dataflow VM Instances

I have a Google Cloud Dataflow streaming job that have a growing system lag. Lag started when I deployed a new changes. Lag is gradually growing without subsiding. I see frequent GCs happening from the stackdriver logs which indicate inefficiency/bug introduced by newly deployed changes. I like to further debug this, what is the best way to debug JVM on Dataflow instances?
I have tried enabling Monitoring agent when launching the job, which gives me GC count/time which is not that useful to debug the source of issue.

How to delete files from cloud storage after dataflow job completes

In GCP, I have a dataflow job that does the job of copying files from cloud storage to big query.I would like to delete these files once they are successfully inserted into big query. Can someone provide pointers on how to achieve this and also how to trigger another job after the previous one has succeeded?
For these types of scenarios, its normally recommended that you introduce a tool for scheduling and workload orchestration into your architecture. Google Cloud provides Cloud Composer, a managed version of Airflow, to solve exactly this use case. You can schedule a DAG (directed-acyclic graph) in Composer to start your Dataflow job and then, on the success of a job run, execute additional tasks for file cleanup or to kick off the next process.
Example DAG
To get started I recommend checking out the Cloud Composer documentation as well as these Cloud Composer Examples which seem similar to your use case.