I currently have a PySpark job that is deployed on a DataProc cluster (1 master & 4 worker nodes with sufficient cores and memory). This job runs on millions of records and performs an expensive computation (Point in Polygon). I am able to successfully run this job by itself. However, I want to schedule the job to be run on the 7th of every month.
What I am looking for is the most efficient way to set up cron jobs on a DataProc Cluster. I tried to read up on Cloud Scheduler, but it doesn't exactly explain how it can be used in conjunction with a DataProc cluster. It would be really helpful to see either an example of cron job on DataProc or some documentation on DataProc exclusively working together with Scheduler.
Thanks in advance!
For scheduled Dataproc interactions (create cluster, submit job, wait for job, delete cluster while also handling errors) Dataproc's Workflow Templates API is a better choice than trying to orchestrate these yourself. A key advantage is Workflows are fire-and-forget and any clusters created will also be deleted on completion.
If your Workflow Template is relatively simple such that it's parameters do not change between invocations a simpler way to schedule would be to use Cloud Scheduler. Cloud Functions are a good choice if you need to run a workflow in response to files in GCS or events in PubSub. Finally, Cloud Composer is great if your workflow parameters are dynamic or there's other GCP products in the mix.
Assuming your use cases is the simple run workflow every so often with the same parameters, I'll demonstrate using Cloud Scheduler:
I created a workflow in my project called terasort-example.
I then created a new Service Account in my project, called workflow-starter#example.iam.gserviceaccount.com and gave it Dataproc Editor role; however something more restricted with just dataproc.workflows.instantiate is also sufficient.
After enabling the the Cloud Scheduler API, I headed over to Cloud Scheduler in Developers Console. I created a job as follows:
Target: HTTP
URL: https://dataproc.googleapis.com/v1/projects/example/regions/global/workflowTemplates/terasort-example:instantiate?alt=json
HTTP Method: POST
Body: {}
Auth Header: OAuth Token
Service Account: workflow-starter#example.iam.gserviceaccount.com
Scope: (left blank)
You can test it by clicking Run Now.
Note you can also copy the entire workflow content in the Body as JSON payload. The last part of the URL would become workflowTemplates:instantiateInline?alt=json
Check out this official doc that discusses other scheduling options.
Please see the other answer for more comprehensive solution
What you will have to do is publish an event to pubsub topic from Cloud Scheduler and then have a Cloud Function react to that event.
Here's a complete example of using Cloud Function to trigger Dataproc:
How can I run create Dataproc cluster, run job, delete cluster from Cloud Function
Related
I have a Cloud Scheduler that triggers a Cloud Run Job to run and parallelize the payload into 3 tasks using the CLOUD_RUN_TASK_INDEX environment variable.
I would like to trigger another Cloud Run Job or Cloud Function after the previous Cloud Run Job finishes all the 3 tasks. I looked up in the Google documentation, but I could not find any reference in order to accomplish this. How can this be done?
In your case, the most efficient (IMO) is to use Cloud Workflows with the Cloud Task connectors. The connector run the task and wait for completion. And then you can continue the workflow with different tasks that you want to execute.
Cloud Workflows also support parallel execution
Another idea could be to use Eventarc to catch the audit logs that mention the end of the job processing. But I prefer the first solution.
I am using google cloud dataflow to execute some resource intensive dataflow jobs. And at a given time, my system must execute no more than 2 jobs in parallel.
Since each job is quite resource intensive, I am looking for a way to trigger an alert when more than 2 dataflow jobs are running.
I tried implementing a custom_count which increments after the start of each job. But custom_couter only display after the job has executed. And it might be too late to trigger an alert by then.
You could modify the quota dataflow.googleapis.com/job_count of the project to be limited to 1, and no two jobs could run parallel in that project. The quota is at the project level, it would not affect other projects.
Another option is to use an GCP monitoring system that is observing the running Dataflow jobs. You can e.g. use Elastic Cloud (available via Marketplace) to load all relevant Metrics and Logs. Elastic can visualize and alert on every state you are interested in.
I found this terraform project very helpful in order to get started with that approach.
A BigTable table can be backed up through GCP for up to 30 days.
(https://cloud.google.com/bigtable/docs/backups)
Is it possible to have a custom automatic backup policy?
i.e. trigger automatic backups every X days & keep up to 3 copies at a time.
As mentioned in the comment, the link provides a solution which involves the use of the following GCP Products:
Cloud Scheduler: trigger tasks with a cron-based schedule
Cloud Pub/Sub: pass the message request from Cloud Scheduler to Cloud Functions
Cloud Functions: initiate an operation for creating a Cloud Bigtable backup
Cloud Logging and Monitoring (optional).
Full guide can also be seen on GitHub.
This is a good solution since you have a certain requirement that should be done with client libraries, because Big Table doesn't have an API that sets 3 copies at a time.
For normal use cases however, such as triggering automatic backups every X days, there's another solution such as calling the backups.create directly by creating a Cloud Scheduler with HTTP similar to what's done in this answer.
Here is another thought on a solution:
Instead of using three GCP Products, if you are already using k8s or GKE you can replace all this functionality with a k8s CronJob. Put the BigTable API calls in a container and deploy it on a schedule using the CronJob.
In my opinion, it is a simpler solution if you are already using kubernetes.
I am currently studying for the GCP Data Engineer exam and have struggled to understand when to use Cloud Scheduler and whe to use Cloud Composer.
From reading the docs, I have the impression that Cloud Composer should be used when there is interdependencies between the job, e.g. we need the output of a job to start another whenever the first finished, and use dependencies coming from first job. You can then chain flexibly as many of these "workflows" as you want, as well as giving the opporutnity to restart jobs when failed, run batch jobs, shell scripts, chain queries and so on.
For the Cloud Scheduler, it has very similar capabilities in regards to what tasks it can execute, however, it is used more for regular jobs, that you can execute at regular intervals, and not necessarily used when you have interdependencies in between jobs or when you need to wait for other jobs before starting another one. Therefore, seems to be more tailored to use in "simpler" tasks.
These thoughts came after attempting to answer some exam questions I found. However, I was surprised with the "correct answers" I found, and was hoping someone could clarify if these answers are correct and if I understood when to use one over another.
Here are the example questions that confused me in regards to this topic:
Question 1
You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times.
Which service should you use to manage the execution of these jobs?
A. Cloud Scheduler
B. Cloud Dataflow
C. Cloud Functions
D. Cloud Composer
Correct Answer: A
Question 2
You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Cloud Dataproc and Cloud Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day.
Which tool should you use?
A. cron
B. Cloud Composer
C. Cloud Scheduler
D. Workflow Templates on Cloud Dataproc
Correct Answer: D
Question 3
Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers.
Which cloud-native service should you use to orchestrate the entire pipeline?
A. Cloud Dataflow
B. Cloud Composer
C. Cloud Dataprep
D. Cloud Dataproc
Correct Answer: D
Any insight on this would be greatly appreciated. Thank you !
Your assumptions are correct, Cloud Composer is an Apache Airflow managed service, it serves well when orchestrating interdependent pipelines, and Cloud Scheduler is just a managed Cron service.
I don't know where you have got these questions and answers, but I assure you(and I just got the GCP Data Engineer certification last month), the correct answer would be Cloud Composer for each one of them, just ignore this supposed correct answers and move on.
Cloud Scheduler is essentially Cron-as-a-service. All you need is to enter a schedule and an endpoint (Pub/Sub topic, HTTP, App Engine route). Cloud Scheduler has built in retry handling so you can set a fixed number of times and doesn't have time limits for requests. The functionality is much simpler than Cloud Composer.
Cloud Composer is managed Apache Airflow that "helps you create, schedule, monitor and manage workflows. Cloud Composer automation helps you create Airflow environments quickly and use Airflow-native tools, such as the powerful Airflow web interface and command line tools, so you can focus on your workflows and not your infrastructure."(https://cloud.google.com/composer/docs/)
Airflow is aimed at data pipelines with all the needed tooling.
In GCP, I have a dataflow job that does the job of copying files from cloud storage to big query.I would like to delete these files once they are successfully inserted into big query. Can someone provide pointers on how to achieve this and also how to trigger another job after the previous one has succeeded?
For these types of scenarios, its normally recommended that you introduce a tool for scheduling and workload orchestration into your architecture. Google Cloud provides Cloud Composer, a managed version of Airflow, to solve exactly this use case. You can schedule a DAG (directed-acyclic graph) in Composer to start your Dataflow job and then, on the success of a job run, execute additional tasks for file cleanup or to kick off the next process.
Example DAG
To get started I recommend checking out the Cloud Composer documentation as well as these Cloud Composer Examples which seem similar to your use case.