How to schedule Spark jobs on Google Dataproc? - google-cloud-platform

I want to create an ingestion/aggregation flow on Google Cloud using Dataproc, where once a day/hour I want a Spark job to run on the data collected till then.
Is there any way to schedule the Spark jobs? Or of making this trigger based for e.g. on any new data event arriving on the flow?

Dataproc Workflow + Cloud Scheduler might be a solution for you. It supports exactly what you described, e.g. run a flow of jobs in a daily base.

Related

Is it Possible to Build a REST API interface on top of the Spark Cluster?

Essentially, we are running a batch ML model using a Spark EMR cluster on AWS. There will be several iterations of the model so we want to have some sort of model metadata endpoint on top of the Spark cluster. In this way, other services that rely on the output of the EMR cluster can ping the spark cluster's REST API endpoint and be informed of the latest ML system version it's using. I'm not sure if this is feasible or not.
Objective:
We want other services to be able to ping the EMR cluster which runs the latest ML model and obtain the metadata for the model, which includes ML system version.
If I have understood correctly, you want to add metadata (e.g., version, last-updated, action performed etc) somewhere once the spark job is finished, right?
There can be several possibilities and all will be somehow integrated into your data pipeline in the same way as other task, for example, triggering spark job with workflow management tool (airflow/luigi), lambda function or even cron.
Updating meta-data after Spark job runs
So for the post spark job step, you can have add something in your pipeline that adds this metadata to some DB or event store. I am sharing to options and you can decide which one is more feasible
Utilize cloudwatch event and associate a lambda with the event. Amazon EMR automatically sends events to a CloudWatch event stream
Add a step in your workflow management tool (airflow/luigi) that triggers a DB/event-store update step/operator "on-completion" of the EMR step function. (for e.g., using EmrStepSensor in Airflow to issue next step of writing to DB depending on that)
For Rest-api on top of DB/event store
Now, once you have regular updating mechanism in place for every emr spark step run, you can build normal rest API using EC2 or a serverless API using AWS lambda. You will essentially be returning this meta-data from the rest service.

can we Trigger dataflow jobs only once from cloud function?

My BigQuery table is being loaded with multiple batch jobs, resulting in creating multiple log entries in stackdriver logging. I want to use these log entries to trigger dataflow job by sending them to pubsub topic and then using cloud function to run the dataflow job.
The issue, we have more than 1 batch job and therefore more than 1 log entries going to pubsub topic and cloud function trying to run same dataflow job again and again.
I need solution that runs dataflow job once when all the batch jobs completes data loading in bigQuery table. (Number of batch jobs is not constant)

Create Jobs and schedule them on Bigquery

I'm new to bigquery and need to do some tests on it. Looking through bigquery documentation, i can't find nothing about creating jobs and scheduling them.
I found in other page on internet that the only available method is creating a bucket in google cloud storage and create a function in cloud functions using javascript, and inside it's body write down the sql query.
Can someone help me here? Is it true?
Your question is a bit confusing as you mix scheduling jobs with defining a query in a cloud function.
There is a difference in scheduling jobs vs scheduling queries.
BigQuery offers Scheduled queries. See docs here.
BigQuery Data Transfer Service (schedule recurring data loads from GCS.) See docs here.
If you want to schedule jobs for (load, delete, copy jobs etc) you better do this with a trigger on the observed resource like Cloud Storage new file, a Pub/Sub message, a HTTP trigger all this wired in a Cloud Function.
Some other related blog posts:
How to schedule a BigQuery ETL job with Dataprep
Scheduling BigQuery Jobs: This time using Cloud Storage & Cloud Functions

How to execute an SQL script stored in S3 other than datapipeline service

We have been working on leveraging AWS services for scheduling few sql scripts on a daily basis. Datapipeline is a good option but we have found issues with underlying support systems which is Task runner. Is there any other options that we can look for. The Lambda has a limitation of 300 seconds. And the query which we are using will exceed 5 minutes. Any suggestions/workarounds is much appreciated.!!
You're on the right path. Use Lambda just to kick off the job, not to do the actual workload.
For example, pack your app in a Docker container, push it to ECR and use Lambdas to periodically add an AWS Batch job.

Can AWS Athena queries be run periodically (i.e., on a schedule)?

Is there any support for running Athena queries on a schedule? We want to query some data daily, and dump a summarized CSV file, but it would be best if this happened on an automated schedule.
Schedule an AWS Lambda task to kick this off, or use a cron job on one of your servers.