My BigQuery table is being loaded with multiple batch jobs, resulting in creating multiple log entries in stackdriver logging. I want to use these log entries to trigger dataflow job by sending them to pubsub topic and then using cloud function to run the dataflow job.
The issue, we have more than 1 batch job and therefore more than 1 log entries going to pubsub topic and cloud function trying to run same dataflow job again and again.
I need solution that runs dataflow job once when all the batch jobs completes data loading in bigQuery table. (Number of batch jobs is not constant)
Related
Currently, we have the following AWS setup for executing Glue jobs. An S3 event triggers a lambda function execution whose python logic triggers 10 AWS Glue jobs.
S3 -> Trigger -> Lambda -> 1 or more Glue Jobs.
With this setup, we see that at a time, multiple different Glue jobs run in parallel. How can I make it so that at any point in time, only one Glue job runs? And any Glue jobs sent for execution wait in a queue until the currently running Glue job is finished?
You can use step function and in each steps specify job you want to run so you will have control to run jobs and once step one complete then call step 2 jobs etc
If you are looking for having some job queues to have the Glue jobs trigger in sequence, you may consider using a combination of SQS->lambda->Glue jobs? Please refer this SO for details
AWS Step function is also another option as suggested by Vaquar Khan
I implement a dataflow job on terraform, using Google template Pubsub to Big Query. Pubsub is in one project, while dataflow and big query is in the other. The dataflow job is created, compute engine scales, subscriptions get created, service account has all possible permissions to run dataflow job and pubsub and service account user permissions in project where pubsub is. Pipeline API is enbled. Dataflow job is with the status running, big query tables are created, table scemas match the message schema. The only thing is that dataflow doesn't read messages from pubsub. The only thing is, maybe, when I open pipelines (within dataflow), I see nothing, also temp location specified in terraform code is not created. Service account has cloud storage admin permissions, so it's another indication that dataflow job (pipeline) just doesn't initiate the stream. Any suggestions? Maybe somebody had similar issue?
enter image description here
enter image description here
I want to create an ingestion/aggregation flow on Google Cloud using Dataproc, where once a day/hour I want a Spark job to run on the data collected till then.
Is there any way to schedule the Spark jobs? Or of making this trigger based for e.g. on any new data event arriving on the flow?
Dataproc Workflow + Cloud Scheduler might be a solution for you. It supports exactly what you described, e.g. run a flow of jobs in a daily base.
I am trying to populate maximum possible Glue job metrics for some testing, below is the setup I have created:
A crawler reads data (dummy customer data of 500 rows) from a CSV file placed in an S3 bucket.
Used another crawler to crawl tables created in Redshift cluster.
An ETL job finally reads data from csv file in s3 and dumps it into a Redshift table.
The job is running without any issue and i am able to see final data getting dumped into Redshift table, however, in the end, only below 5 Cloudwatch metrics are being populated:
glue.jvm.heap.usage
glue.jvm.heap.used
glue.s3.filesystem.read_bytes
glue.s3.filesystem.write_bytes
glue.system.cpuSystemLoad
There are approximately 20 more metrics which are not getting populated.
Any suggestions on how to populate those remaining metrics as well?
Double check if the CW metrics for your job is enabled
Make sure your job runs longer say > 3mins such that it allows CW to push the metrics
For this you can add a sleep time in your code
Assuming that you are using Glue version 2.0+ for the above job, please be advised that AWS Glue version 2.0+ does not use dynamic allocation, hence the ExecutorAllocationManager metrics are not available. Trackback on using Glue 1.0 and you should confirm that all the documented metrics are now available.
Met the same issue. Does your glue.s3.filesystem.read_bytes and glue.s3.filesystem.write_bytes have any data?
One possible reason is that the AWS Glue job metrics not emitted if job completes in less then 30 sec
While running the job enable the metrics option under monitoring tab.
I'm new to bigquery and need to do some tests on it. Looking through bigquery documentation, i can't find nothing about creating jobs and scheduling them.
I found in other page on internet that the only available method is creating a bucket in google cloud storage and create a function in cloud functions using javascript, and inside it's body write down the sql query.
Can someone help me here? Is it true?
Your question is a bit confusing as you mix scheduling jobs with defining a query in a cloud function.
There is a difference in scheduling jobs vs scheduling queries.
BigQuery offers Scheduled queries. See docs here.
BigQuery Data Transfer Service (schedule recurring data loads from GCS.) See docs here.
If you want to schedule jobs for (load, delete, copy jobs etc) you better do this with a trigger on the observed resource like Cloud Storage new file, a Pub/Sub message, a HTTP trigger all this wired in a Cloud Function.
Some other related blog posts:
How to schedule a BigQuery ETL job with Dataprep
Scheduling BigQuery Jobs: This time using Cloud Storage & Cloud Functions