I am writing an app to monitor and view Google dataflow jobs.
To get the metadata about google dataflow jobs, I am exploring the REST APIs listed here :
https://developers.google.com/apis-explorer/#search/dataflow/dataflow/v1b3/
I was wondering if there are any APIs that could do the following :
1) Get the job details if we provide a list of job Ids (there is an API for one individual job ID, but I wanted the same for a list of Ids)
2)Search or filter jobs on the basis of job name.Or for that matter, filtering of jobs of any other criteria apart from the job state.
3)Get log messages associated with a dataflow job
4)Get the records of "all" jobs, from the beginning of time. The current APIs seem to give records only of jobs in the last 30 days.
Any help would be greatly appreciated. Thank You
There is additional documentation about the Dataflow REST API at: https://cloud.google.com/dataflow/docs/reference/rest/
Addressing each of your questions separately:
1) Get the job details if we provide a list of job Ids (there is an API for one individual job ID, but I wanted the same for a list of Ids)
No, there is no batch method for a list of jobs. You'll need to query them individually with projects.jobs.get.
2)Search or filter jobs on the basis of job name.Or for that matter, filtering of jobs of any other criteria apart from the job state.
The only other filter currently available is location.
3)Get log messages associated with a dataflow job
In Dataflow there are two types of log messages:
"Job Logs" are generated by the Dataflow service and provide high-level information about the overall job execution. These are available via the projects.jobs.messages.list API.
There are also "Worker Logs" written by the SDK and user code running in the pipeline. These are generated on the distributed VMs associated with a pipeline and ingested into Stackdriver. They can be queried via the Stackdriver Logging entries.list API by including in your filter:
resource.type="dataflow_step"
resource.labels.job_id="<YOUR JOB ID>"
4)Get the records of "all" jobs, from the beginning of time. The current APIs seem to give records only of jobs in the last 30 days.
Dataflow jobs are only retained by the service for 30 days. Older jobs are deleted and thus not available in the UI or APIs.
In our case we implemented such functionality by tracking the job stages and by using schedulers/cron jobs to report the details of running job in one file. This file withing 1 bucket is watched by our job which just gives all status to our application
Related
I'm trying to establish email alerts at a project level to send email alerts for when a certain number of query/job concurrency is reached e.g. 5 concurrent queries. We have a flat-rate pricing model.
I want to do a similar email notification when total slot Usage exceeds a certain threshold as well e.g. slot usage reaching 1000 slots
As a next step I would like to throttle new incoming queries based on the above mentioned thresholds. Meaning if there are already for example 5 queries actively running the 6th one will be put on hold until one of the 5 running earlier have completed.
You may create an Alert Policy in which you can set your desired metric type (eg. slots) and then configure your desired threshold similar to below.
In creating an Alert Policy you may also configure the notification channel to email notification which is also included on the same documentation.
For the available metric types for SLOTS in BigQuery, you may refer to this Google Cloud Metrics for BigQuery documentation.
For your next step, you may code (python, node.js, etc) using BigQuery API to count queries actively running (through JOB ID) and when the count hits 5, you may print "query queue is full" and then wait for the total JOBS to hit below 5 before running the next query. You may refer to this BigQuery Managing Jobs API Documentation.
I am working on a pipeline that takes data and do some partitioning on it, I am trying to load some data into bq table on gcp, but I got Too many partitions produced by query, allowed 4000, query produces at least 10000 partitions, I understand that it's a limitation by bq, and have found multiple purposed solutions to create a cluster on the data or partition by week instead of day, The problem is that I have no visibility on the data itself, I can not do this. if any other ideas are there please help.
Also, for sake of investigation and analysis, how to know how many big query jobs is submitted? is there a way to get the number of bq jobs submitted by specific dataflow?
Thannks
You can view the jobs created by a particular Dataflow job by navigating to the Google Cloud Console and clicking through to the Dataflow Job UI. Here is the relevant documentation with screenshots.
We need your guidance on the dataflow design for the below scenario.
Requirement:
We need to build a dataflow job to read dataflow MS SQL database and write to Bigquery.
We need the dataflow job to take as input “the list of table names” (source and target table names) to read from and write to the data.
Question:
On a daily schedule, would it be possible for a dataflow to take the list of tables (i.e. 50 table names) as input and copy data from source to target or should this be designed as 50 independent dataflow jobs.
Would dataflow automatically adjust the number of workers – without bringing down the source MS SQL server ?
Key Info:
Source: MS SQL database
Target: Bigquery
No of Table: 50
Schedule: Every day , say 8 am
Write Disposition: Write Truncate (or Write Append)
You have to create a dataflow template to be able to trigger it on a schedule. On that template, you have to define a input variable in which you can put your table list.
Then, in a same dataflow job, you can have 50 independent pipeline, each reading in a table and sinking the data in BigQuery. You can't run 50 dataflow jobs in parallel because of quotas (limit of 25 per projects). In addition, it will be less cost efficient.
Indeed, Dataflow is able to parallelize on the same worker different pipeline (in different thread) and to scale up and down the cluster size according to the workload requirements.
We use lots of components in Google Cloud, for example a job may start on App Engine, then do some work in Apache Airflow, then do some Dataflow work which will run a BigQuery insert.
Is there any way we can track the status of a job across all components using stack driver. For example tell stackdriver somehow a custom job id and query for it.
You can use advanced logs filters [1] to include log entries from various products. In the logging page search for your BigQuery Job ID. Click to the Job ID and select show matching entries. This will open advanced filter text box with the proper syntax. Then you can add more queries with an OR in between.
I am working on Amazon Matillion for Redshift and we have multiple jobs running daily triggered by SQS messages. Now I am checking the possibility of creating a UI dashboard for stakeholders which will monitor live progress of jobs and will show report of previous jobs, like Job name, tables impacted, job status/reason for failure etc. Does Matillion maintain this kind of information implicitly? Or I will have to maintain this information for each job.
Matillion has an API which you can use to obtain details of all task history. Information on the tasks API is here:
https://redshiftsupport.matillion.com/customer/en/portal/articles/2720083-loading-task-information?b_id=8915
You can use this to pull data on either currently running jobs or completed jobs down to component level including name of job, name of component, how long it took to run, whether it ran successfully or not and any applicable error message.
This information can be pulled into a Redshift table using the Matillion API profile which comes built into the product and the API Query component. You could then build your dashboard on top of this table. For further information I suggest you reach out to Matillion via their Support Center.
The API is helpful, but you can only pass a date as a parameter (this is for Matillion for Snowflake, assume it's the same for Redshift). I've requested the ability to pass a datetime so we can run the jobs throughout the day and not pull back the same set of records every time our API call runs.