Is there an api to send notifications based on job outputs? - qubole

I know there are api to configure the notification when a job is failed or finished.
But what if, say, I run a hive query that count the number of rows in a table. If the returned result is zero I want to send out emails to the concerned parties. How can I do that?
Thanks.

You may want to look at Airflow and Qubole's operator for airflow. We use airflow to orchestrate all jobs being run using Qubole and in some cases non Qubole environments. We DataDog API to report success / failures of each task (Qubole / Non Qubole). DataDog in this case can be replaced by Airflow's email operator. Airflow also has some chat operator (like Slack)

There is no direct api for triggering notification based on results of a query.
However there is a way to do this using Qubole:
-Create a work flow in qubole with following steps:
1. Your query (any query) that writes output to a particular location on s3.
2. A shell script - This script reads result from your s3 and fails the job based on any criteria. For instance in your case, fail the job if result returns 0 rows.
-Schedule this work flow using "Scheduler" API to notify on failure.
You can also use "Sendmail" shell command to send mail based on results in step 2 above.

Related

GCP BigQuery - Verify successful execution of stored procedure

I have a BigQuery routine that inserts records into a BQ Table.
I am looking to have a Eventarc trigger that triggers Cloud Run, and performs some action on successful execution of the BigQuery Routine.
From Cloud Logging, I can see two events that would seem to confirm the successful execution of the BQ Routine.
protoPayload.methodName="google.cloud.bigquery.v2.JobService.InsertJob"
protoPayload.metadata.tableDataChange.insertedRowsCount
However, this does not give me the Job ID.
So, I am looking at event -
protoPayload.methodName="jobservice.jobcompleted"
Would it be correct to assume that, if protoPayload.serviceData.jobCompletedEvent.job.jobStatus.error is empty, then the stored procedure execution was successful?
Thanks!
Decided to go with protoPayload.methodName="jobservice.jobcompleted" in this case.
It gives the job id at protoPayload.requestMetadata.resourceName, status like protoPayload.serviceData.jobCompletedEvent.job.jobStatus.state and errors if any like protoPayload.serviceData.jobCompletedEvent.job.jobStatus.error

How to apply Path Patterns in GCP Eventarc for BigQuery service's jobCompleted method?

I am developing a solution where a cloud function calls BigQuery procedure and upon successful completion of this stored proc trigger another cloud function. For this I am using Audit Logs "jobservice.jobcompleted" method. Problem with this approach is it will trigger cloud function on every job that are completed in BigQuery irrespective of dataset and procedure.
Is there any way to add Path Pattern to the filter so that it triggers only for specific query completion and not for all?
My query starts something like: CALL storedProc() ...
Also, as I tried to create a 2nd Gen function from console, I tried Eventarc trigger. But to my surprise BigQuery Event provider doesn't have Event for jobCompleted
Now I'm wondering if it's possible to trigger based on job complete event.
Update:I changed my logic now to use google.cloud.bigquery.v2.TableService.InsertTable method to make sure after inserting a record to a table it will add AuditLog message so that I can trigger the next service. This insert statement is present as the last statement in BigQuery procedure.
After running the procedure, the insert statement is inserting the data but resource name is coming as projects/<project_name>/jobs
I was expecting something like projects/<project_name>/tables/<table_name> so that I can apply path pattern on resource name.
Do I need to use different protoPayload.method?
Try to create a Log Sink for job completed with unique principal-email sv account and use pubsub with the sink.
Get pubsub published event to run destination service.

what will be the query for check completion of workflow?

I have to cheack the status of workflow weather that workflow completed within scheduled time or not in sql query format. And also send an email of workflow status like 'completed within time ' or not 'completed within time'. So, please help me out
You can do it either using option1 or option 2.
You need access to repository meta database.
Create a post session shell script. You can pass workflow name and benchmark value to the shell script.
Get workflow run time from repository metadata base.
SQL you can use -
SELECT WORKFLOW_NAME,(END_TIME-START_TIME)*24*60*60 diff_seconds
FROM
REP_WFLOW_RUN
WHERE WORKFLOW_NAME='myWorkflow'
You can then compare above value with benchmark value. Shell script can send a mail depending on outcome.
you need to create another workflow to check this workflow.
If you do not have access to Metadata, please follow above steps except metadata SQL.
Use pmcmd GetWorkflowDetails to check status, start and end time for a workflow.
pmcmd GetWorkflowDetails -sv service -d domain -f folder myWorkflow
You can then grep start and end time from there, compare them with benchmark values. The problem is the format etc. You need little bit scripting here.

Which Google Cloud function is preferable to fetch data from external API into GCP?

This should be a very easy question but I can't wrap my head around what to use. I would like to create a data pipeline that fetches data from an outside/external API (for example, Spotify API) and perform some rather simple data cleaning on it, while either continue to create a JSON file in Cloud Storage or enter the data into BigQuery.
As far as I understand I can use Composer to do it, using DAGS etc but what I need here is something more simple/lightweight (mainly UI based) that doesn't cost as much as Composer does as well as being easier to use. What I am looking for is something like Data Factory in Azure.
So, in brief:
Login to a data source using username/password
Extract data from a well known format (CSV/Json)
Transform data, such as remove columns, perform simple filtering like date filtering.
Reformat the data into another format (JSON/CSV/BigQuery)
...without having to code everything from scratch.
Can I handle all of this with one GCP application or do I need to use combinations like Cloud Scheduler, Cloud Functions etc?
As always, you have several options...
Cloud Scheduler seems to be a requirement to trigger regularly the process (up to every minutes).
Then, you have 2 options:
Code the process: API Call, transform/clean the data, sink the data into the destination
Use Cloud Workflow: you can define the API calls that you want to do
Call the API
Store the raw data in BigQuery (API Call also, you have connectors to simplify the process)
Run a query in BigQuery to clean/format your data and store them into a final table (API Call also)
You can also perform a mix between Cloud Functions to get the data and clean/format the data with a query in BigQuery.
Doing something specific like that without starting from scratch... difficult...
EDIT 1
If you have a look to the documentation, you can see that sample
- getCurrentTime:
call: http.get
args:
url: https://us-central1-workflowsample.cloudfunctions.net/datetime
result: currentTime
- readWikipedia:
call: http.get
args:
url: https://en.wikipedia.org/w/api.php
query:
action: opensearch
search: ${currentTime.body.dayOfTheWeek}
result: wikiResult
- returnResult:
return: ${wikiResult.body[1]}
The first step getCurrentTime performs an external call and store the result in result: currentTime.
In the next step, you can reuse the result currentTime and get only the value that you want in another API call.
And you can plug steps like that.
If you need authentication, you can perform a call to secret manager to get the secret values and then to result the secret manager call result in subsequent steps.
For an easier connection to Google APIs, you can use connectors

Is there a way to publish custom metrics from AWS Glue jobs?

I'm using an AWS Glue job to move and transform data across S3 buckets, and I'd like to build custom accumulators to monitor the number of rows that I'm receiving and sending, along with other custom metrics. What is the best way to monitor these metrics? According to this document: https://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html I can keep track of general metrics on my glue job but there doesn't seem to be a good way to send custom metrics through cloudwatch.
I have done lots of similar project like this, each micro batch can be:
a file or a bunch of file
a time interval of data from API
a partition of records from database
etc ...
Your use case is can be break down into three question:
given a bunch of input, how could you define a task_id
how you want to define the metrics for your task, you need to define a simple dictionary structure for this metrics data
find a backend data store to store the metrics data
find a way to query the metrics data
In some business use case, you also need to store status information to track each of the input, are they succeeded? failed? in-progress? stuck? and you may want to control retry, and concurrency control (avoid multiple worker working on the same input)
DynamoDB is the perfect backend for this type of use case. It is a super fast, no ops, pay as you go, automatically scaling key-value store.
There's a Python library that implemented this pattern https://github.com/MacHu-GWU/pynamodb_mate-project/blob/master/examples/patterns/status-tracker.ipynb
Here's an example:
put your glue ETL job main logic in a function:
def glue_job() -> dict:
...
return your_metrics
given an input, calculate the task id identifier, then you just need
tracker = Tracker.new(task_id)
# start the job, it will succeed
with tracker.start_job():
# do some work
your_metrics = glue_job()
# save your metrics in dynamodb
tracker.set_data(your_metrics)
Consider enabling continuous logging on your AWS Glue Job. This will allow you to do custom logging via. CloudWatch. Custom logging can include information such as row count.
More specifically
Enable continuous logging for you Glue Job
Add logger = glueContext.get_logger() at the beginning of you Glue Job
Add logger.info("Custom logging message that will be sent to CloudWatch") where you want to log information to CloudWatch. For example if I have a data frame named df I could log the number of rows to CloudWatch by adding logger.info("Row count of df " + str(df.count()))
Your log messages will be located under the CloudWatch log groups /aws-glue/jobs/logs-v2 under the log stream named glue_run_id -driver.
You can also reference the "Logging Application-Specific Messages Using the Custom Script Logger" section of the AWS documentation Enabling Continuous Logging for AWS Glue Jobs for more information on application specific logging.