How to display Airflow DAG status in Big Query tables - google-cloud-platform

I want to show the DAG (airflow) final status (success/Failure) to a table in BQ.
Like that table can contains: Date-Time,DAG-Name,Status etc columns and it will get populated according to the final status of the DAG.
Please help; how can this be achieved?

There's no native out-of-the-box method to achieve this in Airflow. However, you could implement a function yourself which writes data to BigQuery and run it via a DAG's on_success_callback and on_failure_callback methods.
Note: BigQuery is not a transactional database and has limits on the number of inserts per day. For a large number of DAG runs, you might want to think of writing results in batches to BigQuery.

If you need the data in real-time, I would go with somethign along the lines of the approach #Bas has suggested, maybe with firestore or Cloud SQL. However note his comments on the inserts per day if you go with BigQuery.
If you can wait on the results on a daily basis you can do a log sink to BigQuery as described here:
https://cloud.google.com/bigquery/docs/reference/auditlogs#stackdriver_logging_exports
In the filter criteria you can either bring in all of the Airflow logs or just the ones from the worker/scheduler.
Ex criteria:
resource.type="cloud_composer_environment"
logName="projects/{YOUR-PROJECT}/logs/airflow-worker"
In the log textPayload you will see something like:
Marking task as SUCCESS. dag_id=thing, task_id=stuff, executiondate=20220307T111111, start_date=20220307T114858, end_date=20220307T114859
You can then parse for what you need in BigQuery

To complement the answer of user Bas Harenslak. There are these options also that you can explore:
You can make use of TriggerDagRunOperator. By using it you can have one dag (a recap-dag) which will be referenced by your DAGs to populate the record into your destination dataset.
trigger_recap_dag = TriggerDagRunOperator(
task_id="trigger_recap_dag",
trigger_dag_id="recap-dag",
wait_for_completion=False,
allowed_states=['success']
conf='{"Time": datetime.now() ,"DAG": "recap-dag","Status":"success"}'
)
ingestion >> transformation >> save >> send_notification >> trigger_recap_dag
If you see fit, This recap-dag can also be independent and only run every hour/day/week of your election and checks your DAGs status.
with DAG(
'recap-dag',
schedule_interval='#daily',
start_date=datetime(2021, 1, 1),
catchup=False,
) as dag:
...
# Airflow >= 2.0.0
# Inside a python Operator
def GetRunningDagsInfo():
dag_runs = DagRun.find(
dag_id=your_dag_id,
execution_start_date=your_start_date
execution_end_date=your_end_date
)
...
You can make use of prior options and come with a solution like this:
After you dag (or dags) complete, it will fire the trigger dag. this recap-dag will saves your dag records into a custom table or
file and then your independent DAG runs and retrieves the datasets
that have been created so far and push the data into your BigQuery
Table.
Another option is by looking into your Airflow Database to retrieve running information. Know as Data Profiling. It has been deprecated in latest versions due to security concerns.

Related

Process data from BigQuery using Dataflow

I want to retrieve data from BigQuery that arrived every hour and do some processing and pull the new calculate variables in a new BigQuery table. The things is that I've never worked with gcp before and I have to for my job now.
I already have my code in python to process the data but it's work only with a "static" dataset
As your source and sink of that are both in BigQuery, I would recommend you to do your transformations inside BigQuery.
If you need a scheduled job that runs in a pre determined time, you can use Scheduled Queries.
With Scheduled Queries you are able to save some query, execute it periodically and save the results to another table.
To create a scheduled query follow the steps:
In BigQuery Console, write your query
After writing the correct query, click in Schedule query and then in Create new scheduled query as you can see in the image below
Pay attention in this two fields:
Schedule options: there are some pre-configured schedules such as daily, monthly, etc.. If you need to execute it every two hours, for example, you can set the Repeat option as Custom and set your Custom schedule as 'every 2 hours'. In the Start date and run time field, select the time and data when your query should start being executed.
Destination for query results: here you can set the dataset and table where your query's results will be saved. Please keep in mind that this option is not available if you use scripting. In other words, you should use only SQL and not scripting in your transformations.
Click on Schedule
After that your query will start being executed according to your schedule and destination table configurations.
According with Google recommendation, when your data are in BigQuery and when you want to transform them to store them in BigQuery, it's always quicker and cheaper to do this in BigQuery if you can express your processing in SQL.
That's why, I don't recommend you dataflow for your use case. If you don't want, or you can't use directly the SQL, you can create User Defined Function (UDF) in BigQuery in Javascript.
EDIT
If you have no information when the data are updated into BigQuery, Dataflow won't help you on this. Dataflow can process realtime data only if these data are present into PubSub. If not, it's not magic!!
Because you haven't the information of when a load is performed, you have to run your process on a schedule. For this, Scheduled Queries is the right solution is you use BigQuery for your processing.

Snowflake to/from S3 Pipeline Recommendations for ETL architecture

I am trying to build a pipeline which is sending data from Snowflake to S3 and then from S3 back into Snowflake (after running it through a production ML model on Sagemaker). I am new to Data Engineering, so I would love to hear from the community what the recommended path is. The pipeline requirements are the following:
I am looking to schedule a monthly job. Do I specify such in AWS or on the Snowflake side?
For the initial pull, I want to query 12 months' worth of data from Snowflake. However, for any subsequent pull, I only need the last month since this should be a monthly pipeline.
All monthly data pulls should be stored in own S3 subfolder like this query_01012020,query_01022020,query_01032020 etc.
The data load from S3 back to a specified Snowflake table should be triggered after the ML model has successfully scored the data in Sagemaker.
I want to monitor the performance of the ML model in production overtime to catch if the model is decreasing its accuracy (some calibration-like graph perhaps).
I want to get any error notifications in real-time when issues in the pipeline occur.
I hope you are able to guide me on relevant documentation/tutorials for this effort. I would truly appreciate the guidance.
Thank you very much.
Snowflake does not have any orchestration tools like Airflow or Oozie. So you need to use or think of using some of the Snowflake Partner Ecosystem tools like Mattilion etc. Alternatively, you can build your own end to end flow using Spark or python or any other programming language which can connect snowflake using JDBC/ODBC/Python connectors.
To feed the data realtime to snowflake from s3, you can use AWS SNS service and invoke a SnowPipe to feed the data to Snowflake Stage environment and take it fwd via the ETL process for consumption.
Answer to each one of your question
I am looking to schedule a monthly job. Do I specify such in AWS or on the Snowflake side?
It is not possible in snowflake, you have to do it via AWS or some other tool.
For the initial pull, I want to query 12 months' worth of data from Snowflake. However, for any subsequent pull, I only need the last month since this should be a monthly pipeline.
Ans: You can pull any size of data and you can also have some scripting to support that via SF, but invocation need to be programmed.
All monthly data pulls should be stored in own S3 subfolder like this query_01012020,query_01022020,query_01032020 etc.
Ans: Feeding data to Snowflake is possible via AWS SNS (or REST API) + SnowPipe but visa-versa is not possible.
The data load from S3 back to a specified Snowflake table should be triggered after the ML model has successfully scored the data in Sagemaker.
Ans: This is possible via AWS SNS + SnowPipe.
I want to monitor the performance of the ML model in production overtime to catch if the model is decreasing its accuracy (some calibration-like graph perhaps).
Ans:Not possible via Snowflake.
I would approach the problem like this:
In a temporary table hold 12 months of data ( i am sure you know all the required queries, just since you asked for tutorials i am thinking it will be helpful may be for you as well as others )
-- Initial Pull Hold 12 months of Data ....
Drop table if exists <TABLE_NAME>;
Create Temporary Table <TABLE_NAME> as (
Select *
From Original Table
Where date_field between current_date -365 and Current_date
);
-- Export data to S3 ...
copy into 's3://path/to/export/directory'
from DB_NAME.SCHEMA_NAME.TABLE_NAME
file_format = (type = csv field_delimiter = '|' skip_header = 0)
credentials=(aws_key_id='your_aws_key_id' aws_secret_key='your_aws_secret_key');
once your ML stuff is done, import data back to snowflake like this:
-- Import to S3 ...
copy into DB_NAME.SCHEMA_NAME.TABLE_NAME
from 's3://path/to/your/csv_file_name.csv'
credentials=(aws_key_id='your_aws_key_id' aws_secret_key='your_aws_secret_key')
file_format = (type = csv field_delimiter = '|' skip_header = 1);
I am not sure if snowflake has released ML stuff and how you will do ML at your end etc.
for the scheduling I would suggest either:
Place your code in a shell script or a python script and schedule it to run once a month.
Use Snowflake tasks as follows:
CREATE TASK monthly_task_1
WAREHOUSE =
SCHEDULE = 'USING CRON 0 0 1 * * America/Chicago'
AS
insert your create temporary table query here;
CREATE TASK monthly_task_2
WAREHOUSE =
AFTER monthly_task_1
AS
insert your S3 export query here;
You can read more about snowflake tasks here: https://docs.snowflake.com/en/sql-reference/sql/create-task.html
For importing results back to Snowflake from S3 after ML stuff is done, you can add few lines in your ML code ( presumably in Python ) to execute the copy into code for --Import to S3 which is written above.

How to monitor if a BigQuery table contains current data and send an alert if not?

I have a BigQuery table and an external data import process that should add entries every day. I need to verify that the table contains current data (with a timestamp of today). Writing the SQL-query is not a problem.
My question is how to best install such a monitoring in GCP? Can Stackdriver execute custom BigQuery SQL? Or would a CloudFunction be more suitable? An AppEngine application with a cronjob? What's the best practise?
Not sure what's the best practice here, but one simple solution is to use BigQuery scheduled query. Schedule query, make it fail is something is wrong using ERROR() function, configure scheduled query to notify (it sends email) if it fails.

How to export BigQuery partitions to the GCS using Airflow

I need to create a Airflow job that exports the partitions in the BigQuery table to GCS between the given range of _PARTITIONDATE. I need partitions to be in separate file with the date of partitions. How can I achieve this?
I have tried using airflow tasks that uses SQL to fetch the _PARTITIONDATE, but can I do it programatically?
For this, I recommend you to perform a loop in your dag definition (your loop is in Python code and you will add a lot of step in the DAG. By definition, the DAG can't contain loop).
The algorithm should be like that
For all days in the range
Query BigQuery on this day and save the result to a temporary table, the name of the table contain the date. Use BigqueryOperator
Extract the temporary table to GCS. Use BigQueryToCloudStorageOperator
Just follow the link below
these are the guide to export Bigquery partitions to the GCS using Airflow:
https://m.youtube.com/watch?v=wAyu5BN3VpY&t=28s

How to monitor the number of records loaded into BQ table while using big query streaming?

We are trying to insert data into bigquery (streaming) using dataflow. Is there a way where we can keep a check on the number of records inserted into Bigquery? We need this data for reconciliation purpose.
Add a step to your dataflow which calls Google API Tables.get OR run this query before and after the flow (Both are equally good).
select row_count, table_id from `dataset.__TABLES__` where table_id = 'audit'
As an example, the query returns this
You also may be able to examine the "Elements added" by clicking on the step writing to bigquery in the Dataflow UI.