I have an onpremise CDAP data fusion instance with multiple namespaces. How to trigger the pipeline using airflow operators? I have tried exploring the airflow available operators and this page but not very helpful https://cloud.google.com/data-fusion/docs/reference/cdap-reference#start_a_batch_pipeline
Assuming you already deployed the pipeline and you have the location, instance name and pipeline name of the pipeline you want to run. See CloudDataFusionStartPipelineOperator() for the parameters that it accepts.
Using the quickstart pipeline, I triggered the pipeline using CloudDataFusionStartPipelineOperator(). See operator usage below:
import airflow
from airflow.providers.google.cloud.operators.datafusion import CloudDataFusionStartPipelineOperator
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with airflow.DAG(
'trigger_df',
'catchup=False',
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
start_pipeline = CloudDataFusionStartPipelineOperator(
location='us-central1',
pipeline_name='DataFusionQuickstart',
instance_name='test',
task_id="start_pipeline",
)
start_pipeline
Success "Graph View":
Logs:
Related
We are using GCP composer (Airflow managed) as orchestral tools and BigQuery as DB. I need to push data into table from another table (both of the tables located in bigquery db) but the method should be upsert. So I wrote a sql script that using marge to update or insert.
I have 2 questions:
The marge script located in GCP Composer bucket, how can I read the sql script from the bucket ?
After reading the sql file, how can I run the query on bigquery ?
Thanks
You can use the script below to read a file in GCS. I tested this using an SQL script that does INSERT and is saved in my Composer bucket.
In read_gcs_op it will execute read_gcs_file() and return the content of the sql script. The content of the sql script will be used by execute_query and execute the query in the script. See code below:
import datetime
from airflow import models
from airflow.providers.google.cloud.hooks.gcs import GCSHook
from airflow.operators import python
from airflow.providers.google.cloud.hooks.bigquery import BigQueryHook
from google.cloud import bigquery
import logging
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
BUCKET_NAME = 'your-composer-bucket'
GCS_FILES = ['sql_query.txt']
PREFIX = 'data' # populate this if you stored your sql script in a directory in the bucket
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with models.DAG(
'query_gcs_to_bq',
catchup=False,
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
def read_gcs_file(**kwargs):
hook = GCSHook()
for gcs_file in GCS_FILES:
#check if PREFIX is available and initialize the gcs file to be copied
if PREFIX:
object_name = f'{PREFIX}/{gcs_file}'
else:
object_name = f'{gcs_file}'
#perform gcs hook download
resp_byte = hook.download_as_byte_array(
bucket_name = BUCKET_NAME,
object_name = object_name,
)
resp_string = resp_byte.decode("utf-8")
logging.info(resp_string)
return resp_string
read_gcs_op = python.PythonOperator(
task_id='read_gcs',
provide_context=True,
python_callable=read_gcs_file,
)
sql_query = "{{ task_instance.xcom_pull(task_ids='read_gcs') }}" # store returned value from read_gcs_op
def query_bq(sql):
hook = BigQueryHook(bigquery_conn_id="bigquery_default", delegate_to=None, use_legacy_sql=False)
client = bigquery.Client(project=hook._get_field("project"), credentials=hook._get_credentials())
client.query(sql) # If you are not doing DML, you assign this to a variable and return the value
execute_query = python.PythonOperator(
task_id='query_bq',
provide_context=True,
python_callable=query_bq,
op_kwargs = {
"sql": sql_query
}
)
read_gcs_op >> execute_query
For testing I used an INSERT statement as the SQL script used by the script above:
sql_script.txt
INSERT `your-project.dataset.your_table` (name, age)
VALUES('Brady', 44)
Test done:
Return value of task read_gcs:
After Composer is done executing read_gcs and query_bq, I checked my table insert statement succeeded.:
I'm trying to send an email using SenDGrid but the DAG is stuck on running.
I did the following:
set the environment variable SENDGRID_MAIL_FROM as my email
set the environment variable SENDGRID_API_KEY as the api I've generated from Sendgrid after confirming my personal email (same as sender email).
No spam im my email inbox.
Nothing found in the Activity section on SendGrid page and nothing is sent.
Can someone maybe point out what am I doing wrong?
My code:
from airflow.models import (DAG, Variable)
import os
from airflow.operators.email import EmailOperator
from datetime import datetime,timedelta
default_args = {
'start_date': datetime(2020, 1, 1),
'owner': 'Airflow',
"email_on_failure" : False,
"email_on_retry" : False,
"emails" : ['my#myemail.com']
}
PROJECT_ID = os.environ.get("GCP_PROJECT_ID", "bigquery_default")
PROJECT_ID_GCP = os.environ.get("GCP_PROJECT_ID", "my_progect")
with DAG(
'retries_test',
schedule_interval=None,
catchup=False,
default_args=default_args
) as dag:
send_email_notification = EmailOperator(
task_id = "send_email_notification",
to = "test#sendgrid.com",
subject = "test",
html_content = "<h3>Hello</h3>"
)
send_email_notification
I have the following default args for a airflow dag:
DEFAULT_ARGS = {
'owner': 'me',
'depends_on_past': False,
'email': ['me#me.com'],
'email_on_failure': True,
'retries': 4,
'retry_delay': timedelta(seconds=5)
}
Each time when a specific job attempt fails, I got an email alert. However, is it possible to ask airflow to only send alerts when all the retries/attempts fail?
Disable email_on_retry option in default_Args.
DEFAULT_ARGS = {
'owner': 'me',
'depends_on_past': False,
'email': ['me#me.com'],
'email_on_failure': True,
'retries': 4,
'email_on_retry': False,
'retry_delay': timedelta(seconds=5)
}
Since all these email options are available in base operator as well in case you want to apply different option on each job eg enable email alert on retry for some jobs.
Interesting article on configuring mail in airflow https://www.astronomer.io/guides/error-notifications-in-airflow
Is there a way that one can make a DAG file dynamically from code and upload it on airflow(AirFlow reads from the dags directory, but creating file for every DAG and uploading it on that folder is slow)?
Is it possible to create a template dag and populate it with new logic whenever it is needed?
I saw that they are working on API. The current version only has a trigger DAG option.
You can quite easily create multiple dags in a single file:
create_dag(dag_id):
dag = DAG(....)
// some tasks added
return dag
for dag_id in dags_lists:
globals()[dag_id] = create_dag(dag_id)
If you create a proper DAG object with the template function (create_dag in the above example) and make them available in the globals object, Airflow will recognise them as individual DAGs.
Yes you can create dynamic DAGs as follows:
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def create_dag(dag_id,
schedule,
dag_number,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py,
dag_number=dag_number)
return dag
# build a dag for each number in range(10)
for n in range(1, 10):
dag_id = 'hello_world_{}'.format(str(n))
default_args = {'owner': 'airflow',
'start_date': datetime(2018, 1, 1)
}
schedule = '#daily'
dag_number = n
globals()[dag_id] = create_dag(dag_id,
schedule,
dag_number,
default_args)
Example from https://www.astronomer.io/guides/dynamically-generating-dags/
However, note that this can cause some issues like delays between the execution of tasks. This is because Airflow Scheduler and Worker will have to parse the entire file when scheduling/executing each task for a single DAG.
As you would have many DAGs (let's say 100) inside the same file this will mean that all the 100 DAG objects will have to be parsed while executing a single task for DAG1.
I would recommend building a tool that creates a single file per DAG.
I am using Composer to run my Dataflow pipeline on a schedule. If the job is taking over a certain amount of time, I want it to be killed. Is there a way to do this programmatically either as a pipeline option or a DAG parameter?
Not sure how to do it as a pipeline config option, but here is an idea.
You could launch a taskqueue task with countdown set to your timeout value. When the task does launch, you could check to see if your task is still running:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/list
If it is, you can call update on it with job state JOB_STATE_CANCELLED
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/update
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#jobstate
This is done through the googleapiclient lib: https://developers.google.com/api-client-library/python/apis/discovery/v1
Here is an example of how to use it
class DataFlowJobsListHandler(InterimAdminResourceHandler):
def get(self, resource_id=None):
"""
Wrapper to this:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/list
"""
if resource_id:
self.abort(405)
else:
credentials = GoogleCredentials.get_application_default()
service = discovery.build('dataflow', 'v1b3', credentials=credentials)
project_id = app_identity.get_application_id()
_filter = self.request.GET.pop('filter', 'UNKNOWN').upper()
jobs_list_request = service.projects().jobs().list(
projectId=project_id,
filter=_filter) #'ACTIVE'
jobs_list = jobs_list_request.execute()
return {
'$cursor': None,
'results': jobs_list.get('jobs', []),
}