Aiflow skipping task on ONE_SUCCESS trigger rule - airflow-scheduler

I am using one_success trigger rule so that if anyone of the parent task passes than the child task run which is happening as expected. However, I am getting issue when both are getting failed. In this case, child task is getting skipped instead of fail. Below is the dag implementation
import logging
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.hive_operator import HiveOperator
from airflow.operators.bash_operator import BashOperator
from airflow.sensors.web_hdfs_sensor import WebHdfsSensor
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule
circles = ['101','102','103']
def load_hive_partition(circle, **kwargs):
return HiveOperator(
task_id='add_partition_{}'.format(circle),
hql='alter table abc.def add partition (event_date="{{ ds }}", circle="'+circle+'") location "/user/cloudera/hive/abc/def/event_date={{ ds }}/circle='+circle+'"',
trigger_rule=TriggerRule.ONE_SUCCESS,
dag=dag)
def check_hdfs_node1(circle, **kwargs):
return WebHdfsSensor(
task_id='source_data_sensor_node1_{}'.format(circle),
webhdfs_conn_id='webhdfs_default_1',
filepath='/user/cloudera/hive/abc/def/event_date={{ds}}/circle='+circle,
timeout=60 * 60 * 24,
dag=dag
)
def check_hdfs_node2(circle, **kwargs):
return WebHdfsSensor(
task_id='source_data_sensor_node2_{}'.format(circle),
webhdfs_conn_id='webhdfs_default',
filepath='/user/cloudera/hive/abc/def/event_date={{ds}}/circle='+circle,
timeout=60 * 60 * 24,
dag=dag
)
args = {
'owner': 'airflow',
'depends_on_past': False,
'run_as_user': 'airflow',
'retries': 3,
'start_date': datetime(year=2020, month=8, day=30),
'retry_delay': timedelta(minutes=10)
}
dag = DAG(dag_id="HiveLoadPartition_circle",
default_args=args,
schedule_interval='30 18 * * *',
catchup=False)
kinit_bash = BashOperator(
task_id='kinit_bash',
bash_command='kinit -kt /usr/local/airflow/keytab.keytab appuser#cloudera.com',
dag=dag)
#start_dag = DummyOperator(task_id='start_dag', dag=dag)
end_dag = DummyOperator(task_id='end_dag',trigger_rule=TriggerRule.ALL_DONE, dag=dag)
for circle in circles:
add_partition = load_hive_partition(circle)
check_hdfs_1 = check_hdfs_node1(circle)
check_hdfs_2 = check_hdfs_node2(circle)
check_hdfs_1.set_upstream(kinit_bash)
check_hdfs_2.set_upstream(kinit_bash)
add_partition.set_upstream(check_hdfs_1)
add_partition.set_upstream(check_hdfs_2)
end_dag.set_upstream(add_partition)
Graph view
How can I make my hiveload task fail in case of both of hdfs_sensor fails?
UPDATE:
I have also tried using trigger rule all_done in end dag. Even then it is triggering end_dag when parent task skips.

I ran into this same issue and believe it is an underlying issue with airflow. I've opened up the following PR to fix this https://github.com/apache/airflow/pull/15467
2021-05-04: the fix has been merged and marked for the next (2.0.3) release.
2021-05-23: the fix has been deployed with the release of 2.1 (2.0.3 was cancelled)
Note: I've re-added my removed response to the meta conversation as I don't have the 50 required reputation to comment on the meta thread. Glancing at the edit guidelines it does not appear to have originally qualified for removal but if I've missed something and the following does not belong here please move this response to the meta thread on my behalf since I'm unable to.
As for the meta conversation: This answer was provided very much in the spirit captured by #Ivar's comment. So it is my intent that this answer helps Ayush and anyone else who stumbles across this issue while using <= 2.0.2. However, given the strong negative (-4 votes at the time of writing) reaction to this answer next time I'll just refrain from contributing back to SO as I only have so much time in the day and it has been made clear that identifying the issue, fixing the issue, and opening a PR does not meet SO's standard. ~Cheers

Related

Apache Airflow - Dag doesn't start even with start_date and schedule_interval defined

I am new at Airflow but I've defined a Dag to send a basic email every day at 9am. My DAG is the following one:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
from airflow.operators.email_operator import EmailOperator
from airflow.utils.dates import days_ago
date_log = str(datetime.today())
my_email = ''
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(0),
'email': ['my_email'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'concurrency': 1,
'max_active_runs': 1
}
with DAG('TEST', default_args=default_args, schedule_interval='0 9 * * *',max_active_runs=1, catchup=False) as dag:
t_teste = EmailOperator(dag=dag, task_id='successful_notification',
to='my_email',
subject='Airflow Dag ' + date_log,
html_content="""""")
t_teste
I've all the configurations as I needed, and I have the webserver and scheduler running. Also, I have my Dag active on UI. My problem is that my DAG seems to be doing nothing. It hasn't run for two days, and even if it passes the scheduled time, it doesn't run as expected. I have already tested and run my trigger manually, and it ran successfully. But if I wait for the trigger time, it does nothing.
Do you know how what I am doing wrong?
Thanks!
Your DAG will never be scheduled. Airflow schedule calculates state_date + schedule_interval and schedule the DAG at the END of the interval.
>>> import airflow
>>> from airflow.utils.dates import days_ago
>>> print(days_ago(0))
2021-06-26 00:00:00+00:00
Calculating 2021-06-26 (today) + schedule_interval it means that the DAG will run on 2021-06-27 09:00 however when we reach 2021-06-27 the calculation will produce 2021-06-28 09:00 and so on resulting in DAG never actually runs.
The conclusion is: never use dynamic values in start_date!
To solve your issue simply change:
'start_date': days_ago(0) to some static value like: 'start_date': datetime(2021,6,25)
note that if you are running older versions of Airflow you might also need to change the dag_id.

How to skip a task in airflow without skipping its downstream tasks?

Let’s say this is my dag:
A >> B >> C
If task B raises an exception, I want to skip the task instead of failing it. However, I don’t want to skip task C. I looked into AirflowSkipException and the soft_fail sensor but they both forcibly skip downstream tasks as well. Does anyone have a way to make this work?
Thanks!
Currently posted answers touch on different topic or does not seem to be fully correct.
Adding trigger rule all_failed to Task-C won't work for OP's example DAG: A >> B >> C unless Task-A ends in failed state, which most probably is not desirable.
OP was, in fact, very close because expected behavior can be achieved with mix of AirflowSkipException and none_failed trigger rule:
from datetime import datetime
from airflow.exceptions import AirflowSkipException
from airflow.models import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
with DAG(
dag_id="mydag",
start_date=datetime(2022, 1, 18),
schedule_interval="#once"
) as dag:
def task_b():
raise AirflowSkipException
A = DummyOperator(task_id="A")
B = PythonOperator(task_id="B", python_callable=task_b)
C = DummyOperator(task_id="C", trigger_rule="none_failed")
A >> B >> C
which Airflow executes as follows:
What this rule mean?
Trigger Rules
none_failed: All upstream tasks have not failed or upstream_failed -
that is, all upstream tasks have succeeded or been skipped
So basically we can catch the actual exception in our code and raise mentioned Airflow exception which "force" task state change from failed to skipped.
However, without the trigger_rule argument to Task-C we would end up with Task-B downstream marked as skipped.
You can refer to the Airflow documentation on trigger_rule.
trigger_rule allows you to configure the task's execution dependency. Generally, a task is executed when all upstream tasks succeed. You can change that to other trigger rules provided in Airflow. The all_failed trigger rule only executes a task when all upstream tasks fail, which would accomplish what you outlined.
from datetime import datetime
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.trigger_rule import TriggerRule
with DAG(
dag_id="my_dag",
start_date=datetime(2021, 4, 5),
schedule_interval='#once',
) as dag:
p = PythonOperator(
task_id='fail_task',
python_callable=lambda x: 1,
)
t = PythonOperator(
task_id='run_task',
python_callable=lambda: 1,
trigger_rule=TriggerRule.ALL_FAILED
)
p >> t
You can change the trigger_rule in your task declaration.
task = BashOperator(
task_id="task_C",
bash_command="echo hello world",
trigger_rule="all_done",
dag=dag
)

Airflow DAG not triggered at schedule time

Below is my DAG code, it is not getting triggered #scheduled time. If i run manually, it is working fine.
Not sure where is the problem. I tried testing two to three corn expression but without any luck.
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#from datetime import timedelta
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5)
}
dag = DAG(
'airflow_worker_pod_monitoring',
default_args=default_args,
description='liveness monitoring dag',
schedule_interval='0,00 15 * * *',
start_date=YESTERDAY,
dagrun_timeout=datetime.timedelta(minutes=60))
# priority_weight has type int in Airflow DB, uses the maximum.
t1 = BashOperator(
task_id='monitor_pod',
bash_command='bash /home/airflow/gcs/data/testscript.sh' ,
dag=dag,
depends_on_past=False,
priority_weight=2**31-1)```
How to make this working?
There are 4 potential causes of this:
schedule_interval provided in default_args
Existing DAG's schedule_interval was modified
start_date is set to datetime.now()
start_date not aligned with schedule_interval
Your issue could be caused by 2. or 4. above since 1. & 3. are properly configured.
Airflow documentation recommends using static data https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date . Using dynamic date (such as airflow.utils.dates.days_ago(0)) as start_date is not advisable and may cause issues as the dag gets confused at 00:00 and switch to next day incorrectly. Set your start_date to be a fixed datetime and it should work correctly.
In general, the way the Airflow scheduling works is not always straight forward. I recommend this article going deeper into this topic https://towardsdatascience.com/airflow-schedule-interval-101-bbdda31cc463

Airflow keeps running my DAG, despite catchup=False, schedule_interval=datetime.timedelta(hours=2)

Similar to previous questions, but none of the answers given worked. I have a DAG:
import datetime
import os
from airflow import DAG
from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
from airflow.operators import BashOperator
PROJECT = os.environ['PROJECT']
GCS_BUCKET = os.environ['BUCKET']
API_KEY = os.environ['API_KEY']
default_args = {
'owner': 'me',
'start_date': datetime.datetime(2019, 7, 30),
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': datetime.timedelta(hours=1),
'catchup': False
}
dag = DAG('dag-name',
schedule_interval=datetime.timedelta(hours=2),
default_args=default_args,
max_active_runs=1,
concurrency=1,
catchup=False)
DEFAULT_OPTIONS_TEMPLATE = {
'project': PROJECT,
'stagingLocation': 'gs://{}/staging'.format(GCS_BUCKET),
'tempLocation': 'gs://{}/temp'.format(GCS_BUCKET)
}
def my-dataflow-job(template_location, name):
run_time = datetime.datetime.utcnow()
a_value = run_time.strftime('%Y%m%d%H')
t1 = DataflowTemplateOperator(
task_id='{}-task'.format(name),
template=template_location,
parameters={'an_argument': a_value},
dataflow_default_options=DEFAULT_OPTIONS_TEMPLATE,
poll_sleep=30
)
t2 = BashOperator(
task_id='{}-loader-heartbeat'.format(name),
bash_command='curl --fail -XGET "[a heartbeat URL]" --header "Authorization: heartbeat_service {1}"'.format(name, API_KEY)
)
t1 >> t2
with dag:
backup_bt_to_bq('gs://[path to gcs]'.format(GCS_BUCKET), 'name')
As you can see, I'm trying very hard to prevent Airflow from trying to backfill. Yet, when I deploy the DAG (late in the day, on 7/30/2019), it just keeps running the DAG one after the other, after the other, after the other.
Since this task is moving a bit of data around, this is not desirable. How do I get airflow to respect the "run this every other hour" schedule_interval??
As you can see, I've set catchup: False in both the DAG args AND the default args (just in case, started with them in the DAG args). The retry delay is also a long period.
Each DAG run is reported as a success.
I'm running with the following version:
composer-1.5.0-airflow-1.10.1
My next step is kubernetes cron...
I suspect you did not have catchup=False when you first created the dag. I think airflow may not recognize changes to the catchup parameter after inital dag creation.
Try renaming it and see what happens. E.g. add a v2 and enable it. After enabling it, it will run once even though catchup is false, because there is a valid completed interval (i.e. current time is >= start_time + schedule_interval), but that is all.
Of course, test with a fake operator that doesn't do anything expensive.

GCP Composer (Airflow) operator

I'm using the GCP Composer API (Airflow) and my DAG to scale up the number of workers keep returning me the error below:
Broken DAG: [/home/airflow/gcs/dags/cluster_scale_workers.py] 'module' object has no attribute 'DataProcClusterScaleOperator'
Seems to be something related to the ScaleOperator, however when I look at the Airflow Read the Docs and cross check with my code, seems that nothing is wrong. What am I missing?
Is it related to GCP Airflow version?
Code:
import datetime
import os
from airflow import models
from airflow.contrib.operators import dataproc_operator
from airflow.utils import trigger_rule
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time())
default_dag_args = {
'start_date': yesterday,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'project_id': models.Variable.get('gcp_project'),
'cluster_name': 'hive-cluster'
}
with models.DAG(
'scale_workers',
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
scale_to_6_workers = dataproc_operator.DataprocClusterScaleOperator(
task_id='scale_dataproc_cluster_6',
cluster_name='hive-cluster',
num_workers=6,
num_preemptible_workers=3,
dag=dag
)
I managed to find the issue and sort it out. The comment provided by Ashish Kumar above is correct.
The problem was that the Airflow version I was using (1.9.0) did not support the DataProcClusterScaleOperator. I created another instance by activating BETA and choosing the version 1.10.0.
Which fixed my issue.