How to skip a task in airflow without skipping its downstream tasks? - airflow-scheduler

Let’s say this is my dag:
A >> B >> C
If task B raises an exception, I want to skip the task instead of failing it. However, I don’t want to skip task C. I looked into AirflowSkipException and the soft_fail sensor but they both forcibly skip downstream tasks as well. Does anyone have a way to make this work?
Thanks!

Currently posted answers touch on different topic or does not seem to be fully correct.
Adding trigger rule all_failed to Task-C won't work for OP's example DAG: A >> B >> C unless Task-A ends in failed state, which most probably is not desirable.
OP was, in fact, very close because expected behavior can be achieved with mix of AirflowSkipException and none_failed trigger rule:
from datetime import datetime
from airflow.exceptions import AirflowSkipException
from airflow.models import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
with DAG(
dag_id="mydag",
start_date=datetime(2022, 1, 18),
schedule_interval="#once"
) as dag:
def task_b():
raise AirflowSkipException
A = DummyOperator(task_id="A")
B = PythonOperator(task_id="B", python_callable=task_b)
C = DummyOperator(task_id="C", trigger_rule="none_failed")
A >> B >> C
which Airflow executes as follows:
What this rule mean?
Trigger Rules
none_failed: All upstream tasks have not failed or upstream_failed -
that is, all upstream tasks have succeeded or been skipped
So basically we can catch the actual exception in our code and raise mentioned Airflow exception which "force" task state change from failed to skipped.
However, without the trigger_rule argument to Task-C we would end up with Task-B downstream marked as skipped.

You can refer to the Airflow documentation on trigger_rule.
trigger_rule allows you to configure the task's execution dependency. Generally, a task is executed when all upstream tasks succeed. You can change that to other trigger rules provided in Airflow. The all_failed trigger rule only executes a task when all upstream tasks fail, which would accomplish what you outlined.
from datetime import datetime
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.trigger_rule import TriggerRule
with DAG(
dag_id="my_dag",
start_date=datetime(2021, 4, 5),
schedule_interval='#once',
) as dag:
p = PythonOperator(
task_id='fail_task',
python_callable=lambda x: 1,
)
t = PythonOperator(
task_id='run_task',
python_callable=lambda: 1,
trigger_rule=TriggerRule.ALL_FAILED
)
p >> t

You can change the trigger_rule in your task declaration.
task = BashOperator(
task_id="task_C",
bash_command="echo hello world",
trigger_rule="all_done",
dag=dag
)

Related

GCP ComposerV2 missing log files

We deployed GCP ComposerV2 with the most recent airflow version. It works perfectly. But from time to time "airflow_monitoring" predefined DAG crashes.
Here are the logs of the issue:
*** Log file is not found: gs://********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log. The task might not have been executed or worker executing it might have finished abnormally (e.g. was evicted)
*** 404 GET https://storage.googleapis.com/download/storage/v1/b/********/o/logs%2Fairflow_monitoring%2Fecho%2F2021-12-14T12%3A36%3A55%2B00%3A00%2F1.log?alt=media: No such object: ********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
We don't change anything, this issue has happened randomly.
Here is the code of "airflow_monitoring" predefined DAG:
"""A liveness prober dag for monitoring composer.googleapis.com/environment/healthy."""
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'airflow_monitoring',
default_args=default_args,
description='liveness monitoring dag',
schedule_interval=None,
dagrun_timeout=timedelta(minutes=20))
# priority_weight has type int in Airflow DB, uses the maximum.
t1 = BashOperator(
task_id='echo',
bash_command='echo test',
dag=dag,
depends_on_past=False,
priority_weight=2**31-1)
I think the log says everything:
*** Log file is not found: gs://********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log. The task might not have been executed or worker executing it might have finished abnormally (e.g. was evicted)
Kubernetes environment might from time to time evict a running task (for example when it fails over to another node because disk crashed or because the machines need to be restarted.
I think you should set retry to 2 and it should automatically retry in such case.

Airflow DAG not triggered at schedule time

Below is my DAG code, it is not getting triggered #scheduled time. If i run manually, it is working fine.
Not sure where is the problem. I tried testing two to three corn expression but without any luck.
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#from datetime import timedelta
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5)
}
dag = DAG(
'airflow_worker_pod_monitoring',
default_args=default_args,
description='liveness monitoring dag',
schedule_interval='0,00 15 * * *',
start_date=YESTERDAY,
dagrun_timeout=datetime.timedelta(minutes=60))
# priority_weight has type int in Airflow DB, uses the maximum.
t1 = BashOperator(
task_id='monitor_pod',
bash_command='bash /home/airflow/gcs/data/testscript.sh' ,
dag=dag,
depends_on_past=False,
priority_weight=2**31-1)```
How to make this working?
There are 4 potential causes of this:
schedule_interval provided in default_args
Existing DAG's schedule_interval was modified
start_date is set to datetime.now()
start_date not aligned with schedule_interval
Your issue could be caused by 2. or 4. above since 1. & 3. are properly configured.
Airflow documentation recommends using static data https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date . Using dynamic date (such as airflow.utils.dates.days_ago(0)) as start_date is not advisable and may cause issues as the dag gets confused at 00:00 and switch to next day incorrectly. Set your start_date to be a fixed datetime and it should work correctly.
In general, the way the Airflow scheduling works is not always straight forward. I recommend this article going deeper into this topic https://towardsdatascience.com/airflow-schedule-interval-101-bbdda31cc463

Aiflow skipping task on ONE_SUCCESS trigger rule

I am using one_success trigger rule so that if anyone of the parent task passes than the child task run which is happening as expected. However, I am getting issue when both are getting failed. In this case, child task is getting skipped instead of fail. Below is the dag implementation
import logging
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.hive_operator import HiveOperator
from airflow.operators.bash_operator import BashOperator
from airflow.sensors.web_hdfs_sensor import WebHdfsSensor
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule
circles = ['101','102','103']
def load_hive_partition(circle, **kwargs):
return HiveOperator(
task_id='add_partition_{}'.format(circle),
hql='alter table abc.def add partition (event_date="{{ ds }}", circle="'+circle+'") location "/user/cloudera/hive/abc/def/event_date={{ ds }}/circle='+circle+'"',
trigger_rule=TriggerRule.ONE_SUCCESS,
dag=dag)
def check_hdfs_node1(circle, **kwargs):
return WebHdfsSensor(
task_id='source_data_sensor_node1_{}'.format(circle),
webhdfs_conn_id='webhdfs_default_1',
filepath='/user/cloudera/hive/abc/def/event_date={{ds}}/circle='+circle,
timeout=60 * 60 * 24,
dag=dag
)
def check_hdfs_node2(circle, **kwargs):
return WebHdfsSensor(
task_id='source_data_sensor_node2_{}'.format(circle),
webhdfs_conn_id='webhdfs_default',
filepath='/user/cloudera/hive/abc/def/event_date={{ds}}/circle='+circle,
timeout=60 * 60 * 24,
dag=dag
)
args = {
'owner': 'airflow',
'depends_on_past': False,
'run_as_user': 'airflow',
'retries': 3,
'start_date': datetime(year=2020, month=8, day=30),
'retry_delay': timedelta(minutes=10)
}
dag = DAG(dag_id="HiveLoadPartition_circle",
default_args=args,
schedule_interval='30 18 * * *',
catchup=False)
kinit_bash = BashOperator(
task_id='kinit_bash',
bash_command='kinit -kt /usr/local/airflow/keytab.keytab appuser#cloudera.com',
dag=dag)
#start_dag = DummyOperator(task_id='start_dag', dag=dag)
end_dag = DummyOperator(task_id='end_dag',trigger_rule=TriggerRule.ALL_DONE, dag=dag)
for circle in circles:
add_partition = load_hive_partition(circle)
check_hdfs_1 = check_hdfs_node1(circle)
check_hdfs_2 = check_hdfs_node2(circle)
check_hdfs_1.set_upstream(kinit_bash)
check_hdfs_2.set_upstream(kinit_bash)
add_partition.set_upstream(check_hdfs_1)
add_partition.set_upstream(check_hdfs_2)
end_dag.set_upstream(add_partition)
Graph view
How can I make my hiveload task fail in case of both of hdfs_sensor fails?
UPDATE:
I have also tried using trigger rule all_done in end dag. Even then it is triggering end_dag when parent task skips.
I ran into this same issue and believe it is an underlying issue with airflow. I've opened up the following PR to fix this https://github.com/apache/airflow/pull/15467
2021-05-04: the fix has been merged and marked for the next (2.0.3) release.
2021-05-23: the fix has been deployed with the release of 2.1 (2.0.3 was cancelled)
Note: I've re-added my removed response to the meta conversation as I don't have the 50 required reputation to comment on the meta thread. Glancing at the edit guidelines it does not appear to have originally qualified for removal but if I've missed something and the following does not belong here please move this response to the meta thread on my behalf since I'm unable to.
As for the meta conversation: This answer was provided very much in the spirit captured by #Ivar's comment. So it is my intent that this answer helps Ayush and anyone else who stumbles across this issue while using <= 2.0.2. However, given the strong negative (-4 votes at the time of writing) reaction to this answer next time I'll just refrain from contributing back to SO as I only have so much time in the day and it has been made clear that identifying the issue, fixing the issue, and opening a PR does not meet SO's standard. ~Cheers

How to set a number as retry condition in airflow DAG?

In my Airflow DAG i have 4 tasks
task_1 >> [task_2,task_3]>> task_4
task_4 runs only after a successful run of both task_2 and task_3
How do i set a condition such as :
if task_2 fails, retry task_2 after 2 minutes and stop retrying after the 5th attempt
This is my code :
from airflow.models import DAG
from airflow.utils.dates import days_ago
from airflow.operators.python_operator import PythonOperator
args={
'owner' : 'Anti',
'start_date':days_ago(1)# 1 means yesterday
}
dag = DAG(dag_id='my_sample_dag',default_args=args,schedule_interval='15 * * * *')
def func1(**context):
print("ran task 1")
def func2(**context):
print("ran task 2")
def func3(**context):
print("ran task 3")
def func4(**context):
print("ran task 4")
with dag:
task_1=PythonOperator(
task_id='task1',
python_callable=func1,
provide_context=True,
)
task_2=PythonOperator(
task_id='task2',
python_callable=func2,
provide_context=True
)
task_3=PythonOperator(
task_id='task3',
python_callable=func3,
provide_context=True
)
task_4=PythonOperator(
task_id='task4',
python_callable=func4,
provide_context=True
)
task_1 >> [task_2,task_3]>> task_4 # t2,t3 runs parallel right after t1 has ran
Every operator supports retry_delay and retries - Airflow documention.
retries (int) – the number of retries that should be performed before
failing the task
retry_delay (datetime.timedelta) – delay between retries
If you want to apply this for all of your tasks, you can just edit your args dictionary:
args={
'owner' : 'Anti',
'retries': 5,
'retry_delay': timedelta(minutes=2),
'start_date':days_ago(1)# 1 means yesterday
}
If you just want to apply it to task_2 you can pass it directly to PythonOperator - in that case the other tasks use the default settings.
One comment on your args, it's not recommended to set a dynamic relative start_date but rather a fixed absolute date.

Airflow trigger_rule=none_failed not working

I have a demo DAG, whose source code is attached below.
The dag is quite simple:
dummy_success >> one_failed >> none_failed
dummy_success is a dummy node and would success whatever.
one_failed is a task with trigger_rule=one_failed, so it would be skipped in the dag.
none_failed is a task with trigger_rule=none_failed.
As explained is the airflow documentation, the final task would be triggered because all of it's parents is in state success or skipped(in this case is skipped). However, when I ran this in GCP composer, the final task is skipped too.
I'm wondering why this doesn't behave as expected? And what else can I do if I need my task to be triggered when it's parent is success or skipped?
My image version is composer-1.7.2-airflow-1.10.2
import datetime as dt
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
dag = DAG(
dag_id='test_trigger_rule',
schedule_interval='#once',
start_date=dt.datetime(2019, 2, 28)
)
dummy_success= DummyOperator(task_id='dummy_success', dag=dag)
one_failed= DummyOperator(task_id='one_failed', dag=dag, trigger_rule="one_failed")
none_failed = DummyOperator(task_id='none_failed', dag=dag,trigger_rule='none_failed')
dummy_success >> one_failed >> none_failed
I tried add another dummy node as upstream of the none_failed task and then it works as expected.
dummy_fix = DummyOperator(task_id='dummy_fix', dag=dag)
dummy_fix >> none_failed
Seems like none_failed trigger_rule works only when the task has more than one upstream tasks?