I have a demo DAG, whose source code is attached below.
The dag is quite simple:
dummy_success >> one_failed >> none_failed
dummy_success is a dummy node and would success whatever.
one_failed is a task with trigger_rule=one_failed, so it would be skipped in the dag.
none_failed is a task with trigger_rule=none_failed.
As explained is the airflow documentation, the final task would be triggered because all of it's parents is in state success or skipped(in this case is skipped). However, when I ran this in GCP composer, the final task is skipped too.
I'm wondering why this doesn't behave as expected? And what else can I do if I need my task to be triggered when it's parent is success or skipped?
My image version is composer-1.7.2-airflow-1.10.2
import datetime as dt
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
dag = DAG(
dag_id='test_trigger_rule',
schedule_interval='#once',
start_date=dt.datetime(2019, 2, 28)
)
dummy_success= DummyOperator(task_id='dummy_success', dag=dag)
one_failed= DummyOperator(task_id='one_failed', dag=dag, trigger_rule="one_failed")
none_failed = DummyOperator(task_id='none_failed', dag=dag,trigger_rule='none_failed')
dummy_success >> one_failed >> none_failed
I tried add another dummy node as upstream of the none_failed task and then it works as expected.
dummy_fix = DummyOperator(task_id='dummy_fix', dag=dag)
dummy_fix >> none_failed
Seems like none_failed trigger_rule works only when the task has more than one upstream tasks?
Related
We deployed GCP ComposerV2 with the most recent airflow version. It works perfectly. But from time to time "airflow_monitoring" predefined DAG crashes.
Here are the logs of the issue:
*** Log file is not found: gs://********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log. The task might not have been executed or worker executing it might have finished abnormally (e.g. was evicted)
*** 404 GET https://storage.googleapis.com/download/storage/v1/b/********/o/logs%2Fairflow_monitoring%2Fecho%2F2021-12-14T12%3A36%3A55%2B00%3A00%2F1.log?alt=media: No such object: ********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)
We don't change anything, this issue has happened randomly.
Here is the code of "airflow_monitoring" predefined DAG:
"""A liveness prober dag for monitoring composer.googleapis.com/environment/healthy."""
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'airflow_monitoring',
default_args=default_args,
description='liveness monitoring dag',
schedule_interval=None,
dagrun_timeout=timedelta(minutes=20))
# priority_weight has type int in Airflow DB, uses the maximum.
t1 = BashOperator(
task_id='echo',
bash_command='echo test',
dag=dag,
depends_on_past=False,
priority_weight=2**31-1)
I think the log says everything:
*** Log file is not found: gs://********/logs/airflow_monitoring/echo/2021-12-14T12:36:55+00:00/1.log. The task might not have been executed or worker executing it might have finished abnormally (e.g. was evicted)
Kubernetes environment might from time to time evict a running task (for example when it fails over to another node because disk crashed or because the machines need to be restarted.
I think you should set retry to 2 and it should automatically retry in such case.
I am trying to use the airflow.providers.amazon.aws.operators.s3_list S3ListOperator to list files in an S3 bucket in my AWS account with the DAG operator below:
list_bucket = S3ListOperator(
task_id = 'list_files_in_bucket',
bucket = '<MY_BUCKET>',
aws_conn_id = 's3_default'
)
I have configured my Extra Connection details in the form of: {"aws_access_key_id": "<MY_ACCESS_KEY>", "aws_secret_access_key": "<MY_SECRET_KEY>"}
When I run my Airflow job, it appears it is executing fine & my task status is Success. Here is the Log output:
[2021-04-27 11:44:50,009] {base_aws.py:368} INFO - Airflow Connection: aws_conn_id=s3_default
[2021-04-27 11:44:50,013] {base_aws.py:170} INFO - Credentials retrieved from extra_config
[2021-04-27 11:44:50,013] {base_aws.py:84} INFO - Creating session with aws_access_key_id=<MY_ACCESS_KEY> region_name=None
[2021-04-27 11:44:50,027] {base_aws.py:157} INFO - role_arn is None
[2021-04-27 11:44:50,661] {taskinstance.py:1185} INFO - Marking task as SUCCESS. dag_id=two_step, task_id=list_files_in_bucket, execution_date=20210427T184422, start_date=20210427T184439, end_date=20210427T184450
[2021-04-27 11:44:50,676] {taskinstance.py:1246} INFO - 0 downstream tasks scheduled from follow-on schedule check
[2021-04-27 11:44:50,700] {local_task_job.py:146} INFO - Task exited with return code 0
Is there anything I can do to print the files in my bucket to Logs?
TIA
This code is enough and you don't need to use print function. Just check the corresponding log, then go to xcom, and the return list is there.
list_bucket = S3ListOperator(
task_id='list_files_in_bucket',
bucket='ob-air-pre',
prefix='data/',
delimiter='/',
aws_conn_id='aws'
)
The result from executing S3ListOperator is an XCom object that is stored in the Airflow database after the task instance has completed.
You need to declare another operator to feed in the results from the S3ListOperator and print them out.
For example in Airflow 2.0.0 and up you can use TaskFlow:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils import timezone
dag = DAG(
dag_id='my-workflow',
start_date=timezone.parse('2021-01-14 21:00')
)
#dag.task(task_id="print_objects")
def print_objects(objects):
print(objects)
list_bucket = S3ListOperator(
task_id='list_files_in_bucket',
bucket='<MY_BUCKET>',
aws_conn_id='s3_default',
dag=dag
)
print_objects(list_bucket.output)
In older versions,
from airflow.models import DAG
from airflow.operators.python import PythonOperator
from airflow.utils import timezone
dag = DAG(
dag_id='my-workflow',
start_date=timezone.parse('2021-01-14 21:00')
)
def print_objects(objects):
print(objects)
list_bucket = S3ListOperator(
dag=dag,
task_id='list_files_in_bucket',
bucket='<MY_BUCKET>',
aws_conn_id='s3_default',
)
print_objects_in_bucket = PythonOperator(
dag=dag,
task_id='print_objects_in_bucket',
python_callable=print_objects,
op_args=("{{ti.xcom_pull(task_ids='list_files_in_bucket')}}",)
)
list_bucket >> print_objects_in_bucket
Let’s say this is my dag:
A >> B >> C
If task B raises an exception, I want to skip the task instead of failing it. However, I don’t want to skip task C. I looked into AirflowSkipException and the soft_fail sensor but they both forcibly skip downstream tasks as well. Does anyone have a way to make this work?
Thanks!
Currently posted answers touch on different topic or does not seem to be fully correct.
Adding trigger rule all_failed to Task-C won't work for OP's example DAG: A >> B >> C unless Task-A ends in failed state, which most probably is not desirable.
OP was, in fact, very close because expected behavior can be achieved with mix of AirflowSkipException and none_failed trigger rule:
from datetime import datetime
from airflow.exceptions import AirflowSkipException
from airflow.models import DAG
from airflow.operators.dummy import DummyOperator
from airflow.operators.python import PythonOperator
with DAG(
dag_id="mydag",
start_date=datetime(2022, 1, 18),
schedule_interval="#once"
) as dag:
def task_b():
raise AirflowSkipException
A = DummyOperator(task_id="A")
B = PythonOperator(task_id="B", python_callable=task_b)
C = DummyOperator(task_id="C", trigger_rule="none_failed")
A >> B >> C
which Airflow executes as follows:
What this rule mean?
Trigger Rules
none_failed: All upstream tasks have not failed or upstream_failed -
that is, all upstream tasks have succeeded or been skipped
So basically we can catch the actual exception in our code and raise mentioned Airflow exception which "force" task state change from failed to skipped.
However, without the trigger_rule argument to Task-C we would end up with Task-B downstream marked as skipped.
You can refer to the Airflow documentation on trigger_rule.
trigger_rule allows you to configure the task's execution dependency. Generally, a task is executed when all upstream tasks succeed. You can change that to other trigger rules provided in Airflow. The all_failed trigger rule only executes a task when all upstream tasks fail, which would accomplish what you outlined.
from datetime import datetime
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.utils.trigger_rule import TriggerRule
with DAG(
dag_id="my_dag",
start_date=datetime(2021, 4, 5),
schedule_interval='#once',
) as dag:
p = PythonOperator(
task_id='fail_task',
python_callable=lambda x: 1,
)
t = PythonOperator(
task_id='run_task',
python_callable=lambda: 1,
trigger_rule=TriggerRule.ALL_FAILED
)
p >> t
You can change the trigger_rule in your task declaration.
task = BashOperator(
task_id="task_C",
bash_command="echo hello world",
trigger_rule="all_done",
dag=dag
)
Below is my DAG code, it is not getting triggered #scheduled time. If i run manually, it is working fine.
Not sure where is the problem. I tried testing two to three corn expression but without any luck.
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#from datetime import timedelta
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5)
}
dag = DAG(
'airflow_worker_pod_monitoring',
default_args=default_args,
description='liveness monitoring dag',
schedule_interval='0,00 15 * * *',
start_date=YESTERDAY,
dagrun_timeout=datetime.timedelta(minutes=60))
# priority_weight has type int in Airflow DB, uses the maximum.
t1 = BashOperator(
task_id='monitor_pod',
bash_command='bash /home/airflow/gcs/data/testscript.sh' ,
dag=dag,
depends_on_past=False,
priority_weight=2**31-1)```
How to make this working?
There are 4 potential causes of this:
schedule_interval provided in default_args
Existing DAG's schedule_interval was modified
start_date is set to datetime.now()
start_date not aligned with schedule_interval
Your issue could be caused by 2. or 4. above since 1. & 3. are properly configured.
Airflow documentation recommends using static data https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date . Using dynamic date (such as airflow.utils.dates.days_ago(0)) as start_date is not advisable and may cause issues as the dag gets confused at 00:00 and switch to next day incorrectly. Set your start_date to be a fixed datetime and it should work correctly.
In general, the way the Airflow scheduling works is not always straight forward. I recommend this article going deeper into this topic https://towardsdatascience.com/airflow-schedule-interval-101-bbdda31cc463
In my Airflow DAG i have 4 tasks
task_1 >> [task_2,task_3]>> task_4
task_4 runs only after a successful run of both task_2 and task_3
How do i set a condition such as :
if task_2 fails, retry task_2 after 2 minutes and stop retrying after the 5th attempt
This is my code :
from airflow.models import DAG
from airflow.utils.dates import days_ago
from airflow.operators.python_operator import PythonOperator
args={
'owner' : 'Anti',
'start_date':days_ago(1)# 1 means yesterday
}
dag = DAG(dag_id='my_sample_dag',default_args=args,schedule_interval='15 * * * *')
def func1(**context):
print("ran task 1")
def func2(**context):
print("ran task 2")
def func3(**context):
print("ran task 3")
def func4(**context):
print("ran task 4")
with dag:
task_1=PythonOperator(
task_id='task1',
python_callable=func1,
provide_context=True,
)
task_2=PythonOperator(
task_id='task2',
python_callable=func2,
provide_context=True
)
task_3=PythonOperator(
task_id='task3',
python_callable=func3,
provide_context=True
)
task_4=PythonOperator(
task_id='task4',
python_callable=func4,
provide_context=True
)
task_1 >> [task_2,task_3]>> task_4 # t2,t3 runs parallel right after t1 has ran
Every operator supports retry_delay and retries - Airflow documention.
retries (int) – the number of retries that should be performed before
failing the task
retry_delay (datetime.timedelta) – delay between retries
If you want to apply this for all of your tasks, you can just edit your args dictionary:
args={
'owner' : 'Anti',
'retries': 5,
'retry_delay': timedelta(minutes=2),
'start_date':days_ago(1)# 1 means yesterday
}
If you just want to apply it to task_2 you can pass it directly to PythonOperator - in that case the other tasks use the default settings.
One comment on your args, it's not recommended to set a dynamic relative start_date but rather a fixed absolute date.