Airflow DAG doesn't start based on `start_date`, it starts from now - airflow-scheduler

I have an Airflow DAG which I need to backfill, when I change the start_date and run the dag, it doesn't understand the start_date and just starts from the current date.
I copied my code to a new python file, for example from 'dag_xx.py' to 'dag_xx_backfill.py', and changed the name of dag itself and all the tasks it has. Also, I used the Delete button in the UI to clear the whole state of the dag and start it all over again. But yet, it doesn't start from my desired start_date
There are some configs in the dag's default_args, like:
default_args = {
"owner": "airflow",
"depends_on_past": False,
"retries": 1,
"retry_delay": timedelta(minutes=1),
"catchup": True
}
test_dag_backfill = DAG(
dag_id="test_dag_backfill",
description="backfill the data",
default_args=default_args,
start_date=datetime(2020, 11, 1, 3, 0, tzinfo=local_tz),
schedule_interval="0 * * * *", # or #hourly
max_active_runs=1,
)
As you can see, the start_date is November 1st, but it starts from the current date (December 2nd).
Do you have any idea what I am missing here?

Well, I found the reason. If you use catchup in default_args, it doesn't work, because it's a dag property but in default_args you can just define default operator properties. What I did was to include catchup in DAG properties directly and it worked.
Thanks to: https://stackoverflow.com/a/54692189/10874265

Related

Apache Airflow - Dag doesn't start even with start_date and schedule_interval defined

I am new at Airflow but I've defined a Dag to send a basic email every day at 9am. My DAG is the following one:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
from airflow.operators.email_operator import EmailOperator
from airflow.utils.dates import days_ago
date_log = str(datetime.today())
my_email = ''
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(0),
'email': ['my_email'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'concurrency': 1,
'max_active_runs': 1
}
with DAG('TEST', default_args=default_args, schedule_interval='0 9 * * *',max_active_runs=1, catchup=False) as dag:
t_teste = EmailOperator(dag=dag, task_id='successful_notification',
to='my_email',
subject='Airflow Dag ' + date_log,
html_content="""""")
t_teste
I've all the configurations as I needed, and I have the webserver and scheduler running. Also, I have my Dag active on UI. My problem is that my DAG seems to be doing nothing. It hasn't run for two days, and even if it passes the scheduled time, it doesn't run as expected. I have already tested and run my trigger manually, and it ran successfully. But if I wait for the trigger time, it does nothing.
Do you know how what I am doing wrong?
Thanks!
Your DAG will never be scheduled. Airflow schedule calculates state_date + schedule_interval and schedule the DAG at the END of the interval.
>>> import airflow
>>> from airflow.utils.dates import days_ago
>>> print(days_ago(0))
2021-06-26 00:00:00+00:00
Calculating 2021-06-26 (today) + schedule_interval it means that the DAG will run on 2021-06-27 09:00 however when we reach 2021-06-27 the calculation will produce 2021-06-28 09:00 and so on resulting in DAG never actually runs.
The conclusion is: never use dynamic values in start_date!
To solve your issue simply change:
'start_date': days_ago(0) to some static value like: 'start_date': datetime(2021,6,25)
note that if you are running older versions of Airflow you might also need to change the dag_id.

Airflow - GCP - files from DAG folder are not showing up

I'm new to GCP . I have a sample python script created in a GCP environment which is running fine. I want to schedule this in Airflow. I copied the file in DAG folder in the environment (gs://us-west2-*******-6f9ce4ef-bucket/dags), but it's showing up in the airflow DAG ..
This is the location in airflow config.
dags_folder = /home/airflow/gcs/dags
Pls do let me know how to get my python code to show up in airflow..do i have to setup any other things. I kept all default.
Thanks in advance.
What you did is already correct, wherein you placed your python script in your gs://auto-generated-bucket/dags/. I'm not sure if you were able to use the airflow library in your script, but this library will let you configure the behavior of your DAG in airflow. You can see an example in the Cloud Composer quickstart.
You can check an in-depth tutorial of DAGs here.
Sample DAG (test_dag.py) that prints the dag_run.id:
# test_dag.py #
import datetime
import airflow
from airflow.operators import bash_operator
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': YESTERDAY,
}
with airflow.DAG(
'this_is_the_test_dag', ## <-- This string will be displayed in the AIRFLOW web interface as the DAG name ##
'catchup=False',
default_args=default_args,
schedule_interval=datetime.timedelta(days=1)) as dag:
# Print the dag_run id from the Airflow logs
print_dag_run_conf = bash_operator.BashOperator(
task_id='print_dag_run_conf', bash_command='echo {{ dag_run.id }}')
gs://auto-generated-bucket/dags/ gcs location:
Airflow Web server:

Airflow DAG not triggered at schedule time

Below is my DAG code, it is not getting triggered #scheduled time. If i run manually, it is working fine.
Not sure where is the problem. I tried testing two to three corn expression but without any luck.
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#from datetime import timedelta
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5)
}
dag = DAG(
'airflow_worker_pod_monitoring',
default_args=default_args,
description='liveness monitoring dag',
schedule_interval='0,00 15 * * *',
start_date=YESTERDAY,
dagrun_timeout=datetime.timedelta(minutes=60))
# priority_weight has type int in Airflow DB, uses the maximum.
t1 = BashOperator(
task_id='monitor_pod',
bash_command='bash /home/airflow/gcs/data/testscript.sh' ,
dag=dag,
depends_on_past=False,
priority_weight=2**31-1)```
How to make this working?
There are 4 potential causes of this:
schedule_interval provided in default_args
Existing DAG's schedule_interval was modified
start_date is set to datetime.now()
start_date not aligned with schedule_interval
Your issue could be caused by 2. or 4. above since 1. & 3. are properly configured.
Airflow documentation recommends using static data https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date . Using dynamic date (such as airflow.utils.dates.days_ago(0)) as start_date is not advisable and may cause issues as the dag gets confused at 00:00 and switch to next day incorrectly. Set your start_date to be a fixed datetime and it should work correctly.
In general, the way the Airflow scheduling works is not always straight forward. I recommend this article going deeper into this topic https://towardsdatascience.com/airflow-schedule-interval-101-bbdda31cc463

Aiflow skipping task on ONE_SUCCESS trigger rule

I am using one_success trigger rule so that if anyone of the parent task passes than the child task run which is happening as expected. However, I am getting issue when both are getting failed. In this case, child task is getting skipped instead of fail. Below is the dag implementation
import logging
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.hive_operator import HiveOperator
from airflow.operators.bash_operator import BashOperator
from airflow.sensors.web_hdfs_sensor import WebHdfsSensor
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule
circles = ['101','102','103']
def load_hive_partition(circle, **kwargs):
return HiveOperator(
task_id='add_partition_{}'.format(circle),
hql='alter table abc.def add partition (event_date="{{ ds }}", circle="'+circle+'") location "/user/cloudera/hive/abc/def/event_date={{ ds }}/circle='+circle+'"',
trigger_rule=TriggerRule.ONE_SUCCESS,
dag=dag)
def check_hdfs_node1(circle, **kwargs):
return WebHdfsSensor(
task_id='source_data_sensor_node1_{}'.format(circle),
webhdfs_conn_id='webhdfs_default_1',
filepath='/user/cloudera/hive/abc/def/event_date={{ds}}/circle='+circle,
timeout=60 * 60 * 24,
dag=dag
)
def check_hdfs_node2(circle, **kwargs):
return WebHdfsSensor(
task_id='source_data_sensor_node2_{}'.format(circle),
webhdfs_conn_id='webhdfs_default',
filepath='/user/cloudera/hive/abc/def/event_date={{ds}}/circle='+circle,
timeout=60 * 60 * 24,
dag=dag
)
args = {
'owner': 'airflow',
'depends_on_past': False,
'run_as_user': 'airflow',
'retries': 3,
'start_date': datetime(year=2020, month=8, day=30),
'retry_delay': timedelta(minutes=10)
}
dag = DAG(dag_id="HiveLoadPartition_circle",
default_args=args,
schedule_interval='30 18 * * *',
catchup=False)
kinit_bash = BashOperator(
task_id='kinit_bash',
bash_command='kinit -kt /usr/local/airflow/keytab.keytab appuser#cloudera.com',
dag=dag)
#start_dag = DummyOperator(task_id='start_dag', dag=dag)
end_dag = DummyOperator(task_id='end_dag',trigger_rule=TriggerRule.ALL_DONE, dag=dag)
for circle in circles:
add_partition = load_hive_partition(circle)
check_hdfs_1 = check_hdfs_node1(circle)
check_hdfs_2 = check_hdfs_node2(circle)
check_hdfs_1.set_upstream(kinit_bash)
check_hdfs_2.set_upstream(kinit_bash)
add_partition.set_upstream(check_hdfs_1)
add_partition.set_upstream(check_hdfs_2)
end_dag.set_upstream(add_partition)
Graph view
How can I make my hiveload task fail in case of both of hdfs_sensor fails?
UPDATE:
I have also tried using trigger rule all_done in end dag. Even then it is triggering end_dag when parent task skips.
I ran into this same issue and believe it is an underlying issue with airflow. I've opened up the following PR to fix this https://github.com/apache/airflow/pull/15467
2021-05-04: the fix has been merged and marked for the next (2.0.3) release.
2021-05-23: the fix has been deployed with the release of 2.1 (2.0.3 was cancelled)
Note: I've re-added my removed response to the meta conversation as I don't have the 50 required reputation to comment on the meta thread. Glancing at the edit guidelines it does not appear to have originally qualified for removal but if I've missed something and the following does not belong here please move this response to the meta thread on my behalf since I'm unable to.
As for the meta conversation: This answer was provided very much in the spirit captured by #Ivar's comment. So it is my intent that this answer helps Ayush and anyone else who stumbles across this issue while using <= 2.0.2. However, given the strong negative (-4 votes at the time of writing) reaction to this answer next time I'll just refrain from contributing back to SO as I only have so much time in the day and it has been made clear that identifying the issue, fixing the issue, and opening a PR does not meet SO's standard. ~Cheers

Airflow keeps running my DAG, despite catchup=False, schedule_interval=datetime.timedelta(hours=2)

Similar to previous questions, but none of the answers given worked. I have a DAG:
import datetime
import os
from airflow import DAG
from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
from airflow.operators import BashOperator
PROJECT = os.environ['PROJECT']
GCS_BUCKET = os.environ['BUCKET']
API_KEY = os.environ['API_KEY']
default_args = {
'owner': 'me',
'start_date': datetime.datetime(2019, 7, 30),
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': datetime.timedelta(hours=1),
'catchup': False
}
dag = DAG('dag-name',
schedule_interval=datetime.timedelta(hours=2),
default_args=default_args,
max_active_runs=1,
concurrency=1,
catchup=False)
DEFAULT_OPTIONS_TEMPLATE = {
'project': PROJECT,
'stagingLocation': 'gs://{}/staging'.format(GCS_BUCKET),
'tempLocation': 'gs://{}/temp'.format(GCS_BUCKET)
}
def my-dataflow-job(template_location, name):
run_time = datetime.datetime.utcnow()
a_value = run_time.strftime('%Y%m%d%H')
t1 = DataflowTemplateOperator(
task_id='{}-task'.format(name),
template=template_location,
parameters={'an_argument': a_value},
dataflow_default_options=DEFAULT_OPTIONS_TEMPLATE,
poll_sleep=30
)
t2 = BashOperator(
task_id='{}-loader-heartbeat'.format(name),
bash_command='curl --fail -XGET "[a heartbeat URL]" --header "Authorization: heartbeat_service {1}"'.format(name, API_KEY)
)
t1 >> t2
with dag:
backup_bt_to_bq('gs://[path to gcs]'.format(GCS_BUCKET), 'name')
As you can see, I'm trying very hard to prevent Airflow from trying to backfill. Yet, when I deploy the DAG (late in the day, on 7/30/2019), it just keeps running the DAG one after the other, after the other, after the other.
Since this task is moving a bit of data around, this is not desirable. How do I get airflow to respect the "run this every other hour" schedule_interval??
As you can see, I've set catchup: False in both the DAG args AND the default args (just in case, started with them in the DAG args). The retry delay is also a long period.
Each DAG run is reported as a success.
I'm running with the following version:
composer-1.5.0-airflow-1.10.1
My next step is kubernetes cron...
I suspect you did not have catchup=False when you first created the dag. I think airflow may not recognize changes to the catchup parameter after inital dag creation.
Try renaming it and see what happens. E.g. add a v2 and enable it. After enabling it, it will run once even though catchup is false, because there is a valid completed interval (i.e. current time is >= start_time + schedule_interval), but that is all.
Of course, test with a fake operator that doesn't do anything expensive.