Apache Airflow - Dag doesn't start even with start_date and schedule_interval defined - airflow-scheduler

I am new at Airflow but I've defined a Dag to send a basic email every day at 9am. My DAG is the following one:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
from airflow.operators.email_operator import EmailOperator
from airflow.utils.dates import days_ago
date_log = str(datetime.today())
my_email = ''
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': days_ago(0),
'email': ['my_email'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
'concurrency': 1,
'max_active_runs': 1
}
with DAG('TEST', default_args=default_args, schedule_interval='0 9 * * *',max_active_runs=1, catchup=False) as dag:
t_teste = EmailOperator(dag=dag, task_id='successful_notification',
to='my_email',
subject='Airflow Dag ' + date_log,
html_content="""""")
t_teste
I've all the configurations as I needed, and I have the webserver and scheduler running. Also, I have my Dag active on UI. My problem is that my DAG seems to be doing nothing. It hasn't run for two days, and even if it passes the scheduled time, it doesn't run as expected. I have already tested and run my trigger manually, and it ran successfully. But if I wait for the trigger time, it does nothing.
Do you know how what I am doing wrong?
Thanks!

Your DAG will never be scheduled. Airflow schedule calculates state_date + schedule_interval and schedule the DAG at the END of the interval.
>>> import airflow
>>> from airflow.utils.dates import days_ago
>>> print(days_ago(0))
2021-06-26 00:00:00+00:00
Calculating 2021-06-26 (today) + schedule_interval it means that the DAG will run on 2021-06-27 09:00 however when we reach 2021-06-27 the calculation will produce 2021-06-28 09:00 and so on resulting in DAG never actually runs.
The conclusion is: never use dynamic values in start_date!
To solve your issue simply change:
'start_date': days_ago(0) to some static value like: 'start_date': datetime(2021,6,25)
note that if you are running older versions of Airflow you might also need to change the dag_id.

Related

Dataflow job not triggering on cloud from Composer(Airflow)

I am trying to execute apache beam pipeline from composer and facing below issue that the job does not get trigger on GCP.
Job log: (parameterized below stuff not to reveal company specific details:
NFO - Running command: java -jar /tmp/dataflow40103bb6-GcsToBqDataIngestion.jar --runner=DataflowRunner --project=<project_id> --zone=northamerica-northeast1-a --stagingLocation=gs:// --maxNumWorkers=1 --tempLocation=<> --region=northamerica-northeast1 --subnetwork=<network_link> --serviceAccount= --usePublicIps=false --pipelineConfig=pipeline_config/pgp_comm_apps.properties --workerMachineType=n1-standard-2 --env=dev --jobName=test-ingestion-7e20f260#-#{"workflow": "fds-test-dataflow", "task-id": "load-data", "execution-date": "2022-08-02T19:31:51.473861+00:00"}
DAG code:
import datetime
from airflow import models
# The DAG object; we'll need this to instantiate a DAG
from airflow import DAG
from airflow.contrib.operators.dataflow_operator import DataFlowJavaOperator
# Operators; we need this to operate!
from airflow.operators.bash_operator import BashOperator
default_dag_args = {
# The start_date describes when a DAG is valid / can be run. Set this to a
# fixed point in time rather than dynamically, since it is evaluated every
# time a DAG is parsed. See:
# https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime.datetime(2022, 7, 27),
'email': ['test#test.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'dataflow_default_options': {
'project': '<prod_id>',
#"region": "northamerica-northeast1",
"zone": "northamerica-northeast1-a",
'stagingLocation': 'gs://location',
}
#'retry_delay': timedelta(minutes=30),
}
# Define a DAG (directed acyclic graph) of tasks.
# Any task you create within the context manager is automatically added to the
# DAG object.
with models.DAG(
'fds-test-dataflow',
catchup=False,
schedule_interval=None,
#schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
task = DataFlowJavaOperator(
task_id='load-data',
gcp_conn_id="gcp_connection",
job_name='test-ingestion',
jar='gs://path_to_jar',
delegate_to="<SA>",
location='northamerica-northeast1',
options={
'maxNumWorkers': '1',
'project': '<proj_id>',
'tempLocation': 'gs://location/',
'region': 'northamerica-northeast1',
"zone": "northamerica-northeast1-a",
'subnetwork': 'network',
'serviceAccount': 'SA',
'usePublicIps': 'false',
'pipelineConfig': 'pipeline_config/pgp_comm_apps.properties',
"currentTms": '"2022-06-28 10:00:00"',
'labels': {},
'workerMachineType': 'n1-standard-2',
'env': 'dev'
},
dag=dag,)
task

Airflow DAG not triggered at schedule time

Below is my DAG code, it is not getting triggered #scheduled time. If i run manually, it is working fine.
Not sure where is the problem. I tried testing two to three corn expression but without any luck.
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
#from datetime import timedelta
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5)
}
dag = DAG(
'airflow_worker_pod_monitoring',
default_args=default_args,
description='liveness monitoring dag',
schedule_interval='0,00 15 * * *',
start_date=YESTERDAY,
dagrun_timeout=datetime.timedelta(minutes=60))
# priority_weight has type int in Airflow DB, uses the maximum.
t1 = BashOperator(
task_id='monitor_pod',
bash_command='bash /home/airflow/gcs/data/testscript.sh' ,
dag=dag,
depends_on_past=False,
priority_weight=2**31-1)```
How to make this working?
There are 4 potential causes of this:
schedule_interval provided in default_args
Existing DAG's schedule_interval was modified
start_date is set to datetime.now()
start_date not aligned with schedule_interval
Your issue could be caused by 2. or 4. above since 1. & 3. are properly configured.
Airflow documentation recommends using static data https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-s-the-deal-with-start-date . Using dynamic date (such as airflow.utils.dates.days_ago(0)) as start_date is not advisable and may cause issues as the dag gets confused at 00:00 and switch to next day incorrectly. Set your start_date to be a fixed datetime and it should work correctly.
In general, the way the Airflow scheduling works is not always straight forward. I recommend this article going deeper into this topic https://towardsdatascience.com/airflow-schedule-interval-101-bbdda31cc463

Aiflow skipping task on ONE_SUCCESS trigger rule

I am using one_success trigger rule so that if anyone of the parent task passes than the child task run which is happening as expected. However, I am getting issue when both are getting failed. In this case, child task is getting skipped instead of fail. Below is the dag implementation
import logging
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.hive_operator import HiveOperator
from airflow.operators.bash_operator import BashOperator
from airflow.sensors.web_hdfs_sensor import WebHdfsSensor
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule
circles = ['101','102','103']
def load_hive_partition(circle, **kwargs):
return HiveOperator(
task_id='add_partition_{}'.format(circle),
hql='alter table abc.def add partition (event_date="{{ ds }}", circle="'+circle+'") location "/user/cloudera/hive/abc/def/event_date={{ ds }}/circle='+circle+'"',
trigger_rule=TriggerRule.ONE_SUCCESS,
dag=dag)
def check_hdfs_node1(circle, **kwargs):
return WebHdfsSensor(
task_id='source_data_sensor_node1_{}'.format(circle),
webhdfs_conn_id='webhdfs_default_1',
filepath='/user/cloudera/hive/abc/def/event_date={{ds}}/circle='+circle,
timeout=60 * 60 * 24,
dag=dag
)
def check_hdfs_node2(circle, **kwargs):
return WebHdfsSensor(
task_id='source_data_sensor_node2_{}'.format(circle),
webhdfs_conn_id='webhdfs_default',
filepath='/user/cloudera/hive/abc/def/event_date={{ds}}/circle='+circle,
timeout=60 * 60 * 24,
dag=dag
)
args = {
'owner': 'airflow',
'depends_on_past': False,
'run_as_user': 'airflow',
'retries': 3,
'start_date': datetime(year=2020, month=8, day=30),
'retry_delay': timedelta(minutes=10)
}
dag = DAG(dag_id="HiveLoadPartition_circle",
default_args=args,
schedule_interval='30 18 * * *',
catchup=False)
kinit_bash = BashOperator(
task_id='kinit_bash',
bash_command='kinit -kt /usr/local/airflow/keytab.keytab appuser#cloudera.com',
dag=dag)
#start_dag = DummyOperator(task_id='start_dag', dag=dag)
end_dag = DummyOperator(task_id='end_dag',trigger_rule=TriggerRule.ALL_DONE, dag=dag)
for circle in circles:
add_partition = load_hive_partition(circle)
check_hdfs_1 = check_hdfs_node1(circle)
check_hdfs_2 = check_hdfs_node2(circle)
check_hdfs_1.set_upstream(kinit_bash)
check_hdfs_2.set_upstream(kinit_bash)
add_partition.set_upstream(check_hdfs_1)
add_partition.set_upstream(check_hdfs_2)
end_dag.set_upstream(add_partition)
Graph view
How can I make my hiveload task fail in case of both of hdfs_sensor fails?
UPDATE:
I have also tried using trigger rule all_done in end dag. Even then it is triggering end_dag when parent task skips.
I ran into this same issue and believe it is an underlying issue with airflow. I've opened up the following PR to fix this https://github.com/apache/airflow/pull/15467
2021-05-04: the fix has been merged and marked for the next (2.0.3) release.
2021-05-23: the fix has been deployed with the release of 2.1 (2.0.3 was cancelled)
Note: I've re-added my removed response to the meta conversation as I don't have the 50 required reputation to comment on the meta thread. Glancing at the edit guidelines it does not appear to have originally qualified for removal but if I've missed something and the following does not belong here please move this response to the meta thread on my behalf since I'm unable to.
As for the meta conversation: This answer was provided very much in the spirit captured by #Ivar's comment. So it is my intent that this answer helps Ayush and anyone else who stumbles across this issue while using <= 2.0.2. However, given the strong negative (-4 votes at the time of writing) reaction to this answer next time I'll just refrain from contributing back to SO as I only have so much time in the day and it has been made clear that identifying the issue, fixing the issue, and opening a PR does not meet SO's standard. ~Cheers

Airflow keeps running my DAG, despite catchup=False, schedule_interval=datetime.timedelta(hours=2)

Similar to previous questions, but none of the answers given worked. I have a DAG:
import datetime
import os
from airflow import DAG
from airflow.contrib.operators.dataflow_operator import DataflowTemplateOperator
from airflow.operators import BashOperator
PROJECT = os.environ['PROJECT']
GCS_BUCKET = os.environ['BUCKET']
API_KEY = os.environ['API_KEY']
default_args = {
'owner': 'me',
'start_date': datetime.datetime(2019, 7, 30),
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': datetime.timedelta(hours=1),
'catchup': False
}
dag = DAG('dag-name',
schedule_interval=datetime.timedelta(hours=2),
default_args=default_args,
max_active_runs=1,
concurrency=1,
catchup=False)
DEFAULT_OPTIONS_TEMPLATE = {
'project': PROJECT,
'stagingLocation': 'gs://{}/staging'.format(GCS_BUCKET),
'tempLocation': 'gs://{}/temp'.format(GCS_BUCKET)
}
def my-dataflow-job(template_location, name):
run_time = datetime.datetime.utcnow()
a_value = run_time.strftime('%Y%m%d%H')
t1 = DataflowTemplateOperator(
task_id='{}-task'.format(name),
template=template_location,
parameters={'an_argument': a_value},
dataflow_default_options=DEFAULT_OPTIONS_TEMPLATE,
poll_sleep=30
)
t2 = BashOperator(
task_id='{}-loader-heartbeat'.format(name),
bash_command='curl --fail -XGET "[a heartbeat URL]" --header "Authorization: heartbeat_service {1}"'.format(name, API_KEY)
)
t1 >> t2
with dag:
backup_bt_to_bq('gs://[path to gcs]'.format(GCS_BUCKET), 'name')
As you can see, I'm trying very hard to prevent Airflow from trying to backfill. Yet, when I deploy the DAG (late in the day, on 7/30/2019), it just keeps running the DAG one after the other, after the other, after the other.
Since this task is moving a bit of data around, this is not desirable. How do I get airflow to respect the "run this every other hour" schedule_interval??
As you can see, I've set catchup: False in both the DAG args AND the default args (just in case, started with them in the DAG args). The retry delay is also a long period.
Each DAG run is reported as a success.
I'm running with the following version:
composer-1.5.0-airflow-1.10.1
My next step is kubernetes cron...
I suspect you did not have catchup=False when you first created the dag. I think airflow may not recognize changes to the catchup parameter after inital dag creation.
Try renaming it and see what happens. E.g. add a v2 and enable it. After enabling it, it will run once even though catchup is false, because there is a valid completed interval (i.e. current time is >= start_time + schedule_interval), but that is all.
Of course, test with a fake operator that doesn't do anything expensive.

GCP Composer (Airflow) operator

I'm using the GCP Composer API (Airflow) and my DAG to scale up the number of workers keep returning me the error below:
Broken DAG: [/home/airflow/gcs/dags/cluster_scale_workers.py] 'module' object has no attribute 'DataProcClusterScaleOperator'
Seems to be something related to the ScaleOperator, however when I look at the Airflow Read the Docs and cross check with my code, seems that nothing is wrong. What am I missing?
Is it related to GCP Airflow version?
Code:
import datetime
import os
from airflow import models
from airflow.contrib.operators import dataproc_operator
from airflow.utils import trigger_rule
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time())
default_dag_args = {
'start_date': yesterday,
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'project_id': models.Variable.get('gcp_project'),
'cluster_name': 'hive-cluster'
}
with models.DAG(
'scale_workers',
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
scale_to_6_workers = dataproc_operator.DataprocClusterScaleOperator(
task_id='scale_dataproc_cluster_6',
cluster_name='hive-cluster',
num_workers=6,
num_preemptible_workers=3,
dag=dag
)
I managed to find the issue and sort it out. The comment provided by Ashish Kumar above is correct.
The problem was that the Airflow version I was using (1.9.0) did not support the DataProcClusterScaleOperator. I created another instance by activating BETA and choosing the version 1.10.0.
Which fixed my issue.