Why is Airflow DAG not scheduled? - airflow-scheduler

The code below triggers the DAG at 12:37 when Airflow is started at 12:35 (on Tuesday).
But when I remove ",minutes=10" from the code, the run is not scheduled.
Why is it like this?
start_date = datetime.utcnow().replace(tzinfo=pytz.UTC) \
- timedelta(days=7, minutes=10)
dag = DAG(default_args={'start_date': start_date},
schedule_interval='37 12 * * tue', dag_id='test1')
task = PythonOperator(...)
dag >> task
The code variant without minutes:
start_date = datetime.utcnow().replace(tzinfo=pytz.UTC) \
- timedelta(days=7)

I found the solution myself. As Airflow is "recreating" the DAG all the time, "now" is moving, see https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
And this is why the job is not started.
So I defined a fixed start date (anything in the far past) and used the LatestOnlyOperator. This works.
start_date = datetime(2018, 9, 1, 0, 0, tzinfo=pytz.UTC)
dag = DAG(default_args={'start_date': start_date},
schedule_interval='37 12 * * tue', dag_id='test1')
latest_only = LatestOnlyOperator(task_id='latest_only', dag=dag)
task = PythonOperator(...)
dag >> latest_only >> task

Related

Run celery task for maximum 6 hour, if it it takes more then 6 hour, rerun the same task again?

i have using django project with celery + rabbitmq , some of my task takes like 6 hour or more ,even stack , so i want to re-run the same task if its takes more than 6 hour how to do that ,im new with celery ?
You could try;
from celery.exceptions import SoftTimeLimitExceeded
#celery.task(soft_time_limit=60*60*6) # <--- Set time limit to 6 hours
def mytask():
try:
return do_work()
except SoftTimeLimitExceeded:
mytask.retry() # <--- Retry task after limit of 6 hours exceeded

airflow max_active_run is not increasing

my dag code
dag = DAG(
"ETL_s21_{}_impression_product_test".format("every_minute"),
default_args=default_args,
schedule_interval="* * * * *",
max_active_runs=10,
concurrency=5
)
def export_impression_log(table, item):
export_impression_log_task = PythonOperator(
task_id=f"{table}_{item}_export_impression_log",
python_callable=S3.export_impression_log_from_ad_s3,
op_kwargs={
"execution_date": '{{execution_date.in_timezone("Asia/Seoul").strftime("%Y-%m-%d-%H-%M")}}',
'table': table,
'item': item
},
dag=dag,
queue='s21',
task_concurrency=5,
provide_context=False
)
return export_impression_log_task
def add_partition(table, item):
add_partition_task = PythonOperator(
task_id=f"{table}_{item}_add_partition",
python_callable=Athena.add_partition,
op_kwargs={
"glue_table_name": 'amplitude_impression_test',
"execution_date": '{{execution_date.in_timezone("UTC").strftime("%Y-%m-%d-%H-%M")}}'
},
dag=dag,
queue='s21',
task_concurrency=5,
provide_context=False
)
airflow.cfg
dag_concurrency = 16
worker_concurrency = 16
parallelism = 32
max_active_runs_per_dag = 16
dag runs every minutes and start from past. then, i expected it start from start_date and being created max 5 dag at once and 5 dag is keep going.
but max active run dag instances are 2 and it doesn't increase.
i think already tested about option for concurrency / parallelism. but coudn't solve this problem.
what should i check?

DAG schedule in Airflow 2.0

How to schedule a DAG in Airflow 2.0 so that it does not run on holidays?
Question 1 : Runs on every 5th working day of the month?
Question 2 : Runs on 5th working day of the month, if 5th day is holiday then it should run on next day which is not holiday?
For the moment this can't be done (at least not natively). Airflow DAGs accept either single cron expression or a timedelta. If you can't say the desired scheduling logic with one of them then you can not have this scheduling in Airflow. The good news is that Airflow has AIP-39 Richer scheduler_interval to address it and provide more scheduling capabilities in future versions.
That said, you can workaround this by setting you DAG to run with schedule_interval="#daily" and place BranchPythonOperator as the first task of the DAG. In the Python callable you can write the desired logic of your scheduling meaning that your function will return True if it's the 5th working day of the month otherwise will return False and you workflow will branch accordingly. For True it will continue to executed and for False it will end. This is not ideal but it will work. A possible template can be:
def check_fifth():
#write the logic of finding if today is the 5th working day of the month
if logic:
return "continue_task"
else:
return "stop_task"
with DAG(dag_id='stackoverflow',
default_args=default_args,
schedule_interval="#daily",
catchup=False
) as dag:
start_op = BranchPythonOperator(
task_id='choose',
python_callable=check_fifth
)
stop_op = DummyOperator(
task_id='stop_task'
)
#replace with the operator that you want to actually execute
continue_op = YourOperator(
task_id='continue_task'
)
start_op >> [stop_op, continue_op]

Running Airflow DAG on Tuesday 00:00:00 every week

I'm trying to run a DAG on Airflow (GCP Cloud Composer to be exact) on weekly bases.
But the Dag is not ran on Tuesdays as I'm specifying on the Cron expression.
In all the examples I found the schedule_interval was sat as an interval (daily,weekly, and so on). I can't figure out what the error might be on my settings.
default_dag_args = {
'start_date': datetime.datetime.strptime('07/08/2020 00:00:00', '%d/%m/%Y %H:%M:%S'),
'depends_on_past':False,
'catchup' :...,
'retry_delay': ...,
'project_id': ...
}
with models.DAG(
'every_Tues_00_00',
schedule_interval= "0 0 * * 2",
default_args=default_dag_args) as dag:
.
.
.
Something to keep in mind is when Airflow triggers the task.
"For example, if you run a DAG on a schedule_interval of one day, the run stamped 2020-01-01 will be triggered soon after 2020-01-01T23:59. In other words, the job instance is started once the period it covers has ended. The execution_date available in the context will also be 2020-01-01." [1]
[1] https://airflow.apache.org/docs/stable/dag-run.html

How to run an Airflow DAG once after x minutes?

I need to run a DAG exactly once, but waiting 10 minutes before:
with models.DAG(
'bq_executor',
schedule_interval = '#once',
start_date= datetime().now() + timedelta(minutes=10) ,
catchup = False,
default_args=default_dag_args) as dag:
// DAG operator here
but I can't see the execution after 10 minutes. Something wrong with start_date?
If I use schedule_interval = '*/10 * * * *' and start_date= datetime(2019, 8, 1) (old date from now), I can see the excution every 10 minutes
Dont use datetime.now() as it will keep on changing whenever the DAG is loaded and now() + 10 minutes will always be a future timestamp resulting in DAG never getting scheduled.
Airflow runs the DAGS you have added always AFTER the start_date. So if you have start_date today, it will start after today 23:59.
Scheduling is tricky for this, so check the documentation and examples:
https://airflow.apache.org/scheduler.html
In you case, just switch the start_date to yesterday (or today -1) and it will start today with yesterday's ds (date stamp)