airflow max_active_run is not increasing - airflow-scheduler

my dag code
dag = DAG(
"ETL_s21_{}_impression_product_test".format("every_minute"),
default_args=default_args,
schedule_interval="* * * * *",
max_active_runs=10,
concurrency=5
)
def export_impression_log(table, item):
export_impression_log_task = PythonOperator(
task_id=f"{table}_{item}_export_impression_log",
python_callable=S3.export_impression_log_from_ad_s3,
op_kwargs={
"execution_date": '{{execution_date.in_timezone("Asia/Seoul").strftime("%Y-%m-%d-%H-%M")}}',
'table': table,
'item': item
},
dag=dag,
queue='s21',
task_concurrency=5,
provide_context=False
)
return export_impression_log_task
def add_partition(table, item):
add_partition_task = PythonOperator(
task_id=f"{table}_{item}_add_partition",
python_callable=Athena.add_partition,
op_kwargs={
"glue_table_name": 'amplitude_impression_test',
"execution_date": '{{execution_date.in_timezone("UTC").strftime("%Y-%m-%d-%H-%M")}}'
},
dag=dag,
queue='s21',
task_concurrency=5,
provide_context=False
)
airflow.cfg
dag_concurrency = 16
worker_concurrency = 16
parallelism = 32
max_active_runs_per_dag = 16
dag runs every minutes and start from past. then, i expected it start from start_date and being created max 5 dag at once and 5 dag is keep going.
but max active run dag instances are 2 and it doesn't increase.
i think already tested about option for concurrency / parallelism. but coudn't solve this problem.
what should i check?

Related

Celery beat PeriodicTask max executions

Im trying to put max executions to my PeriodicTask with IntervalSchedule.
I know total_run_count but, ¿how can use it to put max executions to my PeriodicTask?
#receiver(post_save, sender=AutoTask)
def create_periodic_task(sender, instance, **kwargs):
interval = IntervalSchedule.objects.get_or_create(every=instance.every, period=instance.periodicity)[0]
PeriodicTask.objects.create(name=instance.title, task="create_autotask", start_time=instance.creation_date, total_run_count=instance.repetitions, interval=interval)
I coded this, but, in this case, I can configure the interval (example : each 2 days ), but... how can I limitate the repetitions?
Ty

Running Airflow DAG on Tuesday 00:00:00 every week

I'm trying to run a DAG on Airflow (GCP Cloud Composer to be exact) on weekly bases.
But the Dag is not ran on Tuesdays as I'm specifying on the Cron expression.
In all the examples I found the schedule_interval was sat as an interval (daily,weekly, and so on). I can't figure out what the error might be on my settings.
default_dag_args = {
'start_date': datetime.datetime.strptime('07/08/2020 00:00:00', '%d/%m/%Y %H:%M:%S'),
'depends_on_past':False,
'catchup' :...,
'retry_delay': ...,
'project_id': ...
}
with models.DAG(
'every_Tues_00_00',
schedule_interval= "0 0 * * 2",
default_args=default_dag_args) as dag:
.
.
.
Something to keep in mind is when Airflow triggers the task.
"For example, if you run a DAG on a schedule_interval of one day, the run stamped 2020-01-01 will be triggered soon after 2020-01-01T23:59. In other words, the job instance is started once the period it covers has ended. The execution_date available in the context will also be 2020-01-01." [1]
[1] https://airflow.apache.org/docs/stable/dag-run.html

How to run an Airflow DAG once after x minutes?

I need to run a DAG exactly once, but waiting 10 minutes before:
with models.DAG(
'bq_executor',
schedule_interval = '#once',
start_date= datetime().now() + timedelta(minutes=10) ,
catchup = False,
default_args=default_dag_args) as dag:
// DAG operator here
but I can't see the execution after 10 minutes. Something wrong with start_date?
If I use schedule_interval = '*/10 * * * *' and start_date= datetime(2019, 8, 1) (old date from now), I can see the excution every 10 minutes
Dont use datetime.now() as it will keep on changing whenever the DAG is loaded and now() + 10 minutes will always be a future timestamp resulting in DAG never getting scheduled.
Airflow runs the DAGS you have added always AFTER the start_date. So if you have start_date today, it will start after today 23:59.
Scheduling is tricky for this, so check the documentation and examples:
https://airflow.apache.org/scheduler.html
In you case, just switch the start_date to yesterday (or today -1) and it will start today with yesterday's ds (date stamp)

Get record's age in seconds if older than 5 minutes (otherwise 0) in Django (with PostgreSQL database)

I'm retrieving all records, and I would like to display the record's age for those records that are older than 5 minutes.
The output should be something like this (in this example, two records: 1.8.9.1 and 2.7.3.1 are older than 5 minutes) :
ip ... status
---------------------
1.8.9.1 ... 3 hours
2.7.3.1 ... 7 minutes
1.1.1.1 ... up
1.1.1.2 ... up
1.1.1.3 ... up
1.1.1.4 ... up
1.1.1.5 ... up
Here's my current code:
Interfaces.objects.all()
.annotate(
age = (datetime.utcnow() - F('timestamp')), # 0:00:08.535704
age2 = Epoch(datetime.utcnow() - F('timestamp')), # 8.535704
# age3 = int(Epoch(datetime.utcnow() - F('timestamp'))/300),
current_time=Value(str(datetime.utcnow()),
output_field=null_char_field),
)
.order_by('age','ip')
age and age2 both work, but the problem is that I want the records that are older than 5 minutes sorted by age, and the rest by ip
So I'm trying to set age to 0, if it's less than 5 minutes.
If I would do it directly in postgresql, I'd use this query:
select ip, <other fields>,
case when extract('epoch' from now() - "timestamp") > 300
then extract('epoch' from now() - "timestamp")
else 0
end
Is there a way to do it in django?
I figured it out:
Interfaces.objects.all()
.annotate(
age=Case(
When(timestamp__lt=datetime.utcnow() - timedelta(minutes=5),
then=Cast(Epoch(datetime.utcnow() - F('timestamp')),
NullIntegerField)),
default=0,
output_field=NullIntegerField
),
)
.order_by('age','ip')
By the way, my imports and relevant settings:
from django.db.models import F, Func, Case, When, IntegerField
from django.db.models.functions import Coalesce, Cast
NullIntegerField = IntegerField(null=True)
class Epoch(Func):
function = 'EXTRACT'
template = "%(function)s('epoch' from %(expressions)s)"
This website ended up being the most helpful: https://micropyramid.com/blog/django-conditional-expression-in-queries/
You can do it in other way also which will be faster.
Get current time, subtract from that 5 minutes, after that search all the Interfaces
where age is less or equal than the subtracted date.
example:
current_time = datetime.now()
older_than_five = current_time - datetime.timedelta(minutes=5)
Interfaces.objects.all()
.annotate(
age=Case(
When(age__lt=older_than_five, then=Value(0)),
default=F('age')
)
)
.order_by('age','ip')

Why is Airflow DAG not scheduled?

The code below triggers the DAG at 12:37 when Airflow is started at 12:35 (on Tuesday).
But when I remove ",minutes=10" from the code, the run is not scheduled.
Why is it like this?
start_date = datetime.utcnow().replace(tzinfo=pytz.UTC) \
- timedelta(days=7, minutes=10)
dag = DAG(default_args={'start_date': start_date},
schedule_interval='37 12 * * tue', dag_id='test1')
task = PythonOperator(...)
dag >> task
The code variant without minutes:
start_date = datetime.utcnow().replace(tzinfo=pytz.UTC) \
- timedelta(days=7)
I found the solution myself. As Airflow is "recreating" the DAG all the time, "now" is moving, see https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
And this is why the job is not started.
So I defined a fixed start date (anything in the far past) and used the LatestOnlyOperator. This works.
start_date = datetime(2018, 9, 1, 0, 0, tzinfo=pytz.UTC)
dag = DAG(default_args={'start_date': start_date},
schedule_interval='37 12 * * tue', dag_id='test1')
latest_only = LatestOnlyOperator(task_id='latest_only', dag=dag)
task = PythonOperator(...)
dag >> latest_only >> task