Possibility of creating DAGs in Airflow

Possibility of creating DAGs in Airflow - airflow-scheduler

Is there a way that one can make a DAG file dynamically from code and upload it on airflow(AirFlow reads from the dags directory, but creating file for every DAG and uploading it on that folder is slow)?
Is it possible to create a template dag and populate it with new logic whenever it is needed?
I saw that they are working on API. The current version only has a trigger DAG option.

You can quite easily create multiple dags in a single file:
create_dag(dag_id):
dag = DAG(....)
// some tasks added
return dag
for dag_id in dags_lists:
globals()[dag_id] = create_dag(dag_id)
If you create a proper DAG object with the template function (create_dag in the above example) and make them available in the globals object, Airflow will recognise them as individual DAGs.

Yes you can create dynamic DAGs as follows:
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def create_dag(dag_id,
schedule,
dag_number,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py,
dag_number=dag_number)
return dag
# build a dag for each number in range(10)
for n in range(1, 10):
dag_id = 'hello_world_{}'.format(str(n))
default_args = {'owner': 'airflow',
'start_date': datetime(2018, 1, 1)
}
schedule = '#daily'
dag_number = n
globals()[dag_id] = create_dag(dag_id,
schedule,
dag_number,
default_args)
Example from https://www.astronomer.io/guides/dynamically-generating-dags/
However, note that this can cause some issues like delays between the execution of tasks. This is because Airflow Scheduler and Worker will have to parse the entire file when scheduling/executing each task for a single DAG.
As you would have many DAGs (let's say 100) inside the same file this will mean that all the 100 DAG objects will have to be parsed while executing a single task for DAG1.
I would recommend building a tool that creates a single file per DAG.

Related

Schedule an Airflow DAG for running with parameters

I am trying to trigger an airflow DAG externally and passing some parameters to the DAG. The DAG is scheduled to run every 3 minutes. My problem is that the parameters are only being used by the first DAG run.
from pyexpat import model
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.python import PythonOperator
import os
dag_id = "proj"
home_path = os.path.expanduser("~")
runpath = os.path.join(home_path, "airflow/data", dag_id)
def load_data(ti):
import os
train = os.path.join(runpath, "mnist")
test = os.path.join(runpath, "mnist.t")
model = os.path.join(runpath, "trained.mnist")
if not os.path.exists(train):
os.system(
"curl https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist.bz2 --output mnist.bz2"
)
os.system("bzip2 -d mnist.bz2")
if not os.path.exists(test):
os.system(
"curl https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist.t.bz2 --output mnist.t.bz2"
)
os.system("bzip2 -d mnist.t.bz2")
ti.xcom_push(key="train_path", value=train)
ti.xcom_push(key="test_path", value=test)
ti.xcom_push(key="model_path", value=model)
def train(
**context,
):
import os
ti = context["ti"]
train = ti.xcom_pull(task_ids="load_data", key="train_path")
model_path = ti.xcom_pull(task_ids="load_data", key="model_path")
lr = context["dag_run"].conf["lr"]
epochs = context["dag_run"].conf["epochs"]
name = context["dag_run"].conf["name"]
print(lr)
print(epochs)
ti.xcom_push(key="model_name", value=model_final_name)
def validate(**context):
ti = context["ti"]
test = ti.xcom_pull(task_ids="load_data", key="test_path")
model_path = ti.xcom_pull(task_ids="train", key="model_name")
print(test)
print(model_path)
with DAG(
dag_id="project",
default_args={"owner": "airflow"},
start_date=datetime(2022, 8, 8),
schedule_interval=timedelta(minutes=3),
tags=["mnist_4"],
catchup=False,
) as dag:
print(runpath)
os.makedirs(runpath, exist_ok=True)
os.chdir(runpath)
read_file = PythonOperator(
task_id="load_data",
python_callable=load_data,
provide_context=True,
)
process_train = PythonOperator(
task_id="train",
python_callable=train,
provide_context=True,
)
validate = PythonOperator(
task_id="validate", python_callable=validate, provide_context=True
)
read_file >> process_train >> validate
I trigger dag with the command
airflow dags trigger project --conf '{"epochs":1,"name":"trial_3","lr":0.001}'
Except one run, all the other runs have failed with the following error:
KeyError: 'lr'
When I look at the conf for the dag runs, only one run has the conf, rest are empty.
If I look at the field External Trigger, only one run is true which means while triggering the dag, only run is triggered, rest are scheduled.
I want to know how to pass config to the scheduled dags as well.

I hope it can help.
Indeed dag run conf works for manual triggered DAGs, in this case the conf can be passed.
For sheduled DAGs, you can set default params in your DAGs, this post shows an example :
Airflow how to set default values for dag_run.conf

How can I set a timeout on Dataflow?

I am using Composer to run my Dataflow pipeline on a schedule. If the job is taking over a certain amount of time, I want it to be killed. Is there a way to do this programmatically either as a pipeline option or a DAG parameter?

Not sure how to do it as a pipeline config option, but here is an idea.
You could launch a taskqueue task with countdown set to your timeout value. When the task does launch, you could check to see if your task is still running:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/list
If it is, you can call update on it with job state JOB_STATE_CANCELLED
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/update
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#jobstate
This is done through the googleapiclient lib: https://developers.google.com/api-client-library/python/apis/discovery/v1
Here is an example of how to use it
class DataFlowJobsListHandler(InterimAdminResourceHandler):
def get(self, resource_id=None):
"""
Wrapper to this:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/list
"""
if resource_id:
self.abort(405)
else:
credentials = GoogleCredentials.get_application_default()
service = discovery.build('dataflow', 'v1b3', credentials=credentials)
project_id = app_identity.get_application_id()
_filter = self.request.GET.pop('filter', 'UNKNOWN').upper()
jobs_list_request = service.projects().jobs().list(
projectId=project_id,
filter=_filter) #'ACTIVE'
jobs_list = jobs_list_request.execute()
return {
'$cursor': None,
'results': jobs_list.get('jobs', []),
}

Set dynamic scheduling celerybeat

I have send_time field in my Notification model. I want to send notification to all mobile clients at that time.
What i am doing right now is, I have created a task and scheduled it for every minute
tasks.py
#app.task(name='app.tasks.send_notification')
def send_notification():
# here is logic to filter notification that fall inside that 1 minute time span
cron.push_notification()
settings.py
CELERYBEAT_SCHEDULE = {
'send-notification-every-1-minute': {
'task': 'app.tasks.send_notification',
'schedule': crontab(minute="*/1"),
},
}
All things are working as expected.
Question:
is there any way to schedule task as per send_time field, so i don't have to schedule task for every minute.
More specifically i want to create a new instance of task as my Notification model get new entry and schedule it according to send_time field of that record.
Note: i am using new integration of celery with django not django-celery package

To execute a task at specified date and time you can use eta attribute of apply_async while calling task as mentioned in docs
After creation of notification object you can call your task as
# here obj is your notification object, you can send extra information in kwargs
send_notification.apply_async(kwargs={'obj_id':obj.id}, eta=obj.send_time)
Note: send_time should be datetime.

You have to use PeriodicTask and CrontabSchedule to schedule task that can be imported from djcelery.models.
So the code will be like:
from djcelery.models import PeriodicTask, CrontabSchedule
crontab, created = CrontabSchedule.objects.get_or_create(minute='*/1')
periodic_task_obj, created = PeriodicTask.objects.get_or_create(name='send_notification', task='send_notification', crontab=crontab, enabled=True)
Note: you have to write full path to the task like 'app.tasks.send_notification'
You can schedule the notification task in post_save of Notification Model like:
#post_save
def schedule_notification(sender, instance, *args, **kwargs):
"""
instance is notification model object
"""
# create crontab according to your notification object.
# there are more options you can pass like day, week_day etc while creating Crontab object.
crontab, created = CrontabSchedule.objects.get_or_create(minute=instance.send_time.minute, hour=instance.send_time.hour)
periodic_task_obj, created = PeriodicTask.objects.get_or_create(name='send_notification', task='send_notification_{}'.format(instance.pk))
periodic_task_obj.crontab = crontab
periodic_task_obj.enabled = True
# you can also pass kwargs to your task like this
periodic_task_obj.kwargs = json.dumps({"notification_id": instance.pk})
periodic_task_obj.save()

How do we trigger multiple airflow dags using TriggerDagRunOperator?

I have a scenario wherein a particular dag upon completion needs to trigger multiple dags,have used TriggerDagRunOperator to trigger single dag,is it possible to pass multiple dags to the TriggerDagRunOperator to trigger multiple dags?
And is it possible to trigger only upon successful completion of the current dag.

I have faced the same problem. And there is no solution out of the box, but we can write a custom operator for it.
So here the code of a custom operator, that get python_callable and trigger_dag_id as arguments:
class TriggerMultiDagRunOperator(TriggerDagRunOperator):
#apply_defaults
def __init__(self, op_args=None, op_kwargs=None, *args, **kwargs):
super(TriggerMultiDagRunOperator, self).__init__(*args, **kwargs)
self.op_args = op_args or []
self.op_kwargs = op_kwargs or {}
def execute(self, context):
session = settings.Session()
created = False
for dro in self.python_callable(context, *self.op_args, **self.op_kwargs):
if not dro or not isinstance(dro, DagRunOrder):
break
if dro.run_id is None:
dro.run_id = 'trig__' + datetime.utcnow().isoformat()
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
conf=dro.payload,
external_trigger=True
)
created = True
self.log.info("Creating DagRun %s", dr)
if created is True:
session.commit()
else:
self.log.info("No DagRun created")
session.close()
trigger_dag_id is dag id what we want running multiple times.
python_callable is a function, it should return a list of DagRunOrder objects, one object for schedule one instance of DAG with dag_id trigger_dag_id.
Code and examples on GitHub: https://github.com/mastak/airflow_multi_dagrun
Little bit more description about this code: https://medium.com/#igorlubimov/dynamic-scheduling-in-airflow-52979b3e6b13

In Airflow 2, you can do a dynamic task mapping. For example:
import uuid
import random
from airflow.decorators import dag, task
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
dag_args = {
"start_date": datetime(2022, 9, 9),
"schedule_interval": None,
"catchup": False,
}
#task
def define_runs():
num_runs = random.randint(3, 5)
runs = [str(uuid.uuid4()) for _ in range(num_runs)]
return runs
#dag(**dag_args)
def dynamic_tasks():
runs = define_runs()
run_dags = TriggerDagRunOperator.partial(
task_id="run_dags",
trigger_dag_id="hello_world",
conf=None,
).expand(
trigger_run_id=runs,
)
run_dags
dag = dynamic_tasks()
Docs here.

You can try looping it! for example:
for i in list:
trigger_dag =TriggerDagRunOperator(task_id='trigger_'+ i,
trigger_dag_id=i,
python_callable=conditionally_trigger_non_indr,
dag=dag)
Set this dependent on the task that is required. I have automated something like this for PythonOperator. You could try if this works for you!

As the API docs state, the method accepts a single dag_id. However, if you want to unconditionally kick off downstream DAGs upon completion, why not just put those tasks in a single DAG and set your dependencies/workflow there? You would then be able to set depends_on_past=True where appropriate.
EDIT: Easy workaround if you absolutely need them in separate DAGs is to create multiple TriggerDagRunOperators and set their dependencies to the same task.

How to clear Django RQ jobs from a queue?

I feel a bit stupid for asking, but it doesn't appear to be in the documentation for RQ. I have a 'failed' queue with thousands of items in it and I want to clear it using the Django admin interface. The admin interface lists them and allows me to delete and re-queue them individually but I can't believe that I have to dive into the django shell to do it in bulk.
What have I missed?

The Queue class has an empty() method that can be accessed like:
import django_rq
q = django_rq.get_failed_queue()
q.empty()
However, in my tests, that only cleared the failed list key in Redis, not the job keys itself. So your thousands of jobs would still occupy Redis memory. To prevent that from happening, you must remove the jobs individually:
import django_rq
q = django_rq.get_failed_queue()
while True:
job = q.dequeue()
if not job:
break
job.delete() # Will delete key from Redis
As for having a button in the admin interface, you'd have to change django-rq/templates/django-rq/jobs.html template, who extends admin/base_site.html, and doesn't seem to give any room for customizing.

The redis-cli allows FLUSHDB, great for my local environment as I generate a bizzallion jobs.
With a working Django integration I will update. Just adding $0.02.

You can empty any queue by name using following code sample:
import django_rq
queue = "default"
q = django_rq.get_queue(queue)
q.empty()
or even have Django Command for that:
import django_rq
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def add_arguments(self, parser):
parser.add_argument("-q", "--queue", type=str)
def handle(self, *args, **options):
q = django_rq.get_queue(options.get("queue"))
q.empty()

As #augusto-men method seems not to work anymore, here is another solution:
You can use the the raw connection to delete the failed jobs. Just iterate over rq:job keys and check the job status.
from django_rq import get_connection
from rq.job import Job
# delete failed jobs
con = get_connection('default')
for key in con.keys('rq:job:*'):
job_id = key.decode().replace('rq:job:', '')
job = Job.fetch(job_id, connection=con)
if job.get_status() == 'failed':
con.delete(key)
con.delete('rq:failed:default') # reset failed jobs registry

The other answers are outdated with the RQ updates implementing Registries.
Now, you need to do this to loop through and delete failed jobs. This would work any particular Registry as well.
import django_rq
from rq.registry import FailedJobRegistry
failed_registry = FailedJobRegistry('default', connection=django_rq.get_connection())
for job_id in failed_registry.get_job_ids():
try:
failed_registry.remove(job_id, delete_job=True)
except:
# failed jobs expire in the queue. There's a
# chance this will raise NoSuchJobError
pass
Source

You can empty a queue from the command line with:
rq empty [queue-name]
Running rq info will list all the queues.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Possibility of creating DAGs in Airflow - airflow-scheduler

Related

Schedule an Airflow DAG for running with parameters

How can I set a timeout on Dataflow?

Set dynamic scheduling celerybeat

How do we trigger multiple airflow dags using TriggerDagRunOperator?

How to clear Django RQ jobs from a queue?

Categories

Resources