How do we trigger multiple airflow dags using TriggerDagRunOperator? - directed-acyclic-graphs

I have a scenario wherein a particular dag upon completion needs to trigger multiple dags,have used TriggerDagRunOperator to trigger single dag,is it possible to pass multiple dags to the TriggerDagRunOperator to trigger multiple dags?
And is it possible to trigger only upon successful completion of the current dag.

I have faced the same problem. And there is no solution out of the box, but we can write a custom operator for it.
So here the code of a custom operator, that get python_callable and trigger_dag_id as arguments:
class TriggerMultiDagRunOperator(TriggerDagRunOperator):
#apply_defaults
def __init__(self, op_args=None, op_kwargs=None, *args, **kwargs):
super(TriggerMultiDagRunOperator, self).__init__(*args, **kwargs)
self.op_args = op_args or []
self.op_kwargs = op_kwargs or {}
def execute(self, context):
session = settings.Session()
created = False
for dro in self.python_callable(context, *self.op_args, **self.op_kwargs):
if not dro or not isinstance(dro, DagRunOrder):
break
if dro.run_id is None:
dro.run_id = 'trig__' + datetime.utcnow().isoformat()
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
conf=dro.payload,
external_trigger=True
)
created = True
self.log.info("Creating DagRun %s", dr)
if created is True:
session.commit()
else:
self.log.info("No DagRun created")
session.close()
trigger_dag_id is dag id what we want running multiple times.
python_callable is a function, it should return a list of DagRunOrder objects, one object for schedule one instance of DAG with dag_id trigger_dag_id.
Code and examples on GitHub: https://github.com/mastak/airflow_multi_dagrun
Little bit more description about this code: https://medium.com/#igorlubimov/dynamic-scheduling-in-airflow-52979b3e6b13

In Airflow 2, you can do a dynamic task mapping. For example:
import uuid
import random
from airflow.decorators import dag, task
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
dag_args = {
"start_date": datetime(2022, 9, 9),
"schedule_interval": None,
"catchup": False,
}
#task
def define_runs():
num_runs = random.randint(3, 5)
runs = [str(uuid.uuid4()) for _ in range(num_runs)]
return runs
#dag(**dag_args)
def dynamic_tasks():
runs = define_runs()
run_dags = TriggerDagRunOperator.partial(
task_id="run_dags",
trigger_dag_id="hello_world",
conf=None,
).expand(
trigger_run_id=runs,
)
run_dags
dag = dynamic_tasks()
Docs here.

You can try looping it! for example:
for i in list:
trigger_dag =TriggerDagRunOperator(task_id='trigger_'+ i,
trigger_dag_id=i,
python_callable=conditionally_trigger_non_indr,
dag=dag)
Set this dependent on the task that is required. I have automated something like this for PythonOperator. You could try if this works for you!

As the API docs state, the method accepts a single dag_id. However, if you want to unconditionally kick off downstream DAGs upon completion, why not just put those tasks in a single DAG and set your dependencies/workflow there? You would then be able to set depends_on_past=True where appropriate.
EDIT: Easy workaround if you absolutely need them in separate DAGs is to create multiple TriggerDagRunOperators and set their dependencies to the same task.

Related

RunID is coming as manual even if the DAG is being scheduled using TriggerDagRunOperator

I am triggering my DAG automatically using scheduled time and TriggerDagRunOperator, but the run_id is coming as manual_(time). I want that the run_id to come as scheduled or something but not manual to distinguish it with manually triggered DAG. I am using Airflow 2.
Issue: TriggerDagOperator also generates a run_id inside it's execute method, right? So, we are using that run_id for our pipeline. The problem is that run_id comes as manual_ and then timestamp. I want this manual should get replace with triggered or something scheduled.
I share with you a way to override the TriggerDagRunOperator operator.
In this case, we can interact with the current context, apply a logic based on the current run_id and override the run_id param with new calculated field in the operator, the code should look like :
from __future__ import annotations
import datetime
from typing import List, Dict, Optional, Union
from airflow.operators.trigger_dagrun import TriggerDagRunOperator
class CustomTriggerDagRunOperator(TriggerDagRunOperator):
def __init__(self,
trigger_dag_id: str,
trigger_run_id: Optional[str] = None,
conf: Optional[Dict] = None,
execution_date: Optional[Union[str, datetime.datetime]] = None,
reset_dag_run: bool = False,
wait_for_completion: bool = False,
poke_interval: int = 60,
allowed_states: Optional[List] = None,
failed_states: Optional[List] = None,
**kwargs) -> None:
super(CustomTriggerDagRunOperator, self).__init__(
trigger_dag_id=trigger_dag_id,
trigger_run_id=trigger_run_id,
conf=conf,
execution_date=execution_date,
reset_dag_run=reset_dag_run,
wait_for_completion=wait_for_completion,
poke_interval=poke_interval,
allowed_states=allowed_states,
failed_states=failed_states,
**kwargs
)
def execute(self, context):
current_trigger_run_id = self.trigger_run_id
# Apply your logic.
self.trigger_run_id = ...
super(CustomTriggerDagRunOperator, self).execute(context)
Create a class CustomTriggerDagRunOperator that extends TriggerDagRunOperator operator.
Override the constructor of the operator.
Apply your logic based on the trigger_run_id field.
Call the execute method on the parent operator.

Saving a celery task (for re-running) in database

Our workflow is currently built around an old version of celery, so bear in mind things are already not optimal. We need to run a task and save a record of that task run in the database. If that task fails or hangs (it happens often), we want to re run, exactly as it was run the first time. This shouldn't happen automatically though. It needs to be triggered manually depending on the nature of the failure and the result needs to be logged in the DB to make that decision (via a front end).
How can we save a complete record of a task in the DB so that a subsequent process can grab the record and run a new identical task? The current implementation saves the path of the #task decorated function in the DB as part of a TaskInfo model. When the task needs to be rerun, we have a get_task() method on the TaskInfo model that gets the path from the DB, imports it using getattr, and another rerun() method that runs the task again with *args, **kwargs (also saved in the DB).
Like so (these are methods on the TaskInfo model instance):
def get_task(self):
"""Returns the task's decorated function, which can be delayed."""
module_name, object_name = self.path.rsplit('.', 1)
module = import_module(module_name)
task = getattr(module, object_name)
if inspect.isclass(task):
task = task()
# task = current_app.tasks[self.path]
return task
def rerun(self):
"""Re-run the task, and replace this one.
- A new task is scheduled to run.
- The new task's TaskInfo has the same parent as this TaskInfo.
- This TaskInfo is deleted.
"""
args, kwargs = self.get_arguments()
celery_task = self.get_task()
celery_task.delay(*args, **kwargs)
defaults = {
'path': self.path,
'status': Status.PENDING,
'timestamp': timezone.now(),
'args': args,
'kwargs': kwargs,
'parent': self.parent,
}
TaskInfo.objects.update_or_create(task_id=celery_task.id, defaults=defaults)
self.delete()
There must be a cleaner solution for saving a task in the DB to rerun later, right?
The latest version of Celery (4.4.0) included a param extended_result. You can set it to True, then the table (it is named celery_taskmeta by default) in the Result Backend Database will store the args and kwargs of the task.
Here is a demo:
app = Celery('test_result_backend')
app.conf.update(
broker_url='redis://localhost:6379/10',
result_backend='db+mysql://root:passwd#localhost/celery_toys',
result_extended=True
)
#app.task(bind=True, name='add')
def add(self, x, y):
self.request.task_name = 'add' # For saving the task name.
time.sleep(5)
return x + y
With the task info recorded in MySQL, you are able to re-run your task easily.

How to receive invoke multiple tasks and collect task status with Django/Celery

I am setting up multiple tasks in my tasks.py, and calling the tasks from views.py. I want to invoke all the different tasks in a for loop so I can collect the status easily and build a progress bar. I am able to invoke all the tasks line by line currently (as shown below).
Here are my questions: how do I invoke the different tasks from views.py in a for loop? it always give me an error of "unicode does not have attribute delay()". Or is there a better way to collect the statuses of different tasks and build the progress bar from them?
I have tried to invoke the functions in views.py like this:
for i in range (1, 6):
functionName = "calculation" + str(i)
functionName.delay(accountNumber)
But this gives an error as stated above "unicode does not have attribute delay()"
my guess is that the tasks are imported from tasks.py to views.py
my current tasks.py:
#shared_task
def calculation1(arg):
some action here
#shared_task
def calculation2(arg):
some action here
#shared_task
def calculation3(arg):
some action here
#shared_task
def calculation4(arg):
some action here
#shared_task
def calculation5(arg):
some action here
my views.py:
result_calculation1= calculation1.delay(accountNumber)
result_calculation2 = calculation2.delay(accountNumber)
result_calculation3 = calculation3.delay(accountNumber)
result_calculation4= calculation4.delay(accountNumber)
result_calculation5 = calculation5.delay(accountNumber)
I want to collect all the tasks statuses in a for loop, so I can build a progress bar, but if there is any other better suggestion on collecting task status and building a progress bar , that's great.
Thank you very much for help in advance.
You need to use getattr() to retrieve the functions from the tasks.py module once you've built the names:
from myapp import tasks # Make sure you import the tasks module
for i in range (1, 6):
functionName = "calculation" + str(i)
task = getattr(tasks, functionName) # Get the task by name from the tasks module
After you've retrieved the task function, you can build up a list of signatures:
signatures = []
signatures.append(task.s(accountNumber)) # Add task signature
From the signatures you can create a group and execute the group as a whole:
from celery import group
task_group = group(signatures)
group_result = group() # Execute the group
And from the group_result you can access each individual task result and build the progress bar around that (perhaps iterating the results in group_result and checking each result's status):
for result in group_result:
status = result.status
# Your progress bar logic...
Putting it all together:
from celery import group
from myapp import tasks # Make sure you import the tasks module
signatures = []
for i in range (1, 6):
functionName = "calculation" + str(i)
task = getattr(tasks, functionName) # Get the task from the tasks module
signatures.append(task.s(accountNumber)) # Add each task signature
task_group = group(signatures)
group_result = group() # Execute the group
for result in group_result:
status = result.status
# Your progress bar logic...
you can put the tasks into a group and you can receive a GroupResult. you can refer to celery doc

How can I set a timeout on Dataflow?

I am using Composer to run my Dataflow pipeline on a schedule. If the job is taking over a certain amount of time, I want it to be killed. Is there a way to do this programmatically either as a pipeline option or a DAG parameter?
Not sure how to do it as a pipeline config option, but here is an idea.
You could launch a taskqueue task with countdown set to your timeout value. When the task does launch, you could check to see if your task is still running:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/list
If it is, you can call update on it with job state JOB_STATE_CANCELLED
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/update
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#jobstate
This is done through the googleapiclient lib: https://developers.google.com/api-client-library/python/apis/discovery/v1
Here is an example of how to use it
class DataFlowJobsListHandler(InterimAdminResourceHandler):
def get(self, resource_id=None):
"""
Wrapper to this:
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs/list
"""
if resource_id:
self.abort(405)
else:
credentials = GoogleCredentials.get_application_default()
service = discovery.build('dataflow', 'v1b3', credentials=credentials)
project_id = app_identity.get_application_id()
_filter = self.request.GET.pop('filter', 'UNKNOWN').upper()
jobs_list_request = service.projects().jobs().list(
projectId=project_id,
filter=_filter) #'ACTIVE'
jobs_list = jobs_list_request.execute()
return {
'$cursor': None,
'results': jobs_list.get('jobs', []),
}

Django celery task keep global state

I am currently developing a Django application based on django-tenants-schema. You don't need to look into the actual code of the module, but the idea is that it has a global setting for the current database connection defining which schema to use for the application tenant, e.g.
tenant = tenants_schema.get_tenant()
And for setting
tenants_schema.set_tenant(xxx)
For some of the tasks I would like them to remember the current global tenant selected during the instantiation, e.g. in theory:
class AbstractTask(Task):
'''
Run this method before returning the task future
'''
def before_submit(self):
self.run_args['tenant'] = tenants_schema.get_tenant()
'''
This method is run before related .run() task method
'''
def before_run(self):
tenants_schema.set_tenant(self.run_args['tenant'])
Is there an elegant way of doing it in celery?
Celery (as of 3.1) has signals you can hook into to do this. You can alter the kwargs that were passed in, and on the other side, undo your alterations before they're given to the actual task:
from celery import shared_task
from celery.signals import before_task_publish, task_prerun, task_postrun
from threading import local
current_tenant = local()
#before_task_publish.connect
def add_tenant_to_task(body=None, **unused):
body['kwargs']['tenant_middleware.tenant'] = getattr(current_tenant, 'id', None)
print 'sending tenant: {t}'.format(t=current_tenant.id)
#task_prerun.connect
def extract_tenant_from_task(kwargs=None, **unused):
tenant_id = kwargs.pop('tenant_middleware.tenant', None)
current_tenant.id = tenant_id
print 'current_tenant.id set to {t}'.format(t=tenant_id)
#task_postrun.connect
def cleanup_tenant(**kwargs):
current_tenant.id = None
print 'cleaned current_tenant.id'
#shared_task
def get_current_tenant():
# Here is where you would do work that relied on current_tenant.id being set.
import time
time.sleep(1)
return current_tenant.id
And if you run the task (not showing logging from the worker):
In [1]: current_tenant.id = 1234; ct = get_current_tenant.delay(); current_tenant.id = 5678; ct.get()
sending tenant: 1234
Out[1]: 1234
In [2]: current_tenant.id
Out[2]: 5678
The signals are not called if no message is sent (when you call the task function directly, without delay() or apply_async()). If you want to filter on the task name, it is available as body['task'] in the before_task_publish signal handler, and the task object itself is available in the task_prerun and task_postrun handlers.
I am a Celery newbie, so I can't really tell if this is the "blessed" way of doing "middleware"-type stuff in Celery, but I think it will work for me.
I'm not sure what you mean here, is before_submit executed before the task is called by a client?
In that case I would rather use a with statement here:
from contextlib import contextmanager
#contextmanager
def set_tenant_db(tenant):
prev_tenant = tenants_schema.get_tenant()
try:
tenants_scheme.set_tenant(tenant)
yield
finally:
tenants_schema.set_tenant(prev_tenant)
#app.task
def tenant_task(tenant=None):
with set_tenant_db(tenant):
do_actions_here()
tenant_task.delay(tenant=tenants_scheme.get_tenant())
You can of course create a base task that does this automatically,
you can apply the context in Task.__call__ for example, but I'm not sure
if that saves you much if you can just use the with statement explicitly.