Why does Celery task autodiscovery never trigger under Django? - django

Everything is working except task autodiscovery under Django. (Celery 5.2.6, Django 3.0.6)
If I run celery worker, the worker triggers autodiscovery, finds all my tasks, and displays them in a list as part of it's startup process. However, if I run my Django app or Django shell, this never happens.
Additionally, even though the docs promise that accessing the task registry will trigger autodiscovery, it does not. I print app.tasks, I call app.tasks.keys(), and still no autodiscovery - it only shows the tasks that are built-in or were registered when the module containing them happened to be imported for other reasons.
What do I need to do to trigger task autodiscovery?
PS - If I try adding force=True to app.autodiscover_tasks(), it fails because the Django app registry hasn't finished loading at that time.

I dug into the code and tried a bunch of things. Here's what I learned.
How Autodiscovery is Performed
Autodiscovery is actually performed by app._autodiscover_tasks() (src), which is invoked by the celery.signals.import_modules signal (src), which is sent by app.loader.import_default_modules() (src). The signal is connected to app._autodiscover_tasks() in app.autodiscover_tasks (src).
In other words, the only way to get autodiscovery to trigger is to:
call app.autodiscover_tasks() to register app._autodiscover_tasks as a receiver for signal celery.signals.import_modules
call app.loader.import_default_modules() to send the signal
The docs instruct me to do the first, but not the second. Thus, in my Django app, autodiscovery does not happen.
How Autodiscovery is Performed in celery
If you run celery with the commands beat, report, shell, or worker, the resulting code path eventually calls app.loader.import_default_modules() directly. Thus, autodiscovery happens under those conditions - but not when you run your Django app, or open a Django shell, or any other context but those four commands, unless you explicitly call app.loader.import_default_modules().
On Finalize
The docs promise that accessing app.tasks will cause the app to finalize, which I assumed triggered autodiscovery, but it has nothing to do with that. All that finalize does is send an app.on_after_finalize signal, which does nothing.
My Solution
After much frustrating and experimentation, I settled on this snippet from my celery.py:
app = Celery("myproj")
app.config_from_object("django.conf:settings", namespace="CELERY")
app.autodiscover_tasks()
#app.on_after_configure.connect()
def trigger_autodiscovery(sender, **kwargs):
app.loader.import_default_modules()

Related

Celery task calls endpoint. Is it celery or the django server that does the job?

This is a generic question that I seek answer to because of a celery task I saw in my company's codebase from a previous employee.
It's a shared task that calls an endpoint like
#shared_task(time_limit=60*60)
def celery_task_here(some_args):
data = get_data(user, url, server_name)
# some other logic to build csv and stuff
def get_data(user, url, server_name):
client = APIClient()
client.force_authenticate(user=user)
response = client.get(some_url, format='json', SERVER_NAME=server_name)
and all the logic resides in that endpoint.
Now what I understand is that this will make the server do all the work and do not utilize celery's advantage, but I do see celery log producing queries when I run this locally. I'd like to know who's actually doing the job in this case, celery or the django server?
If the task is called via celery_task_here.delay, the task will be pushed to a queue, then the worker process that is responsible for handling the queue will actually execute the task, which is not the "Django server". The worker process could potentially be on the same machine as your Django instance, it depends on your environment.
If you were to call the task via celery_task_here.s (or as a normal function) the task would be executed by the Django server.
It depends of how the task is called
If it is meant to be called as celery task with apply_async or delay than it is executed as celery task by celery worker process
You still can call it as normal function without sending it to celery if you just call it as function

Celery what happen to running tasks when using app.control.purge()?

Currently i have a celery batch running with django like so:
Celery.py:
from __future__ import absolute_import, unicode_literals
import os
import celery
from celery import Celery
from celery.schedules import crontab
import django
load_dotenv(os.path.join(os.path.dirname(os.path.dirname(__file__)), '.env'))
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'base.settings')
django.setup()
app = Celery('base')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
#app.on_after_configure.connect
def setup_periodic_tasks(sender, **kwargs):
app.control.purge()
sender.add_periodic_task(30.0, check_loop.s())
recursion_function.delay() #need to use recursive because it need to wait for loop to finish(time can't be predict)
print("setup_periodic_tasks")
#app.task()
def check_loop():
.....
start = database start number
end = database end number
callling apis in a list from id=start to id=end
create objects
update database(start number = end, end number = end + 3)
....
#app.task()
def recursion_function(default_retry_delay=10):
.....
do some looping
....
#when finished, call itself again
recursion_function.apply_async(countdown=30)
My aim is whenever the celery file get edited then it would restart all the task -remove queued task that not yet execute(i'm doing this because recursion_function will run itself again if it finished it's job of checking each record of a table in my database so i'm not worry about it stopping mid way).
The check_loop function will call to an api that has paging to return a list of objects and i will compare it to by record in a table , if match then create a new custom record of another model
My question is when i purge all messages, will the current running task get stop midway or it gonna keep running ? because if the check_loop function stop midway looping through the api list then it will run the loop again and i will create new duplicate record which i don't want
EXAMPLE:
during ruuning task of check_loop() it created object midway (on api list from element id=2 to id=5), server restart -> run again, now check_loop() run from beginning(on api list from element id=2 to id=5) and created object from that list again(which 100% i don't want)
Is this how it run ? i just need a confirmation
EDIT:
https://docs.celeryproject.org/en/4.4.1/faq.html#how-do-i-purge-all-waiting-tasks
I added app.control.purge() because when i restart then recursion_function get called again in setup_periodic_tasks while previous recursion_function from recursion_function.apply_async(countdown=30) execute too so it multiplied itself
Yes, worker will continue execution of currently running task unless worker is also restarted.
Also, The Celery Way is to always expect tasks to be run in concurrent environment with following considerations:
there are many tasks running concurrently
there are many celery workers executing tasks
same task may run again
multiple instances of the same task may run at the same moment
any task may be terminated any time
even if you are sure that in your environment there is only one worker started / stopped manually and these do not apply - tasks should be created in such way to allow everything of this to happen.
Some useful techniques:
use database transactions
use locking
split long-running tasks into faster ones
if task has intermediate values to be saved or they are important (i.e. non-reproducible like some api calls) and their processing in next step takes time - consider splitting into several chained tasks
If you need to run only one instance of a task at a time - use some sort of locking - create / update lock-record in the database or in the cache so others (same tasks) can check and know this task is running and just return or wait for previous one to complete.
I.e. recursion_function can also be Periodic Task. Being periodic task will make sure it is run every interval, even if previous one fails for any reason (and thus fails to queue itself again as in regular non-periodic task). With locking you can make sure only one is running at a time.
check_loop():
First, it is recommended to save results in one transaction in the database to make sure all or nothing is saved / modified in the database.
You can also save some marker that indicates how many / status of saved objects, so future tasks can just check this marker, not each object.
Or somehow perform check for each element before creating it that it already exists in the database.
I am not going to write an essay like Oleg's excellent post above. The answer is simply - all running tasks will continue running. purge is all about the tasks that are in the queue(s), waiting to be picked by Celery workers.

Why does apscheduler not work in uwsgi mode?

I have a flask application. This application mimics routing of vehicles in a city and when a vehicle reaches a designated point, I have to wait for 30-180 seconds before starting it again. I am trying to use apscheduler for this.
When the vehicle arrives I start an apscheduler job (with the 'date' trigger for X seconds). When the job fires, I do my processing.
This works well on my dev machine when I am running the flask app standalone. But when I try to run it on my production server (where the app is running in uwsgi mode) the job never fires. I have already set --enable-threads=true for the app, so that doesn't seem to be the problem.
My relevant code is like this.
At my app initialization.
scheduler = BackgroundScheduler()
scheduler.start()
Whenever the trigger happens.
scheduler.add_job(func=myfunc, trigger='date', run_date=datetime.datetime.now() + datetime.timedelta(seconds=value)).
Anything I am missing in using apscheduler in uwsgi mode? Or any other options in flask to achieve what I want to?

How can I get rid of legacy tasks still in the Celery / RabbitMQ queue?

I am running Django + Celery + RabbitMQ. After modifying some task names I started getting "unregistered task" KeyErrors, even after removing tasks with this key from the Periodic tasks table in Django Celery Beat and restarting the Celery worker. They persist even after running with the --purge option.
How can I get rid of them?
To flush out the last of these tasks, you can re-implement them with their old method headers, but no logic.
For example, if you removed the method original and are now getting the error
[ERROR/MainProcess] Received unregistered task of type u'myapp.tasks.original'
Just recreate the original method as follows:
tasks.py
#shared_task
def original():
# keep legacy task header so that it is flushed out of queue
# FIXME: this will be removed in the next release
pass
Once you have run this version in each environment, any remaining tasks will be processed (and do nothing). Ensure that you have removed them from your Periodic tasks table, and that they are no longer being invoked. You can then remove the method before your next deployment, and the issue should no recur.
This is still a workaround, and it would be preferable to be able to review and delete the tasks individually.

Stopping/Purging Periodic Tasks in Django-Celery

I have managed to get periodic tasks working in django-celery by subclassing PeriodicTask. I tried to create a test task and set it running doing something useless. It works.
Now I can't stop it. I've read the documentation and I cannot find out how to remove the task from the execution queue. I have tried using celeryctl and using the shell, but registry.tasks() is empty, so I can't see how to remove it.
I have seen suggestions that I should "revoke" it, but for this I appear to need a task id, and I can't see how I would find the task id.
Thanks.
A task is a message, and a "periodic task" sends task messages at periodic intervals. Each of the tasks sent will have an unique id assigned to it.
revoke will only cancel a single task message. To get the id for a task you have to keep
track of the id sent, but you can also specify a custom id when you send a task.
I'm not sure if you want to cancel a single task message, or if you want to stop the periodic task from sending more messages, so I'll list answers for both.
There is no built-in way to keep the id of a task sent with periodic tasks,
but you could set the id for each task to the name of the periodic task, that way
the id will refer to any task sent with the periodic task (usually the last one).
You can specify a custom id this way,
either with the #periodic_task decorator:
#periodic_task(options={"task_id": "my_periodic_task"})
def my_periodic_task():
pass
or with the CELERYBEAT_SCHEDULE setting:
CELERYBEAT_SCHEDULE = {name: {"task": task_name,
"options": {"task_id": name}}}
If you want to remove a periodic task you simply remove the #periodic_task from the codebase, or remove the entry from CELERYBEAT_SCHEDULE.
If you are using the Django database scheduler you have to remove the periodic task
from the Django Admin interface.
PS1: revoke doesn't stop a task that has already been started. It only cancels
tasks that haven't been started yet. You can terminate a running task using
revoke(task_id, terminate=True). By default this will send the TERM signal to
the process, if you want to send another signal (e.g. KILL) use
revoke(task_id, terminate=True, signal="KILL").
PS2: revoke is a remote control command so it is only supported by the RabbitMQ
and Redis broker transports.
If you want your task to support cancellation you should do so by storing a cancelled
flag in a database and have the task check that flag when it starts:
from celery.task import Task
class RevokeableTask(Task):
"""Task that can be revoked.
Example usage:
#task(base=RevokeableTask)
def mytask():
pass
"""
def __call__(self, *args, **kwargs):
if revoke_flag_set_in_db_for(self.request.id):
return
super(RevokeableTask, self).__call__(*args, **kwargs)
Just in case this may help someone ... We had the same problem at work, and despites some efforts to find some kind of management command to remove the periodic task, we could not. So here are some pointers.
You should probably first double-check which scheduler class you're using.
The default scheduler is celery.beat.PersistentScheduler, which is simply keeping track of the last run times in a local database file (a shelve).
In our case, we were using the djcelery.schedulers.DatabaseScheduler class.
django-celery also ships with a scheduler that stores the schedule in the Django database
Although the documentation does mention a way to remove the periodic tasks:
Using django-celery‘s scheduler you can add, modify and remove periodic tasks from the Django Admin.
We wanted to perform the removal programmatically, or via a (celery/management) command in a shell.
Since we could not find a command line, we used the django/python shell:
$ python manage.py shell
>>> from djcelery.models import PeriodicTask
>>> pt = PeriodicTask.objects.get(name='the_task_name')
>>> pt.delete()
I hope this helps!