Celery what happen to running tasks when using app.control.purge()? - django

Currently i have a celery batch running with django like so:
Celery.py:
from __future__ import absolute_import, unicode_literals
import os
import celery
from celery import Celery
from celery.schedules import crontab
import django
load_dotenv(os.path.join(os.path.dirname(os.path.dirname(__file__)), '.env'))
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'base.settings')
django.setup()
app = Celery('base')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
#app.on_after_configure.connect
def setup_periodic_tasks(sender, **kwargs):
app.control.purge()
sender.add_periodic_task(30.0, check_loop.s())
recursion_function.delay() #need to use recursive because it need to wait for loop to finish(time can't be predict)
print("setup_periodic_tasks")
#app.task()
def check_loop():
.....
start = database start number
end = database end number
callling apis in a list from id=start to id=end
create objects
update database(start number = end, end number = end + 3)
....
#app.task()
def recursion_function(default_retry_delay=10):
.....
do some looping
....
#when finished, call itself again
recursion_function.apply_async(countdown=30)
My aim is whenever the celery file get edited then it would restart all the task -remove queued task that not yet execute(i'm doing this because recursion_function will run itself again if it finished it's job of checking each record of a table in my database so i'm not worry about it stopping mid way).
The check_loop function will call to an api that has paging to return a list of objects and i will compare it to by record in a table , if match then create a new custom record of another model
My question is when i purge all messages, will the current running task get stop midway or it gonna keep running ? because if the check_loop function stop midway looping through the api list then it will run the loop again and i will create new duplicate record which i don't want
EXAMPLE:
during ruuning task of check_loop() it created object midway (on api list from element id=2 to id=5), server restart -> run again, now check_loop() run from beginning(on api list from element id=2 to id=5) and created object from that list again(which 100% i don't want)
Is this how it run ? i just need a confirmation
EDIT:
https://docs.celeryproject.org/en/4.4.1/faq.html#how-do-i-purge-all-waiting-tasks
I added app.control.purge() because when i restart then recursion_function get called again in setup_periodic_tasks while previous recursion_function from recursion_function.apply_async(countdown=30) execute too so it multiplied itself

Yes, worker will continue execution of currently running task unless worker is also restarted.
Also, The Celery Way is to always expect tasks to be run in concurrent environment with following considerations:
there are many tasks running concurrently
there are many celery workers executing tasks
same task may run again
multiple instances of the same task may run at the same moment
any task may be terminated any time
even if you are sure that in your environment there is only one worker started / stopped manually and these do not apply - tasks should be created in such way to allow everything of this to happen.
Some useful techniques:
use database transactions
use locking
split long-running tasks into faster ones
if task has intermediate values to be saved or they are important (i.e. non-reproducible like some api calls) and their processing in next step takes time - consider splitting into several chained tasks
If you need to run only one instance of a task at a time - use some sort of locking - create / update lock-record in the database or in the cache so others (same tasks) can check and know this task is running and just return or wait for previous one to complete.
I.e. recursion_function can also be Periodic Task. Being periodic task will make sure it is run every interval, even if previous one fails for any reason (and thus fails to queue itself again as in regular non-periodic task). With locking you can make sure only one is running at a time.
check_loop():
First, it is recommended to save results in one transaction in the database to make sure all or nothing is saved / modified in the database.
You can also save some marker that indicates how many / status of saved objects, so future tasks can just check this marker, not each object.
Or somehow perform check for each element before creating it that it already exists in the database.

I am not going to write an essay like Oleg's excellent post above. The answer is simply - all running tasks will continue running. purge is all about the tasks that are in the queue(s), waiting to be picked by Celery workers.

Related

Scheduler duplicate email 8 times [duplicate]

We have a web app made with pyramid and served through gunicorn+nginx. It works with 8 worker threads/processes
We needed to jobs, we have chosen apscheduler. here is how we launch it
from apscheduler.events import EVENT_JOB_EXECUTED, EVENT_JOB_ERROR
from apscheduler.scheduler import Scheduler
rerun_monitor = Scheduler()
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
The issue is that all the worker processes of gunicorn pick the scheduler up. We tried implementing a file lock but it does not seem like a good enough solution. What would be the best way to make sure at any given time only one of the worker process picks the scheduled event up and no other thread picks it up till next JOB_INTERVAL?
The solution needs to work even with mod_wsgi in case we decide to switch to apache2+modwsgi later. It needs to work with single process development server which is waitress.
Update from the bounty sponsor
I'm facing the same issue described by the OP, just with a Django app. I'm mostly sure adding this detail won't change much if the original question. For this reason, and to gain a bit more of visibility, I also tagged this question with django.
Because Gunicorn is starting with 8 workers (in your example), this forks the app 8 times into 8 processes. These 8 processes are forked from the Master process, which monitors each of their status & has the ability to add/remove workers.
Each process gets a copy of your APScheduler object, which initially is an exact copy of your Master processes' APScheduler. This results in each "nth" worker (process) executing each job a total of "n" times.
A hack around this is to run gunicorn with the following options:
env/bin/gunicorn module_containing_app:app -b 0.0.0.0:8080 --workers 3 --preload
The --preload flag tells Gunicorn to "load the app before forking the worker processes". By doing so, each worker is "given a copy of the app, already instantiated by the Master, rather than instantiating the app itself". This means the following code only executes once in the Master process:
rerun_monitor = Scheduler()
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
Additionally, we need to set the jobstore to be anything other than :memory:.This way, although each worker is its own independent process unable of communicating with the other 7, by using a local database (rather then memory) we guarantee a single-point-of-truth for CRUD operations on the jobstore.
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.jobstores.sqlalchemy import SQLAlchemyJobStore
rerun_monitor = Scheduler(
jobstores={'default': SQLAlchemyJobStore(url='sqlite:///jobs.sqlite')})
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
Lastly, we want to use the BackgroundScheduler because of its implementation of start(). When we call start() in the BackgroundScheduler, a new thread is spun up in the background, which is responsible for scheduling/executing jobs. This is significant because remember in step (1), due to our --preload flag we only execute the start() function once, in the Master Gunicorn process. By definition, forked processes do not inherit the threads of their Parent, so each worker doesn't run the BackgroundScheduler thread.
from apscheduler.jobstores.sqlalchemy import SQLAlchemyJobStore
rerun_monitor = BackgroundScheduler(
jobstores={'default': SQLAlchemyJobStore(url='sqlite:///jobs.sqlite')})
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
As a result of all of this, every Gunicorn worker has an APScheduler that has been tricked into a "STARTED" state, but actually isn't running because it drops the threads of it's parent! Each instance is also capable of updating the jobstore database, just not executing any jobs!
Check out flask-APScheduler for a quick way to run APScheduler in a web-server (like Gunicorn), and enable CRUD operations for each job.
I found a fix that worked with a Django project having a very similar issue. I simply bind a TCP socket the first time the scheduler starts and check against it subsequently. I think the following code can work for you as well with minor tweaks.
import sys, socket
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind(("127.0.0.1", 47200))
except socket.error:
print "!!!scheduler already started, DO NOTHING"
else:
from apscheduler.schedulers.background import BackgroundScheduler
scheduler = BackgroundScheduler()
scheduler.start()
print "scheduler started"
Short answer: You can't do it properly without consequences.
I'm using Gunicorn as an example, but it is essentially the same for uWSGI. There are various hacks when running multiple processes, to name a few:
use --preload option
use on_starting hook to start the APScheduler background scheduler
use when_ready hook to start the APScheduler background scheduler
They work to some extent but may get the following errors:
worker timing out frequently
scheduler hanging when there are no jobs https://github.com/agronholm/apscheduler/issues/305
APScheduler is designed to run in a single process where it has complete control over the process of adding jobs to job stores. It uses threading.Event's wait() and set() methods to coordinate. If they are run by different processes, the coordination wouldn't work.
It is possible to run it in Gunicorn in a single process.
use only one worker process
use the post_worker_init hook to start the scheduler, this will make sure the scheduler is run only in the worker process but not the master process
The author also pointed out sharing the job store amount multiple processes isn't possible. https://apscheduler.readthedocs.io/en/stable/faq.html#how-do-i-share-a-single-job-store-among-one-or-more-worker-processes He also provided a solution using RPyC.
While it's entirely doable to wrap APScheduler with a REST interface. You might want to consider serving it as a standalone app with one worker. In another word, if you have others endpoints, put them in another app where you can use multiple workers.

Is celery-beats can only trigger a celery task or normal task (Django)?

I am workign on a django project with celery and celery-beats. My main use case is use celery-beats to set up a periodical task as a background task, instead of using a front-end request to trigger. I would save the results and put it inside model, then pull the model to front-end view as a view to user.
My current problem is, not matter how I change the way I am calling my task, it always throwing the task is not registered in the task list inside celery.
I am trying to trigger a non-celery task(inside, it will call a celery taskthe , using celery beats module,
Below is the pesudo-code.
tasks.py:
#app.shared_task
def longrunningtask(a):
res = APIcall(a)
return res
caller.py:
from .task import longrunningtask
def dosomething(input_list):
for ele in input_list:
res.append(longrunningtask.delay(ele))
return res
Periodical Task :
schedule, created = CrontabSchedule.objects.get_or_create(hour = 1, minute = 34)
task = PeriodicTask.objects.create(crontab=schedule, name="XXX_task_", task='app.caller.dosomething'))
return HttpResponse("Done")
Nothing special about the periodical task, but This never works for me. It errored that not detected tasks or not registered tasks if I do not make the dosomething() as celery task.
Problem is I do not want to make the caller function a celery task, the reason being, that
Inside for loop, I would make parameter passing into the task(), I would like to see multiple celery long runing task is running with the for loop passing it and kick it. so I would create mutliple sub-task instead of as one giant running task.
Not necessary since longrunningtask is the task I need it to be run as celery task, no need its parent to be inside celery task.
Can someone please help me out of this dilemma? It's super frustrating and has been blocking me for a while.
Any suggestion or idea of this use case is also superhelpful!

How can I get rid of legacy tasks still in the Celery / RabbitMQ queue?

I am running Django + Celery + RabbitMQ. After modifying some task names I started getting "unregistered task" KeyErrors, even after removing tasks with this key from the Periodic tasks table in Django Celery Beat and restarting the Celery worker. They persist even after running with the --purge option.
How can I get rid of them?
To flush out the last of these tasks, you can re-implement them with their old method headers, but no logic.
For example, if you removed the method original and are now getting the error
[ERROR/MainProcess] Received unregistered task of type u'myapp.tasks.original'
Just recreate the original method as follows:
tasks.py
#shared_task
def original():
# keep legacy task header so that it is flushed out of queue
# FIXME: this will be removed in the next release
pass
Once you have run this version in each environment, any remaining tasks will be processed (and do nothing). Ensure that you have removed them from your Periodic tasks table, and that they are no longer being invoked. You can then remove the method before your next deployment, and the issue should no recur.
This is still a workaround, and it would be preferable to be able to review and delete the tasks individually.

Django-celery project, how to handle results from result-backend?

1) I am currently working on a web application that exposes a REST api and uses Django and Celery to handle request and solve them. For a request in order to get solved, there have to be submitted a set of celery tasks to an amqp queue, so that they get executed on workers (situated on other machines). Each task is very CPU intensive and takes very long (hours) to finish.
I have configured Celery to use also amqp as results-backend, and I am using RabbitMQ as Celery's broker.
Each task returns a result that needs to be stored afterwards in a DB, but not by the workers directly. Only the "central node" - the machine running django-celery and publishing tasks in the RabbitMQ queue - has access to this storage DB, so the results from the workers have to return somehow on this machine.
The question is how can I process the results of the tasks execution afterwards? So after a worker finishes, the result from it gets stored in the configured results-backend (amqp), but now I don't know what would be the best way to get the results from there and process them.
All I could find in the documentation is that you can either check on the results's status from time to time with:
result.state
which means that basically I need a dedicated piece of code that runs periodically this command, and therefore keeps busy a whole thread/process only with this, or to block everything with:
result.get()
until a task finishes, which is not what I wish.
The only solution I can think of is to have on the "central node" an extra thread that runs periodically a function that basically checks on the async_results returned by each task at its submission, and to take action if the task has a finished status.
Does anyone have any other suggestion?
Also, since the backend-results' processing takes place on the "central node", what I aim is to minimize the impact of this operation on this machine.
What would be the best way to do that?
2) How do people usually solve the problem of dealing with the results returned from the workers and put in the backend-results? (assuming that a backend-results has been configured)
I'm not sure if I fully understand your question, but take into account each task has a task id. If tasks are being sent by users you can store the ids and then check for the results using json as follows:
#urls.py
from djcelery.views import is_task_successful
urlpatterns += patterns('',
url(r'(?P<task_id>[\w\d\-\.]+)/done/?$', is_task_successful,
name='celery-is_task_successful'),
)
Other related concept is that of signals each finished task emits a signal. A finnished task will emit a task_success signal. More can be found on real time proc.

Stopping/Purging Periodic Tasks in Django-Celery

I have managed to get periodic tasks working in django-celery by subclassing PeriodicTask. I tried to create a test task and set it running doing something useless. It works.
Now I can't stop it. I've read the documentation and I cannot find out how to remove the task from the execution queue. I have tried using celeryctl and using the shell, but registry.tasks() is empty, so I can't see how to remove it.
I have seen suggestions that I should "revoke" it, but for this I appear to need a task id, and I can't see how I would find the task id.
Thanks.
A task is a message, and a "periodic task" sends task messages at periodic intervals. Each of the tasks sent will have an unique id assigned to it.
revoke will only cancel a single task message. To get the id for a task you have to keep
track of the id sent, but you can also specify a custom id when you send a task.
I'm not sure if you want to cancel a single task message, or if you want to stop the periodic task from sending more messages, so I'll list answers for both.
There is no built-in way to keep the id of a task sent with periodic tasks,
but you could set the id for each task to the name of the periodic task, that way
the id will refer to any task sent with the periodic task (usually the last one).
You can specify a custom id this way,
either with the #periodic_task decorator:
#periodic_task(options={"task_id": "my_periodic_task"})
def my_periodic_task():
pass
or with the CELERYBEAT_SCHEDULE setting:
CELERYBEAT_SCHEDULE = {name: {"task": task_name,
"options": {"task_id": name}}}
If you want to remove a periodic task you simply remove the #periodic_task from the codebase, or remove the entry from CELERYBEAT_SCHEDULE.
If you are using the Django database scheduler you have to remove the periodic task
from the Django Admin interface.
PS1: revoke doesn't stop a task that has already been started. It only cancels
tasks that haven't been started yet. You can terminate a running task using
revoke(task_id, terminate=True). By default this will send the TERM signal to
the process, if you want to send another signal (e.g. KILL) use
revoke(task_id, terminate=True, signal="KILL").
PS2: revoke is a remote control command so it is only supported by the RabbitMQ
and Redis broker transports.
If you want your task to support cancellation you should do so by storing a cancelled
flag in a database and have the task check that flag when it starts:
from celery.task import Task
class RevokeableTask(Task):
"""Task that can be revoked.
Example usage:
#task(base=RevokeableTask)
def mytask():
pass
"""
def __call__(self, *args, **kwargs):
if revoke_flag_set_in_db_for(self.request.id):
return
super(RevokeableTask, self).__call__(*args, **kwargs)
Just in case this may help someone ... We had the same problem at work, and despites some efforts to find some kind of management command to remove the periodic task, we could not. So here are some pointers.
You should probably first double-check which scheduler class you're using.
The default scheduler is celery.beat.PersistentScheduler, which is simply keeping track of the last run times in a local database file (a shelve).
In our case, we were using the djcelery.schedulers.DatabaseScheduler class.
django-celery also ships with a scheduler that stores the schedule in the Django database
Although the documentation does mention a way to remove the periodic tasks:
Using django-celery‘s scheduler you can add, modify and remove periodic tasks from the Django Admin.
We wanted to perform the removal programmatically, or via a (celery/management) command in a shell.
Since we could not find a command line, we used the django/python shell:
$ python manage.py shell
>>> from djcelery.models import PeriodicTask
>>> pt = PeriodicTask.objects.get(name='the_task_name')
>>> pt.delete()
I hope this helps!