I am a beginner with django, I have celery installed.
I am confused about the working of the celery, if the queued works are handled synchronously or asynchronously. Can other works be queued when the queued work is already being processed?
Celery is a task queuing system, that is backed by a message queuing system, Celery allows you to invoke tasks asynchronously, in a way that won't block your process for the task to finish, you can wait for the task to finish using the AsyncResult.get.
Other tasks can be queued while a task is being processed, and if Celery is running more than one process/thread (which is the default case), tasks will be executed in parallel to each others.
It is your responsibility to make sure that related tasks are executed in the correct order, e.g. if the output of a task A is an input to the other task B then you should make sure that you get the result from task A before you start the task B.
Read Avoid launching synchronous subtasks from Celery documentation.
I think you're possibly a bit confused about what Celery does.
Celery isn't really responsible for queueing at all. That is taken care of by the queue itself - RabbitMQ, Redis, or whatever. The only way Celery gets involved at this end is as a library that you call inside your app to serialize to task into something suitable for putting onto the queue. Since that is done by your web app, it is exactly as synchronous or asynchronous as your app itself: usually, in production, you'd have multiple processes running your site, each of those could put things onto the queue simultaneously, but each queueing action is done in-process.
The main point of Celery is the separate worker processes. This is where the asynchronous bit comes from: the workers run completely separately from your web app, and pick tasks off the queue as necessary. They are not at all involved in the process of putting tasks onto the queue in the first place.
Related
Let's say there is a periodic task scheduled to run every hour. A worker receives the tasks and starts processing. While the task is being processed, the celeryd process (controlled via supervisord) gets restarted (supervisorctl restart all). Even though the task had never finished execution, it won't get re-executed.
How can I re-queue the periodic task right away and prevent the multiple versions of the tasks run at the same time?
There may be a nicer way to do it, but you could just use the periodic task to create a regular task in the queue (e.g., my_actual_task.defer(…)) which will not be removed from the queue until it is completed (assuming you are using acks_late).
If you're not using acks_late, you can put the bulk of the task in a try, and in the corresponding finally put a my_actual_task.retry().
Either way, you should generally avoid killing workers without giving them a chance to finish up what they're doing.
1) I am currently working on a web application that exposes a REST api and uses Django and Celery to handle request and solve them. For a request in order to get solved, there have to be submitted a set of celery tasks to an amqp queue, so that they get executed on workers (situated on other machines). Each task is very CPU intensive and takes very long (hours) to finish.
I have configured Celery to use also amqp as results-backend, and I am using RabbitMQ as Celery's broker.
Each task returns a result that needs to be stored afterwards in a DB, but not by the workers directly. Only the "central node" - the machine running django-celery and publishing tasks in the RabbitMQ queue - has access to this storage DB, so the results from the workers have to return somehow on this machine.
The question is how can I process the results of the tasks execution afterwards? So after a worker finishes, the result from it gets stored in the configured results-backend (amqp), but now I don't know what would be the best way to get the results from there and process them.
All I could find in the documentation is that you can either check on the results's status from time to time with:
result.state
which means that basically I need a dedicated piece of code that runs periodically this command, and therefore keeps busy a whole thread/process only with this, or to block everything with:
result.get()
until a task finishes, which is not what I wish.
The only solution I can think of is to have on the "central node" an extra thread that runs periodically a function that basically checks on the async_results returned by each task at its submission, and to take action if the task has a finished status.
Does anyone have any other suggestion?
Also, since the backend-results' processing takes place on the "central node", what I aim is to minimize the impact of this operation on this machine.
What would be the best way to do that?
2) How do people usually solve the problem of dealing with the results returned from the workers and put in the backend-results? (assuming that a backend-results has been configured)
I'm not sure if I fully understand your question, but take into account each task has a task id. If tasks are being sent by users you can store the ids and then check for the results using json as follows:
#urls.py
from djcelery.views import is_task_successful
urlpatterns += patterns('',
url(r'(?P<task_id>[\w\d\-\.]+)/done/?$', is_task_successful,
name='celery-is_task_successful'),
)
Other related concept is that of signals each finished task emits a signal. A finnished task will emit a task_success signal. More can be found on real time proc.
SQS expects your application to be idempotent and I've got multiple consumers/producers where (even if SQS had a deliver-once mechanism) I will have race conditions creating duplicates and race conditions consuming because my consumers run via cron jobs.
My current plan is to use the Django 1.4 select_for_update which should block other consumers on the same row, doing something like:
reminders = EmailReminder.objects.select_for_update().filter(id=some_id)
if not reminders[0].finished:
reminder.send()
reminder.update(finished=datetime.now())
# Delete job.
Are there better ways of dealing with this?
Hook up django-celery to SQS and have it designate a periodic job using celerybeat. Then have celeryd worker(s) running on the same queue anywhere you want. Only one will pick up a job at a time and execute it. No need to introduce DB locking on any level.
As long as your worker is guaranteed to finish its current task before celerybeat fires a new one you will never have a need for a lock. Now if you think there is a chance they may overlap you can introduce states for your notifications where:
Any reminder starts in "unsent" state.
Your celerybeat sends a request to process unsent emails to the queue.
Some worker picks it up and grabs all of them.
Immediately the worker transitions all of them to "sending" state.
Proceeds to send them one at a time (or in bulk).
If sending fails for any, revert their state back to unsent.
For all that succeeded transition to sent.
This way if celerybeat fires another job while your original job is not done with the initial batch, you won't have duplicate emails sent. As an added bonus you can scale the solution and distribute the load.
Preconditions: There is a small celery cluster processing some tasks. Each celery instance has few workers running. Everything is running under flask.
Tasks: I need an ability to pause/resume consuming of tasks from a particular node from the code. I.e. task can make a decision if current celery instance and all her workers should pause or resume consuming of tasks.
Didn't find any straight forward way to solve this. Any suggestions?
Thanks in advance!
Control.cancel_consumer(queue, **kwargs) (reference) is all that you probably need for your use case.
Perhaps a better strategy would be to divide the work across several queues.
Have a default queue where all tasks start. The workers watching the default queue can, according to your logic, add subtasks to the other active queues. You may not need this extra queue if you can add tasks to the active queues directly from flask.
That way, each node does not have to worry about whether it's paused or active. It just consumes everything that's been added to its queue. These location-specific queues will be empty (and thus paused) unless the default workers have added subtasks.
I've implemented a small test which uses celery for message queueing and I just want to make sure I understand how it works on a basic level (Django-Celery, Using Redis as a broker).
My understanding is that when I place a call to start an asyncronous task, the task information is placed in redis and then a celeryd instance connected to the broker consumes and executes the task. Is this essentially what is happening?
If I setup a periodic task thats supposed to execute once every hour does that task get executed on all task consumers? If so is there a way to limit it so that only one consumer will ever execute a periodic task?
The workers will consume as many messages as the broker contains. If you have 8 workers, but only 1 message, 1 of the 8 workers will consume the message, executing the task.