Understanding how basic celery message qeueing works - django

I've implemented a small test which uses celery for message queueing and I just want to make sure I understand how it works on a basic level (Django-Celery, Using Redis as a broker).
My understanding is that when I place a call to start an asyncronous task, the task information is placed in redis and then a celeryd instance connected to the broker consumes and executes the task. Is this essentially what is happening?
If I setup a periodic task thats supposed to execute once every hour does that task get executed on all task consumers? If so is there a way to limit it so that only one consumer will ever execute a periodic task?

The workers will consume as many messages as the broker contains. If you have 8 workers, but only 1 message, 1 of the 8 workers will consume the message, executing the task.

Related

Celery: Numerous small tasks or one long running task?

I have a Django app and using Celery to process long running tasks.
Let's say I need to generate a file (takes 5 seconds), attach it to an email and send it to 1000 users, which of these methods are the preferred way?
Method 1: For loop outside task - generates numerates background tasks, each running a couple of seconds
#share_task
def my_task(usr):
#gen file + send email...
def send_to_all_users(users): # called to start task
for usr in users:
my_task.delay(usr)
Method 2: For loop inside task - generates 1 background tasks that could be running for hours
#share_task
def my_task(users):
for usr in users:
#gen file + send email...
def send_to_all_users(users): # called to start task
my_task.delay(users)
With method 1, I can scale up the number of workers to complete the entire task quicker, but creating all those tasks might take a while and I'm not sure if my task queue can fill up and then jobs get discarded?
Method 2 seems simpler, but it might run a very long time and I can't scale up the number of workers.
Not sure if it matters, but my app is running on Heroku and I'm using Redis as the message broker. I'm currently using a single background worker.
Celery docs on Task Granularity:
The task granularity is the amount of computation needed by each
subtask. In general it is better to split the problem up into many
small tasks rather than have a few long running tasks.
With smaller tasks you can process more tasks in parallel and the
tasks won’t run long enough to block the worker from processing other
waiting tasks.
However, executing a task does have overhead. A message needs to be
sent, data may not be local, etc. So if the tasks are too fine-grained
the overhead added probably removes any benefit.
So the first method should be preferred in general, but you have to benchmark your particular case to assess the overhead.

How to enqueue a periodic task if it gets terminated in celery?

Let's say there is a periodic task scheduled to run every hour. A worker receives the tasks and starts processing. While the task is being processed, the celeryd process (controlled via supervisord) gets restarted (supervisorctl restart all). Even though the task had never finished execution, it won't get re-executed.
How can I re-queue the periodic task right away and prevent the multiple versions of the tasks run at the same time?
There may be a nicer way to do it, but you could just use the periodic task to create a regular task in the queue (e.g., my_actual_task.defer(…)) which will not be removed from the queue until it is completed (assuming you are using acks_late).
If you're not using acks_late, you can put the bulk of the task in a try, and in the corresponding finally put a my_actual_task.retry().
Either way, you should generally avoid killing workers without giving them a chance to finish up what they're doing.

AWS SWF Simple Workflow - Best Way to Keep Activity Worker Scripts Running?

The maximum amount of time the pollForActivityTask method stays open polling for requests is 60 seconds. I am currently scheduling a cron job every minute to call my activity worker file so that my activity worker machine is constantly polling for jobs.
Is this the correct way to have continuous queue coverage?
The way that the Java Flow SDK does it and the way that you create an ActivityWorker, give it a tasklist, domain, activity implementations, and a few other settings. You set both the setPollThreadCount and setTaskExecutorSize. The polling threads long poll and then hand over work to the executor threads to avoid blocking further polling. You call start on the ActivityWorker to boot it up and when wanting to shutdown the workers, you can call one of the shutdown methods (usually best to call shutdownAndAwaitTermination).
Essentially your workers are long lived and need to deal with a few factors:
New versions of Activities
Various tasklists
Scaling independently on tasklist, activity implementations, workflow workers, host sizes, etc.
Handle error cases and deal with polling
Handle shutdowns (in case of deployments and new versions)
I ended using a solution where I had another script file that is called by a cron job every minute. This file checks whether an activity worker is already running in the background (if so, I assume a workflow execution is already being processed on the current server).
If no activity worker is there, then the previous long poll has completed and we launch the activity worker script again. If there is an activity worker already present, then the previous poll found a workflow execution and started processing so we refrain from launching another activity worker.

Workflow of celery

I am a beginner with django, I have celery installed.
I am confused about the working of the celery, if the queued works are handled synchronously or asynchronously. Can other works be queued when the queued work is already being processed?
Celery is a task queuing system, that is backed by a message queuing system, Celery allows you to invoke tasks asynchronously, in a way that won't block your process for the task to finish, you can wait for the task to finish using the AsyncResult.get.
Other tasks can be queued while a task is being processed, and if Celery is running more than one process/thread (which is the default case), tasks will be executed in parallel to each others.
It is your responsibility to make sure that related tasks are executed in the correct order, e.g. if the output of a task A is an input to the other task B then you should make sure that you get the result from task A before you start the task B.
Read Avoid launching synchronous subtasks from Celery documentation.
I think you're possibly a bit confused about what Celery does.
Celery isn't really responsible for queueing at all. That is taken care of by the queue itself - RabbitMQ, Redis, or whatever. The only way Celery gets involved at this end is as a library that you call inside your app to serialize to task into something suitable for putting onto the queue. Since that is done by your web app, it is exactly as synchronous or asynchronous as your app itself: usually, in production, you'd have multiple processes running your site, each of those could put things onto the queue simultaneously, but each queueing action is done in-process.
The main point of Celery is the separate worker processes. This is where the asynchronous bit comes from: the workers run completely separately from your web app, and pick tasks off the queue as necessary. They are not at all involved in the process of putting tasks onto the queue in the first place.

Django-celery project, how to handle results from result-backend?

1) I am currently working on a web application that exposes a REST api and uses Django and Celery to handle request and solve them. For a request in order to get solved, there have to be submitted a set of celery tasks to an amqp queue, so that they get executed on workers (situated on other machines). Each task is very CPU intensive and takes very long (hours) to finish.
I have configured Celery to use also amqp as results-backend, and I am using RabbitMQ as Celery's broker.
Each task returns a result that needs to be stored afterwards in a DB, but not by the workers directly. Only the "central node" - the machine running django-celery and publishing tasks in the RabbitMQ queue - has access to this storage DB, so the results from the workers have to return somehow on this machine.
The question is how can I process the results of the tasks execution afterwards? So after a worker finishes, the result from it gets stored in the configured results-backend (amqp), but now I don't know what would be the best way to get the results from there and process them.
All I could find in the documentation is that you can either check on the results's status from time to time with:
result.state
which means that basically I need a dedicated piece of code that runs periodically this command, and therefore keeps busy a whole thread/process only with this, or to block everything with:
result.get()
until a task finishes, which is not what I wish.
The only solution I can think of is to have on the "central node" an extra thread that runs periodically a function that basically checks on the async_results returned by each task at its submission, and to take action if the task has a finished status.
Does anyone have any other suggestion?
Also, since the backend-results' processing takes place on the "central node", what I aim is to minimize the impact of this operation on this machine.
What would be the best way to do that?
2) How do people usually solve the problem of dealing with the results returned from the workers and put in the backend-results? (assuming that a backend-results has been configured)
I'm not sure if I fully understand your question, but take into account each task has a task id. If tasks are being sent by users you can store the ids and then check for the results using json as follows:
#urls.py
from djcelery.views import is_task_successful
urlpatterns += patterns('',
url(r'(?P<task_id>[\w\d\-\.]+)/done/?$', is_task_successful,
name='celery-is_task_successful'),
)
Other related concept is that of signals each finished task emits a signal. A finnished task will emit a task_success signal. More can be found on real time proc.