1) I am currently working on a web application that exposes a REST api and uses Django and Celery to handle request and solve them. For a request in order to get solved, there have to be submitted a set of celery tasks to an amqp queue, so that they get executed on workers (situated on other machines). Each task is very CPU intensive and takes very long (hours) to finish.
I have configured Celery to use also amqp as results-backend, and I am using RabbitMQ as Celery's broker.
Each task returns a result that needs to be stored afterwards in a DB, but not by the workers directly. Only the "central node" - the machine running django-celery and publishing tasks in the RabbitMQ queue - has access to this storage DB, so the results from the workers have to return somehow on this machine.
The question is how can I process the results of the tasks execution afterwards? So after a worker finishes, the result from it gets stored in the configured results-backend (amqp), but now I don't know what would be the best way to get the results from there and process them.
All I could find in the documentation is that you can either check on the results's status from time to time with:
result.state
which means that basically I need a dedicated piece of code that runs periodically this command, and therefore keeps busy a whole thread/process only with this, or to block everything with:
result.get()
until a task finishes, which is not what I wish.
The only solution I can think of is to have on the "central node" an extra thread that runs periodically a function that basically checks on the async_results returned by each task at its submission, and to take action if the task has a finished status.
Does anyone have any other suggestion?
Also, since the backend-results' processing takes place on the "central node", what I aim is to minimize the impact of this operation on this machine.
What would be the best way to do that?
2) How do people usually solve the problem of dealing with the results returned from the workers and put in the backend-results? (assuming that a backend-results has been configured)
I'm not sure if I fully understand your question, but take into account each task has a task id. If tasks are being sent by users you can store the ids and then check for the results using json as follows:
#urls.py
from djcelery.views import is_task_successful
urlpatterns += patterns('',
url(r'(?P<task_id>[\w\d\-\.]+)/done/?$', is_task_successful,
name='celery-is_task_successful'),
)
Other related concept is that of signals each finished task emits a signal. A finnished task will emit a task_success signal. More can be found on real time proc.
Related
I'm developing a Django app which relies heavily on Celery task scheduling, using Redis as backend. Tasks can be set to run at a large periods of time, as well as in a few seconds/minutes.
I've read about Redis visibility timeout and consequences of scheduling tasks with timedelta greater than visibility timeout (I'm also in the process of dealing with it in a previous project), so I'm interested if there's anything neater than my solution, which is to have another "helper" task run 5 minutes before the "main" one needs to be executed, scheduling the "main" task to run in required time, storing task id in DB, and then checking in "main" task if the stored task id is the one that is being run. The last part (with task id storing) is required as multiple runs of "helper" task could spawn a lot of "main" task instances, but with this approach each will have different task id.
I really hate how that approach sounds and how it works, as if the task is scheduled to be run a month from current time, "helper" and "main" tasks are executed up to a hundred times.
I also know that it's an open issue, so I'm interested in more a neat workaround than a solution itself.
Having tested available options, in my opinion only using RabbitMQ as broker solves the whole problem.
Although it's a viable option for me, lack of some of redis configuration parameters (e.g. pool size) makes it unusable for those who are using hosting services with some limit on opened broker connection.
I am a beginner with django, I have celery installed.
I am confused about the working of the celery, if the queued works are handled synchronously or asynchronously. Can other works be queued when the queued work is already being processed?
Celery is a task queuing system, that is backed by a message queuing system, Celery allows you to invoke tasks asynchronously, in a way that won't block your process for the task to finish, you can wait for the task to finish using the AsyncResult.get.
Other tasks can be queued while a task is being processed, and if Celery is running more than one process/thread (which is the default case), tasks will be executed in parallel to each others.
It is your responsibility to make sure that related tasks are executed in the correct order, e.g. if the output of a task A is an input to the other task B then you should make sure that you get the result from task A before you start the task B.
Read Avoid launching synchronous subtasks from Celery documentation.
I think you're possibly a bit confused about what Celery does.
Celery isn't really responsible for queueing at all. That is taken care of by the queue itself - RabbitMQ, Redis, or whatever. The only way Celery gets involved at this end is as a library that you call inside your app to serialize to task into something suitable for putting onto the queue. Since that is done by your web app, it is exactly as synchronous or asynchronous as your app itself: usually, in production, you'd have multiple processes running your site, each of those could put things onto the queue simultaneously, but each queueing action is done in-process.
The main point of Celery is the separate worker processes. This is where the asynchronous bit comes from: the workers run completely separately from your web app, and pick tasks off the queue as necessary. They are not at all involved in the process of putting tasks onto the queue in the first place.
I plan to use celery to process incoming web service requests. I understand that celery is used mostly to process asynchronous tasks. However celery has lot of features that I like and could benefit from in my project - priorities, rate limits, distributed architecture etc.
I am just struggling with the design. I would like to have web service that creates and starts the task that will call subtasks. Original task needs results from the subtasks and then when original task is finished I return result back to the client through web service. I know I could call tasks synchronously but that it is not a good practice.
Thanks,
The scatter/gather thing looks like it could be a map/reduce job. If the mapreduce part is important to you, go with a specialised framework like Disco or Hadoop. Otherwise, you need some kind of completion signal, so that you can fire a reply to the user once all subtasks are done or cancelled. For example, a counter of how many subtasks are yet to terminate. The subtask that brings the counter to zero can push a new reply task that pushes a reply to the user and closes the circle.
Look at Mongrel2, an asynchronous web framework, for an example of this kind of circular request path.
One of the characteristics I love most about Google's Task Queue is its simplicity. More specifically, I love that it takes a URL and some parameters and then posts to that URL when the task queue is ready to execute the task.
This structure means that the tasks are always executing the most current version of the code. Conversely, my gearman workers all run code within my django project -- so when I push a new version live, I have to kill off the old worker and run a new one so that it uses the current version of the code.
My goal is to have the task queue be independent from the code base so that I can push a new live version without restarting any workers. So, I got to thinking: why not make tasks executable by url just like the google app engine task queue?
The process would work like this:
User request comes in and triggers a few tasks that shouldn't be blocking.
Each task has a unique URL, so I enqueue a gearman task to POST to the specified URL.
The gearman server finds a worker, passes the url and post data to a worker
The worker simply posts to the url with the data, thus executing the task.
Assume the following:
Each request from a gearman worker is signed somehow so that we know it's coming from a gearman server and not a malicious request.
Tasks are limited to run in less than 10 seconds (There would be no long tasks that could timeout)
What are the potential pitfalls of such an approach? Here's one that worries me:
The server can potentially get hammered with many requests all at once that are triggered by a previous request. So one user request might entail 10 concurrent http requests. I suppose I could have a single worker with a sleep before every request to rate-limit.
Any thoughts?
As a user of both Django and Google AppEngine, I can certainly appreciate what you're getting at. At work I'm currently working on the exact same scenario using some pretty cool open source tools.
Take a look at Celery. It's a distributed task queue built with Python that exposes three concepts - a queue, a set of workers, and a result store. It's pluggable with different tools for each part.
The queue should be battle-hardened, and fast. Check out RabbitMQ for a great queue implementation in Erlang, using the AMQP protocol.
The workers ultimately can be Python functions. You can trigger workers using either queue messages, or perhaps more pertinent to what you're describing - using webhooks
Check out the Celery webhook documentation. Using all these tools you can build a production ready distributed task queue that implements your requirements above.
I should also mention that in regards to your first pitfall, celery implements rate-limiting of tasks using a Token Bucket algorithm.
In my Django app, I need to implement this "timer-based" functionality:
User creates some jobs and for each one defines when (in the same unit the timer works, probably seconds) it will take place.
User starts the timer.
User may pause and resume the timer whenever he wants.
A job is executed when its time is due.
This does not fit a typical cron scenario as time of execution is tied to a timer that the user can start, pause and resume.
What is the preferred way of doing this?
This isn't a Django question. It is a system architecture problem. The http is stateless, so there is no notion of times.
My suggestion is to use Message Queues such as RabbitMQ and use Carrot to interface with it. You can put the jobs on the queue, then create a seperate consumer daemon which will process jobs from the queue. The consumer has the logic about when to process.
If that it too complex a system, perhaps look at implementing the timer in JS and having it call a url mapped to a view that processes a unit of work. The JS would be the timer.
Have a look at Pinax, especially the notifications.
Once created they are pushed to the DB (queue), and processed by the cron-jobbed email-sending (2. consumer).
In this senario you won't stop it once it get fired.
That could be managed by som (ajax-)views, that call system process....
edit
instead of cron-jobs you could use a twisted-based consumer:
write jobs to db with time-information to the db
send a request for consuming (or resuming, pausing, ...) to the twisted server via socket
do the rest in twisted
You're going to end up with separate (from the web server) processes to monitor the queue and execute jobs. Consider how you would build that without Django using command-line tools to drive it. Use Django models to access the the database.
When you have that working, layer on on a web-based interface (using full Django) to manipulate the queue and report on job status.
I think that if you approach it this way the problem becomes much easier.
I used the probably simplest (crudest is more appropriate, I'm afraid) approach possible: 1. Wrote a model featuring the current position and the state of the counter (active, paused, etc), 2. A django job that increments the counter if its state is active, 3. An entry to the cron that executes the job every minute.
Thanks everyone for the answers.
You can always use a client based jquery timer, but remember to initialize the timer with a value which is passed from your backend application, also make sure that the end user didn't edit the time (edit by inspecting).
So place a timer start time (initial value of the timer) and timer end time or timer pause time in the backend (DB itself).
Monitor the duration in the backend and trigger the job ( in you case ).
Hope this is clear.