For a django project i like to run index updated by a celery worker to not hit the page parse time. I noticed celery-haystack that is able to do this but i'm wondering why it's that complicated. A much simpler solution would be to simply apply an async task from a post_save signal and invoke the signal processor from there, so not to apply the async part from within the signal processor but before.
I guess i'm missing something?
I'm aware that instances may not exist any more in case of delete signals...
So celery is only the task distributor, right? And indexing is jobs to be done. Search is the end result. When your resource is limited, tasks will be queued up and scheduled to be ran workers are available. You can pursue your approach just fine, but Celery will optimize by delegating tasks to different workers, which may reside in other machines.
I kind of forgot about the details.... (sorry). But to comment: i ended up not using celery-haystack but instead use django signals (not just post_save but i created more specific custom signals) that trigger async celery tasks (so delegate to other queue's/nodes) and these run the index update using the signal processor. I also extended the signal processor to support update and removal of single objects and iterable of objects.
Paul
Related
In my Django project I'm using Celery with a RabbitMQ broker for asynchronous tasks, how can I record the information of all of my tasks (e.g. created time (task appears in queue), worker consume task time, execution time, status, ...) to monitor how Celery is doing?
I know there are solutions like Flower but that seems to much for what I need, django-celery-results looks like what I want but it's missing a few information I need like task created time.
Thanks!
It seems like you often find the answer yourself after asking on SO. I settled with using celery signals to do all the recording I want and store the results in a database table.
1) I am currently working on a web application that exposes a REST api and uses Django and Celery to handle request and solve them. For a request in order to get solved, there have to be submitted a set of celery tasks to an amqp queue, so that they get executed on workers (situated on other machines). Each task is very CPU intensive and takes very long (hours) to finish.
I have configured Celery to use also amqp as results-backend, and I am using RabbitMQ as Celery's broker.
Each task returns a result that needs to be stored afterwards in a DB, but not by the workers directly. Only the "central node" - the machine running django-celery and publishing tasks in the RabbitMQ queue - has access to this storage DB, so the results from the workers have to return somehow on this machine.
The question is how can I process the results of the tasks execution afterwards? So after a worker finishes, the result from it gets stored in the configured results-backend (amqp), but now I don't know what would be the best way to get the results from there and process them.
All I could find in the documentation is that you can either check on the results's status from time to time with:
result.state
which means that basically I need a dedicated piece of code that runs periodically this command, and therefore keeps busy a whole thread/process only with this, or to block everything with:
result.get()
until a task finishes, which is not what I wish.
The only solution I can think of is to have on the "central node" an extra thread that runs periodically a function that basically checks on the async_results returned by each task at its submission, and to take action if the task has a finished status.
Does anyone have any other suggestion?
Also, since the backend-results' processing takes place on the "central node", what I aim is to minimize the impact of this operation on this machine.
What would be the best way to do that?
2) How do people usually solve the problem of dealing with the results returned from the workers and put in the backend-results? (assuming that a backend-results has been configured)
I'm not sure if I fully understand your question, but take into account each task has a task id. If tasks are being sent by users you can store the ids and then check for the results using json as follows:
#urls.py
from djcelery.views import is_task_successful
urlpatterns += patterns('',
url(r'(?P<task_id>[\w\d\-\.]+)/done/?$', is_task_successful,
name='celery-is_task_successful'),
)
Other related concept is that of signals each finished task emits a signal. A finnished task will emit a task_success signal. More can be found on real time proc.
My django web-app logic is heavily geared towards background task execution (both periodic as well as stand alone, synchronous as well as asynchronous). All the research seems to point to using Celery being the most recommended approach. I plan to eventually deploy on Heroku and the fact that it has support for Celery + Redis (what I'm using for local development) is a big plus for me.
However I need more extensive scheduling capabilities than celery provides. I need some of my periodic tasks to be able to run schedules like 'run on last sun of the month' etc. So I've implemented my own models in django to store a recurrence rule and other needed parameters.
Now I'm stumped with how to interface my tables with celery. Ideally what I'd like to do is to have my own Job model which has the schedule, the task which should be run when it becomes due as well as the parameters for the task. Sort of like function ptr in C++. Then I would run a daemon which keeps checking the job queue for which job has become due, if its periodic it creates the next job instance and pushes it into queue, then runs the associated task with parameters using celery's delay method or similar.
questions:
Does this approach even make sense?
If not what other alternative approach(es) can I use
If yes how do I go about designing that Job/Event queue...
I'd love to hear a better approach to doing this or if there's an existing implementation of a job queue that might be suitable or a way to use celery's job queue itself...
Thanks heaps..
The periodic tasks in Celery works pretty much like this. There's a dedicated scheduler process (celery beat) which simply sends off tasks when they are due.
You can also create new schedulers to use with beat by subclassing the celery.beat.Scheduler class, and you can create custom schedules too (like the crontab schedule that is already built-in) by subclassing celery.schedules.schedule.
There's a database-backed scheduler implementation in the django-celery extension (djcelery.schedulers.DatabaseScheduler), which uses many tricks to avoid too frequent polling of the database and so on (sadly it's not well commented).
Scheduler: https://github.com/celery/celery/tree/master/celery/beat.py
schedules: https://github.com/celery/celery/tree/master/celery/schedules.py
DatabaseScheduler: https://github.com/celery/django-celery/tree/master/djcelery/schedulers.py
Preconditions: There is a small celery cluster processing some tasks. Each celery instance has few workers running. Everything is running under flask.
Tasks: I need an ability to pause/resume consuming of tasks from a particular node from the code. I.e. task can make a decision if current celery instance and all her workers should pause or resume consuming of tasks.
Didn't find any straight forward way to solve this. Any suggestions?
Thanks in advance!
Control.cancel_consumer(queue, **kwargs) (reference) is all that you probably need for your use case.
Perhaps a better strategy would be to divide the work across several queues.
Have a default queue where all tasks start. The workers watching the default queue can, according to your logic, add subtasks to the other active queues. You may not need this extra queue if you can add tasks to the active queues directly from flask.
That way, each node does not have to worry about whether it's paused or active. It just consumes everything that's been added to its queue. These location-specific queues will be empty (and thus paused) unless the default workers have added subtasks.
In my Django app, I need to implement this "timer-based" functionality:
User creates some jobs and for each one defines when (in the same unit the timer works, probably seconds) it will take place.
User starts the timer.
User may pause and resume the timer whenever he wants.
A job is executed when its time is due.
This does not fit a typical cron scenario as time of execution is tied to a timer that the user can start, pause and resume.
What is the preferred way of doing this?
This isn't a Django question. It is a system architecture problem. The http is stateless, so there is no notion of times.
My suggestion is to use Message Queues such as RabbitMQ and use Carrot to interface with it. You can put the jobs on the queue, then create a seperate consumer daemon which will process jobs from the queue. The consumer has the logic about when to process.
If that it too complex a system, perhaps look at implementing the timer in JS and having it call a url mapped to a view that processes a unit of work. The JS would be the timer.
Have a look at Pinax, especially the notifications.
Once created they are pushed to the DB (queue), and processed by the cron-jobbed email-sending (2. consumer).
In this senario you won't stop it once it get fired.
That could be managed by som (ajax-)views, that call system process....
edit
instead of cron-jobs you could use a twisted-based consumer:
write jobs to db with time-information to the db
send a request for consuming (or resuming, pausing, ...) to the twisted server via socket
do the rest in twisted
You're going to end up with separate (from the web server) processes to monitor the queue and execute jobs. Consider how you would build that without Django using command-line tools to drive it. Use Django models to access the the database.
When you have that working, layer on on a web-based interface (using full Django) to manipulate the queue and report on job status.
I think that if you approach it this way the problem becomes much easier.
I used the probably simplest (crudest is more appropriate, I'm afraid) approach possible: 1. Wrote a model featuring the current position and the state of the counter (active, paused, etc), 2. A django job that increments the counter if its state is active, 3. An entry to the cron that executes the job every minute.
Thanks everyone for the answers.
You can always use a client based jquery timer, but remember to initialize the timer with a value which is passed from your backend application, also make sure that the end user didn't edit the time (edit by inspecting).
So place a timer start time (initial value of the timer) and timer end time or timer pause time in the backend (DB itself).
Monitor the duration in the backend and trigger the job ( in you case ).
Hope this is clear.