Celery user guides suggests Django transaction be manually committed before calling task process.
http://celery.readthedocs.org/en/latest/userguide/tasks.html#database-transactions
I want the system to be as reliable as possible. What is the best practice to recover from a crash between transaction commit and calling task (i.e. make sure task is always called when transaction is committed).
BTW, right now I'm using database-based job queue I implemented so there is no such problem -- I can send jobs within transaction. I'm not really convinced if I should switch to Celery.
From django 1.9 this has been added
transaction.on_commit(lambda: add_task_to_the_queue())
Related
For a django project i like to run index updated by a celery worker to not hit the page parse time. I noticed celery-haystack that is able to do this but i'm wondering why it's that complicated. A much simpler solution would be to simply apply an async task from a post_save signal and invoke the signal processor from there, so not to apply the async part from within the signal processor but before.
I guess i'm missing something?
I'm aware that instances may not exist any more in case of delete signals...
So celery is only the task distributor, right? And indexing is jobs to be done. Search is the end result. When your resource is limited, tasks will be queued up and scheduled to be ran workers are available. You can pursue your approach just fine, but Celery will optimize by delegating tasks to different workers, which may reside in other machines.
I kind of forgot about the details.... (sorry). But to comment: i ended up not using celery-haystack but instead use django signals (not just post_save but i created more specific custom signals) that trigger async celery tasks (so delegate to other queue's/nodes) and these run the index update using the signal processor. I also extended the signal processor to support update and removal of single objects and iterable of objects.
Paul
I'm running a system with a few workers that's taking jobs from a message queue, all using Djangos ORM.
In one case I'm actually passing a message along from one worker to another in another queue.
It works like this:
Worker1 in queue1 creates an object (MySQL INSERT) and pushes a message to queue2
Worker2 accepts the new message in queue2 and retrieves the object (MySQL SELECT), using Djangos objects.get(pk=object_id)
This works for the first message. But in the second message worker 2 always fails on that it can't find object with id object_id (with Django exception DoesNotExist).
This works seamlessly in my local setup with Django 1.2.3 and MySQL 5.1.66, the problem occurs only in my test environment which runs Django 1.3.1 and MySQL 5.5.29.
If I restart worker2 every time before worker1 pushes a message, it works fine. This makes me believe there's some kind of caching going on.
Is there any caching involved in Django's objects.get() that differs between these versions? If that's the case, can I clear it in some way?
The issue is likely related to the use of MySQL transactions. On the sender's site, the transaction must be committed to the database before notifying the receiver of an item to read. On the receiver's side, the transaction level used for a session must be set such that the new data becomes visible in the session after the sender's commit.
By default, MySQL uses the REPEATABLE READ isolation level. This poses problems where there are more than one process reading/writing to the database. One possible solution is to set the isolation level in the Django settings.py file using a DATABASES option like the following:
'OPTIONS': {'init_command': 'SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED'},
Note however that changing the transaction isolation level may have other side effects, especially when using statement based replication.
The following links provide more useful information:
How do I force Django to ignore any caches and reload data?
Django ticket#13906
My django web-app logic is heavily geared towards background task execution (both periodic as well as stand alone, synchronous as well as asynchronous). All the research seems to point to using Celery being the most recommended approach. I plan to eventually deploy on Heroku and the fact that it has support for Celery + Redis (what I'm using for local development) is a big plus for me.
However I need more extensive scheduling capabilities than celery provides. I need some of my periodic tasks to be able to run schedules like 'run on last sun of the month' etc. So I've implemented my own models in django to store a recurrence rule and other needed parameters.
Now I'm stumped with how to interface my tables with celery. Ideally what I'd like to do is to have my own Job model which has the schedule, the task which should be run when it becomes due as well as the parameters for the task. Sort of like function ptr in C++. Then I would run a daemon which keeps checking the job queue for which job has become due, if its periodic it creates the next job instance and pushes it into queue, then runs the associated task with parameters using celery's delay method or similar.
questions:
Does this approach even make sense?
If not what other alternative approach(es) can I use
If yes how do I go about designing that Job/Event queue...
I'd love to hear a better approach to doing this or if there's an existing implementation of a job queue that might be suitable or a way to use celery's job queue itself...
Thanks heaps..
The periodic tasks in Celery works pretty much like this. There's a dedicated scheduler process (celery beat) which simply sends off tasks when they are due.
You can also create new schedulers to use with beat by subclassing the celery.beat.Scheduler class, and you can create custom schedules too (like the crontab schedule that is already built-in) by subclassing celery.schedules.schedule.
There's a database-backed scheduler implementation in the django-celery extension (djcelery.schedulers.DatabaseScheduler), which uses many tricks to avoid too frequent polling of the database and so on (sadly it's not well commented).
Scheduler: https://github.com/celery/celery/tree/master/celery/beat.py
schedules: https://github.com/celery/celery/tree/master/celery/schedules.py
DatabaseScheduler: https://github.com/celery/django-celery/tree/master/djcelery/schedulers.py
I have a time-out set on an entity in my database, and a state (active/finished) assigned to it. What I want is to change that entity's state to finished when that time-out expires. I was thinking of using celery to create a scheduled task with that associated time-out on object creation, which in turn would trigger a django signal to notify that the object has 'expired' and after that I would set the value to finished in the signal handler. Still, this seems like a bit of an overhead, and I am thinking that there must be a more straight-forward way to do this.
Thank you in advance.
Not necessarily light-weight, but when I was faced with this problem I had two solutions.
For the first, I wrote a Django manager that would create a queryset of "to be expired" objects and then delete them. To make this lighter, I kept the "to be expired on event" objects in their own table with a one-to-one relationship to the actual objects, and deleted these events they're done to keep that table small. The relationship between the "to be expired" object and the object being marked "expired" only causes a database hit on the second table when you dereference the ForeignKey field, so it's fairly lightweight. I would then call that management call every 5 minutes with cron (the schedule manager for Unix, if you're not familiar with Unix). This was fine for an every-hour-or-so timeout.
For more close-to-the-second timeouts, my solution was to run a separate server that receives, via REST calls from the Django app, notices of timeouts. It keeps a sorted list of when timeouts were to occur, and then calls the aforementioned management call. It's basically a scheduler of its own with scheduled events being fed to it by the Django process. To make it cheap, I wrote it using Node.js.
Both of these worked. The cron job is far easier.
If the state is always active until it's expired and always finished afterwards, it would be simpler to just have a "finished" datetime field. Everything with a datetime in the past would be finished and everything in the future would be active. Unless there is some complexity going on that is not mentioned in your question, that should provide the functionality you want without any scheduling at all.
Example:
class TaskManager(models.Manager):
def finished(self):
return self.filter(finish__lte=datetime.datetime.now())
def active(self):
return self.filter(finish__gt=datetime.datetime.now())
class Task(models.Model):
finish = models.DateTimeField()
def is_finished(self):
return self.finish <= datetime.datetime.now()
In my Django app, I need to implement this "timer-based" functionality:
User creates some jobs and for each one defines when (in the same unit the timer works, probably seconds) it will take place.
User starts the timer.
User may pause and resume the timer whenever he wants.
A job is executed when its time is due.
This does not fit a typical cron scenario as time of execution is tied to a timer that the user can start, pause and resume.
What is the preferred way of doing this?
This isn't a Django question. It is a system architecture problem. The http is stateless, so there is no notion of times.
My suggestion is to use Message Queues such as RabbitMQ and use Carrot to interface with it. You can put the jobs on the queue, then create a seperate consumer daemon which will process jobs from the queue. The consumer has the logic about when to process.
If that it too complex a system, perhaps look at implementing the timer in JS and having it call a url mapped to a view that processes a unit of work. The JS would be the timer.
Have a look at Pinax, especially the notifications.
Once created they are pushed to the DB (queue), and processed by the cron-jobbed email-sending (2. consumer).
In this senario you won't stop it once it get fired.
That could be managed by som (ajax-)views, that call system process....
edit
instead of cron-jobs you could use a twisted-based consumer:
write jobs to db with time-information to the db
send a request for consuming (or resuming, pausing, ...) to the twisted server via socket
do the rest in twisted
You're going to end up with separate (from the web server) processes to monitor the queue and execute jobs. Consider how you would build that without Django using command-line tools to drive it. Use Django models to access the the database.
When you have that working, layer on on a web-based interface (using full Django) to manipulate the queue and report on job status.
I think that if you approach it this way the problem becomes much easier.
I used the probably simplest (crudest is more appropriate, I'm afraid) approach possible: 1. Wrote a model featuring the current position and the state of the counter (active, paused, etc), 2. A django job that increments the counter if its state is active, 3. An entry to the cron that executes the job every minute.
Thanks everyone for the answers.
You can always use a client based jquery timer, but remember to initialize the timer with a value which is passed from your backend application, also make sure that the end user didn't edit the time (edit by inspecting).
So place a timer start time (initial value of the timer) and timer end time or timer pause time in the backend (DB itself).
Monitor the duration in the backend and trigger the job ( in you case ).
Hope this is clear.