I have a time-out set on an entity in my database, and a state (active/finished) assigned to it. What I want is to change that entity's state to finished when that time-out expires. I was thinking of using celery to create a scheduled task with that associated time-out on object creation, which in turn would trigger a django signal to notify that the object has 'expired' and after that I would set the value to finished in the signal handler. Still, this seems like a bit of an overhead, and I am thinking that there must be a more straight-forward way to do this.
Thank you in advance.
Not necessarily light-weight, but when I was faced with this problem I had two solutions.
For the first, I wrote a Django manager that would create a queryset of "to be expired" objects and then delete them. To make this lighter, I kept the "to be expired on event" objects in their own table with a one-to-one relationship to the actual objects, and deleted these events they're done to keep that table small. The relationship between the "to be expired" object and the object being marked "expired" only causes a database hit on the second table when you dereference the ForeignKey field, so it's fairly lightweight. I would then call that management call every 5 minutes with cron (the schedule manager for Unix, if you're not familiar with Unix). This was fine for an every-hour-or-so timeout.
For more close-to-the-second timeouts, my solution was to run a separate server that receives, via REST calls from the Django app, notices of timeouts. It keeps a sorted list of when timeouts were to occur, and then calls the aforementioned management call. It's basically a scheduler of its own with scheduled events being fed to it by the Django process. To make it cheap, I wrote it using Node.js.
Both of these worked. The cron job is far easier.
If the state is always active until it's expired and always finished afterwards, it would be simpler to just have a "finished" datetime field. Everything with a datetime in the past would be finished and everything in the future would be active. Unless there is some complexity going on that is not mentioned in your question, that should provide the functionality you want without any scheduling at all.
Example:
class TaskManager(models.Manager):
def finished(self):
return self.filter(finish__lte=datetime.datetime.now())
def active(self):
return self.filter(finish__gt=datetime.datetime.now())
class Task(models.Model):
finish = models.DateTimeField()
def is_finished(self):
return self.finish <= datetime.datetime.now()
Related
I am building a Django app that uses APScheduler to send out a daily email at a scheduled time each day. Recently the decision was made to bump up the number of instances to two in order to always have something running in case one of the instances crashes. The problem I am now facing is how to prevent the daily email from being sent out by both instances. I've considered having it set some sort of flag on the database (Postgres) so the other instance knows not to send, but I think this method would create race conditions--the first instance wouldn't set the flag in time for the second instance to see or some similar scenario. Has anybody come up against this problem and how did you resolve it?
EDIT:
def start():
scheduler = BackgroundScheduler()
scheduler.add_job(send_daily_emails, 'cron', hour=11)
scheduler.start()
So this is run when my app initializes--this creates a background scheduler that runs the send_daily_emails function at 11am each morning. The send_daily_emails function is exactly that--all it does is send a couple of emails. My problem is that if there are two instances of the app running, two separate background schedulers will be created and thus the emails will be sent twice each day instead of once.
You can use your proposed database solution with select_for_update
If you're using celery, why not use celery-beat + django-celery-beat?
You can use something like the following. Note the max_instances param.
def start():
scheduler = BackgroundScheduler()
scheduler.add_job(send_daily_emails, trigger='cron', hour='23', max_instances=1)
scheduler.start()
Celery user guides suggests Django transaction be manually committed before calling task process.
http://celery.readthedocs.org/en/latest/userguide/tasks.html#database-transactions
I want the system to be as reliable as possible. What is the best practice to recover from a crash between transaction commit and calling task (i.e. make sure task is always called when transaction is committed).
BTW, right now I'm using database-based job queue I implemented so there is no such problem -- I can send jobs within transaction. I'm not really convinced if I should switch to Celery.
From django 1.9 this has been added
transaction.on_commit(lambda: add_task_to_the_queue())
I'm running a system with a few workers that's taking jobs from a message queue, all using Djangos ORM.
In one case I'm actually passing a message along from one worker to another in another queue.
It works like this:
Worker1 in queue1 creates an object (MySQL INSERT) and pushes a message to queue2
Worker2 accepts the new message in queue2 and retrieves the object (MySQL SELECT), using Djangos objects.get(pk=object_id)
This works for the first message. But in the second message worker 2 always fails on that it can't find object with id object_id (with Django exception DoesNotExist).
This works seamlessly in my local setup with Django 1.2.3 and MySQL 5.1.66, the problem occurs only in my test environment which runs Django 1.3.1 and MySQL 5.5.29.
If I restart worker2 every time before worker1 pushes a message, it works fine. This makes me believe there's some kind of caching going on.
Is there any caching involved in Django's objects.get() that differs between these versions? If that's the case, can I clear it in some way?
The issue is likely related to the use of MySQL transactions. On the sender's site, the transaction must be committed to the database before notifying the receiver of an item to read. On the receiver's side, the transaction level used for a session must be set such that the new data becomes visible in the session after the sender's commit.
By default, MySQL uses the REPEATABLE READ isolation level. This poses problems where there are more than one process reading/writing to the database. One possible solution is to set the isolation level in the Django settings.py file using a DATABASES option like the following:
'OPTIONS': {'init_command': 'SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED'},
Note however that changing the transaction isolation level may have other side effects, especially when using statement based replication.
The following links provide more useful information:
How do I force Django to ignore any caches and reload data?
Django ticket#13906
1) I am currently working on a web application that exposes a REST api and uses Django and Celery to handle request and solve them. For a request in order to get solved, there have to be submitted a set of celery tasks to an amqp queue, so that they get executed on workers (situated on other machines). Each task is very CPU intensive and takes very long (hours) to finish.
I have configured Celery to use also amqp as results-backend, and I am using RabbitMQ as Celery's broker.
Each task returns a result that needs to be stored afterwards in a DB, but not by the workers directly. Only the "central node" - the machine running django-celery and publishing tasks in the RabbitMQ queue - has access to this storage DB, so the results from the workers have to return somehow on this machine.
The question is how can I process the results of the tasks execution afterwards? So after a worker finishes, the result from it gets stored in the configured results-backend (amqp), but now I don't know what would be the best way to get the results from there and process them.
All I could find in the documentation is that you can either check on the results's status from time to time with:
result.state
which means that basically I need a dedicated piece of code that runs periodically this command, and therefore keeps busy a whole thread/process only with this, or to block everything with:
result.get()
until a task finishes, which is not what I wish.
The only solution I can think of is to have on the "central node" an extra thread that runs periodically a function that basically checks on the async_results returned by each task at its submission, and to take action if the task has a finished status.
Does anyone have any other suggestion?
Also, since the backend-results' processing takes place on the "central node", what I aim is to minimize the impact of this operation on this machine.
What would be the best way to do that?
2) How do people usually solve the problem of dealing with the results returned from the workers and put in the backend-results? (assuming that a backend-results has been configured)
I'm not sure if I fully understand your question, but take into account each task has a task id. If tasks are being sent by users you can store the ids and then check for the results using json as follows:
#urls.py
from djcelery.views import is_task_successful
urlpatterns += patterns('',
url(r'(?P<task_id>[\w\d\-\.]+)/done/?$', is_task_successful,
name='celery-is_task_successful'),
)
Other related concept is that of signals each finished task emits a signal. A finnished task will emit a task_success signal. More can be found on real time proc.
In my Django app, I need to implement this "timer-based" functionality:
User creates some jobs and for each one defines when (in the same unit the timer works, probably seconds) it will take place.
User starts the timer.
User may pause and resume the timer whenever he wants.
A job is executed when its time is due.
This does not fit a typical cron scenario as time of execution is tied to a timer that the user can start, pause and resume.
What is the preferred way of doing this?
This isn't a Django question. It is a system architecture problem. The http is stateless, so there is no notion of times.
My suggestion is to use Message Queues such as RabbitMQ and use Carrot to interface with it. You can put the jobs on the queue, then create a seperate consumer daemon which will process jobs from the queue. The consumer has the logic about when to process.
If that it too complex a system, perhaps look at implementing the timer in JS and having it call a url mapped to a view that processes a unit of work. The JS would be the timer.
Have a look at Pinax, especially the notifications.
Once created they are pushed to the DB (queue), and processed by the cron-jobbed email-sending (2. consumer).
In this senario you won't stop it once it get fired.
That could be managed by som (ajax-)views, that call system process....
edit
instead of cron-jobs you could use a twisted-based consumer:
write jobs to db with time-information to the db
send a request for consuming (or resuming, pausing, ...) to the twisted server via socket
do the rest in twisted
You're going to end up with separate (from the web server) processes to monitor the queue and execute jobs. Consider how you would build that without Django using command-line tools to drive it. Use Django models to access the the database.
When you have that working, layer on on a web-based interface (using full Django) to manipulate the queue and report on job status.
I think that if you approach it this way the problem becomes much easier.
I used the probably simplest (crudest is more appropriate, I'm afraid) approach possible: 1. Wrote a model featuring the current position and the state of the counter (active, paused, etc), 2. A django job that increments the counter if its state is active, 3. An entry to the cron that executes the job every minute.
Thanks everyone for the answers.
You can always use a client based jquery timer, but remember to initialize the timer with a value which is passed from your backend application, also make sure that the end user didn't edit the time (edit by inspecting).
So place a timer start time (initial value of the timer) and timer end time or timer pause time in the backend (DB itself).
Monitor the duration in the backend and trigger the job ( in you case ).
Hope this is clear.