I have a celery task that runs for five minutes at a time to check an azure service bus subscription topic for messages. Throughout the five minutes, it pings the service bus every second to check for a message, and if it finds one, it saves some info to a database. I`ve noticed that the actual commit to the database only occurs when the task ends at the five minute mark, and not when the model.save() method is called.
I'm wondering, is it a good idea to add some code to force each save to occur immediately, instead of at the end of the five minutes? I`m thinking of using a statement that involves atomic transactions to accomplish this.
The code below contains my task. I use a while loop to keep the task going for 5 minutes, and inside it, I ping the service bus for messages every second, and then save them if found.
class CheckForUpdates(PeriodicTask):
run_every = 300
def run(self, queue_name='bus_queue'):
end_task_time = _at_five_minutes()
while time.time() < end_task_time:
_wait_for_one_second()
result = _check_service_bus_for_update()
if _update_was_found(result):
update = json.loads(result.body)
logger.info("azure response body: ", update)
# code that updates a django model
model.save()
Is this a good design? It is ok to let the database commits accumulate for 5 minutes and then save them all consecutively at the end of the 5 minutes? Should I use transactions, or force the task to save immmediately every time?
I recommend using the signal of the models in Django, that way you can wait to listen for a signal from databases django.db.models.signals had some hook that you can use, that way you can just wait for the event happen, I will post an example:
signal.py
from django.db.models.signals import post_save
from pay.models import RateDeck
#receiver(post_save, sender=RateDeck)
def post_save_RateDeck(sender, instance, **kwargs):
idpk = instance.pk
transaction.on_commit(lambda :uploadrate.apply_async(args=[idpk]))
So I hope that you get that uploadrate is a function that is called when the transaction is completed transaction.on_commit send the signal when the transaction in databases is done.
Related
I am building a Django app that uses APScheduler to send out a daily email at a scheduled time each day. Recently the decision was made to bump up the number of instances to two in order to always have something running in case one of the instances crashes. The problem I am now facing is how to prevent the daily email from being sent out by both instances. I've considered having it set some sort of flag on the database (Postgres) so the other instance knows not to send, but I think this method would create race conditions--the first instance wouldn't set the flag in time for the second instance to see or some similar scenario. Has anybody come up against this problem and how did you resolve it?
EDIT:
def start():
scheduler = BackgroundScheduler()
scheduler.add_job(send_daily_emails, 'cron', hour=11)
scheduler.start()
So this is run when my app initializes--this creates a background scheduler that runs the send_daily_emails function at 11am each morning. The send_daily_emails function is exactly that--all it does is send a couple of emails. My problem is that if there are two instances of the app running, two separate background schedulers will be created and thus the emails will be sent twice each day instead of once.
You can use your proposed database solution with select_for_update
If you're using celery, why not use celery-beat + django-celery-beat?
You can use something like the following. Note the max_instances param.
def start():
scheduler = BackgroundScheduler()
scheduler.add_job(send_daily_emails, trigger='cron', hour='23', max_instances=1)
scheduler.start()
My django app allows users to send messages to each other, and I pool some of the recent messages together and send them in an email using celery and redis.
Every time a user sends a message, I add a Message to the db and then trigger an async task to pool that user's messages from the last 60 seconds and send them as an email.
tasks.pushMessagePool.apply_async(args = (fromUser,), countdown = 60)
If the user sends 5 messages in the next 60 seconds, then my assumption is that 5 tasks should be created, but only the first task sends the email, and the other 4 tasks do nothing. I implemented a simple locking mechanism to make sure that messages were only considered a single time and to ensure db locking.
#shared_task
def pushMessagePool(fromUser, ignore_result=True):
lockCode = randint(0,10**9)
data.models.Messages.objects.filter(fromUser = fromUser, locked=False).update(locked=True, lockCode = lockCode)
M = data.models.Messages.objects.filter(fromUser = fromUser, lockCode = lockCode)
sendEmail(M,lockCode)
With this setup, I still get occasional (~10%) duplicates. The duplicates will fire within 10ms of each other, and they have different lockCodes.
Why doesn't this locking mechanism work? Does celery refer to an old DB snapshot? That wouldn't make any sense.
Djangojack,here is a similar issue? But for SQS. I'm not sure if it applies to Redis too?
When creating your SQS queue you need to set the Default Visibility
timeout to some time that's greater than the max time you expect a
task to run. This is the time SQS will make a message invisible to all
other consumers after delivering to one consumer. I believe the
default is 30 seconds. So, if a task takes more than 30 seconds, SQS
will deliver the same message to another consumer because it assumes
the first consumer died and did not complete the task.
From a comment by #gustavo-ambrozio on this answer.
I need to run celery task only when django request finished.
Is it possible?
I've found that the best way to make sure your task happens after the request is finished is to write a custom middleware. In the process_response method, you can handle any quick actions that don't impact page load time or performance too much. Anything else, you can hand off to Celery. Any saving or database transactions are completed by the time process_response is called (AFAICT).
Try something like this:
Django sends request_finished at the end of every request.
You can access request object through sender argument,
from django.dispatch import receiver
from django.core.signals import request_finished
from app.tasks import my_task
#receiver(request_finished)
def add_celery_task(sender):
if sender.__name__ != 'StaticFilesHandler':
my_task.delay()
If you are running server in development environment it's good to check sender's name to avoid adding too many celery task for every static file you are serving.
You can run the task in the background, using delay method of celery. I mean just before returning the response you can call the delay method to put the task in the background.
Some thing like this:
task_name.delay(arg1, arg2, ...)
By doing this your task will be put into background and run asynchronously, this is not going to block the request response cycle .
I have two instances of Celery. I want to be able to notify all users of a particular event via email, push, etc. However, I want to make sure that all users get notified only ONCE. Is there an example of how to loop through users and gaurantee each user gets contacted once?
The solution I have is to simply mark the user as having received notification... But that would be very inefficient. And there could be a condition where the user gets notified inbetween the mark is being saved.
I tried to read the following regarding this:
http://docs.celeryproject.org/en/latest/userguide/routing.html
http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html
[EDIT]
By 2 instances I mean 1 worker on two EC2's, so 2 workers.
Have you read this one? Ensuring a task is only executed one at a time
Though I think your approach is good, but storing it a database is slow, you should cache it to put very quickly. As a simple setup, you could cache the email (hashed) before sending the email. If the cache already exist, then don't send the email.
So it would be something like
from hashlib import md5
from django.core.cache import cache
from celery import task
# 1 day because I expect that the whole email task finish within a day
# But the same email may be send out if executed on the next day.
LOCK_TIME = 24 * 60 * 60
#task
def notify_user(email):
uid = md5(email).hexdigest()
# This will return False if the uid already exists in the cache
if cache.add(uid, 'notified', LOCK_TIME):
send_mail("Subject", "Message", None, [email,])
NB: I think Deleting the cache may not necessary. Since you only care to send once till it expired.
I have a time-out set on an entity in my database, and a state (active/finished) assigned to it. What I want is to change that entity's state to finished when that time-out expires. I was thinking of using celery to create a scheduled task with that associated time-out on object creation, which in turn would trigger a django signal to notify that the object has 'expired' and after that I would set the value to finished in the signal handler. Still, this seems like a bit of an overhead, and I am thinking that there must be a more straight-forward way to do this.
Thank you in advance.
Not necessarily light-weight, but when I was faced with this problem I had two solutions.
For the first, I wrote a Django manager that would create a queryset of "to be expired" objects and then delete them. To make this lighter, I kept the "to be expired on event" objects in their own table with a one-to-one relationship to the actual objects, and deleted these events they're done to keep that table small. The relationship between the "to be expired" object and the object being marked "expired" only causes a database hit on the second table when you dereference the ForeignKey field, so it's fairly lightweight. I would then call that management call every 5 minutes with cron (the schedule manager for Unix, if you're not familiar with Unix). This was fine for an every-hour-or-so timeout.
For more close-to-the-second timeouts, my solution was to run a separate server that receives, via REST calls from the Django app, notices of timeouts. It keeps a sorted list of when timeouts were to occur, and then calls the aforementioned management call. It's basically a scheduler of its own with scheduled events being fed to it by the Django process. To make it cheap, I wrote it using Node.js.
Both of these worked. The cron job is far easier.
If the state is always active until it's expired and always finished afterwards, it would be simpler to just have a "finished" datetime field. Everything with a datetime in the past would be finished and everything in the future would be active. Unless there is some complexity going on that is not mentioned in your question, that should provide the functionality you want without any scheduling at all.
Example:
class TaskManager(models.Manager):
def finished(self):
return self.filter(finish__lte=datetime.datetime.now())
def active(self):
return self.filter(finish__gt=datetime.datetime.now())
class Task(models.Model):
finish = models.DateTimeField()
def is_finished(self):
return self.finish <= datetime.datetime.now()