Django Celery issue with multiple instances

Django Celery issue with multiple instances - django

I have two instances of Celery. I want to be able to notify all users of a particular event via email, push, etc. However, I want to make sure that all users get notified only ONCE. Is there an example of how to loop through users and gaurantee each user gets contacted once?
The solution I have is to simply mark the user as having received notification... But that would be very inefficient. And there could be a condition where the user gets notified inbetween the mark is being saved.
I tried to read the following regarding this:
http://docs.celeryproject.org/en/latest/userguide/routing.html
http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html
[EDIT]
By 2 instances I mean 1 worker on two EC2's, so 2 workers.

Have you read this one? Ensuring a task is only executed one at a time
Though I think your approach is good, but storing it a database is slow, you should cache it to put very quickly. As a simple setup, you could cache the email (hashed) before sending the email. If the cache already exist, then don't send the email.
So it would be something like
from hashlib import md5
from django.core.cache import cache
from celery import task
# 1 day because I expect that the whole email task finish within a day
# But the same email may be send out if executed on the next day.
LOCK_TIME = 24 * 60 * 60
#task
def notify_user(email):
uid = md5(email).hexdigest()
# This will return False if the uid already exists in the cache
if cache.add(uid, 'notified', LOCK_TIME):
send_mail("Subject", "Message", None, [email,])
NB: I think Deleting the cache may not necessary. Since you only care to send once till it expired.

Related

Prevent django send_mail timimg attack

I have different REST-API views where I either send a mail (if an account exists) or do not send a mail.
For example, the user can input the email in the forgot-password form and a mail is sent if the account exists.
I am using from django.core.mail import send_mail to send the mail.
The problem is, that this takes some time, and so requests for valid emails are generally longer than requests for non-exiting emails.
This allows an attacker to compare the request times to find out if an account exists or not.
Is there any way that I can call send_mail() without sending the mail?
Or what would be the fix to make request times equally long for both cases?
Note: I could check how long send_mail() needs on average and wait this time if I do not send the mail. As the app runs on different servers with different configs, this can not be generally done in my case. I would rather not store the average execution time per server in a database to solve this.

It's a common practice to use celery for tasks that require some time to be finished. Celery will run a task in a separate thread and a user doesn't need to wait while it is finished. In your specific case what will happen if you use celery:
You send a task send_mail to celery and immediately return a successful response to a user.
Celery receives a task and runs it in a separate thread.
In this way, the response time for both cases will be the same.

So this is something similar to an issue I had, and my solution was actually to always send the email, but the email reads something like You tried to reset your password, but this email isn't registered to an account with us. if they don't have an account.
From a user's perspective, it can be annoying to have to wait for an email that may or may not arrive, and spend time checking spam/junk etc. Telling them they don't have an account with that email address is quicker and cleaner for them.
We saw a big drop in users enquiring with us about why they hadn't received a PW reset email.
(Sorry for not actually answering the question, I dislike it when people do this on SO but since I experienced the same issue, I thought I'd weigh in.)

Celery what happen to running tasks when using app.control.purge()?

Currently i have a celery batch running with django like so:
Celery.py:
from __future__ import absolute_import, unicode_literals
import os
import celery
from celery import Celery
from celery.schedules import crontab
import django
load_dotenv(os.path.join(os.path.dirname(os.path.dirname(__file__)), '.env'))
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'base.settings')
django.setup()
app = Celery('base')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
#app.on_after_configure.connect
def setup_periodic_tasks(sender, **kwargs):
app.control.purge()
sender.add_periodic_task(30.0, check_loop.s())
recursion_function.delay() #need to use recursive because it need to wait for loop to finish(time can't be predict)
print("setup_periodic_tasks")
#app.task()
def check_loop():
.....
start = database start number
end = database end number
callling apis in a list from id=start to id=end
create objects
update database(start number = end, end number = end + 3)
....
#app.task()
def recursion_function(default_retry_delay=10):
.....
do some looping
....
#when finished, call itself again
recursion_function.apply_async(countdown=30)
My aim is whenever the celery file get edited then it would restart all the task -remove queued task that not yet execute(i'm doing this because recursion_function will run itself again if it finished it's job of checking each record of a table in my database so i'm not worry about it stopping mid way).
The check_loop function will call to an api that has paging to return a list of objects and i will compare it to by record in a table , if match then create a new custom record of another model
My question is when i purge all messages, will the current running task get stop midway or it gonna keep running ? because if the check_loop function stop midway looping through the api list then it will run the loop again and i will create new duplicate record which i don't want
EXAMPLE:
during ruuning task of check_loop() it created object midway (on api list from element id=2 to id=5), server restart -> run again, now check_loop() run from beginning(on api list from element id=2 to id=5) and created object from that list again(which 100% i don't want)
Is this how it run ? i just need a confirmation
EDIT:
https://docs.celeryproject.org/en/4.4.1/faq.html#how-do-i-purge-all-waiting-tasks
I added app.control.purge() because when i restart then recursion_function get called again in setup_periodic_tasks while previous recursion_function from recursion_function.apply_async(countdown=30) execute too so it multiplied itself

Yes, worker will continue execution of currently running task unless worker is also restarted.
Also, The Celery Way is to always expect tasks to be run in concurrent environment with following considerations:
there are many tasks running concurrently
there are many celery workers executing tasks
same task may run again
multiple instances of the same task may run at the same moment
any task may be terminated any time
even if you are sure that in your environment there is only one worker started / stopped manually and these do not apply - tasks should be created in such way to allow everything of this to happen.
Some useful techniques:
use database transactions
use locking
split long-running tasks into faster ones
if task has intermediate values to be saved or they are important (i.e. non-reproducible like some api calls) and their processing in next step takes time - consider splitting into several chained tasks
If you need to run only one instance of a task at a time - use some sort of locking - create / update lock-record in the database or in the cache so others (same tasks) can check and know this task is running and just return or wait for previous one to complete.
I.e. recursion_function can also be Periodic Task. Being periodic task will make sure it is run every interval, even if previous one fails for any reason (and thus fails to queue itself again as in regular non-periodic task). With locking you can make sure only one is running at a time.
check_loop():
First, it is recommended to save results in one transaction in the database to make sure all or nothing is saved / modified in the database.
You can also save some marker that indicates how many / status of saved objects, so future tasks can just check this marker, not each object.
Or somehow perform check for each element before creating it that it already exists in the database.

I am not going to write an essay like Oleg's excellent post above. The answer is simply - all running tasks will continue running. purge is all about the tasks that are in the queue(s), waiting to be picked by Celery workers.

Django app with multiple instances - how to ensure daily email is only sent once?

I am building a Django app that uses APScheduler to send out a daily email at a scheduled time each day. Recently the decision was made to bump up the number of instances to two in order to always have something running in case one of the instances crashes. The problem I am now facing is how to prevent the daily email from being sent out by both instances. I've considered having it set some sort of flag on the database (Postgres) so the other instance knows not to send, but I think this method would create race conditions--the first instance wouldn't set the flag in time for the second instance to see or some similar scenario. Has anybody come up against this problem and how did you resolve it?
EDIT:
def start():
scheduler = BackgroundScheduler()
scheduler.add_job(send_daily_emails, 'cron', hour=11)
scheduler.start()
So this is run when my app initializes--this creates a background scheduler that runs the send_daily_emails function at 11am each morning. The send_daily_emails function is exactly that--all it does is send a couple of emails. My problem is that if there are two instances of the app running, two separate background schedulers will be created and thus the emails will be sent twice each day instead of once.

You can use your proposed database solution with select_for_update

If you're using celery, why not use celery-beat + django-celery-beat?

You can use something like the following. Note the max_instances param.
def start():
scheduler = BackgroundScheduler()
scheduler.add_job(send_daily_emails, trigger='cron', hour='23', max_instances=1)
scheduler.start()

Should a celery task save django model updates immediately?

I have a celery task that runs for five minutes at a time to check an azure service bus subscription topic for messages. Throughout the five minutes, it pings the service bus every second to check for a message, and if it finds one, it saves some info to a database. I`ve noticed that the actual commit to the database only occurs when the task ends at the five minute mark, and not when the model.save() method is called.
I'm wondering, is it a good idea to add some code to force each save to occur immediately, instead of at the end of the five minutes? I`m thinking of using a statement that involves atomic transactions to accomplish this.
The code below contains my task. I use a while loop to keep the task going for 5 minutes, and inside it, I ping the service bus for messages every second, and then save them if found.
class CheckForUpdates(PeriodicTask):
run_every = 300
def run(self, queue_name='bus_queue'):
end_task_time = _at_five_minutes()
while time.time() < end_task_time:
_wait_for_one_second()
result = _check_service_bus_for_update()
if _update_was_found(result):
update = json.loads(result.body)
logger.info("azure response body: ", update)
# code that updates a django model
model.save()
Is this a good design? It is ok to let the database commits accumulate for 5 minutes and then save them all consecutively at the end of the 5 minutes? Should I use transactions, or force the task to save immmediately every time?

I recommend using the signal of the models in Django, that way you can wait to listen for a signal from databases django.db.models.signals had some hook that you can use, that way you can just wait for the event happen, I will post an example:
signal.py
from django.db.models.signals import post_save
from pay.models import RateDeck
#receiver(post_save, sender=RateDeck)
def post_save_RateDeck(sender, instance, **kwargs):
idpk = instance.pk
transaction.on_commit(lambda :uploadrate.apply_async(args=[idpk]))
So I hope that you get that uploadrate is a function that is called when the transaction is completed transaction.on_commit send the signal when the transaction in databases is done.

How to tell if a task has already been queued in django-celery?

Here's my setup:
django 1.3
celery 2.2.6
django-celery 2.2.4
djkombu 0.9.2
In my settings.py file I have
BROKER_BACKEND = "djkombu.transport.DatabaseTransport"
i.e. I'm just using the database to queue tasks.
Now on to my problem: I have a user-initiated task that could take a few minutes to complete. I want the task to only run once per user, and I will cache the results of the task in a temporary file so if the user initiates the task again I just return the cached file. I have code that looks like this in my view function:
task_id = "long-task-%d" % user_id
result = tasks.some_long_task.AsyncResult(task_id)
if result.state == celery.states.PENDING:
# The next line makes a duplicate task if the user rapidly refreshes the page
tasks.some_long_task.apply_async(task_id=task_id)
return HttpResponse("Task started...")
elif result.state == celery.states.STARTED:
return HttpResponse("Task is still running, please wait...")
elif result.state == celery.states.SUCCESS:
if cached_file_still_exists():
return get_cached_file()
else:
result.forget()
tasks.some_long_task.apply_async(task_id=task_id)
return HttpResponse("Task started...")
This code almost works. But I'm running into a problem when the user rapidly reloads the page. There's a 1-3 second delay between when the task is queued and when the task is finally pulled off the queue and given to a worker. During this time, the task's state remains PENDING which causes the view logic to kick off a duplicate task.
What I need is some way to tell if the task has already been submitted to the queue so I don't end up submitting it twice. Is there a standard way of doing this in celery?

I solved this with Redis. Just set a key in redis for each task and then remove the key from redis in task's after_return method. Redis is lightweight and fast.

I don't think (as Tomek and other have suggested) that using the database is the way to do this locking. django has built-in cache framework, which should be sufficient to accomplish this locking, and must faster. See:
http://docs.celeryproject.org/en/latest/tutorials/task-cookbook.html#cookbook-task-serial
Django can be configured to use memcached as its cache backend, and this can be distributed across multiple machines ... this seems better to me. Thoughts?

You can cheat a bit by storing the result manually in the database. Let me explain how this will help.
For example, if using RDBMS (table with columns - task_id, state, result):
View part:
Use transaction management.
Use SELECT FOR UPDATE to get row where task_id == "long-task-%d" % user_id. SELECT FOR UPDATE will block other requests until this one COMMITs or ROLLBACKs.
If it doesn't exist - set state to PENDING and start the 'some_long_task', end the request.
If the state is PENDING - inform the user.
If the state is SUCCESS - set state to PENDING, start the task, return the file pointed to by 'result' column. I base this on the assumption, that you want to re-run the task on getting the result. COMMIT
If the state is ERROR - set state to PENDING, start the task, inform the user. COMMIT
Task part:
Prepare the file, wrap in try, catch block.
On success - UPDATE the proper row with state = SUCCESS, result.
On failure - UPDATE the proper row with state = ERROR.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js