get number of pending tasks for a specific user - django

In one of my applications i want to limit users to make a only a specific number of document conversion each calendar month and want to notify them of the conversions they've made and number of conversions they can still make in that calendar month.
So I do something like the following.
class CustomUser(models.Model):
# user fields here
def get_converted_docs(self):
return self.document_set.filter(date__range=[start, end]).count()
def remaining_docs(self):
converted = self.get_converted_docs()
return LIMIT - converted
Now, document conversion is done in the background using celery. So there may be a situation when a conversion task is pending, so in that case the above methods would let a user make an extra conversion, because the pending task is not being included in the count.
How can i get the number of tasks pending for a specific CustomUser object here ??
update
ok so i tried the following:
from celery.task.control import inspect
def get_scheduled_tasks():
tasks = []
scheduled = inspect().scheduled()
for task in scheduled.values()
tasks.extend(task)
return tasks
This gives me a list of scheduled tasks but now all the values are unicode for the above mentioned task args look like this:
u'args': u'(<Document: test_document.doc>, <CustomUser: Test User>)'
is there a way these can be decoded back to original django objects so that i can filter them ?

Store the state of your documents somewhere else, don't inspect your queue.
Either create a seperate model for that, or eg. have a state on your document model, at least independently from your queue. This should have several advantages:
Inspecting the queue might be expensive - also depending on the backend for that. And as you see it can also turn out to be difficult.
Your queue might not be persistent, if eg. your server crashes and use something like Redis you would loose this information, so it's a good thing to have a log somewhere else to be able to reconstruct the queue)

Related

It's possible to perform db operations asynchronously in django?

I'm writing a command to randomly create 5M orders in a database.
def constrained_sum_sample(
number_of_integers: int, total: Optional[int] = 5000000
) -> int:
"""Return a randomly chosen list of n positive integers summing to total.
Args:
number_of_integers (int): The number of integers;
total (Optional[int]): The total sum. Defaults to 5000000.
Yields:
(int): The integers whose the sum is equals to total.
"""
dividers = sorted(sample(range(1, total), number_of_integers - 1))
for i, j in zip(dividers + [total], [0] + dividers):
yield i - j
def create_orders():
customers = Customer.objects.all()
number_of_customers = Customer.objects.count()
for customer, number_of_orders in zip(
customers,
constrained_sum_sample(number_of_integers=number_of_customers),
):
for _ in range(number_of_orders):
create_order(customer=customer)
number_of_customers will be at least greater than 1k and the create_order function does at least 5 db operations (one to create the order, one to randomly get the order's store, one to create the order item (and this can go up to 30, also randomly), one to get the item's product (or higher but equals to the item) and one to create the sales note.
As you may suspect this take a LONG time to complete. I've tried, unsuccessfully, to perform these operations asynchronously. All of my attempts (dozen at least; most of them using sync_to_async) have raised the following error:
SynchronousOnlyOperation you cannot call this from an async context - use a thread or sync_to_async
Before I continue to break my head, I ask: is it possible to achieve what I desire? If so, how should I proceed?
Thank you very much!
Not yet supported but in development.
Django 3.1 has officially asynchronous support for views and middleware however if you try to call ORM within async function you will get SynchronousOnlyOperation.
if you need to call DB from async function they have provided helpers utils like:
async_to_sync and sync_to_async to change between threaded or coroutine mode as follows:
from asgiref.sync import sync_to_async
results = await sync_to_async(Blog.objects.get, thread_sensitive=True)(pk=123)
If you need to queue call to DB, we used to use tasks queues like celery or rabbitMQ.
By the way if you really know what you are doing you can call it but on your responsibility
just turn off the Async safety but watch out for data lost and integrity errors
#settings.py
DJANGO_ALLOW_ASYNC_UNSAFE=True
The reason this is needed in Django is that many libraries, specifically database adapters, require that they are accessed in the same thread that they were created in. Also a lot of existing Django code assumes it all runs in the same thread, e.g. middleware adding things to a request for later use in views.
More fun news in the release notes:
https://docs.djangoproject.com/en/3.1/topics/async/
It's possible to achieve what you desire, however you need a different perspective to solve this problem.
Try using asynchronous workers, and a simple one would be rq workers or celery.
Use one of these libraries to process async long-running tasks defined in django in different threads or processes.
you can use bulk_create() to create large number of objects , this will speed up the process , additionally put the bulk_create() under a separate thread.

Django Celery shared_task singleton pattern

I have a Django based site that has several background processes that are executed in Celery workers. I have one particular task that can run for a few seconds with several read/writes to the database that are subject to a race condition if a second task tries to access the same rows.
I'm trying to prevent this by ensuring the task is only ever running on a single worker at a time but I'm running into issues getting this to work correctly. I've used this Celery Task Cookbook Recipe as inspiration, trying to make my own version that works for my scenario of ensuring that this specific task is only running on one worker at a time, but it still seems to be possible to encounter situations where it's executed across more than one worker.
So far, in tasks.py I have:
class LockedTaskInProgress(Exception):
"""The locked task is already in progress"""
silent_variable_failure = True
#shared_task(autoretry_for=[LockedTaskInProgress], default_retry_delay=30)
def create_or_update_payments(things=None):
"""
This takes a list of `things` that we want to process payments on. It will
read the thing's status, then apply calculations to make one or more payments
for various users that are owed money for the thing.
`things` - The list of things we need to process payments on.
"""
lock = cache.get('create_or_update_payments') # Using Redis as our cache backend
if not lock:
logger.debug('Starting create/update payments processing. Locking task.')
cache.set('create_or_update_payments', 'LOCKED')
real_create_or_update_payments(things) # Long running function w/ lots of DB read/writes
cache.delete('create_or_update_payments')
logger.debug('Completed create/update payments processing. Lock removed.')
else:
logger.debug('Unable to process create/update payments at this time. Lock detected.')
raise LockedTaskInProgress
The above seems to almost work but there still looks to be a possible race condition between the cache.get and cache.set that has shown up in my testing.
I'd love to get suggestions on how to improve this to make it more robust.
Think I've found a way of doing this, inspired by an older version of the Celery Task Cookbook Recipe I was using earlier.
Here's my implementation:
class LockedTaskInProgress(Exception):
"""The locked task is already in progress"""
silent_variable_failure = True
#shared_task(autoretry_for=[LockedTaskInProgress], default_retry_delay=30)
def create_or_update_payments(things=None):
"""
This takes a list of `things` that we want to process payments on. It will
read the thing's status, then apply calculations to make one or more payments
for various users that are owed money for the thing.
`things` - The list of things we need to process payments on.
"""
LOCK_EXPIRE = 60 * 5 # 5 Mins
lock_id = 'create_or_update_payments'
acquire_lock = lambda: cache.add(lock_id, 'LOCKED', LOCK_EXPIRE)
release_lock = lambda: cache.delete(lock_id)
if acquire_lock():
try:
logger.debug('Starting create/update payments processing. Locking task.')
real_create_or_update_payments(things) # Long running function w/ lots of DB read/writes
finally:
release_lock()
logger.debug('Completed create/update payments processing. Lock removed.')
else:
logger.debug('Unable to process create/update payments at this time. Lock detected.')
raise LockedTaskInProgress
It's very possible that there's a better way of doing this but this seems to work in my tests.

Celery: dynamic queues at object level

I'm writing a django app to make polls which uses celery to put under control the voting system. Right now, I have two queues, default and polls, the first one with concurrency set to 8 and the second one set to 1.
$ celery multi start -A myproject.celery default polls -Q:default default -Q:polls polls -c:default 8 -c:polls 1
Celery routes:
CELERY_ROUTES = {
'polls.tasks.option_add_vote': {
'queue': 'polls',
},
'polls.tasks.option_subtract_vote': {
'queue': 'polls',
}
}
Task:
#app.task
def option_add_vote(pk):
"""
Updates given option id and its poll increasing vote number by 1.
"""
option = Option.objects.get(pk=pk)
try:
with transaction.atomic():
option.vote_quantity += 1
option.save()
option.poll.total_votes += 1
option.poll.save()
except IntegrityError as exc:
raise self.retry(exc=exc)
The option_add_vote method (task) updates the poll-object vote-number value adding 1 to the previous value. So, to avoid concurrency problems, I set the poll queue concurrency to 1. This allow the system to handle thousand of vote requests to be completed successfully.
The problem will be, as I can imagine, a bottle-neck when the system grows up.
So, I was thinking about some kind of dynamic queues where all vote requests to any options of a certain poll where routered to a custom queue. I think this will make the system more reliable and fast.
What do you think? How can I make it?
EDIT1:
I got a new idea thanks to Paul and Plahcinski. I'm storing the votes as objects in their own model (a user-options relationship). When someone votes an option it creates an object from this model, allowing me to count how many votes an option has. This free the system from the voting-concurrency problem, so it could be executed in parallel.
I'm thinking about using CELERYBEAT_SCHEDULE to cron a task that updates poll options based on the result of Vote.objects.get(pk=pk).count(). Maybe I could execute it every hour or do partial updates for those options that are getting new votes...
But, how do I give to the clients updated options in real time?
As Plahcinski says, I can have a cached value for my options in Redis (or any other mem-cached system?) and use it to temporally store this values, giving to any new request the cached value.
How can I mix this with my standar values in django models? Anyone could give me some code references or hints?
Am I in the good way or did I make mistakes?
What I would do is remove your incrementation for the database and move to redis and use the database model as your cached value. Have a celery beat that updates recently incremented redis keys to your database
http://redis.io/commands/INCR
What about just having a simple model that stores vote -1/+1 integers then a celery task that reconciles those with the FK object for atomic transactions and updates?

Is there any way to define task quota in celery?

I have requirements:
I have few heavy-resource-consume task - exporting different reports that require big complex queries, sub queries
There are lot users.
I have built project in django, and queue task using celery
I want to restrict user so that they can request 10 report per minute. The idea is they can put hundreds of request 10 minute, but I want celery to execute 10 task for a user. So that every user gets their turn.
Is there any way so that celery can do this?
Thanks
Celery has a setting to control the RATE_LIMIT (http://celery.readthedocs.org/en/latest/userguide/tasks.html#Task.rate_limit), it means, the number of task that could be running in a time frame.
You could set this to '100/m' (hundred per second) maning your system allows 100 tasks per seconds, its important to notice, that setting is not per user neither task, its per time frame.
Have you thought about this approach instead of limiting per user?
In order to have a 'rate_limit' per task and user pair you will have to do it. I think (not sure) you could use a TaskRouter or a signal based on your needs.
TaskRouters (http://celery.readthedocs.org/en/latest/userguide/routing.html#routers) allow to route tasks to a specify queue aplying some logic.
Signals (http://celery.readthedocs.org/en/latest/userguide/signals.html) allow to execute code in few well-defined points of the task's scheduling cycle.
An example of Router's logic could be:
if task == 'A':
user_id = args[0] # in this task the user_id is the first arg
qty = get_task_qty('A', user_id)
if qty > LIMIT_FOR_A:
return
elif task == 'B':
user_id = args[2] # in this task the user_id is the seconds arg
qty = get_task_qty('B', user_id)
if qty > LIMIT_FOR_B:
return
return {'queue': 'default'}
With the approach above, every time a task starts you should increment by one in some place (for example Redis) the pair user_id/task_type and
every time a task finishes you should decrement that value in the same place.
Its seems kind of complex, hard to maintain and with few failure points for me.
Other approach, which i think could fit, is to implement some kind of 'Distributed Semaphore' (similar to distributed lock) per user and task, so in each task which needs to limit the number of task running you could use it.
The idea is, every time a task which should have 'concurrency control' starts it have to check if there is some resource available if not just return.
You could imagine this idea as below:
#shared_task
def my_task_A(user_id, arg1, arg2):
resource_key = 'my_task_A_{}'.format(user_id)
available = SemaphoreManager.is_available_resource(resource_key)
if not available:
# no resources then abort
return
try:
# the resourse could be acquired just before us for other
if SemaphoreManager.acquire(resource_key):
#execute your code
finally:
SemaphoreManager.release(resource_key)
Its hard to say which approach you SHOULD take because that depends on your application.
Hope it helps you!
Good luck!

How should I implement callback for taskset in celery

Question
I use celery to launch task sets that look like this:
I perform a batch of tasks that can be run in parallel, number of tasks in this batch varies from tens to couple thousands.
I aggregate results of these tasks into single answer, then do something with this answer --- like store to the database, save to special result file and so on. Basically after tasks done executing I have to call function that has following signature:
def callback(result_file_name, task_result_list):
#store in file
def callback(entity_key, task_result_list):
#store in db
For now step 1. is done in Celery queue and step 2 is done outside celery:
tasks = []
# add taksks to tasks list
task_group = group()
task_group.tasks = tasks
result = task_group.apply_async()
res = result.join()
# Aggregate results
# Save results to file, database whatever
This approach is cumbersome since I have to stop a single thread until all tasks are performed (which can take couple of hours).
I would like to somehow move step 2 to celery also --- esentially I would need to add a callback to entire taskset (as far as I know it is unsupported in Celery) or submit a task that is executed after all these subtasks.
Does anyone have idea how to do it? I use it in the django enviorment so I can store some state in the database.
To sum up my recent findings
Chords won't do
I'cant use chords straight forwardly because chords enable me to create callbacks that look this way:
def callback(task_result_list):
#store in file
there is no obvious way to pass additional parameters to callback (especially because these callbacks can't be local functions).
Using the database either
I can store results using TaskSetMeta but this entity has no status field --- so even if I would add a signal to TaskSetMeta i'd have to pool task results which could have siginificant overhead.
Well answer was really straightforward, and I can indeed use chords --- and additional parameters (like report file name and so on) must be passed as kwargs.
Here is chord task:
#task
def print_and_sum(to_sum, file_name):
print file_name
print sum(to_sum)
return file_name, sum(to_sum)
Here is how to instantiate it:
subtasks = [...]
result = chord(subtasks)(print_and_sum.subtask(kwargs={'file_name' : 'report_file.csv'}))