Django Celery application with timer - django

I am having a problem with my django celery application. Let's assume that I have two related models - Transaction and Account. I am pushing transactions over my API. What I want to achieve is to calculate the balance for a specified account.
I have a celery task which calculates balance. The problem is that I need a timer per each account, set to eg. 60 seconds. When transaction is coming for the same account timer is set to 60 again. I want to make it this way because I don't want to run the same task many times. When transactions for a specified account will not come for 60 seconds then task should be executed.
Any architectural suggestions how to achieve this? In fact, I only have no idea how to setup these "timers".
Thanks for answers!

You can follow the aproach of django-celery-transactions. They subclass the Task class in order for the execution logic to be customized.
For your case, you should customized apply_async to check if there are queued tasks to be executed during the following 60 seconds (for this you will use the Celery API for inspecting workers). If there already exists an scheduled task, you can ignore the execution of the current task; if not, set an execution time of 60 seconds in the future and call with it the super().apply_async(...) method.

Related

GAE - how to avoid service request timing out after 1 day

As I explained in this post, I'm trying to scrape tweets from Twitter.
I implemented the suggested solution with services, so that the actual heavy lifting happens in the backend.
The problem is that after about one day, I get this error
"Process terminated because the request deadline was exceeded. (Error code 123)"
I guess this is because the manual scaling has the requests timing out after 24 hours.
Is it possible to make it run for more than 24 hours?
You can't make a single request / task run for more than 24 hours but you can split up your request into different parts and each lasting a day. It's unwise to have a request run indefinitely that's why app engine closes them after a certain time to prevent idling / loopy request that lasts indefinitely.
I would recommend having your task fire a call at the end to trigger the queuing of the next task, that way it's automatic and you don't have to queue a task daily. Make sure there's some cursor or someway for your task to communicate progress so it won't duplicate work.

Celery: Numerous small tasks or one long running task?

I have a Django app and using Celery to process long running tasks.
Let's say I need to generate a file (takes 5 seconds), attach it to an email and send it to 1000 users, which of these methods are the preferred way?
Method 1: For loop outside task - generates numerates background tasks, each running a couple of seconds
#share_task
def my_task(usr):
#gen file + send email...
def send_to_all_users(users): # called to start task
for usr in users:
my_task.delay(usr)
Method 2: For loop inside task - generates 1 background tasks that could be running for hours
#share_task
def my_task(users):
for usr in users:
#gen file + send email...
def send_to_all_users(users): # called to start task
my_task.delay(users)
With method 1, I can scale up the number of workers to complete the entire task quicker, but creating all those tasks might take a while and I'm not sure if my task queue can fill up and then jobs get discarded?
Method 2 seems simpler, but it might run a very long time and I can't scale up the number of workers.
Not sure if it matters, but my app is running on Heroku and I'm using Redis as the message broker. I'm currently using a single background worker.
Celery docs on Task Granularity:
The task granularity is the amount of computation needed by each
subtask. In general it is better to split the problem up into many
small tasks rather than have a few long running tasks.
With smaller tasks you can process more tasks in parallel and the
tasks won’t run long enough to block the worker from processing other
waiting tasks.
However, executing a task does have overhead. A message needs to be
sent, data may not be local, etc. So if the tasks are too fine-grained
the overhead added probably removes any benefit.
So the first method should be preferred in general, but you have to benchmark your particular case to assess the overhead.

Randomly scheduled tasks and celery

Everyday I have a set of tasks (A few hundred) that needs to performed at random time. So a periodic task is probably not what I want. It seems like I cannot dynamically change crontab
Should I:
Dynamically schedule a task whenever it's received a user, and let celery "wake up" at the scheduled time to perform the task? If so, how is it done?
OR
Create a celery tasks that wakes up every 60 seconds to look into the database for tasks scheduled current time. So the database acts as a queue. I am wondering if this would put too much load on the server?

Django-celery project, how to handle results from result-backend?

1) I am currently working on a web application that exposes a REST api and uses Django and Celery to handle request and solve them. For a request in order to get solved, there have to be submitted a set of celery tasks to an amqp queue, so that they get executed on workers (situated on other machines). Each task is very CPU intensive and takes very long (hours) to finish.
I have configured Celery to use also amqp as results-backend, and I am using RabbitMQ as Celery's broker.
Each task returns a result that needs to be stored afterwards in a DB, but not by the workers directly. Only the "central node" - the machine running django-celery and publishing tasks in the RabbitMQ queue - has access to this storage DB, so the results from the workers have to return somehow on this machine.
The question is how can I process the results of the tasks execution afterwards? So after a worker finishes, the result from it gets stored in the configured results-backend (amqp), but now I don't know what would be the best way to get the results from there and process them.
All I could find in the documentation is that you can either check on the results's status from time to time with:
result.state
which means that basically I need a dedicated piece of code that runs periodically this command, and therefore keeps busy a whole thread/process only with this, or to block everything with:
result.get()
until a task finishes, which is not what I wish.
The only solution I can think of is to have on the "central node" an extra thread that runs periodically a function that basically checks on the async_results returned by each task at its submission, and to take action if the task has a finished status.
Does anyone have any other suggestion?
Also, since the backend-results' processing takes place on the "central node", what I aim is to minimize the impact of this operation on this machine.
What would be the best way to do that?
2) How do people usually solve the problem of dealing with the results returned from the workers and put in the backend-results? (assuming that a backend-results has been configured)
I'm not sure if I fully understand your question, but take into account each task has a task id. If tasks are being sent by users you can store the ids and then check for the results using json as follows:
#urls.py
from djcelery.views import is_task_successful
urlpatterns += patterns('',
url(r'(?P<task_id>[\w\d\-\.]+)/done/?$', is_task_successful,
name='celery-is_task_successful'),
)
Other related concept is that of signals each finished task emits a signal. A finnished task will emit a task_success signal. More can be found on real time proc.

Set timeout on Django View Execution

How can I set a time limit for execution of a django view. i.e. a view never takes more than say, 10secs for execution and if it does, it should return half way from execution. My idea is that we can have a decorator. But i am not sure. Looking for a solution. Thanks in advance.
I would suggest to consider using Celery, which includes built in time limit support for tasks, and would keep your django app and server responsive:
A single task can potentially run forever, if you have lots of tasks
waiting for some event that will never happen you will block the
worker from processing new tasks indefinitely. The best way to defend
against this scenario happening is enabling time limits.