Everyday I have a set of tasks (A few hundred) that needs to performed at random time. So a periodic task is probably not what I want. It seems like I cannot dynamically change crontab
Should I:
Dynamically schedule a task whenever it's received a user, and let celery "wake up" at the scheduled time to perform the task? If so, how is it done?
OR
Create a celery tasks that wakes up every 60 seconds to look into the database for tasks scheduled current time. So the database acts as a queue. I am wondering if this would put too much load on the server?
Related
I have a Django app and using Celery to process long running tasks.
Let's say I need to generate a file (takes 5 seconds), attach it to an email and send it to 1000 users, which of these methods are the preferred way?
Method 1: For loop outside task - generates numerates background tasks, each running a couple of seconds
#share_task
def my_task(usr):
#gen file + send email...
def send_to_all_users(users): # called to start task
for usr in users:
my_task.delay(usr)
Method 2: For loop inside task - generates 1 background tasks that could be running for hours
#share_task
def my_task(users):
for usr in users:
#gen file + send email...
def send_to_all_users(users): # called to start task
my_task.delay(users)
With method 1, I can scale up the number of workers to complete the entire task quicker, but creating all those tasks might take a while and I'm not sure if my task queue can fill up and then jobs get discarded?
Method 2 seems simpler, but it might run a very long time and I can't scale up the number of workers.
Not sure if it matters, but my app is running on Heroku and I'm using Redis as the message broker. I'm currently using a single background worker.
Celery docs on Task Granularity:
The task granularity is the amount of computation needed by each
subtask. In general it is better to split the problem up into many
small tasks rather than have a few long running tasks.
With smaller tasks you can process more tasks in parallel and the
tasks won’t run long enough to block the worker from processing other
waiting tasks.
However, executing a task does have overhead. A message needs to be
sent, data may not be local, etc. So if the tasks are too fine-grained
the overhead added probably removes any benefit.
So the first method should be preferred in general, but you have to benchmark your particular case to assess the overhead.
I am having a problem with my django celery application. Let's assume that I have two related models - Transaction and Account. I am pushing transactions over my API. What I want to achieve is to calculate the balance for a specified account.
I have a celery task which calculates balance. The problem is that I need a timer per each account, set to eg. 60 seconds. When transaction is coming for the same account timer is set to 60 again. I want to make it this way because I don't want to run the same task many times. When transactions for a specified account will not come for 60 seconds then task should be executed.
Any architectural suggestions how to achieve this? In fact, I only have no idea how to setup these "timers".
Thanks for answers!
You can follow the aproach of django-celery-transactions. They subclass the Task class in order for the execution logic to be customized.
For your case, you should customized apply_async to check if there are queued tasks to be executed during the following 60 seconds (for this you will use the Celery API for inspecting workers). If there already exists an scheduled task, you can ignore the execution of the current task; if not, set an execution time of 60 seconds in the future and call with it the super().apply_async(...) method.
Let's say there is a periodic task scheduled to run every hour. A worker receives the tasks and starts processing. While the task is being processed, the celeryd process (controlled via supervisord) gets restarted (supervisorctl restart all). Even though the task had never finished execution, it won't get re-executed.
How can I re-queue the periodic task right away and prevent the multiple versions of the tasks run at the same time?
There may be a nicer way to do it, but you could just use the periodic task to create a regular task in the queue (e.g., my_actual_task.defer(…)) which will not be removed from the queue until it is completed (assuming you are using acks_late).
If you're not using acks_late, you can put the bulk of the task in a try, and in the corresponding finally put a my_actual_task.retry().
Either way, you should generally avoid killing workers without giving them a chance to finish up what they're doing.
The maximum amount of time the pollForActivityTask method stays open polling for requests is 60 seconds. I am currently scheduling a cron job every minute to call my activity worker file so that my activity worker machine is constantly polling for jobs.
Is this the correct way to have continuous queue coverage?
The way that the Java Flow SDK does it and the way that you create an ActivityWorker, give it a tasklist, domain, activity implementations, and a few other settings. You set both the setPollThreadCount and setTaskExecutorSize. The polling threads long poll and then hand over work to the executor threads to avoid blocking further polling. You call start on the ActivityWorker to boot it up and when wanting to shutdown the workers, you can call one of the shutdown methods (usually best to call shutdownAndAwaitTermination).
Essentially your workers are long lived and need to deal with a few factors:
New versions of Activities
Various tasklists
Scaling independently on tasklist, activity implementations, workflow workers, host sizes, etc.
Handle error cases and deal with polling
Handle shutdowns (in case of deployments and new versions)
I ended using a solution where I had another script file that is called by a cron job every minute. This file checks whether an activity worker is already running in the background (if so, I assume a workflow execution is already being processed on the current server).
If no activity worker is there, then the previous long poll has completed and we launch the activity worker script again. If there is an activity worker already present, then the previous poll found a workflow execution and started processing so we refrain from launching another activity worker.
I am running (Django) Celery for scheduling tasks to remote workers w1, w2, w3. Each of these workers has their own queue from which they are consuming tasks placed by a "scheduler", which is another celery task on beat on the master server:
w1: q1
w2: q2
w3: q3
The scheduler schedules tasks based on a db check, i.e. it will reschedule a task with the same parameters if the db doesn't get updated as per the task's running. So if one or more of the queues are piling up, multiple tasks with the same parameters ("duplicates" from my app's perspective) may be in multiple queues at the same time.
I'm seeing some strange behavior with this: when there are duplicate tasks in multiple queues, if one of the queues runs its instance of the task, just a few milliseconds before, the other queued up "duplicate" tasks get executed. So all of a sudden all the tasks execute at the same time, even if they were enqueued minutes apart from each other.
Is there any documentation or other reference that explains this behavior? Is it known behavior, if so how do I turn it off? I only want one instance of this task to run.