Celery worker best practices

Celery worker best practices - django

I am using Celery to run background jobs for my Django app, hosted on Heroku, with Redis as broker. and I want to set up task prioritization.
I am currently using the Celery default queue and all the workers feed from. I was thinking about implementing prioritization within the only queue but it is described everywhere as a bad practice.
The consensus on the best approach to deal with the priority problem is to set different Celery queues for each level of priority. Let's say:
Queue1 for highest priority tasks, assigned to x workers
Queue2 the default queue, assigned to all the other workers
The first problem I see with this method is that if there is no high priority task at some time, I loose the productivity of x workers.
Also, let's say my infrastructure scales up and I have more workers available. Only the number of "default" workers will be expanded dynamically. Besides, this method prevents me from keeping identical dynos (containers on Heroku) which doesn't look optimized for scalability.
Is there an efficient way to deal with task prioritization and keep replicable workers at the same time?

I know this is a very old question, but this might be helpful for someone who is still looking for an answer.
In My current project, we have an implementation like this
1) High --> User actions (user is waiting for this task to get completed)
2) Low ---> Non-User action (There is no wait time)
3) Reporting --> Sending emails/reports to the user
4) Maintenance --> Cleaning up the historical data from DB
All these queues have different tasks based on priority (low, medium & high). We have to implement the priority for our celery tasks.
A) Let's assume this scenario in which we are more interested in processing the tasks based on priority.
1) We have 2 or more queues and we are pushing the tasks into the queue(s) by specifying the priority.
2) all the workers (let's say I have 4 workers) are listening to all the queues.
3) In this scenario, if you have 100 tasks in your queues, and within these 100 tasks 20 are high priority tasks, 30 are medium priority and 50 are low priority.
4) So, the celery framework first process the high priority tasks across the queues then the medium, and finally the low priority tasks.
B) Queue1 for highest priority tasks, assigned to x workers
Queue2 the default queue, assigned to all the other workers
1) This approach would be helpful **when you are more concerned about the performance of processing the tasks in a specific queue** like I have queue **HIGH** and tasks in this queue are very important for me irrespective of the priority of the task.
2) So, I should be having a dedicated worker which would be processing the tasks only from a particular queue. (As you have mentioned in the question, it has some limitations)
We have to choose these two options based on our requirements, I hope this would be helpful,
Suresh.

For the answer, W1 and W2 are workers consuming high and low priority tasks respectively.
You can scale W1 and W2 as separate containers. You can have three containers, essentially drawn from the same image. One for the app, two for the workers. If you have higher number of one kind of task, only that container would scale. Also, depending on the kind of dyno you are using, you can set concurrency for the workers to use resources in a better way.
For your reference, this is something that I did in one of my projects.

Related

Distributing tasks over HTTP using SpringBoot non-blocking WebClient (performance question?)

I have a large number of tasks - N, needs to be distributed to multiple http worker nodes via load balancer. Though there exists multiple nodes - n, combining all nodes we have a max-concurrency setting - x.
Always
N > x > n
One node can run those tasks in multiple threads. Mean time consumption for each task is about 50 sec to 1 min. Using Web-Client to distribute tasks and Mono response from Workers.
There exists a distributor and designed the process as follows:
1. Remove a task from queue.
2. Send the task via POST request using Web-Client and subscribe immediately with a subscriber instance
3. Holt new subscription when max concurrency reached to x
4. When any one of the above distributed task completes it calls on-accept(T) method of the subscriber.
5. If task queue is not empty, remove and send the next task / (x+1) task.
6. Keep track of total number completed tasks.
7. If all tasks completed & queue empty set Completable Future object as complete
8. Exit
The above process works fine. Tested with N=5000, n=10 & x=25.
Now the confusion is in this design we always have x number of concurrent subscriptions. As soon as one ends we create another until all tasks are completed. What is the impact of this in large scale production environment? If number of concurrent subscription (the value of x > 10,000) increases via the HTTP(s) load balancer is that going to have serious impact on performance and network latency? Our expected production volume will be something like below:
N=200,000,000
n=100
x=10,000
I will be grateful if some one with knowledge of Reactor and Web-Client expertise comment of this approach. Our main concern is having too many concurrent subscriptions.

Running large jobs with low startup time and autoscaling for bursts of traffic

For a website I’m developing on AWS, a user can submit a large job (ex. select a large number of items and ask to update them all in some way). We don’t want to limit the size of the job these users are submitting so this job can can in theory run for a very long period of time and require a large amount of memory (this rules out AWS Lambda as a compute engine option). We want jobs to be as independent from one another as possible so we chose to run each job in its own container in Amazon ECS. What we currently do when a user submits a job request is send a message with a job id/reference to an SQS queue, have AWS lambda poll that queue and upon receiving a message, lambda starts an ECS task (SQS -> Lambda -> ECS). This has the problem that a new ECS task is started with each request, so a new container must be booted up which can take minutes. This latency is directly visible to the user and is particularly unacceptable if the users job is not even particularly large yet they still wait for minutes for the container to boot up. Additionally, the cost of constantly running container or two would not be too problematic.
I've been toying with some ideas for updating this flow.
Attempt 1:
In this updated flow we'd create an ECS task that looks like the following:
message = null;
while (message == null) {
message = pollForMessages();
}
processMessage(message);
// task finishes, and container can be brought down
We remove the lambda from the flow and just have SQS -> ECS rather than SQS -> Lambda -> ECS. In this case, there would be no cold start assuming a container is up spinning for messages. We could set the minimum number of tasks we want running to be a number > 0 to ensure all messages are processed at some point. However this suffers from the problem that it would not auto-scale as the number of messages in the queue increases. So something needs to spawn more containers when traffic increases.
Attempt 2:
In this updated flow we'd create an ECS task that looks like the following:
message = null;
while (message == null) {
message = pollForMessages();
}
If (number of running tasks < number of messages in queue) {
spawnMoreContainers();
}
processMessage(message);
// task finishes, and container can be brought down
This comes with the issue that we could end up over provisioning containers if multiple containers see that there are more messages in the queue than tasks running. Since these tasks run forever until a message is processed this could result in a large unnecessary cost. It could also under provision containers - if the task sees that number of running tasks >= number of messages, but these running tasks are already busy processing messages, these tasks will not end up taking one of these messages out of the queue and we may end up with messages that have to wait a very long time to be processed.
Attempt 3:
message = null;
while (message == null) {
message = pollForMessages();
If (# of containers > min provisioned && this particular container has been running longer than some timeout) {
// finish this task so this container can be brought down
return;
}
}
If (number of running tasks < number of messages in queue) {
spawnMoreContainers();
}
processMessage(message);
// task finishes, and container can be brought down
While this may save us some cost compared to Attempt 2 so over provisioning wouldn’t be so much of an issue, there is still the possibility that we could under provision containers, in which case certain job requests would need to wait for potentially long periods of time before being processed.
Attempt 4:
We can introduce locking (ex. https://aws.amazon.com/blogs/database/building-distributed-locks-with-the-dynamodb-lock-client/) to mitigate some of the race conditions, however we'll always have the issue that a task running does not necessarily mean a task that is available to pick up messages and Fargate gives us no way of distinguishing between these, which makes it difficult to determine how many containers to provision (ex. we see there are 5 running containers and 5 messages, but we don't know whether to provision more containers or not because we don't know if those containers are already processing a message or if they're waiting). Alternatively we could introduce some mechanism, either an external orchestrator or some logic within the containers and some data store, to manage the state of these containers.
Essentially to deal with each of these problems, the architecture becomes more and more complex and implementation would be difficult and error prone.
It also seems to me like these solutions are reinventing the wheel, and I feel there must be some service out there that has solved this problem already, but I can’t seem to find it.
The suggestions I’ve seen to deal with this are:
Maybe AWS batch is more suited for this use case - Indeed, AWS batch might be the more recommended approach for a workload like this but, we don’t remove any of the cold start problem by switching. AWS batch would still create a new container with each job.
Run the ECS tasks on EC2 rather than Fargate, then cache the container image on the host - With this, we’d be managing our own infrastructure and ideally we’d like this to be serverless.
Have an alarm on the number of messages in the queue and have this alarm trigger a lambda that then boots up more containers - alarms on the /AWS log group have a minimum period of 1 minute. This means the alarm would not be triggered until a minute after we’d received more requests than our provisioned containers can handle. Additionally we'd have to set up many alarms to scale at different numbers of messages.
I’m wondering if anyone is aware of potential services/frameworks that could make doing this more feasible? Or if anyone has suggestions on alternative architectures?

If you don't mind a bit slower response time to the bursts, you may create an autoscaling group (I assume there is something similar for ECS). This group can be governed by a custom metric, e. g. queue length divided by the number of workers. A detailed guide is here: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-using-sqs-queue.html
In any case, I'd decouple the scaling decision from the worker code, because there is a varying number of workers that you would need to synchronize. It's much easier to have one overseer that controls how many workers there should be. Because the overseer is not on the critical path to task processing, you don't need to care that much about its uptime. It's OK if it takes a few minutes before it recovers after a failure - the workers are still there, processing at least at some capacity.

Creating a scalable and fault tolerant system using AWS ECS

We're designing C# scheduled task (runs every few hours) that will run on AWS ECS instances that will grab batched transaction data for thousands of customers from an endpoint, modify the data then send it on to another web service. We will be maintaining the state of the last successful batch in a separate database (using some like created date of the transactions). We need the system to be scalable so as more customers are added we add additional ECS containers to process the data.
There are the options we're considering:
Each container only processes a specific subset of the data. As more customers are added more contains are added. We would need to maintain a logical separation of what contains are processing what customers data.
All the containers process all of the customers. We use some kind of locking flags on the database to let other processes know that the customers data is being processed.
Some other approach.
I think that option 2 is probably the best, but it adds a lot of complexity regarding the locking and unlocking of customers. Are there specific design patterns I could be pointed towards if that if the correct solution?

In both scenarios an important thing to consider is retries in case processing for a specific customer fails. One potential way to distribute jobs across a vast number of container with retries would be to use AWS SQS.
A single container would run periodically every few hours and be the job generator. It would create one SQS queued item for each customer that needs to be processed. In response to items appearing in the queue a number of "worker" containers would be spun up by ECS to consume items from the queue. This can be made to autoscale relative to the number of items in the queue to quickly spin up many containers that can work in parallel.
Each container would use its own high performance concurrent poller similar to this (https://www.npmjs.com/package/squiss) to start grabbing items from the queue and processing them. If a worker failed or crashed due to a bug then SQS will automatically redeliver and dropped queued items that worker had been working on to a different worker after they time out.
This approach would give you a great deal of flexibility, and would let you horizontally scale out the number of workers, while letting any of the workers process any jobs from the queue that it grabs. It would also ensure that every queued item gets processed at least once, and that none get dropped forever in case something crashes or goes wrong.

Distributed Priority Queue, once and only once

TL;DR
I have producers, tasks and consumers. I need a scalable queuing system which can ensure that a task can be consumed once and only once, and which can sort the tasks according to their priority.
The context:
We have a prototype working, but it's not "scale ready", and today we need to scale...
Below is the prototype "process":
1°) Some customers upload dataset in the database (PostgreSQL)
2°) Each second, an application fetches for new dataset in the database and converts them into tasks.
One customer's dataset can generate thousand of tasks (~500K tasks/day, ~30K tasks/customer)
3°) An application "Dispatcher"
fetches sorted tasks from the database (tasks with the smallest dataset will be proceeded first even if they have been submitted later + some random value to shuffle)
performs some validations (check if the task has been canceled or not)
dispatch the task to the according worker.
Each worker can process only one kind of task, but it can process thousands of them concurrently.
4°) The workers receive the task, and push the result to the database
5°) A "Monitor" application checks the state of all tasks, and retries any task that needs to (worker crashed).
Today, the bottleneck is the SQL server, I can tune it but I would prefer to redesign it the right way. So, I was wondering if there are some best practices for that kind of process?
It seems I need a distributed queuing system (Kafka?), which can guarantee that a task will be processed once and only once, but which will also manage priority.

Make celery stop consuming tasks

Preconditions: There is a small celery cluster processing some tasks. Each celery instance has few workers running. Everything is running under flask.
Tasks: I need an ability to pause/resume consuming of tasks from a particular node from the code. I.e. task can make a decision if current celery instance and all her workers should pause or resume consuming of tasks.
Didn't find any straight forward way to solve this. Any suggestions?
Thanks in advance!

Control.cancel_consumer(queue, **kwargs) (reference) is all that you probably need for your use case.

Perhaps a better strategy would be to divide the work across several queues.
Have a default queue where all tasks start. The workers watching the default queue can, according to your logic, add subtasks to the other active queues. You may not need this extra queue if you can add tasks to the active queues directly from flask.
That way, each node does not have to worry about whether it's paused or active. It just consumes everything that's been added to its queue. These location-specific queues will be empty (and thus paused) unless the default workers have added subtasks.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js