Priority Weight is not being honoured when using celery executor - airflow-scheduler

I was using Airflow 1.10.10 with Celery executor. I defined two dag, each with three task. Same Pool id was used in both dag/task. Pool slot was configured as 3. First DAG (say High_prioirty)was having priroity_weight as 10 for each task. Second DAG (say Low_priority) was having default priority_weight ( that is 1). I submitted first 5 Low_priority Dag. I waited till 3 tasks of low priority were moved into running state. Then I submitted 4 high priority dag. I was expecting when pool slot becomes available in next scheduling round , high priority task should be moved into QUEUING state. But high priority task remain in Scheduling State. I repeated this for 10-15 times and observe same thing each and every time.
However this works fine when I moved to LocalExecutor.
Please suggest fix/workaround for resolving this priority_weight issue in CeleryExecutor.

Related

Distributing tasks over HTTP using SpringBoot non-blocking WebClient (performance question?)

I have a large number of tasks - N, needs to be distributed to multiple http worker nodes via load balancer. Though there exists multiple nodes - n, combining all nodes we have a max-concurrency setting - x.
Always
N > x > n
One node can run those tasks in multiple threads. Mean time consumption for each task is about 50 sec to 1 min. Using Web-Client to distribute tasks and Mono response from Workers.
There exists a distributor and designed the process as follows:
1. Remove a task from queue.
2. Send the task via POST request using Web-Client and subscribe immediately with a subscriber instance
3. Holt new subscription when max concurrency reached to x
4. When any one of the above distributed task completes it calls on-accept(T) method of the subscriber.
5. If task queue is not empty, remove and send the next task / (x+1) task.
6. Keep track of total number completed tasks.
7. If all tasks completed & queue empty set Completable Future object as complete
8. Exit
The above process works fine. Tested with N=5000, n=10 & x=25.
Now the confusion is in this design we always have x number of concurrent subscriptions. As soon as one ends we create another until all tasks are completed. What is the impact of this in large scale production environment? If number of concurrent subscription (the value of x > 10,000) increases via the HTTP(s) load balancer is that going to have serious impact on performance and network latency? Our expected production volume will be something like below:
N=200,000,000
n=100
x=10,000
I will be grateful if some one with knowledge of Reactor and Web-Client expertise comment of this approach. Our main concern is having too many concurrent subscriptions.

Celery worker best practices

I am using Celery to run background jobs for my Django app, hosted on Heroku, with Redis as broker. and I want to set up task prioritization.
I am currently using the Celery default queue and all the workers feed from. I was thinking about implementing prioritization within the only queue but it is described everywhere as a bad practice.
The consensus on the best approach to deal with the priority problem is to set different Celery queues for each level of priority. Let's say:
Queue1 for highest priority tasks, assigned to x workers
Queue2 the default queue, assigned to all the other workers
The first problem I see with this method is that if there is no high priority task at some time, I loose the productivity of x workers.
Also, let's say my infrastructure scales up and I have more workers available. Only the number of "default" workers will be expanded dynamically. Besides, this method prevents me from keeping identical dynos (containers on Heroku) which doesn't look optimized for scalability.
Is there an efficient way to deal with task prioritization and keep replicable workers at the same time?
I know this is a very old question, but this might be helpful for someone who is still looking for an answer.
In My current project, we have an implementation like this
1) High --> User actions (user is waiting for this task to get completed)
2) Low ---> Non-User action (There is no wait time)
3) Reporting --> Sending emails/reports to the user
4) Maintenance --> Cleaning up the historical data from DB
All these queues have different tasks based on priority (low, medium & high). We have to implement the priority for our celery tasks.
A) Let's assume this scenario in which we are more interested in processing the tasks based on priority.
1) We have 2 or more queues and we are pushing the tasks into the queue(s) by specifying the priority.
2) all the workers (let's say I have 4 workers) are listening to all the queues.
3) In this scenario, if you have 100 tasks in your queues, and within these 100 tasks 20 are high priority tasks, 30 are medium priority and 50 are low priority.
4) So, the celery framework first process the high priority tasks across the queues then the medium, and finally the low priority tasks.
B) Queue1 for highest priority tasks, assigned to x workers
Queue2 the default queue, assigned to all the other workers
1) This approach would be helpful **when you are more concerned about the performance of processing the tasks in a specific queue** like I have queue **HIGH** and tasks in this queue are very important for me irrespective of the priority of the task.
2) So, I should be having a dedicated worker which would be processing the tasks only from a particular queue. (As you have mentioned in the question, it has some limitations)
We have to choose these two options based on our requirements, I hope this would be helpful,
Suresh.
For the answer, W1 and W2 are workers consuming high and low priority tasks respectively.
You can scale W1 and W2 as separate containers. You can have three containers, essentially drawn from the same image. One for the app, two for the workers. If you have higher number of one kind of task, only that container would scale. Also, depending on the kind of dyno you are using, you can set concurrency for the workers to use resources in a better way.
For your reference, this is something that I did in one of my projects.

Keep the task in the queue even after maximum number of retry limits in google task queue

I am using google task queues and I am setting task_retry_limit on the queue.
The default behavior is the task is removed from task queue in the following cases :
1) when the task is executed successfully or
2) when the task reaches the maximum number of retry attempts set.
In my use case, I have a problem with the second case. I want to keep the task in the task queue even after maximum number of retries
(I don't want to retry the task after task_retry_limit but I want to keep it in the task queue so that I can run it manually later)
Is there parameter in Queue.yaml which drives this?
I know that a workaround for this would be to set a moderate task_age_limit, but I don't want the task to keep retrying.
No, the task queues aren't presently designed to keep around tasks which reached their maximum number of retries.
I see 2 options you could try, from inside your task code when you detect it will fail on the final task retry:
create some sort of FailedTask datastore entities with all the info/parameters required to re-create and enqueue copies of the original failing tasks later on, under manual triggers
re-queue the task on a different queue, configured with an extremely long time between retries - long enough to not actually be retried until the moment you get to trigger them manually (you can do that for any task still pending, in any queue, at any time).
Somehow related: handling failure after maximum number of retries in google app engine task queues

Celery: Do duplicate tasks get automatically executed?

I am running (Django) Celery for scheduling tasks to remote workers w1, w2, w3. Each of these workers has their own queue from which they are consuming tasks placed by a "scheduler", which is another celery task on beat on the master server:
w1: q1
w2: q2
w3: q3
The scheduler schedules tasks based on a db check, i.e. it will reschedule a task with the same parameters if the db doesn't get updated as per the task's running. So if one or more of the queues are piling up, multiple tasks with the same parameters ("duplicates" from my app's perspective) may be in multiple queues at the same time.
I'm seeing some strange behavior with this: when there are duplicate tasks in multiple queues, if one of the queues runs its instance of the task, just a few milliseconds before, the other queued up "duplicate" tasks get executed. So all of a sudden all the tasks execute at the same time, even if they were enqueued minutes apart from each other.
Is there any documentation or other reference that explains this behavior? Is it known behavior, if so how do I turn it off? I only want one instance of this task to run.

Heroku/Celery: Simultaneous tasks on one worker?

Is it possible to use Celery to set up multiple updating tasks to run simultaneously on Django/Heroku on just ONE worker? If I schedule certain functions to run every 5 minutes, will they automatically overlap in terms of when they start running, or will they wait till all other tasks are finished? I'm new to Celery and frankly vary confused over what it can do? ):
By default Celery uses multiprocessing to perform concurrent execution of tasks. Celery worker launches a pool of processes to consume tasks. The number of processes in a pool is set by --concurrency argument and defaults to the number of CPUs available on the machine.
So if the concurrency level is greater than one then the tasks will be processed in parallel.