I have a airflow dag with many sub-tasks, I know when certain tasks fail they can be re-run in 5 minutes, while other tasks can be re-run in 60 minutes. How can I set my tasks to rerun on failure as such?
I found this question and answer on stack overflow however this only changes the number of retries.
Operators should support a retry_delay as well - see BaseOperator:
retry_delay (datetime.timedelta) – delay between retries
Related
I'm using google cloud tasks with http triggers to invoke cloud functions. I've setup the cloud task queue retry parameters as follows:
max attempts: 2
Max retry duration: 16
Min backoff: 1
Max backoff: 16
Max doublings: 4
I will often have bursts of tasks which will create around 600 tasks within a second or two. There are times when about 15% of these will fail (this is expected and intentional). I'm expecting these failed tasks to retry according to the queue configuration. Thus I would not expect any task retry schedule to be more than 16 seconds beyond its initially scheduled time. However, I'm seeing some failed tasks scheduled several minutes out. Typically, the first few failed tasks will schedule for retry only a few seconds out, but some of the last few failed tasks in this burst will have these retry schedule for many minutes away.
Why are these retry schedules not honoring my retry config?
If it helps, I also have these settings on the queue:
Max dispatches: 40
Max concurrent dispatches: 40
I have an Apache Airflow managed environment running in which a number of DAGs are defined and enabled. Some DAGs are scheduled, running on a 15 minute schedule, while others are not scheduled. All the DAGs are single-task DAGs. The DAGs are structured in the following way:
level 2 DAGs -> (triggers) level 1 DAG -> (triggers) level 0 DAG
The scheduled DAGs are the level 2 DAGs, while the level 1 and level 0 DAGs are unscheduled. The level 0 DAG uses ECSOperator to call a pre-defined Elastic Container Service (ECS) task, to call a Python ETL script inside a Docker container defined in the ECS task. The level 2 DAGs wait on the level 1 DAG to complete, which in turns waits on the level 0 DAG to complete. The full Python logs produced by the ETL scripts are visible in the CloudWatch logs from the ECS task runs, while the Airflow task logs only show high-level logging.
The singular tasks in the scheduled DAGs (level 2) have depends_on_past set to False, and I expected that as a result successive scheduled runs of a level 2 DAG would not depend on each other, i.e. that if a particular run failed it would not prevent the next scheduled run from occurring. But what is happening is that Airflow is overriding this and I can clearly see in the UI that a failure of a particular level 2 DAG run is preventing the next run from being selected by the scheduler - the next scheduled run state is being set to None, and I have to manually clear the failed DAG run state before the scheduler can schedule it again.
Why does this happen? As far as I know, there is no Airflow configuration option that should override the task-level setting of False for depends_on_past in the level 2 DAG tasks. Any pointers would be greatly appreciated.
Answering the question "why is this happening?". I understand that the behavior you are watching is explained by the fact that Tasks are being defined with wait_for_downstream = True. The docs state the following about it:
wait_for_downstream (bool) -- when set to true, an instance of task X will wait for tasks immediately downstream of the previous instance of task X to finish successfully or be skipped before it runs. This is useful if the different instances of a task X alter the same asset, and this asset is used by tasks downstream of task X. Note that depends_on_past is forced to True wherever wait_for_downstream is used. Also note that only tasks immediately downstream of the previous task instance are waited for; the statuses of any tasks further downstream are ignored.
Keep in mind that the term previous instances of task X refers to the task_instance of the last scheduled dag_run, not the upstream Task (in a DAG with a daily schedule, that would be the task_instance from "yesterday").
This also explains why your Task are being executed once you clear the state of the previous DAG Run.
I hope it helps you clarifying things up!
I have a project in google cloud where there are 2 task queues: process-request to receive requests and process them, send-result to send the result of the processed request to another server. They are both running on an instance called remote-processing
My problem is that I see the tasks being enqueued in send-result but they are only executed after the process-request queue is empty and has processed all requests.
This is the instance config:
instance_class: B4
basic_scaling:
max_instances: 8
Here is the queue config:
- name: send-result
max_concurrent_requests: 20
rate: 1/s
retry_parameters:
task_retry_limit: 10
min_backoff_seconds: 5
max_backoff_seconds: 20
target: remote-processing
- name: process-request
bucket_size: 50
max_concurrent_requests: 10
rate: 10/s
target: remote-processing
Clarification : I don't need for the queues to run in an specific order, but I find it very strange that it looks like the insurance only runs one queue at a time, so it will only run the tasks in another queue after its done with the current queue.
over what period of time is this all happening?
how long does a process-request task take to run vs a send-result task
One thing that sticks out is that your rate for process-request is much higher than your rate for send-result. So maybe a couple send-result tasks ARE squeezing off, but it then hits the rate cap and has to run process-request tasks instead.
Same note for bucket_size. The bucket_size for process-request is huge compared to it's rate:
The bucket size limits how fast the queue is processed when many tasks
are in the queue and the rate is high. The maximum value for bucket
size is 500. This allows you to have a high rate so processing starts
shortly after a task is enqueued, but still limit resource usage when
many tasks are enqueued in a short period of time.
If you don't specify bucket_size for a queue, the default value is 5.
We recommend that you set this to a larger value because the default
size might be too small for many use cases: the recommended size is
the processing rate divided by 5 (rate/5).
https://cloud.google.com/appengine/docs/standard/python/config/queueref
Also, by setting max_instances: 8 does a big backlog of work build-up in these queues?
Let's try a two things:
set bucket_size and rate to be the same for both process-request and send-result. If fixes it, then start fiddling with the values to get the desired balance
bump up max_instances: 8 to see if removing that bottleneck fixes it
In one of my ECS clusters I have a scheduled Fargate task that's meant to spin up 8 instances of it's given target. However, when the task procs it starts up waaayyyy more than 8 tasks. Sometimes as many as 50. Does anyone know what could be causing this to happen?
Details:
Cron Expression: cron(40 16 ? * 1-5 *)
Target Definition:
For anyone who might run into this problem in the future:
This problem occurred because we had too many tasks running the cluster. As of the writing of this answer AWS set of limit of 50 tasks running in a single cluster. Before the rule triggered there was already close to 50 tasks running. The rule would proc and would start spinning up new tasks trying to get to the desired number (8).
However, due to the limit it would never be able to get 8 because new tasks over the limit would just get shutdown. So it would keep trying, and keep trying, and keep trying to spin up tasks which led to there being a huge pending queue of tasks that would seemingly push (nearly) all of our tasks out of the cluster and we'd be left with way more tasks than we had asked for.
The solution: we just moved the scheduled task into a new cluster to avoid the 50 task limit.
We set up a schedule to execute a command.
It is scheduled to run every 5 minutes as follows: 20090201T235900|20190201T235900|127|00:05:00
However, from the logs we see it runs only every hour.
Is there a reason for this?
check scheduling frequency in your sitecore.config file
<sitecore>
<scheduling>
<!-- Time between checking for scheduled tasks waiting to execute -->
<frequency>00:05:00</frequency>
</scheduling>
</sitecore>
The scheduling interval is based on the the scheduler interval and the job interval. Every scheduler interval period, all the configured jobs are evaluated. This is logged. During that evaluation, each job checked against the last time it ran, if that interval is greater that the configured job interval, the job is started.
It's fairly simple, but it's important to understand the mechanism. You can also see how it allows no way of inherently running jobs at a specific time, only at approximate intervals.
You can also see that jobs can never run more frequently than the scheduler interval regardless of the job interval. It is not unreasonable to set the scheduler to one-minute intervals to reduce the inaccuracy of job timings to no more than a minute.
In a worse case, with a 5 minute sheduler interval and a 5 minute job interval. The delay to job starting could be up to 9 minutes 59 seconds.