How to improve Airflow task concurrency - concurrency

I have question about dag and task concurrency.
Scneario:
I have two DAG files
DAG 1 has only one task
DAG 2 has three tasks, From three, one task is calling third party API (Third party API response time is 900 miliseconds, it is simple weather API for showing current weather of provided city. e.g https://api.weatherapi.com/v1/current.json?key=api_key&q=Londodn ) and other 2 task are just for logs(print statment)
I trigger DAG 1 with the custom payloads having 1000 records
(
for e.g.
conf: {
[
{
"city": "London",
...
},
{
...
}
]
}
)
DAG 1 task just loop though the records and call the DAG 2 1000 times with individual record
So first, I want to ask here about this approach. Is this a good approach to call the list of data with 2 DAGs or is there any better way to do this?
My concern is it is taking 17 minutes for DAG 2 to process all 1000 execution
I am using Managed Workflow (AWS) configuration are as below:
Environment class: mw1.large
Scheduler count: 4
Maximum worker count: 25
Minimum worker count: 20
Region: us-west-2
core.max_active_runs_per_dag: 1000
core.max_active_tasks_per_dag: 5000
Default MWAA config for task as per aws documentation
(https://docs.aws.amazon.com/mwaa/latest/userguide/best-practices-tuning.html)
core.parallelism: 10000
core.dag_concurrency: 10000
Can anyone guide me how can I improve my AWS Managed Airflow performance to improve the parallelism of DAG run?
I want to understand the parallelism and concurrency settings if they are set to this high as above configs are then why it takes 17 minutes for Airflow to complete tasks?
Thanks!

Related

Failed cloud tasks are not being retried with task queue retry config

I'm using google cloud tasks with http triggers to invoke cloud functions. I've setup the cloud task queue retry parameters as follows:
max attempts: 2
Max retry duration: 16
Min backoff: 1
Max backoff: 16
Max doublings: 4
I will often have bursts of tasks which will create around 600 tasks within a second or two. There are times when about 15% of these will fail (this is expected and intentional). I'm expecting these failed tasks to retry according to the queue configuration. Thus I would not expect any task retry schedule to be more than 16 seconds beyond its initially scheduled time. However, I'm seeing some failed tasks scheduled several minutes out. Typically, the first few failed tasks will schedule for retry only a few seconds out, but some of the last few failed tasks in this burst will have these retry schedule for many minutes away.
Why are these retry schedules not honoring my retry config?
If it helps, I also have these settings on the queue:
Max dispatches: 40
Max concurrent dispatches: 40

Managed Workflows with Apache Airflow (MWAA) - how to disable task run dependency on previous run

I have an Apache Airflow managed environment running in which a number of DAGs are defined and enabled. Some DAGs are scheduled, running on a 15 minute schedule, while others are not scheduled. All the DAGs are single-task DAGs. The DAGs are structured in the following way:
level 2 DAGs -> (triggers) level 1 DAG -> (triggers) level 0 DAG
The scheduled DAGs are the level 2 DAGs, while the level 1 and level 0 DAGs are unscheduled. The level 0 DAG uses ECSOperator to call a pre-defined Elastic Container Service (ECS) task, to call a Python ETL script inside a Docker container defined in the ECS task. The level 2 DAGs wait on the level 1 DAG to complete, which in turns waits on the level 0 DAG to complete. The full Python logs produced by the ETL scripts are visible in the CloudWatch logs from the ECS task runs, while the Airflow task logs only show high-level logging.
The singular tasks in the scheduled DAGs (level 2) have depends_on_past set to False, and I expected that as a result successive scheduled runs of a level 2 DAG would not depend on each other, i.e. that if a particular run failed it would not prevent the next scheduled run from occurring. But what is happening is that Airflow is overriding this and I can clearly see in the UI that a failure of a particular level 2 DAG run is preventing the next run from being selected by the scheduler - the next scheduled run state is being set to None, and I have to manually clear the failed DAG run state before the scheduler can schedule it again.
Why does this happen? As far as I know, there is no Airflow configuration option that should override the task-level setting of False for depends_on_past in the level 2 DAG tasks. Any pointers would be greatly appreciated.
Answering the question "why is this happening?". I understand that the behavior you are watching is explained by the fact that Tasks are being defined with wait_for_downstream = True. The docs state the following about it:
wait_for_downstream (bool) -- when set to true, an instance of task X will wait for tasks immediately downstream of the previous instance of task X to finish successfully or be skipped before it runs. This is useful if the different instances of a task X alter the same asset, and this asset is used by tasks downstream of task X. Note that depends_on_past is forced to True wherever wait_for_downstream is used. Also note that only tasks immediately downstream of the previous task instance are waited for; the statuses of any tasks further downstream are ignored.
Keep in mind that the term previous instances of task X refers to the task_instance of the last scheduled dag_run, not the upstream Task (in a DAG with a daily schedule, that would be the task_instance from "yesterday").
This also explains why your Task are being executed once you clear the state of the previous DAG Run.
I hope it helps you clarifying things up!

MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL)

I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.
My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:
[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL
I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.
At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.
I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs
celery.sync_parallelism = 1
celery.worker_autoscale = 1,1
This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.

Understanding AWS Glue detailed job metrics

Please see the attached screen shot of the CPU Load: Driver and Executors. It looks fine in the first 6 minutes, multiple executors are active. But after 6 minutes the chart only shows the Executor Average and Driver lines. When I put the mouse on the line, there are no usage data for all 17 executors. Does that mean all the executors are inactive after 6 minutes? How the Executor Average is calculated?
Thank you.
After talked to AWS support, I finally got the answer for why after 04:07 there are no lines for individual executors but only the Executor Average and the Driver.
I was told there are 62 executors for each job, however, at each moment at most 17 executors are used. So the Executor Average is the average of different sets of 17 executors at different moment. The default CPU Load chart only shows Executor 1 to 17, not 18 to 62. In order to show other executors, you need to manually add the metrics.

AWS SWF cancelling child workflows automatically

I have AWS SWF workflow which creates many child workflows at runtime based on the number of input files. For x number of input files, it will create x number of child workflows. It works fine when number of input files is around 400 and successfully creates and executes 400 child workflows.
The issue is - when my input has around 500 files or more, it starts that many child workflows successfully but then automatically cancels some of them. I have tried different configurations but nothing worked.
I think AWS limit for number of child workflows is 1000, so that should not be issue.
Current child workflow config:
Execution Start To Close Timeout: 2 hours 1 minute
Task Start To Close Timeout: 1 minute 30 seconds
Main workflow config:
Execution Start To Close Timeout: 9 hours
Task Start To Close Timeout: 1 minute 30 seconds
My guess is that some exception is thrown in the workflow code which by default cancels workflows in the same cancellation scope. Read the TryCatchFinally documentation for more info about the cancellation semantic.
In general I wouldn't recommend that many child workflows in SWF, you can always do it hierarchically. Like 30 children, each of them 30 children give 900 workflows.