Airflow scheduler does not schedule (or slowly) when lot of tasks - airflow-scheduler

I am working with airflow on Google Cloud Composer (version: composer-1.10.2-airflow-1.10.6).
I realized that my that the scheduler doesn't schedule task when there is a lot of tasks to process (See Gantt view below)
(don't pay attention to the colours, the red tasks are "createTable Operators" that fail if the table already exists so they have to fail 5 times before the next part (the important one) of the DAG runs)
There are gaps of hours between tasks! (for example 5 hours between 10am and 15pm and nothing happened)
Normally it works fine with ~40 DAGs with about 100-200 tasks each (sometimes a bit more). But recently I added 2 DAGs with a lot of tasks (~5000 each) and the scheduler is very slow or doesn't schedule tasks.
On the screenshot, I paused the 2 DAGs with a lot of tasks at 15pm and the scheduler is back again, doing its work fine.
Do you have any solution about it?
Airflow is meant to be a tool that handle an "infinite" amount of tasks.
Here are some information about my environment:
version: composer-1.10.2-airflow-1.10.6
cluster size: 6 (12vCPUs, 96GB of memory)
Here are some information about airflow configuration:
╔════════════════════════════════╦═══════╗
║ Airflow parameter ║ value ║
╠════════════════════════════════╬═══════╣
║ -(celery)- ║ ║
║ worker_concurrency ║ 32 ║
║ -(webserver)- ║ ║
║ default_dag_run_display_number ║ 2 ║
║ workers ║ 2 ║
║ worker_refresh_interval ║ 60 ║
║ -(core)- ║ ║
║ max_active_runs_per_dag ║ 1 ║
║ dagbag_import_timeout ║ 600 ║
║ parallelism ║ 200 ║
║ min_file_process_interval ║ 60 ║
║ -(scheduler)- ║ ║
║ processor_poll_interval ║ 5 ║
║ max_threads ║ 2 ║
╚════════════════════════════════╩═══════╝
Thank you for your help
EDIT:
26 of my DAGs are created by a single .py file by parsing a huge JSON variable to create all the DAGs and tasks.
Maybe the problem comes from this because today Airflow is scheduling tasks from others DAGs than the 26 (especially the 2 big DAGs) I described.
More precisely, Airflow is sometimes scheduling the tasks of my 26 DAGs but it schedules much more easily and more often the tasks of the other DAGs.

High inter-task latency is usually an indicator that there is a scheduler-related bottleneck (as opposed to something worker-related). Even when running the same DAGs over and over again, it's still possible for a Composer environment to suffer from performance bottlenecks like this, because work can be distributed differently each time, or there may be different processes running in the background.
To start, I would recommend increasing the number of threads available to the scheduler (scheduler.max_threads), and then ensuring that your scheduler is not consuming all CPU of the node it resides on. You can check CPU metrics for the node the scheduler resides on by identifying where it is, then checking in the Cloud Console. To find the node name:
# Obtain the Composer namespace name
kubectl get namespaces | grep composer
# Check for the scheduler
kubectl get pods -n $NAMESPACE -o wide | grep scheduler
If the above doesn't help, then it's also possible that the scheduler is intentionally blocking on a condition. To inspect all the conditions that are evaluated when the scheduler is checking for tasks to run, set core.logging_level=DEBUG. In the scheduler logs (which you can filter for in Cloud Logging), you can then check all the conditions that passed or failed in order for a task to run or to stay queued.

I feel you should upgrade to Composer version 1.10.4, having the latest patches always helps.
What database are you working with? Having all those failed tasks is highly inadvisable. Can you use CREATE TABLE IF NOT EXISTS ...?

Related

Confusion About Cloudwatch Alarms

I have a Cloudwatch Alarm which receives data from a Canary. My canary attempts to visit a website, and if the website is up and responding, then the datapoint is 0, if the server returns some sort of error then the datapoint is 1. Pretty standard canary stuff I hope. This canary runs every 30 minutes.
My Cloudwatch alarm is configured as follows:
With the expected behaviour that if my canary cannot reach the website 3 times in a row, then the alarm should go off.
Unfortunately, this is not what's happening. My alarm was triggered with the following canary data:
Feb 8 # 7:51 PM (MST)
Feb 8 # 8:22 PM (MST)
Feb 8 # 9:52 PM (MST)
How is it possible that these three datapoints would trigger my alarm?
My actual email was received as follows:
You are receiving this email because your Amazon CloudWatch Alarm "...." in the US West (Oregon) region has entered the ALARM state, because "Threshold Crossed: 3 out of the last 3 datapoints [1.0 (09/02/21 04:23:00), 1.0 (09/02/21 02:53:00), 1.0 (09/02/21 02:23:00)] were greater than or equal to the threshold (1.0) (minimum 3 datapoints for OK -> ALARM transition)." at "Tuesday 09 February, 2021 04:53:30 UTC".
I am even more confused because the times on these datapoints do not align. If I convert these times to MST, we have:
Feb 8 # 7:23 PM
Feb 8 # 7:53 PM
Feb 8 # 9:23 PM
The time range on the reported datapoints is a two hour window, when I have clearly specified my evaluation period as 1.5 hours.
If I view the "metrics" chart in cloudwatch for my alarm it makes even less sense:
The points in this chart as shown as:
Feb 9 # 2:30 UTC
Feb 9 # 3:00 UTC
Feb 9 # 4:30 UTC
Which, again, appears to be a 2 hour evaluation period.
Help? I don't understand this.
How can I configure my alarm to fire if my canary cannot reach the website 3 times in a row (waiting 30 minutes in-between checks)?
I have two things to answer this:
Every time a canary runs 1 datapoint is sent to cloudwatch. So if within 30 mins you are checking for 3 failures for alarms to be triggered then your canary should run at a interval for 10 mins. So in 30 mins 3 data point and all 3 failed data points for alarm to be triggered.
For some reasons statistics was not working for me so I used count option. May be this might help.
My suggestion to run canary every 5 mins. So in 30 mins 6 data points and create alarm for if count=4.
The way i read your config, your alarm is expecting to find 3 data points within a 30 minute window - but your metric is only updated every 30 minutes so this condition will never be true.
You need to increase the period so there is 3 or more metrics available in order to trigger the alarm.

AWS Elasticache backed by memcached CPU usage flat line at 1%

I've created an ElastiCache cluster in AWS, with node type as t3.micro (500 MB, 2 vCPUs and network up to 5 gigabit). My current setup is having 3 nodes for High Availability, each node is in a different AZ.
I'm using the AWS labs memcached client for Java (https://github.com/awslabs/aws-elasticache-cluster-client-memcached-for-java) that allows auto discovery of nodes, i.e. I only need to provide the cluster DNS record and the client will automatically discover all nodes within that cluster.
I intermittently get some timeout errors:
1) Error in custom provider, net.spy.memcached.OperationTimeoutException: Timeout waiting for value: waited 2,500 ms. Node status: Connection Status { /XXX.XX.XX.XXX:11211 active: false, authed: true, last read: 44,772 ms ago /XXX.XX.XX.XXX:11211 active: true, authed: true, last read: 4 ms ago /XXX.XX.XX.XXX:11211 active: true, authed: true, last read: 6 ms ago
I'm trying to understand what's the problem, but nothing really stands out by looking at the CloudWatch metrics.
The only thing that looks a bit weird is the CPU utilization graph:
The CPU always maxes out at 1% during peak hours, so I'm trying to understand how to read this value and whether this is not a 1% but more of a 100%, indicating that there's a bottleneck on the CPU.
Any help on this?
Just one question. Why are using such small instances? How is the memory use. My guess is the same as yours. The CPU is causing the trouble. 3 micro instances are not much.
I would try to increase the instances. But it is just a guess.

AWS GlueJob Error - Command failed with exit code 137

I am executing a AWS-Gluejob with python shell. It fails inconsistently with the error "Command failed with exit code 137" and executes perfectly fine with no changes sometimes.
What does this error signify? Are there any changes we can do in the job configuration to handle the same?
Error Screenshot
Adding the worker type to the Job Properties will resolve the issue.Based on the file size please select the worker type as below:
Standard – When you choose this type, you also provide a value for Maximum capacity. Maximum capacity is the number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. The Standard worker type has a 50 GB disk and 2 executors.
G.1X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 1 DPU (4 vCPU, 16 GB of memory, 64 GB disk), and provides 1 executor per worker. We recommend this worker type for memory-intensive jobs.
G.2X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 2 DPU (8 vCPU, 32 GB of memory, 128 GB disk), and provides 1 executor per worker. We recommend this worker type for memory-intensive jobs and jobs that run ML transforms.

Understanding AWS Glue detailed job metrics

Please see the attached screen shot of the CPU Load: Driver and Executors. It looks fine in the first 6 minutes, multiple executors are active. But after 6 minutes the chart only shows the Executor Average and Driver lines. When I put the mouse on the line, there are no usage data for all 17 executors. Does that mean all the executors are inactive after 6 minutes? How the Executor Average is calculated?
Thank you.
After talked to AWS support, I finally got the answer for why after 04:07 there are no lines for individual executors but only the Executor Average and the Driver.
I was told there are 62 executors for each job, however, at each moment at most 17 executors are used. So the Executor Average is the average of different sets of 17 executors at different moment. The default CPU Load chart only shows Executor 1 to 17, not 18 to 62. In order to show other executors, you need to manually add the metrics.

Spark 1.5 - unexpected number of executors

I am running a spark job on AWS (cr1.8xlarge instance, 32 cores with 240 GB memory each node) with the following configuration:
(The cluster has one master and 25 slaves, and I want each slave node to have 2 executors)
However, in the job tracker, it has only 25 executors:
Why does it have only 25 executors while I explicitly ask it to make 50? Thanks!