Scheduler node always above 100% utilization - airflow-scheduler

We're running Cloud Composer on a 5 node, n1-standard-2 cluster running composer-1.11.3-airflow-1.10.9. Private IP is enabled. Python 3 is selected. We currently have around 30 DAGs, some containing over 100 tasks. Most DAGs run once a day.
The node running the airflow scheduler workload is consistently running at around 150% CPU utilisation regardless of the number of running tasks. The only way to lower the CPU usage is to remove DAGs until only 5 or 6 remain (obviously not an option). What we've tried:
We have followed this Medium article detailing how to run the scheduler service on a dedicated node however we cannot find a configuration that reduces the CPU usage. We've tried a node as powerful as an e2-highcpu-32 running on a 100 GB SSD. Usage remained at 150%.
We've tried to update the airflow.cfg variables to reduce the frequency the dags directory is parsed via settings such as store_serialized_dags and max_threads. Again, this did not have any impact on the CPU usage.
For reference the other nodes all run at 30-70% CPU for the majority of the time, spiking to over 100% for short periods when big DAGs are running. No node has any issue with memory usage, with between 2 GB and 4 GB used.
We plan on adding more DAGs in future and are concerned the scheduler may become a bottle neck with this current setup. Are there any other configuration options available to reduce the CPU usage to allow for a future increase in DAG number?
Edit in response to Ines' answer:
I'm seeing the CPU usage as a percentage in the Monitoring tab, the node running the scheduler service is coloured orange:
Additionally, when I look at the pod running airflow scheduler this is the CPU usage, pretty much always 100%:

Please, have a look for the official documentation, which describes CPU usage per node metric. Can you elaborate where do you see the percentage values, since the documentation mentioned a core time usage ratio:
A chart showing the usage of CPU cores aggregated over all running
Pods in the node, measured as a core time usage ratio. This does not
include CPU usage of the App Engine instance used for the Airflow UI
or Cloud SQL instance. High CPU usage is often the root cause of
Worker Pod evictions. If you see very high usage, consider scaling out
your Composer environment or changing the schedule of your DAG runs.
In the meantime, there is an ongoing workaround and it would be worth trying. You should follow these steps to limit CPU usage on syncing POD:
Go to environment configuration page and click view cluster workloads
Click airflow-scheduler, then edit
Find name: gcs-syncd and add:
resources:
limits:
cpu: some value (you can try with 300m)
requests:
cpu: 10m
then click save (at the bottom).
Repeat the procedure for airflow-worker.
We have to edit also the section of airflow-scheduler of the workload airflow-scheduler. Click on edit the YAML file and for the section airflow-scheduler add:
resources:
limits:
cpu: 750m
requests:
cpu: 300m
It would be great if you could try the aforementioned steps and see if it improves the performance.
Sometimes bucket's /logs might consist of lot of files that causes gcs-syncd to use CPU much while doing an internal synchronization of the logs. You can try to remove some of the oldest logs of the bucket gs://<composer-env-name>/logs. As an example, if you would like to remove all logs of May, please use the following command:
gsutil -m rm -r gs://europe-west1-td2-composter-a438b8eb-bucket/logs/*/*/2020-05*
Ideally, the GCE instances shouldn't be running over 70% CPU at all times, or the Composer environment may become unstable during resource usage.

Related

Are there any problems with running same cron job that takes 2 hours to complete every 10 minutes?

I have a script that takes two hours to run and I want to run it every 15 minutes as a cronjob on a cloud vm.
I noticed that my cpu is often at 100% usage. Should I resize memory and/or number_of_cores ?
Each time you execute your cron job, a new process will be created.
So if your job takes 120 min (2h) to complete, and you will be starting new jobs every 15 minutes, then you will be having 8 jobs running at the same time (120/15).
Thus, if the jobs are resource intensive, you will observe issues, such as 100% cpu usage.
So the question whether to up-scale or not is really dependent on the nature of these jobs. What do they do, how much cpu and memory do they take? Based on your description you are already running at 100% CPU often, thus an upgrade would be warranted in my view.
It would depend on your cron, but outside of resourcing for your server/application the following issues should be considered:
Is there overlap in data? i.e. do you retrieve a pool of data that will be processed multiple times.
Will duplicate critical actions happen? i.e. will a customer receive an email multiple times, will a payment be processed multiple times.
Is there a chance of a race condition that cause the script to exit early.
Will there be any collisions in the processing i.e. duplicate bookings made etc.
You will need to increase the CPU and Memory specification of your VM instance (in GCP) due to the high CPU load of your instance. The document [1] on upgrading the machine type of your VM instance, to do this need to shutdown your VM instance and change it´s machine type.
To know about different machine types in GCP, please have the link [2].
On the other hand, you can autoscale based on the average CPU utilization if you use managed instance group (MIG) [3]. Using this policy tells the autoscaler to collect the CPU utilization of the instances in the group and determine whether it needs to scale. You set the target CPU utilization the autoscaler should maintain and the autoscaler works to maintain that level.
[1] https://cloud.google.com/compute/docs/instances/changing-machine-type-of-stopped-instance
[2] https://cloud.google.com/compute/docs/machine-types
[3] https://cloud.google.com/compute/docs/autoscaler/scaling-cpu-load-balancing#scaling_based_on_cpu_utilization

GCE shows only ~15% usage but top shows ~99%

I was running a load test but seeing only ~20% CPU on GCE. That was surprising, so I decided to SSH into my machine and run top, which showed me 99.7% utilization.
I found a question that was very similar: Google Cloud Compute engine CPU usage shows 100% but dashboard only shows 10% usage
However, I am certain I only have one core (1 vCPU, 3.75 GB memory).
Here's top running that shows 99.7% utilization:
What could be the reason for this?
Even for a single core case, when workload is not split between several cores, the curve shape and the Y-coordinate of the CPU Utilization chart depend on the aggregation and alignment settings you use: for instance max or mean, 1m or 1h alignment period, etc. For instance in case of a short peak load, the wide time window will act as a big denominator for the mean aligner. That way you'll get lower values on the chart.
For more details please see:
Google Monitoring > Documentation > Alignment
Google Cloud Blog > Stackdriver tips and tricks: Understanding metrics and building charts

Neo4j performance discrepancies local vs cloud

I am encountering drastic performance differences between a local Neo4j instance running on a VirtualBox-hosted VM and a basically identical Neo4j instance hosted in Google Cloud (GCP). The task involves performing a simple load from a Postgres instance also located in GCP. The entire load takes 1-2 minutes on the VirtualBox-hosted VM instance and 1-2 hours on the GCP VM instance. The local hardware setup is a 10-year-old 8 core, 16GB desktop running VirtualBox 6.1.
With both VirtualBox and GCP I perform these similar tasks:
provision a 4 core, 8GB Ubuntu 18 LTS instance
install Neo4j Community Edition 4.0.2
use wget to download the latest apoc and postgres jdbc jars into the plugins dir
(only in GCP is the neo4j.conf file changed from defaults. I uncomment the "dbms.default_listen_address=0.0.0.0" line to permit non-localhost connections. Corresponding GCP firewall rule also created)
restart neo4j service
install and start htop and iotop for hardware monitoring
login to empty neo4j instance via broswer console
load jdbc driver and run load statement
The load statement uses apoc.periodic.iterate to call apoc.load.jdbc. I've varied the "batchSize" parameter in both environments from 100-10000 but only saw marginal changes in either system. The "parallel" parameter is set to false because true causes lock errors.
Watching network I/O, both take the first ~15-25 seconds to pull the ~700k rows (8 columns) from the database table. Watching CPU, both keep one core maxed at 100% while another core varies from 0-100%. Watching memory, neither takes more than 4GB and swap stays at 0. Initially, I did use the config recommendations from "neo4j-admin memrec" but those didn't seem to significantly change anything either in mem usage or overall execution time.
Watching disk, that is where there are differences. But I think these are symptoms and not the root cause: the local VM consistently writes 1-2 MB/s throughout the entire execution time (1-2 minutes). The GCP VM burst writes 300-400 KB/s for 1 second every 20-30 seconds. But I don't think the GCP disks are slow or the problem (I've tried with both GCP's standard disk and their SSD disk). If the GCP disks were slow, I would expect to see sustained write activity and a huge write-to-disk queue. It seems whenever something should be written to disk, it gets done quickly in GCP. It seems the bottleneck is before the disk writes.
All I can think of are that my 10-year-old cores are way faster than a current GCP vCPU, or that there is some memory heap thing going on. I don't know much about java except heaps are important and can be finicky.
Do you have the exact same :schema on both systems? If you're missing a critical index used in your LOAD query that could easily explain the differences you're seeing.
For example, if you're using a MATCH or a MERGE on a node by a certain property, it's the difference between doing a quick lookup of the node via the index, or performing a label scan of all nodes of that label checking every single one to see if the node exists or if it's the right node. Understand also that this process repeats for every single row, so in the worst case it's not a single label scan, it's n times that.

Cloud computing service to run thousands of containers in parallel

Is there any provider, that offers such an option out of the box? I need to run at least 1K concurrent sessions (docker containers) of headless web-browsers (firefox) for complex UI tests. I have a Docker image that I just want to deploy and scale to 1000 1CPU/1GB instances in second, w/o spending time on maintaining the cluster of servers (I need to shut them all down after the job is done), just focuse on the code. The most close thing I found so far is Amazon ECS/Fargate, but their limits have no sense to me ("Run containerized applications in production" -> max limit: 50 tasks -> production -> ok). Am I missing something?
I think that AWS Batch might be a better solution for your use case. You define a "compute environment" that provides a certain level of capacity, then submit tasks that are run on that compute environment.
I don't think that you'll find anything that can start up an environment and deploy a large number of tasks in "one second": in my experience it takes about a minute or two ramp-up time for Batch, although once the machines are up and running they are able to sequence jobs quickly. You should also give consideration to whether it makes sense to run all 1,000 jobs concurrently; that will depend on what you're trying to get out of your tests.
You'll also need to be aware of any places where you might be throttled (for example, retrieving configuration from the AWS Parameter Store). This talk from last year's NY Summit covers some of the issues that the speaker ran into when deploying multiple-thousands of concurrent tasks.
You could use lambda layers to run headless browsers (I know there are several implementations for chromium/selenium on github, not sure about firefox).
Alternatively you could try and contact the AWS team to see how much the limit for concurrent tasks on Fargate can be increased. As you can see at the documentation, the 50 task is a soft limit and can be raised.
Be aware if you start via Fargate, there is some API limit on the requests per second. You need to make sure you throttle your API calls or you use the ECS Create Service.
In any case, starting 1000 tasks would require 1000 seconds, which is probably not what you expect.
Those limits are not there if you use ECS, but in that case you need to manage the cluster, so it might be a good idea to explore the lambda option.

What is the scale of the CPU utilisation for f1-micro on GCP?

The instance details of my f1-micro instance shows a graph of CPU utilisation fluctuating between 8% and 15%, but what is the scale? The f1-micro has 0.2 CPU so is my max 20%? Or does the 100% in the graph mark my 20% of the CPU? Occasionally the graph has gone above 20% but is it bursting then? Or does the bursting start at 100% in the graph?
The recommendation to increase performance is always displayed. Is it just sales tactics? The VM is a watchdog so it is not doing much.
I tried to build up a small test in order to answer your question and if interested you can do the same to double check.
TEST
I created two instances one f1-micro and one n1-standard-1 and then I forced a CPU burst making use of stress, but you can use any tool of your choice.
$ sudo apt-get install stress
$ stress --cpu 1 & top
In this way we can compare the output of top of the two instances with the one showed in the dashboard, since the operating system is not aware to share the CPU so we expect a 100% seen from the inside of the machine.
RESULTS
While the output of top for both the instances showed as expected that the 99.9% of the CPU was currently used, the output of the dashboard is more interesting.
n1-standard-1 showed a stable value around 100% the whole time.
f1-micro shows an initial spike to 250% (because it is using a bigger share of the CPU assigned, i.e. it is running on bursting mode) and then it reduces to 100%.
I repeated the test several times and each time I got the same behaviour, therefore the % refers to the share of CPU that you are currently using.
This features is Documented here:
"f1-micro machine types offer bursting capabilities that allow instances to use additional physical CPU for short periods of time. Bursting happens automatically when your instance requires more physical CPU than originally allocated"
On the other hand if you want to know more about those recommendation and how they work you can check the Official Documentation.