I was running a load test but seeing only ~20% CPU on GCE. That was surprising, so I decided to SSH into my machine and run top, which showed me 99.7% utilization.
I found a question that was very similar: Google Cloud Compute engine CPU usage shows 100% but dashboard only shows 10% usage
However, I am certain I only have one core (1 vCPU, 3.75 GB memory).
Here's top running that shows 99.7% utilization:
What could be the reason for this?
Even for a single core case, when workload is not split between several cores, the curve shape and the Y-coordinate of the CPU Utilization chart depend on the aggregation and alignment settings you use: for instance max or mean, 1m or 1h alignment period, etc. For instance in case of a short peak load, the wide time window will act as a big denominator for the mean aligner. That way you'll get lower values on the chart.
For more details please see:
Google Monitoring > Documentation > Alignment
Google Cloud Blog > Stackdriver tips and tricks: Understanding metrics and building charts
Related
We're running Cloud Composer on a 5 node, n1-standard-2 cluster running composer-1.11.3-airflow-1.10.9. Private IP is enabled. Python 3 is selected. We currently have around 30 DAGs, some containing over 100 tasks. Most DAGs run once a day.
The node running the airflow scheduler workload is consistently running at around 150% CPU utilisation regardless of the number of running tasks. The only way to lower the CPU usage is to remove DAGs until only 5 or 6 remain (obviously not an option). What we've tried:
We have followed this Medium article detailing how to run the scheduler service on a dedicated node however we cannot find a configuration that reduces the CPU usage. We've tried a node as powerful as an e2-highcpu-32 running on a 100 GB SSD. Usage remained at 150%.
We've tried to update the airflow.cfg variables to reduce the frequency the dags directory is parsed via settings such as store_serialized_dags and max_threads. Again, this did not have any impact on the CPU usage.
For reference the other nodes all run at 30-70% CPU for the majority of the time, spiking to over 100% for short periods when big DAGs are running. No node has any issue with memory usage, with between 2 GB and 4 GB used.
We plan on adding more DAGs in future and are concerned the scheduler may become a bottle neck with this current setup. Are there any other configuration options available to reduce the CPU usage to allow for a future increase in DAG number?
Edit in response to Ines' answer:
I'm seeing the CPU usage as a percentage in the Monitoring tab, the node running the scheduler service is coloured orange:
Additionally, when I look at the pod running airflow scheduler this is the CPU usage, pretty much always 100%:
Please, have a look for the official documentation, which describes CPU usage per node metric. Can you elaborate where do you see the percentage values, since the documentation mentioned a core time usage ratio:
A chart showing the usage of CPU cores aggregated over all running
Pods in the node, measured as a core time usage ratio. This does not
include CPU usage of the App Engine instance used for the Airflow UI
or Cloud SQL instance. High CPU usage is often the root cause of
Worker Pod evictions. If you see very high usage, consider scaling out
your Composer environment or changing the schedule of your DAG runs.
In the meantime, there is an ongoing workaround and it would be worth trying. You should follow these steps to limit CPU usage on syncing POD:
Go to environment configuration page and click view cluster workloads
Click airflow-scheduler, then edit
Find name: gcs-syncd and add:
resources:
limits:
cpu: some value (you can try with 300m)
requests:
cpu: 10m
then click save (at the bottom).
Repeat the procedure for airflow-worker.
We have to edit also the section of airflow-scheduler of the workload airflow-scheduler. Click on edit the YAML file and for the section airflow-scheduler add:
resources:
limits:
cpu: 750m
requests:
cpu: 300m
It would be great if you could try the aforementioned steps and see if it improves the performance.
Sometimes bucket's /logs might consist of lot of files that causes gcs-syncd to use CPU much while doing an internal synchronization of the logs. You can try to remove some of the oldest logs of the bucket gs://<composer-env-name>/logs. As an example, if you would like to remove all logs of May, please use the following command:
gsutil -m rm -r gs://europe-west1-td2-composter-a438b8eb-bucket/logs/*/*/2020-05*
Ideally, the GCE instances shouldn't be running over 70% CPU at all times, or the Composer environment may become unstable during resource usage.
I have RDS DB with low number of connections on it (usually something around 30 connections), but it shows high CPU load all the time (about 25%). The DB Family (r3.2xlarge) .
As shown in the screenshot below of enhanced monitoring, it shows some processes with high CPU and Memory utilization. what does the numbers that i have marked in rectangles mean? I thought they are the threads' IDs of Queries, but in show processlist, i can't see those numbers!
So briefly:
What does those numbers (in rectangle) mean?
is there anyway to know which query is taking the top utilization of CPU and memory (in realtime, not via slow log)?
What does those numbers (in rectangle) mean?
They are just process/thread IDs. Don't mean anything.
is there anyway to know which query is taking the top utilization of CPU and memory (in realtime, not via slow log)?
Since you're using MySQL flavor of RDS, connect to your instance with any MySQL client and use SHOW PROCESSLIST; or SHOW FULL PROCESSLIST; commands to see the list of running queries.
https://dev.mysql.com/doc/refman/5.7/en/show-processlist.html
While studying basic ML algorithms on MNIST database, I noticed that my netbook is too week for such purpose. I started a free trial on Google Cloud and successfully set up VM instance with 1 vCPU. However, it only boosts up the performance 3x and I need much more computing power for some specific algorithms.
I want to do the following:
use 1 vCPU for setting up an algorithm
switch to plenty of vCPU to perform a single algorithm
go back to 1 vCPU
Unfortunately, I am not sure how Google will charge me for such maneuver. I am afraid that it will drain my 300$ which I have on my account. It is my very first day playing with VMs and using clouds for computing purpose so I really need a good advice from someone with experience.
Question. How to manage namber of vCPUs on Google Cloud Compute Engine to compute single expensive algorithms?
COSTS
The quick answer is that you will pay what you use, if you make use of 16 cpu for 1 hour you will pay 16 cpu for 1 hour.
In order to have a rough idea of cost I would advice you to take a look to Price Calculator and try to create your own estimation with the resources you are going to use.
Having a 1VCPU and 3.75GB of RAM machine running for one day cost around 0.80$ (if it is not a preentible instance and without any committed use discounts), a machine having 32 VCPU and 120GB of RAM on the other hand would cost around 25$/day.
Remember the rule: when it is running, you are paying it; you can change the machine type how many times you want according your needs and during the transition you would pay just the persistent disk. Therefore it could make sense to switch off the machine each time you are not using it.
Consider that you will have to pay as well networking and storage, but the costs in your use case are kind of marginal, for example 100GB of storage for one day costs $0.13.
Notice that since September 2017 Google extended per-second billing, with a one minute minimum, to Compute Engine. I believe that this is how most of the Cloud Provider works.
ADDING VCPU
When the machine is off, you can modify from the edit menu the number of VCU and the amount of memory, here you can find a step to step official guide you can follow through the process. You can change machine type as well through the command line, for example setting a custom machine type with 4 vCPUs and 1 GB of memory :
$ gcloud compute instances set-machine-type INSTANCE-NAME --machine-type custom-4-1024
As soon you are done with your computation, stop the instance and reduce the size of the machine (or leave it off).
I ran a cloudml job with BASIG_GPU.
I would like to check Walker's cpu, gpu, and momory usage, but is it possible?
The reason for this is that I applied for one GPU, but I want to see the change in gpu usage when I turn two jobs (scaleTier: BASIG_GPU).
thanks.
The CPU and memory utilization charts are available on the ML Engine job page on your Google Cloud Console, and the GPU utilization metrics are under development.
The instance details of my f1-micro instance shows a graph of CPU utilisation fluctuating between 8% and 15%, but what is the scale? The f1-micro has 0.2 CPU so is my max 20%? Or does the 100% in the graph mark my 20% of the CPU? Occasionally the graph has gone above 20% but is it bursting then? Or does the bursting start at 100% in the graph?
The recommendation to increase performance is always displayed. Is it just sales tactics? The VM is a watchdog so it is not doing much.
I tried to build up a small test in order to answer your question and if interested you can do the same to double check.
TEST
I created two instances one f1-micro and one n1-standard-1 and then I forced a CPU burst making use of stress, but you can use any tool of your choice.
$ sudo apt-get install stress
$ stress --cpu 1 & top
In this way we can compare the output of top of the two instances with the one showed in the dashboard, since the operating system is not aware to share the CPU so we expect a 100% seen from the inside of the machine.
RESULTS
While the output of top for both the instances showed as expected that the 99.9% of the CPU was currently used, the output of the dashboard is more interesting.
n1-standard-1 showed a stable value around 100% the whole time.
f1-micro shows an initial spike to 250% (because it is using a bigger share of the CPU assigned, i.e. it is running on bursting mode) and then it reduces to 100%.
I repeated the test several times and each time I got the same behaviour, therefore the % refers to the share of CPU that you are currently using.
This features is Documented here:
"f1-micro machine types offer bursting capabilities that allow instances to use additional physical CPU for short periods of time. Bursting happens automatically when your instance requires more physical CPU than originally allocated"
On the other hand if you want to know more about those recommendation and how they work you can check the Official Documentation.