How to scale EKS Pods and Nodes? - amazon-web-services

In AWS console CloudWatch > Container Insights > Performance Monitoring. I selected EKS Pods. There are 2 charts: Node CPU Utilization and Pod CPU Utilization.
I have several images deployed.
During my load testing, Node CPU Utilization shows Image A at spiking up to 98%. However, Pod CPU Utilization shows Image A below 0.1%.
Anybody understands what does these 2 metrics mean? Does it mean I should increase the number of nodes instead of number of pods?
Example of the dashboard:

Related

In GCP , I can able to view GKE Monitoring dashboard. How to create alerts for CPU and memory Utilization for Kubernetes Container

I have enabled Default GCP Monitoring in my Google Kubernetes Cluster. So GKE Dashboard is created which contains System Metrics. Now I need to enable alert for Kubernetes container's CPU and memory Utilization from GKE dashboard. I tried to create own alert, but it didn't match with metrics defined in GKE dashboard.
This is a Guide1 and Guide2 for monitoring the Kubernetes engine. In it, you can know about alerting and how to monitor your system. In case you were already familiar, here is a list of the metrics for the new Kubernetes engine in comparison to the previous metrics. Additionally, the complete list of metrics, which are always useful, can be found here.
In Monitoring dashboard, dashboard displays CPU and Memory utilization in time range:
CPU utilization: The CPU utilization of containers that can be attributed to a resource within the selected time span. The metric used is here check with For CPU Utilization
Memory utilization: The memory utilization of containers that can be attributed to a resource within the selected time span. The metric used is here check with For Memory Utilization
The command "kubectl top node" displays resource (CPU/Memory/Storage) usage at that moment, not time span.

Get mem and cpu usage from AWS fargate task

What APIs are available for tasks running under ECS fargate service to get their own memory and CPU usage?
My use case is for load shedding / adjusting: task is an executor which retrieves work items from a queue and processes in parallel. If load is low it should take on more tasks, if high - shed or take on less tasks.
You can look at Cloudwatch Container Insights. Container Insights reports CPU utilization relative to instance capacity. So if the container is using only 0.2 vCPU on an instance with 2 CPUs and nothing else is running on the instance, then the CPU utilization will only be reported as 10%.
Whereas Average CPU utilization is based on the ratio of CPU utilization relative to reservation. So if the container reserves 0.25 vCPU and it's actually using 0.2 vCPU, then the average CPU utilization (assuming a single task) is 80%. More details about the ECS metrics can be found here
You can get those metrics in CloudWatch by enabling Container Insights. Note that there is an added cost for enabling that.

Openshift K8s cluster CPU and memory metrics issues

We have an AWS hosted openshift cluster running, at the moment we have 10 worker nodes and 3 control planes. We are using new relic as the monitoring platform. Our problem is as follows. Overall cluster resources are low, that is
CPU usage - average 25%
Memory usage - 37%.
But under load, the metrics shows some nodes are fully occupied and are at max CPU and memory usage while others are not and still overall cluster resource usage is low
We have a feeling that we have too much compute resources over provisioned. We have actually noted the same using AWS compute optimizer
How do we make the cluster resource utilization to be optimal like overall utilization of above 70% .
Why are some worker nodes being utilized to the maximum while others are seriously under utilized.
Any links to k8s cluster optimization will be appreciated
Using node toleration to assign some workloads to certain worker nodes

Is there a way to get the metric that initiated pod scaling in Kubernetes?

I have a Kubernetes cluster deployed in AWS. The pods are set to auto-scale based on memory, CPU total average usage, and based on request latency of the services.
Scaling works as expected. I am wondering if there is a Kubernetes event that fires when scaling that indicates which metric was used for the scaling? Whether the scaling happened due to memory usage or CPU usage etc.

AWS EC2 Autoscaling Average CPU utilization v.s. Grafana CPU utilization

We want to use AWS predictive scaling to forecast the load and CPU so this will certainly help us move away from manually launching instances based on load. We created new scaling plan by choosing EC2 Autoscaling group and enabling predictive scaling(forecast only for now). But we noticed that the CPU graph on Grafana is different from AWS Average CPU utilization. Grafana is getting alerts from elasticsearch which gets logs directly from services running in ec2. I am not sure why they don't show the same percentage of CPU Utilization and am wondering why AWS CPU Utilization is lower than the CPU shows on Grafana? If so can autoscaling scales the instances correctly?
AWS Autoscaling group Average CPU utilization
Grafana Averge CPU graph
AWS has its own method of computing CPU Util which is based on "EC2 compute units" so it is possible that value will differ when compared to another way of calculating the same metrics.