Openshift K8s cluster CPU and memory metrics issues - amazon-web-services

We have an AWS hosted openshift cluster running, at the moment we have 10 worker nodes and 3 control planes. We are using new relic as the monitoring platform. Our problem is as follows. Overall cluster resources are low, that is
CPU usage - average 25%
Memory usage - 37%.
But under load, the metrics shows some nodes are fully occupied and are at max CPU and memory usage while others are not and still overall cluster resource usage is low
We have a feeling that we have too much compute resources over provisioned. We have actually noted the same using AWS compute optimizer
How do we make the cluster resource utilization to be optimal like overall utilization of above 70% .
Why are some worker nodes being utilized to the maximum while others are seriously under utilized.
Any links to k8s cluster optimization will be appreciated
Using node toleration to assign some workloads to certain worker nodes

Related

Get mem and cpu usage from AWS fargate task

What APIs are available for tasks running under ECS fargate service to get their own memory and CPU usage?
My use case is for load shedding / adjusting: task is an executor which retrieves work items from a queue and processes in parallel. If load is low it should take on more tasks, if high - shed or take on less tasks.
You can look at Cloudwatch Container Insights. Container Insights reports CPU utilization relative to instance capacity. So if the container is using only 0.2 vCPU on an instance with 2 CPUs and nothing else is running on the instance, then the CPU utilization will only be reported as 10%.
Whereas Average CPU utilization is based on the ratio of CPU utilization relative to reservation. So if the container reserves 0.25 vCPU and it's actually using 0.2 vCPU, then the average CPU utilization (assuming a single task) is 80%. More details about the ECS metrics can be found here
You can get those metrics in CloudWatch by enabling Container Insights. Note that there is an added cost for enabling that.

How to scale EKS Pods and Nodes?

In AWS console CloudWatch > Container Insights > Performance Monitoring. I selected EKS Pods. There are 2 charts: Node CPU Utilization and Pod CPU Utilization.
I have several images deployed.
During my load testing, Node CPU Utilization shows Image A at spiking up to 98%. However, Pod CPU Utilization shows Image A below 0.1%.
Anybody understands what does these 2 metrics mean? Does it mean I should increase the number of nodes instead of number of pods?
Example of the dashboard:

AWS Fargate Prices Tasks

I have set up a Task Definition with CPU maximum allocation of 1024 units and 2048 MiB of memory with Fargate being the launch type. When I looked at the costs it was way more expensive than I thought ($ 1.00 per day or $ 0.06 per hour [us-east-1]). What I did was to reduce to 256 units and I am waiting to see if the costs goes down. But How does the Task maximum allocation work? Is the task definition maximum allocation responsible for Fargate provisioning a more powerfull server with a higher cost even if I dont use 100%?
The apps in containers running 24/7 are NestJS application + apache (do not ask why) + redis and I can see that it has low CPU usage but the price is too high for me. Is the fargate the wrong choice for this? Should I go for EC2 instances with ECS?
When you run a task, Fargate provisions a container with the resources you have requested. It's not a question of "use up to this maximum CPU and memory," but rather "use this much CPU and memory." You'll pay for that much CPU and memory for as long as it runs, as per the AWS Fargate pricing. At the current costs, the CPU and memory you listed (1024 CPU units, 2048MiB), the cost would come to $0.04937/hour, or $1.18488/day, or $35.55/month.
Whether Fargate is the right or wrong choice is subjective. It depends what you're optimizing for. If you just want to hand off a container and allow AWS to manage everything about how it runs, it's hard to beat ECS Fargate. OTOH, if you are optimizing for lowest cost, on-demand Fargate is probably not the best choice. You could use Fargate Spot ($10.66/month) if you can tolerate the constraints of spot. Alternatively, you could use an EC2 instance (t3.small # $14.98/month), but then you'll be responsible for managing everything.
You didn't mention how you're running Redis which will factor in here as well. If you're running Redis on Elasticache, you'll incur that cost as well, but you won't have to manage anything. If you end up using an EC2 instance, you could run Redis on the same instance, saving latency and expense, with the trade off that you'll have to install/operate Redis yourself.
Ultimately, you're making tradeoffs between time saved and money spent on managed services.

AWS EC2 Autoscaling Average CPU utilization v.s. Grafana CPU utilization

We want to use AWS predictive scaling to forecast the load and CPU so this will certainly help us move away from manually launching instances based on load. We created new scaling plan by choosing EC2 Autoscaling group and enabling predictive scaling(forecast only for now). But we noticed that the CPU graph on Grafana is different from AWS Average CPU utilization. Grafana is getting alerts from elasticsearch which gets logs directly from services running in ec2. I am not sure why they don't show the same percentage of CPU Utilization and am wondering why AWS CPU Utilization is lower than the CPU shows on Grafana? If so can autoscaling scales the instances correctly?
AWS Autoscaling group Average CPU utilization
Grafana Averge CPU graph
AWS has its own method of computing CPU Util which is based on "EC2 compute units" so it is possible that value will differ when compared to another way of calculating the same metrics.

What is the number of cores in aws.data.highio.i3 elastic cloud instance given for a 14 day trial period?

I wanted to make some performance calculations hence i need to know the number of cores that this aws.data.highio.i3 instance deployed by elastic cloud on aws has, I know that it has 4 GB of ram so if anyone can help me with the number of cores that would be really very helpfull.
I am working on elasticsearch deployed on elastic cloud and my use case requires me to make approx 40 million writes in a day so if you can help me suggest what machines i must use that can work accordingly to my use case and are I/O optimized as well.
The instance used by Elastic Cloud for aws.data.highio.i3 in the background is i3.8xlarge, see here. That means it has 32 virtual CPUs or 16 cores, see here.
But you down own the instance in Elastic Cloud, from reference hardware page:
Host machines are shared between deployments, but containerization and
guaranteed resource assignment for each deployment prevent a noisy
neighbor effect.
Each ES process runs on a large multi-tenant server with resources carved out using cgroups, and ES scales the thread pool sizing automatically. You can see the number of times that the CPU was throttled by the cgroups if you go to Stack Monitoring -> Advanced and down to graphs Cgroup CPU Performance and Cgroup CFS Stats.
That being said, if you need full CPU availability all the time, better go with AWS Elasticsearch service or host your own cluster.