AWS ElastiCache for Redis Engine CPU Utilization metrics, how to interpret? - amazon-web-services

We are using AWS ElastiCache for Redis for our application, and we need some help in understanding the metrics. During high load, we saw a CPU utilization of 30%, But Engine CPU Utilization was showing almost 80%. Could someone please elaborate on the difference between these metrics and what are the optimum limits for those metrics for a better performance.
Thanks in advance.

Now I got a better understanding of both the metrics. When it is CPU Utilization, it is total cpu utilization of that system. And Engine Utilization is specific to the redis process thread which handles all the redis queries. So in a system with 4 cores, as we all know redis processing happens in a single thread, only one core will be used by the redis for processing the queries. So in that case the maximum CPU Utilization by redis will be 25 %.

The engine CPU utilization show you the entire value of the CPU resources being consumed by the host. Whereas the Engine CPU utilization shows you the value of the CPU resource consumed for a particular core.
In this case as Redis is single thread and assuming that there are two cores. If the threshold for CPU utilization is 90% then the actual threshold per core would 90/2 or 45%.
For reference, you can check out: https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/CacheMetrics.WhichShouldIMonitor.html

Related

Get mem and cpu usage from AWS fargate task

What APIs are available for tasks running under ECS fargate service to get their own memory and CPU usage?
My use case is for load shedding / adjusting: task is an executor which retrieves work items from a queue and processes in parallel. If load is low it should take on more tasks, if high - shed or take on less tasks.
You can look at Cloudwatch Container Insights. Container Insights reports CPU utilization relative to instance capacity. So if the container is using only 0.2 vCPU on an instance with 2 CPUs and nothing else is running on the instance, then the CPU utilization will only be reported as 10%.
Whereas Average CPU utilization is based on the ratio of CPU utilization relative to reservation. So if the container reserves 0.25 vCPU and it's actually using 0.2 vCPU, then the average CPU utilization (assuming a single task) is 80%. More details about the ECS metrics can be found here
You can get those metrics in CloudWatch by enabling Container Insights. Note that there is an added cost for enabling that.

Cloud Memorystore Redis high CPU utilisation

We are using Cloud Memorystore Redis instance to add a caching layer to our mission critical Internet facing application. Total number of calls (including get, set and key expiry operations) to Memorystore instance is around 10-15K per second. CPU utilisation has been consistently around 75-80% and expecting the utilisation to go even higher.
Currently, we are using M4 capacity tier under Standard service tier.
https://cloud.google.com/memorystore/docs/redis/pricing
Need some clarity around the following pointers.
How many CPU cores do the M4 capacity tier correspond to?
Is it really alarming to have more than 100% CPU utilisation? Do we expect any noticeable performance issues?
What are the options to tackle the performance issues (if any) caused by higher CPU utilisation (>=100%)? Will switching to M5 capacity tier address the high CPU consumption and the corresponding issues.
Our application is really CPU intensive and we don't see any way to further optimize our application. Looking forward to some helpful references.
Addressing your questions.
1. How many CPU cores do the M4 capacity tier correspond to?
Cloud Memorystore for Redis is a Google-managed service which means that Google can reserve the inner details(resources) of the virtual machine that is running the redis service. Still it is expected that the higher the capacity tier, the more resources(CPU) the virtual machine will have. For your case in particular, adding CPUs will not solve issues around CPU usage because redis service itself is single threaded.
As you can see from the previous link:
To maximize CPU usage you can start multiple instances of Redis.
If you want to use multiple CPUs you can start thinking of some way to shard earlier.
2. Is it really alarming to have more than 100% CPU utilisation?
Yes, it is alarming to have high CPU utilization because it can result in connection errors or high latency.
CPU utilization is important but also whether the Redis instance is efficient enough to sustain your throughput at a given latency. You can check the redis latency with the command redis-cli --latency while CPU % is high.
3. Do we expect any noticeable performance issues?
This is really hard to say or predict because it depends on several factor(client service, commands run within a time frame, workload). Some of the most common causes for high latency and performance issues are:
Client VMs or services are overloaded and not consuming the messages from Redis: When a client opens a TCP connection to redis then the redis server has a buffer of messages to send to that connection. If a client service has its CPU maxed out, giving no time for the kernel to receive messages from redis then they fill up on the redis server.
The commands executed are consuming a lot of CPU: The following commands are known to be potentially very expensive to process:
EVAL/EVALSHA
KEYS
LRANGE
ZRANGE/ZREVRANGE
4.-What are the options to tackle the performance issues (if any) caused by higher CPU utilisation (>=100%)?
This question revolves mainly around the scaling design of your implementation. Since redis is single threaded, a better approach to reduce CPU % would be by sharding your data in multiple redis instances and have a proxy in front of it to distribute the load. Please take a look at the graph under section Twemproxy from this link.
5.-Will switching to M5 capacity tier address the high CPU consumption and the corresponding issues?
Switching to a higher capacity tier should help with the latency temporarily but this is known as vertical scaling which is limited to the tiers that Cloud Memorystore offers.
Redis Enterprise solves all the issues you are facing. Redis Enterprise can be configured in a clustered configuration and utilize all the resources of the machine as well as scale out over multiple machines.
The Redis Enterprise Software is responsible for watching over the CPU utilization and other resource management tasks so you do not need to.
It is offered on GCP and GCP marketplace as well.
https://redis.com/redis-enterprise-cloud/pricing/

Discrepancy between CPU utilization of ECS service and EC2 instance

I'm seeing some discrepancy between ECS service and EC2 in terms of CPU utilization metrics.
We have an EC2 instance of t2 small type and two different ECS containers running inside it. I have allocated 512 CPU units for one container and 128 CPU units for the other container. Here, the problem is that the CPU utilization goes up to > 90% and the following is the screenshot,
While the CPU utilization of underlying EC2 is not even greater than 40% and the following is the screenshot,
What could be the reason for this discrepancy? What could have been gone wrong?
Well, if you assign CPU units to your containers, CloudWatch will report the CPU usage in relation to the available CPU capacity. Your container with 512 CPU units has access to 0.5 vCPUs and the one with 128 Units has access to 0.125 of a vCPU, which is not a lot, so a high utilization of those is easy to achieve.
Since the CPU utilization of the t2.small, which has about 1 vCPU (ignoring the credit/bursting system for now) is hovering around 20%, my guess is that the first graph is from the smaller container.

AWS EC2 instance cost far above estimate, why?

I have a script that I run 24/7 that uses 90-100% CPU constantly. I am running this script in multiple virtual machines from Google Cloud Platform. I run one script per VM.
I am trying to reduce cost by using AWS EC2. I looked at the price per hour of t3-micro (2 vCPU) instances and it says the cost is around $0.01/h, which is cheaper than the GCP's equivalent instance with 2 vCPU.
Now, I tried to run the script in one t3-micro instance, just to have a real estimate of how much each t3-instance running my script will cost. I was expecting the monthly cost per instance to be ~$7.20 (720h/month * $0.01/h). The thing is that I have been running the script for 2-3 days, and the cost reports already show a cost of more than $4.
I am trying to understand why the cost is so far from my estimate (and from AWS monthly calculator's estimate). All these extra cost seem to be from "EC2 Other" and "CPU Credit", but I don't understand these costs.
I suspect these come from my 24-7 full CPU usage, but could someone explain what are these costs and if there is a way to reduce them?
The EC2 instance allows a certain baseline CPU usage: 10% for a t3.micro. When the instance is operating below that threshold it accumulates vCPU credits: which are applied to usage above the threshold. A t3.micro can accumulate up to 12 credits an hour (with one credit being equal to 100% CPU ulitilisation for 1 minute). If you are regularly using more CPU credits than the instance allows will be charged at a higher rate: which I understand to be 5c per vCPU hour.
It may be that t3.micro is not your best choice for that type of workload and you may need to select a different instance type or a bigger instance.
The purple in your chart is CPU credits, not instance usage.
Looks like you enabled “T2/T3 Unlimited” when launching your instance and your script is causing it to bursting beyond the provided capacity. When you burst beyond the baseline capacity, you’re charged for that usage at the prevailing rate. You can read more about T2/T3 Unlimited and burstable performance here.
To bring these costs down, disable T2/T3 unlimited by following instructions here.

Amazon EC2 Upgrade

We are considering upgrading from an t2.micro AWS server instance to a m3.medium instance based on the recommendation here and some research offline. We feel the need to upgrade primarily for speed issues and to ensure google bots crawl our fast growing site fast enough. We have upward of 8000 products (on magento) and that will grow.
While trying to understand what exactly could be the constraint of the current t2.micro instance, we ran through a lot of logs but couldn't find anything specific that could indicate a bottle-neck as such in the current usage.
Could anyone help point out
1. What are the clues that can be found in logs which could show potential bottleneck issues(if-any) with the current t2.micro instance
2. How could we find out if google-bot had issues while crawling and stopped crawling due to server performance related issues.
There are two things to note about t2.micro instances:
They have CPU limitations based upon a CPU credits system
They have limited network bandwidth
CPU credits
The T2 family is very powerful (see comparison between t2.medium and m3.medium), but there is a limit on the amount of CPU that can be used.
From the T2 documentation:
Each T2 instance starts with a healthy initial CPU credit balance and then continuously (at a millisecond-level resolution) receives a set rate of CPU credits per hour, depending on instance size. The accounting process for whether credits are accumulated or spent also happens at a millisecond-level resolution, so you don't have to worry about overspending CPU credits; a short burst of CPU takes a small fraction of a CPU credit.
Therefore, you should look at the CloudWatch CPUCreditBalance metric for the instance to determine whether it has consumed all available credits. If so, then the CPU will be limited to 10% of the time and you either need a larger T2 instance, or you should move away from the T2 family.
In general, T2 instances are great for bursty workloads, where the CPU only spikes at certain times. It is not good for sustained workloads.
Network Bandwidth
Each Amazon EC2 instance type has a limited amount of network bandwidth. This is done to prevent noisy neighbour situations. While AWS only describes bandwidth as Low/Moderate/High, there are some better details at: EC2 Instance Types's EXACT Network Performance?
You can monitor network traffic of your EC2 instances using CloudWatch. Pay attention to NetworkIn and NetworkOut to determine whether the instances are hitting limits.