After reading AWS documentation I am still not clear about cloudwatch metrics statistics average and maximum, specifically for ECS CPUUtilization.
I have a AWS ECS cluster fargate setup, a service with minimum count of 2 healthy task. I have enabled autoscaling using AWS/ECS CPUUtilization for
ClusterName my and ServiceName. A Cloudwatch alarm triggers is configured to trigger when average cpu utilization is more than 75% for one minute period for 3 data points.
I also have a health check setup with a frequency of 30 seconds and timeout of 5 mins and
I ran a performance script to test the autoscaling behavior, but I am noticing the service gets marked as unhealthy and new tasks gets created. When I check the cpuutilization metric, for average statistics it shows around 44% utilization but for maximum statistics it shows more than hundred percent, screenshots attached.
Average
Maximum
So what is average and maximum here, does this mean average is average cpu utilization of both my instances? and maximum shows one of my instance's cpu utilization more than 100?
Average and maximum here measures the average CPU usage over 1 minute period and the max CPU usage over 1 minute period.
In terms of configuring autoscaling rules, you want to use the average metric.
The maximum metric usually is random short burst spikes that can be caused by things like garbage collection.
The average metric however is the p50 CPU usage, so half of the time the CPU usage is more than that, half is less. (Yeah, technically that is the median, but for now, it doesn't matter as much).
You most likely want to be scaling up using average metric when say your CPU goes to say 75-85% (keep in mind, you need to give time for new tasks to warm up).
Max metric can generally be ignored for autoscaling usecases.
Related
I have started an EC2 instance (with standard monitoring).
From my understanding, the EC2 service will publish 1 datapoint every 5 minutes for the CPUUtilization to Cloudwatch.
Hence my question is, why are the graphs different for a 5 minutes visualization for different statistics (Min, Max, Avg, ...) ?
Since there is only 1 datapoint per 5 minutes, the Min, Max or Average of a single datapoint should be the same right ?
Example:
Just by changing the "average" statistic to the "max", the graph changes (I don't understand why).
Thanks
Just to add on to #jccampanero's answer, I'd like to explain it with a bit more details.
From my understanding, the EC2 service will publish 1 datapoint every 5 minutes for the CPUUtilization to CloudWatch.
Yes, your understanding is correct, but there are two types of datapoint. One type is called "raw data", and the other type is called "statistic set". Both types use the same PutMetricData API to publish metrics to CloudWatch, but they use different options.
Since there is only 1 datapoint per 5 minutes, the Min, Max or Average of a single datapoint should be the same right?
Not quite. This is only true when all datapoints are of type "raw data". Basically just think of it as a number. If you have statistic sets, then the Min, Max and Average of a single datapoint can be different, which is exactly what happens here.
If you choose the SampleCount statistic, you can see that one datapoint here is an aggregation of 5 samples. Just to give you a concrete example, actually, let's take the one in #jccampanero's answer.
In this period of time on average the CPU utilization was 40%, with a maximum of 90%, and a minimum of 20%,. I hope you get the idea.
Translated to code (e.g. AWS CLI), it's something like
aws cloudwatch put-metric-data \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--unit Percent \
--statistic-values Sum=200,Minimum=20,Maximum=90,SampleCount=5 \
--dimensions InstanceId=i-123456789
If EC2 were using AWS CLI to push the metrics to CloudWatch, this would be it. I think you get the idea now, and it's quite common to aggregate the data to save some money on the CloudWatch bill.
Honestly I have never thought about it carefully but from my understanding the following is going on.
Amazon EC2 sends metric data to CloudWatch in the configured period of time, five minutes in this case unless you enable detailed monitoring for the instance.
This metric data will not consist only of the average, but also the maximum and minimum CPU utilization percentage observed during that period of time. I mean, it will tell CloudWatch: in this period of time on average the CPU utilization was 40%, with a maximum of 90%, and a minimum of 20%. I hope you get the idea.
That explains why your graphs look different depending on the statistic chosen.
Please, consider read this entry in the AWS documentation, in which they explain how the CloudWatch statistics definitions work.
I have setup a AWS CloudWatch alarm: CPU utilization > 90 %: https://www.screencast.com/t/BPs3hlY2hEZ
I have added the alarm / CPU % utilization metric to a dashboard: https://content.screencast.com/users/MartinBakDK/folders/Jing/media/5c01c414-95d7-4a20-ab7d-e0a6c9debc01/2020-05-20_2158.png
I do know - that the metric shows a 5 min average - But for more than 1 hour now the actual CPU usage of my EC2 instance has been 100% (So it should show 100% as well):
https://www.screencast.com/t/BvITivn0ff
Because CloudWatch only reg. 20% and not the actual 100% => CloudWatch is useless as a monitoring-system.
Can this really be true? Please tell me what is going on and how AWS can provide such a "service".
Look at CpuCreditBalance metric. If this is at 0 your CPU will be capped at a fixed percentage (this is why you see the straight line).
Your host sees itself as 100% because it cannot use anymore CPU.
All T instances are burstable, so once they’re depleted the CPU is capped for performance. You can either change instance type or enable unlimited credits (there will be additional cost).
Further reading: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-performance-instances-monitoring-cpu-credits.html
t3 instances are burstable above baseline, so how does this have to be considered for auto-scaling by CPUUtilization?
Let's say we use t3.small instances with:
24 CPU credits earned per hour
2 vCPUs
20% Baseline performance per vCPU
I would set this scaling trigger for CPUUtilization:
Statistic: Average
Unit: Percent
Period: 5min
Breach duration: 5min
Upper threshold: 15%
Lower threshold: 5%
So the upper threshold is set just below the baseline to avoid the additional cost when staying above the baseline for too long (and using up CPU credits).
In addition I set a CloudWatch alert when CPUUtilization of any instance stays above 20% for too long. This would at least get triggered when auto scaling has reached the max. number of allowed instances.
Does this all make sense?
It is not advisable to use CPU Utilization as a metric for Auto Scaling when using T1/2/3 burstable instance types.
The reason is that the CPU on these instances can be artificially limited, thereby giving a false impression of how busy they are.
If you activate the "Unlimited" option, then this is okay because instance can burst as required without limit. Do not be afraid of the extra charge since it only costs extra if it exceeds a monthly average and you'll only be paying for CPU that was actually "used" when they were busy.
Alternatively, choose a different (non-burstable) instance type.
My primary requirement is as follows:
When CPU consumption on an instance exceeds 50 % then adjust capacity of autoscaling group to 5 instances, when CPU consumption exceeds 80% then adjust capacity to 10 instances.
However if I use cloudwatch alarms to set capacity I can imagine the following race condition:
5 instances exist
CPU consumption exceeds 80 %
Alarm is triggered
Capacity is changed to 19 instances
CPU consumption drops below 50 %
Eventually CPU consumption again exceeds 50% but now capacity will be changed to 5 instances (which is something I don't want to happen)
So what I would ideally like to happen is that in response to alarm triggers I would like to ensure that capacity is altleast the corresponding threshold.
I am aware that this can be done by manually setting the capacity through AWS SDK - which could be triggered in response to lifecycle events monitored by a supervisor, but is there a better approach, preferably one that does not require setting up additional supervisors or webhooks for alarms ?
A general approach is to fine grain the scaling actions:
Do not jump that big:
if the ASG avg CPU is over 70% > Add an instance
if the ASG avg CPU is over 90% > Add "n" instances
if the ASG avg CPU is under 40% > remove an instance
if the ASG avg CPU is under 10% > remove "n" instance
All of these values are the last 5 mins AVG. So if you have a really fast pike, you need more aggressive scaling. So in half an hour you can easily add 6 servers or even more.
Also scaling works better with higher numbers. So if your system needs only 1-3 instances, it may make sense to decrease the instance size so you can have 2-6 instances. It give some extra flexibility to your system.
But again, the question is, what is your expected load? Big pikes or an expected up and down during the day?
I would suggest looking into an AWS lambda function, triggered by an SNS message from cloudwatch - it should give you free reign to put as much logic into the scaling decision as you want.
Good Luck!
I have an auto scale group with triggers as follows:
Average CPU Utliziation > 90% scale up 1 instance
Average CPU Utilization < 25% scale down 1 instance
The metric is being calculated every 2 minutes and the breach limit is 10 minutes.
The problem I am experiencing is that the triggers are being triggered constantly it seems. The instances are being created and destroyed every 10 minutes. I have been monitoring the CPU Utilization and it never surpasses the scale up threshold. The maximium it hits is around 80% and this only happened 1 time, most of the time it is in the 20 to 25% range. I only have 1 instance running normally, but eveyr 10 minutes ELB will create a new instance, and soon after it will terminate it.
Any thing I am doing wrong here? Am I not understanding how the average CPU Utilization works?
The new EC2 instances are being created by Auto-Scaling (not Load Balancer).
There is a "Scaling History" tab in the Auto Scaling group that might provide some hints as to what is triggering the scale-out policy.
Check whether "Detailed Monitoring" is enabled on the Auto Scaling group and/or Launch Configuration -- this will cause metrics (eg CPU) to be collected every 1 minute instead of the default 5 minutes.
Check the setting on your CloudWatch chart to match the metric collection interval -- if metrics are being collected every minute, set the CloudWatch chart to 1-minute also. Otherwise, you might be viewing metrics at a lower "resolution" than the alarm itself.
Worst case, increase the timing settings for the Alarm, such as "Above 90% for 2 consecutive periods" rather than just one period.