fargate cloudwatch cpu utilization percentile - amazon-web-services

I've created dashboard for a service deployed to Fargate using the following:
Namespace
AWS/ECS
Metric name
CPUUtilization
Both Average and Max can be viewed but neither of these is that useful. Max is really high and Average is really low.
When choosing p99 from the drop-down selector no data is returned and nothing is plotted on the chart. Is it just a case that p99 isn't supported by CPUUtilization metric on Fargate?
Is there a way to get this stat manually on the dashboard, and added to an Alarm as a threshold?

Related

Why does Max, Min, Avg statistics have different results on a 5 min metric in Cloudwatch?

I have started an EC2 instance (with standard monitoring).
From my understanding, the EC2 service will publish 1 datapoint every 5 minutes for the CPUUtilization to Cloudwatch.
Hence my question is, why are the graphs different for a 5 minutes visualization for different statistics (Min, Max, Avg, ...) ?
Since there is only 1 datapoint per 5 minutes, the Min, Max or Average of a single datapoint should be the same right ?
Example:
Just by changing the "average" statistic to the "max", the graph changes (I don't understand why).
Thanks
Just to add on to #jccampanero's answer, I'd like to explain it with a bit more details.
From my understanding, the EC2 service will publish 1 datapoint every 5 minutes for the CPUUtilization to CloudWatch.
Yes, your understanding is correct, but there are two types of datapoint. One type is called "raw data", and the other type is called "statistic set". Both types use the same PutMetricData API to publish metrics to CloudWatch, but they use different options.
Since there is only 1 datapoint per 5 minutes, the Min, Max or Average of a single datapoint should be the same right?
Not quite. This is only true when all datapoints are of type "raw data". Basically just think of it as a number. If you have statistic sets, then the Min, Max and Average of a single datapoint can be different, which is exactly what happens here.
If you choose the SampleCount statistic, you can see that one datapoint here is an aggregation of 5 samples. Just to give you a concrete example, actually, let's take the one in #jccampanero's answer.
In this period of time on average the CPU utilization was 40%, with a maximum of 90%, and a minimum of 20%,. I hope you get the idea.
Translated to code (e.g. AWS CLI), it's something like
aws cloudwatch put-metric-data \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--unit Percent \
--statistic-values Sum=200,Minimum=20,Maximum=90,SampleCount=5 \
--dimensions InstanceId=i-123456789
If EC2 were using AWS CLI to push the metrics to CloudWatch, this would be it. I think you get the idea now, and it's quite common to aggregate the data to save some money on the CloudWatch bill.
Honestly I have never thought about it carefully but from my understanding the following is going on.
Amazon EC2 sends metric data to CloudWatch in the configured period of time, five minutes in this case unless you enable detailed monitoring for the instance.
This metric data will not consist only of the average, but also the maximum and minimum CPU utilization percentage observed during that period of time. I mean, it will tell CloudWatch: in this period of time on average the CPU utilization was 40%, with a maximum of 90%, and a minimum of 20%. I hope you get the idea.
That explains why your graphs look different depending on the statistic chosen.
Please, consider read this entry in the AWS documentation, in which they explain how the CloudWatch statistics definitions work.

How to increase resolution of CpuUtilization metric of ECS cluster past 1 min mark?

I'm trying to create a robust autoscaling process for my ECS cluster but am facing problems with resolution of CpuUtilization metric. I have turned on 'Detailed metrics' for 1-min resolution, but am not able to achieve good scaling results. I am deploying an ML model which takes roughly 1.5s to infer. I am not facing any memory bottleneck and hence, am using CpuUtilization for scaling.
I need fast scaling as when requests start piling up the response time easily shoots up to 3-5s. Currently, with 'Detailed Metrics' enabled. The scale-out time takes around 3-5 miuntes to start as 3 datapoints are checked for 1-min res metrics. If I have 5-10s res metric, then I can look at 6 data points within 30s and start the scale-out job faster.
I tried using Lambda, StepFunctions and EventBridge from this blog. But, I am not able to get CpuUtilization or MemoryUtilization, only the task, service and container counts.
Is there a way to get Cpu and Memory metrics directly from ECS? I know we can use cloudwatch.get_metric_statistics(). But, we can only get datapoints that are reported to CloudWatch. So, not useful.
You can't change that. 1 min value is set by AWS. The only thing you can do to get better resolution is to create your own custom metrics. Custom metrics can have resolution of 1 second.

Amazon RDS: plot number of instances in CloudWatch

How do I plot the number of instances in my AWS Aurora RDS cluster over time in CloudWatch?
There doesn't seem to be a metric for that.
Indeed there is no metric for that.
UPDATE: the below trick is not 100% foolproof: when the dashboard range is set to 1d or more, the display period automatically changes to 5 Minutes, which leads to values being off by a factor of 5.
The trick is to pick any RDS aggregated metric (for example CPUUtilization, aggregated per DB role), then select Statistic: Sample Count and Period: 1 Minute.

AWS/ECS CPUUtilization average vs maximum

After reading AWS documentation I am still not clear about cloudwatch metrics statistics average and maximum, specifically for ECS CPUUtilization.
I have a AWS ECS cluster fargate setup, a service with minimum count of 2 healthy task. I have enabled autoscaling using AWS/ECS CPUUtilization for
ClusterName my and ServiceName. A Cloudwatch alarm triggers is configured to trigger when average cpu utilization is more than 75% for one minute period for 3 data points.
I also have a health check setup with a frequency of 30 seconds and timeout of 5 mins and
I ran a performance script to test the autoscaling behavior, but I am noticing the service gets marked as unhealthy and new tasks gets created. When I check the cpuutilization metric, for average statistics it shows around 44% utilization but for maximum statistics it shows more than hundred percent, screenshots attached.
Average
Maximum
So what is average and maximum here, does this mean average is average cpu utilization of both my instances? and maximum shows one of my instance's cpu utilization more than 100?
Average and maximum here measures the average CPU usage over 1 minute period and the max CPU usage over 1 minute period.
In terms of configuring autoscaling rules, you want to use the average metric.
The maximum metric usually is random short burst spikes that can be caused by things like garbage collection.
The average metric however is the p50 CPU usage, so half of the time the CPU usage is more than that, half is less. (Yeah, technically that is the median, but for now, it doesn't matter as much).
You most likely want to be scaling up using average metric when say your CPU goes to say 75-85% (keep in mind, you need to give time for new tasks to warm up).
Max metric can generally be ignored for autoscaling usecases.

Define a Cloudwatch alert on replication lag

I have a Master-Slave configuration on AWS RDS MySQL.
I want to set an alert when the replication lag goes above a certain threshold (e.g. 10 seconds).
How can it be done?
If it is not possible, is there another way to achieve similar result? (without using 3rd party tools / custom scripting)
You can track replica lag using the ReplicaLag metric on your slave instance. Note that this metric is measure in seconds. This metric is reported automatically by RDS every minute.
You can create a CloudWatch alarm to monitor the ReplicaLag metric. You should set this alarm to be breaching if the sum of ReplicaLag over an evaluation period of 1 minute is greater than 0.