AWS Cloudwatch alarm for fluctuating datapoints - amazon-web-services

I am monitoring a Fargate service on ECS and want to know when containers are bouncing a lot (come up, fail healthcheck, get killed by ECS and a new one gets scheduled and does the same)
I replicated the scenario I'm interested in and using the "Sample count" aggregation for CPUUtilization from ECS I can see this graph:
The value in an ideal world would be 1 but as we can see here ECS schedules a new container to replace the unhealthy one and that gets killed eventually and we see this bouncing behavior
I would like to set up a Cloudwatch alarm for this. When the value fluctuates a lot from the ideal value in a short period of time but I can't quite figure out if this is possible. Maybe with some metric math but I can't quite get it. I also looked into Anomaly Detection and I think that would work but it incurs extra cost that I don't think is warranted
Just looking to set off an alarm if value bounces around multiple y axis points in let's say a 5 minute time frame

You can do something like this with metric math:
RUNNING_SUM(ABS(DIFF(myMetric)))
This will return a rolling sum of all the absolute changes in the metric for your period. You can then create an alarm based on this, adjusting the desired period and threshold.

Related

How to increase resolution of CpuUtilization metric of ECS cluster past 1 min mark?

I'm trying to create a robust autoscaling process for my ECS cluster but am facing problems with resolution of CpuUtilization metric. I have turned on 'Detailed metrics' for 1-min resolution, but am not able to achieve good scaling results. I am deploying an ML model which takes roughly 1.5s to infer. I am not facing any memory bottleneck and hence, am using CpuUtilization for scaling.
I need fast scaling as when requests start piling up the response time easily shoots up to 3-5s. Currently, with 'Detailed Metrics' enabled. The scale-out time takes around 3-5 miuntes to start as 3 datapoints are checked for 1-min res metrics. If I have 5-10s res metric, then I can look at 6 data points within 30s and start the scale-out job faster.
I tried using Lambda, StepFunctions and EventBridge from this blog. But, I am not able to get CpuUtilization or MemoryUtilization, only the task, service and container counts.
Is there a way to get Cpu and Memory metrics directly from ECS? I know we can use cloudwatch.get_metric_statistics(). But, we can only get datapoints that are reported to CloudWatch. So, not useful.
You can't change that. 1 min value is set by AWS. The only thing you can do to get better resolution is to create your own custom metrics. Custom metrics can have resolution of 1 second.

Auto scale rule based on custom Cloudwatch alarm

I have an auto-scaling group of EC2 servers that run a number of processes.
This number of processes changes with the load and I'd like to trigger a scaling (up/down) based on the number of processes.
I've successfully set up a script that sends to Cloudwatch the number of processes on every servers, for every minutes, and I can see these on Cloudwatch. (I haven't set a dimension, to be able to get the value for all the servers).
Then, I created an Alarm, that uses the average for the values sent, and if it reach a certain limit, it triggers the "Add a new server" to the auto scaling group, and when it stop being on alarm, it triggers a "Remove a server".
My issue is that when I add the new server, the average drops, since there is one more server now, which move the alarm to the ok state, removing the server, and increasing again the average, triggering again the alarm, etc.
For instance, the limit is set to 10 processes on average. With 3 servers, if the average becomes 11, I trigger the alarm state, adding a server. Now with the new server, I'm at 33 processes (3 x 11) for 4 servers : 8,25 processes on average, thus triggering the "OK" alarm.
My question is: Is it possible to set up an alarm based on the number of processes without having the new trigger causes a up-down-up-down issue?
Instead of average, I can use something else to trigger the alarm, such as min/max/I-don't-know.
Thank you for your help. Happy to provide any other details if needed.
You should not create an alarm that adds instances when True and removes instances when False. This will cause a continual 'flip-flop' situation rather than trying to find a steady-state.
You could have each server regularly send a custom metric to Amazon CloudWatch. You could then use this with Target tracking scaling policies for Amazon EC2 Auto Scaling - Amazon EC2 Auto Scaling, which will calculate the average value of the metric and automatically launch/terminate instances to keep the target value around 10.
This would work well with long-running processes (perhaps 5+ minutes with several processes running concurrently), but would not be good with short sub-minute processes because it takes time to launch new instances.
I think you could look at metric math. So instead of directly triggering your alarm based on your process-count-metric only, you could perhabs calculate the average count yourself using metric math. You could use the GroupTotalInstances metric from your ASG, or just publish second custom metric having the number of instances.
In both cases, your metric for the alarm would use metric math to divide number of processes by size of ASG for each evaluation period.

AWS High Resolution Metrics for faster ECS scaling

I have a complex REST API deployed in AWS ECS. The autoscaling policy for the same is based on RequestCount of 2000.
The scale out will happen when RequestCount is consistently higher than 2000 with standard resolution per 60 seconds. This takes at least 2 minutes before scaling happens. This is becoming a problem with short-time request surge when request count increases to 10k and above. The containers start rejecting requests(throttling).
I need to at least make the scaling happen more quickly within a minute if not within seconds. AWS CloudWatch seems to offer High-Resolution metrics, but there's very less information about:
Can I enable specific metrics with high-resolution. Is it possible that I can have request counts resolved at high granularity of 5 seconds and CPUUtilization at standard granularity of 1 minute?
How can I enable high resolution on AWS metrics?
The AWS CloudWatch Documentation seems to be insufficient to understand this process.
There's two different things that can be 'high resolution', the alarm and the metric.
A High Resolution metric just means the source is pushing values more frequently. You can't control this if your using an AWS metric, and most of them don't push more often than once a minute.
A High Resolution alarm is one where the period is less than 60 seconds and will be billed at a higher rate than standard alarms. However, this isn't very useful in most cases if the metric your basing it on only gets pushed once per minute
EDIT:
To directly answer your questions
No, I don't think any of the AWS RequestCount metrics for things like ELB have a 'high resolution on/off' toggle (although ELB might push more frequently than 1 minute by default, I'm not sure)
its based on how often the source pushes data points to cloudwatch. If the AWS metrics don't work for what you need, you would need to add something like the CloudWatch agent (or just a script in your instance) pushing metric more frequently. Be careful about the CloudWatch API call charges if you do this from a lot of sources at a high frequency though

Best values to set duration alarm on Lambda Function

I have a lambda function which does some work. I wanted to create a cloudwatch alarm on it for duration of lambda, i.e. how much time this lambda is taking to run?
I tried to use the following values for the alarm but I am getting a issue with this alarm, probably due to cold start problem. Following are the values I am setting:
Statistic : Average
ComparisonOperator : "GreaterThanThreshold"
Threshold: 1000
EvaluationPeriods: 5
Period: 60
Unit: Milliseconds
The issue I am facing with this is that, it keeps getting into alarm because of the cold start problem probably since it does not get called that often.
What is the best values to set for lambda? How other people are setting alarms on lambda?
Also, if lambda is not called for how much time, then it gets shutdown and a coldstart problem can occur?
Use Blue Matador. The thresholds are dynamic, account for daily variation and cold starts, and use machine learning to detect real anomalies. It does the same thing for all the services that Lambda interacts with (Dynamo, SQS, API gateway, RDS, Kinesis, S3, etc.).
disclaimer: i'm the founder of Blue Matador
If you're looking to do it yourself with Cloudwatch, I would recommend timing out after a certain period of time and returning an error. Then, you can use the Errors metric to tell how many failed over a given time period. It's not a perfect solution, but it could correctly ignore cold starts. We wrote a blog about How to Monitor AWS Lambda with CloudWatch and it includes errors, throttles, and more metrics to watch out for.

How can I tell if CPUUtilization in AWS is the reason my website sign-up is timing out?

Our website is hosted on AWS in a t2.small instance. User-facing sign-up is currently timing out.
Initially, I was getting a loadbalancer latency alarm notification for this instance, so I increased the limit, which seemed to work as a temporary solution.
However, once I increased the limit, I started getting 2 other alarm notifications, which were as follows:
1) production-remove-capacity-alarm
Description: None
Threshold: CPUUtilization <= 40 for 3 datapoints within 15 minutes
2) AWSEBCloudwatchAlarmLow
Description: ElasticBeanstalk Default Scale Down alarm
Threshold: NetworkOut < 2,000,000 for 1 datapoints within 5 minutes
It seems to me that I should simply change the alarm notifications so that I'm no longer alerted to #2, as I don't see how this is interfering with anything, but please correct me if I seem to be missing something.
Regarding #1, does it seem likely that somehow adjusting CPU Utilization in AWS will solve the timeout issue with website sign-up?
And if so, what specifically ought to be done?
Everything is okay. Don't panic.
The first priority is that your application operates correctly. Hopefully your adjustment to the instance type) satisfactorily fixed this (but it is still worth watching).
The above two alarms are basically saying:
CPU is under 40%
There's not a lot of network traffic
These alarms can be used to scale-in instances (reduce the number of instances) so that you are not paying for excess capacity. There would be similar alarms that let you scale-out (add additional instances).
ALARM simply means the check is True. That is, the condition has been satisfied. It does not necessarily indicate a problem.
I'm going to presume that you currently have only one instance running. If so, you can ignore those alarms (and Auto Scaling will ignore them too) because you are already at the minimum capacity.
If Auto Scaling has been configured to scale-out to more instances, these alarms would later scale-in to save you money. They're probably a bit trigger-happy, only looking at 15 minutes CPU and 5 minutes of network traffic — it would normally be better to wait for a longer period before deciding to remove capacity.
Bottom line: If your application is running correctly and you are only operating a single instance, there's nothing to worry about. It's all working as expected.