I created a service in AWS ECS and configured the maximum number of tasks to 30. And the desired task number is 2. I also configured CPUUtilization >= 75 as scale-up and CPUUtilization < 25 as scale-down.
I use Artillery to send 200 requests per second to my service on ECS and I can see the CPU usage reached 100%. This test lasts 2 minutes so 24000 requests in the total.
But the number of tasks under the service is always 2. It seems that the service never scale-up based on CPU usage. I wonder what else I have missed in the configuration.
Check your alarms in CloudWatch > Alarms, you can see there if your alarms are actually firing. You will also see a threshold description like "CPUUtilization < 10 for 10 datapoints within 10 minutes".
My guess would be that your threshold is misconfigured. I.e. you check too many datapoints within too long period of time. Start with something like "1 datapoint within 1 minute" and work from there.
It also helps checking "Activity History" in EC2 > Auto Scaling Groups. Together with CloudWatch this normally gives you a pretty good picture of what was scaled (or not) and why.
Related
I would like to get all the AWS EC2 instances list which are idle and underutilized. I need to filter them based on CPU utilization used less than 2% and network I/O less than 5Mb for last 30 days.
Can you please provide me with the commands or any scripts to get the list or guide me to get that achieved.
I need this to get that list and terminate those instances for cost management.
You can create a CloudWatch alarm that is triggered when the average
CPU utilization percentage has been lower than 2 percent for 24 hours,
signaling that it is idle and no longer in use. You can adjust the
threshold, duration, and period to suit your needs, plus you can add
an SNS notification so that you will receive an email when the alarm
is triggered.
Kindly refer to this documentation to create an alarm to terminate an idle instance using the Amazon CloudWatch console.
I'm load testing my auto scaling AWS ECS Fargate stack which comprises of:
Application Load Balancer (ALB) with a target group pointing to ECS,
ECS Cluster, Service, Task, ApplicationAutoScaling::ScalableTarget, and ApplicationAutoScaling::ScalingPolicy,
the application auto scaling policy defines a target tracking policy:
type: TargetTrackingScaling,
PredefinedMetricType: ALBRequestCountPerTarget,
threshold = 1000 requests
alarm is triggered when 1 datapoint breaches the threshold in the past 1 minute evaluation period.
This all works fine. The alarms do get triggered and I see the scale out actions happening. But it feels slow to detect the "threshold breach". This is the timing of my load test and AWS events (collated here from different places in the JMeter logs and the AWS console):
10:44:32 start load test (this is the first request timestamp entry in JMeter logs)
10:44:36 4 seconds later (in the the JMeter logs), we see that the load test reaches it's 1000th request to the ALB. At this point in time, we're above the threshold and waiting for AWS to detect that...
10:46:10 1m34s later, I can finally see the spike show up in alarm graph on the cloudwatch alarm detail page BUT the alarm is still in OK state!
NOTE: notice the 1m34s delay in detecting the spike, if it gets a datapoint every 60 seconds, it should be MAX 60 seconds before it detects it: my load test blasts out 1000 request every 4 seconds!!
10:46:50 the alarm finally goes from OK to ALARM state
NOTE: at this point, we're 2m14s past the moment when requests started pounding the server at a rate of 1000 requests every 6 seconds!
NOTE: 3 seconds later, after the alarm finally went off, the "scale out" action gets called (awesome, that part is quick):
14:46:53 Action Successfully executed action arn:aws:autoscaling:us-east-1:MYACCOUNTID:scalingPolicy:51f0a780-28d5-4005-9681-84244912954d:resource/ecs/service/my-ecs-cluster/my-service:policyName/alb-requests-per-target-per-minute:createdBy/ffacb0ac-2456-4751-b9c0-b909c66e9868
After that, I follow the actions in the ECS "events tab":
10:46:53 Message: Successfully set desired count to 6. Waiting for change to be fulfilled by ecs. Cause: monitor alarm TargetTracking-service/my-ecs-cluster-cce/my-service-AlarmHigh-fae560cc-e2ee-4c6b-8551-9129d3b5a6d3 in state ALARM triggered policy alb-requests-per-target-per-minute
10:47:08 service my-service has started 5 tasks: task 7e9612fa981c4936bd0f33c52cbded72 task e6cd126f265842c1b35c0186c8f9b9a6 task ba4ffa97ceeb49e29780f25fe6c87640 task 36f9689711254f0e9d933890a06a9f45 task f5dd3dad76924f9f8f68e0d725a770c0.
10:47:41 service my-service registered 3 targets in target-group my-tg
10:47:52 service my-service registered 2 targets in target-group my-tg
10:49:05 service my-service has reached a steady state.
NOTE: starting the tasks took 33 seconds, this is very acceptable because I set the HealthCheckGracePeriodSeconds to 30 seconds and health check interval is 30 seconds as well)
NOTE: 3m09s between the time the load starting pounding the server and the time the first new ECS tasks are up
NOTE: most of this time (3m09s) is due to the waiting for the alarm to go off (2m20s)!! The rest is normal: waiting for the new tasks to start.
Q1: Is there a way to make the alarm trigger faster and/or as soon as the threshold is breached? To me, this is taking 1m20s too much. It should really scale up in around 1m30s (1m max to detect the ALARM HIGH state + 30 seconds to start the task)...
Note: I documented my CloudFormation stack in this other question I opened today:
Cloudformation ECS Fargate autoscaling target tracking: 1 custom alarm in 1 minute: Failed to execute action
You can't do much about it. ALB sends metrics to CloudWatch in 1 minute intervals. Also these metrics are not real-time anyway, so delays are expected, even up to few minutes as explained by AWS support and reported in the comments here:
Some delay in metrics is expected, which is inherent for any monitoring systems- as they depend on several variables such as delay with the service publishing the metric, propagation delays and ingestion delay within CloudWatch to name a few. I do understand that a consistent 3 or 4 minute delay for ALB metrics is on the higher side.
You would either have to over-provision your ECS to sustain the increased load by the time the alarms fires and the up-scaling executes, or reduce your thresholds.
Alternative, you can create your own custom metrics, e.g. from your app. These metrics can be even with 1 second intervals. Your app could also "manually" trigger the alarm. This would allow you to reduce the delay you observe.
Our website is hosted on AWS in a t2.small instance. User-facing sign-up is currently timing out.
Initially, I was getting a loadbalancer latency alarm notification for this instance, so I increased the limit, which seemed to work as a temporary solution.
However, once I increased the limit, I started getting 2 other alarm notifications, which were as follows:
1) production-remove-capacity-alarm
Description: None
Threshold: CPUUtilization <= 40 for 3 datapoints within 15 minutes
2) AWSEBCloudwatchAlarmLow
Description: ElasticBeanstalk Default Scale Down alarm
Threshold: NetworkOut < 2,000,000 for 1 datapoints within 5 minutes
It seems to me that I should simply change the alarm notifications so that I'm no longer alerted to #2, as I don't see how this is interfering with anything, but please correct me if I seem to be missing something.
Regarding #1, does it seem likely that somehow adjusting CPU Utilization in AWS will solve the timeout issue with website sign-up?
And if so, what specifically ought to be done?
Everything is okay. Don't panic.
The first priority is that your application operates correctly. Hopefully your adjustment to the instance type) satisfactorily fixed this (but it is still worth watching).
The above two alarms are basically saying:
CPU is under 40%
There's not a lot of network traffic
These alarms can be used to scale-in instances (reduce the number of instances) so that you are not paying for excess capacity. There would be similar alarms that let you scale-out (add additional instances).
ALARM simply means the check is True. That is, the condition has been satisfied. It does not necessarily indicate a problem.
I'm going to presume that you currently have only one instance running. If so, you can ignore those alarms (and Auto Scaling will ignore them too) because you are already at the minimum capacity.
If Auto Scaling has been configured to scale-out to more instances, these alarms would later scale-in to save you money. They're probably a bit trigger-happy, only looking at 15 minutes CPU and 5 minutes of network traffic — it would normally be better to wait for a longer period before deciding to remove capacity.
Bottom line: If your application is running correctly and you are only operating a single instance, there's nothing to worry about. It's all working as expected.
My primary requirement is as follows:
When CPU consumption on an instance exceeds 50 % then adjust capacity of autoscaling group to 5 instances, when CPU consumption exceeds 80% then adjust capacity to 10 instances.
However if I use cloudwatch alarms to set capacity I can imagine the following race condition:
5 instances exist
CPU consumption exceeds 80 %
Alarm is triggered
Capacity is changed to 19 instances
CPU consumption drops below 50 %
Eventually CPU consumption again exceeds 50% but now capacity will be changed to 5 instances (which is something I don't want to happen)
So what I would ideally like to happen is that in response to alarm triggers I would like to ensure that capacity is altleast the corresponding threshold.
I am aware that this can be done by manually setting the capacity through AWS SDK - which could be triggered in response to lifecycle events monitored by a supervisor, but is there a better approach, preferably one that does not require setting up additional supervisors or webhooks for alarms ?
A general approach is to fine grain the scaling actions:
Do not jump that big:
if the ASG avg CPU is over 70% > Add an instance
if the ASG avg CPU is over 90% > Add "n" instances
if the ASG avg CPU is under 40% > remove an instance
if the ASG avg CPU is under 10% > remove "n" instance
All of these values are the last 5 mins AVG. So if you have a really fast pike, you need more aggressive scaling. So in half an hour you can easily add 6 servers or even more.
Also scaling works better with higher numbers. So if your system needs only 1-3 instances, it may make sense to decrease the instance size so you can have 2-6 instances. It give some extra flexibility to your system.
But again, the question is, what is your expected load? Big pikes or an expected up and down during the day?
I would suggest looking into an AWS lambda function, triggered by an SNS message from cloudwatch - it should give you free reign to put as much logic into the scaling decision as you want.
Good Luck!
I have an auto scale group with triggers as follows:
Average CPU Utliziation > 90% scale up 1 instance
Average CPU Utilization < 25% scale down 1 instance
The metric is being calculated every 2 minutes and the breach limit is 10 minutes.
The problem I am experiencing is that the triggers are being triggered constantly it seems. The instances are being created and destroyed every 10 minutes. I have been monitoring the CPU Utilization and it never surpasses the scale up threshold. The maximium it hits is around 80% and this only happened 1 time, most of the time it is in the 20 to 25% range. I only have 1 instance running normally, but eveyr 10 minutes ELB will create a new instance, and soon after it will terminate it.
Any thing I am doing wrong here? Am I not understanding how the average CPU Utilization works?
The new EC2 instances are being created by Auto-Scaling (not Load Balancer).
There is a "Scaling History" tab in the Auto Scaling group that might provide some hints as to what is triggering the scale-out policy.
Check whether "Detailed Monitoring" is enabled on the Auto Scaling group and/or Launch Configuration -- this will cause metrics (eg CPU) to be collected every 1 minute instead of the default 5 minutes.
Check the setting on your CloudWatch chart to match the metric collection interval -- if metrics are being collected every minute, set the CloudWatch chart to 1-minute also. Otherwise, you might be viewing metrics at a lower "resolution" than the alarm itself.
Worst case, increase the timing settings for the Alarm, such as "Above 90% for 2 consecutive periods" rather than just one period.