I'm load testing my auto scaling AWS ECS Fargate stack which comprises of:
Application Load Balancer (ALB) with a target group pointing to ECS,
ECS Cluster, Service, Task, ApplicationAutoScaling::ScalableTarget, and ApplicationAutoScaling::ScalingPolicy,
the application auto scaling policy defines a target tracking policy:
type: TargetTrackingScaling,
PredefinedMetricType: ALBRequestCountPerTarget,
threshold = 1000 requests
alarm is triggered when 1 datapoint breaches the threshold in the past 1 minute evaluation period.
This all works fine. The alarms do get triggered and I see the scale out actions happening. But it feels slow to detect the "threshold breach". This is the timing of my load test and AWS events (collated here from different places in the JMeter logs and the AWS console):
10:44:32 start load test (this is the first request timestamp entry in JMeter logs)
10:44:36 4 seconds later (in the the JMeter logs), we see that the load test reaches it's 1000th request to the ALB. At this point in time, we're above the threshold and waiting for AWS to detect that...
10:46:10 1m34s later, I can finally see the spike show up in alarm graph on the cloudwatch alarm detail page BUT the alarm is still in OK state!
NOTE: notice the 1m34s delay in detecting the spike, if it gets a datapoint every 60 seconds, it should be MAX 60 seconds before it detects it: my load test blasts out 1000 request every 4 seconds!!
10:46:50 the alarm finally goes from OK to ALARM state
NOTE: at this point, we're 2m14s past the moment when requests started pounding the server at a rate of 1000 requests every 6 seconds!
NOTE: 3 seconds later, after the alarm finally went off, the "scale out" action gets called (awesome, that part is quick):
14:46:53 Action Successfully executed action arn:aws:autoscaling:us-east-1:MYACCOUNTID:scalingPolicy:51f0a780-28d5-4005-9681-84244912954d:resource/ecs/service/my-ecs-cluster/my-service:policyName/alb-requests-per-target-per-minute:createdBy/ffacb0ac-2456-4751-b9c0-b909c66e9868
After that, I follow the actions in the ECS "events tab":
10:46:53 Message: Successfully set desired count to 6. Waiting for change to be fulfilled by ecs. Cause: monitor alarm TargetTracking-service/my-ecs-cluster-cce/my-service-AlarmHigh-fae560cc-e2ee-4c6b-8551-9129d3b5a6d3 in state ALARM triggered policy alb-requests-per-target-per-minute
10:47:08 service my-service has started 5 tasks: task 7e9612fa981c4936bd0f33c52cbded72 task e6cd126f265842c1b35c0186c8f9b9a6 task ba4ffa97ceeb49e29780f25fe6c87640 task 36f9689711254f0e9d933890a06a9f45 task f5dd3dad76924f9f8f68e0d725a770c0.
10:47:41 service my-service registered 3 targets in target-group my-tg
10:47:52 service my-service registered 2 targets in target-group my-tg
10:49:05 service my-service has reached a steady state.
NOTE: starting the tasks took 33 seconds, this is very acceptable because I set the HealthCheckGracePeriodSeconds to 30 seconds and health check interval is 30 seconds as well)
NOTE: 3m09s between the time the load starting pounding the server and the time the first new ECS tasks are up
NOTE: most of this time (3m09s) is due to the waiting for the alarm to go off (2m20s)!! The rest is normal: waiting for the new tasks to start.
Q1: Is there a way to make the alarm trigger faster and/or as soon as the threshold is breached? To me, this is taking 1m20s too much. It should really scale up in around 1m30s (1m max to detect the ALARM HIGH state + 30 seconds to start the task)...
Note: I documented my CloudFormation stack in this other question I opened today:
Cloudformation ECS Fargate autoscaling target tracking: 1 custom alarm in 1 minute: Failed to execute action
You can't do much about it. ALB sends metrics to CloudWatch in 1 minute intervals. Also these metrics are not real-time anyway, so delays are expected, even up to few minutes as explained by AWS support and reported in the comments here:
Some delay in metrics is expected, which is inherent for any monitoring systems- as they depend on several variables such as delay with the service publishing the metric, propagation delays and ingestion delay within CloudWatch to name a few. I do understand that a consistent 3 or 4 minute delay for ALB metrics is on the higher side.
You would either have to over-provision your ECS to sustain the increased load by the time the alarms fires and the up-scaling executes, or reduce your thresholds.
Alternative, you can create your own custom metrics, e.g. from your app. These metrics can be even with 1 second intervals. Your app could also "manually" trigger the alarm. This would allow you to reduce the delay you observe.
Related
I would like to get all the AWS EC2 instances list which are idle and underutilized. I need to filter them based on CPU utilization used less than 2% and network I/O less than 5Mb for last 30 days.
Can you please provide me with the commands or any scripts to get the list or guide me to get that achieved.
I need this to get that list and terminate those instances for cost management.
You can create a CloudWatch alarm that is triggered when the average
CPU utilization percentage has been lower than 2 percent for 24 hours,
signaling that it is idle and no longer in use. You can adjust the
threshold, duration, and period to suit your needs, plus you can add
an SNS notification so that you will receive an email when the alarm
is triggered.
Kindly refer to this documentation to create an alarm to terminate an idle instance using the Amazon CloudWatch console.
I have some ECS tasks running in AWS Fargate which in very rare cases may "die" internally, but will still show as "RUNNING" and not fail and trigger the task to restart.
What I would like to do, if possible is check for the absence of logs, e.g. if logs haven't been written in 30 minutes, trigger a lambda to kill the ECS task which will cause it to start back up.
The health check functionality isn't sufficient.
If this isn't possible, are there any other approaches I could consider?
you can have metric and anomaly detection but it may cost for metric to process logs + alarm may cost too. Would rather do lambda run every 30min which would check if logs are there and then would kill ECS as needed. you can run lambda on interval with cloudwatch events bridge.
Logs are probably sent to cloudwatch logs group from your ECS, if you have static name of the logs group, you can use SDK to describe streams inside the group. This api call will tell you timestamp of the last data in stream.
inside lambda nodejs context aws-sdk v2 is already present, so you can require w/o install. here is doc for v2:
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/CloudWatchLogs.html#describeLogStreams-property
pick to orderBy: "LastEventTime" and to save networking time, set limit from default 50 to 1 limit: 1 and in result you will have lastEventTimestamp
anomaly detection:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html
alarms:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
check pricing for these, there is free tier, so maybe it won't cost you anything, yet it's easy to build up real $ spend with cloudwatch. https://aws.amazon.com/cloudwatch/pricing/
To run lambda on interval:
I have received following alarm message daily at the same time from my Amazon SQS.
You are receiving this email because your Amazon CloudWatch Alarm "Old Messages in SQS" in the {my region} region has entered the ALARM state, because "Threshold Crossed: 1 out of the last 1 datapoints [183.0 (30/09/20 00:06:00)] was greater than or equal to the threshold (180.0) (minimum 1 datapoint for OK -> ALARM transition)." at "Wednesday 30 September, 2020 00:07:22 UTC".
Alarm Details:
Name: Old Messages in SQS
Description: Abc updates take too long. Check the processor and queue.
State Change: OK -> ALARM
Reason for State Change: Threshold Crossed: 1 out of the last 1 datapoints [183.0 (30/09/20 00:06:00)] was greater than or equal to
the threshold (180.0) (minimum 1 datapoint for OK -> ALARM
transition).
Timestamp: Wednesday 30 September, 2020 00:07:22 UTC
Threshold:
The alarm is in the ALARM state when the metric is GreaterThanOrEqualToThreshold 180.0 for 60 seconds.
Monitored Metric:
MetricNamespace: AWS/SQS
MetricName: ApproximateAgeOfOldestMessage
Period: 60 seconds
Statistic: Average
Unit: not specified
State Change Actions:
OK:
INSUFFICIENT_DATA:
So I checked in cloudwatch and see what is happening. So I identified that CPU utilization is going low at the same time of that instance which is used to process messages in SQS. So I decided that messages in SQS were increased due to server down.
But I can't identify why server is going down at the same time every day. I checked followings
EC2 snapshots - no automated schedules
RDS snapshots - no automated schedules at that time
Cron jobs in the server
Is there anyone who has this kind of experience will be highly appreciated to identify what the exact issue is.
This is super late to the party and I assume by now you either figured it out or you moved on. Sharing some thoughts for posterity.
The metric that went over threshold of 3 mins is age of the item on the SQS queue. The cause of the alert is not necessarily an instance processing messages going down but could be that the processing routine is blocked (waiting) on something external. This could be a network call that lasts long, or something like that.
During the waiting state CPU is not used that much so CPU utilization could be close to 0 if no other messages are being processed. Again, this is just a guess.
What I would do in this situation is:
Add/check cloudwatch logs that are emitted within the processing routine and verify processing is not stuck.
Check instance events / logs (by these I mean instance starting/stopping, maintenance, etc. to verify nothing external was affecting my processing instance)
In a number of cases the answer could be an error on Amazon side so when stuck - check with Amazon support.
I created a service in AWS ECS and configured the maximum number of tasks to 30. And the desired task number is 2. I also configured CPUUtilization >= 75 as scale-up and CPUUtilization < 25 as scale-down.
I use Artillery to send 200 requests per second to my service on ECS and I can see the CPU usage reached 100%. This test lasts 2 minutes so 24000 requests in the total.
But the number of tasks under the service is always 2. It seems that the service never scale-up based on CPU usage. I wonder what else I have missed in the configuration.
Check your alarms in CloudWatch > Alarms, you can see there if your alarms are actually firing. You will also see a threshold description like "CPUUtilization < 10 for 10 datapoints within 10 minutes".
My guess would be that your threshold is misconfigured. I.e. you check too many datapoints within too long period of time. Start with something like "1 datapoint within 1 minute" and work from there.
It also helps checking "Activity History" in EC2 > Auto Scaling Groups. Together with CloudWatch this normally gives you a pretty good picture of what was scaled (or not) and why.
I am trying to setup a EC2 Scaling group that scales depending on how many items are in an SQS queue.
When the SQS queue has items visible I need the Scaling group to have 1 instance available and when the SQS queue is empty (e.g. there are no visible or non-visible messages) I want there to be 0 instances.
Desired instances it set to 0, min is set to 0 and max is set to 1.
I have setup cloudwatch alarms on my SQS queue to trigger when visible messages are greater than zero, and also triggers an alarm when non visible messages are less than one (i.e no more work to do).
Currently the Cloudwatch Alarm Triggers to create an instance but then the scaling group automatically kills the instance to meet the desired setting. I expected the alarm to adjust the desired instance count within the min and max settings but this seems to not be the case.
Yes, you can certainly have an Auto Scaling group with:
Minimum = 0
Maximum = 1
Alarm: When ApproximateNumberOfMessagesVisible > 0 for 1 minute, Add 1 Instance
This will cause Auto Scaling to launch an instance when there are messages waiting in the queue. It will keep trying to launch more instances, but the Maximum setting will limit it to 1 instance.
Scaling-in when there are no messages is a little bit tricker.
Firstly, it can be difficult to actually know when to scale-in. If there are messages waiting to be processed, then ApproximateNumberOfMessagesVisible will be greater than zero. However, there are no messages waiting, it doesn't necessarily mean you wish to scale-in because messages might be currently processing ("in flight"), as indicated by ApproximateNumberOfMessagesNotVisible. So, you only want to scale-in if both of these are zero. Unfortunately, a CloudWatch alarm can only reference one metric, not two.
Secondly, when an Amazon SQS queue is empty, it does not send metrics to Amazon CloudWatch. This sort of makes sense, because queues are mostly empty, so it would be continually sending a zero metric. However, it causes a problem that CloudWatch does not receive a metric when the queue is empty. Instead, the alarm will enter the INSUFFICIENT_DATA state.
Therefore, you could create your alarm as:
When ApproximateNumberOfMessagesVisible = 0 for 15 minutes, Remove 1 instance but set the action to trigger on INSUFFICIENT_DATA rather than ALARM
Note the suggested "15 minutes" delay to avoid thrashing instances. This is where instances are added and removed in rapid succession because messages are coming in regularly, but infrequently. Therefore, it is better to wait a while before deciding to scale-in.
This leaves the problem of having instances terminated while they are still processing messages. This can be avoided by taking advantage of Auto Scaling Lifecycle Hooks, which send a signal when an instance is about to be terminated, giving the application the opportunity to delay the termination until work is complete. Your application should then signal that it is ready for termination only when message processing has finished.
Bottom line
Much of the above depends upon:
How often your application receives messages
How long it takes to process a message
The cost savings involved
If your messages are infrequent and simple to process, it might be worthwhile to continuously run a t2.micro instance. At 2c/hour, the benefit of scaling-in is minor. Also, there is always the risk when adding and removing instances that you might actually pay more, because instances are charged by the hour -- running an instance for 30 minutes, terminating it, then launching another instance for 30 minutes will actually be charged as two hours.
Finally, you could consider using AWS Lambda instead of an Amazon EC2 instance. Lambda is ideal for short-lived code execution without requiring a server. It could totally remove the need to use Amazon EC2 instances, and you only pay while the Lambda function is actually running.
for simple conf, with per sec aws ami/ubuntu billing dont worry about wasted startup/shutdown time, just terminate your ec2 by yourself, w/o any asg down policy add a little bash in client startup code or preinstal it in cron and poll for process presence or cpu load and term ec2 or shutdown (termination is better if you attach volumes and need 'em to autodestruct) after processing is done. there's ane annoying thing about asg defined as 0/0/1 (min/desired/max) with defaults and ApproximateNumberOfMessagesNotVisible on sqs - after ec2 is fired somehow it switches to 1/0/1 and it start to loop firing ec2 even if there's nothing is sqs (i'm doing video transcoding, queing jobs to do to sns/sqs and firing ffmpeg nodes with asg defined on non empty sqs)