Unknown spike in 'Approximate Age of Oldest Message Average' matrix AWS - amazon-web-services

I have received following alarm message daily at the same time from my Amazon SQS.
You are receiving this email because your Amazon CloudWatch Alarm "Old Messages in SQS" in the {my region} region has entered the ALARM state, because "Threshold Crossed: 1 out of the last 1 datapoints [183.0 (30/09/20 00:06:00)] was greater than or equal to the threshold (180.0) (minimum 1 datapoint for OK -> ALARM transition)." at "Wednesday 30 September, 2020 00:07:22 UTC".
Alarm Details:
Name: Old Messages in SQS
Description: Abc updates take too long. Check the processor and queue.
State Change: OK -> ALARM
Reason for State Change: Threshold Crossed: 1 out of the last 1 datapoints [183.0 (30/09/20 00:06:00)] was greater than or equal to
the threshold (180.0) (minimum 1 datapoint for OK -> ALARM
transition).
Timestamp: Wednesday 30 September, 2020 00:07:22 UTC
Threshold:
The alarm is in the ALARM state when the metric is GreaterThanOrEqualToThreshold 180.0 for 60 seconds.
Monitored Metric:
MetricNamespace: AWS/SQS
MetricName: ApproximateAgeOfOldestMessage
Period: 60 seconds
Statistic: Average
Unit: not specified
State Change Actions:
OK:
INSUFFICIENT_DATA:
So I checked in cloudwatch and see what is happening. So I identified that CPU utilization is going low at the same time of that instance which is used to process messages in SQS. So I decided that messages in SQS were increased due to server down.
But I can't identify why server is going down at the same time every day. I checked followings
EC2 snapshots - no automated schedules
RDS snapshots - no automated schedules at that time
Cron jobs in the server
Is there anyone who has this kind of experience will be highly appreciated to identify what the exact issue is.

This is super late to the party and I assume by now you either figured it out or you moved on. Sharing some thoughts for posterity.
The metric that went over threshold of 3 mins is age of the item on the SQS queue. The cause of the alert is not necessarily an instance processing messages going down but could be that the processing routine is blocked (waiting) on something external. This could be a network call that lasts long, or something like that.
During the waiting state CPU is not used that much so CPU utilization could be close to 0 if no other messages are being processed. Again, this is just a guess.
What I would do in this situation is:
Add/check cloudwatch logs that are emitted within the processing routine and verify processing is not stuck.
Check instance events / logs (by these I mean instance starting/stopping, maintenance, etc. to verify nothing external was affecting my processing instance)
In a number of cases the answer could be an error on Amazon side so when stuck - check with Amazon support.

Related

ECS Fargate autoscaling more rapidly?

I'm load testing my auto scaling AWS ECS Fargate stack which comprises of:
Application Load Balancer (ALB) with a target group pointing to ECS,
ECS Cluster, Service, Task, ApplicationAutoScaling::ScalableTarget, and ApplicationAutoScaling::ScalingPolicy,
the application auto scaling policy defines a target tracking policy:
type: TargetTrackingScaling,
PredefinedMetricType: ALBRequestCountPerTarget,
threshold = 1000 requests
alarm is triggered when 1 datapoint breaches the threshold in the past 1 minute evaluation period.
This all works fine. The alarms do get triggered and I see the scale out actions happening. But it feels slow to detect the "threshold breach". This is the timing of my load test and AWS events (collated here from different places in the JMeter logs and the AWS console):
10:44:32 start load test (this is the first request timestamp entry in JMeter logs)
10:44:36 4 seconds later (in the the JMeter logs), we see that the load test reaches it's 1000th request to the ALB. At this point in time, we're above the threshold and waiting for AWS to detect that...
10:46:10 1m34s later, I can finally see the spike show up in alarm graph on the cloudwatch alarm detail page BUT the alarm is still in OK state!
NOTE: notice the 1m34s delay in detecting the spike, if it gets a datapoint every 60 seconds, it should be MAX 60 seconds before it detects it: my load test blasts out 1000 request every 4 seconds!!
10:46:50 the alarm finally goes from OK to ALARM state
NOTE: at this point, we're 2m14s past the moment when requests started pounding the server at a rate of 1000 requests every 6 seconds!
NOTE: 3 seconds later, after the alarm finally went off, the "scale out" action gets called (awesome, that part is quick):
14:46:53 Action Successfully executed action arn:aws:autoscaling:us-east-1:MYACCOUNTID:scalingPolicy:51f0a780-28d5-4005-9681-84244912954d:resource/ecs/service/my-ecs-cluster/my-service:policyName/alb-requests-per-target-per-minute:createdBy/ffacb0ac-2456-4751-b9c0-b909c66e9868
After that, I follow the actions in the ECS "events tab":
10:46:53 Message: Successfully set desired count to 6. Waiting for change to be fulfilled by ecs. Cause: monitor alarm TargetTracking-service/my-ecs-cluster-cce/my-service-AlarmHigh-fae560cc-e2ee-4c6b-8551-9129d3b5a6d3 in state ALARM triggered policy alb-requests-per-target-per-minute
10:47:08 service my-service has started 5 tasks: task 7e9612fa981c4936bd0f33c52cbded72 task e6cd126f265842c1b35c0186c8f9b9a6 task ba4ffa97ceeb49e29780f25fe6c87640 task 36f9689711254f0e9d933890a06a9f45 task f5dd3dad76924f9f8f68e0d725a770c0.
10:47:41 service my-service registered 3 targets in target-group my-tg
10:47:52 service my-service registered 2 targets in target-group my-tg
10:49:05 service my-service has reached a steady state.
NOTE: starting the tasks took 33 seconds, this is very acceptable because I set the HealthCheckGracePeriodSeconds to 30 seconds and health check interval is 30 seconds as well)
NOTE: 3m09s between the time the load starting pounding the server and the time the first new ECS tasks are up
NOTE: most of this time (3m09s) is due to the waiting for the alarm to go off (2m20s)!! The rest is normal: waiting for the new tasks to start.
Q1: Is there a way to make the alarm trigger faster and/or as soon as the threshold is breached? To me, this is taking 1m20s too much. It should really scale up in around 1m30s (1m max to detect the ALARM HIGH state + 30 seconds to start the task)...
Note: I documented my CloudFormation stack in this other question I opened today:
Cloudformation ECS Fargate autoscaling target tracking: 1 custom alarm in 1 minute: Failed to execute action
You can't do much about it. ALB sends metrics to CloudWatch in 1 minute intervals. Also these metrics are not real-time anyway, so delays are expected, even up to few minutes as explained by AWS support and reported in the comments here:
Some delay in metrics is expected, which is inherent for any monitoring systems- as they depend on several variables such as delay with the service publishing the metric, propagation delays and ingestion delay within CloudWatch to name a few. I do understand that a consistent 3 or 4 minute delay for ALB metrics is on the higher side.
You would either have to over-provision your ECS to sustain the increased load by the time the alarms fires and the up-scaling executes, or reduce your thresholds.
Alternative, you can create your own custom metrics, e.g. from your app. These metrics can be even with 1 second intervals. Your app could also "manually" trigger the alarm. This would allow you to reduce the delay you observe.

AWS Cloudwatch - No Alarm-Mail send on lambda timeout of 15 min

i've a strange problem which I don't understand.
I have created an cloudwatch alarm which should notify me on errors within an lambda execution (including timeout).
The relevant parameters of the alarm are the following:
period = "300"
datapoints_to_alarm = "1"
evaluation_periods = "1"
treat_missing_data = "notBreaching"
statistic = "Sum"
threshold = "0"
metric_name = "Errors"
namespace = "AWS/Lambda"
alarm_actions = http:// aws_sns_topic.alarm.arn
When my lambda run's into a timeout after 15 min (max Lambda execution time), no email is sent to me.
When my lambda run's into a timeout after 2, 6, 10 or 14 minutes i will get the notification email as expected. Even on 14 minutes and 30 seconds, the mail is sent. Over 14:30 minutes, the metric doesn't switch to alarm state.
Does anybody know why that is happen?
The datapoint (error) is shown correctly in the metric. It seems that the point (error) is set to the start date of the lambda. Might that be the problem? Because already 3 evaluations periods elapsed since lambda start? But why I get the alarm mail when it runs on timeout after 14 minutes (also more than one evaluation period).
Already asked this question in AWS Forum but no answer yet.
Can anyone suggest what I'm doing wrong?
Regards Hannes
According to AWS documentation regarding Lambda function metrics, The timestamp on a metric reflects when the function was invoked. Depending on the duration of the execution, this can be several minutes before the metric is emitted.
So for example, if your function has a 15 minutes timeout, you should look at more than 15 minutes in the past for accurate metrics. Since AWS polls metrics status before sending alerts in absolute times, you should set your alarms parameter EvaluationPeriods to be higher than 1, and then set DatapointsToAlarm and Period parameters according to your lambda’s timeout and how often will you want to sample the alarm state in CloudWatch before raising notifications. More details about these parameters can be found here.

AWS Cloudwatch Alarm status

I have set cloudwatch alarm to trigger SNS mail whenever some keywords are found in cloudwatch logs. (using metric filter)
When those keywords are detected, Alarm state gets changed from insufficient data to alarm & triggers SNS topic
Now, to move from Alarm state alarm to insufficient data it takes time randomly.
Is there any specific way it works, I expect it to come back to Alarm state insufficient data immediately after alarm state.
Any help would be appreciated. Thanks
The alarm has a metric period of 60 seconds and some evaluation period (let suppose 3; total equal 3 * 60 = 3 mints evaluation window).
The alarm will be in Alarm state if all the last 3 datapoints at 60 seconds interval are in Alarm State (above the threshold).
If any 1 in last 3 datapoint is below threshold then the Alarm will transition to OK.
BUT, if the latest all 3 datapoints are missing (say your metric filter did not match and as a result no metric was pushed), the Alarm waits longer than 3 periods to transition to InsufficientData and this is by design to accommodate network delays or processing delay.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
Came across the same situation, used a period of 1 min and some x > threshold.
The state changes to Alarm immediately whenever the metric exceeds the threshold. But to change back to OK/ Insufficient data takes 6 mins. This happens only for missing data.
As per AWS Support this is the expected behavior of Cloudwatch Alarms, clear explanation can be found here https://forums.aws.amazon.com/thread.jspa?threadID=284182

AWS CloudWatch alarm for SQS Number of Messages Visible

I am trying to capture the event of a new message in my FIFO queue (as I want to avoid , infinite polling of Queue) .
For this purpose I am evaluating the CloudWatch alarm option with metrics ApproximateNumberOfMessagesVisible .
Following is my Alarm description-
Threshold: The condition in which the alarm will go to the ALARM state.ApproximateNumberOfMessagesVisible >= 0 for 1 minute
Actions:The actions that will occur when the alarm changes state.
In ALARM:
Send message to topic "topic_for_events_generated_bycloudwatch" (xyz#xyz)
Send message to topic "topic_for_events_generated_bycloudwatch"
Period:The granularity of the datapoints for the monitored metric.1 minute
Following are my queries -
Assuming there are more than 0 messages in the given Q - will this alarm raised only once when the condition met or every minute ?
During quick test I saw Alarm keeping moving between INSUFFICIENT and ALARM state in random other without any configuration changes, what could be rational ?
Screenshot of ApproximateNumberOfMessagesVisible metric graph
Screenshot of the log activity
Thanks in advance.
Regards,
Rohan K
Cloudwatch will alarm once the threshold is breached for state transition.
From the Docs
Alarms invoke actions for sustained state changes only. CloudWatch alarms do not invoke actions simply because they are in a particular state, the state must have changed and been maintained for a specified number of periods.
But
After an alarm invokes an action due to a change in state, its
subsequent behavior depends on the type of action that you have
associated with the alarm. For Amazon EC2 and Auto Scaling actions,
the alarm continues to invoke the action for every period that the
alarm remains in the new state. For Amazon SNS notifications, no additional actions are invoked.
An Example:
In the following figure, the alarm threshold is set to 3 units and the
alarm is evaluated over 3 periods. That is, the alarm goes to ALARM
state if the oldest of the 3 periods being evaluated is breaching, and
the 2 subsequent periods are either breaching or missing. In the
figure, this happens with the third through fifth time periods, and
the alarm's state is set to ALARM. At period six, the value dips below
the threshold, and the state reverts to OK. Later, during the ninth
time period, the threshold is breached again, but for only one period.
Consequently, the alarm state remains OK.

Amazon Cloudwatch alarm not triggered

I have a cloudwatch alarm configured :
Threshold : "GreaterThan 0" for 1 consecutive period,
Period : 1 minute,
Statistic : Sum
The alarm is configured on top of AWS SQS NumberOfMessagesSent. The queue was empty and no messages were being published to it. I sent a message manually. I could see the spike in metric but state of alarm was still OK. I am a bit confused why this alarm is not changing its state even though all the conditions to trigger this are met.
I just overcame this problem with the help of AWS support. You need to set the period on your alarm to ~15 minutes. It's got to do with how SQS marks the event's timestamps as it pushes them to CloudWatch.
Don't worry, as setting the period to a greater number will not affect how quickly you are alerted of an alarm. It will still get data from SQS every 5 minutes.
It could be that the interval time is set to less than 300 seconds. The free CloudWatch checks every 5 minutes so if you set an alarm for less than that it you will sometimes get INSUFFICIENT_DATA.
Sometimes they suffer something calling "Delayed Metric delivery", it's something more usual when the alarm period is around narrow times, like 1 minute.
When the delayed timestamp arrive, is too late for the alarm, but not for the graph, because it finally print it nicely without gap.
Play with Evalution Periods and Datapoints to Alarm, not 1/1, maybe 3/2 or 3/1 would work fine.