Sqs error in which Approximate age pf oldest alarm is high - amazon-web-services

I am having sqs with visibility time out of 30 minutes and we have setup sqs approximate age of oldest message alarm for value 14000 for 1 data point within 1 minute, the alarm is going high everytime and this issue is recurring and i am not sure what to be done. Any suggestions that i can follow. Its default retention period is 1 day. Thanks

Related

Unknown spike in 'Approximate Age of Oldest Message Average' matrix AWS

I have received following alarm message daily at the same time from my Amazon SQS.
You are receiving this email because your Amazon CloudWatch Alarm "Old Messages in SQS" in the {my region} region has entered the ALARM state, because "Threshold Crossed: 1 out of the last 1 datapoints [183.0 (30/09/20 00:06:00)] was greater than or equal to the threshold (180.0) (minimum 1 datapoint for OK -> ALARM transition)." at "Wednesday 30 September, 2020 00:07:22 UTC".
Alarm Details:
Name: Old Messages in SQS
Description: Abc updates take too long. Check the processor and queue.
State Change: OK -> ALARM
Reason for State Change: Threshold Crossed: 1 out of the last 1 datapoints [183.0 (30/09/20 00:06:00)] was greater than or equal to
the threshold (180.0) (minimum 1 datapoint for OK -> ALARM
transition).
Timestamp: Wednesday 30 September, 2020 00:07:22 UTC
Threshold:
The alarm is in the ALARM state when the metric is GreaterThanOrEqualToThreshold 180.0 for 60 seconds.
Monitored Metric:
MetricNamespace: AWS/SQS
MetricName: ApproximateAgeOfOldestMessage
Period: 60 seconds
Statistic: Average
Unit: not specified
State Change Actions:
OK:
INSUFFICIENT_DATA:
So I checked in cloudwatch and see what is happening. So I identified that CPU utilization is going low at the same time of that instance which is used to process messages in SQS. So I decided that messages in SQS were increased due to server down.
But I can't identify why server is going down at the same time every day. I checked followings
EC2 snapshots - no automated schedules
RDS snapshots - no automated schedules at that time
Cron jobs in the server
Is there anyone who has this kind of experience will be highly appreciated to identify what the exact issue is.
This is super late to the party and I assume by now you either figured it out or you moved on. Sharing some thoughts for posterity.
The metric that went over threshold of 3 mins is age of the item on the SQS queue. The cause of the alert is not necessarily an instance processing messages going down but could be that the processing routine is blocked (waiting) on something external. This could be a network call that lasts long, or something like that.
During the waiting state CPU is not used that much so CPU utilization could be close to 0 if no other messages are being processed. Again, this is just a guess.
What I would do in this situation is:
Add/check cloudwatch logs that are emitted within the processing routine and verify processing is not stuck.
Check instance events / logs (by these I mean instance starting/stopping, maintenance, etc. to verify nothing external was affecting my processing instance)
In a number of cases the answer could be an error on Amazon side so when stuck - check with Amazon support.

AWS - False Billing Alert at 0.00 USD - at start of the Month

I have a billing Alarm set as Whenever charges for: EstimatedCharges is: >=USD $100
So i assume - The alarm should trigger when my Billing cost is above 100USD.
But early today i had a Billing Alert for my Alarm
That said
The alarm limit you set was $ 100.00 USD. Your total estimated charges accrued for this billing period are currently $ .00 USD as of
Saturday 01 December, 2018 11:24:23 UTC
But when I checked CloudWatch the alarm state is OK
State changed to OK at 2018/12/01. Reason: Threshold Crossed: 1 out of
the last 1 datapoints [0.0 (01/12/18 05:24:00)] was not greater than
or equal to the threshold (100.0) (minimum 1 datapoint for ALARM -> OK
transition).
For each alarm, you can set up 3 types of alert depending on the alarm state (OK, ALARM, INSUFFICIANT_DATA).
When you first create an alarm, its initial state is INSUFFICIANT_DATA. Then, depending on the metric value, threshold and period, it will reach the OK or ALARM state.
If you have set up notifications for all the different states, it's normal that you get a notification as well when the alarm went back from ALARM (end of a month over 100$) to OK state (new month - new bill).
If you don't want to get notifications when billing is under 100$, just remove the OK notification that you have set up:
Hope it helps!

AWS Cloudwatch Alarm status

I have set cloudwatch alarm to trigger SNS mail whenever some keywords are found in cloudwatch logs. (using metric filter)
When those keywords are detected, Alarm state gets changed from insufficient data to alarm & triggers SNS topic
Now, to move from Alarm state alarm to insufficient data it takes time randomly.
Is there any specific way it works, I expect it to come back to Alarm state insufficient data immediately after alarm state.
Any help would be appreciated. Thanks
The alarm has a metric period of 60 seconds and some evaluation period (let suppose 3; total equal 3 * 60 = 3 mints evaluation window).
The alarm will be in Alarm state if all the last 3 datapoints at 60 seconds interval are in Alarm State (above the threshold).
If any 1 in last 3 datapoint is below threshold then the Alarm will transition to OK.
BUT, if the latest all 3 datapoints are missing (say your metric filter did not match and as a result no metric was pushed), the Alarm waits longer than 3 periods to transition to InsufficientData and this is by design to accommodate network delays or processing delay.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
Came across the same situation, used a period of 1 min and some x > threshold.
The state changes to Alarm immediately whenever the metric exceeds the threshold. But to change back to OK/ Insufficient data takes 6 mins. This happens only for missing data.
As per AWS Support this is the expected behavior of Cloudwatch Alarms, clear explanation can be found here https://forums.aws.amazon.com/thread.jspa?threadID=284182

AWS CloudWatch Zero Queue Size For One Week alarm

I am wondering if there is a way to set up a CloudWatch alarm that will alarm if an SQS queue has not received any traffic for 7 days. I currently have a job that runs on my host once a week that is guaranteed to add message to my SQS queue, I already have a way of alarming if the job doesn't run but I would also like to alarm if for some reason the job does run but does not send any messages to my queue. I understand that the longest alarm period you can set is 1 day. Is there another way to create an alarm that will do what I am looking for?
Edit:
Since my job runs once a week is there a way to have an alarm that will monitor metrics every 7th day, seeing if any traffic hits the queue within a 24 hour time frame? This is more accurate seeing as the 6 days in between I don't expect or care if there is any traffic only that there is traffic on that 7th day.
CloudWatch Alarms set a limit that period * number_of_datapoints_to_watch must be less than 24 hours. As far as I know, there is no way around that.
To get the behavior you want, you can calculate days since last activity yourself, publish that as a custom metric and alarm on that.
One way to do it would be:
Create a lambda function and have it trigger every hour for example.
In the lambda, call CloudWatch GetMetricStatistics for the SQS metric you want to monitor.
Get the latest datapoint returned that has value greater than 0 and calculate the difference between now and the timestamp on that datapoint.
Use CloudWatch PutMetricData to publish this value to your new metric days-since-last-activity.
Now you can alarm when the value of your new metric goes above 7 days.

Amazon Cloudwatch alarm not triggered

I have a cloudwatch alarm configured :
Threshold : "GreaterThan 0" for 1 consecutive period,
Period : 1 minute,
Statistic : Sum
The alarm is configured on top of AWS SQS NumberOfMessagesSent. The queue was empty and no messages were being published to it. I sent a message manually. I could see the spike in metric but state of alarm was still OK. I am a bit confused why this alarm is not changing its state even though all the conditions to trigger this are met.
I just overcame this problem with the help of AWS support. You need to set the period on your alarm to ~15 minutes. It's got to do with how SQS marks the event's timestamps as it pushes them to CloudWatch.
Don't worry, as setting the period to a greater number will not affect how quickly you are alerted of an alarm. It will still get data from SQS every 5 minutes.
It could be that the interval time is set to less than 300 seconds. The free CloudWatch checks every 5 minutes so if you set an alarm for less than that it you will sometimes get INSUFFICIENT_DATA.
Sometimes they suffer something calling "Delayed Metric delivery", it's something more usual when the alarm period is around narrow times, like 1 minute.
When the delayed timestamp arrive, is too late for the alarm, but not for the graph, because it finally print it nicely without gap.
Play with Evalution Periods and Datapoints to Alarm, not 1/1, maybe 3/2 or 3/1 would work fine.