CloudWatch Alarm triggers more often than expected - amazon-web-services

I have a problem with one of my CloudWatch alarms and i cant quite figure out.
I have a metric for one of my LogGroups, that inserts 1 data point, whenever a "Fatal Alarm" is logged within the application.
A fatal alarm was logged this night at 01:44:51. My alarm changed to ALARM state at 01:45 (as expected), but also went into ALARM state at 02:35 and 03:05
I have a bunch of screen dumps with all the information that i believe is necessary for pointing me in the right direction:
Only one Fatal Log:
.
We can also see theres only one fatal log, if we graph the metric:
.
Alarm state changes:
.
Alarm state changed to ALARM 1 (the expected one):
.
Alarm state changed to ALARM 2 (not expected):
.
Alarm state changed to ALARM 3 (not expected):
.
Alarm configuration:
Am i doing some kind of obvious misconfiguration? I'm a bit confused as to what i am doing wrong
Thanks in advance!
Frederik

There is no issue in your configuration as far as I can see. But if you want to see a graph corresponding to your incidents, you should change the period to fifteen minutes or an hour. Note that your alarm period is set to 30 minutes.

Related

How to configure Cloudwatch Alarms - check for 5 errors once every 30 minutes

Can someone please explain how i should configure my cloudwatch alarms as the documentation is terribly confusing.
Use case: I want to check for errors once every 30 mins and trigger an alarm if i see more than 5 errors in the logs.
Below is my current configuration:
threshold - 5
period - 1800
datapoints_to_alarm - 1
evaluation_periods - 1
comparison_operator - GreaterThanOrEqualToThreshold
statistic - Sum
treat_missing_data- notBreaching
When i tested with period as 120 i was able to validate that the alarm gets triggered when there are 5 or more errors in the logs.
However, when i changed the period to 1800 i am seeing that the alarm is triggered instantly on seeing 5 erros in the logs and its not coming out of alarm state for 1800sec(30 mins). Any ideas how to fix this?
The above solution is the answer to the question. There is no way to get out of the alarm state before the defined period, evaluation_period ends.

AWS SQS Polling Issue

I have encountered a weird sqs situation that I can't find a satisfying answer.
I created a delay queue that should delay (what a surprise) incoming events for 4 seconds and then they should be processed by lambda. Order is not an issue here.
The issue though is that the "approximate age of the oldest message" metric (stat. Max) sometimes it reaches over 1 minute which is weird since there aren't so many message as you can see in the picture. My expectation would be that the event gets processed immediately after the 4 secs delay time.
The reserved concurrency level of that lambda is 50 so the sqs poller should have no problem invoking more lambda instances if there is too much traffic. But traffic isn't really a problem as you can see.
The queue is configured like this:
Default visibility timeout: 120 sec
Delivery delay: 4 sec
Dead-letter-queue: No (It is only one event generated by aws, so no
bad pills)
Message retention period: 4 days
The lambda config:
Batch size: 5 (Tried also 1 or 10. Not much of a difference for the mentioned metric)
Batch window: None
reserved concurrency: 50
timeout: 20 secs
I can't explain the reason for those old messages (ApproximateAgeOfOldestMessage). Any help would be highly appreciated
Best
Patrick
I contacted the AWS Support. Apparently it is a bug on the aws side:
Response from AWS Support:
I have just received an update from the backend service team and the
team has confirmed that they have identified an issue of unexpected
spikes in "ApproximateAgeOfOldestMessage" metrics that triggers when
messages are sent to SQS with a configured delay. This issue's root
cause is that our internal system uses recently processed delayed
messages to calculate the "ApproximateAgeOfOldestMessage," which
results in a higher than the actual value for
"ApproximateAgeOfOldestMessage" metrics. They have now identified a
fix for this issue and will start deploying the fix soon. After this
update, when messages are sent to Amazon SQS with a configured delay,
you may see the "ApproximateAgeOfOldestMessage" metrics value come
down for the queues to the accurate value.
So if you encounter the same problem you have to wait for that mentioned fix. Hope it will come soon.

AWS alarm goes to Ok state unexpectedly

I have an alarm setup in AWS cloudwatch which generates a data point every hour. When its value is greater than or equal to 1, it goes to ALARM state. Following are the settings
On 2nd Nov, it got into ALARM state and then back to OK state in 3 hours. I'm just trying to understand why it took 3 hours to get back to the OK state instead of 1 hour because the metric run every hour.
Here's are the logs which prove that metric transited from ALARM to OK state in 3 hours.
Following is the graph which shows the data point value every hour.
This is probably because alarms are evaluated on longer period then your 1 hour. The period is evaluation range. In your case, the evaluation range could be longer then your 1 hour, thus it takes longer for it to change.
There is also thread about this behavior with extra info on AWS forum:
. Unexplainable delay between Alarm data breach and Alarm state change

AWS Cloudwatch alarm not working: For last few days, alarm stays "OK" even when it passes threshold

I have an alarm that for months has worked properly to manage the size of my ASG. Since Monday (Oct. 12), though, it has stopped working; it stays in "OK" state even when the graphs clearly show that it is above the threshold. See the attached screen shot.
What may or may not be related is that the alarm will trigger, then fail with no error message. It looks like this happens when the alarm triggers during the cooldown stage of the ASG. Once this happens, the alarm reverts to "OK", then just stays there indefinitely, even though it is above the threshold. Before Monday, it would stay in alarm state, re-triggering repeatedly, until the ASG left cooldown state.
Anybody know what is going on here? How can I fix this? And why did it suddenly change when there were no changes on my side?
I can see some missing data around 15:15 and your Missing data treatment is set to 'Treat missing data as good', shall we change this to 'Treat missing data as ignore (maintain the alarm state)' and check?

AWS stop alarm not working

I cannot get alarms to work reliably for an AWS ec2 instance. I have a g2.xlarge image running and want it to stop when it is not in use, i.e. when average usage falls below 2%.
When I try 1 or 2 periods of 1 hour below 2% it usually works but then when I start it up again it immediately stops itself as it is in an alarm condition. I have tried 12 periods of 5 minutes which allows it to start ok but now it doesn't stop at all despite being in an alarm condition for several hours.
I have tried various options and can't nail down what makes it work and what doesn't. It feels as if sometimes things work and sometimes they don't. Is it buggy or am I missing something?
Here is a screenshot of my setup which has failed to trigger a shutdown...