I have some alarms to check when an instance is left idle. The conditions are when 12 consecutive datapoints (at 5 min each) are found to have an average of <1% CPU usage, the instance should be stopped and a notification email sent out.
The alarm I created reads:
Whenever _Average_ of _CPU Utilization_
is _<_ +1+ Percent
For at least _12_ consecutive periods of _5 minutes_
Alarm
The alarm gets triggered in the use case of the instance being up and running for 1 hour with <1% CPU usage.
However, the alarm is also triggered when the instance is shut off. For instance, if the the instance is turned on, has 30 minutes of data points <1% CPU, and then is turned off, the alarm will be triggered in 30 minutes.
CPU metrics
How can I set this alarm so it is either:
only triggered when the instance is running, or
only triggered when a full set of 12 consecutive data points is actually collected, and not missing points that register as <1%?
The answer to this was actually quite simple. If you go to Cloudwatch, select the alarm and scroll down to Additional Configuration. For Missing Data Treatment, select "Treat missing data as good (not breaching alarm)".
Well as AWS says:
For each alarm you can specify CloudWatch to treat missing data points
as any of the following :
missing: the alarm does not consider missing data points when evaluating whether to change state (default)
notBreaching: missing data points are treated as begin within the threshold
breaching: missing data points are treated as breaching the threshold
ignore: the current alarm state is maintained
Related
Follwing case:
We want an Alarm in AWS that reads the EstimatedCharges Metric of AmazonCloudWatch every 5 minutes (for potential log overflow). But the only timespan I can set are 6 hours, else it gives me "Insufficient" as Status. How can I change the metric so that I can use it with 5 minutes between each check?
And how can I make an action that will stop the Cloudwatch Logs when over X?
According to documentation.
CloudWatch Billing metrics are updated every 6 hours.
Thus, your Alert may change the status only every 6 hours.
Just set Treat Missing Data to notBreaching
notBreaching – Missing data points are treated as "good" and within the threshold,
More info: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
I have set cloudwatch alarm to trigger SNS mail whenever some keywords are found in cloudwatch logs. (using metric filter)
When those keywords are detected, Alarm state gets changed from insufficient data to alarm & triggers SNS topic
Now, to move from Alarm state alarm to insufficient data it takes time randomly.
Is there any specific way it works, I expect it to come back to Alarm state insufficient data immediately after alarm state.
Any help would be appreciated. Thanks
The alarm has a metric period of 60 seconds and some evaluation period (let suppose 3; total equal 3 * 60 = 3 mints evaluation window).
The alarm will be in Alarm state if all the last 3 datapoints at 60 seconds interval are in Alarm State (above the threshold).
If any 1 in last 3 datapoint is below threshold then the Alarm will transition to OK.
BUT, if the latest all 3 datapoints are missing (say your metric filter did not match and as a result no metric was pushed), the Alarm waits longer than 3 periods to transition to InsufficientData and this is by design to accommodate network delays or processing delay.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
Came across the same situation, used a period of 1 min and some x > threshold.
The state changes to Alarm immediately whenever the metric exceeds the threshold. But to change back to OK/ Insufficient data takes 6 mins. This happens only for missing data.
As per AWS Support this is the expected behavior of Cloudwatch Alarms, clear explanation can be found here https://forums.aws.amazon.com/thread.jspa?threadID=284182
I have an app that puts a custom Cloudwatch metric to AWS every minute. This is supposed to act as a heartbeat so I know the app is alive.
Now I want to put an alarm on this metric to notify me if the heartbeat stops. I have tried to accomplish this using different cloudwatch alarm statistics including "average" and "data samples" and setting an alarm threshold less than 1 over a given period. However, in all cases, if my app dies and stops reporting the heartbeat, the alarm will only go into an "Insufficient Data" state and never into an "Alarm" state.
I understand I can put a notification on the "Insufficient Data" state, but I want this to show up as an alarm. Is this possible in Cloudwatch?
Thanks,
Matt
I think that the alarm going into "Insufficient Data" state has to do with how missing data is being handled. As the doc states:
Similar to how each alarm is always in one of three states, each specific data point reported to CloudWatch falls under one of three categories:
Not breaching (within the threshold)
Breaching (violating the threshold)
Missing
You can specify how alarms handle missing data points. Choose whether to treat missing data points as:
missing (The alarm looks back farther in time to find additional data points)
notBreaching (Treated as a data point that is within the threshold)
breaching (Treated as a data point that is breaching the threshold)
ignore (The current alarm state is maintained)
The default behavior is missing.
So i guess that specifying missing data points as breaching would do the trick :)
Instead of pushing in a custom metric to Cloudwatch, consider:
Push a message onto an SNS topic, on the same periodic basis as you were doing, and set up a CloudWatch monitor for the SNS topic's NumberOfMessagesPublished metric. If the number of heartbeats falls below the expected value for the time period you specify, whether its because the app crashed, or server crashed, the metric will go into an Alarm state.
Treat missing data as breaching threshold (step 4)
Check this: https://cloudonaut.io/dead-mans-switch-with-cloudwatch/
I have a cloudwatch alarm configured :
Threshold : "GreaterThan 0" for 1 consecutive period,
Period : 1 minute,
Statistic : Sum
The alarm is configured on top of AWS SQS NumberOfMessagesSent. The queue was empty and no messages were being published to it. I sent a message manually. I could see the spike in metric but state of alarm was still OK. I am a bit confused why this alarm is not changing its state even though all the conditions to trigger this are met.
I just overcame this problem with the help of AWS support. You need to set the period on your alarm to ~15 minutes. It's got to do with how SQS marks the event's timestamps as it pushes them to CloudWatch.
Don't worry, as setting the period to a greater number will not affect how quickly you are alerted of an alarm. It will still get data from SQS every 5 minutes.
It could be that the interval time is set to less than 300 seconds. The free CloudWatch checks every 5 minutes so if you set an alarm for less than that it you will sometimes get INSUFFICIENT_DATA.
Sometimes they suffer something calling "Delayed Metric delivery", it's something more usual when the alarm period is around narrow times, like 1 minute.
When the delayed timestamp arrive, is too late for the alarm, but not for the graph, because it finally print it nicely without gap.
Play with Evalution Periods and Datapoints to Alarm, not 1/1, maybe 3/2 or 3/1 would work fine.
Amazon AWS CloudWatch has the following Alarm in an alarmed state
What caused it to get into this state?
Why is it still in this state, as my application is not currently being used.
CloudWatch alarms have three possible states:
ALARM: This means the condition is TRUE. It is typically associated with a condition that should trigger an alert or an auto-scaling action.
OK: This means the condition is FALSE. It typically means "don't worry, everything's fine".
INSUFFICIENT DATA: This means there is not enough data for the state to be determined. Typically caused by an alarm configured for a period of time (eg Average over 5 minutes) where there is insufficient data (eg less than 5 minutes of data).
The ALARM condition can look scary when associated with a scale-down alarm because it doesn't mean anything is 'wrong'. Rather, it just means TRUE. Sometimes I wish they'd call it something other than 'ALARM' since people sometimes get worried when this state is perfectly OK.
Your alarm triggers if the amount of outgoing network usage is less than the configured threshold. Given that you say that your application is not currently being used it sounds normal for it to be in this state.
When using alarms to trigger scale up/down behaviour, it's normal that the scale down alarm is active when usage is low. It won't actually do anything in general since it can't make the number of instances less than the minimum you've allowed.