I have an app that puts a custom Cloudwatch metric to AWS every minute. This is supposed to act as a heartbeat so I know the app is alive.
Now I want to put an alarm on this metric to notify me if the heartbeat stops. I have tried to accomplish this using different cloudwatch alarm statistics including "average" and "data samples" and setting an alarm threshold less than 1 over a given period. However, in all cases, if my app dies and stops reporting the heartbeat, the alarm will only go into an "Insufficient Data" state and never into an "Alarm" state.
I understand I can put a notification on the "Insufficient Data" state, but I want this to show up as an alarm. Is this possible in Cloudwatch?
Thanks,
Matt
I think that the alarm going into "Insufficient Data" state has to do with how missing data is being handled. As the doc states:
Similar to how each alarm is always in one of three states, each specific data point reported to CloudWatch falls under one of three categories:
Not breaching (within the threshold)
Breaching (violating the threshold)
Missing
You can specify how alarms handle missing data points. Choose whether to treat missing data points as:
missing (The alarm looks back farther in time to find additional data points)
notBreaching (Treated as a data point that is within the threshold)
breaching (Treated as a data point that is breaching the threshold)
ignore (The current alarm state is maintained)
The default behavior is missing.
So i guess that specifying missing data points as breaching would do the trick :)
Instead of pushing in a custom metric to Cloudwatch, consider:
Push a message onto an SNS topic, on the same periodic basis as you were doing, and set up a CloudWatch monitor for the SNS topic's NumberOfMessagesPublished metric. If the number of heartbeats falls below the expected value for the time period you specify, whether its because the app crashed, or server crashed, the metric will go into an Alarm state.
Treat missing data as breaching threshold (step 4)
Check this: https://cloudonaut.io/dead-mans-switch-with-cloudwatch/
Related
I have some alarms to check when an instance is left idle. The conditions are when 12 consecutive datapoints (at 5 min each) are found to have an average of <1% CPU usage, the instance should be stopped and a notification email sent out.
The alarm I created reads:
Whenever _Average_ of _CPU Utilization_
is _<_ +1+ Percent
For at least _12_ consecutive periods of _5 minutes_
Alarm
The alarm gets triggered in the use case of the instance being up and running for 1 hour with <1% CPU usage.
However, the alarm is also triggered when the instance is shut off. For instance, if the the instance is turned on, has 30 minutes of data points <1% CPU, and then is turned off, the alarm will be triggered in 30 minutes.
CPU metrics
How can I set this alarm so it is either:
only triggered when the instance is running, or
only triggered when a full set of 12 consecutive data points is actually collected, and not missing points that register as <1%?
The answer to this was actually quite simple. If you go to Cloudwatch, select the alarm and scroll down to Additional Configuration. For Missing Data Treatment, select "Treat missing data as good (not breaching alarm)".
Well as AWS says:
For each alarm you can specify CloudWatch to treat missing data points
as any of the following :
missing: the alarm does not consider missing data points when evaluating whether to change state (default)
notBreaching: missing data points are treated as begin within the threshold
breaching: missing data points are treated as breaching the threshold
ignore: the current alarm state is maintained
I have an instance in AWS that from time to time it's CPU cross the threshold of 90%.
I have created an alert for this, however I saw that I received one notification only and it was during the first 5 minutes while the CPU was at 100% for 2 hours.
How do I set the metric so I will keep getting notifications all the time?
Cloudwatch does not send notifications continuously if the threshold is breached. Cloudwatch can send a Notification only when the state changes.
Alarms invoke actions for sustained state changes only. CloudWatch alarms do not invoke actions simply because they are in a particular state, the state must have changed and been maintained for a specified number of periods.
Ref: AWS Cloudwatch Documentation
One possible solution that I can think of is to create a Multiple Cloudwatch Alarms with Multiple thresholds.
As the above answer already says it is not triggered again, one thing you can do is changing the alarm conditions to a very large value and then the orginal value and the state change will occur again.
According to this doc I should consider publishing value zero instead of no data because I "can set a CloudWatch alarm to notify you if your application fails to publish metrics every five minute".
But I can set a cloudwatch alarm to notify on INSUFFICIENT_DATA too. Is using 0 a more reliable way of doing this? Is using 0 over INSUFFICIENT_DATA recommended by amazon because its more reliable?
You can set an alarm via either method.
However, there is a difference between publishing a value of zero and an alarm state of INSUFFICIENT_DATA.
If your service is running, then publish a zero value instead of not publishing and having the alarm go into the INSUFFICIENT_DATA state. In the first case you know your service is running. In the second case you have no data. This may or not be valuable to you but at least your log files will not have missing time areas.
I am trying to capture the event of a new message in my FIFO queue (as I want to avoid , infinite polling of Queue) .
For this purpose I am evaluating the CloudWatch alarm option with metrics ApproximateNumberOfMessagesVisible .
Following is my Alarm description-
Threshold: The condition in which the alarm will go to the ALARM state.ApproximateNumberOfMessagesVisible >= 0 for 1 minute
Actions:The actions that will occur when the alarm changes state.
In ALARM:
Send message to topic "topic_for_events_generated_bycloudwatch" (xyz#xyz)
Send message to topic "topic_for_events_generated_bycloudwatch"
Period:The granularity of the datapoints for the monitored metric.1 minute
Following are my queries -
Assuming there are more than 0 messages in the given Q - will this alarm raised only once when the condition met or every minute ?
During quick test I saw Alarm keeping moving between INSUFFICIENT and ALARM state in random other without any configuration changes, what could be rational ?
Screenshot of ApproximateNumberOfMessagesVisible metric graph
Screenshot of the log activity
Thanks in advance.
Regards,
Rohan K
Cloudwatch will alarm once the threshold is breached for state transition.
From the Docs
Alarms invoke actions for sustained state changes only. CloudWatch alarms do not invoke actions simply because they are in a particular state, the state must have changed and been maintained for a specified number of periods.
But
After an alarm invokes an action due to a change in state, its
subsequent behavior depends on the type of action that you have
associated with the alarm. For Amazon EC2 and Auto Scaling actions,
the alarm continues to invoke the action for every period that the
alarm remains in the new state. For Amazon SNS notifications, no additional actions are invoked.
An Example:
In the following figure, the alarm threshold is set to 3 units and the
alarm is evaluated over 3 periods. That is, the alarm goes to ALARM
state if the oldest of the 3 periods being evaluated is breaching, and
the 2 subsequent periods are either breaching or missing. In the
figure, this happens with the third through fifth time periods, and
the alarm's state is set to ALARM. At period six, the value dips below
the threshold, and the state reverts to OK. Later, during the ninth
time period, the threshold is breached again, but for only one period.
Consequently, the alarm state remains OK.
Amazon AWS CloudWatch has the following Alarm in an alarmed state
What caused it to get into this state?
Why is it still in this state, as my application is not currently being used.
CloudWatch alarms have three possible states:
ALARM: This means the condition is TRUE. It is typically associated with a condition that should trigger an alert or an auto-scaling action.
OK: This means the condition is FALSE. It typically means "don't worry, everything's fine".
INSUFFICIENT DATA: This means there is not enough data for the state to be determined. Typically caused by an alarm configured for a period of time (eg Average over 5 minutes) where there is insufficient data (eg less than 5 minutes of data).
The ALARM condition can look scary when associated with a scale-down alarm because it doesn't mean anything is 'wrong'. Rather, it just means TRUE. Sometimes I wish they'd call it something other than 'ALARM' since people sometimes get worried when this state is perfectly OK.
Your alarm triggers if the amount of outgoing network usage is less than the configured threshold. Given that you say that your application is not currently being used it sounds normal for it to be in this state.
When using alarms to trigger scale up/down behaviour, it's normal that the scale down alarm is active when usage is low. It won't actually do anything in general since it can't make the number of instances less than the minimum you've allowed.