I have an instance in AWS that from time to time it's CPU cross the threshold of 90%.
I have created an alert for this, however I saw that I received one notification only and it was during the first 5 minutes while the CPU was at 100% for 2 hours.
How do I set the metric so I will keep getting notifications all the time?
Cloudwatch does not send notifications continuously if the threshold is breached. Cloudwatch can send a Notification only when the state changes.
Alarms invoke actions for sustained state changes only. CloudWatch alarms do not invoke actions simply because they are in a particular state, the state must have changed and been maintained for a specified number of periods.
Ref: AWS Cloudwatch Documentation
One possible solution that I can think of is to create a Multiple Cloudwatch Alarms with Multiple thresholds.
As the above answer already says it is not triggered again, one thing you can do is changing the alarm conditions to a very large value and then the orginal value and the state change will occur again.
Related
So I'm trying to setup composite alarms on AWS. So far, I have most of it set up. At the moment, I have a composite alarm set up with 3 alarms. If any 2 of these 3 alarms trigger, then the composite alarm also triggers. This part works fine.
However, I am having trouble with part of my use case. I'd also like to make it so that if one of these alarms within the composite alarm stays in alarm for over a certain period of time, then an alert is also sent out.
Here's an example of the situation:
2 out of the 3 alarms turn on in any time period: Alert should be sent
1 out of the 3 alarms turn on for under a certain time period: Alert should not be sent
1 out of the 3 alarms turn on for over a certain time period: Alert should be sent
I've tried looking into the settings available on the alarms themselves, and there doesn't seem to be an option for what I'm trying to do.
I'm wondering if this would require a lambda function? Is it possible for a lambda function to keep track of how long an alarm has been in alarm?
As talked in the comment section above, I am providing you with a possible solution to your problem. The only blocker is that you can't have different time frame for the alarms, both should be the same.
So you will have (example)- Alarm 1(cpu) if for 15 min it's over 60%. Alarm 2(EFS connections) if for 15 min there are more than 10 connections.
Now the alarm will go off when both the statements are true. Also the alarm will go off when only Alarm 1 goes off.
This is how you are going to make this alarm.
As for testing, it depends on what type of alarms you are making. For example cpu and ram increment methods are widely available on stackoverflow.
Also with aws cli you can change state of an alarm. It's usually for a very small amount of time, maybe 10 seconds.
aws cloudwatch set-alarm-state --alarm-name "myalarm" --state-value ALARM --state-reason "testing purposes"
You need to find a method which can suite your needs better.
Is it possible to determine a time for sending alerts on the cloudwatch aws?
For example:
I would like to configure the operation of alerts from Monday to Friday at 8 am to 6 pm
A CloudWatch Alarm will trigger whenever the rule is met. It might take a minute or so to react (depending upon the metric and the aggregate function used).
However, it is not possible to schedule when alerts operate.
If you wish, you could create an AWS Lambda function (triggered on a schedule) that will call disable_alarm_actions() and enable_alarm_actions(). The documentation says:
disable_alarm_actions(): Disables the actions for the specified alarms. When an alarm's actions are disabled, the alarm actions do not execute when the alarm state changes.
I have a few alarms set up with an evaluation period of 5 minutes.
The problem is that I get too many alerts throughout the day because of them getting triggered. Is there a way to schedule those alarms once a day or twice a day?
CloudWatch Alarms only trigger when the cross the threshold. They will not send another alarm until they return to the OK status and then cross into ALARM again.
So, if you are receiving multiple alarms, is because they are often going into, and out of, the ALARM state.
If this is too sensitive for your needs, increase the evaluation period or the number of number of datapoints required to trigger the alarm.
According to this doc I should consider publishing value zero instead of no data because I "can set a CloudWatch alarm to notify you if your application fails to publish metrics every five minute".
But I can set a cloudwatch alarm to notify on INSUFFICIENT_DATA too. Is using 0 a more reliable way of doing this? Is using 0 over INSUFFICIENT_DATA recommended by amazon because its more reliable?
You can set an alarm via either method.
However, there is a difference between publishing a value of zero and an alarm state of INSUFFICIENT_DATA.
If your service is running, then publish a zero value instead of not publishing and having the alarm go into the INSUFFICIENT_DATA state. In the first case you know your service is running. In the second case you have no data. This may or not be valuable to you but at least your log files will not have missing time areas.
Amazon AWS CloudWatch has the following Alarm in an alarmed state
What caused it to get into this state?
Why is it still in this state, as my application is not currently being used.
CloudWatch alarms have three possible states:
ALARM: This means the condition is TRUE. It is typically associated with a condition that should trigger an alert or an auto-scaling action.
OK: This means the condition is FALSE. It typically means "don't worry, everything's fine".
INSUFFICIENT DATA: This means there is not enough data for the state to be determined. Typically caused by an alarm configured for a period of time (eg Average over 5 minutes) where there is insufficient data (eg less than 5 minutes of data).
The ALARM condition can look scary when associated with a scale-down alarm because it doesn't mean anything is 'wrong'. Rather, it just means TRUE. Sometimes I wish they'd call it something other than 'ALARM' since people sometimes get worried when this state is perfectly OK.
Your alarm triggers if the amount of outgoing network usage is less than the configured threshold. Given that you say that your application is not currently being used it sounds normal for it to be in this state.
When using alarms to trigger scale up/down behaviour, it's normal that the scale down alarm is active when usage is low. It won't actually do anything in general since it can't make the number of instances less than the minimum you've allowed.