I have to check if EC2 machines are on for more then 6 hours so I set a CloudWatch alarm to fire if CPUUtilization > 0 for 24 periods of 15 min. The issue is that even if the machine is used for less, after it is switched off I receive emails like the following one. As you can see thethreshold was crossed only for 6 / 24 points. But I recevice the alarm anyway. Any clue what is going on here?
Alarm Details:
- Name: Machine up
- Description: ATTENTION: the machine has been up for more than 6 hours.
- State Change: INSUFFICIENT_DATA -> ALARM
- Reason for State Change: Threshold Crossed: 6 out of the last 24 datapoints were greater than or equal to the threshold (0.0). The most recent datapoints which crossed the threshold: [4.2 (26/09/19 13:16:00), 4.8 (26/09/19 13:01:00), 6.933333333333334 (26/09/19 12:46:00), 4.0 (26/09/19 12:31:00), 24.4 (26/09/19 12:16:00)] (minimum 24 datapoints for OK -> ALARM transition).
Related
Can someone please explain how i should configure my cloudwatch alarms as the documentation is terribly confusing.
Use case: I want to check for errors once every 30 mins and trigger an alarm if i see more than 5 errors in the logs.
Below is my current configuration:
threshold - 5
period - 1800
datapoints_to_alarm - 1
evaluation_periods - 1
comparison_operator - GreaterThanOrEqualToThreshold
statistic - Sum
treat_missing_data- notBreaching
When i tested with period as 120 i was able to validate that the alarm gets triggered when there are 5 or more errors in the logs.
However, when i changed the period to 1800 i am seeing that the alarm is triggered instantly on seeing 5 erros in the logs and its not coming out of alarm state for 1800sec(30 mins). Any ideas how to fix this?
The above solution is the answer to the question. There is no way to get out of the alarm state before the defined period, evaluation_period ends.
I have a Cloudwatch Alarm which receives data from a Canary. My canary attempts to visit a website, and if the website is up and responding, then the datapoint is 0, if the server returns some sort of error then the datapoint is 1. Pretty standard canary stuff I hope. This canary runs every 30 minutes.
My Cloudwatch alarm is configured as follows:
With the expected behaviour that if my canary cannot reach the website 3 times in a row, then the alarm should go off.
Unfortunately, this is not what's happening. My alarm was triggered with the following canary data:
Feb 8 # 7:51 PM (MST)
Feb 8 # 8:22 PM (MST)
Feb 8 # 9:52 PM (MST)
How is it possible that these three datapoints would trigger my alarm?
My actual email was received as follows:
You are receiving this email because your Amazon CloudWatch Alarm "...." in the US West (Oregon) region has entered the ALARM state, because "Threshold Crossed: 3 out of the last 3 datapoints [1.0 (09/02/21 04:23:00), 1.0 (09/02/21 02:53:00), 1.0 (09/02/21 02:23:00)] were greater than or equal to the threshold (1.0) (minimum 3 datapoints for OK -> ALARM transition)." at "Tuesday 09 February, 2021 04:53:30 UTC".
I am even more confused because the times on these datapoints do not align. If I convert these times to MST, we have:
Feb 8 # 7:23 PM
Feb 8 # 7:53 PM
Feb 8 # 9:23 PM
The time range on the reported datapoints is a two hour window, when I have clearly specified my evaluation period as 1.5 hours.
If I view the "metrics" chart in cloudwatch for my alarm it makes even less sense:
The points in this chart as shown as:
Feb 9 # 2:30 UTC
Feb 9 # 3:00 UTC
Feb 9 # 4:30 UTC
Which, again, appears to be a 2 hour evaluation period.
Help? I don't understand this.
How can I configure my alarm to fire if my canary cannot reach the website 3 times in a row (waiting 30 minutes in-between checks)?
I have two things to answer this:
Every time a canary runs 1 datapoint is sent to cloudwatch. So if within 30 mins you are checking for 3 failures for alarms to be triggered then your canary should run at a interval for 10 mins. So in 30 mins 3 data point and all 3 failed data points for alarm to be triggered.
For some reasons statistics was not working for me so I used count option. May be this might help.
My suggestion to run canary every 5 mins. So in 30 mins 6 data points and create alarm for if count=4.
The way i read your config, your alarm is expecting to find 3 data points within a 30 minute window - but your metric is only updated every 30 minutes so this condition will never be true.
You need to increase the period so there is 3 or more metrics available in order to trigger the alarm.
I have an alarm setup in AWS cloudwatch which generates a data point every hour. When its value is greater than or equal to 1, it goes to ALARM state. Following are the settings
On 2nd Nov, it got into ALARM state and then back to OK state in 3 hours. I'm just trying to understand why it took 3 hours to get back to the OK state instead of 1 hour because the metric run every hour.
Here's are the logs which prove that metric transited from ALARM to OK state in 3 hours.
Following is the graph which shows the data point value every hour.
This is probably because alarms are evaluated on longer period then your 1 hour. The period is evaluation range. In your case, the evaluation range could be longer then your 1 hour, thus it takes longer for it to change.
There is also thread about this behavior with extra info on AWS forum:
. Unexplainable delay between Alarm data breach and Alarm state change
I have an alarm tracking the metric for LoadBalancer 5xx errors n a single ALB. This should be in an "In alarm" state if 1 datapoint in the past 1 is above the threshold of 2. The period is set to 1 minute. See the alarm details:
On 2020-09-23 at 17:18 UTC the Load Balancer started to return 502 errors. This is shown in the Cloudwatch metric chart below, and I've confirmed the times are correct (this was a forced 502 response so I know when I triggered it and I can see the 17:18 timestamp in the ALB logs)
But in the alarm log, the "In Alarm" state was only triggered at 17:22 UTC - 4 minutes after the 17:18 period had more than 2 errors. This isn't a delay in receiving a notification - it's about a delay in the state change compared to my expectation. Notifications were correctly received within seconds of the state change.
Here is the Alarm log with the state change timestamps:
We consider missing data as GOOD, so based on the metric graph, I assume it should have recovered to OK at 17:22 (after the 17:21 period with 0 errors) but only returned to OK at 17:27 - 5minutes delay.
I then expected it to return to "In alarm" at 17:24, but this didn't return until 17:28.
Finally, I expect it to have returned to OK at 17:31 but it took until 17:40 - a full 9 minutes afterwards.
Why is there a 4-9 minute delay between when I expect a state transition and it actually happening?
I think the explanation is given in the following AWS forum:
. Unexplainable delay between Alarm data breach and Alarm state change
Basicially alarms are evaluated on longer period then what you set, not only 1 minute. The period is evaluation range, and you as a user, don't have direct control on it.
From the forum:
The reporting criteria for the HTTPCode_Target_4XX_Count metric is if there is a non-zero value. That means data point will only be reported if a non-zero value is generated, otherwise nothing will be pushed to the metric.
CloudWatch standard alarm evaluates its state every minute and no matter what value you set for how to treat missing data, when an alarm evaluates whether to change state, CloudWatch attempts to retrieve a higher number of data points than specified by Evaluation Periods (1 in this case). The exact number of data points it attempts to retrieve depends on the length of the alarm period and whether it is based on a metric with standard resolution or high resolution. The time frame of the data points that it attempts to retrieve is the evaluation range. Treat missing data as setting is applied if all the data in the evaluation range is missing, and not just if the data in evaluation period is missing.
Hence, CloudWatch alarms will look at some previous data points to evaluate its state, and will use the treat missing data as setting if all the data in evaluation range is missing. In this case, for the time when alarm did not transition to OK state, it was using the previous data points in the evaluation range to evaluate its state, as expected.
The alarm evaluation in case of missing data is explained in detail here, that will help in understanding this further:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-evaluating-missing-data
I have a similar issue with Lambda invocations, trying to detect zero invocations in an interval. The delay until alarm was consistently three times period, no matter if period was 10 min or 1 day.
This thread in AWS forums also mentions the evaluation range, and suggests using the Fill() metrics math function to work around this restriction.
Here's a Cloudformation sample that worked for me. The alarm is triggered after about 10-11 minutes of no invocations - as configured - instead of 30 min as before. That's good enough for me.
Caveat: It works around the evaluation range issue, it cannot help with ingestion delays of CloudWatch.
ManualCfMathAlarm:
Type: AWS::CloudWatch::Alarm
DependsOn:
- ManualCfAlarmNotificationTopic
Properties:
AlarmDescription: Notifies on ZERO invocations, based on MATH
AlarmName: ${self:service}-${self:provider.stage}-ManualCfMathAlarm
OKActions:
- !Ref ManualCfAlarmNotificationTopic
AlarmActions:
- !Ref ManualCfAlarmNotificationTopic
InsufficientDataActions:
- !Ref ManualCfAlarmNotificationTopic
EvaluationPeriods: 1
DatapointsToAlarm: 1
Threshold: 1.0
ComparisonOperator: LessThanThreshold
TreatMissingData: "missing" # doesn't matter, because of FILL()
Metrics:
- Id: "e1"
Expression: "FILL(m1, 0)"
Label: "MaxFillInvocations"
ReturnData: true
- Id: "m1"
MetricStat:
Metric:
Namespace: "AWS/Lambda"
MetricName: "Invocations"
Dimensions:
- Name: "FunctionName"
Value: "alarms-test-dev-AlarmsTestManual"
- Name: "Resource"
Value: "alarms-test-dev-AlarmsTestManual"
Period: 600
Stat: "Sum"
ReturnData: false
We need to pay attention to the behavior of CW Alarms when there are missing data points involved, as documented here.
If some data points in the evaluation range are missing, and the number of
actual data points that were retrieved is lower than the alarm's number of
Evaluation Periods, CloudWatch fills in the missing data points with the
result you specified for how to treat missing data, and then evaluates the
alarm. However, all real data points in the evaluation range are included in
the evaluation. CloudWatch uses missing data points only as few times as
possible.
One great way to auto fill in missing data points is using the FILL math metric expression.
For example, applying the expression FILL(METRICS(), 0) will fill in the missing values with 0.
Now we won't have any missing data points and so it is the evaluation 'period' that will be considered and not the evaluation 'range'. There shouldn't be any delay and we can apply the alarm to the resultant metric.
Using the console, it looks something like this:
Screenshot to AWS Console Math Metric
I've set up an AWS CloudWatch alarm with the following parameters:
ActionsEnabled: true
AlarmActions: "some SNS topic"
AlarmDescription: "Too many HTTP 5xx errors"
ComparisonOperator: GreaterThanOrEqualToThreshold
DatapointsToAlarm: 1
Dimensions:
- Name: ApiName
Value: "some API"
EvaluationPeriods: 20
MetricName: 5XXError
Namespace: AWS/ApiGateway
Period: 300
Statistic: Average
Threshold: 0.1
TreatMissingData: ignore
The idea is to receive a mail when there are too many HTTP 500 errors. I believe the above gives me an alarm that evaluates time periods of 5 minutes (300s). If 1 out of 20 data points exceeds the limit (10% of the requests) I should receive an email.
This works. I receive the email. But even if the amount of errors drops below the threshold again, I seem to keep receiving emails. It seems to be more or less for the entire duration of the evaluation interval (1h40min = 20 x 5 minutes). Also, I receive these mails every 5 minutes, leading me to think there must be a connection with my configuration.
This question implies that this shouldn't happen, which seems logical to me. In fact, I'd expect not to receive an email for at least 1 hour and 40 minutes (20 x 5 minutes), even if the threshold is breached again.
This is the graph of my metric/alarm:
Correction: I actually received 22 mails.
Have I made an error in my configuration?
Update
I can see that the state is set from Alarm to OK 3 minutes after it was set from OK to Alarm:
This is what we've found and how we fixed it.
So we're evaluating blocks of 5 minutes and taking the average of the amount of errors. But AWS is evaluating at faster intervals than 5 minutes. The distribution of your errors can be such that at a given point in time, a 5 minute block has an average of 12%. But a bit later, this block could be split in two giving you two blocks with different averages, possibly lower than the threshold.
That's what we believe is going on.
We've fixed it by changing our Period to 60s, and change our DatapointsToAlarm and EvaluationPeriods settings.