I have a Cloudwatch Alarm which receives data from a Canary. My canary attempts to visit a website, and if the website is up and responding, then the datapoint is 0, if the server returns some sort of error then the datapoint is 1. Pretty standard canary stuff I hope. This canary runs every 30 minutes.
My Cloudwatch alarm is configured as follows:
With the expected behaviour that if my canary cannot reach the website 3 times in a row, then the alarm should go off.
Unfortunately, this is not what's happening. My alarm was triggered with the following canary data:
Feb 8 # 7:51 PM (MST)
Feb 8 # 8:22 PM (MST)
Feb 8 # 9:52 PM (MST)
How is it possible that these three datapoints would trigger my alarm?
My actual email was received as follows:
You are receiving this email because your Amazon CloudWatch Alarm "...." in the US West (Oregon) region has entered the ALARM state, because "Threshold Crossed: 3 out of the last 3 datapoints [1.0 (09/02/21 04:23:00), 1.0 (09/02/21 02:53:00), 1.0 (09/02/21 02:23:00)] were greater than or equal to the threshold (1.0) (minimum 3 datapoints for OK -> ALARM transition)." at "Tuesday 09 February, 2021 04:53:30 UTC".
I am even more confused because the times on these datapoints do not align. If I convert these times to MST, we have:
Feb 8 # 7:23 PM
Feb 8 # 7:53 PM
Feb 8 # 9:23 PM
The time range on the reported datapoints is a two hour window, when I have clearly specified my evaluation period as 1.5 hours.
If I view the "metrics" chart in cloudwatch for my alarm it makes even less sense:
The points in this chart as shown as:
Feb 9 # 2:30 UTC
Feb 9 # 3:00 UTC
Feb 9 # 4:30 UTC
Which, again, appears to be a 2 hour evaluation period.
Help? I don't understand this.
How can I configure my alarm to fire if my canary cannot reach the website 3 times in a row (waiting 30 minutes in-between checks)?
I have two things to answer this:
Every time a canary runs 1 datapoint is sent to cloudwatch. So if within 30 mins you are checking for 3 failures for alarms to be triggered then your canary should run at a interval for 10 mins. So in 30 mins 3 data point and all 3 failed data points for alarm to be triggered.
For some reasons statistics was not working for me so I used count option. May be this might help.
My suggestion to run canary every 5 mins. So in 30 mins 6 data points and create alarm for if count=4.
The way i read your config, your alarm is expecting to find 3 data points within a 30 minute window - but your metric is only updated every 30 minutes so this condition will never be true.
You need to increase the period so there is 3 or more metrics available in order to trigger the alarm.
Related
Can someone please explain how i should configure my cloudwatch alarms as the documentation is terribly confusing.
Use case: I want to check for errors once every 30 mins and trigger an alarm if i see more than 5 errors in the logs.
Below is my current configuration:
threshold - 5
period - 1800
datapoints_to_alarm - 1
evaluation_periods - 1
comparison_operator - GreaterThanOrEqualToThreshold
statistic - Sum
treat_missing_data- notBreaching
When i tested with period as 120 i was able to validate that the alarm gets triggered when there are 5 or more errors in the logs.
However, when i changed the period to 1800 i am seeing that the alarm is triggered instantly on seeing 5 erros in the logs and its not coming out of alarm state for 1800sec(30 mins). Any ideas how to fix this?
The above solution is the answer to the question. There is no way to get out of the alarm state before the defined period, evaluation_period ends.
I have an alarm setup in AWS cloudwatch which generates a data point every hour. When its value is greater than or equal to 1, it goes to ALARM state. Following are the settings
On 2nd Nov, it got into ALARM state and then back to OK state in 3 hours. I'm just trying to understand why it took 3 hours to get back to the OK state instead of 1 hour because the metric run every hour.
Here's are the logs which prove that metric transited from ALARM to OK state in 3 hours.
Following is the graph which shows the data point value every hour.
This is probably because alarms are evaluated on longer period then your 1 hour. The period is evaluation range. In your case, the evaluation range could be longer then your 1 hour, thus it takes longer for it to change.
There is also thread about this behavior with extra info on AWS forum:
. Unexplainable delay between Alarm data breach and Alarm state change
I've set up an AWS CloudWatch alarm with the following parameters:
ActionsEnabled: true
AlarmActions: "some SNS topic"
AlarmDescription: "Too many HTTP 5xx errors"
ComparisonOperator: GreaterThanOrEqualToThreshold
DatapointsToAlarm: 1
Dimensions:
- Name: ApiName
Value: "some API"
EvaluationPeriods: 20
MetricName: 5XXError
Namespace: AWS/ApiGateway
Period: 300
Statistic: Average
Threshold: 0.1
TreatMissingData: ignore
The idea is to receive a mail when there are too many HTTP 500 errors. I believe the above gives me an alarm that evaluates time periods of 5 minutes (300s). If 1 out of 20 data points exceeds the limit (10% of the requests) I should receive an email.
This works. I receive the email. But even if the amount of errors drops below the threshold again, I seem to keep receiving emails. It seems to be more or less for the entire duration of the evaluation interval (1h40min = 20 x 5 minutes). Also, I receive these mails every 5 minutes, leading me to think there must be a connection with my configuration.
This question implies that this shouldn't happen, which seems logical to me. In fact, I'd expect not to receive an email for at least 1 hour and 40 minutes (20 x 5 minutes), even if the threshold is breached again.
This is the graph of my metric/alarm:
Correction: I actually received 22 mails.
Have I made an error in my configuration?
Update
I can see that the state is set from Alarm to OK 3 minutes after it was set from OK to Alarm:
This is what we've found and how we fixed it.
So we're evaluating blocks of 5 minutes and taking the average of the amount of errors. But AWS is evaluating at faster intervals than 5 minutes. The distribution of your errors can be such that at a given point in time, a 5 minute block has an average of 12%. But a bit later, this block could be split in two giving you two blocks with different averages, possibly lower than the threshold.
That's what we believe is going on.
We've fixed it by changing our Period to 60s, and change our DatapointsToAlarm and EvaluationPeriods settings.
I have to check if EC2 machines are on for more then 6 hours so I set a CloudWatch alarm to fire if CPUUtilization > 0 for 24 periods of 15 min. The issue is that even if the machine is used for less, after it is switched off I receive emails like the following one. As you can see thethreshold was crossed only for 6 / 24 points. But I recevice the alarm anyway. Any clue what is going on here?
Alarm Details:
- Name: Machine up
- Description: ATTENTION: the machine has been up for more than 6 hours.
- State Change: INSUFFICIENT_DATA -> ALARM
- Reason for State Change: Threshold Crossed: 6 out of the last 24 datapoints were greater than or equal to the threshold (0.0). The most recent datapoints which crossed the threshold: [4.2 (26/09/19 13:16:00), 4.8 (26/09/19 13:01:00), 6.933333333333334 (26/09/19 12:46:00), 4.0 (26/09/19 12:31:00), 24.4 (26/09/19 12:16:00)] (minimum 24 datapoints for OK -> ALARM transition).
I want to trigger my AWS lambda function on 15th of every month but my function is triggering after every 30 minutes. My function in Serverless.yml is
monthlyTbAlert:
warmup: true
handler: handlers/monthly-tbalert/index.monthlyTbAlert
timeout: 60
events:
- schedule: cron(0 0 10 15 1/1 ? *)
enabled: true
If you want to debug your cron expressions before deploying them, you can go to CloudWatch -> Rules and test them there. It's a very useful playground if you're unsure about what may be going on.
If we grab the expression provided in #Stargazer's answer (which, by the way, is very accurate) and paste it in CloudWatch Rules, we can see when the next triggers will happen:
By using yours, however, we can see no events are shown. If you say it is running every 30 minutes, then there potentially is a bug in CloudWatch rules that triggers invalid expressions every 30 minutes:
According to aws docs, the format is cron(Minutes Hours Day-of-month Month Day-of-week Year)
So you should use this:
0 - Minute 0 of the hours
10- Hours of the day. So, 10:00
15- 15th day of the month
* - Execute it every month
? - Regardless of the day of the week
*- Every year
So, your cron expression should be 0 10 15 * ? * To execute your cron every 15th day of the month at 10:00AM