I have an alarm tracking the metric for LoadBalancer 5xx errors n a single ALB. This should be in an "In alarm" state if 1 datapoint in the past 1 is above the threshold of 2. The period is set to 1 minute. See the alarm details:
On 2020-09-23 at 17:18 UTC the Load Balancer started to return 502 errors. This is shown in the Cloudwatch metric chart below, and I've confirmed the times are correct (this was a forced 502 response so I know when I triggered it and I can see the 17:18 timestamp in the ALB logs)
But in the alarm log, the "In Alarm" state was only triggered at 17:22 UTC - 4 minutes after the 17:18 period had more than 2 errors. This isn't a delay in receiving a notification - it's about a delay in the state change compared to my expectation. Notifications were correctly received within seconds of the state change.
Here is the Alarm log with the state change timestamps:
We consider missing data as GOOD, so based on the metric graph, I assume it should have recovered to OK at 17:22 (after the 17:21 period with 0 errors) but only returned to OK at 17:27 - 5minutes delay.
I then expected it to return to "In alarm" at 17:24, but this didn't return until 17:28.
Finally, I expect it to have returned to OK at 17:31 but it took until 17:40 - a full 9 minutes afterwards.
Why is there a 4-9 minute delay between when I expect a state transition and it actually happening?
I think the explanation is given in the following AWS forum:
. Unexplainable delay between Alarm data breach and Alarm state change
Basicially alarms are evaluated on longer period then what you set, not only 1 minute. The period is evaluation range, and you as a user, don't have direct control on it.
From the forum:
The reporting criteria for the HTTPCode_Target_4XX_Count metric is if there is a non-zero value. That means data point will only be reported if a non-zero value is generated, otherwise nothing will be pushed to the metric.
CloudWatch standard alarm evaluates its state every minute and no matter what value you set for how to treat missing data, when an alarm evaluates whether to change state, CloudWatch attempts to retrieve a higher number of data points than specified by Evaluation Periods (1 in this case). The exact number of data points it attempts to retrieve depends on the length of the alarm period and whether it is based on a metric with standard resolution or high resolution. The time frame of the data points that it attempts to retrieve is the evaluation range. Treat missing data as setting is applied if all the data in the evaluation range is missing, and not just if the data in evaluation period is missing.
Hence, CloudWatch alarms will look at some previous data points to evaluate its state, and will use the treat missing data as setting if all the data in evaluation range is missing. In this case, for the time when alarm did not transition to OK state, it was using the previous data points in the evaluation range to evaluate its state, as expected.
The alarm evaluation in case of missing data is explained in detail here, that will help in understanding this further:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-evaluating-missing-data
I have a similar issue with Lambda invocations, trying to detect zero invocations in an interval. The delay until alarm was consistently three times period, no matter if period was 10 min or 1 day.
This thread in AWS forums also mentions the evaluation range, and suggests using the Fill() metrics math function to work around this restriction.
Here's a Cloudformation sample that worked for me. The alarm is triggered after about 10-11 minutes of no invocations - as configured - instead of 30 min as before. That's good enough for me.
Caveat: It works around the evaluation range issue, it cannot help with ingestion delays of CloudWatch.
ManualCfMathAlarm:
Type: AWS::CloudWatch::Alarm
DependsOn:
- ManualCfAlarmNotificationTopic
Properties:
AlarmDescription: Notifies on ZERO invocations, based on MATH
AlarmName: ${self:service}-${self:provider.stage}-ManualCfMathAlarm
OKActions:
- !Ref ManualCfAlarmNotificationTopic
AlarmActions:
- !Ref ManualCfAlarmNotificationTopic
InsufficientDataActions:
- !Ref ManualCfAlarmNotificationTopic
EvaluationPeriods: 1
DatapointsToAlarm: 1
Threshold: 1.0
ComparisonOperator: LessThanThreshold
TreatMissingData: "missing" # doesn't matter, because of FILL()
Metrics:
- Id: "e1"
Expression: "FILL(m1, 0)"
Label: "MaxFillInvocations"
ReturnData: true
- Id: "m1"
MetricStat:
Metric:
Namespace: "AWS/Lambda"
MetricName: "Invocations"
Dimensions:
- Name: "FunctionName"
Value: "alarms-test-dev-AlarmsTestManual"
- Name: "Resource"
Value: "alarms-test-dev-AlarmsTestManual"
Period: 600
Stat: "Sum"
ReturnData: false
We need to pay attention to the behavior of CW Alarms when there are missing data points involved, as documented here.
If some data points in the evaluation range are missing, and the number of
actual data points that were retrieved is lower than the alarm's number of
Evaluation Periods, CloudWatch fills in the missing data points with the
result you specified for how to treat missing data, and then evaluates the
alarm. However, all real data points in the evaluation range are included in
the evaluation. CloudWatch uses missing data points only as few times as
possible.
One great way to auto fill in missing data points is using the FILL math metric expression.
For example, applying the expression FILL(METRICS(), 0) will fill in the missing values with 0.
Now we won't have any missing data points and so it is the evaluation 'period' that will be considered and not the evaluation 'range'. There shouldn't be any delay and we can apply the alarm to the resultant metric.
Using the console, it looks something like this:
Screenshot to AWS Console Math Metric
Related
I am trying to wrap my head around Threshold value in the context of creating an alerting policy for an external HTTP load balancer:
Resource type: https_lb_rule
Metric: https/request_count
The docs say the following:
Enter when the value of a metric violates the threshold by using the Threshold position and Threshold value fields. For example, if you set these values to Above threshold and 0.3, then any measurement higher than 0.3 violates the threshold.
Now what does the value 0.3 indicate? Is there any unit and how do I relate to it in the context of the https/request_count metric for https_lb_rule?
Threshold value Reflects the minimum performance required to achieve the required operational effect. A threshold is an amount, level, or limit on a scale. When the threshold is reached, something else happens or changes.threshold means point or level where the alert needs to trigger at some specified point
As per your concerns Google Cloud HTTP/S Load Balancing Rule - Request count has a numeric value for threshold it means it will count the requests received for the load balancer.
If you give a threshold value of 10 then when the count of the requests received is above 10 then alert will trigger.You will be getting alerts frequently after crossing the given value. There is an option in advanced options as restart count as shown in below image it will help you to reduce the notifications. If the Restart window value is 1hr, then if the count is greater than the threshold for 1h(i.e, >10), then alert policy will open an incident and send notification to you.
I have an alarm setup in AWS cloudwatch which generates a data point every hour. When its value is greater than or equal to 1, it goes to ALARM state. Following are the settings
On 2nd Nov, it got into ALARM state and then back to OK state in 3 hours. I'm just trying to understand why it took 3 hours to get back to the OK state instead of 1 hour because the metric run every hour.
Here's are the logs which prove that metric transited from ALARM to OK state in 3 hours.
Following is the graph which shows the data point value every hour.
This is probably because alarms are evaluated on longer period then your 1 hour. The period is evaluation range. In your case, the evaluation range could be longer then your 1 hour, thus it takes longer for it to change.
There is also thread about this behavior with extra info on AWS forum:
. Unexplainable delay between Alarm data breach and Alarm state change
We have an AWS Elasticsearch cluster setup. However, our Error rate alarm goes off at regular intervals. The way we are trying to calculate our error rate is:
((sum(4xx) + sum(5xx))/sum(ElasticsearchRequests)) * 100
However, if you look at the screenshot below, at 7:15 4xx was 4, however ElasticsearchRequests value is only 2. Based on the metrics info on AWS Elasticsearch documentation page, ElasticsearchRequests should be total number of requests, so it should clearly be greater than or equal to 4xx.
Can someone please help me understand in what I am doing wrong here?
AWS definitions of these metrics are:
OpenSearchRequests (previously ElasticsearchRequests): The number of requests made to the OpenSearch cluster. Relevant statistics: Sum
2xx, 3xx, 4xx, 5xx: The number of requests to the domain that resulted in the given HTTP response code (2xx, 3xx, 4xx, 5xx). Relevant statistics: Sum
Please note the different terms used for the subjects of the metrics: cluster vs domain
To my understanding, OpenSearchRequests only considers requests that actually reach the underlying OpenSearch/ElasticSearch cluster, so some the 4xx requests might not (e.g. 403 errors), hence the difference in metrics.
Also, AWS only recommends comparing 5xx to OpenSearchRequests:
5xx alarms >= 10% of OpenSearchRequests: One or more data nodes might be overloaded, or requests are failing to complete within the idle timeout period. Consider switching to larger instance types or adding more nodes to the cluster. Confirm that you're following best practices for shard and cluster architecture.
I know this was posted a while back but I've additionally struggled with this issue and maybe I can add a few pointers.
First off, make sure your metrics are properly configured. For instance, some responses (4xx for example) take up to 5 minutes to register, while OpensearchRequests are refershed every minute. This makes for a very wonky graph that will definitely throw off your error rate.
In the picture above, I send a request that returns 400 every 5 seconds, and send a response that returns 200 every 0.5 seconds. The period in this case is 1 minute. This makes it so on average it should be around a 10% error rate. As you can see by the green line, the requests sent are summed up every minute, whereas the the 4xx are summed up every 5 minute, and in between every minute they are 0, which makes for an error rate spike every 5 minutes (since the opensearch requests are not multiplied by 5).
In the next image, the period is set to 5 minutes. Notice how this time the error rate is around 10 percent.
When I look at your graph, I see metrics that look like they are based off of a different period.
The second pointer I may add is to make sure to account for when no data is coming in. The behavior the alarm has may vary based on your how you define the "treat missing data" parameter. In some cases, if no data comes in, your expression might make it so it stays in alarm when in fact there is only no new data coming in. Some metrics might return no value when no requests are made, while some may return 0. In the former case, you can use the FILL(metric, value) function to specify what to return when no value is returned. Experiment with what happens to your error rate if you divide by zero.
Hope this message helps clarify a bit.
I've set up an AWS CloudWatch alarm with the following parameters:
ActionsEnabled: true
AlarmActions: "some SNS topic"
AlarmDescription: "Too many HTTP 5xx errors"
ComparisonOperator: GreaterThanOrEqualToThreshold
DatapointsToAlarm: 1
Dimensions:
- Name: ApiName
Value: "some API"
EvaluationPeriods: 20
MetricName: 5XXError
Namespace: AWS/ApiGateway
Period: 300
Statistic: Average
Threshold: 0.1
TreatMissingData: ignore
The idea is to receive a mail when there are too many HTTP 500 errors. I believe the above gives me an alarm that evaluates time periods of 5 minutes (300s). If 1 out of 20 data points exceeds the limit (10% of the requests) I should receive an email.
This works. I receive the email. But even if the amount of errors drops below the threshold again, I seem to keep receiving emails. It seems to be more or less for the entire duration of the evaluation interval (1h40min = 20 x 5 minutes). Also, I receive these mails every 5 minutes, leading me to think there must be a connection with my configuration.
This question implies that this shouldn't happen, which seems logical to me. In fact, I'd expect not to receive an email for at least 1 hour and 40 minutes (20 x 5 minutes), even if the threshold is breached again.
This is the graph of my metric/alarm:
Correction: I actually received 22 mails.
Have I made an error in my configuration?
Update
I can see that the state is set from Alarm to OK 3 minutes after it was set from OK to Alarm:
This is what we've found and how we fixed it.
So we're evaluating blocks of 5 minutes and taking the average of the amount of errors. But AWS is evaluating at faster intervals than 5 minutes. The distribution of your errors can be such that at a given point in time, a 5 minute block has an average of 12%. But a bit later, this block could be split in two giving you two blocks with different averages, possibly lower than the threshold.
That's what we believe is going on.
We've fixed it by changing our Period to 60s, and change our DatapointsToAlarm and EvaluationPeriods settings.
I have an alerts system which is currently powered by Prometheus and I need to port it to CloudWatch.
Prometheus is aware of counter resets so I can, let's say, calculate the rate() in the last 24h seamlessly, without handling the counter resets myself.
Is CloudWatch aware of this too?
Rate function is available in CloudWatch Metric Math, defined as:
Returns the rate of change of the metric per second. This is
calculated as the difference between the latest data point value and
the previous data point value, divided by the time difference in
seconds between the two values.
So you would need to modify the way you emit the metric to not reset the counter. A possible workaround could be to increase the number of datapoints to alarm, for example you can configure your alarms so they transition to alarm if 2 or more datapoints are less or equal to (<=) 0, this way you'll avoid to get an alarm when the reset occurs.