Sorry for the large image, but it's the best way to to convey what I'm struggling to understand.
This is a simple alert that should trigger when a lambda generates 10 or more errors within 1 hour. Should be very simple, basic stuff.
So why does this alarm go to ALARM-state when the metric doesn't cross the threshold as shown in the image (green boxes). The new(?) bar at the bottom is the state of the alarm.
All relevant settings should be in the screenshot, its just "sum of errors over 1 hour"
I could just adjust the threshold to account for this weirdness, but I'm guess this is not an AWS-error, but a failure to understand from my part. I want to understand.
I did find this text hidden under an information-icon: "The alarming datapoints can appear different to the metric line because of aggregation when displaying at a higher period or because of delayed data that arrived after the alarm was evaluated."
I guess that is the answer, but I still don't like that the visualization doesn't use the same logic as the alarm. Kind of defeats the purpose.
Related
TL;DR;
What is the proper way to create a metric so that it generates reliable information about the log insights?
What is desired
The current Log insights can be seen similar to the following
However, it becomes easier to analyse these logs using the metrics (mostly because you can have multiple sources of data in the same plot and even perform math operations between them).
Solution according to docs
Allegedly, a log can be converted to a metric filter following a guide like this. However, this approach does not seem to work entirely right (I guess because of the time frames that have to be imposed in the metric plots), providing incorrect information, for example:
Issue with solution
In the previous image I've created a dashboard containing the metric count (the number 7), corresponding to the sum of events each 5 minutes. Also I've added a preview of the log insight corresponding to the information used to create the event.
However, as it can be seen, the number of logs is 4, but the event count displays 7. Changing the time frame in the metric generates other types of issues (e.g., selecting a very small time frame like 1 sec won't retrieve any data, or a slightly smaller time frame will now provide another wrong number: 3, when there are 4 logs, for example).
P.S.
I've also tried converting the log insights to metrics using this lambda function as suggested by Danil Smirnov to no avail, as it seems to generate the same issues.
So I am trying to visualize a AWS CloudWatch Metric with Tolerance via AWS Managed Grafana.
For example, this is my current graph (and I want to add tolerance lines to it):
I want to add some tolerance lines too see which spikes go outside of the expected range.
I could technically do this by enabling CloudWatch anomaly detection and using ANOMALY_DETECTION_BAND(a) as my metric math but I am trying to replicate a dashboard we currently have which uses a 6 week rolling average with a simple multiplier as the upper and lower thresholds.
My thought was that I can accomplish this using metric math by leveraging a combination of SLICE, RUNNING_SUM, and DATAPOINT_COUNT but no matter what combination I try I can't seem to find the right mix.
Does anyone know how I can using Metric Math to create a time series where each data point is either:
The average of the last x amount of time data points (Ex: the last 6 days worth of data points)
The average of the last x data points.
If I can figure out either of these solutions I can do the rest but I am having a hard time just getting "the last x data points" instead of referencing the entire query when doing any metric math operation.
I could maybe find a way to do this with built in Grafana functionality as well but I couldn't find a great way to do it. (Still new to Grafana).
I have a service which I want to know how many errors he throws.
So I've created a metric and an alert based on that metric.
The metric is a counter, and it filters out all the unneeded logs, leaving only the relevant onces.
The alert is using the metric, with an aggregator of type 'count' and aligner of type 'delta' resulting in value '1' when the metric catches any errors.
The condition for the alert is to check if the most recent value is above 0.99.
After an incident from that alert has been fired, it just wont close.
I went to the summary page and it shows that for some reason the condition is still being met (atleast that is what I understand from the red lines that keeps increasing) even though the errors when thrown last time a few hours ago.
In the picture you can see the red lines which indicates the duration of the incident, and below it in the graph you can see three small points where an error was detected. The first one caused the incident to fire.
Any help on how to make the incident resolve?
Thanks!
Was able to fix the problem as soon as I set the aggregator to 'sum' instead of 'count'.
From the docs:
No matter what value you set for how to treat missing data, when an alarm evaluates whether to change state, CloudWatch attempts to retrieve a higher number of data points than specified by Evaluation Periods. The exact number of data points it attempts to retrieve depends on the length of the alarm period and whether it is based on a metric with standard resolution or high resolution. The timeframe of the data points that it attempts to retrieve is the evaluation range.
The docs go on to give an example of an alarm with 'EvaluationPeriods' and 'DatapointsToAlarm' set to 3. They state that Cloudwatch chooses the 5 most recent datapoints. Part of my question is, Where are they getting 5? It's not clear from the docs.
The second part of my question is, why have this behavior at all (or at least, why have it by default)? If I set my evaluation period to 3, my Datapoints to Alarm to 3, and tell Cloudwatch to 'TreatMissingData' as 'breaching,' I'm going to expect 3 periods of missing data to trigger an alarm state. This doesn't necessarily happen, as illustrated by an example in the docs.
I had the same questions. As best I can tell, the 5 can be explained if I am thinking about standard collection intervals vs standard resolution correctly. In other words, if we assume a standard collection interval of 5 minutes and a standard 1-minute resolution, then within the 5 minutes of the collection interval, 5 separate data points are collected. The example states you need 3 data points over 3 evaluation periods, which is less than the 5 data points CloudWatch has collected. CloudWatch would then have all the data points it needs within the 5-data-point evaluation range defined by a single collection. As an example, if 4 of the 5 expected data points are missing from the collection, you have one defined data point and thus need 2 more within the evaluation range to reach the three needed for alarm evaluation. These 2 (not the 4 that are actually missing from the collection) are considered the "missing" data points in the documentation - I find this confusing. The tables in the AWS documentation provide examples for how the different treatments of the "missing" 2 data points impact the alarm evaluations.
Regardless of whether this is the correct interpretation, this could be better explained in the documentation.
I also agree that this behavior is unexpected, and the fact that you can't configure it is quite frustrating. However, there does seem to be an easy workaround depending on your use case.
I also wanted the same behavior as you specified; i.e. a missing data point is a breaching data point plain and simple:
If I set my evaluation period to 3, my Datapoints to Alarm to 3, and tell Cloudwatch to 'TreatMissingData' as 'breaching,' I'm going to expect 3 periods of missing data to trigger an alarm state.
I had a use case which is basically like a push-style health monitor. We needed a particular on-premises service to report a "healthy" metric daily to CloudWatch, and an alarm in case this report didn't come through due to network issues or anything disruptive. Semantically, missing data is the same as reporting a metric of value 0 (the "healthy" metric is value 1).
So I was able to use metric math's FILL function to replace every missing data point with 0. Setting a 1-out-of-1, alarm on <1 alarm on this new expression provides exactly the needed behavior without involving any kind of "missing data".
I have a custom cloud watch metric with unit Seconds. (representing the age of a cache)
As usual values are around 125,000 I'd like to convert them into Hours - for better readability.
Is that possible?
This has changed with the addition of Metrics Math. You can do all sorts of transformations on your data, both manually (from the console) and from CloudFormation dashboard templates.
From the console: see the link above, which says:
To add a math expression to a graph
Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
Create or edit a graph or line widget.
Choose Graphed metrics.
Choose Add a math expression. A new line appears for the expression.
For the Details column, type the math expression. The tables in the following section list the functions you can use in the
expression.
To use a metric or the result of another expression as part of the formula for this expression, use the value shown in the Id column. For
example, m1+m2 or e1-MIN(e1).
From a CloudFormation Template
You can add new metrics which are Metrics Math expressions, transforming existing metrics. You can add, subtract, multiply, etc. metrics and scalars. In your case, you probably just want to use divide, like in this example:
Say you have the following bucket request latency metrics object in your template:
"metrics":[
["AWS/S3","TotalRequestLatency","BucketName","MyBucketName"]
]
The latency default is in milliseconds. Let's plot it in seconds, just for fun. 1s = 1,000ms so we'll add the following:
"metrics":[
["AWS/S3","TotalRequestLatency","BucketName","MyBucketName",{"id": "timeInMillis"}],
[{"expression":"timeInMillis / 1000", "label":"LatencyInSeconds","id":"timeInSeconds"}]
]
Note that the expression has access to the ID of the other metrics. Helpful naming can be useful when things get more complicated, but the key thing is just to match the variables you put in the expression to the ID you assign to the corresponding metric.
This leaves us with a graph with two metrics on it: one milliseconds, the other seconds. If we want to lose the milliseconds, we can, but we need to keep the metric values around to compute the math expression, so we use the following work-around:
"metrics":[
["AWS/S3","TotalRequestLatency","BucketName","MyBucketName",{"id": "timeInMillis","visible":false}],
[{"expression":"timeInMillis / 1000", "label":"LatencyInSeconds","id":"timeInSeconds"}]
]
Making the metric invisible takes it off the graph while still allowing us to compute our expression off of it.
Cloudwatch does not do any Unit conversion (i.e seconds into hours etc). So you cannot use the AWS console to display your 'Seconds' datapoint values converted to Hours.
You could either publish your metric values as 'Hours' (leaving the Unit field blank or set it to 'None').
Otherwise if you still want to provide the datapoints with units 'Seconds' you could retrieve the datapoints (using the GetMetricStatistics API) and graph the values using some other dashboard/graphing solution.