Math Expression on AWS Cloudwatch metrics is not giving expected output - amazon-web-services

I have created two metrics (m1 and m2) on my logs which will give me sum of some filter pattern, I wanted to add math expression in metric to sum these two metrics so I have added SUM([m1,m2]) but it is not giving me actual sum, Please refer below snapshot.
I tried to add expressions as m1+m2 but still no luck. One thing I tried, m1 + 2 is giving me exact sum as 5. Not sure if anything is missing here.
Update (2019-07-18):
Adding stacked snapshot,

The SUM() functions sums up values per datapoint. On your last datapoints you have the value 2 for Completed and no value for Failed, so the sum is 2 + 0 = 2. Number widget on the other hand displays the last value returned which for Failed count is 3, but that 3 didn't happen at the last observed time period, it happened before.
You can do few thing here:
Update the metric filter on the logs to emit the value 0 as default if no Failed events are encountered.
Add a new expression to your graph, FILL(m1, 0), with ID e3 for example, which will give you a continuous line with zeros when there are no failures and the number of failures otherwise. Then you can update your SUM expression to be SUM([m2, e3]).
You can do this on both or your metrics, so you don't have gaps in any of them. This will make the graphing and alarming more consistent and intuitive.

Related

CloudWatch log filter count metric values are < 1

I followed the instructions here:
https://docs.amazonaws.cn/en_us/AmazonCloudWatch/latest/logs/CountOccurrencesExample.html
and created a log filter metric to count occurrences of a particular logged term
But when I graph the metric I get:
I don't see how a value of < 1 is possible for a count metric.
It seems like it is calculating something else, perhaps the ratio of hits for the log filter query vs total number of log entries. But that's a meaningless stat because these are application logs so it's not even the ratio of hits vs no of requests.
The shape of the graph looks right, but the units don't make sense.
How do I get a meaningful count from a log filter metric?
After thinking about this further I realised what maybe should have been obvious already... I was graphing the average rate of a count.
This can very easily be < 1
One option would be to instead graph the sum (per time bucket) of the count, so that is an easy way to get "occurrences per minute" or per second or whatever.
I realised eventually what I really wanted was the percentage of a specific subset of logs lines (potential matches) where the log term matched.
I achieved this by creating another metric which counted both the matching and non-matching instances of this log term, for a specific log path (e.g. requests to a particular endpoint, or calls to a specific function)
Then I can then hide both those metric lines from the graph and instead add a 'math expression' like m1 / m2 * 100 and show that to graph the percentage of requests which feature the log term of interest.

GCP Console: How are percentile charts calculated?

I do not understand how the charts that show percentiles are calculated inside the Google Cloud Platform Monitoring UI.
Here is how I am creating the standard chart:
Example log events
Creating a log-based metric for request durations
https://cloud.google.com/logging/docs/logs-based-metrics/distribution-metrics
https://console.cloud.google.com/logs/viewer
Here I have configured a histogram of 20 buckets, starting from 0, each bucket taking 100ms.
0 - 100,
100 - 200,
... until 2 seconds
Creating a chart to show percentiles over time
https://console.cloud.google.com/monitoring/metrics-explorer
I do not understand how these histogram buckets work with "aggregator", "aligner" and "alignment period".
The UI forces using an "aligner" and "alignment period".
Questions
A. If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?
B. Do the histogram buckets configured for the log-based metric affect these sums?
I've been looking into the same question for a couple of days and found the Understanding distribution percentiles section in official docs quite helpful.
The percentile value is a computed value. The computation takes into account the number of buckets, the width of the buckets, and the total count of samples. Because the actual measured values aren't known, the percentile computation can't rely on this data.
They have a good example with buckets [0, 1) [1, 2) [2, 4) [4, 8) [8, 16) [16, 32) [32, 64) [64, 128) [128, 256) and only one measurement in the last bucket [128, 256) (none in all other buckets).
You use the bucket counts to determine that the [128, 256) bucket contains the 50th percentile.
You assume that the measured values within the selected bucket are uniformly distributed and therefore the best estimate of the 50th percentile is the bucket midpoint.
A. If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?
I find the GCP Console UI for Metrics explorer a little misleading/confusing with the wording as well (but maybe it's just me unfamiliar with their terms). The key concepts here are Alignment and Reduction, I think.
The aligner produces a single value placed at the end of each alignment period.
A reducer is a function that is applied to the values across a set of time series to produce a single value.
The difference between the two are horizontal vs. vertical aggregations. With the UI, Aggregator (both primary and secondary) are reducers.
Back to the question, a sum alignment applying before a percentile reducer seems more useful in other use cases than yours. In short, a mean or max aligner may be more useful to your "duration_ms" metric, but they're not available in the dropdown on UI, and to be honest I haven't figured out how to implement them in MQL Editor either. Just referencing from the docs here. There are other aligners that may also be useful, but I'm just gonna leave them out for now.
B. Do the histogram buckets configured for the log-based metric affect these sums?
Same as #Anthony, I'm not quite sure what the question is implying either. Just going to assume you're asking if you can align/reduce log-based metrics using these aligners/reducers, and the answer would be yes. However, you'll need to know what metric type you're using (counter vs distribution) and aggregate them in corresponding ways as you need.
Before we look at your questions, we must understand Histograms.
By using the documentation you had provided in the post, there is a section in the document that explains Histogram Buckets. Looking at this section and reflecting your setup, we can see that you are using the Linear type to specify the boundaries between histogram buckets for distribution metrics.
Furthermore, the Linear type has three values for calculations:
offset value (Start value [a])
width value (Bucket width [b])
I value (Number of buckets [N])
Every bucket has the same width and the boundaries are calculated using the following formula: offset + width x I (Where I = 0,1,2,...,∞).
For example, if the start value is 5, the number of buckets is 4, and the bucket width is 15, then the bucket ranges are as follows:
[-INF, 5), [5, 20), [20, 35), [35, 50), [50, 65), [65, +INF]
Now we understand the formula, we can look at your questions and answer them:
How are percentile charts calculated?
If we look at this documentation on Selecting metrics, we can see that there is a section that speaks about how Aggregation works. I would suggest looking into this part to understand how Aggregation works in GCP
The formula to calculate the Percentile is the following:
R = P / 100 (N + 1)
Where R represents the rank order of the score. P represents the percentile rank. N represents the number of scores in the distribution.
If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?
In the same section, it also explains what the Alignment Period is, but for the most part, the alignment period determines the length of time for subdividing the time series. For example, you can break a time series into one-minute chunks or one-hour chunks. The data in each period is summarized so that a single value represents that period. The default alignment period is one minute.
Although you can set the alignment interval for your data, time series might be realigned when you change the time interval displayed on a chart or change the zoom level.
Do the histogram buckets configured for the log-based metric affect these sums?
I am not too sure on what you are applying here, are you asking if when logs are created, the sums would be altered via by the logs being generated?
I hope this helps!

AWS CloudWatch metric math with a cumulative metric's value 30 minutes ago to show rate of change

I have a AWS CloudWatch custom metric that represents a cumulative value which continues to increase overtime. I will add that metric to a dashboard, but I also want to show the rate of change of this metric over the last 30 minutes. Ideally I would like a function to return the metric's value from 30 minutes ago and subtract that from the current value. The "Rate()" function does not seem to help.
I could submit the metrics value a second time with a timestamp that is 30 minutes in the future and subtract these two metrics, but I am hoping for a solution that uses metric math and does not force me to submit another metric. I can think of other use cases where I might want to do math with metrics from different time periods.
Hope I am just missing something here!
You can use some arithmetic to obtain the previous value and then you're able to calculate the percentage of change as you want.
The value you want is: (value_now - value_before) / value_before
Breaking this into 2 parts:
Obtain value_now - value_before. This is the absolute delta of the values.
Obtain value_before. This is the value of the metric in the last datapoint.
Assuming that your metric in Cloudwatch is m.
Step 1: The absolute delta
The absolute_delta can be obtained with: absolute_delta = RATE(m) * PERIOD(m).
Step 2: The previous value
With some arithmetic it is possible to obtain previous_value. Given the definition of absolute delta:
absolute_delta = value_now - value_before
Since we have value_now = m and absolute_delta, then it's a matter of inverting the equation:
value_before = value_now - absolute_delta
Final equation
Just plug everything together and you have your final metric:
change_percentage = 100 * absolute_delta / value_before
In CloudWatch terms:
Metric math function RATE() calculates the rate of change per second.
Returns the rate of change of the metric, per second. This is calculated as the difference between the latest data point value and the previous data point value, divided by the time difference in seconds between the two values.
From https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
So to get the rate of change for your period you could do this:
RATE(m1)*PERIOD(m1)
and set the period of the dashboard to the wanted value.
Problem in your case is that you need it for a period of 30 min, I don't think you can set 30 min as period on the CloudWatch dashboard. Closest values would be 15 min or 1 hour.

Cumulative sum of AWS Cloudwatch Metric

AWS Cloudwatch receives a count of 1 every time I start an image download. I am downloading 1,000s of images (on a cluster of EC2 instances) and would like to track the total progress.
I can't find any documentation on how to plot the cumulative sum of a metric. The AWS Cloudwatch Math Expressions looked promising, but they do not have an integrate function.
Currently, I can plot the sum of the started image downloads but only for periods, as seen below. Ideally, I'd like to plot the integral of this plot:
You can get a cumulative sum over the current range by using the SUM() function that is operated over the original range containing only the number One (1). Remember, you're looking for a single number in the end, so it's not much of a graph, but you need to turn the single value sum back into a time-series.
Define m1 as your metric. This is the metric you will want to use SUM() on.
Define an expression e1 as m1/m1. This results in a time-series with every value equal to 1. This is what will allow you convert that SUM back to a time-series.
Define an expression e2 as SUM(m1) / e1. This is, effectively, the cumulative sum of m1 divided by one for every data-point in the original time-series. It will be a horizontal line on the graph, which will have every point on that horizontal line being the cumulative sum of metric m1. This is required because Cloudwatch can only plot a time-series on the chart, not a single value.
Make m1 and e1 invisible. You need them, but you don't need to see them.
Finally, change the chart type from Line to Number, since you only wanted the cumulative sum anyway.
The reason you can't use SUM() directly is because it is a single value. By dividing by a time-series containing all 1's, the entire graph is the result of the SUM(). Then, changing the chart to a Number effectively hides all the math and presents only the "final result".
Looks like RUNNING_SUM() has been added that does what your need:
Graph with RUNNING_SUM
You can find RUNNING_SUM() under "Add math"->"All functions"
You are correct. All Amazon CloudWatch metrics are for a defined period.
The maximum period for a metric is one day, so this is not suitable for a cumulative counter that you wish to continue beyond one day.
You would need to find an alternate method of storing the count, such as an Amazon DynamoDB table. Use an atomic counter via UpdateItem to increment the count.
You can also use a very long period.
Change your stat to SUM, and set your metric's period to 7 days. You'll get a time series of 1 point with the cumulative sum of all the downloads.
If you give each download a unique dimension value, you can keep your queries separate.

Show CloudWatch metric with unit Seconds in Hours

I have a custom cloud watch metric with unit Seconds. (representing the age of a cache)
As usual values are around 125,000 I'd like to convert them into Hours - for better readability.
Is that possible?
This has changed with the addition of Metrics Math. You can do all sorts of transformations on your data, both manually (from the console) and from CloudFormation dashboard templates.
From the console: see the link above, which says:
To add a math expression to a graph
Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
Create or edit a graph or line widget.
Choose Graphed metrics.
Choose Add a math expression. A new line appears for the expression.
For the Details column, type the math expression. The tables in the following section list the functions you can use in the
expression.
To use a metric or the result of another expression as part of the formula for this expression, use the value shown in the Id column. For
example, m1+m2 or e1-MIN(e1).
From a CloudFormation Template
You can add new metrics which are Metrics Math expressions, transforming existing metrics. You can add, subtract, multiply, etc. metrics and scalars. In your case, you probably just want to use divide, like in this example:
Say you have the following bucket request latency metrics object in your template:
"metrics":[
["AWS/S3","TotalRequestLatency","BucketName","MyBucketName"]
]
The latency default is in milliseconds. Let's plot it in seconds, just for fun. 1s = 1,000ms so we'll add the following:
"metrics":[
["AWS/S3","TotalRequestLatency","BucketName","MyBucketName",{"id": "timeInMillis"}],
[{"expression":"timeInMillis / 1000", "label":"LatencyInSeconds","id":"timeInSeconds"}]
]
Note that the expression has access to the ID of the other metrics. Helpful naming can be useful when things get more complicated, but the key thing is just to match the variables you put in the expression to the ID you assign to the corresponding metric.
This leaves us with a graph with two metrics on it: one milliseconds, the other seconds. If we want to lose the milliseconds, we can, but we need to keep the metric values around to compute the math expression, so we use the following work-around:
"metrics":[
["AWS/S3","TotalRequestLatency","BucketName","MyBucketName",{"id": "timeInMillis","visible":false}],
[{"expression":"timeInMillis / 1000", "label":"LatencyInSeconds","id":"timeInSeconds"}]
]
Making the metric invisible takes it off the graph while still allowing us to compute our expression off of it.
Cloudwatch does not do any Unit conversion (i.e seconds into hours etc). So you cannot use the AWS console to display your 'Seconds' datapoint values converted to Hours.
You could either publish your metric values as 'Hours' (leaving the Unit field blank or set it to 'None').
Otherwise if you still want to provide the datapoints with units 'Seconds' you could retrieve the datapoints (using the GetMetricStatistics API) and graph the values using some other dashboard/graphing solution.