GCP Console: How are percentile charts calculated? - google-cloud-platform

I do not understand how the charts that show percentiles are calculated inside the Google Cloud Platform Monitoring UI.
Here is how I am creating the standard chart:
Example log events
Creating a log-based metric for request durations
https://cloud.google.com/logging/docs/logs-based-metrics/distribution-metrics
https://console.cloud.google.com/logs/viewer
Here I have configured a histogram of 20 buckets, starting from 0, each bucket taking 100ms.
0 - 100,
100 - 200,
... until 2 seconds
Creating a chart to show percentiles over time
https://console.cloud.google.com/monitoring/metrics-explorer
I do not understand how these histogram buckets work with "aggregator", "aligner" and "alignment period".
The UI forces using an "aligner" and "alignment period".
Questions
A. If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?
B. Do the histogram buckets configured for the log-based metric affect these sums?

I've been looking into the same question for a couple of days and found the Understanding distribution percentiles section in official docs quite helpful.
The percentile value is a computed value. The computation takes into account the number of buckets, the width of the buckets, and the total count of samples. Because the actual measured values aren't known, the percentile computation can't rely on this data.
They have a good example with buckets [0, 1) [1, 2) [2, 4) [4, 8) [8, 16) [16, 32) [32, 64) [64, 128) [128, 256) and only one measurement in the last bucket [128, 256) (none in all other buckets).
You use the bucket counts to determine that the [128, 256) bucket contains the 50th percentile.
You assume that the measured values within the selected bucket are uniformly distributed and therefore the best estimate of the 50th percentile is the bucket midpoint.
A. If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?
I find the GCP Console UI for Metrics explorer a little misleading/confusing with the wording as well (but maybe it's just me unfamiliar with their terms). The key concepts here are Alignment and Reduction, I think.
The aligner produces a single value placed at the end of each alignment period.
A reducer is a function that is applied to the values across a set of time series to produce a single value.
The difference between the two are horizontal vs. vertical aggregations. With the UI, Aggregator (both primary and secondary) are reducers.
Back to the question, a sum alignment applying before a percentile reducer seems more useful in other use cases than yours. In short, a mean or max aligner may be more useful to your "duration_ms" metric, but they're not available in the dropdown on UI, and to be honest I haven't figured out how to implement them in MQL Editor either. Just referencing from the docs here. There are other aligners that may also be useful, but I'm just gonna leave them out for now.
B. Do the histogram buckets configured for the log-based metric affect these sums?
Same as #Anthony, I'm not quite sure what the question is implying either. Just going to assume you're asking if you can align/reduce log-based metrics using these aligners/reducers, and the answer would be yes. However, you'll need to know what metric type you're using (counter vs distribution) and aggregate them in corresponding ways as you need.

Before we look at your questions, we must understand Histograms.
By using the documentation you had provided in the post, there is a section in the document that explains Histogram Buckets. Looking at this section and reflecting your setup, we can see that you are using the Linear type to specify the boundaries between histogram buckets for distribution metrics.
Furthermore, the Linear type has three values for calculations:
offset value (Start value [a])
width value (Bucket width [b])
I value (Number of buckets [N])
Every bucket has the same width and the boundaries are calculated using the following formula: offset + width x I (Where I = 0,1,2,...,∞).
For example, if the start value is 5, the number of buckets is 4, and the bucket width is 15, then the bucket ranges are as follows:
[-INF, 5), [5, 20), [20, 35), [35, 50), [50, 65), [65, +INF]
Now we understand the formula, we can look at your questions and answer them:
How are percentile charts calculated?
If we look at this documentation on Selecting metrics, we can see that there is a section that speaks about how Aggregation works. I would suggest looking into this part to understand how Aggregation works in GCP
The formula to calculate the Percentile is the following:
R = P / 100 (N + 1)
Where R represents the rank order of the score. P represents the percentile rank. N represents the number of scores in the distribution.
If I am trying to compute percentiles, why would I want to sum all of my response times every "alignment period"?
In the same section, it also explains what the Alignment Period is, but for the most part, the alignment period determines the length of time for subdividing the time series. For example, you can break a time series into one-minute chunks or one-hour chunks. The data in each period is summarized so that a single value represents that period. The default alignment period is one minute.
Although you can set the alignment interval for your data, time series might be realigned when you change the time interval displayed on a chart or change the zoom level.
Do the histogram buckets configured for the log-based metric affect these sums?
I am not too sure on what you are applying here, are you asking if when logs are created, the sums would be altered via by the logs being generated?
I hope this helps!

Related

AWS CloudWatch interpreting insights graph -- how many read/write IOs will be billed?

Introduction
We are trying to "measure" the cost of usage of a specific use case on one of our Aurora DBs that is not used very often (we use it for staging).
Yesterday at 18:18 hrs. UTC we issued some representative queries to it and today we were examining the resulting graphs via Amazon CloudWatch Insights.
Since we are being billed USD 0.22 per million read/write IOs, we need to know how many of those there were during our little experiment yesterday.
A complicating factor is that in the cost explorer it is not possible to group the final billed costs for read/write IOs per DB instance! Therefore, the only thing we can think of to estimate the cost is from the read/write volume IO graphs on CLoudwatch Insights.
So we went to the CloudWatch Insights and selected the graphs for read/write IOs. Then we selected the period of time in which we did our experiment. Finaly, we examined the graphs with different options: "Number" and "Lines".
Graph with "number"
This shows us the picture below suggesting a total billable IO count of 266+510=776. Since we have choosen the "Sum" metric, this we assume would indicate a cost of about USD 0.00017 in total.
Graph with "lines"
However, if we choose the "Lines" option, then we see another picture, with 5 points on the line. The first and last around 500 (for read IOs) and the last one at approx. 750. Suggesting a total of 5000 read/write IOs.
Our question
We are not really sure which interpretation to go with and the difference is significant.
So our question is now: How much did our little experiment cost us and, equivalently, how to interpret these graphs?
Edit:
Using 5 minute intervals (as suggested in the comments) we get (see below) a horizontal line with points at 255 (read IOs) for a whole hour around the time we did our experiment. But the experiment took less than 1 minute at 19:18 (UTC).
Wil the (read) billing be for 12 * 255 IOs or 255 ... (or something else altogether)?
Note: This question triggered another follow-up question created here: AWS CloudWatch insights graph — read volume IOs are up much longer than actual reading
From Aurora RDS documentation
VolumeReadIOPs
The number of billed read I/O operations from a cluster volume within
a 5-minute interval.
Billed read operations are calculated at the cluster volume level,
aggregated from all instances in the Aurora DB cluster, and then
reported at 5-minute intervals. The value is calculated by taking the
value of the Read operations metric over a 5-minute period. You can
determine the amount of billed read operations per second by taking
the value of the Billed read operations metric and dividing by 300
seconds. For example, if the Billed read operations returns 13,686,
then the billed read operations per second is 45 (13,686 / 300 =
45.62).
You accrue billed read operations for queries that request database
pages that aren't in the buffer cache and must be loaded from storage.
You might see spikes in billed read operations as query results are
read from storage and then loaded into the buffer cache.
Imagine AWS report these data each 5 minutes
[100,150,200,70,140,10]
And you used the Sum of 15 minutes statistic like what you had on the image
F̶i̶r̶s̶t̶,̶ ̶t̶h̶e̶ ̶"̶n̶u̶m̶b̶e̶r̶"̶ ̶v̶i̶s̶u̶a̶l̶i̶z̶a̶t̶i̶o̶n̶ ̶r̶e̶p̶r̶e̶s̶e̶n̶t̶ ̶o̶n̶l̶y̶ ̶t̶h̶e̶ ̶l̶a̶s̶t̶ ̶a̶g̶g̶r̶e̶g̶a̶t̶e̶d̶ ̶g̶r̶o̶u̶p̶.̶ ̶I̶n̶ ̶y̶o̶u̶r̶ ̶c̶a̶s̶e̶ ̶o̶f̶ ̶1̶5̶ ̶m̶i̶n̶u̶t̶e̶s̶ ̶a̶g̶g̶r̶e̶g̶a̶t̶i̶o̶n̶,̶ ̶i̶t̶ ̶w̶o̶u̶l̶d̶ ̶b̶e̶ ̶(̶7̶0̶+̶1̶4̶0̶+̶1̶0̶)̶
Edit: First, the "number" visualization represent the whole selected duration, aggregated with would be the total of (100+150+200+70+140+10)
The "line" visualization will represent all the aggregated groups. which would in this case be 2 points (100+150+200) and (70+140+10)
It can be a little bit hard to understand at first if you are not used to data points and aggregations. So I suggest that you set your "line" chart to Sum of 5 minutes you will need to get value of each points and devide by 300 as suggested by the doc then sum them all
Added images for easier visualization

clarification regarding difference between ALIGN_MEAN and ALIGN_SUM in google cloud metric explorer

I am collecting metrics related to memory using api compute.googleapis.com/guest/memory/bytes_used from google cloud metric explorer. I selected a particular instanceid and I set the alignment period to 1 day. so that I will get the metrics for 1 day.
For the same alignement period:
In advanced aggregation I selected the Aligner as mean and i got this value for the free category of memory 114.526 KiB
In advanced aggregation I selected the Aligner as sum and i got this value for the free category of memory 63.750 Mib
I am not understanding the formulae, on how this align_mean and align sum is calculated. i have set the alignment period to 1 day. Can anyone give me the forumula and the explanation.
Thanks a lot for your help.
It's only a graphical representation. You choose an alignment period of 1 day. So you have 1 value per day.
If you look the graph on 1 day, for the sum, you have 1 value, equal to 63Mib. I think that the line slightly go down because the day before the value is slightly higher.
Now, if you take the same value, but you say: I want to see the mean value during the day. You have 1 value per day, 63Mib, so the graph show you an interpolation of the mean per hours/minutes. If you change the timeframe, the line change. Even if you change the size of your screen it could change!!
Go to the Week or the Month timeframe view. The "aligment sum" should grow of 60Mib per day, the "aligment mean" should be flat around 60Mib (at month view)
We can read about difference ALIGN_MEAN and ALIGN_SUM in this GCP documentation [1]:
ALIGN_MEAN: Align the time series by returning the mean value in each alignment period. This aligner is valid for GAUGE and DELTA metrics with numeric values. The valueType of the aligned result is DOUBLE.
ALIGN_SUM: Align the time series by returning the sum of the values in each alignment period. This aligner is valid for GAUGE and DELTA metrics with numeric and distribution values. The valueType of the aligned result is the same as the valueType of the input.
I would also like to share this other document [2].
Best Wishes!
[1] https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.alertPolicies#aligner
[2] https://cloud.google.com/monitoring/charts/metrics-selector#alignment

aws cloudwatch metrics - AVG over a range

I want to make an average graph of the CDCLatencySource and CDCLatencyTarget of few ARNs.
CDCLatencySource are m1,m2,m3,m4
CDCLatencyTarget are m5,m6,m7,m8
So I made another row - AVG([m1,m4]) for the Source and same for the target.
But it looks like it average only the m1 & m4 and not the whole range.
What am I missing?
You will need to include all metrics, so for your CDCLatencySource it would be AVG([m1,m2,m3,m4]).
Similarly for CDCLatencyTarget the value would be AVG([m5,m6,m7,m8])
The functions do not accept ranges, instead they accept each metric id individually in the list that is passed into the function.
More information for metric math is available here for further reading.
From the docs:
AVG The AVG of a single time series returns a scalar representing the average of all the data points in the metric. The AVG of an array of time series returns a single time series. Missing values are treated as 0.
Thus you need to provide full array of time series:
AVG([m1,m2,m3,m4])
AVG([m5,m6,m7,m8])

Understanding GCP Dataproc billing and how it is affected by labels

I'm trying to make sure I have a clear understanding of how my organisation gets billed for Google Cloud Platform Dataproc.
We have exported our billing history to BigQuery so that we can analyse it. This morning we had two dataproc clusters running and the screenshot below shows a subset of the billing history for those two clusters. I have filtered on labels.key = "goog-dataproc-cluster-uuid" or labels.key = "goog-dataproc-cluster-name" or labels.key = "goog-dataproc-location". Here is a subset of the results
I've drawn boxes around the costs for two kinds of sku. Lets's take a look at the Standard Intel N1 16 VCPU running in EMEA items.
I only have two clusters yet for each of those two clusters there are three lines. The reason is that there are three labels applied to each dataproc cluster, hence the costs 1.271852 & 3.815556 appear three times each.
My simple question then is...how do I get the total cost of my dataproc clusters? Do I add up all of these numbers (thus implying that the total cost is split equally over all of the labels) or do I take just one of the values (implying that the cost is repeated for each label)?
Here's another way of phrasing my question. Does this query give the total cost of running cluster data-dev-dataplatform-dataproc for one day:
SELECT sum(cost)
FROM [dh-billing-179310:billing.gcp_billing_export_XXXXXXXX]
WHERE labels.key = "goog-dataproc-cluster-name"
and labels.value = "data-dev-dataplatform-dataproc"
and usage_start_time >= "2018-07-05 00:00:00"
and usage_end_time <= "2018-07-06 00:00:00"
or do I need to include other labels in order to get the total cost?
In that flattened view of billing export data, the cost is repeated for each label; you should pick a single label value for any particular calculation. If you're trying to calculate the Dataproc total, it's probably most convenient to use one of the Dataproc-inserted "goog-dataproc-*" labels.
The idea here is that you can use different sets of labels to easily organize your total Dataproc-related costs attributed to any given subproject, so that you can then filter your billing queries along different dimensions.

Cumulative sum of AWS Cloudwatch Metric

AWS Cloudwatch receives a count of 1 every time I start an image download. I am downloading 1,000s of images (on a cluster of EC2 instances) and would like to track the total progress.
I can't find any documentation on how to plot the cumulative sum of a metric. The AWS Cloudwatch Math Expressions looked promising, but they do not have an integrate function.
Currently, I can plot the sum of the started image downloads but only for periods, as seen below. Ideally, I'd like to plot the integral of this plot:
You can get a cumulative sum over the current range by using the SUM() function that is operated over the original range containing only the number One (1). Remember, you're looking for a single number in the end, so it's not much of a graph, but you need to turn the single value sum back into a time-series.
Define m1 as your metric. This is the metric you will want to use SUM() on.
Define an expression e1 as m1/m1. This results in a time-series with every value equal to 1. This is what will allow you convert that SUM back to a time-series.
Define an expression e2 as SUM(m1) / e1. This is, effectively, the cumulative sum of m1 divided by one for every data-point in the original time-series. It will be a horizontal line on the graph, which will have every point on that horizontal line being the cumulative sum of metric m1. This is required because Cloudwatch can only plot a time-series on the chart, not a single value.
Make m1 and e1 invisible. You need them, but you don't need to see them.
Finally, change the chart type from Line to Number, since you only wanted the cumulative sum anyway.
The reason you can't use SUM() directly is because it is a single value. By dividing by a time-series containing all 1's, the entire graph is the result of the SUM(). Then, changing the chart to a Number effectively hides all the math and presents only the "final result".
Looks like RUNNING_SUM() has been added that does what your need:
Graph with RUNNING_SUM
You can find RUNNING_SUM() under "Add math"->"All functions"
You are correct. All Amazon CloudWatch metrics are for a defined period.
The maximum period for a metric is one day, so this is not suitable for a cumulative counter that you wish to continue beyond one day.
You would need to find an alternate method of storing the count, such as an Amazon DynamoDB table. Use an atomic counter via UpdateItem to increment the count.
You can also use a very long period.
Change your stat to SUM, and set your metric's period to 7 days. You'll get a time series of 1 point with the cumulative sum of all the downloads.
If you give each download a unique dimension value, you can keep your queries separate.