AWS EC2 CloudWatch metrics interpretation - amazon-web-services

How should I interpret the AWS EC2 CloudWatch NetworkIn and NetworkOut metrics?
What does the Statistic: Average in the chart refer to?

The docs state that "the units for the Amazon EC2 NetworkIn metric are Bytes because NetworkIn tracks the number of bytes that an instance receives on all network interfaces”.
When viewing the chart below, Network In (Bytes), with Statistic: Average and a Period: 5 Minutes (note that the time window is zoomed in to around five hours, not one week), it is not immediately obvious how the average is calculated.
Instance i-aaaa1111 (orange) at 15.29: 2664263.8
If I change Statistic to “Sum”, I get this:
The same instance (i-aaaa1111), now at 15.31: 13321319
It turns out 13321319/5 = 2664263.8, suggesting that incoming network traffic during those five minutes was, on average, 2664263.8 Bytes/minute.
=> 2664263.8/60 ≈ 44404.4 Bytes/second
=> 4404.39/1024 ≈ 43.3KB/s
=> 43.3*8 ≈ 350Kbps
I tested this by repeatedly copying a large file from one instance to another, transferring at an average speed of 30.1MB/s. The CloudWatch metric was 1916943925 Bytes (Average) => around 30.5MB/s

The metric, "Network In (Bytes)", refers to bytes/minute.
It appears in my case that the average is computed over the period specified. In other words: for '15 Minutes', it divides the sum of bytes for the 15-minute period by 15, for '5 Minutes', it divides the sum for the 5-minute period by 5.
Here is why I believe this: I used this chart to debug an upload where rsync was reporting ~710kB/sec (~727,000 bytes / sec) when I expected a faster upload. After selecting lots of different sum values in the EC2 plot, I determined that the sums were correct numbers of bytes for the period specified (selecting a 15 minute period tripled the sum compared to a 5 minute period). Then viewing the average and selecting different periods shows that I get the same value of ~45,000,000 when I select a period of "5 Minutes", "15 Minutes", or "1 Hour".
45,000,000 (bytes/???) / 730,000 (bytes/sec) is approximately 60, so ??? is a minute (60 seconds). In fact, ~45,000,000 / 1024 / 60 = ~730 kB/sec and this is within 3% of what rsync was reporting.
Incidentally, my 'bug' was user error - I had failed to pass the '-z' option to rsync and therefore was not getting the compression boost I expected.

Related

what's the size of my logs in cloudWatch?

I would like to know the actual size of my logs and how fast do they grow.
Looking at Cloudwatch>Metrics>Account>IncomingBytes and choosing that I want to get the SUM for:
last 3 months and a period of 30 Days I do get 43GB, but for a period of 7 days I do get 17 GB and for a period of 1 Day 45 MB
last 4 weeks and a period of 30 days I do get 63GB, but for a period of 7 days I have 784KB, and a period of 1 day 785 KB.
I do not understand It, how could I get the size of my logs right now? and how to find how it increases over time (for example 1 day?)
CloudWatch logs doesn't publish a metric for "bytes right now." And a sum of IncomingBytes will just show the bytes received in whatever period you look at; it doesn't account for existing bytes or bytes that are removed due to a retention policy or deleted stream.
However, you can get the current reported bytes from the log group description. Here's a Python program that iterates all log groups and prints the answer:
import boto3
client = boto3.client('logs')
paginator = client.get_paginator('describe_log_groups')
for page in paginator.paginate():
for group in page['logGroups']:
print(f"{group['logGroupName']}: {group['storedBytes']}")
If it's important to track this over time, I'd wrap it in a Lambda that runs nightly (or however often you want) and reports the number as a custom metric.
The problem was related to the cloudwatch configuration (Graph options), there latest value was selected and should be the"time range value".
After changing It Cloudwatch was showing me that I have some TB and modifying the period was always showing the same values

Why is the value of "Sum CPUCreditBalance" so high?

I have 3 EC2 instances which are created by Elastic Beanstalk. Their current CPU Credit Balance are as the following:
And this is the monitoring page in Elastic Beanstalk:
Why is "Sum CPUCreditBalance" equal to 1.8K?
As you can see from the first picture, the CPU credit balances of the 3 EC2 instances are all below 120. 120 * 3 = 360 is far smaller than 1.8K = 1800.
How is 1.8K calculated?
Here are the options I used when creating Sum CPUCreditBalance:
It is the sum of all data points (CPU Credit Balance) in the graph.
Roughly calculating data points: 11x20 + 7x50 + 110x11 = 1780
SUM() isn't a meaningful aggregation of a sampled statistic like CPU Credit Balance. You're adding up all the values from the samples recorded in the time range, and that provides no useful information for this type of measurement.
SUM() only makes sense when the metric itself is a raw count of things per sampling period, such as the number HTTP requests or errors.
Sum -- All values submitted for the matching metric added together. This statistic can be useful for determining the total volume of a metric.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html#Statistic

Interpreting Stackdriver Metrics meaning "Sampled every 60 seconds. After sampling, data is not visible for up to 240 seconds"

We are in process of identifying Stackdriver metrics
I am specifically looking at GCP predefined metric subscription/ack_message_count with description Cumulative count of messages acknowledged by Acknowledge requests, grouped by delivery type. Sampled every 60 seconds. After sampling, data is not visible for up to 240 seconds.
Can any one help me understand highlighted part, what does Sampled every 60 seconds. After sampling, data is not visible for up to 240 seconds. mean
once i check this metric will it not able available for next 240 seconds.
Thanks
"Sampled every" refers to granularity. In this case, you'll get a data point for every minute.
"not visible" refers to freshness. In this case, the newest data point will describe the system as it was 4 minutes ago. Put another way, if you do something and watch the graphs you won't see the metric reflect the change for 4 minutes.
From my understanding, the data is polled every 60 seconds but at the metrics creation the time until the data is polled would take up to 240 seconds. The BigQuery section makes this a bit clearer. Because the numbers are as such that it would not be feasible in an other context
Example: Scanned bytes. Sampled every 60 seconds. After sampling, data is not visible for up to 21720 seconds.

Amazon elasticsearch interpretation of FreeStorageSpace metrics

I have 6 instances of type m3.large.elasticsearch and storage type instance.
I don't really get what does Average, Minimum, Maximum ..mean here?
I am not getting any logs into my cluster right now although it shows FreeStorageSpace as 14.95GB here:
But my FreeStorageSpace graph for "Minimum" has reached zero!
What is happening here?
I was also confused by this. Minimum means size on single data node - one which has least free space. And Sum means size of entire cluster (summation of free space on all data nodes). Got this info from following link
http://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-managedomains.html
We ran into the same confusion. Avg, Min, Max spreads the calculation across all nodes and Sum combines the Free/Used space for the whole cluster.
We had assumed that Average FreeStorageSpace means average free storage space of the whole cluster and set an alarm keeping the following calculation in mind:
Per day index = 1 TB
Max days to keep indices = 10
Hence we had an average utilization of 10 TB at any point of time. Assuming, we will go 2x - i.e. 20 TB our actual storage need as per https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/sizing-domains.html#aes-bp-storage was
with replication factor of 2 is:
(20 * 2 * 1.1 / 0.95 / 0.8) = 57.89 =~ 60 TB
So we provisioned 18 X 3.8 TB instances =~ 68 TB to accomodated 2x = 60 TB
So we had set an alarm that if we go below 8 TB free storage - it means we have hit our 2x limit and should scale up. Hence we set the alarm
FreeStorageSpace <= 8388608.00 for 4 datapoints within 5 minutes + Statistic=Average + Duration=1minute
FreeStorageSpace is in MB hence - 8 TB = 8388608 MB.
But we immediately got alerted because our average utilization per node was below 8 TB.
After realizing that to get accurate storage you need to do FreeStorageSpace sum for 1 min - we set the alarm as
FreeStorageSpace <= 8388608.00 for 4 datapoints within 5 minutes + Statistic=Sum + Duration=1minute
The above calculation checked out and we were able to set the right alarms.
The same applies for ClusterUsedSpace calculation.
You should also track the actual free space percent using Cloudwatch Math:

Metric-based Auto scaling policies in Amazon EC2

I have defined the following policies on t2.micro instance:
Take action A whenever {maximum} of CPU Utilization is >= 80% for at least 2 consecutive period(s) of 1 minute.
Take action B whenever {Minimum} of CPU Utilization is <= 20% for at least 2 consecutive period(s) of 1 minute.
Is my interpretation is wrong that: if the min (max) of CPU drops below (goes beyond) 20 (80) for 2 minutes, these rules have to be activated?
Because my collected stats show for example the Max of cpu has reached 90% twice in two consecutive period of 1 minute, but I got No Alarm!
Cheers
It seems my interpretation is not correct! The policy works based on the Average of the metric for every minute! It means the first policy will be triggered if the AVERAGE of stat Datapoints within a minute is >= 80% for two consecutive periods of 1 minute. The reason is simple: Cloudwatch does not consider stat datapoints less than 1 Min granularity. So if I go for 5 Minutes period, Max and Min show the correct behavior.