Network data out - nmon/nload vs AWS Cloudwatch disparity

Network data out - nmon/nload vs AWS Cloudwatch disparity - amazon-web-services

We are running a video conferencing server in an EC2 instance.
Since this is a data out (egress) heavy app, we want to monitor the network data out closely (since we are charged heavily for that).
As seen in the screenshot above, in our test, using nmon (top right) or nload (left) in our EC2 server shows the network out as 138 Mbits/s in nload and 17263 KB/s in nmon which are very close (138/8 = 17.25).
But, when we check the network out (bytes) in AWS Cloudwatch (bottom right), the number shown is very high (~ 1 GB/s) (which makes more sense for the test we are running), and this is the number for which we are finally charged.
Why is there such a big difference between nmon/nload and AWS Cloudwatch?
Are we missing some understanding here? Are we not looking at the AWS Cloudwatch metrics correctly?
Thank you for your help!
Edit:
Adding the screenshot of a longer test which shows the average network out metric in AWS Cloudwatch to be flat around 1 GB for the test duration while nmon shows average network out of 15816 KB/s.

Just figured out the answer to this.
The following link talks about the periods of data capture in AWS:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html
Periods
A period is the length of time associated with a specific
Amazon CloudWatch statistic. Each statistic represents an aggregation
of the metrics data collected for a specified period of time. Periods
are defined in numbers of seconds, and valid values for period are 1,
5, 10, 30, or any multiple of 60. For example, to specify a period of
six minutes, use 360 as the period value. You can adjust how the data
is aggregated by varying the length of the period. A period can be as
short as one second or as long as one day (86,400 seconds). The
default value is 60 seconds.
Only custom metrics that you define with a storage resolution of 1
second support sub-minute periods. Even though the option to set a
period below 60 is always available in the console, you should select
a period that aligns to how the metric is stored. For more information
about metrics that support sub-minute periods, see High-resolution
metrics.
As seen in the link above, if we don't set a custom metric with custom periods, AWS by default does not capture sub-minute data. So, the lowest resolution of data available is every 1 minute.
So, in our case, the network out data within 60 seconds is aggregated and captured as a single data point.
Even if I change the statistic to Average and the period to 1 second, it still shows every 1 minute data.
Now, if I divide 1.01 GB (shown by AWS) with 60, I get the per second data which is roughly around 16.8 MBps which is very close to the data shown by nmon or nload.

From the AWS docs:
NetworkOut: The number of bytes sent out by the instance on all network interfaces. This metric identifies the volume of outgoing network traffic from a single instance.
The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
The NetworkOut graph in your case does not represent the current speed, it represents the number of bytes sent out by all network interfaces in the last 5 minutes. If my calculations are correct, we should get the following values:
1.01 GB ~= 1027 MB (reading from your graph)
To get the average speed for the last 5 minutes:
1027 MB / 300 = 3.42333 MB/s ~= 27.38 Mbits/s
It is still more than what you are expecting, although this is just an average for the last 5 minutes.

Related

AWS Elasticsearch publishing wrong total request metric

We have an AWS Elasticsearch cluster setup. However, our Error rate alarm goes off at regular intervals. The way we are trying to calculate our error rate is:
((sum(4xx) + sum(5xx))/sum(ElasticsearchRequests)) * 100
However, if you look at the screenshot below, at 7:15 4xx was 4, however ElasticsearchRequests value is only 2. Based on the metrics info on AWS Elasticsearch documentation page, ElasticsearchRequests should be total number of requests, so it should clearly be greater than or equal to 4xx.
Can someone please help me understand in what I am doing wrong here?

AWS definitions of these metrics are:
OpenSearchRequests (previously ElasticsearchRequests): The number of requests made to the OpenSearch cluster. Relevant statistics: Sum
2xx, 3xx, 4xx, 5xx: The number of requests to the domain that resulted in the given HTTP response code (2xx, 3xx, 4xx, 5xx). Relevant statistics: Sum
Please note the different terms used for the subjects of the metrics: cluster vs domain
To my understanding, OpenSearchRequests only considers requests that actually reach the underlying OpenSearch/ElasticSearch cluster, so some the 4xx requests might not (e.g. 403 errors), hence the difference in metrics.
Also, AWS only recommends comparing 5xx to OpenSearchRequests:
5xx alarms >= 10% of OpenSearchRequests: One or more data nodes might be overloaded, or requests are failing to complete within the idle timeout period. Consider switching to larger instance types or adding more nodes to the cluster. Confirm that you're following best practices for shard and cluster architecture.

I know this was posted a while back but I've additionally struggled with this issue and maybe I can add a few pointers.
First off, make sure your metrics are properly configured. For instance, some responses (4xx for example) take up to 5 minutes to register, while OpensearchRequests are refershed every minute. This makes for a very wonky graph that will definitely throw off your error rate.
In the picture above, I send a request that returns 400 every 5 seconds, and send a response that returns 200 every 0.5 seconds. The period in this case is 1 minute. This makes it so on average it should be around a 10% error rate. As you can see by the green line, the requests sent are summed up every minute, whereas the the 4xx are summed up every 5 minute, and in between every minute they are 0, which makes for an error rate spike every 5 minutes (since the opensearch requests are not multiplied by 5).
In the next image, the period is set to 5 minutes. Notice how this time the error rate is around 10 percent.
When I look at your graph, I see metrics that look like they are based off of a different period.
The second pointer I may add is to make sure to account for when no data is coming in. The behavior the alarm has may vary based on your how you define the "treat missing data" parameter. In some cases, if no data comes in, your expression might make it so it stays in alarm when in fact there is only no new data coming in. Some metrics might return no value when no requests are made, while some may return 0. In the former case, you can use the FILL(metric, value) function to specify what to return when no value is returned. Experiment with what happens to your error rate if you divide by zero.
Hope this message helps clarify a bit.

CloudWatch Metrics for Volume IOPS, Volume Throughput (MiB/s) and Network (Gbps)

I had to troubleshoot one application at AWS and was not easy to use all CloudWatch Metrics Graphs to interpret environment healthiness, so I decided to share my experience here.
CloudWatch give us metrics for CPU, Memory*, Disk and Network.
* to get memory metrics you need to install CloudWatch Agent.
CPU and Memory give us the metric in percentage, which is clear ans strait-forward to interpret.
But Disk and Network are not that easy, for example I would like to check IOPS and Throughput (MiB/s) for my volumes and Network (Gbps).
I needed those values because AWS define EBS limits as IOPS and Throughput (MB/s) and Instance network limit as Gbps.

Total IOPS
EBS Volume give us metrics VolumeReadOps and VolumeWriteOps. Let me quote AWS documentation.
VolumeReadOps - The total number of read operations in a specified period of time.
To calculate the average read operations per second (read IOPS) for the period, divide the total read operations in the period by the number of seconds in that period.
VolumeWriteOps - The total number of write operations in a specified period of time.
To calculate the average write operations per second (write IOPS) for the period, divide the total write operations in the period by the number of seconds in that period.
Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using_cloudwatch_ebs.html
To get the Total IOPS we need to (VolumeReadOps + VolumeWriteOps) / SecondsInPeriod.
Luckily CloudWatch help us with Expression. Use the expression below, the function PERIOD is our friend here.
m1 = VolumeWriteOps - Sum
m2 = VolumeReadOps - Sum
Expression: (m1+m2)/PERIOD(m1)
Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
Total Throughput (MiB/s)
EBS Volume give us metrics VolumeReadBytes and VolumeWriteBytes. Let me quote AWS documentation.
VolumeReadBytes - Provides information on the read operations in a specified period of time. The Sum statistic reports the total number of bytes transferred during the period.
VolumeWriteBytes - Provides information on the write operations in a specified period of time. The Sum statistic reports the total number of bytes transferred during the period.
Both metrics give us the value in bytes, but we want them in MiB, so to convert we need to divide by 1048576, which is the result of 1024 * 1024. Let me explain in detail.
1024 bytes = 1 KiB
1024 KiB = 1 MiB
To get the Total Throughput in MiB/s we need to ((VolumeReadBytes + VolumeWriteBytes) / 1048576) / SecondsInPeriod.
Use the expression below, the function PERIOD is our friend here.
m1 = VolumeWriteBytes - Sum
m2 = VolumeReadBytes - Sum
Expression: ((m1+m2)/1048576)/PERIOD(m1)
Total Network (Gbps)
EC2 Instance give us metrics NetworkIn and NetworkOut. Let me quote AWS documentation.
NetworkIn - The number of bytes received on all network interfaces by the instance. This metric identifies the volume of incoming network traffic to a single instance.
The number reported is the number of bytes received during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
NetworkOut - The number of bytes sent out on all network interfaces by the instance. This metric identifies the volume of outgoing network traffic from a single instance.
The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html
Both metrics give us the value in bytes per period, but we want them in gigabits / second.
To convert from "period" to "second" we just need to divide by 300 (as I am using standard monitoring).
To convert from bytes to gigabits we need to divide by 0.008, which is the result of (1000 / 1000 / 1000) * 8. Let me explain in detail.
1000 bits = 1 kilobits
1000 kilobits = 1 megabits
1000 megabits = 1 gigabits
1 byte = 8 bits
To get Total Network in Gbps we need to ((NetworkIn + NetworkOut) / 300) / 0.008.
m1 = NetworkIn - Sum
m2 = NetworkOut - Sum
Expression: ((m1+m2)/300)/0.008

Where and how to set up a function which is doing GET request every second?

I am trying to setup a function which will be working somewhere on the server. It is a simple GET request and I want to trigger it every second.
I tried google cloud functions and AWS. Both of them don't have a straightforward solution to run it every second. (every 1 minute only)
Could you please suggest me a service, or combination of services that will allow me to do it. (preferably not costly)

Here are some options on AWS ...
Launch a t2.nano EC2 instance to run a script that issues GET, then sleeps for 1 second, and repeats. You can't use cron (doesn't support every second). This costs about 13 cents per day.
If you are going to do this for months/years then reduce the cost by using Reserved Instances.
If you can tolerate periods where the GET requests don't happen then reduce the cost even further by using Spot instances.
That said, why do you need to issue a GET request every second? Perhaps there is a better solution here.

You can create a AWS Lambda function, which simply loops and issues the GET request every second, and exits after 240 requets (i.e. 4 minutes). Then create a CloudWatch event that fires every 4 minutes calling the Lambda function.
Every 4 minutes because the maximum timeout you can set for a Lambda function is 5 minutes.
This setup will likely incur only some trivial cost:
At 1 event per 4 minutes, it's $1/month for the CloudWatch events generated.
At 1 call per 4 minutes to a minimally configured (128MB) Lambda function, it's 324,000 GB-second worth of execution per month, just within the free tier of 400,000 GB-second.
Since network transfer into AWS is free, the response size of your GET request is irrelevant. And the first 1GB of transfer out to the Internet is free, which should cover all the GET requests themselves.

CloudWatch Agent: batch size equal to "1" - is it a bad idea?

If I correctly understand, a CloudWatch Agent publishes events to CloudWatch by using a of kind of batching, the size of which is specified by the two params:
batch_count:
Specifies the max number of log events in a batch, up to 10000. The
default value is 1000.
batch_size
Specifies the max size of log events in a batch, in bytes, up to
1048576 bytes. The default value is 32768 bytes. This size is
calculated as the sum of all event messages in UTF-8, plus 26 bytes
for each log event.
I guess, that in order to eliminate a possibility of loosing any log data in case of a EC2 instance termination, the batch_count should be equal to 1 (because in case of the instance termination all logs will be destroyed). Am I right that this is only one way to achieve it, and how this can affect the performance? Will it have any noticeable side-effects?

Yes, it's a bad idea. You are probably more likely to lose data that way. The PutLogEvents API that the agent uses is limited to 5 requests per second per log stream (source). With a batch_count of 1, you'd only be able to publish 5 log events per second. If the application were to produce more than that consistently, the agent wouldn't be able to keep up.
If you absolutely can't afford to lose any log data, maybe you should be writing that data to a database instead. There will always be some risk of losing log data, even if with a batch_count of 1. The host could always crash before the agent polls the log file... which BTW is every 5 seconds by default (source).

Converting a high performance web service from Nginx on AWS EC2 to AWS Lambda

On a project I’m working on, there are a number of web services implemented on AWS. The services that are relatively simple (DynamoDB insert or lookup) and will be used relatively infrequently have been implemented as Lambdas, which were perfect for the task. There is also a more complex web service which does a lot of string processing and regex matching which needs to be highly performant, that has been implemented in C++ (roughly 5K LOC) as a Nginx module and can handle in the region of 20K requests/s running on an EC2 instance (the service just takes in a small JSON payload, does a lot of string processing and regex matching against some reference data that sits in static data files on S3, and returns a JSON response under 1KB in size)
There is a push from management to unify our use of AWS services and have all the web services implemented as Lambdas.
My question is: can a high performance web service such as the C/C++ nginx compiled module running on EC2 that’s expected to run continuously and handle 20K to 100K req/s actually be converted to AWS Lambda (in Python) and expected to have the same performance or is this better left as is on EC2? What are the performance considerations to be aware of if converting to Lambda?

Can Lambda do it? Yes. Should Lambda do it? No.
Why? Cost.
First, let's say you do handle 20k Requests / Second, every second for an entire day. That will then equate to 1.728 Billion requests in that day. In the free tier, you do get 1 Million requests free, so that drops the billable requests down to 1.727 Billion. Lambda charges $0.20 / Million Requests, so:
1.728 Billion requests * $0.20 / Million requests = $345.40
I'm pretty sure your cost for EC2 is lower than that per day. Taking the m4.16xlarge instance, with on-demand pricing, we get:
$3.20 / Hour * 24 Hours = $76.80
See the difference? But, Lambda also charges for compute time!
Let's say you include the c++ executable in your Lambda function (called from Python or Node, so we won't take into account the performance hit going from c++ to an interpreted language. Since Lambda charges in 100 Millisecond blocks, rounded up, for this estimate we will assume that all the requests finish within 100 Milliseconds.
Say you use the smallest memory size, 128 MB. That will give you 3.2 Million Seconds within the free tier, or 32 Million Requests given that they are all under 100 Milliseconds, free. But that still leaves you with 1.696 Billion Requests billable. The cost for the 128 MB size is $0.000000208 / 100 Milliseconds. Given that each request finishes under 100 Milliseconds, the cost for the execution time will be:
$0.000000208 / 100 Milliseconds * 1.696 Billion 100 Millisecond Units = $352.77
Adding that cost to the cost of the requests, you get:
$345.40 + $352.77 = $707.17
EC2: $76.80
Lambda: $707.17
Note, this is just using the 20k Requests / Second number that you gave and is for a single day. If the actual number of requests differs, the requests take longer than 100 Milliseconds, or you need more memory than 128 MB, the cost estimate will go up or down accordingly.
Lambda has its place, but EC2 does also. Just because you can put it on Lambda doesn't mean you should.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js