AWS Elasticsearch publishing wrong total request metric - amazon-web-services

We have an AWS Elasticsearch cluster setup. However, our Error rate alarm goes off at regular intervals. The way we are trying to calculate our error rate is:
((sum(4xx) + sum(5xx))/sum(ElasticsearchRequests)) * 100
However, if you look at the screenshot below, at 7:15 4xx was 4, however ElasticsearchRequests value is only 2. Based on the metrics info on AWS Elasticsearch documentation page, ElasticsearchRequests should be total number of requests, so it should clearly be greater than or equal to 4xx.
Can someone please help me understand in what I am doing wrong here?

AWS definitions of these metrics are:
OpenSearchRequests (previously ElasticsearchRequests): The number of requests made to the OpenSearch cluster. Relevant statistics: Sum
2xx, 3xx, 4xx, 5xx: The number of requests to the domain that resulted in the given HTTP response code (2xx, 3xx, 4xx, 5xx). Relevant statistics: Sum
Please note the different terms used for the subjects of the metrics: cluster vs domain
To my understanding, OpenSearchRequests only considers requests that actually reach the underlying OpenSearch/ElasticSearch cluster, so some the 4xx requests might not (e.g. 403 errors), hence the difference in metrics.
Also, AWS only recommends comparing 5xx to OpenSearchRequests:
5xx alarms >= 10% of OpenSearchRequests: One or more data nodes might be overloaded, or requests are failing to complete within the idle timeout period. Consider switching to larger instance types or adding more nodes to the cluster. Confirm that you're following best practices for shard and cluster architecture.

I know this was posted a while back but I've additionally struggled with this issue and maybe I can add a few pointers.
First off, make sure your metrics are properly configured. For instance, some responses (4xx for example) take up to 5 minutes to register, while OpensearchRequests are refershed every minute. This makes for a very wonky graph that will definitely throw off your error rate.
In the picture above, I send a request that returns 400 every 5 seconds, and send a response that returns 200 every 0.5 seconds. The period in this case is 1 minute. This makes it so on average it should be around a 10% error rate. As you can see by the green line, the requests sent are summed up every minute, whereas the the 4xx are summed up every 5 minute, and in between every minute they are 0, which makes for an error rate spike every 5 minutes (since the opensearch requests are not multiplied by 5).
In the next image, the period is set to 5 minutes. Notice how this time the error rate is around 10 percent.
When I look at your graph, I see metrics that look like they are based off of a different period.
The second pointer I may add is to make sure to account for when no data is coming in. The behavior the alarm has may vary based on your how you define the "treat missing data" parameter. In some cases, if no data comes in, your expression might make it so it stays in alarm when in fact there is only no new data coming in. Some metrics might return no value when no requests are made, while some may return 0. In the former case, you can use the FILL(metric, value) function to specify what to return when no value is returned. Experiment with what happens to your error rate if you divide by zero.
Hope this message helps clarify a bit.

Related

Network data out - nmon/nload vs AWS Cloudwatch disparity

We are running a video conferencing server in an EC2 instance.
Since this is a data out (egress) heavy app, we want to monitor the network data out closely (since we are charged heavily for that).
As seen in the screenshot above, in our test, using nmon (top right) or nload (left) in our EC2 server shows the network out as 138 Mbits/s in nload and 17263 KB/s in nmon which are very close (138/8 = 17.25).
But, when we check the network out (bytes) in AWS Cloudwatch (bottom right), the number shown is very high (~ 1 GB/s) (which makes more sense for the test we are running), and this is the number for which we are finally charged.
Why is there such a big difference between nmon/nload and AWS Cloudwatch?
Are we missing some understanding here? Are we not looking at the AWS Cloudwatch metrics correctly?
Thank you for your help!
Edit:
Adding the screenshot of a longer test which shows the average network out metric in AWS Cloudwatch to be flat around 1 GB for the test duration while nmon shows average network out of 15816 KB/s.
Just figured out the answer to this.
The following link talks about the periods of data capture in AWS:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html
Periods
A period is the length of time associated with a specific
Amazon CloudWatch statistic. Each statistic represents an aggregation
of the metrics data collected for a specified period of time. Periods
are defined in numbers of seconds, and valid values for period are 1,
5, 10, 30, or any multiple of 60. For example, to specify a period of
six minutes, use 360 as the period value. You can adjust how the data
is aggregated by varying the length of the period. A period can be as
short as one second or as long as one day (86,400 seconds). The
default value is 60 seconds.
Only custom metrics that you define with a storage resolution of 1
second support sub-minute periods. Even though the option to set a
period below 60 is always available in the console, you should select
a period that aligns to how the metric is stored. For more information
about metrics that support sub-minute periods, see High-resolution
metrics.
As seen in the link above, if we don't set a custom metric with custom periods, AWS by default does not capture sub-minute data. So, the lowest resolution of data available is every 1 minute.
So, in our case, the network out data within 60 seconds is aggregated and captured as a single data point.
Even if I change the statistic to Average and the period to 1 second, it still shows every 1 minute data.
Now, if I divide 1.01 GB (shown by AWS) with 60, I get the per second data which is roughly around 16.8 MBps which is very close to the data shown by nmon or nload.
From the AWS docs:
NetworkOut: The number of bytes sent out by the instance on all network interfaces. This metric identifies the volume of outgoing network traffic from a single instance.
The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
The NetworkOut graph in your case does not represent the current speed, it represents the number of bytes sent out by all network interfaces in the last 5 minutes. If my calculations are correct, we should get the following values:
1.01 GB ~= 1027 MB (reading from your graph)
To get the average speed for the last 5 minutes:
1027 MB / 300 = 3.42333 MB/s ~= 27.38 Mbits/s
It is still more than what you are expecting, although this is just an average for the last 5 minutes.

On WS02, rate limit works successfully but the quota does not

When setting throttling limits for our API, it appears that the Rate Limit works successfully but the Quota does not.
We created a subscription that limits to 10 requests/second, and when running tests, we obtain a 429 response upon sending an 11th query in one second, which is exactly what we want and expect.
However, the filter also has a Quota of 100 requests/minute, yet we are able to run over 100 requests (have tested up to 300 queries and still gotten entirely 200 response codes) in the span of a minute without getting throttled.

AWS WAF How to rate limit path by IP below the minimum of 2000 requests/minute

I have a path (mysite.com/myapiendpoint for sake of example) that is both resource intensive to service, and very prone to bot abuse. I need to rate limit access to that specific path to something like 10 requests per minute per client IP address. How can this be done?
I'm hosting off an EC2 instance with CloudFront and AWS WAF in front. I have the standard "Rate Based Rule" enabled, but its 2,000 requests per minute per IP address minimum is absolutely unusable for my application.
I was considering using API Gateway for this, and have used it in the past, but its rate limiting as I understand it is not based on IP address, so bots would simply use up the limit and legitimate users would constantly be denied usage of the endpoint.
My site does not use sessions of any sort, so I don't think I could do any sort of rate limiting in the server itself. Also please bear in mind my site is a one-man-operation and I'm somewhat new to AWS :)
How can I limit the usage per IP to something like 10 requests per minute, preferably in WAF?
[Edit]
After more research I'm wondering if I could enable header forwarding to the origin (running node/express) and use a rate-limiter package. Is this a viable solution?
I don't know if this is still useful to you - but I just got a tip from AWS support. If you add the rate limit rule multiple times, it effectively reduces the number of requests each time. Basically what happens is each time you add the rule, it counts an extra request for each IP. So say an IP makes a single request. If you have 2 rate limit rules applied, the request is counted twice. So basically, instead of 2000 requests, the IP only has to make 1000 before it gets blocked. If you add 3 rules, it will count each request 3 times - so the IP will be blocked at 667 requests.
The other thing they clarified is that the "window" is 5 minutes, but if the total is breached anywhere in that window, it will be blocked. I thought the WAF would only evaluate the requests after a 5 minute period. So for example. Say you have a single rule for 2000 requests in 5 minutes. Say an IP makes 2000 requests in the 1st minute, then only 10 requests after that for the next 4 minutes. I initially understood that the IP would only be blocked after minute 5 (because WAF evaluates a 5 minute window). But apparently, if the IP exceeds the limit anywhere in that window, it will be locked immediately. So if that IP makes 2000 requests in minute 1, it will actually be blocked from minute 2, 3, 4 and 5. But then will be allowed again from minute 6 onward.
This clarified a lot for me. Having said that, I haven't tested this yet. I assume the AWS support techie knows what he's talking about - but definitely worth testing first.
AWS have now finally released an update which allows the rate limit to go as low as 100 requests every 5 minutes.
Announcement post: https://aws.amazon.com/about-aws/whats-new/2019/08/lower-threshold-for-aws-waf-rate-based-rules/
Using rule twice will not work, because WAF rate based rule will count on cloud watch logs basis, both rule will count 2000 requests separately, so it would not work for you.
You can use AWS-WAF automation cloud front template, and choose lambda/Athena parser, this way, request count will perform on s3 logs basis, also you will be able to block SQL,XSS and bad bot requests.

Where and how to set up a function which is doing GET request every second?

I am trying to setup a function which will be working somewhere on the server. It is a simple GET request and I want to trigger it every second.
I tried google cloud functions and AWS. Both of them don't have a straightforward solution to run it every second. (every 1 minute only)
Could you please suggest me a service, or combination of services that will allow me to do it. (preferably not costly)
Here are some options on AWS ...
Launch a t2.nano EC2 instance to run a script that issues GET, then sleeps for 1 second, and repeats. You can't use cron (doesn't support every second). This costs about 13 cents per day.
If you are going to do this for months/years then reduce the cost by using Reserved Instances.
If you can tolerate periods where the GET requests don't happen then reduce the cost even further by using Spot instances.
That said, why do you need to issue a GET request every second? Perhaps there is a better solution here.
You can create a AWS Lambda function, which simply loops and issues the GET request every second, and exits after 240 requets (i.e. 4 minutes). Then create a CloudWatch event that fires every 4 minutes calling the Lambda function.
Every 4 minutes because the maximum timeout you can set for a Lambda function is 5 minutes.
This setup will likely incur only some trivial cost:
At 1 event per 4 minutes, it's $1/month for the CloudWatch events generated.
At 1 call per 4 minutes to a minimally configured (128MB) Lambda function, it's 324,000 GB-second worth of execution per month, just within the free tier of 400,000 GB-second.
Since network transfer into AWS is free, the response size of your GET request is irrelevant. And the first 1GB of transfer out to the Internet is free, which should cover all the GET requests themselves.

Client Error Youtube API python

I have a python program which query youtube to get the video details. I use the version-3 api. I have multiple processes m and a python pool of 10 processes in each python process.
songs_pool = Pool()
songs_pool =Pool(processes=10)
return_pool = songs_pool.map(getVideo,songs_list)
I get some client errors when the value of m is increased to more than 2 and the pool is increased to >5. I get forbidden errors. When I check the number of requests in the google analytics,it shows that the number of requests are 250 per sec. But according to the documentation the limit is 3000 requests per sec. I dont understand why am I getting the client errors. Can you tell me if there is a way to not get this errors and run the program quicker.
if m = 2 and process = 10 , i get no errors but it takes so much time to complete.
But if I increase them , then I get client errors which are ~ 5% of the total requests.
The per-user-limit is 3000 requests per second from a single IP address, and as soon as you go above that in a given second you'll start getting the forbidden errors. The analytics you see in the developers console will only report your average number of requests over a 5 minute period; therefore, if you had zero requests for 4 minutes, then started running your routine, the console may show only 250 requests per second (as an average) but your app likely is overrunning the limit in a given period of time or two.
It seems that you're handling it in the best way possible if speed is your concern; you'll want to run it fast enough to get a very small number of errors (so you know you're staying up there at your limit). Another option, though, might be to look into using etags; if you find yourself requesting info on the same videos a lot, you can let etags tell you whether or not any info has changed (and if the API responds that nothing has changed, it doesn't count against either your quota or your reqests/sec.)