I have an alerts system which is currently powered by Prometheus and I need to port it to CloudWatch.
Prometheus is aware of counter resets so I can, let's say, calculate the rate() in the last 24h seamlessly, without handling the counter resets myself.
Is CloudWatch aware of this too?
Rate function is available in CloudWatch Metric Math, defined as:
Returns the rate of change of the metric per second. This is
calculated as the difference between the latest data point value and
the previous data point value, divided by the time difference in
seconds between the two values.
So you would need to modify the way you emit the metric to not reset the counter. A possible workaround could be to increase the number of datapoints to alarm, for example you can configure your alarms so they transition to alarm if 2 or more datapoints are less or equal to (<=) 0, this way you'll avoid to get an alarm when the reset occurs.
Related
I am trying to wrap my head around Threshold value in the context of creating an alerting policy for an external HTTP load balancer:
Resource type: https_lb_rule
Metric: https/request_count
The docs say the following:
Enter when the value of a metric violates the threshold by using the Threshold position and Threshold value fields. For example, if you set these values to Above threshold and 0.3, then any measurement higher than 0.3 violates the threshold.
Now what does the value 0.3 indicate? Is there any unit and how do I relate to it in the context of the https/request_count metric for https_lb_rule?
Threshold value Reflects the minimum performance required to achieve the required operational effect. A threshold is an amount, level, or limit on a scale. When the threshold is reached, something else happens or changes.threshold means point or level where the alert needs to trigger at some specified point
As per your concerns Google Cloud HTTP/S Load Balancing Rule - Request count has a numeric value for threshold it means it will count the requests received for the load balancer.
If you give a threshold value of 10 then when the count of the requests received is above 10 then alert will trigger.You will be getting alerts frequently after crossing the given value. There is an option in advanced options as restart count as shown in below image it will help you to reduce the notifications. If the Restart window value is 1hr, then if the count is greater than the threshold for 1h(i.e, >10), then alert policy will open an incident and send notification to you.
We are running a video conferencing server in an EC2 instance.
Since this is a data out (egress) heavy app, we want to monitor the network data out closely (since we are charged heavily for that).
As seen in the screenshot above, in our test, using nmon (top right) or nload (left) in our EC2 server shows the network out as 138 Mbits/s in nload and 17263 KB/s in nmon which are very close (138/8 = 17.25).
But, when we check the network out (bytes) in AWS Cloudwatch (bottom right), the number shown is very high (~ 1 GB/s) (which makes more sense for the test we are running), and this is the number for which we are finally charged.
Why is there such a big difference between nmon/nload and AWS Cloudwatch?
Are we missing some understanding here? Are we not looking at the AWS Cloudwatch metrics correctly?
Thank you for your help!
Edit:
Adding the screenshot of a longer test which shows the average network out metric in AWS Cloudwatch to be flat around 1 GB for the test duration while nmon shows average network out of 15816 KB/s.
Just figured out the answer to this.
The following link talks about the periods of data capture in AWS:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html
Periods
A period is the length of time associated with a specific
Amazon CloudWatch statistic. Each statistic represents an aggregation
of the metrics data collected for a specified period of time. Periods
are defined in numbers of seconds, and valid values for period are 1,
5, 10, 30, or any multiple of 60. For example, to specify a period of
six minutes, use 360 as the period value. You can adjust how the data
is aggregated by varying the length of the period. A period can be as
short as one second or as long as one day (86,400 seconds). The
default value is 60 seconds.
Only custom metrics that you define with a storage resolution of 1
second support sub-minute periods. Even though the option to set a
period below 60 is always available in the console, you should select
a period that aligns to how the metric is stored. For more information
about metrics that support sub-minute periods, see High-resolution
metrics.
As seen in the link above, if we don't set a custom metric with custom periods, AWS by default does not capture sub-minute data. So, the lowest resolution of data available is every 1 minute.
So, in our case, the network out data within 60 seconds is aggregated and captured as a single data point.
Even if I change the statistic to Average and the period to 1 second, it still shows every 1 minute data.
Now, if I divide 1.01 GB (shown by AWS) with 60, I get the per second data which is roughly around 16.8 MBps which is very close to the data shown by nmon or nload.
From the AWS docs:
NetworkOut: The number of bytes sent out by the instance on all network interfaces. This metric identifies the volume of outgoing network traffic from a single instance.
The number reported is the number of bytes sent during the period. If you are using basic (five-minute) monitoring, you can divide this number by 300 to find Bytes/second. If you have detailed (one-minute) monitoring, divide it by 60.
The NetworkOut graph in your case does not represent the current speed, it represents the number of bytes sent out by all network interfaces in the last 5 minutes. If my calculations are correct, we should get the following values:
1.01 GB ~= 1027 MB (reading from your graph)
To get the average speed for the last 5 minutes:
1027 MB / 300 = 3.42333 MB/s ~= 27.38 Mbits/s
It is still more than what you are expecting, although this is just an average for the last 5 minutes.
I have an alarm setup in AWS cloudwatch which generates a data point every hour. When its value is greater than or equal to 1, it goes to ALARM state. Following are the settings
On 2nd Nov, it got into ALARM state and then back to OK state in 3 hours. I'm just trying to understand why it took 3 hours to get back to the OK state instead of 1 hour because the metric run every hour.
Here's are the logs which prove that metric transited from ALARM to OK state in 3 hours.
Following is the graph which shows the data point value every hour.
This is probably because alarms are evaluated on longer period then your 1 hour. The period is evaluation range. In your case, the evaluation range could be longer then your 1 hour, thus it takes longer for it to change.
There is also thread about this behavior with extra info on AWS forum:
. Unexplainable delay between Alarm data breach and Alarm state change
We have an AWS Elasticsearch cluster setup. However, our Error rate alarm goes off at regular intervals. The way we are trying to calculate our error rate is:
((sum(4xx) + sum(5xx))/sum(ElasticsearchRequests)) * 100
However, if you look at the screenshot below, at 7:15 4xx was 4, however ElasticsearchRequests value is only 2. Based on the metrics info on AWS Elasticsearch documentation page, ElasticsearchRequests should be total number of requests, so it should clearly be greater than or equal to 4xx.
Can someone please help me understand in what I am doing wrong here?
AWS definitions of these metrics are:
OpenSearchRequests (previously ElasticsearchRequests): The number of requests made to the OpenSearch cluster. Relevant statistics: Sum
2xx, 3xx, 4xx, 5xx: The number of requests to the domain that resulted in the given HTTP response code (2xx, 3xx, 4xx, 5xx). Relevant statistics: Sum
Please note the different terms used for the subjects of the metrics: cluster vs domain
To my understanding, OpenSearchRequests only considers requests that actually reach the underlying OpenSearch/ElasticSearch cluster, so some the 4xx requests might not (e.g. 403 errors), hence the difference in metrics.
Also, AWS only recommends comparing 5xx to OpenSearchRequests:
5xx alarms >= 10% of OpenSearchRequests: One or more data nodes might be overloaded, or requests are failing to complete within the idle timeout period. Consider switching to larger instance types or adding more nodes to the cluster. Confirm that you're following best practices for shard and cluster architecture.
I know this was posted a while back but I've additionally struggled with this issue and maybe I can add a few pointers.
First off, make sure your metrics are properly configured. For instance, some responses (4xx for example) take up to 5 minutes to register, while OpensearchRequests are refershed every minute. This makes for a very wonky graph that will definitely throw off your error rate.
In the picture above, I send a request that returns 400 every 5 seconds, and send a response that returns 200 every 0.5 seconds. The period in this case is 1 minute. This makes it so on average it should be around a 10% error rate. As you can see by the green line, the requests sent are summed up every minute, whereas the the 4xx are summed up every 5 minute, and in between every minute they are 0, which makes for an error rate spike every 5 minutes (since the opensearch requests are not multiplied by 5).
In the next image, the period is set to 5 minutes. Notice how this time the error rate is around 10 percent.
When I look at your graph, I see metrics that look like they are based off of a different period.
The second pointer I may add is to make sure to account for when no data is coming in. The behavior the alarm has may vary based on your how you define the "treat missing data" parameter. In some cases, if no data comes in, your expression might make it so it stays in alarm when in fact there is only no new data coming in. Some metrics might return no value when no requests are made, while some may return 0. In the former case, you can use the FILL(metric, value) function to specify what to return when no value is returned. Experiment with what happens to your error rate if you divide by zero.
Hope this message helps clarify a bit.
I want to change Kinesis stream polling frequency of AWS Lambda function. I was going through this article:
https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html
but, no luck.
The only information it conveys is AWS Lambda then polls the stream periodically (once per second) for new records.
I was also looking for answers in threads, but no luck:
https://forums.aws.amazon.com/thread.jspa?threadID=229037
There is another option though, which can be used if desired frequency is required:
https://docs.aws.amazon.com/lambda/latest/dg/with-scheduled-events.html
So, my question is, can we decrease AWS Lambda's polling frequency to, lets say 1-2 mins? Or do we have to go with AWS Lambda with Scheduled Events?
As far as I know there is now way to decrease the polling frequency if you are using an event source mapping.
These are all the settings you can set (source: https://docs.aws.amazon.com/de_de/lambda/latest/dg/API_CreateEventSourceMapping.html):
{
"BatchSize": number,
"Enabled": boolean,
"EventSourceArn": "string",
"FunctionName": "string",
"StartingPosition": "string",
"StartingPositionTimestamp": number
}
So going with a scheduled event seems to be the only feasible option.
An alternative would be to let the lambda function sleep before exiting so it will only poll again after a desired time. But of course this means you are paying for that.. So this is probably not desired.
I haven't seen a way to decrease the polling frequency, but you can have the same effect as if polling frequency decreased by increasing the MaximumBatchingWindowInSeconds parameter.
Reference: https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/sam-property-function-kinesis.html#sam-function-kinesis-maximumbatchingwindowinseconds
Let's say you have new records arriving on average at 1 record/s. Regardless of BatchSize, your lambda might trigger every second as it polls once every second. But if you increase your BatchSize to let's say, 60 and the MaximumBatchingWindowInSeconds to 60, then your lambda only invokes on average once every minute, as if you've changed polling frequency to once per minute.