How to configure a CloudWatch alarm to evaluate once every X minutes - amazon-web-services

I would like to configure a CloudWatch alarm to:
sum the last 30 minutes of the ApplicationRequestsTotal metric once every 30 minutes
alarm if the sum is equal to 0
I have configured the custom CloudWatch ApplicationRequestsTotal metric to emit once every 60 seconds for my service.
I have configure the alarm as:
{
"MetricAlarms": [
{
"AlarmName": "radio-silence-alarm",
"AlarmDescription": "Alarm if 0 or less requests are received for 1 consecutive period(s) of 30 minutes.",
"ActionsEnabled": true,
"OKActions": [],
"InsufficientDataActions": [],
"MetricName": "ApplicationRequestsTotal",
"Namespace": "AWS/ElasticBeanstalk",
"Statistic": "Sum",
"Dimensions": [
{
"Name": "EnvironmentName",
"Value": "service-environment"
}
],
"Period": 1800,
"EvaluationPeriods": 1,
"Threshold": 0.0,
"ComparisonOperator": "LessThanOrEqualToThreshold",
"TreatMissingData": "missing"
}
],
"CompositeAlarms": []
}
I have set up many alarms like this and each one seems to:
sum the last 30 minutes of ApplicationRequestsTotal metric once EVERY minute
For example this service started getting 0 ApplicationRequestsTotal at 8:36a and right at 9:06a CloudWatch triggered an alarm.
The aws cloudwatch describe-alarm-history for the above time period:
{
"AlarmName": "radio-silence-alarm",
"AlarmType": "MetricAlarm",
"Timestamp": "2021-09-29T09:06:37.929000+00:00",
"HistoryItemType": "StateUpdate",
"HistorySummary": "Alarm updated from OK to ALARM",
"HistoryData": "{
"version":"1.0",
"oldState":{
"stateValue":"OK",
"stateReason":"Threshold Crossed: 1 datapoint [42.0 (22/09/21 08:17:00)] was not less than or equal to the threshold (0.0).",
"stateReasonData":{
"version":"1.0",
"queryDate":"2021-09-22T08:47:37.930+0000",
"startDate":"2021-09-22T08:17:00.000+0000",
"statistic":"Sum",
"period":1800,
"recentDatapoints":[
42.0
],
"threshold":0.0,
"evaluatedDatapoints":[
{
"timestamp":"2021-09-22T08:17:00.000+0000",
"sampleCount":30.0,
"value":42.0
}
]
}
},
"newState":{
"stateValue":"ALARM",
"stateReason":"Threshold Crossed: 1 datapoint [0.0 (29/09/21 08:36:00)] was less than or equal to the threshold (0.0).",
"stateReasonData":{
"version":"1.0",
"queryDate":"2021-09-29T09:06:37.926+0000",
"startDate":"2021-09-29T08:36:00.000+0000",
"statistic":"Sum",
"period":1800,
"recentDatapoints":[
0.0
],
"threshold":0.0,
"evaluatedDatapoints":[
{
"timestamp":"2021-09-29T08:36:00.000+0000",
"sampleCount":30.0,
"value":0.0
}
]
}
}
}"
}
What have I configured incorrectly?

That is not how Amazon CloudWatch works.
When creating an Alarm in CloudWatch, you specify:
A metric (eg CPU Utilization, or perhaps a Custom Metric being sent to CloudWatch)
A time period (eg the previous 30 minutes)
An aggregation method (eg Average, Sum, Count)
For example, CloudWatch can trigger an Alarm if the Average of the metric was exceeded over the previous 30 minutes. This is continually evaluated as a sliding window. It does not look at metrics in distinct 30-minute blocks.
Using your example, it would send an alert whenever the Sum of the metric is zero for the previous 30 minutes, on a continual basis.

I think that your answer can be found directly in the documentation that I'm going to link: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
I'm gonna cite the docs:
When you create an alarm, you specify three settings to enable CloudWatch to evaluate when to change the alarm state:
Period is the length of time to evaluate the metric or expression to create each individual data point for an alarm. It is expressed in seconds. If you choose one minute as the period, the alarm evaluates the metric once per minute.
Evaluation Periods is the number of the most recent periods, or data points, to evaluate when determining alarm state.
Datapoints to Alarm is the number of data points within the Evaluation Periods that must be breaching to cause the alarm to go to the ALARM state. The breaching data points don't have to be consecutive, but they must all be within the last number of data points equal to Evaluation Period.
When you configure Evaluation Periods and Datapoints to Alarm as different values, you're setting an "M out of N" alarm. Datapoints to Alarm is ("M") and Evaluation Periods is ("N"). The evaluation interval is the number of data points multiplied by the period. For example, if you configure 4 out of 5 data points with a period of 1 minute, the evaluation interval is 5 minutes. If you configure 3 out of 3 data points with a period of 10 minutes, the evaluation interval is 30 minutes.

Related

Cloudwatch custom metric math

I have a custom metric in Cloudwatch that has the value of 1 or 0. I need to create a pie widget in a dashboard that will represent the percentage of how many 1 and 0 i have in a time period selected. Is this possible only with metric math? If so, how? If not, how else?
Thanks.
If you're only publishing 0s and 1s, then the average statistic will give you the percentage of 1s. You simply add your metric to the graph and set Statistic to Average, and put the id to m1
For pie chart you need two metrics, percentage of 1s and percentage of 0s. Since these are the only 2 values, percentage of zeros will be 1 - percentage of 1s. You can get this by adding a metric math expression 1 - m1.
Pie chart will by default only display the value of the last datapoint. You need to change this by clicking to the Options tab on the edit graph view, Widget type should be Pie, and selecting Time range value shows the value from the entire time range.
Example source of the graph would be:
{
"metrics": [
[ { "expression": "1-m1", "label": "zeros", "id": "e1", "region": "eu-west-1" } ],
[ YOUR METRIC DEFINITION, { "id": "m1", "label": "ones" } ]
],
"period": 60,
"view": "pie",
"stacked": false,
"stat": "Average",
"setPeriodToTimeRange": true,
"sparkline": false,
"region": "YOUR REGION"
}

How do the timings work for online_followers in insights on instagram graph api?

According to the documentation, the end_time is when the cutoff point for when the data starts:
The end_time property indicates a data set's lookback cutoff date; data older than this value is not included in the data set's calculation.
When looking at online_followers in insights, the data looks like this:
{
"value": {
"0": 18634,
"1": 18604,
"2": 19849,
"3": 21491,
"4": 23519,
"5": 25000,
"6": 24772,
"7": 25081,
"8": 25408,
"9": 25883,
"10": 26216,
"11": 26591,
"12": 27182,
"13": 27398,
"14": 25384,
"15": 19336,
"16": 13968,
"17": 11596,
"18": 10770,
"19": 10156,
"20": 9967,
"21": 11243,
"22": 14837,
"23": 18040
},
"end_time": "2021-07-01T07:00:00+0000"
Do the numbers refer to the hour of the day? Or do they refer to the number of hours that have passed since 07:00:00? If the latter, would this data be for 2021-07-21 and 2021-07-22?
According to documentation,
"Metrics that support lifetime periods will have results returned in
an array of 24 hour periods, with periods ending on UTC−07:00"
UTC-07:00 corresponds to US Pacific Time (PDT) 0 AM (midnight/start of a new day)
so, for anyone in that time zone, the number of hours passing since that UTC Timestamp IS EQUAL to the hour of the day. In your example, that data is for every hour of the day 2021-07-01 in PDT.

In AWS Kinesis, what happens if we call GetShardIterator with a stale/expired SequenceNumber?

Normally, we call GetShardIterator with the SequenceNumber of the last read record (if our previous ShardIterator is expired).
It is assumed that the SequenceNumber belongs to a valid Record that is within the retention period (i.e. default 24 hours).
But what if it is outside of the Kinesis retention period (i.e. 25 hours ago)? Then that Record/SequenceNumber would have been deleted from the stream.
Will GetShardIterator throw an exception? What kind of exception? Or will it return no records?
This was interesting enough to me that I tried it out.
TL;DR: it works as I expected: starting with a a sequence number that's past the trim horizon is equivalent to starting from the trim horizon.
To test, yesterday morning I posted a record on a dedicated stream:
aws kinesis put-record --stream-name test-expiration --partition-key irrelevant --data "this is a test"
{
"ShardId": "shardId-000000000000",
"SequenceNumber": "49616057638370363251266361760650016619879524195517857794"
}
Then I waited almost 24 hours (good thing I didn't decide to sleep in this morning), and ran a utility that I wrote to verify that the record was still on the stream:
> kinesis_reader.py test-expiration TRIM_HORIZON 1
{"SequenceNumber": "49616057638370363251266361760650016619879524195517857794", "ApproximateArrivalTimestamp": "2021-03-04T11:33:13.254000+00:00", "Data": "this is a test", "PartitionKey": "irrelevant"}
Lastly, I took the code from that utility, put it into a Jupyter Notebook, and executed it after the record hard been in the stream for more than 24 hours:
Retrieve the shard iterator:
client = boto3.client('kinesis')
stream_name = "test-expiration"
shard_id = "shardId-000000000000"
sequence_number ="49616057638370363251266361760650016619879524195517857794"
resp = client.get_shard_iterator(StreamName=stream_name, ShardId=shard_id, ShardIteratorType='AT_SEQUENCE_NUMBER', StartingSequenceNumber=sequence_number)
shard_itx = resp['ShardIterator']
This returned an iterator (which I'll omit because it's a lot of opaque text). It was wondering it if would throw, but there's no documented exception corresponding to a stale iterator.
Use this iterator to retrieve records:
client.get_records(ShardIterator=shard_itx)
{'Records': [],
'NextShardIterator': 'AAAAAAAAAAE8Pi3/Ykdggje538B61BxObso1tCZAK4MJIGMc//IGiqJlNdUz2PgTGXhMAW3GLJIFSsaSmWW72Y2qBuwk8+WvKse0Al8DhjBNUmCdB5T/FbUa/67NeUjgSsktcke3ZiCs+rnHXFkAv08rR8egQsJCDmcHkELeEKTaa5pnlMB9kUDB+NT+yFCO7oFNaDdz4OUSH094IN0+Y/w6n5K+XTLsVvhPmM6pYdTv2xllzJJnTA==',
'MillisBehindLatest': 44741000,
'ResponseMetadata': {'RequestId': 'fd58bcf1-6596-0186-a5e4-a7359063274d',
'HTTPStatusCode': 200,
'HTTPHeaders': {'x-amzn-requestid': 'fd58bcf1-6596-0186-a5e4-a7359063274d',
'x-amz-id-2': 'jK9tGfx5eSyi5ysHhnANVn0IvJrwWwYzbxRGTRyFnk1OgjfQ+D2KtzqfF3FXVg5wwBH0m/QBoXdwJ+cEQSeBCktkKgFWOUx5',
'date': 'Fri, 05 Mar 2021 11:44:04 GMT',
'content-type': 'application/x-amz-json-1.1',
'content-length': '315'},
'RetryAttempts': 0}}
As you can see, there are no records in the response.
Surprisingly, it only indicates that I'm 44741000 milliseconds behind the latest record, which I added this morning. I would have expected something closer to 8640000 millis (one day).
As a final experiment, I wrote a loop that would count how many times I had to read the stream to find a record that I put on the stream this morning (which was, by now, a half hour old):
count = 0
while True:
count += 1
resp = client.get_records(ShardIterator=shard_itx)
print(f"{count}: {resp['MillisBehindLatest']} millis behind latest")
if resp['Records']:
print(resp)
break
shard_itx = resp['NextShardIterator']
The answer: 99 reads, with the shard iterator advancing approximately 500 seconds each time.
I'm going to keep this stream around for a while: I want to see if Kinesis will update its internal pointers so that subsequent requests return a shard iterator that's closer to the present time.
Update
I ran through this code again, approximately an hour later than the first try. When I retrieved records using the iterator, it incorrectly told me that I was 0 milliseconds behind latest. A subsequent retrieve (using the iterator from the first) reported 49915000.
Moral: don't rely on MillisBehindLatest unless you've been actively processing records.

How to show the percentage of uptime of an AWS service on the dashboard of CloudWatch?

I want to build a dashboard that displays the percentage of the uptime for each month of an Elastic Beanstalk service in my company.
So I used boto3 get_metric_data to retrieve the Environment Health CloudWatch metrics data and calculate the percentage of non-severe time of my service.
from datetime import datetime
import boto3
SEVERE = 25
client = boto3.client('cloudwatch')
metric_data_queries = [
{
'Id': 'healthStatus',
'MetricStat': {
'Metric': {
'Namespace': 'AWS/ElasticBeanstalk',
'MetricName': 'EnvironmentHealth',
'Dimensions': [
{
'Name': 'EnvironmentName',
'Value': 'ServiceA'
}
]
},
'Period': 300,
'Stat': 'Maximum'
},
'Label': 'EnvironmentHealth',
'ReturnData': True
}
]
response = client.get_metric_data(
MetricDataQueries=metric_data_queries,
StartTime=datetime(2019, 9, 1),
EndTime=datetime(2019, 9, 30),
ScanBy='TimestampAscending'
)
health_data = response['MetricDataResults'][0]['Values']
total_times = len(health_data)
severe_times = health_data.count(SEVERE)
print(f'total_times: {total_times}')
print(f'severe_times: {severe_times}')
print(f'healthy percent: {1 - (severe_times/total_times)}')
Now I'm wondering how to show the percentage on the dashboard on CloudWatch. I mean I want to show something like the following:
Does anyone know how to upload the healthy percent I've calculated to the dashboard of CloudWatch?
Or is there any other tool that is more appropriate for displaying the uptime of my service?
You can do math with CloudWatch metrics:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html
You can create a metric math expression with the metrics you have in metric_data_queries and get the result on the graph. Metric math also works with GetMetricData API, so you could move the calculation you do into MetricDataQuery and get the number you need directly from CloudWatch.
Looks like you need a number saying what percentage of datapoints in the last month the metric value equaled to 25.
You can calculate it like this (this is the source of the graph, you can use in CloudWatch console on the source tab, make sure the region matches your region and the metric name matches your metric):
{
"metrics": [
[
"AWS/ElasticBeanstalk",
"EnvironmentHealth",
"EnvironmentName",
"ServiceA",
{
"label": "metric",
"id": "m1",
"visible": false,
"stat": "Maximum"
}
],
[
{
"expression": "25",
"label": "Value for severe",
"id": "severe_c",
"visible": false
}
],
[
{
"expression": "m1*0",
"label": "Constant 0 time series",
"id": "zero_ts",
"visible": false
}
],
[
{
"expression": "1-AVG(CEIL(ABS(m1-severe_c)/MAX(m1)))",
"label": "Percentage of times value equals severe",
"id": "severe_pct",
"visible": false
}
],
[
{
"expression": "(zero_ts+severe_pct)*100",
"label": "Service Uptime",
"id": "e1"
}
]
],
"view": "singleValue",
"stacked": false,
"region": "eu-west-1",
"period": 300
}
To explain what is going on there (what is the purpose of each element above, by id):
m1 - This is your original metric. Setting stat to Maximum.
severe_c - Constant you want to use for your SEVERE value.
zero_ts - Creating a constant time series with all values equal zero. This is needed because constants can't be graphed and the final value will be constant. So to graph it, we'll just add the constant to this time series of zeros.
severe_pct - this is where you actually calculate the percentage of value that are equal SEVERE.
m1-severe_c - sets the datapoints with value equal SEVERE to 0.
ABS(m1-severe_c) - makes all values positive, keeps SEVERE datapoints at 0.
ABS(m1-severe_c)/MAX(m1) - dividing by maximum value ensures that all values are now between 0 and 1.
CEIL(ABS(m1-severe_c)/MAX(m1)) - snaps all values that are different than 0 to 1, keeps SEVERE at 0.
AVG(CEIL(ABS(m1-severe_c)/MAX(m1)) - Because metric is now all 1s and 0s, with 0 meaning SEVERE, taking the average gives you the percentage of non severe datapoints.
1-AVG(CEIL(ABS(m1-severe_c)/MAX(m1))) - finally you need the percentage of severe values and since values are either severe or not sever, substracting from 1 gives you the needed number.
e1 - The last expression gave you a constant between 0 and 1. You need a time series between 0 and 100. This is the expression that gives you that: (zero_ts+severe_pct)*100. Not that this is the only result that you're returning, all other expressions have "visible": false.

Sampling rate for data returned with boto3 get_metric_statistics()

The documentation is here...
http://boto3.readthedocs.io/en/latest/reference/services/cloudwatch.html#CloudWatch.Client.get_metric_statistics
Here is our call
response = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization', #reported every 5 minutes
Dimensions=[
{
'Name': 'AutoScalingGroupName',
'Value': 'Celery-AutoScalingGroup'
},
],
StartTime=now - datetime.timedelta(minutes=12),
EndTime=now,
Period=60, #I can't figure out what exactly changing this is doing
Statistics=['Average','SampleCount','Sum','Minimum','Maximum'],
)
Here is our response
>>> response['Datapoints']
[ {u'SampleCount': 5.0, u'Timestamp': datetime.datetime(2017, 8, 25, 12, 46, tzinfo=tzutc()), u'Average': 0.05, u'Maximum': 0.17, u'Minimum': 0.0, u'Sum': 0.25, u'Unit': 'Percent'},
{u'SampleCount': 5.0, u'Timestamp': datetime.datetime(2017, 8, 25, 12, 51, tzinfo=tzutc()), u'Average': 0.034, u'Maximum': 0.08, u'Minimum': 0.0, u'Sum': 0.17, u'Unit': 'Percent'}
]
Here is my question
Look at first dictionary in the returned list. SampleCount of 5 makes sense, I guess, because our Period is 60 (seconds) and CloudWatch supplies 'CPUUtilization' metric every 5 minutes.
But if I change Period, to say 3 minutes (180), I am still getting a SampleCount of 5 (I'd expect 1 or 2).
This is a problem because I want the Average, but I think it is averaging 5 datapoints, only 2 of which are valid (the beginning and end, which correspond to the Min and Max, that is the CloudWatch metric at some time t and the next reporting of that metric at time t+5minutes).
It is averaging this with 3 intermediate 0-value datapoints so that the Average is (Minimum+Maximum+0+0+0)/5
I can just get the Minumum, Maximum add them and divide by 2 for a better reading - but I was hoping somebody could explain just exactly what that 'Period' parameter is doing.
Like I said, changing it to 360 didn't change SampleCount, but when I changed it to 600, suddenly my SampleCount was 10.0 for one datapoint (that does make sense).
Data can be published to CloudWatch in two different ways:
You can publish your observations one by one and let CloudWatch do the aggregation.
You can aggregate the data yourself and publish the statistic set (SampleCount, Sum, Minimum, Maximum).
If data is published using method 1, you would get the behaviour you were expecting. But if the data is published using method 2, you are limited by the granularity of published data.
If ec2 is aggregating the data for 5 min and then publishing statistic set, there is no point in requesting data at 3 min level. However, if you request data with the period that is multiple of the period data was published with (eg. 10 min) stats can be calculated, which CloudWatch does.