Cloudwatch Alarm doesn't leave alarm state and doesn't retrigger - amazon-web-services

I created a custom metric with the unit count. The requirement is to check every 24h if the sum of the metric count is >= 1. If so a message should be sent to sns topic which triggers a lambda which sends a message to slack channel.
Metric behaviour: Currently the custom metric is always higher than one. I crate a datapoint every 10 sec.
Alarm behaviour: The alarm instantly switches into alarm state and sends a message to the sns topic. But the state never leaves the alarm state and also doesn't retrigger a new message 24h later to the sns topic.
How should I configure my alarm if I want to achieve my requirement?
Thanks in advance,
Patrick
Here is the aws cloudwatch describe-alarms result:
{
"MetricAlarms": [
{
"AlarmName": "iot-data-platform-stg-InvalidMessagesAlarm-1OS91W5YCQ8E9",
"AlarmArn": "arn:aws:cloudwatch:eu-west-1:xxxxxx:alarm:iot-data-platform-stg-InvalidMessagesAlarm-1OS91W5YCQ8E9",
"AlarmDescription": "Invalid Messages received",
"AlarmConfigurationUpdatedTimestamp": "2020-04-03T18:11:15.076Z",
"ActionsEnabled": true,
"OKActions": [],
"AlarmActions": [
"arn:aws:sns:eu-west-1:xxxxx:iot-data-platform-stg-InvalidMessagesTopic-FJQ0WUJY9TZC"
],
"InsufficientDataActions": [],
"StateValue": "ALARM",
"StateReason": "Threshold Crossed: 1 out of the last 1 datapoints [3.0 (30/03/20 11:49:00)] was greater than or equal to the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition).",
"StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2020-03-31T11:49:03.417+0000\",\"startDate\":\"2020-03-30T11:49:00.000+0000\",\"statistic\":\"Sum\",\"period\":86400,\"recentDatapoints\":[3.0],\"threshold\":1.0}",
"StateUpdatedTimestamp": "2020-03-31T11:49:03.421Z",
"MetricName": "InvalidMessages",
"Namespace": "Message validation",
"Statistic": "Sum",
"Dimensions": [
{
"Name": "stream",
"Value": "raw events"
},
{
"Name": "stage",
"Value": "stg"
}
],
"Period": 86400,
"EvaluationPeriods": 1,
"DatapointsToAlarm": 1,
"Threshold": 1.0,
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"TreatMissingData": "notBreaching"
}
]
}

Related

Linking to AWS Cloudwatch Logs from Alarm SNS

I have CloudWatch alarms sending SNS messages back with error information, and I'm using that along with the slackWebhook to send alarm messages to our Slack channel. I'd like to be able to include a link to the relevant logs, but right now all I'm seeing that may be useful is the alarm Arn. Can I use this somehow, or is there a way to scrape the aws error logs for that Arn and link to that somehow?
Here's the JSON from the SNS message:
{
"AlarmName": "EmailErrorsFF58B22B-HFUJGANB6BDD",
"AlarmDescription": "Some Description",
"AWSAccountId": "<REMOVED>",
"AlarmConfigurationUpdatedTimestamp": "2022-03-24T12:20:22.195+0000",
"NewStateValue": "ALARM",
"NewStateReason": "Threshold Crossed: 1 datapoint [1.0 (25/03/22 15:39:00)] was greater than the threshold (0.0).",
"StateChangeTime": "2022-03-25T15:44:45.495+0000",
"Region": "US East (N. Virginia)",
"AlarmArn": "arn:aws:cloudwatch:<REMOVED>",
"OldStateValue": "OK",
"OKActions": [],
"AlarmActions": [
"arn:aws:sns:<REMOVED>"
],
"InsufficientDataActions": [],
"Trigger": {
"MetricName": "Errors",
"Namespace": "AWS/Lambda",
"StatisticType": "Statistic",
"Statistic": "SUM",
"Unit": null,
"Dimensions": [
{
"value": "Email-production",
"name": "FunctionName"
}
],
"Period": 300,
"EvaluationPeriods": 1,
"ComparisonOperator": "GreaterThanThreshold",
"Threshold": 0,
"TreatMissingData": "",
"EvaluateLowSampleCountPercentile": ""
}
}

How to get resource tags from CloudWatch alarm's trigger in lambda

I'm trying to get resource tags from the resource that breached a CloudWatch alarm in a Lambda function
Say I have 2 CloudWatch alarms - one for CPU utilization, another for Lambda errors, both publish to the same SNS topic
Then a Lambda function is triggered from that SNS topic. This Lambda function needs to know which resource triggered the CloudWatch alarm, then I assume call list_tags_for_resource() on the ARN of said resource
However the payload from CloudWatch doesn't include the ARN of the resource. Example:
{
"AlarmName": "LessThanThreshold-CPUUtilization",
"AlarmDescription": "Created from EC2 Console",
"AWSAccountId": "xx",
"AlarmConfigurationUpdatedTimestamp": "2022-03-01T03:29:21.832+0000",
"NewStateValue": "ALARM",
"NewStateReason": "Threshold Crossed: 1 out of the last 1 datapoints [0.161290322580642 (01/03/22 03:34:00)] was less than the threshold (0.99) (minimum 1 datapoint for OK -> ALARM transition).",
"StateChangeTime": "2022-03-01T03:35:27.613+0000",
"Region": "US East (N. Virginia)",
"AlarmArn": "arn:aws:cloudwatch:us-east-1:xx:alarm:LessThanThreshold-CPUUtilization",
"OldStateValue": "OK",
"Trigger": {
"MetricName": "CPUUtilization",
"Namespace": "AWS/EC2",
"StatisticType": "Statistic",
"Statistic": "AVERAGE",
"Unit": null,
"Dimensions": [
{
"value": "i-xx",
"name": "InstanceId"
}
],
"Period": 60,
"EvaluationPeriods": 1,
"DatapointsToAlarm": 1,
"ComparisonOperator": "LessThanThreshold",
"Threshold": 0.99,
"TreatMissingData": "",
"EvaluateLowSampleCountPercentile": ""
}
}

How do I use Lambda to get EC2 information via CloudWatch?

The following flow is executing Lambda.
monitor log files in EC2 with cloudwatch logs
Detects monitored strings with a metrics filter
Execute Lambda with alarm
I would like to know how to get the following information within Lambda.
Path of the log file being monitored
Instance name
Instance id
Alarm name
I am writing in python and trying to get it using boto3.
You can easily achieve this in 2 ways:-
Create a cloudwatch event bridge rule with event type as cloudwatch
alarm state change.
Whenever your alarm will be in an alarm state it will send an event, configure the target of this event type as lambda function or sns topic, whatever suits your need.
Sample event from this rule
{
"version": "0",
"id": "c4c1c1c9-6542-e61b-6ef0-8c4d36933a92",
"detail-type": "CloudWatch Alarm State Change",
"source": "aws.cloudwatch",
"account": "123456789012",
"time": "2019-10-02T17:04:40Z",
"region": "us-east-1",
"resources": ["arn:aws:cloudwatch:us-east-1:123456789012:alarm:ServerCpuTooHigh"],
"detail": {
"alarmName": "ServerCpuTooHigh",
"configuration": {
"description": "Goes into alarm when server CPU utilization is too high!",
"metrics": [{
"id": "30b6c6b2-a864-43a2-4877-c09a1afc3b87",
"metricStat": {
"metric": {
"dimensions": {
"InstanceId": "i-12345678901234567"
},
"name": "CPUUtilization",
"namespace": "AWS/EC2"
},
"period": 300,
"stat": "Average"
},
"returnData": true
}]
},
"previousState": {
"reason": "Threshold Crossed: 1 out of the last 1 datapoints [0.0666851903306472 (01/10/19 13:46:00)] was not greater than the threshold (50.0) (minimum 1 datapoint for ALARM -> OK transition).",
"reasonData": "{\"version\":\"1.0\",\"queryDate\":\"2019-10-01T13:56:40.985+0000\",\"startDate\":\"2019-10-01T13:46:00.000+0000\",\"statistic\":\"Average\",\"period\":300,\"recentDatapoints\":[0.0666851903306472],\"threshold\":50.0}",
"timestamp": "2019-10-01T13:56:40.987+0000",
"value": "OK"
},
"state": {
"reason": "Threshold Crossed: 1 out of the last 1 datapoints [99.50160229693434 (02/10/19 16:59:00)] was greater than the threshold (50.0) (minimum 1 datapoint for OK -> ALARM transition).",
"reasonData": "{\"version\":\"1.0\",\"queryDate\":\"2019-10-02T17:04:40.985+0000\",\"startDate\":\"2019-10-02T16:59:00.000+0000\",\"statistic\":\"Average\",\"period\":300,\"recentDatapoints\":[99.50160229693434],\"threshold\":50.0}",
"timestamp": "2019-10-02T17:04:40.989+0000",
"value": "ALARM"
}
}
}
Inside your cloud watch alarm there is an alarm action there your can add SNS topic to it and then you can easily get your event information, if you want to process it further,you can add lambda to SNS topic.

CloudWatch alarm in ALARM state despite no data

I have a CloudWatch alarm based on the DynamoDB metric WriteThrottleEvents. There was a throttle datapoint in September which caused the alarm to enter the ALARM state, however there have been no other throttle datapoints since then, yet the alarm is still in ALARM state. The alarm previously had 'Treat Missing Data' set to ignore (which initially explained why it stayed in ALARM state), however I have now changed it to missing, yet the alarm is still in ALARM state, despite no datapoints. Why has it not changed state to 'INSUFFICIENT DATA'?
{
"MetricAlarms": [
{
"AlarmName": "WriteThrottleEvents_Alarm",
"AlarmArn": "******************",
"AlarmConfigurationUpdatedTimestamp": "2021-02-25T20:07:44.960000+00:00",
"ActionsEnabled": true,
"OKActions": [],
"AlarmActions": ["******************"],
"InsufficientDataActions": [],
"StateValue": "ALARM",
"StateReason": "Threshold Crossed: 1 out of the last 1 datapoints [1.0 (22/09/20 18:21:00)] was greater than or equal to the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition).",
"StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2020-09-22T18:22:44.912+0000\",\"startDate\":\"2020-09-22T18:21:00.000+0000\",\"unit\":\"Count\",\"statistic\":\"Average\",\"period\":60,\"recentDatapoints\":[1.0],\"threshold\":1.0}",
"StateUpdatedTimestamp": "2020-09-22T18:22:44.915000+00:00",
"MetricName": "WriteThrottleEvents",
"Namespace": "AWS/DynamoDB",
"Statistic": "Average",
"Dimensions": [
{
"Name": "TableName",
"Value": "table-one"
}
],
"Period": 60,
"Unit": "Count",
"EvaluationPeriods": 1,
"DatapointsToAlarm": 1,
"Threshold": 1.0,
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"TreatMissingData": "missing"
}
],
"CompositeAlarms": []
}
You can set missing data to notBreaching. This way, lack of data points will be treated as good:
Missing data points are treated as "good" and within the threshold,

AWS CloudWatch Alarm triggering wrongly

I am looking forward to understand how the alarms are triggered.
The case is a follows:
Currently, I have set up a custom metric for a Log Group. These are the logs of an R script running with AWS ECS, and the custom metric is set to 1 whenever it finds the keyword done in the logs, otherwise, it will set to 0.
The alarm was setup to check 1 data point, and whenever it finds 0 it will be triggered.
Now the issue is that I have been receiving the alarm randomly without reason, as the logs indicate the script has been running successfully (All of them contain the keyword done). Then after a while, and before the time period setup happens again, the status changes back to OK.
{
"MetricAlarms": [
{
"AlarmName": "XXXXXXXXX",
"AlarmArn": "XXXXXXXXX",
"AlarmConfigurationUpdatedTimestamp": "2020-07-27T07:08:34.498000+00:00",
"ActionsEnabled": true,
"OKActions": [],
"AlarmActions": [
"XXXXXXXXX:slack-monitoring"
],
"InsufficientDataActions": [],
"StateValue": "OK",
"StateReason": "Threshold Crossed: 1 out of the last 1 datapoints [1.0 (26/07/20 07:10:00)] was not less than or equal to the threshold (0.0) (minimum 1 datapoint for ALARM -> OK transition).",
"StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2020-07-27T07:10:51.284+0000\",\"startDate\":\"2020-07-26T07:10:00.000+0000\",\"statistic\":\"Maximum\",\"period\":86400,\"recentDatapoints\":[1.0],\"threshold\":0.0}",
"StateUpdatedTimestamp": "2020-07-27T07:10:51.285000+00:00",
"MetricName": "XXXXXXXXX",
"Namespace": "XXXXXXXXX",
"Statistic": "Maximum",
"Dimensions": [],
"Period": 86400,
"EvaluationPeriods": 1,
"DatapointsToAlarm": 1,
"Threshold": 0.0,
"ComparisonOperator": "LessThanOrEqualToThreshold",
"TreatMissingData": "missing"
}
],
"CompositeAlarms": []
}