AWS CloudWatch Alarm triggering wrongly - amazon-web-services

I am looking forward to understand how the alarms are triggered.
The case is a follows:
Currently, I have set up a custom metric for a Log Group. These are the logs of an R script running with AWS ECS, and the custom metric is set to 1 whenever it finds the keyword done in the logs, otherwise, it will set to 0.
The alarm was setup to check 1 data point, and whenever it finds 0 it will be triggered.
Now the issue is that I have been receiving the alarm randomly without reason, as the logs indicate the script has been running successfully (All of them contain the keyword done). Then after a while, and before the time period setup happens again, the status changes back to OK.
{
"MetricAlarms": [
{
"AlarmName": "XXXXXXXXX",
"AlarmArn": "XXXXXXXXX",
"AlarmConfigurationUpdatedTimestamp": "2020-07-27T07:08:34.498000+00:00",
"ActionsEnabled": true,
"OKActions": [],
"AlarmActions": [
"XXXXXXXXX:slack-monitoring"
],
"InsufficientDataActions": [],
"StateValue": "OK",
"StateReason": "Threshold Crossed: 1 out of the last 1 datapoints [1.0 (26/07/20 07:10:00)] was not less than or equal to the threshold (0.0) (minimum 1 datapoint for ALARM -> OK transition).",
"StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2020-07-27T07:10:51.284+0000\",\"startDate\":\"2020-07-26T07:10:00.000+0000\",\"statistic\":\"Maximum\",\"period\":86400,\"recentDatapoints\":[1.0],\"threshold\":0.0}",
"StateUpdatedTimestamp": "2020-07-27T07:10:51.285000+00:00",
"MetricName": "XXXXXXXXX",
"Namespace": "XXXXXXXXX",
"Statistic": "Maximum",
"Dimensions": [],
"Period": 86400,
"EvaluationPeriods": 1,
"DatapointsToAlarm": 1,
"Threshold": 0.0,
"ComparisonOperator": "LessThanOrEqualToThreshold",
"TreatMissingData": "missing"
}
],
"CompositeAlarms": []
}

Related

AWS ECS - autoscaling using CloudWatch alarm does not work as expected

I have configured a CloudWatch alarm in order to trigger auto scaling for an ECS service.
The alarm is configured to trigger auto scaling when there's a certain number of scheduled activities in a step function.
But when testing with 1000+ scheduled activities in the step function, the alarm is raised, but the number of scheduled activities showed in CloudWatch metrics is much lower than the number scheduled activities I see in the step function itself.
Therefore either no scale up occurs, or a much lower number of machines is
This is the alarm configuration:
{
"Type": "AWS::CloudWatch::Alarm",
"Properties": {
"AlarmName": "thumbnails-generator-scaling-alarm",
"ActionsEnabled": true,
"OKActions": [],
"AlarmActions": [
"arn:aws:autoscaling:us-west-2:111111111111:scalingPolicy:8694a867-85ee-4740-ba70-7b439c3e5fb3:resource/ecs/service/prod/thumbnails-generator:policyName/thumbnails-generator-scaling-policy"
],
"InsufficientDataActions": [],
"MetricName": "ActivitiesScheduled",
"Namespace": "AWS/States",
"Statistic": "SampleCount",
"Dimensions": [
{
"Name": "ActivityArn",
"Value": "arn:aws:states:us-west-2:111111111111:activity:thumbnails-generator-activity-prod"
}
],
"Period": 300,
"EvaluationPeriods": 1,
"DatapointsToAlarm": 1,
"Threshold": 0,
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"TreatMissingData": "missing"
}
}
This is the auto scale configuration:
Please advise what can I do in order to make the auto scale work properly.
I guess you need to use Sum instead of SampleCount statistic.
SampleCount is the number of data points during the period.

Linking to AWS Cloudwatch Logs from Alarm SNS

I have CloudWatch alarms sending SNS messages back with error information, and I'm using that along with the slackWebhook to send alarm messages to our Slack channel. I'd like to be able to include a link to the relevant logs, but right now all I'm seeing that may be useful is the alarm Arn. Can I use this somehow, or is there a way to scrape the aws error logs for that Arn and link to that somehow?
Here's the JSON from the SNS message:
{
"AlarmName": "EmailErrorsFF58B22B-HFUJGANB6BDD",
"AlarmDescription": "Some Description",
"AWSAccountId": "<REMOVED>",
"AlarmConfigurationUpdatedTimestamp": "2022-03-24T12:20:22.195+0000",
"NewStateValue": "ALARM",
"NewStateReason": "Threshold Crossed: 1 datapoint [1.0 (25/03/22 15:39:00)] was greater than the threshold (0.0).",
"StateChangeTime": "2022-03-25T15:44:45.495+0000",
"Region": "US East (N. Virginia)",
"AlarmArn": "arn:aws:cloudwatch:<REMOVED>",
"OldStateValue": "OK",
"OKActions": [],
"AlarmActions": [
"arn:aws:sns:<REMOVED>"
],
"InsufficientDataActions": [],
"Trigger": {
"MetricName": "Errors",
"Namespace": "AWS/Lambda",
"StatisticType": "Statistic",
"Statistic": "SUM",
"Unit": null,
"Dimensions": [
{
"value": "Email-production",
"name": "FunctionName"
}
],
"Period": 300,
"EvaluationPeriods": 1,
"ComparisonOperator": "GreaterThanThreshold",
"Threshold": 0,
"TreatMissingData": "",
"EvaluateLowSampleCountPercentile": ""
}
}

CloudWatch alarm in ALARM state despite no data

I have a CloudWatch alarm based on the DynamoDB metric WriteThrottleEvents. There was a throttle datapoint in September which caused the alarm to enter the ALARM state, however there have been no other throttle datapoints since then, yet the alarm is still in ALARM state. The alarm previously had 'Treat Missing Data' set to ignore (which initially explained why it stayed in ALARM state), however I have now changed it to missing, yet the alarm is still in ALARM state, despite no datapoints. Why has it not changed state to 'INSUFFICIENT DATA'?
{
"MetricAlarms": [
{
"AlarmName": "WriteThrottleEvents_Alarm",
"AlarmArn": "******************",
"AlarmConfigurationUpdatedTimestamp": "2021-02-25T20:07:44.960000+00:00",
"ActionsEnabled": true,
"OKActions": [],
"AlarmActions": ["******************"],
"InsufficientDataActions": [],
"StateValue": "ALARM",
"StateReason": "Threshold Crossed: 1 out of the last 1 datapoints [1.0 (22/09/20 18:21:00)] was greater than or equal to the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition).",
"StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2020-09-22T18:22:44.912+0000\",\"startDate\":\"2020-09-22T18:21:00.000+0000\",\"unit\":\"Count\",\"statistic\":\"Average\",\"period\":60,\"recentDatapoints\":[1.0],\"threshold\":1.0}",
"StateUpdatedTimestamp": "2020-09-22T18:22:44.915000+00:00",
"MetricName": "WriteThrottleEvents",
"Namespace": "AWS/DynamoDB",
"Statistic": "Average",
"Dimensions": [
{
"Name": "TableName",
"Value": "table-one"
}
],
"Period": 60,
"Unit": "Count",
"EvaluationPeriods": 1,
"DatapointsToAlarm": 1,
"Threshold": 1.0,
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"TreatMissingData": "missing"
}
],
"CompositeAlarms": []
}
You can set missing data to notBreaching. This way, lack of data points will be treated as good:
Missing data points are treated as "good" and within the threshold,

Cloudwatch Alarm doesn't leave alarm state and doesn't retrigger

I created a custom metric with the unit count. The requirement is to check every 24h if the sum of the metric count is >= 1. If so a message should be sent to sns topic which triggers a lambda which sends a message to slack channel.
Metric behaviour: Currently the custom metric is always higher than one. I crate a datapoint every 10 sec.
Alarm behaviour: The alarm instantly switches into alarm state and sends a message to the sns topic. But the state never leaves the alarm state and also doesn't retrigger a new message 24h later to the sns topic.
How should I configure my alarm if I want to achieve my requirement?
Thanks in advance,
Patrick
Here is the aws cloudwatch describe-alarms result:
{
"MetricAlarms": [
{
"AlarmName": "iot-data-platform-stg-InvalidMessagesAlarm-1OS91W5YCQ8E9",
"AlarmArn": "arn:aws:cloudwatch:eu-west-1:xxxxxx:alarm:iot-data-platform-stg-InvalidMessagesAlarm-1OS91W5YCQ8E9",
"AlarmDescription": "Invalid Messages received",
"AlarmConfigurationUpdatedTimestamp": "2020-04-03T18:11:15.076Z",
"ActionsEnabled": true,
"OKActions": [],
"AlarmActions": [
"arn:aws:sns:eu-west-1:xxxxx:iot-data-platform-stg-InvalidMessagesTopic-FJQ0WUJY9TZC"
],
"InsufficientDataActions": [],
"StateValue": "ALARM",
"StateReason": "Threshold Crossed: 1 out of the last 1 datapoints [3.0 (30/03/20 11:49:00)] was greater than or equal to the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition).",
"StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2020-03-31T11:49:03.417+0000\",\"startDate\":\"2020-03-30T11:49:00.000+0000\",\"statistic\":\"Sum\",\"period\":86400,\"recentDatapoints\":[3.0],\"threshold\":1.0}",
"StateUpdatedTimestamp": "2020-03-31T11:49:03.421Z",
"MetricName": "InvalidMessages",
"Namespace": "Message validation",
"Statistic": "Sum",
"Dimensions": [
{
"Name": "stream",
"Value": "raw events"
},
{
"Name": "stage",
"Value": "stg"
}
],
"Period": 86400,
"EvaluationPeriods": 1,
"DatapointsToAlarm": 1,
"Threshold": 1.0,
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"TreatMissingData": "notBreaching"
}
]
}

Strange CloudWatch alarm behaviour

I have a backup script that runs every 2 hours. I want to use CloudWatch to track the successful executions of this script and CloudWatch's Alarms to get notified whenever the script runs into problems.
The script puts a data point on a CloudWatch metric after every successful backup:
mon-put-data --namespace Backup --metric-name $metric --unit Count --value 1
I have an alarm that goes to ALARM state whenever the statistic "Sum" on the metric is less than 2 in a 6-hour period.
In order to test this setup, after a day, I stopped putting data in the metric (ie, I commented out the mon-put-data command). Good, eventually the alarm went to ALARM state and I got an email notification, as expected.
The problem is that, some time later, the alarm wen back to the OK state, however there's no new data being added to the metric!
The two transitions (OK => ALARM, then ALARM => OK) have been logged and I reproduce the logs in this question. Note that, although both show "period: 21600" (ie, 6h), the second one shows a 12-hour time span between startDate and queryDate; I see that this might explain the transition, but I cannot understand why CloudWatch is considering a 12-hour time span to calculate a statistic with a 6-hour period!
What am I missing here? How to configure the alarms to achieve what I want (ie, get notified if backups are not being made)?
{
"Timestamp": "2013-03-06T15:12:01.069Z",
"HistoryItemType": "StateUpdate",
"AlarmName": "alarm-backup-svn",
"HistoryData": {
"version": "1.0",
"oldState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (3.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-05T21:12:44.081+0000",
"startDate": "2013-03-05T15:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
3
],
"threshold": 3
}
},
"newState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T15:12:01.052+0000",
"startDate": "2013-03-06T09:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
1
],
"threshold": 2
}
}
},
"HistorySummary": "Alarm updated from OK to ALARM"
}
The second one, which I simple cannot understand:
{
"Timestamp": "2013-03-06T17:46:01.063Z",
"HistoryItemType": "StateUpdate",
"AlarmName": "alarm-backup-svn",
"HistoryData": {
"version": "1.0",
"oldState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T15:12:01.052+0000",
"startDate": "2013-03-06T09:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
1
],
"threshold": 2
}
},
"newState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T17:46:01.041+0000",
"startDate": "2013-03-06T05:46:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
3
],
"threshold": 2
}
}
},
"HistorySummary": "Alarm updated from ALARM to OK"
}
This behavior (that your monitor did not transition into the INSFUCCIENT_DATA state is because Cloudwatch considers 'pre-timestamped' metric datapoints and so (for a 6 hour alarm) if no data exists in the current 6 open hour window .. it will take data from the previous 6 hour window (hence the 12 hour timestamp you see above).
To increase the 'fidelity' of your alarm, reduce the alarm period down to 1 Hour/3600s and increase your number of evaluation periods to how many periods you want to alarm on failure for. That will ensure your alarm transitions into INSFUCCIENT_DATA as you expect.
How to configure the alarms to achieve what I want (ie, get notified if backups are not being made)?
A possible architecture for your alarm would be publish 1 if your job is successful, 0 if it failed. Then create an alarm with a threshold of < 1 for 3 - 3600s periods meaning that your alarm will go into ALARM if the job is failing (i.e running .. but failing). If you also set an INSFUCCIENT_DATA action on that alarm then you will also get notified if your job is not running at all.
Hope that makes sense .