I have configured a CloudWatch alarm in order to trigger auto scaling for an ECS service.
The alarm is configured to trigger auto scaling when there's a certain number of scheduled activities in a step function.
But when testing with 1000+ scheduled activities in the step function, the alarm is raised, but the number of scheduled activities showed in CloudWatch metrics is much lower than the number scheduled activities I see in the step function itself.
Therefore either no scale up occurs, or a much lower number of machines is
This is the alarm configuration:
{
"Type": "AWS::CloudWatch::Alarm",
"Properties": {
"AlarmName": "thumbnails-generator-scaling-alarm",
"ActionsEnabled": true,
"OKActions": [],
"AlarmActions": [
"arn:aws:autoscaling:us-west-2:111111111111:scalingPolicy:8694a867-85ee-4740-ba70-7b439c3e5fb3:resource/ecs/service/prod/thumbnails-generator:policyName/thumbnails-generator-scaling-policy"
],
"InsufficientDataActions": [],
"MetricName": "ActivitiesScheduled",
"Namespace": "AWS/States",
"Statistic": "SampleCount",
"Dimensions": [
{
"Name": "ActivityArn",
"Value": "arn:aws:states:us-west-2:111111111111:activity:thumbnails-generator-activity-prod"
}
],
"Period": 300,
"EvaluationPeriods": 1,
"DatapointsToAlarm": 1,
"Threshold": 0,
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"TreatMissingData": "missing"
}
}
This is the auto scale configuration:
Please advise what can I do in order to make the auto scale work properly.
I guess you need to use Sum instead of SampleCount statistic.
SampleCount is the number of data points during the period.
Related
I'm trying to get resource tags from the resource that breached a CloudWatch alarm in a Lambda function
Say I have 2 CloudWatch alarms - one for CPU utilization, another for Lambda errors, both publish to the same SNS topic
Then a Lambda function is triggered from that SNS topic. This Lambda function needs to know which resource triggered the CloudWatch alarm, then I assume call list_tags_for_resource() on the ARN of said resource
However the payload from CloudWatch doesn't include the ARN of the resource. Example:
{
"AlarmName": "LessThanThreshold-CPUUtilization",
"AlarmDescription": "Created from EC2 Console",
"AWSAccountId": "xx",
"AlarmConfigurationUpdatedTimestamp": "2022-03-01T03:29:21.832+0000",
"NewStateValue": "ALARM",
"NewStateReason": "Threshold Crossed: 1 out of the last 1 datapoints [0.161290322580642 (01/03/22 03:34:00)] was less than the threshold (0.99) (minimum 1 datapoint for OK -> ALARM transition).",
"StateChangeTime": "2022-03-01T03:35:27.613+0000",
"Region": "US East (N. Virginia)",
"AlarmArn": "arn:aws:cloudwatch:us-east-1:xx:alarm:LessThanThreshold-CPUUtilization",
"OldStateValue": "OK",
"Trigger": {
"MetricName": "CPUUtilization",
"Namespace": "AWS/EC2",
"StatisticType": "Statistic",
"Statistic": "AVERAGE",
"Unit": null,
"Dimensions": [
{
"value": "i-xx",
"name": "InstanceId"
}
],
"Period": 60,
"EvaluationPeriods": 1,
"DatapointsToAlarm": 1,
"ComparisonOperator": "LessThanThreshold",
"Threshold": 0.99,
"TreatMissingData": "",
"EvaluateLowSampleCountPercentile": ""
}
}
The following flow is executing Lambda.
monitor log files in EC2 with cloudwatch logs
Detects monitored strings with a metrics filter
Execute Lambda with alarm
I would like to know how to get the following information within Lambda.
Path of the log file being monitored
Instance name
Instance id
Alarm name
I am writing in python and trying to get it using boto3.
You can easily achieve this in 2 ways:-
Create a cloudwatch event bridge rule with event type as cloudwatch
alarm state change.
Whenever your alarm will be in an alarm state it will send an event, configure the target of this event type as lambda function or sns topic, whatever suits your need.
Sample event from this rule
{
"version": "0",
"id": "c4c1c1c9-6542-e61b-6ef0-8c4d36933a92",
"detail-type": "CloudWatch Alarm State Change",
"source": "aws.cloudwatch",
"account": "123456789012",
"time": "2019-10-02T17:04:40Z",
"region": "us-east-1",
"resources": ["arn:aws:cloudwatch:us-east-1:123456789012:alarm:ServerCpuTooHigh"],
"detail": {
"alarmName": "ServerCpuTooHigh",
"configuration": {
"description": "Goes into alarm when server CPU utilization is too high!",
"metrics": [{
"id": "30b6c6b2-a864-43a2-4877-c09a1afc3b87",
"metricStat": {
"metric": {
"dimensions": {
"InstanceId": "i-12345678901234567"
},
"name": "CPUUtilization",
"namespace": "AWS/EC2"
},
"period": 300,
"stat": "Average"
},
"returnData": true
}]
},
"previousState": {
"reason": "Threshold Crossed: 1 out of the last 1 datapoints [0.0666851903306472 (01/10/19 13:46:00)] was not greater than the threshold (50.0) (minimum 1 datapoint for ALARM -> OK transition).",
"reasonData": "{\"version\":\"1.0\",\"queryDate\":\"2019-10-01T13:56:40.985+0000\",\"startDate\":\"2019-10-01T13:46:00.000+0000\",\"statistic\":\"Average\",\"period\":300,\"recentDatapoints\":[0.0666851903306472],\"threshold\":50.0}",
"timestamp": "2019-10-01T13:56:40.987+0000",
"value": "OK"
},
"state": {
"reason": "Threshold Crossed: 1 out of the last 1 datapoints [99.50160229693434 (02/10/19 16:59:00)] was greater than the threshold (50.0) (minimum 1 datapoint for OK -> ALARM transition).",
"reasonData": "{\"version\":\"1.0\",\"queryDate\":\"2019-10-02T17:04:40.985+0000\",\"startDate\":\"2019-10-02T16:59:00.000+0000\",\"statistic\":\"Average\",\"period\":300,\"recentDatapoints\":[99.50160229693434],\"threshold\":50.0}",
"timestamp": "2019-10-02T17:04:40.989+0000",
"value": "ALARM"
}
}
}
Inside your cloud watch alarm there is an alarm action there your can add SNS topic to it and then you can easily get your event information, if you want to process it further,you can add lambda to SNS topic.
I am looking forward to understand how the alarms are triggered.
The case is a follows:
Currently, I have set up a custom metric for a Log Group. These are the logs of an R script running with AWS ECS, and the custom metric is set to 1 whenever it finds the keyword done in the logs, otherwise, it will set to 0.
The alarm was setup to check 1 data point, and whenever it finds 0 it will be triggered.
Now the issue is that I have been receiving the alarm randomly without reason, as the logs indicate the script has been running successfully (All of them contain the keyword done). Then after a while, and before the time period setup happens again, the status changes back to OK.
{
"MetricAlarms": [
{
"AlarmName": "XXXXXXXXX",
"AlarmArn": "XXXXXXXXX",
"AlarmConfigurationUpdatedTimestamp": "2020-07-27T07:08:34.498000+00:00",
"ActionsEnabled": true,
"OKActions": [],
"AlarmActions": [
"XXXXXXXXX:slack-monitoring"
],
"InsufficientDataActions": [],
"StateValue": "OK",
"StateReason": "Threshold Crossed: 1 out of the last 1 datapoints [1.0 (26/07/20 07:10:00)] was not less than or equal to the threshold (0.0) (minimum 1 datapoint for ALARM -> OK transition).",
"StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2020-07-27T07:10:51.284+0000\",\"startDate\":\"2020-07-26T07:10:00.000+0000\",\"statistic\":\"Maximum\",\"period\":86400,\"recentDatapoints\":[1.0],\"threshold\":0.0}",
"StateUpdatedTimestamp": "2020-07-27T07:10:51.285000+00:00",
"MetricName": "XXXXXXXXX",
"Namespace": "XXXXXXXXX",
"Statistic": "Maximum",
"Dimensions": [],
"Period": 86400,
"EvaluationPeriods": 1,
"DatapointsToAlarm": 1,
"Threshold": 0.0,
"ComparisonOperator": "LessThanOrEqualToThreshold",
"TreatMissingData": "missing"
}
],
"CompositeAlarms": []
}
I created a custom metric with the unit count. The requirement is to check every 24h if the sum of the metric count is >= 1. If so a message should be sent to sns topic which triggers a lambda which sends a message to slack channel.
Metric behaviour: Currently the custom metric is always higher than one. I crate a datapoint every 10 sec.
Alarm behaviour: The alarm instantly switches into alarm state and sends a message to the sns topic. But the state never leaves the alarm state and also doesn't retrigger a new message 24h later to the sns topic.
How should I configure my alarm if I want to achieve my requirement?
Thanks in advance,
Patrick
Here is the aws cloudwatch describe-alarms result:
{
"MetricAlarms": [
{
"AlarmName": "iot-data-platform-stg-InvalidMessagesAlarm-1OS91W5YCQ8E9",
"AlarmArn": "arn:aws:cloudwatch:eu-west-1:xxxxxx:alarm:iot-data-platform-stg-InvalidMessagesAlarm-1OS91W5YCQ8E9",
"AlarmDescription": "Invalid Messages received",
"AlarmConfigurationUpdatedTimestamp": "2020-04-03T18:11:15.076Z",
"ActionsEnabled": true,
"OKActions": [],
"AlarmActions": [
"arn:aws:sns:eu-west-1:xxxxx:iot-data-platform-stg-InvalidMessagesTopic-FJQ0WUJY9TZC"
],
"InsufficientDataActions": [],
"StateValue": "ALARM",
"StateReason": "Threshold Crossed: 1 out of the last 1 datapoints [3.0 (30/03/20 11:49:00)] was greater than or equal to the threshold (1.0) (minimum 1 datapoint for OK -> ALARM transition).",
"StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2020-03-31T11:49:03.417+0000\",\"startDate\":\"2020-03-30T11:49:00.000+0000\",\"statistic\":\"Sum\",\"period\":86400,\"recentDatapoints\":[3.0],\"threshold\":1.0}",
"StateUpdatedTimestamp": "2020-03-31T11:49:03.421Z",
"MetricName": "InvalidMessages",
"Namespace": "Message validation",
"Statistic": "Sum",
"Dimensions": [
{
"Name": "stream",
"Value": "raw events"
},
{
"Name": "stage",
"Value": "stg"
}
],
"Period": 86400,
"EvaluationPeriods": 1,
"DatapointsToAlarm": 1,
"Threshold": 1.0,
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"TreatMissingData": "notBreaching"
}
]
}
I have a backup script that runs every 2 hours. I want to use CloudWatch to track the successful executions of this script and CloudWatch's Alarms to get notified whenever the script runs into problems.
The script puts a data point on a CloudWatch metric after every successful backup:
mon-put-data --namespace Backup --metric-name $metric --unit Count --value 1
I have an alarm that goes to ALARM state whenever the statistic "Sum" on the metric is less than 2 in a 6-hour period.
In order to test this setup, after a day, I stopped putting data in the metric (ie, I commented out the mon-put-data command). Good, eventually the alarm went to ALARM state and I got an email notification, as expected.
The problem is that, some time later, the alarm wen back to the OK state, however there's no new data being added to the metric!
The two transitions (OK => ALARM, then ALARM => OK) have been logged and I reproduce the logs in this question. Note that, although both show "period: 21600" (ie, 6h), the second one shows a 12-hour time span between startDate and queryDate; I see that this might explain the transition, but I cannot understand why CloudWatch is considering a 12-hour time span to calculate a statistic with a 6-hour period!
What am I missing here? How to configure the alarms to achieve what I want (ie, get notified if backups are not being made)?
{
"Timestamp": "2013-03-06T15:12:01.069Z",
"HistoryItemType": "StateUpdate",
"AlarmName": "alarm-backup-svn",
"HistoryData": {
"version": "1.0",
"oldState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (3.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-05T21:12:44.081+0000",
"startDate": "2013-03-05T15:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
3
],
"threshold": 3
}
},
"newState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T15:12:01.052+0000",
"startDate": "2013-03-06T09:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
1
],
"threshold": 2
}
}
},
"HistorySummary": "Alarm updated from OK to ALARM"
}
The second one, which I simple cannot understand:
{
"Timestamp": "2013-03-06T17:46:01.063Z",
"HistoryItemType": "StateUpdate",
"AlarmName": "alarm-backup-svn",
"HistoryData": {
"version": "1.0",
"oldState": {
"stateValue": "ALARM",
"stateReason": "Threshold Crossed: 1 datapoint (1.0) was less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T15:12:01.052+0000",
"startDate": "2013-03-06T09:12:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
1
],
"threshold": 2
}
},
"newState": {
"stateValue": "OK",
"stateReason": "Threshold Crossed: 1 datapoint (3.0) was not less than the threshold (2.0).",
"stateReasonData": {
"version": "1.0",
"queryDate": "2013-03-06T17:46:01.041+0000",
"startDate": "2013-03-06T05:46:00.000+0000",
"statistic": "Sum",
"period": 21600,
"recentDatapoints": [
3
],
"threshold": 2
}
}
},
"HistorySummary": "Alarm updated from ALARM to OK"
}
This behavior (that your monitor did not transition into the INSFUCCIENT_DATA state is because Cloudwatch considers 'pre-timestamped' metric datapoints and so (for a 6 hour alarm) if no data exists in the current 6 open hour window .. it will take data from the previous 6 hour window (hence the 12 hour timestamp you see above).
To increase the 'fidelity' of your alarm, reduce the alarm period down to 1 Hour/3600s and increase your number of evaluation periods to how many periods you want to alarm on failure for. That will ensure your alarm transitions into INSFUCCIENT_DATA as you expect.
How to configure the alarms to achieve what I want (ie, get notified if backups are not being made)?
A possible architecture for your alarm would be publish 1 if your job is successful, 0 if it failed. Then create an alarm with a threshold of < 1 for 3 - 3600s periods meaning that your alarm will go into ALARM if the job is failing (i.e running .. but failing). If you also set an INSFUCCIENT_DATA action on that alarm then you will also get notified if your job is not running at all.
Hope that makes sense .