I setup a metric when lambda function timeout is 15 minute. With:
threshold: 900000
statistic: Maximum
period: 1
But nothing happend when conditions are satisfied.
Here my metric
enter image description here
Notify when conditions are satisfied
Related
I have AWS CloudWatch alarm setup with metric condition TargetResponseTime. My current condition to trigger alarm is below.
This trigger alarm every time the response time exceeds 45 sec. I want to trigger the alarm if 'more than one' response takes 45sec in 1 min time frame.
Is my metric condition or threshold type is correct? Or please let me know what is the right metric and condition to match my case.
I want to monitor a Lambda function based on the “invocation” metric. Specifically, I want the alarm to trigger if it hasn’t been invoked at least once. I've set up the following alarm and it hasn't triggered in 24 hours. I've made sure that Lambda has also not triggered however this alarm never goes into "ALARM" state. Any advice would be helpful.
Threshold: Invocations < 1 for 1 datapoints within 5 minutes
Statistic: Sum
Period: 5 minutes
Metric Name: Invocations
Namespace: AWS/Lambda
Datapoints to alarm: 1 out of 1
The Invocations metric is only written if there are invocations (i.e. it won't write a value of 0).
Therefore, for use cases such as this, where you want the alarm to trigger when the metric has not been written, you need to configure it to treat missing data as breaching.
Documentation for this is here.
I have received following alarm message daily at the same time from my Amazon SQS.
You are receiving this email because your Amazon CloudWatch Alarm "Old Messages in SQS" in the {my region} region has entered the ALARM state, because "Threshold Crossed: 1 out of the last 1 datapoints [183.0 (30/09/20 00:06:00)] was greater than or equal to the threshold (180.0) (minimum 1 datapoint for OK -> ALARM transition)." at "Wednesday 30 September, 2020 00:07:22 UTC".
Alarm Details:
Name: Old Messages in SQS
Description: Abc updates take too long. Check the processor and queue.
State Change: OK -> ALARM
Reason for State Change: Threshold Crossed: 1 out of the last 1 datapoints [183.0 (30/09/20 00:06:00)] was greater than or equal to
the threshold (180.0) (minimum 1 datapoint for OK -> ALARM
transition).
Timestamp: Wednesday 30 September, 2020 00:07:22 UTC
Threshold:
The alarm is in the ALARM state when the metric is GreaterThanOrEqualToThreshold 180.0 for 60 seconds.
Monitored Metric:
MetricNamespace: AWS/SQS
MetricName: ApproximateAgeOfOldestMessage
Period: 60 seconds
Statistic: Average
Unit: not specified
State Change Actions:
OK:
INSUFFICIENT_DATA:
So I checked in cloudwatch and see what is happening. So I identified that CPU utilization is going low at the same time of that instance which is used to process messages in SQS. So I decided that messages in SQS were increased due to server down.
But I can't identify why server is going down at the same time every day. I checked followings
EC2 snapshots - no automated schedules
RDS snapshots - no automated schedules at that time
Cron jobs in the server
Is there anyone who has this kind of experience will be highly appreciated to identify what the exact issue is.
This is super late to the party and I assume by now you either figured it out or you moved on. Sharing some thoughts for posterity.
The metric that went over threshold of 3 mins is age of the item on the SQS queue. The cause of the alert is not necessarily an instance processing messages going down but could be that the processing routine is blocked (waiting) on something external. This could be a network call that lasts long, or something like that.
During the waiting state CPU is not used that much so CPU utilization could be close to 0 if no other messages are being processed. Again, this is just a guess.
What I would do in this situation is:
Add/check cloudwatch logs that are emitted within the processing routine and verify processing is not stuck.
Check instance events / logs (by these I mean instance starting/stopping, maintenance, etc. to verify nothing external was affecting my processing instance)
In a number of cases the answer could be an error on Amazon side so when stuck - check with Amazon support.
I have set cloudwatch alarm to trigger SNS mail whenever some keywords are found in cloudwatch logs. (using metric filter)
When those keywords are detected, Alarm state gets changed from insufficient data to alarm & triggers SNS topic
Now, to move from Alarm state alarm to insufficient data it takes time randomly.
Is there any specific way it works, I expect it to come back to Alarm state insufficient data immediately after alarm state.
Any help would be appreciated. Thanks
The alarm has a metric period of 60 seconds and some evaluation period (let suppose 3; total equal 3 * 60 = 3 mints evaluation window).
The alarm will be in Alarm state if all the last 3 datapoints at 60 seconds interval are in Alarm State (above the threshold).
If any 1 in last 3 datapoint is below threshold then the Alarm will transition to OK.
BUT, if the latest all 3 datapoints are missing (say your metric filter did not match and as a result no metric was pushed), the Alarm waits longer than 3 periods to transition to InsufficientData and this is by design to accommodate network delays or processing delay.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
Came across the same situation, used a period of 1 min and some x > threshold.
The state changes to Alarm immediately whenever the metric exceeds the threshold. But to change back to OK/ Insufficient data takes 6 mins. This happens only for missing data.
As per AWS Support this is the expected behavior of Cloudwatch Alarms, clear explanation can be found here https://forums.aws.amazon.com/thread.jspa?threadID=284182
I am trying to capture the event of a new message in my FIFO queue (as I want to avoid , infinite polling of Queue) .
For this purpose I am evaluating the CloudWatch alarm option with metrics ApproximateNumberOfMessagesVisible .
Following is my Alarm description-
Threshold: The condition in which the alarm will go to the ALARM state.ApproximateNumberOfMessagesVisible >= 0 for 1 minute
Actions:The actions that will occur when the alarm changes state.
In ALARM:
Send message to topic "topic_for_events_generated_bycloudwatch" (xyz#xyz)
Send message to topic "topic_for_events_generated_bycloudwatch"
Period:The granularity of the datapoints for the monitored metric.1 minute
Following are my queries -
Assuming there are more than 0 messages in the given Q - will this alarm raised only once when the condition met or every minute ?
During quick test I saw Alarm keeping moving between INSUFFICIENT and ALARM state in random other without any configuration changes, what could be rational ?
Screenshot of ApproximateNumberOfMessagesVisible metric graph
Screenshot of the log activity
Thanks in advance.
Regards,
Rohan K
Cloudwatch will alarm once the threshold is breached for state transition.
From the Docs
Alarms invoke actions for sustained state changes only. CloudWatch alarms do not invoke actions simply because they are in a particular state, the state must have changed and been maintained for a specified number of periods.
But
After an alarm invokes an action due to a change in state, its
subsequent behavior depends on the type of action that you have
associated with the alarm. For Amazon EC2 and Auto Scaling actions,
the alarm continues to invoke the action for every period that the
alarm remains in the new state. For Amazon SNS notifications, no additional actions are invoked.
An Example:
In the following figure, the alarm threshold is set to 3 units and the
alarm is evaluated over 3 periods. That is, the alarm goes to ALARM
state if the oldest of the 3 periods being evaluated is breaching, and
the 2 subsequent periods are either breaching or missing. In the
figure, this happens with the third through fifth time periods, and
the alarm's state is set to ALARM. At period six, the value dips below
the threshold, and the state reverts to OK. Later, during the ninth
time period, the threshold is breached again, but for only one period.
Consequently, the alarm state remains OK.