AWS Cloudwatch - No Alarm-Mail send on lambda timeout of 15 min - amazon-web-services

i've a strange problem which I don't understand.
I have created an cloudwatch alarm which should notify me on errors within an lambda execution (including timeout).
The relevant parameters of the alarm are the following:
period = "300"
datapoints_to_alarm = "1"
evaluation_periods = "1"
treat_missing_data = "notBreaching"
statistic = "Sum"
threshold = "0"
metric_name = "Errors"
namespace = "AWS/Lambda"
alarm_actions = http:// aws_sns_topic.alarm.arn
When my lambda run's into a timeout after 15 min (max Lambda execution time), no email is sent to me.
When my lambda run's into a timeout after 2, 6, 10 or 14 minutes i will get the notification email as expected. Even on 14 minutes and 30 seconds, the mail is sent. Over 14:30 minutes, the metric doesn't switch to alarm state.
Does anybody know why that is happen?
The datapoint (error) is shown correctly in the metric. It seems that the point (error) is set to the start date of the lambda. Might that be the problem? Because already 3 evaluations periods elapsed since lambda start? But why I get the alarm mail when it runs on timeout after 14 minutes (also more than one evaluation period).
Already asked this question in AWS Forum but no answer yet.
Can anyone suggest what I'm doing wrong?
Regards Hannes

According to AWS documentation regarding Lambda function metrics, The timestamp on a metric reflects when the function was invoked. Depending on the duration of the execution, this can be several minutes before the metric is emitted.
So for example, if your function has a 15 minutes timeout, you should look at more than 15 minutes in the past for accurate metrics. Since AWS polls metrics status before sending alerts in absolute times, you should set your alarms parameter EvaluationPeriods to be higher than 1, and then set DatapointsToAlarm and Period parameters according to your lambda’s timeout and how often will you want to sample the alarm state in CloudWatch before raising notifications. More details about these parameters can be found here.

Related

AWS CloudWatch Alarm - TargetResponseTime: Count number of response takes more than 40 sec for specific time

I have AWS CloudWatch alarm setup with metric condition TargetResponseTime. My current condition to trigger alarm is below.
This trigger alarm every time the response time exceeds 45 sec. I want to trigger the alarm if 'more than one' response takes 45sec in 1 min time frame.
Is my metric condition or threshold type is correct? Or please let me know what is the right metric and condition to match my case.

Unknown spike in 'Approximate Age of Oldest Message Average' matrix AWS

I have received following alarm message daily at the same time from my Amazon SQS.
You are receiving this email because your Amazon CloudWatch Alarm "Old Messages in SQS" in the {my region} region has entered the ALARM state, because "Threshold Crossed: 1 out of the last 1 datapoints [183.0 (30/09/20 00:06:00)] was greater than or equal to the threshold (180.0) (minimum 1 datapoint for OK -> ALARM transition)." at "Wednesday 30 September, 2020 00:07:22 UTC".
Alarm Details:
Name: Old Messages in SQS
Description: Abc updates take too long. Check the processor and queue.
State Change: OK -> ALARM
Reason for State Change: Threshold Crossed: 1 out of the last 1 datapoints [183.0 (30/09/20 00:06:00)] was greater than or equal to
the threshold (180.0) (minimum 1 datapoint for OK -> ALARM
transition).
Timestamp: Wednesday 30 September, 2020 00:07:22 UTC
Threshold:
The alarm is in the ALARM state when the metric is GreaterThanOrEqualToThreshold 180.0 for 60 seconds.
Monitored Metric:
MetricNamespace: AWS/SQS
MetricName: ApproximateAgeOfOldestMessage
Period: 60 seconds
Statistic: Average
Unit: not specified
State Change Actions:
OK:
INSUFFICIENT_DATA:
So I checked in cloudwatch and see what is happening. So I identified that CPU utilization is going low at the same time of that instance which is used to process messages in SQS. So I decided that messages in SQS were increased due to server down.
But I can't identify why server is going down at the same time every day. I checked followings
EC2 snapshots - no automated schedules
RDS snapshots - no automated schedules at that time
Cron jobs in the server
Is there anyone who has this kind of experience will be highly appreciated to identify what the exact issue is.
This is super late to the party and I assume by now you either figured it out or you moved on. Sharing some thoughts for posterity.
The metric that went over threshold of 3 mins is age of the item on the SQS queue. The cause of the alert is not necessarily an instance processing messages going down but could be that the processing routine is blocked (waiting) on something external. This could be a network call that lasts long, or something like that.
During the waiting state CPU is not used that much so CPU utilization could be close to 0 if no other messages are being processed. Again, this is just a guess.
What I would do in this situation is:
Add/check cloudwatch logs that are emitted within the processing routine and verify processing is not stuck.
Check instance events / logs (by these I mean instance starting/stopping, maintenance, etc. to verify nothing external was affecting my processing instance)
In a number of cases the answer could be an error on Amazon side so when stuck - check with Amazon support.

Set AWS Cloudwatch Alarm datapoint timespan and action to shut it down

Follwing case:
We want an Alarm in AWS that reads the EstimatedCharges Metric of AmazonCloudWatch every 5 minutes (for potential log overflow). But the only timespan I can set are 6 hours, else it gives me "Insufficient" as Status. How can I change the metric so that I can use it with 5 minutes between each check?
And how can I make an action that will stop the Cloudwatch Logs when over X?
According to documentation.
CloudWatch Billing metrics are updated every 6 hours.
Thus, your Alert may change the status only every 6 hours.
Just set Treat Missing Data to notBreaching
notBreaching – Missing data points are treated as "good" and within the threshold,
More info: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

AWS Cloudwatch Alarm status

I have set cloudwatch alarm to trigger SNS mail whenever some keywords are found in cloudwatch logs. (using metric filter)
When those keywords are detected, Alarm state gets changed from insufficient data to alarm & triggers SNS topic
Now, to move from Alarm state alarm to insufficient data it takes time randomly.
Is there any specific way it works, I expect it to come back to Alarm state insufficient data immediately after alarm state.
Any help would be appreciated. Thanks
The alarm has a metric period of 60 seconds and some evaluation period (let suppose 3; total equal 3 * 60 = 3 mints evaluation window).
The alarm will be in Alarm state if all the last 3 datapoints at 60 seconds interval are in Alarm State (above the threshold).
If any 1 in last 3 datapoint is below threshold then the Alarm will transition to OK.
BUT, if the latest all 3 datapoints are missing (say your metric filter did not match and as a result no metric was pushed), the Alarm waits longer than 3 periods to transition to InsufficientData and this is by design to accommodate network delays or processing delay.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
Came across the same situation, used a period of 1 min and some x > threshold.
The state changes to Alarm immediately whenever the metric exceeds the threshold. But to change back to OK/ Insufficient data takes 6 mins. This happens only for missing data.
As per AWS Support this is the expected behavior of Cloudwatch Alarms, clear explanation can be found here https://forums.aws.amazon.com/thread.jspa?threadID=284182

Get notifications when AWS Lambda timesout

Is there a way to get notifications when my AWS Lambda function times out?
I am unable to find any documentation. The only way as of now is to search through the Cloudwatch logs for timeout notifications of all the Lambda functions I have. Is there a better way?
According to the docs, a timeout should be in the Errors metric. I observed weird behaviour with the count (e.g. having an error count of 0.5). Hence I made a CloudWatch alarm for the errors count > 0 (not >= 1).
You could also do something with the REPORT message or with
Task timed out after 25.00 seconds
which can be found in the Cloudwatch logs.
I've created an alarm in CloudWatch for a Lambda metric of type "Duration" and selected the Statistic of "Maximum" to alert me when the execution duration is greater/equal 30000 (= 30 seconds) for a Lambda function configured with a timeout of 30 seconds.
If the duration of a single execution ("maximum" of the period) exceeds the timeout time, you will be notified. It is working fine for me.
You can have CloudWatch trigger an alarm when a certain message shows up in the logs. I can't seem to find any official documentation on this, but you create a "Metric Filter" in CloudWatch Logs, and then you can create an alarm from from that. This blog post seems to describe the process well.
I could receive SNS notification (email) by creating a metric filter and an alarm whenever a lambda function timed out or provisioned throughput exceeded on a Dynamo table -
:
error: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
:
:
2022-06-08T06:09:07.427+05:30 REPORT RequestId: c6acc2ca-ee60-495a-a554-bb76a9943430 Duration: 10019.19 ms Billed Duration: 10000 ms Memory Size: 1024 MB Max Memory Used: 236 MB
2022-06-08T06:09:07.427+05:30 2022-06-08T00:39:07.426Z c6acc2ca-ee60-495a-a554-bb76a9943430 Task timed out after 10.02 seconds
:
Refer to doc: https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudwatch-alarms-for-cloudtrail.html
for the detailed steps on setting up "metric filter and an alarm"
Summary of steps,
a) identify the pattern you are looking for in the cloudwatch log
b) Create a metric filter on the log group related to the lambda function
provide a filter pattern to match any one of the possible text in the log:
?"timed out" ?error ?"ProvisionedThroughputExceededException"
c) create an alarm for the filter
create sns topic
provide email to receive notification on
This helped in tuning the capacity (RCU, WCU) set on the Dynamo table and also the timeout settings on lambda function. Hope this helps someone..