I'm using an AWS Lambda (hourly triggered by a Cloudwatch rule) to trigger the creation of an EMR cluster to execute a job. The EMR cluster once finished its steps write a result file in a S3 bucket. The key path is the hour of the day
/bucket/2017/04/28/00/result.txt
/bucket/2017/04/28/01/result.txt
..
/bucket/2017/04/28/23/result.txt
I wanted to put some alert in case for some reason the EMR job failed to create the result.txt for the hour.
I have already put some alerts on the Lambda invocation count and on the lambda error count but I didn't manage to find the appropriate alert to test that the EMR actually correctly finishes its job.
Note that the Lambda is triggered every 3 min past the hour and takes about 15 minutes to complete. Would a good solution be to create an other Lambda that is triggered every 30min past the hour and checks that the correct key is present in the bucket? if not then write some logs to cloudwatch that I could monitor and use them to create my alert?
What other way could I achieve this alerting?
S3 offers free metrics on object count per bucket, but doesn't publish often enough for your use case.
CloudWatch Alarm on S3 Request Metrics
For a cost, you can enable CloudWatch metrics for S3 requests to enable request metrics that write data in 1-minute periods. You could, for example, create a relevant alarm on the following S3 CloudWatch metrics:
PutRequests sum <= 0 over each hour
4xxErrors sum >= 1 over 1 minute
5xxErrors sum >= 1 over 1 minute
The HTTP status code alarms on much shorter intervals (down to 1 minute), will offer feedback nearer to when these failures occur.
CloudWatch Alarm on Put Events
If you don't want to incur the cost of S3 request metrics, you could instead configure an event to publish a message to an SNS topic on S3 put. You can use CloudWatch to set up alerting on the sum of messages published (or lack thereof).
You could then create a CloudWatch alarm based on this topic failing to publish a message.
Dimensions: TopicName = YOURSNSTOPIC
Namespace: AWS/SNS
Metric Name: NumberOfMessagesPublished
Threshold: NumberOfMessagesPublished <= 0 for 60 minutes (4 periods)
Statistic: Sum
Period: 15 minutes
Treat missing data as: breaching
Actions: Send notification to another, separate SNS topic that sends you an email/sms, or otherwise publishes to some alerting service.
Discussion
Note that both CloudWatch solutions have the caveat that they won't fire alerts exactly at 30 minutes past the hour, but they will capture your entire monitoring period.
You may be able to further configure from these base examples by adjusting your period or how cloudwatch treats missing data to get better results.
A lambda that triggers 30 minutes past the hour (via cron-style scheduling) to check S3 request metrics or the SNS topic's "NumberOfMessagesPublished" metric instead of relying on CloudWatch alarms could also accomplish this. This may be a better alternative if firing exactly 30 minutes past the hour is important, as the CloudWatch alarm's firing time will not be as precise.
Further Reading
AWS Documentation - Configuring Amazon S3 Event Notifications
AWS Documentation - SNS CloudWatch Metrics
AWS Documentation - S3 CloudWatch Metrics
Related
In our team's infrastructure we have a Databricks job which sends data to an SQS queue which triggers a Lambda function. The Databricks job runs one in every 30 minutes. A week ago the Databricks job was failing continuously so it was not sending data, therefore the Lambda function was not triggered. Is there any way to set up an alert so that I get notified if the lambda function is not triggered for a period of 2 hours?
When I searched for a solution I was only able to see to get an alert if and when a Lambda fails or if a specific log type is found in its cloudwatch logs etc, but couldn't see any solution for the above scenario.
You can create a Cloudwatch alarm for the Invocation metrics for that lambda; you can configure the alarm so that if there are no invocations over a timespan of two hours, it goes into an ALARM state.
If you wish to be notified, you can also configure the Cloudwatch alarm to send a message to an SNS topic, which can then be configured to trigger SES so that it sends you an email (for example).
I have created AWS Cloudwatch Alarm on a Target group for my Elastic Load Balancer.
Once the alarm state goes from OK to Alarm it is should send notification to an AWS SNS Topic with a Lambda function as subscriber.
What I have tested:
The lambda function works with a dummy SNS event
Publishing a test message to the SNS Topic results in all subscribers being notified
The CloudWatch Alarm successfully goes into state Alarm.
The CloudWatch Alarm History says "2019-08-04 13:18:00 Action Successfully executed action arn:arn:aws:sns:eu-west-1:something"
It seems like there is something wrong in the link between the CloudWatch Alarm and the SNS Topic. The SNS topic Access Policy has been configured to "Everyone" and still no luck.
The exact settings on my CloudWatch Alarm
Threshold
UnHealthyHostCount > 0 for 1 datapoints within 1 minute
ARN
arn:aws:cloudwatch:eu-west-1:xxxxx:alarm:awsapplicationelb-targetgroup-API-Workers-xxxxxx-High-Unhealthy-Hosts
Namespace
AWS/ApplicationELB
Metric name
UnHealthyHostCount
LoadBalancer
app/Backend-API-HTTPS/xxxxxx
TargetGroup
targetgroup/API-Workers/xxxxx
Statistic
Sum
Period
1 minute
Datapoints to alarm
1 out of 1
Missing data treatment
Treat missing data as bad (breaching threshold)
Percentiles with low samples
evaluate
The actions configured for this were selected from a list of "Select an existing SNS topic". I also tried configuring the topic ARN directly. In both cases the configuration of the actions are marked as valid by the AWS console.
I have an application publishing a custom cloudwatch metric using boto's put_metric_data. The metric shows the number of tasks waiting in a redis queue.
The 1-minute max shows '3', 1-minute min shows '0' and 1-minute average shows '1.5'.
It seems that the application is correctly setting the value to zero, but some other process is overwriting it with 3 at the same time, but I can't find this to stop it.
Is it possible to see logs for PutMetricData to diagnose where this value might be coming from?
Normally, Amazon CloudTrail would be the ideal way to discover information about API calls being made to your AWS account. Unfortunately, PutMetricData is not captured in Amazon CloudTrail.
From Logging Amazon CloudWatch API Calls in AWS CloudTrail:
The CloudWatch GetMetricStatistics, ListMetrics, and PutMetricData API actions are not supported.
Is there a way to get notifications when my AWS Lambda function times out?
I am unable to find any documentation. The only way as of now is to search through the Cloudwatch logs for timeout notifications of all the Lambda functions I have. Is there a better way?
According to the docs, a timeout should be in the Errors metric. I observed weird behaviour with the count (e.g. having an error count of 0.5). Hence I made a CloudWatch alarm for the errors count > 0 (not >= 1).
You could also do something with the REPORT message or with
Task timed out after 25.00 seconds
which can be found in the Cloudwatch logs.
I've created an alarm in CloudWatch for a Lambda metric of type "Duration" and selected the Statistic of "Maximum" to alert me when the execution duration is greater/equal 30000 (= 30 seconds) for a Lambda function configured with a timeout of 30 seconds.
If the duration of a single execution ("maximum" of the period) exceeds the timeout time, you will be notified. It is working fine for me.
You can have CloudWatch trigger an alarm when a certain message shows up in the logs. I can't seem to find any official documentation on this, but you create a "Metric Filter" in CloudWatch Logs, and then you can create an alarm from from that. This blog post seems to describe the process well.
I could receive SNS notification (email) by creating a metric filter and an alarm whenever a lambda function timed out or provisioned throughput exceeded on a Dynamo table -
:
error: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.
:
:
2022-06-08T06:09:07.427+05:30 REPORT RequestId: c6acc2ca-ee60-495a-a554-bb76a9943430 Duration: 10019.19 ms Billed Duration: 10000 ms Memory Size: 1024 MB Max Memory Used: 236 MB
2022-06-08T06:09:07.427+05:30 2022-06-08T00:39:07.426Z c6acc2ca-ee60-495a-a554-bb76a9943430 Task timed out after 10.02 seconds
:
Refer to doc: https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudwatch-alarms-for-cloudtrail.html
for the detailed steps on setting up "metric filter and an alarm"
Summary of steps,
a) identify the pattern you are looking for in the cloudwatch log
b) Create a metric filter on the log group related to the lambda function
provide a filter pattern to match any one of the possible text in the log:
?"timed out" ?error ?"ProvisionedThroughputExceededException"
c) create an alarm for the filter
create sns topic
provide email to receive notification on
This helped in tuning the capacity (RCU, WCU) set on the Dynamo table and also the timeout settings on lambda function. Hope this helps someone..
Suppose I have an ec2 instance with service /etc/init/my_service.conf with contents
script
exec my_exec
end script
How can I monitor that ec2 instance such that if my_service stopped running I can act on it?
You can publish a custom metric to CloudWatch in the form of a "heart beat".
Have a small script running via cron on your server checking the
process list to see whether my_service is running and if it is, make
a put-metric-data call to CloudWatch.
The metric could be as simple as pushing the number "1" to your custom metric in CloudWatch.
Set up a CloudWatch alarm that triggers if the average for the metric falls below 1
Make the period of the alarm be >= the period that the cron runs e.g. cron runs every 5 minutes, make the alarm alarm if it sees the average is below 1 for two 5 minute periods.
Make sure you also handle the situation in which the metric is not published (e. g. cron fails to run or whole machine dies). you would want to setup an alert in case the metric is missing. (see here: AWS Cloudwatch Heartbeat Alarm)
Be aware that the custom metric will add an additional cost of 50c to your AWS bill (not a big deal for one metric - but the equation changes drastically if you want to push hundred/thousands of metrics - i.e. good to know it's not free as one would expect)
See here for how to publish a custom metric: http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/publishingMetrics.html
I am not sure if CloudWatch is the right route for checking if the service is running - it would be easier with Nagios kind of solution.
Nevertheless, you may try the CloudWatch Custom metrics approach. You add Additional lines of code which publishes say an integer 1 to CloudWatch Custom Metrics every 5 mins. Your can then configure CloudWatch alarms to do a SNS Notification / Mail Notification for the conditions like Sample Count or sum deviating your anticipated value.
script
exec my_exec
publish cloudwatch custom metrics value
end script
More Info
Publish Custom Metrics - http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/publishingMetrics.html