I have an application publishing a custom cloudwatch metric using boto's put_metric_data. The metric shows the number of tasks waiting in a redis queue.
The 1-minute max shows '3', 1-minute min shows '0' and 1-minute average shows '1.5'.
It seems that the application is correctly setting the value to zero, but some other process is overwriting it with 3 at the same time, but I can't find this to stop it.
Is it possible to see logs for PutMetricData to diagnose where this value might be coming from?
Normally, Amazon CloudTrail would be the ideal way to discover information about API calls being made to your AWS account. Unfortunately, PutMetricData is not captured in Amazon CloudTrail.
From Logging Amazon CloudWatch API Calls in AWS CloudTrail:
The CloudWatch GetMetricStatistics, ListMetrics, and PutMetricData API actions are not supported.
Related
Question
I have set up a Laravel project that connects to AWS MediaLive for streaming.
Everything is working fine, and I am able to stream, but I couldn't find a way to see if a channel that was running had anyone connected to it.
What I need
I want to be able to see if a running channel has anyone connected to it via the php SDK.
Why
I want to show a stream on the user's side only if there is someone connected to it.
I want to stop a channel that has noone connected to it for too long (like an hour?)
Other
I tried looking at the docs but the closest thing I could find was the DescribeChannel command.
This however does not return any informations about the alerts. I also tried comparing the output of DescribeChannel when someone was connected and when noone was connected, but there was no difference
On the AWS site I can see the alerts on the channel page, but I cannot find how to view that from my laravel application.
Update
I tried running these from the SDK:
CloudWatch->DescribeAlarms();
CloudWatchLogs->GetLogEvents(['logGroupName'=>'ElementalMediaLive', 'logStreamName'=>'channel-log-stream-name']);
But it seems to me that their output didn't change after a channel started running without anyone connected to it.
I went on the console's CloudWatch and it was the same.
Do I need to first set up Egress Points for alerts to show here?
I looked into SNS Topics and lambda functions, but it seems they are for sending messages and notifications? can I also use this to stop/delete a channel that has been disconnected for over an hour? Are there any docs that could help me?
I'm using AWS MediaStore, but I'm guessing I can do the same as AWS MediaPackage? How can the threshold tell me if, and for how long no-one has been connected to a MediaLive channel?
Overall
After looking here and there in the docs I am assuming I have to:
1. set up a metric alarm that detects when a channel had no input for over an hour
2. Send the alarm message to the CloudWatchLogs
3. retrieve the alarm message from the SDK and/or the SNS Topic
4. stop/delete the channel that sent the alarm message
Did I understand this correctly?
Thanks for your post.
Channel alerts will go your AWS CloudWatch logs. You can poll these alarms from SDK or CLI using a command of the form 'aws cloudwatch describe-alarms'. Related log events may be retrieved with a command of the form 'aws logs get-log-events'.
You can also configure a CloudWatch rule to propagate selected service alerts to an SNS Topic which can be polled by various clients including a Lambda function, which can then take various actions on your behalf. This approach works well to aggregate the alerts from multiple channels or services.
Measuring the connected sessions is possible for MediaPackage endpoints, using the 2xx Egress Request Count metric. You can set a metric alarm on this metric such that when its value drops below a given threshold, and alarm message will be sent to the CloudWatch logs mentioned above.
With Regard to your list:
set up a metric alarm that detects when a channel had no input for over an hour
----->CORRECT.
Send the alarm message to the CloudWatchLogs
----->The alarm message goes directly to an SNS Topic, and will be echoed to your CloudWatch logs. See: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
a Lambda Fn will need to be created to process new entries arriving in the SNS topic (queue) mentioned above, and take a desired action. This Lambda Fn can send API or CLI calls to stop/delete the channel that sent the alarm message. You can also have email alerts or other actions triggered from the SNS Topic (queue); refer to https://docs.aws.amazon.com/sns/latest/dg/sns-common-scenarios.html
Alternatively, you could do everything in one lambda function that queries the same MediaPackage metric (EgressRequestCount), evaluates the response and takes a yes/no action WRT shutting down a specified channel. This lambda function could be scheduled to run in a recurring fashion every 5 minutes to achieve the desired result. This approach would be simpler to implement, but is limited in scope to the metrics and actions coded into the Lambda Function. The Channel Alert->SNS->LAMBDA approach would allow you to take multiple actions based on any one Alert hitting the SNS Topic (queue).
I have my app writing logs to /var/log/my_app.log. I have the logrotator set up daily to rotate the log, so presumably when the log rotate condition is met it will copy over my_app.log to my_app<date>.log. I also have the Cloudwatch agent on the same ec2 instance sending files over to Cloudwatch logs. There they will stay indefinitely I assume (or until a set time set in the aws console). Is it correct to assume that Cloudwatch will always have the first log created and logged regardless of how I rotate the actual log files on the ec2 instance? That is to say, no matter what happens with the rotated logs, I will always have ALL the logs that have been created because they've been sent to cloudwatch?
Any logs that is sent to CloudWatch will not be deleted because of the log rotation. Check out the FAQ section in the following link that has some important questions answered including the log rotation naming schemes and the scenarios in which log events can be truncated or skipped.
(Search for CloudWatch Logs Agent FAQs in the following link)
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html
Your assumption is correct on the log retention. CloudWatch logs are stored indefinitely by default.
Here is the quote from Amazon documentation
Log Retention – By default, logs are kept indefinitely and never expire. You can adjust the retention policy for each log group, keeping the indefinite retention, or choosing a retention period between 10 years and one day.
The AWS documentation indicates that multiple log event records are provided to Lambda when streaming logs from CloudWatch.
logEvents
The actual log data, represented as an array of log event
records. The "id" property is a unique identifier for every log event.
How does CloudWatch group these logs?
Time? Count? Randomly, from my perspective?
Currently you get one Lambda invocation for every PutLogEvents batch that CloudWatch Logs had received against that log group. However you should probably not rely on that because AWS could always change it (for example batch more, etc).
You can observe this behavior by running the CWL -> Lambda example in the AWS docs.
Some aws services allow you to configure the log intervals such as elastic load balancing. There's a choice between five and sixty minute log intervals. You may not see a specific increment or parameter in the docs because they are configurable based on each service.
I'm using an AWS Lambda (hourly triggered by a Cloudwatch rule) to trigger the creation of an EMR cluster to execute a job. The EMR cluster once finished its steps write a result file in a S3 bucket. The key path is the hour of the day
/bucket/2017/04/28/00/result.txt
/bucket/2017/04/28/01/result.txt
..
/bucket/2017/04/28/23/result.txt
I wanted to put some alert in case for some reason the EMR job failed to create the result.txt for the hour.
I have already put some alerts on the Lambda invocation count and on the lambda error count but I didn't manage to find the appropriate alert to test that the EMR actually correctly finishes its job.
Note that the Lambda is triggered every 3 min past the hour and takes about 15 minutes to complete. Would a good solution be to create an other Lambda that is triggered every 30min past the hour and checks that the correct key is present in the bucket? if not then write some logs to cloudwatch that I could monitor and use them to create my alert?
What other way could I achieve this alerting?
S3 offers free metrics on object count per bucket, but doesn't publish often enough for your use case.
CloudWatch Alarm on S3 Request Metrics
For a cost, you can enable CloudWatch metrics for S3 requests to enable request metrics that write data in 1-minute periods. You could, for example, create a relevant alarm on the following S3 CloudWatch metrics:
PutRequests sum <= 0 over each hour
4xxErrors sum >= 1 over 1 minute
5xxErrors sum >= 1 over 1 minute
The HTTP status code alarms on much shorter intervals (down to 1 minute), will offer feedback nearer to when these failures occur.
CloudWatch Alarm on Put Events
If you don't want to incur the cost of S3 request metrics, you could instead configure an event to publish a message to an SNS topic on S3 put. You can use CloudWatch to set up alerting on the sum of messages published (or lack thereof).
You could then create a CloudWatch alarm based on this topic failing to publish a message.
Dimensions: TopicName = YOURSNSTOPIC
Namespace: AWS/SNS
Metric Name: NumberOfMessagesPublished
Threshold: NumberOfMessagesPublished <= 0 for 60 minutes (4 periods)
Statistic: Sum
Period: 15 minutes
Treat missing data as: breaching
Actions: Send notification to another, separate SNS topic that sends you an email/sms, or otherwise publishes to some alerting service.
Discussion
Note that both CloudWatch solutions have the caveat that they won't fire alerts exactly at 30 minutes past the hour, but they will capture your entire monitoring period.
You may be able to further configure from these base examples by adjusting your period or how cloudwatch treats missing data to get better results.
A lambda that triggers 30 minutes past the hour (via cron-style scheduling) to check S3 request metrics or the SNS topic's "NumberOfMessagesPublished" metric instead of relying on CloudWatch alarms could also accomplish this. This may be a better alternative if firing exactly 30 minutes past the hour is important, as the CloudWatch alarm's firing time will not be as precise.
Further Reading
AWS Documentation - Configuring Amazon S3 Event Notifications
AWS Documentation - SNS CloudWatch Metrics
AWS Documentation - S3 CloudWatch Metrics
Suppose I have an ec2 instance with service /etc/init/my_service.conf with contents
script
exec my_exec
end script
How can I monitor that ec2 instance such that if my_service stopped running I can act on it?
You can publish a custom metric to CloudWatch in the form of a "heart beat".
Have a small script running via cron on your server checking the
process list to see whether my_service is running and if it is, make
a put-metric-data call to CloudWatch.
The metric could be as simple as pushing the number "1" to your custom metric in CloudWatch.
Set up a CloudWatch alarm that triggers if the average for the metric falls below 1
Make the period of the alarm be >= the period that the cron runs e.g. cron runs every 5 minutes, make the alarm alarm if it sees the average is below 1 for two 5 minute periods.
Make sure you also handle the situation in which the metric is not published (e. g. cron fails to run or whole machine dies). you would want to setup an alert in case the metric is missing. (see here: AWS Cloudwatch Heartbeat Alarm)
Be aware that the custom metric will add an additional cost of 50c to your AWS bill (not a big deal for one metric - but the equation changes drastically if you want to push hundred/thousands of metrics - i.e. good to know it's not free as one would expect)
See here for how to publish a custom metric: http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/publishingMetrics.html
I am not sure if CloudWatch is the right route for checking if the service is running - it would be easier with Nagios kind of solution.
Nevertheless, you may try the CloudWatch Custom metrics approach. You add Additional lines of code which publishes say an integer 1 to CloudWatch Custom Metrics every 5 mins. Your can then configure CloudWatch alarms to do a SNS Notification / Mail Notification for the conditions like Sample Count or sum deviating your anticipated value.
script
exec my_exec
publish cloudwatch custom metrics value
end script
More Info
Publish Custom Metrics - http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/publishingMetrics.html