AWS Route53 - get uptime percentage - amazon-web-services

I need to find a way to report on website uptime (as a percentage) based on AWS Route53 monitoring. This reporting is generally done quarterly.
My initial thought was to have CloudWatch send ALARM and OK states via SNS to SQS, and then process this queue into a database for later reporting. As far as I can tell, however, CloudWatch will only send emails even though an SQS queue is subscribed to the topic.
Any suggestions of how I might achieve this?

Amazon Route 53 can be configured to send healthcheck data to Amazon SQS.
It worked for me -- here's the steps I took:
Create an Amazon SNS notification topic in us-east-1 (where Route 53 performs its health checks)
Create an Amazon SQS queue in us-east-1 (same region as the notification topic)
Subscribe the Amazon SQS queue to the Amazon SNS topic (via the Queue Actions menu option)
Create an Amazon Route 53 Health Check. Set Create Alarm to Yes. Configure it to Send notification to Existing SNS topic and choose the topic created above.
An Amazon CloudWatch alarm will be automatically created by Amazon Route 53.
This will result in health notifications arriving in the SQS queue. However, it will only send an ALARM notification -- there is no notification when it becomes healthy again. To receive a "now healthy" notification, edit the CloudWatch alarm and add a new Notification that triggers when "State is OK".
Here is an example of a failure notification retrieved from the SQS queue:
{
"Type" : "Notification",
"MessageId" : "4768e8e4-0026-51c7-aa6e-a696bf02f808",
"TopicArn" : "arn:aws:sns:us-east-1:123456789012:r53-east",
"Subject" : "ALARM: \"awsroute53--4c2f-9816-a42c50ec8671-High-HealthCheckStatus\" in US - N. Virginia",
"Message" : "{\"AlarmName\":\"awsroute53-4c2f-9816-a42c50ec8671-High-HealthCheckStatus\",\"AlarmDescription\":null,\"AWSAccountId\":\"743112987576\",\"NewStateValue\":\"ALARM\",\"NewStateReason\":\"Threshold Crossed: 1 datapoint (0.0) was less than the threshold (1.0).\",\"StateChangeTime\":\"2015-09-16T00:50:44.591+0000\",\"Region\":\"US - N. Virginia\",\"OldStateValue\":\"OK\",\"Trigger\":{\"MetricName\":\"HealthCheckStatus\",\"Namespace\":\"AWS/Route53\",\"Statistic\":\"MINIMUM\",\"Unit\":null,\"Dimensions\":[{\"name\":\"HealthCheckId\",\"value\":\"4c2f-9816-a42c50ec8671\"}],\"Period\":60,\"EvaluationPeriods\":1,\"ComparisonOperator\":\"LessThanThreshold\",\"Threshold\":1.0}}",
"Timestamp" : "2015-09-16T00:50:44.656Z",
"SignatureVersion" : "1",
"Signature" : "KvCHsBh95q...cw8A==",
"SigningCertURL" : "https://sns.us-east-1.amazonaws.com/SimpleNotificationService-90147a5624348ee.pem",
"UnsubscribeURL" : "https://sns.us-east-1.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=arn:aws:sns:us-east-1:123456789012:r53-east:4b5d-8318-57bd58f0b3a4"
}

One option is to offload Route53 health check statistics to Axibase Time Series Database and enable scheduled reports as discussed in the uptime reports article.
Implementation notes:
Need to create a read-only account to query CloudWatch statistics and Route53 Health Check metadata.
Offload task has a latency of 5-15 minutes (configurable).
Offload task ensures no data gaps in copied CW statistics, i.e. when collection temporarily stops for whatever reason.
Available reports:
Base report: average uptime over the period (day, week, month, quarter).
Enhanced uptime report with additional check configuration details.
Filter results by protocol or custom tag.
Aggregate uptime by custom tag such as GEO, environment.
Filter results by hours of the day or working days.
Aggregate uptime by day of week.
Downtime incident count.
Longest downtime incidents.
The reports can be generated interactively, via a web-based console, delivered via email, or displayed on portals.
Disclaimer: I work for Axibase.

Related

CloudWatch Monitoring and Notifications

I am using various AWS services (Lambda, Glue, S3, Redshift, EC2) for ETL processing. I am trying to create a 'log ETL flow' to have monitoring and notifications sent out (email or otherwise) when a step fails in the process.
I have checked that each service I am using has metrics being logged in CloudWatch. I am now trying to figure out a smart way of processing this data in order to send out notifications when a step fails and/or have a central monitoring of the entire flow.
Are there any best practices or examples of this setup?
It seems to be the perfect case for the CloudWatch Alarms.
You can create a CloudWatch alarm that watches a single CloudWatch metric or the result of a math expression based on CloudWatch metrics. The alarm performs one or more actions based on the value of the metric or expression relative to a threshold over a number of time periods. The action can be an Amazon EC2 action, an Amazon EC2 Auto Scaling action, or a notification sent to an Amazon SNS topic.
You can create a chain CloudWatch Alarm -> SNS
You can either use SNS to notify users via SMS or Push Notifications.
Or you can do one step more SNS -> SES to deliver emails.

AWS SNS not registering successfully executed action from AWS Cloudwatch

I have created AWS Cloudwatch Alarm on a Target group for my Elastic Load Balancer.
Once the alarm state goes from OK to Alarm it is should send notification to an AWS SNS Topic with a Lambda function as subscriber.
What I have tested:
The lambda function works with a dummy SNS event
Publishing a test message to the SNS Topic results in all subscribers being notified
The CloudWatch Alarm successfully goes into state Alarm.
The CloudWatch Alarm History says "2019-08-04 13:18:00 Action Successfully executed action arn:arn:aws:sns:eu-west-1:something"
It seems like there is something wrong in the link between the CloudWatch Alarm and the SNS Topic. The SNS topic Access Policy has been configured to "Everyone" and still no luck.
The exact settings on my CloudWatch Alarm
Threshold
UnHealthyHostCount > 0 for 1 datapoints within 1 minute
ARN
arn:aws:cloudwatch:eu-west-1:xxxxx:alarm:awsapplicationelb-targetgroup-API-Workers-xxxxxx-High-Unhealthy-Hosts
Namespace
AWS/ApplicationELB
Metric name
UnHealthyHostCount
LoadBalancer
app/Backend-API-HTTPS/xxxxxx
TargetGroup
targetgroup/API-Workers/xxxxx
Statistic
Sum
Period
1 minute
Datapoints to alarm
1 out of 1
Missing data treatment
Treat missing data as bad (breaching threshold)
Percentiles with low samples
evaluate
The actions configured for this were selected from a list of "Select an existing SNS topic". I also tried configuring the topic ARN directly. In both cases the configuration of the actions are marked as valid by the AWS console.

AWS Alert to monitor a key is periodically createded a bucket

I'm using an AWS Lambda (hourly triggered by a Cloudwatch rule) to trigger the creation of an EMR cluster to execute a job. The EMR cluster once finished its steps write a result file in a S3 bucket. The key path is the hour of the day
/bucket/2017/04/28/00/result.txt
/bucket/2017/04/28/01/result.txt
..
/bucket/2017/04/28/23/result.txt
I wanted to put some alert in case for some reason the EMR job failed to create the result.txt for the hour.
I have already put some alerts on the Lambda invocation count and on the lambda error count but I didn't manage to find the appropriate alert to test that the EMR actually correctly finishes its job.
Note that the Lambda is triggered every 3 min past the hour and takes about 15 minutes to complete. Would a good solution be to create an other Lambda that is triggered every 30min past the hour and checks that the correct key is present in the bucket? if not then write some logs to cloudwatch that I could monitor and use them to create my alert?
What other way could I achieve this alerting?
S3 offers free metrics on object count per bucket, but doesn't publish often enough for your use case.
CloudWatch Alarm on S3 Request Metrics
For a cost, you can enable CloudWatch metrics for S3 requests to enable request metrics that write data in 1-minute periods. You could, for example, create a relevant alarm on the following S3 CloudWatch metrics:
PutRequests sum <= 0 over each hour
4xxErrors sum >= 1 over 1 minute
5xxErrors sum >= 1 over 1 minute
The HTTP status code alarms on much shorter intervals (down to 1 minute), will offer feedback nearer to when these failures occur.
CloudWatch Alarm on Put Events
If you don't want to incur the cost of S3 request metrics, you could instead configure an event to publish a message to an SNS topic on S3 put. You can use CloudWatch to set up alerting on the sum of messages published (or lack thereof).
You could then create a CloudWatch alarm based on this topic failing to publish a message.
Dimensions: TopicName = YOURSNSTOPIC
Namespace: AWS/SNS
Metric Name: NumberOfMessagesPublished
Threshold: NumberOfMessagesPublished <= 0 for 60 minutes (4 periods)
Statistic: Sum
Period: 15 minutes
Treat missing data as: breaching
Actions: Send notification to another, separate SNS topic that sends you an email/sms, or otherwise publishes to some alerting service.
Discussion
Note that both CloudWatch solutions have the caveat that they won't fire alerts exactly at 30 minutes past the hour, but they will capture your entire monitoring period.
You may be able to further configure from these base examples by adjusting your period or how cloudwatch treats missing data to get better results.
A lambda that triggers 30 minutes past the hour (via cron-style scheduling) to check S3 request metrics or the SNS topic's "NumberOfMessagesPublished" metric instead of relying on CloudWatch alarms could also accomplish this. This may be a better alternative if firing exactly 30 minutes past the hour is important, as the CloudWatch alarm's firing time will not be as precise.
Further Reading
AWS Documentation - Configuring Amazon S3 Event Notifications
AWS Documentation - SNS CloudWatch Metrics
AWS Documentation - S3 CloudWatch Metrics

How can we monitor a process with cloudwatch

I have a Java process that runs on EC2 I would like to setup an alert in Cloudwatch when the process goes down or is in a bad state (e.g does not send heartbeat to Cloudwatch for the last 10 secs or so).
What is the best way to do this ? I think I need the custom metrics, but did not find any documentation for specifically monitoring a process.
I can use the AWS SDK if needed.
You can write a custom script with ps or jps and push that metric to Cloudwatch. BUT if you are looking for 10 seconds granularity, then Cloudwatch is not the right solution since its minimum granularity is 60 seconds.
From: AWS Resource and Custom Metrics Monitoring
Q: What is the minimum granularity for the data that Amazon CloudWatch
receives and aggregates?
The minimum granularity supported by CloudWatch is 1 minute data
points. Many metrics are received and aggregated at 1-minute
intervals. Some are received at 3-minute or 5-minute intervals.
Though it is possible to create an alarm using CLI and SDK, I suggest you use the AWS Cloudwatch dashboard. Wait for your custom metric to appear in Cloudwatch dashboard. After you see your custom metrics in Cloudwatch, click on CreateAlarm and select your metric. After that define your alarm.
The attached image shows Applications as the metric. In your case, it will be whatever name you choose to call it. Under Actions, create a new notification and specify your email. Now if the count goes below 1 for one period, you will get an alarm.
AWS Custom Metrics can be used to publish the health of the Program.
Below Java Code can be used to Publish the Heart Beat. Using Custom Metrics Alarm can be configured in CloudWatch.
AmazonCloudWatch amazonCloudWatch = AmazonCloudWatchClientBuilder.standard().
withEndpointConfiguration(new AwsClientBuilder.
EndpointConfiguration("monitoring.us-west-1.amazonaws.com","us-west-1")).build();
PutMetricDataRequest putMetricDataRequest = new PutMetricDataRequest();
putMetricDataRequest.setNamespace("CUSTOM/SQS");
MetricDatum metricDatum1 = new MetricDatum().withMetricName("MessageCount").withDimensions(new Dimension().withName("Personalization").withValue("123"));
metricDatum1.setValue(-1.00);
metricDatum1.setUnit(StandardUnit.Count);
putMetricDataRequest.getMetricData().add(metricDatum1);
PutMetricDataResult result = amazonCloudWatch.putMetricData(putMetricDataRequest);
The best way to monitor a process will be using AWS CloudWatch procstat plugin. First create a CloudWatch configuration file with PID file location from EC2 and monitor the memory_rss parameter of process. The idea is memory consumption metric will never go below or equal to zero for a running process.
{
"agent": {
"run_as_user": "cwagent"
},
"metrics": {
"metrics_collected": {
"procstat": [
{
"pid_file": "/var/run/sshd.pid",
"measurement": [
"cpu_usage",
"memory_rss"
]
}
]
}
}
}
Later start the CloudWatch Agent and configure the ALARM using this AWS documentation!

Not getting complete information in SQS topic in the message generated from AWS CloudWatch alarm

I have configured an Alarm on CloudTrail events. The metric of the alarm is to trigger it when it finds the information in the logs that an instance is terminated. The information sends a message to an SNS topic which in turn calls SQS.
It is all working as of now. However, when I read SQS I can only see the information of the alarm, but I would like to obtain details of the instance that got terminated. For example, below is what I see:
{
"Type" : "Notification",
"MessageId" : "1744f315-1042-5248-99a8-bd637aac7da4",
"TopicArn" : "arn:aws:sns:us-east-1:873150696559:chefterm",
"Subject" : "ALARM: \"terminatedchefnodes\" in US - N. Virginia",
"Message" : "{\"AlarmName\":\"terminatedchefnodes\",\"AlarmDescription\":\"terminatedchefnodes\",\"AWSAccountId\":\"873150696559\",\"NewStateValue\":\"ALARM\",\"NewStateReason\":\"Threshold Crossed: 1 datapoint (1.0) was greater than the threshold (0.0).\",\"StateChangeTime\":\"2015-09-18T19:40:30.459+0000\",\"Region\":\"US - N. Virginia\",\"OldStateValue\":\"INSUFFICIENT_DATA\",\"Trigger\":{\"MetricName\":\"TestChefMetric\",\"Namespace\":\"CloudTrailMetrics\",\"Statistic\":\"AVERAGE\",\"Unit\":null,\"Dimensions\":[],\"Period\":900,\"EvaluationPeriods\":1,\"ComparisonOperator\":\"GreaterThanThreshold\",\"Threshold\":0.0}}",
"Timestamp" : "2015-09-18T19:40:30.506Z",
"SignatureVersion" : "1",
"Signature" : "XpE8xR8S8sZPW0Yp642c2lpfiqP9qpXg1w8HCiD4YyWLRyHaQSR5RfSBk7yANJOtApw2nIUGRgpWzBE0j5RkfW4cvRrEcRLudAqO2N5QhCJfjvl48/AxWh1qmDiyrHmr0sTpSTg4zPbMQUs7nDRrW1QwQ6cqy04PTNJuZfBNfAXBlJNCkmeyJ8+klq6edmDijMy6M4D8kAUQ+trmTqTO29/jvT0+yOtBWBIOwiRDHxRfNIJ2vOWz8mjvyU43YDYZD1AG3hDBuSbs7li/8jkY7arsK2R5mDBhYI+o/w8D/W7qdBOGJlby1umVHX4mLQBwuOdLmSxN0P34cG9feuqdlg==",
"SigningCertURL" : "https://sns.us-east-1.amazonaws.com/SimpleNotificationService-bb750dd426d95ee9390147a5624348ee.pem",
"UnsubscribeURL" : "https://sns.us-east-1.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=arn:aws:sns:us-east-1:873150696559:chefterm:467b007c-bb58-4ad6-b05b-ccd159c0515d"
}
But I instead I want to see the instance id information which was there in the CloudTrail logs :
AWS CloudTrail delivers log files to your Amazon S3 bucket approximately every 5 minutes. The delivery of these files can then be used to 'trigger' some code that checks whether a certain activity has occurred. And a good way to run this code is AWS Lambda.
The basic flow is:
AWS CloudTrail creates a log file in Amazon S3
This triggers a call to AWS Lambda, with custom code that can determine whether the event is of interest
The custom code can send publish a message to Amazon SNS, which can deliver a message via email, HTTP, etc
Here are two articles that describe such a setup:
AWS Lambda Walkthrough 5: Handling AWS CloudTrail Events
How to Receive Alerts When Specific APIs Are Called by Using AWS CloudTrail, Amazon SNS, and AWS Lambda