How do I set up CloudWatch to detect when an EC2 instance goes down? - amazon-web-services

I've got an app running on AWS. How do I set up Amazon CloudWatch to notify me when the EC2 instance fails or is no longer responsive?
I went through the CloudWatch screens, and it appears that you can monitor certain statistics, like CPU or disk utilization, but I didn't see a way to monitor an event like "the instance got an http request and took more than X seconds to respond."

Amazon's Route 53 Health Check is the right tool for the job.
Route 53 can monitor the health and performance of your application as well as your web servers and other resources.
You can set up HTTP resource checks in Route 53 that will trigger an e-mail notification if the server is down or responding with an error.
http://eladnava.com/monitoring-http-health-email-alerts-aws/

To monitor an event in CloudWatch you create an Alarm, which monitors a metric against a given threshold.
When creating an alarm you can add an "action" for sending a notification. AWS handles notifications through SNS (Simple Notification Service). You can subscribe to a notification topic and then you'll receive an email for you alarm.
For EC2 metrics like CPU or disk utilization this is the guide from the AWS docs: http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/US_AlarmAtThresholdEC2.html
As answered already, use an ELB to monitor HTTP.
This is the list of available metrics for ELB:
http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/US_MonitoringLoadBalancerWithCW.html#available_metrics
To answer your specific question, for monitoring X seconds for the http response, you would set up an alarm to monitor the ELB "Latency".

CloudWatch monitoring is just like you have discovered. You will be able to infer that one of your instances is frozen by taking a look at the metrics, but CloudWatch won't e.g. send you an email when your app is down or too slow, for example.
If you are looking for some sort of notification when your app or instance is down, I suggest you to use a monitoring service. Pingdom is a good option. You can also set up a new instance on AWS and install a monitoring tool, like Nagios, which would be my preferred option.
Good practices that are always worth, in the long road: using load balancing (Amazon ELB), more than one instance running your app, Autoscaling (when an instance is down, Amazon will automatically start a new one and maintain your SLA), and custom monitoring.
My team has used a custom monitoring script for a long time, and we always knew of failures as soon as they occurred. Basically, if we had two nodes running our app, node 1 sent HTTP requests to node 2 and node 2 to 1. If any request took more than expected, or returned an unexpected HTTP status or response body, the script sent an email to the system admins. Nowadays, we rely on more robust approaches, like Nagios, which can even monitor operating system stuff (threads, etc), application servers (connection pools health, etc) and so on. It's worth every cent invested in setting it up.

CloudWatch recently added "status check" metrics that will answer one of your questions on whether an instance is down or not. It will not do a request to your Web server but rather a system check. As previous answer suggest, use ELB for HTTP health checks.

You could always have another instance for tools/testing, that instance would try the http request based on a schedule and measure the response time, then you could publish that response time with CloudWatch and set an alarm when it goes over a certain threshold.
You could even do that from the instance itself.

As Kurst Ursan mentioned above, using "Status Check" metrics is the way to go. In some cases you won't be able to browse that metrics (i.e if you;re using AWS OpsWorks), so you're going to have to report that custom metric on your own. However, you can set up an alarm built on a metric that always matches (in an OK sate) and have the alarm trigger when the state changes to "INSUFFICIENT DATA" state, this technically means CloudWatch can't tell whether the state is OK or ALARM because it can't reach your instance, AKA your instance is offline.

There are a bunch of ways to get instance health info. Here are a couple.
Watch for instance status checks and EC2 events (planned downtime) in the EC2 API. You can poll those and send to Cloudwatch to create an alarm.
Create a simple daemon on the server which writes to DynamoDB every second (has better granularity than Cloudwatch). Have a second process query the heartbeats and alert when missing.
Put all instances in a load balancer with a dummy port open that that gives a TCP response. Setup TCP health checks on the ELB, and alert on unhealthy instances.
Unless you use a product like Blue Matador (automatically notifies you of production issues), it's actually quite heinous to set something like this up - let alone maintain it. That said, if you're going down the road, and want some help getting started using Cloudwatch (terminology, alerts, logs, etc), start with this blog: How to Monitor Amazon EC2 with CloudWatch

You can use CloudWatch Event Rule to Monitor whenever any EC2 instance goes down. You can create an Event rule from CloudWatch console as following :
In the CLoudWatch Console choose Events -> rule
For Event Pattern, In service Name Choose EC2
For Event Type, Choose EC2 Instance State-change Notification
For Specific States, Choose Stopped
In targets Choose any previously created SNS topic for sending a notification!
Source : Create a Rule - https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/CloudWatch-Events-Input-Transformer-Tutorial.html#input-transformer-create-rule
This is not exactly a CloudWatch alarm, however this serves the purpose of monitoring/notification.

Related

Beats installation on AWS-ec2 to send to on-prem ELK

I have to setup jboss over AWS-EC2-Windows server, this will scale-up as well as per the requirements. We are using ELK for infrastructure monitoring for which will be installing beats here which will send the data to on-prem logstash. There we on-board the servers with there hostname and ip.
Now the problem is: in case of autoscaling, how we can achieve this.
Please advise.
Thanks,
Abhishek
If you would create one EC2 instance and create an AMI of it in order to have it autoscale based on that one, this way the config can be part of it.
If you mean by onboard adding it to the allowed list, you could use a direct connect or a VPC with a custom CIDR block defined and add that subnet in the allowed list already.
AFAIK You need to change the logstash config file on disk to add new hosts, and it should notice the updated config automatically and "just work".
I would suggest a local script that can read/write the config file and that polls an SQS queue "listening" for autoscaling events. You can have your ASG send SNS messages when it scales and then subscribe an SQS queue to receive them. Messages will be retained for upto 14 days and theres options to add delays if required. The message you receive from SQS will indicate the region, instance-id and operation (launched or terminated) from which you can lookup the IP address/hostname to add/remove from the config file (and the message should be deleted from the queue when processed successfully). Editing the config file is just simple string operations to locate the right line and insert the new one. This approach only requires outbound HTTPS access for your local script to work and some IAM permissions, but there is (a probably trivial) cost implication.
Another option is a UserData script thats executed on each instance at startup (part of the Launch Template of your AutoScale group). Exactly how it might communicate with your on-prem depends on your architecture/capabilities - anythings possible. You could write a simple webservice to manage the config file and have the instances call it but thats a lot more effort and somewhat risky in my opinion.
FYI - if you use SQS look at Long Polling if your checking the queues frequently/want the message to propagate as quickly as possible (TLDR - faster & cheaper than polling any more than twice a minute). Its good practice to use a dead-letter queue with SQS - messages that get retrieved but not removed from the queue end up here. Setup alarms on the queue and deadletter queue to alert you via email if there are messages failing to be processed or not getting picked up in sensible time (ie your script has crashed etc).

Difference between AWS CloudWatch and AWS CloudWatch Events

Was studying about Amazon web services and fundamentals when came across these 2 concepts:
Amazon CloudWatch
Amazon CloudWatch Events
Even while going through the official documents on AWS, I couldn't find a difference between the two even when Amazon mentions that they are different. Excerpt is:
CloudWatch provides you with data and actionable insights to monitor
your applications, respond to system-wide performance changes,
optimize resource utilization, and get a unified view of operational
health. CloudWatch collects monitoring and operational data in the
form of logs, metrics, and events, providing you with a unified view
of AWS resources, applications, and services that run on AWS and
on-premises servers. You can use CloudWatch to detect anomalous behavior in your environments, set alarms, visualize logs and metrics side by side, take automated actions, troubleshoot issues, and discover insights to keep your applications
running smoothly.
Documentation of AWS CloudWatch
Amazon CloudWatch Events delivers a near real-time stream of system
events that describe changes in Amazon Web Services (AWS) resources.
Using simple rules that you can quickly set up, you can match events
and route them to one or more target functions or streams. CloudWatch
Events becomes aware of operational changes as they occur. CloudWatch
Events responds to these operational changes and takes corrective
action as necessary, by sending messages to respond to the
environment, activating functions, making changes, and capturing
state information.
Documentation of AWS CloudWatch Events
CloudWatch
CloudWatch is a monitoring service for your AWS resources. You can log your log files. By default the resources created within AWS logs in CloudWatch(CW). You can monitor the performance of resources too for example you can monitor how is the CPU utilisation of your EC2 instances. You can set Alarms for your resources
threshold and get an SNS alert on that. For example you can create an Alarm for your DynamoDB if Write capacity is exceeding. You can set an alarm for your billing too. So basically CW is used as a Monitoring solution.
CloudWatch Events
CW Events is also the part of CloudWatch. CloudWatch Events is helpful when you want to schedule something. Say you to want run your lambda every other day, you can create a Rule for that or you want to trigger your lambda by Event Pattern. There are bunch of services supported by CloudWatch Events, you can use anyone of them as your target not just Lambda. Event Buses is used to send your events to multiple accounts also. For example if you have a CICD account and every month you bake new AMI there, to notify all accounts you can use Event Buses, after getting the event from Event Buses other accounts can trigger some important tasks.

Detect thrashing on AWS Auto Scale Group

Sometimes if there are conditions that prevent the app from starting, say a bad config, the auto scaler will continue to start up instances one after the other.
Anybody know of a good way to alert on this?
Most of our servers receive network traffic so we put a CloudWatch monitor on the NetworkIn metric.
I would suggest configuring the start-up script to Terminate/Shutdown the instance upon failure and sending an alert using CloudWatch custom metrics or any other service like NewRelic.
I don't think that there is a way to alert auto-scaling-group to stop spanning up instances. You could set the max instances limit and have an alert upon reaching this number.
You could alert based on the CloudWatch metric:
Auto Scaling / Group Metrics / GroupTerminatingInstances
See the doc page for more details

Amazon EC2 ELB alarm - which instance is unhealthy?

We have hosted some apps on Amazon EC2 and are using an Elastic Load Balancer (ELB) to manage several instances of one app. Also, we have set up ELB alarms to get notified about Unhealthy Hosts, i.e. when an instance has gone down.
So far, I could not figure out where to check which instance exactly has gone down when the alarm goes off, except for the ELB status page in the AWS console. However, if the instance comes back to In Service state again, this won't help me either.
The e-mail notification sent out by the ELB does not contain this information; and I couldn't find it anywhere in the alarms history in the console either.
Is there a way to tell which instance an ELB alarm has been triggered for, even if the instance has come back into OK state in the meantime?
Cheers, Alex
Sadly Amazon does not provide a health check log, so its impossible to find out which instance failed the health check afterwards, assuming that the server is no longer unhealthy. You can only use Per-Az metrics to know in which AZ is the instance.
But, you could know which instance is down if you query AWS api during the issue. So, I have thought of a possible workaround:
Set up a new SNS topic, and add an HTTP action to a custom URL that triggers a job that enumerates the instances and send you that info by mail.
Then setup a CloudWatch alarm for UnHealthyHostCount > 0 and setup the action to the SNS topic.
The difficult part is that your URL should handle the SNS subscription & confirmation described here.
The command to know which instance is currently OutOfService is:
elb-describe-instance-health *LoadBalancerName* --region *YourRegion*
You could probably use the AWS SDK gem or other AWS library that can get status. Use it to create a cron task that regularly gets the status of each instance and records it somewhere. Either that will give you what you need or the disappearance of the status for one instance will tell you which one went bad.
We are using the following Lambda function to make up for the lack of Health Check logging:
'use strict';
var AWS = require('aws-sdk');
var elb = new AWS.ELB();
exports.handler = (event, context, callback) => {
var params = {
LoadBalancerName: "<elb_name_here>"
};
elb.describeInstanceHealth(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});
};
It does not produce the prettiest logs in CloudWatch, but the data is there. It allows us to see if there is a particular instance which tends to drop more often, etc. It is set up much like Gerardo Grignoli's answer above. I added a CloudWatch alarm to send an SNS message to the Lambda function when the alarm was triggered. It doesn't do anything with the message itself - the message is merely the triggering mechanism for the Lambda function to run and log the instance status.
No. The ELB metrics in CloudWatch do not provide you with that level of details and IMHO from the design perspective they should not. If a host is unhealthy the monitoring on the specific host should report the details for that not the ELB. If a node goes out of service in ELB, it should not be a problem for ELB. Although, in load balancer it makes sense to figure out an alarming state where 3 out of 6 of your machines go into Not In Service state. Take a look at CloudWatch metrics
Go to load balancer and find load balancer associated with you ELB. Then look at instances that OutofService

Can I use AWS CloudWatch to hit a status URI?

Is it possible to use CloudWatch or other AWS services to hit a URI, e.g. www.mysite.com/status, and send me error alerts when that doesn't return a 200 result? I want service-level monitoring for a small site (and don't want to do any work).
Ideally, I'd like to hit the /status endpoint on a particular EC2 host, with the HTTP hostname parameter set.
Thanks in advance.
edit: I recall something similar is available in auto-scaling groups, where hosts are automatically taken down if they don't meet health checks. I'm looking for something similar, but I just want email, not hosts taken down. (Since I'm working on small sites on a shared host.)
You can't do it directly from CloudWatch, but you could set up a monitor on a separate server, construct the test, and then send a custom metric to CloudWatch using the CLI tools. Custom metrics (and the CloudWatch CLI) are covered here:
http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/publishingMetrics.html
From a separate server you could then run a simple script which tries to load your health page, and sends 0 for healthy, 1 for unhealthy, or whatever works for you, to CloudWatch.
Doing this with CloudWatch and SNS is not straightforward. You could do it with Route 53 and DNS failover, but for what you need, have a look at Pingdom. They have a free plan somewhere if you search for it.