AWS auto scaling adds additional node resulting in 5xx errors - amazon-web-services

I am new to AWS, I have auto scaling enabled for my Elastic Beanstalk based server. For some reason the below healthd process is almost fully consuming the cpu causing the auto scaler to add a new instance as I have set the expansion policy to add a new instance to the beanstalk environment when resource utilization is > 70.
healthd 20 0 1024648 43660 9876 S 75.7 1.1 121:16.03 ruby
I have two questions:
How to avoid 5xx (network) errors when a new instance is added?
What is the process Healthd needed for? why it is running when I didn't start it? and how can I prevent that process from draining the CPU?
Maybe the Load Balancer is starting to send traffic to the new instance before the application is completely up on that instance and this is why I am getting the network errors! How can I verify this is the cause and how to avoid it?

Related

Why is Elastic Beanstalk Traffic Splitting deploy strategy ignoring HTTP errors?

I am using AWS Elastic Beanstalk. In there, I selected a Traffic Splitting deploy strategy, with a 100% split (so that 100% of new instances will have the new version and have their health evaluated).
Here's how (according to their documentation) that is supposed to work:
During a traffic-splitting deployment, Elastic Beanstalk creates a new set of instances in a separate temporary Auto Scaling group. Elastic Beanstalk then instructs the load balancer to direct a certain percentage of your environment's incoming traffic to the new instances. Then, for a configured amount of time, Elastic Beanstalk tracks the health of the new set of instances. If all is well, Elastic Beanstalk shifts remaining traffic to the new instances and attaches them to the environment's original Auto Scaling group, replacing the old instances. Then Elastic Beanstalk cleans up—terminates the old instances and removes the temporary Auto Scaling group.
And more specifically:
Rolling back the deployment to the previous application version is quick and doesn't impact service to client traffic. If the new instances don't pass health checks, or if you choose to abort the deployment, Elastic Beanstalk moves traffic back to the old instances and terminates the new ones.
However, it seems silly that it only looks at my internal /health health checks, and not the overall health status of the environment, from the HTTP status codes, that it already has information on.
I tried the following scenario:
Deploy a new version.
As soon as the "health evaluation period" begins, flood the server with error 500s (from an endpoint I made specifically for this purpose).
AWS then moves all my instances into "degraded" state, and "unhealthy", but then seems to ignore it, and goes on anyway.
See the following two log dump screenshots (they are oldest-first).
Is there any way that I can make AWS respect the HTTP status based health checks that it already performs, during a traffic split? Or am I bound to only rely on custom-developed health checks entirely?
Update 1: Even weirder, I tried making my own healthchecks fail always too, but it still decides to deploy the new version with the failed healthcheck!
Update 2: I noticed that the temporary auto scaling group that it creates while assessing health does only have an "EC2" type health check, and not "ELB". I think that might be the root cause. If I could only get it to use "ELB" instead.
That is interesting! I do not know if setting the health check type to "ELB" may do the job because we use CodeDeploy, which has far better rollback capabilities than AWS Elastic Beanstalk.
However, there is a well-document way in the docs [1] to apply the setting you are looking for:
[...] By default, the Auto Scaling group, created for your environment uses Amazon EC2 status checks. If an instance in your environment fails an Amazon EC2 status check, Auto Scaling takes it down and replaces it.
Amazon EC2 status checks only cover an instance's health, not the health of your application, server, or any Docker containers running on the instance. If your application crashes, but the instance that it runs on is still healthy, it may be kicked out of the load balancer, but Auto Scaling won't replace it automatically. [...]
If you want Auto Scaling to replace instances whose application has stopped responding, you can use a configuration file to configure the Auto Scaling group to use Elastic Load Balancing health checks. The following example sets the group to use the load balancer's health checks, in addition to the Amazon EC2 status check, to determine an instance's health.
Example .ebextensions/autoscaling.config
Resources:
AWSEBAutoScalingGroup:
Type: "AWS::AutoScaling::AutoScalingGroup"
Properties:
HealthCheckType: ELB
HealthCheckGracePeriod: 300
It does not mention the new traffic splitting deployment feature, though.
Thus, I cannot confirm this is the actual solution, but at least you can give it a shot.
[1] https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/environmentconfig-autoscaling-healthchecktype.html
Once upon a time I thought that the Immutable Deployment option in Elastic Beanstalk was a holy panacea -- but it only works when a deployment involves no changes to the application's database schema.
We've now resorted to blue-green deployments. However, this only works if you control the DNS. If you are a SaaS solution and you allow customers to create a CNAME then B/G is often a spectacular failure as the enterprise: a) sets a very high TTL, and/or b) their internal DNS or firewalls caches the underlaying IP addresses of the ALB (which are dynamic and, of course, replaced when you swap the URL of the blue and green environments).
Traffic splitting is written as an option in the Elastic Beanstalk documentation.
But it's not actually an option in the configuration section in the console.
This wouldn't be the first time I've seen Elastic Beanstalk's docs are out of date so it could be AWS have removed that feature.
Since AWS introduced CodeStar I suspect Elastic Beanstalk is getting the cold shoulder.

ec2 instances getting removed from elastic beanstalk

EB dashboard:
Removed instance [i-0c6e4cba4392d1ace] from your environment.
And if I'm on the EC2 instance, I get these messages on console:
Broadcast message from root#ip-172-31-20-119
(unknown) at 21:20 ...
The system is going down for power off NOW!
Connection to 54.186.171.133 closed by remote host.
Connection to 54.186.171.133 closed.
Any pointers on why this is happening and how can I debug this? Are there any logs which can I look after the instance has terminated?
It is likely that the Auto Scaling group associated with your Elastic Beanstalk application decided to scale-in the number of instances.
You can go to Auto Scaling in the EC2 console, find the Auto Scaling group and look at the History tab to determine why it happened (eg due to low CPU load).
It might also be because the instance failed a Health Check, so Auto Scaling removed it.

ELB backend connection errors when deregister ec2 instances

I've written a custom release script to manage releases for an EC2 autoscaling application. The processing works like so...
Create an AMI based on an application git tag.
Create launch config.
Configure ASG to use new launch config.
Find current desired capacity for ASG.
Set desired capacity to 2x previous capacity.
Wait for new instances to become healthy by querying ELB.
Set desired capacity back to previous value.
This all works fairly well, except whenever I run this, the monitoring for the ELB is showing a lot of backend connection errors.
I don't know why this would be occurring, as it should (based on my understanding) still service current connections if the "Connection draining" option is enabled for the ELB (which it is).
I thought perhaps the ASG was terminating the instances before the connections could finish, so I changed my script to first deregister the instances from the ELB, and then wait a while before changing the desired capacity at the ASG. This however didn't make any difference. As soon as the instances were deregistered from the ELB (even though they're still running and healthy) the backend connection errors occur.
It seems as though it's ignoring the connection draining option and simply dropping connections as soon as the instance has been deregistered.
This is the command I'm using to deregister the instances...
aws elb deregister-instances-from-load-balancer --load-balancer-name $elb_name --instances $old_instances
Is there some preferred method to gracefully remove the instances from the ELB before removing them from the ASG?
Further investigation suggests that the back-end connection errors are occurring because the new instances aren't yet ready to take the full load when the old instances are removed from the ELB. They're healthy, but seem to require a bit more warming.
I'm working on tweaking the health-check settings to give the instances a bit more time before they start trying to serve requests. I may also need to change the apache2 settings to get them ready quicker.

ELB always reports instances as inservice

I am using aws ELB to report the status of my instances to an autoscaling group so a non-functional instance would be terminated and replaced by a new one. The ELB is configured to ping TCP:3000 every 60 seconds and wait for a timeout of 10 seconds to consider it a health check failure. the unhealthy threshold is 5 consecutive checks.
However the ELB always reports my instances as healthy and inservice all the time even though I periodically manually come across an instance that is timing out and I have to terminate it manually and launch a new one despite the ELB reporting it as inservice all the time
Why does this happen ?
After investigating a little bit I found that
I am trying to assess the health of the app through an api callto a web app running on the instance and wait for the response to timeout to declare the instance faulty. I needed to use http as the protocol to call port 3000 with a custom path through the load balancer instead of tcp.
Note: The api needs to return a status code of 200 for the load balancer to consider it healthy. It now works perfectly

EC2 Auto scaling

I have one EC2 instance and running a tomcat service on the EC2 machine. I know, how to configure auto scaling when the CPU usage goes up, ... But, not sure how to configure auto scaling to launch a new instance when my tomcat service goes down even the EC2 instance is up. Also how to configure auto scaling when the tomcat service is hung even the tomcat process is up and running.
If this is not possible with Ec2 auto scaling, Is this possible with ELB and Beanstalk?
If you go to the auto scaling page in the web console and click edit, you can choose either ec2 or elb health check. Ec2 monitors instance performance characteristics. Elb health checks can be used to monitor server response. As the name implies the auto scaling health status is controlled by the response given to a load balancer. This could be a tcp check to port 80 that just checks that the server is there, listening and responding, all the way up to a custom http check to a page you define, e.g. You could do hostname/myserverstatus and at that page have a script that checks server status, database availability etc, and then return either a success or error. See http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-add-elb-healthcheck.html
Good Luck!
There are some standard unix tools that do that for you. Upstart will watch your server and restart if it goes down. I don't know about it hanging. If you run on beanstalk, you can set up a call that the load balancer will make to see if your app is responsive, and it can then message you to let you know there is a problem. You can probably set it up to reboot the box, or restart the process.