ELB backend connection errors when deregister ec2 instances - amazon-web-services

I've written a custom release script to manage releases for an EC2 autoscaling application. The processing works like so...
Create an AMI based on an application git tag.
Create launch config.
Configure ASG to use new launch config.
Find current desired capacity for ASG.
Set desired capacity to 2x previous capacity.
Wait for new instances to become healthy by querying ELB.
Set desired capacity back to previous value.
This all works fairly well, except whenever I run this, the monitoring for the ELB is showing a lot of backend connection errors.
I don't know why this would be occurring, as it should (based on my understanding) still service current connections if the "Connection draining" option is enabled for the ELB (which it is).
I thought perhaps the ASG was terminating the instances before the connections could finish, so I changed my script to first deregister the instances from the ELB, and then wait a while before changing the desired capacity at the ASG. This however didn't make any difference. As soon as the instances were deregistered from the ELB (even though they're still running and healthy) the backend connection errors occur.
It seems as though it's ignoring the connection draining option and simply dropping connections as soon as the instance has been deregistered.
This is the command I'm using to deregister the instances...
aws elb deregister-instances-from-load-balancer --load-balancer-name $elb_name --instances $old_instances
Is there some preferred method to gracefully remove the instances from the ELB before removing them from the ASG?

Further investigation suggests that the back-end connection errors are occurring because the new instances aren't yet ready to take the full load when the old instances are removed from the ELB. They're healthy, but seem to require a bit more warming.
I'm working on tweaking the health-check settings to give the instances a bit more time before they start trying to serve requests. I may also need to change the apache2 settings to get them ready quicker.

Related

EC2 server losses internet connection and application fails to send email, sms and even yum updates

I have 5 EC2 servers in the same VPC and all of a sudden yesterday, all of my applications started failing to send email and sms. So I tried doing git pull of my project it also timed out. Then tried to install telnet using yum that to failed with Time out. I have checked almost everything including Network ACLs, Security Groups, Subnets, Iptables, etc and everything is correct. I am not sure why is this happening.
The weird thing is if I reboot the server once the internet comes for a brief amount of time and again it disconnects.
Attaching below are the errors I am facing:
Error while Generating the Tiny URL. Error: {"errno":-110,"code":"ETIMEDOUT","syscall":"connect","address":"XXX.XX.XXX.XX","port":443}
Error SendEmail UnknownEndpoint: Inaccessible host: `email.ap-south-1.amazonaws.com'. This service may not be available in the `ap-south-1' region.
Attaching screenshots of my Network ACLs, Security Groups, Subnets, and iptables:
Please help with what am I doing wrong or if is this an issue with AWS EC2? My goal is to make sure my application works without timeout and git and yum starts working.
Did you try terminating and reprovisioning the instances, rather than rebooting them? There may be some problem with the underlying hardware. When you terminate and recreate an instance, it will likely end up in a different rack in the datacenter, which may solve the problem.
If the above helps, you should consider setting up an application load balancer with an auto scaling group, with health checks enabled for both, so that the auto scaling group terminates unhealthy instances and replaces then with the new ones automatically.
You may also consider using Simple Notification Service and stop worrying about underlying compute for e-mail and sms distribution altogether!

Shutdown scripts for Google managed instance groups

I am running managed instance group in google cloud. These are behind a loadbalancer and it is working fine. The problem is when the managed instance group scales down, the loadbalancer will not notice this until after the instance has been killed so some requests will be sent to an instance that is dead causing the application to not work properly for a while.
On this page https://cloud.google.com/compute/docs/autoscaler/understanding-autoscaler-decisions I have read that shutdown scripts can be used. I tried to add one that tells the instance it will be shut down so it starts sending unhealthy when the load balancer does a health check, the script then waits for a while to make sure to give the load balancer time to check it. It does however not seem to work. The script seems to be called but to late so it just shuts down.
Anyone know how to write a shutdown script for this scenario?
Seems like this was not the problem after all. After inspecting the logs it turned out that the health checks timed out causing the load balancer to ot find any nodes with a healthy state.

Why is Elastic Beanstalk Traffic Splitting deploy strategy ignoring HTTP errors?

I am using AWS Elastic Beanstalk. In there, I selected a Traffic Splitting deploy strategy, with a 100% split (so that 100% of new instances will have the new version and have their health evaluated).
Here's how (according to their documentation) that is supposed to work:
During a traffic-splitting deployment, Elastic Beanstalk creates a new set of instances in a separate temporary Auto Scaling group. Elastic Beanstalk then instructs the load balancer to direct a certain percentage of your environment's incoming traffic to the new instances. Then, for a configured amount of time, Elastic Beanstalk tracks the health of the new set of instances. If all is well, Elastic Beanstalk shifts remaining traffic to the new instances and attaches them to the environment's original Auto Scaling group, replacing the old instances. Then Elastic Beanstalk cleans up—terminates the old instances and removes the temporary Auto Scaling group.
And more specifically:
Rolling back the deployment to the previous application version is quick and doesn't impact service to client traffic. If the new instances don't pass health checks, or if you choose to abort the deployment, Elastic Beanstalk moves traffic back to the old instances and terminates the new ones.
However, it seems silly that it only looks at my internal /health health checks, and not the overall health status of the environment, from the HTTP status codes, that it already has information on.
I tried the following scenario:
Deploy a new version.
As soon as the "health evaluation period" begins, flood the server with error 500s (from an endpoint I made specifically for this purpose).
AWS then moves all my instances into "degraded" state, and "unhealthy", but then seems to ignore it, and goes on anyway.
See the following two log dump screenshots (they are oldest-first).
Is there any way that I can make AWS respect the HTTP status based health checks that it already performs, during a traffic split? Or am I bound to only rely on custom-developed health checks entirely?
Update 1: Even weirder, I tried making my own healthchecks fail always too, but it still decides to deploy the new version with the failed healthcheck!
Update 2: I noticed that the temporary auto scaling group that it creates while assessing health does only have an "EC2" type health check, and not "ELB". I think that might be the root cause. If I could only get it to use "ELB" instead.
That is interesting! I do not know if setting the health check type to "ELB" may do the job because we use CodeDeploy, which has far better rollback capabilities than AWS Elastic Beanstalk.
However, there is a well-document way in the docs [1] to apply the setting you are looking for:
[...] By default, the Auto Scaling group, created for your environment uses Amazon EC2 status checks. If an instance in your environment fails an Amazon EC2 status check, Auto Scaling takes it down and replaces it.
Amazon EC2 status checks only cover an instance's health, not the health of your application, server, or any Docker containers running on the instance. If your application crashes, but the instance that it runs on is still healthy, it may be kicked out of the load balancer, but Auto Scaling won't replace it automatically. [...]
If you want Auto Scaling to replace instances whose application has stopped responding, you can use a configuration file to configure the Auto Scaling group to use Elastic Load Balancing health checks. The following example sets the group to use the load balancer's health checks, in addition to the Amazon EC2 status check, to determine an instance's health.
Example .ebextensions/autoscaling.config
Resources:
AWSEBAutoScalingGroup:
Type: "AWS::AutoScaling::AutoScalingGroup"
Properties:
HealthCheckType: ELB
HealthCheckGracePeriod: 300
It does not mention the new traffic splitting deployment feature, though.
Thus, I cannot confirm this is the actual solution, but at least you can give it a shot.
[1] https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/environmentconfig-autoscaling-healthchecktype.html
Once upon a time I thought that the Immutable Deployment option in Elastic Beanstalk was a holy panacea -- but it only works when a deployment involves no changes to the application's database schema.
We've now resorted to blue-green deployments. However, this only works if you control the DNS. If you are a SaaS solution and you allow customers to create a CNAME then B/G is often a spectacular failure as the enterprise: a) sets a very high TTL, and/or b) their internal DNS or firewalls caches the underlaying IP addresses of the ALB (which are dynamic and, of course, replaced when you swap the URL of the blue and green environments).
Traffic splitting is written as an option in the Elastic Beanstalk documentation.
But it's not actually an option in the configuration section in the console.
This wouldn't be the first time I've seen Elastic Beanstalk's docs are out of date so it could be AWS have removed that feature.
Since AWS introduced CodeStar I suspect Elastic Beanstalk is getting the cold shoulder.

AWS CodeDeploy: stuck on install step

I'm running through this tutorial to create a deployment pipeline with my custom .net-based docker image.
But when I start a deployment, it's stuck on install phase, so I have to stop it manually:
After that I get a couple of running tasks with different task definitions (note :1 and :4, 'cause I've tried to run deployment 4 times by now):
They also change their state RUNNING->PROVISIONING->PENDING all the time. And the list of stopped tasks grows:
Q:
So, how to hunt down the issue with CodeDeploy? Why It's running forever?
UPDATE:
It is connected to health checks.
UPDATE:
I'm getting this:
(service dataapi-dev-service, taskSet ecs-svc/9223370487815385540) (port 80) is unhealthy in target-group dataapi-dev-tg1 due to (reason Health checks failed with these codes: [404]).
Don't quite understand, why is it failing for newly created container, 'cause the original one passes health-check.
While the ECS task is running, ELB (Elastic Load Balancer) will constantly do healthchecking the container as you config in the target group to check if the container is still responding.
From your debug message, the container (api) responded the healthcheck path with 404.
I suggest you config the healhcheck path in target group dataapi-dev-tg1.
For those who are still hitting this issue: in my case the ECS cluster had no outbound connectivity.
Possible solutions to this problem:
make security groups you use with your VPC allow outbound traffic
make sure that the route table you use with VPC has subnet associations with subnets you use with your load balancer (examine route tables)
I have able to figure it out because I enabled CloudWatch during ECS cluster creation and got CannotPullContainerError. For more information on solving this problem look into Cannot Pull Container Image Error.
Make sure your Internet Gateway is attached to your Subnets through the Route Table (Routes), if your Load Balancer is internet facing.
The error is due to health check which detected an unhealthy target.
Make sure to check your configuration in Target group settings.

Grace Period? - AWS EC2 Container Service and Elastic Load Balancers

When an elastic load balancer (ELB) is associated with an auto-scaling group, it is possible to specify a grace period during which new EC2 instances will not be terminated even if they are marked as unhealthy by the ELB. Is it possible to specify a similar grace period, during which new ECS tasks will not be killed and restarted by their associated ECS service, even if the ECS instance on which a task is running has been marked unhealthy by the ELB?
Update:
In our current use case, the docker container being run as an ECS task contains a JBoss instance that loads a number of caches on startup. These caches can take several minutes to load. However, the ECS service registers the container instance with the ELB, as soon as the container has started. This means that traffic can be routed to the new container before it is ready to accept it. We could increase the health check interval and the "healthy/unhealthy thresholds" on the ELB to prevent the ELB from routing traffic to the instance and the ECS service from restarting the container until the caches have been loaded. However, increasing the health check interval and thresholds is not desirable, because if an instance is marked as unhealthy after the caches have been loaded, the ECS service should restart the container as soon as possible (which necessitates a shorter health check interval and smaller thresholds).
Thus, is it possible to apply a grace period during which traffic will not be routed to a new container by the ELB and the ECS service will not restart the container (even if it fails the health checks)? Or failing that, are there any suggestions regarding a solution for our use case?
In case anyone else finds themselves here via google, in the linked support thread, it is noted that this has been added to AWS, it is called healthCheckGracePeriodSeconds https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_CreateService.html#ECS-CreateService-request-healthCheckGracePeriodSeconds
After a discussion with the support team, it turns out that ECS cannot support our current use case.
There is a workaround that solves one of the issues we are facing. That workaround is to create a separate, essential, health-check container and in the same ECS task as the actual application container. The purpose of the health-check container is to monitor the application container to determine when the application has been started completely. If it detects that the application has failed to start, it exits, causing the ECS service to cycle the task. The ELB is then configured to perform its health checks against the health-check container, which will always report that it is up via the relevant port. This workaround will prevent the ECS service from cycling the ECS task due to failed health checks.
However, the ELB will begin routing traffic to the application container immediately. It will do so, even if the application container is not yet ready to receive traffic (for example, because it is still waiting for a cache to load). Currently, there is no way to delay the ELB from sending traffic to the application container, as the ECS service provides no support a grace period. We have managed to workaround this issue by providing messages to our application containers via SQS and only having them pull from the queue when their caches are fully loaded. However, we have future use cases (such as serving web requests) where this is not a feasible option. To this end, I intend to raise a feature request for the grace period.
As an aside, both Kubernetes (http://kubernetes.io/v1.0/docs/user-guide/walkthrough/k8s201.html#application-health-checking) and Marathon (https://mesosphere.github.io/marathon/docs/health-checks.html) already support this option for health checking, if someone reading this is happy not to use a managed service.
Use env var ECS_CONTAINER_STOP_TIMEOUT
See https://github.com/aws/amazon-ecs-agent/issues/126