ECS health check from container passed, but container was unresponsive - amazon-web-services

The other day, we came across an issue where one of the containers in our ECS Cluster was unresponsive. The troubling part was that the instances health checks, administered via Docker, seemed to indicate that nothing was wrong.
In addition to ECS, we use Route53 service discovery, and all of our containers use these service entries to communicate.
For reference, Docker health checks can be used by ECS to determine if a container should be replaced. We use something like the following (pseudo code):
HEALTHCHECK curl localhost:3000 ... more docker params here
When the incident occurred, connections to the container were timing out, but I could see in the task logs that the health checks were completing successfully. I even tried logging into Cloud9 instance in the VPC and connecting to the task, and was receiving these timeouts. Nothing else seemed out of the ordinary.
All that was required to fix the issue was stopping the container in question.
How can this be avoided? In an ideal situation, ECS should have detected that the container was inaccessible. Is there a way to health check at the container level, AKA "can the container accept connections?", rather than just at the application level?

Related

docker restarts, and want to find the reason

I see docker containers restarts after the host receives huge network requests spike.
(I'm running ECS on ec2 instance)
I guess network requests spike somehow makes the instance unstable and somehow docker container decides to restart.
What logs should I look for to narrow down what causes the container restart?

ELB backend connection errors when deregister ec2 instances

I've written a custom release script to manage releases for an EC2 autoscaling application. The processing works like so...
Create an AMI based on an application git tag.
Create launch config.
Configure ASG to use new launch config.
Find current desired capacity for ASG.
Set desired capacity to 2x previous capacity.
Wait for new instances to become healthy by querying ELB.
Set desired capacity back to previous value.
This all works fairly well, except whenever I run this, the monitoring for the ELB is showing a lot of backend connection errors.
I don't know why this would be occurring, as it should (based on my understanding) still service current connections if the "Connection draining" option is enabled for the ELB (which it is).
I thought perhaps the ASG was terminating the instances before the connections could finish, so I changed my script to first deregister the instances from the ELB, and then wait a while before changing the desired capacity at the ASG. This however didn't make any difference. As soon as the instances were deregistered from the ELB (even though they're still running and healthy) the backend connection errors occur.
It seems as though it's ignoring the connection draining option and simply dropping connections as soon as the instance has been deregistered.
This is the command I'm using to deregister the instances...
aws elb deregister-instances-from-load-balancer --load-balancer-name $elb_name --instances $old_instances
Is there some preferred method to gracefully remove the instances from the ELB before removing them from the ASG?
Further investigation suggests that the back-end connection errors are occurring because the new instances aren't yet ready to take the full load when the old instances are removed from the ELB. They're healthy, but seem to require a bit more warming.
I'm working on tweaking the health-check settings to give the instances a bit more time before they start trying to serve requests. I may also need to change the apache2 settings to get them ready quicker.

How to deploy continuously using just One EC2 instance with ECS

I want to deploy my nodejs webapp continuously using just One EC2 instance with ECS. I cannot create multiple instances for this app.
My current continuous integration process:
Travis build the code from github, build tag and push docker image and deployed to ECS via ECS Deploy shell script.
Everytime the deployment happen, following error occurs. Because the port 80 is always used by my webapp.
The closest matching container-instance ffa4ec4ccae9
is already using a port required by your task
Is it actually possible to use ECS with one instance? (documentation not clear)
How to get rid of this port issue on ECS? (stop the running container)
What is the way to get this done without using a Load Balancer?
Anything I missed or doing apart from the best practises?
The main issue is the port conflict, which occurs when deploying a second instance of the task on the same node in the cluster. Nothing should stop you from having multiple container instances apart from that (e.g. when not using a load balancer; binding to any ports at all).
To solve this issue, Amazon introduced a dynamic ports feature in a recent update:
Dynamic ports makes it easier to start tasks in your cluster without having to worry about port conflicts. Previously, to use Elastic Load Balancing to route traffic to your applications, you had to define a fixed host port in the ECS task. This added operational complexity, as you had to track the ports each application used, and it reduced cluster efficiency, as only one task could be placed per instance. Now, you can specify a dynamic port in the ECS task definition, which gives the container an unused port when it is scheduled on the EC2 instance. The ECS scheduler automatically adds the task to the application load balancer’s target group using this port. To get started, you can create an application load balancer from the EC2 Console or using the AWS Command Line Interface (CLI). Create a task definition in the ECS console with a container that sets the host port to 0. This container automatically receives a port in the ephemeral port range when it is scheduled.
Here's a way to do it using the green/blue deployment pattern:
Host your containers on port 8080 & 8081 (or whatever port you want). Let's call 8080 green and 8081 blue. (You may have to switch the networking mode from bridge to host to get this to work on a single instance).
Use Elastic Load Balancing to redirect the traffic from 80/443 to green or blue.
When you deploy, use a script to swap the active listener on the ELB to the other color/container.
This also allows you to roll back to a 'last known good' state.
See http://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-load-balancing.html for more information.

Connect ECS instances from different task definitions

We are testing the ECS infrastructure to run an application that requires a backend service (a MySQL) as well as a few web servers. Since we'd like to restart and redeploy the front end web servers independently from the elasticsearch service, we were considering defining them as separate task definitions, as suggested here.
However, since the container names are autogenerated by ECS, we have no means of referring to the container running the MySQL instance, and links can only be defined between containers running on the same task.
How can I make a reference to a container from a different task?
PS: I'd like to keep everything running within ECS, and not rely on RDS, at least for now.
What you're asking about is generally called service discovery, and there are a number of different alternatives. ECS has integration with ELB through the service feature, where tasks will be automatically registered in the ELB and deregistered in the ELB appropriately. If you want to avoid ELB, another pattern might be an ambassador container (there's a sample called the ecs-task-kite that uses the ECS API) or you might be interested in an overlay network (Weave has a fairly detailed getting started guide for their solution).

Grace Period? - AWS EC2 Container Service and Elastic Load Balancers

When an elastic load balancer (ELB) is associated with an auto-scaling group, it is possible to specify a grace period during which new EC2 instances will not be terminated even if they are marked as unhealthy by the ELB. Is it possible to specify a similar grace period, during which new ECS tasks will not be killed and restarted by their associated ECS service, even if the ECS instance on which a task is running has been marked unhealthy by the ELB?
Update:
In our current use case, the docker container being run as an ECS task contains a JBoss instance that loads a number of caches on startup. These caches can take several minutes to load. However, the ECS service registers the container instance with the ELB, as soon as the container has started. This means that traffic can be routed to the new container before it is ready to accept it. We could increase the health check interval and the "healthy/unhealthy thresholds" on the ELB to prevent the ELB from routing traffic to the instance and the ECS service from restarting the container until the caches have been loaded. However, increasing the health check interval and thresholds is not desirable, because if an instance is marked as unhealthy after the caches have been loaded, the ECS service should restart the container as soon as possible (which necessitates a shorter health check interval and smaller thresholds).
Thus, is it possible to apply a grace period during which traffic will not be routed to a new container by the ELB and the ECS service will not restart the container (even if it fails the health checks)? Or failing that, are there any suggestions regarding a solution for our use case?
In case anyone else finds themselves here via google, in the linked support thread, it is noted that this has been added to AWS, it is called healthCheckGracePeriodSeconds https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_CreateService.html#ECS-CreateService-request-healthCheckGracePeriodSeconds
After a discussion with the support team, it turns out that ECS cannot support our current use case.
There is a workaround that solves one of the issues we are facing. That workaround is to create a separate, essential, health-check container and in the same ECS task as the actual application container. The purpose of the health-check container is to monitor the application container to determine when the application has been started completely. If it detects that the application has failed to start, it exits, causing the ECS service to cycle the task. The ELB is then configured to perform its health checks against the health-check container, which will always report that it is up via the relevant port. This workaround will prevent the ECS service from cycling the ECS task due to failed health checks.
However, the ELB will begin routing traffic to the application container immediately. It will do so, even if the application container is not yet ready to receive traffic (for example, because it is still waiting for a cache to load). Currently, there is no way to delay the ELB from sending traffic to the application container, as the ECS service provides no support a grace period. We have managed to workaround this issue by providing messages to our application containers via SQS and only having them pull from the queue when their caches are fully loaded. However, we have future use cases (such as serving web requests) where this is not a feasible option. To this end, I intend to raise a feature request for the grace period.
As an aside, both Kubernetes (http://kubernetes.io/v1.0/docs/user-guide/walkthrough/k8s201.html#application-health-checking) and Marathon (https://mesosphere.github.io/marathon/docs/health-checks.html) already support this option for health checking, if someone reading this is happy not to use a managed service.
Use env var ECS_CONTAINER_STOP_TIMEOUT
See https://github.com/aws/amazon-ecs-agent/issues/126