docker restarts, and want to find the reason - amazon-web-services

I see docker containers restarts after the host receives huge network requests spike.
(I'm running ECS on ec2 instance)
I guess network requests spike somehow makes the instance unstable and somehow docker container decides to restart.
What logs should I look for to narrow down what causes the container restart?

Related

Does AWS Fargate docker image with express app listening and waiting for requests consume cpu?

I configured an AWS Fargate cluster with a docker image that runs nodejs express app and listens on port 80. Now I can browse to the public IP and successfully the request is handled by AWS Fargate.
Is it right that the docker container now is running and still waiting for requests?
Isn't it consuming CPU and so I have to pay as long as the docker container is running?
Do I have to build a docker image that just handles a single request and exits to be really serverless?
Thank you
Is it right that the docker container now is running and still waiting for requests? Isn't it consuming CPU and so I have to pay as long as the docker container is running?
Yes, that's how ECS Fargate works. It's really no different from running a docker container on your local computer. It has to be up and running all the time in order to handle requests that come in.
Do I have to build a docker image that just handles a single request and exits to be really serverless?
The term "serverless" is a vague marketing term and means different things depending on who you ask. Amazon calls ECS Fargate serverless because you don't have to manage, or even know the details of, the server that is running the container. In contrast to ECS EC2 deployments, where you have to have EC2 servers up and running ahead of time and ECS just starts the containers on those EC2 servers.
If you want something that only runs, and only charges you, when a request comes in, then you would need to reconfigure your application to run on AWS Lambda instead of ECS Fargate.

ECS health check from container passed, but container was unresponsive

The other day, we came across an issue where one of the containers in our ECS Cluster was unresponsive. The troubling part was that the instances health checks, administered via Docker, seemed to indicate that nothing was wrong.
In addition to ECS, we use Route53 service discovery, and all of our containers use these service entries to communicate.
For reference, Docker health checks can be used by ECS to determine if a container should be replaced. We use something like the following (pseudo code):
HEALTHCHECK curl localhost:3000 ... more docker params here
When the incident occurred, connections to the container were timing out, but I could see in the task logs that the health checks were completing successfully. I even tried logging into Cloud9 instance in the VPC and connecting to the task, and was receiving these timeouts. Nothing else seemed out of the ordinary.
All that was required to fix the issue was stopping the container in question.
How can this be avoided? In an ideal situation, ECS should have detected that the container was inaccessible. Is there a way to health check at the container level, AKA "can the container accept connections?", rather than just at the application level?

Reading Docker container status inside application

I have a C++ application containerised as docker container. I have come across a scenario where the docker daemon on a host where the docker containers are running ,goes down. Thereby the container run time also goes down. How do i read this status inside the application which is residing in the container itself?
Thanks in advance
Use --network="host" in your docker run command, then 127.0.0.1 in your docker container will point to your docker host. You can then ping 127.0.0.1 from your container to get the status of your host. But I don't see a point of doing it because if your host is going down, anyways containers would be the first to loose network connectivity and you won't be able to read the status inside the application ever.
Recommended way is to perform container orchestration using swarm, k8s, openshift etc to avoid this issue.

How to deploy continuously using just One EC2 instance with ECS

I want to deploy my nodejs webapp continuously using just One EC2 instance with ECS. I cannot create multiple instances for this app.
My current continuous integration process:
Travis build the code from github, build tag and push docker image and deployed to ECS via ECS Deploy shell script.
Everytime the deployment happen, following error occurs. Because the port 80 is always used by my webapp.
The closest matching container-instance ffa4ec4ccae9
is already using a port required by your task
Is it actually possible to use ECS with one instance? (documentation not clear)
How to get rid of this port issue on ECS? (stop the running container)
What is the way to get this done without using a Load Balancer?
Anything I missed or doing apart from the best practises?
The main issue is the port conflict, which occurs when deploying a second instance of the task on the same node in the cluster. Nothing should stop you from having multiple container instances apart from that (e.g. when not using a load balancer; binding to any ports at all).
To solve this issue, Amazon introduced a dynamic ports feature in a recent update:
Dynamic ports makes it easier to start tasks in your cluster without having to worry about port conflicts. Previously, to use Elastic Load Balancing to route traffic to your applications, you had to define a fixed host port in the ECS task. This added operational complexity, as you had to track the ports each application used, and it reduced cluster efficiency, as only one task could be placed per instance. Now, you can specify a dynamic port in the ECS task definition, which gives the container an unused port when it is scheduled on the EC2 instance. The ECS scheduler automatically adds the task to the application load balancer’s target group using this port. To get started, you can create an application load balancer from the EC2 Console or using the AWS Command Line Interface (CLI). Create a task definition in the ECS console with a container that sets the host port to 0. This container automatically receives a port in the ephemeral port range when it is scheduled.
Here's a way to do it using the green/blue deployment pattern:
Host your containers on port 8080 & 8081 (or whatever port you want). Let's call 8080 green and 8081 blue. (You may have to switch the networking mode from bridge to host to get this to work on a single instance).
Use Elastic Load Balancing to redirect the traffic from 80/443 to green or blue.
When you deploy, use a script to swap the active listener on the ELB to the other color/container.
This also allows you to roll back to a 'last known good' state.
See http://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-load-balancing.html for more information.

Grace Period? - AWS EC2 Container Service and Elastic Load Balancers

When an elastic load balancer (ELB) is associated with an auto-scaling group, it is possible to specify a grace period during which new EC2 instances will not be terminated even if they are marked as unhealthy by the ELB. Is it possible to specify a similar grace period, during which new ECS tasks will not be killed and restarted by their associated ECS service, even if the ECS instance on which a task is running has been marked unhealthy by the ELB?
Update:
In our current use case, the docker container being run as an ECS task contains a JBoss instance that loads a number of caches on startup. These caches can take several minutes to load. However, the ECS service registers the container instance with the ELB, as soon as the container has started. This means that traffic can be routed to the new container before it is ready to accept it. We could increase the health check interval and the "healthy/unhealthy thresholds" on the ELB to prevent the ELB from routing traffic to the instance and the ECS service from restarting the container until the caches have been loaded. However, increasing the health check interval and thresholds is not desirable, because if an instance is marked as unhealthy after the caches have been loaded, the ECS service should restart the container as soon as possible (which necessitates a shorter health check interval and smaller thresholds).
Thus, is it possible to apply a grace period during which traffic will not be routed to a new container by the ELB and the ECS service will not restart the container (even if it fails the health checks)? Or failing that, are there any suggestions regarding a solution for our use case?
In case anyone else finds themselves here via google, in the linked support thread, it is noted that this has been added to AWS, it is called healthCheckGracePeriodSeconds https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_CreateService.html#ECS-CreateService-request-healthCheckGracePeriodSeconds
After a discussion with the support team, it turns out that ECS cannot support our current use case.
There is a workaround that solves one of the issues we are facing. That workaround is to create a separate, essential, health-check container and in the same ECS task as the actual application container. The purpose of the health-check container is to monitor the application container to determine when the application has been started completely. If it detects that the application has failed to start, it exits, causing the ECS service to cycle the task. The ELB is then configured to perform its health checks against the health-check container, which will always report that it is up via the relevant port. This workaround will prevent the ECS service from cycling the ECS task due to failed health checks.
However, the ELB will begin routing traffic to the application container immediately. It will do so, even if the application container is not yet ready to receive traffic (for example, because it is still waiting for a cache to load). Currently, there is no way to delay the ELB from sending traffic to the application container, as the ECS service provides no support a grace period. We have managed to workaround this issue by providing messages to our application containers via SQS and only having them pull from the queue when their caches are fully loaded. However, we have future use cases (such as serving web requests) where this is not a feasible option. To this end, I intend to raise a feature request for the grace period.
As an aside, both Kubernetes (http://kubernetes.io/v1.0/docs/user-guide/walkthrough/k8s201.html#application-health-checking) and Marathon (https://mesosphere.github.io/marathon/docs/health-checks.html) already support this option for health checking, if someone reading this is happy not to use a managed service.
Use env var ECS_CONTAINER_STOP_TIMEOUT
See https://github.com/aws/amazon-ecs-agent/issues/126