Troubleshooting ECS task failing container health checks

Troubleshooting ECS task failing container health checks - amazon-web-services

We are trying to transition our AWS EC2 instance to AWS ECS service running under Elastic Load Balancer. This EC2 instance is responsible for receiving a lot of network connections and saving received data to S3 bucket and sql. Problem is no matter how many tasks we scale up to or how big they are, they still fail container health check once it is used in production. Container health check we are using is this bash script:
#!/bin/bash
# Instead of pinging apache we can only check the service's status for incoming due to the high load apache deals with
cgi-fcgi -bind -connect localhost:9001 \
&& service apache2 status \
&& service cron status \
&& service ssh status \
&& service awslogs status \
&& service td-agent status \
|| exit 1
Currently I have 1 ECS task with max capacity of 2. Each task has 4 vCPU units and 16GB of RAM. Our original EC2 instance is of m5.2xlarge type and has 32GB of RAM and 8vCPU units. EC2 instance does not ever reach 100% CPU and Memory, so ECS tasks should be big enough to handle this network load.
Once we start sending requests to ECS task, it response from server becomes slower(requests taking 45 seconds) and then health check script eventually fails and ECS kills the task.
I have also tried using apache2buddy script to tune performance of apache2 web server, but it did not help as the script did not report any issues. Also tailing syslog and apache2 error log is not helpful as there are no obvious errors logged. So I am trying to find more ways to troubleshoot this and find why EC2 instance has no problems with all these network requests but docker container running on ECS does.

Related

Does AWS Fargate docker image with express app listening and waiting for requests consume cpu?

I configured an AWS Fargate cluster with a docker image that runs nodejs express app and listens on port 80. Now I can browse to the public IP and successfully the request is handled by AWS Fargate.
Is it right that the docker container now is running and still waiting for requests?
Isn't it consuming CPU and so I have to pay as long as the docker container is running?
Do I have to build a docker image that just handles a single request and exits to be really serverless?
Thank you

Is it right that the docker container now is running and still waiting for requests? Isn't it consuming CPU and so I have to pay as long as the docker container is running?
Yes, that's how ECS Fargate works. It's really no different from running a docker container on your local computer. It has to be up and running all the time in order to handle requests that come in.
Do I have to build a docker image that just handles a single request and exits to be really serverless?
The term "serverless" is a vague marketing term and means different things depending on who you ask. Amazon calls ECS Fargate serverless because you don't have to manage, or even know the details of, the server that is running the container. In contrast to ECS EC2 deployments, where you have to have EC2 servers up and running ahead of time and ECS just starts the containers on those EC2 servers.
If you want something that only runs, and only charges you, when a request comes in, then you would need to reconfigure your application to run on AWS Lambda instead of ECS Fargate.

AWS ECS Fargate request not reaching task container

So I setup a aws ecs cluster to run a docker image of Valhalla service almost as-is.
Issue : target group seems to be not able to check for cluster health, like if the health request was reaching the cluster, but container is not "forwarding" the request to Valhalla.
Description :
I created a repository on AWS ECR, and pushed a docker image of gisops/valhalla with only the valhalla.json file changed.
Here is the valhalla configuration I used.
Note that I changed the default listening port from 8002 to 80.
I created a ECS Fargate cluster, and a service that uses this task definition to launch a container that runs Valhalla.
The service receives traffic from an application load balancer via port 80.
The target group is checking /status path on port 80.
All set, the task is then creating, and task logs shows that Valhalla is initializing perfectly and running.
However the target group is not able to check for health status : the request seems to timeout.
If the request was reaching valhalla, the task logs would have at least show it (because valhalla logs every incoming request by default), but it doesn't.
Therefore fargate kills the task (Task failed ELB health checks in (target-group {my-target-group-uri})) (showing that the health request was reaching the cluster service indeed)
I don't think the issue is with the valhalla configuration, because I can run the same docker image locally, and it works perfectly, using :
docker run -dt -p 3000:80 -v /local/path/to/valhalla-files:/custom_files/ --name valhalla gisops/valhalla:latest
And then checking localhost:3000/status
Anyone has an idea of what could be the issue ?
Already spent a lot of time on this, and I'm out of ideas. Thanks for your help !

How to determine root cause of AWS Elastic Beanstalk Shutdown Errors

I have a Django app hosted on AWS Elastic Beanstalk.
Users upload documents to the site. Sometimes, users upload documents and the server completely shuts down. The server instantly 500s, goes offline for about 4 minutes and then then magically the app is back up and running.
Obviously, something is happening to the app where it gets overwhelmed.
The only thing I get from Elastic Beanstalk is this message:
Environment health has transitioned from Ok to Severe. 100.0 % of the requests are failing with HTTP 5xx. ELB processes are not healthy on all instances. ELB health is failing or not available for all instances.
Then about 4 minutes later:
Environment health has transitioned from Severe to Ok.
I have 1 t2.medium EC2 instance. I've set it up as Load Balancing, but use Min 1 Max 1, so I don't take advantage of the load balancing features.
Here's a screenshot of my health tab:
My app shut off on 7/10 as can be seen in picture 1. My CPU spiked at this time, but I can't imagine 20% CPU was enough to overwhelm my server.
How can I determine what might be causing these short 500 errors? Is there somewhere else I can look to discover the source of this? I don't see anything helpful in my access_log or error_log. I don't know where to start looking.

I was having similar problems with Elastic Beanstalk without using load balancer. So when I faced that problem, my application was simply crashing and I needed to rebuild the environment from scratch. Further search revealed that the problem was sometimes the EC2 memory was exceeding, which was causing the elastic beanstalk to shutdown. The solution was adding a swap area (I preferred 2048MB of swap) and prevent these sudden memory exceeding.
Here is how to add swap area to the elastic beanstalk instance:
.ebextensions/swap-area.sh:
#!/usr/bin/env bash
SWAPFILE=/var/swapfile
SWAP_MEGABYTES=2048
if [ -f $SWAPFILE ]; then
echo "$SWAPFILE found, ignoring swap setup..."
exit;
fi
/bin/dd if=/dev/zero of=$SWAPFILE bs=1M count=$SWAP_MEGABYTES
/bin/chmod 600 $SWAPFILE
/sbin/mkswap $SWAPFILE
/sbin/swapon $SWAPFILE
.ebextensions/00-swap-area.config:
container_commands:
00_swap_area:
command: "bash .ebextensions/swap-area.sh"
Then after the deployment, you can check the swap area by using commands such as top etc. in your EC2.

AWS ECS Fargate ALB Error (Request Timed Out)

I have set up a Docker container running on port 5566 with a small Django application. The Docker image is uploaded into the ECR and later used by Fargate container(s).
I have set up an ECS cluster with a VPC.
After creating the Task Definition and Service, the Service starts up 2 tasks (as it is supposed to):
Here's the Service's Network Access (with health check grace period on 300s):
I also set up an Application Load Balancer (with DNS) with a target group for the service, but the health checks seem to be failing:
Here's the health check configuration:
Because the health checks are failing the tasks are terminated and new ones are started after ~every 5 minutes.
Here's the container's port mapping:
As one cannot access the Fargate container (via SSH for example) and the logs are empty, how should I troubleshoot the issue?
I have tried to follow every step in the Troubleshoot Your Application Load Balancer.
Feel free to ask additional information.

can you confirm once, your application is working on port 5566 inside docker?
you can check logs in cloudwatch. you'll get the link in cluster -> service -> tasks -> your task.
Can you post your ALB configuration? your Target group port?

What should the path of healthcheck be in the target group created for Fargate Service

I deployed docker image using AWS Fargate. When I created a service out of the task definition, logs show that tomcat has no errors and app is up and running but new instances are getting constantly getting spun as health check is failing
Health Checks (On target group tied to the service)
Protocol: HTTP
Path: /Sampler/data/ping
Port: traffic/port
What is the right path for health check?
I tried giving servicename too, but it did not work
for example: /servicename/data/ping
Can you please suggest what I am missing?
I have deployed the same war file in local by running docker run -p 8080:8080 sampler:latest (same image pushed from local to ECR) and when I hit the URL http://localhost:8080/Sampler/data/ping, I get 200 status code
Dockerfile
FROM tomcat:9.0-jre8-alpine
COPY target/Sampler-*.war $CATALINA_HOME/webapps/Sampler.war
EXPOSE 80

The path for the health check depends on your application. Based on the information you have provided, I suspect the issue could be related to healthCheckGracePeriodSeconds
healthCheckGracePeriodSeconds
The period of time, in seconds, that the Amazon ECS service scheduler ignores unhealthy
Elastic Load Balancing target health checks after a task has first started.
https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Service.html
When ECS tasks took a long time to start, Elastic Load Balancing (ELB) health checks could mark the task as unhealthy and the service scheduler would shut the task down.
You can specify a health check grace period in ECS service definition parameter. This instructs the service scheduler to ignore ELB health checks for a pre-defined time period after a task has been instantiated.
https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-ecs-adds-elb-health-check-grace-period/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js