why ECS suddenly restarts? - amazon-web-services

our example.com suddenly gives 503 service temporarily unavailable error
I check the monitor and get the following
It looks like docker container restarted for some reason, what should I check?

as you can see UnHealthyHostCount jump from 0 to 2, its the ECS services are restarted and due to some reason it still passing the health check or ALB did not yet receive success HTTP status code.
You can look into ECS Cluster
Select cluster -> ECS service -> Event
You will see the reason for health check failure or might be restarted by the user or some other service.

Related

AWS ECS Fargate request not reaching task container

So I setup a aws ecs cluster to run a docker image of Valhalla service almost as-is.
Issue : target group seems to be not able to check for cluster health, like if the health request was reaching the cluster, but container is not "forwarding" the request to Valhalla.
Description :
I created a repository on AWS ECR, and pushed a docker image of gisops/valhalla with only the valhalla.json file changed.
Here is the valhalla configuration I used.
Note that I changed the default listening port from 8002 to 80.
I created a ECS Fargate cluster, and a service that uses this task definition to launch a container that runs Valhalla.
The service receives traffic from an application load balancer via port 80.
The target group is checking /status path on port 80.
All set, the task is then creating, and task logs shows that Valhalla is initializing perfectly and running.
However the target group is not able to check for health status : the request seems to timeout.
If the request was reaching valhalla, the task logs would have at least show it (because valhalla logs every incoming request by default), but it doesn't.
Therefore fargate kills the task (Task failed ELB health checks in (target-group {my-target-group-uri})) (showing that the health request was reaching the cluster service indeed)
I don't think the issue is with the valhalla configuration, because I can run the same docker image locally, and it works perfectly, using :
docker run -dt -p 3000:80 -v /local/path/to/valhalla-files:/custom_files/ --name valhalla gisops/valhalla:latest
And then checking localhost:3000/status
Anyone has an idea of what could be the issue ?
Already spent a lot of time on this, and I'm out of ideas. Thanks for your help !

GCP: How to check the log for changes in the health state of a load balancer backend

I'm using GCP's an unmanaged HTTP external load balacer, and I have several Nginx servers running on its backend.
I wanted to check the logs for changes in the health status of the Nginx servers, but could not find a way to do so.
First, I enabled logging for the health check assigned to the backend service. Then, I stopped the service of one Nginx server to test it. Then I selected GCE Health Check from the Resource Type in the Logs Explorer, but only logs related to the creation, update, and deletion of the health check itself appeared.
Next, I enabled logging for the backend service and did the same experiment. However, similarly, only logs related to the creation, update, and deletion of the backend service itself appeared.
I have three questions:
Doesn't the logging of the health check log the changes in the health status of the monitored service?
Doesn't the logging of backend services log the health status changes of the instances belonging to the backend?
How can I check when each Nginx server became unhealthy and when it became healthy in the log?

ECS What is causing 503 temporarily available?

Our webserver shows 503 service temporary unavailable sproadically
It gets back to normal after short period.
Because ECS restarts, I don't get to see the logs, how do I go about debugging this?
ECS logs are stored in Cloudwatch. But most probably the cause is the failing health checks of services. You can check the health check status in the target group.

AWS ECS Fargate ALB Error (Request Timed Out)

I have set up a Docker container running on port 5566 with a small Django application. The Docker image is uploaded into the ECR and later used by Fargate container(s).
I have set up an ECS cluster with a VPC.
After creating the Task Definition and Service, the Service starts up 2 tasks (as it is supposed to):
Here's the Service's Network Access (with health check grace period on 300s):
I also set up an Application Load Balancer (with DNS) with a target group for the service, but the health checks seem to be failing:
Here's the health check configuration:
Because the health checks are failing the tasks are terminated and new ones are started after ~every 5 minutes.
Here's the container's port mapping:
As one cannot access the Fargate container (via SSH for example) and the logs are empty, how should I troubleshoot the issue?
I have tried to follow every step in the Troubleshoot Your Application Load Balancer.
Feel free to ask additional information.
can you confirm once, your application is working on port 5566 inside docker?
you can check logs in cloudwatch. you'll get the link in cluster -> service -> tasks -> your task.
Can you post your ALB configuration? your Target group port?

What should the path of healthcheck be in the target group created for Fargate Service

I deployed docker image using AWS Fargate. When I created a service out of the task definition, logs show that tomcat has no errors and app is up and running but new instances are getting constantly getting spun as health check is failing
Health Checks (On target group tied to the service)
Protocol: HTTP
Path: /Sampler/data/ping
Port: traffic/port
What is the right path for health check?
I tried giving servicename too, but it did not work
for example: /servicename/data/ping
Can you please suggest what I am missing?
I have deployed the same war file in local by running docker run -p 8080:8080 sampler:latest (same image pushed from local to ECR) and when I hit the URL http://localhost:8080/Sampler/data/ping, I get 200 status code
Dockerfile
FROM tomcat:9.0-jre8-alpine
COPY target/Sampler-*.war $CATALINA_HOME/webapps/Sampler.war
EXPOSE 80
The path for the health check depends on your application. Based on the information you have provided, I suspect the issue could be related to healthCheckGracePeriodSeconds
healthCheckGracePeriodSeconds
The period of time, in seconds, that the Amazon ECS service scheduler ignores unhealthy
Elastic Load Balancing target health checks after a task has first started.
https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Service.html
When ECS tasks took a long time to start, Elastic Load Balancing (ELB) health checks could mark the task as unhealthy and the service scheduler would shut the task down.
You can specify a health check grace period in ECS service definition parameter. This instructs the service scheduler to ignore ELB health checks for a pre-defined time period after a task has been instantiated.
https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-ecs-adds-elb-health-check-grace-period/