ECS What is causing 503 temporarily available? - amazon-web-services

Our webserver shows 503 service temporary unavailable sproadically
It gets back to normal after short period.
Because ECS restarts, I don't get to see the logs, how do I go about debugging this?

ECS logs are stored in Cloudwatch. But most probably the cause is the failing health checks of services. You can check the health check status in the target group.

Related

GCP: How to check the log for changes in the health state of a load balancer backend

I'm using GCP's an unmanaged HTTP external load balacer, and I have several Nginx servers running on its backend.
I wanted to check the logs for changes in the health status of the Nginx servers, but could not find a way to do so.
First, I enabled logging for the health check assigned to the backend service. Then, I stopped the service of one Nginx server to test it. Then I selected GCE Health Check from the Resource Type in the Logs Explorer, but only logs related to the creation, update, and deletion of the health check itself appeared.
Next, I enabled logging for the backend service and did the same experiment. However, similarly, only logs related to the creation, update, and deletion of the backend service itself appeared.
I have three questions:
Doesn't the logging of the health check log the changes in the health status of the monitored service?
Doesn't the logging of backend services log the health status changes of the instances belonging to the backend?
How can I check when each Nginx server became unhealthy and when it became healthy in the log?

why ECS suddenly restarts?

our example.com suddenly gives 503 service temporarily unavailable error
I check the monitor and get the following
It looks like docker container restarted for some reason, what should I check?
as you can see UnHealthyHostCount jump from 0 to 2, its the ECS services are restarted and due to some reason it still passing the health check or ALB did not yet receive success HTTP status code.
You can look into ECS Cluster
Select cluster -> ECS service -> Event
You will see the reason for health check failure or might be restarted by the user or some other service.

How to change AWS ELB status to InService?

A WordPress application is deployed in AWS Elastic Beanstalk that has a load balancer. I see sometimes there is ELB 5XX error. To make the instance OutOfService for the higher number of unhealthy threshold count, I set Unhealthy Threshold to 10. But sometimes health check fails and health is Severe. I get sometimes the error "% of the requests to the ELB are failing with HTTP 5xx". I checked the ELB access logs and sometimes request get the timeout (504) error and after a consecutive number of 504, ELB makes the instance OutOfService. I am trying to fix which request is failing.
What I don't know, is it possible to make the instance "InService" as quickly as possible. Because sometimes instance is OutOfService for 2-3 hours, which is really bad. Is there any good way to handle this situation. I am really in trouble with this situation. Looks like after the service is out, I have nothing to do. I am relatively new to AWS. Please help.
To solve this issue:
1) HTTP 504 means timeout. The resource that the load balancer is accessing on your backend is failing to respond. Determine what the path for the healthcheck from the AWS console.
2) In your browser verify that you can access the healthcheck path going around the load balancer. This may mean temporarily assigning an EIP to the EC2 instance. If the load balancer healthcheck is "/test/myhealthpage.php" then use "http://REPLACE_WITH_EIP/test/myhealthpage.php". For HTTPS listeners use https in your path.
3) Debug why the path that you specified is timing out and fix it.
Note: Healthcheck paths should not be to pages that do complicated tests or operations. A healthcheck should be a quick and simple GO / NO GO type of page.

AWS Health Check Restart API

I have an AWS load balancer with 2 ec2 instances serving an API in Python.
If I have 10K request come in at the same time, and the AWS health check comes in, the health check will fail, and there is a 502/504 gateway error because of instances restart due the to failed health check.
I check the instances CPU usage, max at 30%, and memory maxed at 25%.
What's the best option to fix this?
A few things to consider here:
Keep the health check API fairly light, but ensure that the health check API/URL indeed returns correct responses based on the health of the app.
You can configure the health check to mark the instance as failed only after X failed checks. You can tune this parameter and the Health check frequency to match your needs.
You can disable the EC2 restart from failed health-check by configuring your autoscaling group health-check type to EC2. This will prevent instances from being terminated due to a failed ELB health-check.

Diagnosing occasional HTTP 5xx errors in Elastic Beanstalk and Elastic Load Balancer

My monitoring tab in Elastic Beanstalk is showing occasional HTTP 5xx errors, both from the EB instance and the ELB that performs its load balancing.
The trouble is that I generally only see these a few hours after they occur, and by the time I log into the EB instance the logs have rotated and see no trace of the error.
What's the best way to record the request and response associated with these errors for later viewing?
Best and cheap option to achieve this is set up a cron job on the EC2 instance that will move the logs to a AWS S3 bucket each 15 min or so. Or in other word store the logs in AWS S3 so you can analyze them when ever you want.
Here are some things I've found out in the past few weeks (I'll maybe edit into a more coherent answer later):
Consider the layering here: we've got ELB -> httpd -> Tomcat (in my example). I'd forgotten about httpd (Apache 2.2 atm)
You can enable ELB logging into an S3 bucket of your choice. This allows you to see the results returned to the client
From there, trace through to httpd to see if there are any errors in /var/log/httpd
And then from there, trace through to the Tomcat logs to see if the same errors pop up there
I was seeing errors in ELB and httpd that weren't showing in Tomcat
I was also seeing a number of error messages similar to:
->
"proxy: error reading status line from remote server"
"(103)Software caused connection abort: proxy: pass request body failed"
Reading around, these may be caused by bugs in mod_proxy.