"ELB health is failing or not available for all instances" + "Request timed out" on Elastic Beanstalk - amazon-web-services

I have been trying for two days to get rid of this error. It also often says "100.0 % of the requests are failing with HTTP 5xx". I have been reading about troubleshooting from here https://aws.amazon.com/premiumsupport/knowledge-center/elb-fix-failing-health-checks-alb/ but nothing is working. I have tried changing the health check path from '/' to '/healthCheck' as that has worked for some other people.
INFO:
I am using an application load balancer so that I can use HTTPS. I am using t3.micro although I have tried t3.small and t3.medium.
Here are my load balancer settings in the configuration part of console:
My security group for the instance has the same two inbound rules at source 0.0.0.0/0 and outbounds all traffic to 0.0.0.0/0.
And here is some target group info:
Where is the best place to look for this error?

Based on the comments.
The cause of the issue is undetermined. Thus it was decided to make new EB environment in an effort to address the problems.

I had a similar problem, not sure yet how prevalent it will be but it happened to me twice after a number of broken deploys during testing. I even created a specific endpoint to return 200 without luck. It seems to be more of a load balancer problem than an instance problem. A new environment for me cleared up the problem.

Related

Cannot access a public ALB

I have been trying to troubleshoot some connection issues, and I'm struggling with a relatively simple setup.
On my (relatively new) AWS account, I create a new Application Load balancer. I configure it in the following way:
Internet facing
Use the default VPC that came with the account
Across all availability zones
Uses default security group for VPC
Listens on HTTP:80 and returns a fixed response (status 404)
When I then try and use the new dns name assigned, it just hangs. When using curl -v I can see it says:
Trying :80
dig also responds with 3 IPs (I'm assuming for the different zones).
It feels like I'm missing something obvious, but I'm struggling to find it myself.
Can anyone see what I may be missing?
Can you please share a print screen of the default security group and LB configuration?
I am almost sure that the default security group has opened ALL inbound traffic but only for itself (security group).

Load balancer giving failed_to_pick_backend with internet network endpoint group

I have a load balancer setup pointing to an external url via internet network endpoint group (internet NEG)
Now the load balancer returns 502 status code & I see failed_to_pick_backend in the logs. Also the monitoring tab of the load balancer shows INVALID_BACKEND next to the internet NEG I've defined. I've attached screenshots of the view for clarity, latter one is the one that's failing. I've checked the NEGs and they seem identical.
All the suggestions so far mention fixing health checks, but as seen from the docs, internet NEGs does not support health checks.
I was able to create working setup through the UI, but when replicating the setup via terraform, things starts to fail. The only difference I saw was that the setup done via UI, the appropriate forwarding rule had ipVersion: IPV4, but that was not possible to setup through terraform since it takes either ipVersion or ip and I gave the resource ip.
So, what could cause failed_to_pick_backend & INVALID_BACKEND with setup like mine?
I found the answer to my question from another post: https://serverfault.com/a/1065279/965524
google_compute_global_network_endpoint needs to be created before google_compute_backend_service is created so you need to set depends_on = [google_compute_global_network_endpoint.endpoint] to your google_compute_backend_service. Otherwise you will hit errors like described in the question.

Unexpected latency issues AWS-API Gateway

I need help to troubleshoot AWS API gateway latency issues. We have same configuration and even data everything same but facing high latency issues in Non Prod. Actually we are using Nlb and VPC link for API Gateway . Please find same values here below.
We have copied the data from dev mongo to test environment to make sure the same volume of data is present in both the places. We hit /test/16 from both the environment, but experiencing very high latency in dev as compared to sandbox.
Test:
Request:/test/16
Status:200
Latency:213ms
Dev:
Request:/test/16
Status:200
Latency:4896ms
Have you checked your VPC logs to see the flow paths for the requests? If not, I suggest starting there.
As FYI, you can learn about VPC flow logs at https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#working-with-flow-logs.
What is behind the load balancer? Anything you are reaching for with DNS names or just IPs?
We had a similar problem at one point, looking in the monitoring of the load balancer(ELB) we found that the problem was downstreams.
The monitoring even showed that we got 504s in the load balancer.
In our case it was DNS caching that caused it, the target instances had been replaced and the DNS in some nginx instances, on the network path to the target, had not been updated.
The nginx instances had to be updated with dynamic DNS resolving. Since nginx default only resolved the target on startup.
With out knowing your architecture however, hard to say what can cause your problems. Here is another DNS story, with some debugging examples: https://srvaroa.github.io/kubernetes/migration/latency/dns/java/aws/microservices/2019/10/22/kubernetes-added-a-0-to-my-latency.html 🍿
Good luck.

How to change AWS ELB status to InService?

A WordPress application is deployed in AWS Elastic Beanstalk that has a load balancer. I see sometimes there is ELB 5XX error. To make the instance OutOfService for the higher number of unhealthy threshold count, I set Unhealthy Threshold to 10. But sometimes health check fails and health is Severe. I get sometimes the error "% of the requests to the ELB are failing with HTTP 5xx". I checked the ELB access logs and sometimes request get the timeout (504) error and after a consecutive number of 504, ELB makes the instance OutOfService. I am trying to fix which request is failing.
What I don't know, is it possible to make the instance "InService" as quickly as possible. Because sometimes instance is OutOfService for 2-3 hours, which is really bad. Is there any good way to handle this situation. I am really in trouble with this situation. Looks like after the service is out, I have nothing to do. I am relatively new to AWS. Please help.
To solve this issue:
1) HTTP 504 means timeout. The resource that the load balancer is accessing on your backend is failing to respond. Determine what the path for the healthcheck from the AWS console.
2) In your browser verify that you can access the healthcheck path going around the load balancer. This may mean temporarily assigning an EIP to the EC2 instance. If the load balancer healthcheck is "/test/myhealthpage.php" then use "http://REPLACE_WITH_EIP/test/myhealthpage.php". For HTTPS listeners use https in your path.
3) Debug why the path that you specified is timing out and fix it.
Note: Healthcheck paths should not be to pages that do complicated tests or operations. A healthcheck should be a quick and simple GO / NO GO type of page.

Why does Elastic Load Balancing report 'Out of Service'?

I am trying to set up Elastic Load Balancing (ELB) in AWS to split the requests between multiple instances. I have created several images of my webserver based on the same AMI, and I am able to ssh into each individually and access the site via each distinct public DNS.
I have added each of my instances to the load balancer, but they all come back with the Status: Out of Service because they failed the health check. I'm mostly confused because I can access each instance from its public DNS, but I get a timeout whenever I visit the load balancer DNS name.
I've been trying to read through all the docs and googling it, but I'm stuck. Any pointers or links in the right direction would be greatly appreciated.
I contacted AWS support about this same issue. Apparently their system doesn't know how to handle cases were all of the instances behind the ELB are stopped for an extended amount of time. AWS support can manually refresh the statuses, if you need them up immediately.
The suggested fix it to de-register the ec2 instances from the ELB instead of just stopping them and re-register them when you start again.
Health check is (by default) made by accessing index.html on each instance incorporated in load balancer. If you don't have index.html in document root of instance - default health check will fail. You can set custom protocol, port and path for health check when creating elastic load balancer.
Finally I got this working. The issue was with the Amazon Security Groups, because I've restricted the access to port 80 to few machines on my development area and the load balancer could not access the apache server on the instance. Once the load balancer gained access to my instance, it gets In Service.
I checked it with tail -f /var/log/apache2/access.log in my instance, to verify if the load balancer was trying to access my server, and to see the answer the server is giving to the load balancer.
Hope this helps.
If your web server is running fine, then it means the health check goes on a url that doesn't return 200.
A trick that works for me : go on the instance, type curl localhost:80/pathofyourhealthcheckurl
After you can adapt your health check url to always have a 200 response.
In my case, the rules on security groups assigned to the instance and the load balancer were not allowing traffic to pass between the two. This caused the health check to fail.
I to faced same issue , i changed Ping Protocol from https to ssl .. it worked !
Go to Health Check --> click on Edit Health Check -- > change Ping protocol from HTTPS to SSL
Ping Target SSL:443
Timeout 5 seconds
Interval 30 seconds
Unhealthy Threshold 5
Healthy Threshold 10
For anyone else that sees this thread as this isn't listed:
Check that the health check is checking the port that the responding server is listening on.
E.g. node.js running on port 3000 -> Point healthcheck to port 3000;
Not port 80 or 443. Those are what your ALB will be using.
I spent a morning on this. Yes.
I would like to provide you a general way to solve this problem. When you have set up you web server like apache or nginx, try to read the access log file to see what happened. In my occasion, it report 401 error because I have add the basic auth in nginx. Of course, just like #ivankoni remind, it may because of the document you check is not exist.
I was working on the AWS Tutorial on hosting a web app and ran into this problem. Step 7b states the following:
"Set Ping Path to /. This sends queries to your default page, whether
it is named index.html or something else."
They could have put the forward slash in quotations like this "/". Make sure you have that in your health checks and not this "/." .
Adding this because I've spent hours trying to figure it out...
If you configured your health check endpoint but it still says Out of Service, it might be because your server is redirecting the request (i.e. returning a 301 or 302 response).
For example, if your endpoint is supposed to be /app/health/ but you only enter /app/health (no trailing slash) into the health check endpoint field on your ELB, you will not get a 200 response, so the health check will fail.
I had a similar issue. The problem appears to have been caused due to my using a HTTP health check and also using .htaccess to password protect the site.
I got the same error, in my case had to copy the particular html file from s3 bucket to "/var/www/html" location. The same html referenced in load balancer path.
The issue resolved after copying html file.
I had this issue too, and it was due to both my inbound and outbound rule for the Load Balancer's Security Group only allowing HTTP traffic on port 80. I needed to add another rule for HTTPS traffic on port 443.
I was also facing that same issue,
where ELB (Classic-Load-Balancer) try to request /index.html not / (root) while health check.
If it unable to find /index.html resource it says 'OutOfService'. Be Sure index.html should be available.