http 502 errors when new instance is being created in a group

http 502 errors when new instance is being created in a group - google-cloud-platform

We are using cross region load balancing. When we get heavy traffic all at once, within 1 region, it begins to spin up new instances. While it is starting new instances, we get random HTTP 502 errors. Screenshots of configurations below. Is there any way to avoid the 502 errors while it is scaling up?
Image links of configuration below.
Instance Group Configuration (same setting on all regions)
Load Balancer
Thanks in advance for the help!

HTTP load balancer and the instances will have different external IPs.
1) Try accessing through one instance's external IP first to make sure the backend works. If it doesn't work, usually it's firewall settings problem.
2) HTTP 502 from load balancer usually indicates the health check of the load balancer thought the backend is unhealthy, check your health check config then.
See another similar question Google Load-balancer randomly failing requests to backend

Related

Getting 5xx error with AWS Application Load Balancer - fluctuating healthy and unhealthy target group

My web application on AWS EC2 + load balancer sometimes shows 500 errors. How do I know if the error is on the server side or the application side?
I am using Route 53 domain and ssl on my url. I set the ALB redirect requests on port 80 to 443, and forward requests on port 443 to the target group (the EC2). However, the target group is returning 5xx error code sometimes when handling the request. Please see the screenshots for the metrics and configurations for the ALB.
Target Group Metrics
Target Group Configuration
Load Balancer Metrics
Load Balancer Listeners
EC2 Metrics
Right now the web application is running unsteady, sometimes it returns a 502 or 503 service unavailable (seems like it's a connnection timeout).
I have set up the ALB idle timeout 4000 secs.
ALB configuration
The application is using Nuxt.js + PHP7.0 + MySQL + Apache 2.4.54.
I have set the Apache prefork worker Maxclient number as 1000, which should be enough to handle the requests on the application.
The EC2 is a t2.Large resource, the CPU and Memory look enough to handle the processing.
It seems like if I directly request the IP address but not the domain, the amount of 5xx errors significantly reduced (but still exists).
I also have Wordpress application host on this EC2 in a subdomain (CNAME). I have never encountered any 5xx errors on this subdomain site, which makes me guess there might be some errors in my application code but not on the server side.
Is the 5xx error from my application or from the server?
I also tried to add another EC2 in the target group see if they can have at lease one healthy instance to handle the requests. However, the application is using a third-party API and has strict IP whitelist policy. I did some research that the Elastic IP I got from AWS cannot be attached to 2 different EC2s.

First of all, if your application is prone to stutters, increase healthcheck retries and timeouts, which will affect your initial question of flapping health.
To what I see from your screenshot, most of your 5xx are due to either server or application (you know obviously better what's the culprit since you have access to their logs).
To answer your question about 5xx errors coming from LB: this happens directly after LB kicks out unhealthy instance and if there's none to replace (which shouldn't be the case because you're supposed to have ASG if you enable evaluation of target health for LB), it can't produce meaningful output and thus crumbles with 5xx.
This should be enough information for you to make adjustments and logs investigation.

Error: Server Error The server encountered a temporary error and could not complete your request. Please try again in 30 seconds.(GCP)

I've configured a HTTP(S) Load balancer as per the documentation on https://cloud.google.com/compute/docs/load-balancing/http/
When I try to access the site via the Public IP address associated with the Load balancer. I'm getting a 502 response with the message:
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
I believe this is coming from the load balancer.
Anyone have any insight into what might be going on, what more I should be looking at?

Have a look at the documentation Troubleshooting HTTP(S) Load Balancing section Unexplained 502 errors:
If 502 errors persist longer than a few minutes after you complete the
load balancer configuration, it's likely that either:
There's no firewall rule configured to allow health checks.
The software on the backends isn't running.
To verify that health check traffic reaches your backend VMs, enable
health check logging and search for successful log entries.
To create an ingress rule that allows traffic from the Google Cloud health checking systems (130.211.0.0/22 and 35.191.0.0/16) you can use Cloud Console or this command:
gcloud compute firewall-rules create fw-allow-health-check \
--network=default \
--action=allow \
--direction=ingress \
--source-ranges=130.211.0.0/22,35.191.0.0/16 \
--target-tags=allow-health-check \
--rules=tcp
In this command target tag allow-health-check used to identify VMs.

I had the same problem. After a day of searching, it was a health checker problem. The health test was on TCP, I changed it to HTTP, the problem was solved.

Could it be that the load balancer depends on you using the URL and not an IP address?
There are a couple of reasons that might be the case.
The URL points to the load balancer and the load balancer has a list of server IP addresses that service that URL; then it picks a server and forwards the request. To do that, it must receive the oritinal URL because the load balancer may be serving multiple sets of servers.
If the IP address points to the load balancer, it won't know which set of servers to choose from. If the IP address points to a server, the load balancer will be bypassed.
That's as much as I can think of...
Jam

GCP external http load balancer 502 server error:"failed_to_connect_to_backend"

I have configured a http external load balancer on GCP and all my vm instances are healthy in backend.
But when i am trying to access my server(installed on VM) from frontend static IP that is reserved at load balancer it is giving me 502 status error.
As a result of which i am unable to launch my application server using load balancer. Help me fix this issue.
Thanking you in advance.

To troubleshoot 502 response from the Load Balancer due to "failed_to_connect_to_backend." I would check the followings:
Usually, "failed_to_connect_to_backend" error message indicates that the load balancer is failing to connect to backends, investigating URL map rules is also a good point to start. I would also suggest reviewing your Load Balancer's URL map to make sure that Host rules, Path matcher, and Path rules are correctly defined and comply with descriptions in this article.
Also check if the backend instances are exhausting their resources, If a backend server is overwhelmed, it will refuse incoming requests, potentially causing the load balancer to give up on it and return the specific 502 error you're experiencing. Also, check the output on how many established connections are present at any one time using 'netstat' and watch command.
I would also recommend testing again with the HTTP(S) request directly to the instance, request the same URL that reporting 502. You might do this test in another VM instance in your VPC network.

maybe you should check if the time taken for the API to return the response is exceeded the timeout that will trigger the 502. The default value is 30 seconds.
Ref: https://cloud.google.com/load-balancing/docs/backend-service#timeout-setting

How to change AWS ELB status to InService?

A WordPress application is deployed in AWS Elastic Beanstalk that has a load balancer. I see sometimes there is ELB 5XX error. To make the instance OutOfService for the higher number of unhealthy threshold count, I set Unhealthy Threshold to 10. But sometimes health check fails and health is Severe. I get sometimes the error "% of the requests to the ELB are failing with HTTP 5xx". I checked the ELB access logs and sometimes request get the timeout (504) error and after a consecutive number of 504, ELB makes the instance OutOfService. I am trying to fix which request is failing.
What I don't know, is it possible to make the instance "InService" as quickly as possible. Because sometimes instance is OutOfService for 2-3 hours, which is really bad. Is there any good way to handle this situation. I am really in trouble with this situation. Looks like after the service is out, I have nothing to do. I am relatively new to AWS. Please help.

To solve this issue:
1) HTTP 504 means timeout. The resource that the load balancer is accessing on your backend is failing to respond. Determine what the path for the healthcheck from the AWS console.
2) In your browser verify that you can access the healthcheck path going around the load balancer. This may mean temporarily assigning an EIP to the EC2 instance. If the load balancer healthcheck is "/test/myhealthpage.php" then use "http://REPLACE_WITH_EIP/test/myhealthpage.php". For HTTPS listeners use https in your path.
3) Debug why the path that you specified is timing out and fix it.
Note: Healthcheck paths should not be to pages that do complicated tests or operations. A healthcheck should be a quick and simple GO / NO GO type of page.

AWS load balancer and maintenance page

I'm using AWS Load Balancer with 3 EC2 servers, and I'm trying to serve a Maintenance page when site is under maintenance.
This page need to return 503 HTTP code, because it is a proper code for a maintenance mode and will prevent possible problems with SEO.
When I return 503 code from any of my servers, Load Balancer makes it "Not In Service", and when all servers return 503, website returns a blank page (because all servers are disconnected).
My questions are:
1) Is there any way to serve a custom static page with a message for visitors from Load balancer if there is no healthy servers?
2) Or how to configure Load Balancer's Health Check that it will not consider 503 as a reason to mark server as "unhealthy"?
Thanks!

I've been searching for a quick way to do this. We need to return a 503 error to the world during DB upgrade, but white list a few IPs of developers so they can test it before opening back up to public.
Found a one spot solution::
Go to the Loader Balancer in EC2 and select the load balancer you would like to target. Below, you should see Listeners. Click on a listener, and edit the rule. Create a rule like this:
Now everyone gets a pretty maintenance page returned with a 503 error code, and only two IP addresses in the first rule will be able to browse to the site. Order is important, where the two IP exceptions are on top, then it goes down the list. The last item is always there by default.
Listener Rules for Your Application Load Balancer:
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/listener-update-rules.html

You could implement an additional route in your app server, let's say /hcm (for health check maintenance), that always responds 200 OK. When it's time for maintenance, you programmatically modify the ELB health check to use the /hcm target which returns 200 OK rather than / or /index.html, which both return 503 Service Unavailable. Revert these changes when exiting maintenance.

Might not meet your 503 requirement but a good option for this is using s3 and dns failover: https://aws.amazon.com/blogs/aws/create-a-backup-website-using-route-53-dns-failover-and-s3-website-hosting/

The load balancer will serve a 503 for you when you no longer have any healthy server behind it so you should not do anything special.
If you return anything but a 200 on the health check, ELB will take the machine out of the load balancer after it fails the configured number of health checks.
So to recap, you can potentially serve 503 from your app when in maintenance, but you have to return 200 for health checks all the time. If you don't care about the content of the page, you can simply remove the machines from the load balancer (or fail health checks) and the LB will do the right thing for you.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js