Google cloud load balancer causing error 502 - failed_to_pick_backend

Google cloud load balancer causing error 502 - failed_to_pick_backend - google-cloud-platform

I've got an error 502 when I use google cloud balancer with CDN, the thing is, I am pretty sure I must have done something wrong setting up the load balancer because when I remove the load balancer, my website runs just fine.
This is how I configure my load balancer
here
Should I use HTTP or HTTPS healthcheck, because when I set up HTTPS
healthcheck, my website was up for a bit and then it down again
I have checked this link, they seem to have the same problem but it is not working for me.
I have followed a tutorial from openlitespeed forum to set Keep-Alive Timeout (secs) = 60s in server admin panel and configure instance to accepts long-lived connections ,still not working for me.
I have added these 2 firewall rules following this google cloud link to allow google health check ip but still didn’t work:
https://cloud.google.com/load-balancing/docs/health-checks#fw-netlb
https://cloud.google.com/load-balancing/docs/https/ext-http-lb-simple#firewall
When checking load balancer log message, it shows an error saying failed_to_pick_backend . I have tried to re-configure load balancer but it didn't help.
I just started to learn Google Cloud and my knowledge is really limited, it would be greatly appreciated if someone could show me step by step how to solve this issue. Thank you!

Posting an answer - based on OP's finding to improve user experience.
Solution to the error 502 - failed_to_pick_backend was changing Load Balancer from HTTP to TCP protocol and at the same type changing health check from HTTP to TCP also.
After that LB passes through all incoming connections as it should and the error dissapeared.
Here's some more info about various types of health checks and how to chose correct one.

The error message that you're facing it's "failed_to_pick_backend".
This error message means that HTTP responses code are generated when a GFE was not able to establish a connection to a backend instance or was not able to identify a viable backend instance to connect to
I noticed in the image that your health-check failed causing the aforementioned error messages, this Health Check failing behavior could be due to:
Web server software not running on backend instance
Web server software misconfigured on backend instance
Server resources exhausted and not accepting connections:
- CPU usage too high to respond
- Memory usage too high, process killed or can't malloc()
- Maximum amount of workers spawned and all are busy (think mpm_prefork in Apache)
- Maximum established TCP connections
Check if the running services were responding with a 200 (OK) to the Health Check probes and Verify your Backend Service timeout. The Backend Service timeout works together with the configured Health Check values to define the amount of time an instance has to respond before being considered unhealthy.
Additionally, You can see this troubleshooting guide to face some error messages (Including this).

Those experienced with Kubernetes from other platforms may be confused as to why their Ingresses are calling their backends "UNHEALTHY".
Health checks are not the same thing as Readiness Probes and Liveness Probes.
Health checks are an independent utility used by GCP's Load Balancers and perform the exact same function, but are defined elsewhere. Failures here will lead to 502 errors.
https://console.cloud.google.com/compute/healthChecks

Related

Getting 5xx error with AWS Application Load Balancer - fluctuating healthy and unhealthy target group

My web application on AWS EC2 + load balancer sometimes shows 500 errors. How do I know if the error is on the server side or the application side?
I am using Route 53 domain and ssl on my url. I set the ALB redirect requests on port 80 to 443, and forward requests on port 443 to the target group (the EC2). However, the target group is returning 5xx error code sometimes when handling the request. Please see the screenshots for the metrics and configurations for the ALB.
Target Group Metrics
Target Group Configuration
Load Balancer Metrics
Load Balancer Listeners
EC2 Metrics
Right now the web application is running unsteady, sometimes it returns a 502 or 503 service unavailable (seems like it's a connnection timeout).
I have set up the ALB idle timeout 4000 secs.
ALB configuration
The application is using Nuxt.js + PHP7.0 + MySQL + Apache 2.4.54.
I have set the Apache prefork worker Maxclient number as 1000, which should be enough to handle the requests on the application.
The EC2 is a t2.Large resource, the CPU and Memory look enough to handle the processing.
It seems like if I directly request the IP address but not the domain, the amount of 5xx errors significantly reduced (but still exists).
I also have Wordpress application host on this EC2 in a subdomain (CNAME). I have never encountered any 5xx errors on this subdomain site, which makes me guess there might be some errors in my application code but not on the server side.
Is the 5xx error from my application or from the server?
I also tried to add another EC2 in the target group see if they can have at lease one healthy instance to handle the requests. However, the application is using a third-party API and has strict IP whitelist policy. I did some research that the Elastic IP I got from AWS cannot be attached to 2 different EC2s.

First of all, if your application is prone to stutters, increase healthcheck retries and timeouts, which will affect your initial question of flapping health.
To what I see from your screenshot, most of your 5xx are due to either server or application (you know obviously better what's the culprit since you have access to their logs).
To answer your question about 5xx errors coming from LB: this happens directly after LB kicks out unhealthy instance and if there's none to replace (which shouldn't be the case because you're supposed to have ASG if you enable evaluation of target health for LB), it can't produce meaningful output and thus crumbles with 5xx.
This should be enough information for you to make adjustments and logs investigation.

GCP external http load balancer 502 server error:"failed_to_connect_to_backend"

I have configured a http external load balancer on GCP and all my vm instances are healthy in backend.
But when i am trying to access my server(installed on VM) from frontend static IP that is reserved at load balancer it is giving me 502 status error.
As a result of which i am unable to launch my application server using load balancer. Help me fix this issue.
Thanking you in advance.

To troubleshoot 502 response from the Load Balancer due to "failed_to_connect_to_backend." I would check the followings:
Usually, "failed_to_connect_to_backend" error message indicates that the load balancer is failing to connect to backends, investigating URL map rules is also a good point to start. I would also suggest reviewing your Load Balancer's URL map to make sure that Host rules, Path matcher, and Path rules are correctly defined and comply with descriptions in this article.
Also check if the backend instances are exhausting their resources, If a backend server is overwhelmed, it will refuse incoming requests, potentially causing the load balancer to give up on it and return the specific 502 error you're experiencing. Also, check the output on how many established connections are present at any one time using 'netstat' and watch command.
I would also recommend testing again with the HTTP(S) request directly to the instance, request the same URL that reporting 502. You might do this test in another VM instance in your VPC network.

maybe you should check if the time taken for the API to return the response is exceeded the timeout that will trigger the 502. The default value is 30 seconds.
Ref: https://cloud.google.com/load-balancing/docs/backend-service#timeout-setting

(GCP) : Server Error The server encountered a temporary error and could not complete your request. Please try again in 30 seconds

I have created a "Load Balancer" in Google Cloud and connected 2 virtual machines to it. When I send some requests to "Load Balancer", sometimes it gets passed to virtual machines attached to load balancer and sometimes it throws following error even health check is 100% OK at that time.
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.

This answer was created to support the community based on the limited information delivered by the OP and the comments written above.
The most accurate decision to make when you try to determine the root cause of an HTTP load balancer issue is review the log entries.
According to the official google documentation. HTTP(S) Load Balancing log entries contain information useful for monitoring and debugging your HTTP(S) traffic.
Log entries contain the following types of information:
General information, such as severity, project ID, project number, and timestamp.
HttpRequest log fields. However, HttpRequest.protocol is not populated for HTTP(S) Load Balancing Cloud Logging logs.
A statusDetails field inside the structPayload. This field holds a string that explains why the load balancer returned the
HTTP status that it did. The tables below contain further
explanations of these log strings. The statusDetails field is not
available for regional external HTTP(S) load balancers.
Redirects (HTTP response status code 302 Found) issued from the load balancer are not logged. Redirects issued from the backend
instances are logged.
To enable the log entries in an HTTP Load Balancer please follow this guide.
The message “Error: Server Error The server encountered a temporary error and could not complete your request.” Could be caused for several reason reasons including:
There's no firewall rule configured to allow health checks.
The software on the backends isn't running.
In this page you can find a detailed guide to perform a complete troubleshooting related to general connectivity issues.
I found these posts related to HTTP Load balancer and 502 response, you can find useful information in these threads.
Debugging Load Balancer issues
Compute Engine HTTP Load Balancing 502 error
Google Cloud HTTP balancer returns 502 error
Error: Server Error The server encountered a temporary error and
could not complete your request. Please try again in 30
seconds.(GCP)

In my case issue was with health check not returning 200.
It returned 302 instead (Found) when calling default / and redirected to other url with 200 (which Loadbalancer checks ignored) and deemed that node as "unhealthy" and instead to route incoming http/s request to broken node removed it out of rotation and returned that 502 error message to client.
Error: Server Error The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
Underneath my LoadBalancer was GKE cluster with gke ingress->service-> pod and no explicit liveness/readiness probes configured so by default healthchecks hit / with 302/Found/redirect.
After adding those probes to deployment manifest and pointing them to endpoint that return OK/200 (/-/healthy, /-/ready in my case of prometheus running inside the pod)issue was fixed.
Unfortunately gke ingress had un-informative message UNHEALTY only in annotations, so it took me a while to understand what causes that issue.

Does Google Cloud HTTPS load balancer log back end errors?

Looking for way to debug why backend for NIFI is failing. I created a NIFI cluster (verison 1.9.0, HDF 3.1.1.4, AMBARI 2.7.3) on Google cloud. Created HTTPS load balancer terminating https front end, and back end is the instance group for SSL enabled NIFI cluster. Getting a 502 back end error in the browser when I hit the url for the load balancer. Is there a way for Google Cloud to log the error ? There must be an error returned somewhere to troubleshoot the root cause. I don't see messages in the nifi log or the vm instance /var/log/messages. Stackdriver hasn't shown me errors. I created the keystore and truststore and followed the NIFI SSL enable instructions. It might be related to the SSL configs, or possibly firewall rules are not correct. But I am looking for some more helpful information to find the error.

If I am understanding the question properly, you are looking for a way to get HTTPS load balancer logs due to back end errors and your intention is to find out the root cause.Load balancer basically return 502 error due to unhealthy backend services or for unhealthy backend VM 's.If your stackdriver logging is enabled, you might get this log using advanced filter or can search by selecting the load balancer name and look for/search 502:
Advanced filter for 502 responses due to failures to connect to backends:
resource.type="http_load_balancer"
resource.labels.url_map_name="[URL Map]"
httpRequest.status=502
jsonPayload.statusDetails="failed_to_connect_to_backend"
Advanced filter for 502 responses due to backend timeouts:
resource.type="http_load_balancer"
resource.labels.url_map_name="[URL Map]"
httpRequest.status=502
jsonPayload.statusDetails="backend_timeout"
Advanced filter for 502 responses due to prematurely closed connections:
resource.type="http_load_balancer"
resource.labels.url_map_name="[URL Map]"
httpRequest.status=502
jsonPayload.statusDetails="backend_connection_closed_before_data_sent_to_client"
The URL Map is same as the name of the load balancer for HTTP(S) for cloud console.If we create the various components of the load balancer manually, need to use the URL Map for advanced filter.
Most common root causes for "failed_to_connect_to_backend" are: 1. Firewall blocking traffic, 2. Web server software not running on backend instance, 3. Web server software misconfigured on backend instance, 4. Server resources exhausted and not accepting connections (CPU usage too high to respond, Memory usage too high, process killed ,the maximum amount of workers spawned and all are busy, Maximum established TCP connections), 5. Poorly written server implementation struggling under load or non-standard behavior.
Most common root causes for “backend_timeout” are 1. the backend instance took longer than the Backend Service timeout to respond, meaning either the application is overloaded or the Backend Service Timeout is set too low, 2. The backend instance didn't respond at all (Crashing during a request).
Most Common Root causes for” backend_connection_closed_before_data_sent_to_client” is usually caused because the keepalive configuration parameter for the web server software running on the backend instance is less than the fixed (10 minute) keepalive (HTTP idle) timeout of the GFE. There are some situations where the backend may close a connection too soon while the GFE is still sending the HTTP request.

The previous response was spot on. The nifi ssl configuration is misconfigured, causing the backend health check to fail with a bad certificate. I will open a new question to address the nifi ssl configuration.

How can I configure an automatic timeout for an Elastic Load Balancer?

Does anyone know of a way to make Amazon's Elastic Load Balancers timeout if an HTTP response has not been received from upstream in a set timeframe?
Occasionally Amazon's Elastic Beanstalk will fail an update and any requests to the specified resource (running Nginx + Node if tht's any use) will hang any request pages whilst the resource attempts to load.
I'd like to keep the request timeout under 2s, and if the upstream server has no response by then, to automatically fail over to a default 503 response.
Is this possible with ELB?
Cheers

You can Configure Health Check Settings for Elastic Load Balancing to achieve this:
Elastic Load Balancing routinely checks the health of each registered Amazon EC2 instance based on the configurations that you specify. If Elastic Load Balancing finds an unhealthy instance, it stops sending traffic to the instance and reroutes traffic to healthy instances. For more information on configuring health check, see Health Check.
For example, you simply need to specify an appropriate Ping Path for the HTTP health check, a Response Timeout of 2 seconds and an UnhealthyThreshold of 1 to approximate your specification.
See my answer to What does the Amazon ELB automatic health check do and what does it expect? for more details on how the ELB health check system work.

TLDR - Set your timeout in Nginx.
Let's see if we can walkthrough the issues.
Problem:
The client should be presented with something quickly. It's okay if it's a 500 page. However, the ELB currently waits 60 seconds until giving up (https://forums.aws.amazon.com/thread.jspa?messageID=382182) which means it takes a minute before the user is shown anything.
Solutions:
Change the timeout of the ELB
Looks like AWS support will help increase the timeout (https://forums.aws.amazon.com/thread.jspa?messageID=382182) so I imagine that you'll be able to ask for the reverse. Thus, we can see that it's not user/api tunable and requires you to interact with support. This takes a bit of lead time and more importantly, seems like an odd dial to tune when future developers working on this project will be surprised by such a short timeout.
Change the timeout of the nginx server
This seems like the right level of change. You can use proxy_read_timeout (http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout) to do what you're looking for. Tune it to something small (and in particular, you can set it for a particular location if you would like).
Change the way the request happens.
It may be beneficial to change how your client code works. You could imagine shipping a really simple html/js page that 1. pings to see if the job is done and 2. keeps the user updated on the progress. This takes a bit more work then just throwing the 500 page.

Recently, AWS added a way to configure timeouts for ELB. See this blog post:
http://aws.amazon.com/blogs/aws/elb-idle-timeout-control/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js