502 - failed_to_connect_to_backend from LB on HEALTHY target group - google-cloud-platform

I have some global Load balancer on the GCP. This balancer should send requests to the instance group with two back services.
And when I try to send some requests, I randomly get 502 errors failed_to_connect_to_backend from my load balancer.
I can get a successful answer seven times, one by one, and then 2-3 times 502 error for the same request.
In the
Monitoring Dashboard I see this - my both services are healthy.
The Instanse groups overwiev shows 100% healthy status too.
URL map rules is default default
I also don`t see any problems with resource consumption
And, unfortunately, I couldn't get any logs from the back-end side for the 502 errors, have only logs from the Load Balancer

After hours of coffee and liters of manuals (I'm not very well versed in GCP yet) the "problem" is solved - at some point I noticed that the execution time of all failed requests is ~ 9 seconds.
Therefore, I tried to search for results with similar symptoms, as a result I found a answer on the Google Groups
In my case - we have trobles with port mapping(was used two ports in mapping - like 80, 6000. And 80 - was not listening from the backend side)
After removing closed - 502s gone away.

If port 80 was not allowed on the firewall rule applied on the backend instances?

Related

Getting 5xx error with AWS Application Load Balancer - fluctuating healthy and unhealthy target group

My web application on AWS EC2 + load balancer sometimes shows 500 errors. How do I know if the error is on the server side or the application side?
I am using Route 53 domain and ssl on my url. I set the ALB redirect requests on port 80 to 443, and forward requests on port 443 to the target group (the EC2). However, the target group is returning 5xx error code sometimes when handling the request. Please see the screenshots for the metrics and configurations for the ALB.
Target Group Metrics
Target Group Configuration
Load Balancer Metrics
Load Balancer Listeners
EC2 Metrics
Right now the web application is running unsteady, sometimes it returns a 502 or 503 service unavailable (seems like it's a connnection timeout).
I have set up the ALB idle timeout 4000 secs.
ALB configuration
The application is using Nuxt.js + PHP7.0 + MySQL + Apache 2.4.54.
I have set the Apache prefork worker Maxclient number as 1000, which should be enough to handle the requests on the application.
The EC2 is a t2.Large resource, the CPU and Memory look enough to handle the processing.
It seems like if I directly request the IP address but not the domain, the amount of 5xx errors significantly reduced (but still exists).
I also have Wordpress application host on this EC2 in a subdomain (CNAME). I have never encountered any 5xx errors on this subdomain site, which makes me guess there might be some errors in my application code but not on the server side.
Is the 5xx error from my application or from the server?
I also tried to add another EC2 in the target group see if they can have at lease one healthy instance to handle the requests. However, the application is using a third-party API and has strict IP whitelist policy. I did some research that the Elastic IP I got from AWS cannot be attached to 2 different EC2s.
First of all, if your application is prone to stutters, increase healthcheck retries and timeouts, which will affect your initial question of flapping health.
To what I see from your screenshot, most of your 5xx are due to either server or application (you know obviously better what's the culprit since you have access to their logs).
To answer your question about 5xx errors coming from LB: this happens directly after LB kicks out unhealthy instance and if there's none to replace (which shouldn't be the case because you're supposed to have ASG if you enable evaluation of target health for LB), it can't produce meaningful output and thus crumbles with 5xx.
This should be enough information for you to make adjustments and logs investigation.

Putting ALb-NLB-ALB route for requests is giving 502 for application

We had a primary ALB listening to all out apps mapped through R53 records. Now we have listener rule crunch as ALB doesn't support more rules above 100. So we had been proposed a solution where we can put a NLB under primary ALB and then secondary ALB under NLB.
So flow will be:
Requests--->R53--->ALB1--->NLB--->ALB2--->Apps
ALB1 has a default rule which allows unmatched requests to pass through to NLB and then ultimately to ALB2 where new rules are evaluated.
Rule configuration at ALB1 is:
Default rule --Forwardto-->
Rule at NLB:
TCP-443 listener rule --ForwardTo--> ALB2 TG with fargate application ip
But we're seeing intermittent 502 responses on primary ALB while testing. We are not seeing any 502 logging on ALB2. So possibly NLB is ending it as we have seen multiple TArget reset count happening at NLB in metrics.
Also, nothing is getting logged in application logs.
We did another testing where we directly routed traffic to ALB2 through R53, we didn't see any 502 responses there.
Any suggestion, how to go about the debugging it?
I think, I have the answer to my problem now, so sharing it for wider audience. The reason for intermittent 502s was the inconsistency of idle_timeout_value across the Lbs and backend application.
Since for NLB idle_timeout_value is set to 350 seconds by default, and can't be changed, we had inconsistent values across LBs. First ALB and last ALB had value 600 seconds.
Ideally application should have highest idle_timeout_value followed by LBs in hierarchy.
So setting up value of first ALB to 300 seconds and second ALB to 500 seconds solved this problem. And we haven't got a single 500 code post this implementation.

Google cloud load balancer causing error 502 - failed_to_pick_backend

I've got an error 502 when I use google cloud balancer with CDN, the thing is, I am pretty sure I must have done something wrong setting up the load balancer because when I remove the load balancer, my website runs just fine.
This is how I configure my load balancer
here
Should I use HTTP or HTTPS healthcheck, because when I set up HTTPS
healthcheck, my website was up for a bit and then it down again
I have checked this link, they seem to have the same problem but it is not working for me.
I have followed a tutorial from openlitespeed forum to set Keep-Alive Timeout (secs) = 60s in server admin panel and configure instance to accepts long-lived connections ,still not working for me.
I have added these 2 firewall rules following this google cloud link to allow google health check ip but still didn’t work:
https://cloud.google.com/load-balancing/docs/health-checks#fw-netlb
https://cloud.google.com/load-balancing/docs/https/ext-http-lb-simple#firewall
When checking load balancer log message, it shows an error saying failed_to_pick_backend . I have tried to re-configure load balancer but it didn't help.
I just started to learn Google Cloud and my knowledge is really limited, it would be greatly appreciated if someone could show me step by step how to solve this issue. Thank you!
Posting an answer - based on OP's finding to improve user experience.
Solution to the error 502 - failed_to_pick_backend was changing Load Balancer from HTTP to TCP protocol and at the same type changing health check from HTTP to TCP also.
After that LB passes through all incoming connections as it should and the error dissapeared.
Here's some more info about various types of health checks and how to chose correct one.
The error message that you're facing it's "failed_to_pick_backend".
This error message means that HTTP responses code are generated when a GFE was not able to establish a connection to a backend instance or was not able to identify a viable backend instance to connect to
I noticed in the image that your health-check failed causing the aforementioned error messages, this Health Check failing behavior could be due to:
Web server software not running on backend instance
Web server software misconfigured on backend instance
Server resources exhausted and not accepting connections:
- CPU usage too high to respond
- Memory usage too high, process killed or can't malloc()
- Maximum amount of workers spawned and all are busy (think mpm_prefork in Apache)
- Maximum established TCP connections
Check if the running services were responding with a 200 (OK) to the Health Check probes and Verify your Backend Service timeout. The Backend Service timeout works together with the configured Health Check values to define the amount of time an instance has to respond before being considered unhealthy.
Additionally, You can see this troubleshooting guide to face some error messages (Including this).
Those experienced with Kubernetes from other platforms may be confused as to why their Ingresses are calling their backends "UNHEALTHY".
Health checks are not the same thing as Readiness Probes and Liveness Probes.
Health checks are an independent utility used by GCP's Load Balancers and perform the exact same function, but are defined elsewhere. Failures here will lead to 502 errors.
https://console.cloud.google.com/compute/healthChecks

HTTP ERROR 408 - After setting up kubernetes , along with AWS ELB and NGINX Ingress

I am finding it extremely hard to debug this issue, I have a Kubernetes cluster setup along with services for the pods, they are connected to the Nginx ingress and connected to was elb classic, which also connects to the AWS route53 DNS my domain name is connected to. Everything works fine with that, but then am facing an issue where my domains do not behave the way I would like them to.
My domains in the Nginx-ingress-rules are connected to a service which sends alive page when hit with a domain, now when I do that I get this page.
Please help me what what to do to resolve this quickly, thanks in advance!
Talk to you soon
enter image description here
While you are using web servers behind ELB you must know that they generate a lot of 408 responses due to their health checks.
Possible solutions:
1. Set RequestReadTimeout header=0 body=0
This disables the 408 responses if a request times out.
2. Disable logging for the ELB IP addresses with:
SetEnvIf Remote_Addr "10\.0\.0\.5" exclude_from_log
CustomLog logs/access_log common env=!exclude_from_log
3. Set up different port for ELB health check.
4. Adjust your request timeout higher than 60.
5. Make sure that the idle time configured on the Elastic Loadbalancer is slightly lower than the idle timeout configured for the Apache httpd running on each of the instances.
Take a look: amazon-aws-http-408, haproxy-elb, 408-http-elb.

AWS/EKS: Getting frequent 504 gateway timeout errors from ALB

I'm using EKS to deploy a service, with ingress running on top of alb-ingress-controller.
All in all I have about 10 replicas of a single pod, with a single service of type NodePort which forwards traffic to them. The replicas run on 10 nodes, established with eksctl, and spread across 3 availability zones.
The problem I'm seeing is very strange - inside the cluster, all the logs show that requests are being handled in less than 1s, mostly around 20-50 millis. I know this because I used linkerd to show the percentiles of request latencies, as well as the app logs themselves. However, the ALB logs/monitoring tell a very different story. I see a relatively high request latency (often approaching 20s or more), and often also 504 errors returned from the ELB (sometimes 2-3 every 5 minutes).
When trying to read the access logs for the ALB, I noticed that the 504 lines look like this:
https 2019-12-10T14:56:54.514487Z app/1d061b91-XXXXX-3e15/19297d73543adb87 52.207.101.56:41266 192.168.32.189:30246 0.000 -1 -1 504 - 747 308 "GET XXXXXXXX" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-1:750977848747:targetgroup/1d061b91-358e2837024de757a3d/e59bbbdb58407de3 "Root=1-5defb1fa-cbcdd248dd043b5bf1221ad8" "XXXX" "XXXX" 1 2019-12-10T14:55:54.514000Z "forward" "-" "-" "192.168.32.189:30246" "-"
Where the request processing time is 0 and the target processing time is -1, indicating the request never made it to the backend, and response was returned immediately.
I tried to play with the backend HTTP keepalive timeout (currently at 75s) and with the ALB idle time (currently at 60s) but nothing seems to change much for this behavior.
If anyone can point me to how to proceed and investigate this, or what the cause can be, I'd appreciate it very much.
We faced a similar type of issue with EKS and ALB combination. If the target response code says -1, there may be a chance that the request waiting queue is full on the target side. So the ALB will immediately drop the request.
Try to do an ab benchmark by skipping the ALB and directly send the request to the service or the private IP address. Doing this will help you to identify where the problem is.
For us, 1 out of 10 requests failed if we send traffic via ALB. We are not seeing failures if we directly send the request to the service.
AWS recommendation is to use NLB over the ALB. NLB gives more advantages and suitable for Kubernetes. There is a blog which explains this Using a Network Load Balancer with the NGINX Ingress Controller on Amazon EKS
We changed to NLB and now we are not getting 5XX errors.