AWS ELB randomly returns 504 /503 response

AWS ELB randomly returns 504 /503 response - amazon-web-services

We have a UI application deployed on Apache web server, within EC2 instance (t2.small). We have 2 such instances behind ELB (Classic load balacner), in 2 availability zones. From browser request is made to get javascript and css resources from server, however sometimes the network request to get these resources fails, throwing 504 timeout and sometimes 503. When we refresh the page multiple times, sometimes we get 504/503 and then on next refresh it loads fine.
I don't see any error in Apache access logs, and nothing useful in ELB logs.
Following response header is returned for 504 timeout, which I don't see for 2xx response:
"server: awselb/2.0".
I have tried keeping just one instance as well, but still this issue is reproducable.
Any debugging pointers appreciated. TIA

Related

Getting 5xx error with AWS Application Load Balancer - fluctuating healthy and unhealthy target group

My web application on AWS EC2 + load balancer sometimes shows 500 errors. How do I know if the error is on the server side or the application side?
I am using Route 53 domain and ssl on my url. I set the ALB redirect requests on port 80 to 443, and forward requests on port 443 to the target group (the EC2). However, the target group is returning 5xx error code sometimes when handling the request. Please see the screenshots for the metrics and configurations for the ALB.
Target Group Metrics
Target Group Configuration
Load Balancer Metrics
Load Balancer Listeners
EC2 Metrics
Right now the web application is running unsteady, sometimes it returns a 502 or 503 service unavailable (seems like it's a connnection timeout).
I have set up the ALB idle timeout 4000 secs.
ALB configuration
The application is using Nuxt.js + PHP7.0 + MySQL + Apache 2.4.54.
I have set the Apache prefork worker Maxclient number as 1000, which should be enough to handle the requests on the application.
The EC2 is a t2.Large resource, the CPU and Memory look enough to handle the processing.
It seems like if I directly request the IP address but not the domain, the amount of 5xx errors significantly reduced (but still exists).
I also have Wordpress application host on this EC2 in a subdomain (CNAME). I have never encountered any 5xx errors on this subdomain site, which makes me guess there might be some errors in my application code but not on the server side.
Is the 5xx error from my application or from the server?
I also tried to add another EC2 in the target group see if they can have at lease one healthy instance to handle the requests. However, the application is using a third-party API and has strict IP whitelist policy. I did some research that the Elastic IP I got from AWS cannot be attached to 2 different EC2s.

First of all, if your application is prone to stutters, increase healthcheck retries and timeouts, which will affect your initial question of flapping health.
To what I see from your screenshot, most of your 5xx are due to either server or application (you know obviously better what's the culprit since you have access to their logs).
To answer your question about 5xx errors coming from LB: this happens directly after LB kicks out unhealthy instance and if there's none to replace (which shouldn't be the case because you're supposed to have ASG if you enable evaluation of target health for LB), it can't produce meaningful output and thus crumbles with 5xx.
This should be enough information for you to make adjustments and logs investigation.

Google cloud load balancer causing error 502 - failed_to_pick_backend

I've got an error 502 when I use google cloud balancer with CDN, the thing is, I am pretty sure I must have done something wrong setting up the load balancer because when I remove the load balancer, my website runs just fine.
This is how I configure my load balancer
here
Should I use HTTP or HTTPS healthcheck, because when I set up HTTPS
healthcheck, my website was up for a bit and then it down again
I have checked this link, they seem to have the same problem but it is not working for me.
I have followed a tutorial from openlitespeed forum to set Keep-Alive Timeout (secs) = 60s in server admin panel and configure instance to accepts long-lived connections ,still not working for me.
I have added these 2 firewall rules following this google cloud link to allow google health check ip but still didn’t work:
https://cloud.google.com/load-balancing/docs/health-checks#fw-netlb
https://cloud.google.com/load-balancing/docs/https/ext-http-lb-simple#firewall
When checking load balancer log message, it shows an error saying failed_to_pick_backend . I have tried to re-configure load balancer but it didn't help.
I just started to learn Google Cloud and my knowledge is really limited, it would be greatly appreciated if someone could show me step by step how to solve this issue. Thank you!

Posting an answer - based on OP's finding to improve user experience.
Solution to the error 502 - failed_to_pick_backend was changing Load Balancer from HTTP to TCP protocol and at the same type changing health check from HTTP to TCP also.
After that LB passes through all incoming connections as it should and the error dissapeared.
Here's some more info about various types of health checks and how to chose correct one.

The error message that you're facing it's "failed_to_pick_backend".
This error message means that HTTP responses code are generated when a GFE was not able to establish a connection to a backend instance or was not able to identify a viable backend instance to connect to
I noticed in the image that your health-check failed causing the aforementioned error messages, this Health Check failing behavior could be due to:
Web server software not running on backend instance
Web server software misconfigured on backend instance
Server resources exhausted and not accepting connections:
- CPU usage too high to respond
- Memory usage too high, process killed or can't malloc()
- Maximum amount of workers spawned and all are busy (think mpm_prefork in Apache)
- Maximum established TCP connections
Check if the running services were responding with a 200 (OK) to the Health Check probes and Verify your Backend Service timeout. The Backend Service timeout works together with the configured Health Check values to define the amount of time an instance has to respond before being considered unhealthy.
Additionally, You can see this troubleshooting guide to face some error messages (Including this).

Those experienced with Kubernetes from other platforms may be confused as to why their Ingresses are calling their backends "UNHEALTHY".
Health checks are not the same thing as Readiness Probes and Liveness Probes.
Health checks are an independent utility used by GCP's Load Balancers and perform the exact same function, but are defined elsewhere. Failures here will lead to 502 errors.
https://console.cloud.google.com/compute/healthChecks

GCP external http load balancer 502 server error:"failed_to_connect_to_backend"

I have configured a http external load balancer on GCP and all my vm instances are healthy in backend.
But when i am trying to access my server(installed on VM) from frontend static IP that is reserved at load balancer it is giving me 502 status error.
As a result of which i am unable to launch my application server using load balancer. Help me fix this issue.
Thanking you in advance.

To troubleshoot 502 response from the Load Balancer due to "failed_to_connect_to_backend." I would check the followings:
Usually, "failed_to_connect_to_backend" error message indicates that the load balancer is failing to connect to backends, investigating URL map rules is also a good point to start. I would also suggest reviewing your Load Balancer's URL map to make sure that Host rules, Path matcher, and Path rules are correctly defined and comply with descriptions in this article.
Also check if the backend instances are exhausting their resources, If a backend server is overwhelmed, it will refuse incoming requests, potentially causing the load balancer to give up on it and return the specific 502 error you're experiencing. Also, check the output on how many established connections are present at any one time using 'netstat' and watch command.
I would also recommend testing again with the HTTP(S) request directly to the instance, request the same URL that reporting 502. You might do this test in another VM instance in your VPC network.

maybe you should check if the time taken for the API to return the response is exceeded the timeout that will trigger the 502. The default value is 30 seconds.
Ref: https://cloud.google.com/load-balancing/docs/backend-service#timeout-setting

(GCP) : Server Error The server encountered a temporary error and could not complete your request. Please try again in 30 seconds

I have created a "Load Balancer" in Google Cloud and connected 2 virtual machines to it. When I send some requests to "Load Balancer", sometimes it gets passed to virtual machines attached to load balancer and sometimes it throws following error even health check is 100% OK at that time.
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.

This answer was created to support the community based on the limited information delivered by the OP and the comments written above.
The most accurate decision to make when you try to determine the root cause of an HTTP load balancer issue is review the log entries.
According to the official google documentation. HTTP(S) Load Balancing log entries contain information useful for monitoring and debugging your HTTP(S) traffic.
Log entries contain the following types of information:
General information, such as severity, project ID, project number, and timestamp.
HttpRequest log fields. However, HttpRequest.protocol is not populated for HTTP(S) Load Balancing Cloud Logging logs.
A statusDetails field inside the structPayload. This field holds a string that explains why the load balancer returned the
HTTP status that it did. The tables below contain further
explanations of these log strings. The statusDetails field is not
available for regional external HTTP(S) load balancers.
Redirects (HTTP response status code 302 Found) issued from the load balancer are not logged. Redirects issued from the backend
instances are logged.
To enable the log entries in an HTTP Load Balancer please follow this guide.
The message “Error: Server Error The server encountered a temporary error and could not complete your request.” Could be caused for several reason reasons including:
There's no firewall rule configured to allow health checks.
The software on the backends isn't running.
In this page you can find a detailed guide to perform a complete troubleshooting related to general connectivity issues.
I found these posts related to HTTP Load balancer and 502 response, you can find useful information in these threads.
Debugging Load Balancer issues
Compute Engine HTTP Load Balancing 502 error
Google Cloud HTTP balancer returns 502 error
Error: Server Error The server encountered a temporary error and
could not complete your request. Please try again in 30
seconds.(GCP)

In my case issue was with health check not returning 200.
It returned 302 instead (Found) when calling default / and redirected to other url with 200 (which Loadbalancer checks ignored) and deemed that node as "unhealthy" and instead to route incoming http/s request to broken node removed it out of rotation and returned that 502 error message to client.
Error: Server Error The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
Underneath my LoadBalancer was GKE cluster with gke ingress->service-> pod and no explicit liveness/readiness probes configured so by default healthchecks hit / with 302/Found/redirect.
After adding those probes to deployment manifest and pointing them to endpoint that return OK/200 (/-/healthy, /-/ready in my case of prometheus running inside the pod)issue was fixed.
Unfortunately gke ingress had un-informative message UNHEALTY only in annotations, so it took me a while to understand what causes that issue.

AWS ELB Logs not showing 5xx errors in

I am seeing an strange issues with AWS ELB, I am getting High-Sum-HTTP-5XX from ELB but when I go to log I do not see any request in access log which have 5XX errors.
Does elb access log does not have 5XX errors reported there. Where can I see which request were having 5XX error it will help me to find root cause. I do not see anything in my server log as well.

I'm speculating, you are running a CLB (Classic Load Balancer). The access log with HTTP 5xx errors entries should be analyzed using elb_status_code and a backend_status_code
entries.
This could be off the topic but from AWS's documentation, it looks like some of these HTTP messages cannot be parsed by Classic Load Balancer (This could happen if there is reverse proxy in place on the instance that is sending an error that the ELB doesn't understand and hence are not recorded in the access logs. I could see the 404 errors in the access logs).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js