Google Load Balancer retrying the request on web server after timeout expires - google-cloud-platform

We are using Google Load Balancer with Tomcat Server. We have kept a specific timeout on load balancer from cloud console setting portal. Whenever any request takes more than the timeout time, GLB returns 502 i.e expected.
Here the problem is -
Whenever the request takes more than the given time, on tomcat side we are getting the same request again exactly after the timeout e.g when we have the timeout as 30 sec we got the same request on tomcat exactly after 30 sec.
On browser the response time for 502 is exactly twice of the timeout time. (It might be because of network turn around time but why always exact twice)

I assume you are referring to HTTP(S) load balancing. In this scenario, a reverse proxy is sitting in front of your application, handling requests and forwarding them to your backends. This proxy (a GFE) will retry as documented:
HTTP(S) load balancing retries failed GET requests in certain circumstances, such as when the response timeout is exhausted. It does not retry failed POST requests. Retries are limited to two attempts. Retried requests only generate one log entry for the final response. Refer to Logging for more information.

Related

Websocket connection being closed on Google Compute Engine

I have a set of apps deployed in Docker containers that use websockets to communicate. One is the backend and one is the frontend.
I have both VM instances inside instance groups and served up through load balancers so that I can host them at https domains.
The problem I'm having is that in Google Compute Engine, the websocket connection is being closed after 30 seconds.
When running locally, the websockets do not time out. I've searched the issue and found these possible reasons, but I can't find a solution:
Websockets might time out on their own if you don't pass "keep alive" messages to keep them active. So I pass a keep alive from the frontend to the backend, and have the backend respond to the frontend, every 10 seconds.
According to the GCE websocket support docs, some type of "upgrade" handshake needs to occur between the backend and the frontend for the websocket connection to be kept alive. But according to MDN, "if you're opening a new connection using the WebSocket API, or any library that does WebSockets, most or all of this is done for you." I am using that API, and indeed when I inspect the headers, I see those fields:
The GCE background service timeout docs say:
For external HTTP(S) load balancers and internal HTTP(S) load
balancers, if the HTTP connection is upgraded to a WebSocket, the
backend service timeout defines the maximum amount of time that a
WebSocket can be open, whether idle or not.
This seems to be in conflict with the GCE websocket support docs that say:
When the load balancer recognizes a WebSocket Upgrade request from an
HTTP(S) client followed by a successful Upgrade response from the
backend instance, the load balancer proxies bidirectional traffic for
the duration of the current connection.
Which is it? I want to keep sockets open once they're established, but requests to initialize websocket connections should still time out if they take longer than 30 seconds. And I don't want to allow other standard REST calls to block forever, either.
What should I do? Do I have to set the timeout on the backend service to forever, and deal with the fact that other non-websocket REST calls may be susceptible to never timing out? Or is there a better way?
As mentioned in this GCP doc "When the load balancer recognizes a WebSocket Upgrade request from an HTTP(S) client followed by a successful Upgrade response from the backend instance, the load balancer proxies bidirectional traffic for the duration of the current connection. If the backend instance does not return a successful Upgrade response, the load balancer closes the connection."
In addition, the websocket timeout is a combination of the LB timeout and the bakckend time out. I understand that you already have modified the backend timeout. so you can also adjust the load balancer timeout according to your needs, pleas keep in mind that the default value is 30 seconds.
We have a similar, strange issue from GCP but this is without even using a load balancer. We have a WSS server, we can connect to it fine, it randomly stops the feed, no disconnect message, just stops sending WSS feeds to the client. It could be after 2-3 minutes it could be after 15-20 minutes but usually never makes it longer than that before dropping the connection.
We take the exact same code, the exact same build,(its all containerized) and we drop it on AWS and the problem with WSS magically disappears.
There is no question this issue is GCP related.

AWS Loadbalancer terminating http call before complete http response is sent

We are making a REST call to a spring boot application hosted in PCF environment. There is an AWS load balancer in front of our application to handle traffic management.
We are consuming the http request in a streaming manner using apache file upload library [https://commons.apache.org/proper/commons-fileupload/]. On processing the request, we are immediately sending the response back without waiting for the whole request to arrive. The size of the http request is normally large in the range of 100 MB.
This implementation works fine without AWS load balancer in between. When AWS load balancer is present, it terminates the http call after few bytes of response has been sent.
If we defer the response sending till whole request is received in the server side, request go through without any failures.
If the size of the http request is small, then also implementation works fine.
Any idea why AWS load balancer terminates the http call, if we start sending the http response before receiving full http request.
I assume in your case you are using an application load balancer, however I'd recommend in your case is to use network load balancer, classic may work too as they provide more transparency to your requests than application locad balancer, also its recommended for API implementations
if this is still not solving your case, consider implementing HA proxy LB
https://www.loadbalancer.org/blog/transparent-load-balancing-with-haproxy-on-amazon-ec2/

Does Google Cloud HTTPS load balancer log back end errors?

Looking for way to debug why backend for NIFI is failing. I created a NIFI cluster (verison 1.9.0, HDF 3.1.1.4, AMBARI 2.7.3) on Google cloud. Created HTTPS load balancer terminating https front end, and back end is the instance group for SSL enabled NIFI cluster. Getting a 502 back end error in the browser when I hit the url for the load balancer. Is there a way for Google Cloud to log the error ? There must be an error returned somewhere to troubleshoot the root cause. I don't see messages in the nifi log or the vm instance /var/log/messages. Stackdriver hasn't shown me errors. I created the keystore and truststore and followed the NIFI SSL enable instructions. It might be related to the SSL configs, or possibly firewall rules are not correct. But I am looking for some more helpful information to find the error.
If I am understanding the question properly, you are looking for a way to get HTTPS load balancer logs due to back end errors and your intention is to find out the root cause.Load balancer basically return 502 error due to unhealthy backend services or for unhealthy backend VM 's.If your stackdriver logging is enabled, you might get this log using advanced filter or can search by selecting the load balancer name and look for/search 502:
Advanced filter for 502 responses due to failures to connect to backends:
resource.type="http_load_balancer"
resource.labels.url_map_name="[URL Map]"
httpRequest.status=502
jsonPayload.statusDetails="failed_to_connect_to_backend"
Advanced filter for 502 responses due to backend timeouts:
resource.type="http_load_balancer"
resource.labels.url_map_name="[URL Map]"
httpRequest.status=502
jsonPayload.statusDetails="backend_timeout"
Advanced filter for 502 responses due to prematurely closed connections:
resource.type="http_load_balancer"
resource.labels.url_map_name="[URL Map]"
httpRequest.status=502
jsonPayload.statusDetails="backend_connection_closed_before_data_sent_to_client"
The URL Map is same as the name of the load balancer for HTTP(S) for cloud console.If we create the various components of the load balancer manually, need to use the URL Map for advanced filter.
Most common root causes for "failed_to_connect_to_backend" are: 1. Firewall blocking traffic, 2. Web server software not running on backend instance, 3. Web server software misconfigured on backend instance, 4. Server resources exhausted and not accepting connections (CPU usage too high to respond, Memory usage too high, process killed ,the maximum amount of workers spawned and all are busy, Maximum established TCP connections), 5. Poorly written server implementation struggling under load or non-standard behavior.
Most common root causes for “backend_timeout” are 1. the backend instance took longer than the Backend Service timeout to respond, meaning either the application is overloaded or the Backend Service Timeout is set too low, 2. The backend instance didn't respond at all (Crashing during a request).
Most Common Root causes for” backend_connection_closed_before_data_sent_to_client” is usually caused because the keepalive configuration parameter for the web server software running on the backend instance is less than the fixed (10 minute) keepalive (HTTP idle) timeout of the GFE. There are some situations where the backend may close a connection too soon while the GFE is still sending the HTTP request.
The previous response was spot on. The nifi ssl configuration is misconfigured, causing the backend health check to fail with a bad certificate. I will open a new question to address the nifi ssl configuration.

AWS ELB - will it retry request if node fails?

I have an ELB and 3 nodes behind it.
Can someone please explain me what will ELB do in these scenarios:
Client Request -> ELB -> Node1 fails in the middle of the request (ELB timeout)
Client Request -> ELB -> Node1 timeouts (Server timeout and health check haven't kicked in yet)
Particularly I'm wondering if ELB retries the request to another node?
I made a test and it doesn't seem to, but maybe there's a setting that I've missed.
Thanks,
This may have been a matter of passage of time, but these days ELBs do retry requests that abort:
Either because of an Idle Timeout (60s by default);
Or because the instance went unhealthy due to failing health checks, with Connection Draining disabled (default is enabled)
However, this holds only if you haven't sent any response bytes yet. If you have sent incomplete headers, you will get a 408 Request Timeout. If you have sent full headers (but didn't send a body or got halfway through the body) the client will get the incomplete response as-is.
The experiments I've performed were all with a single HTTP request per connection. If you use Keep-Alive connections, the behavior might be different.
The AWS Elastic Load Balancing service uses Health Checks to identify healthy/unhealthy Amazon EC2 instances. If an instance is marked as Unhealthy, then no new traffic will be sent to that server. Once it is identified as Heathy, traffic will once again be sent to the instance.
If a request is sent to an instance and no response is received (either because the app fails or a timeout is triggered), the request will not be resent nor sent to another server. It would be up to the originator (eg a user or an app) to resend the request.

Why would AWS ELB (Elastic Load Balancer) sometimes returns 504 (gateway timeout) right away?

ELB occasionally returns 504 to our clients right away (under 1 seconds).
Problem is, it's totally random, when we repeat the request right away, it works as it should be.
Anyone have same issue or any idea on this?
Does this answers for your quiestion:
Troubleshooting Elastic Load Balancing: HTTP Errors
HTTP 504: Gateway Timeout
Description: Indicates that the load balancer closed a connection because a request did not complete within the idle timeout period.
Cause: The application takes longer to respond than the configured idle timeout.
Solution: Monitor the HTTPCode_ELB_5XX and Latency CloudWatch metrics. If there is an increase in these metrics, it could be due to the application not responding within the idle timeout period. For details about the requests that are timing out, enable access logs on the load balancer and review the 504 response codes in the logs that are generated by Elastic Load Balancing. If necessary, you can increase your back-end capacity or increase the configured idle timeout so that lengthy operations (such as uploading a large file) can complete.
Or this:
504 gateway timeout LB and EC2