How do I configure keep-alive time between elb and server? - amazon-web-services

We are seeing 504 errors in our ELB logs, however there are no corresponding errors in the application logs. Have increased the idle timeout on ELB and can see that no requests are taking more time than that.
Going through aws documentation found that we need to configure keep-alive time at ec2 instances to be equal or more than idle timeout to keep the connection open between elb and backend server.
Couldn't find any way to configure keep-alive time between elb and backend server. Any suggestion to do that would be helpful
We are using tomcat-ebs for backend servers.

You need to set the keepAliveTimeout="xxxx" in your tomcat connector settings to avoid tearing down idle connections.

Related

Websocket connection being closed on Google Compute Engine

I have a set of apps deployed in Docker containers that use websockets to communicate. One is the backend and one is the frontend.
I have both VM instances inside instance groups and served up through load balancers so that I can host them at https domains.
The problem I'm having is that in Google Compute Engine, the websocket connection is being closed after 30 seconds.
When running locally, the websockets do not time out. I've searched the issue and found these possible reasons, but I can't find a solution:
Websockets might time out on their own if you don't pass "keep alive" messages to keep them active. So I pass a keep alive from the frontend to the backend, and have the backend respond to the frontend, every 10 seconds.
According to the GCE websocket support docs, some type of "upgrade" handshake needs to occur between the backend and the frontend for the websocket connection to be kept alive. But according to MDN, "if you're opening a new connection using the WebSocket API, or any library that does WebSockets, most or all of this is done for you." I am using that API, and indeed when I inspect the headers, I see those fields:
The GCE background service timeout docs say:
For external HTTP(S) load balancers and internal HTTP(S) load
balancers, if the HTTP connection is upgraded to a WebSocket, the
backend service timeout defines the maximum amount of time that a
WebSocket can be open, whether idle or not.
This seems to be in conflict with the GCE websocket support docs that say:
When the load balancer recognizes a WebSocket Upgrade request from an
HTTP(S) client followed by a successful Upgrade response from the
backend instance, the load balancer proxies bidirectional traffic for
the duration of the current connection.
Which is it? I want to keep sockets open once they're established, but requests to initialize websocket connections should still time out if they take longer than 30 seconds. And I don't want to allow other standard REST calls to block forever, either.
What should I do? Do I have to set the timeout on the backend service to forever, and deal with the fact that other non-websocket REST calls may be susceptible to never timing out? Or is there a better way?
As mentioned in this GCP doc "When the load balancer recognizes a WebSocket Upgrade request from an HTTP(S) client followed by a successful Upgrade response from the backend instance, the load balancer proxies bidirectional traffic for the duration of the current connection. If the backend instance does not return a successful Upgrade response, the load balancer closes the connection."
In addition, the websocket timeout is a combination of the LB timeout and the bakckend time out. I understand that you already have modified the backend timeout. so you can also adjust the load balancer timeout according to your needs, pleas keep in mind that the default value is 30 seconds.
We have a similar, strange issue from GCP but this is without even using a load balancer. We have a WSS server, we can connect to it fine, it randomly stops the feed, no disconnect message, just stops sending WSS feeds to the client. It could be after 2-3 minutes it could be after 15-20 minutes but usually never makes it longer than that before dropping the connection.
We take the exact same code, the exact same build,(its all containerized) and we drop it on AWS and the problem with WSS magically disappears.
There is no question this issue is GCP related.

Does Google Cloud HTTPS load balancer log back end errors?

Looking for way to debug why backend for NIFI is failing. I created a NIFI cluster (verison 1.9.0, HDF 3.1.1.4, AMBARI 2.7.3) on Google cloud. Created HTTPS load balancer terminating https front end, and back end is the instance group for SSL enabled NIFI cluster. Getting a 502 back end error in the browser when I hit the url for the load balancer. Is there a way for Google Cloud to log the error ? There must be an error returned somewhere to troubleshoot the root cause. I don't see messages in the nifi log or the vm instance /var/log/messages. Stackdriver hasn't shown me errors. I created the keystore and truststore and followed the NIFI SSL enable instructions. It might be related to the SSL configs, or possibly firewall rules are not correct. But I am looking for some more helpful information to find the error.
If I am understanding the question properly, you are looking for a way to get HTTPS load balancer logs due to back end errors and your intention is to find out the root cause.Load balancer basically return 502 error due to unhealthy backend services or for unhealthy backend VM 's.If your stackdriver logging is enabled, you might get this log using advanced filter or can search by selecting the load balancer name and look for/search 502:
Advanced filter for 502 responses due to failures to connect to backends:
resource.type="http_load_balancer"
resource.labels.url_map_name="[URL Map]"
httpRequest.status=502
jsonPayload.statusDetails="failed_to_connect_to_backend"
Advanced filter for 502 responses due to backend timeouts:
resource.type="http_load_balancer"
resource.labels.url_map_name="[URL Map]"
httpRequest.status=502
jsonPayload.statusDetails="backend_timeout"
Advanced filter for 502 responses due to prematurely closed connections:
resource.type="http_load_balancer"
resource.labels.url_map_name="[URL Map]"
httpRequest.status=502
jsonPayload.statusDetails="backend_connection_closed_before_data_sent_to_client"
The URL Map is same as the name of the load balancer for HTTP(S) for cloud console.If we create the various components of the load balancer manually, need to use the URL Map for advanced filter.
Most common root causes for "failed_to_connect_to_backend" are: 1. Firewall blocking traffic, 2. Web server software not running on backend instance, 3. Web server software misconfigured on backend instance, 4. Server resources exhausted and not accepting connections (CPU usage too high to respond, Memory usage too high, process killed ,the maximum amount of workers spawned and all are busy, Maximum established TCP connections), 5. Poorly written server implementation struggling under load or non-standard behavior.
Most common root causes for “backend_timeout” are 1. the backend instance took longer than the Backend Service timeout to respond, meaning either the application is overloaded or the Backend Service Timeout is set too low, 2. The backend instance didn't respond at all (Crashing during a request).
Most Common Root causes for” backend_connection_closed_before_data_sent_to_client” is usually caused because the keepalive configuration parameter for the web server software running on the backend instance is less than the fixed (10 minute) keepalive (HTTP idle) timeout of the GFE. There are some situations where the backend may close a connection too soon while the GFE is still sending the HTTP request.
The previous response was spot on. The nifi ssl configuration is misconfigured, causing the backend health check to fail with a bad certificate. I will open a new question to address the nifi ssl configuration.

HTTP 504 errors returned by ELB even when hosts are healthy and able to serve request

I have a service which is deployed on Amazon Web Services (AWS), specifically 2 instances behind an Elastic Load Balancer (ELB). Availability zones are selected as all three us-west-2a,b,c
but only 2 of the above 3 zones have instances running in it.
The issues is that even though the traffic/load is not too high but I still get HTTP 504 errors from ELB often enough.
The log lines reads like this
-1 -1 -1 504 0 0 0
In order, --request_processing_time --backend_processing_time --response_processing_time --elb_status_code --backend_status_code --received_bytes --sent_bytes.
Description of what each field and response means can be found here
ELB idle timeout is 60 seconds. KeepAlive is 'On' on backend instances. Latency of requests from ELB are in check. I have tried increasing KeepAliveTimeout but to no avail.
Does anyone have any idea about how to proceed? I don't even know the root cause of this issue.
PS: More like a second question, there are a few cases (much less than 504 being returned by ELB when backend does not even accept the request) where even backend is returning a 504 and then ELB is forwarding the same to client. To the best of my knowledge HTTP 504 should be returned by a proxy only when backend is timing out. How can a server itself return a 504?
So that it might assist others in future, I am publishing my finding(s) here:
1) This 504 0 HTTP errors were mainly because of logrotate reloading apache instead of graceful restart.
The current AWS config does the following
/sbin/service httpd reload > /dev/null 2>/dev/null || true
so replace the service command with either apachectl -k graceful or /sbin/service httpd graceful
File location on my ec2 instance: /etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf
2) Because logrotate frequency was too high by default in AWS (once every hour), at least for my use case, and that in turn was reloading apache every hour, so I reduced that as well.
When backend connection timeout, ELB will put -1 to backend_processing_time column in its access log. Think what's happening is that some of your requests take longer than 60s for your backend to process. To confirm this, can you check your latency metrics? Please switch to maximum when viewing this metric. It will confirm my guess if you see latency frequently reaches 60s.
After it got confirmed, you might want to increase Idle timeout of your ELB and backend.

Why would AWS ELB (Elastic Load Balancer) sometimes returns 504 (gateway timeout) right away?

ELB occasionally returns 504 to our clients right away (under 1 seconds).
Problem is, it's totally random, when we repeat the request right away, it works as it should be.
Anyone have same issue or any idea on this?
Does this answers for your quiestion:
Troubleshooting Elastic Load Balancing: HTTP Errors
HTTP 504: Gateway Timeout
Description: Indicates that the load balancer closed a connection because a request did not complete within the idle timeout period.
Cause: The application takes longer to respond than the configured idle timeout.
Solution: Monitor the HTTPCode_ELB_5XX and Latency CloudWatch metrics. If there is an increase in these metrics, it could be due to the application not responding within the idle timeout period. For details about the requests that are timing out, enable access logs on the load balancer and review the 504 response codes in the logs that are generated by Elastic Load Balancing. If necessary, you can increase your back-end capacity or increase the configured idle timeout so that lengthy operations (such as uploading a large file) can complete.
Or this:
504 gateway timeout LB and EC2

408 timeouts on my Amazon ELB

We are seeing a lot of 408 timeouts on our ELB access logs. Have come across this thread https://serverfault.com/questions/485063/getting-408-errors-on-our-logs-with-no-request-or-user-agent
and also https://forums.aws.amazon.com/thread.jspa?messageID=307846
These are just two sample threads I found but others suggest the same solutions with no joy.
Have set web server timeout to be < ELB idle timeout, to be = to it and to be > than it, same result, our logs are polluted with these 408s. A bigger problem though it that they also throw off the average latency response time of our ELB which is what we trigger our auto scaler with.
We use Tomcat on our back end instances. No logs appear on tomcat to indicate a request was recieved but the ELB still shows as if requests had timed out.
On our ELB access logs there is no back end IP given for the 408s so in my opinon the requests never got to an instance at all but Amazon disagree :(.
Any one had this problem and got a reliable solution for it?
Following the suggestion of milsonspt in the linked thread, I added a virtual host to my server that monitors a different thread instead of 80 so all health checks will be executed on that host (replace CUSTOM_PORT with any port you want to be used for ELB health check).
Listen CUSTOM_PORT
<VirtualHost *:CUSTOM_PORT>
CustomLog "|/usr/sbin/rotatelogs /var/log/httpd/access_log_elb_health_check_rotated_%Y-%m-%d-%H_%M_%S 10M" combined
</VirtualHost>
Make sure that the ELB does NOT have a listener on that port.
That configuration removed the 408 errors and logs all health check in a separate log so you get an uncluttered log for your regular access log, and a dedicated log for health check.
This could happen when ELB is waiting on the client for complete request. If a partial request comes in, with incomplete headers AWS ELB would just wait. AWS ELB would not do anything with the partial request headers, and eventually respond with 408 req_timeout due to idle timeout expiry on the tcp connection.