Websocket connection being closed on Google Compute Engine

Websocket connection being closed on Google Compute Engine - google-cloud-platform

I have a set of apps deployed in Docker containers that use websockets to communicate. One is the backend and one is the frontend.
I have both VM instances inside instance groups and served up through load balancers so that I can host them at https domains.
The problem I'm having is that in Google Compute Engine, the websocket connection is being closed after 30 seconds.
When running locally, the websockets do not time out. I've searched the issue and found these possible reasons, but I can't find a solution:
Websockets might time out on their own if you don't pass "keep alive" messages to keep them active. So I pass a keep alive from the frontend to the backend, and have the backend respond to the frontend, every 10 seconds.
According to the GCE websocket support docs, some type of "upgrade" handshake needs to occur between the backend and the frontend for the websocket connection to be kept alive. But according to MDN, "if you're opening a new connection using the WebSocket API, or any library that does WebSockets, most or all of this is done for you." I am using that API, and indeed when I inspect the headers, I see those fields:
The GCE background service timeout docs say:
For external HTTP(S) load balancers and internal HTTP(S) load
balancers, if the HTTP connection is upgraded to a WebSocket, the
backend service timeout defines the maximum amount of time that a
WebSocket can be open, whether idle or not.
This seems to be in conflict with the GCE websocket support docs that say:
When the load balancer recognizes a WebSocket Upgrade request from an
HTTP(S) client followed by a successful Upgrade response from the
backend instance, the load balancer proxies bidirectional traffic for
the duration of the current connection.
Which is it? I want to keep sockets open once they're established, but requests to initialize websocket connections should still time out if they take longer than 30 seconds. And I don't want to allow other standard REST calls to block forever, either.
What should I do? Do I have to set the timeout on the backend service to forever, and deal with the fact that other non-websocket REST calls may be susceptible to never timing out? Or is there a better way?

As mentioned in this GCP doc "When the load balancer recognizes a WebSocket Upgrade request from an HTTP(S) client followed by a successful Upgrade response from the backend instance, the load balancer proxies bidirectional traffic for the duration of the current connection. If the backend instance does not return a successful Upgrade response, the load balancer closes the connection."
In addition, the websocket timeout is a combination of the LB timeout and the bakckend time out. I understand that you already have modified the backend timeout. so you can also adjust the load balancer timeout according to your needs, pleas keep in mind that the default value is 30 seconds.

We have a similar, strange issue from GCP but this is without even using a load balancer. We have a WSS server, we can connect to it fine, it randomly stops the feed, no disconnect message, just stops sending WSS feeds to the client. It could be after 2-3 minutes it could be after 15-20 minutes but usually never makes it longer than that before dropping the connection.
We take the exact same code, the exact same build,(its all containerized) and we drop it on AWS and the problem with WSS magically disappears.
There is no question this issue is GCP related.

Related

How to create a health check for websocket in Google cloud?

I am trying to set up a load balancer in GCP for my application over websoket. When creating a health check to determine which instances are up and working, the only header I can modify for a HTTP health check is the host. So I can not add the Upgrade header and other websocket related headers (like in here) needed to establish a connection.
The documents mentions that websockets are supported by default, but does not mention how health check rule should be defined. What is the best practice for using GCP load balancer with websockets? Is there a way to work around this on my end e.g. by defining an endpoint that automatically upgrades to websocket or any other methods?

The Load balancer can accept and forward the websocket traffic to the correct service/backend. However, from my experience, Load balancer can't perform health checks on websocket.
The meaning of health check is a simple HTTP endpoint that answer "Ok, all is running well" or "Arg, I have a problem, if that continue, restart (or replace) me". Doing this in websocket (meaning continuous communication/streaming) is "strange".
If you look at the documentation, the success criteria are an answer in 200 (and you can optionally add content validation if you want). When you request a websocket endpoint, the HTTP answer is 101 Switching protocol, that is not 200, thus not valid at the Load balancer point of view.
Add a standard endpoint on your service that perform the health check inside your app and answer the correct HTTP code.

Google cloud load balancer causing error 502 - failed_to_pick_backend

I've got an error 502 when I use google cloud balancer with CDN, the thing is, I am pretty sure I must have done something wrong setting up the load balancer because when I remove the load balancer, my website runs just fine.
This is how I configure my load balancer
here
Should I use HTTP or HTTPS healthcheck, because when I set up HTTPS
healthcheck, my website was up for a bit and then it down again
I have checked this link, they seem to have the same problem but it is not working for me.
I have followed a tutorial from openlitespeed forum to set Keep-Alive Timeout (secs) = 60s in server admin panel and configure instance to accepts long-lived connections ,still not working for me.
I have added these 2 firewall rules following this google cloud link to allow google health check ip but still didn’t work:
https://cloud.google.com/load-balancing/docs/health-checks#fw-netlb
https://cloud.google.com/load-balancing/docs/https/ext-http-lb-simple#firewall
When checking load balancer log message, it shows an error saying failed_to_pick_backend . I have tried to re-configure load balancer but it didn't help.
I just started to learn Google Cloud and my knowledge is really limited, it would be greatly appreciated if someone could show me step by step how to solve this issue. Thank you!

Posting an answer - based on OP's finding to improve user experience.
Solution to the error 502 - failed_to_pick_backend was changing Load Balancer from HTTP to TCP protocol and at the same type changing health check from HTTP to TCP also.
After that LB passes through all incoming connections as it should and the error dissapeared.
Here's some more info about various types of health checks and how to chose correct one.

The error message that you're facing it's "failed_to_pick_backend".
This error message means that HTTP responses code are generated when a GFE was not able to establish a connection to a backend instance or was not able to identify a viable backend instance to connect to
I noticed in the image that your health-check failed causing the aforementioned error messages, this Health Check failing behavior could be due to:
Web server software not running on backend instance
Web server software misconfigured on backend instance
Server resources exhausted and not accepting connections:
- CPU usage too high to respond
- Memory usage too high, process killed or can't malloc()
- Maximum amount of workers spawned and all are busy (think mpm_prefork in Apache)
- Maximum established TCP connections
Check if the running services were responding with a 200 (OK) to the Health Check probes and Verify your Backend Service timeout. The Backend Service timeout works together with the configured Health Check values to define the amount of time an instance has to respond before being considered unhealthy.
Additionally, You can see this troubleshooting guide to face some error messages (Including this).

Those experienced with Kubernetes from other platforms may be confused as to why their Ingresses are calling their backends "UNHEALTHY".
Health checks are not the same thing as Readiness Probes and Liveness Probes.
Health checks are an independent utility used by GCP's Load Balancers and perform the exact same function, but are defined elsewhere. Failures here will lead to 502 errors.
https://console.cloud.google.com/compute/healthChecks

Does Google Cloud HTTPS load balancer log back end errors?

Looking for way to debug why backend for NIFI is failing. I created a NIFI cluster (verison 1.9.0, HDF 3.1.1.4, AMBARI 2.7.3) on Google cloud. Created HTTPS load balancer terminating https front end, and back end is the instance group for SSL enabled NIFI cluster. Getting a 502 back end error in the browser when I hit the url for the load balancer. Is there a way for Google Cloud to log the error ? There must be an error returned somewhere to troubleshoot the root cause. I don't see messages in the nifi log or the vm instance /var/log/messages. Stackdriver hasn't shown me errors. I created the keystore and truststore and followed the NIFI SSL enable instructions. It might be related to the SSL configs, or possibly firewall rules are not correct. But I am looking for some more helpful information to find the error.

If I am understanding the question properly, you are looking for a way to get HTTPS load balancer logs due to back end errors and your intention is to find out the root cause.Load balancer basically return 502 error due to unhealthy backend services or for unhealthy backend VM 's.If your stackdriver logging is enabled, you might get this log using advanced filter or can search by selecting the load balancer name and look for/search 502:
Advanced filter for 502 responses due to failures to connect to backends:
resource.type="http_load_balancer"
resource.labels.url_map_name="[URL Map]"
httpRequest.status=502
jsonPayload.statusDetails="failed_to_connect_to_backend"
Advanced filter for 502 responses due to backend timeouts:
resource.type="http_load_balancer"
resource.labels.url_map_name="[URL Map]"
httpRequest.status=502
jsonPayload.statusDetails="backend_timeout"
Advanced filter for 502 responses due to prematurely closed connections:
resource.type="http_load_balancer"
resource.labels.url_map_name="[URL Map]"
httpRequest.status=502
jsonPayload.statusDetails="backend_connection_closed_before_data_sent_to_client"
The URL Map is same as the name of the load balancer for HTTP(S) for cloud console.If we create the various components of the load balancer manually, need to use the URL Map for advanced filter.
Most common root causes for "failed_to_connect_to_backend" are: 1. Firewall blocking traffic, 2. Web server software not running on backend instance, 3. Web server software misconfigured on backend instance, 4. Server resources exhausted and not accepting connections (CPU usage too high to respond, Memory usage too high, process killed ,the maximum amount of workers spawned and all are busy, Maximum established TCP connections), 5. Poorly written server implementation struggling under load or non-standard behavior.
Most common root causes for “backend_timeout” are 1. the backend instance took longer than the Backend Service timeout to respond, meaning either the application is overloaded or the Backend Service Timeout is set too low, 2. The backend instance didn't respond at all (Crashing during a request).
Most Common Root causes for” backend_connection_closed_before_data_sent_to_client” is usually caused because the keepalive configuration parameter for the web server software running on the backend instance is less than the fixed (10 minute) keepalive (HTTP idle) timeout of the GFE. There are some situations where the backend may close a connection too soon while the GFE is still sending the HTTP request.

The previous response was spot on. The nifi ssl configuration is misconfigured, causing the backend health check to fail with a bad certificate. I will open a new question to address the nifi ssl configuration.

Is any aws service suitable for sending real time updates to browser?

I'm developing a stocks app and have to keep users browser updated with pricing changes
I don't need to access past data, browser just have to get current data whenever it changes
is it possible to filter a dynamodb stream and expose an endpoint (behind api gateway) that could be used with a javascript EventSource?

I realize this is not using Server Sent Events but AWS just announced Serverless WebSockets for API Gateway. Pricing is based on minutes connected and number of messages sent.
Product Launch Article: https://aws.amazon.com/about-aws/whats-new/2018/12/amazon-api-gateway-launches-support-for-websocket-apis/
Documentation: https://docs.aws.amazon.com/apigateway/latest/developerguide/apigateway-websocket-api.html
Pricing: https://aws.amazon.com/api-gateway/pricing/

API Gateway is a store-and-forward service. It collects the response from whatever the back-end may happen to be (Lambda, an HTTP server, etc.) and then returns it en block to the browser -- it doesn't stream the response, so it would not be suited for use as an Eventsource.
AWS doesn't currently have a managed service offering that is obviously suited to this use case... you'd need a server (or more than one) on EC2, consuming the data stream and relaying it back to the connected browsers.
Assuming that running EC2 servers is an acceptable option, you then need HTTPS and load balancing. Application Load Balancer supports web sockets, so it also might also support an eventsource. A Classic ELB in TCP (not HTTP) mode should support an eventsource without a problem, though it might not correctly signal to the back-end when the browser connection is lost. Both of those balancers can also offload HTTPS for you. Network Load Balancer would definitely work for balancing an eventsource, but your instances would need to provide the HTTPS, since NLB doesn't offload it for you.
A somewhat unorthodox alternative might actually be AWS IoT, which has built-in websocket support... Not the same as eventsource, of course, but a streaming connection nonetheless... in such an environment, I suppose each browser user could be an addressable "thing."

GCP HTTP Load balancer - Websocket getting upgrade but no frames from the client

I just noticed that gcp http(s) load balancer now supports websockets. I went to try it out and am having some problems. I have a gloabl https load balancer setup with Cloud CDN and a simple, no url-map, backend (Node.js). When I go to make websocket connection, I get a successful upgrade response but when I go to send frames to the server, they are never received. The server can send frames back to the client just fine. It is almost like the load balancer doesn't know that the connection has been upgraded and therefore doesn't allow any data sent from the client.
When I look in the logs for the https load balancer, I see the 101 Switching Protocols response and then statusDetails is "client_disconnected_after_partial_response" almost like it was a normal http request.
Any help would be appreciated.

After some investigation it seems that when a GCP load balancer is going through the GCP CDN, client messages never make it to the backend.
I worked around this by duplicating the backend configuration I wanted in a new backend called "websocket-prototype," and then just changing the "use cdn" setting for the websocket prototype. Finally, I mapped the path to my websocket server to the configuration without the CDN.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js