How to create a health check for websocket in Google cloud? - google-cloud-platform

I am trying to set up a load balancer in GCP for my application over websoket. When creating a health check to determine which instances are up and working, the only header I can modify for a HTTP health check is the host. So I can not add the Upgrade header and other websocket related headers (like in here) needed to establish a connection.
The documents mentions that websockets are supported by default, but does not mention how health check rule should be defined. What is the best practice for using GCP load balancer with websockets? Is there a way to work around this on my end e.g. by defining an endpoint that automatically upgrades to websocket or any other methods?

The Load balancer can accept and forward the websocket traffic to the correct service/backend. However, from my experience, Load balancer can't perform health checks on websocket.
The meaning of health check is a simple HTTP endpoint that answer "Ok, all is running well" or "Arg, I have a problem, if that continue, restart (or replace) me". Doing this in websocket (meaning continuous communication/streaming) is "strange".
If you look at the documentation, the success criteria are an answer in 200 (and you can optionally add content validation if you want). When you request a websocket endpoint, the HTTP answer is 101 Switching protocol, that is not 200, thus not valid at the Load balancer point of view.
Add a standard endpoint on your service that perform the health check inside your app and answer the correct HTTP code.

Related

Websocket connection being closed on Google Compute Engine

I have a set of apps deployed in Docker containers that use websockets to communicate. One is the backend and one is the frontend.
I have both VM instances inside instance groups and served up through load balancers so that I can host them at https domains.
The problem I'm having is that in Google Compute Engine, the websocket connection is being closed after 30 seconds.
When running locally, the websockets do not time out. I've searched the issue and found these possible reasons, but I can't find a solution:
Websockets might time out on their own if you don't pass "keep alive" messages to keep them active. So I pass a keep alive from the frontend to the backend, and have the backend respond to the frontend, every 10 seconds.
According to the GCE websocket support docs, some type of "upgrade" handshake needs to occur between the backend and the frontend for the websocket connection to be kept alive. But according to MDN, "if you're opening a new connection using the WebSocket API, or any library that does WebSockets, most or all of this is done for you." I am using that API, and indeed when I inspect the headers, I see those fields:
The GCE background service timeout docs say:
For external HTTP(S) load balancers and internal HTTP(S) load
balancers, if the HTTP connection is upgraded to a WebSocket, the
backend service timeout defines the maximum amount of time that a
WebSocket can be open, whether idle or not.
This seems to be in conflict with the GCE websocket support docs that say:
When the load balancer recognizes a WebSocket Upgrade request from an
HTTP(S) client followed by a successful Upgrade response from the
backend instance, the load balancer proxies bidirectional traffic for
the duration of the current connection.
Which is it? I want to keep sockets open once they're established, but requests to initialize websocket connections should still time out if they take longer than 30 seconds. And I don't want to allow other standard REST calls to block forever, either.
What should I do? Do I have to set the timeout on the backend service to forever, and deal with the fact that other non-websocket REST calls may be susceptible to never timing out? Or is there a better way?
As mentioned in this GCP doc "When the load balancer recognizes a WebSocket Upgrade request from an HTTP(S) client followed by a successful Upgrade response from the backend instance, the load balancer proxies bidirectional traffic for the duration of the current connection. If the backend instance does not return a successful Upgrade response, the load balancer closes the connection."
In addition, the websocket timeout is a combination of the LB timeout and the bakckend time out. I understand that you already have modified the backend timeout. so you can also adjust the load balancer timeout according to your needs, pleas keep in mind that the default value is 30 seconds.
We have a similar, strange issue from GCP but this is without even using a load balancer. We have a WSS server, we can connect to it fine, it randomly stops the feed, no disconnect message, just stops sending WSS feeds to the client. It could be after 2-3 minutes it could be after 15-20 minutes but usually never makes it longer than that before dropping the connection.
We take the exact same code, the exact same build,(its all containerized) and we drop it on AWS and the problem with WSS magically disappears.
There is no question this issue is GCP related.

Google cloud load balancer causing error 502 - failed_to_pick_backend

I've got an error 502 when I use google cloud balancer with CDN, the thing is, I am pretty sure I must have done something wrong setting up the load balancer because when I remove the load balancer, my website runs just fine.
This is how I configure my load balancer
here
Should I use HTTP or HTTPS healthcheck, because when I set up HTTPS
healthcheck, my website was up for a bit and then it down again
I have checked this link, they seem to have the same problem but it is not working for me.
I have followed a tutorial from openlitespeed forum to set Keep-Alive Timeout (secs) = 60s in server admin panel and configure instance to accepts long-lived connections ,still not working for me.
I have added these 2 firewall rules following this google cloud link to allow google health check ip but still didn’t work:
https://cloud.google.com/load-balancing/docs/health-checks#fw-netlb
https://cloud.google.com/load-balancing/docs/https/ext-http-lb-simple#firewall
When checking load balancer log message, it shows an error saying failed_to_pick_backend . I have tried to re-configure load balancer but it didn't help.
I just started to learn Google Cloud and my knowledge is really limited, it would be greatly appreciated if someone could show me step by step how to solve this issue. Thank you!
Posting an answer - based on OP's finding to improve user experience.
Solution to the error 502 - failed_to_pick_backend was changing Load Balancer from HTTP to TCP protocol and at the same type changing health check from HTTP to TCP also.
After that LB passes through all incoming connections as it should and the error dissapeared.
Here's some more info about various types of health checks and how to chose correct one.
The error message that you're facing it's "failed_to_pick_backend".
This error message means that HTTP responses code are generated when a GFE was not able to establish a connection to a backend instance or was not able to identify a viable backend instance to connect to
I noticed in the image that your health-check failed causing the aforementioned error messages, this Health Check failing behavior could be due to:
Web server software not running on backend instance
Web server software misconfigured on backend instance
Server resources exhausted and not accepting connections:
- CPU usage too high to respond
- Memory usage too high, process killed or can't malloc()
- Maximum amount of workers spawned and all are busy (think mpm_prefork in Apache)
- Maximum established TCP connections
Check if the running services were responding with a 200 (OK) to the Health Check probes and Verify your Backend Service timeout. The Backend Service timeout works together with the configured Health Check values to define the amount of time an instance has to respond before being considered unhealthy.
Additionally, You can see this troubleshooting guide to face some error messages (Including this).
Those experienced with Kubernetes from other platforms may be confused as to why their Ingresses are calling their backends "UNHEALTHY".
Health checks are not the same thing as Readiness Probes and Liveness Probes.
Health checks are an independent utility used by GCP's Load Balancers and perform the exact same function, but are defined elsewhere. Failures here will lead to 502 errors.
https://console.cloud.google.com/compute/healthChecks

AWS Application Load Balancer with HTTP2

I have a RESTful app deployed on a number of EC2 instances sitting behind a Load Balancer.
Authentication is handled in part by a custom request header called "X-App-Key".
I have just migrated my classic Load Balancers to Application Load Balancers and I'm starting to experience intermittent issues where some valid requests (via testing with CURL) are failing authentication for some users. It looks like the custom request header is only intermittently being passed through. Using apache bench approx 100 of 500 requests failed.
If I test with a classic Load Balancer all 500 succeed.
I looked into this a bit more and found that the users who this is failing for are using a slightly newer version of CURL and specifically the requests coming from these users are using HTTP2. If I add "--http1.1" to the CURL request they all pass fine.
So the issues seem to be specific to us using a custom request header with the new generation application load balancers and HTTP2.
Am I doing something wrong?!
I found the answer on this post...
AWS Application Load Balancer transforms all headers to lower case
It seems the headers come through from the ALB in lowercase. I needed to update my backend to support this
You probably have to enable Sticky sessions in your loadbalancer.
They are needed to keep the session open liked to the same instance.
But, it's at application level the need of having to keep a session active, and not really useful in some kind of services, (depending on the nature of your system, not really recommended) as it provides performance reduction in REST like systems.

Gremlin-server health check endpoint for AWS ELB

Is there any HTTP/TCP endpoint for a gremlin-server health check? Currently, we are using the default TCP port but it doesn't seem to indicate the gremlin-server's health.
We noticed that gremlin-server crashed and was not running but the health check kept passing. We are using AWS Classic Load Balancer.
Have you enabled an HTTP endpoint for the Gremlin service? The document above explains:
While the default behavior for Gremlin Server is to provide a
WebSocket-based connection, it can also be configured to support plain
HTTP web service. The HTTP endpoint provides for a communication
protocol familiar to most developers, with a wide support of
programming languages, tools and libraries for accessing it.
If so, you can use an ELB HTTP health check to a target like this:
HTTP:8182/?gremlin=100-1
With a properly configured service, this query will return a 200 HTTP status code, which will indicate to the ELB that the service is healthy.

AWS ALB health check pass HTTP but not Websocket

I have an application deployed in a Docker Swarm which have two publicly reachable services, HTTP and WS.
I created two target groups, one for each service, and the registered instances are the managers of the Docker Swarm. Then I created the ALB and added two HTTPS listeners, each one pointing to the specific target group.
Now comes the problem. The HTTP health check passes without no problem, but the Websocket check is always unhealthy, and I don't know why. According to http://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-listeners.html, using a HTTP/HTTPS listener should work for WS/WSS as well.
In the WS check, I have tried as path both / and the path the application is actually using /ws. Neither of them passes the health check.
It's not a problem related to firewall either. Security groups are wide open and there are no iptables rules, so connection is possible in both directions.
I launched the websocket container out of Docker Swarm, just to test if it was something related to Swarm (which I was pretty sure it was not, but hell.. for testing's sake), and it did not work either, so now I'm a little out of hope.
What configuration might I be missing, so that HTTP services work but Websocket services don't?.
UPDATE
I'm back with this issue, and after further researching, the problem seems to be the Target Group, not the ALB per se. Reading through the documentation http://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html, I had forgotten to enable the stickiness option. However, I just did, and the problem persists.
UPDATE 2
It looks like the ELB is not upgrading the connection from HTTP to WebSocket.
ALBs do not support websocket health checks per:
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html
" Health checks do not support WebSockets."
The issue is that despite AWS claiming that the ALB supports HTTP2 in fact it downsamples everything to HTTP1 and then does it's thing then upgrades it to HTTP2 again which breaks everything.
I had a similar issue. When the ALB checks the WS service it receives an HTTP status of 101 (Switching protocols) from it. And, as noted in other answers, that's not a good-enough response for the ALB. I attempted changing the matcher code in the health check config but it doesn't allow anything outside the 200-299 range.
In my setup I had Socket.io running on top of Express.js so I solved it by changing the Socket.io path (do not confuse path with namespace) to /ws and let the Express.js answer the requests for /. Pointed the health check to / and that did it.