MQTT over WebSockets periodically disconnecting - google-cloud-platform

We have been running eclipse-mosquitto 1.6.10 in Google Cloud Platform for months without issue. We use both MQTT protocol and websockets. We have one Google Cloud TCP load balancer for each protocol - one running on port 443 for MQTT and one running port 9200 for websockets. They provide TLS for both connections. There is only 1 broker which both of the load balancers point to.
Last night at 19:30 BST (UTC +1), the broker started having timeout issues and we were unable to find the root cause. In the end, we spun up a new server with a fresh image and new load balancers. The timeout issues stopped, and regular MQTT connections are now working perfectly. However, websocket connections are now intermittently failing. The browser attempts to connect, and after a delay of ~20 seconds, either connects or fails. Even successful connections are then dropping shortly after - anywhere between 20 seconds and 8 minutes.
We have upped the Linux open file limit and have also tried upgrading to 1.6.14. Rebooting Mosquitto and the entire server doesn't fix the problem but does seem to slightly more likely that connections succeed for a short time.
The server that the broker is running on is not under high load.
Any insights would be appreciated. Thanks.

Related

How to handle cloud run websocket connection timeout server side

Running a cloud run service which basically establishes a websocket connection to an endpoint (third party) which sends events for which I listen on the server. Unfortunately websocket connections are treated as long running https request by cloud run. Thus i need to reconnect at the 60 min timeout. What is the best way to do this server side (since i dont control the client)? schedule a new connection every 59 minutes and drop the old one? Also don't want to miss any events on the reconnection. Would appreciate any ideas:)
Implement scheduler, but not very elegant imo
WebSockets requests are treated as long-running HTTP requests in Cloud Run. They are subject to request timeouts (currently up to 60 minutes and defaults to 5 minutes) even if your application server does not enforce any timeouts.
Accordingly, if the client keeps the connection open longer than the required timeout configured for the Cloud Run service, the client will be disconnected when the request times out.
Therefore, WebSockets clients connecting to Cloud Run should handle reconnecting to the server if the request times out or the server disconnects. You can achieve this in browser-based clients by using libraries such as reconnecting-websocket or by handling "disconnect" events if you are using the SocketIO library.

Websocket connection being closed on Google Compute Engine

I have a set of apps deployed in Docker containers that use websockets to communicate. One is the backend and one is the frontend.
I have both VM instances inside instance groups and served up through load balancers so that I can host them at https domains.
The problem I'm having is that in Google Compute Engine, the websocket connection is being closed after 30 seconds.
When running locally, the websockets do not time out. I've searched the issue and found these possible reasons, but I can't find a solution:
Websockets might time out on their own if you don't pass "keep alive" messages to keep them active. So I pass a keep alive from the frontend to the backend, and have the backend respond to the frontend, every 10 seconds.
According to the GCE websocket support docs, some type of "upgrade" handshake needs to occur between the backend and the frontend for the websocket connection to be kept alive. But according to MDN, "if you're opening a new connection using the WebSocket API, or any library that does WebSockets, most or all of this is done for you." I am using that API, and indeed when I inspect the headers, I see those fields:
The GCE background service timeout docs say:
For external HTTP(S) load balancers and internal HTTP(S) load
balancers, if the HTTP connection is upgraded to a WebSocket, the
backend service timeout defines the maximum amount of time that a
WebSocket can be open, whether idle or not.
This seems to be in conflict with the GCE websocket support docs that say:
When the load balancer recognizes a WebSocket Upgrade request from an
HTTP(S) client followed by a successful Upgrade response from the
backend instance, the load balancer proxies bidirectional traffic for
the duration of the current connection.
Which is it? I want to keep sockets open once they're established, but requests to initialize websocket connections should still time out if they take longer than 30 seconds. And I don't want to allow other standard REST calls to block forever, either.
What should I do? Do I have to set the timeout on the backend service to forever, and deal with the fact that other non-websocket REST calls may be susceptible to never timing out? Or is there a better way?
As mentioned in this GCP doc "When the load balancer recognizes a WebSocket Upgrade request from an HTTP(S) client followed by a successful Upgrade response from the backend instance, the load balancer proxies bidirectional traffic for the duration of the current connection. If the backend instance does not return a successful Upgrade response, the load balancer closes the connection."
In addition, the websocket timeout is a combination of the LB timeout and the bakckend time out. I understand that you already have modified the backend timeout. so you can also adjust the load balancer timeout according to your needs, pleas keep in mind that the default value is 30 seconds.
We have a similar, strange issue from GCP but this is without even using a load balancer. We have a WSS server, we can connect to it fine, it randomly stops the feed, no disconnect message, just stops sending WSS feeds to the client. It could be after 2-3 minutes it could be after 15-20 minutes but usually never makes it longer than that before dropping the connection.
We take the exact same code, the exact same build,(its all containerized) and we drop it on AWS and the problem with WSS magically disappears.
There is no question this issue is GCP related.

How do I configure keep-alive time between elb and server?

We are seeing 504 errors in our ELB logs, however there are no corresponding errors in the application logs. Have increased the idle timeout on ELB and can see that no requests are taking more time than that.
Going through aws documentation found that we need to configure keep-alive time at ec2 instances to be equal or more than idle timeout to keep the connection open between elb and backend server.
Couldn't find any way to configure keep-alive time between elb and backend server. Any suggestion to do that would be helpful
We are using tomcat-ebs for backend servers.
You need to set the keepAliveTimeout="xxxx" in your tomcat connector settings to avoid tearing down idle connections.

Diagnosing Kafka Connection Problems

I have tried to build as much diagnostics into my Kafka connection setup as possible, but it still leads to mystery problems. In particular, the first thing I do is use the Kafka Admin Client to get the clusterId, because if this operation fails, nothing else is likely to succeed.
def getKafkaClusterId(describeClusterResult: DescribeClusterResult): Try[String] = {
try {
val clusterId = describeClusterResult.clusterId().get(futureTimeout.length / 2, futureTimeout.unit)
Success(clusterId)
} catch {
case cause: Exception =>
Failure(cause)
}
}
In testing this usually works, and everything is fine. It generally only fails when the endpoint is not reachable somehow. It fails because the future times out, so I have no other diagnostics to go by. To test these problems, I usually telnet to the endpoint, for example
$ telnet blah 9094
Trying blah...
Connected to blah.
Escape character is '^]'.
Connection closed by foreign host.
Generally if I can telnet to a Kafka broker, I can connect to Kafka from my server. So my questions are:
What does it mean if I can reach the Kafka brokers via telnet, but I cannot connect via the Kafka Admin Client
What other diagnostic techniques are there to troubleshoot Kafka broker connection problems?
In this particular case, I am running Kafka on AWS, via a Docker Swarm, and trying to figure out why my server cannot connect successfully. I can see in the broker logs when I try to telnet in, so I know the brokers are reachable. But when my server tries to connect to any of 3 brokers, the logs are completely silent.
This is a good article that explains the steps that happens when you first connect to a Kafka broker
https://community.hortonworks.com/articles/72429/how-kafka-producer-work-internally.html
If you can telnet to the bootstrap server then it is listening for client connections and requests.
However clients don't know which real brokers are the leaders for each of the partitions of a topic so the first request they always send to a bootstrap server is a metadata request to get a full list of all the topic metadata. The client uses the metadata response from the bootstrap server to know where it can then make new connections to each of Kafka brokers with the active leaders for each topic partition of the topic you are trying to produce to.
That is where your misconfigured broker problem comes into play. When you misconfigure the advertised.listener port the results of the first metadata request are redirecting the client to connect to unreachable IP addresses or hostnames. It's that second connection that is timing out, not the first one on the port you are telnet'ing into.
Another way to think of it is that you have to configure a Kafka server to work properly as both a bootstrap server and a regular pub/sub message broker since it provides both services to clients. Yours are configured correctly as a pub/sub server but incorrectly as a bootstrap server because the internal and external ip addresses are different in AWS (also in docker containers or behind a NAT or a proxy).
It might seem counter intuitive in small clusters where your bootstrap servers are often the same brokers that the client is eventually connecting to but it is actually a very helpful architectural design that allow kafka to scale and to failover seamlessly without needing to provide a static list of 20 or more brokers on your bootstrap server list, or maintain extra load balancers and health checks to know onto which broker to redirect the client requests.
If you do not configure listeners and advertised.listeners correctly, basically Kafka just does not listen. Even though telnet is listening on the ports you've configured, the Kafka Client Library silently fails.
I consider this a defect in the Kafka design which leads to unnecessary confusion.
Sharing Anand Immannavar's answer from another question:
Along with ADVERTISED_HOST_NAME, You need to add ADVERTISED_LISTENERS to container environment.
ADVERTISED_LISTENERS - Broker will register this value in zookeeper and when the external world wants to connect to your Kafka Cluster they can connect over the network which you provide in ADVERTISED_LISTENERS property.
example:
environment:
- ADVERTISED_HOST_NAME=<Host IP>
- ADVERTISED_LISTENERS=PLAINTEXT://<Host IP>:9092

Vertx Clustered EventBus not sending messages

Diagram of Setup
I've setup TCP discovery using Hazelcast where parts of the cluster exist in and out of the AWS cloud.
Inside AWS I can send and receive messages no problem but not externally.
Looking at the members all 3 servers are in the list but no messages are sent to server 3 on my local machine.
For testing the AWS machines have their firewalls disabled, so the only thing I can think of is a firewall issue on my local network.
I tried making a new instance of Vertx on all servers setting the EventBus port to 80 but that stopped all messages.
Servers 1 or 2 are not reporting any failed to send issues, but I'm not sure what the problems is.
Anybody have any ideas as to why server 3 cannot send or receive messages despite being int he cluster?