Vertx Clustered EventBus not sending messages - amazon-web-services

Diagram of Setup
I've setup TCP discovery using Hazelcast where parts of the cluster exist in and out of the AWS cloud.
Inside AWS I can send and receive messages no problem but not externally.
Looking at the members all 3 servers are in the list but no messages are sent to server 3 on my local machine.
For testing the AWS machines have their firewalls disabled, so the only thing I can think of is a firewall issue on my local network.
I tried making a new instance of Vertx on all servers setting the EventBus port to 80 but that stopped all messages.
Servers 1 or 2 are not reporting any failed to send issues, but I'm not sure what the problems is.
Anybody have any ideas as to why server 3 cannot send or receive messages despite being int he cluster?

Related

Can't connect Burrow to my amazon MSK cluster

I've tried several ways to connect the burrow application from my EC2 instance to my kafka cluster to get the consumer lag metrics. I can console produce and consume messages from the instance but the moment I want to connect burrow it throws this error in the logs.
"name":"kafkatestingcluster","error":"kafka: client has run out of available brokers to talk to (Is your cluster reachable?)"
I have checked the bootstrap servers twice, and the zookeeper as well and they are okay. I have also tried with cluster running versions 1.1.0 and 2.2.1, and different client versions in the burrow's config file.
Am I missing a step?
Mind to share your configuration with us? did you enter the correct port?
Have you tried to run a simple telnet test from the host Burrow runs in to the Kafka brokers? Have you checked your inbound and outbound SG rules on AWS?
I would suggest testing those stuff first and if everything is good on that layer, switch Burrow logging level to debug and I'm sure it will give you a better understanding on what's going on.

Azure relay hybrid connection listener not reestablishing when internet is disrupted

I have CUSTOM azure hybrid connection listener service running on premise with below mention code as MSDN suggested, but listener not getting reestablished when on premise INTERNET connectivity get disrupted.
Only 1 out of 10 times, listener getting reestablished with below code, when on premise INTERNET is disrupted
// Opening the listener establishes the control channel to
// the Azure Relay service. The control channel is continuously
// maintained, and is reestablished when connectivity is disrupted
await listener.OpenAsync(cts.Token);
//Below delegate not getting called when INTERNET plugged off from
//listener running machine
listener.Offline += listener_Offline;
What changes required to reestablish listener to azure hybrid connection 10 out of 10 times?. Please advice.

Limit port 8080 access to SNS

I am using SNS to connect to a Tomcat server on port 8080. The server runs on AWS/EC2. It is not a public server, I use it only to execute my code (triggered by the Notification delivered to it.)
How can I set up the "inbound rules" on my EC2 box so that only SNS can reach it? When I block the port, messages are not delivered. If I restrict it to the EC2 internal or external IP, the same. Apparently Notifications are delivered from "somewhere" that is not documented. And/or it is not a fixed IP [range]?
[I know how to secure the Tomcat server itself, but it would be nice if random port-scans can't even get to the server. I do see a number of (so far unsuccessful) access attempts in the Tomcat log.]

Diagnosing Kafka Connection Problems

I have tried to build as much diagnostics into my Kafka connection setup as possible, but it still leads to mystery problems. In particular, the first thing I do is use the Kafka Admin Client to get the clusterId, because if this operation fails, nothing else is likely to succeed.
def getKafkaClusterId(describeClusterResult: DescribeClusterResult): Try[String] = {
try {
val clusterId = describeClusterResult.clusterId().get(futureTimeout.length / 2, futureTimeout.unit)
Success(clusterId)
} catch {
case cause: Exception =>
Failure(cause)
}
}
In testing this usually works, and everything is fine. It generally only fails when the endpoint is not reachable somehow. It fails because the future times out, so I have no other diagnostics to go by. To test these problems, I usually telnet to the endpoint, for example
$ telnet blah 9094
Trying blah...
Connected to blah.
Escape character is '^]'.
Connection closed by foreign host.
Generally if I can telnet to a Kafka broker, I can connect to Kafka from my server. So my questions are:
What does it mean if I can reach the Kafka brokers via telnet, but I cannot connect via the Kafka Admin Client
What other diagnostic techniques are there to troubleshoot Kafka broker connection problems?
In this particular case, I am running Kafka on AWS, via a Docker Swarm, and trying to figure out why my server cannot connect successfully. I can see in the broker logs when I try to telnet in, so I know the brokers are reachable. But when my server tries to connect to any of 3 brokers, the logs are completely silent.
This is a good article that explains the steps that happens when you first connect to a Kafka broker
https://community.hortonworks.com/articles/72429/how-kafka-producer-work-internally.html
If you can telnet to the bootstrap server then it is listening for client connections and requests.
However clients don't know which real brokers are the leaders for each of the partitions of a topic so the first request they always send to a bootstrap server is a metadata request to get a full list of all the topic metadata. The client uses the metadata response from the bootstrap server to know where it can then make new connections to each of Kafka brokers with the active leaders for each topic partition of the topic you are trying to produce to.
That is where your misconfigured broker problem comes into play. When you misconfigure the advertised.listener port the results of the first metadata request are redirecting the client to connect to unreachable IP addresses or hostnames. It's that second connection that is timing out, not the first one on the port you are telnet'ing into.
Another way to think of it is that you have to configure a Kafka server to work properly as both a bootstrap server and a regular pub/sub message broker since it provides both services to clients. Yours are configured correctly as a pub/sub server but incorrectly as a bootstrap server because the internal and external ip addresses are different in AWS (also in docker containers or behind a NAT or a proxy).
It might seem counter intuitive in small clusters where your bootstrap servers are often the same brokers that the client is eventually connecting to but it is actually a very helpful architectural design that allow kafka to scale and to failover seamlessly without needing to provide a static list of 20 or more brokers on your bootstrap server list, or maintain extra load balancers and health checks to know onto which broker to redirect the client requests.
If you do not configure listeners and advertised.listeners correctly, basically Kafka just does not listen. Even though telnet is listening on the ports you've configured, the Kafka Client Library silently fails.
I consider this a defect in the Kafka design which leads to unnecessary confusion.
Sharing Anand Immannavar's answer from another question:
Along with ADVERTISED_HOST_NAME, You need to add ADVERTISED_LISTENERS to container environment.
ADVERTISED_LISTENERS - Broker will register this value in zookeeper and when the external world wants to connect to your Kafka Cluster they can connect over the network which you provide in ADVERTISED_LISTENERS property.
example:
environment:
- ADVERTISED_HOST_NAME=<Host IP>
- ADVERTISED_LISTENERS=PLAINTEXT://<Host IP>:9092

Tablet Server Access for Accumulo Running on AWS

I am attempting to run a simple driver to write some data to an Accumulo 1.5 instance running on AWS that is using a single node cluster managed by CDH 4.7 . The client successfully connects to zookeeper but then fails with the following message:
2015-06-26 12:12:13 WARN ServerClient:163 - Failed to find an available server in the list of servers: [172.31.13.210:10011:9997 (120000)]
I tried applying the solution listed
here
, but this has not resolved the issue. The IP that is set for the master/slave is the internal AWS IP for the server.
Other than the warning message, I have not been able to find anything else in the Accumulo logs that indicate what is preventing connection to the master server. Any suggestions on where to look next?
--EDIT--
It looks like zookeeper is returning connectors to the remote client that contain references to the internal IP of the AWS server. The remote client cannot use these connectors because it does not know about the internal IP. When I changed the internal IPs in the thrift connector objects to the public IP, the connection works fine. In essence I can't figure out how to get zookeeper to return public IPs and not AWS internal ones for remote clients
172.31.13.210:10011:9997
This looks really strange. This should be an IP/hostname and a port. It looks like you have two ports somehow..
Did you list ports in the slaves file in ACCUMULO_CONF_DIR? This file should only contain the hostname/IP. If you want to change the port that a TabletServer listens on, you need to change tserver.port.client.