Diagnosing Kafka Connection Problems - amazon-web-services

I have tried to build as much diagnostics into my Kafka connection setup as possible, but it still leads to mystery problems. In particular, the first thing I do is use the Kafka Admin Client to get the clusterId, because if this operation fails, nothing else is likely to succeed.
def getKafkaClusterId(describeClusterResult: DescribeClusterResult): Try[String] = {
try {
val clusterId = describeClusterResult.clusterId().get(futureTimeout.length / 2, futureTimeout.unit)
Success(clusterId)
} catch {
case cause: Exception =>
Failure(cause)
}
}
In testing this usually works, and everything is fine. It generally only fails when the endpoint is not reachable somehow. It fails because the future times out, so I have no other diagnostics to go by. To test these problems, I usually telnet to the endpoint, for example
$ telnet blah 9094
Trying blah...
Connected to blah.
Escape character is '^]'.
Connection closed by foreign host.
Generally if I can telnet to a Kafka broker, I can connect to Kafka from my server. So my questions are:
What does it mean if I can reach the Kafka brokers via telnet, but I cannot connect via the Kafka Admin Client
What other diagnostic techniques are there to troubleshoot Kafka broker connection problems?
In this particular case, I am running Kafka on AWS, via a Docker Swarm, and trying to figure out why my server cannot connect successfully. I can see in the broker logs when I try to telnet in, so I know the brokers are reachable. But when my server tries to connect to any of 3 brokers, the logs are completely silent.

This is a good article that explains the steps that happens when you first connect to a Kafka broker
https://community.hortonworks.com/articles/72429/how-kafka-producer-work-internally.html
If you can telnet to the bootstrap server then it is listening for client connections and requests.
However clients don't know which real brokers are the leaders for each of the partitions of a topic so the first request they always send to a bootstrap server is a metadata request to get a full list of all the topic metadata. The client uses the metadata response from the bootstrap server to know where it can then make new connections to each of Kafka brokers with the active leaders for each topic partition of the topic you are trying to produce to.
That is where your misconfigured broker problem comes into play. When you misconfigure the advertised.listener port the results of the first metadata request are redirecting the client to connect to unreachable IP addresses or hostnames. It's that second connection that is timing out, not the first one on the port you are telnet'ing into.
Another way to think of it is that you have to configure a Kafka server to work properly as both a bootstrap server and a regular pub/sub message broker since it provides both services to clients. Yours are configured correctly as a pub/sub server but incorrectly as a bootstrap server because the internal and external ip addresses are different in AWS (also in docker containers or behind a NAT or a proxy).
It might seem counter intuitive in small clusters where your bootstrap servers are often the same brokers that the client is eventually connecting to but it is actually a very helpful architectural design that allow kafka to scale and to failover seamlessly without needing to provide a static list of 20 or more brokers on your bootstrap server list, or maintain extra load balancers and health checks to know onto which broker to redirect the client requests.

If you do not configure listeners and advertised.listeners correctly, basically Kafka just does not listen. Even though telnet is listening on the ports you've configured, the Kafka Client Library silently fails.
I consider this a defect in the Kafka design which leads to unnecessary confusion.

Sharing Anand Immannavar's answer from another question:
Along with ADVERTISED_HOST_NAME, You need to add ADVERTISED_LISTENERS to container environment.
ADVERTISED_LISTENERS - Broker will register this value in zookeeper and when the external world wants to connect to your Kafka Cluster they can connect over the network which you provide in ADVERTISED_LISTENERS property.
example:
environment:
- ADVERTISED_HOST_NAME=<Host IP>
- ADVERTISED_LISTENERS=PLAINTEXT://<Host IP>:9092

Related

Can't connect Burrow to my amazon MSK cluster

I've tried several ways to connect the burrow application from my EC2 instance to my kafka cluster to get the consumer lag metrics. I can console produce and consume messages from the instance but the moment I want to connect burrow it throws this error in the logs.
"name":"kafkatestingcluster","error":"kafka: client has run out of available brokers to talk to (Is your cluster reachable?)"
I have checked the bootstrap servers twice, and the zookeeper as well and they are okay. I have also tried with cluster running versions 1.1.0 and 2.2.1, and different client versions in the burrow's config file.
Am I missing a step?
Mind to share your configuration with us? did you enter the correct port?
Have you tried to run a simple telnet test from the host Burrow runs in to the Kafka brokers? Have you checked your inbound and outbound SG rules on AWS?
I would suggest testing those stuff first and if everything is good on that layer, switch Burrow logging level to debug and I'm sure it will give you a better understanding on what's going on.

best architecture to deploy TCP/IP and UDP service on amazon AWS (Without EC2 instances)

i am traying to figure it out how is the best way to deploy a TCP/IP and UDP service on Amazon AWS.
I made a previous research to my question and i can not find anything. I found others protocols like HTTP, MQTT but no TCP or UDP
I need to refactor a GPS Tracking service running right now in AMAZON EC2. The GPS devices sent the position data using udp and tcp protocol. Every time a message is received the server have to respond with an ACKNOWLEDGE message, giving the reception confirmation to the gps device.
The problem i am facing right now and is the motivation to refactor is:
When the traffic increase, the server is not able to catch up all the messages.
I try to solve this issue with load balancer and autoscaling but UDP is not supported.
I was wondering if there is something like Api Gateway, which gave me a tcp or udp endpoint, leave the message on a SQS queue and process with a lambda function.
Thanks in advance!
Your question really doesn't make a lot of sense - you are asking how to run a service without running a server.
If you have reached the limits of a single instance, and you need to grow, look at using the AWS Network Load Balancer with an autoscaled group of EC2 instances. However, this will not support UDP - if you really need that, then you may have to look at 3rd party support in the AWS Marketplace.
Edit: Serverless architectures are designed for http based application, where you send a request and get a response. Since your app is TCP based, and uses persistent connections, most existing serverless implementations simply won't support it. You will need to rewrite your app to support http, or use traditional server based infrastructures that can support persistent connections.
Edit #2: As of Dec. 2018, API gateway supports WebSockets. This probably doesn't help with the original question, but opens up other alternatives if you need to run lambda code behind a long running connection.
If you want to go more Serverless, I think the ECS Container Service has instances that accept TCP and UDP. Also take a look at running Docker Containers with with Kubernetes. I am not sure if they support those protocols, but I believe they do.
If not, some EC2 instances with load balancing can be your best bet.

Vertx Clustered EventBus not sending messages

Diagram of Setup
I've setup TCP discovery using Hazelcast where parts of the cluster exist in and out of the AWS cloud.
Inside AWS I can send and receive messages no problem but not externally.
Looking at the members all 3 servers are in the list but no messages are sent to server 3 on my local machine.
For testing the AWS machines have their firewalls disabled, so the only thing I can think of is a firewall issue on my local network.
I tried making a new instance of Vertx on all servers setting the EventBus port to 80 but that stopped all messages.
Servers 1 or 2 are not reporting any failed to send issues, but I'm not sure what the problems is.
Anybody have any ideas as to why server 3 cannot send or receive messages despite being int he cluster?

Tablet Server Access for Accumulo Running on AWS

I am attempting to run a simple driver to write some data to an Accumulo 1.5 instance running on AWS that is using a single node cluster managed by CDH 4.7 . The client successfully connects to zookeeper but then fails with the following message:
2015-06-26 12:12:13 WARN ServerClient:163 - Failed to find an available server in the list of servers: [172.31.13.210:10011:9997 (120000)]
I tried applying the solution listed
here
, but this has not resolved the issue. The IP that is set for the master/slave is the internal AWS IP for the server.
Other than the warning message, I have not been able to find anything else in the Accumulo logs that indicate what is preventing connection to the master server. Any suggestions on where to look next?
--EDIT--
It looks like zookeeper is returning connectors to the remote client that contain references to the internal IP of the AWS server. The remote client cannot use these connectors because it does not know about the internal IP. When I changed the internal IPs in the thrift connector objects to the public IP, the connection works fine. In essence I can't figure out how to get zookeeper to return public IPs and not AWS internal ones for remote clients
172.31.13.210:10011:9997
This looks really strange. This should be an IP/hostname and a port. It looks like you have two ports somehow..
Did you list ports in the slaves file in ACCUMULO_CONF_DIR? This file should only contain the hostname/IP. If you want to change the port that a TabletServer listens on, you need to change tserver.port.client.

ELB for Websockets SSL

Does AWS support websockets with SSL ?
Can EWS ELB be used for websockets over SSL ?
What happens when a EC2 instance(machine) is added or removed to this ELB. Especially removed; what if a machine goes down. are the existing sockets routed to some other machine or reseted to connected.
can ELB be a bottleneck at any point in time.
any other alternatives .. let me know
This link might prove partially helpful for you - it would appear that you can do web sockets over SSL, but currently I'm struggling to implement it.
StackOverflow - Websocket with Tomcat 7 on AWS Elastic Beanstalk
Currently AWS ELB doesn't support Websocket balancing, there is a trick to do it via SSL, but it has some limitation and depends on your app logic. So if websocket connection is used only as server-client communication, it will work. But if you have more advanced logic when clients must communicate with each other via a server then this solution won't work. For example one client has established connection for a chatroom, then other clients can connect to the established chatroom and communicate with each other.
Then only possible way to use HA-proxy http://blog.haproxy.com/2012/11/07/websockets-load-balancing-with-haproxy/
But shown example just shows how to configure HA-proxy base on two servers. So if you do not use Amazon Autoscalling Group, the solution is good. But if you will need use ASG, the question about add/remove instances to ha-proxy config is other challenge.