Kafka Multi broker setup with ec2 machine: Timed out waiting for a node assignment. Call: createTopics - amazon-web-services

I am trying to setup kafka with 3 broker nodes and 1 zookeeper node in AWS EC2 instances. I have following server.properties for every broker:
kafka-1:
broker.id=0
listeners=PLAINTEXT_1://ec2-**-***-**-17.eu-central-1.compute.amazonaws.com:9092
advertised.listeners=PLAINTEXT_1://ec2-**-***-**-17.eu-central-1.compute.amazonaws.com:9092
listener.security.protocol.map=,PLAINTEXT_1:PLAINTEXT
inter.broker.listener.name=PLAINTEXT_1
zookeeper.connect=ec2-**-***-**-105.eu-central-1.compute.amazonaws.com:2181
kafka-2:
broker.id=1
listeners=PLAINTEXT_2://ec2-**-***-**-43.eu-central-1.compute.amazonaws.com:9093
advertised.listeners=PLAINTEXT_2://ec2-**-***-**-43.eu-central-1.compute.amazonaws.com:9093
listener.security.protocol.map=,PLAINTEXT_2:PLAINTEXT
inter.broker.listener.name=PLAINTEXT_2
zookeeper.connect=ec2-**-***-**-105.eu-central-1.compute.amazonaws.com:2181
kafka-3:
broker.id=2
listeners=PLAINTEXT_3://ec2-**-***-**-27.eu-central-1.compute.amazonaws.com:9094
advertised.listeners=PLAINTEXT_3://ec2-**-***-**-27.eu-central-1.compute.amazonaws.com:9094
listener.security.protocol.map=,PLAINTEXT_3:PLAINTEXT
inter.broker.listener.name=PLAINTEXT_3
zookeeper.connect=ec2-**-***-**-105.eu-central-1.compute.amazonaws.com:2181
zookeeper:
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
When I ran following command in zookeeper I see that they are connected
I also telnetted from any broker to other ones with broker port they are all connected
However, when I try to create topic with 2 replication factor I get Timed out waiting for a node assignment
I cannot understand what is incorrect with my setup, I see 3 nodes running in zookeeper, but having problems when creating topic. BTW, when I make replication factor 1 I get the same error. How can I make sure that everything is alright with my cluster?

It's good that telnet checks the port is open, but it doesn't verify the Kafka protocol works. You could use kcat utility for that, but the fix includes
listeners are set to either PLAINTEXT://:9092 or PLAINTEXT://0.0.0.0:9092 for every broker, which means using the same port
Removing the number from the listener mapping and advertised listeners property such that each broker is the same
I'd also recommend looking at using Ansible/Terraform/Cloudformation to ensure you consistently modify the cluster rather than edit individual settings manually

Related

Bootstrap IP to internal address conversion

I am new to Kafka and I setup an instance in aws. runs well.
then I created another aws instance and run the codes:
See image here
it can print out messages that I published to kafka
If I ran the same codes in the kafka server itself, I can also get messages.
However, if I run the same codes in my own laptop, I cant get anything.
I thought it might be the codes so I used kafka's own client in my laptop:
bin/kafka-console-consumer.sh --topic test22 --bootstrap-server 34.215.180.111:9092
Now I got an error:
2021-05-11 16:21:32,252] WARN [Consumer clientId=consumer-console-consumer-94326-1, groupId=console-consumer-94326] Error connecting to node ip-172-31-29-222.us-west-2.compute.internal:9092 (id: 0 rack: null) (org.apache.kafka.clients.NetworkClient)
ip-172-31-29-222.us-west-2.compute.internal
this piece of name is actually the AWS instance's internal address:
See image here
Then I thought it might be Amazon's issue so I repeated the whole process in Google Cloud and got the same results:
[2021-05-11 17:15:34,840] WARN [Consumer clientId=consumer-console-consumer-2377-1, groupId=console-consumer-2377] Error connecting to node instance-1.us-central1-a.c.seventh-seeker-267203.internal:9092 (id: 0 rack: null) (org.apache.kafka.clients.NetworkClient)
These internal addresses can not be accessed from external computers at all.
Can anybody help? thanks!
The logs are showing you the advertised.listeners of the brokers. If you want that to be different in order to connect, you'll need to modify that property such that the brokers have resolvable addresses for the clients
https://www.confluent.io/blog/kafka-listeners-explained/

Greengrass_HelloWorld lambda doesn't publish to Amazon IoT console

I have been following the documentation in every step, and I didn't face any errors. Configured, deployed and made a subscription to hello/world topic just as the documentation detailed. However, when I arrived at the testing step here: https://docs.aws.amazon.com/greengrass/latest/developerguide/lambda-check.html
No messages were showing up on the IoT console (subscription view hello/world)! I am using Greengrass core daemon which runs on my Ubuntu machine, it is active and listens to port 8000. I don't think there is anything wrong with my local device because the group was deployed successfully and because I see the communications going both ways on Wireshark.
I have these logs on my machine: /home/##/Desktop/greengrass/ggc/var/log/system/runtime.log:
[2019-09-28T06:57:42.492-07:00][INFO]-===========================================
[2019-09-28T06:57:42.492-07:00][INFO]-Greengrass Version: 1.9.3-RC3
[2019-09-28T06:57:42.492-07:00][INFO]-Greengrass Root: /home/##/Desktop/greengrass
[2019-09-28T06:57:42.492-07:00][INFO]-Greengrass Write Directory: /home/##/Desktop/greengrass/ggc
[2019-09-28T06:57:42.492-07:00][INFO]-Group File Directory: /home/##/Desktop/greengrass/ggc/deployment/group
[2019-09-28T06:57:42.492-07:00][INFO]-Default Lambda UID: 122
[2019-09-28T06:57:42.492-07:00][INFO]-Default Lambda GID: 127
[2019-09-28T06:57:42.492-07:00][INFO]-===========================================
[2019-09-28T06:57:42.492-07:00][INFO]-The current core is using the AWS IoT certificates with fingerprint. {"fingerprint": "90##4d"}
[2019-09-28T06:57:42.492-07:00][INFO]-Will persist worker process info. {"dir": "/home/##/Desktop/greengrass/ggc/ggc/core/var/worker/processes"}
[2019-09-28T06:57:42.493-07:00][INFO]-Will persist worker process info. {"dir": "/home/##/Desktop/greengrass/ggc/ggc/core/var/worker/processes"}
[2019-09-28T06:57:42.494-07:00][INFO]-No proxy URL found.
[2019-09-28T06:57:42.495-07:00][INFO]-Started Deployment Agent to listen for updates. [2019-09-28T06:57:42.495-07:00][INFO]-Connecting with MQTT. {"endpoint": "a6##ws-ats.iot.us-east-2.amazonaws.com:8883", "clientId": "simulators_gg_Core"}
[2019-09-28T06:57:42.497-07:00][INFO]-The current core is using the AWS IoT certificates with fingerprint. {"fingerprint": "90##4d"}
[2019-09-28T06:57:42.685-07:00][INFO]-MQTT connection successful. {"attemptId": "GVko", "clientId": "simulators_gg_Core"}
[2019-09-28T06:57:42.685-07:00][INFO]-MQTT connection established. {"endpoint": "a6##ws-ats.iot.us-east-2.amazonaws.com:8883", "clientId": "simulators_gg_Core"}
[2019-09-28T06:57:42.685-07:00][INFO]-MQTT connection connected. Start subscribing. {"clientId": "simulators_gg_Core"}
[2019-09-28T06:57:42.685-07:00][INFO]-Deployment agent connected to cloud.
[2019-09-28T06:57:42.685-07:00][INFO]-Start subscribing. {"numOfTopics": 2, "clientId": "simulators_gg_Core"}
[2019-09-28T06:57:42.685-07:00][INFO]-Trying to subscribe to topic $aws/things/simulators_gg_Core-gda/shadow/update/delta
[2019-09-28T06:57:42.727-07:00][INFO]-Trying to subscribe to topic $aws/things/simulators_gg_Core-gda/shadow/get/accepted
[2019-09-28T06:57:42.814-07:00][INFO]-All topics subscribed. {"clientId": "simulators_gg_Core"}
[2019-09-28T06:58:57.888-07:00][INFO]-Daemon received signal: terminated. [2019-09-28T06:58:57.888-07:00][INFO]-Shutting down daemon.
[2019-09-28T06:58:57.888-07:00][INFO]-Stopping all workers.
[2019-09-28T06:58:57.888-07:00][INFO]-Lifecycle manager is stopped.
[2019-09-28T06:58:57.888-07:00][INFO]-IPC server stopped.
/home/##/Desktop/greengrass/ggc/var/log/system/localwatch/localwatch.log:
[2019-09-28T06:57:42.491-07:00][DEBUG]-will keep the log files for the following lambdas {"readingPath": "/home/##/Desktop/greengrass/ggc/var/log/user", "lambdas": "map[]"}
[2019-09-28T06:57:42.492-07:00][WARN]-failed to list the user log directory {"path": "/home/##/Desktop/greengrass/ggc/var/log/user"}
Thanks in advance.
I had a similar issue on another platform (Jetson Nano). I could not get a response after going through the AWS instructions for setting up a simple Lambda using IOT Greengrass. In my search for answers I discovered that AWS has a qualification test script for any device you connect.
It goes through an automated process of deploying and testing a lambda function(as well as other functionality) and reports results for each step and docs provide troubleshooting info for failures.
By going through those tests I was able to narrow down the issues with my setup, installation, and configuration. The testing docs give pointers to troubleshoot test results. Here is a link to the test: https://docs.aws.amazon.com/greengrass/latest/developerguide/device-tester-for-greengrass-ug.html
If you follow the 'Next Topic' links, it will take you through the complete test. Let me warn you that its extensive, and will take some time, but for me it gave a lot of detailed insight that a hello world does not.

wso2am-analytics 2.2.0 spark on offset 0

Installing wso2am-analytics-2.2.0 on the port offset 0, then I get error messages as
WARN {org.apache.spark.scheduler.TaskSetManager} - Lost task 0.0 in stage 2990.0 (TID 147439, 10.0.11.26): FetchFailed(BlockManagerId(0, someserver.compute.internal, 12001), shuffleId=745, mapId=0, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-10-0-17-131.eu-central-1.compute.internal:12001
Apparently somewhere is configured to connect to port 12001 (while seems the server listens on 12000)
Where could I configure the port 12000?
Thanks
This port is defined in <Product_Home>repository/conf/analytics/spark/spark-defaults.conf. Property name is spark.blockManager.port. However you shouldn't manually configure it.
This particular issue is a connectivity problem in my knowledge. DAS uses 1200x range ports to spark executor communications. So incase of multiple executors or new executor spawning in and event of one executor getting killed incremented port will be opened. Hence at the network level also we should allow traffic through that port range. So opening that port range in your network interface ip-10-0-17-131.eu-central-1.compute.internal will solve your issue.

DSE spark cluster on AWS worker and Executor ports

I am trying to setup a 6 node DSE 5.1 spark cluster on AWS EC2 machines.
I have referred the DSE documentation
just to start with, I have opened all TCP ports , when I checked the logs, I found that worker process and executor process and driver process are using below ports
33xxx
33xxx
33xxx
34xxx
34xxx
34xxx
35xxx
35xxx
35xxx
36xxx
37xxx
37xxx
39xxx
40xxx
40xxx
41xxx
41xxx
43xxx
43xxx
43xxx
43xxx
45xxx
46xxx
the range here is from 33xxx to 46xxx, what is suggested range to open the ports ?, or is there any way to bind worker and executor ports ?
By default the port selection is random
See the Spark Docs
Specifically
spark.blockManager.port
spark.driver.port
While you can lock these down to a specific value by setting them in the SparkConf or on the CLI through Spark Submit, you need to make sure that every application has unique values so they do not collide.
In most cases it makes sense to keep the Driver in the same VPN as the Cluster.

Can Akka Cluster Client Send Messages to Cluster Nodes Not in Initial Contacts?

Using Akka 2.3.14, I'm trying to create an Akka cluster of various services. Until now, I have had all my "services" in one artifact that was clustered across multiple nodes, but now I am trying to break this artifact into multiple services that all exist on the same cluster.
So in breaking this up, we've designed it so that any node on the cluster will first try to connect to the seed nodes. If there is no seed node, it will look to see if it is a candidate to run as a seed node (if it's on the same host that a seed node can be on) in which case it will grab the an open seed node port and become a seed node. So in this sense, any service in the cluster can become the seed node.
At least, that was the idea. Our API into this system running as a separate service implements a ClusterClient into this system. The initialContacts are set to be the same as the seed nodes. The problem is that the only receptionist actors I can send a message to through the ClusterClient are the actors on the seed nodes.
Here is an example if it helps. Let's say I have a String Service and a Double Service, and the receptionist for each service is a StringActor and a DoubleActor respectively. Now lets say I have a Client Service which sends StringMessages and DoubleMessages to the StringActor and DoubleActor
So for simplicity, let's say I have two nodes, server1 and server2 then:
seed-nodes = ["akka.tcp://system#server1:2773", "akka.tcp://system#server2:2773"]
My ClusterClient would be initialize like so:
system.actorOf(
ClusterClient.props(
Set(
system.actorSelection("akka.tcp://system#server1:2773/user/receptionist"),
system.actorSelection("akka.tcp://system#server2:2773/user/receptionist")
)
),
"clusterClient"
)
Here are the scenarios that are happening for me:
If the StringServices start up on both servers first, then DoubleMessages from the Client Service just disappear into the ether.
If the DoubleServices start up on both servers first, then StringMessages from the Client Service just disappear into the ether.
If the StringService starts up first on serverX and the DoubleService starts up first on serverY, then all StringMessages will be sent to serverX and all DoubleMessages will be sent to serverY, which is not as bad as the above case, but it means it's not really scaling.
This isn't what I expected, it's possible it's just a defect in my code, so I would like to know if this IS expected behavior or not. And if not, then is there another Akka concept that could help me with this?
Arguably, I could just make one service type my entry point, like a RoutingService that could accept StringMessages or DoubleMessages, and then send that to the correct service. But if the Client Service can only send messages to the RoutingService instances that are in the initial contacts, then I can't dynamically scale the RoutingService because no matter how many nodes I add the Client Service can only send to the initial contacts.
I'm also thinking about subscribing to ClusterEvents in my Client Service and seeing if I can add and remove initial contacts from my cluster client as nodes are started up in the cluster, but I'm not sure if this is possible, and it feels like there should be a better solution.
This is what I found out upon more troubleshooting, in case it helps anyone else:
The ClusterClient will attempt to connect to the initial contacts in order, and then only sends it's messages across that connection. If you are deploying different services on each node, you will have problems as the messages sent from the ClusterClient will only be sent to the node that it makes its connection to. In this way, you can think of the ClusterClient a legitimate client, it will connect to a URL that you give it, and then continue to communicate with the server through that URL.
Reading the Distributed Workers example, I realized that my Frontend, or in this case my routing service, should actually be part of the cluster, rather than acting as a client. For this I used the DistributedPubSub method instead.