Installing wso2am-analytics-2.2.0 on the port offset 0, then I get error messages as
WARN {org.apache.spark.scheduler.TaskSetManager} - Lost task 0.0 in stage 2990.0 (TID 147439, 10.0.11.26): FetchFailed(BlockManagerId(0, someserver.compute.internal, 12001), shuffleId=745, mapId=0, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-10-0-17-131.eu-central-1.compute.internal:12001
Apparently somewhere is configured to connect to port 12001 (while seems the server listens on 12000)
Where could I configure the port 12000?
Thanks
This port is defined in <Product_Home>repository/conf/analytics/spark/spark-defaults.conf. Property name is spark.blockManager.port. However you shouldn't manually configure it.
This particular issue is a connectivity problem in my knowledge. DAS uses 1200x range ports to spark executor communications. So incase of multiple executors or new executor spawning in and event of one executor getting killed incremented port will be opened. Hence at the network level also we should allow traffic through that port range. So opening that port range in your network interface ip-10-0-17-131.eu-central-1.compute.internal will solve your issue.
Related
I am trying to setup kafka with 3 broker nodes and 1 zookeeper node in AWS EC2 instances. I have following server.properties for every broker:
kafka-1:
broker.id=0
listeners=PLAINTEXT_1://ec2-**-***-**-17.eu-central-1.compute.amazonaws.com:9092
advertised.listeners=PLAINTEXT_1://ec2-**-***-**-17.eu-central-1.compute.amazonaws.com:9092
listener.security.protocol.map=,PLAINTEXT_1:PLAINTEXT
inter.broker.listener.name=PLAINTEXT_1
zookeeper.connect=ec2-**-***-**-105.eu-central-1.compute.amazonaws.com:2181
kafka-2:
broker.id=1
listeners=PLAINTEXT_2://ec2-**-***-**-43.eu-central-1.compute.amazonaws.com:9093
advertised.listeners=PLAINTEXT_2://ec2-**-***-**-43.eu-central-1.compute.amazonaws.com:9093
listener.security.protocol.map=,PLAINTEXT_2:PLAINTEXT
inter.broker.listener.name=PLAINTEXT_2
zookeeper.connect=ec2-**-***-**-105.eu-central-1.compute.amazonaws.com:2181
kafka-3:
broker.id=2
listeners=PLAINTEXT_3://ec2-**-***-**-27.eu-central-1.compute.amazonaws.com:9094
advertised.listeners=PLAINTEXT_3://ec2-**-***-**-27.eu-central-1.compute.amazonaws.com:9094
listener.security.protocol.map=,PLAINTEXT_3:PLAINTEXT
inter.broker.listener.name=PLAINTEXT_3
zookeeper.connect=ec2-**-***-**-105.eu-central-1.compute.amazonaws.com:2181
zookeeper:
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
When I ran following command in zookeeper I see that they are connected
I also telnetted from any broker to other ones with broker port they are all connected
However, when I try to create topic with 2 replication factor I get Timed out waiting for a node assignment
I cannot understand what is incorrect with my setup, I see 3 nodes running in zookeeper, but having problems when creating topic. BTW, when I make replication factor 1 I get the same error. How can I make sure that everything is alright with my cluster?
It's good that telnet checks the port is open, but it doesn't verify the Kafka protocol works. You could use kcat utility for that, but the fix includes
listeners are set to either PLAINTEXT://:9092 or PLAINTEXT://0.0.0.0:9092 for every broker, which means using the same port
Removing the number from the listener mapping and advertised listeners property such that each broker is the same
I'd also recommend looking at using Ansible/Terraform/Cloudformation to ensure you consistently modify the cluster rather than edit individual settings manually
I am new to Kafka and I setup an instance in aws. runs well.
then I created another aws instance and run the codes:
See image here
it can print out messages that I published to kafka
If I ran the same codes in the kafka server itself, I can also get messages.
However, if I run the same codes in my own laptop, I cant get anything.
I thought it might be the codes so I used kafka's own client in my laptop:
bin/kafka-console-consumer.sh --topic test22 --bootstrap-server 34.215.180.111:9092
Now I got an error:
2021-05-11 16:21:32,252] WARN [Consumer clientId=consumer-console-consumer-94326-1, groupId=console-consumer-94326] Error connecting to node ip-172-31-29-222.us-west-2.compute.internal:9092 (id: 0 rack: null) (org.apache.kafka.clients.NetworkClient)
ip-172-31-29-222.us-west-2.compute.internal
this piece of name is actually the AWS instance's internal address:
See image here
Then I thought it might be Amazon's issue so I repeated the whole process in Google Cloud and got the same results:
[2021-05-11 17:15:34,840] WARN [Consumer clientId=consumer-console-consumer-2377-1, groupId=console-consumer-2377] Error connecting to node instance-1.us-central1-a.c.seventh-seeker-267203.internal:9092 (id: 0 rack: null) (org.apache.kafka.clients.NetworkClient)
These internal addresses can not be accessed from external computers at all.
Can anybody help? thanks!
The logs are showing you the advertised.listeners of the brokers. If you want that to be different in order to connect, you'll need to modify that property such that the brokers have resolvable addresses for the clients
https://www.confluent.io/blog/kafka-listeners-explained/
I deployed Kafka on Google cloud, I changed listeners to
PLAINTEXT://[internal ip address]:9092
And when I try
sudo ./bin/kafka-topics.sh --list --zookeeper [external IP address]:2181
I can get the topic on the broker. However when I try to produce message to the Kafka broker
sudo ./bin/kafka-console-producer.sh --broker-list [external IP address]:9092
--topic test
following error shows up:
ERROR Error when sending message to topic test with key: null, value:
5 bytes with error:
(org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s)
for test-0: 1506 ms has passed since batch creation plus linger time
I wonder what properties did I set wrong and how to fix it?
You need to set advertised.listeners to the external IP so that clients can correctly connect to it. Otherwise they'll try to connect to the internal IP (since advertised.listeners will default to listeners unless explicitly set)
Ref: https://kafka.apache.org/documentation/#brokerconfigs
I am trying to setup a 6 node DSE 5.1 spark cluster on AWS EC2 machines.
I have referred the DSE documentation
just to start with, I have opened all TCP ports , when I checked the logs, I found that worker process and executor process and driver process are using below ports
33xxx
33xxx
33xxx
34xxx
34xxx
34xxx
35xxx
35xxx
35xxx
36xxx
37xxx
37xxx
39xxx
40xxx
40xxx
41xxx
41xxx
43xxx
43xxx
43xxx
43xxx
45xxx
46xxx
the range here is from 33xxx to 46xxx, what is suggested range to open the ports ?, or is there any way to bind worker and executor ports ?
By default the port selection is random
See the Spark Docs
Specifically
spark.blockManager.port
spark.driver.port
While you can lock these down to a specific value by setting them in the SparkConf or on the CLI through Spark Submit, you need to make sure that every application has unique values so they do not collide.
In most cases it makes sense to keep the Driver in the same VPN as the Cluster.
I am setting up Spark 0.9 on AWS and am finding that when launching the interactive Pyspark shell, my executors / remote workers are first being registered:
14/07/08 22:48:05 INFO cluster.SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor#ip-xx-xx-xxx-xxx.ec2.internal:54110/user/
Executor#-862786598] with ID 0
and then disassociated almost immediately, before I have the chance to run anything:
14/07/08 22:48:05 INFO cluster.SparkDeploySchedulerBackend: Executor 0 disconnected,
so removing it
14/07/08 22:48:05 ERROR scheduler.TaskSchedulerImpl: Lost an executor 0 (already
removed): remote Akka client disassociated
Any idea what might be wrong? I've tried adjusting the JVM options spark.akka.frameSize and spark.akka.timeout, but I'm pretty sure this is not the issue since (1) I'm not running anything to begin with, and (2) my executors are disconnecting a few seconds after startup, which is well within the default 100s timeout.
Thanks!
Jack
I had a very similar problem, if not the same.
It started to work for me once the workers were connecting to master by using the very same name as the master thought it had.
My log messages were something like:
ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#idc1-hrm1.heylinux.com:7078] -> [akka.tcp://sparkMaster#vagrant-centos64.vagrantup.com:7077]: Error [Association failed with [akka.tcp://sparkMaster#vagrant-centos64.vagrantup.com:7077]].
ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#192.168.121.127:7078] -> [akka.tcp://sparkMaster#idc1-hrm1.heylinux.com:7077]: Error [Association failed with [akka.tcp://sparkMaster#idc1-hrm1.heylinux.com:7077]]
WARN util.Utils: Your hostname, idc1-hrm1 resolves to a loopback address: 127.0.0.1; using 192.168.121.187 instead (on interface eth0)
So check the log of the master and see what name it thinks it has.
Then use that very same name on the workers.