How to properly configure HDFS high availability using Zookeeper? - hdfs

I'm a student of big data. I'm coming to you today for a question about the high availability of HDFS using Zookeeper. I am aware that there has already been bcp of topic dealing with this subject, I have read a lot of them already. It's already been 15 days that I've been browsing the forums without finding what I'm looking for (maybe I'm not looking in the right place too ;-) )
I have followed the procedure three times here: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html.
I may have done everything right, but when I kill one of my namenodes, none of them take over.
My architecture is as follows:
- 5 VM
- VM 1,3 and 5 are namenodes
- VMs 1 to 5 are datanodes.
I launched my journalnodes, I started my DFSZKFailoverController, I formatted my first namenode, I copied with -bootstrapStandby the configuration of my first namenode to the 2 others and I started my cluster.
Despite all this and no obvious problems in the ZKFC and namenode logs, I can't get a namenode to take over a dying namenode.
Does anyone have any idea how to help me?
Many thanks for your help :)
zoo.cfg
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=5
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=2
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/home/zookeeper/zoo
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
## Metrics Providers
#
# https://prometheus.io Metrics Exporter
#metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
#metricsProvider.httpPort=7000
#metricsProvider.exportJvmInfo=true
admin.serverPort=7979
server.1=10.10.10.15:2888:3888
server.2=10.10.10.16:2888:3888
server.3=10.10.10.17:2888:3888
server.4=10.10.10.18:2888:3888
server.5=10.10.10.19:2888:3888
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- default configuration -->
<property>
<name>fs.default.name</name>
<value>hdfs://my-cluster</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<!-- zookeeper configuration -->
<property>
<name>ha.zookeeper.quorum</name>
<value>10.10.10.15:2181,10.10.10.16:2181,10.10.10.17:2181,10.10.10.18:2181,10.10.10.19:2181</value>
</property>
</configuration>
hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- cluster configuration -->
<property>
<name>dfs.nameservices</name>
<value>my-cluster</value>
</property>
<!-- namenode configuration -->
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hdfs/data/nameNode</value>
</property>
<!-- datanode configuration -->
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hdfs/data/dataNode</value>
</property>
<!-- secondary namenode configuration -->
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/home/hdfs/data/secondaryNameNode</value>
</property>
<!-- replication factor -->
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<!-- webhdfs connector -->
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<!-- high-availability configuration -->
<property>
<name>dfs.ha.namenodes.my-cluster</name>
<value>nn1,nn2,nn3</value>
</property>
<property>
<name>dfs.namenode.rpc-address.my-cluster.nn1</name>
<value>10.10.10.15:9000</value>
</property>
<property>
<name>dfs.namenode.rpc-address.my-cluster.nn2</name>
<value>10.10.10.19:9000</value>
</property>
<property>
<name>dfs.namenode.rpc-address.my-cluster.nn3</name>
<value>10.10.10.17:9000</value>
</property>
<property>
<name>dfs.namenode.http-address.my-cluster.nn1</name>
<value>10.10.10.15:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.my-cluster.nn2</name>
<value>10.10.10.19:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.my-cluster.nn3</name>
<value>10.10.10.17:9870</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://10.10.10.15:8485;10.10.10.19:8485;10.10.10.17:8485/my-cluster</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/home/hdfs/data/journalNode</value>
</property>
<!-- failover configuration -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.my-cluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hdfsuser/.ssh/id_rsa</value>
</property>
</configuration>
dfs.service
[Unit]
Description=Hadoop DFS namenode and datanode
After=syslog.target network.target remote-fs.target nss-lookup.target network-online.target
Requires=network-online.target
[Service]
User=hdfsuser
Group=hdfsgroup
Type=simple
ExecStart=/apps/hadoop/sbin/start-dfs.sh
ExecStop=/apps/hadoop/sbin/stop-dfs.sh
RemainAfterExit=yes
Restart=on-failure
StartLimitInterval=350
StartLimitBurst=10
[Install]
WantedBy=multi-user.target
hadoop-hdfsuser-zkfc-node15-hdfs-spark-master.log
(before i crash a namenode)
2020-04-09 13:32:22,216 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DFSZKFailoverController
STARTUP_MSG: host = node15-hdfs-spark-master/10.10.10.15
STARTUP_MSG: args = []
STARTUP_MSG: version = 3.2.1
STARTUP_MSG: classpath = /apps/hadoop/etc/hadoop:/apps/hadoop/share/hadoop/common/lib/kerby-util-1.0.1.jar:/apps/hadoop/share/hadoop/common/lib/kerby-xdr-1.0.1.jar:/apps/hadoop/share/hado$STARTUP_MSG: build = https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842; compiled by 'rohithsharmaks' on 2019-09-10T15:56Z
STARTUP_MSG: java = 1.8.0_242
************************************************************/
2020-04-09 13:32:22,229 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: registered UNIX signal handlers for [TERM, HUP, INT]
2020-04-09 13:32:22,628 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Failover controller configured for NameNode NameNode at hdfs-0/10.10.10.15:9000
2020-04-09 13:32:22,751 INFO org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on 06/29/2018 00:39 GMT
2020-04-09 13:32:22,752 INFO org.apache.zookeeper.ZooKeeper: Client environment:host.name=node15
2020-04-09 13:32:22,752 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.version=1.8.0_242
2020-04-09 13:32:22,752 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
2020-04-09 13:32:22,752 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.home=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64/jre
2020-04-09 13:32:22,752 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.class.path=/apps/hadoop/etc/hadoop:/apps/hadoop/share/hadoop/common/lib/kerby-util-1.0.1.jar:/apps/hadoo$2020-04-09 13:32:22,753 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.library.path=/apps/hadoop/lib/native
2020-04-09 13:32:22,753 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
2020-04-09 13:32:22,753 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
2020-04-09 13:32:22,753 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.name=Linux
2020-04-09 13:32:22,753 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.arch=amd64
2020-04-09 13:32:22,756 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.version=3.10.0-1062.12.1.el7.x86_64
2020-04-09 13:32:22,757 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.name=hdfsuser
2020-04-09 13:32:22,757 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.home=/home/hdfsuser
2020-04-09 13:32:22,757 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.dir=/home/hdfsuser
2020-04-09 13:32:22,757 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=node15:2181,node16:2181,node17:2181,node18:2181,node19:2181 sessionTimeout=10000 wat$2020-04-09 13:32:22,777 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node19/10.10.10.19:2181. Will not attempt to authenticate using SASL (unknown error)
2020-04-09 13:32:22,784 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to node19/10.10.10.19:2181, initiating session
2020-04-09 13:32:22,817 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node19/10.10.10.19:2181, sessionid = 0x50000a3038f0000, negotiated timeout = 10000
2020-04-09 13:32:22,820 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2020-04-09 13:32:22,864 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue, queueCapacity: 300, scheduler: class org.apache.hadoop.$2020-04-09 13:32:22,888 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8019
2020-04-09 13:32:22,920 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2020-04-09 13:32:22,920 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8019: starting
2020-04-09 13:32:23,049 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_HEALTHY
2020-04-09 13:32:23,049 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at hdfs-0/10.10.10.15:9000 entered state: SERVICE_HEALTHY
2020-04-09 13:32:23,074 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2020-04-09 13:32:23,085 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a0a6d792d636c757374657212036e6e321a06686466732d3420a84628d33e
2020-04-09 13:32:23,088 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at hdfs-4/10.10.10.19:9000
2020-04-09 13:32:23,102 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at hdfs-4/10.10.10.19:9000 to standby state without fencing
2020-04-09 13:32:23,102 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/my-cluster/ActiveBreadCrumb to indicate that the local node is the most recent active...
2020-04-09 13:32:23,110 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at hdfs-0/10.10.10.15:9000 active...
2020-04-09 13:32:23,759 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at hdfs-0/10.10.10.15:9000 to active state
hadoop-hdfsuser-zkfc-node15-hdfs-spark-master.log
(after i crash a namenode)
2020-04-09 13:32:22,216 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DFSZKFailoverController
STARTUP_MSG: host = node15-hdfs-spark-master/10.10.10.15
STARTUP_MSG: args = []
STARTUP_MSG: version = 3.2.1
STARTUP_MSG: classpath = /apps/hadoop/etc/hadoop:/apps/hadoop/share/hadoop/common/lib/kerby-util-1.0.1.jar:/apps/hadoop/share/hadoop/common/lib/kerby-xdr-1.0.1.jar:/apps/hadoop/share/hadoop/common/lib/commons-net-$STARTUP_MSG: build = https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842; compiled by 'rohithsharmaks' on 2019-09-10T15:56Z
STARTUP_MSG: java = 1.8.0_242
************************************************************/
2020-04-09 13:32:22,229 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: registered UNIX signal handlers for [TERM, HUP, INT]
2020-04-09 13:32:22,628 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Failover controller configured for NameNode NameNode at hdfs-0/10.10.10.15:9000
2020-04-09 13:32:22,751 INFO org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on 06/29/2018 00:39 GMT
2020-04-09 13:32:22,752 INFO org.apache.zookeeper.ZooKeeper: Client environment:host.name=node15
2020-04-09 13:32:22,752 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.version=1.8.0_242
2020-04-09 13:32:22,752 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
2020-04-09 13:32:22,752 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.home=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.242.b08-0.el7_7.x86_64/jre
2020-04-09 13:32:22,752 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.class.path=/apps/hadoop/etc/hadoop:/apps/hadoop/share/hadoop/common/lib/kerby-util-1.0.1.jar:/apps/hadoop/share/hadoop/common/lib/$2020-04-09 13:32:22,753 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.library.path=/apps/hadoop/lib/native
2020-04-09 13:32:22,753 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
2020-04-09 13:32:22,753 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
2020-04-09 13:32:22,753 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.name=Linux
2020-04-09 13:32:22,753 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.arch=amd64
2020-04-09 13:32:22,756 INFO org.apache.zookeeper.ZooKeeper: Client environment:os.version=3.10.0-1062.12.1.el7.x86_64
2020-04-09 13:32:22,757 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.name=hdfsuser
2020-04-09 13:32:22,757 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.home=/home/hdfsuser
2020-04-09 13:32:22,757 INFO org.apache.zookeeper.ZooKeeper: Client environment:user.dir=/home/hdfsuser
2020-04-09 13:32:22,757 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=node15:2181,node16:2181,node17:2181,node18:2181,node19:2181 sessionTimeout=10000 watcher=org.apache.hadoop.ha.$2020-04-09 13:32:22,777 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node19/10.10.10.19:2181. Will not attempt to authenticate using SASL (unknown error)
2020-04-09 13:32:22,784 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to node19/10.10.10.19:2181, initiating session
2020-04-09 13:32:22,817 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node19/10.10.10.19:2181, sessionid = 0x50000a3038f0000, negotiated timeout = 10000
2020-04-09 13:32:22,820 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2020-04-09 13:32:22,864 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue, queueCapacity: 300, scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler, i$2020-04-09 13:32:22,888 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8019
2020-04-09 13:32:22,920 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2020-04-09 13:32:22,920 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8019: starting
2020-04-09 13:32:23,049 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_HEALTHY
2020-04-09 13:32:23,049 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at hdfs-0/10.10.10.15:9000 entered state: SERVICE_HEALTHY
2020-04-09 13:32:23,074 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2020-04-09 13:32:23,085 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a0a6d792d636c757374657212036e6e321a06686466732d3420a84628d33e
2020-04-09 13:32:23,088 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at hdfs-4/10.10.10.19:9000
2020-04-09 13:32:23,102 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at hdfs-4/10.10.10.19:9000 to standby state without fencing
2020-04-09 13:32:23,102 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/my-cluster/ActiveBreadCrumb to indicate that the local node is the most recent active...
2020-04-09 13:32:23,110 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at hdfs-0/10.10.10.15:9000 active...
2020-04-09 13:32:23,759 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at hdfs-0/10.10.10.15:9000 to active state
2020-04-09 13:38:59,910 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at hdfs-0/10.10.10.15:9000
java.io.EOFException: End of File Exception between local host is: "node15-hdfs-spark-master/10.10.10.15"; destination host is: "hdfs-0":9000; : java.io.EOFException; For more details see: http://wiki.apache.org/ha$
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:833)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:791)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1549)
at org.apache.hadoop.ipc.Client.call(Client.java:1491)
at org.apache.hadoop.ipc.Client.call(Client.java:1388)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy9.getServiceStatus(Unknown Source)
at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.getServiceStatus(HAServiceProtocolClientSideTranslatorPB.java:136)
at org.apache.hadoop.ha.HealthMonitor.doHealthChecks(HealthMonitor.java:202)
at org.apache.hadoop.ha.HealthMonitor.access$600(HealthMonitor.java:49)
at org.apache.hadoop.ha.HealthMonitor$MonitorDaemon.run(HealthMonitor.java:296)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1850)
at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1183)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1079)
2020-04-09 13:38:59,913 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_NOT_RESPONDING
2020-04-09 13:38:59,913 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at hdfs-0/10.10.10.15:9000 entered state: SERVICE_NOT_RESPONDING
2020-04-09 13:38:59,938 WARN org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Can't get local NN thread dump due to Connexion refusée (Connection refused)
2020-04-09 13:38:59,938 INFO org.apache.hadoop.ha.ZKFailoverController: Quitting master election for NameNode at hdfs-0/10.10.10.15:9000 and marking that fencing is necessary
2020-04-09 13:38:59,938 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
2020-04-09 13:38:59,947 INFO org.apache.zookeeper.ZooKeeper: Session: 0x50000a3038f0000 closed
2020-04-09 13:38:59,947 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x50000a3038f0000
2020-04-09 13:38:59,947 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down for session: 0x50000a3038f0000
2020-04-09 13:39:01,951 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdfs-0/10.10.10.15:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=10$2020-04-09 13:39:01,953 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at hdfs-0/10.10.10.15:9000
java.net.ConnectException: Call From node15-hdfs-spark-master/10.10.10.15 to hdfs-0:9000 failed on connection exception: java.net.ConnectException: Connexion refusée; For more details see: http://wiki.apache.org/ha$
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:833)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:757)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1549)
at org.apache.hadoop.ipc.Client.call(Client.java:1491)
at org.apache.hadoop.ipc.Client.call(Client.java:1388)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy9.getServiceStatus(Unknown Source)
at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.getServiceStatus(HAServiceProtocolClientSideTranslatorPB.java:136)
at org.apache.hadoop.ha.HealthMonitor.doHealthChecks(HealthMonitor.java:202)
at org.apache.hadoop.ha.HealthMonitor.access$600(HealthMonitor.java:49)
at org.apache.hadoop.ha.HealthMonitor$MonitorDaemon.run(HealthMonitor.java:296)
Caused by: java.net.ConnectException: Connexion refusée
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:700)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:804)
at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:421)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1606)
at org.apache.hadoop.ipc.Client.call(Client.java:1435)
... 8 more
2020-04-09 13:39:03,956 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hdfs-0/10.10.10.15:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=10$2020-04-09 13:39:03,958 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at hdfs-0/10.10.10.15:9000
java.net.ConnectException: Call From node15-hdfs-spark-master/10.10.10.15 to hdfs-0:9000 failed on connection exception: java.net.ConnectException: Connexion refusée; For more details see: http://wiki.apache.org/ha$
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:833)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:757)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1549)
at org.apache.hadoop.ipc.Client.call(Client.java:1491)
at org.apache.hadoop.ipc.Client.call(Client.java:1388)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy9.getServiceStatus(Unknown Source)
at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.getServiceStatus(HAServiceProtocolClientSideTranslatorPB.java:136)
at org.apache.hadoop.ha.HealthMonitor.doHealthChecks(HealthMonitor.java:202)
at org.apache.hadoop.ha.HealthMonitor.access$600(HealthMonitor.java:49)
at org.apache.hadoop.ha.HealthMonitor$MonitorDaemon.run(HealthMonitor.java:296)
Caused by: java.net.ConnectException: Connexion refusée
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:700)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:804)
at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:421)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1606)
at org.apache.hadoop.ipc.Client.call(Client.java:1435)
... 8 more
...
...
...

The problem with my configuration finally came from two commands that hadn't been installed when I installed the hadoop cluster:
first the nc command: fixed by installing the nmap package from yum
then the command fuser: fixed by installing the psmisc package from yum
yum install -y nmap.x86_64
yum install -y psmisc.x86_64
Hopefully it will help other of you soon.

Related

How to make Pyspark script running on Amazon EMR to recognize boto3 module? It says module not found

Spark version 2.4.5
I have files that need to be processed in an S3 bucket. (s3a://tobeprocessed)
I have a pyspark application that reads files from the S3 bucket and writes output to another S3 bucket (s3://processed).
I intend to run this as a step function in my emr cluster.
I used to following command from my terminal to add a step to my cluster.
aws emr add-steps --cluster-id j-xxxxxx --steps Name=etlapp,Jar=command-runner.jar,Args=[spark-submit,--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=true,s3://bucketname/spark_app.py,s3://bucketname/configuration_file.cfg],ActionOnFailure=CONTINUE
I got an error message like this
STDERR
20/03/10 19:50:46 INFO RMProxy: Connecting to ResourceManager at ip-172-31-27-34.ec2.internal/172.31.27.34:8032
20/03/10 19:50:47 INFO Client: Requesting a new application from cluster with 2 NodeManagers
20/03/10 19:50:47 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
20/03/10 19:50:47 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
20/03/10 19:50:47 INFO Client: Setting up container launch context for our AM
20/03/10 19:50:47 INFO Client: Setting up the launch environment for our AM container
20/03/10 19:50:47 INFO Client: Preparing resources for our AM container
20/03/10 19:50:47 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
20/03/10 19:50:49 INFO Client: Uploading resource file:/mnt/tmp/spark-4c4ea7ac-b2bb-4a61-929d-c371d87417ff/__spark_libs__2224504543987850085.zip -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/__spark_libs__2224504543987850085.zip
20/03/10 19:50:50 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
20/03/10 19:50:50 INFO Client: Uploading resource s3://imdbetlapp/complete_etl.py -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/complete_etl.py
20/03/10 19:50:51 INFO S3NativeFileSystem: Opening 's3://imdbetlapp/complete_etl.py' for reading
20/03/10 19:50:51 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/pyspark.zip
20/03/10 19:50:51 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/py4j-0.10.7-src.zip
20/03/10 19:50:52 INFO Client: Uploading resource file:/mnt/tmp/spark-4c4ea7ac-b2bb-4a61-929d-c371d87417ff/__spark_conf__476112427502500805.zip -> hdfs://ip-172-31-27-34.ec2.internal:8020/user/hadoop/.sparkStaging/application_1583867709817_0003/__spark_conf__.zip
20/03/10 19:50:52 INFO SecurityManager: Changing view acls to: hadoop
20/03/10 19:50:52 INFO SecurityManager: Changing modify acls to: hadoop
20/03/10 19:50:52 INFO SecurityManager: Changing view acls groups to:
20/03/10 19:50:52 INFO SecurityManager: Changing modify acls groups to:
20/03/10 19:50:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
20/03/10 19:50:53 INFO Client: Submitting application application_1583867709817_0003 to ResourceManager
20/03/10 19:50:53 INFO YarnClientImpl: Submitted application application_1583867709817_0003
20/03/10 19:50:54 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:54 INFO Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1583869853550
final status: UNDEFINED
tracking URL: http://ip-172-31-27-34.ec2.internal:20888/proxy/application_1583867709817_0003/
user: hadoop
20/03/10 19:50:55 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:56 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:57 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:58 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:50:59 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:51:00 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:51:01 INFO Client: Application report for application_1583867709817_0003 (state: ACCEPTED)
20/03/10 19:51:02 INFO Client: Application report for application_1583867709817_0003 (state: FAILED)
20/03/10 19:51:02 INFO Client:
client token: N/A
diagnostics: Application application_1583867709817_0003 failed 2 times due to AM Container for appattempt_1583867709817_0003_000002 exited with exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1583867709817_0003_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-172-31-27-34.ec2.internal:8088/cluster/app/application_1583867709817_0003 Then click on links to logs of each attempt.
. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1583869853550
final status: FAILED
tracking URL: http://ip-172-31-27-34.ec2.internal:8088/cluster/app/application_1583867709817_0003
user: hadoop
20/03/10 19:51:02 ERROR Client: Application diagnostics message: Application application_1583867709817_0003 failed 2 times due to AM Container for appattempt_1583867709817_0003_000002 exited with exitCode: 13
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1583867709817_0003_02_000001
Exit code: 13
Stack trace: ExitCodeException exitCode=13:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 13
For more detailed output, check the application tracking page: http://ip-172-31-27-34.ec2.internal:8088/cluster/app/application_1583867709817_0003 Then click on links to logs of each attempt.
. Failing the application.
Exception in thread "main" org.apache.spark.SparkException: Application application_1583867709817_0003 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1149)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1526)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/03/10 19:51:02 INFO ShutdownHookManager: Shutdown hook called
20/03/10 19:51:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-4c4ea7ac-b2bb-4a61-929d-c371d87417ff
20/03/10 19:51:02 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-a72b6dba-91bb-46b0-b2c3-893ac3b8581f
Command exiting with ret '1'
When I dug through the container logs, I found the script spitting out an error as follows
boto3 module not found
I ran a bootstrap script that installs boto3 using pip.
I even logged in to the master node and found that boto3 was installed using the command pip list
You have to use a bootstrap script that installs boto3 but you have to be very specific about the version of python being used.
sudo pip-3.6 install boto3
On the console homepage, where you click on Create Cluster, a page appears. On the top, there is an option to "go to advanced options". There you will find the option to "Auto terminate" on "After last step completes"

django-channels does not work with daphne on linux server

I'm using Django-eventstream over Django channels to send an event to my client app (react using eventstream),
on my local machine the events are sent correctly to the client.
but when I upload the app to my Linux server the webhook just getting open and 'keep-alive' but the events won't get to the client at all.
I use daphne to deploy my Asgi app and use Nginx as my getaway.
when I use "python manage.py runserver" (on the Linux server) the client is getting all the messages.
because of that my clients do get the messages when I use runserver command I assume that my Nginx configuration is right (correct me if I'm wrong) and the problem is in Daphne somehow.
I don't see the events trying to be sent in the daphne logs at all.
does anyone have a clue why this is happening?
Thanks!
the command I used to run Daphne:
daphne --verbosity 3 -p 8001 my_project.asgi:application
here is my Daphne log:
[06/12/2019 14:52:34] INFO [daphne.cli:287] Starting server at tcp:port=8001:interface=127.0.0.1
2019-12-06 14:52:34,376 INFO Starting server at tcp:port=8001:interface=127.0.0.1
[06/12/2019 14:52:34] INFO [daphne.server:311] HTTP/2 support enabled
2019-12-06 14:52:34,377 INFO HTTP/2 support enabled
[06/12/2019 14:52:34] INFO [daphne.server:311] Configuring endpoint tcp:port=8001:interface=127.0.0.1
2019-12-06 14:52:34,377 INFO Configuring endpoint tcp:port=8001:interface=127.0.0.1
[06/12/2019 14:52:34] INFO [daphne.server:131] HTTPFactory starting on 8001
2019-12-06 14:52:34,378 INFO HTTPFactory starting on 8001
[06/12/2019 14:52:34] INFO [daphne.server:131] Starting factory <daphne.http_protocol.HTTPFactory object at 0x7fb1fd600cf8>
2019-12-06 14:52:34,378 INFO Starting factory <daphne.http_protocol.HTTPFactory object at 0x7fb1fd600cf8>
[06/12/2019 14:52:34] INFO [daphne.server:654] Listening on TCP address 127.0.0.1:8001
2019-12-06 14:52:34,379 INFO Listening on TCP address 127.0.0.1:8001
[06/12/2019 14:52:38] DEBUG [daphne.http_protocol:160] HTTP b'GET' request for ['127.0.0.1', 49034]
2019-12-06 14:52:38,215 DEBUG HTTP b'GET' request for ['127.0.0.1', 49034]
[06/12/2019 14:52:38] DEBUG [daphne.http_protocol:246] HTTP 200 response started for ['127.0.0.1', 49034]

Hyperledger Sawtooth cannot start devmode consensus engine

I am trying to start up a Hyperledger Sawtooth network on Ubuntu 16.04. I am following the instructions of https://sawtooth.hyperledger.org/docs/core/releases/latest/app_developers_guide/ubuntu.html.
Starting up the validation service works fine, but starting the devmode consensus engine does not work. The following happened:
mdi#boromir:~$ sudo -u sawtooth devmode-engine-rust -vv --connect tcp://localhost:5050
ERROR | devmode_engine_rust: | ReceiveError: TimeoutError
DEBUG | sawtooth_sdk::messag | Disconnected outbound channel
DEBUG | sawtooth_sdk::messag | Exited stream
DEBUG | zmq:547 | socket dropped
DEBUG | zmq:547 | socket dropped
DEBUG | zmq:454 | context dropped
mdi#boromir:~$
The validation service was running, as follows:
mdi#boromir:~$ sudo -u sawtooth sawtooth-validator -vv
[sudo] password for mdi:
[2019-03-07 16:40:15.601 WARNING (unknown file)] [src/pylogger.rs: 40] Started logger at level DEBUG
[2019-03-07 16:40:15.919 DEBUG ffi] loading library libsawtooth_validator.so
[2019-03-07 16:40:15.926 DEBUG ffi] loading library libsawtooth_validator.so
[2019-03-07 16:40:16.299 INFO path] Skipping path loading from non-existent config file: /etc/sawtooth/path.toml
[2019-03-07 16:40:16.299 INFO validator] Skipping validator config loading from non-existent config file: /etc/sawtooth/validator.toml
[2019-03-07 16:40:16.300 INFO keys] Loading signing key: /etc/sawtooth/keys/validator.priv
[2019-03-07 16:40:16.306 INFO cli] sawtooth-validator (Hyperledger Sawtooth) version 1.1.4
[2019-03-07 16:40:16.307 INFO cli] config [path]: config_dir = "/etc/sawtooth"; config [path]: key_dir = "/etc/sawtooth/keys"; config [path]: data_dir = "/var/lib/sawtooth"; config [path]: log_dir = "/var/log/sawtooth"; config [path]: policy_dir = "/etc/sawtooth/policy"
[2019-03-07 16:40:16.307 WARNING cli] Network key pair is not configured, Network communications between validators will not be authenticated or encrypted.
[2019-03-07 16:40:16.333 DEBUG state_verifier] verifying state in /var/lib/sawtooth/merkle-00.lmdb
[2019-03-07 16:40:16.337 DEBUG state_verifier] block store file is /var/lib/sawtooth/block-00.lmdb
[2019-03-07 16:40:16.338 INFO state_verifier] Skipping state verification: chain head's state root is present
[2019-03-07 16:40:16.339 INFO cli] Starting validator with serial scheduler
[2019-03-07 16:40:16.339 DEBUG core] global state database file is /var/lib/sawtooth/merkle-00.lmdb
[2019-03-07 16:40:16.340 DEBUG core] txn receipt store file is /var/lib/sawtooth/txn_receipts-00.lmdb
[2019-03-07 16:40:16.341 DEBUG core] block store file is /var/lib/sawtooth/block-00.lmdb
[2019-03-07 16:40:16.342 DEBUG threadpool] Creating thread pool executor Component
[2019-03-07 16:40:16.343 DEBUG threadpool] Creating thread pool executor Network
[2019-03-07 16:40:16.343 DEBUG threadpool] Creating thread pool executor Client
[2019-03-07 16:40:16.343 DEBUG threadpool] Creating thread pool executor Signature
[2019-03-07 16:40:16.345 DEBUG threadpool] Creating thread pool executor FutureCallback
[2019-03-07 16:40:16.346 DEBUG threadpool] Creating thread pool executor FutureCallback
[2019-03-07 16:40:16.352 DEBUG threadpool] Creating thread pool executor Executing
[2019-03-07 16:40:16.353 DEBUG threadpool] Creating thread pool executor Consensus
[2019-03-07 16:40:16.353 DEBUG threadpool] Creating thread pool executor FutureCallback
[2019-03-07 16:40:16.358 DEBUG threadpool] Creating thread pool executor Instrumented
[2019-03-07 16:40:16.368 DEBUG selector_events] Using selector: ZMQSelector
[2019-03-07 16:40:16.376 INFO interconnect] Listening on tcp://127.0.0.1:4004
[2019-03-07 16:40:16.377 DEBUG dispatch] Added send_message function for connection ServerThread
[2019-03-07 16:40:16.377 DEBUG dispatch] Added send_last_message function for connection ServerThread
[2019-03-07 16:40:16.382 DEBUG genesis] genesis_batch_file: /var/lib/sawtooth/genesis.batch
[2019-03-07 16:40:16.384 DEBUG genesis] block_chain_id: not yet specified
[2019-03-07 16:40:16.384 INFO genesis] Producing genesis block from /var/lib/sawtooth/genesis.batch
[2019-03-07 16:40:16.385 DEBUG genesis] Adding 1 batches
This output is on time 17:29, so no output has been appended for almost an hour.
I tried to see Sawtooth settings:
mdi#boromir:~$ sawtooth settings list
Error: Unable to connect to "http://localhost:8008": make sure URL is correct
mdi#boromir:~$
And I checked what processes were listening to what ports:
mdi#boromir:~$ netstat -plnt
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:4004 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
mdi#boromir:~$
Does anybody know whether the validator service initiates connection or the consensus engine? What is wrong with my sawtooth settings list command? And does anybody know how to get the consensus engine to work? Thanks.
I found the answers myself. I had another machine with a docker installation of Hyperledger Sawtooth. On that server, the validator log had the line:
[2019-03-08 14:39:02.478 INFO interconnect] Listening on tcp://127.0.0.1:5050
Port 5050 is used for the consensus engine as stated in https://sawtooth.hyperledger.org/docs/core/releases/latest/app_developers_guide/ubuntu.html. This makes clear that the consensus engine initiates the connection to the validator service.
So why didn't the validator service listen to port 5050 on my Ubuntu machine? Because the settings transaction processor did not ever run on the Ubuntu machine. I started this processor according to the command in the Ubuntu tutorial:
sudo -u sawtooth settings-tp -v
Then the validator proceeded and started listening to port 5050. As a consequence, the consensus engine could be started.

Vora 1.3 Thriftserver cannot start

I'm deploying Vora 1.3 Services on HDP 2.3 using the Manager web UI. Mostly default configuration and nodes assignment. I've assigned Vora Thriftserver service to the node that's been successfully hosting the same service of Vora 1.2 (which I removed already).
The service doesn't start though. Here's the related part of the log:
17/01/23 10:04:27 INFO Server: jetty-8.y.z-SNAPSHOT
17/01/23 10:04:27 INFO AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
17/01/23 10:04:27 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/01/23 10:04:27 INFO SparkUI: Started SparkUI at http://<jumpbox>:4040
17/01/23 10:04:28 INFO SparkContext: Added JAR file:/var/lib/ambari-agent/cache/stacks/HDP/2.3/services/vora-manager/package/lib/vora-spark/lib/spark-sap-datasources-1.3.102-assembly.jar at http://<jumpbox>:41874/jars/spark-sap-datasources-1.3.102-assembly.jar with timestamp 1485126268263
17/01/23 10:04:28 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
17/01/23 10:04:28 INFO Executor: Starting executor ID driver on host localhost
17/01/23 10:04:28 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37523.
17/01/23 10:04:28 INFO NettyBlockTransferService: Server created on 37523
17/01/23 10:04:28 INFO BlockManagerMaster: Trying to register BlockManager
17/01/23 10:04:28 INFO BlockManagerMasterEndpoint: Registering block manager localhost:37523 with 530.0 MB RAM, BlockManagerId(driver, localhost, 37523)
17/01/23 10:04:28 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/SparkPlanner
at org.apache.spark.sql.hive.sap.thriftserver.SapSQLEnv$.init(SapSQLEnv.scala:39)
at org.apache.spark.sql.hive.thriftserver.SapThriftServer$.main(SapThriftServer.scala:22)
at org.apache.spark.sql.hive.thriftserver.SapThriftServer.main(SapThriftServer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
(.... goes on...)
Spark executable and Java executable paths in the Vora Thriftserver configuration tab are correct.
Did I miss something else?
You are running Vora 1.3 which means you must use HDP 2.4.2 which includes the required Spark 1.6.1 version. See the official Vora product availability matrix (PAM)

Storm UI timeout error

I created 3 aws ec2 servers with RedHat 6 and used this tutorial to deploy storm.
After creating the zookeeper and nimbus instances i could manually start zookeeper and the nimbus/ui nodes. The nimbus:8080 showed me an empty topology.
The third server was configured for one supervisor/slave node and i saw it in the ui.
After that I added the supervisord option and changed some ec2 firewall options (unfortunately at the same time).
Now when i start zookeeper, nimbus and ui (with or without supervisord) and look at the ui i get this Error.
java.lang.RuntimeException: org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection timed out
Already tried fiddling around with the aws firewall configs but none is changing anything. Even opening some ports to all ip addresses....
I used this readme to get the right settings.
The logs are all pretty empty. Zookeper seems to accept connections:
2015-12-04 21:47:04,151 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#192] - Accepted s2015-12-04 21:47:04,174 [myid:] - INFO [SyncThread:0:ZooKeeperServer#643] - Established session 0x15170091b370000 with negotiated timeout 20000 for client /52.34.142.187:53935
2015-12-04 21:47:04,177 [myid:] - INFO [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor#489] - Processed session termination for sessionid: 0x15170091b370000
2015-12-04 21:47:04,179 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1008] - Closed socket connection for client /52.34.142.187:53935 which had sessionid 0x15170091b370000
2015-12-04 21:47:04,181 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#192] - Accepted socket connection from /52.34.142.187:53936
2015-12-04 21:47:04,183 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#898] - Client attempting to establish new session at /52.34.142.187:53936
2015-12-04 21:47:04,187 [myid:] - INFO [SyncThread:0:ZooKeeperServer#643] - Established session 0x15170091b370001 with negotiated timeout 20000 for client /52.34.142.187:53936
2015-12-04 21:47:04,193 [myid:] - INFO [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor#489] - Processed session termination for sessionid: 0x15170091b370001
2015-12-04 21:47:04,194 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1008] - Closed socket connection for client /52.34.142.187:53936 which had sessionid 0x15170091b370001
2015-12-04 21:47:04,201 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#192] - Accepted socket connection from /52.34.142.187:53937
2015-12-04 21:47:04,203 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#898] - Client attempting to establish new session at /52.34.142.187:53937
2015-12-04 21:47:04,204 [myid:] - INFO [SyncThread:0:ZooKeeperServer#643] - Established session 0x15170091b370002 with negotiated timeout 20000 for client /52.34.142.187:53937
2015-12-04 21:47:54,973 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#192] - Accepted socket connection from /52.33.187.63:58714
2015-12-04 21:47:55,034 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#898] - Client attempting to establish new session at /52.33.187.63:58714
2015-12-04 21:47:55,035 [myid:] - INFO [SyncThread:0:ZooKeeperServer#643] - Established session 0x15170091b370003 with negotiated timeout 20000 for client /52.33.187.63:58714
2015-12-04 21:47:56,056 [myid:] - INFO [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor#489] - Processed session termination for sessionid: 0x15170091b370003
2015-12-04 21:47:56,058 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1008] - Closed socket connection for client /52.33.187.63:58714 which had sessionid 0x15170091b370003
2015-12-04 21:47:56,063 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#192] - Accepted socket connection from /52.33.187.63:58715
2015-12-04 21:47:56,065 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#898] - Client attempting to establish new session at /52.33.187.63:58715
2015-12-04 21:47:56,066 [myid:] - INFO [SyncThread:0:ZooKeeperServer#643] - Established session 0x15170091b370004 with negotiated timeout 20000 for client /52.33.187.63:58715
2015-12-04 21:49:12,000 [myid:] - INFO [SessionTracker:ZooKeeperServer#353] - Expiring session 0x15170091b370002, timeout of 20000ms exceeded
2015-12-04 21:49:12,000 [myid:] - INFO [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor#489] - Processed session termination for sessionid: 0x15170091b370002
2015-12-04 21:49:12,002 [myid:] - INFO [SyncThread:0:NIOServerCnxn#1008] - Closed socket connection for client /52.34.142.187:53937 which had sessionid 0x15170091b370002
2015-12-04 21:49:14,001 [myid:] - INFO [SessionTracker:ZooKeeperServer#353] - Expiring session 0x15170091b370004, timeout of 20000ms exceeded
2015-12-04 21:49:14,001 [myid:] - INFO [ProcessThread(sid:0 cport:2181)::PrepRequestProcessor#489] - Processed session termination for sessionid: 0x15170091b370004
2015-12-04 21:49:14,002 [myid:] - INFO [SyncThread:0:NIOServerCnxn#1008] - Closed socket connection for client /52.33.187.63:58715 which had sessionid 0x15170091b370004ocket connection from /52.34.142.187:53935
2015-12-04 21:47:04,164 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#898] - Client attempting to establish new session at /52.34.142.187:53935
2015-12-04 21:47:04,165 [myid:] - INFO [SyncThread:0:FileTxnLog#199] - Creating new log file: log.195
nimbus.log:
2015-12-04 22:11:07.531 o.a.s.s.o.a.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 20082ms for sessionid 0x0, closing socket connection and attempting reconnect
2015-12-04 22:11:08.632 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server ec2-52-34-113-54.us-west-2.compute.amazonaws.com/52.34.113.54:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-04 22:11:18.481 o.a.s.s.o.a.c.ConnectionState [WARN] Connection attempt unsuccessful after 31043 (greater than max timeout of 20000). Resetting connection and trying again with a new connection.
2015-12-04 22:11:27.743 o.a.s.s.o.a.z.ZooKeeper [INFO] Session: 0x0 closed
2015-12-04 22:11:27.743 o.a.s.s.o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=zkserver1:2181 sessionTimeout=20000 watcher=org.apache.storm.shade.org.apache.curator.ConnectionState#185100a6
2015-12-04 22:11:27.747 o.a.s.s.o.a.z.ClientCnxn [INFO] EventThread shut down
2015-12-04 22:11:27.750 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server ec2-52-34-113-54.us-west-2.compute.amazonaws.com/52.34.113.54:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-04 22:11:47.753 o.a.s.s.o.a.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 20006ms for sessionid 0x0, closing socket connection and attempting reconnect
2015-12-04 22:11:48.854 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server ec2-52-34-113-54.us-west-2.compute.amazonaws.com/52.34.113.54:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-04 22:12:03.860 o.a.s.s.o.a.c.ConnectionState [WARN] Connection attempt unsuccessful after 45379 (greater than max timeout of 20000). Resetting connection and trying again with a new connection.
2015-12-04 22:12:07.958 o.a.s.s.o.a.z.ZooKeeper [INFO] Session: 0x0 closed
2015-12-04 22:12:07.959 o.a.s.s.o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=zkserver1:2181 sessionTimeout=20000 watcher=org.apache.storm.shade.org.apache.curator.ConnectionState#185100a6
2015-12-04 22:12:07.959 o.a.s.s.o.a.z.ClientCnxn [INFO] EventThread shut down
2015-12-04 22:12:07.960 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server ec2-52-34-113-54.us-west-2.compute.amazonaws.com/52.34.113.54:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-04 22:12:27.965 o.a.s.s.o.a.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 20006ms for sessionid 0x0, closing socket connection and attempting reconnect
2015-12-04 22:12:29.066 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server ec2-52-34-113-54.us-west-2.compute.amazonaws.com/52.34.113.54:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-04 22:12:44.074 o.a.s.s.o.a.c.ConnectionState [WARN] Connection attempt unsuccessful after 40213 (greater than max timeout of 20000). Resetting connection and trying again with a new connection.
2015-12-04 22:12:48.169 o.a.s.s.o.a.z.ZooKeeper [INFO] Session: 0x0 closed
2015-12-04 22:12:48.169 o.a.s.s.o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=zkserver1:2181 sessionTimeout=20000 watcher=org.apache.storm.shade.org.apache.curator.ConnectionState#185100a6
2015-12-04 22:12:48.170 o.a.s.s.o.a.z.ClientCnxn [INFO] EventThread shut down
2015-12-04 22:12:48.171 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server ec2-52-34-113-54.us-west-2.compute.amazonaws.com/52.34.113.54:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-04 22:13:03.172 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket connection established to ec2-52-34-113-54.us-west-2.compute.amazonaws.com/52.34.113.54:2181, initiating session
2015-12-04 22:13:03.177 o.a.s.s.o.a.z.ClientCnxn [INFO] Session establishment complete on server ec2-52-34-113-54.us-west-2.compute.amazonaws.com/52.34.113.54:2181, sessionid = 0x15170091b370005, negotiated timeout = 20000
2015-12-04 22:13:03.179 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO] State change: CONNECTED
2015-12-04 22:13:03.180 b.s.zookeeper [INFO] Zookeeper state update: :connected:none
2015-12-04 22:13:03.186 o.a.s.s.o.a.z.ZooKeeper [INFO] Session: 0x15170091b370005 closed
2015-12-04 22:13:03.186 o.a.s.s.o.a.z.ClientCnxn [INFO] EventThread shut down
2015-12-04 22:13:03.187 b.s.u.StormBoundedExponentialBackoffRetry [INFO] The baseSleepTimeMs [1000] the maxSleepTimeMs [30000] the maxRetries [5]
2015-12-04 22:13:03.188 o.a.s.s.o.a.c.f.i.CuratorFrameworkImpl [INFO] Starting
2015-12-04 22:13:03.190 o.a.s.s.o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=zkserver1:2181/storm sessionTimeout=20000 watcher=org.apache.storm.shade.org.apache.curator.ConnectionState#6c4fb026
2015-12-04 22:13:03.191 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server ec2-52-34-113-54.us-west-2.compute.amazonaws.com/52.34.113.54:2181. Will not attempt to authenticate using SASL (unknown error)
2015-12-04 22:13:03.193 o.a.s.s.o.a.z.ClientCnxn [INFO] Socket connection established to ec2-52-34-113-54.us-west-2.compute.amazonaws.com/52.34.113.54:2181, initiating session
2015-12-04 22:13:03.195 o.a.s.s.o.a.z.ClientCnxn [INFO] Session establishment complete on server ec2-52-34-113-54.us-west-2.compute.amazonaws.com/52.34.113.54:2181, sessionid = 0x15170091b370006, negotiated timeout = 20000
2015-12-04 22:13:03.195 o.a.s.s.o.a.c.f.s.ConnectionStateManager [INFO] State change: CONNECTED
2015-12-04 22:13:03.233 b.s.d.nimbus [INFO] Starting Nimbus server...
Any ideas?
i found out... my hosts file was global, so the nimbus ip was the external ip of my machine. It tried to connect through that ip even though it was on localhost, the firewall needs to configured properly or the hosts file adjusted