Can anyone help me with remote HDFS file uploads? - hdfs

HDFS Version 3.3
container port 9864 , 9866 , 9869 , 9870 , 9000
Company router port forwarding information
port external port 9864 , 9866 , 9869 , 9870 , 9000
internal port 9864 , 9866 , 9869 , 9870 , 9000
HostPC /etc/hosts
127.0.0.1 localhost
192.168.8.250 namenode
Container /etc/hosts
127.0.0.1 localhost
172.19.0.2 192.168.8.250 namenode
One container has data node and name node,
~/hdfs-site.xml
dfs.client.use.datanode.hostname true
java client code
dfs.client.use.datanode.hostname "true"
It's set up.
It is well set up, so files can be uploaded remotely from another PC within the same network.
However
If you are uploading files via an external network other than the same network,
It throws a RemoteException error.
Even if you google it, there are only articles related to namenode initialization, datanode execution, host file registration, and dfs.client.use.datanode.hostname setting.
Can anyone help me?
There is no problem with the log file of the data node,
The log file of Namenode is as follows.
2022-06-10 15:08:04,612 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741826_1002, replicas=127.0.0.1:9866 for /user/root/test333.csv
2022-06-10 15:08:25,816 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2022-06-10 15:08:25,816 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=1, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2022-06-10 15:08:25,816 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2022-06-10 15:08:25,816 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on default port 9000, call Call#5 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 222.222.222.222:8812
java.io.IOException: File /user/root/test333.csv could only be written to 0 of the 1 minReplication nodes. There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2312)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2939)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:908)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:593)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:532)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1020)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:948)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2952)
2022-06-10 15:08:47,891 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 127.0.0.1
2022-06-10 15:08:47,891 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2022-06-10 15:08:47,891 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 161, 167
2022-06-10 15:08:47,891 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 8 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 1 Number of syncs: 7 SyncTimes(ms): 120
2022-06-10 15:08:47,901 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 8 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 1 Number of syncs: 8 SyncTimes(ms): 130
2022-06-10 15:08:47,901 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /root/hadoop-3.3.0/hdfs/namenode/current/edits_inprogress_0000000000000000161 -> /root/hadoop-3.3.0/hdfs/namenode/current/edits_0000000000000000161-0000000000000000168
2022-06-10 15:08:47,901 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 169

Related

Istio1.9 integration with virtual machine (aws ec2) getting host file as empty

I have installed mysql in a VM and wanted my EKS with istio 1.9 installed to talk with them, i am following this https://istio.io/latest/docs/setup/install/virtual-machine/ but when am doing this step the host file which getting generated is empty file.
With this empty host file i tried but when starting the vm with this command am getting
> sudo systemctl start istio
when tailed this file
*/var/log/istio/istio.log*
2021-03-22T18:44:02.332421Z info Proxy role ips=[10.8.1.179 fe80::dc:36ff:fed3:9eea] type=sidecar id=ip-10-8-1-179.vm domain=vm.svc.cluster.local
2021-03-22T18:44:02.332429Z info JWT policy is third-party-jwt
2021-03-22T18:44:02.332438Z info Pilot SAN: [istiod.istio-system.svc]
2021-03-22T18:44:02.332443Z info CA Endpoint istiod.istio-system.svc:15012, provider Citadel
2021-03-22T18:44:02.332997Z info Using CA istiod.istio-system.svc:15012 cert with certs: /etc/certs/root-cert.pem
2021-03-22T18:44:02.333093Z info citadelclient Citadel client using custom root cert: istiod.istio-system.svc:15012
2021-03-22T18:44:02.410934Z info ads All caches have been synced up in 82.7974ms, marking server ready
2021-03-22T18:44:02.411247Z info sds SDS server for workload certificates started, listening on "./etc/istio/proxy/SDS"
2021-03-22T18:44:02.424855Z info sds Start SDS grpc server
2021-03-22T18:44:02.425044Z info xdsproxy Initializing with upstream address "istiod.istio-system.svc:15012" and cluster "Kubernetes"
2021-03-22T18:44:02.425341Z info Starting proxy agent
2021-03-22T18:44:02.425483Z info dns Starting local udp DNS server at localhost:15053
2021-03-22T18:44:02.427627Z info dns Starting local tcp DNS server at localhost:15053
2021-03-22T18:44:02.427683Z info Opening status port 15020
2021-03-22T18:44:02.432407Z info Received new config, creating new Envoy epoch 0
2021-03-22T18:44:02.433999Z info Epoch 0 starting
2021-03-22T18:44:02.690764Z warn ca ca request failed, starting attempt 1 in 91.93939ms
2021-03-22T18:44:02.693579Z info Envoy command: [-c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster istio-proxy --service-node sidecar~10.8.1.179~ip-10-8-1-179.vm~vm.svc.cluster.local --local-address-ip-version v4 --bootstrap-version 3 --log-format %Y-%m-%dT%T.%fZ %l envoy %n %v -l warning --component-log-level misc:error --concurrency 2]
2021-03-22T18:44:02.782817Z warn ca ca request failed, starting attempt 2 in 195.226287ms
2021-03-22T18:44:02.978344Z warn ca ca request failed, starting attempt 3 in 414.326774ms
2021-03-22T18:44:03.392946Z warn ca ca request failed, starting attempt 4 in 857.998629ms
2021-03-22T18:44:04.251227Z warn sds failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.8.0.2:53: no such host"
2021-03-22T18:44:04.849207Z warn ca ca request failed, starting attempt 1 in 91.182413ms
2021-03-22T18:44:04.940652Z warn ca ca request failed, starting attempt 2 in 207.680983ms
2021-03-22T18:44:05.148598Z warn ca ca request failed, starting attempt 3 in 384.121814ms
2021-03-22T18:44:05.533019Z warn ca ca request failed, starting attempt 4 in 787.704352ms
2021-03-22T18:44:06.321042Z warn sds failed to warm certificate: failed to generate workload certificate: create certificate: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.8.0.2:53: no such host"

Copying files from HDFS to S3 on EMR cluster using S3DistCp

I am copying 800 avro files, size around 136 MB, from HDFS to S3 on EMR cluster, but Im getting this exception:
8/06/26 10:53:14 INFO mapreduce.Job: map 100% reduce 91%
18/06/26 10:53:14 INFO mapreduce.Job: Task Id : attempt_1529995855123_0003_r_000006_0, Status : FAILED
Error: java.lang.RuntimeException: Reducer task failed to copy 1 files: hdfs://url-to-aws-emr/user/hadoop/output/part-00258-3a28110a-9270-4639-b389-3e1f7f386ed6-c000.avro etc
at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.cleanup(CopyFilesReducer.java:67)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:635)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
The configuration for the EMR cluster is:
core-site fs.trash.checkpoint.interval 60
core-site fs.trash.interval 60
hadoop-env.export HADOOP_CLIENT_OPTS -Xmx10g
hdfs-site dfs.replication 3
Any help will be appreciated.
Edit:
Running the hdfs dfsadmin -report command, gives the following result:
[hadoop#~]$ hdfs dfsadmin -report
Configured Capacity: 79056308744192 (71.90 TB)
Present Capacity: 78112126204492 (71.04 TB)
DFS Remaining: 74356972374604 (67.63 TB)
DFS Used: 3755153829888 (3.42 TB)
DFS Used%: 4.81%
Under replicated blocks: 126
Blocks with corrupt replicas: 0
Missing blocks: 63
Missing blocks (with replication factor 1): 0
It suggests that the block are missing. Does it mean that I have to re-run the program again? And if I see the output of Under replicated blocks, it says 126. It means 126 blocks will be replicated. How can I know, will it replicate the missing blocks?
Also, the value of Under replicated blocks is 126 for the last 30 minutes. Is there any way to force to it to replicate quickly?
I got the same "Reducer task failed to copy 1 files" error and I found logs in HDFS /var/log/hadoop-yarn/apps/hadoop/logs related to the MR job that s3-dist-cp kicks off.
hadoop fs -ls /var/log/hadoop-yarn/apps/hadoop/logs
I copied them out to local:
hadoop fs -get /var/log/hadoop-yarn/apps/hadoop/logs/application_nnnnnnnnnnnnn_nnnn/ip-nnn-nn-nn-nnn.ec2.internal_nnnn
And then examined them in a text editor to find more diagnostic information about the detailed results of the Reducer phase. In my case I was getting an error back from the S3 service. You might find a different error.

Geth private network problems generating ether

Short description
I have three Ethereum nodes connected in a private network and I am using the interactive Javascript console with geth.
The problem is, I cannot find a way to get ether on any of the accounts. The balance is always 0.
Details
For all three nodes, the configuration and output are similar with the difference only in their addresses and account numbers.
File tree before running geth:
~/eth/
database/
keystore/
genesis/
CustomGenesis.json
Contents of CustomGenesis.json:
{
"config": {
"chainId": 15,
"homesteadBlock": 0,
"eip155Block": 0,
"eip158Block": 0
},
"nonce": "0x0000000000000042",
"timestamp": "0x00",
"parentHash": "0x0000000000000000000000000000000000000000000000000000000000000000",
"extraData": "0x00",
"gasLimit": "0x08000000",
"difficulty": "0x0400",
"mixhash": "0x0000000000000000000000000000000000000000000000000000000000000000",
"coinbase": "0xd77821c8b92e3e29bc63c8f2a94a6c6a64b28b53",
"alloc": {
"0x862e90e6b6ebfe0535081d07be8e0f38e422932c": {"balance": "100"},
"0x47e4cf0cc71e7257663f3d2f95e3f8982ece3ad8": {"balance": "200"},
"0x1df2f4f40c03367a9bf42b28a090fed1cccb3068": {"balance": "300"},
"0xd77821c8b92e3e29bc63c8f2a94a6c6a64b28b53": {"balance": "4444444444444444444"},
"0x28685a4b9418c1cb85725318756aa815e8e34497": {"balance": "5555555555555555555"},
"0x86f0526280fea57255c6391a4c7dbdbe8e1181ab": {"balance": "6666666666666666666"}
}
}
While in the directory ~/eth/ I started geth with:
sudo geth --networkid 15 --datadir ./database --nodiscover --maxpeers 2 --rpc --rpcport 8080 --rpccorsdomain * --rpcapi "db,eth,net,web3" --port 30303 --identity TestNet init ./genesis/CustomGenesis.json
... which produced the following output:
INFO [07-12|13:12:46] Starting peer-to-peer node instance=Geth/v1.6.6-stable-10a45cb5/linux-amd64/go1.8.1
INFO [07-12|13:12:46] Allocated cache and file handles database=/home/ethereum6/eth/database/geth/chaindata cache=128 handles=1024
INFO [07-12|13:12:46] Writing default main-net genesis block
INFO [07-12|13:12:47] Initialised chain configuration config="{ChainID: 1 Homestead: 1150000 DAO: 1920000 DAOSupport: true EIP150: 2463000 EIP155: 2675000 EIP158: 2675000 Metropolis: 9223372036854775807 Engine: ethash}"
INFO [07-12|13:12:47] Disk storage enabled for ethash caches dir=/home/ethereum6/eth/database/geth/ethash count=3
INFO [07-12|13:12:47] Disk storage enabled for ethash DAGs dir=/home/ethereum6/.ethash count=2
WARN [07-12|13:12:47] Upgrading db log bloom bins
INFO [07-12|13:12:47] Bloom-bin upgrade completed elapsed=222.754µs
INFO [07-12|13:12:47] Initialising Ethereum protocol versions="[63 62]" network=15
INFO [07-12|13:12:47] Loaded most recent local header number=0 hash=d4e567…cb8fa3 td=17179869184
INFO [07-12|13:12:47] Loaded most recent local full block number=0 hash=d4e567…cb8fa3 td=17179869184
INFO [07-12|13:12:47] Loaded most recent local fast block number=0 hash=d4e567…cb8fa3 td=17179869184
INFO [07-12|13:12:47] Starting P2P networking
INFO [07-12|13:12:47] HTTP endpoint opened: http://127.0.0.1:8080
INFO [07-12|13:12:47] RLPx listener up self="enode://5ded12c388e755791590cfe848635c7bb47d3b007d21787993e0f6259933c78033fd6fa17cbb884ed772f1c90aebaccc64c5c88cddc1260e875ac8f6f07067bf#[::]:30303?discport=0"
INFO [07-12|13:12:47] IPC endpoint opened: /home/ethereum6/eth/database/geth.ipc
Interactive Javascript console is started in another terminal with:
sudo geth attach ipc:$HOME/eth/database/geth.ipc
... which gives:
Welcome to the Geth JavaScript console!
instance: Geth/v1.6.6-stable-10a45cb5/linux-amd64/go1.8.1
coinbase: 0x1fb9fb0502cb57fb654b88dd2d24e19a0eb91540
at block: 0 (Thu, 01 Jan 1970 03:00:00 MSK)
datadir: /home/ethereum6/eth/database
modules: admin:1.0 debug:1.0 eth:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
>
Etherbase is set on all nodes with miner.setEtherbase(personal.listAccounts[0]). Each node only has one account. (3 nodes, 3 accounts)
> eth.accounts
["0x1fb9fb0502cb57fb654b88dd2d24e19a0eb91540"]
> personal.listAccounts
["0x1fb9fb0502cb57fb654b88dd2d24e19a0eb91540"]
>
Calling admin.nodeInfo gives:
> admin.nodeInfo
{
enode: "enode://5ded12c388e755791590cfe848635c7bb47d3b007d21787993e0f6259933c78033fd6fa17cbb884ed772f1c90aebaccc64c5c88cddc1260e875ac8f6f07067bf#[::]:30303?discport=0",
id: "5ded12c388e755791590cfe848635c7bb47d3b007d21787993e0f6259933c78033fd6fa17cbb884ed772f1c90aebaccc64c5c88cddc1260e875ac8f6f07067bf",
ip: "::",
listenAddr: "[::]:30303",
name: "Geth/v1.6.6-stable-10a45cb5/linux-amd64/go1.8.1",
ports: {
discovery: 0,
listener: 30303
},
protocols: {
eth: {
difficulty: 17179869184,
genesis: "0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3",
head: "0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3",
network: 15
}
}
}
>
The nodes are connected with admin.addPeer(..) such that each node shows two peers when calling admin.peers.
When I start mining with miner.start(), this is the output that I receive in the interactive js console:
> miner.start()
null
>
... and in the other terminal running the node:
INFO [07-12|13:16:34] Updated mining threads threads=0
INFO [07-12|13:16:34] Transaction pool price threshold updated price=18000000000
INFO [07-12|13:16:34] Starting mining operation
INFO [07-12|13:16:34] Commit new mining work number=1 txs=0 uncles=0 elapsed=749.279µs
After that nothing happens and the balance on all accounts is still 0 when checking with eth.getBalance(eth.accounts[0]).
What options do I have to try and get the nodes on the private network to start mining ether?
Why does the preallocation of ether not work in CustomGenesis.json?
Was the difficulty provided in CustomGenesis.json ignored? admin.nodeInfo showed a different number.
All comments and suggestions are welcome, thanks!
You probably set the genesis difficulty so high, that your CPU miners don't have a chance of finding a block. You probably want to set the difficulty to something more reasonable, such as 1 million (0x100000 in hex).
Ok, I'll provide what input I have (bear in mind that I am new as well, so we are in the same boat!)
The part I am confident in, is the whole balance part, so:
1. Make a new account (on whichever node): personal.newAccount("password")
2. Set the new account to be the coinbase of this node: miner.setEtherbase(eth.accounts[0])
3. Start the mining: miner.start()
Then, you can check the balance while you are mining. Try:
web3.fromWei(eth.getBalance(eth.coinbase), "ether")
The problem was apparently the way the genesis block was initialized.
The Wrong Way
By calling geth with init and the other command-line arguments:
geth --networkid 15 --datadir ./database --nodiscover --maxpeers 2 --rpc --rpcport 8080 --rpccorsdomain * --rpcapi "db,eth,net,web3" --port 30303 --identity TestNet init ./genesis/CustomGenesis.json
the node is started with the mainnet genesis block:
...
INFO [07-12|13:12:46] Writing default main-net genesis block
...
and after that, everything else cannot work the way expected.
The Solution
Call geth with init and --datadir arguments only:
geth --datadir /path/to/database init /path/to/CustomGenesis.json
A short output is given and geth immediately exits when the initialization is finished:
INFO [07-13|10:30:49] Allocated cache and file handles database=/path/to/database/geth/chaindata cache=16 handles=16
INFO [07-13|10:30:49] Writing custom genesis block
INFO [07-13|10:30:49] Successfully wrote genesis state database=chaindata hash=ed4e11…f40ac3
INFO [07-13|10:30:49] Allocated cache and file handles database=/path/to/database/geth/lightchaindata cache=16 handles=16
INFO [07-13|10:30:49] Writing custom genesis block
INFO [07-13|10:30:49] Successfully wrote genesis state database=lightchaindata hash=ed4e11…f40ac3
and after this, everything else works as expected.
Big thanks to Péter for helping me figure this out!

Spark EMR Cluster is removing executors when run because they are idle

I have a spark application that was running fine in standalone mode, I'm now trying to get the same application to run on an AWS EMR Cluster but currently it's failing.
The message is one I've not seen before and implies that the workers are not receiving jobs and are being shut down.
**16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 7)
16/11/30 14:45:00 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 2
16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 2 because it has been idle for 60 seconds (new desired total will be 6)
16/11/30 14:45:00 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 4
16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 4 because it has been idle for 60 seconds (new desired total will be 5)
16/11/30 14:45:01 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 7
16/11/30 14:45:01 INFO ExecutorAllocationManager: Removing executor 7 because it has been idle for 60 seconds (new desired total will be 4)**
The DAG shows the workers initialised, then a collect (one that is relatively small) and then shortly after they all fail.
Dynamic allocation was enabled so there was a thought that perhaps the driver wasn't sending them any tasks and so they timed out - to prove the theory I spun up another cluster without dynamic allocation and the same thing happened.
The master is set to yarn.
Any help is massively appreciated, thanks.
16/11/30 14:49:16 INFO BlockManagerMaster: Removal of executor 21 requested
16/11/30 14:49:16 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
16/11/30 14:49:16 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
16/11/30 14:49:24 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1480517110174_0001_01_000049 on host: ip-10-138-114-125.ec2.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1480517110174_0001_01_000049
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My step is quite simple - spark-submit --deploy-mode client --master yarn --class Run app.jar

Zookeeper cluster on AWS

I am trying to setup a zookeeper cluster on 3 AWS ec2 machines, but continuously getting same error:
2016-10-19 16:30:23,177 [myid:2] - WARN [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager#382] - Cannot open channel to 3 at election address /xxx.31.34.102:3888
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:402)
at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:840)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:762)
2016-10-19 16:30:23,185 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection#849] - Notification time out: 60000
1) I have same security group for all three machines.
2) Using private ips of machine in conf
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/opt/data
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeIinterval=1
server.1=0.0.0.0:2888:3888
server.2=x.31.34.105:2888:3888
server.3=x.31.34.102:2888:3888
3) Even tested with actual private ip of own machine instead of "0.0.0.0".
Not been able to identify what's going wrong.
For each node you must make sure that you have specified 0.0.0.0 as the node's ip address.
i.e.: For server 01
server.1=0.0.0.0:2888:3888
server.2=192.168.10.10:2888:3888
server.3=192.168.2.1:2888:3888
For server 02
server.1=192.168.x.x:2888:3888
server.2=0.0.0.0:2888:3888
server.3=192.168.2.1:2888:3888
For server
server.1=192.168.x.x:2888:3888
server.2=192.168.10.10:2888:3888
server.3=0.0.0.0:2888:3888