Spark Job Crashes with error in prelaunch.err - amazon-web-services

We are runing a spark job which runs close to 30 scripts one by one. it usually takes 14-15h to run, but this time it failed in 13h. Below is the details:
Command:spark-submit --executor-memory=80g --executor-cores=5 --conf spark.sql.shuffle.partitions=800 run.py
Setup: Running spark jobs via jenkins on AWS EMR with 16 spot nodes
Error: Since the YARN log is huge (270Mb+), below are some extracts from it:
[2022-07-25 04:50:08.646]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : ermediates/master/email/_temporary/0/_temporary/attempt_202207250435265404741257029168752_0641_m_000599_168147 s3://memberanalytics-data-out-prod/pipelined_intermediates/master/email/_temporary/0/task_202207250435265404741257029168752_0641_m_000599 using algorithm version 1 22/07/25 04:37:05 INFO FileOutputCommitter: Saved output of task 'attempt_202207250435265404741257029168752_0641_m_000599_168147' to s3://memberanalytics-data-out-prod/pipelined_intermediates/master/email/_temporary/0/task_202207250435265404741257029168752_0641_m_000599 22/07/25 04:37:05 INFO SparkHadoopMapRedUtil: attempt_202207250435265404741257029168752_0641_m_000599_168147: Committed 22/07/25 04:37:05 INFO Executor: Finished task 599.0 in stage 641.0 (TID 168147). 9341 bytes result sent to driver 22/07/25 04:49:36 ERROR YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Driver ip-10-13-52-109.bjw2k.asg:45383 disassociated! Shutting down. 22/07/25 04:49:36 INFO MemoryStore: MemoryStore cleared 22/07/25 04:49:36 INFO BlockManager: BlockManager stopped 22/07/25 04:50:06 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95) 22/07/25 04:50:06 ERROR Utils: Uncaught exception in thread shutdown-hook-0 java.lang.InterruptedException

Related

Grunt - Mapreduce Mode: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 8 time(s); retry policy is RetryUpToMaximum

I'm running a mapreduce mode in Apache Pig version 0.17.0 to simply dump a few lines of text data from a file on HDFS Hadoop-2.7.2
When executing the dump command, the execution goes very slow, however it gets completed. I see some failures during execution shown below:
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1589604570386_0002]
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
[main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1589604570386_0002]
[main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
[main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
[main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
[main] WARN org.apache.pig.tools.pigstats.mapreduce.MRJobStats - Failed to get map task report
java.io.IOException: java.net.ConnectException: Call From localhost/127.0.0.1 to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:343)
at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:428)
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:572)
at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:184)
at org.apache.pig.tools.pigstats.mapreduce.MRJobStats.getTaskReports(MRJobStats.java:528)
at org.apache.pig.tools.pigstats.mapreduce.MRJobStats.addMapReduceStatistics(MRJobStats.java:355)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.addSuccessJobStats(MRPigStatsUtil.java:232)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.accumulateStats(MRPigStatsUtil.java:164)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:379)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:290)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1475)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1460)
at org.apache.pig.PigServer.storeEx(PigServer.java:1119)
at org.apache.pig.PigServer.store(PigServer.java:1082)
at org.apache.pig.PigServer.openIterator(PigServer.java:995)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:782)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:383)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:564)
at org.apache.pig.Main.main(Main.java:175)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Is there away to speed up the mapreduce job?

Sqoop Job Failing via Dataproc [duplicate]

This question already has an answer here:
Sqoop on Dataproc cannot export data to Avro format
(1 answer)
Closed 3 years ago.
I have submitted Sqoop job via GCP Dataproc Cluster and set it --as-avrodatafile configuration argument, but it is failing with below error:
/08/12 22:34:34 INFO impl.YarnClientImpl: Submitted application application_1565634426340_0021
19/08/12 22:34:34 INFO mapreduce.Job: The url to track the job: http://sqoop-gcp-ingest-mzp-m:8088/proxy/application_1565634426340_0021/
19/08/12 22:34:34 INFO mapreduce.Job: Running job: job_1565634426340_0021
19/08/12 22:34:40 INFO mapreduce.Job: Job job_1565634426340_0021 running in uber mode : false
19/08/12 22:34:40 INFO mapreduce.Job: map 0% reduce 0%
19/08/12 22:34:45 INFO mapreduce.Job: Task Id : attempt_1565634426340_0021_m_000000_0, Status : FAILED
Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V
19/08/12 22:34:50 INFO mapreduce.Job: Task Id : attempt_1565634426340_0021_m_000000_1, Status : FAILED
Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V
19/08/12 22:34:55 INFO mapreduce.Job: Task Id : attempt_1565634426340_0021_m_000000_2, Status : FAILED
Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V
19/08/12 22:35:00 INFO mapreduce.Job: map 100% reduce 0%
19/08/12 22:35:01 INFO mapreduce.Job: Job job_1565634426340_0021 failed with state FAILED due to: Task failed task_1565634426340_0021_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
19/08/12 22:35:01 INFO mapreduce.Job: Counters: 11
Job Counters
Failed map tasks=4
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=41976
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=13992
Total vcore-milliseconds taken by all map tasks=13992
Total megabyte-milliseconds taken by all map tasks=42983424
Map-Reduce Framework
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
19/08/12 22:35:01 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
19/08/12 22:35:01 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 30.5317 seconds (0 bytes/sec)
19/08/12 22:35:01 INFO mapreduce.ImportJobBase: Retrieved 0 records.
19/08/12 22:35:01 DEBUG util.ClassLoaderStack: Restoring classloader: sun.misc.Launcher$AppClassLoader#61baa894
19/08/12 22:35:01 ERROR tool.ImportTool: Import failed: Import job failed!
19/08/12 22:35:01 DEBUG manager.OracleManager$ConnCache: Caching released connection for jdbc:oracle:thin:#10.25.42.52:1521/uataca.aaamidatlantic.com/GCPREADER
Job output is complete
Without specifying --as-avrodatafile argument it is working fine.
To fix this issue you need to set mapreduce.job.classloader property value to true when submitting your job:
gcloud dataproc jobs submit hadoop --cluster="${CLUSTER_NAME}" \
--class="org.apache.sqoop.Sqoop" \
--properties="mapreduce.job.classloader=true" \
. . .
-- \
--as-avrodatafile \
. . .

Spark EMR Cluster is removing executors when run because they are idle

I have a spark application that was running fine in standalone mode, I'm now trying to get the same application to run on an AWS EMR Cluster but currently it's failing.
The message is one I've not seen before and implies that the workers are not receiving jobs and are being shut down.
**16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 7)
16/11/30 14:45:00 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 2
16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 2 because it has been idle for 60 seconds (new desired total will be 6)
16/11/30 14:45:00 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 4
16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 4 because it has been idle for 60 seconds (new desired total will be 5)
16/11/30 14:45:01 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 7
16/11/30 14:45:01 INFO ExecutorAllocationManager: Removing executor 7 because it has been idle for 60 seconds (new desired total will be 4)**
The DAG shows the workers initialised, then a collect (one that is relatively small) and then shortly after they all fail.
Dynamic allocation was enabled so there was a thought that perhaps the driver wasn't sending them any tasks and so they timed out - to prove the theory I spun up another cluster without dynamic allocation and the same thing happened.
The master is set to yarn.
Any help is massively appreciated, thanks.
16/11/30 14:49:16 INFO BlockManagerMaster: Removal of executor 21 requested
16/11/30 14:49:16 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
16/11/30 14:49:16 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
16/11/30 14:49:24 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1480517110174_0001_01_000049 on host: ip-10-138-114-125.ec2.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1480517110174_0001_01_000049
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My step is quite simple - spark-submit --deploy-mode client --master yarn --class Run app.jar

toPandas() work from Jupyter iPython Notebook but fails on submit - AWS EMR

I have a program that:
1. reads some data
2. perform some operations
3. Saves a csv file
4. Transport that file to FTP
I am using Amazon EMR cluster and PySpark to accomplish this task.
For step 4, I need to save the CSV on the local storage and not on HDFS. For this purpose, I convert the Spark Dataframe to Pandas dataframe.
a snippet could be:
from pyspark import SparkContext
from pyspark.conf import SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.types import StructType, StructField, LongType, StringType
from pyspark.mllib.evaluation import *
from pyspark.sql.functions import *
from pyspark.sql import Row
from time import time
import timeit
from datetime import datetime, timedelta
import numpy as np
import random as rand
import pandas as pd
from itertools import combinations, permutations
from collections import defaultdict
from ftplib import FTP
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("Recommendation").set('spark.driver.memory', '8G').set('spark.executor.memory', '4G')
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
readRdd = sqlContext.read.format('com.databricks.spark.csv').load('s3n://my-bucket/myfile' + path)
df = readRdd.toPandas() # <---------- PROBLEM
print('toPandas() completed')
df.to_csv('./myFile')
The problem is:
when I run this code from Jpyter iPython notebook on the same cluster, it works like a charm. But when I run this code using Spark Submit, or add it as a step to EMR, the code fails on the following line:
df = readRdd.toPandas()
'toPandas() completed' is never printed
In the spark job monitor, I can see that the toPandas() method gets executed but right after that I get the error.
16/10/10 13:17:47 INFO YarnAllocator: Driver requested a total number of 1 executor(s).
16/10/10 13:17:47 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:17:47 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:17:47 INFO TaskSetManager: Finished task 1462.0 in stage 17.0 (TID 10624) in 2089 ms on ip-172-31-38-70.eu-west-1.compute.internal (1515/1516)
16/10/10 13:17:47 INFO TaskSetManager: Finished task 1485.0 in stage 17.0 (TID 10647) in 2059 ms on ip-172-31-38-70.eu-west-1.compute.internal (1516/1516)
16/10/10 13:17:47 INFO YarnClusterScheduler: Removed TaskSet 17.0, whose tasks have all completed, from pool
16/10/10 13:17:47 INFO DAGScheduler: ResultStage 17 (toPandas at 20161007_RecPipeline.py:182) finished in 12.609 s
16/10/10 13:17:47 INFO DAGScheduler: Job 4 finished: toPandas at 20161007_RecPipeline.py:182, took 14.646644 s
16/10/10 13:17:47 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
16/10/10 13:17:47 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:17:47 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:17:50 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:17:50 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:17:53 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:17:53 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:17:56 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:17:56 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:17:59 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:17:59 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:02 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:02 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:05 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:05 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:08 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:08 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:11 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:11 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:14 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:14 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:17 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:17 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:20 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:20 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:23 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:23 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:26 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:26 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:29 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:29 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:32 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:32 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:35 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:35 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:36 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
16/10/10 13:18:36 INFO SparkContext: Invoking stop() from shutdown hook
16/10/10 13:18:36 INFO SparkUI: Stopped Spark web UI at http://172.31.37.28:45777
16/10/10 13:18:36 INFO YarnClusterSchedulerBackend: Shutting down all executors
16/10/10 13:18:36 INFO YarnClusterSchedulerBackend: Asking each executor to shut down
16/10/10 13:18:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/10/10 13:18:36 ERROR PythonRDD: Error while sending iterator
java.net.SocketException: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:440)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:648)
at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:648)
at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:648)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:649)
16/10/10 13:18:36 ERROR ApplicationMaster: User application exited with status 143
16/10/10 13:18:36 INFO ApplicationMaster: Final app status: FAILED, exitCode: 143, (reason: User application exited with status 143)
16/10/10 13:18:36 INFO MemoryStore: MemoryStore cleared
16/10/10 13:18:36 INFO BlockManager: BlockManager stopped
16/10/10 13:18:36 INFO BlockManagerMaster: BlockManagerMaster stopped
16/10/10 13:18:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/10/10 13:18:36 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/10/10 13:18:36 INFO SparkContext: Successfully stopped SparkContext
16/10/10 13:18:36 INFO ShutdownHookManager: Shutdown hook called
16/10/10 13:18:36 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/10/10 13:18:36 INFO ShutdownHookManager: Deleting directory /mnt3/yarn/usercache/hadoop/appcache/application_1476100925559_0002/spark-eab43d4e-7201-4bcb-8ee7-0e7b546e8fd8
16/10/10 13:18:36 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1476100925559_0002/spark-1d88398f-ecd5-4d94-a42a-a406b3d566af/pyspark-34bec23c-a686-475d-85c9-9e9228b23239
16/10/10 13:18:36 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1476100925559_0002/spark-1d88398f-ecd5-4d94-a42a-a406b3d566af
16/10/10 13:18:36 INFO ShutdownHookManager: Deleting directory /mnt3/yarn/usercache/hadoop/appcache/application_1476100925559_0002/container_1476100925559_0002_01_000001/tmp/spark-96cdee47-e3f3-45f4-8bc7-0df5928ef53c
16/10/10 13:18:36 INFO ShutdownHookManager: Deleting directory /mnt2/yarn/usercache/hadoop/appcache/application_1476100925559_0002/spark-f6821ea1-6f37-4cc6-8bba-049ac0215786
16/10/10 13:18:36 INFO ShutdownHookManager: Deleting directory /mnt1/yarn/usercache/hadoop/appcache/application_1476100925559_0002/spark-1827cae8-8a60-4b29-a4e5-368a8e1856fd
My cluster configuration looks like:
spark-defaults spark.driver.maxResultSize 8G
spark-defaults spark.driver.memory 8G
spark-defaults spark.executor.memory 4G
The Spark Submit command looks like:
spark-submit --deploy-mode cluster s3://my-bucket/myPython.py
This is killing me! Someone please give me any pointers to what direction I may look at?
Here was the problem:
spark-submit --deploy-mode cluster s3://my-bucket/myPython.py
In the above command, the deploy mode is set to cluster which means a node will be chosen out of the core nodes to run the driver program. Since the allowed driver memory is 8G and the core nodes were smaller physical instances, they would always run out of required memory.
The solution was to deploy in client mode where the driver would always run on the master node (a bigger physical instance with more resources in my case) would not run out of required memory for the whole process.
Since it was a dedicated cluster, this solution worked in my case.
In case of a shared cluster where deploy mode must be cluster, using bigger instances should work.

Hadoop MapReduce Job Hangs

I am trying to simulate the Hadoop environment using latest Hadoop version 2.6.0, Java SDK 1.70 on my Ubuntu desktop. I configured the hadoop with necessary environment parameters and all its processes are up and running and they can be seen with the following jps command:
nandu#nandu-Desktop:~$ jps
2810 NameNode
3149 SecondaryNameNode
3416 NodeManager
3292 ResourceManager
2966 DataNode
4805 Jps
I could also see the above information, plus the dfs files through the Firefox browser. However, when I tried to run a simple WordCound MapReduce job, it hangs and it doesn't produce any output or shows any error message(s). After a while I killed the process using the "hadoop job -kill " command. Can you please guide me, to find the cause of this issue and how to resolve it? I am giving below the Job start and kill(end) screenshot.
If you need additional information, please let me know.
Your help will be highly appreciated.
Thanks,
===================================================================
nandu#nandu-Desktop:~/dev$ hadoop jar wc.jar WordCount /user/nandu/input /user/nandu/output
15/02/27 10:35:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/02/27 10:35:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/02/27 10:35:21 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/02/27 10:35:21 INFO input.FileInputFormat: Total input paths to process : 2
15/02/27 10:35:21 INFO mapreduce.JobSubmitter: number of splits:2
15/02/27 10:35:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1425048764581_0003
15/02/27 10:35:22 INFO impl.YarnClientImpl: Submitted application application_1425048764581_0003
15/02/27 10:35:22 INFO mapreduce.Job: The url to track the job: http://nandu-Desktop:8088/proxy/application_1425048764581_0003/
15/02/27 10:35:22 INFO mapreduce.Job: Running job: job_1425048764581_0003
==================== at this point the job was killed ===================
15/02/27 10:38:23 INFO mapreduce.Job: Job job_1425048764581_0003 running in uber mode : false
15/02/27 10:38:23 INFO mapreduce.Job: map 0% reduce 0%
15/02/27 10:38:23 INFO mapreduce.Job: Job job_1425048764581_0003 failed with state KILLED due to: Application killed by user.
15/02/27 10:38:23 INFO mapreduce.Job: Counters: 0
I encountered similar problem while running provided MapReduce sample in hadoop package. In my case it was hanging due to low disk space on my VM (about 1.5 GB was empty). When I freed some disk space it ran pretty fine. Also, please check other system resource requirements are fulfilled.