oozie Mapreduce not finish or stop to run new job - mapreduce

I try to running Pig from Oozie
When Oozie job create Pig job mapreduce not finish the first job to begin Pig job it run 6hour and still run
Pig script
A = load '$INPUT' using PigStorage(',') as (id:int, name:chararray, age:int, safary:int);
store A into '$OUTPUT' USING PigStorage(',');
job.properties
nameNode=hdfs://master:9000
jobTracker=master:8032
queueName=default
examplesRoot=examples
oozie.use.system.libpath=true
oozie.libpath=/user/root/share/lib/lib_20170611141905
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/pig
workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2" name="pig-wf">
<start to="pig-node"/>
<action name="pig-node">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/pig_out"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
</configuration>
<script>id.pig</script>
<param>INPUT=/testHive.txt</param>
<param>OUTPUT=/user/${wf:user()}/${examplesRoot}/pig_out</param>
</pig>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Pig failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
i don't see any error on logs file
nodemanager log file for 4000 line and still get more
2017-06-12 14:22:07,334 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3010 for container-id container_1497203056582_0017_01_000001: 65.9 MB of 2 GB physical memory used; 3.3 GB of 4.2 GB virtual memory used
2017-06-12 14:22:10,346 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3010 for container-id container_1497203056582_0017_01_000001: 65.9 MB of 2 GB physical memory used; 3.3 GB of 4.2 GB virtual memory used
2017-06-12 14:22:13,361 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3010 for container-id container_1497203056582_0017_01_000001: 65.9 MB of 2 GB physical memory used; 3.3 GB of 4.2 GB virtual memory used
2017-06-12 14:22:16,369 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3010 for container-id container_1497203056582_0017_01_000001: 65.9 MB of 2 GB physical memory used; 3.3 GB of 4.2 GB virtual memory used
2017-06-12 14:22:19,382 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3010 for container-id container_1497203056582_0017_01_000001: 66.2 MB of 2 GB physical memory used; 3.3 GB of 4.2 GB virtual memory used
2017-06-12 14:22:22,390 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3010 for container-id container_1497203056582_0017_01_000001: 66.2 MB of 2 GB physical memory used; 3.3 GB of 4.2 GB virtual memory used
2017-06-12 14:22:25,399 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3010 for container-id container_1497203056582_0017_01_000001: 66.6 MB of 2 GB physical memory used; 3.3 GB of 4.2 GB virtual memory used
2017-06-12 14:22:28,409 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 3010 for container-id container_1497203056582_0017_01_000001: 67.1 MB of 2 GB physical memory used; 3.3 GB of 4.2 GB virtual memory used
resourcemanager log file
2017-06-12 13:55:49,493 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for root (auth:TOKEN)
2017-06-12 13:55:49,496 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1497203056582_0017_000001 (auth:SIMPLE)
2017-06-12 13:58:00,204 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for root (auth:TOKEN)
2017-06-12 13:59:19,231 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Activating next master key with id: -52240461
2017-06-12 13:59:19,292 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: Activating next master key with id: -463771108
2017-06-12 13:59:19,717 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Activating next master key with id: 1973280870
2017-06-12 14:00:01,821 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for root (auth:TOKEN)
2017-06-12 14:00:40,402 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for root (auth:TOKEN)
2017-06-12 14:08:31,967 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for root (auth:TOKEN)
2017-06-12 14:09:08,837 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Creating password for identifier: owner=root, renewer=oozie mr token, realUser=root, issueDate=1497290948803, maxDate=1497895748803, sequenceNumber=98, masterKeyId=3, currentKey: 3
2017-06-12 14:09:08,968 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: storing RMDelegation token with sequence number: 98
2017-06-12 14:09:37,379 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Creating password for identifier: owner=root, renewer=oozie mr token, realUser=root, issueDate=1497290977379, maxDate=1497895777379, sequenceNumber=99, masterKeyId=3, currentKey: 3
2017-06-12 14:09:37,424 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: storing RMDelegation token with sequence number: 99
2017-06-12 14:20:49,577 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for root (auth:TOKEN)
2017-06-12 14:20:56,681 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Creating password for identifier: owner=root, renewer=oozie mr token, realUser=root, issueDate=1497291656681, maxDate=1497896456681, sequenceNumber=100, masterKeyId=3, currentKey: 3
2017-06-12 14:20:56,698 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: storing RMDelegation token with sequence number: 100
2017-06-12 14:21:17,592 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Creating password for identifier: owner=root, renewer=oozie mr token, realUser=root, issueDate=1497291677592, maxDate=1497896477592, sequenceNumber=101, masterKeyId=3, currentKey: 3
2017-06-12 14:21:17,592 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: storing RMDelegation token with sequence number: 101

Related

0 bytes of shared 0 bytes (100%) used in Google Workspace

I started using Google Colab through the GCE VM, but I have not been able to initialize or upload any notebooks.
The error I am getting is The user has exceeded their Drive storage quota, but upon looking into my Storage, I have 15 GB of available storage, 0 bytes of 15 GB used.
Trying to fix it, I log in as the admin account and go to Storage, and then what shows is that 0 bytes of shared 0 bytes (100%) and it is in red. So I believe the is something wrong happening with the storage allocation or the store quota?
The virtual machine was ran with 200 GB space, so I cannot see the issue.
Full error when trying to upload a notebook:
The user has exceeded their Drive storage quota
GapiError: The user has exceeded their Drive storage quota
at YI.YA [as constructor] (https://ssl.gstatic.com/colaboratory-static/common/456ec723e499489081e770703ac07ab8/external_polymer_binary.js:1433:2101)
at YI.iH [as constructor] (https://ssl.gstatic.com/colaboratory-static/common/456ec723e499489081e770703ac07ab8/external_polymer_binary.js:2270:222)
at new YI (https://ssl.gstatic.com/colaboratory-static/common/456ec723e499489081e770703ac07ab8/external_polymer_binary.js:2334:151)
at JGa (https://ssl.gstatic.com/colaboratory-static/common/456ec723e499489081e770703ac07ab8/external_polymer_binary.js:2396:348)
at xa.program_ (https://ssl.gstatic.com/colaboratory-static/common/456ec723e499489081e770703ac07ab8/external_polymer_binary.js:2410:131)
at za (https://ssl.gstatic.com/colaboratory-static/common/456ec723e499489081e770703ac07ab8/external_polymer_binary.js:21:57)
at xa.throw_ (https://ssl.gstatic.com/colaboratory-static/common/456ec723e499489081e770703ac07ab8/external_polymer_binary.js:20:201)
at Aa.throw (https://ssl.gstatic.com/colaboratory-static/common/456ec723e499489081e770703ac07ab8/external_polymer_binary.js:22:89)
at c (https://ssl.gstatic.com/colaboratory-static/common/456ec723e499489081e770703ac07ab8/external_polymer_binary.js:22:343)
I have not found a way to troubleshoot it or resolve it. There is nothing broken shown in the log, it says that its working as intended?

Memory usage optimization: High JVM memory but low execution and storage memory?

I'm running a spark application. After the spark application is finished, when I check the executor section in spark log:
First row is the driver and the second row is the executor. From my understanding, please correct me if I am wrong, the memory on-heap in executor is mainly divided by the 3 parts:
Reversed memory: memory reserved for system and is used to store Spark's internal objects, around 300 MB.
User memory: memory for the user-defined data structures / functions / metadata etc.
Spark memory: memory share for both storage and execution
If this is correct, I don't understand why even the peak execution and storage memory on-heap of the executor are low, also there is no big user-defined class or UDF in the application, the peak JVM memory on-heap of executor is very high in both spark log and utilization log when I check in Grafana (~6.27 GiB).
Back to my questions:
Is my understanding of the memory on-heap correct?
If my understanding is correct, why the peak JVM memory on-heap is so high?
How can I do the memory optimization in this case? It seems that both execution and storage memory not high.
Thank you so much for your help.
P.S: I am using Spark 3.2.1 and Delta Lake 1.2.0 on K8S deployed on EC2, 2 instances with 8 core 16 RAM, 1 instance for the driver and 1 instance for the executor. 1 core and 4g memory is used for driver and 5 core and 8g memory is used for executor.
I found out that this peak JVM memory on-heap will be varied based on the driver and executor memory configuration, although I still can't find the relationship or why the peak JVM memory on-heap is so high.
In fact the transformation doesn't require such high memory, when you lower the memory resource of the spark application, its peak JVM memory on-heap will be lower too.

Warning results in failure when reading from AED S3 bucket

I am doing a simple inner join between two tables, but I keep getting the warning shown below. I saw in other posts, that it is ok to ignore the warning, but my jobs end in failure and do not progress.
The tables are pretty big, (12 billion rows) but I am adding just three columns from one table to the other.
When reduce the dataset to a few million rows and run the script in Amazon Sagemaker Jupyter notebook. it works fine. But when I run it on the EMR cluster for the entire dataset, it fails. I even ran the specific snappy partition that it seemed to fail on, and it worked in sagemaker.
The job has no problems reading from one of the tables, it is the other table that seems to give the problem
INFO FileScanRDD: Reading File path:
s3a://path/EES_FD_UVA_HIST/date=2019-10-14/part-00056-ddb83da5-2e1b-499d-a52a-cad16e21bd2c-c000.snappy.parquet,
range: 0-102777097, partition values: [18183] 20/04/06 15:51:58 WARN
S3AbortableInputStream: Not all bytes were read from the
S3ObjectInputStream, aborting HTTP connection. This is likely an error
and may result in sub-optimal behavior. Request only the bytes you
need via a ranged GET or drain the input stream after use. 20/04/06
15:51:58 WARN S3AbortableInputStream: Not all bytes were read from the
S3ObjectInputStream, aborting HTTP connection. This is likely an error
and may result in sub-optimal behavior. Request only the bytes you
need via a ranged GET or drain the input stream after use. 20/04/06
15:52:03 INFO CoarseGrainedExecutorBackend: Driver commanded a
shutdown 20/04/06 15:52:03 INFO MemoryStore: MemoryStore cleared
20/04/06 15:52:03 INFO BlockManager: BlockManager stopped 20/04/06
15:52:03 INFO ShutdownHookManager: Shutdown hook called
This is my code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
uvalim=spark.read.parquet("s3://path/UVA_HIST_WITH_LIMITS")
uvaorg=spark.read.parquet("s3a://path/EES_FD_UVA_HIST")
config=uvalim.select('SEQ_ID','TOOL_ID', 'DATE' ,'UL','LL')
uva=uvaorg.select('SEQ_ID', 'TOOL_ID', 'TIME_STAMP', 'RUN_ID', 'TARGET', 'LOWER_CRITICAL', 'UPPER_CRITICAL', 'RESULT', 'STATUS')
uva_config=uva.join(config, on=['SEQ_ID','TOOL_ID'], how='inner')
uva_config.write.mode("overwrite").parquet("s3a://path/Uvaconfig.parquet")
Is there a way to debug this?
Update: Based on Emerson's suggestion:
I ran it with the debug log. It ran for 9 hours with a Fail before i killed the yarn application.
For some reason the stderr did not have much output
This is the stderr output:
SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found
binding in
[jar:file:/mnt/yarn/usercache/hadoop/filecache/301/__spark_libs__1712836156286367723.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation. SLF4J: Actual binding is of type
[org.slf4j.impl.Log4jLoggerFactory] 20/04/07 05:04:13 INFO
CoarseGrainedExecutorBackend: Started daemon with process name:
5653#ip-10-210-13-51 20/04/07 05:04:13 INFO SignalUtils: Registered
signal handler for TERM 20/04/07 05:04:13 INFO SignalUtils: Registered
signal handler for HUP 20/04/07 05:04:13 INFO SignalUtils: Registered
signal handler for INT 20/04/07 05:04:15 INFO SecurityManager:
Changing view acls to: yarn,hadoop 20/04/07 05:04:15 INFO
SecurityManager: Changing modify acls to: yarn,hadoop 20/04/07
05:04:15 INFO SecurityManager: Changing view acls groups to: 20/04/07
05:04:15 INFO SecurityManager: Changing modify acls groups to:
20/04/07 05:04:15 INFO SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view
permissions: Set(yarn, hadoop); groups with view permissions: Set();
users with modify permissions: Set(yarn, hadoop); groups with modify
permissions: Set() 20/04/07 05:04:15 INFO TransportClientFactory:
Successfully created connection to
ip-10-210-13-51.ec2.internal/10.210.13.51:35863 after 168 ms (0 ms
spent in bootstraps) 20/04/07 05:04:16 INFO SecurityManager: Changing
view acls to: yarn,hadoop 20/04/07 05:04:16 INFO SecurityManager:
Changing modify acls to: yarn,hadoop 20/04/07 05:04:16 INFO
SecurityManager: Changing view acls groups to: 20/04/07 05:04:16 INFO
SecurityManager: Changing modify acls groups to: 20/04/07 05:04:16
INFO SecurityManager: SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(yarn, hadoop); groups
with view permissions: Set(); users with modify permissions:
Set(yarn, hadoop); groups with modify permissions: Set() 20/04/07
05:04:16 INFO TransportClientFactory: Successfully created connection
to ip-10-210-13-51.ec2.internal/10.210.13.51:35863 after 20 ms (0 ms
spent in bootstraps) 20/04/07 05:04:16 INFO DiskBlockManager: Created
local directory at
/mnt1/yarn/usercache/hadoop/appcache/application_1569338404918_1241/blockmgr-2adfe133-fd28-4f25-95a4-2ac1348c625e
20/04/07 05:04:16 INFO DiskBlockManager: Created local directory at
/mnt/yarn/usercache/hadoop/appcache/application_1569338404918_1241/blockmgr-3620ceea-8eee-42c5-af2f-6975c894b643
20/04/07 05:04:17 INFO MemoryStore: MemoryStore started with capacity
3.8 GB 20/04/07 05:04:17 INFO CoarseGrainedExecutorBackend: Connecting to driver:
spark://CoarseGrainedScheduler#ip-10-210-13-51.ec2.internal:35863
20/04/07 05:04:17 INFO CoarseGrainedExecutorBackend: Successfully
registered with driver 20/04/07 05:04:17 INFO Executor: Starting
executor ID 1 on host ip-10-210-13-51.ec2.internal 20/04/07 05:04:18
INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port
34073. 20/04/07 05:04:18 INFO NettyBlockTransferService: Server created on ip-10-210-13-51.ec2.internal:34073 20/04/07 05:04:18 INFO
BlockManager: Using
org.apache.spark.storage.RandomBlockReplicationPolicy for block
replication policy 20/04/07 05:04:18 INFO BlockManagerMaster:
Registering BlockManager BlockManagerId(1,
ip-10-210-13-51.ec2.internal, 34073, None) 20/04/07 05:04:18 INFO
BlockManagerMaster: Registered BlockManager BlockManagerId(1,
ip-10-210-13-51.ec2.internal, 34073, None) 20/04/07 05:04:18 INFO
BlockManager: external shuffle service port = 7337 20/04/07 05:04:18
INFO BlockManager: Registering executor with local external shuffle
service. 20/04/07 05:04:18 INFO TransportClientFactory: Successfully
created connection to ip-10-210-13-51.ec2.internal/10.210.13.51:7337
after 19 ms (0 ms spent in bootstraps) 20/04/07 05:04:18 INFO
BlockManager: Initialized BlockManager: BlockManagerId(1,
ip-10-210-13-51.ec2.internal, 34073, None) 20/04/07 05:04:20 INFO
CoarseGrainedExecutorBackend: Got assigned task 0 20/04/07 05:04:20
INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 20/04/07 05:04:21
INFO TorrentBroadcast: Started reading broadcast variable 0 20/04/07
05:04:21 INFO TransportClientFactory: Successfully created connection
to ip-10-210-13-51.ec2.internal/10.210.13.51:38181 after 17 ms (0 ms
spent in bootstraps) 20/04/07 05:04:21 INFO MemoryStore: Block
broadcast_0_piece0 stored as bytes in memory (estimated size 39.4 KB,
free 3.8 GB) 20/04/07 05:04:21 INFO TorrentBroadcast: Reading
broadcast variable 0 took 504 ms 20/04/07 05:04:22 INFO MemoryStore:
Block broadcast_0 stored as values in memory (estimated size 130.2 KB,
free 3.8 GB) 20/04/07 05:04:23 INFO CoarseGrainedExecutorBackend:
eagerFSInit: Eagerly initialized FileSystem at s3://does/not/exist in
5155 ms 20/04/07 05:04:25 INFO Executor: Finished task 0.0 in stage
0.0 (TID 0). 53157 bytes result sent to driver 20/04/07 05:04:25 INFO CoarseGrainedExecutorBackend: Got assigned task 2 20/04/07 05:04:25
INFO Executor: Running task 2.0 in stage 0.0 (TID 2) 20/04/07 05:04:25
INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 53114 bytes
result sent to driver 20/04/07 05:04:25 INFO
CoarseGrainedExecutorBackend: Got assigned task 3 20/04/07 05:04:25
INFO Executor: Running task 3.0 in stage 0.0 (TID 3) 20/04/07 05:04:25
ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM 20/04/07
05:04:25 INFO DiskBlockManager: Shutdown hook called 20/04/07 05:04:25
INFO ShutdownHookManager: Shutdown hook called
Can you switch to using s3 instead of s3a. i belive s3a is not recommended for use in EMR. Additionaly, You can run your job in debug.
sc = spark.sparkContext
sc.setLogLevel('DEBUG')
Read the below document that talks about s3a
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html
So after troubleshooting the Debugs, I came to the conclusion that it was indeed a memory issue.
The cluster I was using was running out of memory after loading a few days worth of data. Each day was about 2 billion rows.
So I tried parsing my script by each day which it seemed to be able to handle.
However when handling some days where the data was a slightly larger (7 billion rows), it gave me a
executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
error. This post by Jumpman solved the problem by simply extending the spark.dynamicAllocation.executorIdleTimeout value
So thank you #Emerson and #Jumpman!

emr-6.0.0-beta2 HiveLLAP low vCore allocation and utilization

I have a 21 node Hive LLAP EMR cluster.
Hive LLAP Daemons not consuming available cluster VCPU allocation.
160 cores available for YARN but only 1 vCore is used per LLAP daemon.
Each Node has 64 GB memory and 8 vCores. Each node runs 1 LLAP deamon and its allocated 70% of the memory BUT ONLY 1 vCore.
Some of the properties:
yarn.nodemanager.resource.cpu-vcores=8;
yarn.scheduler.minimum-allocation-vcores=1;
yarn.scheduler.maximum-allocation-vcores=128;
hive.llap.daemon.vcpus.per.instance=4;
hive.llap.daemon.num.executors=4;
Why isn't the daemon allocated more than 1 vcore ?
Will the executors be able to use the available vcores OR can ONLY use the 1 vcore allocated to the daemon.
If you are seeing this in YARN ui probably you have to add this
yarn.scheduler.capacity.resource-calculator: org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
I had the same confusion. Actually while using DefaultResourceCalculator in Yarn UI its only calculates memory usage, behind the scene it may have been using more than 1 core but you will see only 1 core used. On the other hand DominantResourceCalculator calucates both core and memory for resource allocation and shows actual number of core and memory.
You can enable ganglia or see EMR metrics for more details.

Read data effectively in AWS EMR Spark cluster

I am trying to get acquainted with Amazon Big Data tools and I want to preprocess data from S3 for eventually using it for Machine Learning.
I am struggling to understand how to effectively read data into an AWS EMR Spark cluster.
I have a Scala script which takes a lot of time to run, most of that time is taken up by Spark's explode+pivot on my data and then using Spark-CSV to write to file.
But even reading the raw data files takes up too much time in my view.
Then I created a script only to read in data with sqlContext.read.json() from 4 different folders (data sizes of 0.18MB, 0.14MB, 0.0003MB and 399.9MB respectively). I used System.currentTimeMillis() before and after each read function to see how much time it takes and with 4 different instances' settings the results were the following:
m1.medium (1) | m1.medium (4) | c4.xlarge (1) | c4.xlarge (4)
1. folder 00:00:34.042 | 00:00:29.136 | 00:00:07.483 | 00:00:06.957
2. folder 00:00:04.980 | 00:00:04.935 | 00:00:01.928 | 00:00:01.542
3. folder 00:00:00.909 | 00:00:00.673 | 00:00:00.455 | 00:00:00.409
4. folder 00:04:13.003 | 00:04:02.575 | 00:03:05.675 | 00:02:46.169
The number after the instance type indicates how many nodes were used. 1 is only master and 4 is one master, 3 slaves of the same type.
Firstly, it is weird that reading in first two similarly sized folders take up different amount of time.
But still how does it take so much time (in seconds) to read in less than 1MB of data?
I had 1800MB of data a few days ago and my job with data processing script on c4.xlarge (4 nodes) took 1,5h before it failed with error:
controller log:
INFO waitProcessCompletion ended with exit code 137 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 4870 seconds
2016-07-01T11:50:38.920Z INFO Step created jobs:
2016-07-01T11:50:38.920Z WARN Step failed with exitCode 137 and took 4870 seconds
stderr log:
16/07/01 11:50:35 INFO DAGScheduler: Submitting 24 missing tasks from ShuffleMapStage 4 (MapPartitionsRDD[21] at json at DataPreProcessor.scala:435)
16/07/01 11:50:35 INFO TaskSchedulerImpl: Adding task set 4.0 with 24 tasks
16/07/01 11:50:36 WARN TaskSetManager: Stage 4 contains a task of very large size (64722 KB). The maximum recommended task size is 100 KB.
16/07/01 11:50:36 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 5330, localhost, partition 0,PROCESS_LOCAL, 66276000 bytes)
16/07/01 11:50:36 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 5331, localhost, partition 1,PROCESS_LOCAL, 66441997 bytes)
16/07/01 11:50:36 INFO Executor: Running task 0.0 in stage 4.0 (TID 5330)
16/07/01 11:50:36 INFO Executor: Running task 1.0 in stage 4.0 (TID 5331)
Command exiting with ret '137'
This data was doubled in size over the weekend. So if I get ~1GB of new data each day now (and it will grow soon and fast) then I hit big data sizes very soon and I really need an effective way to read and process the data quickly.
How can I do that? Is there anything that I am missing? I can upgrade my instances, but for me it does not seem normal that reading in 0.2MB of data with 4x c4.xlarge (4 vCPU, 16ECU, 7.5GiB mem) instances take 7 seconds (even with inferring data schema automatically for ~200 JSON attributes).