AWS Glue Bookmarks - amazon-web-services

How do I verify my bookmarks are working? I find that when I run a job immediately after the previous finishes, it seem to still take a long time. Why is that? I thought it will not read the files it already processed? The script looks like below:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://xxx-glue/testing-csv"], "recurse": True}, format = "csv", format_options = {"withHeader": True}, transformation_ctx="inputGDF")
if bool(inputGDF.toDF().head(1)):
print("Writing ...")
inputGDF.toDF() \
.drop("createdat") \
.drop("updatedat") \
.write \
.mode("append") \
.partitionBy(["querydestinationplace", "querydatetime"]) \
.parquet("s3://xxx-glue/testing-parquet")
else:
print("Nothing to write ...")
job.commit()
import boto3
glue_client = boto3.client('glue', region_name='ap-southeast-1')
glue_client.start_crawler(Name='xxx-testing-partitioned')
The log looks like:
18/12/11 14:49:03 INFO Client: Application report for application_1544537674695_0001 (state: RUNNING)
18/12/11 14:49:03 DEBUG Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 172.31.2.72
ApplicationMaster RPC port: 0
queue: default
start time: 1544539297014
final status: UNDEFINED
tracking URL: http://ip-172-31-0-204.ap-southeast-1.compute.internal:20888/proxy/application_1544537674695_0001/
user: root
18/12/11 14:49:04 INFO Client: Application report for application_1544537674695_0001 (state: RUNNING)
18/12/11 14:49:04 DEBUG Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 172.31.2.72
ApplicationMaster RPC port: 0
queue: default
start time: 1544539297014
final status: UNDEFINED
tracking URL: http://ip-172-31-0-204.ap-southeast-1.compute.internal:20888/proxy/application_1544537674695_0001/
user: root
18/12/11 14:49:05 INFO Client: Application report for application_1544537674695_0001 (state: RUNNING)
18/12/11 14:49:05 DEBUG Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 172.31.2.72
ApplicationMaster RPC port: 0
queue: default
start time: 1544539297014
final status: UNDEFINED
tracking URL: http://ip-172-31-0-204.ap-southeast-1.compute.internal:20888/proxy/application_1544537674695_0001/
user: root
...
18/12/11 14:42:00 INFO NewHadoopRDD: Input split: s3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-15_2018-11-19.csv:0+1194081
18/12/11 14:42:00 INFO S3NativeFileSystem: Opening 's3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-14_2018-11-18.csv' for reading
18/12/11 14:42:00 INFO S3NativeFileSystem: Opening 's3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-15_2018-11-19.csv' for reading
18/12/11 14:42:00 INFO Executor: Finished task 89.0 in stage 0.0 (TID 89). 2088 bytes result sent to driver
18/12/11 14:42:00 INFO CoarseGrainedExecutorBackend: Got assigned task 92
18/12/11 14:42:00 INFO Executor: Running task 92.0 in stage 0.0 (TID 92)
18/12/11 14:42:00 INFO NewHadoopRDD: Input split: s3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-16_2018-11-20.csv:0+1137753
18/12/11 14:42:00 INFO Executor: Finished task 88.0 in stage 0.0 (TID 88). 2088 bytes result sent to driver
18/12/11 14:42:00 INFO CoarseGrainedExecutorBackend: Got assigned task 93
18/12/11 14:42:00 INFO Executor: Running task 93.0 in stage 0.0 (TID 93)
18/12/11 14:42:00 INFO NewHadoopRDD: Input split: s3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-17_2018-11-21.csv:0+1346626
18/12/11 14:42:00 INFO S3NativeFileSystem: Opening 's3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-16_2018-11-20.csv' for reading
18/12/11 14:42:00 INFO S3NativeFileSystem: Opening 's3://pinfare-glue/testing-csv/2018-09-25/DPS/2018-11-17_2018-11-21.csv' for reading
18/12/11 14:42:00 INFO Executor: Finished task 90.0 in stage 0.0 (TID 90). 2088 bytes result sent to driver
18/12/11 14:42:00 INFO Executor: Finished task 91.0 in stage 0.0 (TID 91). 2088 bytes result sent to driver
18/12/11 14:42:00 INFO CoarseGrainedExecutorBackend: Got assigned task 94
18/12/11 14:42:00 INFO CoarseGrainedExecutorBackend: Got assigned task 95
18/12/11 14:42:00 INFO Executor: Running task 95.0 in stage 0.0 (TID 95)
18/12/11 14:42:00 INFO Executor: Running task 94.0 in stage 0.0 (TID 94)
... I notice the parquet is appended with alot of duplicate data ... Is the bookmark not working? Its already enabled

Bookmarking Requirements
From the docs
Job must be created with --job-bookmark-option job-bookmark-enable (or if using the console then in the console options). Job must also have a jobname; this will be passed in automatically.
Job must start with a job.init(jobname)
e.g.
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
Job must have a job.commit() to save the state of the bookmark and finish successfully.
The datasource must be either s3 source or JDBC (limited, and not your usecase so I will ignore it).
The example in the docs shows creating a dynamicframe from the (Glue/Lake formation) catalog using the tablename, not an explicit S3 path. This implies that reading from the catalog is still considered an S3 source; the underlying files will be on S3.
Files on s3 must be one of JSON, CSV, Apache Avro, XML for version 0.9 and above, or can be Parquet or ORC for version 1.0 and above
The datasource in the script must have a transformation_ctx parameter.
The docs say
pass the transformation_ctx parameter only to those methods that you
want to enable bookmarks
You could add this to every transform for saving state but the critical one(s) are the datasource(s) you want to bookmark.
Troubleshooting
From the docs
Max concurrency must be 1. Higher values break bookmarking
It also mentions job.commit() and using the transformation_ctx as above
For Amazon S3 input sources, job bookmarks check the last modified
time of the objects, rather than the file names, to verify which
objects need to be reprocessed. If your input source data has been
modified since your last job run, the files are reprocessed when you
run the job again.
Other things to check
Have you verified that your CSV files in the path "s3://xxx-glue/testing-csv" do not already contain duplicates? You could use a Glue crawler or write DDL in Athena to create a table over them and look directly. Alternatively create a dev endpoint and run a zeppelin or sagemaker notebook and step through your code.
It doesn't mention anywhere that editing your script would reset your state, however, if you modified the transformation_ctx of the datasource or other stages then that would likely impact the state, however I haven't verified that. The job has a Jobname which keys the state, along with the run number, attempt number and version number that are used to manage retries and the latest state, which implies that minor changes to the script wouldn't affect the state as long as the Jobname is consistent, but again I haven't verified that.
As an aside, in your code you test for inputGDF.toDF().head(1) and then run inputGDF.toDF()... to write the data. Spark is lazily evaluated but in that case you are running an equivalent dynamicframe to dataframe twice, and spark can't cache or reuse it. Better to do something like df = inputGDF.toDF() before the if and then reuse the df twice.

Please check this doc about AWS Glue bookmarking mechanism.
Basically it requires to enable it via Console (or CloudFormation) and specify tranformation_context parameter which uses together with some other attributes (like job name, source file names) to save checkpointing information. If you change value of one of these attributes then Glue will treat it as different checkpoint.

https://docs.aws.amazon.com/glue/latest/dg/monitor-debug-multiple.html can be used to verify if bookmark is working or not

Bookmarks are not supported for parquet format in Glue version 0.9:
They are supported in Glue version 1.0 though.

Just for the record, and since there are no answers yet.
I think editing the script seem to affect the bookmarks ... but I thought it should not ...

Related

Spark Job Crashes with error in prelaunch.err

We are runing a spark job which runs close to 30 scripts one by one. it usually takes 14-15h to run, but this time it failed in 13h. Below is the details:
Command:spark-submit --executor-memory=80g --executor-cores=5 --conf spark.sql.shuffle.partitions=800 run.py
Setup: Running spark jobs via jenkins on AWS EMR with 16 spot nodes
Error: Since the YARN log is huge (270Mb+), below are some extracts from it:
[2022-07-25 04:50:08.646]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : ermediates/master/email/_temporary/0/_temporary/attempt_202207250435265404741257029168752_0641_m_000599_168147 s3://memberanalytics-data-out-prod/pipelined_intermediates/master/email/_temporary/0/task_202207250435265404741257029168752_0641_m_000599 using algorithm version 1 22/07/25 04:37:05 INFO FileOutputCommitter: Saved output of task 'attempt_202207250435265404741257029168752_0641_m_000599_168147' to s3://memberanalytics-data-out-prod/pipelined_intermediates/master/email/_temporary/0/task_202207250435265404741257029168752_0641_m_000599 22/07/25 04:37:05 INFO SparkHadoopMapRedUtil: attempt_202207250435265404741257029168752_0641_m_000599_168147: Committed 22/07/25 04:37:05 INFO Executor: Finished task 599.0 in stage 641.0 (TID 168147). 9341 bytes result sent to driver 22/07/25 04:49:36 ERROR YarnCoarseGrainedExecutorBackend: Executor self-exiting due to : Driver ip-10-13-52-109.bjw2k.asg:45383 disassociated! Shutting down. 22/07/25 04:49:36 INFO MemoryStore: MemoryStore cleared 22/07/25 04:49:36 INFO BlockManager: BlockManager stopped 22/07/25 04:50:06 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:205) at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95) 22/07/25 04:50:06 ERROR Utils: Uncaught exception in thread shutdown-hook-0 java.lang.InterruptedException

EMR Core nodes are not taking up map reduce jobs

I have a 2 node EMR (Version 4.6.0) Cluster (1 master (m4.large) , 1 core (r4.xlarge) ) with HBase installed. I'm using default EMR configurations. I want to export HBase tables using
hbase org.apache.hadoop.hbase.mapreduce.Export -D hbase.mapreduce.include.deleted.rows=true Table_Name hdfs:/full_backup/Table_Name 1
I'm getting the following error
2022-04-04 11:29:20,626 INFO [main] util.RegionSizeCalculator: Calculating region sizes for table "Table_Name".
2022-04-04 11:29:20,900 INFO [main] client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
2022-04-04 11:29:20,900 INFO [main] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x17ff27095680070
2022-04-04 11:29:20,903 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread shut down for session: 0x17ff27095680070
2022-04-04 11:29:20,904 INFO [main] zookeeper.ZooKeeper: Session: 0x17ff27095680070 closed
2022-04-04 11:29:20,980 INFO [main] mapreduce.JobSubmitter: number of splits:1
2022-04-04 11:29:20,994 INFO [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2022-04-04 11:29:21,192 INFO [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1649071534731_0002
2022-04-04 11:29:21,424 INFO [main] impl.YarnClientImpl: Submitted application application_1649071534731_0002
2022-04-04 11:29:21,454 INFO [main] mapreduce.Job: The url to track the job: http://ip-10-0-2-244.eu-west-1.compute.internal:20888/proxy/application_1649071534731_0002/
2022-04-04 11:29:21,455 INFO [main] mapreduce.Job: Running job: job_1649071534731_0002
2022-04-04 11:29:28,541 INFO [main] mapreduce.Job: Job job_1649071534731_0002 running in uber mode : false
2022-04-04 11:29:28,542 INFO [main] mapreduce.Job: map 0% reduce 0%
It is stuck at this progress and not running. However when I add a task node and redo the same command, it gets finished within seconds.
Based on the documentation, https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html , core node itself should handle tasks as well. What could be going wrong?

Dependent jar file not found in mapreduce job

I have 2 almost identical CDH 5.8 clusters, namely, Lab & Production. I have a mapreduce job that runs fine in Lab but fails in Production cluster. I spent over 10 hours on this already. I made sure I am running exact same code and also compared the configurations between the clusters. I couldn't find any difference.
Only difference I could see is when I run in Production, I see these warnings:
Also note, the path of the cached file starts with "file://null/"
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/commons-httpclient-3.1.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop/client/commons-httpclient-3.1.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop/client/hadoop-yarn-server-common.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-yarn/hadoop-yarn-server-common.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-yarn/lib/stax-api-1.0-2.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop/client/stax-api-1.0-2.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 WARN util.MRApps: cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hbase/lib/snappy-java-1.0.4.1.jar conflicts with cache file (mapreduce.job.cache.files) file://null/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/snappy-java-1.0.4.1.jar This will be an error in Hadoop 2.0
17/08/16 10:13:14 INFO impl.YarnClientImpl: Submitted application application_1502835801144_0005
17/08/16 10:13:14 INFO mapreduce.Job: The url to track the job: http://myserver.com:8088/proxy/application_1502835801144_0005/
17/08/16 10:13:14 INFO mapreduce.Job: Running job: job_1502835801144_0005
17/08/16 10:13:15 INFO mapreduce.Job: Job job_1502835801144_0005 running in uber mode : false
17/08/16 10:13:15 INFO mapreduce.Job: map 0% reduce 0%
17/08/16 10:13:15 INFO mapreduce.Job: Job job_1502835801144_0005 failed with state FAILED due to: Application application_1502835801144_0005 failed 2 times due to AM Container for appattempt_1502835801144_0005_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://myserver.com:8088/proxy/application_1502835801144_0005/Then, click on links to logs of each attempt.
Diagnostics: java.io.FileNotFoundException: File file:/var/cdr-ingest-mapreduce/lib/mail-1.4.7.jar does not exist
Failing this attempt. Failing the application.
17/08/16 10:13:15 INFO mapreduce.Job: Counters: 0
17/08/16 10:13:16 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x25ba0c30a33ea46
17/08/16 10:13:16 INFO zookeeper.ZooKeeper: Session: 0x25ba0c30a33ea46 closed
17/08/16 10:13:16 INFO zookeeper.ClientCnxn: EventThread shut down
As we can see, the job tries to start but fails saying that a jar file is not found. I made sure the jar file exist in local fs with ample permissions. I suspect the issue happens when it tries to copy the jar files into the distributed cache and fails somehow.
Here is my shell script that start the MR job:
#!/bin/bash
LIBJARS=`ls -m /var/cdr-ingest-mapreduce/lib/*.jar |tr -d ' '|tr -d '\n'`
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hbase/lib/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop/client/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop-yarn/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop-yarn/lib/*.jar |tr -d ' '|tr -d '\n'`"
LIBJARS="$LIBJARS,`ls -m /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/*.jar |tr -d ' '|tr -d '\n'`"
job_start_timestamp=''
if [ -n "$1" ]; then
job_start_timestamp="-overridedJobStartTimestamp $1"
fi
export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`
yarn jar `ls /var/cdr-ingest-mapreduce/cdr-ingest-mapreduce-core*.jar` com.blah.CdrIngestor \
-libjars ${LIBJARS} \
-zookeeper 44.88.111.216,44.88.111.220,44.88.111.211 \
-yarnResourceManagerHost 44.88.111.220 \
-yarnResourceManagerPort 8032 \
-yarnResourceManagerSchedulerHost 44.88.111.220 \
-yarnResourceManagerSchedulerPort 8030 \
-mrClientSubmitFileReplication 6 \
-logFile '/var/log/cdr_ingest_mapreduce/cdr_ingest_mapreduce' \
-hdfsTempOutputDirectory '/cdr/temp_cdr_ingest' \
-versions '3' \
-jobConfigDir '/etc/cdr-ingest-mapreduce' \
${job_start_timestamp}
Node Manager Log:
2017-08-16 18:34:28,438 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM
2017-08-16 18:34:28,551 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting.
2017-08-16 18:34:31,638 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: The Auxilurary Service named 'mapreduce_shuffle' in the configuration is for class class org.apache.hadoop.mapred.ShuffleHandler which has a name of 'httpshuffle'. Because these are not the same tools trying to send ServiceData and read Service Meta Data may have issues unless the refer to the name in the config.
2017-08-16 18:34:31,851 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: container_1502835801144_0006_01_000001 has no corresponding application!
2017-08-16 18:36:08,221 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The logs will be aggregated after this application is finished.
2017-08-16 18:36:08,364 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hdfs OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1502933671610_0001 CONTAINERID=container_1502933671610_0001_01_000001
More logs from Node Manager showing that the jars were not copied to the cache (I am not sure what the 4th parameter "NULL" in the message is):
2017-08-15 15:20:09,876 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1502835577753_0001_01_000001
2017-08-15 15:20:09,876 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1502835577753_0001_01_000001 transitioned from LOCALIZING to LOCALIZATION_FAILED
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/mail-1.4.7.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/commons-lang3-3.4.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/cdr-ingest-mapreduce-core-1.0.3-SNAPSHOT.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_1502835577753_0001_01_000001 sent RELEASE event on a resource request { file:/var/cdr-ingest-mapreduce/lib/opencsv-3.8.jar, 1502740240000, FILE, null } not present in cache.
2017-08-15 15:20:09,877 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download resource { { file:/var/cdr-ingest-mapreduce/lib/dataplatform-common-1.0.7.jar, 1502740240000, FILE, null },pending,[(container_1502835577753_0001_01_000001)],31900834426583787,DOWNLOADING}
Any help is appreciated.
Basically, the mapper/reducer was trying to read the dependent jar file(s) from the node manager's local filesystem. I confirmed that by comparing the configurations between the 2 clusters. The value of "fs.defaultFS" was set to "file:///" for the cluster that wasn't working. It looks like, that value comes from the file /etc/hadoop/conf/core-site.xml on the server (edge server) where my mapreduce was started. This file had no configurations because I had no service/role deployed on that edge server. I deployed HDFS/HttpFs on the edge server and redeployed the client configurations across the cluster. Alternatively, one could deploy a gateway role on the server to pull the configurations without having to run any role. Thanks to #tk421 for the tip. This created the contents in /etc/hadoop/conf/core-site.xml and fixed my problem.
For those who don't want to deploy any service/role on the edge server, you could copy the file contents from one of your your data node.
I added this little code snippet before starting the job to print the configuration values:
for (Entry<String, String> entry : config) {
System.out.println(entry.getKey() + "-->" + entry.getValue());
}
// Start and wait for the job to finish
jobStatus = job.waitForCompletion(true);

toPandas() work from Jupyter iPython Notebook but fails on submit - AWS EMR

I have a program that:
1. reads some data
2. perform some operations
3. Saves a csv file
4. Transport that file to FTP
I am using Amazon EMR cluster and PySpark to accomplish this task.
For step 4, I need to save the CSV on the local storage and not on HDFS. For this purpose, I convert the Spark Dataframe to Pandas dataframe.
a snippet could be:
from pyspark import SparkContext
from pyspark.conf import SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.types import StructType, StructField, LongType, StringType
from pyspark.mllib.evaluation import *
from pyspark.sql.functions import *
from pyspark.sql import Row
from time import time
import timeit
from datetime import datetime, timedelta
import numpy as np
import random as rand
import pandas as pd
from itertools import combinations, permutations
from collections import defaultdict
from ftplib import FTP
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("Recommendation").set('spark.driver.memory', '8G').set('spark.executor.memory', '4G')
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
readRdd = sqlContext.read.format('com.databricks.spark.csv').load('s3n://my-bucket/myfile' + path)
df = readRdd.toPandas() # <---------- PROBLEM
print('toPandas() completed')
df.to_csv('./myFile')
The problem is:
when I run this code from Jpyter iPython notebook on the same cluster, it works like a charm. But when I run this code using Spark Submit, or add it as a step to EMR, the code fails on the following line:
df = readRdd.toPandas()
'toPandas() completed' is never printed
In the spark job monitor, I can see that the toPandas() method gets executed but right after that I get the error.
16/10/10 13:17:47 INFO YarnAllocator: Driver requested a total number of 1 executor(s).
16/10/10 13:17:47 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:17:47 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:17:47 INFO TaskSetManager: Finished task 1462.0 in stage 17.0 (TID 10624) in 2089 ms on ip-172-31-38-70.eu-west-1.compute.internal (1515/1516)
16/10/10 13:17:47 INFO TaskSetManager: Finished task 1485.0 in stage 17.0 (TID 10647) in 2059 ms on ip-172-31-38-70.eu-west-1.compute.internal (1516/1516)
16/10/10 13:17:47 INFO YarnClusterScheduler: Removed TaskSet 17.0, whose tasks have all completed, from pool
16/10/10 13:17:47 INFO DAGScheduler: ResultStage 17 (toPandas at 20161007_RecPipeline.py:182) finished in 12.609 s
16/10/10 13:17:47 INFO DAGScheduler: Job 4 finished: toPandas at 20161007_RecPipeline.py:182, took 14.646644 s
16/10/10 13:17:47 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
16/10/10 13:17:47 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:17:47 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:17:50 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:17:50 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:17:53 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:17:53 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:17:56 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:17:56 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:17:59 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:17:59 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:02 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:02 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:05 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:05 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:08 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:08 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:11 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:11 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:14 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:14 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:17 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:17 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:20 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:20 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:23 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:23 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:26 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:26 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:29 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:29 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:32 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:32 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:35 INFO YarnAllocator: Canceling requests for 0 executor containers
16/10/10 13:18:35 WARN YarnAllocator: Expected to find pending requests, but found none.
16/10/10 13:18:36 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
16/10/10 13:18:36 INFO SparkContext: Invoking stop() from shutdown hook
16/10/10 13:18:36 INFO SparkUI: Stopped Spark web UI at http://172.31.37.28:45777
16/10/10 13:18:36 INFO YarnClusterSchedulerBackend: Shutting down all executors
16/10/10 13:18:36 INFO YarnClusterSchedulerBackend: Asking each executor to shut down
16/10/10 13:18:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/10/10 13:18:36 ERROR PythonRDD: Error while sending iterator
java.net.SocketException: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:440)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:452)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:648)
at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:648)
at org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:648)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:649)
16/10/10 13:18:36 ERROR ApplicationMaster: User application exited with status 143
16/10/10 13:18:36 INFO ApplicationMaster: Final app status: FAILED, exitCode: 143, (reason: User application exited with status 143)
16/10/10 13:18:36 INFO MemoryStore: MemoryStore cleared
16/10/10 13:18:36 INFO BlockManager: BlockManager stopped
16/10/10 13:18:36 INFO BlockManagerMaster: BlockManagerMaster stopped
16/10/10 13:18:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/10/10 13:18:36 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/10/10 13:18:36 INFO SparkContext: Successfully stopped SparkContext
16/10/10 13:18:36 INFO ShutdownHookManager: Shutdown hook called
16/10/10 13:18:36 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/10/10 13:18:36 INFO ShutdownHookManager: Deleting directory /mnt3/yarn/usercache/hadoop/appcache/application_1476100925559_0002/spark-eab43d4e-7201-4bcb-8ee7-0e7b546e8fd8
16/10/10 13:18:36 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1476100925559_0002/spark-1d88398f-ecd5-4d94-a42a-a406b3d566af/pyspark-34bec23c-a686-475d-85c9-9e9228b23239
16/10/10 13:18:36 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1476100925559_0002/spark-1d88398f-ecd5-4d94-a42a-a406b3d566af
16/10/10 13:18:36 INFO ShutdownHookManager: Deleting directory /mnt3/yarn/usercache/hadoop/appcache/application_1476100925559_0002/container_1476100925559_0002_01_000001/tmp/spark-96cdee47-e3f3-45f4-8bc7-0df5928ef53c
16/10/10 13:18:36 INFO ShutdownHookManager: Deleting directory /mnt2/yarn/usercache/hadoop/appcache/application_1476100925559_0002/spark-f6821ea1-6f37-4cc6-8bba-049ac0215786
16/10/10 13:18:36 INFO ShutdownHookManager: Deleting directory /mnt1/yarn/usercache/hadoop/appcache/application_1476100925559_0002/spark-1827cae8-8a60-4b29-a4e5-368a8e1856fd
My cluster configuration looks like:
spark-defaults spark.driver.maxResultSize 8G
spark-defaults spark.driver.memory 8G
spark-defaults spark.executor.memory 4G
The Spark Submit command looks like:
spark-submit --deploy-mode cluster s3://my-bucket/myPython.py
This is killing me! Someone please give me any pointers to what direction I may look at?
Here was the problem:
spark-submit --deploy-mode cluster s3://my-bucket/myPython.py
In the above command, the deploy mode is set to cluster which means a node will be chosen out of the core nodes to run the driver program. Since the allowed driver memory is 8G and the core nodes were smaller physical instances, they would always run out of required memory.
The solution was to deploy in client mode where the driver would always run on the master node (a bigger physical instance with more resources in my case) would not run out of required memory for the whole process.
Since it was a dedicated cluster, this solution worked in my case.
In case of a shared cluster where deploy mode must be cluster, using bigger instances should work.

Hadoop MapReduce Job Hangs

I am trying to simulate the Hadoop environment using latest Hadoop version 2.6.0, Java SDK 1.70 on my Ubuntu desktop. I configured the hadoop with necessary environment parameters and all its processes are up and running and they can be seen with the following jps command:
nandu#nandu-Desktop:~$ jps
2810 NameNode
3149 SecondaryNameNode
3416 NodeManager
3292 ResourceManager
2966 DataNode
4805 Jps
I could also see the above information, plus the dfs files through the Firefox browser. However, when I tried to run a simple WordCound MapReduce job, it hangs and it doesn't produce any output or shows any error message(s). After a while I killed the process using the "hadoop job -kill " command. Can you please guide me, to find the cause of this issue and how to resolve it? I am giving below the Job start and kill(end) screenshot.
If you need additional information, please let me know.
Your help will be highly appreciated.
Thanks,
===================================================================
nandu#nandu-Desktop:~/dev$ hadoop jar wc.jar WordCount /user/nandu/input /user/nandu/output
15/02/27 10:35:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/02/27 10:35:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/02/27 10:35:21 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/02/27 10:35:21 INFO input.FileInputFormat: Total input paths to process : 2
15/02/27 10:35:21 INFO mapreduce.JobSubmitter: number of splits:2
15/02/27 10:35:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1425048764581_0003
15/02/27 10:35:22 INFO impl.YarnClientImpl: Submitted application application_1425048764581_0003
15/02/27 10:35:22 INFO mapreduce.Job: The url to track the job: http://nandu-Desktop:8088/proxy/application_1425048764581_0003/
15/02/27 10:35:22 INFO mapreduce.Job: Running job: job_1425048764581_0003
==================== at this point the job was killed ===================
15/02/27 10:38:23 INFO mapreduce.Job: Job job_1425048764581_0003 running in uber mode : false
15/02/27 10:38:23 INFO mapreduce.Job: map 0% reduce 0%
15/02/27 10:38:23 INFO mapreduce.Job: Job job_1425048764581_0003 failed with state KILLED due to: Application killed by user.
15/02/27 10:38:23 INFO mapreduce.Job: Counters: 0
I encountered similar problem while running provided MapReduce sample in hadoop package. In my case it was hanging due to low disk space on my VM (about 1.5 GB was empty). When I freed some disk space it ran pretty fine. Also, please check other system resource requirements are fulfilled.