amazon emr spark submission from S3 not working - amazon-web-services

I have a cluster up and running. I am trying to add a step to run my code. The code itself works fine on a single instance. Only thing is I can't get it to work off S3.
aws emr add-steps --cluster-id j-XXXXX --steps Type=spark,Name=SomeSparkApp,Args=[--deploy-mode,cluster,--executor-memory,0.5g,s3://<mybucketname>/mypythonfile.py]
This is exactly what examples show I should do. What am I doing wrong?
Error I get:
Exception in thread "main" java.lang.IllegalArgumentException: Unknown/unsupported param List(--executor-memory, 0.5g, --executor-cores, 2, --primary-py-file, s3://<mybucketname>/mypythonfile.py, --class, org.apache.spark.deploy.PythonRunner)
Usage: org.apache.spark.deploy.yarn.Client [options]
Options:
--jar JAR_PATH Path to your application's JAR file (required in yarn-cluster
mode)
.
.
.
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'
When I specify as this instead:
aws emr add-steps --cluster-id j-XXXXX --steps Type=spark,Name= SomeSparkApp,Args=[--executor-memory,0.5g,s3://<mybucketname>/mypythonfile.py]
I get this error instead:
Error: Only local python files are supported: Parsed arguments:
master yarn-client
deployMode client
executorMemory 0.5g
executorCores 2
EDIT: IT gets further along when I manually create the python file after SSH'ing into the cluster, and specifying as follows:
aws emr add-steps --cluster-id 'j-XXXXX' --steps Type=spark,Name= SomeSparkApp,Args=[--executor-memory,1g,/home/hadoop/mypythonfile.py]
But, not doing the job.
Any help appreciated. This is really frustrating as a well documented method on AWS's own blog here https://blogs.aws.amazon.com/bigdata/post/Tx578UTQUV7LRP/Submitting-User-Applications-with-spark-submit does not work.

I will ask, just in case, you used your correct buckets and cluster ID-s?
But anyways, I had similar problems, like I could not use --deploy-mode,cluster when reading from S3.
When I used --deploy-mode,client,--master,local[4] in the arguments, then I think it worked. But I think I still needed something different, can't remember exactly, but I resorted to a solution like this:
Firstly, I use a bootstrap action where a shell script runs the command:
aws s3 cp s3://<mybucket>/wordcount.py wordcount.py
and then I add a step to the cluster creation through the SDK in my Go application, but I can recollect this info and give you the CLI command like this:
aws emr add-steps --cluster-id j-XXXXX --steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=["spark-submit",--master,local[4],/home/hadoop/wordcount.py,s3://<mybucket>/<inputfile.txt> s3://<mybucket>/<outputFolder>/]

I searched for days and finally discovered this thread which states
PySpark currently only supports local
files. This does not mean it only runs in local mode, however; you can
still run PySpark on any cluster manager (though only in client mode). All
this means is that your python files must be on your local file system.
Until this is supported, the straightforward workaround then is to just
copy the files to your local machine.

Related

AWS EMR pyspark java.lang.illegalArgumentException when using pandas_udf

Using pyspark 2.4.7 and pyarrow 6.0.1.
I know from documentation there is compatibility issue therefore I need to set ARROW_PRE_0_15_IPC_FORMAT = 1 inside spark-env.sh
This solves the problem on my local machine however still getting the same error in AWS Emr 5.33.1
I am usint boto3 and have configured spark-env by passing
[...{'Classification': 'spark-env', 'Configurations':[{'Classification': 'export', 'Properties':{'ARROW_PRE_0_15_IPC_FORMAT':'1'}}],
'Properties':{}
}
and EMR loads property and has its config can be see in EMR UI.
I've read that these config only used for master node, so worker nodes are still getting the same error?

What is the correct way of installing a JDBC driver on EMR for Sqoop to use?

I am running Sqoop 1.4.7 on AWS EMR 5.21.1 and am trying to import data from a database. I have successfully been able to do this manually where I create an EMR instance with Sqoop installed via the EMR Console.
Here are the preliminary steps that I performed in order to run sqoop on EMR
Download the JDBC Driver
Move the JDBC driver to the /usr/lib/sqoop/lib directory
I was able to successfully run a sqoop import when I was sshd into an EMR cluster with these commands:
wget -O mssql-jdbc.jar https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.4.0.jre8/mssql-jdbc-8.4.0.jre8.jar
sudo mv mssql-jdbc.jar /usr/lib/sqoop/lib/
When I try to run these commands from an EMR bootstrap script however I get the error:
usr/lib/sqoop/lib/ No such file or directory
After doing some investigation I realized this is because "Bootstrap actions execute before core services, such as Hadoop or Spark, are installed", as found here
So the /usr/lib/sqoop/lib directory doesnt exist when I run my bootstrap steps.
Here are some solutions which work but they feel like work-arounds
Create the /usr/lib/sqoop/lib directory in my bootstrap script and then place the jar in it
Add the jar to this directory as an EMR step. (Turns out this this is the correct approach, look at below accepted answer)
What is the correct way of installing this JDBC driver on EMR?
The 2nd option is the correct way to do it. The documentation explains running bash scripts as an EMR step.
You can also use the jar command-runner.jar and the arguments to be
bash -c "wget -O mssql-jdbc.jar https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/8.4.0.jre8/mssql-jdbc-8.4.0.jre8.jar;sudo mv mssql-jdbc.jar /usr/lib/sqoop/lib/"

Unable to spark-submit a pyspark file on s3 bucket

I have a pyspark code stored both on the master node of an AWS EMR cluster and in an s3 bucket that fetches over 140M rows from a MySQL database and stores the sum of a column back in the log files on s3.
When I spark-submit the pyspark code on the master node, the job gets completed successfully and the output is stored in the log files on the S3 bucket.
However, when I spark-submit the pyspark code on the S3 bucket using these- (using the below commands on the terminal after SSH-ing to the master node)
spark-submit --master yarn --deploy-mode cluster --py-files s3://bucket_name/my_script.py
This returns a Error: Missing application resource. error.
spark_submit s3://bucket_name/my_script.py
This shows :
20/07/02 11:26:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2369)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1911)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:766)
at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:137)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
... 20 more
I read about having to add a Spark Step on the AWS EMR cluster to submit a pyspark code stored on the S3.
Am I correct in saying that I would need to create a step in order to submit my pyspark job stored on the S3?
In the 'Add Step' window that pops up on the AWS Console, in the 'Application location' field, it says that I'll have to type in the location to the JAR file. What JAR file are they referring to? Does my pyspark script have to be packaged into a JAR file and how do I do that or do I mention the path to my pyspark script?
In the 'Add Step' window that pops up on the AWS Console, in the Spark-submit options, how do I know what to write for the --class parameter? Can I leave this field empty? If no, why not?
I have gone through the AWS EMR documentation. I have so many questions because I dived nose-down into the problem and only researched when an error popped up.
Your spark submit should be this.
spark-submit --master yarn --deploy-mode cluster s3://bucket_name/my_script.py
--py-files is used if you want to pass the python dependency modules, not the application code.
When you are adding step in EMR to run spark job, jar location is your python file path. i.e. s3://bucket_name/my_script.py
No its not mandatory to use STEP to submit spark job.
You can also use spark-submit
To submit a pyspark script using STEP please refer aws doc and stackoverflow
For problem 1:
By default spark will use python2.
You need to add 2 config
Go to $SPARK_HOME/conf/spark-env.sh and add
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
Note: If you have any custom bundle add that using --py-files
For problem 2:
A hadoop-assembly jar exists on /usr/share/aws/emr/emrfs/lib/. That contains com.amazon.ws.emr.hadoop.fs.EmrFileSystem.
You need to add this to your classpath.
A better option to me is to create a symbolic link of hadoop-assembly jar to HADOOP_HOME (/usr/lib/hadoop) in your bootstrap action.

Clear data from HDFS on AWS EMR in Hadoop 1.0.3

For various reasons I'm running some jobs on EMR with AMI 2.4.11/Hadoop 1.0.3. I'm trying to run a cleanup of HDFS after my jobs by adding an additional EMR step. Using boto:
step = JarStep(
'HDFS cleanup',
'command-runner.jar',
action_on_failure='CONTINUE',
step_args=['hadoop', 'dfs', '-rmr', '-skipTrash', 'hdfs:/tmp'])
emr_conn.add_jobflow_steps(cluster_id, [step])
However it regularly fails with nothing in stderr in the EMR console.
Why I am confused is if I ssh into the master node and run the command:
hadoop dfs -rmr -skipTrash hdfs:/tmp
It succeeds with a 0 and a message that it successfully deleted everything. All the normal hadoop commands seem to work as documented. Does anyone know if there's an obvious reason for this? Issue with the Amazon distribution? Undocumented behavior in certain commands?
Note:
I have other jobs running in Hadoop 2 and the documented:
hdfs dfs -rm -r -skipTrash hdfs:/tmp
works as one would expect both as a step and as a command.
My solution generally was to upgrade everything to Hadoop2, in which case this works:
JarStep(
'%s: HDFS cleanup' % self.job_name,
'command-runner.jar',
action_on_failure='CONTINUE',
step_args=['hdfs', 'dfs', '-rm', '-r', '-skipTrash', path]
)
This was the best I could get with Hadoop1 that worked pretty well.
JarStep(
'%s: HDFS cleanup' % self.job_name,
'command-runner.jar',
action_on_failure='CONTINUE',
step_args=['hadoop', 'fs', '-rmr', '-skipTrash',
'hdfs:/tmp/mrjob']
)

S3-Dist-Cp Failing on EMR5

I am facing issues with s3-dist-cp command in emr-5.0.0 version. In my application, I need to push some files from hdfs to S3. I am using s3-dist-cp command to achieve this. It was working fine in emr-4.2.0. But its not working in emr-5.0.0. If I run the command manually it works fine. But it fails in my application. I didn't make any change in my application to run it on emr-5.
Do I need to make any change if I need to use emr-5? Has there been any change in way we use s3-dist-cp command in emr-5?
I am using following command:
s3-dist-cp --src /user/hive/warehouse/abc.text --dest s3n://bucket/abc.text
s3-dist-cp is only available on the master node(s3-dist-cp.jar).
The following is the location of the application.
/usr/share/aws/emr/s3-dist-cp/
The s3-dist-cp.jar is not available in the slave nodes.
You can login into slave machine and verify it.
So the reason your application failure might be, In new emr you might be using some workflow management tool which deploy the application on slaves and start from there. As s3 s3-dist-cp is not available and it fails.
Work Around
First Option
bundle the jar and use following commands
hadoop jar s3-dist-cp.jar --src location --dest location
Second
Boot Strap the s3-dist-cp.jars on the cluster
You can even run it as java program
First thing, s3n:// is now deprecated, start using s3:// for S3 paths.
Secondly, if you're merely copying a file into S3 from a local file on your cluster, you can use aws s3 cp:
aws s3 cp /user/hive/warehouse/abc.text s3://bucket/abc.text
The syntax that you have used for s3-dist-cp is incorrect. Please try again with the command below.
s3-dist-cp --src hdfs:///user/hive/warehouse/abc.text --dest s3n://bucket/abc.text
Let me know if this solves your proble.