Add pyspark script as AWS step - amazon-web-services

Add pyspark script as AWS step - amazon-web-services

I have a pyspark script to read an xml file(present in S3). I need to add this as a step in aws. I have used the following command
aws emr add-steps — cluster-id <cluster_id> — steps Type=spark,Name=POI,Args=[ — deploy-mode,cluster, — master,yarn, — conf,spark.yarn.submit.waitAppCompletion=true,<s3 location of pyspark script>],ActionOnFailure=CONTINUE
I have downloaded the spark-xml jar to the master node during bootstrap and its present under
/home/hadoop
location. Also in the python script I have included
conf = SparkConf().setAppName('Project').set("spark.jars", "/home/hadoop/spark-xml_2.11-0.4.1.jar").set("spark.driver.extraClassPath", "/home/hadoop/spark-xml_2.11-0.4.1.jar")
But still its showing
py4j.protocol.Py4JJavaError: An error occurred while calling o56.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html

You have set master as yarn and deploy-mode as cluster. That means your spark driver will be in one of CORE nodes.
Anyway, EMR by default is configured to create Application master on one of the CORE node and application master will have the driver in it.
Please refer this article for more info.
So you have to put your jar in all CORE nodes (Not in MASTER) and refer the file file:///home/hadoop/spark-xml_2.11-0.4.1.jar in this manner.
Or there is better way to put it in HDFS (Lets say under hdfs:///user/hadoop) and refer that hdfs:///user/hadoop/spark-xml_2.11-0.4.1.jar

Related

AWS EMR pyspark java.lang.illegalArgumentException when using pandas_udf

Using pyspark 2.4.7 and pyarrow 6.0.1.
I know from documentation there is compatibility issue therefore I need to set ARROW_PRE_0_15_IPC_FORMAT = 1 inside spark-env.sh
This solves the problem on my local machine however still getting the same error in AWS Emr 5.33.1
I am usint boto3 and have configured spark-env by passing
[...{'Classification': 'spark-env', 'Configurations':[{'Classification': 'export', 'Properties':{'ARROW_PRE_0_15_IPC_FORMAT':'1'}}],
'Properties':{}
}
and EMR loads property and has its config can be see in EMR UI.
I've read that these config only used for master node, so worker nodes are still getting the same error?

Unable to spark-submit a pyspark file on s3 bucket

I have a pyspark code stored both on the master node of an AWS EMR cluster and in an s3 bucket that fetches over 140M rows from a MySQL database and stores the sum of a column back in the log files on s3.
When I spark-submit the pyspark code on the master node, the job gets completed successfully and the output is stored in the log files on the S3 bucket.
However, when I spark-submit the pyspark code on the S3 bucket using these- (using the below commands on the terminal after SSH-ing to the master node)
spark-submit --master yarn --deploy-mode cluster --py-files s3://bucket_name/my_script.py
This returns a Error: Missing application resource. error.
spark_submit s3://bucket_name/my_script.py
This shows :
20/07/02 11:26:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2369)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1911)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:766)
at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:137)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
... 20 more
I read about having to add a Spark Step on the AWS EMR cluster to submit a pyspark code stored on the S3.
Am I correct in saying that I would need to create a step in order to submit my pyspark job stored on the S3?
In the 'Add Step' window that pops up on the AWS Console, in the 'Application location' field, it says that I'll have to type in the location to the JAR file. What JAR file are they referring to? Does my pyspark script have to be packaged into a JAR file and how do I do that or do I mention the path to my pyspark script?
In the 'Add Step' window that pops up on the AWS Console, in the Spark-submit options, how do I know what to write for the --class parameter? Can I leave this field empty? If no, why not?
I have gone through the AWS EMR documentation. I have so many questions because I dived nose-down into the problem and only researched when an error popped up.

Your spark submit should be this.
spark-submit --master yarn --deploy-mode cluster s3://bucket_name/my_script.py
--py-files is used if you want to pass the python dependency modules, not the application code.
When you are adding step in EMR to run spark job, jar location is your python file path. i.e. s3://bucket_name/my_script.py

No its not mandatory to use STEP to submit spark job.
You can also use spark-submit
To submit a pyspark script using STEP please refer aws doc and stackoverflow
For problem 1:
By default spark will use python2.
You need to add 2 config
Go to $SPARK_HOME/conf/spark-env.sh and add
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
Note: If you have any custom bundle add that using --py-files
For problem 2:
A hadoop-assembly jar exists on /usr/share/aws/emr/emrfs/lib/. That contains com.amazon.ws.emr.hadoop.fs.EmrFileSystem.
You need to add this to your classpath.
A better option to me is to create a symbolic link of hadoop-assembly jar to HADOOP_HOME (/usr/lib/hadoop) in your bootstrap action.

AWS EMR pyspark notebook fails with `Failed to run command /usr/bin/virtualenv (...)`

I have created a basic EMR cluster in AWS, and I'm trying to use the Jupyter Notebooks provided through the AWS Console. Launching the notebooks seems to work fine, and I'm also able to run basic python code in notebooks started with the pyspark kernel. Two variables are set up in the notebook: spark is a SparkSession instance, and sc is a SparkContext instance. Displaying sc yields <SparkContext master=yarn appName=livy-session-0> (the output can of course vary slightly depending on the session).
The problem arises once I perform operations that actually hit the spark machinery. For example:
sc.parallelize(list(range(10))).map(lambda x: x**2).collect()
I am no spark expert, but I believe this code should distribute the integers from 0 to 9 across the cluster, square them, and return the results in a list. Instead, I get a lengthy stack trace, mostly from the JVM, but also some python components. I believe the central part of the stack trace is the following:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 116, ip-XXXXXXXXXXXXX.eu-west-1.compute.internal, executor 17): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --system-site-packages virtualenv_application_1586243436143_0002_0
The full stack trace is here.
A bit of digging in the AWS portal led me to log output from the nodes. stdout from one of the nodes includes the following:
The path python3 (from --python=python3) does not exist
I tried running the /usr/bin/virtualenv command on the master node manually (after logging in through), and that worked fine, but the error is of course still present after I did that.
While this error occurs most of the time, I was able to get this working in one session, where I could run several operations against the spark cluster as I was expecting.
Technical information on the cluster setup:
emr-6.0.0
Applications installed are "Ganglia 3.7.2, Spark 2.4.4, Zeppelin 0.9.0, Livy 0.6.0, JupyterHub 1.0.0, Hive 3.1.2". Hadoop is also included.
3 nodes (one of them as master), all r5a.2xlarge.
Any ideas what I'm doing wrong? Note that I am completely new to EMR and Spark.
Edit: Added the stdout log and information about running the virtualenv command manually on the master node through ssh.

I have switched to using emr-5.29.0, which seems to resolve the problem. Perhaps this is an issue with emr-6.0.0? In any case, I have a functional workaround.

The issue for me was that the virtualenv was being made on the executors with a python path that didn't exist. Pointing the executors to the right one did the job for me:
"spark.pyspark.python": "/usr/bin/python3.7"
Here is how I reconfiged the spark app at the beginning of the notebook:
{"conf":{"spark.pyspark.python": "/usr/bin/python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"}
}

AWS EMR script-runner access error

I'm running emr-5.12.0, with Amazon 2.8.3, Hive 2.3.2, Hue 4.1.0, Livy 0.4.0, Spark 2.2.1 and Zeppelin 0.7.3 on 1 m4.large as my master node and 1 m4.large as core node.
I am trying to execute a bootstrap action that configures some parts of the cluster. One of these includes the line:
sudo sed -i '/zeppelin.pyspark.python/c\ \"zepplin.pyspark.python\" : \"python3\",' /etc/alternatives/zeppelin-conf/interpreter.json
It makes sure that the Zeppelin uses python3.4 instead of python2.7. It works fine if I execute this in the terminal after SSH'ing to the master node, but it fails when I submit it as a Custom JAR step on the AWS Web interface. I get the following error:
ed: can't read /etc/alternatives/zeppelin-conf/interpreter.json
: No such file or directory
Command exiting with ret '2'
The same thing happens if I use
sudo sed -i '/zeppelin.pyspark.python/c\ \"zepplin.pyspark.python\" : \"python3\",' /etc/zeppelin-conf/interpreter.json
Obviously I could just change it from the Zeppelin UI, but I would like to include it in the bootstrap action.
Thanks!

It turns out that a bootstrap action submitted throug the AWS EMR web interface is submitted as a regular EMR step, so it's only run on the master node. This can be seen if you click the 'AWS CLI export' in the cluster web interface. The intended bootstrap action is listed as a regular step.
Using the command line to launch a cluster with a bootstrap action bypasses this problem, so I've just used that.
Edit: Looking back at the web interface, it's pretty clear that I was adding regular steps instead of bootstrap actions. My bad!

amazon emr spark submission from S3 not working

I have a cluster up and running. I am trying to add a step to run my code. The code itself works fine on a single instance. Only thing is I can't get it to work off S3.
aws emr add-steps --cluster-id j-XXXXX --steps Type=spark,Name=SomeSparkApp,Args=[--deploy-mode,cluster,--executor-memory,0.5g,s3://<mybucketname>/mypythonfile.py]
This is exactly what examples show I should do. What am I doing wrong?
Error I get:
Exception in thread "main" java.lang.IllegalArgumentException: Unknown/unsupported param List(--executor-memory, 0.5g, --executor-cores, 2, --primary-py-file, s3://<mybucketname>/mypythonfile.py, --class, org.apache.spark.deploy.PythonRunner)
Usage: org.apache.spark.deploy.yarn.Client [options]
Options:
--jar JAR_PATH Path to your application's JAR file (required in yarn-cluster
mode)
.
.
.
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'
When I specify as this instead:
aws emr add-steps --cluster-id j-XXXXX --steps Type=spark,Name= SomeSparkApp,Args=[--executor-memory,0.5g,s3://<mybucketname>/mypythonfile.py]
I get this error instead:
Error: Only local python files are supported: Parsed arguments:
master yarn-client
deployMode client
executorMemory 0.5g
executorCores 2
EDIT: IT gets further along when I manually create the python file after SSH'ing into the cluster, and specifying as follows:
aws emr add-steps --cluster-id 'j-XXXXX' --steps Type=spark,Name= SomeSparkApp,Args=[--executor-memory,1g,/home/hadoop/mypythonfile.py]
But, not doing the job.
Any help appreciated. This is really frustrating as a well documented method on AWS's own blog here https://blogs.aws.amazon.com/bigdata/post/Tx578UTQUV7LRP/Submitting-User-Applications-with-spark-submit does not work.

I will ask, just in case, you used your correct buckets and cluster ID-s?
But anyways, I had similar problems, like I could not use --deploy-mode,cluster when reading from S3.
When I used --deploy-mode,client,--master,local[4] in the arguments, then I think it worked. But I think I still needed something different, can't remember exactly, but I resorted to a solution like this:
Firstly, I use a bootstrap action where a shell script runs the command:
aws s3 cp s3://<mybucket>/wordcount.py wordcount.py
and then I add a step to the cluster creation through the SDK in my Go application, but I can recollect this info and give you the CLI command like this:
aws emr add-steps --cluster-id j-XXXXX --steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=["spark-submit",--master,local[4],/home/hadoop/wordcount.py,s3://<mybucket>/<inputfile.txt> s3://<mybucket>/<outputFolder>/]

I searched for days and finally discovered this thread which states
PySpark currently only supports local
files. This does not mean it only runs in local mode, however; you can
still run PySpark on any cluster manager (though only in client mode). All
this means is that your python files must be on your local file system.
Until this is supported, the straightforward workaround then is to just
copy the files to your local machine.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js