AWS emr driver jars - amazon-web-services

I'm trying to use external drivers in AWS EMR 5.29 on pyspark notebooks via:
#%%configure -f
{ "conf": {"spark.jars":"s3://bucket/spark-redshift_2.10-2.0.1.jar,"
"s3://bucket/minimal-json-0.9.5.jar,"
"s3://bucket/spark-avro_2.11-3.0.0.jar,"
"s3://bucket/RedshiftJDBC4-no-awssdk-1.2.41.1065.jar"}}
As per https://blog.benthem.io/2020/04/21/connect-aws-emr-to-spark.html
However, when trying
from pyspark.sql import SQLContext
sc = spark # existing SparkContext
sql_context = SQLContext(sc)
df = sql_context.read.format("com.databricks.spark.redshift")\
.option("url", jdbcUrl)\
.option("query","select * from test")\
.option("tempdir", "s3://")\
.load()
I get
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.redshift.
How can I troubleshoot this? I can confirm the emr role has access to the bucket as I can process a CSV file on the same bucket with spark. I can also confirm all the listed jar files are in the bucket.

Actually the way to troubleshoot this is to SSH into the master node and then look at the ivy logs:
/mnt/var/log/livy/livy-livy-server.out
and the downloaded jar files at
/var/lib/livy/.ivy2/jars/
based on what I found out I changed my code to:
%%configure -f
{
"conf": {
"spark.jars" : "s3://bucket/RedshiftJDBC4-no-awssdk-1.2.41.1065.jar",
"spark.jars.packages": "com.databricks:spark-redshift_2.10:2.0.0,org.apache.spark:spark-avro_2.11:2.4.0,com.eclipsesource.minimal-json:minimal-json:0.9.4"
}
}

Related

Pyspark from S3 - java.lang.ClassNotFoundException: com.amazonaws.services.s3.model.MultiObjectDeleteException

I'm trying to get the data from s3 with pyspark from AWS EMR Cluster.
I'm still getting this error - An error occurred while calling o27.parquet. : java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException.
I have tried with different versions of jars/clusters, still no results.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().set("spark.jars","/usr/lib/spark/jars/hadoop-aws-3.2.1.jar,/usr/lib/spark/aws-java-sdk-s3-1.11.873.jar")
sc = SparkContext( conf=conf)
sqlContext = SQLContext(sc)
df2 = sqlContext.read.parquet("s3a://stackdev2prq/Badges/*")
I'm using hadoop-aws-3.2.1.jar and aws-java-sdk-s3-1.11.873.jar.
Spark 3.0.1 on Hadoop 3.2.1 YARN
I know that version I need propper version aws-java-sdk, but how can I check which version should I dounload?
mvnrepo provides the information
I don't see it for 3.2.1. Looking in the hadoop-project pom and JIRA versions, HADOOP-15642 says "1.11.375"; the move to 1.11.563 only went in with 3.2.2
do put the whole (huge) aws-java-sdk-bundle on the classpath. that shades everything and avoids version mismatch hell with jackson, httpclient etc.
That said: if you are working with EMR, you should just use the s3:// URLs and pick up the EMR team's S3 connector.

Unable to spark-submit a pyspark file on s3 bucket

I have a pyspark code stored both on the master node of an AWS EMR cluster and in an s3 bucket that fetches over 140M rows from a MySQL database and stores the sum of a column back in the log files on s3.
When I spark-submit the pyspark code on the master node, the job gets completed successfully and the output is stored in the log files on the S3 bucket.
However, when I spark-submit the pyspark code on the S3 bucket using these- (using the below commands on the terminal after SSH-ing to the master node)
spark-submit --master yarn --deploy-mode cluster --py-files s3://bucket_name/my_script.py
This returns a Error: Missing application resource. error.
spark_submit s3://bucket_name/my_script.py
This shows :
20/07/02 11:26:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2369)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2840)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1911)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:766)
at org.apache.spark.deploy.DependencyUtils$.downloadFile(DependencyUtils.scala:137)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$7.apply(SparkSubmit.scala:356)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:355)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2273)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
... 20 more
I read about having to add a Spark Step on the AWS EMR cluster to submit a pyspark code stored on the S3.
Am I correct in saying that I would need to create a step in order to submit my pyspark job stored on the S3?
In the 'Add Step' window that pops up on the AWS Console, in the 'Application location' field, it says that I'll have to type in the location to the JAR file. What JAR file are they referring to? Does my pyspark script have to be packaged into a JAR file and how do I do that or do I mention the path to my pyspark script?
In the 'Add Step' window that pops up on the AWS Console, in the Spark-submit options, how do I know what to write for the --class parameter? Can I leave this field empty? If no, why not?
I have gone through the AWS EMR documentation. I have so many questions because I dived nose-down into the problem and only researched when an error popped up.
Your spark submit should be this.
spark-submit --master yarn --deploy-mode cluster s3://bucket_name/my_script.py
--py-files is used if you want to pass the python dependency modules, not the application code.
When you are adding step in EMR to run spark job, jar location is your python file path. i.e. s3://bucket_name/my_script.py
No its not mandatory to use STEP to submit spark job.
You can also use spark-submit
To submit a pyspark script using STEP please refer aws doc and stackoverflow
For problem 1:
By default spark will use python2.
You need to add 2 config
Go to $SPARK_HOME/conf/spark-env.sh and add
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
Note: If you have any custom bundle add that using --py-files
For problem 2:
A hadoop-assembly jar exists on /usr/share/aws/emr/emrfs/lib/. That contains com.amazon.ws.emr.hadoop.fs.EmrFileSystem.
You need to add this to your classpath.
A better option to me is to create a symbolic link of hadoop-assembly jar to HADOOP_HOME (/usr/lib/hadoop) in your bootstrap action.

AWS Glue Python Shell Job Connect Timeout Error

Trying to run AWS Glue Python Shell Job but gives me Connect Timeout Error
Error Image : https://i.stack.imgur.com/MHpHg.png
Script : https://i.stack.imgur.com/KQxkj.png
It looks like you didn't added secretsmanager endpoint to your VPC. As the traffic will not leave AWS network there will not be internet access inside your Glue job's VPC. So if you want to connect to secretsmanager then you need to add it to your VPC.
Refer to this on how you can add this to your VPC and this to make sure you have properly configured security groups.
AWS Glue Git Issue
Hi,
We got AWS Glue Python Shell working with all dependency as follows. The Glue has awscli dependency as well along with boto3
AWS Glue Python Shell with Internet
Add awscli and boto3 whl files to Python library path during Glue Job execution. This option is slow as it has to download and install dependencies.
Download the following whl files
awscli-1.18.183-py2.py3-none-any.whl
boto3-1.16.23-py2.py3-none-any.whl
Upload the files to s3 bucket in your given python library path
Add the s3 whl file paths in the Python library path. Give the entire whl file s3 referenced path separated by comma
AWS Glue Python Shell without Internet connectivity
Reference: AWS Wrangler Glue dependency build
We followed the steps mentioned above for awscli and boto3 whl files
Below is the latest requirements.txt compiled for the newest versions
colorama==0.4.3
docutils==0.15.2
rsa==4.5.0
s3transfer==0.3.3
PyYAML==5.3.1
botocore==1.19.23
pyasn1==0.4.8
jmespath==0.10.0
urllib3==1.26.2
python_dateutil==2.8.1
six==1.15.0
Download the dependencies to libs folder
pip download -r requirements.txt -d libs
Move the original main whl files also to the lib directory
awscli-1.18.183-py2.py3-none-any.whl
boto3-1.16.23-py2.py3-none-any.whl
Package as a zip file
cd libs zip ../boto3-depends.zip *
Upload the boto3-depends.zip to s3 and add the path to Glue jobs Referenced files path
Note: It is Referenced files path and not Python library path
Placeholder code to install latest awcli and boto3 and load into AWS Python Glue Shell.
import os.path
import subprocess
import sys
# borrowed from https://stackoverflow.com/questions/48596627/how-to-import-referenced-files-in-etl-scripts
def get_referenced_filepath(file_name, matchFunc=os.path.isfile):
for dir_name in sys.path:
candidate = os.path.join(dir_name, file_name)
if matchFunc(candidate):
return candidate
raise Exception("Can't find file: ".format(file_name))
zip_file = get_referenced_filepath("awswrangler-depends.zip")
subprocess.run()
# Can't install --user, or without "-t ." because of permissions issues on the filesystem
subprocess.run(, shell=True)
#Additonal code as part of AWS Thread https://forums.aws.amazon.com/thread.jspa?messageID=954344
sys.path.insert(0, '/glue/lib/installation')
keys =
for k in keys:
if 'boto' in k:
del sys.modules[k]
import boto3
print('boto3 version')
print(boto3.__version__)
Check if the code is working with latest AWS CLI API
Thanks
Sarath

EMR 5.21 , Spark 2.4 - Json4s Dependency broken

Issue
In EMR 5.21 , Spark - Hbase integration is broken.
df.write.options().format().save() fails.
Reason is json4s-jackson version 3.5.3 in spark 2.4 , EMR 5.21
it works fine in EMR 5.11.2 , Spark 2.2 , son4s-jackson version 3.2.11
Problem is this is EMR so i cant rebuild spark with lower json4s .
is there any workaround ?
Error
py4j.protocol.Py4JJavaError: An error occurred while calling o104.save. : java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;
Submission
spark-submit --master yarn \
--jars /usr/lib/hbase/ \
--packages com.hortonworks:shc-core:1.1.3-2.3-s_2.11 \
--repositories http://repo.hortonworks.com/content/groups/public/ \
pysparkhbase_V1.1.py s3://<bucket>/ <Namespace> <Table> <cf> <Key>
Code
import sys
from pyspark.sql.functions import concat
from pyspark import SparkContext
from pyspark.sql import SQLContext,SparkSession
spark = SparkSession.builder.master("yarn").appName("PysparkHbaseConnection").config("spark.some.config.option", "PyHbase").getOrCreate()
spark.sql("set spark.sql.parquet.compression.codec=uncompressed")
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
df = spark.read.parquet(file)
df.createOrReplaceTempView("view")
.
cat = '{|"table":{"namespace":"' + namespace + '", "name":"' + name + '", "tableCoder":"' + tableCoder + '", "version":"' + version + '"}, \n|"rowkey":"' + rowkey + '", \n|"columns":{'
.
df.write.options(catalog=cat).format(data_source_format).save()
There's no obvious answer. A quick check of the SHC POM doesn't show a direct import of the json file, so you can't just change that pom & build the artifact yourself.
You're going to have to talk to the EMR team to get them to build the connector & HBase in sync.
FWIW, getting jackson in sync is one of the stress point of releasing a big data stack, and the AWS SDK's habit of updating their requirements on their fortnightly release one of the stress points ... Hadoop moved to the aws shaded SDK purely to stop the AWS engineering decisions defining the choices for everyone.
downgrade json4s to 3.2.10 can resolve it. but I think it's SHC bug,need to upgrade it.

Pyspark AWS credentials

I'm trying to run a PySpark script that works fine when I run it on my local machine.
The issue is that I want to fetch the input files from S3.
No matter what I try though I can't seem to be able find where I set the ID and secret. I found some answers regarding specific files
ex: Locally reading S3 files through Spark (or better: pyspark)
but I want to set the credentials for the whole SparkContext as I reuse the sql context all over my code.
so the question is: How do I set the AWS Access key and secret to spark?
P.S I tried the $SPARK_HOME/conf/hdfs-site.xml and Environment variable options. both didn't work...
Thank you
For pyspark we can set the credentials as given below
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", AWS_ACCESS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", AWS_SECRET_KEY)
Setting spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key in spark-defaults.conf before establishing a spark session is a nice way to do it.
But, also had success with Spark 2.3.2 and a pyspark shell setting these dynamically from within a spark session doing the following:
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)
And then, able to read/write from S3 using s3a:
documents = spark.sparkContext.textFile('s3a://bucket_name/key')
I'm not sure if this was true at the time, but as of PySpark 2.4.5 you don't need to access the private _jsc object to set Hadoop properties. You can set Hadoop properties using SparkConf.set(). For example:
import pyspark
conf = (
pyspark.SparkConf()
.setAppName('app_name')
.setMaster(SPARK_MASTER)
.set('spark.hadoop.fs.s3a.access.key', AWS_ACCESS_KEY)
.set('spark.hadoop.fs.s3a.secret.key', AWS_SECRET_KEY)
)
sc = pyspark.SparkContext(conf=conf)
See https://spark.apache.org/docs/latest/configuration.html#custom-hadoophive-configuration
You can see a couple of suggestions here:
http://www.infoobjects.com/2016/02/27/different-ways-of-setting-aws-credentials-in-spark/
I usually do the 3rd one (set hadoopConfig on the SparkContext), as I want the credentials to be parameters within my code. So that I can run it from any machine.
For example:
JavaSparkContext javaSparkContext = new JavaSparkContext();
javaSparkContext.sc().hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "");
javaSparkContext.sc().hadoopConfiguration().set("fs.s3n.awsSecretAccessKey","");
The method where you add the AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY to hdfs-site.xml should ideally work. Just ensure that you run pyspark or spark-submit as follows:
spark-submit --master "local[*]" \
--driver-class-path /usr/src/app/lib/mssql-jdbc-6.4.0.jre8.jar \
--jars /usr/src/app/lib/hadoop-aws-2.6.0.jar,/usr/src/app/lib/aws-java-sdk-1.11.443.jar,/usr/src/app/lib/mssql-jdbc-6.4.0.jre8.jar \
repl-sql-s3-schema-change.py
pyspark --jars /usr/src/app/lib/hadoop-aws-2.6.0.jar,/usr/src/app/lib/aws-java-sdk-1.11.443.jar,/usr/src/app/lib/mssql-jdbc-6.4.0.jre8.jar
Setting them in core-site.xml, provided that directory is on the classpath, should work.