EMR 5.21 , Spark 2.4 - Json4s Dependency broken - amazon-web-services

Issue
In EMR 5.21 , Spark - Hbase integration is broken.
df.write.options().format().save() fails.
Reason is json4s-jackson version 3.5.3 in spark 2.4 , EMR 5.21
it works fine in EMR 5.11.2 , Spark 2.2 , son4s-jackson version 3.2.11
Problem is this is EMR so i cant rebuild spark with lower json4s .
is there any workaround ?
Error
py4j.protocol.Py4JJavaError: An error occurred while calling o104.save. : java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse(Lorg/json4s/JsonInput;Z)Lorg/json4s/JsonAST$JValue;
Submission
spark-submit --master yarn \
--jars /usr/lib/hbase/ \
--packages com.hortonworks:shc-core:1.1.3-2.3-s_2.11 \
--repositories http://repo.hortonworks.com/content/groups/public/ \
pysparkhbase_V1.1.py s3://<bucket>/ <Namespace> <Table> <cf> <Key>
Code
import sys
from pyspark.sql.functions import concat
from pyspark import SparkContext
from pyspark.sql import SQLContext,SparkSession
spark = SparkSession.builder.master("yarn").appName("PysparkHbaseConnection").config("spark.some.config.option", "PyHbase").getOrCreate()
spark.sql("set spark.sql.parquet.compression.codec=uncompressed")
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
df = spark.read.parquet(file)
df.createOrReplaceTempView("view")
.
cat = '{|"table":{"namespace":"' + namespace + '", "name":"' + name + '", "tableCoder":"' + tableCoder + '", "version":"' + version + '"}, \n|"rowkey":"' + rowkey + '", \n|"columns":{'
.
df.write.options(catalog=cat).format(data_source_format).save()

There's no obvious answer. A quick check of the SHC POM doesn't show a direct import of the json file, so you can't just change that pom & build the artifact yourself.
You're going to have to talk to the EMR team to get them to build the connector & HBase in sync.
FWIW, getting jackson in sync is one of the stress point of releasing a big data stack, and the AWS SDK's habit of updating their requirements on their fortnightly release one of the stress points ... Hadoop moved to the aws shaded SDK purely to stop the AWS engineering decisions defining the choices for everyone.

downgrade json4s to 3.2.10 can resolve it. but I think it's SHC bug,need to upgrade it.

Related

How to read a csv file from s3 bucket using pyspark

I'm using Apache Spark 3.1.0 with Python 3.9.6. I'm trying to read csv file from AWS S3 bucket something like this:
spark = SparkSession.builder.getOrCreate()
file = "s3://bucket/file.csv"
c = spark.read\
.csv(file)\
.count()
print(c)
But I'm getting the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o26.csv.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
I understand that I need add special libraries, but I didn't find any certain information which exactly and which versions. I've tried to add something like this to my code, but I'm still getting same error:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
How can I fix this?
You need to use hadoop-aws version 3.2.0 for spark 3. In --packages specifying hadoop-aws library is enough to read files from S3.
--packages org.apache.hadoop:hadoop-aws:3.2.0
You need to set below configurations.
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<access_key>")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<secret_key>")
After that you can read CSV file.
spark.read.csv("s3a://bucket/file.csv")
Thanks Mohana for the pointer! After breaking my head for more than a day, I was able to finally figure out. Summarizing my learnings:
Make sure what version of Hadoop your spark comes with:
print(f'pyspark hadoop version:
{spark.sparkContext._jvm.org.apache.hadoop.util.VersionInfo.getVersion()}')
or look for
ls jars/hadoop*.jar
The issue I was having was I had older version of Spark that I had installed a while back that Hadoop 2.7 and was messing up everything.
This should give a brief idea of what binaries you need to download.
For me it was Spark 3.2.1 and Hadoop 3.3.1.
Hence I downloaded :
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.3.1
https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-bundle/1.11.901 # added this just in case;
Placed these jar files in the spark installation dir:
spark/jars/
spark-submit runner.py --packages org.apache.hadoop:hadoop-aws:3.3.1
You have your code snippet that reads from AWS S3

Pyspark from S3 - java.lang.ClassNotFoundException: com.amazonaws.services.s3.model.MultiObjectDeleteException

I'm trying to get the data from s3 with pyspark from AWS EMR Cluster.
I'm still getting this error - An error occurred while calling o27.parquet. : java.lang.NoClassDefFoundError: com/amazonaws/services/s3/model/MultiObjectDeleteException.
I have tried with different versions of jars/clusters, still no results.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().set("spark.jars","/usr/lib/spark/jars/hadoop-aws-3.2.1.jar,/usr/lib/spark/aws-java-sdk-s3-1.11.873.jar")
sc = SparkContext( conf=conf)
sqlContext = SQLContext(sc)
df2 = sqlContext.read.parquet("s3a://stackdev2prq/Badges/*")
I'm using hadoop-aws-3.2.1.jar and aws-java-sdk-s3-1.11.873.jar.
Spark 3.0.1 on Hadoop 3.2.1 YARN
I know that version I need propper version aws-java-sdk, but how can I check which version should I dounload?
mvnrepo provides the information
I don't see it for 3.2.1. Looking in the hadoop-project pom and JIRA versions, HADOOP-15642 says "1.11.375"; the move to 1.11.563 only went in with 3.2.2
do put the whole (huge) aws-java-sdk-bundle on the classpath. that shades everything and avoids version mismatch hell with jackson, httpclient etc.
That said: if you are working with EMR, you should just use the s3:// URLs and pick up the EMR team's S3 connector.

AWS emr driver jars

I'm trying to use external drivers in AWS EMR 5.29 on pyspark notebooks via:
#%%configure -f
{ "conf": {"spark.jars":"s3://bucket/spark-redshift_2.10-2.0.1.jar,"
"s3://bucket/minimal-json-0.9.5.jar,"
"s3://bucket/spark-avro_2.11-3.0.0.jar,"
"s3://bucket/RedshiftJDBC4-no-awssdk-1.2.41.1065.jar"}}
As per https://blog.benthem.io/2020/04/21/connect-aws-emr-to-spark.html
However, when trying
from pyspark.sql import SQLContext
sc = spark # existing SparkContext
sql_context = SQLContext(sc)
df = sql_context.read.format("com.databricks.spark.redshift")\
.option("url", jdbcUrl)\
.option("query","select * from test")\
.option("tempdir", "s3://")\
.load()
I get
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.redshift.
How can I troubleshoot this? I can confirm the emr role has access to the bucket as I can process a CSV file on the same bucket with spark. I can also confirm all the listed jar files are in the bucket.
Actually the way to troubleshoot this is to SSH into the master node and then look at the ivy logs:
/mnt/var/log/livy/livy-livy-server.out
and the downloaded jar files at
/var/lib/livy/.ivy2/jars/
based on what I found out I changed my code to:
%%configure -f
{
"conf": {
"spark.jars" : "s3://bucket/RedshiftJDBC4-no-awssdk-1.2.41.1065.jar",
"spark.jars.packages": "com.databricks:spark-redshift_2.10:2.0.0,org.apache.spark:spark-avro_2.11:2.4.0,com.eclipsesource.minimal-json:minimal-json:0.9.4"
}
}

Clear data from HDFS on AWS EMR in Hadoop 1.0.3

For various reasons I'm running some jobs on EMR with AMI 2.4.11/Hadoop 1.0.3. I'm trying to run a cleanup of HDFS after my jobs by adding an additional EMR step. Using boto:
step = JarStep(
'HDFS cleanup',
'command-runner.jar',
action_on_failure='CONTINUE',
step_args=['hadoop', 'dfs', '-rmr', '-skipTrash', 'hdfs:/tmp'])
emr_conn.add_jobflow_steps(cluster_id, [step])
However it regularly fails with nothing in stderr in the EMR console.
Why I am confused is if I ssh into the master node and run the command:
hadoop dfs -rmr -skipTrash hdfs:/tmp
It succeeds with a 0 and a message that it successfully deleted everything. All the normal hadoop commands seem to work as documented. Does anyone know if there's an obvious reason for this? Issue with the Amazon distribution? Undocumented behavior in certain commands?
Note:
I have other jobs running in Hadoop 2 and the documented:
hdfs dfs -rm -r -skipTrash hdfs:/tmp
works as one would expect both as a step and as a command.
My solution generally was to upgrade everything to Hadoop2, in which case this works:
JarStep(
'%s: HDFS cleanup' % self.job_name,
'command-runner.jar',
action_on_failure='CONTINUE',
step_args=['hdfs', 'dfs', '-rm', '-r', '-skipTrash', path]
)
This was the best I could get with Hadoop1 that worked pretty well.
JarStep(
'%s: HDFS cleanup' % self.job_name,
'command-runner.jar',
action_on_failure='CONTINUE',
step_args=['hadoop', 'fs', '-rmr', '-skipTrash',
'hdfs:/tmp/mrjob']
)

AWS EMR and Spark 1.0.0

I've been running into some issues recently while trying to use Spark on an AWS EMR cluster.
I am creating the cluster using something like :
./elastic-mapreduce --create --alive \
--name "ll_Spark_Cluster" \
--bootstrap-action s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb \
--bootstrap-name "Spark/Shark" \
--instance-type m1.xlarge \
--instance-count 2 \
--ami-version 3.0.4
The issue is that whenever I try to get data from S3 I get an exception.
So if I start the spark-shell and try something like :
val data = sc.textFile("s3n://your_s3_data")
I get the following exception :
WARN storage.BlockManager: Putting block broadcast_1 failed
java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
This issue was caused by the guava library,
The version that's on the AMI is 11 while spark needs version 14.
I edited the bootstrap script from AWS to install spark 1.0.2 and update the guava library during the bootstrap action you can get the gist here :
https://gist.github.com/tnbredillet/867111b8e1e600fa588e
Even after updating guava I still had an issue.
When I tried to save data on S3 I had an exception thrown
lzo.GPLNativeCodeLoader - Could not load native gpl library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
I solved that by adding the hadoop native library to the java.library.path.
When I run a job I add the option
-Djava.library.path=/home/hadoop/lib/native
or if I run a job through spark-submit I add the
--driver-library-path /home/hadoop/lib/native
argument.