Old-style mapred API in HBase does not work - mapreduce

I have a MapReduce job, which takes HBase table as the output destination
of my reduce job. My reducer class implements the TableMap interface in
package org.apache.hadoop.hbase.mapred, and I used the initTableReduceJob()
function in TableMapReduceUtil class from
package org.apache.hadoop.hbase.mapred to configure my job.
But when I run my job, I got the following error at reduce stage
java.lang.NullPointerException
at org.apache.hadoop.mapred.Task.getFsStatistics(Task.java:1099)
at
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:442)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
My HBase version is 0.94.0 and my Hadoop version is 1.0.1.
I found a post similar to my question at:
https://forums.aws.amazon.com/thread.jspa?messageID=394846
Could anyone give me some hint about why this happened? Should I just stick
with the org.apache.hadoop.hbase.mapreduce package?

This error suggests that you may be running HBase on the local filesystem without HDFS. Try installing or running Hadoop HDFS. The org.apache.hadoop.mapred API appears to require HDFS.
As a possible convenience, you may try the Cloudera development VM, which packages both.

Related

AWS EMR Spark error with `Failed to load class of driverClassName com.mysql.jdbc.Driver`

I'm currently trying to add a process in EMR 6.1.0 that will use Spark to store aggregated data in mysql.
However, when I actually run Spark, I get the following error.
Exception in thread "main" java.lang.RuntimeException: Failed to load class of driverClassName com.mysql.jdbc.
This error did not occur in EMR 6.0.0.
In the process of updating from EMR 6.0.0 to 6.1.0, I changed the Spark version from 2.4.4 to 3.0.0.
The code itself has not changed significantly, and we know that it is not a network problem.
I've spent a lot of time looking through the AWS documentation and can't seem to find any hints.
Can anyone help me?
Place the MySQL connector jar under $SPARK_HOME/jars folder or pass the the MySQL connector jar path in spark-shell/spark-submit command using --jars flag.
Spark 3.x depends on HikariCP.
https://github.com/apache/spark/blob/v3.0.0/dev/deps/spark-deps-hadoop-3.2-hive-2.3#L1
Preloaded HikariCP can't load your application classes due to ClassLoader.
https://github.com/brettwooldridge/HikariCP/blob/HikariCP-2.5.1/src/main/java/com/zaxxer/hikari/HikariConfig.java#L318
this.getClass().getClassLoader().loadClass(driverClassName)
You should add shade settings if use sbt-assemlby plugin.
assembly / assemblyShadeRules := {
Seq("com.zaxxer.hikari").map { packageName =>
ShadeRule.rename(s"${packageName}.**" -> s"my_app_shade_package.${packageName}.#1").inAll
}
}

Add pyspark script as AWS step

I have a pyspark script to read an xml file(present in S3). I need to add this as a step in aws. I have used the following command
aws emr add-steps — cluster-id <cluster_id> — steps Type=spark,Name=POI,Args=[ — deploy-mode,cluster, — master,yarn, — conf,spark.yarn.submit.waitAppCompletion=true,<s3 location of pyspark script>],ActionOnFailure=CONTINUE
I have downloaded the spark-xml jar to the master node during bootstrap and its present under
/home/hadoop
location. Also in the python script I have included
conf = SparkConf().setAppName('Project').set("spark.jars", "/home/hadoop/spark-xml_2.11-0.4.1.jar").set("spark.driver.extraClassPath", "/home/hadoop/spark-xml_2.11-0.4.1.jar")
But still its showing
py4j.protocol.Py4JJavaError: An error occurred while calling o56.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html
You have set master as yarn and deploy-mode as cluster. That means your spark driver will be in one of CORE nodes.
Anyway, EMR by default is configured to create Application master on one of the CORE node and application master will have the driver in it.
Please refer this article for more info.
So you have to put your jar in all CORE nodes (Not in MASTER) and refer the file file:///home/hadoop/spark-xml_2.11-0.4.1.jar in this manner.
Or there is better way to put it in HDFS (Lets say under hdfs:///user/hadoop) and refer that hdfs:///user/hadoop/spark-xml_2.11-0.4.1.jar

AWS EMR pyspark notebook fails with `Failed to run command /usr/bin/virtualenv (...)`

I have created a basic EMR cluster in AWS, and I'm trying to use the Jupyter Notebooks provided through the AWS Console. Launching the notebooks seems to work fine, and I'm also able to run basic python code in notebooks started with the pyspark kernel. Two variables are set up in the notebook: spark is a SparkSession instance, and sc is a SparkContext instance. Displaying sc yields <SparkContext master=yarn appName=livy-session-0> (the output can of course vary slightly depending on the session).
The problem arises once I perform operations that actually hit the spark machinery. For example:
sc.parallelize(list(range(10))).map(lambda x: x**2).collect()
I am no spark expert, but I believe this code should distribute the integers from 0 to 9 across the cluster, square them, and return the results in a list. Instead, I get a lengthy stack trace, mostly from the JVM, but also some python components. I believe the central part of the stack trace is the following:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 116, ip-XXXXXXXXXXXXX.eu-west-1.compute.internal, executor 17): java.lang.RuntimeException: Failed to run command: /usr/bin/virtualenv -p python3 --system-site-packages virtualenv_application_1586243436143_0002_0
The full stack trace is here.
A bit of digging in the AWS portal led me to log output from the nodes. stdout from one of the nodes includes the following:
The path python3 (from --python=python3) does not exist
I tried running the /usr/bin/virtualenv command on the master node manually (after logging in through), and that worked fine, but the error is of course still present after I did that.
While this error occurs most of the time, I was able to get this working in one session, where I could run several operations against the spark cluster as I was expecting.
Technical information on the cluster setup:
emr-6.0.0
Applications installed are "Ganglia 3.7.2, Spark 2.4.4, Zeppelin 0.9.0, Livy 0.6.0, JupyterHub 1.0.0, Hive 3.1.2". Hadoop is also included.
3 nodes (one of them as master), all r5a.2xlarge.
Any ideas what I'm doing wrong? Note that I am completely new to EMR and Spark.
Edit: Added the stdout log and information about running the virtualenv command manually on the master node through ssh.
I have switched to using emr-5.29.0, which seems to resolve the problem. Perhaps this is an issue with emr-6.0.0? In any case, I have a functional workaround.
The issue for me was that the virtualenv was being made on the executors with a python path that didn't exist. Pointing the executors to the right one did the job for me:
"spark.pyspark.python": "/usr/bin/python3.7"
Here is how I reconfiged the spark app at the beginning of the notebook:
{"conf":{"spark.pyspark.python": "/usr/bin/python3.7",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"}
}

Flume sink to HDFS error: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument

With:
Java 1.8.0_231
Hadoop 3.2.1
Flume 1.8.0
Have created a hdfs service on 9000 port.
jps:
11688 DataNode
10120 Jps
11465 NameNode
11964 SecondaryNameNode
12621 NodeManager
12239 ResourceManager
Flume conf:
agent1.channels.memory-channel.type=memory
agent1.sources.tail-source.type=exec
agent1.sources.tail-source.command=tail -F /var/log/nginx/access.log
agent1.sources.tail-source.channels=memory-channel
#hdfs sink
agent1.sinks.hdfs-sink.channel=memory-channel
agent1.sinks.hdfs-sink.type=hdfs
agent1.sinks.hdfs-sink.hdfs.path=hdfs://cluster01:9000/system.log
agent1.sinks.hdfs-sink.hdfs.fileType=DataStream
agent1.channels=memory-channel
agent1.sources=tail-source
agent1.sinks=log-sink hdfs-sink
Then start flume:
./bin/flume-ng agent --conf conf -conf-file conf/test1.conf --name agent1 -Dflume.root.logger=INFO,console
Then meet error:
Info: Including Hadoop libraries found via (/usr/local/hadoop-3.2.1/bin/hadoop) for HDFS access
...
2019-11-04 14:48:24,818 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:95)] Component type: SINK, name: hdfs-sink started
2019-11-04 14:48:28,823 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:57)] Serializer = TEXT, UseRawLocalFileSystem = false
2019-11-04 14:48:28,836 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:447)] process failed
java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1679)
at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:226)
at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:541)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:401)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1679)
at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:226)
at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:541)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:401)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
at java.lang.Thread.run(Thread.java:748)
I have searched for a while but haven't found same error on net. Is there any advice to solve this problem?
That may caused by lib/guava.
I removed lib/guava-11.0.2.jar, and restart flume, found it works.
outputs:
2019-11-04 16:52:58,062 (hdfs-hdfs-sink-call-runner-0) [WARN - org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:60)] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-11-04 16:53:01,532 (Thread-9) [INFO - org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:239)] SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
But I still don't know which version of guava it using now.
I had the same issue. It seems to be a bug in flume. It references a class name that does not exist in that version of guava
Replace guava-11.x.x.jar file with guava-27.x.x.jar from hadoop 3 common library, this will work
hadoop-3.3.0/share/hadoop/common/lib/guava-27.0-jre.jar put this file in your flume library, don't forget to delete older version from flume library first
As others said, there is clash between guava-11 (hadoop2/flume 1.8.0/1.9.0) and guava-27 (hadoop3).
Other answers don't explain the root cause of the issue: the script under $FLUME_HOME/bin/flume-ng puts into flume classpath all the jars in our hadoop distribution if $HADOOP_HOME environment variable is set.
Few words on why the suggested actions "fix" the problem: deleting $FLUME_HOME/lib/guava-11.0.2.jar leaves only guava-27.0-jre.jar, no more clash.
So, there is no need to copy it under $FLUME_HOME/lib, and it's no bug from Flume, just a version incompatibility because Flume did not upgrade Guava, while Hadoop 3 did.
I did not dig into the details of the changes between those guava versions, it might happen that everything works fine until it does not (for instance, if there is any backward incompatible change between the two).
So, before using this "fix" in production environment, I suggest to test extensively to reduce the risk of unexpected problems.
The best solution would be to wait (or contribute) a new Flume version where Guava is upgrade to v27.
I do agree with Alessandro S.
Flume communicates with HDFS via the HDFS APIs, and it doesn't matter which version the hadoop platform runs with if the APIs do not change which is the most cases. Actually flume is build with some specific version of hadoop library. The problem is that you use the wrong hadoop library version to run flume。
So just use hadoop library from version 2.x.x to run your 1.8.0 flume

Unable to load S3 parquet with postgresql driver in spark-shell [duplicate]

Trying to read a file located in S3 using spark-shell:
scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log")
lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at <console>:12
scala> myRdd.count
java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
... etc ...
The IOException: No FileSystem for scheme: s3n error occurred with:
Spark 1.31 or 1.40 on dev machine (no Hadoop libs)
Running from the Hortonworks Sandbox HDP v2.2.4 (Hadoop 2.60) which integrates Spark 1.2.1 out of the box
Using s3:// or s3n:// scheme
What is the cause of this error? Missing dependency, Missing configuration, or mis-use of sc.textFile()?
Or may be this is due to a bug that affects Spark build specific to Hadoop 2.60 as this post seems to suggest. I am going to try Spark for Hadoop 2.40 to see if this solves the issue.
Confirmed that this is related to the Spark build against Hadoop 2.60. Just installed Spark 1.4.0 "Pre built for Hadoop 2.4 and later" (instead of Hadoop 2.6). And the code now works OK.
sc.textFile("s3n://bucketname/Filename") now raises another error:
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
The code below uses the S3 URL format to show that Spark can read S3 file. Using dev machine (no Hadoop libs).
scala> val lyrics = sc.textFile("s3n://MyAccessKeyID:MySecretKey#zpub01/SafeAndSound_Lyrics.txt")
lyrics: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21
scala> lyrics.count
res1: Long = 9
Even Better: the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward "/". Configuring AWS Credentials in SparkContext will fix it. Code works whether the S3 file is public or private.
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "BLABLA")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "....") // can contain "/"
val myRDD = sc.textFile("s3n://myBucket/MyFilePattern")
myRDD.count
Despite that this question has already an accepted answer, I think that the exact details of why this is happening are still missing. So I think there might be a place for one more answer.
If you add the required hadoop-aws dependency, your code should work.
Starting Hadoop 2.6.0, s3 FS connector has been moved to a separate library called hadoop-aws.
There is also a Jira for that:
Move s3-related FS connector code to hadoop-aws.
This means that any version of spark, that has been built against Hadoop 2.6.0 or newer will have to use another external dependency to be able to connect to the S3 File System.
Here is an sbt example that I have tried and is working as expected using Apache Spark 1.6.2 built against Hadoop 2.6.0:
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0"
In my case, I encountered some dependencies issues, so I resolved by adding exclusion:
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0" exclude("tomcat", "jasper-compiler") excludeAll ExclusionRule(organization = "javax.servlet")
On other related note, I have yet to try it, but that it is recommended to use "s3a" and not "s3n" filesystem starting Hadoop 2.6.0.
The third generation, s3a: filesystem. Designed to be a switch in replacement for s3n:, this filesystem binding supports larger files and promises higher performance.
You can add the --packages parameter with the appropriate jar:
to your submission:
bin/spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 code.py
I had to copy the jar files from a hadoop download into the $SPARK_HOME/jars directory. Using the --jars flag or the --packages flag for spark-submit didn't work.
Details:
Spark 2.3.0
Hadoop downloaded was 2.7.6
Two jar files copied were from (hadoop dir)/share/hadoop/tools/lib/
aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.6.jar
This is a sample spark code which can read the files present on s3
val hadoopConf = sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", s3Key)
hadoopConf.set("fs.s3.awsSecretAccessKey", s3Secret)
var jobInput = sparkContext.textFile("s3://" + s3_location)
Ran into the same problem in Spark 2.0.2. Resolved it by feeding it the jars. Here's what I ran:
$ spark-shell --jars aws-java-sdk-1.7.4.jar,hadoop-aws-2.7.3.jar,jackson-annotations-2.7.0.jar,jackson-core-2.7.0.jar,jackson-databind-2.7.0.jar,joda-time-2.9.6.jar
scala> val hadoopConf = sc.hadoopConfiguration
scala> hadoopConf.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
scala> hadoopConf.set("fs.s3.awsAccessKeyId",awsAccessKeyId)
scala> hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretAccessKey)
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> sqlContext.read.parquet("s3://your-s3-bucket/")
obviously, you need to have the jars in the path where you're running spark-shell from
There is a Spark JIRA, SPARK-7481, open as of today, oct 20, 2016, to add a spark-cloud module which includes transitive dependencies on everything s3a and azure wasb: need, along with tests.
And a Spark PR to match. This is how I get s3a support into my spark builds
If you do it by hand, you must get hadoop-aws JAR of the exact version the rest of your hadoop JARS have, and a version of the AWS JARs 100% in sync with what Hadoop aws was compiled against. For Hadoop 2.7.{1, 2, 3, ...}
hadoop-aws-2.7.x.jar
aws-java-sdk-1.7.4.jar
joda-time-2.9.3.jar
+ jackson-*-2.6.5.jar
Stick all of these into SPARK_HOME/jars. Run spark with your credentials set up in Env vars or in spark-default.conf
the simplest test is can you do a line count of a CSV File
val landsatCSV = "s3a://landsat-pds/scene_list.gz"
val lines = sc.textFile(landsatCSV)
val lineCount = lines.count()
Get a number: all is well. Get a stack trace. Bad news.
For Spark 1.4.x "Pre built for Hadoop 2.6 and later":
I just copied needed S3, S3native packages from hadoop-aws-2.6.0.jar to
spark-assembly-1.4.1-hadoop2.6.0.jar.
After that I restarted spark cluster and it works.
Do not forget to check owner and mode of the assembly jar.
I was facing the same issue. It worked fine after setting the value for fs.s3n.impl and adding hadoop-aws dependency.
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)
sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
S3N is not a default file format. You need to build your version of Spark with a version of Hadoop that has the additional libraries used for AWS compatibility. Additional info I found here, https://www.hakkalabs.co/articles/making-your-local-hadoop-more-like-aws-elastic-mapreduce
You probably have to use s3a:/ scheme instead of s3:/ or s3n:/
However, it is not working out of the box (for me) for the spark shell. I see the following stacktrace:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781)
at org.apache.spark.rdd.RDD.count(RDD.scala:1099)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
at $iwC$$iwC$$iwC.<init>(<console>:37)
at $iwC$$iwC.<init>(<console>:39)
at $iwC.<init>(<console>:41)
at <init>(<console>:43)
at .<init>(<console>:47)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
... 68 more
What I think - you have to manually add the hadoop-aws dependency manually http://search.maven.org/#artifactdetails|org.apache.hadoop|hadoop-aws|2.7.1|jar But I have no idea how to add it to spark-shell properly.
Download the hadoop-aws jar from maven repository matching your hadoop version.
Copy the jar to $SPARK_HOME/jars location.
Now in your Pyspark script, setup AWS Access Key & Secret Access Key.
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "ACCESS_KEY")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "YOUR_SECRET_ACCESSS_KEY")
// where spark is SparkSession instance
For Spark scala:
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "ACCESS_KEY")
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "YOUR_SECRET_ACCESSS_KEY")
I was able to to read my S3 parquet files (Spark 3.3.1, Hadoop 3) using the configuration proposed here:
spark = SparkSession.builder.appName("Test_Parquet").master("local[*]")\
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")\
.config("spark.jars.packages",
"org.apache.hadoop:hadoop-aws:3.2.2,"
"com.amazonaws:aws-java-sdk-bundle:1.12.180").getOrCreate()
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl",\
"org.apache.hadoop.fs.s3a.S3A")
df = spark.read.parquet(f"s3a://{bucket_name}/{file_name}")
USe s3a instead of s3n. I had similar issue on a Hadoop job. After switching from s3n to s3a it worked.
e.g.
s3a://myBucket/myFile1.log