Hadoop 2.9.2, Spark 2.4.0 access AWS s3a bucket - amazon-web-services

It's been a couple of days but I could not download from public Amazon Bucket using Spark :(
Here is spark-shell command:
spark-shell --master yarn
-v
--jars file:/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar,file:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar
--driver-class-path=/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar
Application started and shell waiting for prompt:
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val data1 = sc.textFile("s3a://my-bucket-name/README.md")
18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 242.1 KB, free 246.7 MB)
18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.2 KB, free 246.6 MB)
18/12/25 13:06:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop-edge01:3545 (size: 24.2 KB, free: 246.9 MB)
18/12/25 13:06:40 INFO SparkContext: Created broadcast 0 from textFile at <console>:24
data1: org.apache.spark.rdd.RDD[String] = s3a://my-bucket-name/README.md MapPartitionsRDD[1] at textFile at <console>:24
scala> data1.count()
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD.count(RDD.scala:1168)
... 49 elided
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.fs.StorageStatistics
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 77 more
scala>
All AWS keys, secret-keys was set in hadoop/core-site.xml as described here: Hadoop-AWS module: Integration with Amazon Web Services
The bucket is public - anyone can download (tested with curl -O)
All .jars as you can see was provided by Hadoop itself from /usr/local/hadoop/share/hadoop/tools/lib/ folder
There's no additional settings in spark-defaults.conf - only what was sent in command line
Both jars does not provide this class:
jar tf /usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar | grep org/apache/hadoop/fs/StorageStatistics
(no result)
jar tf /usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar | grep org/apache/hadoop/fs/StorageStatistics
(no result)
What should I do ? Did I forget to add another jar ? What the exact configuration of hadoop-aws and aws-java-sdk-bundle ? versions ?

Mmmm.... I found the problem, finally..
The main issue is Spark that I have is pre-installed for Hadoop. It's 'v2.4.0 pre-build for Hadoop 2.7 and later'. This is bit of misleading title as you see my struggles with it above. Actually Spark shipped with different version of hadoop jars. The listing from: /usr/local/spark/jars/ shows that it have:
hadoop-common-2.7.3.jar
hadoop-client-2.7.3.jar
....
it only missing: hadoop-aws and aws-java-sdk. I little bit digging in Maven repository: hadoop-aws-v2.7.3 and it dependency: aws-java-sdk-v1.7.4 and voila ! Downloaded those jar and send them as parameters to Spark. Like this:
spark-shell
--master yarn
-v
--jars file:/home/aws-java-sdk-1.7.4.jar,file:/home/hadoop-aws-2.7.3.jar
--driver-class-path=/home/aws-java-sdk-1.7.4.jar:/home/hadoop-aws-2.7.3.jar
Did the job !!!
I'm just wondering why all jars from Hadoop (and I send all of them as parameter to --jars and --driver-class-path) didn't catch up. Spark somehow automatically choose it jars and not what I send

I advise you not to do what you did.
You are running pre built spark with hadoop 2.7.2 jars on hadoop 2.9.2 and you added to the classpath some more jars to work with s3 from the hadoop 2.7.3 version to solve the issue.
What you should be doing is working with a "hadoop free" spark version - and provide the hadoop file by configuration as you can see in the following link -
https://spark.apache.org/docs/2.4.0/hadoop-provided.html
The main parts:
in conf/spark-env.sh
If hadoop binary is on your PATH
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
With explicit path to hadoop binary
export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)
Passing a Hadoop configuration directory
export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)

I use spark 2.4.5 and this is what I did and it worked for me. I am able to connect to AWS s3 from Spark in my local.
(1) Download spark 2.4.5 from here:https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-without-hadoop-scala-2.12.tgz. This spark does not have hadoop in it.
(2) Download hadoop. https://archive.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
(3) Update .bash_profile
SPARK_HOME = <SPARK_PATH> #example /home/spark-2.4.5/spark-2.4.5-bin-without-hadoop-scala-2.12
PATH=$SPARK_HOME/bin
(4) Add Hadoop in spark env
Copy spark-env.sh.template as spark-env.sh
add export SPARK_DIST_CLASSPATH=$(<hadoop_path> classpath)
here <hadoop_path> is path to your hadoop /home/hadoop-3.2.1/bin/hadoop

Related

amazon EMR spark-submit doesn't allow docker Image pattern sha256 diges

I am using Amazon EMR
my question log is following.
Image name '<account_id>.dkr.ecr.ap-northeast-2.amazonaws.com/pyspark-etl#sha256:3d3a07135.......https://forums.aws.amazon.com/....0.87563e51ef5c841841a5d1a6dde9c' doesn't match docker image name pattern
Release label:emr-6.2.0
Hadoop distribution:Amazon 3.2.1
Applications:Spark 3.0.1,
Hive 3.1.2,
JupyterHub 1.1.0,
Ganglia 3.7.2
Zeppelin 0.9.0
Livy 0.7.0 \
Hue 4.8.0
PrestoSQL 343
The reason I had to use sha256 digest is becuase I previously used TAG:latest pyspark image hardcoded in airflow job ALSO containerized in ECR image.
so, when my airflow container runs a EMROperator(SSHoperator precisely) as a CLI spark-submit. It pull :latest spark container which doesn't update because of some reason.
It is strange because when I ssh into core instance, I am able to pull sha256 name pattern from ECR, also update :latest TAG if something is changed(so digest changed).
I think this is something about spark configuration or spark source from AWS which prohibited digest name pattern, but I can't debug this because I do not have spark(Amazon) source on my own. I would appreciate your answer.
Many thanks,
I am editing on my own because I got an answer from someone.
The problem was YARN configuration installed on my EMR master node. YARN default setting for docker default image update is False.
https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DockerLinuxContainerRuntime.java#L344
so, in order to fix default setting, you should go /etc/hadoop/conf/ and find a yarn-sites.xml and fix(or add) this
<property>
<name>yarn.nodemanager.runtime.linux.docker.image-update</name>
<value>false</value>
<description>
Optional. Default option to decide whether to pull the latest image
or not.
</description>
</property>
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/DockerContainers.html

Flume sink to HDFS error: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument

With:
Java 1.8.0_231
Hadoop 3.2.1
Flume 1.8.0
Have created a hdfs service on 9000 port.
jps:
11688 DataNode
10120 Jps
11465 NameNode
11964 SecondaryNameNode
12621 NodeManager
12239 ResourceManager
Flume conf:
agent1.channels.memory-channel.type=memory
agent1.sources.tail-source.type=exec
agent1.sources.tail-source.command=tail -F /var/log/nginx/access.log
agent1.sources.tail-source.channels=memory-channel
#hdfs sink
agent1.sinks.hdfs-sink.channel=memory-channel
agent1.sinks.hdfs-sink.type=hdfs
agent1.sinks.hdfs-sink.hdfs.path=hdfs://cluster01:9000/system.log
agent1.sinks.hdfs-sink.hdfs.fileType=DataStream
agent1.channels=memory-channel
agent1.sources=tail-source
agent1.sinks=log-sink hdfs-sink
Then start flume:
./bin/flume-ng agent --conf conf -conf-file conf/test1.conf --name agent1 -Dflume.root.logger=INFO,console
Then meet error:
Info: Including Hadoop libraries found via (/usr/local/hadoop-3.2.1/bin/hadoop) for HDFS access
...
2019-11-04 14:48:24,818 (lifecycleSupervisor-1-1) [INFO - org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:95)] Component type: SINK, name: hdfs-sink started
2019-11-04 14:48:28,823 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.HDFSDataStream.configure(HDFSDataStream.java:57)] Serializer = TEXT, UseRawLocalFileSystem = false
2019-11-04 14:48:28,836 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:447)] process failed
java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1679)
at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:226)
at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:541)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:401)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1679)
at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:226)
at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:541)
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:401)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
at java.lang.Thread.run(Thread.java:748)
I have searched for a while but haven't found same error on net. Is there any advice to solve this problem?
That may caused by lib/guava.
I removed lib/guava-11.0.2.jar, and restart flume, found it works.
outputs:
2019-11-04 16:52:58,062 (hdfs-hdfs-sink-call-runner-0) [WARN - org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:60)] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-11-04 16:53:01,532 (Thread-9) [INFO - org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:239)] SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
But I still don't know which version of guava it using now.
I had the same issue. It seems to be a bug in flume. It references a class name that does not exist in that version of guava
Replace guava-11.x.x.jar file with guava-27.x.x.jar from hadoop 3 common library, this will work
hadoop-3.3.0/share/hadoop/common/lib/guava-27.0-jre.jar put this file in your flume library, don't forget to delete older version from flume library first
As others said, there is clash between guava-11 (hadoop2/flume 1.8.0/1.9.0) and guava-27 (hadoop3).
Other answers don't explain the root cause of the issue: the script under $FLUME_HOME/bin/flume-ng puts into flume classpath all the jars in our hadoop distribution if $HADOOP_HOME environment variable is set.
Few words on why the suggested actions "fix" the problem: deleting $FLUME_HOME/lib/guava-11.0.2.jar leaves only guava-27.0-jre.jar, no more clash.
So, there is no need to copy it under $FLUME_HOME/lib, and it's no bug from Flume, just a version incompatibility because Flume did not upgrade Guava, while Hadoop 3 did.
I did not dig into the details of the changes between those guava versions, it might happen that everything works fine until it does not (for instance, if there is any backward incompatible change between the two).
So, before using this "fix" in production environment, I suggest to test extensively to reduce the risk of unexpected problems.
The best solution would be to wait (or contribute) a new Flume version where Guava is upgrade to v27.
I do agree with Alessandro S.
Flume communicates with HDFS via the HDFS APIs, and it doesn't matter which version the hadoop platform runs with if the APIs do not change which is the most cases. Actually flume is build with some specific version of hadoop library. The problem is that you use the wrong hadoop library version to run flume。
So just use hadoop library from version 2.x.x to run your 1.8.0 flume

Unable to load S3 parquet with postgresql driver in spark-shell [duplicate]

Trying to read a file located in S3 using spark-shell:
scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log")
lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at <console>:12
scala> myRdd.count
java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
... etc ...
The IOException: No FileSystem for scheme: s3n error occurred with:
Spark 1.31 or 1.40 on dev machine (no Hadoop libs)
Running from the Hortonworks Sandbox HDP v2.2.4 (Hadoop 2.60) which integrates Spark 1.2.1 out of the box
Using s3:// or s3n:// scheme
What is the cause of this error? Missing dependency, Missing configuration, or mis-use of sc.textFile()?
Or may be this is due to a bug that affects Spark build specific to Hadoop 2.60 as this post seems to suggest. I am going to try Spark for Hadoop 2.40 to see if this solves the issue.
Confirmed that this is related to the Spark build against Hadoop 2.60. Just installed Spark 1.4.0 "Pre built for Hadoop 2.4 and later" (instead of Hadoop 2.6). And the code now works OK.
sc.textFile("s3n://bucketname/Filename") now raises another error:
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
The code below uses the S3 URL format to show that Spark can read S3 file. Using dev machine (no Hadoop libs).
scala> val lyrics = sc.textFile("s3n://MyAccessKeyID:MySecretKey#zpub01/SafeAndSound_Lyrics.txt")
lyrics: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21
scala> lyrics.count
res1: Long = 9
Even Better: the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward "/". Configuring AWS Credentials in SparkContext will fix it. Code works whether the S3 file is public or private.
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "BLABLA")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "....") // can contain "/"
val myRDD = sc.textFile("s3n://myBucket/MyFilePattern")
myRDD.count
Despite that this question has already an accepted answer, I think that the exact details of why this is happening are still missing. So I think there might be a place for one more answer.
If you add the required hadoop-aws dependency, your code should work.
Starting Hadoop 2.6.0, s3 FS connector has been moved to a separate library called hadoop-aws.
There is also a Jira for that:
Move s3-related FS connector code to hadoop-aws.
This means that any version of spark, that has been built against Hadoop 2.6.0 or newer will have to use another external dependency to be able to connect to the S3 File System.
Here is an sbt example that I have tried and is working as expected using Apache Spark 1.6.2 built against Hadoop 2.6.0:
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0"
In my case, I encountered some dependencies issues, so I resolved by adding exclusion:
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0" exclude("tomcat", "jasper-compiler") excludeAll ExclusionRule(organization = "javax.servlet")
On other related note, I have yet to try it, but that it is recommended to use "s3a" and not "s3n" filesystem starting Hadoop 2.6.0.
The third generation, s3a: filesystem. Designed to be a switch in replacement for s3n:, this filesystem binding supports larger files and promises higher performance.
You can add the --packages parameter with the appropriate jar:
to your submission:
bin/spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 code.py
I had to copy the jar files from a hadoop download into the $SPARK_HOME/jars directory. Using the --jars flag or the --packages flag for spark-submit didn't work.
Details:
Spark 2.3.0
Hadoop downloaded was 2.7.6
Two jar files copied were from (hadoop dir)/share/hadoop/tools/lib/
aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.6.jar
This is a sample spark code which can read the files present on s3
val hadoopConf = sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", s3Key)
hadoopConf.set("fs.s3.awsSecretAccessKey", s3Secret)
var jobInput = sparkContext.textFile("s3://" + s3_location)
Ran into the same problem in Spark 2.0.2. Resolved it by feeding it the jars. Here's what I ran:
$ spark-shell --jars aws-java-sdk-1.7.4.jar,hadoop-aws-2.7.3.jar,jackson-annotations-2.7.0.jar,jackson-core-2.7.0.jar,jackson-databind-2.7.0.jar,joda-time-2.9.6.jar
scala> val hadoopConf = sc.hadoopConfiguration
scala> hadoopConf.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
scala> hadoopConf.set("fs.s3.awsAccessKeyId",awsAccessKeyId)
scala> hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretAccessKey)
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> sqlContext.read.parquet("s3://your-s3-bucket/")
obviously, you need to have the jars in the path where you're running spark-shell from
There is a Spark JIRA, SPARK-7481, open as of today, oct 20, 2016, to add a spark-cloud module which includes transitive dependencies on everything s3a and azure wasb: need, along with tests.
And a Spark PR to match. This is how I get s3a support into my spark builds
If you do it by hand, you must get hadoop-aws JAR of the exact version the rest of your hadoop JARS have, and a version of the AWS JARs 100% in sync with what Hadoop aws was compiled against. For Hadoop 2.7.{1, 2, 3, ...}
hadoop-aws-2.7.x.jar
aws-java-sdk-1.7.4.jar
joda-time-2.9.3.jar
+ jackson-*-2.6.5.jar
Stick all of these into SPARK_HOME/jars. Run spark with your credentials set up in Env vars or in spark-default.conf
the simplest test is can you do a line count of a CSV File
val landsatCSV = "s3a://landsat-pds/scene_list.gz"
val lines = sc.textFile(landsatCSV)
val lineCount = lines.count()
Get a number: all is well. Get a stack trace. Bad news.
For Spark 1.4.x "Pre built for Hadoop 2.6 and later":
I just copied needed S3, S3native packages from hadoop-aws-2.6.0.jar to
spark-assembly-1.4.1-hadoop2.6.0.jar.
After that I restarted spark cluster and it works.
Do not forget to check owner and mode of the assembly jar.
I was facing the same issue. It worked fine after setting the value for fs.s3n.impl and adding hadoop-aws dependency.
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)
sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
S3N is not a default file format. You need to build your version of Spark with a version of Hadoop that has the additional libraries used for AWS compatibility. Additional info I found here, https://www.hakkalabs.co/articles/making-your-local-hadoop-more-like-aws-elastic-mapreduce
You probably have to use s3a:/ scheme instead of s3:/ or s3n:/
However, it is not working out of the box (for me) for the spark shell. I see the following stacktrace:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781)
at org.apache.spark.rdd.RDD.count(RDD.scala:1099)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
at $iwC$$iwC$$iwC.<init>(<console>:37)
at $iwC$$iwC.<init>(<console>:39)
at $iwC.<init>(<console>:41)
at <init>(<console>:43)
at .<init>(<console>:47)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
... 68 more
What I think - you have to manually add the hadoop-aws dependency manually http://search.maven.org/#artifactdetails|org.apache.hadoop|hadoop-aws|2.7.1|jar But I have no idea how to add it to spark-shell properly.
Download the hadoop-aws jar from maven repository matching your hadoop version.
Copy the jar to $SPARK_HOME/jars location.
Now in your Pyspark script, setup AWS Access Key & Secret Access Key.
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "ACCESS_KEY")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "YOUR_SECRET_ACCESSS_KEY")
// where spark is SparkSession instance
For Spark scala:
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "ACCESS_KEY")
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "YOUR_SECRET_ACCESSS_KEY")
I was able to to read my S3 parquet files (Spark 3.3.1, Hadoop 3) using the configuration proposed here:
spark = SparkSession.builder.appName("Test_Parquet").master("local[*]")\
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")\
.config("spark.jars.packages",
"org.apache.hadoop:hadoop-aws:3.2.2,"
"com.amazonaws:aws-java-sdk-bundle:1.12.180").getOrCreate()
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl",\
"org.apache.hadoop.fs.s3a.S3A")
df = spark.read.parquet(f"s3a://{bucket_name}/{file_name}")
USe s3a instead of s3n. I had similar issue on a Hadoop job. After switching from s3n to s3a it worked.
e.g.
s3a://myBucket/myFile1.log

on Amazon EMR 4.0.0, setting /etc/spark/conf/spark-env.conf is ineffective

I'm launching my spark-based hiveserver2 on Amazon EMR, which has an extra classpath dependency. Due to this bug in Amazon EMR:
https://petz2000.wordpress.com/2015/08/18/get-blas-working-with-spark-on-amazon-emr/
My classpath cannot be submitted through "--driver-class-path" option
So I'm bounded to modify /etc/spark/conf/spark-env.conf to add the extra classpath:
# Add Hadoop libraries to Spark classpath
SPARK_CLASSPATH="${SPARK_CLASSPATH}:${HADOOP_HOME}/*:${HADOOP_HOME}/../hadoop-hdfs/*:${HADOOP_HOME}/../hadoop-mapreduce/*:${HADOOP_HOME}/../hadoop-yarn/*:/home/hadoop/git/datapassport/*"
where "/home/hadoop/git/datapassport/*" is my classpath.
However after launching the server successfully, the Spark environment parameter shows that my change is ineffective:
spark.driver.extraClassPath :/usr/lib/hadoop/*:/usr/lib/hadoop/../hadoop-hdfs/*:/usr/lib/hadoop/../hadoop-mapreduce/*:/usr/lib/hadoop/../hadoop-yarn/*:/etc/hive/conf:/usr/lib/hadoop/../hadoop-lzo/lib/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
Is this configuration file obsolete? Where is the new file and how to fix this problem?
You can use the --driver-classpath.
Start a spark-shell on the master node from a fresh EMR cluster.
spark-shell --master yarn-client
scala> sc.getConf.get("spark.driver.extraClassPath")
res0: String = /etc/hadoop/conf:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
Add your JAR files to the EMR cluster using a --bootstrap-action.
When you call spark-submit prepend (or append) your JAR files to the value of extraClassPath you got from spark-shell
spark-submit --master yarn-cluster --driver-classpath /home/hadoop/my-custom-jar.jar:/etc/hadoop/conf:/usr/lib/hadoop/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop-yarn/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*
This worked for me using EMR release builds 4.1 and 4.2.
The process for building spark.driver.extraClassPath may change between releases, which may be the reason why SPARK_CLASSPATH doesn't work anymore.
Have you tried setting spark.driver.extraClassPath in spark-defaults? Something like this:
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.driver.extraClassPath": "${SPARK_CLASSPATH}:${HADOOP_HOME}/*:${HADOOP_HOME}/../hadoop-hdfs/*:${HADOOP_HOME}/../hadoop-mapreduce/*:${HADOOP_HOME}/../hadoop-yarn/*:/home/hadoop/git/datapassport/*"
}
}
]

AWS EMR and Spark 1.0.0

I've been running into some issues recently while trying to use Spark on an AWS EMR cluster.
I am creating the cluster using something like :
./elastic-mapreduce --create --alive \
--name "ll_Spark_Cluster" \
--bootstrap-action s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb \
--bootstrap-name "Spark/Shark" \
--instance-type m1.xlarge \
--instance-count 2 \
--ami-version 3.0.4
The issue is that whenever I try to get data from S3 I get an exception.
So if I start the spark-shell and try something like :
val data = sc.textFile("s3n://your_s3_data")
I get the following exception :
WARN storage.BlockManager: Putting block broadcast_1 failed
java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
This issue was caused by the guava library,
The version that's on the AMI is 11 while spark needs version 14.
I edited the bootstrap script from AWS to install spark 1.0.2 and update the guava library during the bootstrap action you can get the gist here :
https://gist.github.com/tnbredillet/867111b8e1e600fa588e
Even after updating guava I still had an issue.
When I tried to save data on S3 I had an exception thrown
lzo.GPLNativeCodeLoader - Could not load native gpl library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
I solved that by adding the hadoop native library to the java.library.path.
When I run a job I add the option
-Djava.library.path=/home/hadoop/lib/native
or if I run a job through spark-submit I add the
--driver-library-path /home/hadoop/lib/native
argument.