Read HBase data into Spark via Apache Phoenix - amazon-web-services

Being a noob to working with Spark, Phoenix and HBase, was a trying a few examples, as listed out here and here.
Created the data as per the example for "us_population" here.
However, on trying to query the Table thus created in Phoenix / HBase, via Spark, I get the following error -
scala> val rdd = sc.phoenixTableAsRDD("us_population", Seq("CITY", "STATE", "POPULATION"), zkUrl = Some("random_aws.internal:2181"))
java.lang.NoClassDefFoundError: org/apache/phoenix/jdbc/PhoenixDriver
at org.apache.phoenix.spark.PhoenixRDD.<init>(PhoenixRDD.scala:40)
at
org.apache.phoenix.spark.SparkContextFunctions.phoenixTableAsRDD(SparkContextFunctions.scala:39)
... 52 elided
Caused by: java.lang.ClassNotFoundException: org.apache.phoenix.jdbc.PhoenixDriver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 54 more
Unsure why this error is popping up. Any help for the same would be greatly appreciated!
P.S. I load Spark with the following command -
spark-shell --jars /usr/lib/phoenix/phoenix-spark-4.9.0-HBase-1.2.jar
Am attempting this on a tiny AWS EMR cluster of 1 Master and 1 Name Node (both are R4.xlarge with 20GB SSD external storage)

The exception you got due to class org.apache.phoenix.jdbc.PhoenixDriver missing in the spark executors classpath.
Try to add phoenix-core-4.9.0-HBase-1.2.jar when you start spark-shell.
spark-shell --jars /usr/lib/phoenix/phoenix-spark-4.9.0-HBase-1.2.jar,/usr/lib/phoenix/phoenix-core-4.9.0-HBase-1.2.jar

Related

AWS EMR Spark error with `Failed to load class of driverClassName com.mysql.jdbc.Driver`

I'm currently trying to add a process in EMR 6.1.0 that will use Spark to store aggregated data in mysql.
However, when I actually run Spark, I get the following error.
Exception in thread "main" java.lang.RuntimeException: Failed to load class of driverClassName com.mysql.jdbc.
This error did not occur in EMR 6.0.0.
In the process of updating from EMR 6.0.0 to 6.1.0, I changed the Spark version from 2.4.4 to 3.0.0.
The code itself has not changed significantly, and we know that it is not a network problem.
I've spent a lot of time looking through the AWS documentation and can't seem to find any hints.
Can anyone help me?
Place the MySQL connector jar under $SPARK_HOME/jars folder or pass the the MySQL connector jar path in spark-shell/spark-submit command using --jars flag.
Spark 3.x depends on HikariCP.
https://github.com/apache/spark/blob/v3.0.0/dev/deps/spark-deps-hadoop-3.2-hive-2.3#L1
Preloaded HikariCP can't load your application classes due to ClassLoader.
https://github.com/brettwooldridge/HikariCP/blob/HikariCP-2.5.1/src/main/java/com/zaxxer/hikari/HikariConfig.java#L318
this.getClass().getClassLoader().loadClass(driverClassName)
You should add shade settings if use sbt-assemlby plugin.
assembly / assemblyShadeRules := {
Seq("com.zaxxer.hikari").map { packageName =>
ShadeRule.rename(s"${packageName}.**" -> s"my_app_shade_package.${packageName}.#1").inAll
}
}

aws: EMR cluster fails "ERROR UserData: Error encountered while try to get user data" on submitting spark job

Successfully started aws EMR cluster, but any submission fails with:
19/07/30 08:37:42 ERROR UserData: Error encountered while try to get user data
java.io.IOException: File '/var/aws/emr/userData.json' cannot be read
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:296)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io.FileUtils.readFileToString(FileUtils.java:1711)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.commons.io.FileUtils.readFileToString(FileUtils.java:1748)
at com.amazon.ws.emr.hadoop.fs.util.UserData.getUserData(UserData.java:62)
at com.amazon.ws.emr.hadoop.fs.util.UserData.<init>(UserData.java:39)
at com.amazon.ws.emr.hadoop.fs.util.UserData.ofDefaultResourceLocations(UserData.java:52)
at com.amazon.ws.emr.hadoop.fs.util.AWSSessionCredentialsProviderFactory.buildSTSClient(AWSSessionCredentialsProviderFactory.java:52)
at com.amazon.ws.emr.hadoop.fs.util.AWSSessionCredentialsProviderFactory.<clinit>(AWSSessionCredentialsProviderFactory.java:17)
at com.amazon.ws.emr.hadoop.fs.rolemapping.DefaultS3CredentialsResolver.resolve(DefaultS3CredentialsResolver.java:22)
at com.amazon.ws.emr.hadoop.fs.guice.CredentialsProviderOverrider.override(CredentialsProviderOverrider.java:25)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.executeOverriders(GlobalS3Executor.java:130)
at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:86)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:184)
at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.doesBucketExist(AmazonS3LiteClient.java:90)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.ensureBucketExists(Jets3tNativeFileSystemStore.java:139)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:116)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.initialize(S3NativeFileSystem.java:508)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:111)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2859)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2878)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:392)
at org.apache.spark.deploy.DependencyUtils$.org$apache$spark$deploy$DependencyUtils$$resolveGlobPath(DependencyUtils.scala:190)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:146)
at org.apache.spark.deploy.DependencyUtils$$anonfun$resolveGlobPaths$2.apply(DependencyUtils.scala:144)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:144)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$3.apply(SparkSubmit.scala:354)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$3.apply(SparkSubmit.scala:354)
at scala.Option.map(Option.scala:146)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:354)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
userData.json isn't part of my application, looks like it is emr internals.
Any ideas what is wrong? I submit jobs via livy requests.
Cluster setup:
2 core nodes m4.large
7 task nodes m5.4xlarge
1 master node m5.xlarge
The correct way to fix this is by running the following command as part of your bootstrap script when launching EMR (or, if running on a Glue Endpoint, run the following at any point on your endpoint):
chmod 444 /var/aws/emr/userData.json
I've face the similar issue in AWS EMR emr-5.24.1(spark 2.4.1), but jobs are never failed.

Unable to load S3 parquet with postgresql driver in spark-shell [duplicate]

Trying to read a file located in S3 using spark-shell:
scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log")
lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at <console>:12
scala> myRdd.count
java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
... etc ...
The IOException: No FileSystem for scheme: s3n error occurred with:
Spark 1.31 or 1.40 on dev machine (no Hadoop libs)
Running from the Hortonworks Sandbox HDP v2.2.4 (Hadoop 2.60) which integrates Spark 1.2.1 out of the box
Using s3:// or s3n:// scheme
What is the cause of this error? Missing dependency, Missing configuration, or mis-use of sc.textFile()?
Or may be this is due to a bug that affects Spark build specific to Hadoop 2.60 as this post seems to suggest. I am going to try Spark for Hadoop 2.40 to see if this solves the issue.
Confirmed that this is related to the Spark build against Hadoop 2.60. Just installed Spark 1.4.0 "Pre built for Hadoop 2.4 and later" (instead of Hadoop 2.6). And the code now works OK.
sc.textFile("s3n://bucketname/Filename") now raises another error:
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
The code below uses the S3 URL format to show that Spark can read S3 file. Using dev machine (no Hadoop libs).
scala> val lyrics = sc.textFile("s3n://MyAccessKeyID:MySecretKey#zpub01/SafeAndSound_Lyrics.txt")
lyrics: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21
scala> lyrics.count
res1: Long = 9
Even Better: the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward "/". Configuring AWS Credentials in SparkContext will fix it. Code works whether the S3 file is public or private.
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "BLABLA")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "....") // can contain "/"
val myRDD = sc.textFile("s3n://myBucket/MyFilePattern")
myRDD.count
Despite that this question has already an accepted answer, I think that the exact details of why this is happening are still missing. So I think there might be a place for one more answer.
If you add the required hadoop-aws dependency, your code should work.
Starting Hadoop 2.6.0, s3 FS connector has been moved to a separate library called hadoop-aws.
There is also a Jira for that:
Move s3-related FS connector code to hadoop-aws.
This means that any version of spark, that has been built against Hadoop 2.6.0 or newer will have to use another external dependency to be able to connect to the S3 File System.
Here is an sbt example that I have tried and is working as expected using Apache Spark 1.6.2 built against Hadoop 2.6.0:
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0"
In my case, I encountered some dependencies issues, so I resolved by adding exclusion:
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0" exclude("tomcat", "jasper-compiler") excludeAll ExclusionRule(organization = "javax.servlet")
On other related note, I have yet to try it, but that it is recommended to use "s3a" and not "s3n" filesystem starting Hadoop 2.6.0.
The third generation, s3a: filesystem. Designed to be a switch in replacement for s3n:, this filesystem binding supports larger files and promises higher performance.
You can add the --packages parameter with the appropriate jar:
to your submission:
bin/spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 code.py
I had to copy the jar files from a hadoop download into the $SPARK_HOME/jars directory. Using the --jars flag or the --packages flag for spark-submit didn't work.
Details:
Spark 2.3.0
Hadoop downloaded was 2.7.6
Two jar files copied were from (hadoop dir)/share/hadoop/tools/lib/
aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.6.jar
This is a sample spark code which can read the files present on s3
val hadoopConf = sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", s3Key)
hadoopConf.set("fs.s3.awsSecretAccessKey", s3Secret)
var jobInput = sparkContext.textFile("s3://" + s3_location)
Ran into the same problem in Spark 2.0.2. Resolved it by feeding it the jars. Here's what I ran:
$ spark-shell --jars aws-java-sdk-1.7.4.jar,hadoop-aws-2.7.3.jar,jackson-annotations-2.7.0.jar,jackson-core-2.7.0.jar,jackson-databind-2.7.0.jar,joda-time-2.9.6.jar
scala> val hadoopConf = sc.hadoopConfiguration
scala> hadoopConf.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
scala> hadoopConf.set("fs.s3.awsAccessKeyId",awsAccessKeyId)
scala> hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretAccessKey)
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> sqlContext.read.parquet("s3://your-s3-bucket/")
obviously, you need to have the jars in the path where you're running spark-shell from
There is a Spark JIRA, SPARK-7481, open as of today, oct 20, 2016, to add a spark-cloud module which includes transitive dependencies on everything s3a and azure wasb: need, along with tests.
And a Spark PR to match. This is how I get s3a support into my spark builds
If you do it by hand, you must get hadoop-aws JAR of the exact version the rest of your hadoop JARS have, and a version of the AWS JARs 100% in sync with what Hadoop aws was compiled against. For Hadoop 2.7.{1, 2, 3, ...}
hadoop-aws-2.7.x.jar
aws-java-sdk-1.7.4.jar
joda-time-2.9.3.jar
+ jackson-*-2.6.5.jar
Stick all of these into SPARK_HOME/jars. Run spark with your credentials set up in Env vars or in spark-default.conf
the simplest test is can you do a line count of a CSV File
val landsatCSV = "s3a://landsat-pds/scene_list.gz"
val lines = sc.textFile(landsatCSV)
val lineCount = lines.count()
Get a number: all is well. Get a stack trace. Bad news.
For Spark 1.4.x "Pre built for Hadoop 2.6 and later":
I just copied needed S3, S3native packages from hadoop-aws-2.6.0.jar to
spark-assembly-1.4.1-hadoop2.6.0.jar.
After that I restarted spark cluster and it works.
Do not forget to check owner and mode of the assembly jar.
I was facing the same issue. It worked fine after setting the value for fs.s3n.impl and adding hadoop-aws dependency.
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)
sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
S3N is not a default file format. You need to build your version of Spark with a version of Hadoop that has the additional libraries used for AWS compatibility. Additional info I found here, https://www.hakkalabs.co/articles/making-your-local-hadoop-more-like-aws-elastic-mapreduce
You probably have to use s3a:/ scheme instead of s3:/ or s3n:/
However, it is not working out of the box (for me) for the spark shell. I see the following stacktrace:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781)
at org.apache.spark.rdd.RDD.count(RDD.scala:1099)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
at $iwC$$iwC$$iwC.<init>(<console>:37)
at $iwC$$iwC.<init>(<console>:39)
at $iwC.<init>(<console>:41)
at <init>(<console>:43)
at .<init>(<console>:47)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
... 68 more
What I think - you have to manually add the hadoop-aws dependency manually http://search.maven.org/#artifactdetails|org.apache.hadoop|hadoop-aws|2.7.1|jar But I have no idea how to add it to spark-shell properly.
Download the hadoop-aws jar from maven repository matching your hadoop version.
Copy the jar to $SPARK_HOME/jars location.
Now in your Pyspark script, setup AWS Access Key & Secret Access Key.
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "ACCESS_KEY")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "YOUR_SECRET_ACCESSS_KEY")
// where spark is SparkSession instance
For Spark scala:
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "ACCESS_KEY")
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "YOUR_SECRET_ACCESSS_KEY")
I was able to to read my S3 parquet files (Spark 3.3.1, Hadoop 3) using the configuration proposed here:
spark = SparkSession.builder.appName("Test_Parquet").master("local[*]")\
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")\
.config("spark.jars.packages",
"org.apache.hadoop:hadoop-aws:3.2.2,"
"com.amazonaws:aws-java-sdk-bundle:1.12.180").getOrCreate()
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl",\
"org.apache.hadoop.fs.s3a.S3A")
df = spark.read.parquet(f"s3a://{bucket_name}/{file_name}")
USe s3a instead of s3n. I had similar issue on a Hadoop job. After switching from s3n to s3a it worked.
e.g.
s3a://myBucket/myFile1.log

Hadoop 2.9.2 AWS

I'm managed to setup Hadoop with 3 datanodes as a small cluster and everything work ok.
When trying to access AWS bucket on S3A protocol I get this error:
hadoop fs -ls s3a://my-bucket/
-ls: Fatal internal error
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2395)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3208)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3240)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3291)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3259)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:470)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:245)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:228)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:103)
at org.apache.hadoop.fs.shell.Command.run(Command.java:175)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:317)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:380)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2299)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393)
... 16 more
What I did wrong ? How do fix that ?
P.S. Bucket on Amazon if fully public. Anyone can download from it.
Amazon credentials was configured in hadoop/core-site.xml as described here: Hadoop-AWS module: Integration with Amazon Web Services
As per the link you shared the issue seems to be related JAR file missing from CLASSPATH. Can you check if it is accessible. If it is not can you copy required JARS as shown below matching your Hadoop version and retry.
sudo cp hadoop/share/hadoop/tools/lib/$AWS_JAVA_SDK_VERSION.jar hadoop/share/hadoop/common/lib/
sudo cp hadoop/share/hadoop/tools/lib/$AWS_HADOOP_VERSION.jar hadoop/share/hadoop/common/lib/

Old-style mapred API in HBase does not work

I have a MapReduce job, which takes HBase table as the output destination
of my reduce job. My reducer class implements the TableMap interface in
package org.apache.hadoop.hbase.mapred, and I used the initTableReduceJob()
function in TableMapReduceUtil class from
package org.apache.hadoop.hbase.mapred to configure my job.
But when I run my job, I got the following error at reduce stage
java.lang.NullPointerException
at org.apache.hadoop.mapred.Task.getFsStatistics(Task.java:1099)
at
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:442)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:490)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
My HBase version is 0.94.0 and my Hadoop version is 1.0.1.
I found a post similar to my question at:
https://forums.aws.amazon.com/thread.jspa?messageID=394846
Could anyone give me some hint about why this happened? Should I just stick
with the org.apache.hadoop.hbase.mapreduce package?
This error suggests that you may be running HBase on the local filesystem without HDFS. Try installing or running Hadoop HDFS. The org.apache.hadoop.mapred API appears to require HDFS.
As a possible convenience, you may try the Cloudera development VM, which packages both.