Spark Unit Testing - unit-testing

My entire build.sbt is:
name := """sparktest"""
version := "1.0.0-SNAPSHOT"
scalaVersion := "2.11.8"
scalacOptions := Seq("-unchecked", "-deprecation", "-encoding", "utf8", "-Xexperimental")
parallelExecution in Test := false
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.2",
"org.apache.spark" %% "spark-sql" % "2.0.2",
"org.apache.avro" % "avro" % "1.8.1",
"org.scalatest" %% "scalatest" % "3.0.1" % "test",
"com.holdenkarau" %% "spark-testing-base" % "2.0.2_0.4.7" % "test"
)
I have a simple test. Obviously, this is just a starting point, I'd like to test more:
package sparktest
import com.holdenkarau.spark.testing.DataFrameSuiteBase
import org.scalatest.FunSuite
class SampleSuite extends FunSuite with DataFrameSuiteBase {
test("simple test") {
assert(1 + 1 === 2)
}
}
I run sbt clean test and get a failure with:
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf$ConfVars
For my dev environment, I'm using the spark-2.0.2-bin-hadoop2.7.tar.gz
Do I have to configure this environment in any way? Obviously HiveConf is a transitive Spark dependency

As #daniel-de-paula mentions in the comments you will need to add spark-hive as an explicit dependency (you can restrict this to the test scope though if you aren't using hive in your application its self). spark-hive is not a transitive dependency of spark-core which is why this error happened. spark-hive is excluded from spark-testing-base as a dependency so that people who are doing RDD only tests don't need to add it as a dependency.

Related

hadoop-aws and aws-java-sdk version compatibility for Spark 3.1.2

I ran into version compatibility issues updating Spark project utilising both hadoop-aws and aws-java-sdk-s3 to Spark 3.1.2 with Scala 2.12.15 in order to run on EMR 6.5.0.
I checked EMR release notes stating these versions:
AWS SDK for Java v1.12.31
Spark v3.1.2
Hadoop v3.2.1
I am currently running spark locally to ensure compatibility of above versions and get the following error:
java.lang.NoSuchFieldError: SERVICE_ID
at com.amazonaws.services.s3.AmazonS3Client.createRequest(AmazonS3Client.java:4925)
at com.amazonaws.services.s3.AmazonS3Client.createRequest(AmazonS3Client.java:4911)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1441)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1381)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$verifyBucketExists$1(S3AFileSystem.java:381)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:236)
at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:380)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:314)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
I also tried checking version of aws-java-sdk hadoop-aws is based on. Hadoop-aws 3.2.1 relies on aws-java-sdk 1.11.375 as it can be found here
However these versions result in a different error:
'org.apache.http.client.methods.HttpRequestBase com.amazonaws.http.HttpResponse.getHttpRequest()'
at com.amazonaws.services.s3.internal.S3ObjectResponseHandler.handle(S3ObjectResponseHandler.java:57)
at com.amazonaws.services.s3.internal.S3ObjectResponseHandler.handle(S3ObjectResponseHandler.java:29)
at com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:70)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1555)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1272)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4368)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4315)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1416)
at org.apache.hadoop.fs.s3a.S3AInputStream.lambda$reopen$0(S3AInputStream.java:196)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:195)
at org.apache.hadoop.fs.s3a.S3AInputStream.lambda$lazySeek$1(S3AInputStream.java:346)
at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$2(Invoker.java:195)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:193)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:215)
at org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:339)
at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:451)
at java.base/java.io.DataInputStream.read(DataInputStream.java:149)
build.sbt:
scalaVersion := "2.12.15"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.1.2",
"org.apache.spark" %% "spark-sql" % "3.1.2",
"com.fasterxml.jackson.core" % "jackson-databind" % "2.12.2",
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.12.2",
"org.apache.hadoop" % "hadoop-client" % "3.2.1",
"org.apache.hadoop" % "hadoop-aws" % "3.2.1",
"com.amazonaws" % "aws-java-sdk-s3" % "1.11.375"
)
What should be correct versions for these libraries?
the EMR docs says "use our own s3: connector"...if you are running on EMR do exactly that.
you should use the s3a one on other installations, including local ones
And there
mvnrepository a good way to get a view of what dependencies are
* here is its summary for hadoop-aws though its 3.2.1 declaration misses out all the dependencies. it is 1.11.375
the stack traces you are seeing are from trying to get the aws s3 sdk, core sdk, jackson and httpclient in sync.
it's easiest to give up and just go with the full aws-java-sdk-bundle, which has a consistent set of aws artifacts and private versions of the dependencies. It is huge -but takes away all issues related to transitive dependencies
Turns out adding dependency to aws-java-sdk-core explicitely solved my problem, as mentioned here. That way I can avoid heavy aws sdk bundle.
build.sbt:
scalaVersion := "2.12.15"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.1.2",
"org.apache.spark" %% "spark-sql" % "3.1.2",
"com.fasterxml.jackson.core" % "jackson-databind" % "2.12.2",
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.12.2",
"org.apache.hadoop" % "hadoop-client" % "3.2.1",
"org.apache.hadoop" % "hadoop-aws" % "3.2.1",
"com.amazonaws" % "aws-java-sdk-s3" % "1.11.375",
"com.amazonaws" % "aws-java-sdk-core" % "1.11.375"
)

Scala lambda only failing in AWS

Im writing my first scala lambda, and locally everything connects and works fine. However, when I try to test my lambda in AWS, I get the following error.
{
"errorMessage": "Error loading class FooBar.Main: scala/collection/Seq",
"errorType": "java.lang.NoClassDefFoundError"
}
From my googling, its seems this is cause I needed to add the scala library to my dependencies, which I did.
name := "FooBar"
version := "0.1"
scalaVersion := "2.12.12"
javacOptions ++= Seq("-source", "1.8", "-target", "1.8", "-Xlint")
lazy val root = (project in file(".")).
settings(
name := "FooBar",
version := "1.0",
scalaVersion := "2.12.12",
retrieveManaged := true
)
libraryDependencies += "software.amazon.awssdk" % "ec2" % "2.5.60"
libraryDependencies += "com.amazonaws" % "aws-lambda-java-core" % "1.2.0"
libraryDependencies += "com.amazonaws" % "aws-lambda-java-events" % "2.1.0"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-dynamodb" % "1.11.313"
libraryDependencies += "org.scalikejdbc" %% "scalikejdbc" % "3.4.0"
libraryDependencies += "org.apache.phoenix" % "phoenix-core" % "4.14.3-HBase-1.4"
libraryDependencies += "org.apache.hbase" % "hbase-common" % "1.4.10"
libraryDependencies += "org.apache.hbase" % "hbase-server" % "1.4.10"
libraryDependencies += "io.spray" %% "spray-json" % "1.3.2"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.5" % "test"
libraryDependencies += "org.scala-lang" % "scala-library" % "2.12.12"
assemblyShadeRules in assembly := Seq(
ShadeRule.keep("x.**").inAll,
ShadeRule.keep("FooBar.**").inProject
)
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs#_*) => MergeStrategy.discard
case x => MergeStrategy.first
}
again, everything works fine locally, just can never execute on AWS. Anyone have an idea?
The sbt-assembly plugins shade rule ShadeRule.keep documentation states
The ShadeRule.keep rule marks all matched classes as "roots". If any
keep rules are defined all classes which are not reachable from the
roots via dependency analysis are discarded when writing the output
jar.
https://github.com/sbt/sbt-assembly#shading
So in this case all the classes of the class path x.* and FooBar.* are retained while creating the fat jar. Rest all other classes including the classes in scala-library are discarded.
To fix this remove all the ShadeRule.keep rules and instead try ShadeRule.zap to selectively discard classes not required.
For example, the following shade rules removes all the HDFS classes from the far jar:
assemblyShadeRules in assembly := Seq(
ShadeRule.zap("org.apache.hadoop.hdfs.**").inAll
)
PS: AWS Lambda had a hard limit of 256MB of code size after unzipping the fat jar.

Spark job reading from S3 on Spark cluster gives IllegalAccessError: tried to access method MutableCounterLong [duplicate]

This question already has answers here:
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
(2 answers)
Closed 4 years ago.
I have a Spark cluster on DC/OS and I am running a Spark job that reads from S3. The versions are the following:
Spark 2.3.1
Hadoop 2.7
The dependency for AWS connection: "org.apache.hadoop" % "hadoop-aws" % "3.0.0-alpha2"
I read in the data by doing the following:
`val hadoopConf = sparkSession.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3a.endpoint", Config.awsEndpoint)
hadoopConf.set("fs.s3a.access.key", Config.awsAccessKey)
hadoopConf.set("fs.s3a.secret.key", Config.awsSecretKey)
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val data = sparkSession.read.parquet("s3a://" + "path/to/file")
`
The error I am getting is:
Exception in thread "main" java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:215)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:138)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:170)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:44)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:321)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:559)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:543)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:809)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:182)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:207)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
This job only fails if I submit it as a JAR to the cluster. If I run the code locally or in a docker container, it does not fail and is perfectly able to read in the data.
I would be very grateful if anyone could help me with this!
This is one of the stack traces you get to see when you mix Hadoop-* jars.
As the S3A docs say
Critical: Do not attempt to “drop in” a newer version of the AWS SDK than that which the Hadoop version was built with Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see.
Randomly changing hadoop- and aws- JARs in the hope of making a problem “go away” or to gain access to a feature you want, will not lead to the outcome you desire.
I was also facing problem(not exactly same exception) to run docker image on spark cluster(kubernetes), which was perfectly running locally. Then I have changed build.sbt assembly and hadoop verion.
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.3.0"
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
libraryDependencies += "com.databricks" %% "spark-csv" % "1.5.0"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-core" % "2.8.9"
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.8.9"
dependencyOverrides += "com.fasterxml.jackson.module" % "jackson-module-scala_2.11" % "2.8.9"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "3.1.1"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-s3" % "1.11.271"
dependencyOverrides += "org.apache.hadoop" % "hadoop-hdfs" % "3.1.1"
dependencyOverrides += "org.apache.hadoop" % "hadoop-client" % "3.1.1"
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
case "log4j.properties" => MergeStrategy.discard
case m if m.toLowerCase.startsWith("meta-inf/services/") => MergeStrategy.filterDistinctLines
case PathList("META-INF", "services", "org.apache.hadoop.fs.s3a.S3AFileSystem") => MergeStrategy.filterDistinctLines
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}
But not sure this will work for you or not. Because the same code is not working with aws-EKS machine and same exception throws if hadoop verion is 2.8.1. Hadoop and aws version is also same,as working fine locally, so trying to reach aws team for help.
Seems like the version of hadoop-aws that you are using is not compatible with the version of hadoop. Can you try with hadoop-aws-2.7.3 this version of hadoop-aws and aws-java-sdk-1.11.123 this version of aws java sdk. Hope this will solve your problem

spark and aws redshift: java.sql.SQLException: No suitable driver found for jdbc:redshift://xxx.us-west-2.redshift.amazonaws.com:5439

os: centos
spark:1.6.1
sbt: build.sbt
libraryDependencies ++= {
Seq(
"org.apache.spark" %% "spark-core" % "1.6.1" % "provided",
"com.amazonaws" % "aws-java-sdk" % "1.10.75",
"com.amazonaws" % "amazon-kinesis-client" % "1.1.0",
"com.amazon.redshift" % "jdbc4" % "1.1.7.1007" % "test"
)
}
resolvers ++= Seq(
"redshift" at "https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC4-1.1.7.1007.jar"
)
spark app:
val redshiftDriver = "com.amazon.redshift.jdbc4.Driver"
Class.forName(redshiftDriver)
I've specified the redshift driver, and updated to url etc., following AWS official documentation here: http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-in-code.html
But I'm still getting error below:
java.sql.SQLException: No suitable driver found for jdbc:redshift://xxx.us-west-2.redshift.amazonaws.com:5439
I googled and someone said the jar should be added to classpath? Could anyone please help here? Thank you very much
Solved:
just clean all cached stuff, and re-build everything from scratch, and then it's working
Add on:
Databricks implemented this lib, which could make our life much easier interacting redshift within Spark
https://github.com/databricks/spark-redshift
// Get some data from a Redshift table
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table")
.option("tempdir", "s3n://path/for/temp/data")
.load()

how to set the contextPath of jetty on xsbt-web-plugin?

I'm on sbt 0.11.1 and xsbt-web-plugin 0.2.10
here goes the build.sbt and plugins.sbt
build.sbt
organization := "org"
name := "demo"
version := "0.1.0-SNAPSHOT"
scalaVersion := "2.9.1"
seq(webSettings :_*)
configurationXml :=
<configuration>
<webApp>
<contextPath>/foo</contextPath>
</webApp>
</configuration>
libraryDependencies ++= Seq(
"org.eclipse.jetty" % "jetty-webapp" % "7.4.5.v20110725" % "container",
"javax.servlet" % "servlet-api" % "2.5" % "provided"
)
resolvers += "Sonatype OSS Snapshots" at "http://oss.sonatype.org/content/repositories/snapshots/"
project/plugins.sbt
libraryDependencies <+= sbtVersion(v => "com.github.siasia" %% "xsbt-web-plugin" % (v+"-0.2.10"))
It seems the configurationXml doesn't work, after running container:start in sbt console, the contextPath gets the default value "/"
how can I change the contextPath? any tips? thanks in advance!
Here's a solution from scalatra-user group
Add jetty-plus to dependencies:
"org.eclipse.jetty" % "jetty-plus" % "7.4.5.v20110725" % "container"
Add this to build.sbt:
env in Compile := Some(file(".") / "jetty-env.xml" asFile)
In the same directory as build.sbt, create the jetty-env.xml:
<Configure class="org.eclipse.jetty.webapp.WebAppContext">
<Set name="contextPath">/foo</Set>
</Configure>