UnsupportedClassVersionError with mysql jdbc driver in AWS Data Pipeline - amazon-web-services

I am trying to run a Data Pipeline job in AWS. I added the field "Jdbc Driver Jar Uri" and placed the jar file in my s3 bucket, per instructions here, because it seems "Connector/J" that is installed by AWS Data Pipeline does not work.
I'm using mysql-connector-java-8.0.23 and my mysql database version is the same.
java.lang.UnsupportedClassVersionError: com/mysql/jdbc/Driver : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:808)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:443)
at java.net.URLClassLoader.access$100(URLClassLoader.java:65)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.net.URLClassLoader$1.run(URLClassLoader.java:349)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:348)
at java.lang.ClassLoader.loadClass(ClassLoader.java:430)
at java.lang.ClassLoader.loadClass(ClassLoader.java:363)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:278)
at amazonaws.datapipeline.database.JdbcDriverInitializer.getDriver(JdbcDriverInitializer.java:75)
at amazonaws.datapipeline.database.ConnectionFactory.getRdsDatabaseConnection(ConnectionFactory.java:158)
at amazonaws.datapipeline.database.ConnectionFactory.getConnection(ConnectionFactory.java:74)
at amazonaws.datapipeline.database.ConnectionFactory.getConnectionWithCredentials(ConnectionFactory.java:302)
at amazonaws.datapipeline.connector.SqlDataNode.createConnection(SqlDataNode.java:100)
at amazonaws.datapipeline.connector.SqlDataNode.getConnection(SqlDataNode.java:94)
at amazonaws.datapipeline.connector.SqlDataNode.prepareStatement(SqlDataNode.java:162)
at amazonaws.datapipeline.connector.SqlInputConnector.open(SqlInputConnector.java:49)
at amazonaws.datapipeline.connector.SqlInputConnector.<init>(SqlInputConnector.java:26)
at amazonaws.datapipeline.connector.SqlDataNode.getInputConnector(SqlDataNode.java:79)
at amazonaws.datapipeline.activity.copy.SingleThreadedCopyActivity.processAll(SingleThreadedCopyActivity.java:47)
at amazonaws.datapipeline.activity.copy.SingleThreadedCopyActivity.runActivity(SingleThreadedCopyActivity.java:35)
at amazonaws.datapipeline.activity.CopyActivity.runActivity(CopyActivity.java:22)
at amazonaws.datapipeline.objects.AbstractActivity.run(AbstractActivity.java:16)
at amazonaws.datapipeline.taskrunner.TaskPoller.executeRemoteRunner(TaskPoller.java:136)
at amazonaws.datapipeline.taskrunner.TaskPoller.executeTask(TaskPoller.java:105)
at amazonaws.datapipeline.taskrunner.TaskPoller$1.run(TaskPoller.java:81)
at private.com.amazonaws.services.datapipeline.poller.PollWorker.executeWork(PollWorker.java:76)
at private.com.amazonaws.services.datapipeline.poller.PollWorker.run(PollWorker.java:53)
at java.lang.Thread.run(Thread.java:748)
I've looked at this question for a solution, but I wasn't able to figure out how to adapt those answers to solving it in AWS Data Pipeline.
Can someone explain what steps need to be taken to fix this ClassVersion error?

Related

Developing Glue ETL script locally - java.lang.IllegalStateException: Connection pool shut down

I'm currently developing ETL scripts locally using the AWS Glue ETL library.
I'm facing an issue when extracting data from S3 bucket as DynamicFrame.
When I want to convert to a DataFrame using toDF(), it will always trigger this exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o52.toDF
...
ERROR Executor: Exception in task 5.0 in stage 3.0 (TID 29)
java.lang.IllegalStateException: Connection pool shut down
at org.apache.http.util.Asserts.check(Asserts.java:34)
at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:191)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:267)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
at com.amazonaws.http.conn.$Proxy15.requestConnection(Unknown Source)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1330)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5062)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5008)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1490)
at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:148)
at org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:281)
at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:364)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:179)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:163)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at com.amazonaws.services.glue.readers.BufferedStream.read(DynamicRecordReader.scala:91)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.ensureLoaded(ByteSourceJsonBootstrapper.java:489)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.detectEncoding(ByteSourceJsonBootstrapper.java:126)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:215)
I tried the same code on AWS Glue DevEndpoint and it works fine. Any idea how to resolve this?
Please go with java8, your issue will be resolved
check java -version
I had the same issue when running the dev env with following specs:
Scala version 2.11.12
Spark version 2.4.3
Glue 1.0.0
To fix it, add the following line to the spark configuration in $SPARK_HOME/conf/spark-defaults.conf
spark.master local
Alternatively, depending on how you are running your job, you can configure this dynamically if you are in control of the spark context. i.e.
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
conf = SparkConf()
conf.setMaster("local").setAppName("My app")
sc = SparkContext(conf=conf)
I have found this happens when running in local mode with multiple threads, I have found that increasing fs.s3.connection.maximum or fs.s3a.connection.maximum does not fix the issue. Although this post indicates that it should https://kb.databricks.com/jobs/job-fails-connection-pool.html

How to save parquet in S3 from AWS SageMaker?

I would like to save a Spark DataFrame from AWS SageMaker to S3. In Notebook, I ran
myDF.write.mode('overwrite').parquet("s3a://my-bucket/dir/dir2/")
I get
Py4JJavaError: An error occurred while calling o326.parquet. :
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3native.NativeS3FileSystem not found at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at
org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at
org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:394)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
at
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:508)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:280) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:214) at
java.lang.Thread.run(Thread.java:745) Caused by:
java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3native.NativeS3FileSystem not found at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
How should I do it correctly in Notebook? Many thanks!
The SageMaker notebook instance is not running Spark code, and it doesn't have the Hadoop or other Java classes that you are trying to invoke.
You usually have in the Jupyter notebook in SageMaker python libraries such as Pandas, and you can use it to write the parquet file (for example, https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_parquet.html ).
Another option is to connect from the Jupyter notebook to an existing (or new) Spark cluster and execute the command remotely there. See here for documentation on how to set this connection up: https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/

AWS EMR Mapreduce failure

We have an installation of AWS EMR in a client environment. The encryption in transit and the encryption at rest has been enabled using security configuration. We continue to get the below mapreduce errors when we execute a simple Hive query.
Diagnostic Messages for this Task:
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError:
error in shuffle in fetcher#1
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:377)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by:
java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:366)
at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:288)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:282)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:323)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
Please let me know if anyone has faced this error before.

NoSuchMethodError for com.google.protobuf.AbstractMessage.newBuilderForType on Spanner Java Client

Get this error while trying to experiment on google's cloud spanner by running the sample code
https://github.com/GoogleCloudPlatform/java-docs-samples/blob/master/spanner/cloud-client/src/main/java/com/example/spanner/SpannerSample.java
The stacktrace is as follows:
Exception in thread "main" java.lang.NoSuchMethodError: com.google.protobuf.AbstractMessage.newBuilderForType(Lcom/google/protobuf/AbstractMessage$BuilderParent;)Lcom/google/protobuf/Message$Builder;
at com.google.protobuf.SingleFieldBuilderV3.getBuilder(SingleFieldBuilderV3.java:142)
at com.google.spanner.v1.Mutation$Builder.getInsertBuilder(Mutation.java:3227)
at com.google.cloud.spanner.Mutation.toProto(Mutation.java:377)
at com.google.cloud.spanner.SpannerImpl$TransactionContextImpl.commit(SpannerImpl.java:1223)
at com.google.cloud.spanner.SpannerImpl$TransactionRunnerImpl.run(SpannerImpl.java:1148)
at com.google.cloud.spanner.SpannerImpl$SessionImpl.write(SpannerImpl.java:704)
at com.google.cloud.spanner.SessionPool$PooledSession.write(SessionPool.java:201)
at com.google.cloud.spanner.DatabaseClientImpl.write(DatabaseClientImpl.java:31)
at spanner_test.SpannerSample.writeExampleData(SpannerSample.java:164)
at spanner_test.SpannerSample.run(SpannerSample.java:423)
at spanner_test.SpannerSample.main(SpannerSample.java:501)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
I am using gradle and my dependency is as follows:
spanner:
com.google.cloud:google-cloud-spanner:0.9.4-beta'
protobuf:
com.google.protobuf:protobuf-java:3.1.0
The project has been migrated to a different repository java-spanner and currently on the version 6.36.0 and using protobuf-java#3.21.10. I tried your sample writeExampleData with the latest version and did not face the same issue.

SPARK_CONF not found in AWS EMR

I am trying to deploy a spark application in EMR and facing the following issue.
java.io.FileNotFoundException: File does not exist: hdfs://ip-10-184-176-172.ec2.internal:8020/user/hadoop/.sparkStaging/application_1446113189622_0004/__spark_conf__2712437380309904293.zip
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1122)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I am deploying in cluster mode using the emr console UI. In the first line it specifies the SPARK_CONF zip is uploaded in the hdfs location but the error says file not found on the same location. have anyone faced similar issue?
Issue Resolved. I was using unsupported JAVA version. EMR has java 7 and my application was developed in java 8.