Pyspark read jdbc giving errors . How to fix? - amazon-web-services

I am connecting to RDS MySQL using JDBC in pyspark . I have tried almost everything that I found on Stackoverflow for debugging but still, i am unable to make it work .
spark = SparkSession.builder.config("spark.jars", mysql_jar) \
.master("local[*]").appName("PySpark_MySQL_test").getOrCreate()
df= spark.read.format("jdbc").option("url", "jdbc:mysql://hostname.amazonaws.com:1150/dbname?user=user_name&password=password") \
.option("driver", "com.mysql.cj.jdbc.Driver").option("dbtable", "table_name").load()
I have tried using the same connection details in pymysql library of python it connects and brings back the result.
But here I getting the below error and am unable to solve it.
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:174)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:64)
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:827)
at com.mysql.cj.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:447)
at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:237)
at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:199)
at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.create(ConnectionProvider.scala:68)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:62)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure

I have experienced the same issues.Now it is worked.The core reason is spark use master node to connect mysql and use work nodes to execute task.So you can connect mysql while raise communication error.Based on this theory,you can open the security rules on mysql to let all spark node can connect to mysql

For anyone coming here for an answer using Docker give the below solution a try.
use the below configuration
source_df = spark.read.format('jdbc').options(
url='jdbc:mysql://host.docker.internal:3306/superset?useSSL=false&allowPublicKeyRetrieval=true',
driver='com.mysql.cj.jdbc.Driver',
dbtable='table',
user='root',
password='root').load()
I have tried the host with localhost, 127.0.0.1, and even the IPAddress from docker inspect but didn't work then changed it to host.docker.internal and it worked.

Related

Developing Glue ETL script locally - java.lang.IllegalStateException: Connection pool shut down

I'm currently developing ETL scripts locally using the AWS Glue ETL library.
I'm facing an issue when extracting data from S3 bucket as DynamicFrame.
When I want to convert to a DataFrame using toDF(), it will always trigger this exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o52.toDF
...
ERROR Executor: Exception in task 5.0 in stage 3.0 (TID 29)
java.lang.IllegalStateException: Connection pool shut down
at org.apache.http.util.Asserts.check(Asserts.java:34)
at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:191)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:267)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
at com.amazonaws.http.conn.$Proxy15.requestConnection(Unknown Source)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1330)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5062)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5008)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1490)
at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:148)
at org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:281)
at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:364)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:179)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:163)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at com.amazonaws.services.glue.readers.BufferedStream.read(DynamicRecordReader.scala:91)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.ensureLoaded(ByteSourceJsonBootstrapper.java:489)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.detectEncoding(ByteSourceJsonBootstrapper.java:126)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:215)
I tried the same code on AWS Glue DevEndpoint and it works fine. Any idea how to resolve this?
Please go with java8, your issue will be resolved
check java -version
I had the same issue when running the dev env with following specs:
Scala version 2.11.12
Spark version 2.4.3
Glue 1.0.0
To fix it, add the following line to the spark configuration in $SPARK_HOME/conf/spark-defaults.conf
spark.master local
Alternatively, depending on how you are running your job, you can configure this dynamically if you are in control of the spark context. i.e.
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
conf = SparkConf()
conf.setMaster("local").setAppName("My app")
sc = SparkContext(conf=conf)
I have found this happens when running in local mode with multiple threads, I have found that increasing fs.s3.connection.maximum or fs.s3a.connection.maximum does not fix the issue. Although this post indicates that it should https://kb.databricks.com/jobs/job-fails-connection-pool.html

Selenium grid Kubernetes Could not start a new session. Possible causes are invalid address of the remote server or browser start-up failure

We have selenium hub deployed on kubernetes cliuster on AWS and used ingress-traefik to expose service.
We have also have selenium chrome node registered to this selenium hub on kubernetes.
When i see the grid console page i can see the chrome node attached to this hub.
But when i trigger my automation suite through Jenkins i am getting the below error message
org.openqa.selenium.remote.UnreachableBrowserException: Could not start a new session. Possible causes are invalid address of the remote server or browser start-up failure.
Build info: version: '3.141.59', revision: 'e82be7d358', time: '2018-11-14T08:17:03'
System info: host: 'xxxxx', ip: 'x.x.x.x', os.name: 'Linux', os.arch: 'amd64', os.version: '4.14.165-103.209.amzn1.x86_64', java.version: '1.8.0_221'
Driver info: driver.version: RemoteWebDriver
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:573)
at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:213)
at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:131)
at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:144)
at stepDefns.SetUp.setUpBrowser(SetUp.java:145)
at stepDefns.OrderSpecTabSteps.user_sets_the_browser_to_and_version(OrderSpecTabSteps.java:25)
at ✽.Given user sets the browser to "chrome" and version "69"(/data/jenkins_home/workspace/FPSAutomation/src/test/java/features/NonRes.feature:4)
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
in Logs i can see Caused by: java.net.SocketTimeoutException: connect timed out
In My java code I am using node url as below which is "HTTPS"
String nodeURL = "https://<hostname>/wd/hub";
ChromeOptions remoteOptions = new ChromeOptions();
driver=new RemoteWebDriver(new URL(nodeURL), remoteOptions);
Please let me know how to resolve this issue.
Thanks in Advance.

Google Cloud LocalDatastoreHelper failing to connect

LocalDatastoreHelper from the Google Cloud API started failing one Jenkins worker VM; not on other Jenkins worker VMs nor various development machines. This known issue looks very similar but has a slightly different stacktrace.
(I am using Google Cloud SDK 202.0.0 with cloud-datastore-emulator 2.0.0)
com.google.cloud.datastore.DatastoreException: Deadline exceeded Deadline exceeded
java.net.SocketTimeoutException: connect timed out connect timed out
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
at sun.net.www.http.HttpClient.New(HttpClient.java:339)
at sun.net.www.http.HttpClient.New(HttpClient.java:357)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1220)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:984)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1334)
at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1309)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:77)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
at com.google.datastore.v1.client.RemoteRpc.call(RemoteRpc.java:183)
at com.google.datastore.v1.client.Datastore.commit(Datastore.java:87)
at com.google.cloud.datastore.spi.v1.HttpDatastoreRpc.commit(HttpDatastoreRpc.java:153)
at com.google.cloud.datastore.DatastoreImpl$4.call(DatastoreImpl.java:485)
at com.google.cloud.datastore.DatastoreImpl$4.call(DatastoreImpl.java:482)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:89)
at com.google.cloud.RetryHelper.run(RetryHelper.java:74)
at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:51)
at com.google.cloud.datastore.DatastoreImpl.commit(DatastoreImpl.java:481)
at com.google.cloud.datastore.DatastoreImpl.commitMutation(DatastoreImpl.java:475)
at com.google.cloud.datastore.DatastoreImpl.put(DatastoreImpl.java:435)
at com.google.cloud.datastore.DatastoreHelper.put(DatastoreHelper.java:54)
at com.google.cloud.datastore.DatastoreImpl.put(DatastoreImpl.java:410)
This turned out to be a bug in Google's Cloud library. I opened this ticket, and eventually it was resolved. So this problem probably is obsolete.

AWS EMR Mapreduce failure

We have an installation of AWS EMR in a client environment. The encryption in transit and the encryption at rest has been enabled using security configuration. We continue to get the below mapreduce errors when we execute a simple Hive query.
Diagnostic Messages for this Task:
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError:
error in shuffle in fetcher#1
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:377)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by:
java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:366)
at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:288)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:282)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:323)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
Please let me know if anyone has faced this error before.

DataStax Enterpise on AWS - Futures Timed Out After [120] when running Spark job

We are currently running into the following error when attempting to run a Spark Job on DSE 4.8 Analytics
ERROR 2016-04-11 20:59:42,825 UserGroupInformation.java:1128 -
org.apache.hadoop.security.UserGroupInformation:
PriviledgedActionException as:ubuntu
cause:java.util.concurrent.TimeoutException: Futures timed out after
[120 seconds] Exception in thread "main"
java.lang.reflect.InvocationTargetException at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.spark.DseSecureRunner.(DseSecureRunner.scala:24) at
org.apache.spark.DseSecureRunner$.main(DseSecureRunner.scala:34) at
org.apache.spark.DseSecureRunner.main(DseSecureRunner.scala) Caused
by: java.lang.reflect.UndeclaredThrowableException: Unknown exception
in doAs at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1138)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:67)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:146)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:245)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
... 7 more Caused by: java.security.PrivilegedActionException:
java.util.concurrent.TimeoutException: Futures timed out after [120
seconds] at java.security.AccessController.doPrivileged(Native
Method) at javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1125)
... 11 more Caused by: java.util.concurrent.TimeoutException: Futures
timed out after [120 seconds] at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107) at
org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:97) at
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:159)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
... 14 more
This error occurs on the 2 worker nodes, while the node which is running the driver runs fine.
We have the following Security Group configuration for a Cluster of 3 nodes created using the DataStax AMI of the 2.6 flavor
Scraped from the documentation my Security Group is like this with one minor exception
The following port was ignored
8983 Custom TCP Rule TCP 0.0.0.0/0 Solr port and Demo applications web site port (Portfolio, Search, Search log, Weather Sensors)
The only way to get around this error was to do the following
ALL TCP
TCP (6)
ALL
cluster-security-group (using the picture as reference this would be sg-bbc40aff)
Which leads me to believe that some process is trying to communicate with nodes in the cluster via another port.
http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/install/installAMIsecurity.html
Has anyone ran into this problem running Spark Jobs using DSE Analytics on AWS?
Thanks