Pyspark: No suitable Driver - amazon-web-services

I'm getting the following error SQLException: No suitable driver
I have the spark class set in my code as the following:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/driver/jars --jars /file/path/to/jars'
And when I call spark-submit I specify the jars to make sure that they're found.
spark-submit --driver-class-path /Users/my_user/tools/spark-3.12-bin-hadoop3.2/jars/postgresql-42.2.23.jar --jars /Users/my_user/tools/spark-3.12-bin-hadoop3.2/jars/postgresql-42.2.23.jar postgres_elt.py
I then put my driver and executor in my spark-default template but I'm still getting the same error.
I'm trying to connect to AWS from my local machine. Not sure what I'm doing wrong at this point.

To correct this issue, I added .option("driver", "org.postgresql.Driver") to my variable. This seemed to have fixed my issue and I was able to connect without a problem.

Related

How can i connect to documnet db TLS enabled cluster from databricks spark?

I want to connect to a document db which has TLS enabled .I could do that from a lambda function with the rds-combined-ca-bundle.pem copied with lambda code .I could not do the same with databricks as all the node of cluster should have this file when spark try to connect it always time out.I tried to create the init scripts by following below link
https://learn.microsoft.com/en-us/azure/databricks/kb/python/import-custom-ca-cert
However it does not help either .Let me know if any one has any clue on this kind of use case .
Note:I can connect to TLS disabled document-db from same databricks instance .
If you are experiencing connection time out errors when using an init script to import the rds-combined-ca-bundle.pem file on your Spark cluster, try the following steps:
Make sure that the rds-combined-ca-bundle.pem file is available on the driver node of your Spark cluster. The init script will only be executed on the driver node. You will encounter connection time out errors otherwise.
Use the --conf option when starting the spark-shell or spark-submit command to specify the location of the rds-combined-ca-bundle.pem file on the driver node. To specify the location of the rds-combined-ca-bundle.pem file, run:
spark-shell --conf spark.mongodb.ssl.caFile=path/to/rds-combined-ca-bundle.pem
Check the Spark cluster logs whether the init script is being executed correctly or if its encountering any errors.

AWS EMR pyspark java.lang.illegalArgumentException when using pandas_udf

Using pyspark 2.4.7 and pyarrow 6.0.1.
I know from documentation there is compatibility issue therefore I need to set ARROW_PRE_0_15_IPC_FORMAT = 1 inside spark-env.sh
This solves the problem on my local machine however still getting the same error in AWS Emr 5.33.1
I am usint boto3 and have configured spark-env by passing
[...{'Classification': 'spark-env', 'Configurations':[{'Classification': 'export', 'Properties':{'ARROW_PRE_0_15_IPC_FORMAT':'1'}}],
'Properties':{}
}
and EMR loads property and has its config can be see in EMR UI.
I've read that these config only used for master node, so worker nodes are still getting the same error?

AWS EMR Spark error with `Failed to load class of driverClassName com.mysql.jdbc.Driver`

I'm currently trying to add a process in EMR 6.1.0 that will use Spark to store aggregated data in mysql.
However, when I actually run Spark, I get the following error.
Exception in thread "main" java.lang.RuntimeException: Failed to load class of driverClassName com.mysql.jdbc.
This error did not occur in EMR 6.0.0.
In the process of updating from EMR 6.0.0 to 6.1.0, I changed the Spark version from 2.4.4 to 3.0.0.
The code itself has not changed significantly, and we know that it is not a network problem.
I've spent a lot of time looking through the AWS documentation and can't seem to find any hints.
Can anyone help me?
Place the MySQL connector jar under $SPARK_HOME/jars folder or pass the the MySQL connector jar path in spark-shell/spark-submit command using --jars flag.
Spark 3.x depends on HikariCP.
https://github.com/apache/spark/blob/v3.0.0/dev/deps/spark-deps-hadoop-3.2-hive-2.3#L1
Preloaded HikariCP can't load your application classes due to ClassLoader.
https://github.com/brettwooldridge/HikariCP/blob/HikariCP-2.5.1/src/main/java/com/zaxxer/hikari/HikariConfig.java#L318
this.getClass().getClassLoader().loadClass(driverClassName)
You should add shade settings if use sbt-assemlby plugin.
assembly / assemblyShadeRules := {
Seq("com.zaxxer.hikari").map { packageName =>
ShadeRule.rename(s"${packageName}.**" -> s"my_app_shade_package.${packageName}.#1").inAll
}
}

amazon emr spark submission from S3 not working

I have a cluster up and running. I am trying to add a step to run my code. The code itself works fine on a single instance. Only thing is I can't get it to work off S3.
aws emr add-steps --cluster-id j-XXXXX --steps Type=spark,Name=SomeSparkApp,Args=[--deploy-mode,cluster,--executor-memory,0.5g,s3://<mybucketname>/mypythonfile.py]
This is exactly what examples show I should do. What am I doing wrong?
Error I get:
Exception in thread "main" java.lang.IllegalArgumentException: Unknown/unsupported param List(--executor-memory, 0.5g, --executor-cores, 2, --primary-py-file, s3://<mybucketname>/mypythonfile.py, --class, org.apache.spark.deploy.PythonRunner)
Usage: org.apache.spark.deploy.yarn.Client [options]
Options:
--jar JAR_PATH Path to your application's JAR file (required in yarn-cluster
mode)
.
.
.
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Command exiting with ret '1'
When I specify as this instead:
aws emr add-steps --cluster-id j-XXXXX --steps Type=spark,Name= SomeSparkApp,Args=[--executor-memory,0.5g,s3://<mybucketname>/mypythonfile.py]
I get this error instead:
Error: Only local python files are supported: Parsed arguments:
master yarn-client
deployMode client
executorMemory 0.5g
executorCores 2
EDIT: IT gets further along when I manually create the python file after SSH'ing into the cluster, and specifying as follows:
aws emr add-steps --cluster-id 'j-XXXXX' --steps Type=spark,Name= SomeSparkApp,Args=[--executor-memory,1g,/home/hadoop/mypythonfile.py]
But, not doing the job.
Any help appreciated. This is really frustrating as a well documented method on AWS's own blog here https://blogs.aws.amazon.com/bigdata/post/Tx578UTQUV7LRP/Submitting-User-Applications-with-spark-submit does not work.
I will ask, just in case, you used your correct buckets and cluster ID-s?
But anyways, I had similar problems, like I could not use --deploy-mode,cluster when reading from S3.
When I used --deploy-mode,client,--master,local[4] in the arguments, then I think it worked. But I think I still needed something different, can't remember exactly, but I resorted to a solution like this:
Firstly, I use a bootstrap action where a shell script runs the command:
aws s3 cp s3://<mybucket>/wordcount.py wordcount.py
and then I add a step to the cluster creation through the SDK in my Go application, but I can recollect this info and give you the CLI command like this:
aws emr add-steps --cluster-id j-XXXXX --steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=["spark-submit",--master,local[4],/home/hadoop/wordcount.py,s3://<mybucket>/<inputfile.txt> s3://<mybucket>/<outputFolder>/]
I searched for days and finally discovered this thread which states
PySpark currently only supports local
files. This does not mean it only runs in local mode, however; you can
still run PySpark on any cluster manager (though only in client mode). All
this means is that your python files must be on your local file system.
Until this is supported, the straightforward workaround then is to just
copy the files to your local machine.

Error starting Spark in EMR 4.0

I created an EMR 4.0 instance in AWS with all available applications, including Spark. I did it manually, through AWS Console. I started the cluster and SSHed to the master node when it was up. There I ran pyspark. I am getting the following error when pyspark tries to create SparkContext:
2015-09-03 19:36:04,195 ERROR Thread-3 spark.SparkContext
(Logging.scala:logError(96)) - -ec2-user, access=WRITE,
inode="/user":hdfs:hadoop:drwxr-xr-x at
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
I haven't added any custom applications, nor bootstrapping and expected everything to work without errors. Not sure what's going on. Any suggestions will be greatly appreciated.
Login as the user "hadoop" (http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-connect-master-node-ssh.html). It has all the proper environment and related settings for working as expected. The error you are receiving is due to logging in as "ec2-user".
I've been working with Spark on EMR this week, and found a few weird things relating to user permissions and relative paths.
It seems that running Spark from a directory which you don't 'own', as a user, is problematic. In some situations Spark (or some of the underlying Java pieces) want to create files or folders, and they think that pwd - the current directory - is the best place to do that.
Try going to the home directory
cd ~
then running pyspark.