Spark org.postgresql.Driver not found even though it's configured EMR - amazon-web-services

I am trying to write a pyspark data frame to a Postgres database with the following code:
mode = "overwrite"
url = "jdbc:postgresql://host/database"
properties = {"user": "user","password": "password","driver": "org.postgresql.Driver"}
dfTestWrite.write.jdbc(url=url, table="test_result", mode=mode, properties=properties)
However I am getting the following error:
An error occurred while calling o236.jdbc.
: java.lang.ClassNotFoundException: org.postgresql.Driver
I've found a few SO questions that address a similar issue but haven't found anything that helps. I followed the AWS docs here to add the configuration and from the EMR console it looks as though it was successful:
What am I doing wrong?

What you followed document is to add the database connector for the Presto and it is not a way to add the jdbc driver into the spark. Connector does not mean the driver.
You should download the postgresql jdbc driver and locate it to the spark lib directory or somewhere to refer it by a configuration.

Related

Exception in thread "main" com.google.cloud.bigquery.dms.common.exceptions.AgentException: Unable to start tbuild command

I am trying to migrate a table from teradata to BQ using GCP's Data Transfer functionality. I have following the steps suggested on https://cloud.google.com/bigquery-transfer/docs/teradata-migration.
A detailed description of the steps is below:
The APIs suggested on the aobove were enabled.
A pre-existing GCS bucket, BigQuery DataSet and GCP Service Account was used in this process.
Google SDK setup was completed on the device.
Google Cloud service account key was set to an environment variable called GOOGLE_APPLICATION_CREDENTIALS.
BigQuery Data Transfer was set up.
Initialized the migration agent
On the command line, a command to run the jar file was issued, with some particular flags e.g.
java -cp C:\migration\tdgssconfig.jar;C:\migration\terajdbc4.jar;C:\migration\mirroring-agent.jar com.google.cloud.bigquery.dms.Agent --initialize
When prompted, the required parameters were entered.
When prompted for a BigQuery Data Transfer Service Resource name was entered using the data transfer config from GCP.
After entering all the requested parameters, the migration agent creates a configuration file and puts it into the local path provided in the parameters.
Run the migration agent
The following command was executed by using the classpath to the JDBC drivers and path to the configuration file created in the previous initialization step
e.g.
java -cp C:\migration\tdgssconfig.jar;C:\migration\terajdbc4.jar;C:\migration\mirroring-agent.jar com.google.cloud.bigquery.dms.Agent --configuration-file=config.json
At this point, an error was faced which said “Unable to start tbuild command”. Below is a screenshot of the error:
Following steps were taken to try to resolve this error using steps given here:
Teradata Parallel Transporter was installed using Teradata Tools and Utilities.
No bin file was found for the installation.
Below is a screenshot of the error message:
Exception in thread "main" com.google.cloud.bigquery.dms.common.exceptions.AgentException: Unable to start tbuild command

Reading from Amazon S3 in Flink: UnsupportedFileSystemSchemeException

I am trying to read from an S3 bucket in a Java flink (version 1.11) application. I am trying to read from an S3 file into a DataSet object using the following code:
ExecutionEnvironment executionEnv = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> dataSet = executionEnv.readTextFile("s3a://<bucket>/<key>");
Every time I try to run this I run into this error:
Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 's3a'
...
When I look at other posts about this error, it is because I need to configure flink-s3-fs-hadoop in my environment. I have tried configuring it for a while now, but the same error keeps occurring--I have the .jar for fink-s3-fs-hadoop in my plugins directory but it is not even recognized.
What are the steps for making this work within an IDE? All documentation online is very vague.

How to fix `user must specify LSN` when using AWS DMS for Postgres RDS

I'm trying to migrate and synchronize a PostgreSQL database using AWS DMS and I'm getting the following error.
Last Error Task error notification received from subtask 0, thread 0
[reptask/replicationtask.c:2673] [1020101] When working with Configured Slotname, user must
specify LSN; Error executing source loop; Stream component failed at subtask 0, component
st_0_D27UO7SI6SIKOSZ4V6RH4PPTZQ ; Stream component 'st_0_D27UO7SI6SIKOSZ4V6RH4PPTZQ'
terminated [reptask/replicationtask.c:2680] [1020101] Stop Reason FATAL_ERROR Error Level FATAL
I already created a replication slot and configured its name in the source endpoint.
DMS Engine version: 3.1.4
Does anyone knows anything that could help me?
Luan -
I experienced the same issue - I was trying to replicate data from Postgres to an S3 bucket.I would check two things - your version of Postgres and the DMS version being used.
I downgraded my RDS postgres version to 9.6 and my DMS version to 2.4.5 to get replication working.
You can find more details here -
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html
I wanted to try the newer versions of DMS (3.1.4 and 3.3.0[beta]) as it has parquet support but I have gotten the same errors you mentioned above.
Hope this helps.
It appears AWS expects you to use the pglogical extension rather than test_decoding. You have to:
add pglogical to shared_preload_libraries in parameter options
reboot
CREATE EXTENSION pglogical;
On dms 3.4.2 and postgres 12.3 without the slotName= setting DMS created the slot for itself. Also make sure you exclude the pglogical schema from the migration task as it has unsupported data types.
P.S. When DMS hits resource limits it silently fails. After resolving the LSN errors, I continued to get failures of the type Last Error Task 'psql2es' was suspended due to 6 successive unexpected failures Stop Reason FATAL_ERROR Error Level FATAL without any errors in the logs. I resolved this issue using the Advanced task settings > Full load tuning settings and tuning the parameters downward.

Tibco 7.14 connectivity issues with Hive EMR (AWS) using JDBC driver

Unable to view the table contents in Tibco Spotfire Analytics Explorer.
I have copied the hive EMR jar files to library folder in tibco server. Once the server is up, I tried to configure the data source in Information Designer. I am able to setup the datasource and I can see schema. But when I tried to expand the table, I am not getting any results
This is the error message I am getting here
Error message: An issue occurred while creating the default model. It may be partially constructed.
The data source reported a failure.
InformationModelException at Spotfire.Dxp.Data:
Error retrieving metadata: java.lang.NullPointerException (HRESULT: 80131500)
The following is the URL template in Information Designer
jdbc:hive2://<host>:<port10000>/<database>
which I changed to
jdbc:hive2://sitename:10000/db_name.
Please let me know what need to be changed in the driver config or any other place to see the contents of the table.

Issue connecting to Databricks table from Azure Data Factory using the Spark odbc connector

​We have managed to get a valid connection from Azure Data Factory towards our Azure Databricks cluster using the Spark (odbc) connector. In the list of tables we do get the expected list, but when querying a specific table we get an exception.
ERROR [HY000] [Microsoft][Hardy] (35) Error from server: error code:
'0' error message:
'com.databricks.backend.daemon.data.common.InvalidMountException:
Error while using path xxxx for resolving path xxxx within mount at
'/mnt/xxxx'.'.. Activity ID:050ac7b5-3e3f-4c8f-bcd1-106b158231f3
In our case the Databrick tables and mounted parquet files stored in Azure Data Lake 2, this is related to the above exception. Any suggestions how to solve this issue?
Ps. the same error appaers when connectin from Power BI desktop.
Thanks
Bart
In your configuration to mount the lake can you add this setting:
"fs.azure.createRemoteFileSystemDuringInitialization": "true"
I haven't tried your exact scenario - however this solved a similar problem for me using Databricks-Connect.