AWS Data Pipeline RedshiftCopy activity cannot find suitable drivers - amazon-web-services

I've set up a RedshiftCopy activity in AWS Data Pipeline, but it keeps failing with the following error:
java.lang.RuntimeException: java.sql.SQLException: No suitable driver found for <REDACTED> at private.com.google.common.base.Throwables.propagate(Unknown Source) at amazonaws.datapipeline.database.ConnectionFactory.getConnection(ConnectionFactory.java:145) at amazonaws.datapipeline.database.ConnectionFactory.getRedshiftDatabaseConnection(ConnectionFactory.java:80) at amazonaws.datapipeline.database.ConnectionFactory.getConnection(ConnectionFactory.java:47) at amazonaws.datapipeline.database.ConnectionFactory.getConnectionWithCredentials(ConnectionFactory.java:230) at amazonaws.datapipeline.redshift.RedshiftActivityRunnerFactory$RedshiftActivityRunner.<init>(RedshiftActivityRunnerFactory.java:29) at amazonaws.datapipeline.redshift.RedshiftActivityRunnerFactory.create(RedshiftActivityRunnerFactory.java:48) at amazonaws.datapipeline.activity.RedshiftCopyActivity.runActivity(RedshiftCopyActivity.java:49) at amazona
..etc
The "runsOn" EC2 instance is a Data Pipeline-managed resource, so I'm confused by the error, because I assumed that any instance that gets spun up by Data Pipeline, will have all the necessary resources installed.
Has anyone encountered this error before? What, if anything, did you do to fix it?
Thanks in advance.

Apparently, this is a known issue with AWS Data Pipeline. The suggested workaround for now is to use the Postgres JDBC driver instead of the Redshift one.
(Just change the "jdbc://redshift..." in the pipeline configuration to "jdbc://postgresql..." keeping everything else the same.)

Related

Distcp from S3 to HDFS

Im trying to copy data from S3 to HDFS using distcp tool. Problem with that is, that S3 cluster uses VPC endpoint and I dont know how to properly configure distcp. I have trtied several configurations but none has worked. Currently Im using following command:
hadoop distcp
-Dfs.s3a.access.key=[KEY]
-Dfs.s3a.secret.key=[SECRET]
-Dfs.s3a.region=eu-west-1
-Dfs.s3a.bucket.[BUCKET NAME].endpoint=https://bucket.vpce-[vpce id].s3.eu-west-1.vpce.amazonaws.com
s3a://[BUCKET NAME]/[FILE]
hdfs://[DESTINATION]/[FILE]
But im getint this error:
22/03/16 09:14:39 ERROR tools.DistCp: Exception encountered org.apache.hadoop.fs.s3a.AWSBadRequestException: doesBucketExistV2 on [BUCKET NAME]: com.amazonaws.services.s3.model.AmazonS3Exception: The authorization header is malformed; the region 'vpce' is wrong; expecting 'eu-west-1'
Any ideas how Distcp should be configured with VPC endpoints?
Thanks in advance
you need hadoop 3.3.1 for this, then it should work. ideally use 3.3.2, which is now out
grab the cloudstore jar and use its storediag command to debug this before going near distcp.

Errors during deployment to AWS using Terraform (cdktf)

I am trying to create or update Lambdas on AWS using the Terraform CDKTF. During deployment, I am getting the error of
"An event source mapping with SQS arn (\" arn:aws:sqs:eu-west-2:*******:*****-*****-******** \") and function (\" ******-******-****** \") already exists. Please update or delete the existing mapping with UUID *******-****-****-****-***********"
**** are sensitive info I have swapped out.
Some of our Lambdas are called via SQS, which is what this mapping is referring to. I assumed the first fix would be to remove the mappings that might already exist (on a previous deployment that might have partly gone through), but I am unsure where to find them, nor if they are even available to delete. I originally assumed by calling cdktf deploy it would update these mappings and not throw the err at all.
Does anyone have any advice?
Your diagnosis seems right, there might be some stray resources left behind due to an aborted / unfinished Terraform run. You should be able to clean up after these runs by running terraform destroy in the stack directory ./cdktf.out/stacks/..../. That should delete all previously existing resources created through this Terraform stack.

How to fix `user must specify LSN` when using AWS DMS for Postgres RDS

I'm trying to migrate and synchronize a PostgreSQL database using AWS DMS and I'm getting the following error.
Last Error Task error notification received from subtask 0, thread 0
[reptask/replicationtask.c:2673] [1020101] When working with Configured Slotname, user must
specify LSN; Error executing source loop; Stream component failed at subtask 0, component
st_0_D27UO7SI6SIKOSZ4V6RH4PPTZQ ; Stream component 'st_0_D27UO7SI6SIKOSZ4V6RH4PPTZQ'
terminated [reptask/replicationtask.c:2680] [1020101] Stop Reason FATAL_ERROR Error Level FATAL
I already created a replication slot and configured its name in the source endpoint.
DMS Engine version: 3.1.4
Does anyone knows anything that could help me?
Luan -
I experienced the same issue - I was trying to replicate data from Postgres to an S3 bucket.I would check two things - your version of Postgres and the DMS version being used.
I downgraded my RDS postgres version to 9.6 and my DMS version to 2.4.5 to get replication working.
You can find more details here -
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.PostgreSQL.html
I wanted to try the newer versions of DMS (3.1.4 and 3.3.0[beta]) as it has parquet support but I have gotten the same errors you mentioned above.
Hope this helps.
It appears AWS expects you to use the pglogical extension rather than test_decoding. You have to:
add pglogical to shared_preload_libraries in parameter options
reboot
CREATE EXTENSION pglogical;
On dms 3.4.2 and postgres 12.3 without the slotName= setting DMS created the slot for itself. Also make sure you exclude the pglogical schema from the migration task as it has unsupported data types.
P.S. When DMS hits resource limits it silently fails. After resolving the LSN errors, I continued to get failures of the type Last Error Task 'psql2es' was suspended due to 6 successive unexpected failures Stop Reason FATAL_ERROR Error Level FATAL without any errors in the logs. I resolved this issue using the Advanced task settings > Full load tuning settings and tuning the parameters downward.

Spark org.postgresql.Driver not found even though it's configured EMR

I am trying to write a pyspark data frame to a Postgres database with the following code:
mode = "overwrite"
url = "jdbc:postgresql://host/database"
properties = {"user": "user","password": "password","driver": "org.postgresql.Driver"}
dfTestWrite.write.jdbc(url=url, table="test_result", mode=mode, properties=properties)
However I am getting the following error:
An error occurred while calling o236.jdbc.
: java.lang.ClassNotFoundException: org.postgresql.Driver
I've found a few SO questions that address a similar issue but haven't found anything that helps. I followed the AWS docs here to add the configuration and from the EMR console it looks as though it was successful:
What am I doing wrong?
What you followed document is to add the database connector for the Presto and it is not a way to add the jdbc driver into the spark. Connector does not mean the driver.
You should download the postgresql jdbc driver and locate it to the spark lib directory or somewhere to refer it by a configuration.

An internal error occurred when attempting to deliver data in AWS Firehose data stream

I am implement AWS kinesis-Firehose data stream and facing issue in data delivery from s3 to redshift. can you please help me and let me know what is missing?
An internal error occurred when attempting to deliver data. Delivery
will be retried; if the error persists, it will be reported to AWS for
resolution. InternalError 2
It happened for me, and the problem was an error of inconsistency of the input record format the DB table.
Try to check AWS Docs of COPY command to make sure the COPY command parameters are defined properly.