AWS Data Pipeline. EC2Resource not able to access redshift - amazon-web-services

I am using AWS Data Pipeline to execute SQL queries on redshift which may involve(creating/deleting tables) for the first time.
Created a SQL Activity which "Runs On" an EC2 instance created as part of data pipeline and a Redshift with Database node with appropriate credentials.
But while running the pipeline , EC2 could not access the redshift database. Error thrown is as follows:
Unable to establish connection to jdbc:postgresql://xxxxx/yyyy Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
Probably that it may be because of "ResourceRole" parameter of EC2 which is set to DataPipelineDefaultResource and IAM role may not have the right permissions to access the Redshift DB.
What is the right IAM role if that is the root cause for this or there could be some other reason.

Can you connect to the cluster using a normal client? If you can't, then it's likely there's no ingress allowed on the Redshift cluster. Maybe this might help

Related

Orchestration of Redshift Stored Procedures using AWS Managed Airflow

I have many redshift stored procedures created(15-20) some can run asynchronously while many have to run in a synchronous manner.
I tried scheduling them in Async and Sync manner using Aws Eventbridge but found many limitations (failure handling and orchestration).
I went ahead to use AWS Managed Airflow.
How can we do the redshift cluster connection in the airflow?
So that we can call our stored procedure in airflow dags and stored proc. will run in the redshift cluster?
Is there any RedshiftOperator present for connection or we can create a direct connection to the Redshift cluster using the connection option in the airflow menu?
If possible can we achieve all these using AWS console only, without Aws cli?
How can we do the redshift cluster connection in the airflow?
So that we can call our stored procedure in airflow dags and stored proc. will run in the redshift cluster?
You can use Airflow Connections to connect to Redshift. This is the native approach for managing connections to external services such as databases.
Managing Connections (Airflow)
Amazon Redshift Connection (Airflow)
Is there any RedshiftOperator present for connection or we can create a direct connection to the Redshift cluster using the connection option in the airflow menu?
You can use the PostgresOperator to execute SQL commands in the Redshift cluster. When initializing the PostgresOperator, set the postgres_conn_id parameter to the Redshift connection ID (e.g. redshift_default). Example:
PostgresOperator(
task_id="call_stored_proc",
postgres_conn_id="redshift_default",
sql="sql/stored_proc.sql",
)
PostgresOperator (Airflow)
How-to Guide for PostgresOperator (Airflow)
If possible can we achieve all these using AWS console only, without Aws cli?
No, it's not possible to achieve this only using the AWS console.

Using AWS Glue with Mysql in my EC2 Instance

I am trying to connect my ec2 installed/mysql with Glue, the purpose is to xtract some information and moved to redshift, but i am receiving the following error:
Check that your connection definition references your JDBC database with correct URL syntax, username, and password. Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
This is the format that i am using jdbc:mysql://host:3306/database
I am using the same vpc, same SG, same subnet for the instance.
i know the user/password are correct because i am connected to the database with sql developer.
What i need to check? Is it possible to use AWS Glue with mysql in my instance?
Thanks in advance.
In the JDBC connection url that you have mentioned, use the private ip of the ec2 instance(where mysql is installed) as host.
jdbc:mysql://ec2_instance_private_ip:3306/database
Yes, It is possible to use AWS Glue with your MySQL running in your EC2 Instance but Before, you should first use DMS to migrate your databases.
More over, your target database (Redshift) has a different schema than the source database (MySQL), that's what we call heterogeneous database migrations (the schema structure, data types, and database code of source databases are quite differents), you need AWS SCT.
Check this out :
As you can see, I'm not sure you can migrate straight from MySQL in an EC2 instance to Amazon Redshift.
Hope this helps

Load data from S3 into Aurora Serverless using AWS Glue

According to Moving data from S3 -> RDS using AWS Glue
I found that an instance is required to add a connection to a data target. However, my RDS is a serverless, so there is no instance available. Does Glue support this case?
I have tried to connect Aurora MySql Serverless with AWS glue recently, and I failed. And I got a timeout error.
Check that your connection definition references your JDBC database with
correct URL syntax, username, and password. Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago.
The driver has not received any packets from the server.
I think the reason was Aurora serverless doesn't have any continuously running instances so in the connection URL you cannot give any instances, and that's why Glue cannot connect.
So, you need to make sure that DB instance is running. Only then your JDBC connection works.
If your DB runs in a private VPC, you can follow this link:
Nat Creation
EDIT:
Instead of NAT GW, you can also use the VPC endpoint for S3.
Here is a really good blog that explains step by step.
Or AWS documentation
AWS Glue supports the scenario, i.e., it works well to load data from S3 into Aurora Serverless using an AWS Glue job. The engine version I'm currently using is 8.0.mysql_aurora.3.02.0
Note: if you get an error saying Data source rejected establishment of connection, message from server: "Too many connections", you can increase ACUs (currently mine is set to min 4 - max 8 ACUs for your reference), as the maximum number of connections depends on the capacity of ACUs.
I can use JDBC build connection,
There is one thing very important is you should have at least one subnet open ALL TCP port, but you can point the port to the subnet.
With the setting, connection test pass, crawler also can create tables.

Setting up a second connection with AWS Glue to a target RDS/Mysql instance fails

I'm trying to setup an ETL job with AWS Glue that should pull data from the production database on RDS/Aurora, run some very light weight data manipulation (mainly: removing some columns) and then output to another RDS/Mysql instance for "data warehouse". Each component is in its own VPC. RDS/Aurora <> AWS Glue works however I'm having hard time figuring out what's wrong with AWS Glue <> RDS/Mysql connection: the error is a generic "Check that your connection definition references your JDBC database with correct URL syntax, username, and password. Could not create connection to database server."
I've been following this step-by-step guide https://aws.amazon.com/blogs/big-data/connecting-to-and-running-etl-jobs-across-multiple-vpcs-using-a-dedicated-aws-glue-vpc/ and - I think - I covered all points. To debug, I've also tried to spin a new EC2 instance in the same AWS Glue VPC and subnet and I tried to access the output database and it worked
Comparing the first working connection with the second one doesn't yield to any obvious difference and the fact I was able to connect from an EC2 instance makes me even more confused on where is the problem

How to use AWS DMS from a region to an other?

I am trying to use AWS DMS to move data from a source database ( AWS RDS MySQL ) in the Paris region ( eu-west-3 ) to a target database ( AWS Redshift ) in the Ireland region ( eu-west-1 ). The goal is to continuously replicate ongoing changes.
I am running into these kind of errors :
An error occurred (InvalidResourceStateFault) when calling the CreateEndpoint operation: The redshift cluster datawarehouse needs to be in the same region as the current region. The cluster's region is eu-west-1 and the current region is eu-west-3.
The documentation says :
The only requirement to use AWS DMS is that one of your endpoints must
be on an AWS service.
So what I am trying to do should be possible. In practice, it's seems it's not allowed.
How to use AWS DMS from a region to an other ?
In what region, should my endpoints be ?
In what region, should my replication task be ?
My replication instance has to be on the same region than the RDS MySQL instance because they need to share a subnet
AWS provides this whitepaper called "Migrating AWS Resources to a New AWS Region", updated last year. You may want to contact their support, but an idea would be to move your RDS to another RDS in the proper region, before migrating to Redshift. In the whitepaper, they provide an alternative way to migrate RDS (without DMS, if you don't want to use it for some reason):
Stop all transactions or take a snapshot (however, changes after this point in time are lost and might need to be reapplied to the
target Amazon RDS DB instance).
Using a temporary EC2 instance, dump all data from Amazon RDS to a file:
For MySQL, make use of the mysqldump tool. You might want to
compress this dump (see bzip or gzip).
For MS SQL, use the bcp
utility to export data from the Amazon RDS SQL DB instance into files.
You can use the SQL Server Generate and Publish Scripts Wizard to
create scripts for an entire database or for just selected objects.36
Note: Amazon RDS does not support Microsoft SQL Server backup file
restores.
For Oracle, use the Oracle Export/Import utility or the
Data Pump feature (see
http://aws.amazon.com/articles/AmazonRDS/4173109646282306).
For
PostgreSQL, you can use the pg_dump command to export data.
Copy this data to an instance in the target region using standard tools such as CP, FTP, or Rsync.
Start a new Amazon RDS DB instance in the target region, using the new Amazon RDS security group.
Import the saved data.
Verify that the database is active and your data is present.
Delete the old Amazon RDS DB instance in the source region
I found a work around that I am currently testing.
I declare "Postgres" as the engine type for the Redshift cluster. It tricks AWS DMS into thinking it's an external database and AWS DMS no longer checks for regions.
I think it will result in degraded performance, because DMS will probably feed data to Redshift using INSERTs instead of the COPY command.
Currently Redshift has to be in the same region as the replication instance.
The Amazon Redshift cluster must be in the same AWS account and the
same AWS Region as the replication instance.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.Redshift.html
So one should create the replication instance in the Redshift region inside a VPC
Then use VPC peering to enable the replication instance to connect to the VPC of the MySQL instance in the other region
https://docs.aws.amazon.com/vpc/latest/peering/what-is-vpc-peering.html