Load data from S3 into Aurora Serverless using AWS Glue

Load data from S3 into Aurora Serverless using AWS Glue - amazon-web-services

According to Moving data from S3 -> RDS using AWS Glue
I found that an instance is required to add a connection to a data target. However, my RDS is a serverless, so there is no instance available. Does Glue support this case?

I have tried to connect Aurora MySql Serverless with AWS glue recently, and I failed. And I got a timeout error.
Check that your connection definition references your JDBC database with
correct URL syntax, username, and password. Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago.
The driver has not received any packets from the server.
I think the reason was Aurora serverless doesn't have any continuously running instances so in the connection URL you cannot give any instances, and that's why Glue cannot connect.
So, you need to make sure that DB instance is running. Only then your JDBC connection works.
If your DB runs in a private VPC, you can follow this link:
Nat Creation
EDIT:
Instead of NAT GW, you can also use the VPC endpoint for S3.
Here is a really good blog that explains step by step.
Or AWS documentation

AWS Glue supports the scenario, i.e., it works well to load data from S3 into Aurora Serverless using an AWS Glue job. The engine version I'm currently using is 8.0.mysql_aurora.3.02.0
Note: if you get an error saying Data source rejected establishment of connection, message from server: "Too many connections", you can increase ACUs (currently mine is set to min 4 - max 8 ACUs for your reference), as the maximum number of connections depends on the capacity of ACUs.

I can use JDBC build connection,
There is one thing very important is you should have at least one subnet open ALL TCP port, but you can point the port to the subnet.
With the setting, connection test pass, crawler also can create tables.

Related

Using AWS Glue with Mysql in my EC2 Instance

I am trying to connect my ec2 installed/mysql with Glue, the purpose is to xtract some information and moved to redshift, but i am receiving the following error:
Check that your connection definition references your JDBC database with correct URL syntax, username, and password. Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
This is the format that i am using jdbc:mysql://host:3306/database
I am using the same vpc, same SG, same subnet for the instance.
i know the user/password are correct because i am connected to the database with sql developer.
What i need to check? Is it possible to use AWS Glue with mysql in my instance?
Thanks in advance.

In the JDBC connection url that you have mentioned, use the private ip of the ec2 instance(where mysql is installed) as host.
jdbc:mysql://ec2_instance_private_ip:3306/database

Yes, It is possible to use AWS Glue with your MySQL running in your EC2 Instance but Before, you should first use DMS to migrate your databases.
More over, your target database (Redshift) has a different schema than the source database (MySQL), that's what we call heterogeneous database migrations (the schema structure, data types, and database code of source databases are quite differents), you need AWS SCT.
Check this out :
As you can see, I'm not sure you can migrate straight from MySQL in an EC2 instance to Amazon Redshift.
Hope this helps

Setting up a second connection with AWS Glue to a target RDS/Mysql instance fails

I'm trying to setup an ETL job with AWS Glue that should pull data from the production database on RDS/Aurora, run some very light weight data manipulation (mainly: removing some columns) and then output to another RDS/Mysql instance for "data warehouse". Each component is in its own VPC. RDS/Aurora <> AWS Glue works however I'm having hard time figuring out what's wrong with AWS Glue <> RDS/Mysql connection: the error is a generic "Check that your connection definition references your JDBC database with correct URL syntax, username, and password. Could not create connection to database server."
I've been following this step-by-step guide https://aws.amazon.com/blogs/big-data/connecting-to-and-running-etl-jobs-across-multiple-vpcs-using-a-dedicated-aws-glue-vpc/ and - I think - I covered all points. To debug, I've also tried to spin a new EC2 instance in the same AWS Glue VPC and subnet and I tried to access the output database and it worked
Comparing the first working connection with the second one doesn't yield to any obvious difference and the fact I was able to connect from an EC2 instance makes me even more confused on where is the problem

RDS Aurora AppSync Error : 400 Bad Request

I am fairly new to aws.
I am trying to create a simple app using Aurora and AppSync. So far, I have been able to create aurora database, connected to it using MySQL workbench, created the tables that I need.
I have also made the AppSync APIs. And done the resolver (connected the resolver to the RDS Aurora DB).
Here is the problem I am facing, when I try to run queries from the AppSync Queries Tab, it gives me the following error and message:
"errorType": "400 Bad Request",
"message": "RDSHttp:{\"message\":\"HttpEndPoint is not enabled for
arn:aws:rds:us***:cluster:***\"}" (I replaced some details with ***)
I have made my Aurora accessible to the public, and I have tried to add a few incoming rules to the security group (i.e. allow all).
However, this error still persists. I have spent a few days on it and will appreciate any help I can get to resolve this.
Thanks in advance

AWS AppSync can connect to Aurora Serverless clusters. First, make sure that your Aurora cluster has an engine-mode of serverless. You can verify this via the CLI by using aws rds describe-db-clusters.
Once you've got a cluster that is serverless, enable the Data API for that cluster, which will allow queries via HTTP.
Keep in mind that as of now these features are in beta and not recommended for production usage.

AWS Data Pipeline. EC2Resource not able to access redshift

I am using AWS Data Pipeline to execute SQL queries on redshift which may involve(creating/deleting tables) for the first time.
Created a SQL Activity which "Runs On" an EC2 instance created as part of data pipeline and a Redshift with Database node with appropriate credentials.
But while running the pipeline , EC2 could not access the redshift database. Error thrown is as follows:
Unable to establish connection to jdbc:postgresql://xxxxx/yyyy Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
Probably that it may be because of "ResourceRole" parameter of EC2 which is set to DataPipelineDefaultResource and IAM role may not have the right permissions to access the Redshift DB.
What is the right IAM role if that is the root cause for this or there could be some other reason.

Can you connect to the cluster using a normal client? If you can't, then it's likely there's no ingress allowed on the Redshift cluster. Maybe this might help

AWS Glue ETL job from AWS Redshift to S3 fails

I am trying out AWS Glue service to ETL some data from redshift to S3. Crawler runs successfully and creates the meta table in data catalog, however when I run the ETL job ( generated by AWS ) it fails after around 20 minutes saying "Resource unavailable".
I cannot see AWS glue logs or error logs created in Cloudwatch. When I try to view them it says "Log stream not found. The log stream jr_xxxxxxxxxx could not be found. Check if it was correctly created and retry."
I would appreciate it if you could provide any guidance to resolve this issue.

So basically, the job you add to Glue will either run if there's not too much traffic in the region your Glue is. If there are no resources available, you need to either manually re-add the job again or you can also bind yourself to events from CloudWatch via SNS.
Also, there are parameters you can pass to the job like maximunRetry and timeout.
If you have a Ressource not available, it won't trigger a retry because the job did not fail, it just didn't even started. But if you set the timeout to let's say 60 minutes, it will trigger an error after that time, decrement your retry pool and re-launch the job.

The closest thing I see to Glue documentation on this is here:
If you encounter errors in AWS Glue, use the following solutions to
help you find the source of the problems and fix them. Note The AWS
Glue GitHub repository contains additional troubleshooting guidance in
AWS Glue Frequently Asked Questions. Error: Resource Unavailable If
AWS Glue returns a resource unavailable message, you can view error
messages or logs to help you learn more about the issue. The following
tasks describe general methods for troubleshooting. • A custom DNS
configuration without reverse lookup can cause AWS Glue to fail. Check
your DNS configuration. If you are using Amazon Route 53 or Microsoft
Active Directory, make sure that there are forward and reverse
lookups. For more information, see Setting Up DNS in Your VPC (p. 23).
• For any connections and development endpoints that you use, check
that your cluster has not run out of elastic network interfaces.

I have recently struggled with Resource Unavailable thrown by Glue Job
Also i was not able to make a direct connection in Glue using RDS -it said "no suitable security group found"
I faced this issue while trying to connect with AWS RDS and Redshift
The problem was with the Security Group that the Redshift was using. There is a need to place a self referencing inbound rule in the Security Group.
For those who dont know what is self referencing inbound rule, follow the steps
1) Go to the Security Group you are using (VPC -> Security Group)
2) In the Inbound Rules select Edit Inbound Rules
3) Add a Rule
a) Type - All Traffic b) Protocol - All c) Port Range - ALL d) Source - custom and in space available write the initial of your security group and select it. e) Save it.
Its done !
if you were missing this condition in your Security Group Inbound Rules
Try creating the connection you will be able to create the connection.
Also job should work this time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Load data from S3 into Aurora Serverless using AWS Glue - amazon-web-services

According to Moving data from S3 -> RDS using AWS Glue I found that an instance is required to add a connection to a data target. However, my RDS is a serverless, so there is no instance available. Does Glue support this case?

I can use JDBC build connection, There is one thing very important is you should have at least one subnet open ALL TCP port, but you can point the port to the subnet. With the setting, connection test pass, crawler also can create tables.

Related

Using AWS Glue with Mysql in my EC2 Instance

Setting up a second connection with AWS Glue to a target RDS/Mysql instance fails

RDS Aurora AppSync Error : 400 Bad Request

AWS Data Pipeline. EC2Resource not able to access redshift

AWS Glue ETL job from AWS Redshift to S3 fails

Categories

Resources