Does AWS Glue provide ability to move data from S3 bucket to RDS database?
I'm trying to setup serverless app that picks up dynamic data uploaded to S3 and migrates it to RDS.
Glue provides Crawlers service that determines schema.
Glue also provides ETL Jobs, but this seems to be where target source is only another S3 bucket.
Any ideas?
Yes, Glue can send to an RDS datastore. If you are using the job wizard it will give you a target option of "JDBC". If you select JDBC you can setup a connection to your rds instance.
JDBC connection window
RDS option window
Please check below link you will find the solution
Load the data from S3 bucket to RDS Oracle Instance
using AWS Glue
https://youtu.be/rBFfYpHP1PM
Related
I am using AWS RDS(MySQL) and I would like to sync this data to AWS elasticsearch in real-time.
I am thinking that the best solution for this is AWS Glue but I am not sure about I could realize what I want.
This is information for my RDS database:
■ RDS
・I would like to sync several tables(MySQL) to opensearch(1 table to 1 index).
・The schema of tables will be changed dynamically.
・The new column will be added or The existing columns will be removed since previous sync.
(so I also have to sync this schema change)
Could you teach me roughly whether I could do these things by AWS Glue?
I wonder if AWS Glue can deal with dynamic schame change and syncing in (near) real-time.
Thank you in advance.
Glue Now have OpenSearch connector but Glue is like a ETL tool and does batch kind of operation very well but event based or very frequent load to elastic search might not be best fit ,and cost also can be high .
https://docs.aws.amazon.com/glue/latest/ug/tutorial-elastisearch-connector.html
DMS can help not completely as you have mentioned schema keeps changing .
Logstash Solution
Since Elasticsearch 1.5, Elasticsearch added jdbc input plugin in Logstash to sync MySQL data into Elasticsearch.
AWS Native solution
You can have a lambda function on MySQL event Invoking a Lambda function from an Amazon Aurora MySQL DB cluster
The lambda will write to Kinesis Firehouse in json and kinesis can load into OpenSearch .
I have a scenario in which my source data is in S3 and I need to load data into Amazon Redshift using AWS Glue. There are around 10 tables as my source but I was only able to load one table at a time using AWS glue studio. Is there any way, where I can load all the 10 tables from S3 TO Amazon RedShift in a single Job.
i am new to amazon aws. i have a use case to read ORC files from one s3 bucket, convert that to JSON files and write to another s3 bucket.
Volume is about 100G and roughly a thousand files everyday.
I should be able to run this on-demand or schedule to run daily. what are the options i should consider?
any ideas would be helpful
Amazon Athena
You can use Amazon Athena to convert file formats via the CREATE TABLE AS command. See: Creating a Table from Query Results (CTAS) - Amazon Athena
The question then becomes how to send the commands to Athena. For this, you could schedule an AWS Lambda function to run, which starts an Amazon EC2 instance. Then, run a script on the instance to send all the commands to Amazon Athena. See: Auto-Stop EC2 instances when they finish a task - DEV Community
AWS Glue ETL job
Alternatively, you could create an AWS Glue ETL job that uses Spark to transform the data. See: Built-In Transforms - AWS Glue
I am new in AWS. I want to use AWS glue for ETL process.
Could we use AWS glue for analyzing the RDS database and store the analyzed data into rds mysql table using ETL job
Thanks
Yes, its possible. We have used S3 to store our raw data, from where we read the data in AWS Glue, and perform UPSERTs to RDS Aurora as part of our ETL process. You can either use AWS Glue trigger or a Lambda S3 event triggers for calling the glue job.
We have used pymysql / mysql.connector in AWS Glue since we have to do UPSERTs. Bulk load data directly from S3 is also supported for RDS Mysql (Aurora). Let me know if you need help with code sample
I am trying to setup a sync between AWS Aurora and Redshift. What is the best way to achieve this sync?
Possible ways to sync can be: -
Query table to find changes in a table(since I am only doing inserts, updates don't matter), export these changes to a flat file in S3 bucket and use Redshift copy command to insert into Redshift.
Use python publisher and Boto3 to publish changes into a Kinesis stream and then consume this stream in Firehose from where I can copy directly into Redshift.
Use Kinesis Agent to detect changes in binlog (Is it possible to detect changes int binlog using Kinesis Agent) and publish it to Firehose and from there copy into Firehose.
I haven't explored AWS Datapipeline yet.
As pointed out by #Mark B, the AWS Database Migration Service can migrate data between databases. This can be done as a one-off exercise, or it can run continuously, keeping two databases in sync.
The documentation shows that Amazon Aurora can be a source and Amazon Redshift can be a target.
AWS has just announced this new feature: Amazon Aurora zero-ETL integration with Amazon Redshift
This natively provides near real-time (second) synchronization from Aurora to Redshift.
You can also use federated queries: https://docs.aws.amazon.com/redshift/latest/dg/federated-overview.html