I am using AWS RDS(MySQL) and I would like to sync this data to AWS elasticsearch in real-time.
I am thinking that the best solution for this is AWS Glue but I am not sure about I could realize what I want.
This is information for my RDS database:
■ RDS
・I would like to sync several tables(MySQL) to opensearch(1 table to 1 index).
・The schema of tables will be changed dynamically.
・The new column will be added or The existing columns will be removed since previous sync.
(so I also have to sync this schema change)
Could you teach me roughly whether I could do these things by AWS Glue?
I wonder if AWS Glue can deal with dynamic schame change and syncing in (near) real-time.
Thank you in advance.
Glue Now have OpenSearch connector but Glue is like a ETL tool and does batch kind of operation very well but event based or very frequent load to elastic search might not be best fit ,and cost also can be high .
https://docs.aws.amazon.com/glue/latest/ug/tutorial-elastisearch-connector.html
DMS can help not completely as you have mentioned schema keeps changing .
Logstash Solution
Since Elasticsearch 1.5, Elasticsearch added jdbc input plugin in Logstash to sync MySQL data into Elasticsearch.
AWS Native solution
You can have a lambda function on MySQL event Invoking a Lambda function from an Amazon Aurora MySQL DB cluster
The lambda will write to Kinesis Firehouse in json and kinesis can load into OpenSearch .
Related
I have a DocumentDB as the data source.
I am running an AWS Glue job that pulls all the data from a certain table, and then inserts it to a RedShift cluster.
Is it possible to avoid adding duplicate data?
I have seen that AWS glue supports bookmarks,
This does not seem to work for DocumentDB as the data source
Thanks.
I am new to AWS and trying to find a way to load the data from S3 to RDS . In my current approach I am using EC2 instance to do that (where my app is running). I was thinking of doing through lambda but my data will have around (22 million records) and my current approach is taking 4hr. And lambda timeout is 15mins (So lambda approach does not work in this case).
The problem with my current approach is This data files comes may be like ones in a month and I don't want to have a EC2 running just of this task. Any alternatives in server-less world would be helpful.Thank You
Note: The data is loaded from S3 to RDS based on SQS, i,e my application is pulling the messages from SQS which will then load the data into RDS
Please try DMS for this. You need to create DMS agent with S3 bucket info as source and target details of your RDS.
I am new in AWS. I want to use AWS glue for ETL process.
Could we use AWS glue for analyzing the RDS database and store the analyzed data into rds mysql table using ETL job
Thanks
Yes, its possible. We have used S3 to store our raw data, from where we read the data in AWS Glue, and perform UPSERTs to RDS Aurora as part of our ETL process. You can either use AWS Glue trigger or a Lambda S3 event triggers for calling the glue job.
We have used pymysql / mysql.connector in AWS Glue since we have to do UPSERTs. Bulk load data directly from S3 is also supported for RDS Mysql (Aurora). Let me know if you need help with code sample
Any suggested architecture ?
For the first full load, using Kinesis, how do I automate it so that it creates different streams for different tables. (Is this the way to do it?)
Incase if there is a new additional table, how do I create a new stream automatically.
3.How do I load to Kinesis incrementally (whenever the data is populated )
Any resources/ architectures will be definitely helpful. Using Kinesis because multiple other down stream consumers might access this data in future.
Recommend looking into AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS). DMS does not necessarily use Kinesis but it is specifically design for this use case.
Start with the walk through in this blog post: "How to Migrate Your Oracle Data Warehouse to Amazon Redshift Using AWS SCT and AWS DMS"
I am trying to setup a sync between AWS Aurora and Redshift. What is the best way to achieve this sync?
Possible ways to sync can be: -
Query table to find changes in a table(since I am only doing inserts, updates don't matter), export these changes to a flat file in S3 bucket and use Redshift copy command to insert into Redshift.
Use python publisher and Boto3 to publish changes into a Kinesis stream and then consume this stream in Firehose from where I can copy directly into Redshift.
Use Kinesis Agent to detect changes in binlog (Is it possible to detect changes int binlog using Kinesis Agent) and publish it to Firehose and from there copy into Firehose.
I haven't explored AWS Datapipeline yet.
As pointed out by #Mark B, the AWS Database Migration Service can migrate data between databases. This can be done as a one-off exercise, or it can run continuously, keeping two databases in sync.
The documentation shows that Amazon Aurora can be a source and Amazon Redshift can be a target.
AWS has just announced this new feature: Amazon Aurora zero-ETL integration with Amazon Redshift
This natively provides near real-time (second) synchronization from Aurora to Redshift.
You can also use federated queries: https://docs.aws.amazon.com/redshift/latest/dg/federated-overview.html