Continuously moving data from aws aurora mysql to aws s3 - amazon-web-services

I am trying to setup data lake and move all the data to s3.
I have to move aurora mysql data to s3 (most probably in parquet format).
I tried initial POC using Data Migration Service with that we can move all data at once. Problem with this was every time I run it will copy whole data.
I wanted something like near real time reflection of db changes in s3.
Thanks in advance.

If you enable binary logs you should be able to do Change Data Capture and replicate ongoing changes as explained at this blog post

Related

Data Copy from Azure blob to S3 andn Synapse to Redshift

There is a requirement to copy from Azure Blob to S3 for 10TB data and also from Synpase to Redshift for 10TB of data.
What is the best way to achieve these 2 migrations?
For the Redshift - you could export Azure Synapse Analytics to a a blob storage in a compatible format ideally compressed and then copy the data to S3. It is pretty straightforward to import data from S3 to Redshift.
You may need a VM instance to load read from Azure Storage and put into AWS S3 (doesn't matter where). The simplest option seems to be using the default CLI (Azure and AWS) to read the content to the migration instance and write to to the target bucket. However me personally - I'd maybe create an application writing down checkpoints, if the migration process interrupts for any reason, the migration process wouldn't need to start from the scratch.
There are a few options you may "tweak" based on the files to move, if there are many small files or less large files, from which region to move where, ....
https://aws.amazon.com/premiumsupport/knowledge-center/s3-upload-large-files/
As well you may consider using the AWS S3 Transfer Acceleration, may or may not help too.
Please note every larger cloud provider has some outbound data egress cost, for 10TB it may be considerable cost

Build s3 Datalake Using Dynamo DB data source

i'am a data engineer using AWS, we want to build a data pipeline in order to visualise our Dynmaodb data on QuickSigth, as u know, it's not possible de connect directly dynamo to Quick...u have to pass by S3.
S3 Will be our datalake, the issue is that the date updates frequently (for exemple column named can change / costumer status can evolve..)
So i'am looking for a batch solution in order to always get the lastest data from dynamo on my s3 datalake and visualise it in quicksigth.
Thank u
You can access your tables at DynamoDB, in the console, and export data to S3 under the Streams and Exports tab. This blog post from AWS explains just what you need.
You could also try this approach with Athena instead of S3.

S3-redshift spectrum with streaming data

Problem statement - I have streaming data coming in a S3 bucket. So, they are essentially JSON files. I would want to use this data with the other tables in redshift(perform joins)
Here are the options:
Use lambda with a trigger for every S3 object created. The lambda will run a copy command to move the data to redshift.
Problem here - What if there are a lot of connections open to redshift and this causes a problem?
Use redshift spectrum - Design a glue crawler, that will run every 15 mins, and check for new partitions
Problem here - What if the crawler runs , while the previous instance of the crawler is still running? And this is not real time..
Should I use Kinesis Firehose?
Problem - I do not want to maintain this data on redshift but would like it on S3, as the data is huge. I also want to join this data in S3 with other redshift tables. So, I cant use Athena too
Any ideas on this please?

Aws: best approach to process data from S3 to RDS

I'm trying to implement, I think, a very simple process, but I don't really know what's the best approach.
I want to read a big csv (around 30gb) file from S3, make some transformation and load it into RDS MySQL and I want this process to be replicable.
I tought that the best approach was Aws data pipeline, but I've found that this service is more designed to load data from different sources to redshift after several transformtions.
I've also seen that the process of creating a pipeline is slow and a little bit messy.
Then I've found the dataduct wrapper of Coursera, but after some research, it seems that this project has been abandoned (the last commit was one year ago).
So I don't know if I should continue trying with aws data pipeline or take another approach.
I've also read about AWS Simple Workflow and Step Functions, but I don't know if it's simpler.
Then I've seen a video of AWS glue and it looks nice, but unfortunatelly it's not yet available and I don't know when Amazon will launch it.
As you see, I'm a little bit confuse, can anyone enlight me?
Thanks in advance
If you are trying to get them into RDS so you can query them, there are other options that do not require the data to be moved from S3 to RDS to do SQL like queries.
You can use Redshift spectrum to read and query information from S3 now.
Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables
Step 1. Create an IAM Role for Amazon Redshift
Step 2: Associate the IAM Role with Your Cluster
Step 3: Create an External Schema and an External Table
Step 4: Query Your Data in Amazon S3
Or you can use Athena to query the data in S3 as well if Redshift is too much horsepower for the need job.
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
You could use an ETL tool to do the transformations on your csv data and then load it into your RDS database. There are a number of open source tools that do not require large licensing costs. That way you can pull the data into the tool, do your transformations and then the tool will load the data into your MySQL database. For example there is Talend, Apache Kafka, and Scriptella. Here's some information on them for comparison.
I think Scriptella would be an option for this situation. It can use SQL scripts (or other scripting languages), and has JDBC/ODBC compliant drivers. With this you could create a script that would perform your transformations and then load the data into your MySQL database. And you would be using familiar SQL (I'm assuming you already can create SQL scripts) so there isn't a big learning curve.

How do I move a local DynamoDB database to the AWS cloud?

I have a static set of data I want to get into AWS DynamoDB. I have downloaded the local version of DynamoDB and tested the code that generates the data on it, and now I have the database with all the data locally.
My question is: Is there an efficient way to move the local database into the cloud? I know that I can transfer a CSV file to S3 and use a data pipe from there. Is there a better way without exporting the data and re-importing it?
The data is not that much, about 5 GB (so not Amazon Snowball type thing).
Thanks!