Migrate database and data warehouse into AWS

Migrate database and data warehouse into AWS - amazon-web-services

I want to migrate our database and data warehouse into AWS.
Our on-prem database is Oracle and we use Oracle Data Integrator for data warehousing in IBM AIX.
My first thought was to migrate our database with AWS DMS (Data Migration Services) into a staging point (S3) and then using Lambda (for creating the trigger when data is updated, deleted or inserted) and Kinesis Firehose (For streaming and do the ETL) to send the data into Redshift.
The data in Redshift must be the replica of our on-prem data warehouse (containing facts and dimensions, aggregation and multiple joins) and I want whenever any changes happened in the on-prem database, it automatically updates the AWS S3 and Redshift so I can have near real-time data in my Redshift.
I was wondering if my architecture is correct and/or is there a better way to do it?
Thank you

Related

How to create tables automatically in Redshift through AWS Glue based on RDS data source

I have dozens of tables in my data source (RDS) and I am ingesting all of this data into Redshift through AWS Glue. I am currently manually creating tables in Redshift (through SQL) and then proceeding with the Crawler and AWS Glue to fill in the Redshift tables with the data flowing from RDS.
Is there a way I can create these target tables within Redshift automatically (based on the tables I have in RDS, as these will just be an exact same copy initially) and not manually create each one of them with SQL in the Redshift Query Editor section?
Thanks in advance,

Sync Amazon RDS (PostgreSQL) to S3 in near real time

I'm wondering whether it is possible to easily sync an Amazon RDS PostgreSQL database to Amazon S3 in near real time so that data can be used with Amazon Athena, just as read replicas do.
We have several RDS database and we would like to consolidate all the data in a single repository such as S3.
Thanks.

There is no capability to "export RDS to S3 in real time".
However, Amazon Athena can query Amazon RDS databases, so you could have some of your data in Amazon S3 and some in Amazon RDS.
See: Query any data source with Amazon Athena’s new federated query | AWS Big Data Blog
What you are describing sounds like a data warehouse, where information is extracted from many information sources and is stored in one place for easy querying -- often in 'wide' tables to make querying simpler. However, this is very difficult to do "in real time". It is typically updated nightly, or perhaps hourly.

You might want to consider using AWS Database Migration Service to continuously sync data between RDS and S3: https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-dms-target/
saying this, it is only sensible when you don't have a read-only replica of the data and the queries might affect source RDS performance.

How does crawler much better than direct connecting to db and retreive data?

In AWS Glue jobs, in order to retrieve data from DB or S3, we can get using 2 approaches. 1) Using Crawler 2) Using direct connection to DB or S3.
So my question is: How does crawler much better than direct connecting to a database and retrieve data?

AWS Glue Crawlers will not retrieve the actual data. Crawlers accesses your data stores and progresses through a prioritized list of classifiers to extract the schema of your data and other statistics, and then populates the Glue Data Catalog with this metadata. Crawlers can be scheduled to run periodically that will detect the availability of the new data along with the change to the existing data, including the table definition changes made by the data crawler. Crawlers automatically adds new table, new partitions to the existing table and the new versions of table definitions.
AWS Glue Data Catalog becomes a common metadata repository between
Amazon Athena, Amazon Redshift Spectrum, Amazon S3. AWS Glue Crawlers
helps in building this metadata repository.

Query and join data between Amazon Redshift and Amazon RDS

Currently, we are going to link Redshift and our PostgreSQL RDS database together for our Machine Learning function so that our ML server can query and join the data in a single place.
As I know there are two solutions:
Option 1: Dump the whole RDS data into Redshift and sync every day
Option 2: Create another RDS and use dblink to create a view to join the two databases together
For option 1, what is the best AWS service we can use (we prefer to use AWS service)?
For option 2, how is the performance (our current redshift volume is 80GB, postgresql is 7GB).
And any other solutions?

From Amazon Redshift introduces support for federated querying (preview):
The in-preview Amazon Redshift Federated Query feature allows you to query and analyze data across operational databases, data warehouses, and data lakes. With Federated Query, you can now integrate queries on live data in Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL with queries across your Amazon Redshift and Amazon S3 environments.
Federated Query allows you to incorporate live data as part of your business intelligence (BI) and reporting applications. The intelligent optimizer in Redshift pushes down and distributes a portion of the computation directly into the remote operational databases to speed up performance by reducing data moved over the network. Redshift complements query execution, as needed, with its own massively parallel processing capabilities.

Stream Data from SQL Server into Redshift with Kinesis Firehose

The tool below is a batch import method of copying data from SQL Server RDS into Redshift.
AWS Schema Conversion Tool Exports from SQL Server to Amazon Redshift
Is there a more streamlined method, conducting every second way of streaming data from MS SQL Server into Redshift with Kinesis Firehose. I know we can move AWS Aurora SQL directly into Redshift with Kinesis.

If your goal is to move data from Microsoft SQL Server into Amazon Redshift, then you could consider using AWS Database Migration Service. It can copy data as a one-off job but can also migrate on a continuing basis.
See:
Using a Microsoft SQL Server Database as a Source for AWS DMS - AWS Database Migration Service
Using an Amazon Redshift Database as a Target for AWS Database Migration Service - AWS Database Migration Service

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Migrate database and data warehouse into AWS - amazon-web-services

Related

How to create tables automatically in Redshift through AWS Glue based on RDS data source

Sync Amazon RDS (PostgreSQL) to S3 in near real time

How does crawler much better than direct connecting to db and retreive data?

Query and join data between Amazon Redshift and Amazon RDS

Stream Data from SQL Server into Redshift with Kinesis Firehose

Categories

Resources