Query and join data between Amazon Redshift and Amazon RDS - amazon-web-services

Currently, we are going to link Redshift and our PostgreSQL RDS database together for our Machine Learning function so that our ML server can query and join the data in a single place.
As I know there are two solutions:
Option 1: Dump the whole RDS data into Redshift and sync every day
Option 2: Create another RDS and use dblink to create a view to join the two databases together
For option 1, what is the best AWS service we can use (we prefer to use AWS service)?
For option 2, how is the performance (our current redshift volume is 80GB, postgresql is 7GB).
And any other solutions?

From Amazon Redshift introduces support for federated querying (preview):
The in-preview Amazon Redshift Federated Query feature allows you to query and analyze data across operational databases, data warehouses, and data lakes. With Federated Query, you can now integrate queries on live data in Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL with queries across your Amazon Redshift and Amazon S3 environments.
Federated Query allows you to incorporate live data as part of your business intelligence (BI) and reporting applications. The intelligent optimizer in Redshift pushes down and distributes a portion of the computation directly into the remote operational databases to speed up performance by reducing data moved over the network. Redshift complements query execution, as needed, with its own massively parallel processing capabilities.

Related

What are the limitations of using data sharing between two redshift clusters in two different accounts?

I know that we can share data between two redshift clusters in different accounts using datashare. I also saw that after associating the data share in the consumer account, we need to create a database from the datashare before accessing the data. So, in this case will all the cross database sharing limitations apply in the consumer account.
Below are the limitations mentioned in AWS documentation
When you work with the cross-database query feature in Amazon Redshift, be aware of the limitations following:
When you query database objects on any other unconnected databases,
you have read access only to those database objects.
You can't query views that are created on other databases that refer
to objects of yet another database.
You can only create late-binding and materialized views on objects of
other databases in the cluster. You can't create regular views on
objects of other databases in the cluster.
Amazon Redshift doesn't support tables with column-level privileges
for cross-database queries.
Amazon Redshift doesn't support concurrency scaling for the queries
that read data from other databases.
Amazon Redshift doesn't support query catalog objects on AWS Glue or
federated databases. To query these objects, first create external
schemas that refer to those external data sources in each database.
Amazon Redshift doesn't support result cache for cross-database
queries.

Sync Amazon RDS (PostgreSQL) to S3 in near real time

I'm wondering whether it is possible to easily sync an Amazon RDS PostgreSQL database to Amazon S3 in near real time so that data can be used with Amazon Athena, just as read replicas do.
We have several RDS database and we would like to consolidate all the data in a single repository such as S3.
Thanks.
There is no capability to "export RDS to S3 in real time".
However, Amazon Athena can query Amazon RDS databases, so you could have some of your data in Amazon S3 and some in Amazon RDS.
See: Query any data source with Amazon Athena’s new federated query | AWS Big Data Blog
What you are describing sounds like a data warehouse, where information is extracted from many information sources and is stored in one place for easy querying -- often in 'wide' tables to make querying simpler. However, this is very difficult to do "in real time". It is typically updated nightly, or perhaps hourly.
You might want to consider using AWS Database Migration Service to continuously sync data between RDS and S3: https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-dms-target/
saying this, it is only sensible when you don't have a read-only replica of the data and the queries might affect source RDS performance.

Does Amazon Redshift have its own storage backend

I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!

Migrate database and data warehouse into AWS

I want to migrate our database and data warehouse into AWS.
Our on-prem database is Oracle and we use Oracle Data Integrator for data warehousing in IBM AIX.
My first thought was to migrate our database with AWS DMS (Data Migration Services) into a staging point (S3) and then using Lambda (for creating the trigger when data is updated, deleted or inserted) and Kinesis Firehose (For streaming and do the ETL) to send the data into Redshift.
The data in Redshift must be the replica of our on-prem data warehouse (containing facts and dimensions, aggregation and multiple joins) and I want whenever any changes happened in the on-prem database, it automatically updates the AWS S3 and Redshift so I can have near real-time data in my Redshift.
I was wondering if my architecture is correct and/or is there a better way to do it?
Thank you

What are the differences between Amazon Redshift and the new AWS Glue datawarehousing services?

I am confused about these two services. It looks that they are offering the same service. Probably the only difference is that the Glue catalog can contain a wider range of data sources. Does it mean that AWS Glue can replace Redshift?
The Comment is right , These two services are not same AWS Glue is ETL Service while AWS Redshift is Data Warehousing service.
According to AWS Documentation :
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It allows you to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
According to AWS Documentation :
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores
You can Refer the Documentation Provided by AWS for Details but essentially these are totally different services.