Amazon Redshift Framework (Oracle Data Warehouse Migration) - amazon-web-services

We are currently planning to migrate a 50 TB Oracle data warehouse to Amazon Redshift.
Data from different OLTP data sources were staged first in an Oracle staging database and then loaded into the Data Warehouse currently. Currently data has been transformed using tons of PL/SQL stored procedures within staging database as well as loading into the Data Warehouse.
OLTP Data Source 1 --> JMS (MQ) Real-time --> Oracle STG Database --> Oracle DW
Note: JMS MQ consumer writes data into staging database
OLTP Data Source 2 --> CDC Incremental Data (once in 10 mins) --> Oracle STG Database --> Oracle DW
Note: Change Data Capture on the source side data gets loaded into staging database once in 10 mins.
What would be the better framework to migrate this stack entirely (highlighted) to Amazon Redshift? What are the different components within AWS we can migrate to?

Wow, sounds like a big piece of work. There are quite a few things going on here that all need to be considered.
Your best starting point is probably AWS Database Migration Service (https://aws.amazon.com/dms/). This can do a lot of work for you in regards to converting your schemas and highlighting areas that you will have to migrate manually.
You should consider S3 to be your primary staging area. You need to land all (or almost all) the data in S3 before loading to Redshift. Give very careful consideration to how the data is laid out. In particular, I recommend that you use partitioning prefixes (s3://my_bucket/YYYYMMDDHHMI/files or s3://my_bucket/year=YYYY/month=MM/day=DD/hour=HH/minute=MI/files).
Your PL/SQL logic will not be portable to Redshift. You'll need to convert the non-SQL parts to either bash or Python and use an external tool to run the SQL parts in Redshift. I'd suggest that you start with Apache Airflow (Python) or Azkaban (bash). If you want to stay pure AWS then you can try Data Pipeline (not recommended) or wait for AWS Glue to be released (looks promising - untested).
You may be able to use Amazon Kinesis Firehose for the work that's currently done by JMS but the ideal use of Kinesis is quite different from the typical use of JMS (AFAICT).
Good luck

Related

Automating CSV analysis?

My e-commerce company generates lots of CSV data. To track order status, the team must download a number of trackers. Creating a relationship and subsequently analyse,its a time-consuming process. Which AWS low-code solution can be used to automate the workflow?
Depending on what 'workflow' you require, a few options are:
Amazon Honeycode, which is a low-code application builder
You can Filter and retrieving data using Amazon S3 Select, which works on individual CSV files. This can be scripted via the AWS CLI or an AWS SDK
If you want to run SQL and create JOINs between multiple files, then Amazon Athena is fantastic. This, too, can be scripted.
While not exactly low code, AWS Athena uses SQL-like queries to analyze CSV files, among many other formats

Google Cloud Dataflow - is it possible to define a pipeline that reads data from BigQuery and writes to an on-premise database?

My organization plans to store a set of data in BigQuery and would like to periodically extract some of that data and bring it back to an on-premise database. In reviewing what I've found online about Dataflow, the most common examples involve moving data in the other direction - from an on-premise database into the cloud. Is it possible to use Dataflow to bring data back out of the cloud to our systems? If not, are there other tools that are better suited to this task?
Abstractly, yes. If you've got a set of sources and syncs and you want to move data between them with some set of transformations, then Beam/Dataflow should be perfectly suitable for the task. It sounds like you're discussing a batch-based periodic workflow rather than a continuous streaming workflow.
In terms of implementation effort, there's more questions to consider. Does an appropriate Beam connector exist for your intended on-premise database? You can see the built-in connectors here: https://beam.apache.org/documentation/io/built-in/ (note the per-language SDK toggle at top of page)
Do you need custom transformations? Are you combining data from systems other than just BigQuery? Either implies to me that you're on the right track with Beam.
On the other hand, if your extract process is relatively straightforward (e.g. just run a query once a week and extract it), you may find there are simpler solutions, particularly if you're not moving much data and your database can ingest data in one of the BigQuery export formats.

Using Amazon Redshift for analytics for a Django app with Postgresql as the database

I have a working Django web application that currently uses Postgresql as the database. Moving forward I would like to perform some analytics on the data and also generate reports etc. I would like to make use of Amazon Redshift as the data warehouse for the above goals.
In order to not affect the performance of the existing django web application, I was thinking of writing a NEW Django application that essentially would leverage a READ-ONLY replica of the Postgresql database and continuously write data from read-only replicas to the Amazon Redshift. My thinking is that perhaps the NEW Django application can be used to handle some/all of the Extract, Transform and Load functions
My questions are as follows:
1. Does the Django ORM work well with Amazon Redshift? If yes, how does one handle the model schema translations? Any pointers in this regard would be greatly appreciated.
2. Is there any better alternative to achieve the goals listed above?
Thanks in advance.

Handling Very Large volume(500TB) data using spark

I have large volume of data nearly 500TB , I have to do some ETL on that data.
This data is there in the AWS S3, so I planning to use AWS EMR setup to process this data but I am not sure what should be the config I should select .
What kind of cluster I need(master and how many slaves)?
Do I need to process chunk by chunk(10GB) or can I process all data at once?
What should be Master and slave(executor) memory both Ram and storage?
What kind of processor (speed) I need?
Based on this I want to calculate the cost of AWS EMR and start process the data
Based upon your question, you have little or no experience with Hadoop. Get some training first so that you understand how the Hadoop ecosystem works. Plan on spending three months to get to a starter level.
You have a lot of choices to make, some are fundamental to a project's success. For example, what language (Scala, Java or Python)? Which tools (Spark, Hive, Pig, etc.). What format is your data in (CSV, XML, JSON, Parquet, etc.). Do you only need batch processing or do you require near real-time analysis, etc. etc. etc.
You may find other AWS services more applicable such as Athena or Redshift depending on what format your data is in and what information you are trying to extract / process.
With 500 TB in AWS, open a ticket with support. Explain what you have, what you want and your time frame. An SA will be available to direct you on a path.

AWS: Setting up a kinesis stream from PostgreSQL to Redshift

In reference to my previous question, I got my boss to go ahead and let me set up a DMS from my existing postgres to our new redshift db for our analytics team.
The next issue that I am having, and after spending 3 days doing searching on this has provided nothing to help me with this. My boss wants to use Kinesis to pull real-time data from our PG db to our RS db so our analytics team can pull data in real time from it. I'm trying to get this configured and I'm running into nothing but headaches.
I have a Stream set up, Firehose set up to grab from our S3 bucket that I created called "postgres-stream-bucket", but I'm not sure how to get data to dump into it from PG, and then making sure that RS picks everything up and uses it, in real time.
However, if there are better options I would love to hear them, but it is imperative that we have real-time (or as close as possible) translated data.
Amazon Kinesis Firehose is ideal if you have streaming data coming into your systems. It will collect the records, batch them and load them into Redshift. However, it is not an ideal solution for what you have described, where your source is a database rather than random streams of data.
Since you already have the Database Migration Service setup, you can continue to use it for continuous data replication between PostgreSQL and Redshift. This would be the simplest and most effective solution.