Right now we have an ETL that extracts info from an API, transforms, and Store in one big table in our OLTP database we want to migrate this table to some OLAP solution. This table is only read to do some calculations that we store on our OLTP database.
Which service fits the most here?
We are currently evaluating Redshift but never used the service before. Also, we thought of some snowflake schema(some kind of fact table with dimensions) in an RDS because is intended to store 10GB to 100GB but don't know how much this approach can scale.
Which service fits the most here?
imho you could do a PoC to see which service is more feasible for you. It really depends on how much data you have, what queries and what load you plan to execute.
AWS Redshift is intended for OLAP on top of peta- or exa-bytes scale handling heavy parallel workload. RS can as well aggregate data from other data sources (jdbc, s3,..). However RS is not OLTP, it requires more static server overhead and extra skills for managing the deployment.
So without more numbers and use cases one cannot advice anything. Cloud is great that you can try and see what fits you.
AWS Redshift is really great when you only want to read the data from the database. Basically, Redshift in the backend is a column-oriented database that is more suitable for analytics. You can transfer all your existing data to redshift using the AWS DMS. AWS DMS is a service that basically needs your bin logs of the existing database and it will automatically transfer your data we don't have to do anything. From my Personal experience Redshift is really great.
Related
We're building a multi-tenant SaaS application hosted on AWS that exposes and visualizes data in the front end via a REST api.
Now, for storage we're considering using AWS Redshift (Cluster or Serverless?) and then exposing the data using API Gateway and Lambda with the Redshift Data API.
The reason why I'm inclined to using Redshift as opposed to e.g RDS is that it seems like a nice option to also be able to conduct data experiments internally when building our product.
My question is, would this be considered a good strategy?
Redshift is sized for very large data and tables. For example the minimum storage size is 1MB. That's 1MB for every column and across all the slices (minimum 2). A table with 5 columns and just a few rows will take 26MB on the smallest Redshift cluster size (default distribution style). Redshift shines when your tables have 10s of millions of rows minimum. It isn't clear from your case that you will have the data sizes that will run efficiently on Redshift.
The next concern would be about your workload. Redshift is a powerful analytics engine but is not designed for OLTP workloads. High volumes of small writes will not perform well; it wants batch writes. High concurrency of light reads will not work as well as a database designed for that workload.
At low levels of work Redshift can do these things - it is a database. But if you use it in a way it isn't optimized for it likely isn't the most cost effective option and won't scale well. If job A is the SAS workload and analytics is job B, then choose the right database for job A. If this choice cannot do job B at the performance level you need then add an analytics engine to the mix.
My $.02 and I'm the Redshift guy. If my assumptions about your workload are wrong please update with specific info.
We are building a customer facing App. For this app, data is being captured by IoT devices owned by a 3rd party, and is transferred to us from their server via API calls. We store this data in our AWS Documentdb cluster. We have the user App connected to this cluster with real time data feed requirements. Note: The data is time series data.
The thing is, for long term data storage and for creating analytic dashboards to be shared with stakeholders, our data governance folks are requesting us to replicate/copy the data daily from the AWS Documentdb cluster to their Google cloud platform -> Big Query. And then we can directly run queries on BigQuery to perform analysis and send data to maybe explorer or tableau to create dashboards.
I couldn't find any straightforward solutions for this. Any ideas, comments or suggestions are welcome. How do I achieve or plan the above replication? And how do I make sure the data is copied efficiently - memory and pricing? Also, don't want to disturb the performance of AWS Documentdb since it supports our user facing App.
This solution would need some custom implementation. You can utilize Change Streams and process the data changes in intervals to send to Big Query, so there is a data replication mechanism in place for you to run analytics. One of the use cases of using Change Streams is for analytics with Redshift, so Big Query should serve a similar purpose.
Using Change Streams with Amazon DocumentDB:
https://docs.aws.amazon.com/documentdb/latest/developerguide/change_streams.html
This document also contains a sample Python code for consuming change streams events.
I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!
I'm trying to implement, I think, a very simple process, but I don't really know what's the best approach.
I want to read a big csv (around 30gb) file from S3, make some transformation and load it into RDS MySQL and I want this process to be replicable.
I tought that the best approach was Aws data pipeline, but I've found that this service is more designed to load data from different sources to redshift after several transformtions.
I've also seen that the process of creating a pipeline is slow and a little bit messy.
Then I've found the dataduct wrapper of Coursera, but after some research, it seems that this project has been abandoned (the last commit was one year ago).
So I don't know if I should continue trying with aws data pipeline or take another approach.
I've also read about AWS Simple Workflow and Step Functions, but I don't know if it's simpler.
Then I've seen a video of AWS glue and it looks nice, but unfortunatelly it's not yet available and I don't know when Amazon will launch it.
As you see, I'm a little bit confuse, can anyone enlight me?
Thanks in advance
If you are trying to get them into RDS so you can query them, there are other options that do not require the data to be moved from S3 to RDS to do SQL like queries.
You can use Redshift spectrum to read and query information from S3 now.
Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables
Step 1. Create an IAM Role for Amazon Redshift
Step 2: Associate the IAM Role with Your Cluster
Step 3: Create an External Schema and an External Table
Step 4: Query Your Data in Amazon S3
Or you can use Athena to query the data in S3 as well if Redshift is too much horsepower for the need job.
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
You could use an ETL tool to do the transformations on your csv data and then load it into your RDS database. There are a number of open source tools that do not require large licensing costs. That way you can pull the data into the tool, do your transformations and then the tool will load the data into your MySQL database. For example there is Talend, Apache Kafka, and Scriptella. Here's some information on them for comparison.
I think Scriptella would be an option for this situation. It can use SQL scripts (or other scripting languages), and has JDBC/ODBC compliant drivers. With this you could create a script that would perform your transformations and then load the data into your MySQL database. And you would be using familiar SQL (I'm assuming you already can create SQL scripts) so there isn't a big learning curve.
Does anyone know how fast the copy speed is from Amazon S3 to Redshift?
I only want to use RedShift for about an hour a day, to run updates on Tabelau reports. The queries being run are always on the same database, but I need to run them each night to take in to account new data that's come in that day.
I don't want to keep a cluster going 24x7 just to be used for one hour a day, but the only way that I can see of doing this is to Import the entire database each night into Redshift (I don't think you can't suspend or pause a cluster). I have no idea what the copy speed is so I have no idea if its going to be relatively quick to copy a 10GB file in to Redshift every night.
Assuming its feasible, my thinking is to push the incremental changes on SQL Server dbase in to S3. Using Cloud Formation, I automate the provisioning of a Redshift cluster at 1am for 1 hour, import the dbase from S3, and schedule Tableau to run its queries between that time and get its results. I keep an eye on how long the queries take, and If I need longer than an hour I just amend the cloud formation.
In this way I hope to keep a really 'lean' Tableau server by outsourcing all the ETL to Redshift, and buying only what I consume on Redshift.
Please feel free to critique my solution, or out right blow it out of the water. Otherwise If the consensus of the answer is that importing is relevantly quick, It gives me a thumbs up I'm headed in the right direction with this solution.
Thanks for any assistance!
Redshift loads from S3 are very quick, however Redshift clusters do not come up / tear down very quickly at all. In the above example most of your time (and money) would be spent waiting for the cluster to come up, existing data to load, refreshed data to unload and cluster to tear down again.
In my opinion it would be better to use another approach for your overnight processing. I would suggest either:
For a couple of TB, InfiniDB on a largish EC2 instance with the database stored on an EBS volume.
For many TBs, Amazon EMR with the data stored on S3. If you don't want to get into Hadoop too much you can use Xplenty/Syncsort Ironcluster/etc. to orchestrate the Hadoop element.
While this question was written three years ago and it wasn't available at that time, a suitable solution to this now would be to use Amazon Athena, which allows on-demand SQL querying of data held in S3. This works on a pay-per-query model, and is intended for ad-hoc and "quick" workloads like this.
Behind the scenes, Athena uses Presto and Elastic MapReduce, but the only required knowledge for a developer/analyst in practice is SQL.
Tableau also now has a built-in Athena connector (as of 10.3).
More on Athena here: https://aws.amazon.com/athena/
You can presort data you are keeping on S3. It will make Vacuum much faster.
This is the classic problem with Redshift... if you looking different way .. Microsoft recently announced new service called SQL Data Warehouse (Uses PDW Engine) I think they want to compete directly with Redshift.... Most interesting concept here is ... Familiar SQL Server Query language and Toolset (including Stored proc support). They also decoupled Storage and Compute so you can have 1 GB storage but 10 Compute node for intensive query and vice versa.... they are claiming that compute node start in few seconds and when you resize cluster you don't have to take it offline. Cloud Data Warehouse Battle getting hot :)