We're building a multi-tenant SaaS application hosted on AWS that exposes and visualizes data in the front end via a REST api.
Now, for storage we're considering using AWS Redshift (Cluster or Serverless?) and then exposing the data using API Gateway and Lambda with the Redshift Data API.
The reason why I'm inclined to using Redshift as opposed to e.g RDS is that it seems like a nice option to also be able to conduct data experiments internally when building our product.
My question is, would this be considered a good strategy?
Redshift is sized for very large data and tables. For example the minimum storage size is 1MB. That's 1MB for every column and across all the slices (minimum 2). A table with 5 columns and just a few rows will take 26MB on the smallest Redshift cluster size (default distribution style). Redshift shines when your tables have 10s of millions of rows minimum. It isn't clear from your case that you will have the data sizes that will run efficiently on Redshift.
The next concern would be about your workload. Redshift is a powerful analytics engine but is not designed for OLTP workloads. High volumes of small writes will not perform well; it wants batch writes. High concurrency of light reads will not work as well as a database designed for that workload.
At low levels of work Redshift can do these things - it is a database. But if you use it in a way it isn't optimized for it likely isn't the most cost effective option and won't scale well. If job A is the SAS workload and analytics is job B, then choose the right database for job A. If this choice cannot do job B at the performance level you need then add an analytics engine to the mix.
My $.02 and I'm the Redshift guy. If my assumptions about your workload are wrong please update with specific info.
Related
Right now we have an ETL that extracts info from an API, transforms, and Store in one big table in our OLTP database we want to migrate this table to some OLAP solution. This table is only read to do some calculations that we store on our OLTP database.
Which service fits the most here?
We are currently evaluating Redshift but never used the service before. Also, we thought of some snowflake schema(some kind of fact table with dimensions) in an RDS because is intended to store 10GB to 100GB but don't know how much this approach can scale.
Which service fits the most here?
imho you could do a PoC to see which service is more feasible for you. It really depends on how much data you have, what queries and what load you plan to execute.
AWS Redshift is intended for OLAP on top of peta- or exa-bytes scale handling heavy parallel workload. RS can as well aggregate data from other data sources (jdbc, s3,..). However RS is not OLTP, it requires more static server overhead and extra skills for managing the deployment.
So without more numbers and use cases one cannot advice anything. Cloud is great that you can try and see what fits you.
AWS Redshift is really great when you only want to read the data from the database. Basically, Redshift in the backend is a column-oriented database that is more suitable for analytics. You can transfer all your existing data to redshift using the AWS DMS. AWS DMS is a service that basically needs your bin logs of the existing database and it will automatically transfer your data we don't have to do anything. From my Personal experience Redshift is really great.
I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!
We're building Lambda architecture on AWS stack. A lack of devops knowledge forces us to prefer AWS managed solution over custom deployments.
Our workflow:
[Batch layer]
Kinesys Firehouse -> S3 -Glue-> EMR (Spark) -Glue-> S3 views -----+
|===> Serving layer (ECS) => Users
Kinesys -> EMR (Spark Streaming) -> DynamoDB/ElasticCache views --+
[Speed layer]
We have already using 3 datastores: ElasticCache, DynamoDB and S3 (queried with Athena). Bach layer produce from 500,000 up to 6,000,000 row each hour. Only last hour results should be queried by serving layer with low latency random reads.
Neither of our databases fits batch-insert & random-read requirements. DynamoDB not fit batch-insert - it's too expensive because of throughput required for batch inserts. Athena is MPP and moreover has limitation of 20 concurrent queries. ElasticCache is used by streaming layer, not sure if it's good idea to perform batch inserts there.
Should we introduce the fourth storage solution or stay with existing?
Considered options:
Persist batch output to DynamoDB and ElasticCache (part of data that is updated rarely and can be compressed/aggregated goes to DynamoDB; frequently updated data ~8GB/day goes to elasticCache).
Introduce another database (HBase on EMR over S3/ Amazon redshift?) as a solution
Use S3 Select over parquet to overcome Athena concurrent query limits. That will also reduce query latency. But have S3 Select any concurrent query limits? I can't find any related info.
The first option is bad because of batch insert to ElasticCache used by streaming. Also does it follow Lambda architecture - keeping batch and speed layer views in the same data stores?
The second solution is bad because of the fourth database storage, isn't it?
In this case you might want to use something like HBase or Druid; not only can they handle batch inserts and very low latency random reads, they could even replace the DynamoDB/ElastiCache component from your solution, since you can write directly to them from the incoming stream (to a different table).
Druid is probably superior for this, but as per your requirements, you'll want HBase, as it is available on EMR with the Amazon Hadoop distribution, whereas Druid doesn't come in a managed offering.
I am kind of evaluating Athena & Redshift Spectrum. Both serve the same purpose, Spectrum needs a Redshift cluster in place whereas Athena is pure serverless. Athena uses Presto and Spectrum uses its Redshift's engine
Are there any specific disadvantages for Athena or Redshift spectrum?
Any limitations on using Athena or Spectrum ?
I have used both across a few different use cases and conclude:
Advantages of Redshift Spectrum:
Allows creation of Redshift tables
Able to join Redshift tables with Redshift spectrum tables
efficiently
If you do not need those things then you should consider Athena as well
Athena differences from Redshift spectrum:
Billing. This is the major difference and depending on your use case
you may find one much cheaper than the other
Performance. I found Athena slightly faster.
SQL syntax and features. Athena is derived from presto and is a bit
different to Redshift which has its roots in postgres.
Connectivity. Its easy enough to connect to Athena using API,JDBC or
ODBC but many more products offer "standard out of the box"
connection to Redshift
Also, for either solution, make sure you use the AWS Glue metadata, rather than Athena as there are fewer limitations.
This question has been up for quite a time, but still, I think I can contribute something to the discussion.
What is Athena?
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. (From the Doc)
Pretty straight forward, right?
Then comes the question of what is Redshift Spectrum and why Amazon folks made it when Athena was pretty much a solution for external table queries?
So, AWS folks wanted to create an extension to Redshift (which is pretty popular as a managed columnar datastore at this time) and give it the capability to talk to external tables(typically S3). But they wanted to make life easier for Redshift users, mostly analytics people. Many analytics tools don't support Athena but support Redshift at this time. But creating your Reshift cluster and storing data was a bottleneck. Again Redshift isn't that horizontally scalable and it takes some downtime in case of adding new machines. If you are a Redshift user, making your storage cheaper makes your life so much easier basically.
I suggest you use Redshift spectrum in the following cases:
You are an existing Redshift user and you want to store more data in Redshift.
You want to move colder data to an external table but still, want to join with Redshift tables in some cases.
Spark unloading of your data and if you just want to import data to Pandas or any other tools for analyzing.
And Athena can be useful when:
You are a new user and don't have Redshift cluster. Access to Spectrum requires an active, running Redshift instance. So Redshift Spectrum is not an option without Redshift.
As Spectrum is still a developing tool and they are kind of adding some features like transactions to make it more efficient.
BTW Athena comes with a nice REST API , so go for it you want that.
All to say Redshift + Redshift Spectrum is indeed powerful with lots of promises. But it has still a long way to go to be mature.
If you are using Redshift database then it will be wise to use Spectrum along with redshift to get the required performance.
However, if you are beginning to explore options then we can consider Athena as a tool to go ahead.
I had learned (from Adrian Cantril's/LA's 2019 SA Pro course) that Redshift Spectrum would use one's own Redshift cluster to provide more consistent performance than is available by leveraging the shared capacity which AWS makes available to Athena queries. I appreciate this information might only be useful for the exam, I didn't find his argument convincing.
I wrote this answer because I wasn't satisfied with the leading answer's treatment of Athena outperforming Redshift Spectrum. The rest of that answer is good and I do not mean to directly copy any of that here (without references it hadn't registered with me when I wrote this).
I (again, based solely on my hands-off research) would choose Spectrum when the majority of my data is in S3, which would typically be for the larger data sets. The recent RA3 instances seem to overlap this niche though. So I say Spectrum is most suited to where we have long term Redshift clusters that, being OLAP nodes, have spare capacity to query S3.
Why would you use your own estate to perform the queries that Athena would do without such an investment from you? Caching, where it fits. And consistent performance, if I am to believe Adrian Cantrill more than Jon Scott. This made me suspect RA3 might be edging Spectrum out; that and the lack of decent literature on Spectrum. Why would Amazon offer a serverless product in Athena that outperforms Redshift Spectrum which is more expensive? This is how they are choosing to deprecate RRS. I can't believe Spectrum is deprecated so must offer this answer to contest this. Just look at https://aws.amazon.com/redshift/whats-new/.
I think the picture below (from https://d1.awsstatic.com/events/Summits/AMER2020/May13SummitOnline/Modernize_your_data_warehouse.pdf) is fairly clear that compute nodes are influential here, and perhaps contrary to #JonScott's valuable insights above.
One final big difference is Athena is limited to IAM for authentication, as depicted in this reinvent 2018 (ANT201-R1) slide:
One big limitation and differing factor is the ability to use structured data. Athena supports it for both JSON and Parquet file formats while Redshift Spectrum only accepts flat data.
Another is the availability of GIS functions that Athena has and also lambdas, which do come in handy sometimes.
Now if you ran a standalone new Postgres then that does everything and more, but as far as comparison between Redshift (and Spectrum) goes - it's a tool that has stopped evolving.
My web application requires extremely low-latency read/write of small data blobs (<10KB) that can be stored as key-value pairs. I am considering DynamoDB (with DAX) and EFS and ElastiCache. AWS claims that they all offer low latency but I cannot find any head-2-head comparison and also it is not clear to me if these three are even on the same league. Can someone share any insight?
You are trying to compare different storage systems for different use cases with different pricing models.
EFS is a filesystem for which you don't need to provision storage devices and can access from multiple EC2 instances. EFS might work fine for your use case, but you will need to manage files. Meaning you will need to structure your data to fit in files. Alternatively, you might need to build key-value or blob/object storage system depending on the level of structure and retrieval you need. There are products that solve this problem for you, such as S3, DynamoDB, Elasticache Redis or Memcached.
S3 is a blob storage, with no structure, no data types, items can't be updated only replaced. You may only query by listing blobs in a bucket. It is typically used for storing static media files.
DynamoDB is a non-relational (aka No-SQL) database, which can be used as a document or key-value store in which data is structured, strongly typed and has query capabilities. Can store items up to 400KB.
Elasticache (Redis or Memcached) are key-value stores which are typically used as a cache in front of durable data store such as DynamoDB. In this case the application needs to be aware of the different layers; manage different APIs and handle the caching logic in the application.
With DAX, you can seamlessly integrate a cache layer without having the caching logic in the application. DAX currently provides a write-through cache to DynamoDB. DAX APIs are compatible with DynamoDB APIs, which makes it seamless to add a cache layer if your application already uses DynamoDB by substituting DynamoDB client with DAX client. Keep in mind that DAX currently supports Java, Node.js, Go, .NET and Python clients only.
So it really depends on your workload. If you require sub-millisecond latency, without the headache of managing a cache layer, and your application is Java, Node.js, Go, .NET or Python then DAX is for you.