For majority of use-cases, Spark transformations can be done on streaming data or bounded data (say from Amazon S3) using Amazon EMR, and then data can be written to S3 again with the transformed data.
The transformations can also be achieved in Amazon Redshift using the different data from S3 being loaded to different Redshift tables, and then the data from the different Redshift tables loaded to final table. (Now with Redshift spectrum, we could also select and transform data directly from S3 as well.)
With that said, I see the transformations can be done in both EMR and Redshift, with Redshift loads and transformations done with less development time.
So, should EMR be used for use-cases mainly involving streaming/unbounded data? What other use-cases is EMR preferable (I am aware Spark provides other core, sql, ml libraries as well), but just for transformation(involving joins/reducers) to be achieved, I don't see a use-case other than streaming inside EMR, when transformation can be achieved also in Redshift.
Please provide use-cases when to use EMR transformations vs Redshift transformation.
In the first instance I prefer to use Redshift for transformations as:
Development is easier, SQL rather than Spark
Maintenance / monitoring is easier
Infrastructure costs are lower assuming you can run during "off-peak"
times.
Sometimes EMR is a better option, I would consider it in these circumstances:
When you want to have raw and transformed data both on S3, e.g. a
"data lake" strategy
Complex transformations are required. Some transformations are just
not possible using Redshift such as when
managing complex and large json columns
pivoting of data dynamically (variable number of attributes)
Third party libraries are required
data sizes are so large that a much bigger redshift cluster would be needed to process the transformations.
There are other additional options other than Redshift and EMR, thsese should also be considered.
for example
Standard python or other scripting language to :
create dynamic transformation sql, which can be run in redshift
processing from csv to parquet or similar
scheduling (e.g. airflow)
AWS Athena
can be used with s3 (e.g. parquet) input and output
uses SQL (so some advantages in development time) using Presto syntax which in some cases is more powerful than Redshift SQL
can have significant cost benefits as no permanent infrastructe costs are needed, pay on usage.
AWS Batch and AWS lambda should also be considered.
Related
I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!
As AWS Glue ETL can be a python script, it can be used to perform SQL queries using database interfaces and the data can be loaded from Amazon S3 into a DynamicFrame. I am trying to understand when it is advantageous to use Amazon Redshift spectrum to query on S3 data.
AWS Glue is used for gathering metadata (crawling) and for ETL. It is not for reporting or analytics. It can apply highly complex transformations (ideal for complex ETL requirement).
Redshift Spectrum is primarily used to produce reports and analysis against data stored in S3, usually combined with data stored on Redshift. However is CAN also be used for simple ETL. Much simpler to set up and use than Glue if you just need simple type ETL.
There is one other option that you do not mention, that is amazon Athena, this is a great tool to run queries directly against S3 data. It is similar to Redshift Spectrum but usually faster and cheaper, depending on your use case. It cannot combine S3 data with Redshift data.
I am kind of evaluating Athena & Redshift Spectrum. Both serve the same purpose, Spectrum needs a Redshift cluster in place whereas Athena is pure serverless. Athena uses Presto and Spectrum uses its Redshift's engine
Are there any specific disadvantages for Athena or Redshift spectrum?
Any limitations on using Athena or Spectrum ?
I have used both across a few different use cases and conclude:
Advantages of Redshift Spectrum:
Allows creation of Redshift tables
Able to join Redshift tables with Redshift spectrum tables
efficiently
If you do not need those things then you should consider Athena as well
Athena differences from Redshift spectrum:
Billing. This is the major difference and depending on your use case
you may find one much cheaper than the other
Performance. I found Athena slightly faster.
SQL syntax and features. Athena is derived from presto and is a bit
different to Redshift which has its roots in postgres.
Connectivity. Its easy enough to connect to Athena using API,JDBC or
ODBC but many more products offer "standard out of the box"
connection to Redshift
Also, for either solution, make sure you use the AWS Glue metadata, rather than Athena as there are fewer limitations.
This question has been up for quite a time, but still, I think I can contribute something to the discussion.
What is Athena?
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. (From the Doc)
Pretty straight forward, right?
Then comes the question of what is Redshift Spectrum and why Amazon folks made it when Athena was pretty much a solution for external table queries?
So, AWS folks wanted to create an extension to Redshift (which is pretty popular as a managed columnar datastore at this time) and give it the capability to talk to external tables(typically S3). But they wanted to make life easier for Redshift users, mostly analytics people. Many analytics tools don't support Athena but support Redshift at this time. But creating your Reshift cluster and storing data was a bottleneck. Again Redshift isn't that horizontally scalable and it takes some downtime in case of adding new machines. If you are a Redshift user, making your storage cheaper makes your life so much easier basically.
I suggest you use Redshift spectrum in the following cases:
You are an existing Redshift user and you want to store more data in Redshift.
You want to move colder data to an external table but still, want to join with Redshift tables in some cases.
Spark unloading of your data and if you just want to import data to Pandas or any other tools for analyzing.
And Athena can be useful when:
You are a new user and don't have Redshift cluster. Access to Spectrum requires an active, running Redshift instance. So Redshift Spectrum is not an option without Redshift.
As Spectrum is still a developing tool and they are kind of adding some features like transactions to make it more efficient.
BTW Athena comes with a nice REST API , so go for it you want that.
All to say Redshift + Redshift Spectrum is indeed powerful with lots of promises. But it has still a long way to go to be mature.
If you are using Redshift database then it will be wise to use Spectrum along with redshift to get the required performance.
However, if you are beginning to explore options then we can consider Athena as a tool to go ahead.
I had learned (from Adrian Cantril's/LA's 2019 SA Pro course) that Redshift Spectrum would use one's own Redshift cluster to provide more consistent performance than is available by leveraging the shared capacity which AWS makes available to Athena queries. I appreciate this information might only be useful for the exam, I didn't find his argument convincing.
I wrote this answer because I wasn't satisfied with the leading answer's treatment of Athena outperforming Redshift Spectrum. The rest of that answer is good and I do not mean to directly copy any of that here (without references it hadn't registered with me when I wrote this).
I (again, based solely on my hands-off research) would choose Spectrum when the majority of my data is in S3, which would typically be for the larger data sets. The recent RA3 instances seem to overlap this niche though. So I say Spectrum is most suited to where we have long term Redshift clusters that, being OLAP nodes, have spare capacity to query S3.
Why would you use your own estate to perform the queries that Athena would do without such an investment from you? Caching, where it fits. And consistent performance, if I am to believe Adrian Cantrill more than Jon Scott. This made me suspect RA3 might be edging Spectrum out; that and the lack of decent literature on Spectrum. Why would Amazon offer a serverless product in Athena that outperforms Redshift Spectrum which is more expensive? This is how they are choosing to deprecate RRS. I can't believe Spectrum is deprecated so must offer this answer to contest this. Just look at https://aws.amazon.com/redshift/whats-new/.
I think the picture below (from https://d1.awsstatic.com/events/Summits/AMER2020/May13SummitOnline/Modernize_your_data_warehouse.pdf) is fairly clear that compute nodes are influential here, and perhaps contrary to #JonScott's valuable insights above.
One final big difference is Athena is limited to IAM for authentication, as depicted in this reinvent 2018 (ANT201-R1) slide:
One big limitation and differing factor is the ability to use structured data. Athena supports it for both JSON and Parquet file formats while Redshift Spectrum only accepts flat data.
Another is the availability of GIS functions that Athena has and also lambdas, which do come in handy sometimes.
Now if you ran a standalone new Postgres then that does everything and more, but as far as comparison between Redshift (and Spectrum) goes - it's a tool that has stopped evolving.
I am confused about these two services. It looks that they are offering the same service. Probably the only difference is that the Glue catalog can contain a wider range of data sources. Does it mean that AWS Glue can replace Redshift?
The Comment is right , These two services are not same AWS Glue is ETL Service while AWS Redshift is Data Warehousing service.
According to AWS Documentation :
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It allows you to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
According to AWS Documentation :
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores
You can Refer the Documentation Provided by AWS for Details but essentially these are totally different services.
I am using Redhshift spectrum. I created an external table and uploaded a csv data file on S3 with around 5.5 million records. If fire a query on this external table, it is taking ~15 seconds whereas If I run same query on Amazon redshift, I was getting same result in ~2 seconds. What could be the reason for this performance lag where AWS claim it to be be very high performance platform. Please suggest solution for same performance using spectrum.
For your performance optimizations please have a look to understand your query.
Right now, the best performance is if you don't have a single CSV file but multiple. Typically, you could say you get great performance if the number of files per query is at least about an order of magnitude larger than the number of nodes of your cluster.
In addition, if you use Parquet files you get the advantage of a columnar format on S3 rather than reading CSV which will read the whole file from S3 - and decreases your cost as well.
You can use the script to convert data to Parquet:
Reply from AWS forum as follows :
I understand that you have the same query running on Redshift & Redshift Spectrum. However, the results are different, while one run in 2 seconds the other run in around 15 seconds.
First of all, we must agree that both Redshift and Spectrum are different services designed differently for different purpose. Their internal structure varies a lot from each other, while Redshift relies on EBS storage, Spectrum works directly with S3. Redshift Spectrum's queries employ massive parallelism to execute very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remain in Amazon S3.
Spectrum is also designed to deal with Petabytes of data structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables while Redshift offers you the ability to store data efficiently and in a highly-optimez manner by means of Distribution and Sort Keys.
AWS does not advertise Spectrum as a faster alternative to Redshift. We offer Amazon Redshift Spectrum as an add-on solution to provide access to data stored in Amazon S3 without having to load it into Redshift (similar to Amazon Athena).
In terms of query performance, unfortunately, we can't guarantee performance improvements since Redshift Spectrum layer produces query plans completely different from the ones produced by Redshift's database engine interpreter. This particular reason itself would be enough to discourage any query performance comparison between these services as it is not fair to neither of them.
About your question about on the fly nodes, Spectrum adds them based on the demands of your queries, Redshift Spectrum can potentially use thousands of instances to take advantage of massively parallel processing. There aren't any specific criteria to trigger this behavior, however, bearing in mind that by following the best practices about how to improve query performance[1] and how to create data files for queries[2] you can potentially improve the overall Spectrum's performance.
For the last, I would like to point some interesting documentation to clarify you a bit more about how to achieve better performance improvements. Please see the references at the end!
I am somewhat late to answer this. As of Feb-2018, AWS is supporting the AWS Spectrum queries on files in columnar formats like Parquet, ORC etc. In your case, you are storing the file as .CSV. CSV is row based which results in pulling out the entire row for any field queried. I will suggest you to convert the files from .csv to Parquet format before querying. This will certainly result in much faster performance. Details from AWS: Amazon Redshift Spectrum
These results are to be expected. The whole reason for using Amazon Redshift is that it stores data in a highly-optimized manner to provide fast queries. Data stored outside of Redshift will not run anywhere near as fast.
The intention of Amazon Redshift Spectrum is to provide access to data stored in Amazon S3 without having to load it into Redshift (similar to Amazon Athena), but it makes no performance guarantees.