When to use Amazon Redshift spectrum over AWS Glue ETL to query on Amazon S3 data - amazon-web-services

As AWS Glue ETL can be a python script, it can be used to perform SQL queries using database interfaces and the data can be loaded from Amazon S3 into a DynamicFrame. I am trying to understand when it is advantageous to use Amazon Redshift spectrum to query on S3 data.

AWS Glue is used for gathering metadata (crawling) and for ETL. It is not for reporting or analytics. It can apply highly complex transformations (ideal for complex ETL requirement).
Redshift Spectrum is primarily used to produce reports and analysis against data stored in S3, usually combined with data stored on Redshift. However is CAN also be used for simple ETL. Much simpler to set up and use than Glue if you just need simple type ETL.
There is one other option that you do not mention, that is amazon Athena, this is a great tool to run queries directly against S3 data. It is similar to Redshift Spectrum but usually faster and cheaper, depending on your use case. It cannot combine S3 data with Redshift data.

Related

Different ways to create ad-hoc analysis on top of S3

I have a data lake in AWS S3. The format of data is Parquet. Daily workload is ~70G. I want to build some ad-hoc analytics on top of that data. To do that I see 2 options:
Use AWS Athena to request data via HiveQL to get data via AWS Glue (Data Catalog).
Move data from S3 into Redshift as a data warehouse and query Redshift to perform ad-hoc analysis.
What is the best way to do ah-hoc analysis in my case? Is there more efficient way? And what are pros and cons of mentioned options?
PS
After 6 months I'm going to move data from S3 to Amazon Glacier, so that max data volume to query in S3/Redshift can be ~13T

Does Amazon Redshift have its own storage backend

I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!

Amazon EMR vs Amazon Redshift

For majority of use-cases, Spark transformations can be done on streaming data or bounded data (say from Amazon S3) using Amazon EMR, and then data can be written to S3 again with the transformed data.
The transformations can also be achieved in Amazon Redshift using the different data from S3 being loaded to different Redshift tables, and then the data from the different Redshift tables loaded to final table. (Now with Redshift spectrum, we could also select and transform data directly from S3 as well.)
With that said, I see the transformations can be done in both EMR and Redshift, with Redshift loads and transformations done with less development time.
So, should EMR be used for use-cases mainly involving streaming/unbounded data? What other use-cases is EMR preferable (I am aware Spark provides other core, sql, ml libraries as well), but just for transformation(involving joins/reducers) to be achieved, I don't see a use-case other than streaming inside EMR, when transformation can be achieved also in Redshift.
Please provide use-cases when to use EMR transformations vs Redshift transformation.
In the first instance I prefer to use Redshift for transformations as:
Development is easier, SQL rather than Spark
Maintenance / monitoring is easier
Infrastructure costs are lower assuming you can run during "off-peak"
times.
Sometimes EMR is a better option, I would consider it in these circumstances:
When you want to have raw and transformed data both on S3, e.g. a
"data lake" strategy
Complex transformations are required. Some transformations are just
not possible using Redshift such as when
managing complex and large json columns
pivoting of data dynamically (variable number of attributes)
Third party libraries are required
data sizes are so large that a much bigger redshift cluster would be needed to process the transformations.
There are other additional options other than Redshift and EMR, thsese should also be considered.
for example
Standard python or other scripting language to :
create dynamic transformation sql, which can be run in redshift
processing from csv to parquet or similar
scheduling (e.g. airflow)
AWS Athena
can be used with s3 (e.g. parquet) input and output
uses SQL (so some advantages in development time) using Presto syntax which in some cases is more powerful than Redshift SQL
can have significant cost benefits as no permanent infrastructe costs are needed, pay on usage.
AWS Batch and AWS lambda should also be considered.

How to control increasing data volume in Redshift?

I have a data warehouse maintained in AWS Redshift. The data volume and velocity both have increased lately. One option is to keep scaling the cluster horizontally at the expanse of a higher cost of course. I was wondering if there are any archiving options available so that I can query the entire data as usual (maybe with a compromise in the querying time) but with a low or no additional cost?
One option would be to use external tables and query data directly from S3 but the tools used for achieving this, like Athena and Glue have their own cost, that too on a per query basis.
Easy options:
Ensure all tables have compression SELECT * FROM svv_table_info;.
Maximize compression by changing large tables to use ENCODE zstd.
Switch small tables < ~50k rows (depends) to DISTSTYLE ALL (yes this saves space!).
Switch from SSD based nodes (dc2) to HDD nodes (ds2) which have more 8x storage space.
Less easy options:
UNLOAD older data from Redshift to S3 and query using Redshift Spectrum.
Convert unloaded data to Parquet or ORC format using AWS Glue or AWS EMR and then query using Redshift Spectrum.
Please experiment with Redshift Spectrum. Query performance is typically very good and gets even better if your data is in a columnar format (Parquet/ORC).
Data stored within Amazon Redshift will provide the highest performance.
However, if you have data that is less-frequently accessed, you could export (UNLOAD) it into Amazon S3, preferably as compressed, partitioned data and storing it as Parquet or ORC would be even better!
You could then use Amazon Redshift Spectrum to Query External Data in Amazon S3. You can even join external data with Redshift data, so you could query historical information and current information in the one query.
Alternatively, you could use Amazon Athena to query the data directly from Amazon S3. This is similar to Redshift Spectrum, but does not require Redshift. Amazon Athena is based on Presto, so it is super-fast, especially if data is stored as compressed, partitioned, Parquet/ORC.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Please note that Redshift Spectrum and Amazon Athena charge based upon the amount of data read from disk. Therefore, compressed, partitioned Parquet/ORC is both cheaper and faster.

What are the differences between Amazon Redshift and the new AWS Glue datawarehousing services?

I am confused about these two services. It looks that they are offering the same service. Probably the only difference is that the Glue catalog can contain a wider range of data sources. Does it mean that AWS Glue can replace Redshift?
The Comment is right , These two services are not same AWS Glue is ETL Service while AWS Redshift is Data Warehousing service.
According to AWS Documentation :
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It allows you to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
According to AWS Documentation :
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores
You can Refer the Documentation Provided by AWS for Details but essentially these are totally different services.