How to control increasing data volume in Redshift? - amazon-web-services

I have a data warehouse maintained in AWS Redshift. The data volume and velocity both have increased lately. One option is to keep scaling the cluster horizontally at the expanse of a higher cost of course. I was wondering if there are any archiving options available so that I can query the entire data as usual (maybe with a compromise in the querying time) but with a low or no additional cost?
One option would be to use external tables and query data directly from S3 but the tools used for achieving this, like Athena and Glue have their own cost, that too on a per query basis.

Easy options:
Ensure all tables have compression SELECT * FROM svv_table_info;.
Maximize compression by changing large tables to use ENCODE zstd.
Switch small tables < ~50k rows (depends) to DISTSTYLE ALL (yes this saves space!).
Switch from SSD based nodes (dc2) to HDD nodes (ds2) which have more 8x storage space.
Less easy options:
UNLOAD older data from Redshift to S3 and query using Redshift Spectrum.
Convert unloaded data to Parquet or ORC format using AWS Glue or AWS EMR and then query using Redshift Spectrum.
Please experiment with Redshift Spectrum. Query performance is typically very good and gets even better if your data is in a columnar format (Parquet/ORC).

Data stored within Amazon Redshift will provide the highest performance.
However, if you have data that is less-frequently accessed, you could export (UNLOAD) it into Amazon S3, preferably as compressed, partitioned data and storing it as Parquet or ORC would be even better!
You could then use Amazon Redshift Spectrum to Query External Data in Amazon S3. You can even join external data with Redshift data, so you could query historical information and current information in the one query.
Alternatively, you could use Amazon Athena to query the data directly from Amazon S3. This is similar to Redshift Spectrum, but does not require Redshift. Amazon Athena is based on Presto, so it is super-fast, especially if data is stored as compressed, partitioned, Parquet/ORC.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Please note that Redshift Spectrum and Amazon Athena charge based upon the amount of data read from disk. Therefore, compressed, partitioned Parquet/ORC is both cheaper and faster.

Related

Comparison of Amazon S3, Amazon Athena, and Amazon Athena with partitioning

I wanted to know the performance improvement when we use Amazon Athena without partitioning and with partitioning. I know for sure that Athena with partitioning is much better than Athena. But does Athena without partitioning give any improvement over Amazon S3?
Partitioning separates data files into separate directories. If the column used for partitioning is part of a query's WHERE clause, it allows Athena to skip-over directories that do not contain relevant data. This is highly effective at improve query performance (and lowering cost) because it reduces the need for disk access and memory.
There are several ways to improve the performance of Amazon Athena:
Store data in a columnar format, such as Parquet. This allows Athena to go directly to specific columns without having to read all columns in a wide table. (This is similar to Amazon Redshift.)
Compress data (eg using Snappy compression) to reduce the amount of data that needs to be read from disk. This also reduces the cost of queries since they are charged based on the amount of data read from disk. (Instant savings!)
Partition data to completely skip-over input files when the partition key is used in a query's WHERE clause
For some examples of these benefits, see: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog

Does Amazon Redshift have its own storage backend

I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!

When to use Amazon Redshift spectrum over AWS Glue ETL to query on Amazon S3 data

As AWS Glue ETL can be a python script, it can be used to perform SQL queries using database interfaces and the data can be loaded from Amazon S3 into a DynamicFrame. I am trying to understand when it is advantageous to use Amazon Redshift spectrum to query on S3 data.
AWS Glue is used for gathering metadata (crawling) and for ETL. It is not for reporting or analytics. It can apply highly complex transformations (ideal for complex ETL requirement).
Redshift Spectrum is primarily used to produce reports and analysis against data stored in S3, usually combined with data stored on Redshift. However is CAN also be used for simple ETL. Much simpler to set up and use than Glue if you just need simple type ETL.
There is one other option that you do not mention, that is amazon Athena, this is a great tool to run queries directly against S3 data. It is similar to Redshift Spectrum but usually faster and cheaper, depending on your use case. It cannot combine S3 data with Redshift data.

Are there any benefits to storing data in DynamoDB vs S3 for use with Redshift?

My particular scenario: Expecting to amass TBs or even PBs of JSON data entries which track price history for many items. New data will be written to the data store hundreds or even thousands of times per a day. This data will be analyzed by Redshift and possibly AWS ML. I don't expect to query outside of Redshift or ML.
Question: How do I decide if I should store my data in S3 or DynamoDB? I am having trouble deciding because I know that both stores are supported with redshift, but I did notice Redshift Spectrum exists specifically for S3 data.
Firstly DynamoDB is far more expensive than S3. S3 is only a storage solution; while DynamoDB is a full-fledge NoSQL database.
If you want to query using Redshift; you have to load data into Redshift. Redshift is again an independent full-fledge database ( warehousing solution ).
You can use Athena to query data directly from S3.

Redshift COPY or snapshots?

i'm looking at using AWS Redshift to let users submit queries against the old archived data which isn't available in my web page.
the total data i'm dealing with across all my users is a couple of terabytes. the data is already in an s3 bucket, split up into files by week. most requests won't deal with more than a few files totaling 100GB.
to keep costs down should i use snapshots and delete our cluster when not in use or should i have a smaller cluster which doesn't hold all of the data and only COPY data from S3 into a temporary table when running a query?
If you are just doing occasional queries where cost is more important than speed, you could consider using Amazon Athena, which can query data stored in Amazon S3. (Only in some AWS regions at the moment.) You are only charged for the amount of data read from disk.
To gain an appreciation for making Athena even better value, see: Analyzing Data in S3 using Amazon Athena
Amazon Redshift Spectrum can perform a similar job to Athena but requires an Amazon Redshift cluster to be running.
All other choices are really a trade-off between cost and access to your data. You could start by taking a snapshot of your Amazon Redshift database and then turning it off at night and on the weekends. Then, have a script that can restore it automatically for queries. Use fewer nodes to reduce costs -- this will make queries slower, but that doesn't seem to be an issue for you.