I wanted to know the performance improvement when we use Amazon Athena without partitioning and with partitioning. I know for sure that Athena with partitioning is much better than Athena. But does Athena without partitioning give any improvement over Amazon S3?
Partitioning separates data files into separate directories. If the column used for partitioning is part of a query's WHERE clause, it allows Athena to skip-over directories that do not contain relevant data. This is highly effective at improve query performance (and lowering cost) because it reduces the need for disk access and memory.
There are several ways to improve the performance of Amazon Athena:
Store data in a columnar format, such as Parquet. This allows Athena to go directly to specific columns without having to read all columns in a wide table. (This is similar to Amazon Redshift.)
Compress data (eg using Snappy compression) to reduce the amount of data that needs to be read from disk. This also reduces the cost of queries since they are charged based on the amount of data read from disk. (Instant savings!)
Partition data to completely skip-over input files when the partition key is used in a query's WHERE clause
For some examples of these benefits, see: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!
As AWS Glue ETL can be a python script, it can be used to perform SQL queries using database interfaces and the data can be loaded from Amazon S3 into a DynamicFrame. I am trying to understand when it is advantageous to use Amazon Redshift spectrum to query on S3 data.
AWS Glue is used for gathering metadata (crawling) and for ETL. It is not for reporting or analytics. It can apply highly complex transformations (ideal for complex ETL requirement).
Redshift Spectrum is primarily used to produce reports and analysis against data stored in S3, usually combined with data stored on Redshift. However is CAN also be used for simple ETL. Much simpler to set up and use than Glue if you just need simple type ETL.
There is one other option that you do not mention, that is amazon Athena, this is a great tool to run queries directly against S3 data. It is similar to Redshift Spectrum but usually faster and cheaper, depending on your use case. It cannot combine S3 data with Redshift data.
My particular scenario: Expecting to amass TBs or even PBs of JSON data entries which track price history for many items. New data will be written to the data store hundreds or even thousands of times per a day. This data will be analyzed by Redshift and possibly AWS ML. I don't expect to query outside of Redshift or ML.
Question: How do I decide if I should store my data in S3 or DynamoDB? I am having trouble deciding because I know that both stores are supported with redshift, but I did notice Redshift Spectrum exists specifically for S3 data.
Firstly DynamoDB is far more expensive than S3. S3 is only a storage solution; while DynamoDB is a full-fledge NoSQL database.
If you want to query using Redshift; you have to load data into Redshift. Redshift is again an independent full-fledge database ( warehousing solution ).
You can use Athena to query data directly from S3.
i'm looking at using AWS Redshift to let users submit queries against the old archived data which isn't available in my web page.
the total data i'm dealing with across all my users is a couple of terabytes. the data is already in an s3 bucket, split up into files by week. most requests won't deal with more than a few files totaling 100GB.
to keep costs down should i use snapshots and delete our cluster when not in use or should i have a smaller cluster which doesn't hold all of the data and only COPY data from S3 into a temporary table when running a query?
If you are just doing occasional queries where cost is more important than speed, you could consider using Amazon Athena, which can query data stored in Amazon S3. (Only in some AWS regions at the moment.) You are only charged for the amount of data read from disk.
To gain an appreciation for making Athena even better value, see: Analyzing Data in S3 using Amazon Athena
Amazon Redshift Spectrum can perform a similar job to Athena but requires an Amazon Redshift cluster to be running.
All other choices are really a trade-off between cost and access to your data. You could start by taking a snapshot of your Amazon Redshift database and then turning it off at night and on the weekends. Then, have a script that can restore it automatically for queries. Use fewer nodes to reduce costs -- this will make queries slower, but that doesn't seem to be an issue for you.