I wanted to know the performance improvement when we use Amazon Athena without partitioning and with partitioning. I know for sure that Athena with partitioning is much better than Athena. But does Athena without partitioning give any improvement over Amazon S3?
Partitioning separates data files into separate directories. If the column used for partitioning is part of a query's WHERE clause, it allows Athena to skip-over directories that do not contain relevant data. This is highly effective at improve query performance (and lowering cost) because it reduces the need for disk access and memory.
There are several ways to improve the performance of Amazon Athena:
Store data in a columnar format, such as Parquet. This allows Athena to go directly to specific columns without having to read all columns in a wide table. (This is similar to Amazon Redshift.)
Compress data (eg using Snappy compression) to reduce the amount of data that needs to be read from disk. This also reduces the cost of queries since they are charged based on the amount of data read from disk. (Instant savings!)
Partition data to completely skip-over input files when the partition key is used in a query's WHERE clause
For some examples of these benefits, see: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Related
I'm new to Redshift and having some clarification on how Redshift operates:
Does Amazon Redshift has their own backend storage platform or it depends on S3 to store the data as objects and Redshift is used only for querying, processing and transforming and has temporary storage to pick up the specific slice from S3 and process it?
In the sense, does redshift has its own backend cloud space like oracle or Microsoft SQL having their own physical server in which data is stored?
Because, if I'm migrating from a conventional RDBMS system to Redshift due to increased volume, If I opt for Redshift alone would do or should I opt for combination of Redshift and S3.
This question seems to be basic, but I'm unable to find answer in Amazon websites or any of the blogs related to Redshift.
Yes, Amazon Redshift uses its own storage.
The prime use-case for Amazon Redshift is running complex queries against huge quantities of data. This is the purpose of a "data warehouse".
Whereas normal databases start to lose performance when there are 1+ million rows, Amazon Redshift can handle billions of rows. This is because data is distributed across multiple nodes and is stored in a columnar format, making it suitable for handling "wide" tables (which are typical in data warehouses). This is what gives Redshift its speed. In fact, it is the dedicated storage, and the way that data is stored, that gives Redshift its amazing speed.
The trade-off, however, means that while Redshift is amazing for queries large quantities of data, it is not designed for frequently updating data. Thus, it should not be substituted for a normal database that is being used by an application for transactions. Rather, Redshift is often used to take that transactional data, combine it with other information (customers, orders, transactions, support tickets, sensor data, website clicks, tracking information, etc) and then run complex queries that combine all that data.
Amazon Redshift can also use Amazon Redshift Spectrum, which is very similar to Amazon Athena. Both services can read data directly from Amazon S3. Such access is not as efficient as using data stored directly in Redshift, but can be improved by using columnar storage formats (eg ORC and Parquet) and by partitioning files. This, of course, is only good for querying data, not for performing transactions (updates) against the data.
The newer Amazon Redshift RA3 nodes also have the ability to offload less-used data to Amazon S3, and uses caching to run fast queries. The benefit is that it separates storage from compute.
Quick summary:
If you need a database for an application, use Amazon RDS
If you are building a data warehouse, use Amazon Redshift
If you have a lot of historical data that is rarely queried, store it in Amazon S3 and query it via Amazon Athena or Amazon Redshift Spectrum
looking at your question, you may benefit from professional help with your architecture.
However to get you started, Redshift::
has its own data storage, no link to s3.
Amazon Redshift Spectrum allows you to also query data held in s3 (similar to AWS
Athena)
is not a good alternative as a back-end database to replace a
traditional RDBMS as transactions are very slow.
is a great data warehouse tool, just use it for that!
I have a couple of tables on my s3 bucket. The tables are big both in memory size and in the amount of files, they are stored in JSON(suboptimal, I know) and have a lot of partitions.
Now I want to enable AWS Glue Data Catalog and AWS Glue Crawlers, however I am terrified by the price of the crawlers going through all of the data.
The schema doesn't change often so it is not necessary to go through all of the files on S3.
Will the Crawlers go through all the files by default? Is it possible to configure a smarter sampling strategy that would look inside just some of the files instead of all of them?
Depending on your bucket structure maybe you could just make use of exclude paths and point the crawlers to specific prefixes that you want to be crawled. If the partitioning is hive style partitioning then you can make use of Athena to execute msck repair table to add partitions. Alternatively you can create the tables manually in Athena and run msck repair which is bound to take a very long time if you have to many partitions and files are huge as you mentioned.
I have a data warehouse maintained in AWS Redshift. The data volume and velocity both have increased lately. One option is to keep scaling the cluster horizontally at the expanse of a higher cost of course. I was wondering if there are any archiving options available so that I can query the entire data as usual (maybe with a compromise in the querying time) but with a low or no additional cost?
One option would be to use external tables and query data directly from S3 but the tools used for achieving this, like Athena and Glue have their own cost, that too on a per query basis.
Easy options:
Ensure all tables have compression SELECT * FROM svv_table_info;.
Maximize compression by changing large tables to use ENCODE zstd.
Switch small tables < ~50k rows (depends) to DISTSTYLE ALL (yes this saves space!).
Switch from SSD based nodes (dc2) to HDD nodes (ds2) which have more 8x storage space.
Less easy options:
UNLOAD older data from Redshift to S3 and query using Redshift Spectrum.
Convert unloaded data to Parquet or ORC format using AWS Glue or AWS EMR and then query using Redshift Spectrum.
Please experiment with Redshift Spectrum. Query performance is typically very good and gets even better if your data is in a columnar format (Parquet/ORC).
Data stored within Amazon Redshift will provide the highest performance.
However, if you have data that is less-frequently accessed, you could export (UNLOAD) it into Amazon S3, preferably as compressed, partitioned data and storing it as Parquet or ORC would be even better!
You could then use Amazon Redshift Spectrum to Query External Data in Amazon S3. You can even join external data with Redshift data, so you could query historical information and current information in the one query.
Alternatively, you could use Amazon Athena to query the data directly from Amazon S3. This is similar to Redshift Spectrum, but does not require Redshift. Amazon Athena is based on Presto, so it is super-fast, especially if data is stored as compressed, partitioned, Parquet/ORC.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog
Please note that Redshift Spectrum and Amazon Athena charge based upon the amount of data read from disk. Therefore, compressed, partitioned Parquet/ORC is both cheaper and faster.
My particular scenario: Expecting to amass TBs or even PBs of JSON data entries which track price history for many items. New data will be written to the data store hundreds or even thousands of times per a day. This data will be analyzed by Redshift and possibly AWS ML. I don't expect to query outside of Redshift or ML.
Question: How do I decide if I should store my data in S3 or DynamoDB? I am having trouble deciding because I know that both stores are supported with redshift, but I did notice Redshift Spectrum exists specifically for S3 data.
Firstly DynamoDB is far more expensive than S3. S3 is only a storage solution; while DynamoDB is a full-fledge NoSQL database.
If you want to query using Redshift; you have to load data into Redshift. Redshift is again an independent full-fledge database ( warehousing solution ).
You can use Athena to query data directly from S3.
I am using Redhshift spectrum. I created an external table and uploaded a csv data file on S3 with around 5.5 million records. If fire a query on this external table, it is taking ~15 seconds whereas If I run same query on Amazon redshift, I was getting same result in ~2 seconds. What could be the reason for this performance lag where AWS claim it to be be very high performance platform. Please suggest solution for same performance using spectrum.
For your performance optimizations please have a look to understand your query.
Right now, the best performance is if you don't have a single CSV file but multiple. Typically, you could say you get great performance if the number of files per query is at least about an order of magnitude larger than the number of nodes of your cluster.
In addition, if you use Parquet files you get the advantage of a columnar format on S3 rather than reading CSV which will read the whole file from S3 - and decreases your cost as well.
You can use the script to convert data to Parquet:
Reply from AWS forum as follows :
I understand that you have the same query running on Redshift & Redshift Spectrum. However, the results are different, while one run in 2 seconds the other run in around 15 seconds.
First of all, we must agree that both Redshift and Spectrum are different services designed differently for different purpose. Their internal structure varies a lot from each other, while Redshift relies on EBS storage, Spectrum works directly with S3. Redshift Spectrum's queries employ massive parallelism to execute very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remain in Amazon S3.
Spectrum is also designed to deal with Petabytes of data structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables while Redshift offers you the ability to store data efficiently and in a highly-optimez manner by means of Distribution and Sort Keys.
AWS does not advertise Spectrum as a faster alternative to Redshift. We offer Amazon Redshift Spectrum as an add-on solution to provide access to data stored in Amazon S3 without having to load it into Redshift (similar to Amazon Athena).
In terms of query performance, unfortunately, we can't guarantee performance improvements since Redshift Spectrum layer produces query plans completely different from the ones produced by Redshift's database engine interpreter. This particular reason itself would be enough to discourage any query performance comparison between these services as it is not fair to neither of them.
About your question about on the fly nodes, Spectrum adds them based on the demands of your queries, Redshift Spectrum can potentially use thousands of instances to take advantage of massively parallel processing. There aren't any specific criteria to trigger this behavior, however, bearing in mind that by following the best practices about how to improve query performance[1] and how to create data files for queries[2] you can potentially improve the overall Spectrum's performance.
For the last, I would like to point some interesting documentation to clarify you a bit more about how to achieve better performance improvements. Please see the references at the end!
I am somewhat late to answer this. As of Feb-2018, AWS is supporting the AWS Spectrum queries on files in columnar formats like Parquet, ORC etc. In your case, you are storing the file as .CSV. CSV is row based which results in pulling out the entire row for any field queried. I will suggest you to convert the files from .csv to Parquet format before querying. This will certainly result in much faster performance. Details from AWS: Amazon Redshift Spectrum
These results are to be expected. The whole reason for using Amazon Redshift is that it stores data in a highly-optimized manner to provide fast queries. Data stored outside of Redshift will not run anywhere near as fast.
The intention of Amazon Redshift Spectrum is to provide access to data stored in Amazon S3 without having to load it into Redshift (similar to Amazon Athena), but it makes no performance guarantees.