Amazon Redshift Equality filter performance and sortkeys - amazon-web-services

Does Redshift efficiently (i.e. binary search) find a block of a table that is sorted on a column A for a query with a condition A=?
As an example, let there be a table T with ~500m rows, ~50 fields, distributed and sorted on field A. Field A has high cardinality - so there are ~4.5 m different A values, with exactly the same number of rows in T: ~100 rows per value.
Assume a redshift cluster with a single XL node.
Field A is not compressed. All other fields have some form compression, as suggested by ANALYZE COMPRESSION. A ratio of 1:20 was given compared to an uncompressed table.
Given a trivial query:
select avg(B),avg(C) from
(select B,C from T where A = <val>)
After VACUUM and ANALYZE the following explain plan is given:
XN Aggregate (cost=1.73..1.73 rows=1 width=8)
-> XN Seq Scan on T (cost=0.00..1.23 rows=99 width=8)
Filter: (A = <val>::numeric)
This query takes 39 seconds to complete.
The main question is: Is this the expected behavior of redshift?
According to the documentation at Choosing the best sortkey:
"If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range."
In Choosing sort keys:
"Another optimization that depends on sorted data is the efficient handling of range-restricted predicates. Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If a range-restricted column is a sort key, the query processor is able to use the min and max values to rapidly skip over large numbers of blocks during table scans. For example, if a table stores five years of data sorted by date and a query specifies a date range of one month, up to 98% of the disk blocks can be eliminated from the scan. If the data is not sorted, more of the disk blocks (possibly all of them) have to be scanned. For more information about these optimizations, see Choosing distribution keys."
Secondary questions:
What is the complexity of the aforementioned skipping scan on a sort key? Is it linear ( O(n) ) or some variant of binary search ( O(logn) )?
If a key is sorted - is skipping the only optimization available?
What would this "skipping" optimization look like in the explain plan?
Is the above explain the best one possible for this query?
What is the fastest result redshift can be expected to provide given this scenario?
Does vanilla ParAccel have different behavior in this use case?

This question is answered on amazon forum: https://forums.aws.amazon.com/thread.jspa?threadID=137610

Related

Convert indexes to sortkeys Redshift

Do zonemaps exists only in memory? Or its populated in memory from disk where its stored persistently? Is it stored along with the 1MB block, or in a separate place?
We are migrating from oracle to redshift, there are bunch of indexes to cater to reporting needs. The nearest equivalent of index in Redshift is sortkeys. For bunch of tables, the total number of cols of all the indexes are between 15-20 (some are composite indexes, some are single col indexes). Interleaved keys seems to be best fit, but there cannot be more than 8 cols in an interleaved sortkey. But if I use compound sortkey, it wont be effective since the queries might not have prefix colums.
Whats the general advice in such cases - which type of sort key to use? How to convert many indexes from rdbms to sort keys in redshift?
Are high cardinality cols such as identity cols, dates and timestamps not good fit with interleaved keys? Would it be same with compound sortkeys? Any disadvanatges with interleaved sortkeys to keep in consideration?
You are asking the right questions so let's take these down one at a time. First, zonemaps are located on the leader node and stored on disk and the table data is stored on the compute nodes. They are located separate from each other. The zonemaps store the min and max values for every column for every 1MB block in a table. No matter if a column is in your sortkey list or not, there will be zonemap data for the block. When a column shows up in a WHERE clause Redshift will first compare to the zonemap data to decide if the block is needed for the query. If a block is not needed it won't be read from disk resulting in significant performance improvements for very large tables. I call this "block rejection". A few key points - This really only makes a difference on tables will 10s of millions of rows and when there are selective WHERE predicates.
So you have a number of reports each of which looks at the data by different aspects - common. You want all of these to work well, right? Now the first thing to note is that each table can have it's own sortkeys, they aren't linked. What is important is how does the choice of sortkeys affect the min and max values in the zonemaps for the columns you will use as WHERE clauses. With composite sortkeys you have to think about what impact later keys will have on the composition of the block - not much after the 3rd or 4th key. This is greatly impacted by the ordinality of the data but you get the idea. The good news is that sorting on one column will impact the zonemaps of all the columns so you don't always have to have a column in the sortkey list to get the benefit.
The question of compound vs interleaved sortkeys is a complicated one but remember you want to get high levels of block rejection as often as possible (and on the biggest tables). When different queries have different WHERE predicates it can be tricky to get a good mix of sortkeys to make this happen. In general compound sortkeys are easier to understand and have less table maintenance implications. You can inspect the zonemaps and see what impacts your sortkey choices are having and make informed decisions on how to adjust. If there are columns with low ordinality put those first so that the next sortkeys can have impact on the overall row order and therefore make block with different value ranges for these later keys. For these reasons I like compound keys over interleaved but there are cases where things will improve with interleaved keys. When you have high ordinality for all the columns and they are all equally important interleaved may be the right answer. I usually learn about the data trying to optimize compound keys that even if I end up with interleaved keys I can make smart choices about what columns I want in the sortkeys.
Just some metrics to help in you choice. Redshift can store 200,000 row elements in a single block and I've seen columns with over 2M elements per block. Blocks are distributed across the cluster so you need a lot of rows to fill up enough blocks that rejecting a high percentage of them is even possible. If you have a table of 5 million rows and you are sweating the sortkeys you are into the weeds. (Yes sorting can impact other aspects of the query like joining but these are sub-second improvements not make or break performance impacts.) Compression can have a huge impact on the number of row elements per block and therefore how many rows are represented in an entry in the zonemap. This can increase block rejection but will increase the read data needed to scan the entire table - a tradeoff you will want to make sure you are winning (1 query gets faster by 10 get slower is likely not a good tradeoff).
Your question about ordinality is a good one. If I sort my a high ordinality column first in a compound sortkey list this will set the overall order of the rows potentially making all other sortkeys impotent. However if I sort by a low ordinality column first then there is a lot of power left for other sortkeys to change the order of the rows and therefore the zonemap contents. For example if I have Col_A with only 100 unique values and Col_B which is a timestamp with 1microsecond resolution. If I sort by Col_B first all the rows are likely order just by sorting on this column. But if I sort by Col_A first there are lots of rows with the same value and the later sortkey (Col_B) can order these rows. Interleaved works the same way except which column is "first" changes by region of the table. If I interleave sort base on the same Col_A and Col_B above (just 2 sortkeys), then half the table will be sorted by Col_A first and half by Col_B first. For this example Col_A will be useless half of the time - not the best answer. Interleave sorting just modifies which column is use as the first sortkey throughout the table (and second and third if more keys are used). High ordinality in a sort key makes later sortkeys less powerful and this independent of sort style - it's just the interleave changes up which columns are early and which are late by region of the table.
Because ordinality of sortkeys can be such an important factor in gaining block rejection across many WHERE predicates that it is common to add derived columns to tables to hold lower ordinality versions of other columns. In the example above I might add Col_B2 to the table and have if just hold the year and month (month truncated date) of Col_B. I would use Col_B2 in my sortkey list but my queries would still be referencing Col_B. It "roughly" sorts based on Col_B so that Col_A can have some sorting power if it was to come later in the sortkey list. This is a common reason for making data model changes when moving Redshift.
It is also critical that "block rejecting" WHERE clauses on written against the fact table column, not applied to a dimension table column after the join. Zonemap information is read BEFORE the query starts to execute and is done on the leader node - it can't see through joins. Another data model change is to denormalize some key information into the fact tables so these common where predicates can be applied to the fact table and zonemaps will be back in play.
Sorry for the tome but this is a deep topic which I've spent year optimizing. I hope this is of use to you and reach out if anything isn't clear (and I hope you have the DISTKEYS sorted out already :) ).

Redshift Query taking too much time

In Redshift, the queries are taking too much time to execute. Some queries keep on running or get aborted after some time.
I have very limited knowledge of Redshift and it is getting difficult to understand the Query plan to optimise the query.
Sharing one of the queries that we run, along with the Query Plan.
The query is taking 20 seconds to execute.
Query
SELECT
date_trunc('day',
ti) as date,
count(distinct deviceID) AS COUNT
FROM
live_events
WHERE
brandID = 3927
AND ti >= '2017-08-02T00:00:00+00:00'
AND ti <= '2017-09-02T00:00:00+00:00'
GROUP BY
1
Primary key
brandID
Interleaved Sort Keys
we have set following columns as interleaved sort keys -
brandID, ti, event_name
QUERY PLAN
You have 126 million rows in that table. It's going to take more than a second on a single dc1.large node.
Here's some ways you could improve the performance:
More nodes
Spreading data across more nodes allows more parallelization. Each node adds additional processing and storage. Even if your data volume only justifies one node, if you want more performance, add more nodes.
SORTKEY
For the right type of query, the SORTKEY can be the best way to improve query speed. Sorting data on disk allows Redshift to skip over blocks that it knows does not contain relevant data.
For example, your query has WHERE brandID = 3927, so having brandID as the SORTKEY would make this extremely efficient because very few disk blocks would contain data for one brand.
Interleaved sorting is rarely the best sorting method to use because it is less efficient than a single or compound sort key and takes a long time to VACUUM. If the query you have shown is typical of the type of queries you are running, then use a compound sort key of brandId, ti or ti, brandId. It will be much more efficient.
SORTKEYs are typically a date column, since they are often found in a WHERE clause and the table will be automatically sorted if data is always appended in time order.
The Interleaved Sort would be causing Redshift to read many more disk blocks to find your data, thereby significantly increasing query time.
DISTKEY
The DISTKEY should typically be set to the field that is most used in a JOIN statement on the table. This is because data relating to the same DISTKEY value is stored on the same slice. This won't have such a large impact on a single node cluster, but it is still worth getting right.
Again, you have only shown one type of query, so it is hard to recommend a DISTKEY. Based on this query alone, I would recommend DISTKEY EVEN so that all slices participate in the query. (It is also the default DISTKEY if no specific DISTKEY is selected.) Alternatively, set DISTKEY to a field not shown -- but certainly don't use brandId as the DISTKEY otherwise only one slice will participate in the query shown.
VACUUM
VACUUM your tables regularly so that the data is stored in SORTKEY order and deleted data is removed from storage.
Experiment!
Optimal settings depend upon your data and the queries you typically run. Perform some tests to compare SORTKEY and DISTKEY values and choose the settings that perform the best. Then, test again in 3 months to see if your queries or data has changed enough to make other settings more efficient.
Some time the issue could be due to locks being acquired by other processes. You can refer: https://aws.amazon.com/premiumsupport/knowledge-center/prevent-locks-blocking-queries-redshift/
I'd also like to add that in your query you are performing date transformations. Date operations are expensive in Redshift.
-- This date operation is expensive
date_trunc('day', ti) as date
If you have the luxury you should store the date in the format you need in an additional column.

Does Redshift optimize inter-block search or just scans the whole block?

I created two tables with 43,547,563 rows each:
CREATE TABLE metrics_compressed (
some_id bigint ENCODE ZSTD,
some_value varchar(200) ENCODE ZSTD distkey,
...,
some_timestamp bigint ENCODE ZSTD,
...,
PRIMARY KEY (some_id, some_timestamp, some_value)
)
sortkey (some_id, some_timestamp);
The second one is exactly like the first one but without any column compressed.
Running this query (it just counts one row):
select count(*)
from metrics_compressed
where some_id = 20906
and some_timestamp = 1475679898584;
shows a table scan of 42,394,071 rows (from the rows_pre_filter column in svl_query_summary, column is_rrscan true) and while running it over the uncompressed table it scans 3,143,856. I guess the reason for this is that the compressed one uses less 1MB blocks, hence the scan shows the total number of rows from the retrieved blocks.
Are the scanned rows a sign of bad performance? Or does Redshift use some kind of binary search within a block for such simple queries as this one, and the scanned rows is just confusing info for optimizing queries?
In general, you should let Amazon Redshift determine its own compression types. It does this by loading 100,000 rows and determining the optimal compression type to use for each column based on this sample data. It then drops those rows and restarts the load. This happens automatically when a table is first loaded if there is no compression type specified on the columns.
The SORTKEY is more important for fast queries than compression, because it allows Redshift to totally skip over blocks that do not contain desired data. In your example, using some_id within the WHERE clause allows it to only look at blocks containing that specific value and since it is also the SORTKEY this will be extremely efficient.
Once a block is identified as potentially containing the SORTKEY data, Redshift will read the block from disk and process the contents.
The general rule is to use DISTKEY for columns most used in JOIN and use SORTKEY for columns most used in WHERE statements (but there are also more subtle variations on those general rules).

Dist and Sort Keys Redshift

I'm trying to add dist and sort keys to some of the tables in redshift.
I notice that before adding the size of the table is 0.50 and after adding it gets increased to 0.51 or 0.52. Is this possible ? The whole purpose of having dist and sort keys is to decrease the size of the table and help in increasing the read/write performance.
That is not the purpose of having a DISTKEY and SORTKEY.
To decrease the storage size of a table, use compression.
The DISTKEY is used to distribute data amongst slices. By co-locating information on the same slice, queries can run faster. For example, if you had these tables:
customer table, DISTKEY = customer_id
invoices table, DISTKEY = customer_id
...then these tables would be distributed in the same manner. All records in both tables for a given customer_id would be located on the same slice, thereby avoiding the need to transfer data between slices. The DISTKEY should be the column that is mostly used for JOINS.
The SORTKEY is used to sort data on disk, for the benefit of Zone Maps. Each storage block on disk is 1MB in size and contains data for only one column in one table. The data for this column is sorted, then stored in multiple blocks. The Zone Map associated with each block identifies the minimum and maximum values stored within that block. Then, when a query is run with a WHERE statement, Amazon Redshift only needs to read the blocks that contain the desired range of data. By skipping over blocks that do not contain data within the WHERE clause, Redshift can run queries much faster.
The above can all work together. For example, compressed data requires fewer blocks, which also allows Redshift to skip over more data based on the Zone Maps. To get the best possible performance out of queries, use DISTKEY, SORTKEY and compression together.
(It is often recommended not to compress the SORTKEY column because it causes too many rows to be loaded from a single block.)
See also: Top 10 Performance Tuning Techniques for Amazon Redshift

Redshift performance: encoding on join column

would encoding on join column corrupts the query performance ? I let the "COPY command" to decide the encoding type.
In gernal no - since an encoding on your DIST KEY will even have a positive impact due to the reduction disk I/O.
According to the AWS table design playbook There are a few edge case were indeed an encoding on your DIST KEY will corrupt your query performance:
Your query patterns apply range restricted scans to a column that is
very well compressed.
The well compressed column’s blocks each contain a large number of values per block, typically many more values than the actual count of values your query is interested in.
The other columns necessary for the query pattern are large or don’t compress well. These columns are > 10x the size of the well
compressed column.
If you want to find the optimal encoding for your table you can use the Redshift column encoding utility.
Amazon Redshift is a column-oriented database, which means that rather than organising data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. This architecture is particularly well suited to analytics queries on tables with a large number of columns, where most queries only access a subset of all possible dimensions and measures. Amazon Redshift is able to only access those blocks on disk that are for columns included in the SELECT or WHERE clause, and doesn’t have to read all table data to evaluate a query. Data stored by column should also be encoded , which means that it is heavily compressed to offer high read performance. This further means that Amazon Redshift doesn’t require the creation and maintenance of indexes: every column is almost like its own index, with just the right structure for the data being stored.
Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied.
So your question it will not corrupt the query performance but not a best practice.
There are a couple of details on this by AWS respondants:
AWS Redshift : DISTKEY / SORTKEY columns should be compressed?
Generally:
DISTKEY can be compressed but the first SORTKEY column should be uncompressed (ENCODE raw).
If you have multiple sort keys (compound) the other sort key columns can be compressed.
Also, generally recommend using a commonly filtered date/timestamp column,
(if one exists) as the first sort key column in a compound sort key.
Finally, if you are joining between very large tables try using the same dist
and sort keys on both tables so Redshift can use a faster merge join.
Based on this, i think as long as both sides of the join have the same compression, i think redshift will join on the compressed value safely.