Compound Sort Key vs. Sort Key - amazon-web-services

Let me ask other question about redshift sortkey.
We're planning to set the sortkey with the columns frequently used in WHERE statement.
So far, the best combination for our system seems to be:
DISTSTYLE EVEN + COMPOUND SORTKEY + COMPRESSED Column (except for First SortKey column)
Just wondering which can be more better, simple SORTKEY or COMPOUND SORTKEY for our BI tables which can have diversified queries according to users' analysis.
For example, we set the compound sortkey according to frequency in several queries' WHERE statement as follows.
COMPOUND SORTKEY
(
PURCHASE_DATE <-- set as first sort key since it's date column.
STORE_ID,
CUTOMER_ID,
PRODUCT_ID
)
But sometimes it can be queried only 'PRODUCT ID' in actual queries, not with other listed sort keys, nor queried different from COMPOUND KEY order.
In that case, may I ask 'COMPOUND SORTKEY' can be useless or simple SORT KEY can be more effective ...?
I'd be so grateful if you would tell me about your idea and experiences.

The simple rules for Amazon Redshift are:
Use DISTKEY on the column that is most frequently used with JOIN
Use SORTKEY on the column(s) that is most frequently used with WHERE
You are correct that the above compound sort key would only be used if PURCHASE_DATE is included in the WHERE.
An alternative is to use Interleaved Sort Keys, which give equal weighting to many columns and can be used where different fields are often used in the WHERE. However, Interleaved Sort Keys are much slower to VACUUM and are rarely worth using.
So, aim to use SORTKEY on most of your queries, but don't worry too much about the other queries unless you are having some particular performance problems.
See: Redshift Sort Keys - Choosing Best Sort Style | Hevo Blog

Your compound sort key looks sensible to me. It's important to understand that Redshift sort keys are not an index which is used or not used. The sort key is used to physically arrange the data on disk.
The query optimizer "uses" the sort key by looking at the "zone map" (min and max values) for each block during query execution. This happens for all columns regardless of whether they are in the sort key.
Secondary columns in a compound sort key can still be very effective at reducing the data that has to be scanned from disk, especially when the column values are low cardinality.
See this previous example for a query to check on sort key effectiveness: Is my sort key being used?
Please review our guide for designing tables effectively: "Amazon Redshift Engineering’s Advanced Table Design Playbook". The guide discusses the correct use of Interleaved sort keys but note that they should only be used in very specific circumstances.

Related

Convert indexes to sortkeys Redshift

Do zonemaps exists only in memory? Or its populated in memory from disk where its stored persistently? Is it stored along with the 1MB block, or in a separate place?
We are migrating from oracle to redshift, there are bunch of indexes to cater to reporting needs. The nearest equivalent of index in Redshift is sortkeys. For bunch of tables, the total number of cols of all the indexes are between 15-20 (some are composite indexes, some are single col indexes). Interleaved keys seems to be best fit, but there cannot be more than 8 cols in an interleaved sortkey. But if I use compound sortkey, it wont be effective since the queries might not have prefix colums.
Whats the general advice in such cases - which type of sort key to use? How to convert many indexes from rdbms to sort keys in redshift?
Are high cardinality cols such as identity cols, dates and timestamps not good fit with interleaved keys? Would it be same with compound sortkeys? Any disadvanatges with interleaved sortkeys to keep in consideration?
You are asking the right questions so let's take these down one at a time. First, zonemaps are located on the leader node and stored on disk and the table data is stored on the compute nodes. They are located separate from each other. The zonemaps store the min and max values for every column for every 1MB block in a table. No matter if a column is in your sortkey list or not, there will be zonemap data for the block. When a column shows up in a WHERE clause Redshift will first compare to the zonemap data to decide if the block is needed for the query. If a block is not needed it won't be read from disk resulting in significant performance improvements for very large tables. I call this "block rejection". A few key points - This really only makes a difference on tables will 10s of millions of rows and when there are selective WHERE predicates.
So you have a number of reports each of which looks at the data by different aspects - common. You want all of these to work well, right? Now the first thing to note is that each table can have it's own sortkeys, they aren't linked. What is important is how does the choice of sortkeys affect the min and max values in the zonemaps for the columns you will use as WHERE clauses. With composite sortkeys you have to think about what impact later keys will have on the composition of the block - not much after the 3rd or 4th key. This is greatly impacted by the ordinality of the data but you get the idea. The good news is that sorting on one column will impact the zonemaps of all the columns so you don't always have to have a column in the sortkey list to get the benefit.
The question of compound vs interleaved sortkeys is a complicated one but remember you want to get high levels of block rejection as often as possible (and on the biggest tables). When different queries have different WHERE predicates it can be tricky to get a good mix of sortkeys to make this happen. In general compound sortkeys are easier to understand and have less table maintenance implications. You can inspect the zonemaps and see what impacts your sortkey choices are having and make informed decisions on how to adjust. If there are columns with low ordinality put those first so that the next sortkeys can have impact on the overall row order and therefore make block with different value ranges for these later keys. For these reasons I like compound keys over interleaved but there are cases where things will improve with interleaved keys. When you have high ordinality for all the columns and they are all equally important interleaved may be the right answer. I usually learn about the data trying to optimize compound keys that even if I end up with interleaved keys I can make smart choices about what columns I want in the sortkeys.
Just some metrics to help in you choice. Redshift can store 200,000 row elements in a single block and I've seen columns with over 2M elements per block. Blocks are distributed across the cluster so you need a lot of rows to fill up enough blocks that rejecting a high percentage of them is even possible. If you have a table of 5 million rows and you are sweating the sortkeys you are into the weeds. (Yes sorting can impact other aspects of the query like joining but these are sub-second improvements not make or break performance impacts.) Compression can have a huge impact on the number of row elements per block and therefore how many rows are represented in an entry in the zonemap. This can increase block rejection but will increase the read data needed to scan the entire table - a tradeoff you will want to make sure you are winning (1 query gets faster by 10 get slower is likely not a good tradeoff).
Your question about ordinality is a good one. If I sort my a high ordinality column first in a compound sortkey list this will set the overall order of the rows potentially making all other sortkeys impotent. However if I sort by a low ordinality column first then there is a lot of power left for other sortkeys to change the order of the rows and therefore the zonemap contents. For example if I have Col_A with only 100 unique values and Col_B which is a timestamp with 1microsecond resolution. If I sort by Col_B first all the rows are likely order just by sorting on this column. But if I sort by Col_A first there are lots of rows with the same value and the later sortkey (Col_B) can order these rows. Interleaved works the same way except which column is "first" changes by region of the table. If I interleave sort base on the same Col_A and Col_B above (just 2 sortkeys), then half the table will be sorted by Col_A first and half by Col_B first. For this example Col_A will be useless half of the time - not the best answer. Interleave sorting just modifies which column is use as the first sortkey throughout the table (and second and third if more keys are used). High ordinality in a sort key makes later sortkeys less powerful and this independent of sort style - it's just the interleave changes up which columns are early and which are late by region of the table.
Because ordinality of sortkeys can be such an important factor in gaining block rejection across many WHERE predicates that it is common to add derived columns to tables to hold lower ordinality versions of other columns. In the example above I might add Col_B2 to the table and have if just hold the year and month (month truncated date) of Col_B. I would use Col_B2 in my sortkey list but my queries would still be referencing Col_B. It "roughly" sorts based on Col_B so that Col_A can have some sorting power if it was to come later in the sortkey list. This is a common reason for making data model changes when moving Redshift.
It is also critical that "block rejecting" WHERE clauses on written against the fact table column, not applied to a dimension table column after the join. Zonemap information is read BEFORE the query starts to execute and is done on the leader node - it can't see through joins. Another data model change is to denormalize some key information into the fact tables so these common where predicates can be applied to the fact table and zonemaps will be back in play.
Sorry for the tome but this is a deep topic which I've spent year optimizing. I hope this is of use to you and reach out if anything isn't clear (and I hope you have the DISTKEYS sorted out already :) ).

Redshift DISTSTYLE KEY. Deciding whats the best column to define as KEY

Well I recently got into this area of Redshift, trying to optimize disk usage and performance of my database, and having read lots of information on AWS about the topic, I still have some doubts.
First of all, to my database structure. Per schema, I have 3 master tables, with 3 different IDs, these are now DISTSTLYE ALL tables, being small in size.
Each master table has different amounts of IDs,
the date table --> largest one (#1 most joined)
the store table --> medium one (#3 most joined)
the item table --> smallest one (#2 most joined)
Then I have my core table, which has needed combinations of these IDs to display additional information about them. Anyway, this table should be a DISTSTYLE KEY type, based on my knowledge. Well, which of the 3 IDs should I select to be my DIST KEY?
Whats the criteria for this decision? I understand that for joins I need to look at the Sort Key, well that has been understood and defined to the ID_date, because its the most joined table. So now, what about the distribution per node of this table?
I'm sorry if I'm rambling, I dont want to leave any information out. If I have, feel free to ask! Thanks for taking the time to read!
You'll find the best advice on Amazon Redshift best practices for designing tables. It goes into quite a bit of detail.
However, my rule of thumb is:
The DISTKEY should be the column most used in JOINs between tables
The SORTKEY should be the column most used in WHERE statements
Use DISTSTYLE ALL for small lookup tables

How to Hash an Entire Redshift Table?

I want to hash entire redshift tables in order to check for consistency after upgrades, backups, and other modifications which shouldn't affect table data.
I've found Hashing Tables to Ensure Consistency in Postgres, Redshift and MySQL but the solution still requires spelling out each column name and type so it can't be applied new tables in a generic manner. I'd have to manually change column names and types.
Is there some other function or method by which I could hash / checksum entire tables in order to confirm they are identical? Ideally without spelling out the specific column and column types of that table.
There is certainly no in-built capability in Redshift to hash whole tables.
Also, I'd be a little careful of the method suggested in that article because, from what I can see, it is calculating a hash of all the values in a column but isn't associating the hashed value with a row identifier. Therefore if Row 1 and Row 2 swapped values in a column, the hash wouldn't change. So, it's not strictly calculating an adequate hash (but I could be wrong!).
You could investigate using the new Stored Procedures in Redshift to see whether you can create a generic function that would work for any table.

AWS Redshift : DISTKEY / SORTKEY columns should be compressed?

Let me ask something about column compression on AWS Redshift.
Now we're verifying what can be made better performance using appropriate diststyle, sortkeys and column compression.
If my understanding is correct, the column compression can help to reduce IO cost. I tried "analyze compression table_name;". And mostly Redshift suggests to use 'zstd' or 'lzo' as compression method for our columns.
In general speaking, may I ask the columns set as DISTKEY/SORTKEY should be also compressed like other columns?
I'm totally new to Redshift and any advice would be appreciated.
Sincerly.
DISTKEY can be compressed but the first SORTKEY column should be uncompressed (ENCODE raw). If you have multiple sort keys (compound) the other sort key columns can be compressed.
Also, generally recommend using a commonly filtered date/timestamp column (if one exists) as the first sort key column in a compound sort key.
Finally, if you are joining between very large tables try using the same dist and sort keys on both tables so Redshift can use a faster merge join.

Redshift performance: encoding on join column

would encoding on join column corrupts the query performance ? I let the "COPY command" to decide the encoding type.
In gernal no - since an encoding on your DIST KEY will even have a positive impact due to the reduction disk I/O.
According to the AWS table design playbook There are a few edge case were indeed an encoding on your DIST KEY will corrupt your query performance:
Your query patterns apply range restricted scans to a column that is
very well compressed.
The well compressed column’s blocks each contain a large number of values per block, typically many more values than the actual count of values your query is interested in.
The other columns necessary for the query pattern are large or don’t compress well. These columns are > 10x the size of the well
compressed column.
If you want to find the optimal encoding for your table you can use the Redshift column encoding utility.
Amazon Redshift is a column-oriented database, which means that rather than organising data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. This architecture is particularly well suited to analytics queries on tables with a large number of columns, where most queries only access a subset of all possible dimensions and measures. Amazon Redshift is able to only access those blocks on disk that are for columns included in the SELECT or WHERE clause, and doesn’t have to read all table data to evaluate a query. Data stored by column should also be encoded , which means that it is heavily compressed to offer high read performance. This further means that Amazon Redshift doesn’t require the creation and maintenance of indexes: every column is almost like its own index, with just the right structure for the data being stored.
Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied.
So your question it will not corrupt the query performance but not a best practice.
There are a couple of details on this by AWS respondants:
AWS Redshift : DISTKEY / SORTKEY columns should be compressed?
Generally:
DISTKEY can be compressed but the first SORTKEY column should be uncompressed (ENCODE raw).
If you have multiple sort keys (compound) the other sort key columns can be compressed.
Also, generally recommend using a commonly filtered date/timestamp column,
(if one exists) as the first sort key column in a compound sort key.
Finally, if you are joining between very large tables try using the same dist
and sort keys on both tables so Redshift can use a faster merge join.
Based on this, i think as long as both sides of the join have the same compression, i think redshift will join on the compressed value safely.