Dist and Sort Keys Redshift - amazon-web-services

I'm trying to add dist and sort keys to some of the tables in redshift.
I notice that before adding the size of the table is 0.50 and after adding it gets increased to 0.51 or 0.52. Is this possible ? The whole purpose of having dist and sort keys is to decrease the size of the table and help in increasing the read/write performance.

That is not the purpose of having a DISTKEY and SORTKEY.
To decrease the storage size of a table, use compression.
The DISTKEY is used to distribute data amongst slices. By co-locating information on the same slice, queries can run faster. For example, if you had these tables:
customer table, DISTKEY = customer_id
invoices table, DISTKEY = customer_id
...then these tables would be distributed in the same manner. All records in both tables for a given customer_id would be located on the same slice, thereby avoiding the need to transfer data between slices. The DISTKEY should be the column that is mostly used for JOINS.
The SORTKEY is used to sort data on disk, for the benefit of Zone Maps. Each storage block on disk is 1MB in size and contains data for only one column in one table. The data for this column is sorted, then stored in multiple blocks. The Zone Map associated with each block identifies the minimum and maximum values stored within that block. Then, when a query is run with a WHERE statement, Amazon Redshift only needs to read the blocks that contain the desired range of data. By skipping over blocks that do not contain data within the WHERE clause, Redshift can run queries much faster.
The above can all work together. For example, compressed data requires fewer blocks, which also allows Redshift to skip over more data based on the Zone Maps. To get the best possible performance out of queries, use DISTKEY, SORTKEY and compression together.
(It is often recommended not to compress the SORTKEY column because it causes too many rows to be loaded from a single block.)
See also: Top 10 Performance Tuning Techniques for Amazon Redshift

Related

RedShift Deep Copy Without INSERT_XID (Hidden metadata) Column Data

I have a very long, narrow table in AWS Redshift. It's that has fallen victim to the issue of the hidden metadata column, INSERT_XID, being hugely disproportionate in size compared to the table.
Picture a table of 632K rows that has 22gb visible data in it and a hidden column with 83gb.
I want to reclaim that space, but Vacuum has no effect on it
I tried copying the table:
BEGIN;
CREATE TABLE test.copied (like prod.table);
INSERT INTO test.copied (select * from prod.table);
COMMIT;
This results in a true deep copy where the hidden meta data column is still very large. I was hoping that a copy of the table in one go into a new one would allow the hidden INSERT_XID column to compress, but it failed to do so.
Any ideas how I can optimize this hidden column in AWS Redshift?
I measured the size of each column with the following:
SELECT col, attname, COUNT(*) AS "mbs"
FROM stv_blocklist bl
JOIN stv_tbl_perm perm
ON bl.tbl = perm.id AND bl.slice = perm.slice
LEFT JOIN pg_attribute attr ON
attr.attrelid = bl.tbl
AND attr.attnum-1 = bl.col
WHERE perm.name = 'table_name'
GROUP BY col, attname
ORDER BY col;
Update:
I also tried an UNLOAD of this table into S3 and then a single COPY back into a new table and he size of the hidden column was unchanged. I'm not sure if this is even resolvable.
Thank you!
I did some math on the numbers you provided and I think you may be running into 1MB block size quanta effects. However, the math still doesn't work out.
Redshift stores you data around the cluster per the table's distribution style. For non-diststyle-all tables this means that each column has rows on each slice of the cluster. The minimum storage size on Redshift, a block, is 1MB in size. When you have small (for Redshift) number of rows in your table there isn't enough data on each slice to fill up one block so there is a lot of wasted space on disk.
If you have a table of say 2 columns which has 630K rows and you are working on a cluster that has 1024 slices (like 32 nodes of dc2.8xl) then these effects can be quite pronounced. Each slice has only 615 rows (on average), no where close to filling up a 1MB block. So the non-metadata portion of this table will take up 2X1024X1MB = 2.048gb. As you can see, even in this case, I can only get to one tenth of what you are showing.
I could rerun this with 20 columns instead of 2 and I would get up to your 22gb figure but then the size of the metadata columns wouldn't make a whole lot of sense - they aren't that inefficient. It is possible that I'm not looking at configurations like what you have - 4000 slices? 8 columns?
22gb of space is 22,000 blocks spread across the slices and columns of your cluster / table. Knowing your column count and cluster configuration will greatly help in understanding how the data is being stored.
Recommendation - move this table to DISTSTYLE ALL and you will save greatly in storage space. 600K rows is tiny for Redshift and spreading the data across all the slices is just inefficient. Be advised that DISTSTLYE ALL has query compilation implications - mostly positive but not all so monitor your query performance if you make this change.

Redshift Query taking too much time

In Redshift, the queries are taking too much time to execute. Some queries keep on running or get aborted after some time.
I have very limited knowledge of Redshift and it is getting difficult to understand the Query plan to optimise the query.
Sharing one of the queries that we run, along with the Query Plan.
The query is taking 20 seconds to execute.
Query
SELECT
date_trunc('day',
ti) as date,
count(distinct deviceID) AS COUNT
FROM
live_events
WHERE
brandID = 3927
AND ti >= '2017-08-02T00:00:00+00:00'
AND ti <= '2017-09-02T00:00:00+00:00'
GROUP BY
1
Primary key
brandID
Interleaved Sort Keys
we have set following columns as interleaved sort keys -
brandID, ti, event_name
QUERY PLAN
You have 126 million rows in that table. It's going to take more than a second on a single dc1.large node.
Here's some ways you could improve the performance:
More nodes
Spreading data across more nodes allows more parallelization. Each node adds additional processing and storage. Even if your data volume only justifies one node, if you want more performance, add more nodes.
SORTKEY
For the right type of query, the SORTKEY can be the best way to improve query speed. Sorting data on disk allows Redshift to skip over blocks that it knows does not contain relevant data.
For example, your query has WHERE brandID = 3927, so having brandID as the SORTKEY would make this extremely efficient because very few disk blocks would contain data for one brand.
Interleaved sorting is rarely the best sorting method to use because it is less efficient than a single or compound sort key and takes a long time to VACUUM. If the query you have shown is typical of the type of queries you are running, then use a compound sort key of brandId, ti or ti, brandId. It will be much more efficient.
SORTKEYs are typically a date column, since they are often found in a WHERE clause and the table will be automatically sorted if data is always appended in time order.
The Interleaved Sort would be causing Redshift to read many more disk blocks to find your data, thereby significantly increasing query time.
DISTKEY
The DISTKEY should typically be set to the field that is most used in a JOIN statement on the table. This is because data relating to the same DISTKEY value is stored on the same slice. This won't have such a large impact on a single node cluster, but it is still worth getting right.
Again, you have only shown one type of query, so it is hard to recommend a DISTKEY. Based on this query alone, I would recommend DISTKEY EVEN so that all slices participate in the query. (It is also the default DISTKEY if no specific DISTKEY is selected.) Alternatively, set DISTKEY to a field not shown -- but certainly don't use brandId as the DISTKEY otherwise only one slice will participate in the query shown.
VACUUM
VACUUM your tables regularly so that the data is stored in SORTKEY order and deleted data is removed from storage.
Experiment!
Optimal settings depend upon your data and the queries you typically run. Perform some tests to compare SORTKEY and DISTKEY values and choose the settings that perform the best. Then, test again in 3 months to see if your queries or data has changed enough to make other settings more efficient.
Some time the issue could be due to locks being acquired by other processes. You can refer: https://aws.amazon.com/premiumsupport/knowledge-center/prevent-locks-blocking-queries-redshift/
I'd also like to add that in your query you are performing date transformations. Date operations are expensive in Redshift.
-- This date operation is expensive
date_trunc('day', ti) as date
If you have the luxury you should store the date in the format you need in an additional column.

Does Redshift optimize inter-block search or just scans the whole block?

I created two tables with 43,547,563 rows each:
CREATE TABLE metrics_compressed (
some_id bigint ENCODE ZSTD,
some_value varchar(200) ENCODE ZSTD distkey,
...,
some_timestamp bigint ENCODE ZSTD,
...,
PRIMARY KEY (some_id, some_timestamp, some_value)
)
sortkey (some_id, some_timestamp);
The second one is exactly like the first one but without any column compressed.
Running this query (it just counts one row):
select count(*)
from metrics_compressed
where some_id = 20906
and some_timestamp = 1475679898584;
shows a table scan of 42,394,071 rows (from the rows_pre_filter column in svl_query_summary, column is_rrscan true) and while running it over the uncompressed table it scans 3,143,856. I guess the reason for this is that the compressed one uses less 1MB blocks, hence the scan shows the total number of rows from the retrieved blocks.
Are the scanned rows a sign of bad performance? Or does Redshift use some kind of binary search within a block for such simple queries as this one, and the scanned rows is just confusing info for optimizing queries?
In general, you should let Amazon Redshift determine its own compression types. It does this by loading 100,000 rows and determining the optimal compression type to use for each column based on this sample data. It then drops those rows and restarts the load. This happens automatically when a table is first loaded if there is no compression type specified on the columns.
The SORTKEY is more important for fast queries than compression, because it allows Redshift to totally skip over blocks that do not contain desired data. In your example, using some_id within the WHERE clause allows it to only look at blocks containing that specific value and since it is also the SORTKEY this will be extremely efficient.
Once a block is identified as potentially containing the SORTKEY data, Redshift will read the block from disk and process the contents.
The general rule is to use DISTKEY for columns most used in JOIN and use SORTKEY for columns most used in WHERE statements (but there are also more subtle variations on those general rules).

Redshift: Aggregate data on large number of dimensions is slow

I have an Amazon redshift table with about 400M records and 100 columns - 80 dimensions and 20 metrics.
Table is distributed by 1 of the high cardinality dimension columns and includes a couple of high cardinality columns in sort key.
A simple aggregate query:
Select dim1, dim2...dim60, sum(met1),...sum(met15)
From my table
Group by dim1...dim60
is taking too long. The explain plan looks simple just a sequential scan and hashaggregate on the able. Any recommendations on how I can optimize it?
1) If your table is heavily denormalized (your 80 dimensions are in fact 20 dimensions with 4 attributes each) it is faster to group by dimension keys only, and if you really need all dimension attributes join the aggregated result back to dimension tables to get them, like this:
with
groups as (
select dim1_id,dim2_id,...,dim20_id,sum(met1),sum(met2)
from my_table
group by 1,2,...,20
)
select *
from groups
join dim1_table
using (dim1_id)
join dim2_table
using (dim2_id)
...
join dim20_table
using (dim20_id)
If you don't want to normalize your table and you like that a single row has all pieces of information it's fine to keep it as is since in a column database they won't slow the queries down if you don't use them. But grouping by 80 columns is definitely inefficient and has to be "pseudo-normalized" in the query.
2) if your dimensions are hierarchical you can group by the lowest level only and then join higher level dimension attributes. For example, if you have country, country region and city with 4 attributes each there's no need to group by 12 attributes, all you can do is group by city ID and then join city's attributes, country region and country tables to the city ID of each group
3) you can have the combination of dimension IDs with some delimiter like - in a separate varchar column and use that as a sort key
Sequential scans are quite normal for Amazon Redshift. Instead of using indexes (which themselves would be Big Data), Redshift uses parallel clusters, compression and columnar storage to provide fast queries.
Normally, optimization is done via:
DISTKEY: Typically used on the most-JOINed column (or most GROUPed column) to localize joined data on the same node.
SORTKEY: Typically used for fields that most commonly appear in WHERE statements to quickly skip over storage blocks that do not contain relevant data.
Compression: Redshift automatically compresses data, but over time the skew of data could change, making another compression type more optimal.
Your query is quite unusual in that you are using GROUP BY on 60 columns across all rows in the table. This is not a typical Data Warehousing query (where rows are normally limited by WHERE and tables are connected by JOIN).
I would recommend experimenting with fewer GROUP BY columns and breaking the query down into several smaller queries via a WHERE clause to determine what is occupying most of the time. Worst case, you could run the results nightly and store them in a table for later querying.

Redshift performance: encoding on join column

would encoding on join column corrupts the query performance ? I let the "COPY command" to decide the encoding type.
In gernal no - since an encoding on your DIST KEY will even have a positive impact due to the reduction disk I/O.
According to the AWS table design playbook There are a few edge case were indeed an encoding on your DIST KEY will corrupt your query performance:
Your query patterns apply range restricted scans to a column that is
very well compressed.
The well compressed column’s blocks each contain a large number of values per block, typically many more values than the actual count of values your query is interested in.
The other columns necessary for the query pattern are large or don’t compress well. These columns are > 10x the size of the well
compressed column.
If you want to find the optimal encoding for your table you can use the Redshift column encoding utility.
Amazon Redshift is a column-oriented database, which means that rather than organising data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. This architecture is particularly well suited to analytics queries on tables with a large number of columns, where most queries only access a subset of all possible dimensions and measures. Amazon Redshift is able to only access those blocks on disk that are for columns included in the SELECT or WHERE clause, and doesn’t have to read all table data to evaluate a query. Data stored by column should also be encoded , which means that it is heavily compressed to offer high read performance. This further means that Amazon Redshift doesn’t require the creation and maintenance of indexes: every column is almost like its own index, with just the right structure for the data being stored.
Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied.
So your question it will not corrupt the query performance but not a best practice.
There are a couple of details on this by AWS respondants:
AWS Redshift : DISTKEY / SORTKEY columns should be compressed?
Generally:
DISTKEY can be compressed but the first SORTKEY column should be uncompressed (ENCODE raw).
If you have multiple sort keys (compound) the other sort key columns can be compressed.
Also, generally recommend using a commonly filtered date/timestamp column,
(if one exists) as the first sort key column in a compound sort key.
Finally, if you are joining between very large tables try using the same dist
and sort keys on both tables so Redshift can use a faster merge join.
Based on this, i think as long as both sides of the join have the same compression, i think redshift will join on the compressed value safely.