Redshift Query taking too much time - amazon-web-services

In Redshift, the queries are taking too much time to execute. Some queries keep on running or get aborted after some time.
I have very limited knowledge of Redshift and it is getting difficult to understand the Query plan to optimise the query.
Sharing one of the queries that we run, along with the Query Plan.
The query is taking 20 seconds to execute.
Query
SELECT
date_trunc('day',
ti) as date,
count(distinct deviceID) AS COUNT
FROM
live_events
WHERE
brandID = 3927
AND ti >= '2017-08-02T00:00:00+00:00'
AND ti <= '2017-09-02T00:00:00+00:00'
GROUP BY
1
Primary key
brandID
Interleaved Sort Keys
we have set following columns as interleaved sort keys -
brandID, ti, event_name
QUERY PLAN

You have 126 million rows in that table. It's going to take more than a second on a single dc1.large node.
Here's some ways you could improve the performance:
More nodes
Spreading data across more nodes allows more parallelization. Each node adds additional processing and storage. Even if your data volume only justifies one node, if you want more performance, add more nodes.
SORTKEY
For the right type of query, the SORTKEY can be the best way to improve query speed. Sorting data on disk allows Redshift to skip over blocks that it knows does not contain relevant data.
For example, your query has WHERE brandID = 3927, so having brandID as the SORTKEY would make this extremely efficient because very few disk blocks would contain data for one brand.
Interleaved sorting is rarely the best sorting method to use because it is less efficient than a single or compound sort key and takes a long time to VACUUM. If the query you have shown is typical of the type of queries you are running, then use a compound sort key of brandId, ti or ti, brandId. It will be much more efficient.
SORTKEYs are typically a date column, since they are often found in a WHERE clause and the table will be automatically sorted if data is always appended in time order.
The Interleaved Sort would be causing Redshift to read many more disk blocks to find your data, thereby significantly increasing query time.
DISTKEY
The DISTKEY should typically be set to the field that is most used in a JOIN statement on the table. This is because data relating to the same DISTKEY value is stored on the same slice. This won't have such a large impact on a single node cluster, but it is still worth getting right.
Again, you have only shown one type of query, so it is hard to recommend a DISTKEY. Based on this query alone, I would recommend DISTKEY EVEN so that all slices participate in the query. (It is also the default DISTKEY if no specific DISTKEY is selected.) Alternatively, set DISTKEY to a field not shown -- but certainly don't use brandId as the DISTKEY otherwise only one slice will participate in the query shown.
VACUUM
VACUUM your tables regularly so that the data is stored in SORTKEY order and deleted data is removed from storage.
Experiment!
Optimal settings depend upon your data and the queries you typically run. Perform some tests to compare SORTKEY and DISTKEY values and choose the settings that perform the best. Then, test again in 3 months to see if your queries or data has changed enough to make other settings more efficient.

Some time the issue could be due to locks being acquired by other processes. You can refer: https://aws.amazon.com/premiumsupport/knowledge-center/prevent-locks-blocking-queries-redshift/

I'd also like to add that in your query you are performing date transformations. Date operations are expensive in Redshift.
-- This date operation is expensive
date_trunc('day', ti) as date
If you have the luxury you should store the date in the format you need in an additional column.

Related

AWS Redshift Distkey and Skew

I came across a situation where I am defining the distkey as the column which is used to join it with other tables (to avoid re-distribution). But that column is not the highest cardinality column, so it leads to skew the data distribution.
Example:
Transaction Table (20M rows)
------------------------------
| user_id | int |
| transaction_id | int |
| transaction_date | date |
------------------------------
Let's say most of the joins performed on this table is on user_id, but transaction_id is higher cardinality column. As 1 user can have multiple transactions.
What should be done in this situation?
Distribute the table on transaction_id column? Even though it will need to re-distributing the data when joined on user_id with another table
Distribute on user_id and let the data be skewed? In my case, the skew factor is ~15 which is way higher than AWS Redshift recommended skew factor of 4.0
As John rightly says you LIKELY want to lean towards improving join performance over data skew but this is based on a ton of likely-true assumptions. I'll itemize a few here:
The distribution (disk-based) skew is on a major fact table
The other tables are also distributed on the join-on key
The joins are usually on the raw tables or group-bys are performed on the dist key
Redshift is a networked cluster and the interconnects between nodes is the lowest bandwidth aspect of the architecture (not low bandwidth, just lower than the other aspects). Move very large amounts of data between nodes is an anti-pattern for Redshift and should be avoided whenever possible.
Disk skew is a measure of where the data is stored around the cluster and without query-based-information only impacts how efficiently the data is stored. The bigger impact of disk skew is execution skew - the the difference in the amount of work each CPU (slice) does when executing a query. Since the first step of every query is for each slice to work on the data it "owns", disk skew leads to some amount of execution skew. How much is dependent on many factors but especially the query in question. Disk skew can lead to issues and in some cases this CAN outweigh redistribution costs. Since slice performance of Redshift is high, execution skew OFTEN isn't the #1 factor driving performance.
Now (nearly) all queries have to perform some amount of data redistribution of data when executing. If you do a group-by of two tables by some non-dist-key column and then join them, there will be redistribution needed to perform the join. The good news is that (hopefully) the amount of data post-group-by will be small so the cost of redistribution will be low. Amount of data being redistributed is what matters.
Dist-key of the tables is only one way to control how much data redistributed. Some ways to do this:
If the dimension tables are dist-style ALL then it doesn't (in basic
cases) matter that your fact table is distributed by user_id - the
data to be joined already exists on the nodes it needs to be on.
You can also control how much data is redistributed by reducing how
much data goes into the join. Having where clauses at the earliest
stage in the query can do this. Denormalizing your data so that
needed where clause columns appear in your fact tables can be a huge
win.
In extreme cases you can make derived dist-key columns that align
perfectly to user_id but also have greatly reduced disk and
execution skew. This is a deeper topic that needs to be in this
answer but can be the answer when you need max performance when
redistribution and skew are in conflict.
A quick word on "ordinality". This is a rule-of-thumb metric that a lot of Redshift documents use as a way to keep new users out of trouble but that can also be explained quickly. It's an (somewhat useful) over-simplification. Higher ordinality is not always better and in the extreme is an anti-pattern - think of a table where each row of the dist-key has a unique value, now think about doing a group-by on some other column for this table. The data skew in this example is perfect but performance of the group-by will suck. You want to distribute the data to speed up what work needs to be done - not improve a metric.

Convert indexes to sortkeys Redshift

Do zonemaps exists only in memory? Or its populated in memory from disk where its stored persistently? Is it stored along with the 1MB block, or in a separate place?
We are migrating from oracle to redshift, there are bunch of indexes to cater to reporting needs. The nearest equivalent of index in Redshift is sortkeys. For bunch of tables, the total number of cols of all the indexes are between 15-20 (some are composite indexes, some are single col indexes). Interleaved keys seems to be best fit, but there cannot be more than 8 cols in an interleaved sortkey. But if I use compound sortkey, it wont be effective since the queries might not have prefix colums.
Whats the general advice in such cases - which type of sort key to use? How to convert many indexes from rdbms to sort keys in redshift?
Are high cardinality cols such as identity cols, dates and timestamps not good fit with interleaved keys? Would it be same with compound sortkeys? Any disadvanatges with interleaved sortkeys to keep in consideration?
You are asking the right questions so let's take these down one at a time. First, zonemaps are located on the leader node and stored on disk and the table data is stored on the compute nodes. They are located separate from each other. The zonemaps store the min and max values for every column for every 1MB block in a table. No matter if a column is in your sortkey list or not, there will be zonemap data for the block. When a column shows up in a WHERE clause Redshift will first compare to the zonemap data to decide if the block is needed for the query. If a block is not needed it won't be read from disk resulting in significant performance improvements for very large tables. I call this "block rejection". A few key points - This really only makes a difference on tables will 10s of millions of rows and when there are selective WHERE predicates.
So you have a number of reports each of which looks at the data by different aspects - common. You want all of these to work well, right? Now the first thing to note is that each table can have it's own sortkeys, they aren't linked. What is important is how does the choice of sortkeys affect the min and max values in the zonemaps for the columns you will use as WHERE clauses. With composite sortkeys you have to think about what impact later keys will have on the composition of the block - not much after the 3rd or 4th key. This is greatly impacted by the ordinality of the data but you get the idea. The good news is that sorting on one column will impact the zonemaps of all the columns so you don't always have to have a column in the sortkey list to get the benefit.
The question of compound vs interleaved sortkeys is a complicated one but remember you want to get high levels of block rejection as often as possible (and on the biggest tables). When different queries have different WHERE predicates it can be tricky to get a good mix of sortkeys to make this happen. In general compound sortkeys are easier to understand and have less table maintenance implications. You can inspect the zonemaps and see what impacts your sortkey choices are having and make informed decisions on how to adjust. If there are columns with low ordinality put those first so that the next sortkeys can have impact on the overall row order and therefore make block with different value ranges for these later keys. For these reasons I like compound keys over interleaved but there are cases where things will improve with interleaved keys. When you have high ordinality for all the columns and they are all equally important interleaved may be the right answer. I usually learn about the data trying to optimize compound keys that even if I end up with interleaved keys I can make smart choices about what columns I want in the sortkeys.
Just some metrics to help in you choice. Redshift can store 200,000 row elements in a single block and I've seen columns with over 2M elements per block. Blocks are distributed across the cluster so you need a lot of rows to fill up enough blocks that rejecting a high percentage of them is even possible. If you have a table of 5 million rows and you are sweating the sortkeys you are into the weeds. (Yes sorting can impact other aspects of the query like joining but these are sub-second improvements not make or break performance impacts.) Compression can have a huge impact on the number of row elements per block and therefore how many rows are represented in an entry in the zonemap. This can increase block rejection but will increase the read data needed to scan the entire table - a tradeoff you will want to make sure you are winning (1 query gets faster by 10 get slower is likely not a good tradeoff).
Your question about ordinality is a good one. If I sort my a high ordinality column first in a compound sortkey list this will set the overall order of the rows potentially making all other sortkeys impotent. However if I sort by a low ordinality column first then there is a lot of power left for other sortkeys to change the order of the rows and therefore the zonemap contents. For example if I have Col_A with only 100 unique values and Col_B which is a timestamp with 1microsecond resolution. If I sort by Col_B first all the rows are likely order just by sorting on this column. But if I sort by Col_A first there are lots of rows with the same value and the later sortkey (Col_B) can order these rows. Interleaved works the same way except which column is "first" changes by region of the table. If I interleave sort base on the same Col_A and Col_B above (just 2 sortkeys), then half the table will be sorted by Col_A first and half by Col_B first. For this example Col_A will be useless half of the time - not the best answer. Interleave sorting just modifies which column is use as the first sortkey throughout the table (and second and third if more keys are used). High ordinality in a sort key makes later sortkeys less powerful and this independent of sort style - it's just the interleave changes up which columns are early and which are late by region of the table.
Because ordinality of sortkeys can be such an important factor in gaining block rejection across many WHERE predicates that it is common to add derived columns to tables to hold lower ordinality versions of other columns. In the example above I might add Col_B2 to the table and have if just hold the year and month (month truncated date) of Col_B. I would use Col_B2 in my sortkey list but my queries would still be referencing Col_B. It "roughly" sorts based on Col_B so that Col_A can have some sorting power if it was to come later in the sortkey list. This is a common reason for making data model changes when moving Redshift.
It is also critical that "block rejecting" WHERE clauses on written against the fact table column, not applied to a dimension table column after the join. Zonemap information is read BEFORE the query starts to execute and is done on the leader node - it can't see through joins. Another data model change is to denormalize some key information into the fact tables so these common where predicates can be applied to the fact table and zonemaps will be back in play.
Sorry for the tome but this is a deep topic which I've spent year optimizing. I hope this is of use to you and reach out if anything isn't clear (and I hope you have the DISTKEYS sorted out already :) ).

Does Redshift optimize inter-block search or just scans the whole block?

I created two tables with 43,547,563 rows each:
CREATE TABLE metrics_compressed (
some_id bigint ENCODE ZSTD,
some_value varchar(200) ENCODE ZSTD distkey,
...,
some_timestamp bigint ENCODE ZSTD,
...,
PRIMARY KEY (some_id, some_timestamp, some_value)
)
sortkey (some_id, some_timestamp);
The second one is exactly like the first one but without any column compressed.
Running this query (it just counts one row):
select count(*)
from metrics_compressed
where some_id = 20906
and some_timestamp = 1475679898584;
shows a table scan of 42,394,071 rows (from the rows_pre_filter column in svl_query_summary, column is_rrscan true) and while running it over the uncompressed table it scans 3,143,856. I guess the reason for this is that the compressed one uses less 1MB blocks, hence the scan shows the total number of rows from the retrieved blocks.
Are the scanned rows a sign of bad performance? Or does Redshift use some kind of binary search within a block for such simple queries as this one, and the scanned rows is just confusing info for optimizing queries?
In general, you should let Amazon Redshift determine its own compression types. It does this by loading 100,000 rows and determining the optimal compression type to use for each column based on this sample data. It then drops those rows and restarts the load. This happens automatically when a table is first loaded if there is no compression type specified on the columns.
The SORTKEY is more important for fast queries than compression, because it allows Redshift to totally skip over blocks that do not contain desired data. In your example, using some_id within the WHERE clause allows it to only look at blocks containing that specific value and since it is also the SORTKEY this will be extremely efficient.
Once a block is identified as potentially containing the SORTKEY data, Redshift will read the block from disk and process the contents.
The general rule is to use DISTKEY for columns most used in JOIN and use SORTKEY for columns most used in WHERE statements (but there are also more subtle variations on those general rules).

Dist and Sort Keys Redshift

I'm trying to add dist and sort keys to some of the tables in redshift.
I notice that before adding the size of the table is 0.50 and after adding it gets increased to 0.51 or 0.52. Is this possible ? The whole purpose of having dist and sort keys is to decrease the size of the table and help in increasing the read/write performance.
That is not the purpose of having a DISTKEY and SORTKEY.
To decrease the storage size of a table, use compression.
The DISTKEY is used to distribute data amongst slices. By co-locating information on the same slice, queries can run faster. For example, if you had these tables:
customer table, DISTKEY = customer_id
invoices table, DISTKEY = customer_id
...then these tables would be distributed in the same manner. All records in both tables for a given customer_id would be located on the same slice, thereby avoiding the need to transfer data between slices. The DISTKEY should be the column that is mostly used for JOINS.
The SORTKEY is used to sort data on disk, for the benefit of Zone Maps. Each storage block on disk is 1MB in size and contains data for only one column in one table. The data for this column is sorted, then stored in multiple blocks. The Zone Map associated with each block identifies the minimum and maximum values stored within that block. Then, when a query is run with a WHERE statement, Amazon Redshift only needs to read the blocks that contain the desired range of data. By skipping over blocks that do not contain data within the WHERE clause, Redshift can run queries much faster.
The above can all work together. For example, compressed data requires fewer blocks, which also allows Redshift to skip over more data based on the Zone Maps. To get the best possible performance out of queries, use DISTKEY, SORTKEY and compression together.
(It is often recommended not to compress the SORTKEY column because it causes too many rows to be loaded from a single block.)
See also: Top 10 Performance Tuning Techniques for Amazon Redshift

Best order of joins and append for performance

I'm having huge performance issues with a SAS DI job that I need to get up and running. Therefore I'm looking for clever ways to optimize the job.
One thing in particular that I thought of is that I should perhaps permute the order of some joins and an append. Currently, my job is configured as follows:
there are several similarly structured source tables which I first apply a date filter to (to reduce the number of rows) and sort on two fields, say a and b, then I left join each table to a table with account table on the same fields a and b (I'd like to create indexes for these if possible, but don't know how to do it for temporary work tables in SAS DI). After each of these joins is complete, I append the resulting tables into one dataset.
It occurs to me that I could first append, and then do just one join, but I have no notion of which approach is faster, or if the answer is that it depends I have no notion of what it depends on (though I'd guess the size of the constituent tables).
So, is it better to do many joins then append, or to append then do one join?
EDIT
Here is an update with some relevant information (requested by user Robert Penridge).
The number of source tables here is 7, and the size of these tables ranges from 1500 to 5.2 million. 10 000 is typical. The number of columns is 25. These tables are each being joined with the same table, which has about 5000 rows and 8 columns.
I estimate that the unique key partitions the tables into subsets of roughly equal size; the size reduction here should be between 8% and 30% (the difference is due to the fact that some of the source tables carry much more historical data than others, adding to the percentage of the table grouped into the same number of groups).
I have limited the number of columns to the exact minimum amount required (21).
By default SAS DI creates all temporary datasets as views, and I have not changed that.
The code for the append and joins are auto-generated by SAS DI after constructing them with GUI elements.
The final dataset is not sorted; my reason for sorting the data which feeds the joins is that the section of this link on join performance (page 35) mentions that it should improve performance.
As I mentioned, I'm not sure if one can put indexes on temporary work tables or views in SAS DI.
I cannot say whether the widths of the fields is larger than absolutely necessary, but if so I doubt it is egregious. I hesitate to change this since it would have to be done manually, on several tables, and when new data comes in it might need that extra column width.
Much gratitude
Performance in SAS is mainly about reducing IO (ie. reading/writing to the disk).
Without additional details it's difficult to help but some additional things you can consider are:
limit the columns you are processing by using a keep statement (reduces IO)
if the steps performing the joins are IO intensive, consider using views rather than creating temporary tables
if the joins are still time consuming, consider replacing them with hash table lookups
make sure you are using proc append to append the 2 datasets together to reduce the IO. Append the smaller dataset to the larger dataset.
consider not sorting the final dataset but placing an index on it for consumers of the data.
ensure you are using some type of dataset compression, or ensure your column widths are set appropriately for all columns (ie. you don't have a width of 200 on a field that uses a width of 8)
reduce the number of rows as early in the process as possible (you are already doing this, just listing it here for completeness)
Adjusting the order of left-joins and appends probably won't make as much difference as doing the above.
As per your comments it seems that
1. There are 7 input source tables
2. Join these 7 source tables to 1 table
3. Append the results
In SAS DI studio, use a Lookup to perform the above much faster
1. Connect the 7 Input tables to a Lookup Transform (lets call them SRC 1-7)
2. The table with 5000 records is the tables on which lookup is performed on keys A and B (lets call this LKUP-1)
3. Take the relevant columns from LKUP-1 to propagate into the TARGET tables.
This will be much faster and you don't have to perform JOINs in this case as I suspect you are doing a Many-Many join which is degrading the performance in SAS DIS.