RedShift Deep Copy Without INSERT_XID (Hidden metadata) Column Data

RedShift Deep Copy Without INSERT_XID (Hidden metadata) Column Data - amazon-web-services

I have a very long, narrow table in AWS Redshift. It's that has fallen victim to the issue of the hidden metadata column, INSERT_XID, being hugely disproportionate in size compared to the table.
Picture a table of 632K rows that has 22gb visible data in it and a hidden column with 83gb.
I want to reclaim that space, but Vacuum has no effect on it
I tried copying the table:
BEGIN;
CREATE TABLE test.copied (like prod.table);
INSERT INTO test.copied (select * from prod.table);
COMMIT;
This results in a true deep copy where the hidden meta data column is still very large. I was hoping that a copy of the table in one go into a new one would allow the hidden INSERT_XID column to compress, but it failed to do so.
Any ideas how I can optimize this hidden column in AWS Redshift?
I measured the size of each column with the following:
SELECT col, attname, COUNT(*) AS "mbs"
FROM stv_blocklist bl
JOIN stv_tbl_perm perm
ON bl.tbl = perm.id AND bl.slice = perm.slice
LEFT JOIN pg_attribute attr ON
attr.attrelid = bl.tbl
AND attr.attnum-1 = bl.col
WHERE perm.name = 'table_name'
GROUP BY col, attname
ORDER BY col;
Update:
I also tried an UNLOAD of this table into S3 and then a single COPY back into a new table and he size of the hidden column was unchanged. I'm not sure if this is even resolvable.
Thank you!

I did some math on the numbers you provided and I think you may be running into 1MB block size quanta effects. However, the math still doesn't work out.
Redshift stores you data around the cluster per the table's distribution style. For non-diststyle-all tables this means that each column has rows on each slice of the cluster. The minimum storage size on Redshift, a block, is 1MB in size. When you have small (for Redshift) number of rows in your table there isn't enough data on each slice to fill up one block so there is a lot of wasted space on disk.
If you have a table of say 2 columns which has 630K rows and you are working on a cluster that has 1024 slices (like 32 nodes of dc2.8xl) then these effects can be quite pronounced. Each slice has only 615 rows (on average), no where close to filling up a 1MB block. So the non-metadata portion of this table will take up 2X1024X1MB = 2.048gb. As you can see, even in this case, I can only get to one tenth of what you are showing.
I could rerun this with 20 columns instead of 2 and I would get up to your 22gb figure but then the size of the metadata columns wouldn't make a whole lot of sense - they aren't that inefficient. It is possible that I'm not looking at configurations like what you have - 4000 slices? 8 columns?
22gb of space is 22,000 blocks spread across the slices and columns of your cluster / table. Knowing your column count and cluster configuration will greatly help in understanding how the data is being stored.
Recommendation - move this table to DISTSTYLE ALL and you will save greatly in storage space. 600K rows is tiny for Redshift and spreading the data across all the slices is just inefficient. Be advised that DISTSTLYE ALL has query compilation implications - mostly positive but not all so monitor your query performance if you make this change.

Related

Convert indexes to sortkeys Redshift

Do zonemaps exists only in memory? Or its populated in memory from disk where its stored persistently? Is it stored along with the 1MB block, or in a separate place?
We are migrating from oracle to redshift, there are bunch of indexes to cater to reporting needs. The nearest equivalent of index in Redshift is sortkeys. For bunch of tables, the total number of cols of all the indexes are between 15-20 (some are composite indexes, some are single col indexes). Interleaved keys seems to be best fit, but there cannot be more than 8 cols in an interleaved sortkey. But if I use compound sortkey, it wont be effective since the queries might not have prefix colums.
Whats the general advice in such cases - which type of sort key to use? How to convert many indexes from rdbms to sort keys in redshift?
Are high cardinality cols such as identity cols, dates and timestamps not good fit with interleaved keys? Would it be same with compound sortkeys? Any disadvanatges with interleaved sortkeys to keep in consideration?

You are asking the right questions so let's take these down one at a time. First, zonemaps are located on the leader node and stored on disk and the table data is stored on the compute nodes. They are located separate from each other. The zonemaps store the min and max values for every column for every 1MB block in a table. No matter if a column is in your sortkey list or not, there will be zonemap data for the block. When a column shows up in a WHERE clause Redshift will first compare to the zonemap data to decide if the block is needed for the query. If a block is not needed it won't be read from disk resulting in significant performance improvements for very large tables. I call this "block rejection". A few key points - This really only makes a difference on tables will 10s of millions of rows and when there are selective WHERE predicates.
So you have a number of reports each of which looks at the data by different aspects - common. You want all of these to work well, right? Now the first thing to note is that each table can have it's own sortkeys, they aren't linked. What is important is how does the choice of sortkeys affect the min and max values in the zonemaps for the columns you will use as WHERE clauses. With composite sortkeys you have to think about what impact later keys will have on the composition of the block - not much after the 3rd or 4th key. This is greatly impacted by the ordinality of the data but you get the idea. The good news is that sorting on one column will impact the zonemaps of all the columns so you don't always have to have a column in the sortkey list to get the benefit.
The question of compound vs interleaved sortkeys is a complicated one but remember you want to get high levels of block rejection as often as possible (and on the biggest tables). When different queries have different WHERE predicates it can be tricky to get a good mix of sortkeys to make this happen. In general compound sortkeys are easier to understand and have less table maintenance implications. You can inspect the zonemaps and see what impacts your sortkey choices are having and make informed decisions on how to adjust. If there are columns with low ordinality put those first so that the next sortkeys can have impact on the overall row order and therefore make block with different value ranges for these later keys. For these reasons I like compound keys over interleaved but there are cases where things will improve with interleaved keys. When you have high ordinality for all the columns and they are all equally important interleaved may be the right answer. I usually learn about the data trying to optimize compound keys that even if I end up with interleaved keys I can make smart choices about what columns I want in the sortkeys.
Just some metrics to help in you choice. Redshift can store 200,000 row elements in a single block and I've seen columns with over 2M elements per block. Blocks are distributed across the cluster so you need a lot of rows to fill up enough blocks that rejecting a high percentage of them is even possible. If you have a table of 5 million rows and you are sweating the sortkeys you are into the weeds. (Yes sorting can impact other aspects of the query like joining but these are sub-second improvements not make or break performance impacts.) Compression can have a huge impact on the number of row elements per block and therefore how many rows are represented in an entry in the zonemap. This can increase block rejection but will increase the read data needed to scan the entire table - a tradeoff you will want to make sure you are winning (1 query gets faster by 10 get slower is likely not a good tradeoff).
Your question about ordinality is a good one. If I sort my a high ordinality column first in a compound sortkey list this will set the overall order of the rows potentially making all other sortkeys impotent. However if I sort by a low ordinality column first then there is a lot of power left for other sortkeys to change the order of the rows and therefore the zonemap contents. For example if I have Col_A with only 100 unique values and Col_B which is a timestamp with 1microsecond resolution. If I sort by Col_B first all the rows are likely order just by sorting on this column. But if I sort by Col_A first there are lots of rows with the same value and the later sortkey (Col_B) can order these rows. Interleaved works the same way except which column is "first" changes by region of the table. If I interleave sort base on the same Col_A and Col_B above (just 2 sortkeys), then half the table will be sorted by Col_A first and half by Col_B first. For this example Col_A will be useless half of the time - not the best answer. Interleave sorting just modifies which column is use as the first sortkey throughout the table (and second and third if more keys are used). High ordinality in a sort key makes later sortkeys less powerful and this independent of sort style - it's just the interleave changes up which columns are early and which are late by region of the table.
Because ordinality of sortkeys can be such an important factor in gaining block rejection across many WHERE predicates that it is common to add derived columns to tables to hold lower ordinality versions of other columns. In the example above I might add Col_B2 to the table and have if just hold the year and month (month truncated date) of Col_B. I would use Col_B2 in my sortkey list but my queries would still be referencing Col_B. It "roughly" sorts based on Col_B so that Col_A can have some sorting power if it was to come later in the sortkey list. This is a common reason for making data model changes when moving Redshift.
It is also critical that "block rejecting" WHERE clauses on written against the fact table column, not applied to a dimension table column after the join. Zonemap information is read BEFORE the query starts to execute and is done on the leader node - it can't see through joins. Another data model change is to denormalize some key information into the fact tables so these common where predicates can be applied to the fact table and zonemaps will be back in play.
Sorry for the tome but this is a deep topic which I've spent year optimizing. I hope this is of use to you and reach out if anything isn't clear (and I hope you have the DISTKEYS sorted out already :) ).

Redshift: Aggregate data on large number of dimensions is slow

I have an Amazon redshift table with about 400M records and 100 columns - 80 dimensions and 20 metrics.
Table is distributed by 1 of the high cardinality dimension columns and includes a couple of high cardinality columns in sort key.
A simple aggregate query:
Select dim1, dim2...dim60, sum(met1),...sum(met15)
From my table
Group by dim1...dim60
is taking too long. The explain plan looks simple just a sequential scan and hashaggregate on the able. Any recommendations on how I can optimize it?

1) If your table is heavily denormalized (your 80 dimensions are in fact 20 dimensions with 4 attributes each) it is faster to group by dimension keys only, and if you really need all dimension attributes join the aggregated result back to dimension tables to get them, like this:
with
groups as (
select dim1_id,dim2_id,...,dim20_id,sum(met1),sum(met2)
from my_table
group by 1,2,...,20
)
select *
from groups
join dim1_table
using (dim1_id)
join dim2_table
using (dim2_id)
...
join dim20_table
using (dim20_id)
If you don't want to normalize your table and you like that a single row has all pieces of information it's fine to keep it as is since in a column database they won't slow the queries down if you don't use them. But grouping by 80 columns is definitely inefficient and has to be "pseudo-normalized" in the query.
2) if your dimensions are hierarchical you can group by the lowest level only and then join higher level dimension attributes. For example, if you have country, country region and city with 4 attributes each there's no need to group by 12 attributes, all you can do is group by city ID and then join city's attributes, country region and country tables to the city ID of each group
3) you can have the combination of dimension IDs with some delimiter like - in a separate varchar column and use that as a sort key

Sequential scans are quite normal for Amazon Redshift. Instead of using indexes (which themselves would be Big Data), Redshift uses parallel clusters, compression and columnar storage to provide fast queries.
Normally, optimization is done via:
DISTKEY: Typically used on the most-JOINed column (or most GROUPed column) to localize joined data on the same node.
SORTKEY: Typically used for fields that most commonly appear in WHERE statements to quickly skip over storage blocks that do not contain relevant data.
Compression: Redshift automatically compresses data, but over time the skew of data could change, making another compression type more optimal.
Your query is quite unusual in that you are using GROUP BY on 60 columns across all rows in the table. This is not a typical Data Warehousing query (where rows are normally limited by WHERE and tables are connected by JOIN).
I would recommend experimenting with fewer GROUP BY columns and breaking the query down into several smaller queries via a WHERE clause to determine what is occupying most of the time. Worst case, you could run the results nightly and store them in a table for later querying.

Dist and Sort Keys Redshift

I'm trying to add dist and sort keys to some of the tables in redshift.
I notice that before adding the size of the table is 0.50 and after adding it gets increased to 0.51 or 0.52. Is this possible ? The whole purpose of having dist and sort keys is to decrease the size of the table and help in increasing the read/write performance.

That is not the purpose of having a DISTKEY and SORTKEY.
To decrease the storage size of a table, use compression.
The DISTKEY is used to distribute data amongst slices. By co-locating information on the same slice, queries can run faster. For example, if you had these tables:
customer table, DISTKEY = customer_id
invoices table, DISTKEY = customer_id
...then these tables would be distributed in the same manner. All records in both tables for a given customer_id would be located on the same slice, thereby avoiding the need to transfer data between slices. The DISTKEY should be the column that is mostly used for JOINS.
The SORTKEY is used to sort data on disk, for the benefit of Zone Maps. Each storage block on disk is 1MB in size and contains data for only one column in one table. The data for this column is sorted, then stored in multiple blocks. The Zone Map associated with each block identifies the minimum and maximum values stored within that block. Then, when a query is run with a WHERE statement, Amazon Redshift only needs to read the blocks that contain the desired range of data. By skipping over blocks that do not contain data within the WHERE clause, Redshift can run queries much faster.
The above can all work together. For example, compressed data requires fewer blocks, which also allows Redshift to skip over more data based on the Zone Maps. To get the best possible performance out of queries, use DISTKEY, SORTKEY and compression together.
(It is often recommended not to compress the SORTKEY column because it causes too many rows to be loaded from a single block.)
See also: Top 10 Performance Tuning Techniques for Amazon Redshift

Redshift performance: encoding on join column

would encoding on join column corrupts the query performance ? I let the "COPY command" to decide the encoding type.

In gernal no - since an encoding on your DIST KEY will even have a positive impact due to the reduction disk I/O.
According to the AWS table design playbook There are a few edge case were indeed an encoding on your DIST KEY will corrupt your query performance:
Your query patterns apply range restricted scans to a column that is
very well compressed.
The well compressed column’s blocks each contain a large number of values per block, typically many more values than the actual count of values your query is interested in.
The other columns necessary for the query pattern are large or don’t compress well. These columns are > 10x the size of the well
compressed column.
If you want to find the optimal encoding for your table you can use the Redshift column encoding utility.

Amazon Redshift is a column-oriented database, which means that rather than organising data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. This architecture is particularly well suited to analytics queries on tables with a large number of columns, where most queries only access a subset of all possible dimensions and measures. Amazon Redshift is able to only access those blocks on disk that are for columns included in the SELECT or WHERE clause, and doesn’t have to read all table data to evaluate a query. Data stored by column should also be encoded , which means that it is heavily compressed to offer high read performance. This further means that Amazon Redshift doesn’t require the creation and maintenance of indexes: every column is almost like its own index, with just the right structure for the data being stored.
Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied.
So your question it will not corrupt the query performance but not a best practice.

There are a couple of details on this by AWS respondants:
AWS Redshift : DISTKEY / SORTKEY columns should be compressed?
Generally:
DISTKEY can be compressed but the first SORTKEY column should be uncompressed (ENCODE raw).
If you have multiple sort keys (compound) the other sort key columns can be compressed.
Also, generally recommend using a commonly filtered date/timestamp column,
(if one exists) as the first sort key column in a compound sort key.
Finally, if you are joining between very large tables try using the same dist
and sort keys on both tables so Redshift can use a faster merge join.
Based on this, i think as long as both sides of the join have the same compression, i think redshift will join on the compressed value safely.

Best order of joins and append for performance

I'm having huge performance issues with a SAS DI job that I need to get up and running. Therefore I'm looking for clever ways to optimize the job.
One thing in particular that I thought of is that I should perhaps permute the order of some joins and an append. Currently, my job is configured as follows:
there are several similarly structured source tables which I first apply a date filter to (to reduce the number of rows) and sort on two fields, say a and b, then I left join each table to a table with account table on the same fields a and b (I'd like to create indexes for these if possible, but don't know how to do it for temporary work tables in SAS DI). After each of these joins is complete, I append the resulting tables into one dataset.
It occurs to me that I could first append, and then do just one join, but I have no notion of which approach is faster, or if the answer is that it depends I have no notion of what it depends on (though I'd guess the size of the constituent tables).
So, is it better to do many joins then append, or to append then do one join?
EDIT
Here is an update with some relevant information (requested by user Robert Penridge).
The number of source tables here is 7, and the size of these tables ranges from 1500 to 5.2 million. 10 000 is typical. The number of columns is 25. These tables are each being joined with the same table, which has about 5000 rows and 8 columns.
I estimate that the unique key partitions the tables into subsets of roughly equal size; the size reduction here should be between 8% and 30% (the difference is due to the fact that some of the source tables carry much more historical data than others, adding to the percentage of the table grouped into the same number of groups).
I have limited the number of columns to the exact minimum amount required (21).
By default SAS DI creates all temporary datasets as views, and I have not changed that.
The code for the append and joins are auto-generated by SAS DI after constructing them with GUI elements.
The final dataset is not sorted; my reason for sorting the data which feeds the joins is that the section of this link on join performance (page 35) mentions that it should improve performance.
As I mentioned, I'm not sure if one can put indexes on temporary work tables or views in SAS DI.
I cannot say whether the widths of the fields is larger than absolutely necessary, but if so I doubt it is egregious. I hesitate to change this since it would have to be done manually, on several tables, and when new data comes in it might need that extra column width.
Much gratitude

Performance in SAS is mainly about reducing IO (ie. reading/writing to the disk).
Without additional details it's difficult to help but some additional things you can consider are:
limit the columns you are processing by using a keep statement (reduces IO)
if the steps performing the joins are IO intensive, consider using views rather than creating temporary tables
if the joins are still time consuming, consider replacing them with hash table lookups
make sure you are using proc append to append the 2 datasets together to reduce the IO. Append the smaller dataset to the larger dataset.
consider not sorting the final dataset but placing an index on it for consumers of the data.
ensure you are using some type of dataset compression, or ensure your column widths are set appropriately for all columns (ie. you don't have a width of 200 on a field that uses a width of 8)
reduce the number of rows as early in the process as possible (you are already doing this, just listing it here for completeness)
Adjusting the order of left-joins and appends probably won't make as much difference as doing the above.

As per your comments it seems that
1. There are 7 input source tables
2. Join these 7 source tables to 1 table
3. Append the results
In SAS DI studio, use a Lookup to perform the above much faster
1. Connect the 7 Input tables to a Lookup Transform (lets call them SRC 1-7)
2. The table with 5000 records is the tables on which lookup is performed on keys A and B (lets call this LKUP-1)
3. Take the relevant columns from LKUP-1 to propagate into the TARGET tables.
This will be much faster and you don't have to perform JOINs in this case as I suspect you are doing a Many-Many join which is degrading the performance in SAS DIS.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js