I have an ETL application where I read messages in 1k batches from a queue & do the following operations:
Parse the messages and create 20 CSV files for 20 different tables and gzip compress them
Copy these files to s3
Create temp tables and use Redshift S3 Copy
Compare master table for duplicates using the temp table and delete duplicates on master tables
Copy the rows on temp table to master table.
Everything was working fine until we noticed huge drop in the message processing throughput. On redshift console I see step 4 for certain tables are taking more than 10 mins now (these tables have billions rows now)
Is this expected behaviour?
i.e. when the table size grows, the dedup operations taking longer? Is there any other alternatives to this pattern?
EDIT: looks the size of the tables in question is in TB's (5.10 TB for one of the tables)
EDIT 2: I have put a where clause while performing step 4. Earlier this step used to scan billions of rows and TB's of data, now it has come down to a couple of thousands of rows and mb's of data. Also the time has reduced too.
Related
I have a very long, narrow table in AWS Redshift. It's that has fallen victim to the issue of the hidden metadata column, INSERT_XID, being hugely disproportionate in size compared to the table.
Picture a table of 632K rows that has 22gb visible data in it and a hidden column with 83gb.
I want to reclaim that space, but Vacuum has no effect on it
I tried copying the table:
BEGIN;
CREATE TABLE test.copied (like prod.table);
INSERT INTO test.copied (select * from prod.table);
COMMIT;
This results in a true deep copy where the hidden meta data column is still very large. I was hoping that a copy of the table in one go into a new one would allow the hidden INSERT_XID column to compress, but it failed to do so.
Any ideas how I can optimize this hidden column in AWS Redshift?
I measured the size of each column with the following:
SELECT col, attname, COUNT(*) AS "mbs"
FROM stv_blocklist bl
JOIN stv_tbl_perm perm
ON bl.tbl = perm.id AND bl.slice = perm.slice
LEFT JOIN pg_attribute attr ON
attr.attrelid = bl.tbl
AND attr.attnum-1 = bl.col
WHERE perm.name = 'table_name'
GROUP BY col, attname
ORDER BY col;
Update:
I also tried an UNLOAD of this table into S3 and then a single COPY back into a new table and he size of the hidden column was unchanged. I'm not sure if this is even resolvable.
Thank you!
I did some math on the numbers you provided and I think you may be running into 1MB block size quanta effects. However, the math still doesn't work out.
Redshift stores you data around the cluster per the table's distribution style. For non-diststyle-all tables this means that each column has rows on each slice of the cluster. The minimum storage size on Redshift, a block, is 1MB in size. When you have small (for Redshift) number of rows in your table there isn't enough data on each slice to fill up one block so there is a lot of wasted space on disk.
If you have a table of say 2 columns which has 630K rows and you are working on a cluster that has 1024 slices (like 32 nodes of dc2.8xl) then these effects can be quite pronounced. Each slice has only 615 rows (on average), no where close to filling up a 1MB block. So the non-metadata portion of this table will take up 2X1024X1MB = 2.048gb. As you can see, even in this case, I can only get to one tenth of what you are showing.
I could rerun this with 20 columns instead of 2 and I would get up to your 22gb figure but then the size of the metadata columns wouldn't make a whole lot of sense - they aren't that inefficient. It is possible that I'm not looking at configurations like what you have - 4000 slices? 8 columns?
22gb of space is 22,000 blocks spread across the slices and columns of your cluster / table. Knowing your column count and cluster configuration will greatly help in understanding how the data is being stored.
Recommendation - move this table to DISTSTYLE ALL and you will save greatly in storage space. 600K rows is tiny for Redshift and spreading the data across all the slices is just inefficient. Be advised that DISTSTLYE ALL has query compilation implications - mostly positive but not all so monitor your query performance if you make this change.
I am running a query through the Athena Query Editor on a table in the Glue Data Catalog and would like to understand why it takes so long to do a simple select * from this data.
Our data is stored in an S3 bucket that is partitioned by year/month/day/hour, with 80 snappy Parquet files per partition that are anywhere between 1 - 10 MB in size each. When I run the following query:
select stringA, stringB, timestampA, timestampB, bigintA, bigintB
from tableA
where year='2021' and month='2' and day = '2'
It scans 700MB but takes over 3 minutes to display the Athena results. I feel that we have already optimized the file format and partitioning for this data, and so I am unsure how else we can improve the performance if we're just trying to select this data out and display it in a tool like QuickSight.
The select * performance was impacted by the number of files that needed to be scanned, which were all relatively small. Repartitioning and removing the hour partition resulted in an improvement in both runtime (14% reduction) and also data scanned (26% reduction) due to snappy compression getting more gains on larger files.
Source: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
I have a table on Athena partitioned by day (huge table, TB of data). There's no day column on the table, at least not explicitly. I would expect that a query like the following:
select max(day) from my_table
would scan virtually no data. However, Athena reports that several hundreds of GB are scanned. Any idea why?
===== EDIT 2021-01-14 ===
I've recently bumped on this issue again. It turns out that when the underlying data is parquet then operations on partitions don't consume data. For other data formats that I've tried (including ORC) there is an associated data cost. It doesn't make any sense to me.
I don't know the answer for a fact but I guesstimate:
Athena just does not have the optimization of looking at the partition names only, when only they are queried. This is clear from its behaviour. So it scans everything.
Parquet has min/max for every column whereas ORC does it only if an index is present, AFAIU. Thus for Parquet Athena's query optimizer directs it to look directly at these rollup values, i.e., no scan is performed. It's different for ORC.
I know is a little late to answer this question for you Nicolas but it is important to keep here also some possible solutions.
Unfortunately, this is the way Athena works, Athena will read all data as a tableScan just to list the partitions values.
A possible workaround that works perfectly here is using the metadata of the partition instead of the data information, for example:
Instead of using this syntax:
select max(day) from my_table
Try to use this syntax:
SELECT day FROM my_schema."my_table$partitions" ORDER BY day DESC LIMIT 1
This second statement will read just metadata information and returns the same data you need.
It does not depend on the format but on the compression algorithm used. Snappy for ORC mostly & GZIP for parquet. This is what makes the difference
I'm trying to add dist and sort keys to some of the tables in redshift.
I notice that before adding the size of the table is 0.50 and after adding it gets increased to 0.51 or 0.52. Is this possible ? The whole purpose of having dist and sort keys is to decrease the size of the table and help in increasing the read/write performance.
That is not the purpose of having a DISTKEY and SORTKEY.
To decrease the storage size of a table, use compression.
The DISTKEY is used to distribute data amongst slices. By co-locating information on the same slice, queries can run faster. For example, if you had these tables:
customer table, DISTKEY = customer_id
invoices table, DISTKEY = customer_id
...then these tables would be distributed in the same manner. All records in both tables for a given customer_id would be located on the same slice, thereby avoiding the need to transfer data between slices. The DISTKEY should be the column that is mostly used for JOINS.
The SORTKEY is used to sort data on disk, for the benefit of Zone Maps. Each storage block on disk is 1MB in size and contains data for only one column in one table. The data for this column is sorted, then stored in multiple blocks. The Zone Map associated with each block identifies the minimum and maximum values stored within that block. Then, when a query is run with a WHERE statement, Amazon Redshift only needs to read the blocks that contain the desired range of data. By skipping over blocks that do not contain data within the WHERE clause, Redshift can run queries much faster.
The above can all work together. For example, compressed data requires fewer blocks, which also allows Redshift to skip over more data based on the Zone Maps. To get the best possible performance out of queries, use DISTKEY, SORTKEY and compression together.
(It is often recommended not to compress the SORTKEY column because it causes too many rows to be loaded from a single block.)
See also: Top 10 Performance Tuning Techniques for Amazon Redshift
would encoding on join column corrupts the query performance ? I let the "COPY command" to decide the encoding type.
In gernal no - since an encoding on your DIST KEY will even have a positive impact due to the reduction disk I/O.
According to the AWS table design playbook There are a few edge case were indeed an encoding on your DIST KEY will corrupt your query performance:
Your query patterns apply range restricted scans to a column that is
very well compressed.
The well compressed column’s blocks each contain a large number of values per block, typically many more values than the actual count of values your query is interested in.
The other columns necessary for the query pattern are large or don’t compress well. These columns are > 10x the size of the well
compressed column.
If you want to find the optimal encoding for your table you can use the Redshift column encoding utility.
Amazon Redshift is a column-oriented database, which means that rather than organising data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. This architecture is particularly well suited to analytics queries on tables with a large number of columns, where most queries only access a subset of all possible dimensions and measures. Amazon Redshift is able to only access those blocks on disk that are for columns included in the SELECT or WHERE clause, and doesn’t have to read all table data to evaluate a query. Data stored by column should also be encoded , which means that it is heavily compressed to offer high read performance. This further means that Amazon Redshift doesn’t require the creation and maintenance of indexes: every column is almost like its own index, with just the right structure for the data being stored.
Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied.
So your question it will not corrupt the query performance but not a best practice.
There are a couple of details on this by AWS respondants:
AWS Redshift : DISTKEY / SORTKEY columns should be compressed?
Generally:
DISTKEY can be compressed but the first SORTKEY column should be uncompressed (ENCODE raw).
If you have multiple sort keys (compound) the other sort key columns can be compressed.
Also, generally recommend using a commonly filtered date/timestamp column,
(if one exists) as the first sort key column in a compound sort key.
Finally, if you are joining between very large tables try using the same dist
and sort keys on both tables so Redshift can use a faster merge join.
Based on this, i think as long as both sides of the join have the same compression, i think redshift will join on the compressed value safely.